Back to Tutorials Download .Rmd

1. A Behavioral Story: Mobile Health App Engagement

Dr. Marcus Chen, a behavioral health researcher, developed a mobile app to help people track their daily mood and encourage healthy behaviors. The app sends personalized notifications to users throughout the day. After six months, Dr. Chen wants to understand what factors predict daily app engagement (measured as the number of logins per day).

The data structure is complex: he has repeated daily measurements (up to 180 days) for each of 50 participants. Some users check the app multiple times daily, others rarely engage. The outcome is count data (number of logins), and observations within the same person are correlated—violating the independence assumption of standard GLMs.

This scenario requires Generalized Linear Mixed Models (GLMMs)—a combination of Generalized Linear Models (to handle count data) and mixed-effects models (to account for repeated measures and clustering). GLMMs allow us to model non-normal outcomes while accounting for the hierarchical structure of the data.

2. What are Generalized Linear Mixed Models?

Generalized Linear Mixed Models (GLMMs) extend GLMs by incorporating random effects to account for correlation structures in data. They’re essential when:

  • Observations are clustered or nested (students within schools, days within persons)
  • You have repeated measures on the same individuals
  • Your outcome variable is non-normal (counts, binary, etc.)

Key Features:

  1. Fixed effects: Population-level parameters (e.g., effect of age on engagement)
  2. Random effects: Individual or group-specific deviations from the population average
  3. Link function: Connects the linear predictor to the outcome distribution
  4. Outcome distribution: Can be Poisson, binomial, or other exponential family distributions

GLMMs combine the flexibility of GLMs (non-normal outcomes) with the power of mixed models (correlated data).

3. The GLMM Framework in Plain Form

The general GLMM can be written as:

\[ g(\mu_{ij}) = \beta_0 + \beta_1 X_{ij} + b_{0i} \]

where: - \(\mu_{ij} = E(Y_{ij} | b_{0i})\) is the conditional expected value - \(Y_{ij}\) is the outcome for observation \(j\) within cluster \(i\) - \(g(\cdot)\) is the link function - \(\beta_0, \beta_1\) are fixed effects (population parameters) - \(b_{0i}\) is the random intercept for cluster \(i\), assumed \(b_{0i} \sim N(0, \sigma^2_b)\)

For Poisson GLMM (Count Data with Clustering):

We assume:

\[ Y_{ij} | b_{0i} \sim \text{Poisson}(\lambda_{ij}) \]

with the log link:

\[ \log(\lambda_{ij}) = \beta_0 + \beta_1 X_{ij} + b_{0i} \]

The random intercept \(b_{0i}\) allows each cluster (person, school, etc.) to have its own baseline rate, capturing within-cluster correlation.

Random Slopes:

We can also allow slopes to vary:

\[ \log(\lambda_{ij}) = \beta_0 + \beta_1 X_{ij} + b_{0i} + b_{1i} X_{ij} \]

where \(b_{1i}\) is a random slope, allowing the effect of \(X\) to differ across clusters.

4. Simulating a Dataset

Let’s create data similar to Dr. Chen’s mobile health study. We’ll simulate daily app logins for 50 users over 30 days, with user-specific baselines and a time trend.

# Set seed for reproducibility
set.seed(789)

# Parameters
n_users <- 50       # Number of users
n_days <- 30        # Days per user
n_total <- n_users * n_days

# User IDs
user_id <- rep(1:n_users, each = n_days)

# Day (centered)
day <- rep(1:n_days, times = n_users)
day_centered <- day - mean(day)

# Random intercepts for users (varying baseline engagement)
user_random_intercept <- rnorm(n_users, mean = 0, sd = 0.5)
user_random_effect <- rep(user_random_intercept, each = n_days)

# Fixed effects
beta0 <- 0.8         # Baseline log count
beta1 <- -0.01       # Time trend (slight decline)

# Linear predictor
log_lambda <- beta0 + beta1 * day_centered + user_random_effect

# Expected counts
lambda <- exp(log_lambda)

# Simulate Poisson counts
logins <- rpois(n_total, lambda = lambda)

# Create dataframe
app_data <- data.frame(
  user_id = factor(user_id),
  day = day,
  day_centered = day_centered,
  logins = logins
)

# Display first few rows
head(app_data, 15)
##    user_id day day_centered logins
## 1        1   1        -14.5      3
## 2        1   2        -13.5      2
## 3        1   3        -12.5      1
## 4        1   4        -11.5      6
## 5        1   5        -10.5      5
## 6        1   6         -9.5      1
## 7        1   7         -8.5      3
## 8        1   8         -7.5      3
## 9        1   9         -6.5      2
## 10       1  10         -5.5      3
## 11       1  11         -4.5      4
## 12       1  12         -3.5      3
## 13       1  13         -2.5      5
## 14       1  14         -1.5      2
## 15       1  15         -0.5      2
# Summary statistics
summary(app_data)
##     user_id          day        day_centered       logins      
##  1      :  30   Min.   : 1.0   Min.   :-14.5   Min.   : 0.000  
##  2      :  30   1st Qu.: 8.0   1st Qu.: -7.5   1st Qu.: 1.000  
##  3      :  30   Median :15.5   Median :  0.0   Median : 2.000  
##  4      :  30   Mean   :15.5   Mean   :  0.0   Mean   : 2.296  
##  5      :  30   3rd Qu.:23.0   3rd Qu.:  7.5   3rd Qu.: 3.000  
##  6      :  30   Max.   :30.0   Max.   : 14.5   Max.   :11.000  
##  (Other):1320

5. Visualizing the Data Structure

Let’s visualize the hierarchical structure and variability across users:

library(ggplot2)

# Spaghetti plot: Individual trajectories
ggplot(app_data, aes(x = day, y = logins, group = user_id)) +
  geom_line(alpha = 0.3, color = "steelblue") +
  geom_smooth(aes(group = 1), method = "loess", color = "red", size = 1.2, se = FALSE) +
  labs(
    title = "Daily App Logins Across Users",
    x = "Day",
    y = "Number of Logins"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

Each thin line represents one user’s trajectory over time. The red line shows the average trend. We can see substantial variability between users—some consistently high engagers, others consistently low.

# Show a subset of users
subset_users <- app_data[app_data$user_id %in% c(1, 5, 10, 15, 20, 25), ]

ggplot(subset_users, aes(x = day, y = logins, color = user_id)) +
  geom_line(size = 0.8) +
  geom_point(size = 1.5) +
  facet_wrap(~ user_id, ncol = 3) +
  labs(
    title = "Sample of Individual User Trajectories",
    x = "Day",
    y = "Number of Logins"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none"
  )

6. Fitting the Poisson GLMM in R

We use the glmer() function from the lme4 package to fit GLMMs:

library(lme4)

# Fit Poisson GLMM with random intercepts
glmm_model <- glmer(logins ~ day_centered + (1 | user_id), 
                    family = poisson(link = "log"),
                    data = app_data)

# Display summary
summary(glmm_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: logins ~ day_centered + (1 | user_id)
##    Data: app_data
## 
##      AIC      BIC   logLik deviance df.resid 
##   5290.3   5306.2  -2642.1   5284.3     1497 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1126 -0.6950 -0.1269  0.5945  4.5797 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  user_id (Intercept) 0.1875   0.433   
## Number of obs: 1500, groups:  user_id, 50
## 
## Fixed effects:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.737944   0.064045  11.522  < 2e-16 ***
## day_centered -0.009100   0.001966  -4.628  3.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## day_centerd 0.021

Model specification breakdown: - logins ~ day_centered: Fixed effect of day on login count - (1 | user_id): Random intercept for each user - family = poisson: Poisson distribution for count outcome - link = "log": Log link function

7. Interpreting the Results

Let’s extract and interpret the key components:

# Fixed effects
fixed_effects <- fixef(glmm_model)
cat("Fixed Effects:\n")
## Fixed Effects:
print(fixed_effects)
##  (Intercept) day_centered 
##  0.737944364 -0.009099868
# Extract specific coefficients
beta0_hat <- fixed_effects[1]
beta1_hat <- fixed_effects[2]

cat("\nIntercept (beta0):", round(beta0_hat, 4), "\n")
## 
## Intercept (beta0): 0.7379
cat("Day effect (beta1):", round(beta1_hat, 4), "\n")
## Day effect (beta1): -0.0091
# Exponentiate for rate ratios
exp_fixed <- exp(fixed_effects)
cat("\nExponentiated coefficients (Rate Ratios):\n")
## 
## Exponentiated coefficients (Rate Ratios):
print(round(exp_fixed, 4))
##  (Intercept) day_centered 
##       2.0916       0.9909
# Random effects variance
random_effects_var <- as.data.frame(VarCorr(glmm_model))
cat("\nRandom Effects:\n")
## 
## Random Effects:
print(random_effects_var)
##       grp        var1 var2      vcov     sdcor
## 1 user_id (Intercept) <NA> 0.1875051 0.4330186
sigma_b <- random_effects_var$sdcor[1]
cat("\nRandom intercept SD:", round(sigma_b, 4), "\n")
## 
## Random intercept SD: 0.433

Interpretation:

Fixed Effects: - Intercept (\(\beta_0\)) = 0.738: The log expected count at the average day (day 15.5) for an average user - Day effect (\(\beta_1\)) = -0.0091: For each additional day, the log expected count changes by -0.0091

Rate Ratios: - Day: \(\exp(\beta_1)\) = 0.9909 means each additional day multiplies expected logins by 0.9909, or a -0.91% change per day

Random Effects: - User random intercept SD = 0.433: Substantial variability in baseline engagement across users

8. Examining Individual Random Effects

Let’s look at the estimated random intercepts for each user:

# Extract random effects (BLUPs - Best Linear Unbiased Predictors)
user_random_effects <- ranef(glmm_model)$user_id
user_random_effects$user_id <- rownames(user_random_effects)
colnames(user_random_effects) <- c("random_intercept", "user_id")

# Show first 10 users
head(user_random_effects, 10)
##    random_intercept user_id
## 1        0.41962493       1
## 2       -1.02921534       2
## 3        0.02970535       3
## 4        0.16354850       4
## 5       -0.10677276       5
## 6       -0.09073649       6
## 7       -0.12304753       7
## 8       -0.10677276       8
## 9       -0.22610942       9
## 10       0.52317482      10
# Visualize distribution of random intercepts
ggplot(user_random_effects, aes(x = random_intercept)) +
  geom_histogram(bins = 15, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of User Random Intercepts",
    x = "Random Intercept (log scale)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

Users with positive random intercepts have higher-than-average engagement; those with negative values have lower engagement.

9. Making Predictions

Let’s predict login counts for specific scenarios:

# Predictions for an "average" user at different days
new_data <- data.frame(
  day_centered = c(-14.5, 0, 14.5),  # Day 1, 15, 30
  user_id = NA  # Average user
)

# Population-level predictions (fixed effects only)
pred_pop <- predict(glmm_model, newdata = new_data, 
                    re.form = NA, type = "response")

new_data$predicted_logins <- pred_pop
new_data$day_actual <- c(1, 15.5, 30)

cat("Population-level predictions (average user):\n")
## Population-level predictions (average user):
print(new_data[, c("day_actual", "predicted_logins")])
##   day_actual predicted_logins
## 1        1.0         2.386654
## 2       15.5         2.091631
## 3       30.0         1.833078
# Predictions for specific users (including random effects)
user_specific <- data.frame(
  day_centered = rep(0, 5),
  user_id = factor(c(1, 10, 20, 30, 40))
)

pred_user <- predict(glmm_model, newdata = user_specific, type = "response")
user_specific$predicted_logins <- pred_user

cat("\n\nUser-specific predictions at day 15:\n")
## 
## 
## User-specific predictions at day 15:
print(user_specific)
##   day_centered user_id predicted_logins
## 1            0       1         3.182189
## 2            0      10         3.529369
## 3            0      20         1.728530
## 4            0      30         1.971098
## 5            0      40         2.837006

10. Random Slopes Model (Extension)

We can allow the effect of time to vary across users by adding a random slope:

# Fit GLMM with random intercepts and slopes
glmm_slopes <- glmer(logins ~ day_centered + (day_centered | user_id), 
                     family = poisson(link = "log"),
                     data = app_data,
                     control = glmerControl(optimizer = "bobyqa"))

# Summary
summary(glmm_slopes)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: logins ~ day_centered + (day_centered | user_id)
##    Data: app_data
## Control: glmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##   5292.1   5318.6  -2641.0   5282.1     1495 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1152 -0.6930 -0.1244  0.5845  4.5340 
## 
## Random effects:
##  Groups  Name         Variance  Std.Dev. Corr
##  user_id (Intercept)  1.888e-01 0.434503     
##          day_centered 5.109e-05 0.007148 0.27
## Number of obs: 1500, groups:  user_id, 50
## 
## Fixed effects:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.735462   0.064275  11.442  < 2e-16 ***
## day_centered -0.009859   0.002385  -4.133 3.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## day_centerd 0.134
# Compare models with AIC
cat("\nModel Comparison:\n")
## 
## Model Comparison:
cat("Random Intercept Only AIC:", AIC(glmm_model), "\n")
## Random Intercept Only AIC: 5290.259
cat("Random Intercept + Slope AIC:", AIC(glmm_slopes), "\n")
## Random Intercept + Slope AIC: 5292.065

The random slopes model allows each user to have a different trajectory over time. Compare AICs: lower AIC indicates better fit.

11. Visualizing the Fitted Model

# Add fitted values to the data
app_data$fitted <- fitted(glmm_model)

# Plot observed vs fitted for a subset of users
subset_data <- app_data[app_data$user_id %in% c(1, 5, 10, 15, 20, 25), ]

ggplot(subset_data, aes(x = day)) +
  geom_point(aes(y = logins), alpha = 0.5, color = "gray40") +
  geom_line(aes(y = fitted), color = "red", size = 1) +
  facet_wrap(~ user_id, ncol = 3) +
  labs(
    title = "Observed vs. Fitted Login Counts",
    x = "Day",
    y = "Number of Logins",
    caption = "Gray points = observed, Red line = fitted"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

12. Model Diagnostics

# Residuals vs fitted
plot(fitted(glmm_model), residuals(glmm_model, type = "pearson"),
     xlab = "Fitted Values", ylab = "Pearson Residuals",
     main = "Residuals vs. Fitted",
     pch = 20, col = rgb(0, 0, 0, 0.3))
abline(h = 0, col = "red", lty = 2)

# Q-Q plot of random effects
qqnorm(ranef(glmm_model)$user_id[,1], main = "Q-Q Plot of Random Intercepts")
qqline(ranef(glmm_model)$user_id[,1], col = "red")

Diagnostic checks: - Residuals vs. Fitted: Look for patterns or heteroscedasticity - Q-Q Plot of Random Effects: Check normality assumption for random intercepts

13. Checking for Overdispersion

Even in GLMMs, we should check for overdispersion:

# Calculate overdispersion parameter
pearson_resid <- residuals(glmm_model, type = "pearson")
n <- nrow(app_data)
p <- length(fixef(glmm_model))
overdispersion <- sum(pearson_resid^2) / (n - p)

cat("Overdispersion parameter:", round(overdispersion, 3), "\n")
## Overdispersion parameter: 0.928
if (overdispersion > 1.5) {
  cat("Overdispersion detected. Consider negative binomial GLMM.\n")
} else {
  cat("No severe overdispersion.\n")
}
## No severe overdispersion.

14. Reporting Results in APA Style

Here’s how to report GLMM results:

A Poisson generalized linear mixed model was fitted to examine daily app engagement (login count) over time, accounting for repeated measures within users. The model included day (centered) as a fixed effect and random intercepts for users.

The fixed effect of day was statistically significant, b = -0.0091, z = -4.63, p < .001, indicating a -0.91% change in expected logins per day (rate ratio = 0.9909).

There was substantial between-user variability in baseline engagement, with a random intercept standard deviation of 0.433. This suggests that individual differences in app usage patterns are considerable and must be accounted for in the analysis.

15. Practical Considerations

When to Use GLMMs: - Non-normal outcomes (counts, binary) with clustered/repeated measures - Hierarchical data structures (patients within hospitals, students within schools) - Longitudinal studies with discrete outcomes

Model Building: 1. Start with random intercepts only 2. Add random slopes if theoretically justified 3. Compare models using AIC, BIC, or likelihood ratio tests 4. Check convergence warnings and refit with different optimizers if needed

Common Challenges: - Convergence issues: GLMMs are computationally intensive; try different optimizers - Overdispersion: Consider negative binomial GLMMs - Small cluster sizes: Random effects estimates may be unreliable with few observations per cluster

Software Options: - lme4::glmer(): Fast, widely used, good for most GLMMs - glmmTMB: Handles more complex models (zero-inflation, negative binomial) - MCMCglmm: Bayesian approach via MCMC

16. Conclusion

Generalized Linear Mixed Models combine the flexibility of GLMs (handling non-normal data) with the power of mixed models (accounting for clustering and repeated measures). In our mobile health app study, we found that daily engagement slightly declines over time, but there is substantial variability across users in their baseline engagement levels.

GLMMs are essential tools for modern health and psychology research, where hierarchical data structures are ubiquitous: patients nested within clinics, repeated assessments within individuals, students within classrooms. By properly accounting for these dependencies while handling non-normal outcomes, GLMMs provide more accurate inference and better predictions than simpler models.

Mastering GLMMs requires understanding both the GLM framework (link functions, exponential family distributions) and mixed models (random effects, variance components). Together, these tools enable sophisticated analysis of complex real-world data.