Dr. Marcus Chen, a behavioral health researcher, developed a mobile app to help people track their daily mood and encourage healthy behaviors. The app sends personalized notifications to users throughout the day. After six months, Dr. Chen wants to understand what factors predict daily app engagement (measured as the number of logins per day).
The data structure is complex: he has repeated daily measurements (up to 180 days) for each of 50 participants. Some users check the app multiple times daily, others rarely engage. The outcome is count data (number of logins), and observations within the same person are correlated—violating the independence assumption of standard GLMs.
This scenario requires Generalized Linear Mixed Models (GLMMs)—a combination of Generalized Linear Models (to handle count data) and mixed-effects models (to account for repeated measures and clustering). GLMMs allow us to model non-normal outcomes while accounting for the hierarchical structure of the data.
Generalized Linear Mixed Models (GLMMs) extend GLMs by incorporating random effects to account for correlation structures in data. They’re essential when:
Key Features:
GLMMs combine the flexibility of GLMs (non-normal outcomes) with the power of mixed models (correlated data).
The general GLMM can be written as:
\[ g(\mu_{ij}) = \beta_0 + \beta_1 X_{ij} + b_{0i} \]
where: - \(\mu_{ij} = E(Y_{ij} | b_{0i})\) is the conditional expected value - \(Y_{ij}\) is the outcome for observation \(j\) within cluster \(i\) - \(g(\cdot)\) is the link function - \(\beta_0, \beta_1\) are fixed effects (population parameters) - \(b_{0i}\) is the random intercept for cluster \(i\), assumed \(b_{0i} \sim N(0, \sigma^2_b)\)
For Poisson GLMM (Count Data with Clustering):
We assume:
\[ Y_{ij} | b_{0i} \sim \text{Poisson}(\lambda_{ij}) \]
with the log link:
\[ \log(\lambda_{ij}) = \beta_0 + \beta_1 X_{ij} + b_{0i} \]
The random intercept \(b_{0i}\) allows each cluster (person, school, etc.) to have its own baseline rate, capturing within-cluster correlation.
Random Slopes:
We can also allow slopes to vary:
\[ \log(\lambda_{ij}) = \beta_0 + \beta_1 X_{ij} + b_{0i} + b_{1i} X_{ij} \]
where \(b_{1i}\) is a random slope, allowing the effect of \(X\) to differ across clusters.
Let’s create data similar to Dr. Chen’s mobile health study. We’ll simulate daily app logins for 50 users over 30 days, with user-specific baselines and a time trend.
# Set seed for reproducibility
set.seed(789)
# Parameters
n_users <- 50 # Number of users
n_days <- 30 # Days per user
n_total <- n_users * n_days
# User IDs
user_id <- rep(1:n_users, each = n_days)
# Day (centered)
day <- rep(1:n_days, times = n_users)
day_centered <- day - mean(day)
# Random intercepts for users (varying baseline engagement)
user_random_intercept <- rnorm(n_users, mean = 0, sd = 0.5)
user_random_effect <- rep(user_random_intercept, each = n_days)
# Fixed effects
beta0 <- 0.8 # Baseline log count
beta1 <- -0.01 # Time trend (slight decline)
# Linear predictor
log_lambda <- beta0 + beta1 * day_centered + user_random_effect
# Expected counts
lambda <- exp(log_lambda)
# Simulate Poisson counts
logins <- rpois(n_total, lambda = lambda)
# Create dataframe
app_data <- data.frame(
user_id = factor(user_id),
day = day,
day_centered = day_centered,
logins = logins
)
# Display first few rows
head(app_data, 15)
## user_id day day_centered logins
## 1 1 1 -14.5 3
## 2 1 2 -13.5 2
## 3 1 3 -12.5 1
## 4 1 4 -11.5 6
## 5 1 5 -10.5 5
## 6 1 6 -9.5 1
## 7 1 7 -8.5 3
## 8 1 8 -7.5 3
## 9 1 9 -6.5 2
## 10 1 10 -5.5 3
## 11 1 11 -4.5 4
## 12 1 12 -3.5 3
## 13 1 13 -2.5 5
## 14 1 14 -1.5 2
## 15 1 15 -0.5 2
# Summary statistics
summary(app_data)
## user_id day day_centered logins
## 1 : 30 Min. : 1.0 Min. :-14.5 Min. : 0.000
## 2 : 30 1st Qu.: 8.0 1st Qu.: -7.5 1st Qu.: 1.000
## 3 : 30 Median :15.5 Median : 0.0 Median : 2.000
## 4 : 30 Mean :15.5 Mean : 0.0 Mean : 2.296
## 5 : 30 3rd Qu.:23.0 3rd Qu.: 7.5 3rd Qu.: 3.000
## 6 : 30 Max. :30.0 Max. : 14.5 Max. :11.000
## (Other):1320
Let’s visualize the hierarchical structure and variability across users:
library(ggplot2)
# Spaghetti plot: Individual trajectories
ggplot(app_data, aes(x = day, y = logins, group = user_id)) +
geom_line(alpha = 0.3, color = "steelblue") +
geom_smooth(aes(group = 1), method = "loess", color = "red", size = 1.2, se = FALSE) +
labs(
title = "Daily App Logins Across Users",
x = "Day",
y = "Number of Logins"
) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
Each thin line represents one user’s trajectory over time. The red line shows the average trend. We can see substantial variability between users—some consistently high engagers, others consistently low.
# Show a subset of users
subset_users <- app_data[app_data$user_id %in% c(1, 5, 10, 15, 20, 25), ]
ggplot(subset_users, aes(x = day, y = logins, color = user_id)) +
geom_line(size = 0.8) +
geom_point(size = 1.5) +
facet_wrap(~ user_id, ncol = 3) +
labs(
title = "Sample of Individual User Trajectories",
x = "Day",
y = "Number of Logins"
) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)
We use the glmer() function from the lme4
package to fit GLMMs:
library(lme4)
# Fit Poisson GLMM with random intercepts
glmm_model <- glmer(logins ~ day_centered + (1 | user_id),
family = poisson(link = "log"),
data = app_data)
# Display summary
summary(glmm_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: logins ~ day_centered + (1 | user_id)
## Data: app_data
##
## AIC BIC logLik deviance df.resid
## 5290.3 5306.2 -2642.1 5284.3 1497
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1126 -0.6950 -0.1269 0.5945 4.5797
##
## Random effects:
## Groups Name Variance Std.Dev.
## user_id (Intercept) 0.1875 0.433
## Number of obs: 1500, groups: user_id, 50
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.737944 0.064045 11.522 < 2e-16 ***
## day_centered -0.009100 0.001966 -4.628 3.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day_centerd 0.021
Model specification breakdown: -
logins ~ day_centered: Fixed effect of day on login count -
(1 | user_id): Random intercept for each user -
family = poisson: Poisson distribution for count outcome -
link = "log": Log link function
Let’s extract and interpret the key components:
# Fixed effects
fixed_effects <- fixef(glmm_model)
cat("Fixed Effects:\n")
## Fixed Effects:
print(fixed_effects)
## (Intercept) day_centered
## 0.737944364 -0.009099868
# Extract specific coefficients
beta0_hat <- fixed_effects[1]
beta1_hat <- fixed_effects[2]
cat("\nIntercept (beta0):", round(beta0_hat, 4), "\n")
##
## Intercept (beta0): 0.7379
cat("Day effect (beta1):", round(beta1_hat, 4), "\n")
## Day effect (beta1): -0.0091
# Exponentiate for rate ratios
exp_fixed <- exp(fixed_effects)
cat("\nExponentiated coefficients (Rate Ratios):\n")
##
## Exponentiated coefficients (Rate Ratios):
print(round(exp_fixed, 4))
## (Intercept) day_centered
## 2.0916 0.9909
# Random effects variance
random_effects_var <- as.data.frame(VarCorr(glmm_model))
cat("\nRandom Effects:\n")
##
## Random Effects:
print(random_effects_var)
## grp var1 var2 vcov sdcor
## 1 user_id (Intercept) <NA> 0.1875051 0.4330186
sigma_b <- random_effects_var$sdcor[1]
cat("\nRandom intercept SD:", round(sigma_b, 4), "\n")
##
## Random intercept SD: 0.433
Interpretation:
Fixed Effects: - Intercept (\(\beta_0\)) = 0.738: The log expected count at the average day (day 15.5) for an average user - Day effect (\(\beta_1\)) = -0.0091: For each additional day, the log expected count changes by -0.0091
Rate Ratios: - Day: \(\exp(\beta_1)\) = 0.9909 means each additional day multiplies expected logins by 0.9909, or a -0.91% change per day
Random Effects: - User random intercept SD = 0.433: Substantial variability in baseline engagement across users
Let’s look at the estimated random intercepts for each user:
# Extract random effects (BLUPs - Best Linear Unbiased Predictors)
user_random_effects <- ranef(glmm_model)$user_id
user_random_effects$user_id <- rownames(user_random_effects)
colnames(user_random_effects) <- c("random_intercept", "user_id")
# Show first 10 users
head(user_random_effects, 10)
## random_intercept user_id
## 1 0.41962493 1
## 2 -1.02921534 2
## 3 0.02970535 3
## 4 0.16354850 4
## 5 -0.10677276 5
## 6 -0.09073649 6
## 7 -0.12304753 7
## 8 -0.10677276 8
## 9 -0.22610942 9
## 10 0.52317482 10
# Visualize distribution of random intercepts
ggplot(user_random_effects, aes(x = random_intercept)) +
geom_histogram(bins = 15, fill = "steelblue", color = "black", alpha = 0.7) +
labs(
title = "Distribution of User Random Intercepts",
x = "Random Intercept (log scale)",
y = "Frequency"
) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
Users with positive random intercepts have higher-than-average engagement; those with negative values have lower engagement.
Let’s predict login counts for specific scenarios:
# Predictions for an "average" user at different days
new_data <- data.frame(
day_centered = c(-14.5, 0, 14.5), # Day 1, 15, 30
user_id = NA # Average user
)
# Population-level predictions (fixed effects only)
pred_pop <- predict(glmm_model, newdata = new_data,
re.form = NA, type = "response")
new_data$predicted_logins <- pred_pop
new_data$day_actual <- c(1, 15.5, 30)
cat("Population-level predictions (average user):\n")
## Population-level predictions (average user):
print(new_data[, c("day_actual", "predicted_logins")])
## day_actual predicted_logins
## 1 1.0 2.386654
## 2 15.5 2.091631
## 3 30.0 1.833078
# Predictions for specific users (including random effects)
user_specific <- data.frame(
day_centered = rep(0, 5),
user_id = factor(c(1, 10, 20, 30, 40))
)
pred_user <- predict(glmm_model, newdata = user_specific, type = "response")
user_specific$predicted_logins <- pred_user
cat("\n\nUser-specific predictions at day 15:\n")
##
##
## User-specific predictions at day 15:
print(user_specific)
## day_centered user_id predicted_logins
## 1 0 1 3.182189
## 2 0 10 3.529369
## 3 0 20 1.728530
## 4 0 30 1.971098
## 5 0 40 2.837006
We can allow the effect of time to vary across users by adding a random slope:
# Fit GLMM with random intercepts and slopes
glmm_slopes <- glmer(logins ~ day_centered + (day_centered | user_id),
family = poisson(link = "log"),
data = app_data,
control = glmerControl(optimizer = "bobyqa"))
# Summary
summary(glmm_slopes)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: logins ~ day_centered + (day_centered | user_id)
## Data: app_data
## Control: glmerControl(optimizer = "bobyqa")
##
## AIC BIC logLik deviance df.resid
## 5292.1 5318.6 -2641.0 5282.1 1495
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1152 -0.6930 -0.1244 0.5845 4.5340
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## user_id (Intercept) 1.888e-01 0.434503
## day_centered 5.109e-05 0.007148 0.27
## Number of obs: 1500, groups: user_id, 50
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.735462 0.064275 11.442 < 2e-16 ***
## day_centered -0.009859 0.002385 -4.133 3.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day_centerd 0.134
# Compare models with AIC
cat("\nModel Comparison:\n")
##
## Model Comparison:
cat("Random Intercept Only AIC:", AIC(glmm_model), "\n")
## Random Intercept Only AIC: 5290.259
cat("Random Intercept + Slope AIC:", AIC(glmm_slopes), "\n")
## Random Intercept + Slope AIC: 5292.065
The random slopes model allows each user to have a different trajectory over time. Compare AICs: lower AIC indicates better fit.
# Add fitted values to the data
app_data$fitted <- fitted(glmm_model)
# Plot observed vs fitted for a subset of users
subset_data <- app_data[app_data$user_id %in% c(1, 5, 10, 15, 20, 25), ]
ggplot(subset_data, aes(x = day)) +
geom_point(aes(y = logins), alpha = 0.5, color = "gray40") +
geom_line(aes(y = fitted), color = "red", size = 1) +
facet_wrap(~ user_id, ncol = 3) +
labs(
title = "Observed vs. Fitted Login Counts",
x = "Day",
y = "Number of Logins",
caption = "Gray points = observed, Red line = fitted"
) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
# Residuals vs fitted
plot(fitted(glmm_model), residuals(glmm_model, type = "pearson"),
xlab = "Fitted Values", ylab = "Pearson Residuals",
main = "Residuals vs. Fitted",
pch = 20, col = rgb(0, 0, 0, 0.3))
abline(h = 0, col = "red", lty = 2)
# Q-Q plot of random effects
qqnorm(ranef(glmm_model)$user_id[,1], main = "Q-Q Plot of Random Intercepts")
qqline(ranef(glmm_model)$user_id[,1], col = "red")
Diagnostic checks: - Residuals vs. Fitted: Look for patterns or heteroscedasticity - Q-Q Plot of Random Effects: Check normality assumption for random intercepts
Even in GLMMs, we should check for overdispersion:
# Calculate overdispersion parameter
pearson_resid <- residuals(glmm_model, type = "pearson")
n <- nrow(app_data)
p <- length(fixef(glmm_model))
overdispersion <- sum(pearson_resid^2) / (n - p)
cat("Overdispersion parameter:", round(overdispersion, 3), "\n")
## Overdispersion parameter: 0.928
if (overdispersion > 1.5) {
cat("Overdispersion detected. Consider negative binomial GLMM.\n")
} else {
cat("No severe overdispersion.\n")
}
## No severe overdispersion.
Here’s how to report GLMM results:
A Poisson generalized linear mixed model was fitted to examine daily app engagement (login count) over time, accounting for repeated measures within users. The model included day (centered) as a fixed effect and random intercepts for users.
The fixed effect of day was statistically significant, b = -0.0091, z = -4.63, p < .001, indicating a -0.91% change in expected logins per day (rate ratio = 0.9909).
There was substantial between-user variability in baseline engagement, with a random intercept standard deviation of 0.433. This suggests that individual differences in app usage patterns are considerable and must be accounted for in the analysis.
When to Use GLMMs: - Non-normal outcomes (counts, binary) with clustered/repeated measures - Hierarchical data structures (patients within hospitals, students within schools) - Longitudinal studies with discrete outcomes
Model Building: 1. Start with random intercepts only 2. Add random slopes if theoretically justified 3. Compare models using AIC, BIC, or likelihood ratio tests 4. Check convergence warnings and refit with different optimizers if needed
Common Challenges: - Convergence issues: GLMMs are computationally intensive; try different optimizers - Overdispersion: Consider negative binomial GLMMs - Small cluster sizes: Random effects estimates may be unreliable with few observations per cluster
Software Options: - lme4::glmer():
Fast, widely used, good for most GLMMs - glmmTMB: Handles
more complex models (zero-inflation, negative binomial) -
MCMCglmm: Bayesian approach via MCMC
Generalized Linear Mixed Models combine the flexibility of GLMs (handling non-normal data) with the power of mixed models (accounting for clustering and repeated measures). In our mobile health app study, we found that daily engagement slightly declines over time, but there is substantial variability across users in their baseline engagement levels.
GLMMs are essential tools for modern health and psychology research, where hierarchical data structures are ubiquitous: patients nested within clinics, repeated assessments within individuals, students within classrooms. By properly accounting for these dependencies while handling non-normal outcomes, GLMMs provide more accurate inference and better predictions than simpler models.
Mastering GLMMs requires understanding both the GLM framework (link functions, exponential family distributions) and mixed models (random effects, variance components). Together, these tools enable sophisticated analysis of complex real-world data.