Generalized Linear Models (GLM) in R

1. A Behavioral Story: Counting Emergency Room Visits

Dr. Elena Rodriguez, a public health researcher, works with an urban hospital to understand patterns in emergency room visits. She’s particularly interested in how social and behavioral factors influence the number of ER visits patients make in a year.

After analyzing patient records, she notices that the data doesn’t look like typical linear regression data. Some patients have zero ER visits, others have one or two, and a few have many more. The distribution is skewed, with most values clustered near zero. The residuals from a linear regression model looked terrible—predictions included impossible negative visit counts!

This is where Generalized Linear Models (GLMs) become essential. GLMs extend ordinary linear regression to handle different types of outcome variables: counts, binary outcomes, proportions, and other non-normal distributions. For count data like ER visits, we use Poisson regression or negative binomial regression, two common GLM variants.

2. What are Generalized Linear Models?

Generalized Linear Models (GLMs) extend ordinary linear regression to accommodate response variables that don’t follow a normal distribution. While standard linear regression assumes:

\[ Y_i \sim N(\mu_i, \sigma^2) \]

GLMs allow the response variable to follow various distributions from the exponential family, including: - Poisson distribution (count data) - Binomial distribution (binary or proportion data) - Gamma distribution (positive continuous data) - Negative binomial distribution (overdispersed count data)

The Three Components of GLMs:

Random component: Specifies the probability distribution of the response variable
Systematic component: Linear combination of predictors (\(X\beta\))
Link function: Connects the mean of the response to the linear predictors

3. The GLM Framework in Plain Form

The general GLM can be written as:

\[ g(\mu_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} \]

where: - \(\mu_i = E(Y_i)\) is the expected value of the response - \(g(\cdot)\) is the link function - \(\beta_0, \beta_1, \ldots, \beta_p\) are regression coefficients

For Poisson Regression (Count Data):

We assume:

\[ Y_i \sim \text{Poisson}(\lambda_i) \]

with the log link function:

\[ \log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots \]

This ensures \(\lambda_i > 0\) (counts can’t be negative). Exponentiating both sides:

\[ \lambda_i = \exp(\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots) \]

Interpreting Coefficients:

For Poisson regression, \(\exp(\beta_j)\) represents the multiplicative change in the expected count for a one-unit increase in \(X_j\).

4. Simulating a Dataset

Let’s create data similar to Dr. Rodriguez’s study. We’ll simulate ER visits based on age and chronic disease status.

# Set seed for reproducibility
set.seed(456)

# Sample size
n <- 200

# Simulate predictors
age <- round(rnorm(n, mean = 45, sd = 15))
age <- pmax(age, 18)  # Minimum age 18

chronic_disease <- rbinom(n, size = 1, prob = 0.3)  # 30% have chronic disease

# Generate Poisson counts
# Log-linear model: log(lambda) = beta0 + beta1*age + beta2*chronic_disease
beta0 <- 0.2
beta1 <- 0.01  # Age effect
beta2 <- 0.8   # Chronic disease effect

log_lambda <- beta0 + beta1 * age + beta2 * chronic_disease
lambda <- exp(log_lambda)

# Simulate ER visits from Poisson distribution
er_visits <- rpois(n, lambda = lambda)

# Create dataframe
er_data <- data.frame(
  age = age,
  chronic_disease = factor(chronic_disease, levels = c(0, 1), 
                           labels = c("No", "Yes")),
  er_visits = er_visits
)

# Display first few rows
head(er_data)

##   age chronic_disease er_visits
## 1  25              No         0
## 2  54              No         4
## 3  57             Yes         6
## 4  24              No         4
## 5  34              No         1
## 6  40              No         3

# Summary statistics
summary(er_data)

##       age        chronic_disease   er_visits    
##  Min.   :18.00   No :128         Min.   : 0.00  
##  1st Qu.:34.75   Yes: 72         1st Qu.: 2.00  
##  Median :44.00                   Median : 2.00  
##  Mean   :45.16                   Mean   : 2.99  
##  3rd Qu.:57.00                   3rd Qu.: 4.00  
##  Max.   :79.00                   Max.   :10.00

# Distribution of ER visits
table(er_data$er_visits)

## 
##  0  1  2  3  4  5  6  7  8  9 10 
## 16 31 54 27 28 19 14  5  3  2  1

5. Visualizing the Relationship

Let’s visualize the distribution of ER visits and its relationship with our predictors.

library(ggplot2)

# Histogram of ER visits
ggplot(er_data, aes(x = er_visits)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of ER Visits",
    x = "Number of ER Visits",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

The distribution is right-skewed with many low values and few high values—typical of count data.

# ER visits by chronic disease
ggplot(er_data, aes(x = chronic_disease, y = er_visits, fill = chronic_disease)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.3) +
  labs(
    title = "ER Visits by Chronic Disease Status",
    x = "Chronic Disease",
    y = "Number of ER Visits"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none"
  )

# ER visits vs age
ggplot(er_data, aes(x = age, y = er_visits, color = chronic_disease)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "glm", method.args = list(family = "poisson"), se = FALSE) +
  labs(
    title = "ER Visits by Age and Chronic Disease Status",
    x = "Age (years)",
    y = "Number of ER Visits",
    color = "Chronic Disease"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

6. Fitting the Poisson GLM in R

We use the glm() function with family = poisson for Poisson regression:

# Fit Poisson GLM
poisson_model <- glm(er_visits ~ age + chronic_disease, 
                     family = poisson(link = "log"), 
                     data = er_data)

# Display summary
summary(poisson_model)

## 
## Call:
## glm(formula = er_visits ~ age + chronic_disease, family = poisson(link = "log"), 
##     data = er_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.470338   0.145359   3.236  0.00121 ** 
## age                0.005804   0.002849   2.037  0.04163 *  
## chronic_diseaseYes 0.788195   0.082262   9.581  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 296.27  on 199  degrees of freedom
## Residual deviance: 199.61  on 197  degrees of freedom
## AIC: 741.42
## 
## Number of Fisher Scoring iterations: 5

7. Interpreting the Results

Let’s extract and interpret the key values:

# Extract coefficients
coef_summary <- summary(poisson_model)$coefficients
print(coef_summary)

##                       Estimate Std. Error  z value     Pr(>|z|)
## (Intercept)        0.470337554 0.14535855 3.235706 1.213422e-03
## age                0.005803575 0.00284876 2.037229 4.162712e-02
## chronic_diseaseYes 0.788194813 0.08226219 9.581496 9.564937e-22

# Extract specific coefficients
beta0_hat <- coef(poisson_model)[1]
beta_age <- coef(poisson_model)[2]
beta_chronic <- coef(poisson_model)[3]

cat("\nIntercept:", round(beta0_hat, 4), "\n")

## 
## Intercept: 0.4703

cat("Age coefficient:", round(beta_age, 4), "\n")

## Age coefficient: 0.0058

cat("Chronic disease coefficient:", round(beta_chronic, 4), "\n")

## Chronic disease coefficient: 0.7882

# Calculate exponentiated coefficients (rate ratios)
exp_coefs <- exp(coef(poisson_model))
cat("\nExponentiated coefficients (Rate Ratios):\n")

## 
## Exponentiated coefficients (Rate Ratios):

print(round(exp_coefs, 4))

##        (Intercept)                age chronic_diseaseYes 
##             1.6005             1.0058             2.1994

Interpretation on the Log Scale:

Intercept (\(\beta_0\)) = 0.47: The log expected count of ER visits for a 0-year-old without chronic disease (not meaningful; we need to center age for practical interpretation)
Age coefficient (\(\beta_1\)) = 0.0058: For each additional year of age, the log expected count of ER visits increases by 0.0058
Chronic disease coefficient (\(\beta_2\)) = 0.788: Having a chronic disease increases the log expected count of ER visits by 0.788

Interpretation on the Count Scale (Rate Ratios):

Exponentiating the coefficients gives us rate ratios (also called incidence rate ratios):

Age: \(\exp(\beta_1)\) = 1.0058 means for each additional year of age, the expected number of ER visits is multiplied by 1.0058, or increases by approximately 0.58%
Chronic disease: \(\exp(\beta_2)\) = 2.199 means having a chronic disease multiplies the expected ER visits by 2.199, or increases them by approximately 119.9%

8. Making Predictions

Let’s predict ER visits for specific patient profiles:

# Create new data for prediction
new_patients <- data.frame(
  age = c(30, 30, 60, 60),
  chronic_disease = factor(c("No", "Yes", "No", "Yes"), levels = c("No", "Yes"))
)

# Predict expected counts
predictions <- predict(poisson_model, newdata = new_patients, type = "response")
new_patients$predicted_visits <- predictions

print(new_patients)

##   age chronic_disease predicted_visits
## 1  30              No         1.904929
## 2  30             Yes         4.189744
## 3  60              No         2.267215
## 4  60             Yes         4.986563

For example: - A 30-year-old without chronic disease: 1.9 expected ER visits - A 30-year-old with chronic disease: 4.19 expected ER visits - A 60-year-old without chronic disease: 2.27 expected ER visits - A 60-year-old with chronic disease: 4.99 expected ER visits

9. Checking Model Fit: Deviance and Overdispersion

Deviance:

The deviance measures how well the model fits the data. For Poisson regression, we check the residual deviance against the residual degrees of freedom.

# Extract deviance information
residual_deviance <- poisson_model$deviance
residual_df <- poisson_model$df.residual
deviance_ratio <- residual_deviance / residual_df

cat("Residual Deviance:", round(residual_deviance, 2), "\n")

## Residual Deviance: 199.61

cat("Residual DF:", residual_df, "\n")

## Residual DF: 197

cat("Deviance / DF:", round(deviance_ratio, 3), "\n")

## Deviance / DF: 1.013

If the deviance/DF ratio is close to 1, the Poisson model fits well. A ratio substantially greater than 1 indicates overdispersion—more variance than the Poisson distribution allows.

Testing for Overdispersion:

# Dispersion parameter estimate
dispersion <- sum(residuals(poisson_model, type = "pearson")^2) / residual_df
cat("Dispersion parameter:", round(dispersion, 3), "\n")

## Dispersion parameter: 0.89

if (dispersion > 1.5) {
  cat("Warning: Overdispersion detected. Consider negative binomial regression.\n")
} else {
  cat("No severe overdispersion detected.\n")
}

## No severe overdispersion detected.

10. Negative Binomial Regression (for Overdispersion)

If overdispersion is detected, we can fit a negative binomial model, which includes an extra parameter to account for excess variance:

library(MASS)

# Fit negative binomial GLM
nb_model <- glm.nb(er_visits ~ age + chronic_disease, data = er_data)

# Display summary
summary(nb_model)

## 
## Call:
## glm.nb(formula = er_visits ~ age + chronic_disease, data = er_data, 
##     init.theta = 36985.14912, link = log)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.470338   0.145365   3.236  0.00121 ** 
## age                0.005804   0.002849   2.037  0.04164 *  
## chronic_diseaseYes 0.788195   0.082266   9.581  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(36985.15) family taken to be 1)
## 
##     Null deviance: 296.24  on 199  degrees of freedom
## Residual deviance: 199.59  on 197  degrees of freedom
## AIC: 743.42
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  36985 
##           Std. Err.:  445673 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -735.423

# Compare with Poisson
cat("\nPoisson AIC:", AIC(poisson_model), "\n")

## 
## Poisson AIC: 741.4206

cat("Negative Binomial AIC:", AIC(nb_model), "\n")

## Negative Binomial AIC: 743.4226

The negative binomial model includes a dispersion parameter (theta). A lower AIC indicates better model fit. If the negative binomial AIC is substantially lower, use that model instead.

11. Visualizing the Fitted Model

# Create prediction grid
age_seq <- seq(min(er_data$age), max(er_data$age), length.out = 100)
pred_data <- expand.grid(
  age = age_seq,
  chronic_disease = factor(c("No", "Yes"), levels = c("No", "Yes"))
)

# Get predictions with confidence intervals
pred_data$predicted <- predict(poisson_model, newdata = pred_data, type = "response")

# Plot
ggplot(er_data, aes(x = age, y = er_visits, color = chronic_disease)) +
  geom_point(alpha = 0.4) +
  geom_line(data = pred_data, aes(y = predicted), size = 1.2) +
  labs(
    title = "Fitted Poisson Regression Model",
    x = "Age (years)",
    y = "Expected ER Visits",
    color = "Chronic Disease"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

12. Model Diagnostics

# Residuals vs fitted
plot(poisson_model, which = 1, main = "Residuals vs Fitted")

# Q-Q plot
plot(poisson_model, which = 2, main = "Q-Q Plot")

# Influential observations
plot(poisson_model, which = 5, main = "Residuals vs Leverage")

For GLMs, we examine: - Residuals vs Fitted: Look for patterns suggesting poor fit - Q-Q Plot: Assess normality of deviance residuals - Influential Points: Check for observations with high leverage or Cook’s distance

13. Reporting Results in APA Style

Here’s how to report Poisson regression results:

A Poisson regression was conducted to examine the relationship between age, chronic disease status, and the number of emergency room visits per year. The model provided a good fit to the data, with a residual deviance of 199.61 on 197 degrees of freedom (dispersion parameter = 0.89).

Age was a significant predictor of ER visits, b = 0.0058, z = 2.04, p = 0.042. Specifically, each additional year of age was associated with a 0.58% increase in the expected number of ER visits (rate ratio = 1.006).

Chronic disease status was also a significant predictor, b = 0.788, z = 9.58, p < .001. Patients with chronic diseases had 119.9% more ER visits compared to those without chronic diseases (rate ratio = 2.2), holding age constant.

14. Practical Considerations

When to Use GLMs: - When the outcome variable is not normally distributed - For count data (Poisson or negative binomial) - For binary outcomes (logistic regression) - For rates or proportions

Model Selection: - Poisson: When variance equals the mean - Negative Binomial: When variance exceeds the mean (overdispersion) - Compare models using AIC or likelihood ratio tests

Common Mistakes: - Using linear regression for count data (can predict negative counts) - Ignoring overdispersion in Poisson models - Misinterpreting coefficients (they’re on the log scale) - Not checking model assumptions and diagnostics

Extensions: - Zero-inflated models: When there are excess zeros - Hurdle models: Separate processes for zero vs. non-zero counts - Offset terms: When exposure varies (e.g., person-years)

15. Conclusion

Generalized Linear Models extend regression to handle non-normal response variables. For count data like ER visits, Poisson regression provides a natural framework that respects the discrete, non-negative nature of counts.

In our analysis, we found that both age and chronic disease significantly predict ER visit frequency. Older patients and those with chronic diseases make more ER visits, with chronic disease having a particularly strong effect (more than doubling expected visits).

GLMs are powerful tools in public health and psychology research, enabling analysis of diverse outcome types while maintaining the interpretability of regression coefficients. Mastering GLMs opens doors to analyzing binary outcomes (logistic regression), survival times (Cox regression), and many other data types common in health sciences.