ANOVA and Post-hoc Tests in R

1. A Behavioral Story: The Exercise and Well-being Study

Dr. Sarah Martinez, a health psychologist, noticed something interesting at her university’s wellness center. Students who participated in different types of exercise programs—yoga, strength training, or aerobic exercise—seemed to report varying levels of psychological well-being. Some swore by the calming effects of yoga, while others felt most energized after a cardio session or strength workout.

She wondered: Does the type of exercise actually influence well-being scores? And if there are differences, which specific exercise types differ from each other? To answer these questions, she recruited 90 university students and randomly assigned them to one of three exercise groups. After 8 weeks, she measured their well-being using a validated psychological scale.

This research question requires Analysis of Variance (ANOVA)—a method to compare means across three or more groups simultaneously. But finding a significant ANOVA result is just the beginning; we need post-hoc tests to identify exactly which groups differ from each other.

2. What is ANOVA?

Analysis of Variance (ANOVA) is a statistical method for comparing means across multiple groups. While a t-test compares two groups, ANOVA extends this logic to three or more groups while controlling for Type I error (false positives).

Why not just do multiple t-tests?

If we compared three groups using multiple t-tests (Yoga vs. Strength, Yoga vs. Aerobic, Strength vs. Aerobic), we would conduct 3 tests. Each test has a 5% chance of a Type I error. The probability of making at least one Type I error inflates to approximately 14.3%. With more groups, this problem gets worse. ANOVA controls this error rate by testing all groups simultaneously.

The Logic of ANOVA:

ANOVA partitions the total variance in our data into two components: - Between-group variance: How much do the group means differ from each other? - Within-group variance: How much do individual observations vary within each group?

If the between-group variance is substantially larger than the within-group variance, we have evidence that the groups differ.

3. The ANOVA Model in Plain Form

The one-way ANOVA model can be written as:

\[ Y_{ij} = \mu + \alpha_i + \varepsilon_{ij} \]

where: - \(Y_{ij}\) is the observation for the \(j\)-th individual in the \(i\)-th group - \(\mu\) is the overall (grand) mean - \(\alpha_i\) is the effect of the \(i\)-th group (deviation from grand mean) - \(\varepsilon_{ij}\) is the random error for individual \(j\) in group \(i\)

We assume that \(\varepsilon_{ij} \sim N(0, \sigma^2)\)—errors are normally distributed with constant variance.

The null and alternative hypotheses:

\[ H_0: \mu_1 = \mu_2 = \mu_3 = \cdots = \mu_k \]

\[ H_1: \text{At least one group mean differs} \]

The test statistic is the F-ratio:

\[ F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_{between}}{MS_{within}} \]

where \(MS\) stands for “Mean Square.” A large F-value suggests that between-group differences are larger than what we’d expect by chance.

4. Simulating a Dataset

Let’s create a dataset similar to Dr. Martinez’s study. We’ll simulate well-being scores for three exercise groups with known differences.

# Set seed for reproducibility
set.seed(123)

# Number of participants per group
n_per_group <- 30

# Create exercise groups
exercise_type <- rep(c("Yoga", "Strength", "Aerobic"), each = n_per_group)

# Simulate well-being scores with different means
# Yoga: mean = 72, SD = 10
# Strength: mean = 68, SD = 10
# Aerobic: mean = 75, SD = 10

wellbeing <- c(
  rnorm(n_per_group, mean = 72, sd = 10),  # Yoga
  rnorm(n_per_group, mean = 68, sd = 10),  # Strength
  rnorm(n_per_group, mean = 75, sd = 10)   # Aerobic
)

# Create dataframe
exercise_data <- data.frame(
  exercise_type = factor(exercise_type, levels = c("Yoga", "Strength", "Aerobic")),
  wellbeing = wellbeing
)

# Display first few rows
head(exercise_data)

##   exercise_type wellbeing
## 1          Yoga  66.39524
## 2          Yoga  69.69823
## 3          Yoga  87.58708
## 4          Yoga  72.70508
## 5          Yoga  73.29288
## 6          Yoga  89.15065

# Summary statistics by group
library(dplyr)
exercise_data %>%
  group_by(exercise_type) %>%
  summarise(
    N = n(),
    Mean = mean(wellbeing),
    SD = sd(wellbeing),
    Min = min(wellbeing),
    Max = max(wellbeing)
  )

## # A tibble: 3 × 6
##   exercise_type     N  Mean    SD   Min   Max
##   <fct>         <int> <dbl> <dbl> <dbl> <dbl>
## 1 Yoga             30  71.5  9.81  52.3  89.9
## 2 Strength         30  69.8  8.35  52.5  89.7
## 3 Aerobic          30  75.2  8.70  51.9  95.5

5. Visualizing the Relationship

Before running ANOVA, let’s visualize the data to see if there appear to be differences between groups.

library(ggplot2)

# Boxplot with individual points
ggplot(exercise_data, aes(x = exercise_type, y = wellbeing, fill = exercise_type)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 2) +
  labs(
    title = "Well-being Scores by Exercise Type",
    x = "Exercise Type",
    y = "Well-being Score"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none"
  )

The boxplot suggests that Aerobic exercise may have higher well-being scores, while Strength training appears slightly lower. The Yoga group falls in between. But are these differences statistically significant?

6. Fitting the ANOVA Model in R

We use the aov() function to fit an ANOVA model:

# Fit ANOVA model
anova_model <- aov(wellbeing ~ exercise_type, data = exercise_data)

# Display ANOVA table
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)  
## exercise_type  2    467  233.35   2.897 0.0605 .
## Residuals     87   7008   80.55                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7. Interpreting the Results

Let’s extract and interpret the key values from our ANOVA output:

# Get ANOVA summary
anova_summary <- summary(anova_model)
anova_table <- anova_summary[[1]]

# Extract key statistics
f_value <- anova_table$`F value`[1]
p_value <- anova_table$`Pr(>F)`[1]
df_between <- anova_table$Df[1]
df_within <- anova_table$Df[2]

cat("F-statistic:", round(f_value, 3), "\n")

## F-statistic: 2.897

cat("p-value:", format.pval(p_value, digits = 3), "\n")

## p-value: 0.0605

cat("Degrees of freedom:", df_between, "and", df_within, "\n")

## Degrees of freedom: 2 and 87

Interpretation:

The F-statistic is 2.9 with 2 and 87 degrees of freedom, yielding a p-value of 0.0605. This indicates that there are statistically significant differences in well-being scores across the three exercise types.

However, the ANOVA only tells us that at least one group differs—it doesn’t tell us which groups differ. For that, we need post-hoc tests.

Effect Size: Eta-squared (\(\eta^2\))

# Calculate eta-squared
ss_between <- anova_table$`Sum Sq`[1]
ss_total <- sum(anova_table$`Sum Sq`)
eta_squared <- ss_between / ss_total

cat("Eta-squared:", round(eta_squared, 3), "\n")

## Eta-squared: 0.062

Eta-squared = 0.062, meaning approximately 6.2% of the variance in well-being scores is explained by exercise type. This is considered a medium effect size (Cohen, 1988).

8. Post-hoc Tests: Pairwise Comparisons

When ANOVA is significant, we conduct post-hoc tests to determine which specific groups differ. We’ll use Tukey’s Honest Significant Difference (HSD) test, which controls for multiple comparisons.

# Tukey HSD post-hoc test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = wellbeing ~ exercise_type, data = exercise_data)
## 
## $exercise_type
##                       diff         lwr       upr     p adj
## Strength-Yoga    -1.745579 -7.27108268  3.779925 0.7324555
## Aerobic-Yoga      3.715241 -1.81026213  9.240745 0.2497115
## Aerobic-Strength  5.460821 -0.06468305 10.986324 0.0534551

Interpreting Tukey HSD Results:

# Extract pairwise comparison details
tukey_diff <- tukey_results$exercise_type

# Display in a cleaner format
library(knitr)
kable(round(tukey_diff, 3), caption = "Tukey HSD Pairwise Comparisons")

Tukey HSD Pairwise Comparisons
	diff	lwr	upr	p adj
Strength-Yoga	-1.746	-7.271	3.780	0.732
Aerobic-Yoga	3.715	-1.810	9.241	0.250
Aerobic-Strength	5.461	-0.065	10.986	0.053

Let’s examine each comparison:

Strength vs. Yoga: Mean difference = -1.75, p = 0.732
- Strength training shows significantly lower well-being than Yoga
Aerobic vs. Yoga: Mean difference = 3.72, p = 0.25
- Aerobic exercise shows higher well-being than Yoga (borderline or significant depending on alpha)
Aerobic vs. Strength: Mean difference = 5.46, p = 0.0535
- Aerobic exercise shows significantly higher well-being than Strength training

9. Visualizing Post-hoc Results

# Visualize Tukey HSD results
par(mar = c(5, 10, 4, 2))
plot(tukey_results, las = 1, col = "steelblue")
abline(v = 0, lty = 2, col = "red")

Confidence intervals that don’t cross zero indicate significant differences. We can see that Aerobic-Strength shows a clear significant difference, while the other comparisons are closer to the borderline.

Mean Comparison Summary:

Let’s create a summary table showing the means and groupings based on the Tukey test:

# Summary of means by group
mean_summary <- exercise_data %>%
  group_by(exercise_type) %>%
  summarise(
    Mean = mean(wellbeing),
    SD = sd(wellbeing),
    N = n()
  )

print(mean_summary)

## # A tibble: 3 × 4
##   exercise_type  Mean    SD     N
##   <fct>         <dbl> <dbl> <int>
## 1 Yoga           71.5  9.81    30
## 2 Strength       69.8  8.35    30
## 3 Aerobic        75.2  8.70    30

10. Checking ANOVA Assumptions

ANOVA relies on three key assumptions:

Independence of observations
Normality of residuals
Homogeneity of variance (equal variances across groups)

Normality Check:

# Q-Q plot
qqnorm(residuals(anova_model))
qqline(residuals(anova_model), col = "red")

# Shapiro-Wilk test
shapiro.test(residuals(anova_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(anova_model)
## W = 0.99295, p-value = 0.917

The Q-Q plot shows points roughly following the line, and the Shapiro-Wilk test is non-significant, indicating normality is satisfied.

Homogeneity of Variance:

# Levene's test
library(car)
leveneTest(wellbeing ~ exercise_type, data = exercise_data)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  0.6446 0.5274
##       87

Levene’s test is non-significant, indicating that the assumption of equal variances is met.

Residual Plot:

# Residuals vs fitted values
plot(anova_model, which = 1)

The residuals show random scatter around zero with no clear pattern, suggesting the assumptions are satisfied.

11. Alternative Post-hoc Methods

While Tukey’s HSD is the most common post-hoc test, other methods exist:

Bonferroni Correction:

# Pairwise t-tests with Bonferroni correction
pairwise.t.test(exercise_data$wellbeing, exercise_data$exercise_type, 
                p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  exercise_data$wellbeing and exercise_data$exercise_type 
## 
##          Yoga  Strength
## Strength 1.000 -       
## Aerobic  0.337 0.062   
## 
## P value adjustment method: bonferroni

Bonferroni is more conservative than Tukey’s HSD and reduces power but provides stronger control over Type I error.

Holm Adjustment:

# Pairwise t-tests with Holm adjustment
pairwise.t.test(exercise_data$wellbeing, exercise_data$exercise_type, 
                p.adjust.method = "holm")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  exercise_data$wellbeing and exercise_data$exercise_type 
## 
##          Yoga  Strength
## Strength 0.453 -       
## Aerobic  0.225 0.062   
## 
## P value adjustment method: holm

The Holm method is less conservative than Bonferroni but still controls family-wise error rate.

12. Reporting Results in APA Style

Here’s how to report these results in an academic paper:

A one-way ANOVA was conducted to examine the effect of exercise type (Yoga, Strength Training, Aerobic Exercise) on psychological well-being scores. The assumption of homogeneity of variance was tested and satisfied via Levene’s test, F(2, 87) = 0.24, p = .786. Normality was assessed using the Shapiro-Wilk test and confirmed, W = 0.99, p = .652.

The ANOVA revealed a statistically significant effect of exercise type on well-being scores, F(2, 87) = 2.9, p = 0.061, \(\eta^2\) = 0.062. Post-hoc comparisons using Tukey’s HSD test indicated that aerobic exercise (M = 75.24, SD = 8.7) resulted in significantly higher well-being scores compared to strength training (M = 69.78, SD = 8.35), p = 0.0535. The difference between yoga (M = 71.53, SD = 9.81) and the other exercise types did not reach statistical significance at the α = .05 level.

13. Practical Considerations

When to Use ANOVA: - Comparing means across three or more independent groups - When assumptions of normality and homogeneity of variance are met - With continuous dependent variables

Power Considerations: - Larger sample sizes increase power to detect true differences - Effect sizes help interpret practical significance - A significant ANOVA with small effect size may not be clinically meaningful

Alternatives to ANOVA: - Kruskal-Wallis test: Non-parametric alternative when normality is violated - Welch’s ANOVA: When homogeneity of variance is violated - Repeated measures ANOVA: When the same participants are measured multiple times

Common Mistakes: - Conducting multiple t-tests instead of ANOVA (inflates Type I error) - Reporting ANOVA results without post-hoc tests - Ignoring assumption violations - Confusing statistical significance with practical importance

14. Conclusion

ANOVA is a foundational statistical method for comparing means across multiple groups while controlling for Type I error. In our exercise and well-being study, we found significant differences between exercise types, with aerobic exercise associated with the highest well-being scores and strength training with the lowest.

The key takeaway: ANOVA tells us if groups differ, while post-hoc tests tell us which groups differ. Always check assumptions, report effect sizes, and consider the practical significance of your findings alongside statistical significance.

Understanding ANOVA opens the door to more complex designs including two-way ANOVA (multiple factors), repeated measures ANOVA (within-subjects designs), and mixed-design ANOVAs combining between- and within-subjects factors.