A university implemented a new workplace wellness program for its administrative staff, featuring mindfulness training, ergonomic assessments, and flexible break schedules. After six months, the human resources director conducted a survey measuring job satisfaction scores before and after the intervention. The t-test showed a statistically significant difference (p = 0.032), and she proudly announced the program’s “success” to the administration.
However, when the department chair asked, “But how much did satisfaction actually improve?”, she realized the p-value only told part of the story. Statistical significance indicates that an effect likely exists, but it doesn’t reveal whether that effect is trivial, moderate, or substantial. With 200 employees in the study, even a tiny improvement could be “statistically significant” but practically meaningless.
This is where effect sizes become essential. They quantify the magnitude of differences or relationships, allowing researchers to communicate not just whether something works, but how well it works—information crucial for making informed decisions about interventions, policies, and resource allocation.
Effect size is a quantitative measure of the magnitude of a phenomenon. Unlike p-values, which tell us the probability of observing our data if the null hypothesis were true, effect sizes tell us the strength or magnitude of the relationship or difference we’ve found.
Effect sizes serve several critical purposes:
There are many types of effect sizes, each appropriate for different statistical tests and research designs.
Used for comparing two groups (t-tests), Cohen’s d expresses the difference between means in standard deviation units:
\[ d = \frac{\bar{X}_1 - \bar{X}_2}{s_{\text{pooled}}} \]
where \(\bar{X}_1\) and \(\bar{X}_2\) are the group means, and \(s_{\text{pooled}}\) is the pooled standard deviation:
\[ s_{\text{pooled}} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]
Cohen’s interpretation guidelines: - Small effect: \(|d| = 0.2\) - Medium effect: \(|d| = 0.5\) - Large effect: \(|d| = 0.8\)
Used in ANOVA contexts, eta-squared represents the proportion of total variance explained by a factor:
\[ \eta^2 = \frac{SS_{\text{effect}}}{SS_{\text{total}}} \]
Partial eta-squared represents the proportion of variance explained controlling for other factors:
\[ \eta_p^2 = \frac{SS_{\text{effect}}}{SS_{\text{effect}} + SS_{\text{error}}} \]
Interpretation guidelines for η²: - Small effect: 0.01 - Medium effect: 0.06 - Large effect: 0.14
A less biased estimate than eta-squared, omega-squared adjusts for sample size:
\[ \omega^2 = \frac{SS_{\text{effect}} - (df_{\text{effect}})(MS_{\text{error}})}{SS_{\text{total}} + MS_{\text{error}}} \]
Used in regression, R² represents the proportion of variance in the dependent variable explained by the model:
\[ R^2 = 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}} \]
Pearson’s r itself serves as an effect size, representing the strength of linear association:
\[ r = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum(X_i - \bar{X})^2 \sum(Y_i - \bar{Y})^2}} \]
Interpretation guidelines for |r|: - Small effect: 0.10 - Medium effect: 0.30 - Large effect: 0.50
We’ll create multiple datasets to demonstrate effect size calculations across different statistical contexts.
set.seed(2024)
# Scenario 1: Two-group comparison (Wellness Program)
# Pre-intervention satisfaction scores (control group)
control_satisfaction <- rnorm(100, mean = 65, sd = 12)
# Post-intervention satisfaction scores (treatment group)
treatment_satisfaction <- rnorm(100, mean = 71, sd = 12)
# Create data frame
wellness_data <- data.frame(
satisfaction = c(control_satisfaction, treatment_satisfaction),
group = factor(rep(c("Control", "Treatment"), each = 100))
)
# Scenario 2: ANOVA (Training Method Comparison)
# Three different employee training methods
traditional <- rnorm(50, mean = 75, sd = 10)
online <- rnorm(50, mean = 78, sd = 10)
blended <- rnorm(50, mean = 82, sd = 10)
training_data <- data.frame(
performance = c(traditional, online, blended),
method = factor(rep(c("Traditional", "Online", "Blended"), each = 50))
)
# Scenario 3: Correlation (Work Hours and Burnout)
work_hours <- rnorm(120, mean = 45, sd = 8)
burnout <- 20 + 0.8 * work_hours + rnorm(120, mean = 0, sd = 8)
correlation_data <- data.frame(
work_hours = work_hours,
burnout = burnout
)
# Scenario 4: Regression (Multiple Predictors of Job Performance)
n <- 150
age <- rnorm(n, mean = 35, sd = 8)
experience <- rnorm(n, mean = 8, sd = 4)
training_score <- rnorm(n, mean = 75, sd = 10)
performance <- 40 + 0.3 * age + 1.2 * experience + 0.4 * training_score + rnorm(n, mean = 0, sd = 5)
regression_data <- data.frame(
performance = performance,
age = age,
experience = experience,
training_score = training_score
)
Before calculating effect sizes, let’s visualize our data to understand the patterns.
library(ggplot2)
# Plot 1: Two-group comparison
p1 <- ggplot(wellness_data, aes(x = group, y = satisfaction, fill = group)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3) +
labs(title = "Wellness Program: Satisfaction Scores by Group",
x = "Group", y = "Satisfaction Score") +
scale_fill_manual(values = c("#E74C3C", "#3498DB")) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)
# Plot 2: ANOVA
p2 <- ggplot(training_data, aes(x = method, y = performance, fill = method)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3) +
labs(title = "Training Methods: Performance Scores",
x = "Training Method", y = "Performance Score") +
scale_fill_manual(values = c("#E74C3C", "#F39C12", "#27AE60")) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)
# Plot 3: Correlation
p3 <- ggplot(correlation_data, aes(x = work_hours, y = burnout)) +
geom_point(alpha = 0.6, color = "#3498DB") +
geom_smooth(method = "lm", se = TRUE, color = "#E74C3C") +
labs(title = "Work Hours and Burnout Relationship",
x = "Weekly Work Hours", y = "Burnout Score") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
# Plot 4: Multiple scatter (simplified)
p4 <- ggplot(regression_data, aes(x = training_score, y = performance)) +
geom_point(alpha = 0.6, color = "#9B59B6") +
geom_smooth(method = "lm", se = TRUE, color = "#E74C3C") +
labs(title = "Training Score vs. Job Performance",
x = "Training Score", y = "Performance") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
# Arrange plots
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
Now we calculate effect sizes for each scenario using actual R output.
# Perform t-test
t_test_result <- t.test(satisfaction ~ group, data = wellness_data)
# Calculate Cohen's d manually
mean_control <- mean(wellness_data$satisfaction[wellness_data$group == "Control"])
mean_treatment <- mean(wellness_data$satisfaction[wellness_data$group == "Treatment"])
sd_control <- sd(wellness_data$satisfaction[wellness_data$group == "Control"])
sd_treatment <- sd(wellness_data$satisfaction[wellness_data$group == "Treatment"])
n_control <- sum(wellness_data$group == "Control")
n_treatment <- sum(wellness_data$group == "Treatment")
# Pooled standard deviation
pooled_sd <- sqrt(((n_control - 1) * sd_control^2 + (n_treatment - 1) * sd_treatment^2) /
(n_control + n_treatment - 2))
# Cohen's d
cohens_d <- (mean_treatment - mean_control) / pooled_sd
# Display results
cat("Two-Group Comparison Results:\n")
## Two-Group Comparison Results:
cat("Control Mean:", round(mean_control, 2), "\n")
## Control Mean: 63.98
cat("Treatment Mean:", round(mean_treatment, 2), "\n")
## Treatment Mean: 72.58
cat("Mean Difference:", round(mean_treatment - mean_control, 2), "\n")
## Mean Difference: 8.6
cat("t-statistic:", round(t_test_result$statistic, 3), "\n")
## t-statistic: -4.961
cat("p-value:", round(t_test_result$p.value, 4), "\n")
## p-value: 0
cat("Cohen's d:", round(cohens_d, 3), "\n")
## Cohen's d: 0.702
# Perform ANOVA
anova_model <- aov(performance ~ method, data = training_data)
anova_summary <- summary(anova_model)
# Extract SS components
ss_effect <- anova_summary[[1]]["method", "Sum Sq"]
ss_total <- sum(anova_summary[[1]][, "Sum Sq"])
ss_error <- anova_summary[[1]]["Residuals", "Sum Sq"]
df_effect <- anova_summary[[1]]["method", "Df"]
ms_error <- anova_summary[[1]]["Residuals", "Mean Sq"]
# Calculate eta-squared
eta_squared <- ss_effect / ss_total
# Calculate partial eta-squared
partial_eta_squared <- ss_effect / (ss_effect + ss_error)
# Calculate omega-squared
omega_squared <- (ss_effect - df_effect * ms_error) / (ss_total + ms_error)
# Extract F and p-value
f_value <- anova_summary[[1]]["method", "F value"]
p_value <- anova_summary[[1]]["method", "Pr(>F)"]
# Display results
cat("ANOVA Results:\n")
## ANOVA Results:
cat("F-statistic:", round(f_value, 3), "\n")
## F-statistic: 5.618
cat("p-value:", format(p_value, scientific = TRUE, digits = 3), "\n")
## p-value: 4.46e-03
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.071
cat("Partial Eta-squared (η²ₚ):", round(partial_eta_squared, 4), "\n")
## Partial Eta-squared (η²ₚ): 0.071
cat("Omega-squared (ω²):", round(omega_squared, 4), "\n")
## Omega-squared (ω²): 0.058
# Calculate correlation
cor_test <- cor.test(correlation_data$work_hours, correlation_data$burnout)
r_value <- cor_test$estimate
r_squared <- r_value^2
# Display results
cat("Correlation Results:\n")
## Correlation Results:
cat("Pearson's r:", round(r_value, 3), "\n")
## Pearson's r: 0.655
cat("R² (variance explained):", round(r_squared, 3), "\n")
## R² (variance explained): 0.429
cat("t-statistic:", round(cor_test$statistic, 3), "\n")
## t-statistic: 9.412
cat("p-value:", format(cor_test$p.value, scientific = TRUE, digits = 3), "\n")
## p-value: 4.96e-16
# Fit multiple regression model
reg_model <- lm(performance ~ age + experience + training_score,
data = regression_data)
reg_summary <- summary(reg_model)
# Extract R²
r_squared_reg <- reg_summary$r.squared
adj_r_squared <- reg_summary$adj.r.squared
f_stat <- reg_summary$fstatistic[1]
f_p_value <- pf(f_stat, reg_summary$fstatistic[2],
reg_summary$fstatistic[3], lower.tail = FALSE)
# Display results
cat("Multiple Regression Results:\n")
## Multiple Regression Results:
cat("R²:", round(r_squared_reg, 4), "\n")
## R²: 0.6431
cat("Adjusted R²:", round(adj_r_squared, 4), "\n")
## Adjusted R²: 0.6358
cat("F-statistic:", round(f_stat, 3), "\n")
## F-statistic: 87.702
cat("p-value:", format(f_p_value, scientific = TRUE, digits = 3), "\n")
## p-value: 1.68e-32
Let’s interpret each effect size in practical terms.
The wellness program showed a Cohen’s d of 0.702, which falls into the large effect size category according to Cohen’s guidelines. This means that the treatment group’s satisfaction scores were approximately 0.7 standard deviations higher than the control group.
Practical interpretation: While the p-value (0) indicates statistical significance, the effect size reveals that the improvement is substantial and highly meaningful. The average satisfaction increased by 8.6 points on the scale.
The eta-squared value of 0.071 indicates that approximately 7.1% of the variance in performance scores is explained by the training method. The omega-squared (0.058) provides a more conservative estimate at 5.8%.
Practical interpretation: According to Cohen’s guidelines (small = 0.01, medium = 0.06, large = 0.14), this represents a medium effect. The training method makes a moderate difference in employee performance, suggesting that the choice of training approach has meaningful implications.
The correlation coefficient of 0.655 indicates a strong positive relationship between work hours and burnout. The R² value of 0.429 shows that 42.9% of the variance in burnout scores is explained by work hours.
Practical interpretation: For every additional hour worked per week, burnout scores increase by a meaningful amount. This relationship is strong and concerning, suggesting that work hours are an important factor in employee burnout.
The multiple regression model explains 64.3% of the variance in job performance (R² = 0.6431). The adjusted R² of 0.6358 accounts for the number of predictors in the model.
Practical interpretation: This represents a strong model. The combination of age, experience, and training score provides substantial predictive power for job performance, explaining approximately a large proportion of individual differences.
Confidence intervals provide a range of plausible values for the population effect size.
# Function to calculate CI for Cohen's d
cohens_d_ci <- function(d, n1, n2, conf.level = 0.95) {
# Non-central t distribution approach
t_val <- d * sqrt((n1 * n2) / (n1 + n2))
df <- n1 + n2 - 2
# Standard error
se <- sqrt((n1 + n2) / (n1 * n2) + d^2 / (2 * (n1 + n2)))
# Critical value
alpha <- 1 - conf.level
t_crit <- qt(1 - alpha/2, df)
# CI
lower <- d - t_crit * se
upper <- d + t_crit * se
return(c(lower = lower, upper = upper))
}
# Calculate CI for Cohen's d
d_ci <- cohens_d_ci(cohens_d, n_control, n_treatment)
cat("95% Confidence Interval for Cohen's d:\n")
## 95% Confidence Interval for Cohen's d:
cat("[", round(d_ci[1], 3), ",", round(d_ci[2], 3), "]\n\n")
## [ 0.414 , 0.989 ]
# CI for correlation (using Fisher's Z transformation)
fisher_z <- 0.5 * log((1 + r_value) / (1 - r_value))
se_z <- 1 / sqrt(nrow(correlation_data) - 3)
z_crit <- qnorm(0.975)
z_lower <- fisher_z - z_crit * se_z
z_upper <- fisher_z + z_crit * se_z
# Transform back to r
r_lower <- (exp(2 * z_lower) - 1) / (exp(2 * z_lower) + 1)
r_upper <- (exp(2 * z_upper) - 1) / (exp(2 * z_upper) + 1)
cat("95% Confidence Interval for Pearson's r:\n")
## 95% Confidence Interval for Pearson's r:
cat("[", round(r_lower, 3), ",", round(r_upper, 3), "]\n")
## [ 0.539 , 0.746 ]
The confidence interval for Cohen’s d [0.414, 0.989] does not include zero, consistent with the significant t-test result. The interval suggests the true population effect is likely between small and large.
Let’s create a visual summary of our effect sizes.
# Create data frame of effect sizes
effect_data <- data.frame(
Analysis = c("Wellness Program\n(Cohen's d)",
"Training Methods\n(η²)",
"Work-Burnout\n(r)",
"Performance Model\n(R²)"),
Effect_Size = c(cohens_d, eta_squared, r_value, r_squared_reg),
Type = c("Cohen's d", "Eta²", "Correlation", "R²"),
Lower_CI = c(d_ci[1], NA, r_lower, NA),
Upper_CI = c(d_ci[2], NA, r_upper, NA)
)
# Create plot
ggplot(effect_data, aes(x = Analysis, y = Effect_Size, fill = Type)) +
geom_bar(stat = "identity", alpha = 0.7, width = 0.6) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI),
width = 0.2, na.rm = TRUE) +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
labs(title = "Effect Sizes Across Different Analyses",
subtitle = "Error bars show 95% confidence intervals where available",
x = "Analysis",
y = "Effect Size Value") +
scale_fill_manual(values = c("#E74C3C", "#3498DB", "#27AE60", "#F39C12")) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "right"
)
Effect sizes are crucial for power analysis—determining the sample size needed to detect an effect.
# Power analysis for detecting Cohen's d = 0.5 (medium effect)
# Using two independent samples t-test
# Required inputs
desired_power <- 0.80
alpha <- 0.05
target_d <- 0.5 # Medium effect
# Calculate required sample size per group
# Using approximation formula
n_per_group <- ceiling(2 * ((qnorm(1 - alpha/2) + qnorm(desired_power)) / target_d)^2)
cat("Power Analysis for Two-Group Comparison:\n")
## Power Analysis for Two-Group Comparison:
cat("To detect Cohen's d =", target_d, "with 80% power at α = 0.05:\n")
## To detect Cohen's d = 0.5 with 80% power at α = 0.05:
cat("Required sample size per group:", n_per_group, "\n")
## Required sample size per group: 63
cat("Total sample size:", n_per_group * 2, "\n\n")
## Total sample size: 126
# Calculate power for different sample sizes
sample_sizes <- seq(20, 200, by = 20)
powers <- sapply(sample_sizes, function(n) {
ncp <- target_d * sqrt(n / 2) # Non-centrality parameter
crit_value <- qt(1 - alpha/2, df = 2*n - 2)
power <- 1 - pt(crit_value, df = 2*n - 2, ncp = ncp) +
pt(-crit_value, df = 2*n - 2, ncp = ncp)
return(power)
})
# Plot power curve
power_df <- data.frame(
n_per_group = sample_sizes,
power = powers
)
ggplot(power_df, aes(x = n_per_group, y = power)) +
geom_line(color = "#3498DB", linewidth = 1.2) +
geom_point(color = "#3498DB", size = 2) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "#E74C3C") +
geom_vline(xintercept = n_per_group, linetype = "dashed", color = "#27AE60") +
labs(title = "Statistical Power vs. Sample Size",
subtitle = paste("For detecting Cohen's d =", target_d),
x = "Sample Size per Group",
y = "Statistical Power (1 - β)") +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
axis.line.x.top = element_blank(),
axis.line.y.right = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
)
This power curve shows that with approximately 63 participants per group, we achieve 80% power to detect a medium effect (d = 0.5). Larger samples provide higher power, reducing the risk of Type II errors (false negatives).
Here’s how to report each analysis with effect sizes in APA format:
Two-Group Comparison (t-test with Cohen’s d):
An independent samples t-test revealed that participants in the wellness program (M = 72.58, SD = 12.26) reported significantly higher job satisfaction than control participants (M = 63.98, SD = 12.27), t(197.9997717) = -4.96, p = 0, Cohen’s d = 0.7, 95% CI [0.41, 0.99]. This represents a large effect size.
ANOVA (with η² and ω²):
A one-way ANOVA revealed a significant effect of training method on performance scores, F(2, 147) = 5.62, p = 0.004, η² = 0.071, ω² = 0.058. Approximately 7.1% of the variance in performance was explained by training method, indicating a medium effect.
Correlation:
Work hours and burnout scores were significantly positively correlated, r(118) = 0.65, p < .001, 95% CI [0.54, 0.75]. Approximately 42.9% of the variance in burnout was associated with work hours, representing a strong effect.
Multiple Regression:
A multiple regression analysis revealed that age, experience, and training score significantly predicted job performance, F(3, 146) = 87.7, p < .001, R² = 0.643, adjusted R² = 0.636. The model explained 64.3% of the variance in performance, indicating strong predictive utility.
Always report effect sizes alongside statistical significance tests. Major guidelines (APA, AMA) now require or strongly recommend effect size reporting because:
Cohen’s benchmarks (small = 0.2, medium = 0.5, large = 0.8 for d) are guidelines, not rules. Context matters:
Always consider: - The specific research domain - Cost-benefit considerations - Practical implementation constraints - Existing literature in your field
A statistically significant effect with a large effect size may not always be practically significant:
Conversely, small effect sizes can be meaningful when: - Applied to large populations (small improvement × many people = substantial impact) - Addressing serious outcomes (e.g., mortality, severe illness) - Effects compound over time - No better alternatives exist
Point estimates of effect sizes are subject to sampling error. Always report confidence intervals to indicate precision:
# Compare narrow vs. wide CI
cat("Small sample (n = 30 per group):\n")
## Small sample (n = 30 per group):
cat("Cohen's d might be 0.5, but CI could be [-0.2, 1.2]\n")
## Cohen's d might be 0.5, but CI could be [-0.2, 1.2]
cat("Conclusion: Uncertain whether small, medium, or large\n\n")
## Conclusion: Uncertain whether small, medium, or large
cat("Large sample (n = 300 per group):\n")
## Large sample (n = 300 per group):
cat("Cohen's d might be 0.5, with CI [0.4, 0.6]\n")
## Cohen's d might be 0.5, with CI [0.4, 0.6]
cat("Conclusion: Confidently a medium effect\n")
## Conclusion: Confidently a medium effect
Different effect size metrics have different scales:
Don’t directly compare d = 0.5 to η² = 0.5—they represent different quantities.
Cohen’s benchmarks are useful heuristics, not absolute truth. A d = 0.19 is not fundamentally different from d = 0.21, even though one is “small” and the other might be considered “not quite small.”
Report whether effects are positive or negative: - Cohen’s d = -0.5 means the first group scored lower - r = -0.3 indicates a negative correlation
Direction matters for interpretation!
Effect sizes are essential complements to statistical significance testing, providing crucial information about the magnitude and practical importance of research findings. While p-values answer “Is there an effect?”, effect sizes answer “How large is the effect?”
Key takeaways:
In our workplace scenarios, we saw how effect sizes revealed: - A substantial improvement from the wellness program (d = 0.7) - A moderate effect of training method on performance (η² = 0.071) - A strong relationship between work hours and burnout (r = 0.65) - Strong prediction of job performance from multiple factors (R² = 0.643)
By combining statistical significance with effect size reporting, we provide a complete picture of our research findings—enabling better decisions, more accurate interpretations, and more meaningful contributions to scientific knowledge.