Effect Size: Measuring the Magnitude of Statistical Findings

1. A Story About Meaningful Change

A university implemented a new workplace wellness program for its administrative staff, featuring mindfulness training, ergonomic assessments, and flexible break schedules. After six months, the human resources director conducted a survey measuring job satisfaction scores before and after the intervention. The t-test showed a statistically significant difference (p = 0.032), and she proudly announced the program’s “success” to the administration.

However, when the department chair asked, “But how much did satisfaction actually improve?”, she realized the p-value only told part of the story. Statistical significance indicates that an effect likely exists, but it doesn’t reveal whether that effect is trivial, moderate, or substantial. With 200 employees in the study, even a tiny improvement could be “statistically significant” but practically meaningless.

This is where effect sizes become essential. They quantify the magnitude of differences or relationships, allowing researchers to communicate not just whether something works, but how well it works—information crucial for making informed decisions about interventions, policies, and resource allocation.

2. What is Effect Size?

Effect size is a quantitative measure of the magnitude of a phenomenon. Unlike p-values, which tell us the probability of observing our data if the null hypothesis were true, effect sizes tell us the strength or magnitude of the relationship or difference we’ve found.

Effect sizes serve several critical purposes:

Interpretability: They provide a standardized measure of magnitude that can be interpreted regardless of sample size
Comparability: Standardized effect sizes allow comparison across different studies, measures, and contexts
Practical significance: They help distinguish between statistically significant but trivial effects and meaningful findings
Power analysis: They are essential for determining appropriate sample sizes in study planning
Meta-analysis: They enable researchers to synthesize findings across multiple studies

There are many types of effect sizes, each appropriate for different statistical tests and research designs.

3. Types of Effect Size Measures

Cohen’s d (Standardized Mean Difference)

Used for comparing two groups (t-tests), Cohen’s d expresses the difference between means in standard deviation units:

\[ d = \frac{\bar{X}_1 - \bar{X}_2}{s_{\text{pooled}}} \]

where \(\bar{X}_1\) and \(\bar{X}_2\) are the group means, and \(s_{\text{pooled}}\) is the pooled standard deviation:

\[ s_{\text{pooled}} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

Cohen’s interpretation guidelines: - Small effect: \(|d| = 0.2\) - Medium effect: \(|d| = 0.5\) - Large effect: \(|d| = 0.8\)

Eta-Squared (η²) and Partial Eta-Squared (η²ₚ)

Used in ANOVA contexts, eta-squared represents the proportion of total variance explained by a factor:

\[ \eta^2 = \frac{SS_{\text{effect}}}{SS_{\text{total}}} \]

Partial eta-squared represents the proportion of variance explained controlling for other factors:

\[ \eta_p^2 = \frac{SS_{\text{effect}}}{SS_{\text{effect}} + SS_{\text{error}}} \]

Interpretation guidelines for η²: - Small effect: 0.01 - Medium effect: 0.06 - Large effect: 0.14

Omega-Squared (ω²)

A less biased estimate than eta-squared, omega-squared adjusts for sample size:

\[ \omega^2 = \frac{SS_{\text{effect}} - (df_{\text{effect}})(MS_{\text{error}})}{SS_{\text{total}} + MS_{\text{error}}} \]

R² (Coefficient of Determination)

Used in regression, R² represents the proportion of variance in the dependent variable explained by the model:

\[ R^2 = 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}} \]

Correlation Coefficient (r)

Pearson’s r itself serves as an effect size, representing the strength of linear association:

\[ r = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum(X_i - \bar{X})^2 \sum(Y_i - \bar{Y})^2}} \]

Interpretation guidelines for |r|: - Small effect: 0.10 - Medium effect: 0.30 - Large effect: 0.50

4. Simulating Datasets for Different Scenarios

We’ll create multiple datasets to demonstrate effect size calculations across different statistical contexts.

set.seed(2024)

# Scenario 1: Two-group comparison (Wellness Program)
# Pre-intervention satisfaction scores (control group)
control_satisfaction <- rnorm(100, mean = 65, sd = 12)

# Post-intervention satisfaction scores (treatment group)
treatment_satisfaction <- rnorm(100, mean = 71, sd = 12)

# Create data frame
wellness_data <- data.frame(
  satisfaction = c(control_satisfaction, treatment_satisfaction),
  group = factor(rep(c("Control", "Treatment"), each = 100))
)

# Scenario 2: ANOVA (Training Method Comparison)
# Three different employee training methods
traditional <- rnorm(50, mean = 75, sd = 10)
online <- rnorm(50, mean = 78, sd = 10)
blended <- rnorm(50, mean = 82, sd = 10)

training_data <- data.frame(
  performance = c(traditional, online, blended),
  method = factor(rep(c("Traditional", "Online", "Blended"), each = 50))
)

# Scenario 3: Correlation (Work Hours and Burnout)
work_hours <- rnorm(120, mean = 45, sd = 8)
burnout <- 20 + 0.8 * work_hours + rnorm(120, mean = 0, sd = 8)

correlation_data <- data.frame(
  work_hours = work_hours,
  burnout = burnout
)

# Scenario 4: Regression (Multiple Predictors of Job Performance)
n <- 150
age <- rnorm(n, mean = 35, sd = 8)
experience <- rnorm(n, mean = 8, sd = 4)
training_score <- rnorm(n, mean = 75, sd = 10)
performance <- 40 + 0.3 * age + 1.2 * experience + 0.4 * training_score + rnorm(n, mean = 0, sd = 5)

regression_data <- data.frame(
  performance = performance,
  age = age,
  experience = experience,
  training_score = training_score
)

5. Visualizing the Data

Before calculating effect sizes, let’s visualize our data to understand the patterns.

library(ggplot2)

# Plot 1: Two-group comparison
p1 <- ggplot(wellness_data, aes(x = group, y = satisfaction, fill = group)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3) +
  labs(title = "Wellness Program: Satisfaction Scores by Group",
       x = "Group", y = "Satisfaction Score") +
  scale_fill_manual(values = c("#E74C3C", "#3498DB")) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none"
  )

# Plot 2: ANOVA
p2 <- ggplot(training_data, aes(x = method, y = performance, fill = method)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3) +
  labs(title = "Training Methods: Performance Scores",
       x = "Training Method", y = "Performance Score") +
  scale_fill_manual(values = c("#E74C3C", "#F39C12", "#27AE60")) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none"
  )

# Plot 3: Correlation
p3 <- ggplot(correlation_data, aes(x = work_hours, y = burnout)) +
  geom_point(alpha = 0.6, color = "#3498DB") +
  geom_smooth(method = "lm", se = TRUE, color = "#E74C3C") +
  labs(title = "Work Hours and Burnout Relationship",
       x = "Weekly Work Hours", y = "Burnout Score") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

# Plot 4: Multiple scatter (simplified)
p4 <- ggplot(regression_data, aes(x = training_score, y = performance)) +
  geom_point(alpha = 0.6, color = "#9B59B6") +
  geom_smooth(method = "lm", se = TRUE, color = "#E74C3C") +
  labs(title = "Training Score vs. Job Performance",
       x = "Training Score", y = "Performance") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

# Arrange plots
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)

6. Calculating Effect Sizes in R

Now we calculate effect sizes for each scenario using actual R output.

Cohen’s d for Two-Group Comparison

# Perform t-test
t_test_result <- t.test(satisfaction ~ group, data = wellness_data)

# Calculate Cohen's d manually
mean_control <- mean(wellness_data$satisfaction[wellness_data$group == "Control"])
mean_treatment <- mean(wellness_data$satisfaction[wellness_data$group == "Treatment"])
sd_control <- sd(wellness_data$satisfaction[wellness_data$group == "Control"])
sd_treatment <- sd(wellness_data$satisfaction[wellness_data$group == "Treatment"])
n_control <- sum(wellness_data$group == "Control")
n_treatment <- sum(wellness_data$group == "Treatment")

# Pooled standard deviation
pooled_sd <- sqrt(((n_control - 1) * sd_control^2 + (n_treatment - 1) * sd_treatment^2) / 
                  (n_control + n_treatment - 2))

# Cohen's d
cohens_d <- (mean_treatment - mean_control) / pooled_sd

# Display results
cat("Two-Group Comparison Results:\n")

## Two-Group Comparison Results:

cat("Control Mean:", round(mean_control, 2), "\n")

## Control Mean: 63.98

cat("Treatment Mean:", round(mean_treatment, 2), "\n")

## Treatment Mean: 72.58

cat("Mean Difference:", round(mean_treatment - mean_control, 2), "\n")

## Mean Difference: 8.6

cat("t-statistic:", round(t_test_result$statistic, 3), "\n")

## t-statistic: -4.961

cat("p-value:", round(t_test_result$p.value, 4), "\n")

## p-value: 0

cat("Cohen's d:", round(cohens_d, 3), "\n")

## Cohen's d: 0.702

Eta-Squared and Omega-Squared for ANOVA

# Perform ANOVA
anova_model <- aov(performance ~ method, data = training_data)
anova_summary <- summary(anova_model)

# Extract SS components
ss_effect <- anova_summary[[1]]["method", "Sum Sq"]
ss_total <- sum(anova_summary[[1]][, "Sum Sq"])
ss_error <- anova_summary[[1]]["Residuals", "Sum Sq"]
df_effect <- anova_summary[[1]]["method", "Df"]
ms_error <- anova_summary[[1]]["Residuals", "Mean Sq"]

# Calculate eta-squared
eta_squared <- ss_effect / ss_total

# Calculate partial eta-squared
partial_eta_squared <- ss_effect / (ss_effect + ss_error)

# Calculate omega-squared
omega_squared <- (ss_effect - df_effect * ms_error) / (ss_total + ms_error)

# Extract F and p-value
f_value <- anova_summary[[1]]["method", "F value"]
p_value <- anova_summary[[1]]["method", "Pr(>F)"]

# Display results
cat("ANOVA Results:\n")

## ANOVA Results:

cat("F-statistic:", round(f_value, 3), "\n")

## F-statistic: 5.618

cat("p-value:", format(p_value, scientific = TRUE, digits = 3), "\n")

## p-value: 4.46e-03

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.071

cat("Partial Eta-squared (η²ₚ):", round(partial_eta_squared, 4), "\n")

## Partial Eta-squared (η²ₚ): 0.071

cat("Omega-squared (ω²):", round(omega_squared, 4), "\n")

## Omega-squared (ω²): 0.058

Correlation Effect Size

# Calculate correlation
cor_test <- cor.test(correlation_data$work_hours, correlation_data$burnout)
r_value <- cor_test$estimate
r_squared <- r_value^2

# Display results
cat("Correlation Results:\n")

## Correlation Results:

cat("Pearson's r:", round(r_value, 3), "\n")

## Pearson's r: 0.655

cat("R² (variance explained):", round(r_squared, 3), "\n")

## R² (variance explained): 0.429

cat("t-statistic:", round(cor_test$statistic, 3), "\n")

## t-statistic: 9.412

cat("p-value:", format(cor_test$p.value, scientific = TRUE, digits = 3), "\n")

## p-value: 4.96e-16

R² for Multiple Regression

# Fit multiple regression model
reg_model <- lm(performance ~ age + experience + training_score, 
                data = regression_data)
reg_summary <- summary(reg_model)

# Extract R²
r_squared_reg <- reg_summary$r.squared
adj_r_squared <- reg_summary$adj.r.squared
f_stat <- reg_summary$fstatistic[1]
f_p_value <- pf(f_stat, reg_summary$fstatistic[2], 
                reg_summary$fstatistic[3], lower.tail = FALSE)

# Display results
cat("Multiple Regression Results:\n")

## Multiple Regression Results:

cat("R²:", round(r_squared_reg, 4), "\n")

## R²: 0.6431

cat("Adjusted R²:", round(adj_r_squared, 4), "\n")

## Adjusted R²: 0.6358

cat("F-statistic:", round(f_stat, 3), "\n")

## F-statistic: 87.702

cat("p-value:", format(f_p_value, scientific = TRUE, digits = 3), "\n")

## p-value: 1.68e-32

7. Interpreting Effect Sizes

Let’s interpret each effect size in practical terms.

Wellness Program (Cohen’s d = 0.702)

The wellness program showed a Cohen’s d of 0.702, which falls into the large effect size category according to Cohen’s guidelines. This means that the treatment group’s satisfaction scores were approximately 0.7 standard deviations higher than the control group.

Practical interpretation: While the p-value (0) indicates statistical significance, the effect size reveals that the improvement is substantial and highly meaningful. The average satisfaction increased by 8.6 points on the scale.

Training Methods (η² = 0.071, ω² = 0.058)

The eta-squared value of 0.071 indicates that approximately 7.1% of the variance in performance scores is explained by the training method. The omega-squared (0.058) provides a more conservative estimate at 5.8%.

Practical interpretation: According to Cohen’s guidelines (small = 0.01, medium = 0.06, large = 0.14), this represents a medium effect. The training method makes a moderate difference in employee performance, suggesting that the choice of training approach has meaningful implications.

Work Hours and Burnout (r = 0.655)

The correlation coefficient of 0.655 indicates a strong positive relationship between work hours and burnout. The R² value of 0.429 shows that 42.9% of the variance in burnout scores is explained by work hours.

Practical interpretation: For every additional hour worked per week, burnout scores increase by a meaningful amount. This relationship is strong and concerning, suggesting that work hours are an important factor in employee burnout.

Job Performance Regression (R² = 0.6431)

The multiple regression model explains 64.3% of the variance in job performance (R² = 0.6431). The adjusted R² of 0.6358 accounts for the number of predictors in the model.

Practical interpretation: This represents a strong model. The combination of age, experience, and training score provides substantial predictive power for job performance, explaining approximately a large proportion of individual differences.

8. Confidence Intervals for Effect Sizes

Confidence intervals provide a range of plausible values for the population effect size.

# Function to calculate CI for Cohen's d
cohens_d_ci <- function(d, n1, n2, conf.level = 0.95) {
  # Non-central t distribution approach
  t_val <- d * sqrt((n1 * n2) / (n1 + n2))
  df <- n1 + n2 - 2
  
  # Standard error
  se <- sqrt((n1 + n2) / (n1 * n2) + d^2 / (2 * (n1 + n2)))
  
  # Critical value
  alpha <- 1 - conf.level
  t_crit <- qt(1 - alpha/2, df)
  
  # CI
  lower <- d - t_crit * se
  upper <- d + t_crit * se
  
  return(c(lower = lower, upper = upper))
}

# Calculate CI for Cohen's d
d_ci <- cohens_d_ci(cohens_d, n_control, n_treatment)

cat("95% Confidence Interval for Cohen's d:\n")

## 95% Confidence Interval for Cohen's d:

cat("[", round(d_ci[1], 3), ",", round(d_ci[2], 3), "]\n\n")

## [ 0.414 , 0.989 ]

# CI for correlation (using Fisher's Z transformation)
fisher_z <- 0.5 * log((1 + r_value) / (1 - r_value))
se_z <- 1 / sqrt(nrow(correlation_data) - 3)
z_crit <- qnorm(0.975)
z_lower <- fisher_z - z_crit * se_z
z_upper <- fisher_z + z_crit * se_z

# Transform back to r
r_lower <- (exp(2 * z_lower) - 1) / (exp(2 * z_lower) + 1)
r_upper <- (exp(2 * z_upper) - 1) / (exp(2 * z_upper) + 1)

cat("95% Confidence Interval for Pearson's r:\n")

## 95% Confidence Interval for Pearson's r:

cat("[", round(r_lower, 3), ",", round(r_upper, 3), "]\n")

## [ 0.539 , 0.746 ]

The confidence interval for Cohen’s d [0.414, 0.989] does not include zero, consistent with the significant t-test result. The interval suggests the true population effect is likely between small and large.

9. Visualizing Effect Sizes

Let’s create a visual summary of our effect sizes.

# Create data frame of effect sizes
effect_data <- data.frame(
  Analysis = c("Wellness Program\n(Cohen's d)", 
               "Training Methods\n(η²)",
               "Work-Burnout\n(r)", 
               "Performance Model\n(R²)"),
  Effect_Size = c(cohens_d, eta_squared, r_value, r_squared_reg),
  Type = c("Cohen's d", "Eta²", "Correlation", "R²"),
  Lower_CI = c(d_ci[1], NA, r_lower, NA),
  Upper_CI = c(d_ci[2], NA, r_upper, NA)
)

# Create plot
ggplot(effect_data, aes(x = Analysis, y = Effect_Size, fill = Type)) +
  geom_bar(stat = "identity", alpha = 0.7, width = 0.6) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), 
                width = 0.2, na.rm = TRUE) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  labs(title = "Effect Sizes Across Different Analyses",
       subtitle = "Error bars show 95% confidence intervals where available",
       x = "Analysis", 
       y = "Effect Size Value") +
  scale_fill_manual(values = c("#E74C3C", "#3498DB", "#27AE60", "#F39C12")) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "right"
  )

10. Power Analysis Using Effect Sizes

Effect sizes are crucial for power analysis—determining the sample size needed to detect an effect.

# Power analysis for detecting Cohen's d = 0.5 (medium effect)
# Using two independent samples t-test

# Required inputs
desired_power <- 0.80
alpha <- 0.05
target_d <- 0.5  # Medium effect

# Calculate required sample size per group
# Using approximation formula
n_per_group <- ceiling(2 * ((qnorm(1 - alpha/2) + qnorm(desired_power)) / target_d)^2)

cat("Power Analysis for Two-Group Comparison:\n")

## Power Analysis for Two-Group Comparison:

cat("To detect Cohen's d =", target_d, "with 80% power at α = 0.05:\n")

## To detect Cohen's d = 0.5 with 80% power at α = 0.05:

cat("Required sample size per group:", n_per_group, "\n")

## Required sample size per group: 63

cat("Total sample size:", n_per_group * 2, "\n\n")

## Total sample size: 126

# Calculate power for different sample sizes
sample_sizes <- seq(20, 200, by = 20)
powers <- sapply(sample_sizes, function(n) {
  ncp <- target_d * sqrt(n / 2)  # Non-centrality parameter
  crit_value <- qt(1 - alpha/2, df = 2*n - 2)
  power <- 1 - pt(crit_value, df = 2*n - 2, ncp = ncp) + 
           pt(-crit_value, df = 2*n - 2, ncp = ncp)
  return(power)
})

# Plot power curve
power_df <- data.frame(
  n_per_group = sample_sizes,
  power = powers
)

ggplot(power_df, aes(x = n_per_group, y = power)) +
  geom_line(color = "#3498DB", linewidth = 1.2) +
  geom_point(color = "#3498DB", size = 2) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "#E74C3C") +
  geom_vline(xintercept = n_per_group, linetype = "dashed", color = "#27AE60") +
  labs(title = "Statistical Power vs. Sample Size",
       subtitle = paste("For detecting Cohen's d =", target_d),
       x = "Sample Size per Group",
       y = "Statistical Power (1 - β)") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    axis.line.x.top = element_blank(),
    axis.line.y.right = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

This power curve shows that with approximately 63 participants per group, we achieve 80% power to detect a medium effect (d = 0.5). Larger samples provide higher power, reducing the risk of Type II errors (false negatives).

11. Reporting Effect Sizes in APA Style

Here’s how to report each analysis with effect sizes in APA format:

Two-Group Comparison (t-test with Cohen’s d):

An independent samples t-test revealed that participants in the wellness program (M = 72.58, SD = 12.26) reported significantly higher job satisfaction than control participants (M = 63.98, SD = 12.27), t(197.9997717) = -4.96, p = 0, Cohen’s d = 0.7, 95% CI [0.41, 0.99]. This represents a large effect size.

ANOVA (with η² and ω²):

A one-way ANOVA revealed a significant effect of training method on performance scores, F(2, 147) = 5.62, p = 0.004, η² = 0.071, ω² = 0.058. Approximately 7.1% of the variance in performance was explained by training method, indicating a medium effect.

Correlation:

Work hours and burnout scores were significantly positively correlated, r(118) = 0.65, p < .001, 95% CI [0.54, 0.75]. Approximately 42.9% of the variance in burnout was associated with work hours, representing a strong effect.

Multiple Regression:

A multiple regression analysis revealed that age, experience, and training score significantly predicted job performance, F(3, 146) = 87.7, p < .001, R² = 0.643, adjusted R² = 0.636. The model explained 64.3% of the variance in performance, indicating strong predictive utility.

12. Practical Considerations

When to Report Effect Sizes

Always report effect sizes alongside statistical significance tests. Major guidelines (APA, AMA) now require or strongly recommend effect size reporting because:

P-values alone are insufficient: They tell you if an effect exists, not how large it is
Sample size dependency: Large samples can make trivial effects “significant”
Practical significance: Helps distinguish statistical from practical importance
Comparison across studies: Enables meta-analysis and cross-study synthesis
Resource allocation: Informs decisions about intervention implementation

Choosing the Right Effect Size

Two groups: Cohen’s d (for means), odds ratio (for proportions)
ANOVA: η², partial η², or ω² (omega-squared preferred for unbiased estimates)
Regression: R², adjusted R², or standardized coefficients (β)
Correlation: Pearson’s r or Spearman’s ρ
Categorical data: Cramér’s V, phi coefficient, odds ratio, risk ratio

Interpreting Conventional Guidelines

Cohen’s benchmarks (small = 0.2, medium = 0.5, large = 0.8 for d) are guidelines, not rules. Context matters:

In educational interventions, d = 0.3 might be considered substantial
In medical trials, even d = 0.1 could be clinically meaningful for serious conditions
In behavioral change, effects are often small but cumulative over time

Always consider: - The specific research domain - Cost-benefit considerations - Practical implementation constraints - Existing literature in your field

Effect Size vs. Clinical/Practical Significance

A statistically significant effect with a large effect size may not always be practically significant:

Cost-effectiveness: Does the benefit justify the cost?
Implementation feasibility: Can the intervention be scaled?
Side effects: Are there negative consequences?
Alternatives: Are there better options available?

Conversely, small effect sizes can be meaningful when: - Applied to large populations (small improvement × many people = substantial impact) - Addressing serious outcomes (e.g., mortality, severe illness) - Effects compound over time - No better alternatives exist

13. Common Pitfalls and Misconceptions

Pitfall 1: Ignoring Confidence Intervals

Point estimates of effect sizes are subject to sampling error. Always report confidence intervals to indicate precision:

# Compare narrow vs. wide CI
cat("Small sample (n = 30 per group):\n")

## Small sample (n = 30 per group):

cat("Cohen's d might be 0.5, but CI could be [-0.2, 1.2]\n")

## Cohen's d might be 0.5, but CI could be [-0.2, 1.2]

cat("Conclusion: Uncertain whether small, medium, or large\n\n")

## Conclusion: Uncertain whether small, medium, or large

cat("Large sample (n = 300 per group):\n")

## Large sample (n = 300 per group):

cat("Cohen's d might be 0.5, with CI [0.4, 0.6]\n")

## Cohen's d might be 0.5, with CI [0.4, 0.6]

cat("Conclusion: Confidently a medium effect\n")

## Conclusion: Confidently a medium effect

Pitfall 2: Comparing Non-Comparable Effect Sizes

Different effect size metrics have different scales:

Cohen’s d can exceed 1.0
η² and R² are bounded [0, 1]
Correlation r is bounded [-1, 1]

Don’t directly compare d = 0.5 to η² = 0.5—they represent different quantities.

Pitfall 3: Overreliance on Arbitrary Thresholds

Cohen’s benchmarks are useful heuristics, not absolute truth. A d = 0.19 is not fundamentally different from d = 0.21, even though one is “small” and the other might be considered “not quite small.”

Pitfall 4: Ignoring Direction

Report whether effects are positive or negative: - Cohen’s d = -0.5 means the first group scored lower - r = -0.3 indicates a negative correlation

Direction matters for interpretation!

14. Conclusion

Effect sizes are essential complements to statistical significance testing, providing crucial information about the magnitude and practical importance of research findings. While p-values answer “Is there an effect?”, effect sizes answer “How large is the effect?”

Key takeaways:

Always report effect sizes with your statistical tests
Include confidence intervals to convey precision
Choose appropriate effect sizes for your analysis type
Interpret in context, not just by conventional guidelines
Consider practical significance alongside statistical significance
Use effect sizes for power analysis when planning studies

In our workplace scenarios, we saw how effect sizes revealed: - A substantial improvement from the wellness program (d = 0.7) - A moderate effect of training method on performance (η² = 0.071) - A strong relationship between work hours and burnout (r = 0.65) - Strong prediction of job performance from multiple factors (R² = 0.643)

By combining statistical significance with effect size reporting, we provide a complete picture of our research findings—enabling better decisions, more accurate interpretations, and more meaningful contributions to scientific knowledge.