Back to Tutorials Download .Rmd

1. A developmental story

Imagine you are studying early vocal development in children. You record their actual age in months, but you also have a machine-learning model that produces a predicted vocal-development age based on acoustic features extracted from day-long recordings.

Children in your dataset do not all belong to the same developmental path. Some follow a typical trajectory (TD), some have mild hearing challenges (HH), some show language delays (LD), and some are diagnosed with autism (ASD). When you scatter all of these children onto the same plot, points form gentle coloured clouds. Some clouds are steep, others flatter. Some start higher, others lower. The story is simple: different developmental groups show different relations between chronological age and predicted vocal age.

This is exactly where multiple regression becomes a useful compass.


2. What is multiple regression?

Multiple regression is a statistical framework that asks a single big question: How does one variable change as a function of several predictors at the same time? The “several predictors” part matters. A child’s predicted vocal age may depend on their actual age, but also on which developmental group they belong to. And the effect of age may differ across groups. Regression lets you place these possibilities into a single equation.


3. The model in plain form

For our purpose, the model is a multiple linear regression with an interaction term:

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 G_i + \beta_3 (X_i \times G_i) + \varepsilon_i \]

where \(Y_i\) represents the predicted vocal development age for child \(i\), \(X_i\) is their actual chronological age in months, and \(G_i\) is a group indicator variable representing the developmental category (TD, HH, LD, or ASD). The parameter \(\beta_0\) represents the baseline intercept, which is the starting point for the reference group. The coefficient \(\beta_1\) captures the slope for the reference group, showing the rate of vocal development per month. The term \(\beta_2\) represents the group effect on the intercept, indicating how much other groups differ from the reference at age zero. The interaction coefficient \(\beta_3\) captures how much the slope differs for other groups. Finally, \(\varepsilon_i\) is the random error term, assumed to follow a normal distribution with mean zero and constant variance:

\[ \varepsilon_i \sim N(0, \sigma^2) \]

Understanding the components

The main effect of age, captured by \(\beta_1\), shows how predicted vocal age increases with chronological age for the reference group. A positive \(\beta_1\) means children get vocally older as they age chronologically.

The main effect of group, represented by \(\beta_2\), captures baseline differences between groups. If \(\beta_2 > 0\) for HH compared to TD, it means hearing-challenged children start at a higher predicted vocal age, independent of their actual age.

The interaction term \(\beta_3\) is the key feature. It allows the slope to vary by group. Without this term, all groups would have parallel lines with the same slope but different intercepts. With the interaction, the slope for group \(j\) becomes:

\[ \beta_1 + \beta_3 \cdot G_j \]

For example, if TD is the reference group, their slope equals \(\beta_1\). If \(\beta_3 = -0.5\) for ASD, then the ASD slope becomes \(\beta_1 - 0.5\), indicating a shallower trajectory. If \(\beta_3 = 0.2\) for HH, then the HH slope becomes \(\beta_1 + 0.2\), showing a steeper trajectory.

Expanded form for each group

When we expand the model for specific groups, assuming TD is the reference, we get different equations for each developmental category. For TD children, the model simplifies to:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

For HH children, it becomes:

\[ Y_i = (\beta_0 + \beta_{2,\text{HH}}) + (\beta_1 + \beta_{3,\text{HH}}) X_i + \varepsilon_i \]

Similarly, for LD children we have:

\[ Y_i = (\beta_0 + \beta_{2,\text{LD}}) + (\beta_1 + \beta_{3,\text{LD}}) X_i + \varepsilon_i \]

and for ASD children the equation is:

\[ Y_i = (\beta_0 + \beta_{2,\text{ASD}}) + (\beta_1 + \beta_{3,\text{ASD}}) X_i + \varepsilon_i \]

Each group thus has its own intercept, calculated as:

\[ \beta_0 + \beta_{2,j} \]

where \(\beta_{2,\text{TD}} = 0\) for the reference group, and its own slope, calculated as:

\[ \beta_1 + \beta_{3,j} \]

where \(\beta_{3,\text{TD}} = 0\) for the reference group.

Why this matters

The interaction term \(\beta_3\) allows us to test a critical hypothesis: Do different developmental groups show different rates of vocal maturation? If \(\beta_3 = 0\) for all groups, developmental trajectories are parallel with the same rate but different starting points. If \(\beta_3 \neq 0\), groups mature at different rates, revealing divergent developmental paths. This is what makes the model scientifically meaningful. We’re not just asking “are groups different?” but rather “how do they differ in their developmental trajectories over time?”


4. Simulating a dataset

We now create a synthetic dataset inspired by real developmental patterns. This allows us to walk through each step without depending on external data sources. The simulation includes 600 children randomly assigned to four groups: TD, HH, LD, and ASD, with proportions of 30%, 30%, 20%, and 20% respectively. Each child has an actual age randomly drawn from a uniform distribution between 8 and 45 months.

We define group-specific parameters where TD children have an intercept of 5 and a slope of 1.6, HH children have an intercept of 8 and a slope of 1.4, LD children have an intercept of -5 and a slope of 1.1, and ASD children have an intercept of 0 and a slope of 0.6. In this simulation, TD and HH children show faster developmental alignment with steeper slopes, LD children grow more slowly, and ASD children show the shallowest slope. Noise is injected with a standard deviation of 6 to mimic real-world variability. This gives us a dataset with a recognizable developmental structure.

set.seed(1)

n <- 600
groups <- sample(c("TD","HH","LD","ASD"), n, replace = TRUE, prob = c(0.3,0.3,0.2,0.2))

actual_age <- runif(n, 8, 45)

# group-specific parameters
params <- list(
  TD  = c(intercept =  5, slope = 1.6),
  HH  = c(intercept =  8, slope = 1.4),
  LD  = c(intercept = -5, slope = 1.1),
  ASD = c(intercept =  0, slope = 0.6)
)

pred_age <- numeric(n)
for(i in 1:n){
  g <- groups[i]
  b0 <- params[[g]][1]
  b1 <- params[[g]][2]
  pred_age[i] <- b0 + b1*actual_age[i] + rnorm(1,0,6)
}

df <- data.frame(actual_age, pred_age, group = groups)

head(df)
##   actual_age pred_age group
## 1   38.12731 63.95730    TD
## 2   42.36476 76.32521    HH
## 3   13.45680 30.00936    HH
## 4   35.74340 24.69919   ASD
## 5   44.09932 74.73887    TD
## 6   44.06732 19.61999   ASD

5. Fitting the model

library(tidyverse)

model <- lm(pred_age ~ actual_age * group, data = df)
summary(model)
## 
## Call:
## lm(formula = pred_age ~ actual_age * group, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6586  -4.2995   0.0111   4.2253  19.1004 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.15561    1.53440   0.101 0.919256    
## actual_age          0.54525    0.05419  10.062  < 2e-16 ***
## groupHH             6.49476    1.94200   3.344 0.000877 ***
## groupLD            -7.19784    2.14077  -3.362 0.000823 ***
## groupTD             3.09492    1.96710   1.573 0.116173    
## actual_age:groupHH  0.89820    0.06774  13.260  < 2e-16 ***
## actual_age:groupLD  0.57460    0.07567   7.594 1.22e-13 ***
## actual_age:groupTD  1.16050    0.07056  16.446  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.26 on 592 degrees of freedom
## Multiple R-squared:  0.9115, Adjusted R-squared:  0.9104 
## F-statistic: 870.7 on 7 and 592 DF,  p-value: < 2.2e-16

The summary displays the main effect of actual age showing how predicted vocal age rises with age, the main effect of group showing how groups differ at the baseline, and the interaction terms showing how slopes differ across groups. The interaction is the key component as it tells you the developmental trajectory is not the same for all groups.


6. Interpreting the model

The summary displays three key components: the main effect of actual age showing how predicted vocal age rises with age, the main effect of group showing how groups differ at baseline, and the interaction terms showing how slopes differ across groups. The interaction is the key element, as it tells you the developmental trajectory is not the same for all groups.

Understanding the output

In our model, ASD serves as the reference group (alphabetically first). The intercept of 0.156 (SE = 1.534, p = 0.919) represents the estimated predicted vocal age for ASD children at age zero months. The coefficient for actual age is 0.545 (SE = 0.054, t = 10.062, p < 0.001), indicating that for every additional month of chronological age, ASD children show an increase of approximately 0.545 months in predicted vocal age.

The group effects show baseline differences from ASD. HH children start 6.495 months higher (SE = 1.942, t = 3.344, p < 0.001), LD children start 7.198 months lower (SE = 2.141, t = -3.362, p < 0.001), and TD children start 3.095 months higher (SE = 1.967, t = 1.573, p = 0.116), though this difference is not statistically significant.

The interaction terms reveal how developmental rates differ across groups. Compared to ASD, HH children show an additional 0.898 months increase per month of age (SE = 0.068, t = 13.260, p < 0.001), giving them a total slope of 0.545 + 0.898 = 1.443. LD children show an additional 0.575 months per month (SE = 0.076, t = 7.594, p < 0.001), yielding a slope of 1.120. TD children show the steepest additional rate of 1.161 months per month (SE = 0.071, t = 16.446, p < 0.001), resulting in a total slope of 1.706.

The model explains 91.15% of the variance in predicted vocal age (adjusted R² = 0.910), with a residual standard error of 6.26 months. The overall model is highly significant, F(7, 592) = 870.7, p < 0.001.

APA-style results reporting

A multiple linear regression with interaction terms was conducted to examine the relationship between chronological age and predicted vocal development age across four developmental groups (ASD, HH, LD, TD). The overall model was statistically significant, F(7, 592) = 870.7, p < 0.001, explaining 91.1% of the variance in predicted vocal age (adjusted R² = 0.910).

Results revealed significant main effects and interactions. Using ASD as the reference group, chronological age was significantly associated with predicted vocal age, β = 0.55, SE = 0.05, t = 10.06, p < 0.001. Significant group differences emerged in baseline predicted vocal age, with HH children showing higher baseline scores (β = 6.49, SE = 1.94, t = 3.34, p < 0.001) and LD children showing lower baseline scores (β = -7.20, SE = 2.14, t = -3.36, p < 0.001) compared to ASD. TD children showed numerically higher baseline scores, though this difference was not statistically significant (β = 3.09, SE = 1.97, t = 1.57, p = 0.116).

Critically, significant age by group interactions indicated that developmental trajectories varied across groups. Compared to ASD (slope = 0.55), HH children showed a significantly steeper developmental slope (interaction β = 0.90, SE = 0.07, t = 13.26, p < 0.001), as did LD children (interaction β = 0.57, SE = 0.08, t = 7.59, p < 0.001). TD children demonstrated the steepest trajectory (interaction β = 1.16, SE = 0.07, t = 16.45, p < 0.001). These findings suggest that vocal development proceeds at different rates across developmental groups, with TD and HH children showing faster alignment between chronological age and vocal maturity compared to ASD children.


7. Detailed interpretation

Let me walk through the actual output from our model to understand what each coefficient means. The reference group in our model is ASD (alphabetically first), with an intercept of 0.156 and a slope of 0.545. This means that for ASD children at birth (age = 0), their predicted vocal age would be 0.156 months, and for every one-month increase in actual age, their predicted vocal age increases by 0.545 months.

For HH children, the model shows an intercept difference of 6.495 compared to ASD. This means HH children start at 0.156 + 6.495 = 6.651 months of predicted vocal age at birth. The interaction term for HH is 0.905, meaning their slope is 0.545 + 0.905 = 1.450 months per month of actual age, indicating faster vocal development than ASD children.

For LD children, the intercept difference is -7.198, giving them a starting point of 0.156 - 7.198 = -7.042 months. Their interaction term of 0.571 yields a slope of 0.545 + 0.571 = 1.116 months per month, which is faster than ASD but slower than HH.

For TD children, the intercept difference is 3.095, placing their starting point at 0.156 + 3.095 = 3.251 months. Their interaction term of 1.161 produces a slope of 0.545 + 1.161 = 1.706 months per month, representing the fastest developmental trajectory in our sample.

The model’s R-squared value of 0.9115 indicates that approximately 91.2% of the variance in predicted vocal age is explained by actual age, group membership, and their interaction. The residual standard error of 7.255 represents the average deviation of observed values from the fitted regression line.

Example APA-style results section

A multiple linear regression was conducted to examine the relationship between actual age (in months) and predicted vocal development age across four developmental groups: typically developing (TD), hearing-challenged (HH), language-delayed (LD), and autism spectrum disorder (ASD). The model included the main effects of actual age and group, as well as their interaction term.

The overall model was statistically significant, F(7, 592) = 870.7, p < .001, R² = .91, indicating that actual age, group membership, and their interaction accounted for 91.2% of the variance in predicted vocal age. The main effect of actual age was significant, β = 0.545, SE = 0.054, t(592) = 10.06, p < .001, demonstrating that predicted vocal age increased by 0.545 months for every one-month increase in actual age for the reference group (ASD).

Significant group differences were observed at baseline. Compared to ASD children, HH children showed a significantly higher intercept, β = 6.495, SE = 1.942, t(592) = 3.34, p < .001, while LD children showed a significantly lower intercept, β = -7.198, SE = 2.141, t(592) = -3.36, p < .001. TD children also demonstrated a numerically higher baseline, β = 3.095, SE = 1.967, t(592) = 1.57, p = .116, though this difference did not reach statistical significance.

Critically, the interaction terms revealed that developmental trajectories varied significantly across groups. The interaction between actual age and group HH was significant, β = 0.905, SE = 0.068, t(592) = 13.26, p < .001, indicating that HH children’s predicted vocal age increased 0.905 months faster per month than ASD children. Similarly, the TD × actual age interaction was significant, β = 1.161, SE = 0.071, t(592) = 16.45, p < .001, showing the steepest developmental slope. The LD × actual age interaction was also significant, β = 0.571, SE = 0.075, t(592) = 7.59, p < .001, though less pronounced than HH and TD groups. These findings suggest that while all groups show positive developmental trajectories, TD and HH children demonstrate significantly faster rates of vocal development compared to ASD children, with LD children falling intermediate between these groups.


8. Plotting the data with regression lines

library(ggplot2)

ggplot(df, aes(x = actual_age, y = pred_age, color = group)) +
  geom_point(alpha = 0.5, size = 1.8) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linewidth = 1.1) +
  scale_color_manual(values = c("TD"="#1f77b4","HH"="#2ca02c","LD"="#000000","ASD"="#d62728")) +
  labs(
    x = "Actual age (months)",
    y = "Predicted vocal development age",
    title = "Multiple Regression With Group-Specific Slopes"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "grey80", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    axis.line.x = element_line(color = "black", linewidth = 0.5),
    axis.line.y = element_line(color = "black", linewidth = 0.5)
  )

This produces:

  • Four overlapping point clouds.
  • Four regression lines, each capturing its group’s slope.
  • A structure similar to the published plot you provided.

The final figure gives the whole story visually. The clouds show the raw relationship between chronological age and predicted development, while the colored lines show each group’s developmental trajectory. A steep line implies predicted vocal age increases quickly with real age, whereas a shallow line implies slower alignment. The vertical spread demonstrates how noisy vocal maturation can be from child to child. If TD shows a steeper slope than HH, which in turn is steeper than LD, which is steeper than ASD, you can almost see the developmental divergence across conditions unfolding before your eyes.