Back to Tutorials Download .Rmd

1. A behavioral story

Imagine you are a health psychologist studying the relationship between physical activity and mental well-being. You measure weekly exercise hours for a group of adults and assess their subjective well-being using a standardized happiness scale. You notice that people who exercise more tend to report higher happiness scores, but the relationship is not perfect. Some people exercise frequently but report moderate happiness, while others exercise less but still feel quite happy.

This variability is natural in behavioral science. People differ in countless ways: genetics, social support, sleep quality, diet, and stress levels all play roles. Yet, despite this noise, you suspect a systematic pattern exists. As exercise hours increase, happiness scores tend to rise. This is exactly where simple linear regression becomes a powerful analytical tool.

Simple linear regression helps us answer a fundamental question: Is there a linear relationship between two continuous variables, and if so, how strong is it? Unlike correlation, which only tells us whether variables move together, regression provides a predictive equation that allows us to estimate one variable from another.

2. What is simple linear regression?

Simple linear regression is a statistical method that models the relationship between one predictor variable (also called the independent variable or explanatory variable) and one outcome variable (also called the dependent variable or response variable). The word “simple” indicates that we have only one predictor, distinguishing it from multiple regression, which involves two or more predictors.

The goal is to find the best-fitting straight line through a cloud of data points. This line summarizes the relationship between the predictor and outcome, allowing us to make predictions and test hypotheses about how changes in the predictor are associated with changes in the outcome.

Regression is widely used across behavioral sciences, public health, psychology, and medicine. Researchers use it to examine how sleep duration affects cognitive performance, how therapy sessions influence depression scores, how years of education predict income, or how medication dosage relates to symptom reduction.

3. The model in plain form

The simple linear regression model can be written as:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

where \(Y_i\) represents the observed outcome for individual \(i\), \(X_i\) is the predictor value for individual \(i\), \(\beta_0\) is the intercept (the expected value of \(Y\) when \(X = 0\)), \(\beta_1\) is the slope (the expected change in \(Y\) for a one-unit increase in \(X\)), and \(\varepsilon_i\) is the random error term for individual \(i\).

The error term \(\varepsilon_i\) captures all the variability in \(Y\) that is not explained by \(X\). We assume that errors are normally distributed with mean zero and constant variance:

\[ \varepsilon_i \sim N(0, \sigma^2) \]

This assumption means that for any given value of \(X\), the errors around the regression line follow a bell-shaped distribution centered at zero, with the same spread across all values of \(X\).

Understanding the components

The intercept \(\beta_0\) tells us where the regression line crosses the y-axis. In our example, it represents the predicted happiness score for someone who exercises zero hours per week. Sometimes this value is meaningful (e.g., baseline happiness with no exercise), and sometimes it is purely mathematical (e.g., if negative exercise hours are impossible).

The slope \(\beta_1\) is the heart of the regression model. It quantifies the rate of change in \(Y\) as \(X\) increases by one unit. If \(\beta_1 = 2.5\), it means that for every additional hour of weekly exercise, happiness scores increase by 2.5 points on average. If \(\beta_1 = 0\), there is no linear relationship. If \(\beta_1 < 0\), the relationship is negative: as \(X\) increases, \(Y\) decreases.

The error term \(\varepsilon_i\) represents individual deviations from the regression line. No model is perfect, and regression acknowledges this by including random variation. The smaller the errors, the tighter the data points cluster around the line, and the better the model fits the data.

The fitted line and predictions

Once we estimate \(\beta_0\) and \(\beta_1\) from data, we obtain the fitted regression equation:

\[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i \]

The hat symbol \(\hat{}\) indicates estimated values. The fitted value \(\hat{Y}_i\) is the predicted outcome for individual \(i\) based on their predictor value \(X_i\). The difference between the observed value \(Y_i\) and the predicted value \(\hat{Y}_i\) is called the residual:

\[ e_i = Y_i - \hat{Y}_i \]

Residuals measure how far each data point lies from the regression line. Positive residuals indicate observations above the line, and negative residuals indicate observations below the line. The regression line is chosen to minimize the sum of squared residuals:

\[ \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \]

This method, called ordinary least squares (OLS), ensures that the fitted line is as close as possible to all data points in a mathematical sense. The resulting estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the best linear unbiased estimators under the model assumptions.

Key assumptions

Simple linear regression relies on several assumptions:

  1. Linearity: The relationship between \(X\) and \(Y\) is linear. If the true relationship is curved, a straight line will not fit well.

  2. Independence: Observations are independent of each other. Repeated measurements on the same individuals or clustered data violate this assumption.

  3. Homoscedasticity: The variance of errors is constant across all levels of \(X\). If variability increases or decreases with \(X\), predictions become less reliable.

  4. Normality: Errors are normally distributed. This matters most for hypothesis testing and confidence intervals, especially with small sample sizes.

  5. No measurement error in \(X\): The predictor is measured without error. If \(X\) contains substantial measurement error, estimates of \(\beta_1\) become biased.

Violations of these assumptions can lead to inefficient estimates, biased standard errors, and invalid hypothesis tests. Diagnostic plots help us check these assumptions after fitting the model.

4. Simulating a dataset

We now create a synthetic dataset to illustrate simple linear regression step by step. This allows us to control the true parameters and see how well regression recovers them. We simulate data for 200 adults, measuring weekly exercise hours (ranging from 0 to 10 hours) and happiness scores (on a scale from 0 to 100).

We set the true intercept \(\beta_0 = 40\), meaning someone who does not exercise at all would have an expected happiness score of 40. We set the true slope \(\beta_1 = 3.5\), meaning each additional hour of weekly exercise increases happiness by 3.5 points on average. We add random noise with a standard deviation of \(\sigma = 10\) to mimic real-world variability.

set.seed(123)  # For reproducibility
n <- 200

# Simulate predictor variable (exercise hours per week)
exercise <- runif(n, min = 0, max = 10)

# True parameters
beta_0 <- 40   # Intercept
beta_1 <- 3.5  # Slope
sigma <- 10    # Error standard deviation

# Generate outcome variable (happiness scores)
happiness <- beta_0 + beta_1 * exercise + rnorm(n, mean = 0, sd = sigma)

# Create a data frame
data <- data.frame(exercise = exercise, happiness = happiness)

# View first few rows
head(data)
##   exercise happiness
## 1 2.875775  42.96115
## 2 7.883051  70.15952
## 3 4.089769  51.84727
## 4 8.830174  67.43018
## 5 9.404673  63.40017
## 6 0.455565  41.14420

This gives us a dataset where the true relationship is known. We can now use regression to estimate the parameters and see how close our estimates come to the true values of \(\beta_0 = 40\) and \(\beta_1 = 3.5\).

5. Visualizing the relationship

Before fitting any model, it is essential to visualize the data. A scatterplot reveals the overall pattern, the strength of the relationship, and potential outliers or nonlinear trends. Let’s create a scatterplot of exercise hours versus happiness scores.

library(ggplot2)

ggplot(data, aes(x = exercise, y = happiness)) +
  geom_point(alpha = 0.6, size = 2, color = "steelblue") +
  labs(title = "Relationship Between Exercise and Happiness",
       x = "Weekly Exercise Hours",
       y = "Happiness Score") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The scatterplot should show a positive trend: as exercise hours increase, happiness scores tend to increase. The points scatter around an invisible line, reflecting the random noise we added. If we see a clear linear pattern, regression will fit well. If the pattern is curved or scattered randomly, regression may not be appropriate.

6. Fitting the regression model in R

We fit the simple linear regression model using the lm() function in R, which stands for “linear model.” The syntax is straightforward:

model <- lm(happiness ~ exercise, data = data)
summary(model)
## 
## Call:
## lm(formula = happiness ~ exercise, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.271  -6.249  -1.110   5.954  32.021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.7726     1.4359   28.39   <2e-16 ***
## exercise      3.3602     0.2495   13.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.645 on 198 degrees of freedom
## Multiple R-squared:  0.478,  Adjusted R-squared:  0.4754 
## F-statistic: 181.3 on 1 and 198 DF,  p-value: < 2.2e-16

The formula happiness ~ exercise tells R to model happiness as a function of exercise. The data = data argument specifies which data frame contains the variables. The summary() function provides detailed output, including coefficient estimates, standard errors, t-values, p-values, and model fit statistics.

Let’s examine the key components of the output:

Coefficients: The estimated intercept \(\hat{\beta}_0\) and slope \(\hat{\beta}_1\) appear in the “Estimate” column. These are the best-fitting values based on our data.

Standard Errors: These measure the precision of our estimates. Smaller standard errors indicate more precise estimates.

t-values: These are test statistics for hypothesis tests. For the slope, the null hypothesis is \(H_0: \beta_1 = 0\) (no relationship). The t-value equals the estimate divided by its standard error.

p-values: These indicate the probability of observing a t-value as extreme as ours if the null hypothesis were true. Small p-values (typically \(p < 0.05\)) provide evidence against the null hypothesis.

R-squared: This measures the proportion of variance in \(Y\) explained by \(X\). It ranges from 0 to 1, where 0 means \(X\) explains nothing and 1 means \(X\) explains everything.

Residual standard error: This estimates \(\sigma\), the standard deviation of the errors. It measures the typical distance between observed and predicted values.

F-statistic: This tests the overall significance of the model. For simple linear regression, it is equivalent to the t-test for the slope.

7. Interpreting the results

Let’s extract and interpret the estimated coefficients from our fitted model.

coefficients(model)
## (Intercept)    exercise 
##   40.772573    3.360191

The output shows \(\hat{\beta}_0 = 40.77\) and \(\hat{\beta}_1 = 3.36\). These estimates are very close to the true values we used in the simulation (\(\beta_0 = 40\) and \(\beta_1 = 3.5\)), which is reassuring.

Intercept interpretation: The estimated intercept \(\hat{\beta}_0 = 40.77\) represents the predicted happiness score for someone who exercises zero hours per week. We would say: “Among adults who do not engage in any weekly exercise, the expected happiness score is approximately 40.77 points.”

Slope interpretation: The estimated slope \(\hat{\beta}_1 = 3.36\) represents the expected change in happiness for each additional hour of weekly exercise. We would say: “For every one-hour increase in weekly exercise, happiness scores are expected to increase by approximately 3.36 points, on average.”

It is crucial to use careful language. We say “associated with” or “expected to increase” rather than “causes” because regression alone does not establish causality. Confounding variables, reverse causation, and measurement error can all influence the observed relationship. Causal inference requires additional assumptions and research designs (e.g., randomized experiments, instrumental variables, or longitudinal data with proper controls).

Statistical significance

The summary output includes hypothesis tests for each coefficient. For the slope, the null hypothesis is:

\[ H_0: \beta_1 = 0 \]

This tests whether there is no linear relationship between exercise and happiness. The alternative hypothesis is:

\[ H_a: \beta_1 \neq 0 \]

which states that a relationship exists.

The test statistic is:

\[ t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \]

If the p-value is less than our chosen significance level (typically \(\alpha = 0.05\)), we reject the null hypothesis and conclude that there is statistically significant evidence of a linear relationship.

Confidence intervals

We can construct confidence intervals for the coefficients to quantify uncertainty in our estimates. A 95% confidence interval for \(\beta_1\) is calculated as:

\[ \hat{\beta}_1 \pm t_{n-2, 0.975} \cdot \text{SE}(\hat{\beta}_1) \]

where \(t_{n-2, 0.975}\) is the critical value from the t-distribution with \(n-2\) degrees of freedom.

In R, we obtain confidence intervals using:

confint(model, level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 37.940940 43.604205
## exercise     2.868116  3.852266

The 95% confidence interval for \(\beta_1\) is \([2.87, 3.85]\). We interpret this as: “We are 95% confident that the true increase in happiness per additional hour of exercise lies between 2.87 and 3.85 points.” This interval does not contain zero, providing additional evidence of a positive relationship.

Model fit: R-squared

The R-squared value, denoted \(R^2\), measures the proportion of variance in the outcome explained by the predictor:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} e_i^2}{\sum_{i=1}^{n} (Y_i - \bar{Y})^2} \]

where \(\sum e_i^2\) is the sum of squared residuals (unexplained variance) and \(\sum (Y_i - \bar{Y})^2\) is the total variance in \(Y\).

Our model has \(R^2 = 0.478\), which means that 47.8% of the variability in happiness scores is explained by exercise hours, while the remaining 52.2% is due to other factors not included in the model. Higher \(R^2\) values indicate better fit, but what counts as “good” depends on the field. In tightly controlled laboratory experiments, \(R^2 > 0.80\) is common. In behavioral and social sciences, where many unmeasured factors influence outcomes, \(R^2\) values between 0.20 and 0.50 are often considered meaningful.

It is important not to overinterpret \(R^2\). A low \(R^2\) does not mean the model is useless—if the slope is statistically significant and theoretically meaningful, the predictor still provides valuable information. Conversely, a high \(R^2\) does not guarantee causality or practical importance.

8. Visualizing the fitted line

We can overlay the fitted regression line on the scatterplot to visualize how well the model captures the data pattern.

ggplot(data, aes(x = exercise, y = happiness)) +
  geom_point(alpha = 0.6, size = 2, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred", fill = "pink", alpha = 0.3) +
  labs(title = "Simple Linear Regression: Exercise and Happiness",
       x = "Weekly Exercise Hours",
       y = "Happiness Score") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
## `geom_smooth()` using formula = 'y ~ x'

The geom_smooth(method = "lm") function adds the regression line, and se = TRUE includes a shaded confidence band around the line. This band represents the uncertainty in our prediction: we are 95% confident that the true regression line lies within this shaded region.

The plot shows that the fitted line passes through the center of the point cloud, with roughly equal scatter above and below the line. If the model fits well, the line should follow the general trend without systematic deviations.

9. Making predictions

Once we have a fitted model, we can make predictions for new observations. Suppose we want to predict happiness for someone who exercises 5 hours per week. We use the predict() function:

new_data <- data.frame(exercise = 5)
predicted_happiness <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)
predicted_happiness
##        fit      lwr      upr
## 1 57.57353 56.22819 58.91886

The output includes three values:

  • fit: The predicted happiness score for someone exercising 5 hours per week.
  • lwr: The lower bound of the 95% confidence interval for the mean happiness at 5 hours.
  • upr: The upper bound of the 95% confidence interval.

The predicted value is 57.57 with a confidence interval of \([56.23, 58.92]\). We would say: “For adults who exercise 5 hours per week, the expected happiness score is 57.57, and we are 95% confident that the true mean happiness for this group lies between 56.23 and 58.92.”

We can also construct a prediction interval, which accounts for both the uncertainty in estimating the mean and the random variability of individual observations:

predict(model, newdata = new_data, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 57.57353 38.50529 76.64177

Prediction intervals are wider than confidence intervals because they include the additional uncertainty of individual variation (\(\sigma^2\)). The prediction interval is \([38.51, 76.64]\), so we would say: “We are 95% confident that a randomly selected adult who exercises 5 hours per week will have a happiness score between 38.51 and 76.64.”

Extrapolation warning

Be cautious when predicting outside the range of observed data. This is called extrapolation, and it assumes the linear relationship continues beyond the observed range. If our exercise data ranges from 0 to 10 hours, predicting happiness for someone exercising 20 hours per week is risky. The relationship may become nonlinear, or other factors may dominate. Always restrict predictions to the range of the data unless you have strong theoretical reasons to extrapolate.

10. Checking model assumptions

Regression assumptions are not just abstract mathematical requirements—they determine whether our results are valid. We check assumptions using diagnostic plots and statistical tests.

Residual plots

The most important diagnostic is a plot of residuals versus fitted values:

data$fitted <- fitted(model)
data$residuals <- residuals(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

What to look for:

  • Random scatter: Points should scatter randomly around the horizontal line at zero with no clear pattern. This indicates that the linearity and homoscedasticity assumptions hold.

  • Funnel shape: If residuals spread out as fitted values increase (or decrease), this suggests heteroscedasticity (non-constant variance). Solutions include transforming the outcome variable or using weighted least squares.

  • Curved pattern: If residuals form a curve, the relationship is nonlinear. Consider adding polynomial terms (e.g., \(X^2\)) or using nonlinear regression.

  • Outliers: Points far from the horizontal line are outliers. They may unduly influence the regression line. Investigate whether they are data errors, special cases, or legitimate observations.

Normality of residuals

We check whether residuals are normally distributed using a Q-Q plot (quantile-quantile plot):

ggplot(data, aes(sample = residuals)) +
  stat_qq(color = "steelblue", alpha = 0.6) +
  stat_qq_line(color = "darkred", linetype = "dashed") +
  labs(title = "Q-Q Plot of Residuals",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

What to look for:

  • Points on the line: If residuals are normally distributed, points should lie close to the diagonal line.

  • Heavy tails: If points deviate from the line at the extremes, residuals have heavier tails than a normal distribution (more extreme values).

  • Skewness: If points curve away from the line, residuals are skewed. Consider transforming the outcome variable (e.g., log transformation).

With large sample sizes (\(n > 100\)), mild departures from normality are usually not problematic due to the central limit theorem. Hypothesis tests and confidence intervals remain approximately valid.

Influential points

Some observations have more influence on the regression line than others. We identify influential points using Cook’s distance, which measures how much the regression coefficients change when an observation is removed.

data$cooks_d <- cooks.distance(model)

ggplot(data, aes(x = 1:nrow(data), y = cooks_d)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
  geom_hline(yintercept = 4/nrow(data), linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance for Each Observation",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

A common rule of thumb is that observations with Cook’s distance greater than \(4/n\) (where \(n\) is the sample size) are potentially influential. If you find influential points, investigate them carefully. They may represent data entry errors, unique subgroups, or important cases that should not be excluded without justification.

11. Reporting results in APA style

When reporting regression results in scientific papers, follow the Publication Manual of the American Psychological Association (APA) guidelines. Here is how to report our example:

A simple linear regression was conducted to examine the relationship between weekly exercise hours and happiness scores. The analysis revealed a statistically significant positive relationship, \(\beta = 3.36\), \(t(198) = 13.47\), \(p < .001\), 95% CI \([2.87, 3.85]\). The model explained 47.8% of the variance in happiness scores, \(R^2 = .478\), \(F(1, 198) = 181.3\), \(p < .001\). For each additional hour of weekly exercise, happiness scores increased by approximately 3.36 points. The intercept was \(\beta_0 = 40.77\), \(t(198) = 28.39\), \(p < .001\), indicating that individuals with no weekly exercise had an expected happiness score of 40.77.

This format includes:

  • A brief description of the analysis and research question.
  • The estimated slope with its standard error or confidence interval.
  • The t-statistic, degrees of freedom, and p-value for the slope.
  • The \(R^2\) value and the F-test for overall model significance.
  • A clear interpretation in plain language.

Always accompany statistical results with a well-labeled scatterplot showing the data and the fitted regression line. Visualizations make results more accessible and allow readers to assess the fit visually.

12. Practical considerations

Sample size

Simple linear regression requires a sufficient sample size for reliable estimates. A common rule is at least 10-20 observations per predictor. For simple regression (one predictor), a minimum of \(n = 30\) is recommended, but larger samples (\(n > 100\)) provide more stable estimates and better power to detect relationships.

Outliers and leverage

Outliers can distort regression results. An outlier is an observation with an unusual \(Y\) value given its \(X\) value. A high-leverage point is an observation with an unusual \(X\) value. High-leverage points have greater influence on the slope. If an outlier also has high leverage, it can dramatically change the regression line. Always inspect your data for such points and investigate their validity.

Correlation vs. regression

Correlation and regression are closely related but serve different purposes. Correlation measures the strength and direction of a linear relationship using Pearson’s \(r\), which ranges from \(-1\) to \(+1\). Regression provides a predictive equation and allows for asymmetric relationships (predicting \(Y\) from \(X\) is different from predicting \(X\) from \(Y\)). In simple linear regression, the relationship between \(r\) and \(\beta_1\) is:

\[ \beta_1 = r \cdot \frac{s_Y}{s_X} \]

where \(s_Y\) is the standard deviation of \(Y\) and \(s_X\) is the standard deviation of \(X\). Moreover, \(R^2 = r^2\) in simple linear regression.

Transformations

If assumptions are violated, consider transforming variables. Common transformations include:

  • Log transformation: Used when the outcome is right-skewed or when the relationship is exponential.
  • Square root transformation: Used for count data or moderate skewness.
  • Reciprocal transformation: Used when the relationship is hyperbolic.

After transformation, interpret coefficients carefully, as they represent effects on the transformed scale.

13. Conclusion

Simple linear regression is a foundational tool in quantitative research. It allows us to quantify relationships, make predictions, and test hypotheses about how variables are associated. The method is intuitive: fit the best straight line through the data, interpret the slope and intercept, and check that assumptions hold.

While simple, this technique is powerful. It forms the basis for more complex models, including multiple regression, logistic regression, and mixed-effects models. Mastering simple linear regression equips you with the skills to understand and apply these advanced methods.

Remember that regression reveals associations, not causes. Strong relationships in observational data can arise from confounding, measurement error, or reverse causation. Combine regression with thoughtful research design, domain knowledge, and careful interpretation to draw meaningful conclusions.

With practice, regression becomes an indispensable part of your analytical toolkit, enabling you to uncover patterns, test theories, and communicate findings with clarity and precision.