Imagine you are a health psychologist studying the relationship between daily screen time and subjective well-being. You collect data from 200 adults who report their average daily screen time (in hours) and complete a well-being questionnaire yielding scores from 0 to 100.
At first glance, you expect a simple negative relationship: more screen time, lower well-being. However, as you plot the data, something more nuanced emerges. People with very little screen time (less than 1 hour per day) report moderate well-being, perhaps missing out on social connections and information. As screen time increases to moderate levels (2-4 hours), well-being improves, likely benefiting from social media interactions and entertainment. But beyond 6 hours, well-being starts to decline sharply as excessive screen use displaces sleep, exercise, and face-to-face interactions.
This creates a curved, inverted U-shaped relationship. A straight line cannot capture this pattern. The relationship is non-linear, and the rate of change in well-being varies across different levels of screen time. This is exactly where polynomial regression becomes essential. By adding polynomial terms (squared, cubed) to the regression model, we can fit curves that capture these complex, non-linear relationships.
Polynomial regression is an extension of linear regression that models non-linear relationships between a predictor and an outcome by including polynomial terms of the predictor. While simple linear regression fits a straight line, polynomial regression fits curves such as parabolas (quadratic), S-shapes (cubic), or more complex patterns.
The key insight is that although the relationship between \(X\) and \(Y\) is non-linear, the model remains linear in its parameters. We transform the predictor by raising it to powers (e.g., \(X^2\), \(X^3\)), but we still estimate coefficients using ordinary least squares, just as in linear regression.
Polynomial regression is widely used in psychology, medicine, economics, and environmental science to model phenomena such as:
The most common polynomial models are:
A polynomial regression model of degree \(p\) can be written as:
\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \beta_3 X_i^3 + \cdots + \beta_p X_i^p + \varepsilon_i \]
where \(Y_i\) is the outcome for individual \(i\), \(X_i\) is the predictor value, \(\beta_0\) is the intercept, \(\beta_1, \beta_2, \ldots, \beta_p\) are the polynomial coefficients, and \(\varepsilon_i\) is the random error term assumed to follow:
\[ \varepsilon_i \sim N(0, \sigma^2) \]
For our screen time example, we use a quadratic model (degree 2):
\[ \text{WellBeing}_i = \beta_0 + \beta_1 \cdot \text{ScreenTime}_i + \beta_2 \cdot \text{ScreenTime}_i^2 + \varepsilon_i \]
The intercept \(\beta_0\) represents the expected well-being when screen time equals zero. This may or may not be meaningful depending on the context.
The linear coefficient \(\beta_1\) captures the initial rate of change. If \(\beta_1 > 0\), well-being initially increases with screen time. If \(\beta_1 < 0\), it initially decreases.
The quadratic coefficient \(\beta_2\) determines the curvature:
The quadratic term allows the slope to change as \(X\) increases. The instantaneous slope at any point \(X\) is:
\[ \frac{dY}{dX} = \beta_1 + 2\beta_2 X \]
This shows that the effect of screen time on well-being is not constant—it depends on the current level of screen time.
For a quadratic model, the turning point (maximum or minimum) occurs where the derivative equals zero:
\[ \frac{dY}{dX} = \beta_1 + 2\beta_2 X = 0 \]
Solving for \(X\):
\[ X_{\text{optimal}} = -\frac{\beta_1}{2\beta_2} \]
If \(\beta_2 < 0\), this is a maximum (peak well-being). If \(\beta_2 > 0\), it is a minimum (lowest well-being).
Linear regression assumes a constant rate of change: each additional hour of screen time has the same effect on well-being. This is often unrealistic. Polynomial regression allows the effect to vary, capturing accelerating, decelerating, or reversing trends.
However, polynomial models have limitations:
Always visualize the fitted curve and avoid using polynomials beyond degree 3 unless there is strong theoretical justification.
We simulate data for 200 adults to illustrate polynomial regression. Screen time ranges from 0 to 12 hours per day, and well-being follows an inverted U-shape with a peak around 3-4 hours.
We use a quadratic relationship with noise added to mimic real-world variability.
set.seed(789) # For reproducibility
n <- 200
# Predictor: daily screen time (hours)
screen_time <- runif(n, min = 0, max = 12)
# True quadratic relationship: inverted U-shape
beta_0 <- 30 # Intercept
beta_1 <- 15 # Linear term (positive initial slope)
beta_2 <- -2 # Quadratic term (negative curvature)
# Generate well-being scores with noise
well_being <- beta_0 + beta_1 * screen_time + beta_2 * screen_time^2 + rnorm(n, mean = 0, sd = 8)
# Ensure well-being is within realistic range [0, 100]
well_being <- pmax(0, pmin(100, well_being))
# Create data frame
data <- data.frame(screen_time = screen_time, well_being = well_being)
# View first few rows
head(data)
## screen_time well_being
## 1 8.3987324 10.53835
## 2 1.1219865 41.12034
## 3 0.1426418 36.78198
## 4 7.0992762 37.00060
## 5 5.9057932 49.95831
## 6 0.2419629 38.08662
This dataset has a built-in quadratic relationship. The optimal screen time is:
\[ X_{\text{optimal}} = -\frac{15}{2 \times (-2)} = \frac{15}{4} = 3.75 \text{ hours} \]
We expect the fitted model to recover coefficients close to \(\beta_0 = 30\), \(\beta_1 = 15\), and \(\beta_2 = -2\).
Before fitting any model, we visualize the data with a scatterplot. This reveals the non-linear pattern and helps us decide whether a polynomial model is appropriate.
library(ggplot2)
ggplot(data, aes(x = screen_time, y = well_being)) +
geom_point(alpha = 0.6, size = 2.5, color = "steelblue") +
labs(title = "Relationship Between Screen Time and Well-Being",
x = "Daily Screen Time (hours)",
y = "Well-Being Score (0-100)") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
The scatterplot should show an inverted U-shape: well-being increases at low screen time, peaks around 3-4 hours, and declines at high screen time. This curvature confirms that a linear model would be inadequate, and a quadratic model is needed.
We fit the quadratic model using the lm() function. To
include the quadratic term, we add I(screen_time^2) to the
formula. The I() function tells R to interpret the
expression literally, not as a formula operator.
# Fit quadratic regression model
model_quadratic <- lm(well_being ~ screen_time + I(screen_time^2), data = data)
summary(model_quadratic)
##
## Call:
## lm(formula = well_being ~ screen_time + I(screen_time^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.1640 -7.6753 0.7098 7.5463 26.6336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.64414 2.28228 19.123 < 2e-16 ***
## screen_time 5.56017 0.90357 6.154 4.16e-09 ***
## I(screen_time^2) -0.95428 0.07504 -12.717 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.54 on 197 degrees of freedom
## Multiple R-squared: 0.7988, Adjusted R-squared: 0.7968
## F-statistic: 391.1 on 2 and 197 DF, p-value: < 2.2e-16
Alternatively, we can use the poly() function, which
creates orthogonal polynomials. Orthogonal polynomials reduce
multicollinearity between \(X\) and
\(X^2\), improving numerical
stability:
# Fit using orthogonal polynomials
model_poly <- lm(well_being ~ poly(screen_time, 2, raw = FALSE), data = data)
summary(model_poly)
##
## Call:
## lm(formula = well_being ~ poly(screen_time, 2, raw = FALSE),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.1640 -7.6753 0.7098 7.5463 26.6336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.0925 0.7455 44.39 <2e-16 ***
## poly(screen_time, 2, raw = FALSE)1 -262.6061 10.5429 -24.91 <2e-16 ***
## poly(screen_time, 2, raw = FALSE)2 -134.0733 10.5429 -12.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.54 on 197 degrees of freedom
## Multiple R-squared: 0.7988, Adjusted R-squared: 0.7968
## F-statistic: 391.1 on 2 and 197 DF, p-value: < 2.2e-16
The raw = FALSE argument (default) creates orthogonal
polynomials. Setting raw = TRUE gives raw polynomials
equivalent to the first approach. For interpretation, we use raw
polynomials, but for numerical stability, orthogonal polynomials are
preferred.
For this tutorial, we focus on the raw polynomial model for easier interpretation.
To demonstrate the importance of the quadratic term, we also fit a simple linear model and compare:
# Fit linear model (for comparison)
model_linear <- lm(well_being ~ screen_time, data = data)
summary(model_linear)
##
## Call:
## lm(formula = well_being ~ screen_time, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.903 -10.396 -1.098 12.010 30.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.5174 2.0192 32.45 <2e-16 ***
## screen_time -5.5726 0.3011 -18.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.19 on 198 degrees of freedom
## Multiple R-squared: 0.6336, Adjusted R-squared: 0.6318
## F-statistic: 342.5 on 1 and 198 DF, p-value: < 2.2e-16
We can compare models using \(R^2\), AIC, or an F-test. The quadratic model should fit much better if the relationship is truly curved.
Let’s extract and interpret the coefficients from the quadratic model.
# Extract coefficients
coef_quadratic <- coef(model_quadratic)
coef_quadratic
## (Intercept) screen_time I(screen_time^2)
## 43.644138 5.560166 -0.954281
# Calculate optimal screen time
optimal_time <- -coef_quadratic[2] / (2 * coef_quadratic[3])
cat("Optimal screen time:", round(optimal_time, 2), "hours\n")
## Optimal screen time: 2.91 hours
# Predict well-being at optimal screen time
optimal_wellbeing <- predict(model_quadratic, newdata = data.frame(screen_time = optimal_time))
cat("Predicted well-being at optimal screen time:", round(optimal_wellbeing, 2), "\n")
## Predicted well-being at optimal screen time: 51.74
The estimated coefficients are: intercept \(\beta_0 = 43.64\), linear term \(\beta_1 = 5.56\), and quadratic term \(\beta_2 = -0.95\). These are close to our simulation values (\(\beta_0 = 30\), \(\beta_1 = 15\), \(\beta_2 = -2\)), with differences due to random noise and boundary constraints on well-being scores.
Interpreting the quadratic coefficient: The negative quadratic coefficient (\(\beta_2 = -0.95\)) confirms an inverted U-shaped relationship. Each additional hour of screen time has a diminishing effect on well-being, eventually becoming negative. The relationship is concave down, indicating diminishing returns.
Optimal screen time: The turning point occurs at \(2.91\) hours, where well-being reaches its maximum predicted value of \(51.74\) points. Beyond this point, additional screen time reduces well-being.
Statistical significance: The quadratic term is highly significant (\(p < 0.001\), \(t = -12.72\)), confirming that the non-linear model is necessary. The linear term is also significant (\(p < 0.001\), \(t = 6.15\)), indicating an initial positive slope.
We compare \(R^2\) values:
r2_linear <- summary(model_linear)$r.squared
r2_quadratic <- summary(model_quadratic)$r.squared
cat("Linear model R²:", round(r2_linear, 3), "\n")
## Linear model R²: 0.634
cat("Quadratic model R²:", round(r2_quadratic, 3), "\n")
## Quadratic model R²: 0.799
cat("Improvement:", round(r2_quadratic - r2_linear, 3), "\n")
## Improvement: 0.165
The linear model explains \(R^2 = 0.634\) (63.4%) of the variance in well-being. The quadratic model explains \(R^2 = 0.799\) (79.9%), a substantial improvement of 16.5 percentage points. This confirms that capturing the curvature significantly enhances predictive accuracy.
We can also use an F-test to formally compare the models:
anova(model_linear, model_quadratic)
## Analysis of Variance Table
##
## Model 1: well_being ~ screen_time
## Model 2: well_being ~ screen_time + I(screen_time^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 39873
## 2 197 21897 1 17976 161.72 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A significant F-test (\(p < 0.05\)) confirms that the quadratic model fits significantly better than the linear model.
We overlay the fitted quadratic curve on the scatterplot to visualize how well the model captures the non-linear relationship.
# Create prediction data
screen_time_seq <- seq(min(data$screen_time), max(data$screen_time), length.out = 200)
predicted_wellbeing <- predict(model_quadratic, newdata = data.frame(screen_time = screen_time_seq), interval = "confidence")
pred_data <- data.frame(screen_time = screen_time_seq,
fit = predicted_wellbeing[, "fit"],
lwr = predicted_wellbeing[, "lwr"],
upr = predicted_wellbeing[, "upr"])
# Plot data with fitted curve
ggplot(data, aes(x = screen_time, y = well_being)) +
geom_point(alpha = 0.6, size = 2.5, color = "steelblue") +
geom_line(data = pred_data, aes(y = fit), color = "darkred", size = 1.2) +
geom_ribbon(data = pred_data, aes(ymin = lwr, ymax = upr), fill = "pink", alpha = 0.3) +
labs(title = "Polynomial Regression: Screen Time and Well-Being",
x = "Daily Screen Time (hours)",
y = "Well-Being Score (0-100)") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
The fitted curve should follow the inverted U-shape, peaking around 3.75 hours and declining at both extremes. The shaded confidence band shows uncertainty in the fitted values. If the curve fits the data well, points should scatter evenly around the curve without systematic deviations.
To highlight the improvement, we compare the linear and quadratic fits on the same plot:
# Predictions from linear model
predicted_linear <- predict(model_linear, newdata = data.frame(screen_time = screen_time_seq))
pred_linear_data <- data.frame(screen_time = screen_time_seq, fit = predicted_linear)
# Plot comparison
ggplot(data, aes(x = screen_time, y = well_being)) +
geom_point(alpha = 0.4, size = 2, color = "gray50") +
geom_line(data = pred_linear_data, aes(y = fit), color = "blue", linetype = "dashed", size = 1, alpha = 0.7) +
geom_line(data = pred_data, aes(y = fit), color = "darkred", size = 1.2) +
annotate("text", x = 10, y = 40, label = "Linear Model", color = "blue", fontface = "bold") +
annotate("text", x = 10, y = 50, label = "Quadratic Model", color = "darkred", fontface = "bold") +
labs(title = "Linear vs Quadratic Model Comparison",
x = "Daily Screen Time (hours)",
y = "Well-Being Score (0-100)") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
The linear model fails to capture the curvature, systematically underestimating well-being at moderate screen time and overestimating at extremes. The quadratic model follows the data much more closely.
Once we have a fitted polynomial model, we can predict well-being for specific screen time values.
# Predict well-being at 2, 4, and 8 hours of screen time
new_screen_times <- data.frame(screen_time = c(2, 4, 8))
predictions <- predict(model_quadratic, newdata = new_screen_times, interval = "confidence", level = 0.95)
cbind(new_screen_times, predictions)
## screen_time fit lwr upr
## 1 2 50.94735 48.64909 53.24560
## 2 4 50.61630 48.58694 52.64567
## 3 8 27.05148 25.05473 29.04823
The predictions show expected well-being at different screen time levels, along with 95% confidence intervals. We interpret these as: “For individuals with 2 hours of daily screen time, we expect a well-being score of [value], with 95% confidence that the true mean lies between [lwr] and [upr].”
For individual predictions (not just the mean), we use prediction intervals:
predictions_indiv <- predict(model_quadratic, newdata = new_screen_times, interval = "prediction", level = 0.95)
cbind(new_screen_times, predictions_indiv)
## screen_time fit lwr upr
## 1 2 50.94735 30.029307 71.86538
## 2 4 50.61630 29.726099 71.50651
## 3 8 27.05148 6.164417 47.93854
Prediction intervals are wider because they account for individual variability around the mean.
Be extremely cautious when predicting outside the observed data range. Polynomial models can behave erratically beyond the data. For example, predicting well-being at 20 hours of screen time may yield unrealistic values because the quadratic curve continues declining without bound. Always restrict predictions to the range of observed data unless there is strong theoretical justification.
Polynomial regression relies on the same assumptions as linear regression: linearity (in parameters), independence, homoscedasticity, and normality of residuals. We check these using diagnostic plots.
data$residuals <- residuals(model_quadratic)
data$fitted <- fitted(model_quadratic)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
What to look for: Residuals should scatter randomly around zero with no patterns. If residuals show curvature, a higher-degree polynomial or a different model may be needed. Funnel shapes indicate heteroscedasticity.
ggplot(data, aes(sample = residuals)) +
stat_qq(color = "steelblue", alpha = 0.6) +
stat_qq_line(color = "darkred", linetype = "dashed") +
labs(title = "Q-Q Plot of Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
What to look for: Points should lie close to the diagonal line. Deviations at the extremes indicate heavy tails or skewness. With large samples, mild departures are usually acceptable.
data$cooks_d <- cooks.distance(model_quadratic)
ggplot(data, aes(x = 1:nrow(data), y = cooks_d)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
geom_hline(yintercept = 4/nrow(data), linetype = "dashed", color = "red") +
labs(title = "Cook's Distance for Each Observation",
x = "Observation Index",
y = "Cook's Distance") +
theme_minimal() +
theme(
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
)
Observations with Cook’s distance greater than \(4/n\) are potentially influential. Investigate these points to determine if they are errors or legitimate outliers.
When reporting polynomial regression results, clearly describe the model, coefficients, fit statistics, and practical interpretation.
A quadratic regression analysis was conducted to examine the relationship between daily screen time and subjective well-being in a sample of 200 adults. The quadratic model provided a significantly better fit than a linear model, \(F(1, 197) = 161.72\), \(p < .001\). The quadratic term was statistically significant, \(\beta_2 = -0.95\), \(t(197) = -12.72\), \(p < .001\), indicating a non-linear relationship. The model explained 79.9% of the variance in well-being scores, \(R^2 = .799\), \(F(2, 197) = 391.1\), \(p < .001\).
The fitted model revealed an inverted U-shaped relationship. Well-being increased with screen time up to approximately 2.91 hours per day, where it reached a predicted maximum of 51.74 points. Beyond this optimal level, additional screen time was associated with declining well-being. These findings suggest that moderate screen use may enhance well-being through social connection and information access, but excessive use (beyond 2.91 hours) may interfere with sleep, physical activity, and face-to-face interactions, leading to reduced well-being.
This format includes:
Always accompany results with a plot showing the data and fitted curve.
Start with a quadratic model (degree 2) if you expect one turning point. Use a cubic model (degree 3) if you expect two turning points. Avoid polynomials beyond degree 3 unless there is strong theoretical justification. Higher-degree polynomials overfit and generalize poorly.
Use model comparison tools (AIC, BIC, cross-validation) to choose the appropriate degree. Simpler models (lower degrees) are preferred unless the added complexity significantly improves fit.
Raw polynomial terms (\(X\), \(X^2\), \(X^3\)) are highly correlated, causing multicollinearity. This inflates standard errors and makes coefficients unstable. Solutions include:
poly(X, degree, raw = FALSE) in ROrthogonal polynomials are preferred for numerical stability, but raw polynomials are easier to interpret.
If polynomial regression does not fit well, consider:
Polynomial regression requires larger samples than linear regression because you estimate more parameters. A rough rule is at least 10-20 observations per parameter. For a quadratic model with 3 parameters, aim for \(n \geq 50\). For a cubic model, \(n \geq 100\) is safer.
Unlike linear regression, polynomial coefficients do not have simple interpretations. You cannot say “a one-unit increase in \(X\) increases \(Y\) by \(\beta_1\)” because the effect depends on the current value of \(X\). Instead, focus on:
Always visualize the fitted curve to communicate results clearly.
Polynomial regression extends linear regression to model non-linear relationships by adding polynomial terms to the predictor. It is a powerful tool for capturing curves, peaks, and valleys in data, allowing researchers to go beyond the assumption of constant rates of change.
While polynomial regression is flexible, it requires careful use. Always visualize the data first, choose the polynomial degree thoughtfully, check assumptions, and avoid extrapolation. Coefficients are harder to interpret than in linear regression, so emphasize graphical presentation and predictions at meaningful predictor values.
When used appropriately, polynomial regression reveals important patterns in behavioral, health, and social science data, such as optimal intervention levels, tipping points, and non-monotonic relationships. It forms a bridge between simple linear models and more advanced non-linear techniques, equipping you with the tools to handle the complexity of real-world data.
With practice, polynomial regression becomes a valuable addition to your analytical toolkit, enabling you to model and interpret the rich, non-linear patterns that characterize human behavior and health outcomes.