Polynomial Regression: Modeling Non-Linear Relationships

1. A behavioral story

Imagine you are a health psychologist studying the relationship between daily screen time and subjective well-being. You collect data from 200 adults who report their average daily screen time (in hours) and complete a well-being questionnaire yielding scores from 0 to 100.

At first glance, you expect a simple negative relationship: more screen time, lower well-being. However, as you plot the data, something more nuanced emerges. People with very little screen time (less than 1 hour per day) report moderate well-being, perhaps missing out on social connections and information. As screen time increases to moderate levels (2-4 hours), well-being improves, likely benefiting from social media interactions and entertainment. But beyond 6 hours, well-being starts to decline sharply as excessive screen use displaces sleep, exercise, and face-to-face interactions.

This creates a curved, inverted U-shaped relationship. A straight line cannot capture this pattern. The relationship is non-linear, and the rate of change in well-being varies across different levels of screen time. This is exactly where polynomial regression becomes essential. By adding polynomial terms (squared, cubed) to the regression model, we can fit curves that capture these complex, non-linear relationships.

2. What is polynomial regression?

Polynomial regression is an extension of linear regression that models non-linear relationships between a predictor and an outcome by including polynomial terms of the predictor. While simple linear regression fits a straight line, polynomial regression fits curves such as parabolas (quadratic), S-shapes (cubic), or more complex patterns.

The key insight is that although the relationship between \(X\) and \(Y\) is non-linear, the model remains linear in its parameters. We transform the predictor by raising it to powers (e.g., \(X^2\), \(X^3\)), but we still estimate coefficients using ordinary least squares, just as in linear regression.

Polynomial regression is widely used in psychology, medicine, economics, and environmental science to model phenomena such as:

Learning curves that accelerate then plateau
Dose-response relationships in pharmacology with optimal dosages
Age-related cognitive changes that peak in midlife
Stress-performance relationships following the Yerkes-Dodson law (inverted U-shape)
Economic utility functions with diminishing returns

The most common polynomial models are:

Quadratic (degree 2): Captures one turning point (U-shape or inverted U-shape)
Cubic (degree 3): Captures two turning points (S-shape)
Higher-order polynomials: Capture more complex curves but risk overfitting

3. The model in plain form

A polynomial regression model of degree \(p\) can be written as:

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \beta_3 X_i^3 + \cdots + \beta_p X_i^p + \varepsilon_i \]

where \(Y_i\) is the outcome for individual \(i\), \(X_i\) is the predictor value, \(\beta_0\) is the intercept, \(\beta_1, \beta_2, \ldots, \beta_p\) are the polynomial coefficients, and \(\varepsilon_i\) is the random error term assumed to follow:

\[ \varepsilon_i \sim N(0, \sigma^2) \]

For our screen time example, we use a quadratic model (degree 2):

\[ \text{WellBeing}_i = \beta_0 + \beta_1 \cdot \text{ScreenTime}_i + \beta_2 \cdot \text{ScreenTime}_i^2 + \varepsilon_i \]

Understanding the components

The intercept \(\beta_0\) represents the expected well-being when screen time equals zero. This may or may not be meaningful depending on the context.

The linear coefficient \(\beta_1\) captures the initial rate of change. If \(\beta_1 > 0\), well-being initially increases with screen time. If \(\beta_1 < 0\), it initially decreases.

The quadratic coefficient \(\beta_2\) determines the curvature:

If \(\beta_2 < 0\), the curve is concave down (inverted U-shape), indicating diminishing returns or eventual decline.
If \(\beta_2 > 0\), the curve is concave up (U-shape), indicating accelerating increases or a minimum followed by growth.

The quadratic term allows the slope to change as \(X\) increases. The instantaneous slope at any point \(X\) is:

\[ \frac{dY}{dX} = \beta_1 + 2\beta_2 X \]

This shows that the effect of screen time on well-being is not constant—it depends on the current level of screen time.

Finding the turning point

For a quadratic model, the turning point (maximum or minimum) occurs where the derivative equals zero:

\[ \frac{dY}{dX} = \beta_1 + 2\beta_2 X = 0 \]

Solving for \(X\):

\[ X_{\text{optimal}} = -\frac{\beta_1}{2\beta_2} \]

If \(\beta_2 < 0\), this is a maximum (peak well-being). If \(\beta_2 > 0\), it is a minimum (lowest well-being).

Why polynomial regression?

Linear regression assumes a constant rate of change: each additional hour of screen time has the same effect on well-being. This is often unrealistic. Polynomial regression allows the effect to vary, capturing accelerating, decelerating, or reversing trends.

However, polynomial models have limitations:

Interpretation complexity: Coefficients do not have simple interpretations.
Extrapolation danger: Polynomials can behave erratically outside the observed data range.
Overfitting risk: High-degree polynomials fit training data perfectly but generalize poorly.

Always visualize the fitted curve and avoid using polynomials beyond degree 3 unless there is strong theoretical justification.

4. Simulating a dataset

We simulate data for 200 adults to illustrate polynomial regression. Screen time ranges from 0 to 12 hours per day, and well-being follows an inverted U-shape with a peak around 3-4 hours.

We use a quadratic relationship with noise added to mimic real-world variability.

set.seed(789)  # For reproducibility
n <- 200

# Predictor: daily screen time (hours)
screen_time <- runif(n, min = 0, max = 12)

# True quadratic relationship: inverted U-shape
beta_0 <- 30    # Intercept
beta_1 <- 15    # Linear term (positive initial slope)
beta_2 <- -2    # Quadratic term (negative curvature)

# Generate well-being scores with noise
well_being <- beta_0 + beta_1 * screen_time + beta_2 * screen_time^2 + rnorm(n, mean = 0, sd = 8)

# Ensure well-being is within realistic range [0, 100]
well_being <- pmax(0, pmin(100, well_being))

# Create data frame
data <- data.frame(screen_time = screen_time, well_being = well_being)

# View first few rows
head(data)

##   screen_time well_being
## 1   8.3987324   10.53835
## 2   1.1219865   41.12034
## 3   0.1426418   36.78198
## 4   7.0992762   37.00060
## 5   5.9057932   49.95831
## 6   0.2419629   38.08662

This dataset has a built-in quadratic relationship. The optimal screen time is:

\[ X_{\text{optimal}} = -\frac{15}{2 \times (-2)} = \frac{15}{4} = 3.75 \text{ hours} \]

We expect the fitted model to recover coefficients close to \(\beta_0 = 30\), \(\beta_1 = 15\), and \(\beta_2 = -2\).

5. Visualizing the relationship

Before fitting any model, we visualize the data with a scatterplot. This reveals the non-linear pattern and helps us decide whether a polynomial model is appropriate.

library(ggplot2)

ggplot(data, aes(x = screen_time, y = well_being)) +
  geom_point(alpha = 0.6, size = 2.5, color = "steelblue") +
  labs(title = "Relationship Between Screen Time and Well-Being",
       x = "Daily Screen Time (hours)",
       y = "Well-Being Score (0-100)") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

The scatterplot should show an inverted U-shape: well-being increases at low screen time, peaks around 3-4 hours, and declines at high screen time. This curvature confirms that a linear model would be inadequate, and a quadratic model is needed.

6. Fitting the polynomial regression model in R

We fit the quadratic model using the lm() function. To include the quadratic term, we add I(screen_time^2) to the formula. The I() function tells R to interpret the expression literally, not as a formula operator.

# Fit quadratic regression model
model_quadratic <- lm(well_being ~ screen_time + I(screen_time^2), data = data)
summary(model_quadratic)

## 
## Call:
## lm(formula = well_being ~ screen_time + I(screen_time^2), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.1640  -7.6753   0.7098   7.5463  26.6336 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      43.64414    2.28228  19.123  < 2e-16 ***
## screen_time       5.56017    0.90357   6.154 4.16e-09 ***
## I(screen_time^2) -0.95428    0.07504 -12.717  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.54 on 197 degrees of freedom
## Multiple R-squared:  0.7988, Adjusted R-squared:  0.7968 
## F-statistic: 391.1 on 2 and 197 DF,  p-value: < 2.2e-16

Alternatively, we can use the poly() function, which creates orthogonal polynomials. Orthogonal polynomials reduce multicollinearity between \(X\) and \(X^2\), improving numerical stability:

# Fit using orthogonal polynomials
model_poly <- lm(well_being ~ poly(screen_time, 2, raw = FALSE), data = data)
summary(model_poly)

## 
## Call:
## lm(formula = well_being ~ poly(screen_time, 2, raw = FALSE), 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.1640  -7.6753   0.7098   7.5463  26.6336 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          33.0925     0.7455   44.39   <2e-16 ***
## poly(screen_time, 2, raw = FALSE)1 -262.6061    10.5429  -24.91   <2e-16 ***
## poly(screen_time, 2, raw = FALSE)2 -134.0733    10.5429  -12.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.54 on 197 degrees of freedom
## Multiple R-squared:  0.7988, Adjusted R-squared:  0.7968 
## F-statistic: 391.1 on 2 and 197 DF,  p-value: < 2.2e-16

The raw = FALSE argument (default) creates orthogonal polynomials. Setting raw = TRUE gives raw polynomials equivalent to the first approach. For interpretation, we use raw polynomials, but for numerical stability, orthogonal polynomials are preferred.

For this tutorial, we focus on the raw polynomial model for easier interpretation.

Comparing with linear regression

To demonstrate the importance of the quadratic term, we also fit a simple linear model and compare:

# Fit linear model (for comparison)
model_linear <- lm(well_being ~ screen_time, data = data)
summary(model_linear)

## 
## Call:
## lm(formula = well_being ~ screen_time, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.903 -10.396  -1.098  12.010  30.924 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.5174     2.0192   32.45   <2e-16 ***
## screen_time  -5.5726     0.3011  -18.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.19 on 198 degrees of freedom
## Multiple R-squared:  0.6336, Adjusted R-squared:  0.6318 
## F-statistic: 342.5 on 1 and 198 DF,  p-value: < 2.2e-16

We can compare models using \(R^2\), AIC, or an F-test. The quadratic model should fit much better if the relationship is truly curved.

7. Interpreting the results

Let’s extract and interpret the coefficients from the quadratic model.

# Extract coefficients
coef_quadratic <- coef(model_quadratic)
coef_quadratic

##      (Intercept)      screen_time I(screen_time^2) 
##        43.644138         5.560166        -0.954281

# Calculate optimal screen time
optimal_time <- -coef_quadratic[2] / (2 * coef_quadratic[3])
cat("Optimal screen time:", round(optimal_time, 2), "hours\n")

## Optimal screen time: 2.91 hours

# Predict well-being at optimal screen time
optimal_wellbeing <- predict(model_quadratic, newdata = data.frame(screen_time = optimal_time))
cat("Predicted well-being at optimal screen time:", round(optimal_wellbeing, 2), "\n")

## Predicted well-being at optimal screen time: 51.74

The estimated coefficients are: intercept \(\beta_0 = 43.64\), linear term \(\beta_1 = 5.56\), and quadratic term \(\beta_2 = -0.95\). These are close to our simulation values (\(\beta_0 = 30\), \(\beta_1 = 15\), \(\beta_2 = -2\)), with differences due to random noise and boundary constraints on well-being scores.

Interpreting the quadratic coefficient: The negative quadratic coefficient (\(\beta_2 = -0.95\)) confirms an inverted U-shaped relationship. Each additional hour of screen time has a diminishing effect on well-being, eventually becoming negative. The relationship is concave down, indicating diminishing returns.

Optimal screen time: The turning point occurs at \(2.91\) hours, where well-being reaches its maximum predicted value of \(51.74\) points. Beyond this point, additional screen time reduces well-being.

Statistical significance: The quadratic term is highly significant (\(p < 0.001\), \(t = -12.72\)), confirming that the non-linear model is necessary. The linear term is also significant (\(p < 0.001\), \(t = 6.15\)), indicating an initial positive slope.

Comparing linear and quadratic models

We compare \(R^2\) values:

r2_linear <- summary(model_linear)$r.squared
r2_quadratic <- summary(model_quadratic)$r.squared

cat("Linear model R²:", round(r2_linear, 3), "\n")

## Linear model R²: 0.634

cat("Quadratic model R²:", round(r2_quadratic, 3), "\n")

## Quadratic model R²: 0.799

cat("Improvement:", round(r2_quadratic - r2_linear, 3), "\n")

## Improvement: 0.165

The linear model explains \(R^2 = 0.634\) (63.4%) of the variance in well-being. The quadratic model explains \(R^2 = 0.799\) (79.9%), a substantial improvement of 16.5 percentage points. This confirms that capturing the curvature significantly enhances predictive accuracy.

We can also use an F-test to formally compare the models:

anova(model_linear, model_quadratic)

## Analysis of Variance Table
## 
## Model 1: well_being ~ screen_time
## Model 2: well_being ~ screen_time + I(screen_time^2)
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    198 39873                                  
## 2    197 21897  1     17976 161.72 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A significant F-test (\(p < 0.05\)) confirms that the quadratic model fits significantly better than the linear model.

8. Visualizing the fitted curve

We overlay the fitted quadratic curve on the scatterplot to visualize how well the model captures the non-linear relationship.

# Create prediction data
screen_time_seq <- seq(min(data$screen_time), max(data$screen_time), length.out = 200)
predicted_wellbeing <- predict(model_quadratic, newdata = data.frame(screen_time = screen_time_seq), interval = "confidence")
pred_data <- data.frame(screen_time = screen_time_seq, 
                        fit = predicted_wellbeing[, "fit"],
                        lwr = predicted_wellbeing[, "lwr"],
                        upr = predicted_wellbeing[, "upr"])

# Plot data with fitted curve
ggplot(data, aes(x = screen_time, y = well_being)) +
  geom_point(alpha = 0.6, size = 2.5, color = "steelblue") +
  geom_line(data = pred_data, aes(y = fit), color = "darkred", size = 1.2) +
  geom_ribbon(data = pred_data, aes(ymin = lwr, ymax = upr), fill = "pink", alpha = 0.3) +
  labs(title = "Polynomial Regression: Screen Time and Well-Being",
       x = "Daily Screen Time (hours)",
       y = "Well-Being Score (0-100)") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

The fitted curve should follow the inverted U-shape, peaking around 3.75 hours and declining at both extremes. The shaded confidence band shows uncertainty in the fitted values. If the curve fits the data well, points should scatter evenly around the curve without systematic deviations.

Comparing linear and quadratic fits visually

To highlight the improvement, we compare the linear and quadratic fits on the same plot:

# Predictions from linear model
predicted_linear <- predict(model_linear, newdata = data.frame(screen_time = screen_time_seq))
pred_linear_data <- data.frame(screen_time = screen_time_seq, fit = predicted_linear)

# Plot comparison
ggplot(data, aes(x = screen_time, y = well_being)) +
  geom_point(alpha = 0.4, size = 2, color = "gray50") +
  geom_line(data = pred_linear_data, aes(y = fit), color = "blue", linetype = "dashed", size = 1, alpha = 0.7) +
  geom_line(data = pred_data, aes(y = fit), color = "darkred", size = 1.2) +
  annotate("text", x = 10, y = 40, label = "Linear Model", color = "blue", fontface = "bold") +
  annotate("text", x = 10, y = 50, label = "Quadratic Model", color = "darkred", fontface = "bold") +
  labs(title = "Linear vs Quadratic Model Comparison",
       x = "Daily Screen Time (hours)",
       y = "Well-Being Score (0-100)") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

The linear model fails to capture the curvature, systematically underestimating well-being at moderate screen time and overestimating at extremes. The quadratic model follows the data much more closely.

9. Making predictions

Once we have a fitted polynomial model, we can predict well-being for specific screen time values.

# Predict well-being at 2, 4, and 8 hours of screen time
new_screen_times <- data.frame(screen_time = c(2, 4, 8))
predictions <- predict(model_quadratic, newdata = new_screen_times, interval = "confidence", level = 0.95)
cbind(new_screen_times, predictions)

##   screen_time      fit      lwr      upr
## 1           2 50.94735 48.64909 53.24560
## 2           4 50.61630 48.58694 52.64567
## 3           8 27.05148 25.05473 29.04823

The predictions show expected well-being at different screen time levels, along with 95% confidence intervals. We interpret these as: “For individuals with 2 hours of daily screen time, we expect a well-being score of [value], with 95% confidence that the true mean lies between [lwr] and [upr].”

Prediction intervals for individuals

For individual predictions (not just the mean), we use prediction intervals:

predictions_indiv <- predict(model_quadratic, newdata = new_screen_times, interval = "prediction", level = 0.95)
cbind(new_screen_times, predictions_indiv)

##   screen_time      fit       lwr      upr
## 1           2 50.94735 30.029307 71.86538
## 2           4 50.61630 29.726099 71.50651
## 3           8 27.05148  6.164417 47.93854

Prediction intervals are wider because they account for individual variability around the mean.

Extrapolation warning

Be extremely cautious when predicting outside the observed data range. Polynomial models can behave erratically beyond the data. For example, predicting well-being at 20 hours of screen time may yield unrealistic values because the quadratic curve continues declining without bound. Always restrict predictions to the range of observed data unless there is strong theoretical justification.

10. Checking model assumptions

Polynomial regression relies on the same assumptions as linear regression: linearity (in parameters), independence, homoscedasticity, and normality of residuals. We check these using diagnostic plots.

Residuals vs fitted values

data$residuals <- residuals(model_quadratic)
data$fitted <- fitted(model_quadratic)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

What to look for: Residuals should scatter randomly around zero with no patterns. If residuals show curvature, a higher-degree polynomial or a different model may be needed. Funnel shapes indicate heteroscedasticity.

Normality of residuals

ggplot(data, aes(sample = residuals)) +
  stat_qq(color = "steelblue", alpha = 0.6) +
  stat_qq_line(color = "darkred", linetype = "dashed") +
  labs(title = "Q-Q Plot of Residuals",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

What to look for: Points should lie close to the diagonal line. Deviations at the extremes indicate heavy tails or skewness. With large samples, mild departures are usually acceptable.

Influential points

data$cooks_d <- cooks.distance(model_quadratic)

ggplot(data, aes(x = 1:nrow(data), y = cooks_d)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
  geom_hline(yintercept = 4/nrow(data), linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance for Each Observation",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.line.y = element_line(color = "black"),
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

Observations with Cook’s distance greater than \(4/n\) are potentially influential. Investigate these points to determine if they are errors or legitimate outliers.

11. Reporting results in APA style

When reporting polynomial regression results, clearly describe the model, coefficients, fit statistics, and practical interpretation.

A quadratic regression analysis was conducted to examine the relationship between daily screen time and subjective well-being in a sample of 200 adults. The quadratic model provided a significantly better fit than a linear model, \(F(1, 197) = 161.72\), \(p < .001\). The quadratic term was statistically significant, \(\beta_2 = -0.95\), \(t(197) = -12.72\), \(p < .001\), indicating a non-linear relationship. The model explained 79.9% of the variance in well-being scores, \(R^2 = .799\), \(F(2, 197) = 391.1\), \(p < .001\).

The fitted model revealed an inverted U-shaped relationship. Well-being increased with screen time up to approximately 2.91 hours per day, where it reached a predicted maximum of 51.74 points. Beyond this optimal level, additional screen time was associated with declining well-being. These findings suggest that moderate screen use may enhance well-being through social connection and information access, but excessive use (beyond 2.91 hours) may interfere with sleep, physical activity, and face-to-face interactions, leading to reduced well-being.

This format includes:

Model description and type
Model comparison results (quadratic vs linear)
Coefficient estimates with statistical tests
\(R^2\) and overall F-test
Practical interpretation of the curve and optimal point
Theoretical explanation grounded in the behavioral story

Always accompany results with a plot showing the data and fitted curve.

12. Practical considerations

Choosing the polynomial degree

Start with a quadratic model (degree 2) if you expect one turning point. Use a cubic model (degree 3) if you expect two turning points. Avoid polynomials beyond degree 3 unless there is strong theoretical justification. Higher-degree polynomials overfit and generalize poorly.

Use model comparison tools (AIC, BIC, cross-validation) to choose the appropriate degree. Simpler models (lower degrees) are preferred unless the added complexity significantly improves fit.

Multicollinearity

Raw polynomial terms (\(X\), \(X^2\), \(X^3\)) are highly correlated, causing multicollinearity. This inflates standard errors and makes coefficients unstable. Solutions include:

Centering: Subtract the mean of \(X\) before squaring: \((X - \bar{X})^2\)
Orthogonal polynomials: Use poly(X, degree, raw = FALSE) in R
Scaling: Standardize \(X\) to have mean 0 and standard deviation 1

Orthogonal polynomials are preferred for numerical stability, but raw polynomials are easier to interpret.

Alternatives to polynomial regression

If polynomial regression does not fit well, consider:

Splines: Piecewise polynomials that fit different curves in different regions
Generalized additive models (GAMs): Flexible non-parametric smoothing
Transformation: Log, square root, or other transformations of \(X\) or \(Y\)
Non-linear regression: Explicitly model non-linear functional forms (e.g., exponential, logistic)

Sample size

Polynomial regression requires larger samples than linear regression because you estimate more parameters. A rough rule is at least 10-20 observations per parameter. For a quadratic model with 3 parameters, aim for \(n \geq 50\). For a cubic model, \(n \geq 100\) is safer.

Interpretation challenges

Unlike linear regression, polynomial coefficients do not have simple interpretations. You cannot say “a one-unit increase in \(X\) increases \(Y\) by \(\beta_1\)” because the effect depends on the current value of \(X\). Instead, focus on:

The overall shape of the curve (U-shaped, inverted U, S-shaped)
The location and value at turning points (maxima, minima)
Predictions at specific values of \(X\)

Always visualize the fitted curve to communicate results clearly.

13. Conclusion

Polynomial regression extends linear regression to model non-linear relationships by adding polynomial terms to the predictor. It is a powerful tool for capturing curves, peaks, and valleys in data, allowing researchers to go beyond the assumption of constant rates of change.

While polynomial regression is flexible, it requires careful use. Always visualize the data first, choose the polynomial degree thoughtfully, check assumptions, and avoid extrapolation. Coefficients are harder to interpret than in linear regression, so emphasize graphical presentation and predictions at meaningful predictor values.

When used appropriately, polynomial regression reveals important patterns in behavioral, health, and social science data, such as optimal intervention levels, tipping points, and non-monotonic relationships. It forms a bridge between simple linear models and more advanced non-linear techniques, equipping you with the tools to handle the complexity of real-world data.

With practice, polynomial regression becomes a valuable addition to your analytical toolkit, enabling you to model and interpret the rich, non-linear patterns that characterize human behavior and health outcomes.