Multiple Linear Regression

The general purpose of multiple regression (the term was first used by Pearson, 1908), as a generalization of simple linear regression, is to learn about how several independent variables or predictors (IVs) together predict a dependent variable (DV). Multiple regression analysis often focuses on understanding (1) how much variance in a DV a set of IVs explain and (2) the relative predictive importance of IVs in predicting a DV.

In the social and natural sciences, multiple regression analysis is very widely used in research. Multiple regression allows a researcher to ask (and hopefully answer) the general question "what is the best predictor of ...". For example, educational researchers might want to learn what the best predictors of success in college are. Psychologists may want to determine which personality dimensions best predicts social adjustment.

Multiple regression model

A general multiple linear regression model at the population level can be written as

\[y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\varepsilon_{i} \]

$y_{i}$: the observed score of individual $i$ on the DV.
$x_{1},x_{2},\ldots,x_{k}$ : a set of predictors.
$x_{1i}$: the observed score of individual $i$ on IV 1; $x_{ki}$: observed score of individual $i$ on IV $k$.
$\beta_{0}$: the intercept at the population level, representing the predicted $y$ score when all the independent variables have their values at 0.
$\beta_{1},\ldots,\beta_{k}$: regression coefficients at the population level; $\beta_{1}$: representing the amount predicted $y$ changes when $x_{1}$ changes in 1 unit while holding the other IVs constant; $\beta_{k}$: representing the amount predicted $y$ changes when $x_{k}$ changes in 1 unit while holding the other IVs constant.
$\varepsilon$: unobserved errors with mean 0 and variance $\sigma^{2}$.

Parameter estimation

The least squares method used for the simple linear regression analysis can also be used to estimate the parameters in a multiple regression model. The basic idea is to minimize the sum of squared residuals or errors. Let $b_{0},b_{1},\ldots,b_{k}$ represent the estimated regression coefficients.The individual $i$'s residual $e_{i}$ is the difference between the observed $y_{i}$ and the predicted $y_{i}$

\[ e_{i}=y_{i}-\hat{y}_{i}=y_{i}-b_{0}-b_{1}x_{1i}-\ldots-b_{k}x_{ki}.\]

The sum of squared residuals is

\[ SSE=\sum_{i=1}^{n}e_{i}^{2}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}. \]

By minimizing $SSE$, the regression coefficient estimates can be obtained as

\[ \boldsymbol{b}=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}=(\sum\boldsymbol{x}_{i}\boldsymbol{x}_{i}')^{-1}(\sum\boldsymbol{x}_{i}\boldsymbol{y}_{i}). \]

$R^2$

How well the multiple regression model fits the data can be assessed using the $R^{2}$. Its calculation is the same as for the simple regression

\[\begin{align*} R^{2} & = & 1-\frac{\sum e_{i}^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}\\& = & \frac{\text{Variation explained by IVs}}{\text{Total variation}} \end{align*}. \]

In multiple regression, $R^{2}$ is the total proportion of variation in $y$ explained by the multiple predictors.

The $R^{2}$ increases or at least is the same with the inclusion of more predictors. However, with more predators, the model becomes more complex and potentially more difficult to interpret. In order to take into consideration of the model complexity, the adjusted $R^{2}$ has been defined, which is calculated as

\[aR^{2}=1-(1-R^{2})\frac{n-1}{n-k-1}.\]

Hypothesis testing of regression coefficient(s)

With the estimates of regression coefficients and their standard errors estimates, we can conduct hypothesis testing for one, a subset, or all regression coefficients.

Testing a single regression coefficient

At first, we can test the significance of the coefficient for a single predictor. In this situation, the null and alternative hypotheses are

\[ H_{0}:\beta_{j}=0\text{ vs }H_{1}:\beta_{j}\neq0 \]

with $\beta_{j}$ denoting the regression coefficient of $x_{j}$ at the population level.

As in the simple regression, we use a test statistic

\[ t_{j}=\frac{b_{j} - \beta{j} }{s.e.(b_{j})}\]

where $b_{j}$ is the estimated regression coefficient of $x_{j}$ using data from a sample. If the null hypothesis is true and $\beta_j = 0$, the test statistic follows a t-distribution with degrees of freedom $n-k-1$ where $k$ is the number of predictors.

One can also test the significance of $\beta_j$ by constructing a confidence interval for it. Based on a t distribution, the $100(1-\alpha)%$ confidence interval is

\[ [b_{j}+t_{n-k-1}(\alpha/2)*s.e.(b_{j}),\;b_{j}+t_{n-k-1}(1-\alpha/2)*s.e.(b_{j})]\]

where $t_{n-k-1}(\alpha/2)$ is the $\alpha/2$ percentile of the t distribution. As previously discussed, if the confidence interval includes 0, the regression coefficient is not statistically significant at the significance level $\alpha$.

Testing all the regression coefficients together (overall model fit)

Given the multiple predictors, we can also test whether all of the regression coefficients are 0 at the same time. This is equivalent to test whether all predictors combined can explained a significant portion of the variance of the outcome variable. Since $R^2$ is a measure of the variance explained, this test is naturally related to it.

For this hypothesis testing, the null and alternative hypothesis are

\[H_{0}:\beta_{1}=\beta_{2}=\ldots=\beta_{k}=0\]

vs.

\[H_{1}:\text{ at least one of the regression coefficients is different from 0}.\]

In this kind of test, an F test is used. The F-statistic is defined as

\[F=\frac{n-k-1}{k}\frac{R^{2}}{1-R^{2}}.\]

It follows an F-distribution with degrees of freedom $k$ and $n-k-1$ when the null hypothesis is true. Given an F statistic, its corresponding p-value can be calculated from the F distribution as shown below. Note that we only look at one side of the distribution because the extreme values should be on the large value side.

Testing a subset of the regression coefficients

We can also test whether a subset of $p$ regression coefficients, e.g., $p$ from 1 to the total number coefficients $k$, are equal to zero. For convenience, we can rearrange all the $p$ regression coefficients to be the first $p$ coefficients. Therefore, the null hypothesis should be

\[H_{0}:\beta_{1}=\beta_{2}=\ldots=\beta_{p}=0\]

and the alternative hypothesis is that at least one of them is not equal to 0.

As for testing the overall model fit, an F test can be used here. In this situation, the F statistic can be calculated as

\[F=\frac{n-k-1}{p}\frac{R^{2}-R_{0}^{2}}{1-R^{2}},\]

which follows an F-distribution with degrees of freedom $p$ and $n-k-1$. $R^2$ is for the regression model with all the predictors and $R_0^2$ is from the regression model without the first $p$ predictors $x_{1},x_{2},\ldots,x_{p}$ but with the rest predictors $x_{p+1},x_{p+2},\ldots,x_{k}$.

Intuitively, this test determine whether the variance explained by the first $p$ predictors above and beyond the $k-p$ predictors is significance or not. That is also the increase in R-squared.

An example

As an example, suppose that we wanted to predict student success in college. Why might we want to do this? There's an ongoing debate in college and university admission offices (and in the courts) regarding what factors should be considered important in deciding which applicants to admit. Should admissions officers pay most attention to more easily quantifiable measures such as high school GPA and SAT scores? Or should they give more weight to more subjective measures such as the quality of letters of recommendation? What are the pros and cons of the approaches? Of course, how we define college success is also an open question. For the sake of this example, let's measure college success using college GPA.

In this example, we use a set of simulated data (generated by us). The data are saved in the file gpa.csv. As shown below, the sample size is 100 and there are 4 variables: college GPA (c.gpa), high school GPA (h.gpa), SAT, and quality of recommendation letters (recommd).

>>> import pandas as pd
>>> gpa = pd.read_csv('https://advstats.psychstat.org/data/gpa.csv')
>>> gpa
    c.gpa  h.gpa   SAT  recommd
0    2.04   2.01  1070        5
1    2.56   3.40  1254        6
2    3.75   3.68  1466        6
3    1.10   1.54   706        4
4    3.00   3.32  1160        5
..    ...    ...   ...      ...
95   1.80   1.64   772        4
96   2.64   1.87  1304        6
97   2.08   2.53  1212        4
98   0.70   1.78   818        6
99   0.89   1.20   864        2

[100 rows x 4 columns]

Graph the data

Before fitting a regression model, we should check the relationship between college GPA and each predictor through a scatterplot. A scatterplot can tell us the form of relationship, e.g., linear, nonlinear, or no relationship, the direction of relationship, e.g., positive or negative, and the strength of relationship, e.g., strong, moderate, or weak. It can also identify potential outliers.

The scatterplots between college GPA and the three potential predictors are given below. From the plots, we can roughly see all three predictors are positively related to the college GPA. The relationship is close to linear and the relationship seems to be stronger for high school GPA and SAT than for the quality of recommendation letters.

>>> import pandas as pd
>>> gpa = pd.read_csv('https://advstats.psychstat.org/data/gpa.csv')
>>> 
>>> import seaborn as sns
>>> import matplotlib.pyplot as plt
>>> 
>>> fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
>>> sns.regplot(x=gpa['h.gpa'], y=gpa['c.gpa'], ax=axes[0])

>>> sns.regplot(x=gpa['SAT'], y=gpa['c.gpa'], ax=axes[1])

>>> sns.regplot(x=gpa['recommd'], y=gpa['c.gpa'], ax=axes[2])

>>> 
>>> plt.savefig('regression.svg', format='svg')

Descriptive statistics

Next, we can calculate some summary statistics to explore our data further. For each variable, we calculate 6 numbers: minimum, 1st quartile, median, mean, 3rd quartile, and maximum. Those numbers can be obtained using the describe() function. To look at the relationship among the variables, we can calculate the correlation matrix using the correlation function corr().

Based on the correlation matrix, the correlation between college GPA and high school GPA is about 0.545, which is larger than that (0.523) between college GPA and SAT, in turn larger than that (0.35) between college GPA and quality of recommendation letters.

>>> import pandas as pd
>>> gpa = pd.read_csv('https://advstats.psychstat.org/data/gpa.csv')
>>> 
>>> ## decriptive information
>>> print(gpa.describe())
            c.gpa       h.gpa          SAT     recommd
count  100.000000  100.000000   100.000000  100.000000
mean     1.980500    2.048600  1014.760000    5.190000
std      0.749226    0.722341   217.347049    1.495414
min      0.050000    0.330000   400.000000    2.000000
25%      1.562500    1.640000   852.000000    4.000000
50%      1.985000    1.930000  1036.000000    5.000000
75%      2.410000    2.535000  1168.500000    6.000000
max      4.010000    4.250000  1500.000000   10.000000
>>> 
>>> ## correlation matrix
>>> print(gpa.corr())
            c.gpa     h.gpa       SAT   recommd
c.gpa    1.000000  0.545198  0.522755  0.350077
h.gpa    0.545198  1.000000  0.432625  0.626584
SAT      0.522755  0.432625  1.000000  0.217593
recommd  0.350077  0.626584  0.217593  1.000000

Fit a multiple regression model

As for the simple linear regression, The multiple regression analysis can be carried out using the lm() function in R. From the output, we can write out the regression model as

\[ c.gpa = -0.153+ 0.376 \times h.gpa + 0.00122 \times SAT + 0.023 \times recommd \]

>>> import pandas as pd
>>> gpa = pd.read_csv('https://advstats.psychstat.org/data/gpa.csv')
>>> gpa = gpa.rename(columns={'c.gpa': 'c_gpa', 'h.gpa': 'h_gpa'})
>>> gpa
    c_gpa  h_gpa   SAT  recommd
0    2.04   2.01  1070        5
1    2.56   3.40  1254        6
2    3.75   3.68  1466        6
3    1.10   1.54   706        4
4    3.00   3.32  1160        5
..    ...    ...   ...      ...
95   1.80   1.64   772        4
96   2.64   1.87  1304        6
97   2.08   2.53  1212        4
98   0.70   1.78   818        6
99   0.89   1.20   864        2

[100 rows x 4 columns]
>>> ## using formula
>>> import statsmodels.formula.api as smf
>>> reg = smf.ols("c_gpa ~ h_gpa + SAT +  recommd", data=gpa).fit()
>>> print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  c_gpa   R-squared:                       0.400
Model:                            OLS   Adj. R-squared:                  0.381
Method:                 Least Squares   F-statistic:                     21.31
Date:                Sat, 23 Nov 2024   Prob (F-statistic):           1.16e-10
Time:                        21:10:41   Log-Likelihood:                -87.001
No. Observations:                 100   AIC:                             182.0
Df Residuals:                      96   BIC:                             192.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.1533      0.323     -0.475      0.636      -0.794       0.488
h_gpa          0.3764      0.114      3.294      0.001       0.150       0.603
SAT            0.0012      0.000      4.046      0.000       0.001       0.002
recommd        0.0227      0.051      0.445      0.657      -0.079       0.124
==============================================================================
Omnibus:                        1.679   Durbin-Watson:                   1.906
Prob(Omnibus):                  0.432   Jarque-Bera (JB):                1.658
Skew:                           0.303   Prob(JB):                        0.437
Kurtosis:                       2.823   Cond. No.                     5.71e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.71e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpret the results / output

From the output, we see the intercept is -0.153. Its immediate meaning is that when all predictors' values are 0, the predicted college GPA is -0.15. This clearly does not make much sense because one would never get a negative GPA, which results from the unrealistic presumption that the predictors can take the value of 0.

The regression coefficient for the predictor high school GPA (h.gpa) is 0.376. This can be interpreted as keeping SAT and recommd scores constant, the predicted college GPA would increase 0.376 with a unit increase in high school GPA.This is again might be problematic because it might be impossible to increase high school GPA while keeping the other two predictors unchanged. The other two regression coefficients can be interpreted in the same way.

From the output, we can also see that the multiple R-squared ($R^2$) is 0.3997. Therefore, about 40% of the variation in college GPA can be explained by the multiple linear regression with h.GPA, SAT, and recommd as the predictors. The adjusted $R^2$ is slightly smaller because of the consideration of the number of predictors. In fact,

\[ \begin{eqnarray*} aR^{2} & = & 1-(1-R^{2})\frac{n-1}{n-k-1}\\& = & 1-(1-.3997)\frac{100-1}{100-3-1}\\& = & .3809 \end{eqnarray*} \]

Hypothesis testing

Testing Individual Regression Coefficient

For any regression coefficients for the three predictors (also the intercept), a t test can be conducted. For example, for high school GPA, the estimated coefficient is 0.376 with the standard error 0.114. Therefore, the corresponding t statistic is $t = 0.376/0.114 = 3.294$. Since the statistic follows a t distribution with the degrees of freedom $df = n - k - 1 = 100 - 3 -1 =96$, we can obtain the p-value as $p = 2*(1-pt(3.294, 96))= 0.0013$. Since the p-value is less than 0.05, we conclude the coefficient is statistically significant. Note the t value and p-value are directly provided in the output.

>>> import scipy.stats as stats
>>> 
>>> t_statistic = 3.294  # t-statistic from the analysis
>>> df = 96  # degrees of freedom
>>> 
>>> # Two-tailed p-value calculation
>>> p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), df))
>>> 
>>> print("p-value:", p_value)
p-value: 0.0013840435314131927

Overall model fit (testing all coefficients together)

To test all coefficients together or the overall model fit, we use the F test. Given the $R^2$, the F statistic is

\[ \begin{eqnarray*} F & = & \frac{n-k-1}{k}\frac{R^{2}}{1-R^{2}}\\ & = & \left(\frac{100-3-1}{3}\right)\times \left(\frac{0.3997}{1-.3997}\right )=21.307\end{eqnarray*} \]

which follows the F distribution with degrees of freedom $df1=k=3$ and $df2=n-k-1=96$. The corresponding p-value is 1.160e-10. Note that this information is directly shown in the output as "F-statistic: 21.31 on 3 and 96 DF, p-value: 1.160e-10".

Therefore, at least one of the regression coefficients is statistically significantly different from 0. Overall, the three predictors explained a significant portion of the variance in college GPA. The regression model with the 3 predictors is significantly better than the regression model with intercept only (i.e., predict c.gpa by the mean of c.gpa).

>>> import scipy.stats as stats
>>> 
>>> F_statistic = 21.307  # t-statistic from the analysis
>>> df1 = 3  # degrees of freedom
>>> df2 = 96
>>> 
>>> # p-value calculation
>>> p_value = 1 - stats.f.cdf(F_statistic, df1, df2)
>>> 
>>> print("p-value:", p_value)
p-value: 1.1624656792719179e-10

Testing a subset of regression coefficients

Suppose we are interested in testing whether the regression coefficients of high school GPA and SAT together are significant or not. Alternative, we want to see above and beyond the quality of recommendation letters, whether the two predictors can explain a significant portion of variance in college GPA. To conduct the test, we need to fit two models:

A full model: which consists of all the predictors to predict c.gpa by intercept, h.gpa, SAT, and recommd.
A reduced model: obtained by removing the predictors to be tested in the full model.

From the full model, we can get the $R^2 = 0.3997$ with all three predictors and from the reduced model, we can get the $R_0^2 = 0.1226$ with only quality of recommendation letters. Then the F statistic is constructed as

\[F=\frac{n-k-1}{p}\frac{R^{2}-R_{0}^{2}}{1-R^{2}}=\left(\frac{100-3-1}{2}\right )\times\frac{.3997-.1226}{1-.3997}=22.157.\]

Using the F distribution with the degrees of freedom $p=2$ (the number of coefficients to be tested) and $n-k-1 = 96$, we can get the p-value close to 0 ($p=1.22e-08$).

>>> import scipy.stats as stats
>>> 
>>> F_statistic = 22.157  # t-statistic from the analysis
>>> df1 = 2  # degrees of freedom
>>> df2 = 96
>>> 
>>> # p-value calculation
>>> p_value = 1 - stats.f.cdf(F_statistic, df1, df2)
>>> 
>>> print("p-value:", p_value)
p-value: 1.2250982672767918e-08

Note that the test conducted here is based on the comparison of two models. In R, if there are two models, they can be compared conveniently using the function anova_lm(). As shown below, we obtain the same F statistic and p-value.

>>> import pandas as pd
>>> gpa = pd.read_csv('https://advstats.psychstat.org/data/gpa.csv')
>>> gpa = gpa.rename(columns={'c.gpa': 'c_gpa', 'h.gpa': 'h_gpa'})
>>> 
>>> ## using formula
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> model1 = smf.ols("c_gpa ~ h_gpa + SAT +  recommd", data=gpa).fit()
>>> model2 = smf.ols("c_gpa ~ recommd", data=gpa).fit()
>>> 
>>> # Perform ANOVA to compare the models
>>> anova_results = sm.stats.anova_lm(model2, model1)
>>> 
>>> print("\nANOVA Comparison of Models:")

ANOVA Comparison of Models:
>>> print(anova_results)
   df_resid        ssr  df_diff    ss_diff          F        Pr(>F)
0      98.0  48.762034      0.0        NaN        NaN           NaN
1      96.0  33.358309      2.0  15.403725  22.164757  1.218614e-08

To cite the book, use: Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using Python. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.