Centering and Standardization

In the GPA example, we already showed that the intercept didn't make much sense in practice. If we further look at the coefficients for the three predictors, they are 0.376, 0.0012, and 0.022. They clearly cannot be compared directly because 1 unit change in GPA means very differently than 1 unit change in SAT. In order to improve interpretability of the regression, we can conduct centering and standardization.

Centering

First, we can make the intercept more interpretable by centering the predictors. For the GPA example, we can create the centered predictors by subtracting their corresponding means. Therefore, we have

\[ \begin{eqnarray*} h.gpa^c & = & h.gpa - \overline{h.gpa} \\ SAT^c & = & SAT - \overline{SAT} \\ recommd^c & = & recommd - \overline{recommd} \end{eqnarray*}.\]

Then the centered predictors can be used in the regression analysis. In R, the function scale() can be used to center a variable around its mean. This function can be used in the regression function lm() directly. Note that after centering, the intercept becomes 1.98. Since when all three predictors are at their average values, the centered variables are 0. Therefore, the intercept can be interpreted as the predicted y value when the predictors are at their average values. Specifically, the college GAP would be 1.98 for a student with average high school GPA, average SAT score, and average quality of recommendation letter. Furthermore, we can see that centering does not change the regression coefficient estimates for the predictors.

> usedata('gpa')
> gpa.model.c<-lm(c.gpa~ scale(h.gpa,scale=F) + scale(SAT,scale=F) + scale(recommd,scale=F), data=gpa) 
> summary(gpa.model.c)

Call:
lm(formula = c.gpa ~ scale(h.gpa, scale = F) + scale(SAT, scale = F) + 
    scale(recommd, scale = F), data = gpa)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0979 -0.4407 -0.0094  0.3859  1.7606 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.9805000  0.0589476  33.598  < 2e-16 ***
scale(h.gpa, scale = F)   0.3763511  0.1142615   3.294 0.001385 ** 
scale(SAT, scale = F)     0.0012269  0.0003032   4.046 0.000105 ***
scale(recommd, scale = F) 0.0226843  0.0509817   0.445 0.657358    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5895 on 96 degrees of freedom
Multiple R-squared:  0.3997,	Adjusted R-squared:  0.381 
F-statistic: 21.31 on 3 and 96 DF,  p-value: 1.16e-10

>

Relationship between coefficients before and after centering

For a regression model before centering, we have

\[y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\varepsilon_{i}.\]

After centering, we would have

\[ \begin{eqnarray*} y_{i} & = & \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \ldots + \beta_{k}x_{ki} + \varepsilon_{i} \\ & = & \beta_{0} + \beta_{1}(x_{1i} - \bar{x}_1) + \beta_{2}(x_{2i}- \bar{x}_2) + \ldots + \beta_{k} (x_{ki} - \bar{x}_k) + \varepsilon_{i} \\ & = & ( \beta_{0} - \beta_1*\bar{x}_1 - \beta_2*\bar{x}_2 - \ldots - \beta_k*\bar{x}_k ) + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \ldots + \beta_{k}x_{ki} + \varepsilon_{i} \end{eqnarray*} .\]

Comparing the two models, only the intercepts are different.

Standardization

The estimated regression coefficients are not comparable when the IVs originally have very different scales (e.g., SAT, h.GPA, and recommd) even after centering. Through standardization, however, we can remove the scales of the predictors and therefore make the coefficients relatively more comparable. We can standardize predictors only or both predictors and the outcome variable.

After standardization, the variable means are all 0 and variances are all 1. The estimated standardized regression coefficient, also called beta coefficient, tells us how many standard deviations the predicted DV changes given one standard deviation change in the IV when the other IVs are held constant. For example, If an IV has an estimated standardized regression coefficient at .5, this means that when other IVs are held constant, the predicted DV value will increase by half a standard deviation if the IV increases by 1 standard deviation. Now the estimated standardized regression coefficients are comparable even when the IVs originally have different scales. Some researchers argue in this case, the one with larger beta coefficient can predict the outcome better. However, other researchers also pointed out that a change of one standard deviation in one variable has no reason to be equivalent to 1 standard deviation change in another predictor.

For the GPA example, the estimated beta coefficients are 0.363, 0.360 and 0.045 respectively. From them, we might say high school GPA and SAT are better predictors than the quality of recommendation letters. However, this might not be reliable when the predictors are correlated. Note that when both x and y are standardized, there is no need to estimate the intercept since it should automatically be 0.

In R, scale() can also be used for standardization. Note that to skip the estimation of the intercept, one can add -1 in the regression model formular.

> usedata('gpa')
> gpa.model.s<-lm(scale(c.gpa) ~ scale(h.gpa) + scale(SAT) + scale(recommd)-1, data=gpa) 
> summary(gpa.model.s)

Call:
lm(formula = scale(c.gpa) ~ scale(h.gpa) + scale(SAT) + scale(recommd) - 
    1, data = gpa)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46544 -0.58820 -0.01254  0.51510  2.34994 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
scale(h.gpa)    0.36285    0.10959   3.311  0.00131 ** 
scale(SAT)      0.35593    0.08751   4.067 9.68e-05 ***
scale(recommd)  0.04528    0.10123   0.447  0.65568    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7827 on 97 degrees of freedom
Multiple R-squared:  0.3997,	Adjusted R-squared:  0.3812 
F-statistic: 21.53 on 3 and 97 DF,  p-value: 9.028e-11

>

To cite the book, use: Zhang, Z. & Wang, L. (2017-2026). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.