Variable Selection in Multiple Regression
Variable selection in regression is arguably the hardest part of model building. The purpose of variable selection in regression is to identify the best subset of predictors among many variables to include in a model. The issue is how to find the necessary variables among the complete set of variables by deleting both irrelevant variables (variables not affecting the dependent variable), and redundant variables (variables not adding anything to the dependent variable). Many variable selection methods exist. Each provides a solution to one of the most important problems in statistics.
The general theme of the variable selection is to examine certain subsets and select the best subset, which either maximizes or minimizes an appropriate criterion. More specifically, a model selection method usually should include the following three components:
- Select a test statistic
- Select a criterion for the selected test statistic
- Make a decision on removing / keeping a variable.
Statistics/criteria for variable selection
In the literature, many statistics have been used for the variable selection purpose. Before we discuss them, bear in mind that different statistics/criteria may lead to very different choices of variables.
t-test for a single predictor at a time
We have learned how to use t-test for significance test of a single predictor. It is often used as a way to select predictors. The general rule is that if a predictor is significant, it can be included in a regression model.
F-test for the whole model or for comparing two nested models
As for the F-test, it can be used to test the significance of one or more than one predictors. Therefore, it can also be used for variable selection. For example, for a subset of predictors in a model, if its overall F-test is not significant, then one might simply remove them from the regression model.
$R^{2}$ and Adjusted $R^{2}$
$R^{2}$ can be used to measure the practical importance of a predictor. If a predictor can contribute significantly to the overall $R^{2}$ or adjusted $R^{2}$, it should be considered to be included in the model.
Mallows' $C_{p}$
Mallows' $C_{p}$ is widely used in variable selection. It compares a model with $p$ predictors vs. all $k$ predictors ($k > p$) using a $C_p$ statistic:
\[C_{p}=\frac{SSE_{p}}{MSE_{k}}-N+2(p+1)\]
where $SSE_{p}$ is the sum of squared errors for the model with $p$ predictors and $MSE_{k}$ is the mean squared residuals for the model with all $k$ predictors. The expectation of $C_{p}$ is $p+1$. Intuitively, if the model with $p$ predictors fits as well as the model with $k$ predictors -- the simple model fits as well as a more complex model, the mean squared error should be the same. Therefore, we would expect $SSE_{p}/MSE_{k} = N-p-1$. Therefore, $C_p = p+1$. In variable selection, we therefore should look for a subset of variables with $C_{p}$ around $p+1$ ($C_{p}\approx p+1$) or smaller ($C_{p} < p+1$) than $p+1$. On the other hand, a model with bad fit would have a $C_{p}$ much bigger than p+1.
Information criteria
Information criteria such as AIC (Akaike information criterion) and BIC (Bayesian information criterion) are often used in variable selection. AIC and BIC are define as
\[ \begin{eqnarray*} AIC & = n\ln(SSE/n)+2p \\ BIC & = n\ln(SSE/n)+p\ln(n)\end{eqnarray*}.\]
Note that AIC and BIC are trade-off between goodness of model fit and model complexity. With more predictors in a regression model, $SSE$ typically would become smaller or at least the same and therefore the first part of AIC and BIC becomes smaller. However, with model predictors, the model would become more complex and therefore the second part of AIC and BIC becomes bigger. An information criterion tries to identify the model with the smallest AIC and BIC that balance the model fit and model complexity.
An example
Through an example, we introduce different variable selection methods and illustrate their use. The data here were collected from 189 infants and mothers at the Baystate Medical Center, Springfield, Mass in 1986 on the following variables.
- low: indicator of birth weight less than 2.5 kg.
- age: mother's age in years.
- lwt: mother's weight in pounds at last menstrual period.
- race: mother's race (1 = white, 2 = black, 3 = other).
- smoke: smoking status during pregnancy.
- ptl: number of previous premature labors.
- ht: history of hypertension.
- ui: presence of uterine irritability.
- ftv: number of physician visits during the first trimester.
- bwt: birth weight in grams.
A subset of the data is shown below. Note that the data are included with the R package MASS
. Therefore, once the package is loaded, one can access the data using data(birthwt)
.
> library(MASS) > data(birthwt) > head(birthwt) low age lwt race smoke ptl ht ui ftv bwt 85 0 19 182 2 0 0 0 1 0 2523 86 0 33 155 3 0 0 0 0 3 2551 87 0 20 105 1 1 0 0 0 1 2557 88 0 21 108 1 1 0 0 1 2 2594 89 0 18 107 1 1 0 0 1 0 2600 91 0 21 124 3 0 0 0 0 0 2622 >
The purpose of the study is to identify possible risk factors associated with low infant birth weight. Using the study and the data, we introduce four methods for variable selection: (1) all possible subsets (best subsets) analysis, (2) backward elimination, (3) forward selection, and (4) Stepwise selection/regression.
All possible (best) subsets
The basic idea of the all possible subsets approach is to run every possible combination of the predictors to find the best subset to meet some pre-defined objective criteria such as \(C_{p}\) and adjusted \(R^{2}\). It is hoped that that one ends up with a reasonable and useful regression model. Manually, we can fit each possible model one by one using lm()
and compare the model fits. To automatically run the procedure, we can use the regsubsets()
function in the R package leaps
.
Using the birth weight data, we can run the analysis as shown below. In the function regsubsets()
,
- The regular formula can be used to specify the model with all the predictors to be studied. In this example, it is
bwt~lwt+race+smoke+ptl+ht+ui+ftv
. One can also provides the outcome variable as a vector and the predictors in a matrix. data
tells the data set to be used.nbest
is the number of the best subsets of each size to save. Ifnbest=1
, only the best model will be saved for each number of predictors. Ifnbest=2
, the best two models with be saved given the number of predictors.nvmax
is the maximum size of subsets of predictors to examine. It specifies the maximum number of predictors you want to include in the final regression model. For example, if you have 7 predictors but setnvmax=5
, then the most complex model to be evaluated will have only 5 predictors. Using this option will largely reduce computing time if a large number of predictors are evaluated.
> library(MASS); data(birthwt) > library(leaps) > all<-regsubsets(bwt~lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt, + nbest=1, nvmax=7) > all Subset selection object Call: regsubsets.formula(bwt ~ lwt + race + smoke + ptl + ht + ui + ftv, data = birthwt, nbest = 1, nvmax = 7) 7 Variables (and intercept) Forced in Forced out lwt FALSE FALSE race FALSE FALSE smoke FALSE FALSE ptl FALSE FALSE ht FALSE FALSE ui FALSE FALSE ftv FALSE FALSE 1 subsets of each size up to 7 Selection Algorithm: exhaustive >
The immediate output of the function regsubsets()
does not provide much information. To extract more useful information, the function summary()
can be applied. This will include the following objects that can be printed.
which
: A logical matrix indicating which predictors are in each model. 1 indicates a variable is included and 0 not.rsq
: The r-squared for each model (higher, better)adjr2
: Adjusted r-squared (higher, better)cp
: Mallows' Cp (smaller, better)bic
: Schwartz's Bayesian information criterion, BIC (lower, better)rss
: Residual sum of squares for each model (lower, better)
Note that from the output below, we have $R^2$, adjusted $R^2$, Mallows' cp, BIC and RSS for the best models with 1 predictor till 7 predictors. We can then select the best model among the 7 best models. For example, based on adjusted $R^2$, we would say the model with 6 predictors is best because it has the largest adjusted $R^2$. But based on BIC, the model with the 5 predictors is the best since is has the smallest BIC. Obviously, different criterion might lead to different best models.
> library(MASS); data(birthwt) > library(leaps) > all<-regsubsets(bwt~lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt, + nbest=1, nvmax=7) > info <- summary(all) > cbind(info$which, round(cbind(rsq=info$rsq, + adjr2=info$adjr2, cp=info$cp, bic=info$bic, rss=info$rss), 3)) (Intercept) lwt race smoke ptl ht ui ftv rsq adjr2 cp bic rss 1 1 0 0 0 0 0 1 0 0.081 0.076 29.155 -5.402 91910625 2 1 0 1 0 0 0 1 0 0.113 0.103 23.629 -6.922 88680498 3 1 0 1 1 0 0 1 0 0.175 0.162 11.058 -15.501 82427257 4 1 0 1 1 0 1 1 0 0.203 0.186 6.674 -16.648 79687294 5 1 1 1 1 0 1 1 0 0.221 0.200 4.371 -15.838 77840556 6 1 1 1 1 1 1 1 0 0.222 0.197 6.117 -10.861 77731620 7 1 1 1 1 1 1 1 1 0.223 0.193 8.000 -5.742 77681278 >
We can also plot the different statistics to visually inspect the best models. Mallow's Cp plot is one popular plot to use. In such a plot, Mallows' Cp is plotted along the number of predictors. As mentioned early, for a good model, $C_p \approx p$. Therefore, the models are on or below the line of x=y can be considered as acceptable models. In this example, both the model with 5 predictors and the one with 6 predictors are good models.
> library(MASS); data(birthwt) > library(leaps) > all<-regsubsets(bwt~lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt, + nbest=1, nvmax=7) > info <- summary(all) > > plot(2:8, info$cp, xlab='P (# of predictors + 1)', ylab='Cp') > abline(a=0,b=1) >
Remarks
Using the all possible subsets method, one would select a model with a larger adjusted R-square, smaller Cp, smaller rsq, and smaller BIC. The different criteria quantify different aspects of the regression model, and therefore often yield different choices for the best set of predictors. That's okay — as long as we don't misuse best subsets regression by claiming that it yields the best model. Rather, we should use best subsets regression as a screening tool — that is, as a way to reduce the large number of possible regression models to just a handful of models that we can evaluate further before arriving at one final model. If there are two competing models, one can select the one with fewer predictors or the one with practical or theoretical sense.
With many predictors, for example, more than 40 predictors, the number of possible subsets can be huge. Often, there are several good models, although some are unstable. The best subset may be no better than a subset of some randomly selected variables, if the sample size is relatively small to the number of predictors. The regression fit statistics and regression coefficient estimates can also be biased. In addition, all-possible-subsets selection can yield models that are too small. Generally speaking, one should not blindly trust the results. The data analyst knows more than the computer and failure to use human knowledge produces inadequate data analysis.
Backward elimination
Backward elimination begins with a model which includes all candidate variables. Variables are then deleted from the model one by one until all the variables remaining in the model are significant and exceed certain criteria. At each step, the variable showing the smallest improvement to the model is deleted. Once a variable is deleted, it cannot come back to the model.
The R package MASS
has a function stepAIC()
that can be used to conduct backward elimination. To use the function, one first needs to define a null model and a full model. The null model is typically a model without any predictors (the intercept only model) and the full model is often the one with all the candidate predictors included. For the birth weight example, the R code is shown below. Note that backward elimination is based on AIC. It stops when the AIC would increase after removing a predictor.
> library(MASS) > data(birthwt) > > null<-lm(bwt~ 1, data=birthwt) # 1 here means the intercept > full<-lm(bwt~ lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt) > > stepAIC(full, scope=list(lower=null, upper=full), + data=birthwt, direction='backward') Start: AIC=2459.09 bwt ~ lwt + race + smoke + ptl + ht + ui + ftv Df Sum of Sq RSS AIC - ftv 1 50343 77731620 2457.2 - ptl 1 110047 77791325 2457.377681278 2459.1 - lwt 1 1789656 79470934 2461.4 - ht 1 3731126 81412404 2465.9 - race 1 4707970 82389248 2468.2 - smoke 1 4843734 82525012 2468.5 - ui 1 5749594 83430871 2470.6 Step: AIC=2457.21 bwt ~ lwt + race + smoke + ptl + ht + ui Df Sum of Sq RSS AIC - ptl 1 108936 77840556 2455.5 77731620 2457.2 - lwt 1 1741198 79472818 2459.4 - ht 1 3681167 81412788 2463.9 - race 1 4660187 82391807 2466.2 - smoke 1 4810582 82542203 2466.6 - ui 1 5716074 83447695 2468.6 Step: AIC=2455.47 bwt ~ lwt + race + smoke + ht + ui Df Sum of Sq RSS AIC 77840556 2455.5 - lwt 1 1846738 79687294 2457.9 - ht 1 3718531 81559088 2462.3 - race 1 4727071 82567628 2464.6 - smoke 1 5237430 83077987 2465.8 - ui 1 6302771 84143327 2468.2 Call: lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = birthwt) Coefficients: (Intercept) lwt race smoke ht ui 3104.438 3.434 -187.849 -366.135 -595.820 -523.419 >
Forward selection
Forward selection begins with a model which includes no predictors (the intercept only model). Variables are then added to the model one by one until no remaining variables improve the model by a certain criterion. At each step, the variable showing the biggest improvement to the model is added. Once a variable is in the model, it remains there.
The function stepAIC()
can also be used to conduct forward selection. For the birth weight example, the R code is shown below. Note that forward selection stops when the AIC would decrease after adding a predictor.
> library(MASS) > data(birthwt) > > null<-lm(bwt~ 1, data=birthwt) # 1 here means the intercept > full<-lm(bwt~ lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt) > > stepAIC(null, scope=list(lower=null, upper=full), + data=birthwt, direction='forward') Start: AIC=2492.76 bwt ~ 1 Df Sum of Sq RSS AIC + ui 1 8059031 91910625 2478.9 + race 1 3790184 96179472 2487.5 + smoke 1 3625946 96343710 2487.8 + lwt 1 3448639 96521017 2488.1 + ptl 1 2391041 97578614 2490.2 + ht 1 2130425 97839231 2490.799969656 2492.8 + ftv 1 339993 99629663 2494.1 Step: AIC=2478.88 bwt ~ ui Df Sum of Sq RSS AIC + race 1 3230127 88680498 2474.1 + ht 1 3162595 88748030 2474.3 + smoke 1 2996636 88913988 2474.6 + lwt 1 2074421 89836203 2476.6 91910625 2478.9 + ptl 1 854664 91055961 2479.1 + ftv 1 172098 91738526 2480.5 Step: AIC=2474.11 bwt ~ ui + race Df Sum of Sq RSS AIC + smoke 1 6253241 82427257 2462.3 + ht 1 3000965 85679533 2469.6 + lwt 1 1367676 87312822 2473.2 88680498 2474.1 + ptl 1 869259 87811239 2474.2 + ftv 1 59737 88620761 2476.0 Step: AIC=2462.29 bwt ~ ui + race + smoke Df Sum of Sq RSS AIC + ht 1 2739963 79687294 2457.9 + lwt 1 868170 81559088 2462.3 82427257 2462.3 + ptl 1 220563 82206694 2463.8 + ftv 1 8390 82418867 2464.3 Step: AIC=2457.9 bwt ~ ui + race + smoke + ht Df Sum of Sq RSS AIC + lwt 1 1846738 77840556 2455.5 79687294 2457.9 + ptl 1 214476 79472818 2459.4 + ftv 1 1134 79686160 2459.9 Step: AIC=2455.47 bwt ~ ui + race + smoke + ht + lwt Df Sum of Sq RSS AIC 77840556 2455.5 + ptl 1 108936 77731620 2457.2 + ftv 1 49231 77791325 2457.3 Call: lm(formula = bwt ~ ui + race + smoke + ht + lwt, data = birthwt) Coefficients: (Intercept) ui race smoke ht lwt 3104.438 -523.419 -187.849 -366.135 -595.820 3.434 >
Stepwise regression
Stepwise regression is a combination of both backward elimination and forward selection methods. Stepwise method is a modification of the forward selection approach and differs in that variables already in the model do not necessarily stay. As in forward selection, stepwise regression adds one variable to the model at a time. After a variable is added, however, stepwise regression checks all the variables already included again to see whether there is a need to delete any variable that does not provide an improvement to the model based on a certain criterion.
The function stepAIC()
can also be used to conduct forward selection. For the birth weight example, the R code is shown below.
> library(MASS) > data(birthwt) > > null<-lm(bwt~ 1, data=birthwt) # 1 here means the intercept > full<-lm(bwt~ lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt) > > stepAIC(null, scope=list(lower=null, upper=full), + data=birthwt, direction='both') Start: AIC=2492.76 bwt ~ 1 Df Sum of Sq RSS AIC + ui 1 8059031 91910625 2478.9 + race 1 3790184 96179472 2487.5 + smoke 1 3625946 96343710 2487.8 + lwt 1 3448639 96521017 2488.1 + ptl 1 2391041 97578614 2490.2 + ht 1 2130425 97839231 2490.799969656 2492.8 + ftv 1 339993 99629663 2494.1 Step: AIC=2478.88 bwt ~ ui Df Sum of Sq RSS AIC + race 1 3230127 88680498 2474.1 + ht 1 3162595 88748030 2474.3 + smoke 1 2996636 88913988 2474.6 + lwt 1 2074421 89836203 2476.6 91910625 2478.9 + ptl 1 854664 91055961 2479.1 + ftv 1 172098 91738526 2480.5 - ui 1 8059031 99969656 2492.8 Step: AIC=2474.11 bwt ~ ui + race Df Sum of Sq RSS AIC + smoke 1 6253241 82427257 2462.3 + ht 1 3000965 85679533 2469.6 + lwt 1 1367676 87312822 2473.2 88680498 2474.1 + ptl 1 869259 87811239 2474.2 + ftv 1 59737 88620761 2476.0 - race 1 3230127 91910625 2478.9 - ui 1 7498974 96179472 2487.5 Step: AIC=2462.29 bwt ~ ui + race + smoke Df Sum of Sq RSS AIC + ht 1 2739963 79687294 2457.9 + lwt 1 868170 81559088 2462.3 82427257 2462.3 + ptl 1 220563 82206694 2463.8 + ftv 1 8390 82418867 2464.3 - smoke 1 6253241 88680498 2474.1 - ui 1 6323012 88750270 2474.3 - race 1 6486731 88913988 2474.6 Step: AIC=2457.9 bwt ~ ui + race + smoke + ht Df Sum of Sq RSS AIC + lwt 1 1846738 77840556 2455.5 79687294 2457.9 + ptl 1 214476 79472818 2459.4 + ftv 1 1134 79686160 2459.9 - ht 1 2739963 82427257 2462.3 - smoke 1 5992239 85679533 2469.6 - race 1 6186692 85873986 2470.0 - ui 1 7205312 86892607 2472.3 Step: AIC=2455.47 bwt ~ ui + race + smoke + ht + lwt Df Sum of Sq RSS AIC 77840556 2455.5 + ptl 1 108936 77731620 2457.2 + ftv 1 49231 77791325 2457.3 - lwt 1 1846738 79687294 2457.9 - ht 1 3718531 81559088 2462.3 - race 1 4727071 82567628 2464.6 - smoke 1 5237430 83077987 2465.8 - ui 1 6302771 84143327 2468.2 Call: lm(formula = bwt ~ ui + race + smoke + ht + lwt, data = birthwt) Coefficients: (Intercept) ui race smoke ht lwt 3104.438 -523.419 -187.849 -366.135 -595.820 3.434 >
Remarks
Forward or backward?
If you have a very large set of candidate predictors from which you wish to extract a few–i.e., if you're on a fishing expedition–you should generally go forward. If, on the other hand, if you have a modest-sized set of potential variables from which you wish to eliminate a few–i.e., if you're fine-tuning some prior selection of variables–you should generally go backward. If you're on a fishing expedition, you should still be careful not to cast too wide a net, selecting variables that are only accidentally related to your dependent variable.
Stepwise regression
Stepwise regression can yield R-squared values that are badly biased high. The method can also yield confidence intervals for effects and predicted values that are falsely narrow. It gives biased regression coefficients that need shrinkage e.g., the coefficients for remaining variables are too large. It also has severe problems in the presence of collinearity and increasing the sample size doesn't help very much.
Stepwise or all-possible-subsets?
Stepwise regression often works reasonably well as an automatic variable selection method, but this is not guaranteed. If the number of candidate predictors is large compared to the number of observations in your data set (say, more than 1 variable for every 10 observations), or if there is excessive multicollinearity (predictors are highly correlated), then the stepwise algorithms may go crazy and end up throwing nearly all the variables into the model, especially if you used a low threshold on a criterion like F statistic.
All-possible-subsets goes beyond stepwise regression and literally tests all possible subsets of the set of potential independent variables. But it carries all the caveats of stepwise regression.
Use your knowledge
A model selected by automatic methods can only find the "best" combination from among the set of variables you start with: if you omit some important variables, no amount of searching will compensate! Remember that the computer is not necessarily right in its choice of a model during the automatic phase of the search. Don't accept a model just because the computer gave it its blessing. Use your own judgment and intuition about your data to try to fine-tune whatever the computer comes up with.
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.