Variable selection in regression is arguably the hardest part of model building. The purpose of variable selection in regression is to identify the best subset of predictors among many variables to include in a model. The issue is how to find the necessary variables among the complete set of variables by deleting both irrelevant variables (variables not affecting the dependent variable), and redundant variables (variables not adding anything to the dependent variable). Many variable selection methods exist. Each provides a solution to one of the most important problems in statistics.
The general theme of the variable selection is to examine certain subsets and select the best subset, which either maximizes or minimizes an appropriate criterion. More specifically, a model selection method usually should include the following three components:
Select a test statistic
Select a criterion for the selected test statistic
Make a decision on removing / keeping a variable.
Statistics/criteria for variable selection
In the literature, many statistics have been used for the variable selection purpose. Before we discuss them, bear in mind that different statistics/criteria may lead to very different choices of variables.
t-test for a single predictor at a time
We have learned how to use t-test for significance test of a single predictor. It is often used as a way to select predictors. The general rule is that if a predictor is significant, it can be included in a regression model.
F-test for the whole model or for comparing two nested models
As for the F-test, it can be used to test the significance of one or more than one predictors. Therefore, it can also be used for variable selection. For example, for a subset of predictors in a model, if its overall F-test is not significant, then one might simply remove them from the regression model.
$R^{2}$ and Adjusted $R^{2}$
$R^{2}$ can be used to measure the practical importance of a predictor. If a predictor can contribute significantly to the overall $R^{2}$ or adjusted $R^{2}$, it should be considered to be included in the model.
Mallows' $C_{p}$
Mallows' $C_{p}$ is widely used in variable selection. It compares a model with $p$ predictors vs. all $k$ predictors ($k > p$) using a $C_p$ statistic:
\[C_{p}=\frac{SSE_{p}}{MSE_{k}}-N+2(p+1)\]
where $SSE_{p}$ is the sum of squared errors for the model with $p$ predictors and $MSE_{k}$ is the mean squared residuals for the model with all $k$ predictors. The expectation of $C_{p}$ is $p+1$. Intuitively, if the model with $p$ predictors fits as well as the model with $k$ predictors -- the simple model fits as well as a more complex model, the mean squared error should be the same. Therefore, we would expect $SSE_{p}/MSE_{k} = N-p-1$. Therefore, $C_p = p+1$. In variable selection, we therefore should look for a subset of variables with $C_{p}$ around $p+1$ ($C_{p}\approx p+1$) or smaller ($C_{p} < p+1$) than $p+1$. On the other hand, a model with bad fit would have a $C_{p}$ much bigger than p+1.
Information criteria
Information criteria such as AIC (Akaike information criterion) and BIC (Bayesian information criterion) are often used in variable selection. AIC and BIC are define as
Note that AIC and BIC are trade-off between goodness of model fit and model complexity. With more predictors in a regression model, $SSE$ typically would become smaller or at least the same and therefore the first part of AIC and BIC becomes smaller. However, with model predictors, the model would become more complex and therefore the second part of AIC and BIC becomes bigger. An information criterion tries to identify the model with the smallest AIC and BIC that balance the model fit and model complexity.
An example
Through an example, we introduce different variable selection methods and illustrate their use. The data here were collected from 189 infants and mothers at the Baystate Medical Center, Springfield, Mass in 1986 on the following variables.
low: indicator of birth weight less than 2.5 kg.
age: mother's age in years.
lwt: mother's weight in pounds at last menstrual period.
ftv: number of physician visits during the first trimester.
bwt: birth weight in grams.
A subset of the data is shown below. Note that the data are included with the R package MASS. Therefore, once the package is loaded, one can access the data using data(birthwt).
The purpose of the study is to identify possible risk factors associated with low infant birth weight. Using the study and the data, we introduce four methods for variable selection: (1) all possible subsets (best subsets) analysis, (2) backward elimination, (3) forward selection, and (4) Stepwise selection/regression.
All possible (best) subsets
The basic idea of the all possible subsets approach is to run every possible combination of the predictors to find the best subset to meet some pre-defined objective criteria such as \(C_{p}\) and adjusted \(R^{2}\). It is hoped that that one ends up with a reasonable and useful regression model. Manually, we can fit each possible model one by one using lm() and compare the model fits. To automatically run the procedure, we can use the regsubsets() function in the R package leaps.
Using the birth weight data, we can run the analysis as shown below. In the function regsubsets(),
The regular formula can be used to specify the model with all the predictors to be studied. In this example, it is bwt~lwt+race+smoke+ptl+ht+ui+ftv. One can also provides the outcome variable as a vector and the predictors in a matrix.
data tells the data set to be used.
nbest is the number of the best subsets of each size to save. If nbest=1, only the best model will be saved for each number of predictors. If nbest=2, the best two models with be saved given the number of predictors.
nvmax is the maximum size of subsets of predictors to examine. It specifies the maximum number of predictors you want to include in the final regression model. For example, if you have 7 predictors but set nvmax=5, then the most complex model to be evaluated will have only 5 predictors. Using this option will largely reduce computing time if a large number of predictors are evaluated.
The immediate output of the function regsubsets() does not provide much information. To extract more useful information, the function summary() can be applied. This will include the following objects that can be printed.
which: A logical matrix indicating which predictors are in each model. 1 indicates a variable is included and 0 not.
rsq: The r-squared for each model (higher, better)
adjr2: Adjusted r-squared (higher, better)
cp: Mallows' Cp (smaller, better)
bic: Schwartz's Bayesian information criterion, BIC (lower, better)
rss: Residual sum of squares for each model (lower, better)
Note that from the output below, we have $R^2$, adjusted $R^2$, Mallows' cp, BIC and RSS for the best models with 1 predictor till 7 predictors. We can then select the best model among the 7 best models. For example, based on adjusted $R^2$, we would say the model with 6 predictors is best because it has the largest adjusted $R^2$. But based on BIC, the model with the 5 predictors is the best since is has the smallest BIC. Obviously, different criterion might lead to different best models.
We can also plot the different statistics to visually inspect the best models. Mallow's Cp plot is one popular plot to use. In such a plot, Mallows' Cp is plotted along the number of predictors. As mentioned early, for a good model, $C_p \approx p$. Therefore, the models are on or below the line of x=y can be considered as acceptable models. In this example, both the model with 5 predictors and the one with 6 predictors are good models.
Using the all possible subsets method, one would select a model with a larger adjusted R-square, smaller Cp, smaller rsq, and smaller BIC. The different criteria quantify different aspects of the regression model, and therefore often yield different choices for the best set of predictors. That's okay — as long as we don't misuse best subsets regression by claiming that it yields the best model. Rather, we should use best subsets regression as a screening tool — that is, as a way to reduce the large number of possible regression models to just a handful of models that we can evaluate further before arriving at one final model. If there are two competing models, one can select the one with fewer predictors or the one with practical or theoretical sense.
With many predictors, for example, more than 40 predictors, the number of possible subsets can be huge. Often, there are several good models, although some are unstable. The best subset may be no better than a subset of some randomly selected variables, if the sample size is relatively small to the number of predictors. The regression fit statistics and regression coefficient estimates can also be biased. In addition, all-possible-subsets selection can yield models that are too small. Generally speaking, one should not blindly trust the results. The data analyst knows more than the computer and failure to use human knowledge produces inadequate data analysis.
Backward elimination
Backward elimination begins with a model which includes all candidate variables. Variables are then deleted from the model one by one until all the variables remaining in the model are significant and exceed certain criteria. At each step, the variable showing the smallest improvement to the model is deleted. Once a variable is deleted, it cannot come back to the model.
The R package MASS has a function stepAIC() that can be used to conduct backward elimination. To use the function, one first needs to define a null model and a full model. The null model is typically a model without any predictors (the intercept only model) and the full model is often the one with all the candidate predictors included. For the birth weight example, the R code is shown below. Note that backward elimination is based on AIC. It stops when the AIC would increase after removing a predictor.
Forward selection begins with a model which includes no predictors (the intercept only model). Variables are then added to the model one by one until no remaining variables improve the model by a certain criterion. At each step, the variable showing the biggest improvement to the model is added. Once a variable is in the model, it remains there.
The function stepAIC() can also be used to conduct forward selection. For the birth weight example, the R code is shown below. Note that forward selection stops when the AIC would decrease after adding a predictor.
Stepwise regression is a combination of both backward elimination and forward selection methods. Stepwise method is a modification of the forward selection approach and differs in that variables already in the model do not necessarily stay. As in forward selection, stepwise regression adds one variable to the model at a time. After a variable is added, however, stepwise regression checks all the variables already included again to see whether there is a need to delete any variable that does not provide an improvement to the model based on a certain criterion.
The function stepAIC() can also be used to conduct forward selection. For the birth weight example, the R code is shown below.
If you have a very large set of candidate predictors from which you wish to extract a few–i.e., if you're on a fishing expedition–you should generally go forward. If, on the other hand, if you have a modest-sized set of potential variables from which you wish to eliminate a few–i.e., if you're fine-tuning some prior selection of variables–you should generally go backward. If you're on a fishing expedition, you should still be careful not to cast too wide a net, selecting variables that are only accidentally related to your dependent variable.
Stepwise regression
Stepwise regression can yield R-squared values that are badly biased high. The method can also yield confidence intervals for effects and predicted values that are falsely narrow. It gives biased regression coefficients that need shrinkage e.g., the coefficients for remaining variables are too large. It also has severe problems in the presence of collinearity and increasing the sample size doesn't help very much.
Stepwise or all-possible-subsets?
Stepwise regression often works reasonably well as an automatic variable selection method, but this is not guaranteed. If the number of candidate predictors is large compared to the number of observations in your data set (say, more than 1 variable for every 10 observations), or if there is excessive multicollinearity (predictors are highly correlated), then the stepwise algorithms may go crazy and end up throwing nearly all the variables into the model, especially if you used a low threshold on a criterion like F statistic.
All-possible-subsets goes beyond stepwise regression and literally tests all possible subsets of the set of potential independent variables. But it carries all the caveats of stepwise regression.
Use your knowledge
A model selected by automatic methods can only find the "best" combination from among the set of variables you start with: if you omit some important variables, no amount of searching will compensate! Remember that the computer is not necessarily right in its choice of a model during the automatic phase of the search. Don't accept a model just because the computer gave it its blessing. Use your own judgment and intuition about your data to try to fine-tune whatever the computer comes up with.
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2. To take the full advantage of the book such as running analysis within your web browser, please subscribe.