Relative Importance of Predictors

With multiple predictors, a natural question is which predictor is more important or useful to predict the outcome variable. Correlation can be used to tell the relationship between two variables. However, it would fall short with multiple predictors. In the regression framework, the standardized regression coefficients can be compared. But as mentioned earlier, even after standardization, the predictors might not be directly compared. In addition, with correlated predictors, the standardardized coefficients might not tell which predictor is more important. One alternative way that has been recommended is to calculate the "relative importance" of the predictors. There are different ways to estimate the relative importance of predictors, among which the method developed by Lindemann, Merenda and Gold (lmg; 1980) is often recommended. lmg calculates the relative contribution of each predictor to the R square with the consideration of the sequence of predictors appearing in the model.

Basic idea of lmg

$R^2$ represents the proportion of variance explained by a set of predictors. If one can estimate the proportion of the $R^2$ contributed by each individual predictor, the one with larger $R^2$ would be more important to explain the outcome variable. However, the difficulty lies in how to get the $R^2$ for each predictor.

The most intuitive way to decompose the total $R^2$ is to add the predictors to the regression model sequentially. Then, the increased $R^2$ can be considered as the contribution by the predictor just added. However, this method depends on the sequence the predictors are added if the predictors are correlated. 

The lmg approach is based on sequential $R^2$ but takes care of the dependence on orderings by averaging over orderings. For example, for a model with 4 predictors, there are a total of 24 orderings. For each ordering, the contributed $R^2$ can be calculated. lmg is the average of the $R^2$ across the 24 orderings.

R package relaimpo 

The R package relaimpo developed by Groemping (2007) includes the method lmg and 7 other methods to calculate the relative importance of predictors.

An example

We now calculate the relative importance of the three predictors: high school GPA, SAT and quality of recommendation letters in the GPA example. To do that, the function calc.relimp() is used. Before using the function, we have to fit the regression model first.

From the output, we can see that the total proportion of variance explained by the model with all three predictors is 39.97%. For the three predictors, high school GPA contributed to 0.172, SAT 0.177 and recommd 0.051. Note that the three numbers add to 0.3997. We can also get the proportion of contribution of each predictor to the overall $R^2$ by adding the option rela=TRUE in the function. Clearly, the relative importance of high school GPA and SAT is similar whereas the relative importance of the quality of recommendation letters is lower.

> library('relaimpo')
Loading required package: MASS
Loading required package: boot
Loading required package: survey
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: 'survival'

The following object is masked from 'package:boot':

    aml


Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

Loading required package: mitools
This is the global version of package relaimpo.

If you are a non-US user, a version with the interesting additional metric pmvd is available

from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.

> usedata('gpa')
> 
> gpa.model<-lm(c.gpa~h.gpa+SAT+recommd, data=gpa)
> calc.relimp(gpa.model)
Response variable: c.gpa 
Total response variance: 0.5613402 
Analysis based on 100 observations 

3 Regressors: 
h.gpa SAT recommd 
Proportion of variance explained by model: 39.97%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

               lmg
h.gpa   0.17169723
SAT     0.17698341
recommd 0.05105477

Average coefficients for different model sizes: 

                1X         2Xs        3Xs
h.gpa   0.56549057 0.481773394 0.37635109
SAT     0.00180201 0.001416298 0.00122693
recommd 0.17539410 0.065635789 0.02268425
> calc.relimp(gpa.model, rela=TRUE)
Response variable: c.gpa 
Total response variance: 0.5613402 
Analysis based on 100 observations 

3 Regressors: 
h.gpa SAT recommd 
Proportion of variance explained by model: 39.97%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

              lmg
h.gpa   0.4295272
SAT     0.4427514
recommd 0.1277214

Average coefficients for different model sizes: 

                1X         2Xs        3Xs
h.gpa   0.56549057 0.481773394 0.37635109
SAT     0.00180201 0.001416298 0.00122693
recommd 0.17539410 0.065635789 0.02268425
> 

Bootstrap relative importance

For two predictors, after we get their relative importance measured by $R^2$, we might want to test whether one predictor is significantly more important than the other. However, unlike t-test, it is rather difficult to find an analytical test statistic for a test. Instead, bootstrap can be used. The package relaimpo includes two functions -- boot.relimp() and booteval.relimp() -- for the task.  The first function conducts the bootstrap and the second one gets the confidence intervals.

From the output, we can see that the lower and upper bounds for the $R^2=0.1717$ of high school GPA is 0.0852 and 0.2837. For SAT, it's [0.069, 0.3223] and for recommd, it's [0.0146, 0.1355]. Using the CIs, we can conduct a test. For example, since the interval for h.gpa covers the $R^2$ of SAT, there is no difference in terms of relative importance for the two predictors. On the other hand, both CIs for h.gpa and SAT do not cover the $R^2 = 0.0511$ of recommd. Therefore, the two predictors are statistically more important than the predictor the quality of recommendation letters.

The output also includes the CIs for the differences in the $R^2$ of any two predictors. For example, the difference in $R^2$ between h.gpa and SAT is -0.0053 with a CI [-0.1827, 0.1605]. Since the CI covers 0, the difference is insignificant. Similarly, the difference between h.gpa and recommd is statistically significant.

Note that the two ways for conducting the test may not necessarily lead to the same conclusion.

> library('relaimpo')
Loading required package: MASS
Loading required package: boot
Loading required package: survey
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: 'survival'

The following object is masked from 'package:boot':

    aml


Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

Loading required package: mitools
This is the global version of package relaimpo.

If you are a non-US user, a version with the interesting additional metric pmvd is available

from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.

> usedata('gpa')
> 
> gpa.model<-lm(c.gpa~h.gpa+SAT+recommd, data=gpa)
> bootresults<-boot.relimp(gpa.model, b=1000) 
> ci<-booteval.relimp(bootresults, norank=T)
> ci
Response variable: c.gpa 
Total response variance: 0.5613402 
Analysis based on 100 observations 

3 Regressors: 
h.gpa SAT recommd 
Proportion of variance explained by model: 39.97%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

               lmg
h.gpa   0.17169723
SAT     0.17698341
recommd 0.05105477

Average coefficients for different model sizes: 

                1X         2Xs        3Xs
h.gpa   0.56549057 0.481773394 0.37635109
SAT     0.00180201 0.001416298 0.00122693
recommd 0.17539410 0.065635789 0.02268425

 
 Confidence interval information ( 1000 bootstrap replicates, bty= perc ): 
Relative Contributions with confidence intervals: 
 
                       Lower  Upper
            percentage 0.95   0.95  
h.gpa.lmg   0.1717     0.0852 0.2837
SAT.lmg     0.1770     0.0690 0.3223
recommd.lmg 0.0511     0.0146 0.1355

CAUTION: Bootstrap confidence intervals can be somewhat liberal. 

 
 Differences between Relative Contributions: 
 
                                  Lower   Upper
                  difference 0.95 0.95    0.95   
h.gpa-SAT.lmg     -0.0053         -0.1827  0.1605
h.gpa-recommd.lmg  0.1206     *    0.0055  0.2313
SAT-recommd.lmg    0.1259         -0.0298  0.2839

* indicates that CI for difference does not include 0. 
CAUTION: Bootstrap confidence intervals can be somewhat liberal. 
> 

The relative importance with CI can also be plotted conveniently in R. The bars are the bootstrap CIs.

> library('relaimpo')
Loading required package: MASS
Loading required package: boot
Loading required package: survey
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: 'survival'

The following object is masked from 'package:boot':

    aml


Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

Loading required package: mitools
This is the global version of package relaimpo.

If you are a non-US user, a version with the interesting additional metric pmvd is available

from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.

> usedata('gpa')
> 
> gpa.model<-lm(c.gpa~h.gpa+SAT+recommd, data=gpa)
> bootresults<-boot.relimp(gpa.model, b=1000) 
> ci<-booteval.relimp(bootresults, norank=T)
> plot(ci)
> 

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.