Relative Importance of Predictors
With multiple predictors, a natural question is which predictor is more important or useful to predict the outcome variable. Correlation can be used to tell the relationship between two variables. However, it would fall short with multiple predictors. In the regression framework, the standardized regression coefficients can be compared. But as mentioned earlier, even after standardization, the predictors might not be directly compared. In addition, with correlated predictors, the standardardized coefficients might not tell which predictor is more important. One alternative way that has been recommended is to calculate the "relative importance" of the predictors. There are different ways to estimate the relative importance of predictors, among which the method developed by Lindemann, Merenda and Gold (lmg; 1980) is often recommended. lmg calculates the relative contribution of each predictor to the R square with the consideration of the sequence of predictors appearing in the model.
Basic idea of lmg
$R^2$ represents the proportion of variance explained by a set of predictors. If one can estimate the proportion of the $R^2$ contributed by each individual predictor, the one with larger $R^2$ would be more important to explain the outcome variable. However, the difficulty lies in how to get the $R^2$ for each predictor.
The most intuitive way to decompose the total $R^2$ is to add the predictors to the regression model sequentially. Then, the increased $R^2$ can be considered as the contribution by the predictor just added. However, this method depends on the sequence the predictors are added if the predictors are correlated.
The lmg approach is based on sequential $R^2$ but takes care of the dependence on orderings by averaging over orderings. For example, for a model with 4 predictors, there are a total of 24 orderings. For each ordering, the contributed $R^2$ can be calculated. lmg is the average of the $R^2$ across the 24 orderings.
R package relaimpo
The R package relaimpo
developed by Groemping (2007) includes the method lmg and 7 other methods to calculate the relative importance of predictors.
An example
We now calculate the relative importance of the three predictors: high school GPA, SAT and quality of recommendation letters in the GPA example. To do that, the function calc.relimp()
is used. Before using the function, we have to fit the regression model first.
From the output, we can see that the total proportion of variance explained by the model with all three predictors is 39.97%. For the three predictors, high school GPA contributed to 0.172, SAT 0.177 and recommd 0.051. Note that the three numbers add to 0.3997. We can also get the proportion of contribution of each predictor to the overall $R^2$ by adding the option rela=TRUE
in the function. Clearly, the relative importance of high school GPA and SAT is similar whereas the relative importance of the quality of recommendation letters is lower.
> library('relaimpo') Loading required package: MASS Loading required package: boot Loading required package: survey Loading required package: grid Loading required package: Matrix Loading required package: survival Attaching package: 'survival' The following object is masked from 'package:boot': aml Attaching package: 'survey' The following object is masked from 'package:graphics': dotchart Loading required package: mitools This is the global version of package relaimpo. If you are a non-US user, a version with the interesting additional metric pmvd is available from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping. > usedata('gpa') > > gpa.model<-lm(c.gpa~h.gpa+SAT+recommd, data=gpa) > calc.relimp(gpa.model) Response variable: c.gpa Total response variance: 0.5613402 Analysis based on 100 observations 3 Regressors: h.gpa SAT recommd Proportion of variance explained by model: 39.97% Metrics are not normalized (rela=FALSE). Relative importance metrics: lmg h.gpa 0.17169723 SAT 0.17698341 recommd 0.05105477 Average coefficients for different model sizes: 1X 2Xs 3Xs h.gpa 0.56549057 0.481773394 0.37635109 SAT 0.00180201 0.001416298 0.00122693 recommd 0.17539410 0.065635789 0.02268425 > calc.relimp(gpa.model, rela=TRUE) Response variable: c.gpa Total response variance: 0.5613402 Analysis based on 100 observations 3 Regressors: h.gpa SAT recommd Proportion of variance explained by model: 39.97% Metrics are normalized to sum to 100% (rela=TRUE). Relative importance metrics: lmg h.gpa 0.4295272 SAT 0.4427514 recommd 0.1277214 Average coefficients for different model sizes: 1X 2Xs 3Xs h.gpa 0.56549057 0.481773394 0.37635109 SAT 0.00180201 0.001416298 0.00122693 recommd 0.17539410 0.065635789 0.02268425 >
Bootstrap relative importance
For two predictors, after we get their relative importance measured by $R^2$, we might want to test whether one predictor is significantly more important than the other. However, unlike t-test, it is rather difficult to find an analytical test statistic for a test. Instead, bootstrap can be used. The package relaimpo
includes two functions -- boot.relimp()
and booteval.relimp()
-- for the task. The first function conducts the bootstrap and the second one gets the confidence intervals.
From the output, we can see that the lower and upper bounds for the $R^2=0.1717$ of high school GPA is 0.0852 and 0.2837. For SAT, it's [0.069, 0.3223] and for recommd, it's [0.0146, 0.1355]. Using the CIs, we can conduct a test. For example, since the interval for h.gpa covers the $R^2$ of SAT, there is no difference in terms of relative importance for the two predictors. On the other hand, both CIs for h.gpa and SAT do not cover the $R^2 = 0.0511$ of recommd. Therefore, the two predictors are statistically more important than the predictor the quality of recommendation letters.
The output also includes the CIs for the differences in the $R^2$ of any two predictors. For example, the difference in $R^2$ between h.gpa and SAT is -0.0053 with a CI [-0.1827, 0.1605]. Since the CI covers 0, the difference is insignificant. Similarly, the difference between h.gpa and recommd is statistically significant.
Note that the two ways for conducting the test may not necessarily lead to the same conclusion.
> library('relaimpo') Loading required package: MASS Loading required package: boot Loading required package: survey Loading required package: grid Loading required package: Matrix Loading required package: survival Attaching package: 'survival' The following object is masked from 'package:boot': aml Attaching package: 'survey' The following object is masked from 'package:graphics': dotchart Loading required package: mitools This is the global version of package relaimpo. If you are a non-US user, a version with the interesting additional metric pmvd is available from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping. > usedata('gpa') > > gpa.model<-lm(c.gpa~h.gpa+SAT+recommd, data=gpa) > bootresults<-boot.relimp(gpa.model, b=1000) > ci<-booteval.relimp(bootresults, norank=T) > ci Response variable: c.gpa Total response variance: 0.5613402 Analysis based on 100 observations 3 Regressors: h.gpa SAT recommd Proportion of variance explained by model: 39.97% Metrics are not normalized (rela=FALSE). Relative importance metrics: lmg h.gpa 0.17169723 SAT 0.17698341 recommd 0.05105477 Average coefficients for different model sizes: 1X 2Xs 3Xs h.gpa 0.56549057 0.481773394 0.37635109 SAT 0.00180201 0.001416298 0.00122693 recommd 0.17539410 0.065635789 0.02268425 Confidence interval information ( 1000 bootstrap replicates, bty= perc ): Relative Contributions with confidence intervals: Lower Upper percentage 0.95 0.95 h.gpa.lmg 0.1717 0.0852 0.2837 SAT.lmg 0.1770 0.0690 0.3223 recommd.lmg 0.0511 0.0146 0.1355 CAUTION: Bootstrap confidence intervals can be somewhat liberal. Differences between Relative Contributions: Lower Upper difference 0.95 0.95 0.95 h.gpa-SAT.lmg -0.0053 -0.1827 0.1605 h.gpa-recommd.lmg 0.1206 * 0.0055 0.2313 SAT-recommd.lmg 0.1259 -0.0298 0.2839 * indicates that CI for difference does not include 0. CAUTION: Bootstrap confidence intervals can be somewhat liberal. >
The relative importance with CI can also be plotted conveniently in R. The bars are the bootstrap CIs.
> library('relaimpo') Loading required package: MASS Loading required package: boot Loading required package: survey Loading required package: grid Loading required package: Matrix Loading required package: survival Attaching package: 'survival' The following object is masked from 'package:boot': aml Attaching package: 'survey' The following object is masked from 'package:graphics': dotchart Loading required package: mitools This is the global version of package relaimpo. If you are a non-US user, a version with the interesting additional metric pmvd is available from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping. > usedata('gpa') > > gpa.model<-lm(c.gpa~h.gpa+SAT+recommd, data=gpa) > bootresults<-boot.relimp(gpa.model, b=1000) > ci<-booteval.relimp(bootresults, norank=T) > plot(ci) >
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.