Multilevel Regression
Multilevel data
Lee and Bryk (1989) analyzed a set of data in illustrating the use of multilevel modeling. The data set includes mathematics scores for senior-year high school students from 160 schools. For each student, information on her/his social and economic status (SES) is also available. For each school, data are available on the sector (private or public) of the schools. In addition, a school level variable called average SES (MSES) was created from the students' SES to measure the SES of each school.
There is a clear two-level structure in the data. At the first level, the student level, we have information on individual students. At the second level, the school level, we have information on schools. The students are nested within schools. This kind of data are called multilevel data.
Lee and Bryk school achievement data
As an example, we use a subset of data from the Lee and Bryk (1989) study. In total, there are \(n=7,185\) students at the first level. At the second level, there are \(J=160\) schools. The data are saved in the file LeeBryk.csv
. A subset of data are shown below. sector=0
indicates a public school and sector=1
indicates a private Catholic high school.
> usedata('LeeBryk') > head(LeeBryk) schoolid math ses mses sector 1 1 5.88 -1.53 -0.43 0 2 1 19.71 -0.59 -0.43 0 3 1 20.35 -0.53 -0.43 0 4 1 8.78 -0.67 -0.43 0 5 1 17.90 -0.16 -0.43 0 6 1 4.58 0.02 -0.43 0 >
Source of variances
We first look at the mathematical variable alone. The variance of the variable math is 47.31. The variance can be viewed as the residual variance after removing the mean of it or the residual variance by fitting a regression model with intercept only to it. This assumes that every student across all schools has the same mean score or intercept such that
\[math_{ij} = \beta_0 + e_{ij}.\]
Given there are 160 schools, it is more reasonable to believe that the intercept (or average math score) is different for each school. Using a regression model way, we then have
\[math_{ij}=\beta_{j}+e_{ij},\]
where $\beta_j$ is the average math score or intercept for each school $j$.
Because the intercept is different, it can also has its own variation or variance across schools which can also be calculated using a regression model with intercept only
\[\beta_{j}=\beta_{0}+v_{j}\]
where $v_j$ is the deviation for the average score of school $j$ from the overall average $\beta_0$.
With such a specification, the variance of math can be expressed as the sum of the variance of the residuals \(e_{ij}\) -- within-school variance, and the variance of \(v_{j}\) -- between-school variance. Specifically, the variance of math is equal to \(8.614 + 39.148\) = between-school variance + within-school variance.
Combining the two regressions, we have a two-level regression model. Note that the model can be written as
\[math_{ij}=\beta_{0}+v_{j}+e_{ij}.\]
The model is called a mixed-effects model in which \(\beta_{0}\) is called the fixed effect. It is the average intercept for all schools and \(v_{j}\) is called the random effect.
Use of R package lme4
A multilevel model or a mixed-effects model can be estimated using the R package lme4
. Particularly, the function lmer()
should be used. The function not only estimates the fixed-effects $\beta_0$ but also the random-effects $v_{j}$. The function use the format lmer(math~1 + (1|schoolid), data=school)
. In the function, the first "1" tells to estimate a fixed-effects as the overall intercept. (1|schoolid)
tells there is a random component in the intercept.
The output includes the Fixed effects
, for which a t-test is also conducted for its significance. The Random effects
part includes the variance of the residuals ($e_{ij}$) and the variance of the random intercept ($v_j$).
> library(lme4) Loading required package: Matrix > usedata('LeeBryk') > > m2<-lmer(math ~ 1 + (1|schoolid), data=LeeBryk) > summary(m2) Linear mixed model fit by REML ['lmerMod'] Formula: math ~ 1 + (1 | schoolid) Data: LeeBryk REML criterion at convergence: 47116.7 Scaled residuals: Min 1Q Median 3Q Max -3.06294 -0.75337 0.02658 0.76060 2.74221 Random effects: Groups Name Variance Std.Dev. schoolid (Intercept) 8.614 2.935 Residual 39.148 6.257 Number of obs: 7185, groups: schoolid, 160 Fixed effects: Estimate Std. Error t value (Intercept) 12.6374 0.2444 51.71 >
The random-effects $v_j$ can be obtained using the function ranef()
. With them, the intercept or average math score for each school can be calculated and also plotted as shown below.
> library(lme4) Loading required package: Matrix > usedata('LeeBryk') > > m2<-lmer(math ~ 1 + (1|schoolid), data=LeeBryk) Warning message: 'rBind' is deprecated. Since R version 3.2.0, base's rbind() should work fine with S4 objects > bi<-ranef(m2) > > school.intercept <- bi$schoolid + 12.6374 > plot(school.intercept[,1],type='h', + xlab='School ID', ylab='Intercept') > abline(h=12.6374) > > hist(school.intercept[,1]) >
Intraclass correlation coefficient (ICC)
The intraclass correlation coefficient is defined as the ratio of the variance explained by the multilevel structure and the variance of the outcome variable. For the example above, we have intraclass correlation coefficient
\[\tau=\frac{8.614}{8.614+39.148}=0.18.\]
In social science, it often ranges from 0.05 to 0.25. When ICC is large, it means the between-class variance cannot be ignored and therefore a multilevel model is preferred. It has been suggested that if ICC > 0.1, one should consider the use of a multilevel model.
Explain/model the difference
We have shown the differences in the average score or intercept for each school. What causes the differences? School level covariates, such as the average SES or whether the school is private or public, can be used to explore potential factors related to it. In using a two-level model, we can specify a model as
\begin{eqnarray*}math_{ij} & = & \beta_{j}+e_{ij}\\ \beta_{j} & = & \beta_{0}+\beta_{1}mses_{j}+\beta_{2}sector_{j}+v_{j}. \end{eqnarray*}
Using a mixed-effect model, it can be written as
\[math_{ij} = \beta_{0}+\beta_{1}mses_{j}+\beta_{2}sector_{j}+v_{j}+e_{ij},\]
where $\beta_k, k=0,1,2$ are fixed-effects parameters. Based on the output of the lmer()
, both mses and sector are significant given the t-values in the fixed effects table. Note that to get the associated p-value, the R package lmerTest
can be used.
> library(lme4) Loading required package: Matrix > library(lmerTest) Attaching package: 'lmerTest' The following object is masked from 'package:lme4': lmer The following object is masked from 'package:stats': step > usedata('LeeBryk') > > m3<-lmer(math~1 + mses + sector + (1|schoolid), data=LeeBryk) > summary(m3) Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [lmerMod] Formula: math ~ 1 + mses + sector + (1 | schoolid) Data: LeeBryk REML criterion at convergence: 46946.3 Scaled residuals: Min 1Q Median 3Q Max -3.08323 -0.75079 0.01932 0.76659 2.78831 Random effects: Groups Name Variance Std.Dev. schoolid (Intercept) 2.312 1.520 Residual 39.161 6.258 Number of obs: 7185, groups: schoolid, 160 Fixed effects: Estimate Std. Error df t value Pr(>|t|) (Intercept) 12.0994 0.1986 160.5500 60.919 < 2e-16 *** mses 5.3336 0.3685 151.0000 14.475 < 2e-16 *** sector 1.2196 0.3058 149.6000 3.988 0.000104 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Correlation of Fixed Effects: (Intr) mses mses 0.246 sector -0.698 -0.357 > > anova(m3) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) mses 8205.1 8205.1 1 151.0 209.522 < 2.2e-16 *** sector 622.9 622.9 1 149.6 15.907 0.0001038 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >
Multilevel regression
The variable math can be predicted by certain variables such as individual SES. If we ignore the multilevel structure, we can fit a simple regression as
\[math_{ij}=\beta_{0}+\beta_{1}ses_{ij}+e_{ij}.\]
This is to assume that the relationship, the slope, between math and ses is the same. However, as we showed earlier, the intercepts are different for different schools. So the slopes can also be different. In this case, the model should be written as
\[ math_{ij}=\beta_{0j}+\beta_{1j}ses_{ij}+e_{ij},\]
where \(\beta_{0j}\) and \(\beta_{1j}\) are intercept and slope for the $j$th school. We can also predict the intercept and slope using the school level covariates such as the sector of the school and the SES of the school. Then, we would have a two-level model shown below:
\begin{eqnarray*} math_{ij} & = & \beta_{0j}+\beta_{1j}ses_{ij}+e_{ij}\\ \beta_{0j} & = & \gamma_{0}+\gamma_{1}mses_{j}+\gamma_{2}sector_{j}+v_{0j}.\\ \beta_{1j} & = & \gamma_{3}+\gamma_{4}mses_{j}+\gamma_{5}sector_{j}+v_{1j}\end{eqnarray*}
The above two-level model can again be written as a mixed-effects model
\begin{eqnarray} math_{ij} & = & \beta_{0j}+\beta_{1j}ses_{ij}+e_{ij}\nonumber \\ & = & \gamma_{0}+\gamma_{1}mses_{j}+\gamma_{2}sector_{j}+v_{0j}\nonumber \\ & & +(\gamma_{3}+\gamma_{4}mses_{j}+\gamma_{5}sector_{j}+v_{1j})*ses_{ij}+e_{ij}. \end{eqnarray}
To use the R package for model estimation, we first need to plug in the second level equation to the first level to get a mixed model. In the mixed model, $\gamma$s are fixed-effects and $v_{0j}$ and $v_{1j}$ are random effects. The variances of $v_{0j}$ and $v_{1j}$ and their correlation can also be obtained. We can also get the variance of $e_{ij}$. Those variance parameters are called random-effects parameters while $\gamma$s are called fixed-effects parameters. In R, each term in the mixed-effects model needs to be specified except for $e_{ij}$. The random effects are specified using the notation |
where before it are the random-effects terms and after it is the grouping variable or class variable.
For the math example, the R input and output shown below.
> library(lme4) Loading required package: Matrix > library(lmerTest) Attaching package: 'lmerTest' The following object is masked from 'package:lme4': lmer The following object is masked from 'package:stats': step > usedata('LeeBryk') > > m4<-lmer(math~1 + mses + sector + + ses + ses*mses + ses*sector + + (1 + ses|schoolid), data=LeeBryk) > summary(m4) Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [lmerMod] Formula: math ~ 1 + mses + sector + ses + ses * mses + ses * sector + (1 + ses | schoolid) Data: LeeBryk REML criterion at convergence: 46505.1 Scaled residuals: Min 1Q Median 3Q Max -3.14225 -0.72479 0.01375 0.75506 2.98329 Random effects: Groups Name Variance Std.Dev. Corr schoolid (Intercept) 2.40703 1.5515 ses 0.01443 0.1201 1.00 Residual 36.75766 6.0628 Number of obs: 7185, groups: schoolid, 160 Fixed effects: Estimate Std. Error df t value Pr(>|t|) (Intercept) 12.1026 0.2028 166.0000 59.678 < 2e-16 *** mses 3.3225 0.3885 178.0000 8.553 5.33e-15 *** sector 1.1882 0.3081 148.0000 3.857 0.000171 *** ses 2.9057 0.1483 4352.0000 19.595 < 2e-16 *** mses:ses 0.8476 0.2717 3546.0000 3.119 0.001829 ** sector:ses -1.5793 0.2246 4347.0000 -7.033 2.34e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Correlation of Fixed Effects: (Intr) mses sector ses mss:ss mses 0.213 sector -0.676 -0.345 ses 0.077 -0.146 -0.065 mses:ses -0.143 0.179 -0.082 0.279 sector:ses -0.062 -0.081 0.094 -0.679 -0.357 > anova(m4) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) mses 2688.9 2688.9 1 178.5 73.15 5.329e-15 *** sector 546.8 546.8 1 148.3 14.88 0.0001707 *** ses 14113.4 14113.4 1 4351.7 383.96 < 2.2e-16 *** mses:ses 357.6 357.6 1 3546.5 9.73 0.0018288 ** sector:ses 1818.0 1818.0 1 4346.7 49.46 2.342e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >
The parameters in the table of Fixed effects
give the estimates for $\gamma$s. Hence, we have
\begin{eqnarray*} \beta_{0j} & = & 12.10+3.32mses_{j}+1.19sector_{j}+v_{0j}\\ \beta_{1j} & = & 2.91+0.85mses_{j}-1.58sector_{j}+v_{1j} \end{eqnarray*}
Note that based on the F-test, both ses and the sector of the school are significant predictors of both intercept and slope.
The parameters in the table of Random effects
give the variance estimates. Therefore, the variance of $e_{ij}$ is 36.76. The variance of $v_{0j}$ is 2.41 and the variance of $v_{1j}$ is .014. The correlation between $v_{0j}$ and $v_{1j}$ is almost 1. Individual values for $v_{0j}$ and $v_{1j}$ can be obtained using the function ranef()
.
Note that with the fixed effects and random effects, we can calculate the intercept $\beta_{0j}$ and slope $\beta_{1j}$ for each school. Then, the individual regression line can be plotted together with the scatterplot of all data.
> library(lme4) Loading required package: Matrix > library(lmerTest) Attaching package: 'lmerTest' The following object is masked from 'package:lme4': lmer The following object is masked from 'package:stats': step > usedata('LeeBryk') > attach(LeeBryk) > > m4<-lmer(math~1 + mses + sector + + ses + ses*mses + ses*sector + + (1 + ses|schoolid), data=LeeBryk) Warning message: 'rBind' is deprecated. Since R version 3.2.0, base's rbind() should work fine with S4 objects > > ## calculate betas > #random effects > rancof<-ranef(m4)$schoolid > beta0<-beta1<-rep(0, 160) > > for (i in 1:160){ + index<-min(which(schoolid==i)) + beta0[i] <- 12.1026+2.3225*mses[index] + +1.1882*sector[index]+rancof[i,1] + beta1[i] <- 2.9057+0.8476*mses[index] + -1.5793*sector[index]+rancof[i,2] + } > > hist(beta0) > hist(beta1) > > ## plot the relationship for each school > plot(ses, math) > > for (i in 1:160){ + abline(beta0[i], beta1[i]) + } >
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.