Multilevel Regression

Multilevel data

Lee and Bryk (1989) analyzed a set of data in illustrating the use of multilevel modeling. The data set includes mathematics scores for senior-year high school students from 160 schools. For each student, information on her/his social and economic status (SES) is also available. For each school, data are available on the sector (private or public) of the schools. In addition, a school level variable called average SES (MSES) was created from the students' SES to measure the SES of each school.

There is a clear two-level structure in the data. At the first level, the student level, we have information on individual students. At the second level, the school level, we have information on schools. The students are nested within schools. This kind of data are called multilevel data.

Lee and Bryk school achievement data

As an example, we use a subset of data from the Lee and Bryk (1989) study. In total, there are \(n=7,185\) students at the first level. At the second level, there are \(J=160\) schools. The data are saved in the file LeeBryk.csv. A subset of data are shown below. sector=0 indicates a public school and sector=1 indicates a private Catholic high school.


>>> import pandas as pd
>>> LB = pd.read_csv('https://advstats.psychstat.org/data/LeeBryk.csv')
>>> LB
      schoolid   math   ses  mses  sector
0            1   5.88 -1.53 -0.43       0
1            1  19.71 -0.59 -0.43       0
2            1  20.35 -0.53 -0.43       0
3            1   8.78 -0.67 -0.43       0
4            1  17.90 -0.16 -0.43       0
...        ...    ...   ...   ...     ...
7180       160  20.40  1.51  0.63       1
7181       160  14.79 -0.04  0.63       1
7182       160  19.64  1.33  0.63       1
7183       160  16.24 -0.01  0.63       1
7184       160  22.73  0.79  0.63       1

[7185 rows x 5 columns]

Source of variances

We first look at the mathematical variable alone. The variance of the variable math is 47.31. The variance can be viewed as the residual variance after removing the mean of it or the residual variance by fitting a regression model with intercept only to it. This assumes that every student across all schools has the same mean score or intercept such that 

\[math_{ij} = \beta_0 + e_{ij}.\]

Given there are 160 schools, it is more reasonable to believe that the intercept (or average math score) is different for each school. Using a regression model way, we then have

\[math_{ij}=\beta_{j}+e_{ij},\]

where $\beta_j$ is the average math score or intercept for each school $j$.

Because the intercept is different, it can also has its own variation or variance across schools which can also be calculated using a regression model with intercept only

\[\beta_{j}=\beta_{0}+v_{j}\]

where $v_j$ is the deviation for the average score of school $j$ from the overall average $\beta_0$.

With such a specification, the variance of math can be expressed as the sum of the variance of the residuals \(e_{ij}\) -- within-school variance, and the variance of \(v_{j}\) -- between-school variance. Specifically, the variance of math is equal to \(8.614 + 39.148\) = between-school variance + within-school variance.

Combining the two regressions, we have a two-level regression model. Note that the model can be written as

\[math_{ij}=\beta_{0}+v_{j}+e_{ij}.\]

The model is called a mixed-effects model in which \(\beta_{0}\) is called the fixed effect. It is the average intercept for all schools and \(v_{j}\) is called the random effect. 

Use of the Python library statsmodels (the mixedlm function_

The mixed-effects model can be estimated using the Python library statsmodels. Particularly, the function mixedlm() should be used. The function not only estimates the fixed-effects $\beta_0$ but also the random-effects $v_{j}$. The function use the formula to specify the model.

The input and output of the analysis are shown below. In the output, we can find the following information:

  • The Fixed effects $\beta_0$ as the "Intercept": 12.637. A Z-test is conducted for the significance of the fixed effects.
  • The Random effects variance $var(v_j)$: 8.615.
  • The Random effects variance $var(e_{ij})$: 39.148.

>>> ## Data
>>> import pandas as pd
>>> LB = pd.read_csv('https://advstats.psychstat.org/data/LeeBryk.csv')
>>> 
>>> ## use the statsmodels libray
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> 
>>> ## specify the model
>>> md = smf.mixedlm("math ~ 1", data=LB, groups=LB["schoolid"])
>>> 
>>> ## fit the model
>>> results = md.fit()
>>> 
>>> ## display the results
>>> print(results.summary())
          Mixed Linear Model Regression Results
=========================================================
Model:            MixedLM Dependent Variable: math       
No. Observations: 7185    Method:             REML       
No. Groups:       160     Scale:              39.1477    
Min. group size:  14      Log-Likelihood:     -23558.3435
Max. group size:  67      Converged:          Yes        
Mean group size:  44.9                                   
----------------------------------------------------------
           Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
----------------------------------------------------------
Intercept  12.637     0.244  51.706  0.000  12.158  13.116
Group Var   8.615     0.174                               
=========================================================

The random-effects $v_j$ can be obtained using the random_effects attribute of the fitted model. With them, the intercept or average math score for each school can be calculated and also plotted as shown below.


>>> ## Data
>>> import pandas as pd
>>> LB = pd.read_csv('https://advstats.psychstat.org/data/LeeBryk.csv')
>>> 
>>> ## use the statsmodels libray
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> 
>>> ## specify the model
>>> md = smf.mixedlm("math ~ 1", data=LB, groups=LB["schoolid"])
>>> 
>>> ## fit the model
>>> results = md.fit()
>>> 
>>> # Extract random effects
>>> random_effects = results.random_effects
>>> df = pd.DataFrame.from_dict(random_effects, orient='index')
>>> df
        Group
1   -2.663728
2    0.739361
3   -4.568097
4    2.948594
5    0.493920
..        ...
156  2.426226
157 -2.080346
158  0.829436
159 -1.337513
160  2.067075

[160 rows x 1 columns]
>>> 
>>> # calculate the estimated means for each group
>>> df['Group'] = df['Group'] + 12.637
>>> 
>>> # visualize it
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> 
>>> sns.histplot(df['Group'], bins=15, kde=True, color='skyblue', edgecolor='black')

>>> plt.xlabel('Group mean')
Text(0.5, 0, 'Group mean')
>>> plt.ylabel('Frequency')
Text(0, 0.5, 'Frequency')
>>> 
>>> plt.savefig('hist.svg', format='svg')

Intraclass correlation coefficient (ICC)

The intraclass correlation coefficient is defined as the ratio of the variance explained by the multilevel structure and the variance of the outcome variable. For the example above, we have intraclass correlation coefficient

\[\tau=\frac{8.614}{8.614+39.148}=0.18.\]

In social science, it often ranges from 0.05 to 0.25. When ICC is large, it means the between-class variance cannot be ignored and therefore a multilevel model is preferred. It has been suggested that if ICC > 0.1, one should consider the use of a multilevel model.

Explain/model the difference

We have shown the differences in the average score or intercept for each school. What causes the differences? School level covariates, such as the average SES or whether the school is private or public, can be used to explore potential factors related to it. In using a two-level model, we can specify a model as 

\begin{eqnarray*}math_{ij} & = & \beta_{j}+e_{ij}\\ \beta_{j} & = & \beta_{0}+\beta_{1}mses_{j}+\beta_{2}sector_{j}+v_{j}. \end{eqnarray*}

Using a mixed-effect model, it can be written as

\[math_{ij} = \beta_{0}+\beta_{1}mses_{j}+\beta_{2}sector_{j}+v_{j}+e_{ij},\]

where $\beta_k, k=0,1,2$ are fixed-effects parameters. Based on the output of the analysis, both mses and sector are significant given the Z-values in the results table.


>>> ## Data
>>> import pandas as pd
>>> LB = pd.read_csv('https://advstats.psychstat.org/data/LeeBryk.csv')
>>> 
>>> ## use the statsmodels libray
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> 
>>> ## specify the model
>>> md = smf.mixedlm("math ~ mses + sector", data=LB, groups=LB["schoolid"])
>>> 
>>> ## fit the model
>>> results = md.fit()
>>> 
>>> ## display the results
>>> print(results.summary())
          Mixed Linear Model Regression Results
=========================================================
Model:            MixedLM Dependent Variable: math       
No. Observations: 7185    Method:             REML       
No. Groups:       160     Scale:              39.1609    
Min. group size:  14      Log-Likelihood:     -23473.1517
Max. group size:  67      Converged:          Yes        
Mean group size:  44.9                                   
----------------------------------------------------------
           Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
----------------------------------------------------------
Intercept  12.099     0.199  60.918  0.000  11.710  12.489
mses        5.334     0.369  14.472  0.000   4.611   6.056
sector      1.220     0.306   3.988  0.000   0.620   1.819
Group Var   2.312     0.060                               
=========================================================

Multilevel regression

The variable math can be predicted by certain variables such as individual SES. If we ignore the multilevel structure, we can fit a simple regression as

\[math_{ij}=\beta_{0}+\beta_{1}ses_{ij}+e_{ij}.\]

This is to assume that the relationship, the slope, between math and ses is the same. However, as we showed earlier, the intercepts are different for different schools. So the slopes can also be different. In this case, the model should be written as

\[ math_{ij}=\beta_{0j}+\beta_{1j}ses_{ij}+e_{ij},\]

where \(\beta_{0j}\) and \(\beta_{1j}\) are intercept and slope for the $j$th school. We can also predict the intercept and slope using the school level covariates such as the sector of the school and the SES of the school. Then, we would have a two-level model shown below:

\begin{eqnarray*} math_{ij} & = & \beta_{0j}+\beta_{1j}ses_{ij}+e_{ij}\\ \beta_{0j} & = & \gamma_{0}+\gamma_{1}mses_{j}+\gamma_{2}sector_{j}+v_{0j}.\\ \beta_{1j} & = & \gamma_{3}+\gamma_{4}mses_{j}+\gamma_{5}sector_{j}+v_{1j}\end{eqnarray*}

The above two-level model can again be written as a mixed-effects model

\begin{eqnarray} math_{ij} & = & \beta_{0j}+\beta_{1j}ses_{ij}+e_{ij}\nonumber \\ & = & \gamma_{0}+\gamma_{1}mses_{j}+\gamma_{2}sector_{j}+v_{0j}\nonumber \\ & & +(\gamma_{3}+\gamma_{4}mses_{j}+\gamma_{5}sector_{j}+v_{1j})*ses_{ij}+e_{ij}. \end{eqnarray}

To estimate the model, we first need to plug in the second level equation to the first level to get a mixed model. In the mixed model, $\gamma$s are fixed-effects and $v_{0j}$ and $v_{1j}$ are random effects. The variances of $v_{0j}$ and $v_{1j}$ and their correlation can also be obtained. We can also get the variance of $e_{ij}$. Those variance parameters are called random-effects parameters while $\gamma$s are called fixed-effects parameters. In python, each term in the mixed-effects model needs to be specified except for $e_{ij}$. The random effects are specified using the option re_formula="~1 + ses" where 1 and ses means the coefficients change for the grouping variable, here is school.

For the math example, the python input and output shown below.


>>> ## Data
>>> import pandas as pd
>>> LB = pd.read_csv('https://advstats.psychstat.org/data/LeeBryk.csv')
>>> 
>>> ## use the statsmodels libray
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> 
>>> ## specify the model
>>> md = smf.mixedlm("math ~ mses + sector + ses + mses*ses + sector*ses", 
...                  data=LB, groups=LB["schoolid"],
...                  re_formula="~1 + ses")
>>> 
>>> ## fit the model
>>> results = md.fit()
>>> 
>>> ## display the results
>>> print(results.summary())
          Mixed Linear Model Regression Results
==========================================================
Model:             MixedLM Dependent Variable: math       
No. Observations:  7185    Method:             REML       
No. Groups:        160     Scale:              36.7577    
Min. group size:   14      Log-Likelihood:     -23252.5424
Max. group size:   67      Converged:          Yes        
Mean group size:   44.9                                   
----------------------------------------------------------
                Coef.  Std.Err.   z    P>|z| [0.025 0.975]
----------------------------------------------------------
Intercept       12.103    0.203 59.648 0.000 11.705 12.500
mses             3.322    0.390  8.518 0.000  2.558  4.087
sector           1.188    0.308  3.856 0.000  0.584  1.792
ses              2.906    0.148 19.581 0.000  2.615  3.197
mses:ses         0.848    0.274  3.097 0.002  0.311  1.384
sector:ses      -1.579    0.225 -7.024 0.000 -2.020 -1.139
Group Var        2.407    0.063                           
Group x ses Cov  0.185    0.034                           
ses Var          0.014    0.032                           
==========================================================

The parameters in the table of Fixed effects give the estimates for $\gamma$s. Hence, we have

\begin{eqnarray*} \beta_{0j} & = & 12.10+3.32mses_{j}+1.19sector_{j}+v_{0j}\\ \beta_{1j} & = & 2.91+0.85mses_{j}-1.58sector_{j}+v_{1j} \end{eqnarray*}

Note that based on the F-test, both ses and the sector of the school are significant predictors of both intercept and slope.

The random-effects parameters are also given in the output. Here, the variance of $e_{ij}$ is 36.76. The variance of $v_{0j}$ is 2.41 and the variance of $v_{1j}$ is .014. The covariance between $v_{0j}$ and $v_{1j}$ is 0.185. Individual values for $v_{0j}$ and $v_{1j}$ can be obtained using using the random_effects attribute of the fitted model.

Note that with the fixed effects and random effects, we can calculate the intercept $\beta_{0j}$ and slope $\beta_{1j}$ for each school. Then, the individual regression line can be plotted together with the scatterplot of all data.


>>> ## Data
>>> import pandas as pd
>>> LB = pd.read_csv('https://advstats.psychstat.org/data/LeeBryk.csv')
>>> 
>>> ## use the statsmodels libray
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> 
>>> ## specify the model
>>> md = smf.mixedlm("math ~ mses + sector + ses + mses*ses + sector*ses", 
...                  data=LB, groups=LB["schoolid"],
...                  re_formula="~1 + ses")
>>> 
>>> ## fit the model
>>> results = md.fit()
>>> 
>>> ## display the results
>>> print(results.summary())
          Mixed Linear Model Regression Results
==========================================================
Model:             MixedLM Dependent Variable: math       
No. Observations:  7185    Method:             REML       
No. Groups:        160     Scale:              36.7577    
Min. group size:   14      Log-Likelihood:     -23252.5424
Max. group size:   67      Converged:          Yes        
Mean group size:   44.9                                   
----------------------------------------------------------
                Coef.  Std.Err.   z    P>|z| [0.025 0.975]
----------------------------------------------------------
Intercept       12.103    0.203 59.648 0.000 11.705 12.500
mses             3.322    0.390  8.518 0.000  2.558  4.087
sector           1.188    0.308  3.856 0.000  0.584  1.792
ses              2.906    0.148 19.581 0.000  2.615  3.197
mses:ses         0.848    0.274  3.097 0.002  0.311  1.384
sector:ses      -1.579    0.225 -7.024 0.000 -2.020 -1.139
Group Var        2.407    0.063                           
Group x ses Cov  0.185    0.034                           
ses Var          0.014    0.032                           
==========================================================

>>> 
>>> # Extract random effects
>>> random_effects = results.random_effects
>>> df = pd.DataFrame.from_dict(random_effects, orient='index')
>>> #df
>>> 
>>> ## get means for mses and sector
>>> group_means = LB.groupby('schoolid').mean()
>>> #group_means
>>> 
>>> ## calculate the intercept and slope for each school
>>> beta0 = 12.1026 + 2.3225*group_means['mses'] + 1.1882*group_means['sector']+ df['Group']
>>> beta1 = 2.9057 + 0.8476*group_means['mses'] - 1.5793*group_means['sector'] + df['ses']
>>> #beta0
>>> #beta1
>>> 
>>> ## now plot the lines
>>> 
>>> # Define x values (minimum to maximum of training)
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> 
>>> x = np.linspace(np.min(LB['ses']), np.max(LB['ses']), 100)
>>> 
>>> # Create the plot
>>> _=plt.figure(figsize=(8, 6))
>>> 
>>> # Plot the lines
>>> for i in range(160):
...   y = beta0.iloc[i] + beta1.iloc[i]*x
...   _=plt.plot(x, y, color='blue', linewidth=0.5)
... 
>>> # Add labels and title
>>> plt.xlabel('ses')
Text(0.5, 0, 'ses')
>>> plt.ylabel('Math score')
Text(0, 0.5, 'Math score')
>>> 
>>> # Show the plot
>>> plt.savefig('multilevel.svg', format='svg')
>>>             
>>> plt.show()

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.