Regression with Categorical Predictors

In the variable selection example, we have a predictor -- race, which has three categories: 1 = white, 2 = black, 3 = other. In the previous example, we blindly treated it as continuous by taking the numbers literally. This can cause problems. For example, when we interpret the regression slope, we say how much change in the outcome for one unit change the predictor. Although numerically it is fine to say the change from 1 to 2 is the same as the change from 2 to 3, it does not make sense at all to compare the change in the actual race categories. Therefore, although the categories are coded using numerical values, they should be treated as discrete values. This kind variables is called nominal variables. A typical example is ANOVA where the different levels of a factor should be treated as discrete values. In regression, such variables need special treatment for valid statistical inference. We show how to conduct such regression analysis through an example.

An example

In this example, we study the mid-career median salary of college graduates. The data set college.csv includes the information on salary and college backgrounds. Specifically, there are 6 variables in the data set: id, name of the college, mid-career median salary of graduates, cost of study, whether the college is public (1) or private (0), and the location of the college: 1, Southern colleges; 2, Midwestern; 3, Northeastern; and 4, Western. A subset of the data as well as some summary information are given below. In total, there are 85 colleges in the data and 50 of them are private colleges. Clearly, the variables public and location in the data set should be treated as categorical variables. Using this example, we study the factors related to median salary.

> usedata('college')
> head(college)
  id                                        name salary   cost public location
1  1 Massachusetts Institute of Technology (MIT) 119000 189300      0        3
2  2                          Harvard University 121000 189600      0        3
3  3                           Dartmouth College 123000 188400      0        3
4  4                        Princeton University 123000 188700      0        3
5  5                             Yale University 110000 194200      0        3
6  6                    University of Notre Dame 112000 181900      0        2
> attach(college)
> table(public)
public
 0  1 
50 35 
> table(location)
location
 1  2  3  4 
20 19 25 21 
> 

A naive analysis

Suppose we simply treat the variables public and location and fit a multiple regression model to directly. Then, we would get the results as shown below. Then the regression model is

\[ salary = 105.48 - 11.679*public - 1.869*location. \]

Using the typical way to interpret the regression coefficients, we would say (1) when public=0 and location=0, the average salary is 105.48; (2) when public changes from 0 to 1, the salary would reduce 11.679; and (3) when location increases 1, the salary decreases 1.869. Clearly, (1) and (3) do not make sense at all even though (2) seems to be ok.

> usedata('college'); attach(college)
> salary <- salary / 1000
> summary(lm(salary ~ public + location))

Call:
lm(formula = salary ~ public + location)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1028  -6.5626  -0.2319   6.1280  26.6758 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  105.480      2.905  36.307  < 2e-16 ***
public       -11.679      2.270  -5.145 1.79e-06 ***
location      -1.869      1.015  -1.842   0.0691 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.27 on 82 degrees of freedom
Multiple R-squared:  0.278,	Adjusted R-squared:  0.2604 
F-statistic: 15.78 on 2 and 82 DF,  p-value: 1.588e-06

> 

By fitting the above regression model, we assume that each predictor has equal interval in their values. For example, for the location variable, the change from Southern to Midwestern is the same as the change from Midwestern to Northeastern. This assumption is certainly questionable for variables with distinct, non-numerical categories. The location variable here is a categorical variable, not a continuous one. The use of numerical values in the data file for categorical variables is for convenience of data input and storage and should be viewed as discrete instead of continuous values. For example, for the "public" variable, 0 should read as private and 1 should read as public. Similarly, for the "location" variable, 1 means Southern, 2 means Midwestern, etc. To deal with such variables, we need recode the categorical variables.

Regression with categorical predictors

Code categorical numerical values to avoid confusion

To avoid mistakenly treating the categorical data as continuous, we can code the numerical values as discrete values, or better yet, more descriptive strings. In R, this can be done easily using the function factor(). For example, the following R code changes the value 0 to Private and 1 to Public and for location variable, the values 1 to 4 are recoded as S, MW, NE, and W, respectively. Note that each category of a variable is called a level. So, for the public variable, there are two levels and for the location variables, there are 4 levels.

> usedata('college'); attach(college)
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> public
 [1] Private Private Private Private Private Private Private Private Private
[10] Private Private Private Private Private Private Private Private Public 
[19] Private Private Public  Private Private Private Public  Private Private
[28] Private Private Private Private Public  Private Private Public  Public 
[37] Private Private Public  Private Public  Private Private Private Public 
[46] Public  Public  Private Private Private Public  Private Private Public 
[55] Public  Public  Public  Private Public  Public  Public  Private Public 
[64] Private Public  Private Public  Public  Public  Public  Private Public 
[73] Private Public  Public  Public  Public  Public  Private Public  Public 
[82] Public  Private Public  Private
Levels: Private Public
> 
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> location
 [1] NE NE NE NE NE MW NE S  NE NE NE NE NE NE S  NE S  W  NE NE S  NE MW NE S 
[26] NE MW S  NE S  NE S  S  MW MW MW MW MW S  MW S  S  W  S  NE S  MW MW S  S 
[51] MW S  MW S  S  NE MW S  MW W  NE W  W  MW NE W  W  MW NE W  W  MW W  W  W 
[76] W  W  W  W  MW W  W  W  W  W 
Levels: S MW NE W
> 

Dummy coding

Dummy coding provides a way of using categorical predictor variables in regression or other statistical analysis. Dummy coding uses only ones and zeros to convey all of the necessary information on categories or groups. In general, a categorical variable with \(k\) levels / categories will be transformed into \(k-1\) dummy variables. Regression model can be fitted using the dummy variables as the predictors.

In R using lm() for regression analysis, if the predictor is set as a categorical variable, then the dummy coding procedure is automatic. However, we need to figure out how the coding is done. In R, the dummy coding scheme of a categorical variable can be seen using the function contrasts(). For example, for the public variable, we need one dummy variable, in which 0 means a Private school and 1 means a public score. In regression analysis, a new variable is then created using the original variable name plus the category shown in the output. In this case, it will be publicPlublic.

For the location variable, since it has 4 categories, we would need three dummy variables. In the output of contrasts(location), there are three columns, representing the three variables. The variables in the regression will be represented as locationMW, locationNE, and locationW. Note that with the three dummy variables, the four categories can be uniquely determined. For example, locationMW = locationNE = locationW = 0 indicates the college is from the south. locationMW = 1 and locationNE = locationW = 0 indicate the college is from the Midwest.

> usedata('college'); attach(college)
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> 
> contrasts(public)
        Public
Private      0
Public       1
> contrasts(location)
   MW NE W
S   0  0 0
MW  1  0 0
NE  0  1 0
W   0  0 1
> 

lm() for data analysis

Conducting regression analysis with categorical predictors is actually not difficult. The same function for multiple regression analysis can be applied. We use several examples to illustrate this.

Example 1. A predictor with two categories (one-way ANOVA)

Suppose we want to see if there is a difference in salary for private and public colleges. Then, we use the public variable as a predictor, which has two categories. The input and output for the analysis are given below.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> summary(lm(salary ~ public))

Call:
lm(formula = salary ~ public)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.944  -7.134   0.156   5.156  24.166 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   100.844      1.473  68.478  < 2e-16 ***
publicPublic  -12.010      2.295  -5.233 1.23e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.41 on 83 degrees of freedom
Multiple R-squared:  0.2481,	Adjusted R-squared:  0.239 
F-statistic: 27.39 on 1 and 83 DF,  p-value: 1.231e-06

> 

First, note that the same formula as for the regular regression analysis is used. However, now the public variable is a categorical variable. Second, in the output, there is a variable called publicPublic, which was created by the R function automatically. It has two values, 0 and 1. From the output of contrasts(public), we know that for a private college, it takes the value 0 and for a public college, it takes the value 1.

Write out the model

When using a categorical variable, it's best to write out the model for all the different categories. 

\begin{eqnarray*} salary & = & b_{0}+b_{1}*publicPublic\\ & = & 100.8-12publicPublic\\ & = & \begin{cases} 100.8 & \;\mbox{For private colleges}(publicPublic=0)\\ 88.8 & \;\mbox{For public colleges}(publicPublic=1) \end{cases} \end{eqnarray*}

Interpretation

First, note that the difference in the average salaries between the private colleges and the public colleges is equal to 12k, which is also the estimated regression coefficient for publicPublic. To test whether the difference is significantly different from 0, it is equivalent to testing the significance of the regression coefficient. In this case, it is significant given \(t = -5.233\) and \(p < .0011\). Note that this is just another way of doing the pooled two-independent sample t test!

Example 2. A predictor with four categories

Suppose we are interested in whether the location of college is related to the salary. For the location variable, there are four categories. As expected, three dummy variables are needed to conduct the regression analysis as shown below. The three dummy variable predictors are locationMW, locationNE, locationW.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> summary(lm(salary ~ location))

Call:
lm(formula = salary ~ location)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.988  -4.010  -0.888   3.605  28.890 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   96.635      1.913  50.514  < 2e-16 ***
locationMW    -2.940      2.741  -1.073 0.286556    
locationNE    10.253      2.567   3.995 0.000142 ***
locationW    -12.525      2.673  -4.686 1.11e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.555 on 81 degrees of freedom
Multiple R-squared:  0.5047,	Adjusted R-squared:  0.4863 
F-statistic: 27.51 on 3 and 81 DF,  p-value: 2.287e-12

> 

Based on the output, we can write out the model for the predicted salaries as below.

Based on the analysis, we can get the following information:

  • The average salary of each area. The South area is the reference area, called reference group since the other groups can be directly compared with it.
  • -2.94 is the difference in the average salaries between South and Midwest. Whether this difference is significantly different from 0 can be tested, therefore, through testing the regression coefficient of locationMW. Since t = -1.073, p = 0.287, the difference is not signifcant at the 0.05 level.
  • 10.25 is the difference in the average salaries between South and Northeast. Since t = 3.995, p = .0001 for the parameter in the regression, the difference is significantly different from 0.
  • -12.53 is the difference in the average salaries between South and West. Since t = -4.686, p = 1.11e-05 for the parameter in the regression, the difference is significantly different from 0.

Testing the significance of the predictor

To test the significance of a categorical predictor, one can check the overall model fit of the regression analysis based on the F-test. This is equivalent to test the significance of all the dummy variables together. For this specific example, we have F=27.51 and p-value=2.287e-12. Therefore, the predictor is statistically significant.

Example 3. Two categorical predictors (two-way ANOVA)

Now let's consider the use of both predictors: public and location. With two categorical variables, we can dummy code each of them separately. Then, we should combine the dummy coded variables together to identify features based on the predictors for all possible subjects in the data. For example, for the current analysis, we have the following 4 dummy coded variables.

Sector Location MW NE W Public
Public South 0 0 0 1
Public Midwest 1 0 0 1
Public Northeast 0 1 0 1
Public West 0 0 1 1
Private South 0 0 0 0
Private Midwest 1 0 0 0
Private Northeast 0 1 0 0
Private West 0 0 1 0

In fitting the model, we would expect the three dummy variables locationMW, locationNE, locationW for the categorical variable location and publicPublic for the categorical variable public. Note that when locationMW=0, locationNE=0, locationW=0 and publicPublic=1, the college is a public college in the south. For the other colleges, they can be identified in the same way using the 4 dummy coded variables.

The R input and output for the analysis are given below.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> 
> summary(lm(salary~public+location))

Call:
lm(formula = salary ~ public + location)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.147  -4.754  -0.348   2.847  31.672 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    99.556      1.904  52.296  < 2e-16 ***
publicPublic   -7.301      1.828  -3.994 0.000143 ***
locationMW     -2.402      2.522  -0.953 0.343662    
locationNE      8.793      2.386   3.685 0.000415 ***
locationW     -10.926      2.488  -4.391 3.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.86 on 80 degrees of freedom
Multiple R-squared:  0.587,	Adjusted R-squared:  0.5664 
F-statistic: 28.43 on 4 and 80 DF,  p-value: 1.06e-14

> 

Based on the output, we can calculate the expected salary for each type of college as below:

With this, we can easily calculate the difference in salary between any two types of colleges.

Testing the significance of the predictor

Note to test the significance of public variable, we can directly look at the coefficient for publicPublic since there is only one dummy variable here. For this example, it is significant given t=-3.994 with a p-value=0.00014.

For testing the significance of location, it is equivalent to test the significance of a subset of coefficients for the three dummy variables related to location. In the case, we can compare two models, one with both categorical predictors and the other with public predictor only. Then we can conduct a F-test for comparing the two models. From the comparison, we have an F = 21.887 with a p-value = 1.908e-10. Therefore, location is significant above and beyond the predictor public.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> 
> m1 <- lm(salary~public+location)
> m2 <- lm(salary~public)
> anova(m2,m1)
Analysis of Variance Table

Model 1: salary ~ public
Model 2: salary ~ public + location
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     83 9000.1                                  
2     80 4943.0  3    4057.1 21.887 1.908e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 

Example 4. Two categorical predictors with interaction

We can also look the interaction effects between two categorical predictors. In doing the analysis, we simply include the product of the two predictors. R automatically includes the interaction terms among the dummy coded variables. To investigate the significance of the interaction, we similarly can compare the models with and without the interaction term. For this particular example, we have an F = 8.4569 with a p-value = 6.316e-05. Therefore, the interaction is significant.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> 
> m1 <- lm(salary~public+location)
> m2 <- lm(salary~public*location)
> anova(m1,m2)
Analysis of Variance Table

Model 1: salary ~ public + location
Model 2: salary ~ public * location
  Res.Df  RSS Df Sum of Sq      F    Pr(>F)    
1     80 4943                                  
2     77 3718  3      1225 8.4569 6.316e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 

Note that one can directly apply anova() function in the regression analysis as in ANOVA.

Example 4. Regression with one categorical and one continuous predictors (ANCOVA)

One might argue that the salary is related to the cost of education. Therefore, when looking at the salary difference across locations, one should first control the effect of the cost of eduction. Such analysis can be carried out conveniently as below. And from the output, we still observe significant location effect after controlling the cost of eduction.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> 
> cost <- cost/1000
> summary(res<-lm(salary~cost + location))

Call:
lm(formula = salary ~ cost + location)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.2444  -4.9326  -0.5719   3.1620  29.9388 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  87.78906    3.09770  28.340  < 2e-16 ***
cost          0.06052    0.01728   3.501  0.00076 ***
locationMW   -2.80035    2.56841  -1.090  0.27885    
locationNE    9.23247    2.42247   3.811  0.00027 ***
locationW   -10.52177    2.56915  -4.095  0.00010 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.016 on 80 degrees of freedom
Multiple R-squared:  0.5705,	Adjusted R-squared:  0.549 
F-statistic: 26.57 on 4 and 80 DF,  p-value: 4.956e-14

> anova(res)
Analysis of Variance Table

Response: salary
          Df Sum Sq Mean Sq F value    Pr(>F)    
cost       1 2757.0 2756.97  42.903 5.120e-09 ***
location   3 4071.8 1357.26  21.121 3.571e-10 ***
Residuals 80 5140.8   64.26                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 

Post-hoc comparison

Given the above regression analysis, we can conclude that the location of a university and the private/public sector of the university are related to the average salary the students in the university earn. We can further compare, for example, a private midwest university with a public west university. To make such a comparison, we use the function contrast() in the package contrast. For example, based on the analysis below, students from public Western colleges earn significantly less than students from private Midwest colleges. Note that in using the function, we use a list() to tell the categories of each predictor in the comparison. Keep in mind that this kind of comparison can run into multiple comparison problem and therefore Bonferroni correction should be considered.

> usedata('college'); attach(college)
> salary <- salary / 1000
> 
> public<-factor(public, c(0,1), labels=c('Private', 'Public'))
> location<-factor(location, c(1,2,3,4), labels=c('S', 'MW','NE', 'W'))
> 
> m1 <- lm(salary~public+location)
> 
> library(contrast)
Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve

> contrast(m1, list(location='W', public='Public'), 
+              list(location='MW', public='Private'))
lm model parameter contrast

   Contrast    S.E.     Lower    Upper     t df Pr(>|t|)
1 -15.82522 2.93855 -21.67312 -9.97732 -5.39 80        0
> 

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.