Logistic Regression

Logistic regression is widely used in social and behavioral research in analyzing the binary (dichotomous) outcome data. In logistic regression, the outcome can only take two values 0 and 1. Some examples that can utilize the logistic regression are given in the following.

  • The election of Democratic or Republican president can depend on the factors such as the economic status, the amount of money spent on the campaign, as well as gender and income of the voters.
  • Whether an assistant professor can be tenured may be predicted from the number of publications and teaching performance in the first three years.
  • Whether or not someone has a heart attack may be related to age, gender and living habits.
  • Whether a student is admitted may be predicted by her/his high school GPA, SAT score, and quality of recommendation letters.

We use an example to illustrate how to conduct logistic regression in R.

An example

In this example, the aim is to predict whether a woman is in compliance with mammography screening recommendations from four predictors, one reflecting medical input and three reflecting a woman's psychological status with regarding to screening.

  • Outcome y: whether a woman is in compliance with mammography screening recommendations (1: in compliance; 0: not in compliance)
  • Predictors:
    • x1: whether she has received a recommendation for screening from a physician;
    • x2: her knowledge about breast cancer and mammography screening;
    • x3: her perception of benefit of such a screening;
    • x4: her perception of the barriers to being screened.
> usedata('mamm')
> head(mamm)
  y x1   x2 x3 x4
1 1  1 0.22  4  3
2 0  0 0.56  1  1
3 1  0 0.44  4  3
4 0  0 0.33  3  0
5 1  1 0.44  5  0
6 0  1 0.56  5  0
> 

Basic ideas

With a binary outcome, the linear regression does not work any more. Simply speaking, the predictors can take any value but the outcome cannot. Therefore, using a linear regression cannot predict the outcome well. In order to deal with the problem, we model the probability to observe an outcome 1 instead, that is $p = \Pr(y=1)$. Using the mammography example, that'll be the probability for a woman to be in compliance with the screening recommendation.

Even directly modeling the probability would work better than predicting the 1/0 outcome, intuitively. A potential problem is that the probability is bound between 0 and 1 but the predicted values are generally not. To further deal with the problem, we conduct a transformation using

\[ \eta = \log\frac{p}{1-p}.\]

After transformation, $\eta$ can take any value from $-\infty$ when $p=0$ to $\infty$ when $p=1$. Such a transformation is called logit transformation, denoted by $\text{logit}(p)$. Note that $p_{i}/(1-p_{i})$ is called odds, which is simply the ratio of the probability for the two possible outcomes. For example, if for one woman, the probability that she is in compliance is 0.8, then the odds is 0.8/(1-0.2)=4. Clearly, for equal probability of the outcome, the odds=1. If odds>1, there is a probability higher than 0.5 to observe the outcome 1. With the transformation, the $\eta$ can be directly modeled. 

Therefore, the logistic regression is

\[ \mbox{logit}(p_{i})=\log(\frac{p_{i}}{1-p_{i}})=\eta_i=\beta_{0}+\beta_{1}x_{1i}+\ldots+\beta_{k}x_{ki} \]

where $p_i = \Pr(y_i = 1)$. Different from the regular linear regression, no residual is used in the model.

Why is this?

For a variable $y$ with two and only two outcome values, it is often assumed it follows a Bernoulli or binomial  distribution with the probability $p$ for the outcome 1 and probability $1-p$ for 0. The density function is

\[ p^y (1-p)^{1-y}. \] 

Note that when $y=1$, $p^y (1-p)^{1-y} = p$ exactly.

Furthermore, we assume there is a continuous variable $y^*$ underlying the observed binary variable. If the continuous variable takes a value larger than certain threshold, we would observe 1, otherwise 0. For logistic regression, we assume the continuous variable has a logistic distribution with the density function:

\[ \frac{e^{-y^*}}{1+e^{-y^*}} .\]

The probability for observing 1 is therefore can be directly calculated using the logistic distribution as:

\[ p = \frac{1}{1 + e^{-y^*}},\]

which transforms to 

\[ \log\frac{p}{1-p} = y^*.\]

For $y^*$, since it is a continuous variable, it can be predicted as in a regular regression model.

Fitting a logistic regression model in R

In R, the model can be estimated using the glm() function. Logistic regression is one example of the generalized linear model (glm). Below gives the analysis of the mammography data.

  • glm uses the model formula same as the linear regression model.
  • family = tells the distribution of the outcome variable. For binary data, the binomial distribution is used.
  • link = tell the transformation method. Here, the logit transformation is used.
  • The output includes the regression coefficients and their z-statistics and p-values.
  • The dispersion parameter is related to the variance of the response variable.
> usedata('mamm')
> m1<-glm(y~x1+x2+x3+x4, family=binomial(link='logit'), data=mamm)
> summary(m1)

Call:
glm(formula = y ~ x1 + x2 + x3 + x4, family = binomial(link = "logit"), 
    data = mamm)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3941  -0.7660   0.3756   0.7881   1.6004  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.4466     1.1257  -1.285 0.198754    
x1            1.7731     0.4841   3.663 0.000249 ***
x2           -0.8694     1.1201  -0.776 0.437628    
x3            0.5935     0.2058   2.883 0.003933 ** 
x4           -0.1527     0.1542  -0.990 0.321946    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 203.32  on 163  degrees of freedom
Residual deviance: 155.48  on 159  degrees of freedom
AIC: 165.48

Number of Fisher Scoring iterations: 5

> 

Interpret the results

We first focus on how to interpret the parameter estimates from the analysis. For the intercept, when all the predictors take the value 0, we have

\[ \beta_0 = \log(\frac{p}{1-p}), \]

which is the log odds that the observed outcome is 1.

We now look at the coefficient for each predictor. For the mammography example, let's assume $x_2$, $x_3$, and $x_4$ are the same and look at $x_1$ only. If a woman has received a recommendation ($x_1=1$), then the odds is

\[ \log(\frac{p}{1-p})|(x_1=1)=\beta_{0}+\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]

If a woman has not received a recommendation ($x_1=0$), then the odds is

\[\log(\frac{p}{1-p})|(x_1=0)=\beta_{0}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]

The difference is

\[\log(\frac{p}{1-p})|(x_1=1)-\log(\frac{p}{1-p})|(x_1=0)=\beta_{1}.\]

Therefore, the logistic regression coefficient for a predictor is the difference in the log odds when the predictor changes 1 unit given other predictors unchanged.

This above equation is equivalent to

\[\log\left(\frac{\frac{p(x_1=1)}{1-p(x_1=1)}}{\frac{p(x_1=0)}{1-p(x_1=0)}}\right)=\beta_{1}.\]

More descriptively, we have

\[\log\left(\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}\right)=\beta_{1}.\]

Therefore, the regression coefficients is the log odds ratio. By a simple transformation, we have 

\[\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}=\exp(\beta_{1})\]

or

\[\mbox{ODDS(received recommendation)} = \exp(\beta_{1})*\mbox{ODDS(not received recommendation)}.\]

Therefore, the exponential of a regression coefficient is the odds ratio. For the example, $exp(\beta_{1})$=exp(1.7731)=5.9. Thus, the odds in compliance to screening for those who received recommendation is about 5.9 times of those who did not receive recommendation.

For continuous predictors, the regression coefficients can also be interpreted the same way. For example, we may say that if high school GPA increase one unit, the odds a student to be admitted can be increased to 6 times given other variables the same. 

Although the output does not directly show odds ratio, they can be calculated easily in R as shown below.

> usedata('mamm')
> m1<-glm(y~x1+x2+x3+x4, family=binomial(link='logit'), data=mamm)
> summary(m1)

Call:
glm(formula = y ~ x1 + x2 + x3 + x4, family = binomial(link = "logit"), 
    data = mamm)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3941  -0.7660   0.3756   0.7881   1.6004  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.4466     1.1257  -1.285 0.198754    
x1            1.7731     0.4841   3.663 0.000249 ***
x2           -0.8694     1.1201  -0.776 0.437628    
x3            0.5935     0.2058   2.883 0.003933 ** 
x4           -0.1527     0.1542  -0.990 0.321946    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 203.32  on 163  degrees of freedom
Residual deviance: 155.48  on 159  degrees of freedom
AIC: 165.48

Number of Fisher Scoring iterations: 5

> 
> exp(coef(m1))
(Intercept)          x1          x2          x3          x4 
  0.2353638   5.8892237   0.4192066   1.8103735   0.8583652 
> 

By using odds ratios, we can intercept the parameters in the following.

  • For x1, if a woman receives a screening recommendation, the odds for her to be in compliance with screening is about 5.9 times of the odds of a woman who does not receive a recommendation given x2, x3, x4 the same. Alternatively (may be more intuitive), if a woman receives a screening recommendation, the odds for her to be in compliance with screening will increase 4.9 times (5.889 – 1 = 4.889 =4.9), given other variables the same.
  • For x2, if a woman has one unit more knowledge on breast cancer and mammography screening, the odds for her to be in compliance with screening decreases 58.1% (.419-1=-58.1%, negative number means decrease), keeping other variables constant.
  • For x3, if a woman's perception about the benefit increases one unit, the odds for her to be in compliance with screening increases 81% (1.81-1=81%, positive number means increase), keeping other variables constant.
  • For x4, if a woman's perception about the barriers increases one unit, the odds for her to be in compliance with screening decreases 14.2% (.858-1=-14.2%, negative number means decrease), keeping other variables constant.

Statistical inference for logistic regression

Statistical inference for logistic regression is very similar to statistical inference for simple linear regression. We can (1) conduct significance testing for each parameter, (2) test the overall model, and (3) test the overall model.

Test a single coefficient (z-test and confidence interval)

For each regression coefficient of the predictors, we can use a z-test (note not the t-test). In the output, we have z-values and corresponding p-values. For x1 and x3, their coefficients are significant at the alpha level 0.05. But for x2 and x4, they are not. Note that some software outputs Wald statistic for testing significance. Wald statistic is the square of the z-statistic and thus Wald test gives the same conclusion as the z-test. 

We can also conduct the hypothesis testing by constructing confidence intervals. With the model, the function confint() can be used to obtain the confidence interval. Since one is often interested in odds ratio, its confidence interval can also be obtained. 

Note that if the CI for odds ratio includes 1, it means nonsignificance. If it does not include 1, the coefficient is significant. This is because for the original coefficient, we compare the CI with 0. For odds ratio, exp(0)=1.

If we were reporting the results in terms of the odds and its CI, we could say, “The odds of in compliance to screening increases by a factor of 5.9 if receiving screening recommendation (z=3.66, P = 0.0002; 95% CI = 2.38 to 16.23) given everything else the same.”

> usedata('mamm')
> m1<-glm(y~x1+x2+x3+x4, family=binomial(link='logit'), data=mamm)
> confint(m1)
Waiting for profiling to be done...
                 2.5 %    97.5 %
(Intercept) -3.7161449 0.7286948
x1           0.8665475 2.7871626
x2          -3.1466137 1.2831677
x3           0.2023731 1.0134006
x4          -0.4577415 0.1506678
> exp(confint(m1))
Waiting for profiling to be done...
                 2.5 %    97.5 %
(Intercept) 0.02432757  2.072374
x1          2.37868419 16.234889
x2          0.04299748  3.608051
x3          1.22430472  2.754954
x4          0.63271101  1.162610
> 

Test the overall model

For the linear regression, we evaluate the overall model fit by looking at the variance explained by all the predictors. For the logistic regression, we cannot calculate a variance. However, we can define and evaluate the deviance instead. For a model without any predictor, we can calculate a null deviance, which is similar to variance for the normal outcome variable. After including the predictors, we have the residual deviance. The difference between the null deviance and the residual deviance tells how much the predictors help predict the outcome. If the difference is significant, then overall, the predictors are significant statistically.

The difference or the decease in deviance after including the predictors follows a chi-square ($\chi^{2}$) distribution. The chi-square ($\chi^{2}$) distribution is a widely used distribution in statistical inference. It has a close relationship to F distribution. For example, the ratio of two independent chi-square distributions is a F distribution. In addition, a chi-square distribution is the limiting distribution of an F distribution as the denominator degrees of freedom goes to infinity.

There are two ways to conduct the test. From the output, we can find the Null and Residual deviances and the corresponding degrees of freedom. Then we calculate the difference. For the mammography example, we first get the difference between the Null deviance and the Residual deviance, 203.32-155.48= 47.84. Then, we find the difference in the degrees of freedom 163-159=4. Then, the p-value can be calculated based on a chi-square distribution with the degree of freedom 4. Because the p-value is smaller than 0.05, the overall model is significant.

The test can be conducted simply in another way. We first fit a model without any predictor and another model with all the predictors. Then, we can use anova() to get the difference in deviance and the chi-square test result. 

> usedata('mamm')
> m1<-glm(y~x1+x2+x3+x4, family=binomial(link='logit'), data=mamm)
> summary(m1)

Call:
glm(formula = y ~ x1 + x2 + x3 + x4, family = binomial(link = "logit"), 
    data = mamm)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3941  -0.7660   0.3756   0.7881   1.6004  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.4466     1.1257  -1.285 0.198754    
x1            1.7731     0.4841   3.663 0.000249 ***
x2           -0.8694     1.1201  -0.776 0.437628    
x3            0.5935     0.2058   2.883 0.003933 ** 
x4           -0.1527     0.1542  -0.990 0.321946    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 203.32  on 163  degrees of freedom
Residual deviance: 155.48  on 159  degrees of freedom
AIC: 165.48

Number of Fisher Scoring iterations: 5

> 
> ## method 1
> d_chi <- 203.32 - 155.48
> d_df <- 163 - 159
> 
> 1 - pchisq(d_chi, d_df)
[1] 1.019117e-09
> 
> ## method 2
> m0 <- glm(y~1, family=binomial(link='logit'), data=mamm)
> anova(m0,m1)
Analysis of Deviance Table

Model 1: y ~ 1
Model 2: y ~ x1 + x2 + x3 + x4
  Resid. Df Resid. Dev Df Deviance
1       163     203.32            
2       159     155.48  4   47.837
> 

Test a subset of predictors

We can also test the significance of a subset of predictors. For example, whether x3 and x4 are significant above and beyond x1 and x2. This can also be done using the chi-square test based on the difference. In this case, we can compare a model with all predictors and a model without x3 and x4 to see  if the change in the deviance is significant. In this example, the p-value is 0.002, indicating the change is signficant. Therefore, x3 and x4 are statistically significant above and beyond x1 and x2

> usedata('mamm')
> m1<-glm(y~x1+x2+x3+x4, family=binomial(link='logit'), data=mamm)
> m2 <- glm(y~x1+x2, family=binomial(link='logit'), data=mamm)
> 
> anova(m2, m1)
Analysis of Deviance Table

Model 1: y ~ x1 + x2
Model 2: y ~ x1 + x2 + x3 + x4
  Resid. Df Resid. Dev Df Deviance
1       161     168.23            
2       159     155.48  2   12.749
> 
> 1 - pchisq(12.749, 2)
[1] 0.001704472
> 

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.