Logistic Regression

Logistic regression is widely used in social and behavioral research in analyzing the binary (dichotomous) outcome data. In logistic regression, the outcome can only take two values 0 and 1. Some examples that can utilize the logistic regression are given in the following.

The election of Democratic or Republican president can depend on the factors such as the economic status, the amount of money spent on the campaign, as well as gender and income of the voters.
Whether an assistant professor can be tenured may be predicted from the number of publications and teaching performance in the first three years.
Whether or not someone has a heart attack may be related to age, gender and living habits.
Whether a student is admitted may be predicted by her/his high school GPA, SAT score, and quality of recommendation letters.

We use an example to illustrate how to conduct logistic regression in python.

An example

In this example, the aim is to predict whether a woman is in compliance with mammography screening recommendations from four predictors, one reflecting medical input and three reflecting a woman's psychological status with regarding to screening. The data are in the file mamm.csv.

Outcome y: whether a woman is in compliance with mammography screening recommendations (1: in compliance; 0: not in compliance)
Predictors:
- x1: whether she has received a recommendation for screening from a physician;
- x2: her knowledge about breast cancer and mammography screening;
- x3: her perception of benefit of such a screening;
- x4: her perception of the barriers to being screened.

>>> import pandas as pd
>>> pd.read_csv("https://advstats.psychstat.org/data/mamm.csv")
     y  x1    x2  x3  x4
0    1   1  0.22   4   3
1    0   0  0.56   1   1
2    1   0  0.44   4   3
3    0   0  0.33   3   0
4    1   1  0.44   5   0
..  ..  ..   ...  ..  ..
159  1   1  0.89   5   1
160  0   0  0.78   5   1
161  1   1  0.78   5   0
162  1   1  0.56   3   0
163  1   0  0.67   4   1

[164 rows x 5 columns]

Basic ideas

With a binary outcome, the linear regression does not work any more. Simply speaking, the predictors can take any value but the outcome cannot. Therefore, using a linear regression cannot predict the outcome well. In order to deal with the problem, we model the probability to observe an outcome 1 instead, that is $p = \Pr(y=1)$. Using the mammography example, that'll be the probability for a woman to be in compliance with the screening recommendation.

Even directly modeling the probability would work better than predicting the 1/0 outcome, intuitively. A potential problem is that the probability is bound between 0 and 1 but the predicted values are generally not. To further deal with the problem, we conduct a transformation using

\[ \eta = \log\frac{p}{1-p}.\]

After transformation, $\eta$ can take any value from $-\infty$ when $p=0$ to $\infty$ when $p=1$. Such a transformation is called logit transformation, denoted by $\text{logit}(p)$. Note that $p_{i}/(1-p_{i})$ is called odds, which is simply the ratio of the probability for the two possible outcomes. For example, if for one woman, the probability that she is in compliance is 0.8, then the odds is 0.8/(1-0.2)=4. Clearly, for equal probability of the outcome, the odds=1. If odds>1, there is a probability higher than 0.5 to observe the outcome 1. With the transformation, the $\eta$ can be directly modeled.

Therefore, the logistic regression is

\[ \mbox{logit}(p_{i})=\log(\frac{p_{i}}{1-p_{i}})=\eta_i=\beta_{0}+\beta_{1}x_{1i}+\ldots+\beta_{k}x_{ki} \]

where $p_i = \Pr(y_i = 1)$. Different from the regular linear regression, no residual is used in the model.

Why is this?

For a variable $y$ with two and only two outcome values, it is often assumed it follows a Bernoulli or binomial distribution with the probability $p$ for the outcome 1 and probability $1-p$ for 0. The density function is

\[ p^y (1-p)^{1-y}. \]

Note that when $y=1$, $p^y (1-p)^{1-y} = p$ exactly.

Furthermore, we assume there is a continuous variable $y^*$ underlying the observed binary variable. If the continuous variable takes a value larger than certain threshold, we would observe 1, otherwise 0. For logistic regression, we assume the continuous variable has a logistic distribution with the density function:

\[ \frac{e^{-y^*}}{1+e^{-y^*}} .\]

The probability for observing 1 is therefore can be directly calculated using the logistic distribution as:

\[ p = \frac{1}{1 + e^{-y^*}},\]

which transforms to

\[ \log\frac{p}{1-p} = y^*.\]

For $y^*$, since it is a continuous variable, it can be predicted as in a regular regression model.

Fitting a logistic regression model in python

In python, the model can be estimated using the glm() function of the library statsmodels. Logistic regression is one example of the generalized linear model (glm). Below gives the analysis of the mammography data.

glm uses the model formula same as the linear regression model.
family = tells the distribution of the outcome variable. For binary data, the binomial distribution is used.
link = tell the transformation method. Here, the logit transformation is used.
The output includes the regression coefficients and their z-statistics and p-values.
The dispersion parameter is related to the variance of the response variable.

>>> import pandas as pd
>>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv")
>>> 
>>> ## fit the model
>>> import statsmodels.api as sm
>>> from statsmodels.genmod.families.links import probit
>>> import statsmodels.formula.api as smf
>>> m1 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=probit())).fit()
>>> print(m1.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      159
Model Family:                Binomial   Df Model:                            4
Link Function:                 probit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -77.912
Date:                Sun, 24 Nov 2024   Deviance:                       155.82
Time:                        13:42:03   Pearson chi2:                     169.
No. Iterations:                     6   Pseudo R-squ. (CS):             0.2514
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.8434      0.665     -1.269      0.205      -2.146       0.460
x1             0.9952      0.264      3.769      0.000       0.478       1.513
x2            -0.5486      0.650     -0.844      0.399      -1.822       0.725
x3             0.3543      0.120      2.958      0.003       0.120       0.589
x4            -0.0920      0.091     -1.011      0.312      -0.270       0.086
==============================================================================

Interpret the results

We first focus on how to interpret the parameter estimates from the analysis. For the intercept, when all the predictors take the value 0, we have

\[ \beta_0 = \log(\frac{p}{1-p}), \]

which is the log odds that the observed outcome is 1.

We now look at the coefficient for each predictor. For the mammography example, let's assume $x_2$, $x_3$, and $x_4$ are the same and look at $x_1$ only. If a woman has received a recommendation ($x_1=1$), then the odds is

\[ \log(\frac{p}{1-p})|(x_1=1)=\beta_{0}+\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]

If a woman has not received a recommendation ($x_1=0$), then the odds is

\[\log(\frac{p}{1-p})|(x_1=0)=\beta_{0}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]

The difference is

\[\log(\frac{p}{1-p})|(x_1=1)-\log(\frac{p}{1-p})|(x_1=0)=\beta_{1}.\]

Therefore, the logistic regression coefficient for a predictor is the difference in the log odds when the predictor changes 1 unit given other predictors unchanged.

This above equation is equivalent to

\[\log\left(\frac{\frac{p(x_1=1)}{1-p(x_1=1)}}{\frac{p(x_1=0)}{1-p(x_1=0)}}\right)=\beta_{1}.\]

More descriptively, we have

\[\log\left(\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}\right)=\beta_{1}.\]

Therefore, the regression coefficients is the log odds ratio. By a simple transformation, we have

\[\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}=\exp(\beta_{1})\]

\[\mbox{ODDS(received recommendation)} = \exp(\beta_{1})*\mbox{ODDS(not received recommendation)}.\]

Therefore, the exponential of a regression coefficient is the odds ratio. For the example, $exp(\beta_{1})$=exp(1.7731)=5.9. Thus, the odds in compliance to screening for those who received recommendation is about 5.9 times of those who did not receive recommendation.

For continuous predictors, the regression coefficients can also be interpreted the same way. For example, we may say that if high school GPA increase one unit, the odds a student to be admitted can be increased to 6 times given other variables the same.

Although the output does not directly show odds ratio, they can be calculated easily in python as shown below.

>>> import pandas as pd
>>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv")
>>> 
>>> ## fit the model
>>> import statsmodels.api as sm
>>> from statsmodels.genmod.families.links import logit
>>> import statsmodels.formula.api as smf
>>> m1 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit()
>>> print(m1.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      159
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -77.741
Date:                Sun, 24 Nov 2024   Deviance:                       155.48
Time:                        13:43:16   Pearson chi2:                     169.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2530
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.4466      1.126     -1.285      0.199      -3.653       0.760
x1             1.7731      0.484      3.663      0.000       0.824       2.722
x2            -0.8694      1.120     -0.776      0.438      -3.065       1.326
x3             0.5935      0.206      2.883      0.004       0.190       0.997
x4            -0.1527      0.154     -0.990      0.322      -0.455       0.149
==============================================================================
>>> 
>>> ## calculate the odds ratio
>>> import numpy as np
>>> print("Odds Ratios:")
Odds Ratios:
>>> print(np.exp(m1.params))
Intercept    0.235364
x1           5.889224
x2           0.419207
x3           1.810373
x4           0.858365
dtype: float64

By using odds ratios, we can intercept the parameters in the following.

For x1, if a woman receives a screening recommendation, the odds for her to be in compliance with screening is about 5.9 times of the odds of a woman who does not receive a recommendation given x2, x3, x4 the same. Alternatively (may be more intuitive), if a woman receives a screening recommendation, the odds for her to be in compliance with screening will increase 4.9 times (5.889 – 1 = 4.889 =4.9), given other variables the same.
For x2, if a woman has one unit more knowledge on breast cancer and mammography screening, the odds for her to be in compliance with screening decreases 58.1% (.419-1=-58.1%, negative number means decrease), keeping other variables constant.
For x3, if a woman's perception about the benefit increases one unit, the odds for her to be in compliance with screening increases 81% (1.81-1=81%, positive number means increase), keeping other variables constant.
For x4, if a woman's perception about the barriers increases one unit, the odds for her to be in compliance with screening decreases 14.2% (.858-1=-14.2%, negative number means decrease), keeping other variables constant.

Statistical inference for logistic regression

Statistical inference for logistic regression is very similar to statistical inference for simple linear regression. We can (1) conduct significance testing for each parameter, (2) test the overall model, and (3) test the overall model.

Test a single coefficient (z-test and confidence interval)

For each regression coefficient of the predictors, we can use a z-test (note not the t-test). In the output, we have z-values and corresponding p-values. For x1 and x3, their coefficients are significant at the alpha level 0.05. But for x2 and x4, they are not. Note that some software outputs Wald statistic for testing significance. Wald statistic is the square of the z-statistic and thus Wald test gives the same conclusion as the z-test.

We can also conduct the hypothesis testing by constructing confidence intervals. With the model, the function conf_int() can be used to obtain the confidence interval. Since one is often interested in odds ratio, its confidence interval can also be obtained.

Note that if the CI for odds ratio includes 1, it means nonsignificance. If it does not include 1, the coefficient is significant. This is because for the original coefficient, we compare the CI with 0. For odds ratio, exp(0)=1.

If we were reporting the results in terms of the odds and its CI, we could say, “The odds of in compliance to screening increases by a factor of 5.9 if receiving screening recommendation (z=3.66, P = 0.0002; 95% CI = 2.38 to 16.23) given everything else the same.”

>>> import pandas as pd
>>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv")
>>> 
>>> ## fit the model
>>> import statsmodels.api as sm
>>> from statsmodels.genmod.families.links import logit
>>> import statsmodels.formula.api as smf
>>> m1 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit()
>>> print(m1.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      159
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -77.741
Date:                Sun, 24 Nov 2024   Deviance:                       155.48
Time:                        13:49:27   Pearson chi2:                     169.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2530
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.4466      1.126     -1.285      0.199      -3.653       0.760
x1             1.7731      0.484      3.663      0.000       0.824       2.722
x2            -0.8694      1.120     -0.776      0.438      -3.065       1.326
x3             0.5935      0.206      2.883      0.004       0.190       0.997
x4            -0.1527      0.154     -0.990      0.322      -0.455       0.149
==============================================================================
>>> 
>>> ## confidence interval
>>> conf_int = m1.conf_int(alpha = .05)
>>> conf_int.columns = ['2.5%', '97.5%']
>>> print(conf_int)
               2.5%     97.5%
Intercept -3.652915  0.759669
x1         0.824331  2.721918
x2        -3.064657  1.325874
x3         0.190095  0.996971
x4        -0.454943  0.149492
>>> 
>>> ## calculate the odds ratio
>>> import numpy as np
>>> print("Odds Ratios:")
Odds Ratios:
>>> print(np.exp(m1.params))
Intercept    0.235364
x1           5.889224
x2           0.419207
x3           1.810373
x4           0.858365
dtype: float64
>>> print(np.exp(conf_int))
               2.5%      97.5%
Intercept  0.025915   2.137568
x1         2.280354  15.209462
x2         0.046670   3.765475
x3         1.209365   2.710060
x4         0.634484   1.161244

Test the overall model

For the linear regression, we evaluate the overall model fit by looking at the variance explained by all the predictors. For the logistic regression, we cannot calculate a variance. However, we can define and evaluate the deviance instead. For a model without any predictor, we can calculate a null deviance, which is similar to variance for the normal outcome variable. After including the predictors, we have the residual deviance. The difference between the null deviance and the residual deviance tells how much the predictors help predict the outcome. If the difference is significant, then overall, the predictors are significant statistically.

The difference or the decease in deviance after including the predictors follows a chi-square ($\chi^{2}$) distribution. The chi-square ($\chi^{2}$) distribution is a widely used distribution in statistical inference. It has a close relationship to F distribution. For example, the ratio of two independent chi-square distributions is a F distribution. In addition, a chi-square distribution is the limiting distribution of an F distribution as the denominator degrees of freedom goes to infinity.

There are two ways to conduct the test. From the output, we can find the Null and Residual deviances and the corresponding degrees of freedom. Then we calculate the difference. For the mammography example, we first get the difference between the Null deviance and the Residual deviance, 203.32-155.48= 47.84. Then, we find the difference in the degrees of freedom 163-159=4. Then, the p-value can be calculated based on a chi-square distribution with the degree of freedom 4. Because the p-value is smaller than 0.05, the overall model is significant.

>>> import pandas as pd
>>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv")
>>> 
>>> ## fit the model
>>> import statsmodels.api as sm
>>> from statsmodels.genmod.families.links import logit
>>> import statsmodels.formula.api as smf
>>> m1 = smf.glm(formula="y ~ 1", data=mamm, family=sm.families.Binomial(link=logit())).fit()
>>> print(m1.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      163
Model Family:                Binomial   Df Model:                            0
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -101.66
Date:                Sun, 24 Nov 2024   Deviance:                       203.32
Time:                        13:59:07   Pearson chi2:                     164.
No. Iterations:                     4   Pseudo R-squ. (CS):          3.331e-16
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7956      0.169      4.716      0.000       0.465       1.126
==============================================================================
>>> 
>>> m2 = smf.glm(formula="y ~ 1 + x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit()
>>> print(m2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      159
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -77.741
Date:                Sun, 24 Nov 2024   Deviance:                       155.48
Time:                        13:59:07   Pearson chi2:                     169.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2530
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.4466      1.126     -1.285      0.199      -3.653       0.760
x1             1.7731      0.484      3.663      0.000       0.824       2.722
x2            -0.8694      1.120     -0.776      0.438      -3.065       1.326
x3             0.5935      0.206      2.883      0.004       0.190       0.997
x4            -0.1527      0.154     -0.990      0.322      -0.455       0.149
==============================================================================
>>> 
>>> ## compare two models
>>> chi_diff = 203.32-155.48
>>> df_diff = 163-159
>>> 
>>> from scipy.stats import chi2
>>> print(1 - chi2.cdf(chi_diff, df_diff))
1.0191170130013916e-09

Test a subset of predictors

We can also test the significance of a subset of predictors. For example, whether x3 and x4 are significant above and beyond x1 and x2. This can also be done using the chi-square test based on the difference. In this case, we can compare a model with all predictors and a model without x3 and x4 to see if the change in the deviance is significant. In this example, the p-value is 0.002, indicating the change is signficant. Therefore, x3 and x4 are statistically significant above and beyond x1 and x2

>>> import pandas as pd
>>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv")
>>> 
>>> ## fit the model
>>> import statsmodels.api as sm
>>> from statsmodels.genmod.families.links import logit
>>> import statsmodels.formula.api as smf
>>> m1 = smf.glm(formula="y ~ x1 + x2", data=mamm, family=sm.families.Binomial(link=logit())).fit()
>>> print(m1.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      161
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -84.116
Date:                Sun, 24 Nov 2024   Deviance:                       168.23
Time:                        14:00:18   Pearson chi2:                     163.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.1926
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4479      0.681      0.658      0.511      -0.887       1.782
x1             2.2766      0.451      5.050      0.000       1.393       3.160
x2            -0.7159      1.033     -0.693      0.488      -2.740       1.308
==============================================================================
>>> 
>>> m2 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit()
>>> print(m2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  164
Model:                            GLM   Df Residuals:                      159
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -77.741
Date:                Sun, 24 Nov 2024   Deviance:                       155.48
Time:                        14:00:18   Pearson chi2:                     169.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2530
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.4466      1.126     -1.285      0.199      -3.653       0.760
x1             1.7731      0.484      3.663      0.000       0.824       2.722
x2            -0.8694      1.120     -0.776      0.438      -3.065       1.326
x3             0.5935      0.206      2.883      0.004       0.190       0.997
x4            -0.1527      0.154     -0.990      0.322      -0.455       0.149
==============================================================================
>>> 
>>> ## compare two models
>>> chi_diff = 168.23-155.48
>>> df_diff = 161-159
>>> 
>>> from scipy.stats import chi2
>>> print(1 - chi2.cdf(chi_diff, df_diff))
0.00170361979580258

To cite the book, use: Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using Python. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.