Logistic Regression
Logistic regression is widely used in social and behavioral research in analyzing the binary (dichotomous) outcome data. In logistic regression, the outcome can only take two values 0 and 1. Some examples that can utilize the logistic regression are given in the following.
- The election of Democratic or Republican president can depend on the factors such as the economic status, the amount of money spent on the campaign, as well as gender and income of the voters.
- Whether an assistant professor can be tenured may be predicted from the number of publications and teaching performance in the first three years.
- Whether or not someone has a heart attack may be related to age, gender and living habits.
- Whether a student is admitted may be predicted by her/his high school GPA, SAT score, and quality of recommendation letters.
We use an example to illustrate how to conduct logistic regression in python.
An example
In this example, the aim is to predict whether a woman is in compliance with mammography screening recommendations from four predictors, one reflecting medical input and three reflecting a woman's psychological status with regarding to screening. The data are in the file mamm.csv
.
- Outcome y: whether a woman is in compliance with mammography screening recommendations (1: in compliance; 0: not in compliance)
- Predictors:
- x1: whether she has received a recommendation for screening from a physician;
- x2: her knowledge about breast cancer and mammography screening;
- x3: her perception of benefit of such a screening;
- x4: her perception of the barriers to being screened.
>>> import pandas as pd >>> pd.read_csv("https://advstats.psychstat.org/data/mamm.csv") y x1 x2 x3 x4 0 1 1 0.22 4 3 1 0 0 0.56 1 1 2 1 0 0.44 4 3 3 0 0 0.33 3 0 4 1 1 0.44 5 0 .. .. .. ... .. .. 159 1 1 0.89 5 1 160 0 0 0.78 5 1 161 1 1 0.78 5 0 162 1 1 0.56 3 0 163 1 0 0.67 4 1 [164 rows x 5 columns]
Basic ideas
With a binary outcome, the linear regression does not work any more. Simply speaking, the predictors can take any value but the outcome cannot. Therefore, using a linear regression cannot predict the outcome well. In order to deal with the problem, we model the probability to observe an outcome 1 instead, that is $p = \Pr(y=1)$. Using the mammography example, that'll be the probability for a woman to be in compliance with the screening recommendation.
Even directly modeling the probability would work better than predicting the 1/0 outcome, intuitively. A potential problem is that the probability is bound between 0 and 1 but the predicted values are generally not. To further deal with the problem, we conduct a transformation using
\[ \eta = \log\frac{p}{1-p}.\]
After transformation, $\eta$ can take any value from $-\infty$ when $p=0$ to $\infty$ when $p=1$. Such a transformation is called logit transformation, denoted by $\text{logit}(p)$. Note that $p_{i}/(1-p_{i})$ is called odds, which is simply the ratio of the probability for the two possible outcomes. For example, if for one woman, the probability that she is in compliance is 0.8, then the odds is 0.8/(1-0.2)=4. Clearly, for equal probability of the outcome, the odds=1. If odds>1, there is a probability higher than 0.5 to observe the outcome 1. With the transformation, the $\eta$ can be directly modeled.
Therefore, the logistic regression is
\[ \mbox{logit}(p_{i})=\log(\frac{p_{i}}{1-p_{i}})=\eta_i=\beta_{0}+\beta_{1}x_{1i}+\ldots+\beta_{k}x_{ki} \]
where $p_i = \Pr(y_i = 1)$. Different from the regular linear regression, no residual is used in the model.
Why is this?
For a variable $y$ with two and only two outcome values, it is often assumed it follows a Bernoulli or binomial distribution with the probability $p$ for the outcome 1 and probability $1-p$ for 0. The density function is
\[ p^y (1-p)^{1-y}. \]
Note that when $y=1$, $p^y (1-p)^{1-y} = p$ exactly.
Furthermore, we assume there is a continuous variable $y^*$ underlying the observed binary variable. If the continuous variable takes a value larger than certain threshold, we would observe 1, otherwise 0. For logistic regression, we assume the continuous variable has a logistic distribution with the density function:
\[ \frac{e^{-y^*}}{1+e^{-y^*}} .\]
The probability for observing 1 is therefore can be directly calculated using the logistic distribution as:
\[ p = \frac{1}{1 + e^{-y^*}},\]
which transforms to
\[ \log\frac{p}{1-p} = y^*.\]
For $y^*$, since it is a continuous variable, it can be predicted as in a regular regression model.
Fitting a logistic regression model in python
In python, the model can be estimated using the glm()
function of the library statsmodels
. Logistic regression is one example of the generalized linear model (glm). Below gives the analysis of the mammography data.
glm
uses the model formula same as the linear regression model.family =
tells the distribution of the outcome variable. For binary data, the binomial distribution is used.link =
tell the transformation method. Here, the logit transformation is used.- The output includes the regression coefficients and their z-statistics and p-values.
- The dispersion parameter is related to the variance of the response variable.
>>> import pandas as pd >>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv") >>> >>> ## fit the model >>> import statsmodels.api as sm >>> from statsmodels.genmod.families.links import probit >>> import statsmodels.formula.api as smf >>> m1 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=probit())).fit() >>> print(m1.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 159 Model Family: Binomial Df Model: 4 Link Function: probit Scale: 1.0000 Method: IRLS Log-Likelihood: -77.912 Date: Sun, 24 Nov 2024 Deviance: 155.82 Time: 13:42:03 Pearson chi2: 169. No. Iterations: 6 Pseudo R-squ. (CS): 0.2514 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.8434 0.665 -1.269 0.205 -2.146 0.460 x1 0.9952 0.264 3.769 0.000 0.478 1.513 x2 -0.5486 0.650 -0.844 0.399 -1.822 0.725 x3 0.3543 0.120 2.958 0.003 0.120 0.589 x4 -0.0920 0.091 -1.011 0.312 -0.270 0.086 ==============================================================================
Interpret the results
We first focus on how to interpret the parameter estimates from the analysis. For the intercept, when all the predictors take the value 0, we have
\[ \beta_0 = \log(\frac{p}{1-p}), \]
which is the log odds that the observed outcome is 1.
We now look at the coefficient for each predictor. For the mammography example, let's assume $x_2$, $x_3$, and $x_4$ are the same and look at $x_1$ only. If a woman has received a recommendation ($x_1=1$), then the odds is
\[ \log(\frac{p}{1-p})|(x_1=1)=\beta_{0}+\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]
If a woman has not received a recommendation ($x_1=0$), then the odds is
\[\log(\frac{p}{1-p})|(x_1=0)=\beta_{0}+\beta_{2}x_{2}+\beta_{3}x_{3}+\beta_{4}x_{4}.\]
The difference is
\[\log(\frac{p}{1-p})|(x_1=1)-\log(\frac{p}{1-p})|(x_1=0)=\beta_{1}.\]
Therefore, the logistic regression coefficient for a predictor is the difference in the log odds when the predictor changes 1 unit given other predictors unchanged.
This above equation is equivalent to
\[\log\left(\frac{\frac{p(x_1=1)}{1-p(x_1=1)}}{\frac{p(x_1=0)}{1-p(x_1=0)}}\right)=\beta_{1}.\]
More descriptively, we have
\[\log\left(\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}\right)=\beta_{1}.\]
Therefore, the regression coefficients is the log odds ratio. By a simple transformation, we have
\[\frac{\mbox{ODDS(received recommendation)}}{\mbox{ODDS(not received recommendation)}}=\exp(\beta_{1})\]
or
\[\mbox{ODDS(received recommendation)} = \exp(\beta_{1})*\mbox{ODDS(not received recommendation)}.\]
Therefore, the exponential of a regression coefficient is the odds ratio. For the example, $exp(\beta_{1})$=exp(1.7731)=5.9. Thus, the odds in compliance to screening for those who received recommendation is about 5.9 times of those who did not receive recommendation.
For continuous predictors, the regression coefficients can also be interpreted the same way. For example, we may say that if high school GPA increase one unit, the odds a student to be admitted can be increased to 6 times given other variables the same.
Although the output does not directly show odds ratio, they can be calculated easily in python as shown below.
>>> import pandas as pd >>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv") >>> >>> ## fit the model >>> import statsmodels.api as sm >>> from statsmodels.genmod.families.links import logit >>> import statsmodels.formula.api as smf >>> m1 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit() >>> print(m1.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 159 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -77.741 Date: Sun, 24 Nov 2024 Deviance: 155.48 Time: 13:43:16 Pearson chi2: 169. No. Iterations: 5 Pseudo R-squ. (CS): 0.2530 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.4466 1.126 -1.285 0.199 -3.653 0.760 x1 1.7731 0.484 3.663 0.000 0.824 2.722 x2 -0.8694 1.120 -0.776 0.438 -3.065 1.326 x3 0.5935 0.206 2.883 0.004 0.190 0.997 x4 -0.1527 0.154 -0.990 0.322 -0.455 0.149 ============================================================================== >>> >>> ## calculate the odds ratio >>> import numpy as np >>> print("Odds Ratios:") Odds Ratios: >>> print(np.exp(m1.params)) Intercept 0.235364 x1 5.889224 x2 0.419207 x3 1.810373 x4 0.858365 dtype: float64
By using odds ratios, we can intercept the parameters in the following.
- For x1, if a woman receives a screening recommendation, the odds for her to be in compliance with screening is about 5.9 times of the odds of a woman who does not receive a recommendation given x2, x3, x4 the same. Alternatively (may be more intuitive), if a woman receives a screening recommendation, the odds for her to be in compliance with screening will increase 4.9 times (5.889 – 1 = 4.889 =4.9), given other variables the same.
- For x2, if a woman has one unit more knowledge on breast cancer and mammography screening, the odds for her to be in compliance with screening decreases 58.1% (.419-1=-58.1%, negative number means decrease), keeping other variables constant.
- For x3, if a woman's perception about the benefit increases one unit, the odds for her to be in compliance with screening increases 81% (1.81-1=81%, positive number means increase), keeping other variables constant.
- For x4, if a woman's perception about the barriers increases one unit, the odds for her to be in compliance with screening decreases 14.2% (.858-1=-14.2%, negative number means decrease), keeping other variables constant.
Statistical inference for logistic regression
Statistical inference for logistic regression is very similar to statistical inference for simple linear regression. We can (1) conduct significance testing for each parameter, (2) test the overall model, and (3) test the overall model.
Test a single coefficient (z-test and confidence interval)
For each regression coefficient of the predictors, we can use a z-test (note not the t-test). In the output, we have z-values and corresponding p-values. For x1 and x3, their coefficients are significant at the alpha level 0.05. But for x2 and x4, they are not. Note that some software outputs Wald statistic for testing significance. Wald statistic is the square of the z-statistic and thus Wald test gives the same conclusion as the z-test.
We can also conduct the hypothesis testing by constructing confidence intervals. With the model, the function conf_int()
can be used to obtain the confidence interval. Since one is often interested in odds ratio, its confidence interval can also be obtained.
Note that if the CI for odds ratio includes 1, it means nonsignificance. If it does not include 1, the coefficient is significant. This is because for the original coefficient, we compare the CI with 0. For odds ratio, exp(0)=1.
If we were reporting the results in terms of the odds and its CI, we could say, “The odds of in compliance to screening increases by a factor of 5.9 if receiving screening recommendation (z=3.66, P = 0.0002; 95% CI = 2.38 to 16.23) given everything else the same.”
>>> import pandas as pd >>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv") >>> >>> ## fit the model >>> import statsmodels.api as sm >>> from statsmodels.genmod.families.links import logit >>> import statsmodels.formula.api as smf >>> m1 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit() >>> print(m1.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 159 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -77.741 Date: Sun, 24 Nov 2024 Deviance: 155.48 Time: 13:49:27 Pearson chi2: 169. No. Iterations: 5 Pseudo R-squ. (CS): 0.2530 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.4466 1.126 -1.285 0.199 -3.653 0.760 x1 1.7731 0.484 3.663 0.000 0.824 2.722 x2 -0.8694 1.120 -0.776 0.438 -3.065 1.326 x3 0.5935 0.206 2.883 0.004 0.190 0.997 x4 -0.1527 0.154 -0.990 0.322 -0.455 0.149 ============================================================================== >>> >>> ## confidence interval >>> conf_int = m1.conf_int(alpha = .05) >>> conf_int.columns = ['2.5%', '97.5%'] >>> print(conf_int) 2.5% 97.5% Intercept -3.652915 0.759669 x1 0.824331 2.721918 x2 -3.064657 1.325874 x3 0.190095 0.996971 x4 -0.454943 0.149492 >>> >>> ## calculate the odds ratio >>> import numpy as np >>> print("Odds Ratios:") Odds Ratios: >>> print(np.exp(m1.params)) Intercept 0.235364 x1 5.889224 x2 0.419207 x3 1.810373 x4 0.858365 dtype: float64 >>> print(np.exp(conf_int)) 2.5% 97.5% Intercept 0.025915 2.137568 x1 2.280354 15.209462 x2 0.046670 3.765475 x3 1.209365 2.710060 x4 0.634484 1.161244
Test the overall model
For the linear regression, we evaluate the overall model fit by looking at the variance explained by all the predictors. For the logistic regression, we cannot calculate a variance. However, we can define and evaluate the deviance instead. For a model without any predictor, we can calculate a null deviance, which is similar to variance for the normal outcome variable. After including the predictors, we have the residual deviance. The difference between the null deviance and the residual deviance tells how much the predictors help predict the outcome. If the difference is significant, then overall, the predictors are significant statistically.
The difference or the decease in deviance after including the predictors follows a chi-square ($\chi^{2}$) distribution. The chi-square ($\chi^{2}$) distribution is a widely used distribution in statistical inference. It has a close relationship to F distribution. For example, the ratio of two independent chi-square distributions is a F distribution. In addition, a chi-square distribution is the limiting distribution of an F distribution as the denominator degrees of freedom goes to infinity.
There are two ways to conduct the test. From the output, we can find the Null and Residual deviances and the corresponding degrees of freedom. Then we calculate the difference. For the mammography example, we first get the difference between the Null deviance and the Residual deviance, 203.32-155.48= 47.84. Then, we find the difference in the degrees of freedom 163-159=4. Then, the p-value can be calculated based on a chi-square distribution with the degree of freedom 4. Because the p-value is smaller than 0.05, the overall model is significant.
>>> import pandas as pd >>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv") >>> >>> ## fit the model >>> import statsmodels.api as sm >>> from statsmodels.genmod.families.links import logit >>> import statsmodels.formula.api as smf >>> m1 = smf.glm(formula="y ~ 1", data=mamm, family=sm.families.Binomial(link=logit())).fit() >>> print(m1.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 163 Model Family: Binomial Df Model: 0 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -101.66 Date: Sun, 24 Nov 2024 Deviance: 203.32 Time: 13:59:07 Pearson chi2: 164. No. Iterations: 4 Pseudo R-squ. (CS): 3.331e-16 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.7956 0.169 4.716 0.000 0.465 1.126 ============================================================================== >>> >>> m2 = smf.glm(formula="y ~ 1 + x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit() >>> print(m2.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 159 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -77.741 Date: Sun, 24 Nov 2024 Deviance: 155.48 Time: 13:59:07 Pearson chi2: 169. No. Iterations: 5 Pseudo R-squ. (CS): 0.2530 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.4466 1.126 -1.285 0.199 -3.653 0.760 x1 1.7731 0.484 3.663 0.000 0.824 2.722 x2 -0.8694 1.120 -0.776 0.438 -3.065 1.326 x3 0.5935 0.206 2.883 0.004 0.190 0.997 x4 -0.1527 0.154 -0.990 0.322 -0.455 0.149 ============================================================================== >>> >>> ## compare two models >>> chi_diff = 203.32-155.48 >>> df_diff = 163-159 >>> >>> from scipy.stats import chi2 >>> print(1 - chi2.cdf(chi_diff, df_diff)) 1.0191170130013916e-09
Test a subset of predictors
We can also test the significance of a subset of predictors. For example, whether x3 and x4 are significant above and beyond x1 and x2. This can also be done using the chi-square test based on the difference. In this case, we can compare a model with all predictors and a model without x3 and x4 to see if the change in the deviance is significant. In this example, the p-value is 0.002, indicating the change is signficant. Therefore, x3 and x4 are statistically significant above and beyond x1 and x2
>>> import pandas as pd >>> mamm = pd.read_csv("https://advstats.psychstat.org/data/mamm.csv") >>> >>> ## fit the model >>> import statsmodels.api as sm >>> from statsmodels.genmod.families.links import logit >>> import statsmodels.formula.api as smf >>> m1 = smf.glm(formula="y ~ x1 + x2", data=mamm, family=sm.families.Binomial(link=logit())).fit() >>> print(m1.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 161 Model Family: Binomial Df Model: 2 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -84.116 Date: Sun, 24 Nov 2024 Deviance: 168.23 Time: 14:00:18 Pearson chi2: 163. No. Iterations: 5 Pseudo R-squ. (CS): 0.1926 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4479 0.681 0.658 0.511 -0.887 1.782 x1 2.2766 0.451 5.050 0.000 1.393 3.160 x2 -0.7159 1.033 -0.693 0.488 -2.740 1.308 ============================================================================== >>> >>> m2 = smf.glm(formula="y ~ x1 + x2 + x3 + x4", data=mamm, family=sm.families.Binomial(link=logit())).fit() >>> print(m2.summary()) Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 164 Model: GLM Df Residuals: 159 Model Family: Binomial Df Model: 4 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -77.741 Date: Sun, 24 Nov 2024 Deviance: 155.48 Time: 14:00:18 Pearson chi2: 169. No. Iterations: 5 Pseudo R-squ. (CS): 0.2530 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.4466 1.126 -1.285 0.199 -3.653 0.760 x1 1.7731 0.484 3.663 0.000 0.824 2.722 x2 -0.8694 1.120 -0.776 0.438 -3.065 1.326 x3 0.5935 0.206 2.883 0.004 0.190 0.997 x4 -0.1527 0.154 -0.990 0.322 -0.455 0.149 ============================================================================== >>> >>> ## compare two models >>> chi_diff = 168.23-155.48 >>> df_diff = 161-159 >>> >>> from scipy.stats import chi2 >>> print(1 - chi2.cdf(chi_diff, df_diff)) 0.00170361979580258
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.