Skip to main content

Logistic Regression

Motivation

Note: In this notebook, we will treat yiy_i​ as a binary variable that can take on only the values 1 or 0.

In a linear probability model, the relationship between the predictors and the probability of the binary outcome is assumed to be linear. However, one of the fundamental limitations of the linear probability model is that it can predict probabilities that fall outside the logical bounds of 0 and 1. This happens because, in a linear model, the estimated probability is a linear function of the predictors, and there's nothing inherent in the linear formulation to restrict the predicted probability to lie between 0 and 1.

This limitation can lead to nonsensical predictions in practical applications. For instance, with a sufficiently large positive or negative predictor value, the linear model might predict a probability greater than 1 or less than 0, which is not meaningful in a probabilistic context. Such predictions defy the basic principles of probability and can significantly impair the interpretability and usefulness of the model.

Logistic regression, on the other hand, overcomes this limitation by using a logistic function to model the probability. The logistic function takes any input from the linear combination of predictors and transforms it into a value between 0 and 1. This transformation ensures that the model's outputs are sensible probabilities, regardless of the values of the predictors. Thus, logistic regression is inherently more suitable for modeling probabilities, as it respects the probabilistic boundaries and provides more reliable and interpretable results in scenarios where the dependent variable is binary.

Random Utility Basis

Consider the following dataset as an illustration:

inlfnwifeinceducexperexpersqagekidslt6kidsge6Utility
0110.910112141963210UL>UHU_L>U_H
1119.5125253002UL>UHU_L>U_H
2112.039912152253513UL>UHU_L>U_H
316.8126363403UL>UHU_L>U_H
4120.1001147493112UL>UHU_L>U_H
519.85905123310895400UL>UHU_L>U_H
...
74901012141963123UH>ULU_H>U_L
75009.952124164300UH>ULU_H>U_L
751024.98412152256000UH>ULU_H>U_L
752028.3639121443903UH>ULU_H>U_L

Variables are defined as follows:

  • inlf: 1 if in lab force, 1975
  • nwifeinc: (faminc - wage*hours)/1000
  • educ: years of schooling
  • exper: actual labor mkt exper
  • expersq: exper^2
  • age: woman's age in yrs
  • kidslt6: # kids < 6 years
  • kidsge6: # kids 6-18
  • Utility: UL>UHU_L>U_H if 'inlf'==1. UH>ULU_H>U_L if 'inlf'==0
  • UH>ULU_H>U_L: Utility of staying at home is higher than the utility of labor

Utility model

  • UiL=β0+β1nwifeinc+β2educ+β3exper+β4expersq+β5age+β6kidslt6+β7kidsge6+εiLU_{iL} = \beta_0 + \beta_1\text{nwifeinc} + \beta_2 \text{educ} + \beta_3 \text{exper} + \beta_4 \text{expersq} + \beta_5 \text{age} + \beta_6 \text{kidslt6} + \beta_7 \text{kidsge6} + \varepsilon_{iL}
  • UiH=α0+α1nwifeinc+α2educ+α3exper+α4expersq+α5age+α6kidslt6+α7kidsge6+εiHU_{iH} = \alpha_0 + \alpha_1\text{nwifeinc} + \alpha_2 \text{educ} + \alpha_3 \text{exper} + \alpha_4 \text{expersq} + \alpha_5 \text{age} + \alpha_6 \text{kidslt6} + \alpha_7 \text{kidsge6} + \varepsilon_{iH}
    Ques: Can we think that the above two equations are two different regressions? (Yes)
  • UiLUiH=(β0α0)+(β1α1)nwifeinc+(β2α2)educ+(β3α3)exper+(β4α4)expersq+(β5α5)age+(β6α6)kidslt6+(β7α7)kidsge6+εiLεiH=xiγ+uiU_{iL}-U_{iH}=(\beta_0-\alpha_0)+ (\beta_1-\alpha_1)\text{nwifeinc} + (\beta_2-\alpha_2) \text{educ} + (\beta_3-\alpha_3) \text{exper} + (\beta_4-\alpha_4) \text{expersq} + (\beta_5-\alpha_5) \text{age} + (\beta_6-\alpha_6) \text{kidslt6} + (\beta_7-\alpha_7) \text{kidsge6} + \varepsilon_{iL}-\varepsilon_{iH}=\bold{x_i'}\boldsymbol{\gamma} + u_i
    where
xi=[constnwifeinceducexperexpersqagekidslt6kidsge6],γ=[β0α0β1α1β2α2β3α3β4α4β5α5β6α6β7α7] and ui=εiLεiH.\bold{x_i}=\begin{bmatrix} \text{const}\\ \text{nwifeinc}\\ \text{educ}\\ \text{exper}\\ \text{expersq}\\ \text{age}\\ \text{kidslt6}\\ \text{kidsge6}\\ \end{bmatrix}, \boldsymbol{\gamma}=\begin{bmatrix} \beta_0-\alpha_0\\ \beta_1-\alpha_1\\ \beta_2-\alpha_2\\ \beta_3-\alpha_3\\ \beta_4-\alpha_4\\ \beta_5-\alpha_5\\ \beta_6-\alpha_6\\ \beta_7-\alpha_7\\ \end{bmatrix} \text{ and } u_i=\varepsilon_{iL}-\varepsilon_{iH}.

Let P[inlfi=1xi]P[\text{inlf}_i=1|\bold{x_i'}] is the probability that individual ii chooses 11, in our case s/he is in LABOR FORCE. Observe the following equivalence:

P[inlfi=1xi]P[UiL>UiHxi]=P[(UiLUiH)>0xi]=P[(xiγ+ui)>0xi]=P[ui>(xiγ)xi]\begin{align*} P[\text{inlf}_i=1|\bold{x_i'}] &\equiv P[U_{iL}>U_{iH}|\bold{x_i'}]\\ &=P[(U_{iL}-U_{iH})>0|\bold{x_i'}]\\ &=P[(\bold{x_i'}\boldsymbol{\gamma} + u_i)>0|\bold{x_i'}]\\ &=P[u_i>-(\bold{x_i'}\boldsymbol{\gamma}) |\bold{x_i'}] \end{align*}

If the distribution of uiu_i is symmetric, then

P[inlfi=1xi]=P[ui<(xiγ)xi]=F(xiγ),P[\text{inlf}_i=1|\bold{x_i'}] = P[u_i<(\bold{x_i'}\boldsymbol{\gamma}) |\bold{x_i'}]=F(\bold{x_i'}\boldsymbol{\gamma}),

where FF is the cdf of ui.u_i.

Assume that uiu_i has a logistic distribution with μ=0\mu=0 and σ2=π2/3\sigma^2=\pi^2/3, then

P[inlfi=1xi]=F(xiγ)=exiγ1+exiγ.P[\text{inlf}_i=1|\bold{x_i'}] = F(\bold{x_i'}\boldsymbol{\gamma})=\frac{e^{\bold{x_i'}\boldsymbol{\gamma}}}{1+e^{\bold{x_i'}\boldsymbol{\gamma}}}.

Similarly

P[inlfi=0xi]=1F(xiγ)=11+exiγ.P[\text{inlf}_i=0|\bold{x_i'}] = 1-F(\bold{x_i'}\boldsymbol{\gamma})=\frac{1}{1+e^{\bold{x_i'}\boldsymbol{\gamma}}}.

Aside

Let XX\sim Logistic (μ,s)(\mu,s), where μ\mu is mean and ss is scale. Variance of XX is given by s2π23\frac{s^2\pi^2}{3} and CDF of XX is as follows:

F(x)=exμs1+exμs.F(x)=\frac{e^{\frac{x-\mu}{s}}}{1+e^{\frac{x-\mu}{s}}}.

In our case, μ=0\mu=0 and s=1s=1, therefore

F(xiγ)=exiγ1+exiγ.F(\bold{x'}_i\boldsymbol{\gamma})=\frac{e^{\bold{x'}_i\boldsymbol{\gamma}}}{1+e^{\bold{x'}_i\boldsymbol{\gamma}}}.

Estimation

Likelihood function, fif_i

fi=[P[inlfi=1xi]]yi[P[inlfi=0xi]]1yi,=[F(xiγ)]yi[1F(xiγ)]1yi,\begin{align*} f_i&=\Big[P[\text{inlf}_i=1|\bold{x_i'}]\Big]^{y_i}\cdot \Big[P[\text{inlf}_i=0|\bold{x_i'}]\Big]^{1-y_i},\\ &= [F(\bold{x'}_i\boldsymbol{\gamma})]^{y_i} \cdot [1- F(\bold{x'}_i\boldsymbol{\gamma})]^{1-y_i}, \end{align*}

where y{0,1}y\in\{0,1\}.

Log likelihood function, logfi\log f_i

logfi=yilog[F(xiγ)]+(1yi)log[1F(xiγ)],\log f_i= y_i\log[F(\bold{x'}_i\boldsymbol{\gamma})] + (1-y_i)\log[1- F(\bold{x'}_i\boldsymbol{\gamma})],

take summation both sides,

i=1nlogfi=i=1n[yilog[F(xiγ)]+(1yi)log[1F(xiγ)]].\sum_{i=1}^n\log f_i= \sum_{i=1}^n \bigg[y_i\log[F(\bold{x'}_i\boldsymbol{\gamma})] + (1-y_i)\log[1- F(\bold{x'}_i\boldsymbol{\gamma})]\bigg].

Find such a γ\boldsymbol{\gamma} that maximizes the above sum. Is this γ\boldsymbol{\gamma} unique?

Interpretation of the coefficients

Partial Effect

We have

P[yi=1xi]=F(xiγ)=exiγ1+exiγ=eγ0+γ1x1+...+γjxj+...1+eγ0+γ1x1+...+γjxj+....\begin{align*} &P[y_i=1|\bold{x}_i] = F(\bold{x'}_i\boldsymbol{\gamma})=\frac{e^{\bold{x'}_i\boldsymbol{\gamma}}}{1+e^{\bold{x'}_i\boldsymbol{\gamma}}}=\frac{e^{\gamma_0+\gamma_1 x_1+...+\gamma_j x_j+...}}{1+e^{\gamma_0+\gamma_1 x_1+...+\gamma_j x_j+...}}. \end{align*}

Taking the derivative w.r.t to xjx_j

(P[yi=1xi])xj=eγ0+γ1x1+...+γjxj+...(1+eγ0+γ1x1+...+γjxj+...)2γj=F(xiγ)[1F(xiγ)]γj\begin{align*} &\frac{\partial(P[y_i=1|\bold{x}_i])}{\partial x_j}=\frac{e^{\gamma_0+\gamma_1 x_1+...+\gamma_j x_j+...}}{(1+e^{\gamma_0+\gamma_1 x_1+...+\gamma_j x_j+...})^2}\cdot \gamma_j = F(\bold{x'}_i\boldsymbol{\gamma})\cdot[1-F(\bold{x'}_i\boldsymbol{\gamma})]\cdot \gamma_j \end{align*}

Aside

Derivative w.r.t to vector x\bold{x}

F(xγ)x=dF(xγ)d(xγ)γ=f(xγ)γ=F(xγ).[1F(xγ)]γ\frac{\partial F(\bold{x'}\boldsymbol{\gamma})}{\partial \bold{x}}=\frac{d F(\bold{x'}\boldsymbol{\gamma})}{d (\bold{x'}\boldsymbol{\gamma})} \cdot \boldsymbol{\gamma}=f(\bold{x'}\boldsymbol{\gamma}) \cdot \boldsymbol{\gamma}=F(\bold{x'}\boldsymbol{\gamma}).[1-F(\bold{x'}\boldsymbol{\gamma})]\cdot \boldsymbol{\gamma}


Complex Case

Till now we were assuming that all the independent variables,{nwifeinc, educ, exper, expersq, age, kidslt6, kidsge6},were affecting ULU_L and UHU_H both. Therefore we had the following equations:

  • UiL=β0+β1nwifeinc+β2educ+β3exper+β4expersq+β5age+β6kidslt6+β7kidsge6+εiLU_{iL} = \beta_0 + \beta_1\text{nwifeinc} + \beta_2 \text{educ} + \beta_3 \text{exper} + \beta_4 \text{expersq} + \beta_5 \text{age} + \beta_6 \text{kidslt6} + \beta_7 \text{kidsge6} + \varepsilon_{iL}
  • UiH=α0+α1nwifeinc+α2educ+α3exper+α4expersq+α5age+α6kidslt6+α7kidsge6+εiHU_{iH} = \alpha_0 + \alpha_1\text{nwifeinc} + \alpha_2 \text{educ} + \alpha_3 \text{exper} + \alpha_4 \text{expersq} + \alpha_5 \text{age} + \alpha_6 \text{kidslt6} + \alpha_7 \text{kidsge6} + \varepsilon_{iH}

But what if ULU_L and UHU_H are affected by different independent variables?

To analyze such a scenario, consider the following dataset:

inlfL1L_1L2L_2L3L_3H1H_1H2H_2H3H_3C1C_1Utility
0110.910112141963210UL>UHU_L>U_H
1119.5125253002UL>UHU_L>U_H
2112.039912152253513UL>UHU_L>U_H
316.8126363403UL>UHU_L>U_H
4120.1001147493112UL>UHU_L>U_H
519.85905123310895400UL>UHU_L>U_H
...
74901012141963123UH>ULU_H>U_L
75009.952124164300UH>ULU_H>U_L
751024.98412152256000UH>ULU_H>U_L
752028.3639121443903UH>ULU_H>U_L

With the help to some rationale, we theorize that L1,L2,L3L_1, L_2, L_3 affect ULU_L and H1,H2,H3H_1, H_2, H_3 affect UHU_H. In addition to this C1C_1 affects both. Therefore

  • UiL=β0+β1L1+β2L2+β3L3+β4C1=ViL+εiLU_{iL} = \underbrace{\beta_0 + \beta_1\text{L}_1 + \beta_2 \text{L}_2 + \beta_3 \text{L}_3 + \beta_4 \text{C}_1}_{=\bold{V}_{iL}} + \varepsilon_{iL}
  • UiH=α0+α1H1+α2H2+β3H3+α4C1=ViH+εiHU_{iH} = \underbrace{\alpha_0 + \alpha_1\text{H}_1 + \alpha_2 \text{H}_2 + \beta_3 \text{H}_3 + \alpha_4 \text{C}_1}_{=\bold{V}_{iH}} + \varepsilon_{iH}

Important note: It's important to note that the coefficients for C1C_1 differ, while those for L3L_3 and H3H_3 are identical. The similarity or variance in coefficients for independent variables is guided by the underlying economic theory we consider. However, this distinction does not influence the estimation methodology, provided that the model is properly identified.

P[inlfi=1]P[UiL>UiH]=P[(UiLUiH)>0]=P[(ViLViH+εiLεiHui)>0]=P[(ViLViH+ui)>0]=P[ui>(ViLViH)]P[inlfi=1ViLViH]=P[ui>(ViLViH)ViLViH]\begin{align*} P[\text{inlf}_i=1] &\equiv P[U_{iL}>U_{iH}]\\ &=P[(U_{iL}-U_{iH})>0]\\ &=P[(\bold{V}_{iL} - \bold{V}_{iH} + \underbrace{\varepsilon_{iL} - \varepsilon_{iH}}_{u_i})>0]\\ &=P[(\bold{V}_{iL} - \bold{V}_{iH} + u_i)>0]\\ &=P[u_i>-(\bold{V}_{iL} - \bold{V}_{iH})]\\ P[\text{inlf}_i=1|\bold{V}_{iL} - \bold{V}_{iH}] &=P[u_i>-(\bold{V}_{iL} - \bold{V}_{iH})|\bold{V}_{iL} - \bold{V}_{iH}] \end{align*}

If the distribution of uiu_i is symmetric, then

P[inlfi=1ViLViH]=P[ui<(ViLViH)ViLViH]\begin{align*} P[\text{inlf}_i=1|\bold{V}_{iL} - \bold{V}_{iH}] &=P[u_i<(\bold{V}_{iL} - \bold{V}_{iH})|\bold{V}_{iL} - \bold{V}_{iH}] \end{align*}

Assume that uiu_i has a logistic distribution with μ=0\mu=0 and σ2=π2/3\sigma^2=\pi^2/3, then

P[inlfi=1ViLViH]=F(ViLViH)=eViLViH1+eViLViH=eViLeViL+eViH=eβ0+β1L1+β2L2+β3L3+β4C1eβ0+β1L1+β2L2+β3L3+β4C1+eα0+α1H1+α2H2+β3H3+α4C1\begin{align*} P[\text{inlf}_i=1|\bold{V}_{iL} - \bold{V}_{iH}] &= F(\bold{V}_{iL} - \bold{V}_{iH})=\frac{e^{\bold{V}_{iL} - \bold{V}_{iH}}}{1+e^{\bold{V}_{iL} - \bold{V}_{iH}}}\\ &=\frac{e^{\bold{V}_{iL}}}{e^{\bold{V}_{iL}}+e^{\bold{V}_{iH}}}\\ &=\frac{e^{\beta_0 + \beta_1\text{L}_1 + \beta_2 \text{L}_2 + \beta_3 \text{L}_3 + \beta_4 \text{C}_1}}{e^{\beta_0 + \beta_1\text{L}_1 + \beta_2 \text{L}_2 + \beta_3 \text{L}_3 + \beta_4 \text{C}_1}+e^{\alpha_0 + \alpha_1\text{H}_1 + \alpha_2 \text{H}_2 + \beta_3 \text{H}_3 + \alpha_4 \text{C}_1}} \end{align*}

Similarly

P[inlfi=0ViLViH]=1eViLeViL+eViH=eViHeViL+eViHP[\text{inlf}_i=0|\bold{V}_{iL} - \bold{V}_{iH}] = 1-\frac{e^{\bold{V}_{iL}}}{e^{\bold{V}_{iL}}+e^{\bold{V}_{iH}}}=\frac{e^{\bold{V}_{iH}}}{e^{\bold{V}_{iL}}+e^{\bold{V}_{iH}}}

Final step involves conducting Maximum Likelihood Estimation to identify the set of parameters {β0,β1,β2,β3,β4,α0,α1,α2,α4}\{\beta_0,\beta_1,\beta_2,\beta_3,\beta_4,\alpha_0,\alpha_1,\alpha_2,\alpha_4\} that maximize the log-likelihood function.