Note: In this notebook, we will treat yi as a binary variable that can take on only the values 1 or 0.
In a linear probability model, the relationship between the predictors and the probability of the binary outcome is assumed to be linear. However, one of the fundamental limitations of the linear probability model is that it can predict probabilities that fall outside the logical bounds of 0 and 1. This happens because, in a linear model, the estimated probability is a linear function of the predictors, and there's nothing inherent in the linear formulation to restrict the predicted probability to lie between 0 and 1.
This limitation can lead to nonsensical predictions in practical applications. For instance, with a sufficiently large positive or negative predictor value, the linear model might predict a probability greater than 1 or less than 0, which is not meaningful in a probabilistic context. Such predictions defy the basic principles of probability and can significantly impair the interpretability and usefulness of the model.
Logistic regression, on the other hand, overcomes this limitation by using a logistic function to model the probability. The logistic function takes any input from the linear combination of predictors and transforms it into a value between 0 and 1. This transformation ensures that the model's outputs are sensible probabilities, regardless of the values of the predictors. Thus, logistic regression is inherently more suitable for modeling probabilities, as it respects the probabilistic boundaries and provides more reliable and interpretable results in scenarios where the dependent variable is binary.
UiH=α0+α1nwifeinc+α2educ+α3exper+α4expersq+α5age+α6kidslt6+α7kidsge6+εiH Ques: Can we think that the above two equations are two different regressions? (Yes)
UiL−UiH=(β0−α0)+(β1−α1)nwifeinc+(β2−α2)educ+(β3−α3)exper+(β4−α4)expersq+(β5−α5)age+(β6−α6)kidslt6+(β7−α7)kidsge6+εiL−εiH=xi′γ+ui
where
xi=constnwifeinceducexperexpersqagekidslt6kidsge6,γ=β0−α0β1−α1β2−α2β3−α3β4−α4β5−α5β6−α6β7−α7 and ui=εiL−εiH.
Let P[inlfi=1∣xi′] is the probability that individual i chooses 1, in our case s/he is in LABOR FORCE. Observe the following equivalence:
Till now we were assuming that all the independent variables,{nwifeinc, educ, exper, expersq, age, kidslt6, kidsge6},were affecting UL and UH both. Therefore we had the following equations:
But what if UL and UH are affected by different independent variables?
To analyze such a scenario, consider the following dataset:
inlf
L1
L2
L3
H1
H2
H3
C1
Utility
0
1
10.9101
12
14
196
32
1
0
UL>UH
1
1
19.5
12
5
25
30
0
2
UL>UH
2
1
12.0399
12
15
225
35
1
3
UL>UH
3
1
6.8
12
6
36
34
0
3
UL>UH
4
1
20.1001
14
7
49
31
1
2
UL>UH
5
1
9.85905
12
33
1089
54
0
0
UL>UH
...
749
0
10
12
14
196
31
2
3
UH>UL
750
0
9.952
12
4
16
43
0
0
UH>UL
751
0
24.984
12
15
225
60
0
0
UH>UL
752
0
28.363
9
12
144
39
0
3
UH>UL
With the help to some rationale, we theorize that L1,L2,L3 affect UL and H1,H2,H3 affect UH. In addition to this C1 affects both. Therefore
UiL==ViLβ0+β1L1+β2L2+β3L3+β4C1+εiL
UiH==ViHα0+α1H1+α2H2+β3H3+α4C1+εiH
Important note: It's important to note that the coefficients for C1 differ, while those for L3 and H3 are identical. The similarity or variance in coefficients for independent variables is guided by the underlying economic theory we consider. However, this distinction does not influence the estimation methodology, provided that the model is properly identified.