Deriving the OLS Estimates

Here we will derive OLS estimates for both simple and multiple linear regression.

Simple linear regression

We will use the Method of Moments to derive.

Our objective is to estimate the following model:

y=\beta_0 + \beta_1x + u

Since there are two unknowns $(\beta_0$ and $\beta_1)$ , we need two equations. And we will use the following two equations (conditions):

Above conditions can be written as:

\begin{align*} \mathbb{E}[y-\beta_0 - \beta_1x] &=0,\\ \mathbb{E}[x(y-\beta_0 - \beta_1x)] &=0 \end{align*}

Taking sample counterparts of the above equation:

\begin{align*} n^{-1}\sum_{i=1}^n(y_i-\hat{\beta}_0 - \hat{\beta}_1x_i) &=0,\tag{1}\\ n^{-1}\sum_{i=1}^nx_i(y_i-\hat{\beta}_0 - \hat{\beta}_1x_i) &=0 \tag{2} \end{align*}

Solving $(1)$ , we get

\begin{align*} \bar{y}&=\hat{\beta}_0 + \hat{\beta}_1\bar{x}\\ \implies \hat{\beta}_0 &=\bar{y} - \hat{\beta}_1\bar{x} \tag{3} \end{align*}

where $\bar{y}$ and $\bar{x}$ are sample means of $y$ and $x$ .

Substituting $(3)$ in $(2)$ , we get

\begin{align*} n^{-1}\sum_{i=1}^nx_i(y_i-\bar{y} + \hat{\beta}_1\bar{x} - \hat{\beta}_1x_i) &=0 \end{align*}

$n^{-1}$ will go away because R.H.S is zero. Rearranging the above equation:

\begin{align*} \sum_{i=1}^nx_i(y_i-\bar{y}) &- \sum_{i=1}^nx_i\hat{\beta}_1(x_i-\bar{x}) =0\\ \sum_{i=1}^nx_i(y_i-\bar{y})&=\hat{\beta}_1\sum_{i=1}^nx_i(x_i-\bar{x}). \tag{4} \end{align*}

We know that:

\begin{align*} \sum_{i=1}^nx_i(x_i-\bar{x})=\sum_{i=1}^n(x_i-\bar{x})^2\text{ and }\sum_{i=1}^nx_i(y_i-\bar{y})=\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}). \end{align*}

Using the above equalities, $(4)$ can be written as:

\hat{\beta}_1=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2},

provided that

\sum_{i=1}^n(x_i-\bar{x})^2>0.

We can also find $\hat{\beta}_0$ by substitution $\hat{\beta}_1$ in $(3)$ .

With some algebraic manipulation, $\hat{\beta}_1$ can be written as

\hat{\beta}_1=\hat{\rho}_{xy}\cdot\Big(\frac{\hat{\sigma}_y}{\hat{\sigma}_x}\Big),

where $\hat{\rho}_{xy}$ is the sample correlation between $x_i$ and $y_i$ and $\hat{\sigma}_y,\hat{\sigma}_x$ denote the sample standard deviations.

The population counterpart is

\beta_1=\rho_{xy}\cdot\Big(\frac{\sigma_y}{\sigma_x}\Big) \tag{5}.

We will use first order condition to derive.

Our objective is to estimate the following model:

\bold{y} =\bold{X} \boldsymbol{\beta} + \bold{u},

where $\bold{y},\boldsymbol{\beta}\text{ and }\bold{u}$ are column vectors and $\bold{X}$ is a matrix.

For individual $i$ , the above equation can be written as

y_i =\bold{x_i}' \boldsymbol{\beta} + u_i.

Let $\bold{b}$ be the column vector of OLS regression coefficients, then

y_i =\bold{x_i}' \bold{b} + e_i

where $\bold{x_i}' \bold{b}=\hat{y}_i$ .

The least squares coefficient vector minimizes the sum of squared residuals:

\sum_{i=0}^ne_{i0}^2=\sum_{i=0}^n(y_i -\bold{x_i}' \bold{b}_0)^2,

where $\bold{b}_0$ denotes a choice for the coefficient vector. Consider the following manipulation

\sum_{i=0}^ne_{i0}^2=e_{00}^2+e_{10}^2+...+e_{n0}^2=\bold{e'}_0\bold{e}_0=\begin{bmatrix} e_{00} & e_{10}&.&.&. & e_{n0} \end{bmatrix} \begin{bmatrix} e_{00}\\ e_{10}\\.\\.\\. \\ e_{n0} \end{bmatrix}.

Hence we have to solve the following problem

\begin{align*} \min_{\bold{b}_0} \Big(\bold{e'}_0\bold{e}_0\Big)&= \min_{\bold{b}_0} \Big((\bold{y}-\bold{Xb_0})'(\bold{y}-\bold{Xb_0})\Big)\\ \end{align*}

Expand the above term:

\begin{align*} (\bold{y}-\bold{Xb_0})'(\bold{y}-\bold{Xb_0})&=(\bold{y}'-\bold{b_0'X'})(\bold{y}-\bold{Xb_0})\\ &=\bold{y}'\bold{y}-\bold{y}'\bold{Xb_0}-\bold{b_0'X'}\bold{y}+\bold{b_0'X'}\bold{Xb_0} \end{align*}

Now we have to differentiate the above term with respect to $\bold{b}_0$ and FOC is the following

\begin{align*} \frac{\partial}{\partial \bold{b}_0}{(\bold{y}'\bold{y}-\bold{y}'\bold{Xb_0}-\bold{b_0'X'}\bold{y}+\bold{b_0'X'}\bold{Xb_0})}&=0\\ -2\bold{X'y}+2\bold{X'Xb_0}&=0\\ \end{align*}

Let $\bold{b}$ be the solution (assuming it exists), then

\bold{b}=\bold{(X'X)}^{-1}\bold{X'y}.

As we can see in $(5)$ , $\beta_1$ is just a scaled version of $\rho_{xy}$ . This highlights an important limitation of simple regression when we do not have experimental data: in effect, simple regression is an analysis of correlation between two variables, and so one must be careful in inferring causality.
Why not minimize some other function of the residuals, such as the absolute values of the residuals?