Let (X,Y) be a pair of random variables for which the regression function \nu (x) = \mathbb E[Y | X = x] takes the form

\nu (x) = a + bx

for some pair of real numbers (a,b).

What is a random variable \hat{Y} that is a function of X that minimizes

\mathbb E\left[ (Y - \hat{Y})^2 | X = x \right]

over all possible choices of \hat{Y} and for all x? Enter your answer in terms of a, b and the random variable X (capital letter “X").

(Remark: for a clean, quick solution, it may be helpful to review the law of iterated expectations: \mathbb E_{X,Y}[\cdot ] = \mathbb E_{X}[\mathbb E_{Y}[\cdot \; |\; X]], where \mathbb E_{Y}[\cdot |X] denotes the conditional expectation, which is a random variable. Use the insight from the previous exercise.)

1 answer

Using the law of iterated expectations, we can express the mean squared error as:

\begin{align*}
\mathbb{E}[(Y - \hat{Y})^2|X=x] &= \mathbb{E}[(Y - (\hat{a} + \hat{b}x))^2|X=x] \\
&= \mathbb{E}[(Y - \nu(x) + \nu(x) - (\hat{a} + \hat{b}x))^2|X=x] \\
&= \mathbb{E}[(Y - \nu(x))^2|X=x] + 2\mathbb{E}[(Y - \nu(x))(\nu(x) - (\hat{a} + \hat{b}x))|X=x] + \mathbb{E}[(\nu(x) - (\hat{a} + \hat{b}x))^2|X=x]
\end{align*}

The first term is the unconditional mean squared error at x, which does not depend on our choice of hat{Y}. The third term is the squared difference between the true regression function and our chosen hat{Y}. To minimize the mean squared error, we want to choose hat{Y} such that the second term is 0:

\begin{align*}
0 &= \mathbb{E}[(Y - \nu(x))(\nu(x) - (\hat{a} + \hat{b}x))|X=x] \\
&= \mathbb{E}[(Y - \nu(x))\nu(x)|X=x] - \mathbb{E}[(Y - \nu(x))(\hat{a} + \hat{b}x)|X=x] \\
&= \mathbb{E}[(Y - \nu(x))\nu(x)|X=x] - (\hat{a}\mathbb{E}[(Y - \nu(x))|X=x] + \hat{b}\mathbb{E}[(Y - \nu(x))x|X=x])
\end{align*}

Expanding the first term using the law of iterated expectations again:

\begin{align*}
\mathbb{E}[(Y - \nu(x))\nu(x)|X=x] &= \mathbb{E}[\mathbb{E}[(Y - \nu(x))\nu(x)|X,Y]|X=x] \\
&= \mathbb{E}[\nu(x)(\mathbb{E}[Y|X,Y=x] - \nu(x))|X=x] \\
&= \mathbb{E}[\nu(x)(Y - \nu(x))|X=x]
\end{align*}

Therefore, we have:

\begin{align*}
0 &= \mathbb{E}[\nu(x)(Y - \nu(x))|X=x] - (\hat{a}\mathbb{E}[(Y - \nu(x))|X=x] + \hat{b}\mathbb{E}[(Y - \nu(x))x|X=x]) \\
&= \mathbb{E}[\nu(x)(Y - \nu(x))|X=x] - \hat{a}\mathbb{E}[(Y - \nu(x))|X=x] - \hat{b}\mathbb{E}[(Y - \nu(x))x|X=x]
\end{align*}

Now, using the fact that nu(x) = a + bx:

\begin{align*}
0 &= \mathbb{E}[(a + bx)(Y - (a + bx))|X=x] - \hat{a}\mathbb{E}[(Y - (a + bx))|X=x] - \hat{b}\mathbb{E}[(Y - (a + bx))x|X=x] \\
&= \mathbb{E}[(a + bx)(Y - (a + bx))|X=x] - \hat{a}\mathbb{E}[Y - (a + bx)|X=x] - \hat{b}\mathbb{E}[(Y - (a + bx))x|X=x] \\
&= \mathbb{E}[(a + bx)(Y - (a + bx))|X=x] - \hat{a}\mathbb{E}[Y|X=x] + \hat{a}(a + bx) - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}(a + bx)x \\
&= \mathbb{E}[(a + bx)(Y - (a + bx))|X=x] - \hat{a}\nu(x) + \hat{a}(a + bx) - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}(a + bx)x \\
&= \mathbb{E}[(a + bx)(Y - (a + bx))|X=x] - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}bx^2
\end{align*}

Now, taking expectations with respect to X on both sides:

\begin{align*}
0 &= \mathbb{E}[\mathbb{E}[(a + bx)(Y - (a + bx))|X]|X=x] - \hat{a}\mathbb{E}[\nu(x)|X=x] + \hat{a}a + \hat{b}\mathbb{E}[ax|X=x] - \hat{b}\mathbb{E}[\mathbb{E}[Yx|X]|X=x] + \hat{b}\mathbb{E}[ax|X=x] + \hat{b}\mathbb{E}[bx^2|X=x] \\
&= \mathbb{E}[(a + bx)(\mathbb{E}[Y - (a + bx)|X]|X=x] - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] \\
&= (a + bx)(\mathbb{E}[Y - (a + bx)|X]|X=x] - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] \\
&= a\mathbb{E}[Y - (a + bx)|X=x] + b\mathbb{E}[bx|X=x] - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] \\
&= a\mathbb{E}[Y|X=x] - a\mathbb{E}[a + bx|X=x] + b^2\mathbb{E}[x^2|X=x] - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] \\
&= abx - a^2 - abx + b^2x^2 - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] \\
&= b^2x^2 - \hat{a}\nu(x) + \hat{a}a + \hat{b}ax - \hat{b}\mathbb{E}[Yx|X=x] + \hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] \\
&= b^2x^2 - \hat{a}\nu(x) + \hat{a}a + 2\hat{b}ax + \hat{b}\mathbb{E}[x^2|X=x] - \hat{b}\mathbb{E}[Yx|X=x] \\
&= (\hat{b}\mathbb{E}[x^2|X=x} - \hat{a})x + b^2x^2 - \hat{a}\nu(x) + \hat{a}a - \hat{b}\mathbb{E}[Yx|X=x]
\end{align*}

Since this equation holds for all x, we must have:

\begin{align*}
\hat{b}\mathbb{E}[x^2|X=x} - \hat{a} &= 0 \\
\hat{a}a &= \hat{b}\mathbb{E}[Yx|X=x] - b^2x^2 + \hat{a}\nu(x)
\end{align*}

Solving the first equation for hat{a} in terms of hat{b} gives:

\begin{align*}
\hat{a} &= \hat{b}\mathbb{E}[x^2|X=x}
\end{align*}

Substituting this into the second equation gives:

\begin{align*}
\hat{a}a &= \hat{b}\mathbb{E}[Yx|X=x] - b^2x^2 + \hat{a}\nu(x) \\
\hat{b}\mathbb{E}[x^2|X=x}a &= \hat{b}\mathbb{E}[Yx|X=x] - b^2x^2 + \hat{b}\mathbb{E}[x^2|X=x}\nu(x) \\
0 &= \hat{b}\mathbb{E}[Yx|X=x] - b^2x^2 + \hat{b}\mathbb{E}[x^2|X=x}\nu(x) \\
\hat{b}\left(\mathbb{E}[Yx|X=x] - b^2x^2 + \mathbb{E}[x^2|X=x}\nu(x)\right) &= 0
\end{align*}

Since this equation holds for all x, we must have:

\begin{align*}
\hat{b} &= 0 \quad \text{or} \quad \mathbb{E}[Yx|X=x] - b^2x^2 + \mathbb{E}[x^2|X=x}\nu(x) = 0
\end{align*}

If we choose hat{b} = 0, then hat{a} can take any value and we find \hat{Y} = hat{a}. However, if we choose hat{b} != 0, then we have:

\begin{align*}
\mathbb{E}[Yx|X=x] - b^2x^2 + \mathbb{E}[x^2|X=x}\nu(x) &= 0 \\
\mathbb{E}[Yx|X=x] &= b^2x^2 - \mathbb{E}[x^2|X=x}\nu(x)
\end{align*}

Therefore, we see that the random variable hat{Y} that minimizes the mean squared error is given by:

\begin{align*}
\hat{Y} &= \begin{cases}
\hat{a} & \text{if } \hat{b} = 0 \\
b^2X^2 - \mathbb{E}[X^2|X=X]\nu(X) & \text{if } \hat{b} \neq 0
\end{cases}
\end{align*}

where \hat{a} can take any value.