Variable Selection in Machine Learning

I've had a discussion with a colleague on the selection of variables in a model. The discussion boils down to the following question:

Which is better?

supply all the variables that you have into the model and remove the ones that add a risk of overfitting
start out small and add values to make the model more and more complex?

You should always use a test set to determine the performance of your models, but starting out small and growing the model brings inherit bias into the model. In this document I will provide a mathematical proof of this that I remember from college.

Linear Algrebra/Regression Primer

In this document I've assumed that you have some remembrance of college level linear algebra. The following equations should feel readable to you:

\[ \begin{aligned} M_x & = I_n - X(X'X)^{-1}X' \\\ M_x X & = I_nX - X(X'X)^{-1}X'X \\\ & = X - XI_n\\\ & = 0 \end{aligned} \]

From statistics you should hopyfully feel familiar with the following:

\[ \begin{aligned} Y & = X\beta + \epsilon \text{ where } \epsilon \sim N(0,\sigma) \\\ \hat{\beta} & = (X'X)^{-1} X'Y \\\ \mathbb{E}(\hat{\beta} ) & = \mathbb{E}\big((X'X)^{-1} X'Y)\big) \end{aligned} \]

I am going to proove that for linear models you will introduce bias if you use few variables and include more and more as you are building a model and that this will not happen when you start with a lot of variables and reduce. For each case I will show what goes wrong in terms of the expected value of the $\beta$ variables.

A few things on notation, I will refer to the following linear regression formula:

\[Y = X_1\beta_1 + X_2\beta_2 + \epsilon\]

In this notation, $X_1,X_2$ are matrices containing data, not vectors, such that $\beta_1,\beta_2$ are matrices as well.

Small to Large Problems

Suppose that the true model is given through:

\[Y = X_1\beta_1 + X_2\beta_2 + \epsilon\]

If we start out with a smaller model, say by only looking at $\beta_1$ we would estimate for $ Y = X_1\beta_1 + \epsilon$ while the whole model should be $ Y = X_1\beta_1 + X_2\beta_2 + \epsilon $. Then our expected value of $\beta_1$ can be derived analytically.

\[ \begin{aligned} \mathbb{E}(\beta_1) & = \mathbb{E}\big((X_1'X_1)^{-1} X_1'Y)\big)\\\ & = \mathbb{E}\Big((X_1'X_1)^{-1} X_1'\big(X_1\beta_1 + X_2\beta_2 + \epsilon\big)\Big)\\\ & = \mathbb{E}\Big((X_1'X_1)^{-1} X_1'X_1\beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 + (X_1'X_1)^{-1} X_1'\epsilon\big)\Big)\\\ & = \mathbb{E}\Big(\beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 + (X_1'X_1)^{-1} X_1'\epsilon\big)\Big) \\\ & = \beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 + (X_1'X_1)^{-1} X_1'\mathbb{E}(\epsilon) \\\ & = \beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 \\\ & \ne \beta_1 \end{aligned} \]

So our estimate of $\beta_1$ is biased. This holds for every subset of variables $\{\beta_1, \beta_2\}$ that make up $\beta$.

Large to Small Solution

Suppose that the true model is given through:

\[ Y = X_1\beta_1 + \epsilon \]

If we start out with a larger model, say by including some parameters $\beta_2$ as well while they do not have any influence on the model then we will initially estimate a wrong model $Y = X_1\beta_1 + X_2\beta_2 + \epsilon$.

A lemma in between

Let's define a matrix $M_{X_1} = I_n -X_1(X_1'X_1)^{-1}X_1'$. We can use this matrix to get an estimate of $\beta_2$.

Start out with the original formula.

\[ \begin{aligned} M_{X_1}Y & = M_{X_1}X_1\beta_1 + M_{X_1}X_2\beta_2 + M_{X_1}\epsilon \\\ M_{X_1}Y & = M_{X_1}X_2\beta_2 + \epsilon \\\ X_2'M_{X_1}Y & = X_2'M_{X_1}X_2\beta_2 + X_2'\epsilon \\\ X_2'M_{X_1}Y & = X_2'M_{X_1}X_2\beta_2 \\\ \beta_2 & = ( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}Y \\\ \end{aligned} \]

Notice that $M_{X_1}X_1 = 0$ and that $M_{X_1}\epsilon = \epsilon$ because of the definition while $X_2\epsilon = 0$ because $\epsilon$ is normally distributed around zero and orthogonal to any of the explanatory variables.

The derivation for large to small

With this definition of $\beta_2$ we can analyse it to confirm that it should not converge to any other value than zero.

\[ \begin{aligned} \mathbb{E}(\beta_2) & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}Y\big) \\\ & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\big(X_1\beta_1 + \epsilon\big)\big) \\\ & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}X_1\beta_1 + ( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\epsilon\big) \\\ & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\epsilon\big) \\\ & = ( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\mathbb{E}(\epsilon) \\\ & = 0 \end{aligned} \]

Notice that $( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}X_1\beta_1 = 0$ because $M_{X_1}X_1 = 0$. So we see that $\beta_2$ is correctly estimated, what about $\beta_1$?

\[ \begin{aligned} \mathbb{E}(\beta_1) & = \mathbb{E}\big((X_1'X_1)^{-1} X_1'Y)\big)\\\ & = \mathbb{E}\Big((X_1'X_1)^{-1} X_1'\big(X_1\beta_1 + \epsilon\big)\Big)\\\ & = \mathbb{E}\Big(\beta_1 + (X_1'X_1)^{-1} X_1'\epsilon\Big)\\\ & = \beta_1 \end{aligned} \]

So in this case we would remove the variables $\beta_2$ that are not of influence while our estimate of $\beta_1$ does not have any bias. This is exactly what we want.

Conclusion

I've shown that by starting only a few variables and then adding them to the model has a bias risk in linear models. One can imagine a similar thing happening in other models.