# Variable Selection in Machine Learning

I've had a discussion with a colleague on the selection of variables in a model. The discussion boils down to the following question:

### Which is better?

• supply all the variables that you have into the model and remove the ones that add a risk of overfitting
• start out small and add values to make the model more and more complex?

You should always use a test set to determine the performance of your models, but starting out small and growing the model brings inherit bias into the model. In this document I will provide a mathematical proof of this that I remember from college.

### Linear Algrebra/Regression Primer

In this document I've assumed that you have some remembrance of college level linear algebra. The following equations should feel readable to you:

\begin{aligned} M_x & = I_n - X(X'X)^{-1}X' \\\ M_x X & = I_nX - X(X'X)^{-1}X'X \\\ & = X - XI_n\\\ & = 0 \end{aligned}

From statistics you should hopyfully feel familiar with the following:

\begin{aligned} Y & = X\beta + \epsilon \text{ where } \epsilon \sim N(0,\sigma) \\\ \hat{\beta} & = (X'X)^{-1} X'Y \\\ \mathbb{E}(\hat{\beta} ) & = \mathbb{E}\big((X'X)^{-1} X'Y)\big) \end{aligned}

I am going to proove that for linear models you will introduce bias if you use few variables and include more and more as you are building a model and that this will not happen when you start with a lot of variables and reduce. For each case I will show what goes wrong in terms of the expected value of the $\beta$ variables.

A few things on notation, I will refer to the following linear regression formula:

$Y = X_1\beta_1 + X_2\beta_2 + \epsilon$

In this notation, $$X_1,X_2$$ are matrices containing data, not vectors, such that $$\beta_1,\beta_2$$ are matrices as well.

#### Small to Large Problems

Suppose that the true model is given through:

$Y = X_1\beta_1 + X_2\beta_2 + \epsilon$

If we start out with a smaller model, say by only looking at $$\beta_1$$ we would estimate for $Y = X_1\beta_1 + \epsilon$ while the whole model should be $Y = X_1\beta_1 + X_2\beta_2 + \epsilon$. Then our expected value of $$\beta_1$$ can be derived analytically.

\begin{aligned} \mathbb{E}(\beta_1) & = \mathbb{E}\big((X_1'X_1)^{-1} X_1'Y)\big)\\\ & = \mathbb{E}\Big((X_1'X_1)^{-1} X_1'\big(X_1\beta_1 + X_2\beta_2 + \epsilon\big)\Big)\\\ & = \mathbb{E}\Big((X_1'X_1)^{-1} X_1'X_1\beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 + (X_1'X_1)^{-1} X_1'\epsilon\big)\Big)\\\ & = \mathbb{E}\Big(\beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 + (X_1'X_1)^{-1} X_1'\epsilon\big)\Big) \\\ & = \beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 + (X_1'X_1)^{-1} X_1'\mathbb{E}(\epsilon) \\\ & = \beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 \\\ & \ne \beta_1 \end{aligned}

So our estimate of $$\beta_1$$ is biased. This holds for every subset of variables $$\{\beta_1, \beta_2\}$$ that make up $$\beta$$.

### Large to Small Solution

Suppose that the true model is given through:

$Y = X_1\beta_1 + \epsilon$

If we start out with a larger model, say by including some parameters $$\beta_2$$ as well while they do not have any influence on the model then we will initially estimate a wrong model $$Y = X_1\beta_1 + X_2\beta_2 + \epsilon$$.

### A lemma in between

Let's define a matrix $$M_{X_1} = I_n -X_1(X_1'X_1)^{-1}X_1'$$. We can use this matrix to get an estimate of $$\beta_2$$.

Start out with the original formula.

\begin{aligned} M_{X_1}Y & = M_{X_1}X_1\beta_1 + M_{X_1}X_2\beta_2 + M_{X_1}\epsilon \\\ M_{X_1}Y & = M_{X_1}X_2\beta_2 + \epsilon \\\ X_2'M_{X_1}Y & = X_2'M_{X_1}X_2\beta_2 + X_2'\epsilon \\\ X_2'M_{X_1}Y & = X_2'M_{X_1}X_2\beta_2 \\\ \beta_2 & = ( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}Y \\\ \end{aligned}

Notice that $$M_{X_1}X_1 = 0$$ and that $$M_{X_1}\epsilon = \epsilon$$ because of the definition while $$X_2\epsilon = 0$$ because $$\epsilon$$ is normally distributed around zero and orthogonal to any of the explanatory variables.

### The derivation for large to small

With this definition of $$\beta_2$$ we can analyse it to confirm that it should not converge to any other value than zero.

\begin{aligned} \mathbb{E}(\beta_2) & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}Y\big) \\\ & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\big(X_1\beta_1 + \epsilon\big)\big) \\\ & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}X_1\beta_1 + ( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\epsilon\big) \\\ & = \mathbb{E}\big(( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\epsilon\big) \\\ & = ( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}\mathbb{E}(\epsilon) \\\ & = 0 \end{aligned}

Notice that $( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}X_1\beta_1 = 0$ because $M_{X_1}X_1 = 0$. So we see that $\beta_2$ is correctly estimated, what about $\beta_1$?

\begin{aligned} \mathbb{E}(\beta_1) & = \mathbb{E}\big((X_1'X_1)^{-1} X_1'Y)\big)\\\ & = \mathbb{E}\Big((X_1'X_1)^{-1} X_1'\big(X_1\beta_1 + \epsilon\big)\Big)\\\ & = \mathbb{E}\Big(\beta_1 + (X_1'X_1)^{-1} X_1'\epsilon\Big)\\\ & = \beta_1 \end{aligned}

So in this case we would remove the variables $$\beta_2$$ that are not of influence while our estimate of $$\beta_1$$ does not have any bias. This is exactly what we want.

### Conclusion

I've shown that by starting only a few variables and then adding them to the model has a bias risk in linear models. One can imagine a similar thing happening in other models.