Variable Selection in Machine Learning
I've had a discussion with a colleague on the selection of variables in a model. The discussion boils down to the following question:
Which is better?
- supply all the variables that you have into the model and remove the ones that add a risk of overfitting
- start out small and add values to make the model more and more complex?
You should always use a test set to determine the performance of your models, but starting out small and growing the model brings inherit bias into the model. In this document I will provide a mathematical proof of this that I remember from college.
Linear Algrebra/Regression Primer
In this document I've assumed that you have some remembrance of college level linear algebra. The following equations should feel readable to you:
From statistics you should hopyfully feel familiar with the following:
I am going to proove that for linear models you will introduce bias if you use few variables and include more and more as you are building a model and that this will not happen when you start with a lot of variables and reduce. For each case I will show what goes wrong in terms of the expected value of the $\beta$ variables.
A few things on notation, I will refer to the following linear regression formula:
In this notation, \(X_1,X_2\) are matrices containing data, not vectors, such that \(\beta_1,\beta_2\) are matrices as well.
Small to Large Problems
Suppose that the true model is given through:
If we start out with a smaller model, say by only looking at \(\beta_1\) we would estimate for $ Y = X_1\beta_1 + \epsilon$ while the whole model should be $ Y = X_1\beta_1 + X_2\beta_2 + \epsilon $. Then our expected value of \(\beta_1\) can be derived analytically.
So our estimate of \(\beta_1\) is biased. This holds for every subset of variables \(\{\beta_1, \beta_2\}\) that make up \(\beta\).
Large to Small Solution
Suppose that the true model is given through:
If we start out with a larger model, say by including some parameters \(\beta_2\) as well while they do not have any influence on the model then we will initially estimate a wrong model \(Y = X_1\beta_1 + X_2\beta_2 + \epsilon\).
A lemma in between
Let's define a matrix \(M_{X_1} = I_n -X_1(X_1'X_1)^{-1}X_1'\). We can use this matrix to get an estimate of \(\beta_2\).
Start out with the original formula.
Notice that \(M_{X_1}X_1 = 0\) and that \(M_{X_1}\epsilon = \epsilon\) because of the definition while \(X_2\epsilon = 0\) because \(\epsilon\) is normally distributed around zero and orthogonal to any of the explanatory variables.
The derivation for large to small
With this definition of \(\beta_2\) we can analyse it to confirm that it should not converge to any other value than zero.
Notice that $( X_2'M_{X_1}X_2)^{-1}X_2'M_{X_1}X_1\beta_1 = 0$ because $M_{X_1}X_1 = 0$. So we see that $\beta_2$ is correctly estimated, what about $\beta_1$?
So in this case we would remove the variables \(\beta_2\) that are not of influence while our estimate of \(\beta_1\) does not have any bias. This is exactly what we want.
Conclusion
I've shown that by starting only a few variables and then adding them to the model has a bias risk in linear models. One can imagine a similar thing happening in other models.