Enjoy the Silence
I frequently write blog posts on why you need more than grid-search to judge a machine learning model properly. This blogpost continues the topic by demonstrating yet another reason; it’s not robust against noise.
??? info "The full saga"
If you're interested in the full saga, you can find the other posts here:
[Part 1](https://koaning.io/posts/high-on-probability-low-on-certainty/)
[Part 2](https://koaning.io/posts/goodheart-bad-metric/)
[Part 3](https://koaning.io/posts/inconvenient-feedback/)
[Part 4](https://koaning.io/posts/mean-squared-terror/)
[Part 5](https://koaning.io/posts/oops-and-optimality/)
[Part 6](https://koaning.io/posts/labels/)
An Experiment
We're going to do some experiments that are related to noise. The idea is that the training dataset you're learning from may not perfectly reflect what you might have in production. We should expect drift, and we'd like our models to be robust to changes. To simulate this, we will add noise to our data to see what might happen.
Part 1: Noise on X_valid
Let's start with an experiment where we add noise to the validation set.
We will use the make_classification
function from scikit-learn to generate a
dataset. We'll split it into X_train
and X_valid
. Then the X_train
set is kept as-is and will be used to train a logistic regression, decision tree,
and a random forest.
Code for training the models.
```python import numpy as np import pandas as pdfrom sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
classifiers = { "logreg": LogisticRegression(), "tree": DecisionTreeClassifier(), "forest": RandomForestClassifier() }
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
classifiers["tree"].fit(X_train, y_train) classifiers["forest"].fit(X_train, y_train) classifiers["logreg"].fit(X_train, y_train);
</details>
Now that the models are done training, we'll check how well different algorithms
score on `X_valid`. We're not going to stop there though! We're *also* going to
check if the performance changes if we add noise to `X_valid`.
<details>
<summary><b>Code for collecting the results</b></summary>
np.random.seed(42)
data = [] for noise in np.linspace(0.001, 10, 50): X_tmp = (X_valid + np.random.normal(0, noise, X_valid.shape)) for clf in classifiers.keys(): data.append({ "classifier": clf, "performance": np.mean(classifiers[clf].predict(X_tmp) == y_valid), "noise": noise })
</details>
<details>
<summary><b>Code for plotting the results</b></summary>
import altair as alt
pltr = pd.DataFrame(data)
base = (alt.Chart(pltr) .mark_point() .encode(x="noise", y="performance", color="classifier"))
p = (base + base.transform_loess('noise', 'performance', groupby=['classifier']).mark_line(size=4)) p.interactive().properties(width=700, title="effect of X_test noise on prediction")
</details>
Here's what the results of the experiment look like (note that you can click/drag/zoom):
<vegachart style="width: 100%" schema-url="chart1.json"></vegachart>
Each point resembles a simulation. The point's color reflects the machine learning
algorithm used to train it. The x-axis shows the noise, and the y-axis shows the
accuracy of the validation set. To reduce the jitter, we've also added a smoothed
loess curve to highlight the global pattern.
There are a few things to notice.
1. If there is no noise, the random forest seems to have the highest accuracy.
2. Logistic regression is the worst performing model if there's no noise.
3. If we increase the noise, the decision tree takes a big hit. The random
forest also has trouble, but the decision tree seems more brittle.
4. The logistic regression also suffers from the noise, but
it's the most robust out of the three. There's a considerable margin at higher noise levels.
You could certainly argue that this simulation does not reflect reality
and that one shouldn't draw too many conclusions for general machine learning
practices. That's fair.
But what will happen when the model is in production? New data that's coming in
is likely going to differ from the old information that you trained
on. If that's something that holds in general, you do want to have algorithms that
can withstand changes. That means that maybe, just maybe, we may prefer a simple
logistic regression because it's more robust against noise.
### Part 2: Noise on `X_train`
Let's re-run the experiment but with a twist. Let's now add noise on `X_train`.
We will now keep `X_valid` as-is.
<details>
<summary><b>New code for getting the results</b></summary>
np.random.seed(42)
data = [] for noise in np.linspace(0.001, 10, 50): classifiers = { "logreg": LogisticRegression(), "tree": DecisionTreeClassifier(), "forest": RandomForestClassifier() } X_train, X_valid, y_train, y_valid = train_test_split(X, y)
X_tmp = (X_train + np.random.normal(0, noise, X_train.shape))
classifiers["tree"].fit(X_tmp, y_train)
classifiers["forest"].fit(X_tmp, y_train)
classifiers["logreg"].fit(X_tmp, y_train)
for clf in classifiers.keys():
data.append({
"classifier": clf,
"performance": np.mean(classifiers[clf].predict(X_valid) == y_valid),
"noise": noise
})
</details>
<vegachart style="width: 100%" schema-url="chart2.json"></vegachart>
This chart again shows that the logistic regression is more robust to noise,
but the differences are more drastic now! This makes sense if you consider
that the models can now learn the patterns of the noise instead of the
pattern that we're interested in. It's much easier to deal with noise if you're
sure that you've learned on a pure dataset. If the noise is on the data
you're learning from, there's more opportunity to overfit on a wrong pattern.
### Part 3: What about GridSearch?
What happens if we train on noisy data, but we also perform a grid-search on
the parameters? Will that improve things?
<details>
<summary><b>New code for getting the results</b></summary>
from sklearn.model_selection import GridSearchCV
np.random.seed(42)
for noise in tqdm(noise_range): classifiers = { "logreg": GridSearchCV(LogisticRegression(), param_grid={"C": [0.1, 1.0, 2.0, 3.0]}, n_jobs=-1), "tree": GridSearchCV(DecisionTreeClassifier(), param_grid={"criterion": ["gini", "entropy"], "splitter": ["best", "random"]}, n_jobs=-1), "forest": GridSearchCV(RandomForestClassifier(), param_grid={"n_estimators": [100, 200, 300, 400]}, n_jobs=-1), } X_train, X_valid, y_train, y_valid = train_test_split(X, y)
X_tmp = (X_train + np.random.normal(0, noise, X_train.shape))
classifiers["tree"].fit(X_tmp, y_train)
classifiers["forest"].fit(X_tmp, y_train)
classifiers["logreg"].fit(X_tmp, y_train)
for clf in classifiers.keys():
data.append({
"classifier": clf,
"performance": np.mean(classifiers[clf].predict(X_valid) == y_valid),
"noise": noise,
"situation": "noise-and-gridsearch"
})
</details>
<vegachart style="width: 100%" schema-url="chart3.json"></vegachart>
The pattern persists. Logistic regression seems the most robust against noise,
despite not being the best performing model when there's zero noise.
## Discussion
Again, I don't want to suggest that my simulation perfectly reflects reality.
It's extremely unlikely that your dataset is similar to `make_classification`
from scikit learn. It's evenly absurd to assume that you're going to see Gaussian noise
in production over what you saw in your training data. So please take that grain
of salt in mind.
But what I *do* hope to point out here is that a grid search assumes that your
data does not change. If it's likely that your inbound data changes in production, then
you'll need more than standard hyperparameter tuning to find a helpful model. That
means that you might select a model that cannot handle noise. *That*
is an effect that this simulation *does* demonstrate, and it's a valid concern.
It's also my impression that it's uncommon in the industry to check for this.
It'd be a swell idea to simulate changes that one might expect in production
to confirm if your model is robust against them. I don't want to suggest it's
a straightforward exercise. After all, it's *very* tricky to simulate the future
without making some assumptions. But not doing this exercise
might give you an optimal model... but only if there's absolute silence.
!!! quote "Appendix"
If you're interested in the full saga, you can find the other posts here:
Part of the insipration of this blogpost came from this gem of a paper from 2006;
[Classifier Technology and the Illusion of Progress - David J. Hand](https://arxiv.org/pdf/math/0606441.pdf).
It's a good read, especially the latter part where there's a discussion of common failures in machine learning. Not much has changed since 2006.