Skip to content

Enjoy the Silence

I frequently write blog posts on why you need more than grid-search to judge a machine learning model properly. This blogpost continues the topic by demonstrating yet another reason; it’s not robust against noise.

The full saga

If you're interested in the full saga, you can find the other posts here:

Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

An Experiment

We're going to do some experiments that are related to noise. The idea is that the training dataset you're learning from may not perfectly reflect what you might have in production. We should expect drift, and we'd like our models to be robust to changes. To simulate this, we will add noise to our data to see what might happen.

Part 1: Noise on X_valid

Let's start with an experiment where we add noise to the validation set. We will use the make_classification function from scikit-learn to generate a dataset. We'll split it into X_train and X_valid. Then the X_train set is kept as-is and will be used to train a logistic regression, decision tree, and a random forest.

Code for training the models.
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=5, random_state=42)

classifiers = {
    "logreg": LogisticRegression(),
    "tree": DecisionTreeClassifier(), 
    "forest": RandomForestClassifier()
}

X_train, X_valid, y_train, y_valid = train_test_split(X, y)

classifiers["tree"].fit(X_train, y_train)
classifiers["forest"].fit(X_train, y_train)
classifiers["logreg"].fit(X_train, y_train);

Now that the models are done training, we'll check how well different algorithms score on X_valid. We're not going to stop there though! We're also going to check if the performance changes if we add noise to X_valid.

Code for collecting the results
np.random.seed(42)

data = []
for noise in np.linspace(0.001, 10, 50):
    X_tmp = (X_valid + np.random.normal(0, noise, X_valid.shape))
    for clf in classifiers.keys():
        data.append({
            "classifier": clf,
            "performance": np.mean(classifiers[clf].predict(X_tmp) == y_valid), 
            "noise": noise
        })
Code for plotting the results
import altair as alt 

pltr = pd.DataFrame(data)

base = (alt.Chart(pltr)
 .mark_point()
 .encode(x="noise", y="performance", color="classifier"))

p = (base + base.transform_loess('noise', 'performance', groupby=['classifier']).mark_line(size=4))
p.interactive().properties(width=700, title="effect of X_test noise on prediction")

Here's what the results of the experiment look like (note that you can click/drag/zoom):

Each point resembles a simulation. The point's color reflects the machine learning algorithm used to train it. The x-axis shows the noise, and the y-axis shows the accuracy of the validation set. To reduce the jitter, we've also added a smoothed loess curve to highlight the global pattern.

There are a few things to notice.

  1. If there is no noise, the random forest seems to have the highest accuracy.
  2. Logistic regression is the worst performing model if there's no noise.
  3. If we increase the noise, the decision tree takes a big hit. The random forest also has trouble, but the decision tree seems more brittle.
  4. The logistic regression also suffers from the noise, but it's the most robust out of the three. There's a considerable margin at higher noise levels.

You could certainly argue that this simulation does not reflect reality and that one shouldn't draw too many conclusions for general machine learning practices. That's fair.

But what will happen when the model is in production? New data that's coming in is likely going to differ from the old information that you trained on. If that's something that holds in general, you do want to have algorithms that can withstand changes. That means that maybe, just maybe, we may prefer a simple logistic regression because it's more robust against noise.

Part 2: Noise on X_train

Let's re-run the experiment but with a twist. Let's now add noise on X_train. We will now keep X_valid as-is.

New code for getting the results
np.random.seed(42)

data = []
for noise in np.linspace(0.001, 10, 50):
    classifiers = {
        "logreg": LogisticRegression(),
        "tree": DecisionTreeClassifier(), 
        "forest": RandomForestClassifier()
    }
    X_train, X_valid, y_train, y_valid = train_test_split(X, y)

    X_tmp = (X_train + np.random.normal(0, noise, X_train.shape))
    classifiers["tree"].fit(X_tmp, y_train)
    classifiers["forest"].fit(X_tmp, y_train)
    classifiers["logreg"].fit(X_tmp, y_train)

    for clf in classifiers.keys():
        data.append({
            "classifier": clf,
            "performance": np.mean(classifiers[clf].predict(X_valid) == y_valid), 
            "noise": noise
        })

This chart again shows that the logistic regression is more robust to noise, but the differences are more drastic now! This makes sense if you consider that the models can now learn the patterns of the noise instead of the pattern that we're interested in. It's much easier to deal with noise if you're sure that you've learned on a pure dataset. If the noise is on the data you're learning from, there's more opportunity to overfit on a wrong pattern.

Part 3: What about GridSearch?

What happens if we train on noisy data, but we also perform a grid-search on the parameters? Will that improve things?

New code for getting the results
from sklearn.model_selection import GridSearchCV

np.random.seed(42)

for noise in tqdm(noise_range):
    classifiers = {
        "logreg": GridSearchCV(LogisticRegression(), param_grid={"C": [0.1, 1.0, 2.0, 3.0]}, n_jobs=-1),
        "tree": GridSearchCV(DecisionTreeClassifier(), param_grid={"criterion": ["gini", "entropy"], "splitter": ["best", "random"]}, n_jobs=-1),
        "forest": GridSearchCV(RandomForestClassifier(), param_grid={"n_estimators": [100, 200, 300, 400]}, n_jobs=-1),
    }
    X_train, X_valid, y_train, y_valid = train_test_split(X, y)

    X_tmp = (X_train + np.random.normal(0, noise, X_train.shape))
    classifiers["tree"].fit(X_tmp, y_train)
    classifiers["forest"].fit(X_tmp, y_train)
    classifiers["logreg"].fit(X_tmp, y_train)

    for clf in classifiers.keys():
        data.append({
            "classifier": clf,
            "performance": np.mean(classifiers[clf].predict(X_valid) == y_valid), 
            "noise": noise,
            "situation": "noise-and-gridsearch"
        })

The pattern persists. Logistic regression seems the most robust against noise, despite not being the best performing model when there's zero noise.

Discussion

Again, I don't want to suggest that my simulation perfectly reflects reality. It's extremely unlikely that your dataset is similar to make_classification from scikit learn. It's evenly absurd to assume that you're going to see Gaussian noise in production over what you saw in your training data. So please take that grain of salt in mind.

But what I do hope to point out here is that a grid search assumes that your data does not change. If it's likely that your inbound data changes in production, then you'll need more than standard hyperparameter tuning to find a helpful model. That means that you might select a model that cannot handle noise. That is an effect that this simulation does demonstrate, and it's a valid concern. It's also my impression that it's uncommon in the industry to check for this.

It'd be a swell idea to simulate changes that one might expect in production to confirm if your model is robust against them. I don't want to suggest it's a straightforward exercise. After all, it's very tricky to simulate the future without making some assumptions. But not doing this exercise might give you an optimal model... but only if there's absolute silence.

Appendix

If you're interested in the full saga, you can find the other posts here:

Part of the insipration of this blogpost came from this gem of a paper from 2006;

Classifier Technology and the Illusion of Progress - David J. Hand.

It's a good read, especially the latter part where there's a discussion of common failures in machine learning. Not much has changed since 2006.