What Overfitting Looks Like

I've written before on why grid-search is not enough (exibit A, exibit B). In this document, I'd like to run an experiment to demonstrate the strongest reason why I believe this is true.

We need to talk about, and learn to listen to, feedback.

Applications over Models

Let's consider a recruitment application. We'll pretend that it is for a typical job and that we are a recruitment agency that has experience in hiring for this position. This means that we have a dataset of characteristics of people who applied that includes an indicator if they ended up being a good fit for the job. The use-case is clear: we will make a classifier that will attempt to predict job performance based on the characteristics of the applicants.

At this point, I could get a dataset, make a scikit-learn pipeline, and let the grids search. But instead, I would like to think one step further in the application layer. After all, we're not going to be training this model once but many times. Over the lifetime of this application, we'll gather new data, and we will retrain accordingly.

First, the model is trained on the original dataset. Next, we will receive a new batch of (let's say \(c\)) candidates. From this set, we will select the top (let's say \(n\)) candidates who will be selected. It is from these selected candidates that we'll get an actual label, and this will give us new data. This new data can be used to retrain the model again, such that the next round of candidates is predicted with even more accuracy.

This cycle is going to repeat itself over and over.

Concerns

The crux of all of this is that we can only see the actual label from the candidates who we give a job. The rest does not get the job, and we can never learn if they would have been good for the job.

Let's just list a few concerns here:

if the starting dataset is biased then the predictions will be biased
if the predictions are biased then the candidates who will be selected will be biased
these selected candidates are the only ones who feed new data back into the mechanism
this keeps the dataset where the algorithm can learn from biased

You might observe how the feedback mechanism is not, and cannot, be captured by the standard grid-search. Cross-validation does not protect against biased data.

Which part of this system deserves more attention?

It might make sense to not focus on how we select a model (the red bits) but rather on how we collect data for it (the green bits). So I've set up an experiment to explore this.

Experiment

I've written up some code that simulates the system that's described above. We start with a biased dataset and train a k-nearest-neighbor classifier for it. One trained, the model receives a set of 10 random candidates from the data we have not seen yet. Next, we select whichever candidates the algorithm deems the best performing. These will be the candidates that actually get a job offer, and these candidates will be able to give a label to us. The actual label of these candidates is logged, and the model is retrained. This is repeated over and over until we've exhausted the dataset.

We can then calculate lots of metrics. In general there are two datasets of interest. The first dataset contains all the candidates for which the system receives a label.

These points have been seen by the algorithm.

The second dataset contains all the candidates. For a subset of these candidates we receive a label so there is overlap with the previous dataset.

The goal of the experiment will be to see what the difference in performance is between these two datasets.

Obvious Dataset

I'll start with a dataset that has two clusters of candidates that have a positive label. In the plots below, you'll see the entire (true) dataset, the subset that is taken as a biased start, the probability assignment by the algorithm and the greedily selected candidates from the models' prediction.

I've also listed the scores from two sources. One set is from all the data that our system actually observes. This set contains the biased dataset we started with and the labels of the selected candidates. The other set is from all the data, including the candidates that we have not seen.