Plenty of Bad Labels

I read a fun paper the other day.

The short story of this paper is that common benchmarking datasets contain bad labels. It was well known that in MNIST a 5 can sometimes look like an 8, so those mistakes might be forgiven. The paper however shows that the mistakes are a fair bit worse.

The paper tries to build algorithms to detect these mislabelled instances. The algorithm works by giving a percentage \(\alpha\) of images the user is willing to manually re-evaluate and then the algorithm tries to find the appropriate candidates to check.

Their approach is straight-forward. A model is \(g\) is trained and applied on a datapoint \(x_i\). They then compare the model output \(g(x_i)\) with the true label \(y_i\). The model confidence can be used as a proxy to sort labels so that a user can check them.

It's a neat trick but given that the state-of-the-art is often 99+% for some of these datasets, it might be time for a few of these models to be re-run.