The worst kind of duplicate

Duplicates in a dataset can be very bad. Especially when they appear in both the train and validation set.

But, as it turns out, there's even worse duplicates out there. You can also have duplicates with inconsistent labels, as demonstrated by this paper from Isha Garg and Kaushik Roy. Here's one of the main figures from the paper:

You could argue that these aren't the worst kind of mistakes; whales and sharks can look alike, just like otters and seals. But imagine being an algorithm and being presented with "two truths". That'd mess up your gradient signal for sure.

The paper mentions a trick to find these examples, which involves measuring the curvature of the loss function around a datapoint. It seems interesting that this approach works, but I'd imagine that simpler techniques would also work. Finding duplicate images via imagehash should be relatively easy and once you spot duplicates with different labels you'd also be done.