Label errors in Tobacco3482

It's been a while, but my arxiv scraper found a new dataset with label errors. Based on this paper it seems that 11.7% needs to have a new label (or no label at all) and that 16.7% of the data can reasonably overlap with another class.

Here's a figure from the paper that helps explain what the dataset entails:

The dataset is all about turing photocopies of text documents into categories. One of these categories can be "email" and another can be "form" but there is also "report" and "memo". But you can kind of already smell a problem on the horizon there. What if the report is an email? What if the email is a memo? These classes are not mutually exclusive and I think if they were some of the labellig mishaps could have been prevented.

Unfortunately, as is often the case with these things, all the benchmarks on this dataset should be reconsidered. To quote the paper:

If we consider these mistakes as correct, the DiT model’s accuracy would increase to 89.7%. This marks a 5.7% improvement from the original 84.0% accuracy and brings it closer to a more recent method that achieved 90.7% accuracy using four times more parameters

It's a bit of a bummer, but it's also a good reminder that the data you use is rarely as clean as you might hope it is. If you are curious, the authors of the paper have made the this GitHub repo with extra information and the corrected labels.