For a while, I’ve been interested in studying annotators and their labels more closely. Many datasets, however, don’t provide this information so I started looking for some.
Thankfully, I found an article titled “On Releasing Annotator-Level Labels and Information in Datasets” that talks about the need for sharing annotator information and they also list a few datasets that do provide this information.
- There’s the Gab Hate Speech Corpus, which comes with an article.
- There’s Google Emotions, which has so many inconsistencies it led me to write a package for it. The data also comes with a paper. There’s the Age Sentiment dataset which is part of an age-discrimination research paper.
I figured I’d keep track of a list. There are a lot of tricky issues related to annotator disagreement, and it’d be great to have more datasets to study.