Skip to content

Typo/Spelling Error Dataset

I was searching around for typo/spelling datasets, since these may be useful for testing robustness for text classifiers. I found two interesting sources.

Github

Github has a typo corpus, which comes with a paper and can be downloaded here. It's an interesting dataset, since it seems they used commits on public readme files to find common typos.

Figure from the paper.

It's a neat idea! What's also cool about this dataset is that it comes with non-English typos as well.

Microsoft

There's another dataset of spelling mistakes collected by Microsoft found here that did something interesting with mechanical turk. The data was derived automatically by processing keystroke logs. So they probably gave users a task to get them to type, while the main thing they were interested in were the self corrections. Interesting!

To give an impression, here's a few examples from the dataset:

.beause .because
abay    away
acheve  achieve
accient accident
acciient    accident

<<<<<<< HEAD The English dataset found in the link has 44104 of these pairs. Part of me wonders why they went through the effort of going through mechanical turk though. Don't they have a better dataset via MS Word? ======= The English dataset found in the link has 44104 of these pairs.

Kindof Weird?

There's one thing that I wonder about though; why did Microsoft bother running this on Mechanical Turk? I'd assume MS Word would be a more impressive source of typo data. If anybody knows I'd love to hear about it!

Usecase

It'll be fun datasets to play with. There are many augmentation libraries that can simulate keyboard typos, but they are synthetic. The empirical spelling errors from the datasets won't cover everything either, there's plenty of phonetic errors that won't have been corrected, but I can imagine this dataset to be fun as an alternative to pure simulated keyboard strokes.

0c1d535810190ad7165ff146cad5d5fa3c412aba