Skip to content

Urban Dictionary Embeddings

When you train embeddings, like many things in data science, the data that you train on matters more than the algorithm that you use. That's why I was immediately interested when I learned that a group of researchers trained custom embeddings on the urban dictionary project.

In case you're not familiar, the urban dictionary is a user contributed dictionary for slang. It helps folks get familiar with abbreviations, internet slang, consultancy jargon and even some terms that can refer to an insult. It's certainly an interesting project for sure, even though they certainly have a problem with internet hate speech.

The paper can be found here and the embeddings can be found here.


One of the interesting goals the authors had for these embeddings is sarcasm detection. They also ran the embeddings as a featurization step for a twitter sentiment task and it seems to be on-par with fasttext.


The embedding themselves are still 4Gb, which is too big for comfort for a lot of applications but it might be something worth playing with when you're doing with a task that requires a lot of slang. Another alternative might be the sense2vec implementation from It might also capture some slang but they train on the reddit corpus instead.


A downside of both of these approaches is that slang will go stale very quickly. Slang that was last year is, well, so last year now.