TIL: Gensim Koan

Ten Year Old Bug

Vincent Warmerdam koaning.io

A while ago one of my managers shared a paper on slack. It’s a paper on a 10 year old software bug in gensim. There are many ways to train word-embeddings but some common advise may be based on a software bug. To quote the abstract:

It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train.

Ouch. The bug itself is also a pretty big one.

Specifically, we find that in these implementations, the gradient for source embeddings is incorrectly multiplied by the context window size, resulting in incorrect weight updates and inferior performance.

The bug also has an effect on the hyperparameters.

We believe that Gensim requires a smaller learning rate for CBOW, because a smaller learning rate is more likely to reduce the stochastic loss, since the incorrect update is still a descent direction, although not following the negative stochastic gradient.

The article then goes on to show that their implementation is faster to train and still maintains predictive properties.


There’s a few lessons that you might be able to learn from this. One lesson might be “open source software can still be fragile”. While true, I think there’s a even better lesson. If the code wasn’t open sourced, this error would never have been found.

Then again, another big lesson is that numeric algorithms are very tricky to debug.