I stumbled on an interesting paper about stopwords. Stop words are words that are typically filtered out before processing natural language data. Usually these are the most common words in a language but there is no single universal list of stop words that all common open-source packages use. It turns out that stopword lists are suprisingly inconsistent across open-source packages. The paper is a bit dated (2018) but the paper offers fair critique and it's a very good read.
To quote the abstract:
Open-source software (OSS) packages for natural language processing often include stop word lists. Users may apply them without awareness of their surprising omissions (e.g. hasn’t but not hadn’t) and inclusions (e.g. computer), or their incompatibility with particular tokenizers. Motivated by issues raised about the Scikit-Learn stop list, we investigate variation among and consistency within 52 popular English-language stop lists, and propose strategies for mitigating these issues.
The paper really dives into scikit-learn list in particular. This project used a copy from the Glasgow Information Retrieval Group but the scikit-learn variant wasn't without issues.
- The list seems to contain spelling errors. The word fifty appeared in the list but was originally misspelled as fify.
- The list contains words that arguably shouldn't be stopwords to a general audience. The word computer was on a stop list but also system and cry.
- The list seems to miss a few words. The list contains has but omits does. Likewise, it contains very but omits really.
- The list contains words with less than two characters (I, a) which are ignored by the default tokeniser that scikit-learn provides.
While some of these issues have been addressed, other haven't. The scikit-learn maintainers argued making big changes to the word list would break reproducibility across software versions, so some of the issues remain. A main take-away for me here is how hard it is to offer (and maintain) an appropriate stop word list for a global community.
Scikit-Learn is one library, but the paper explores many packages. They compare 52 lists and find that ~40% words in all of these lists appear in only one list.
A big part of the issue is that word-lists are highly dependant on their associated
tokeniser. A whitespace tokeniser might take the word "hasn't" and turn it into
[hasn, t] while the spaCy tokeniser would turn it into
Another big issue is that a stopword may only become a stopword given a specific context. A list of stopwords for a medical document is not going to be useful in a newspaper use-case and vise versa.
It's that last bit that really rings true to me. Stopword lists are usually blindly copied and this is where the trouble can start. To quote the conclusion of the paper:
Stop word lists are a simple but useful tool for managing noise, with ubiquitous support in natural language processing software. We have found that popular stop lists, which users often apply blindly, may suffer from surprising omissions and inclusions, or their incompatibility with particular tokenizers. Many of these issues may derive from generating stop lists using corpus statistics. We hence recommend better documentation, dynamically adapting stop lists during preprocessing, as well as creating tools for stop list quality control and automatically generating stop lists.