1.4 Million Jupyter Notebooks
I stumbled apon a very interesting paper the other day.
The paper is titled "A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks" by Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo and Juliana Freire.
The paper was described an effort of running 1.4 million Jupyter notebooks found on GitHub to study the reproducibility. The main result: only 24.1% of the notebooks were able to run and only 4% of all notebook runs yield the same results. To quote the paper:
The most common causes of failures were related to missing dependencies, the presence of hidden states and out-of-order executions, and data accessibility.
Extra Advice
The paper also lists some general recommendations based on the findings, all of which ring true to me.
- Use short titles with a restrict charset for notebook files and markdown headings for more detailed ones in the body.
- Pay attention to the bottom of the notebook. Check whether it can benefit from descriptive markdown cells or can have code cells executed or removed.
- Abstract code into functions, classes, and modules and test them.
- Declare the dependencies in requirement files and pin the versions of all packages.
- Use a clean environment for testing the dependencies to check if all of them are declared.
- Put imports at the beginning of notebooks.
- Use relative paths for accessing data in the repository.
- Re-run notebooks top to bottom before committing.
Appendix
If you want to follow the original authors you can find them on twitter: @joaofelipenp, @leomurta, @vanbraganholo, @jfreirenet.