Folks have been experimenting with creating LLMs just for docstring generation in Python. It's an interesting subtask in aiding programmers with copilots, one that's pretty frequent, and these researchers set out to generate a lightweight LLM just for this task.

One cool aspect in all this is that they released a dataset for this task as well as a finetuned model. Both are public and these can ben found on Hugginface. The dataset they release seems to consist of a curated subset of a world of code blob which definately took some effort. The paper reports that they were able to improve the CodeGemma 2B model with this dataset but there were also some lessons learned which to me felt like the most interesting bit of the paper.

  1. While generating their dataset they mention that they've extracted only 66,000 out of 148,000 functions from the World of Code dataset for this task. Why? Well, it turns out that there are a lot of functions with syntax errors and the authors wanted to skip those.
  2. How do you judge the quality of a docstring? Are long docstrings always better? This is very subjective.
  3. Many implementations of functions are the same. You may have different docstrings for them, but will this aid the model or confuse it? More-over, you can imagine that having many duplicate functions doesn't bode well for dataset diversity.
  4. It turns out that LLMs that optimise for docstrings tend to have more spelling errors.

There are more points in the paper, but it makes sense that curation and evaluation are super hard in this field.