Adding Structured Data to Docstrings
Writing excellent documentation is crucial for open source software projects. It’s also a lot of
hard work. While I consider tools like Sphinx combine with services
like ReadTheDocs completely invaluable, I’ve recently hit a bit of a
roadblock when it comes to making the README of a GitHub repository a bit more dynamic. This blog
post is about the dark magic I invented as a solution (i.e., the docdata
package).
How Python Documentation Works
Before beginning, I want to give a quick refresher on what documentation looks like in Python. For any class or object, you can write a docstring using triple-double quotes on the first line after the definition
def sin(x):
"""Compute the sin of x."""
...
class MyClass:
"""This is my class."""
...
It turns out that this is a bit more extensible than I thought. While the triple-double quoted string is the community standard, you can also get away with using triple-single quote, single quote, or double quote as well. Even crazier, you can introduce a blank line before the docstring. While these are possible, please don’t do this. There’s another interesting feature about the docstring that makes it different from any old string sitting in Python code - it’s not evaluated. This means that if you use an expression that isn’t a string literal, it won’t be set to the docstring. The following code illustrates this:
def get_docstr():
return """this is my docstr"""
def my_func():
get_docstr()
return 5
assert my_func.__doc__ is None
assert my_func() == 5
Not only are docstrings useful for readers of your code, but Python considers them with great
respect. PEP 257 outlines in full detail, but the
important thing to keep in mind is that Python code can introspect on the docstring for any
function, class, etc. with the special __doc__
attribute (e.g., MyClass.__doc__
stores the
docstring itself). This feature is what enables tools like Sphinx to exist without having to write
an entirely new parser for Python files.
A Tale of Two READMEs
The rest of this journey will be told through the perspective of my work on the documentation of
PyKEEN, a machine learning library for learning low-dimensional
embeddings for nodes and edges in knowledge graphs. In this blog post, you don’t need to understand
anything about the package itself other than it has several types of interchangeable components that
can be combined to create a model that gets trained on a dataset. Its documentation uses the
sphinx-automodapi
extension to generate pretty lists
of all the datasets, models, loss functions, regularizers, etc.
(example).
The problem is that most people start to use a given package by either looking at the README file in
the GitHub repository, or the splash text on the PyPI project page (which, for PyKEEN and most
packages, is created with the README on upload). I wanted to generate beautiful tables describing
the components on the README file the same as in the Sphinx documentation, so I started by writing a
template markdown file using jinja
as a templating language.
For each type of component, I programatically built a table, formatted it as markdown
with tabulate
, and formatted it into the template.
The tricky part was making these tables better than just lists of the names of the classes. Sphinx
has a deep integration with the restructured text (RST) format and provides custom “directives”
like :class:
that allow for automatic linking between documentation for modules, classes,
functions, variables, or anything else. Luckily, the
sphinx-automodapi
uses a standard format for its documentation. For example,
the pykeen.models.ComplEx
class gets built with a URL
like https://pykeen.readthedocs.io/en/stable/api/pykeen.models.ComplEx.html
. The general form for <X>
is https://pykeen.readthedocs.io/en/stable/api/<X>.html
. I was able
to take advantage of this and generate a column with the name of the class in PyKEEN with a link to
the documentation for the class on ReadTheDocs. This also gives insight to users who might want to
import these classes themselves.
The next tricky part was providing some context besides just the name and class name. For new users looking at the models in PyKEEN, it’s also useful to show a citation. This typically includes the first author’s last name and the year as in “Ali et al., 2019”. Additionally, the citation should link to the paper itself for further reading past what the PyKEEN documentation for the model provides. As an aside, it’s one of my personal goals for PyKEEN’s documentation to be an educational resource that in many cases will be more useful than reading manuscripts written by computer scientists, whose goals are to make themselves smart more than to motivate and educate the reader. In RST, there’s a syntax for linking citations that the PyKEEN documentation organizes in a documentation-wide bibliography. Unfortunately, my templating system is not as powerful as Sphinx, and does not parse all of these files. The solution I had was to standardize the citation keys and the format of the first line of each model’s docstring such that the name and year could be extracted with some simple text processing because I enforced the standard that all model docstrings ended with the RST citation.
from pykeen.models import ComplEx
doc = ComplEx.__doc__
doc_lines = doc.splitlines()
line = doc_lines[0] # get the first line of the docstring
l, r = line.find('['), line.find(']')
author, year = line[1 + l: r - 4], line[r - 4: r]
Then the author and year could be formatted into a new column in the previous format. However, programatically getting the citation link was a completely different problem. One solution I considered was to start adding class variables with this information, but that would quickly become a distraction to users.
My Solution
Then, I remembered a cool feature of flasgger
, which
automatically generates a Swagger interface for Flask applications by embedding the Swagger
definition as YAML inside
each route’s docstring. I
didn’t look into their code for an implementation and tried it my own with some pretty fast success!
The simple (but robust) code that I wrote for PyKEEN seemed generally useful, so I moved it into its
own package docdata
- so others could easily use it. It went
very fast because I recently put a lot of effort into creating
a Cookiecutter package
with all of my favorite settings that I’ve covered in previous posts on this blog.
It does the same thing as flasgger
- it allows the final few lines following a delimiter string
(i.e., ---
) of a docstring to be parsed as YAML and stored in the object. It can be applied as a
decorator to functions and classes, or simply to any data or Python object that has a docstring. A
demonstration shows it all:
from docdata import parse_docdata, get_docdata
@parse_docdata
class MyClass:
"""This is my class.
---
author: Charlie
motto:
- docs
- are
- cool
"""
assert get_docdata(MyClass) == {
'author': 'Charlie',
'motto': ['docs', 'are', 'cool'],
}
The data can also be accessed directly via MyClass.__docdata__
but dunders are scary, and I
thought it would be nice to have a getter as well.
This was immediately useful for PyKEEN models because I was able to store all the citation information in a structured way at the bottom of the docstring. Then, I was able to improve my table generator to make a much more rich column for citations that included the link to each. I also did something for datasets in PyKEEN, but additionally included statistics about each dataset’s entities, relations, and triples to make the PyKEEN README even more useful. The full pull request on PyKEEN can be found at pykeen/pykeen!303.
If you’re interested in the philosophy of documentation, a good place to start is here, or any other talk given by Daniele Procida.