Adding Structured Data to Docstrings

Writing excellent documentation is crucial for open source software projects. It’s also a lot of hard work. While I consider tools like Sphinx combine with services like ReadTheDocs completely invaluable, I’ve recently hit a bit of a roadblock when it comes to making the README of a GitHub repository a bit more dynamic. This blog post is about the dark magic I invented as a solution (i.e., the docdata package).

How Python Documentation Works

Before beginning, I want to give a quick refresher on what documentation looks like in Python. For any class or object, you can write a docstring using triple-double quotes on the first line after the definition

def sin(x):
    """Compute the sin of x."""
    ...


class MyClass:
    """This is my class."""
    ...

It turns out that this is a bit more extensible than I thought. While the triple-double quoted string is the community standard, you can also get away with using triple-single quote, single quote, or double quote as well. Even crazier, you can introduce a blank line before the docstring. While these are possible, please don’t do this. There’s another interesting feature about the docstring that makes it different from any old string sitting in Python code - it’s not evaluated. This means that if you use an expression that isn’t a string literal, it won’t be set to the docstring. The following code illustrates this:

def get_docstr():
    return """this is my docstr"""


def my_func():
    get_docstr()
    return 5


assert my_func.__doc__ is None
assert my_func() == 5

Not only are docstrings useful for readers of your code, but Python considers them with great respect. PEP 257 outlines in full detail, but the important thing to keep in mind is that Python code can introspect on the docstring for any function, class, etc. with the special __doc__ attribute (e.g., MyClass.__doc__ stores the docstring itself). This feature is what enables tools like Sphinx to exist without having to write an entirely new parser for Python files.

A Tale of Two READMEs

The rest of this journey will be told through the perspective of my work on the documentation of PyKEEN, a machine learning library for learning low-dimensional embeddings for nodes and edges in knowledge graphs. In this blog post, you don’t need to understand anything about the package itself other than it has several types of interchangeable components that can be combined to create a model that gets trained on a dataset. Its documentation uses the sphinx-automodapi extension to generate pretty lists of all the datasets, models, loss functions, regularizers, etc. (example).

The problem is that most people start to use a given package by either looking at the README file in the GitHub repository, or the splash text on the PyPI project page (which, for PyKEEN and most packages, is created with the README on upload). I wanted to generate beautiful tables describing the components on the README file the same as in the Sphinx documentation, so I started by writing a template markdown file using jinja as a templating language. For each type of component, I programatically built a table, formatted it as markdown with tabulate, and formatted it into the template.

The tricky part was making these tables better than just lists of the names of the classes. Sphinx has a deep integration with the restructured text (RST) format and provides custom “directives” like :class: that allow for automatic linking between documentation for modules, classes, functions, variables, or anything else. Luckily, the sphinx-automodapi uses a standard format for its documentation. For example, the pykeen.models.ComplEx class gets built with a URL like https://pykeen.readthedocs.io/en/stable/api/pykeen.models.ComplEx.html . The general form for <X> is https://pykeen.readthedocs.io/en/stable/api/<X>.html. I was able to take advantage of this and generate a column with the name of the class in PyKEEN with a link to the documentation for the class on ReadTheDocs. This also gives insight to users who might want to import these classes themselves.

The next tricky part was providing some context besides just the name and class name. For new users looking at the models in PyKEEN, it’s also useful to show a citation. This typically includes the first author’s last name and the year as in “Ali et al., 2019”. Additionally, the citation should link to the paper itself for further reading past what the PyKEEN documentation for the model provides. As an aside, it’s one of my personal goals for PyKEEN’s documentation to be an educational resource that in many cases will be more useful than reading manuscripts written by computer scientists, whose goals are to make themselves smart more than to motivate and educate the reader. In RST, there’s a syntax for linking citations that the PyKEEN documentation organizes in a documentation-wide bibliography. Unfortunately, my templating system is not as powerful as Sphinx, and does not parse all of these files. The solution I had was to standardize the citation keys and the format of the first line of each model’s docstring such that the name and year could be extracted with some simple text processing because I enforced the standard that all model docstrings ended with the RST citation.

from pykeen.models import ComplEx

doc = ComplEx.__doc__
doc_lines = doc.splitlines()
line = doc_lines[0]  # get the first line of the docstring
l, r = line.find('['), line.find(']')
author, year = line[1 + l: r - 4], line[r - 4: r]

Then the author and year could be formatted into a new column in the previous format. However, programatically getting the citation link was a completely different problem. One solution I considered was to start adding class variables with this information, but that would quickly become a distraction to users.

My Solution

Then, I remembered a cool feature of flasgger, which automatically generates a Swagger interface for Flask applications by embedding the Swagger definition as YAML inside each route’s docstring. I didn’t look into their code for an implementation and tried it my own with some pretty fast success! The simple (but robust) code that I wrote for PyKEEN seemed generally useful, so I moved it into its own package docdata - so others could easily use it. It went very fast because I recently put a lot of effort into creating a Cookiecutter package with all of my favorite settings that I’ve covered in previous posts on this blog.

It does the same thing as flasgger - it allows the final few lines following a delimiter string (i.e., ---) of a docstring to be parsed as YAML and stored in the object. It can be applied as a decorator to functions and classes, or simply to any data or Python object that has a docstring. A demonstration shows it all:

from docdata import parse_docdata, get_docdata


@parse_docdata
class MyClass:
    """This is my class.

    ---
    author: Charlie
    motto:
    - docs
    - are
    - cool
    """


assert get_docdata(MyClass) == {
    'author': 'Charlie',
    'motto': ['docs', 'are', 'cool'],
}

The data can also be accessed directly via MyClass.__docdata__ but dunders are scary, and I thought it would be nice to have a getter as well.

This was immediately useful for PyKEEN models because I was able to store all the citation information in a structured way at the bottom of the docstring. Then, I was able to improve my table generator to make a much more rich column for citations that included the link to each. I also did something for datasets in PyKEEN, but additionally included statistics about each dataset’s entities, relations, and triples to make the PyKEEN README even more useful. The full pull request on PyKEEN can be found at pykeen/pykeen!303.

If you’re interested in the philosophy of documentation, a good place to start is here, or any other talk given by Daniele Procida.