Semantic Pydantic

Using Pydantic for encoding data models and FastAPI for implementing APIs on top of them has become a staple for many Python programmers. When this intersects with the semantic web, linked open data, and the natural sciences, we are still lacking a bridge to annotate our data models and APIs to make them more FAIR (findable, accessible, interoperable, and reusable). In this post, we build an extension to Pydantic and FastAPI to annotate data models’ fields and API endpoints’ query, path, and other parameters using the Bioregistry, a comprehensive catalog of metadata about semantic spaces from the semantic web and the natural sciences.

As a demonstration, we will build a data model and API that serves information about scholars.

First Steps with Pydantic

We’ll use Open Researcher and Contributor (ORCID) identifiers as primary keys, include the researcher’s name, and start with a single cross-reference, e.g., to the author’s DBLP identifier. We’ll encode this data model using Pydantic in the Python programming language as follows:

from pydantic import BaseModel, Field

class ScholarV1(BaseModel):
    """A model representing a researcher, who might have several IDs on different services."""

    orcid: str = Field(...)
    name: str = Field(...)
    dblp: str | None = Field(None)

print(ScholarV1.schema_json(indent=2))

JSON Schema - Version 1

```json { "title": "ScholarV1", "description": "A model representing a researcher, who might have several IDs on different services.", "type": "object", "properties": { "orcid": { "title": "Orcid", "type": "string" }, "name": { "title": "Name", "type": "string" }, "dblp": { "title": "Dblp", "type": "string" } }, "required": ["orcid", "name"] } ```

There are several places for improvement here:

Correct capitalization of the titles (ORCID instead of Orcid and DBLP instead of Dblp)
Add useful descriptions of what each field is
Have regular expression patterns to validate input
Give an example

All of these are possible to annotate into Pydantic’s Field object, but it requires lots of effort and takes lots of space. Even worse, this might have to be partially duplicated if multiple models share the same fields. In the example below, I annotated ORCID but will skip the others for brevity.

from pydantic import BaseModel, Field


class ScholarV2(BaseModel):
    """A model representing a researcher, who might have several IDs on different services."""

    orcid: str = Field(
        ...,
        title="ORCID",
        description="A stable, public identifier for a researcher from https://orcid.com",
        pattern="^\d{4}-\d{4}-\d{4}-\d{3}(\d|X)$",
        example="0000-0003-4423-4370",
    )
    name: str = Field(...)
    dblp: str | None = Field(None)


print(ScholarV2.schema_json(indent=2))

JSON Schema - Version 2

```json { "title": "ScholarV2", "description": "A model representing a researcher, who might have several IDs on different services.", "type": "object", "properties": { "orcid": { "title": "ORCID", "description": "A stable, public identifier for a researcher from https://orcid.com", "pattern": "^\\d{4}-\\d{4}-\\d{4}-\\d{3}(\\d|X)$", "example": "0000-0003-4423-4370", "type": "string" }, "name": { "title": "Name", "type": "string" }, "dblp": { "title": "Dblp", "type": "string" } }, "required": ["orcid", "name"] } ```

However, this was a lot of work. It would be nice if there were some database of all the semantic spaces in the semantic web and natural sciences that contained the name, description, regular expression pattern, and examples. Then, we could draw from this database to automatically populate our fields.

Adding Some Bioregistry Magic

The good news is that such a database exists - it’s called the Bioregistry. Each semantic space (e.g., ORCID, DBLP) gets a prefix which is usually an acronym for the name of the resource that serves as the primary key for the semantic space. These prefixes are also useful in making references to entities in the semantic space more FAIR (findable, accessible, interoperable, reusable) using the compact URI (CURIE) syntax, though this isn’t the goal of this demo.

I’ve mocked some Python code that bridges Pydantic and the Bioregistry in this repository (https://github.com/cthoyt/semantic-pydantic). I’m calling it Semantic Pydantic because it lets us annotate our data models with external metadata (and because it rhymes).

Here’s the same model as before, but now using a SemanticField that extends Pydantic’s Field. It has a special keyword prefix that lets you give a Bioregistry prefix, then it is smart enough to fill out all the fields on its own. I also took the liberty of adding several more semantic spaces that identify scholars like Web of Science (wos), Scopus, and even GitHub.

from pydantic import BaseModel, Field

from semantic_pydantic import SemanticField


class ScholarV3(BaseModel):
    """A model representing a researcher, who might have several IDs on different services."""

    orcid: str = SemanticField(..., prefix="orcid")
    name: str = Field(..., example="Charles Tapley Hoyt")

    wos: str | None = SemanticField(default=None, prefix="wos.researcher")
    dblp: str | None = SemanticField(default=None, prefix="dblp.author")
    github: str | None = SemanticField(default=None, prefix="github")
    scopus: str | None = SemanticField(default=None, prefix="scopus")
    semion: str | None = SemanticField(default=None, prefix="semion")
    publons: str | None = SemanticField(default=None, prefix="publons.researcher")
    authorea: str | None = SemanticField(default=None, prefix="authorea.author")


print(ScholarV3.schema_json(indent=2))

Finally, we can see a very detailed JSON schema, which includes everything from before plus additional context from the Bioregistry, including the prefix itself as well as mappings from the Bioregistry prefix to external registries like BARTOC, FAIRsharing, and others. Together, these make the data model more FAIR and support interoperability, since now it is possible to directly match the fields annotated with Bioregistry prefixes in this model to fields annotated with the same prefix in other models, even external to the project.

JSON Schema - Version 3

```json { "title": "ScholarV3", "description": "A model representing a researcher, who might have several IDs on different services.", "type": "object", "properties": { "orcid": { "title": "Open Researcher and Contributor", "description": "

This field corresponds to a local unique identifier from Open Researcher and Contributor</a>.\n

Provenance

The semantics of this field are derived from the\n<a href=\"https://bioregistry.io/orcid\">orcid</a> entry in\nthe <a href=\"https://bioregistry.io\">Bioregistry</a>: a registry of semantic web and linked \nopen data compact URI (CURIE) prefixes and URI prefixes.\n

Description of Semantic Space

ORCID (Open Researcher and Contributor ID) is an open, non-profit, community-based effort to create and maintain a registry of unique identifiers for individual researchers. ORCID records hold non-sensitive information such as name, email, organization name, and research activities.", "pattern": "^\\d{4}-\\d{4}-\\d{4}-\\d{3}(\\d|X)$", "example": "0000-0003-4423-4370", "json_schema_extra": { "bioregistry": { "prefix": "orcid", "mappings": { "bartoc": "2021", "biocontext": "ORCID", "biolink": "ORCID", "fairsharing": "FAIRsharing.nx58jg", "go": "orcid", "miriam": "orcid", "n2t": "orcid", "wikidata": "P496" } } }, "type": "string" }, "name": { "title": "Name", "example": "Charles Tapley Hoyt", "type": "string" }, ... } } ``` Note that the extra semantic fields have been truncated for brevity. Each of them also has lots of detailed information.

Web Application

Let’s take the next step to a web application using FastAPI. The goal of this web application will be to look up the information for a scholar in Wikidata based on their ORCID. You don’t really have to understand how the query works other than that it takes in an ORCID string and gives back an instance of the Scholar model we’ve been working on above.

The app uses annotations for the query parameters, path parameters, and other inputs to routes using extensions of Pydantic Fields. So similar to before, we can extend their custom fields to be semantic in Semantic Pydantic.

from fastapi import FastAPI
from semantic_pydantic import SemanticPath

app = FastAPI(title="Semantic Pydantic Demo")
Scholar = ...  # defined before


@app.get("/api/orcid/{orcid}", response_model=Scholar)
def get_scholar_from_orcid(orcid: str = SemanticPath(prefix="orcid")):
    """Get xrefs for a researcher in Wikidata, given ORCID identifier."""
    ...  # full implementation in https://github.com/cthoyt/semantic-pydantic
    return Scholar(...)

The real power is how this translates to the API, and more importantly, the automatically generated API documentation. First, the SemanticPath object which we used in place of a normal fastapi.Path also knows it is for ORCID identifiers. Second, the response model points to the Scholar class from before which already knows about its semantics. Below, we see this in a screenshot of the OpenAPI (formerly known as Swagger) user interface automatically generated by FastAPI.

A screenshot of the Swagger UI, enriched by the semantic annotations described above.

There are two big things to note here:

All the detailed information about ORCID makes it into the parameter, also giving an example for users to get started
All the detailed information from the response model (Scholar) gets shown as a real example.

Now, we have an API that is also annotated with detailed semantics. If you take a look at the OpenAPI JSON file, it has similar references to Bioregistry prefixes for the routes themselves, and directly reuses the JSON schema for the response model.

Next Steps

So far, this is a proof-of-concept that lives in an ad hoc repository. It’s not clear yet if this code is just a neat demo, whether it should live inside the Bioregistry Python package, I haven’t decided yet if this should go inside the Bioregistry Python package, or if it should be in a stand-alone package that might be extensible even further. There are a few other things to think about in the meantime:

Infrastructure for FAIR Models and APIs

The first version of this idea just throws the Bioregistry data into the JSON schema. It would be interesting to develop this infrastructure further, such as keeping a catalog of all APIs that consume or produce data models containing semantic fields. A few places this would be great:

The Cheminformatics Microservice contains tons of references to InChi Keys (inchikey)
The Bioregistry could dog-food its own API
The INDRA Discovery API could be refactored to use semantic inputs and outputs

There are also so many more examples, please let me know some services you think would benefit in the comments on my blog post. Looking forward, it’s also a question on how to automatically discover such semantic APIs (e.g., by cleverly searching GitHub) or if it would have to be a manually curated catalog.

Resolving URLs

A key feature of the Bioregistry is that it provides a way to take a local unique identifier for an entity in a given semantic space and make a URL that points to a web page describing the entity. For example, if you have an ORCID identifier, you can make a URL for the ORCID page following the format https://orcid.org/<put ID here>. It would be very cool to extend Semantic Pydantic to add some properties that auto-generate URLs, like in the following:

from pydantic import BaseModel, Field

from semantic_pydantic import SemanticField

class Scholar(BaseModel):
    orcid: str = SemanticField(..., prefix="orcid")
    name: str = Field(...)

charlie = Scholar(orcid="0000-0003-4423-4370", name="Charles Tapley Hoyt")
assert charlie.orcid_url == 'https://orcid.org/0000-0003-4423-4370'

Run the Demo

The demo can be run by cloning the repository, installing its requirements, and running the self-contained app.py.

git clone https://github.com/cthoyt/semantic-pydantic
cd semantic-pydantic
python -m pip install -r requirements.txt
python app.py

This is my first science post of 2024! I’m very happy that the Bioregistry is currently supported by the Chan Zuckerberg Initiative (CZI) under award 2023-329850.