At the 4th Ontologies4Chem Workshop in Limburg an der Lahn, I proposed an initial crosswalk between the Simple Standard for Sharing Ontological Mappings (SSSOM) and the Wikidata semantic mapping data model. This post describes the motivation for this proposal and the concrete implementation I’ve developed in sssom-pydantic.

This work is part of the NFDI’s Ontology Harmonization and Mapping Working Group, which is interested in enabling interoperability between SSSOM and related data standards that encode semantic mappings.

The TL;DR for this post is that I implemented a mapping from SSSOM to Wikidata in sssom-pydantic in cthoyt/sssom-pydantic#32. One high-level entrypoint is the following function, which reads an SSSOM file and prepares QuickStatements which can be reviewed in the web browser, then uploaded to Wikidata.

This script can be run from Gist with uv run https://gist.github.com/cthoyt/f38d37426a288989158a9804f74e731a#file-sssom-wikidata-demo-py

Semantic Mappings in SSSOM

The Simple Standard for Sharing Ontological Mappings (SSSOM) is a community-driven data standard for semantic mappings, which are necessary to support (semi-)automated data integration and knowledge integration, such as in the construction of knowledge graphs.

While SSSOM primary a tabular data format that is best serialized in TSV, it uses LinkML to formalize the semantics of each field such that SSSOM can be serialized to and read from OWL, RDF, and JSON-LD. Here’s a brief example:

subject_id subject_label predicate_id object_id object_label mapping_justification
wikidata:Q128700 cell wall skos:exactMatch GO:0005618 cell wall semapv:ManualMappingCuration
wikidata:Q47512 acetic acid skos:exactMatch CHEBI:15366 acetic acid semapv:ManualMappingCuration

Semantic Mappings in Wikidata

Wikidata has two complementary formalisms for representing semantic mappings. The first uses the exact match (P2888) property with a URI as the object. For example, cell wall (Q128700) maps to the Gene Ontology (GO) term for cell wall by its URI http://purl.obolibrary.org/obo/GO_0005618.

A screenshot of the exact match section of webpage for Wikidata's cell wall record

The second formalism uses semantic space-specific properties (e.g. P683 for ChEBI) with local unique identifiers as the object. For example, acetic acid (Q47512) maps to the ChEBI term for acetic acid using the P683 property for ChEBI and local unique identifier for acetic acid (within ChEBI) 15366.

A screenshot of the ChEBI mapping section of webpage for Wikidata's acetic acid record

Wikidata has a data structure that enables annotating qualifiers onto triples. Therefore, other parts of semantic mappings modeled in SSSOM can be ported:

  1. Authors and reviewers can be mapped from ORCiD identifiers to Wikidata identifiers, then encoded using the S50 and S4032 properties, respectively
  2. A SKOS-flavored mapping predicate (i.e., exact, narrow, broad, close, related) can be encoded using the S4390 property
  3. The publication date can be encoded using the S577 property
  4. The license can be mapped from text to a Wikidata identifier, then encoded using the S275 property

Note that properties that normally start with a P when used in triples are changed to start with an S when used as qualifiers. Other fields in SSSOM could potentially be mapped to Wikidata later.

Finding Wikidata Properties using the Semantic Farm

The Semantic Farm (previously called the Bioregistry) maintains mappings between prefixes that appear in compact URIs (CURIEs) and their corresponding Wikidata properties. For example, the prefix CHEBI maps to the Wikidata property P683.

These mappings can be accessed in several ways:

  1. via the Semantic Farm’s SSSOM export. Note: this requires subsetting to mappings where Wikidata properties are the object.
  2. via the Semantic Farm’s live API,
  3. via the Bioregistry Python package (this will get renamed to match Semantic Farm, eventually) using the following code:

    import bioregistry
    
    # get bulk
    prefix_to_property = bioregistry.get_registry_map("wikidata")
    
    # get for a single resource
    resource = bioregistry.get_resource("chebi")
    chebi_wikidata_property_id = resource.get_mapped_prefix("wikidata")
    

Notable Implementation Details

I’ve previously built two package which were key to making this work:

  1. wikidata-client, which interacts with the Wikidata SPARQL endpoint and has high-level wrappers around lookup functionality. I’m also aware of WikidataIntegrator - I’ve contributed several improvements, but working with its codebase doesn’t spark joy and the last time I tried to use it, it was fully broken due to some of its dependencies not working on modern Python.
  2. quickstatements-client, which implements an object model for QuickStatements v2 and an API client.

Along the way to this PR, I made improvements to the wikidata-client in cthoyt/wikidata-client#2 to add high-level functionality for looking up multiple Wikidata records based on values for a property (e.g., to support ORCID lookup in bulk).

All other changes were made in sssom-pydantic in cthoyt/sssom-pydantic#32.

The other key challenge was to avoid adding duplicate information to Wikidata - unlike a simple triple store, we could accidentally end up with duplicate statements. Therefore, the sssom-pydantic implementation looks up all existing semantic mappings in Wikidata for entities appearing in an SSSOM file, then filters appropriately to avoid uploading duplicate mappings to Wikidata.

Pulling it All Together

This new module in sssom-pydantic implements the following interactive workflows:

  1. Read an SSSOM file, convert mappings to Wikidata schema, then open a QuickStatements tab in the web browser using read_and_open_quickstatements()
  2. Convert in-memory semantic mappings to the Wikidata schema, then open a QuickStatements tab in the web browser using open_quickstatements()

Here’s what the QuickStatements web interface looks like after preparing some demo mappings:

A screenshot of the QuickStatements queue

It also implements the following non-interactive workflows, which should be used with caution since they write directly to Wikidata:

  1. Read an SSSOM file, convert mappings to Wikidata schema, then post non-interactively to Wikidata via QuickStatements using read_and_post()
  2. Convert in-memory semantic mappings to the Wikidata schema, then post non-interactively to Wikidata via QuickStatements using post()

I’m a bit hesitant to start uploading SSSOM content to Wikidata in bulk, because I don’t yet have a plan for how to maintain mappings that might change over time in their upstream single source of truth, e.g., mappings curated in Biomappings. Otherwise, I think this is a good proof of concept and would like to get feedback about additional qualifiers that could be added, and if the ones I chose so far were the best.