Extracting Semantic Mappings from BioPortal in SSSOM
Earlier this week, a question was asked on OBO Foundry Slack on where to find semantic mappings to terms in the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT). While some are available in the SeMRA Disease Mappings Database, there are many more available within BioPortal, which has access to the entire SNOMED-CT source data and has produced semantic mapping predictions using LOOM. This post is about how I implemented an API wrapper for generic OntoPortal instances’ mapping endpoints and a post-processing pipeline that converts OntoPortal’s custom mapping format into SSSOM.
Interacting with BioPortal
BioPortal is an instance of a more generic backend called
OntoPortal. I’ve previous developed
ontoportal-client, a Python
package that both has a generic wrapper for any OntoPortal’s API and
pre-configured wrappers for BioPortal,
AgroPortal,
EcoPortal, and several others.
The OntoPortal API endpoint for retrieving mappings is /mappings, which takes
a comma separated pair of two ontologies as a parameter like in
https://data.bioontology.org/mappings?apikey=<API KEY>&ontologies=SNOMEDCT,AERO.
I was able to relatively easily implement this in ontoportal-client in
cthoyt/ontoportal-client#10,
which enables automatically paging through results using the following code:
from ontoportal_client import BioPortalClient
# follow https://github.com/cthoyt/ontoportal-client?tab=readme-ov-file#%EF%B8%8F-configuration
# to configure BioPortalClient to be instantiated without need for explicit configuration
client = BioPortalClient()
for record in client.get_mappings("SNOMEDCT", "AERO"):
pass
Each record is a dictionary object corresponding to the JSON returned by the
API (after stripping pagination metadata):
{
"id": null,
"source": "LOOM",
"classes": [
{
"@id": "http://purl.obolibrary.org/obo/ogms/OMRE_0000023",
"@type": "http://www.w3.org/2002/07/owl#Class",
"links": {
"self": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023",
"ontology": "https://data.bioontology.org/ontologies/AERO",
"children": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/children",
"parents": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/parents",
"descendants": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/descendants",
"ancestors": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/ancestors",
"instances": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/instances",
"tree": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/tree",
"notes": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/notes",
"mappings": "https://data.bioontology.org/ontologies/AERO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023/mappings",
"ui": "http://bioportal.bioontology.org/ontologies/AERO?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fogms%2FOMRE_0000023",
"@context": {
"self": "http://www.w3.org/2002/07/owl#Class",
"ontology": "http://data.bioontology.org/metadata/Ontology",
"children": "http://www.w3.org/2002/07/owl#Class",
"parents": "http://www.w3.org/2002/07/owl#Class",
"descendants": "http://www.w3.org/2002/07/owl#Class",
"ancestors": "http://www.w3.org/2002/07/owl#Class",
"instances": "http://data.bioontology.org/metadata/Instance",
"tree": "http://www.w3.org/2002/07/owl#Class",
"notes": "http://data.bioontology.org/metadata/Note",
"mappings": "http://data.bioontology.org/metadata/Mapping",
"ui": "http://www.w3.org/2002/07/owl#Class"
}
},
"@context": {
"@vocab": "http://data.bioontology.org/metadata/",
"@language": "en"
}
},
{
"@id": "http://purl.bioontology.org/ontology/SNOMEDCT/3415004",
"@type": "http://www.w3.org/2002/07/owl#Class",
"links": {
"self": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004",
"ontology": "https://data.bioontology.org/ontologies/SNOMEDCT",
"children": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/children",
"parents": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/parents",
"descendants": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/descendants",
"ancestors": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/ancestors",
"instances": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/instances",
"tree": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/tree",
"notes": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/notes",
"mappings": "https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004/mappings",
"ui": "http://bioportal.bioontology.org/ontologies/SNOMEDCT?p=classes&conceptid=http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F3415004",
"@context": {
"self": "http://www.w3.org/2002/07/owl#Class",
"ontology": "http://data.bioontology.org/metadata/Ontology",
"children": "http://www.w3.org/2002/07/owl#Class",
"parents": "http://www.w3.org/2002/07/owl#Class",
"descendants": "http://www.w3.org/2002/07/owl#Class",
"ancestors": "http://www.w3.org/2002/07/owl#Class",
"instances": "http://data.bioontology.org/metadata/Instance",
"tree": "http://www.w3.org/2002/07/owl#Class",
"notes": "http://data.bioontology.org/metadata/Note",
"mappings": "http://data.bioontology.org/metadata/Mapping",
"ui": "http://www.w3.org/2002/07/owl#Class"
}
},
"@context": {
"@vocab": "http://data.bioontology.org/metadata/",
"@language": "en"
}
}
],
"process": null,
"@id": "",
"@type": "http://data.bioontology.org/metadata/Mapping"
}
There’s both a lot of noise in this output and several pieces of key information
that need to be inferred. When designing ontoportal-client (and other similar
wrappers), I’ve had to grapple with staying true to the source, versus injecting
logic that processes and makes useful. For now, I’ve decided that
ontoportal-client shouldn’t make any judgments on the data that comes out of
the API. Also, since I wrote the package, the format has changed as well, and I
am not super interested in taking on that maintenance burden (which makes the
suggestion in
cthoyt/ontoportal-client#3)
difficult to address.
Converting to SSSOM
If not in ontoportal-client, then where should I put the code that processes
OntoPortal mappings? I had two options. The first is in the Semantic Mapping
Reasoner and Assembler ( SeMRA; code,
paper), which is a generic
place for assembling semantic mappings. At the time, I designed the internal
data model in SeMRA to go beyond what’s possible in SSSOM because I was
interested in keeping track of provenance of how semantic mappings were used to
infer other ones. Slowly, I’m porting out the SSSOM-specific code from SeMRA
into a stand-alone library,
sssom-pydantic. This serves as an
alternative to the sssom-py
(which I also help maintain) that is more focused on creating a reusable and
high-performance data structure based on Pydantic.
Therefore, I implemented processing around a generic OntoPortal client in cthoyt/sssom-pydantic#14. It can be used like this (warning: subject to change):
import bioregistry
from sssom_pydantic.contrib.ontoportal import from_bioportal
from sssom_pydantic import SemanticMapping
converter = bioregistry.get_converter()
mappings: list[SemanticMapping] = from_bioportal("SNOMEDCT", "AERO", converter=converter)
You have to bring your own curies.Converter because OntoPortal’s data model
doesn’t return a meaningful prefix map for parsing IRIs. The Bioregistry is a
good and quick way to get a comprehensive prefix map.
Warning: BioPortal doesn’t provide an option to only return mappings between
entities defined in the two given ontologies. For example, if you ask for
mappings between SNOMEDCT and AERO, you will also get mappings between OGMS
and SNOMEDCT (because OGMS terms are imported in AERO). This means that you
should probably apply post-hoc filtering to only retain relevant mappings.
One way to do this is to rely on the definition of the converter, since any mappings with subject or objects with URIs that can’t be parsed are discarded:
import curies
from sssom_pydantic.contrib.ontoportal import from_bioportal
converter = curies.Converter.from_prefix_map(
{
"AERO": "http://purl.obolibrary.org/obo/AERO_",
"SNOMEDCT": "http://purl.bioontology.org/ontology/SNOMEDCT/",
}
)
mappings = from_bioportal("SNOMEDCT", "AERO", converter=converter)
Bulk Download
Ideally, I could get all mappings from BioPortal in bulk, instead of needing to hit the mappings API many times for each pair of two ontologies. The motivation for this post originally came from a question on the OBO Foundry Slack about where one could get SNOMED-CT mappings, so I wrote the following script to go through all ontologies in the Bioregistry that have BioPortal alignment to check for semantic mappings from SNOMED-CT to that mapping.
import bioregistry
import click
import pystow
import requests.exceptions
import sssom_pydantic
from sssom_pydantic import MappingSet
from sssom_pydantic.contrib.ontoportal import from_bioportal
from tqdm.contrib.logging import logging_redirect_tqdm
from tqdm import tqdm
MODULE = pystow.module("semra", "bioportal")
internal_to_bioportal = bioregistry.get_registry_map("bioportal")
converter = bioregistry.get_converter()
for internal, bioportal in tqdm(sorted(internal_to_bioportal.items())):
if bioportal == "SNOMEDCT":
continue
name = f"snomedct-{internal}.sssom.tsv"
path = MODULE.join(name=name)
if path.is_file():
tqdm.write(click.style(f"{bioportal} already cached to {path}", fg="green"))
continue
tqdm.write(click.style(bioportal, fg="green"))
metadata = MappingSet(id=f'https://w3id.org/biopragmatics/mappings/bioportal/{name}')
with logging_redirect_tqdm():
try:
mappings = from_bioportal("SNOMEDCT", bioportal, converter=converter)
except requests.exceptions.HTTPError:
tqdm.write(click.style(f"failed on {bioportal}\n", fg="red"))
else:
tqdm.write(click.style(f"{bioportal} got {len(mappings):,} mappings", fg="green"))
if mappings:
sssom_pydantic.write(mappings, path, converter=converter, metadata=metadata)
tqdm.write(click.style(f"{bioportal} wrote to {path}\n", fg="green"))
As of writing, I haven’t been able to get this script to run to completion. The BioPortal API is often slow and gives timeouts. I included caching so I could resume after failure. As I mentioned earlier, this script doesn’t yet post-process mappings to the correct subset.
Thanks to John Graybeal for the suggestion on where to begin. He’s also helped me get in touch with the BioPortal team, so hopefully we can collaborate to get the API working using SSSOM directly or at least to get a bulk export of mappings.