How to Pick a Unique Prefix
After the recent incident on the OBO Foundry where an inexperienced group submitted a new ontology request using a prefix that already existed in the BioPortal, there has been a renewed interest in implementing an automated solution to protect against this.
The Big Picture
A more general issue is that there can be prefix conflicts between different
registries. I found a few examples of this happening while building
the Bioregistry in the Spring 2021. Notably, this
included the conflict between
the Geographical Entity Ontology
and the Gene Expression Omnibus which both
used the prefix geo
in the OBO Foundry
and Identifiers.org,
respectively. These conflicts needed thoughtful mediation. After
discussing on GitHub
with Bill Hogan, the responsible author of the
Geographical Entity Ontology, we decided that to use the prefix
geogeo
in the Bioregistry.
The Gene Expression Omnibus maintained its usage of
geo
due to its much wider usage
and longer history.
As a follow-up, I began curating a list of conflicts and implemented a technical solution in the Bioregistry’s nightly automated alignment workflow to prevent automated alignment between known conflicting prefixes from different registries, but this only a partial solution to a problem that is ultimately dependent on having confidence in the external resources (which is indeed its own issue).
What the Bioregistry Aligns
A small, select set of registries are fully automatically ingested in the Bioregistry which includes (at the time of writing) Identifiers.org, Name-to-Thing, the OBO Foundry, and the Ontology Lookup Service. The remaining registries are excluded for a variety of reasons including redundancy with other resources (e.g., AberOWL and OntoBee), a lack of modernization or alignment (e.g., NCBI’s registry), general inclusion of non-nomenclature resources (e.g., UniProt’s registry, FAIRsharing), and a lack of minimum quality standards (e.g., BioPortal). A summary and slightly more detailed explanation about these sources can be found here.
In many ways, the fact that the Bioregistry fully imports some resources and automatically aligns with others
The Bioregistry imports Identifiers.org, OBO Foundry, and N2T as well as many other resources (see for a full list), so it can be a one-stop shop for most resources. However, it does not import all of BioPortal, so users should check there too.
How to Check Your Prefix is Unique
Ultimately the point of this post is to present a workflow for any potential who want to check their new ontology request has a unique prefix (which will soon be a technical requirement in the OBO Foundry). Because the Bioregistry imports many resources, it’s sufficient to just check the Bioregistry and BioPortal (assuming you’re interested in respecting the BioPortal content).
Manual Check
The first way to check if your prefix is unique is to manually read through some of the sites.
Resource | Home Page | Prefix List |
---|---|---|
Bioregistry | https://bioregistry.io | https://bioregistry.io/registry |
Bioportal | https://bioportal.bioontology.org | https://bioportal.bioontology.org/ontologies |
While the BioPortal API is locked behind API key access, the Bioregistry
additionally has a search endpoint at https://bioregistry.io/api/search?q=...
Data Dumps
The second way to check if your prefix is unique is by comparing it to full dumps of the Bioregistry and BioPortal. The Bioregistry can be downloaded in several formats that are updated on a nightly basis:
BioPortal doesn’t offer any first-party data dumps, but the Bioregistry generates one nightly here
Programmatic Access
The third way to check if your prefix is unique is by comparing it to the
Bioregistry and BioPortal using code from the bioregistry
python package
(which is updated nightly).
Programmatic way to check if something is in the Bioregistry:
import bioregistry
query = "EPSO"
available_in_bioregistry = bioregistry.normalize_prefix(query) is None
Programmatic way to check if something is in BioPortal:
from bioregistry.external.bioportal import get_bioportal
query = "EPSO"
bioportal_dict = get_bioportal()
available_in_bioportal = query not in bioportal_dict
Being high quality and enabling external contribution and improvements are core to the philosophy of the Bioregistry. While no solution is perfect for listing all possible prefixes and it might be necessary to do a bit of extra googling before picking a prefix, this is a great place to start. If during the process of choosing a prefix you find you might create a conflict, please consider also suggesting a new entry in the Bioregistry.