We have a big problem in the bioinformatics community with namespaces, identifiers, and names. And nobody’s posed the question better than Rihanna herself.

During my Ph.D. at Fraunhofer, one of the old text miners reminisced to me about the late 90’s and early naughties when they had to curate their own dictionaries of synonyms for entities. I was lucky enough to have joined the bioinformatics community after excellent nomenclature resources like ChEBI and the HGNC were established and accepted by the community as gospel.

I consider these sources excellent because it’s quite easy to get a list of the identifiers and corresponding names that they maintain (TSV, etc.). There are other nomenclatures, like the ExPASy Enzyme Classes, that are stored as text files in non-standard formats.

The Open Biomedical Ontology (OBO) format and OBO Foundry were first published in 2007 as a solution for standardizing a growing set of biomedical ontologies that few shared semantics. Many ontology maintainers adopted their format, or at least used the OWL to OBO converter tools to include their ontologies in a reusable format. However, there remain some notable holdouts like the Cell Line Ontology that have not begun to distribute their content as OBO.

In parallel, the Ontology Lookup Service (OLS) was published as one of many front-ends for exploring this growing list of resources. In comparison, it may have been one of the first tools to provide a nice user experience that included a search engine (powered by solr, because they’re living in the Java world).

Both are lacking - there does not exist a solid OBO ecosystem (though Martin Larralde’s pronto may well soon change that) and even worse, the content in OBO loosely follows the standard, at best. On the other hand, the OLS has both an over-engineered interface that isn’t quite user friendly. For example, if you want to look up programmed cell death (GO:0012501), you have to know the internal OLS key for the namespace and the PURL for the identifier, which is not so obvious. Then you can finally hit the API.

And still, both of them lack some of my favorite, and arguably most important namespaces, like HGNC, RGD, MGI, UniProt, Entrez Gene, and PubChem. As an aside, dealing with PubChem is for people operating on a whole different level, so I’m not blaming anyone for dropping the ball on that one. Later, I will confess to doing the same.

Even worse, the OBO Foundry and OLS can’t even agree on what to call some namespaces. A great example is the NCBI taxonomy database. On the NCBI site, they say that the namespace is called NCBI and compact uniform identifiers (CURIEs) should look like NCBI:txid175694, OBO Foundry says the namespace is NCBITaxon (one of the few notable mixed-case namespace names) and CURIEs should look like NCBITaxon:175694.

Identifiers.org came along to solve some of these ambiguities with a curated database, but it’s missing lots of the things in OBO Foundry and OLS, and it even disagrees on others. They call the NCBI taxonomy namespace taxonomy and say that identifiers should look like taxonomy:175694. Exhausting!

Registry Comparison

One more issue is the GOGO problem. Many OBO ontologies use local identifiers that also include the prefix because a given ontology might contain terms imported from other ones. However, this means that ontologies that originated from the OBO world have redundant identifiers, like from GO (e.g., GO:GO:0012501). I know what you’re wondering: is Dr. Claw in charge? Maybe.

The reason I went down this rabbit hole is because I want to support people to do better curation. This means I want them to use identifiers instead of ever changing names. For example, it turns out the half life of an HGNC gene symbol is very short - thousands of them change every year. However, if I want people to use identifiers instead of names in their databases, their papers, and other writing, there need to be really good tools for looking up the names that go with each identifier and the cross-references (equivalences) to other databases that are talking about the same thing.

So I built PyOBO. It includes tools for reading the OBO Foundry and getting all of the OBO resources that are available (as well as many manual fixes for incorrect metadata), it uses Daniel Himmelstein’s Obonet for parsing and storing pre-parsed files for fast loading, and it applies a swath of rule-based normalization that I’ve manually curated by personally reading all of the OBO files, their identifiers, their cross-references, relationships, properties, and everything else. When it comes to data, there really is no way around getting your hands dirty.

I also went ahead and wrote parsers and converters for lots of other databases like Entrez, ComplexPortal, InterPro, and others so they could play nice with the rest of the ecosystem. Of course, this is an ongoing process. There are always more databases to include, and when it comes to super-sized ones like PubChem, the paradigms I used might not hold up anymore (though I did write parser/converter for it and you’re welcome to use it).

After this long journey of a blog post, I think we’re ready to address Rihanna’s perrenial question: what’s my name? Until now, there really didn’t exist a service that let you look up the name for an entity by its CURIE. The link I gave for the OLS is the closest I have found, and that just doesn’t cut it.

After all of this coding, I wrote a script (just run obo ooh-na-na) that takes all of the available sources, normalizes their namespaces, normalizes their identifiers, and dumps them as a big ‘ol TSV file. 3 columns - namespace, identifier, and name. No nonsense. Probably legal! Get it at DOI. I’ll make updates periodically as I add more sources, such as if/when I feel comfortable with including the PubChem dump - the CID-Title.gz file is about 1.3 gigabytes, which means this will significantly increase the size, but not so much that it’s unreasonable.

I can imagine that most people probably won’t want to download this file, or load it in memory (un-gzipped) every time they want to use it. I wrote a simple web service that wraps this dataset included in PyOBO. It should be as easy as running with the shell with python -m pyobo.apps.resolver then running the following python code:

import requests

# This is an exact match
successful_request = requests.get('http://localhost:5000/resolve/DOID:14330').json()
# {"identifier": "14330", "name": "Parkinson's disease", "prefix": "doid", "query": "DOID:14330", "success": True}

# This one remaps the prefix if you get it slightly wrong
successful_remapped_request = requests.get('http://localhost:5000/resolve/DO:14330').json()
# {"identifier": "14330", "name": "Parkinson's disease", "prefix": "doid", "query": "DO:14330", "success": True}

# This one can't find the identifier.
unsuccessful_request = requests.get('http://localhost:5000/resolve/DO:00000').json()
# {"identifier": "00000", "message": "Could not look up identifier", "prefix": "doid", "query": "DO:00000", "success": False}

# Keep in mind, the point of this service isn't to validate identifiers.
unsuccessful_crazy_request = requests.get('http://localhost:5000/resolve/DO:thisIsNotRightAtAll').json()
# {"identifier": "thisIsNotRightAtAll", "message": "Could not look up identifier", "prefix": "doid", "query": "DO:thisIsNotRightAtAll", "success": False}

# No mercy for bad prefixes
unsuccessful_prefix_lookup = requests.get('http://localhost:5000/resolve/notanamespace:0000').json()
# {"message": "Could not identify prefix", "query": "notanamespace:0000", "success": False}

It’s especially important that the service normalizes curies first, so both DOID:14330, doid:14330, and DO:14330 can all be resolved to their name, Parkinson’s disease. Because I did extensive manual curation of namespaces and their synonyms, NCBITaxon and taxonomy are both acceptable as well. However, this service doesn’t load from the aforementioned TSV, but rather takes advantage of PyOBO’s internal code for looking up mappings. I can imagine lots of ways I might re-write this service to directly take advantage of this dump (I also invite you to do the same, however best suits you) such as loading it into EdgeDB and auto-generating a GraphQL endpoint.

The last thing that I’m looking into getting this service hosted so everyone can benefit from it without doing dev-ops in their own organizations. Then I will continue to obfuscate all usage and documentation with references to pop culture. Enjoy!