A Glossary for the Bioregistry and Biopragmatics Stack
There are a lot of terms that I’ve been throwing around when talking about the Bioregistry, so this blog post is a first draft of a gloassary of all of them.
Later, I will revise this further and put it either on the Bioregistry website, or make a totally new repo on the Biopragmatics GitHub organization.
Semantic spaces
While a controlled vocabulary enumerates a set of named entities, a semantic space enumerates a set of stable local identifies for entities. Most high-quality controlled vocabularies also assign local identifiers for their named entities and are also semantic spaces. For example, the Chemical Entities of Biological Interest (ChEBI) is a well-known ontology in the biomedical domain that is both a controlled vocabulary and a semantic space.
The term local identifier is synonymous with identifier and accession,
but has the added qualifier local as a reminder that two semantic
spaces may use the same one. For example, the Chemical Entities of Biological Interest (ChEBI)
entry for 6-methoxy-2-octaprenyl-1,4-benzoquinone
and the Human Disease Ontology (DOID) entry for
gender identity disorder
share the local identifier of 1234
.
Formalizing local identifiers
It’s often useful to have a regular expression
that describes local identifiers of a given semantic space. For example,
both ChEBI and DOID use local identifiers that look like numbers, which match
the regular expression ^\d+$
. The ^
and $
denote the beginning and end
of the regular expression and appear exactly the same in all regular expressions
for local identifiers. The \d
will match a number and the +
means that the
preceding token (\d
) can be matched one or more times in a row.
It’s important to remember that identifiers might look like numbers, but they
should never be treated as such. For example, the Gene Ontology (GO)
uses identifiers that are left-padded with zeros like in 0032571
for response to vitamin K. The regular
expression pattern for GO entries is ^\d{7}$
, since there are always exactly
seven numbers. Regular expressions don’t have a straightforward way to describe
numbers that are left padded with zero, so keep in mind that this is
approximation is a good balance between precision and simplicity.
There are a variety of patterns used for identifiers, including integers (^\d+$
;
e.g., PubMed), zero padded integers (^\d{7}$
; e.g., GO and other OBO
Ontologies), universally unique identifiers (UUIDs; e.g., NCI Pathway
Interaction Database, NDEx), and many other variations.
Origins
Semantic spaces arise from several kinds of resources such as:
- Ontologies like the Gene Ontology (GO), Chemical Entities of Biological Interest (ChEBI), and Experimental Factor Ontology (EFO)
- Controlled Vocabularies like Entrez Gene, InterPro, and FamPlex
- Databases like Protein Data Bank and Gene Expression Omnibus
Completeness
Semantic spaces typically fall into one of several “completeness” categories:
- Complete by Definition like Enzyme Classification
- Complete, but Subject to Change like HGNC
- Always Incomplete like Chemical Entities of Biological Interest (ChEBI) and the Protein Data Bank (PDB)
Scope
Semantic spaces have a variety of scopes:
- Single entity type like HGNC
- A few entity types like the Gene Ontology (GO)
- Many entity types like Medical Subject Headings (MeSH), Unified Medical Language System (UMLS), National Cancer Institute Thesaurus (NCIT)
Relationship to Projects and Organizations
Semantic spaces do not always correspond one-to-one with projects, such as how the ChEMBL database contains both the ChEMBL Compound and ChEMBL Target semantic spaces or how the Uber Anatomy Ontology (UBERON) contains both UBERON and UBPROP semantic spaces for terms and properties, respectively.
Providers
A provider returns information about entities from a given semantic space. A provider
is characterized by a URI format string, or URI formatter, into which a
local identifier from its semantic space can be substituted for a special
token (e.g., $1
). For example, the following formatter can be used to get a
web page about HRAS by replacing $1
in the
URI format string
http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$1
by its HGNC
identifier, 5173
.
Well-behaved URI format strings only have one instance of the special token that
occurs at the end. Poorly-behaved URI format strings may have additional
characters following the special token as
in http://rebase.neb.com/rebase/enz/$1.html
for REBASE
or as in http://eawag-bbd.ethz.ch/$1/$1_map.html
for the
UM-BBD Pathway database.
Content Type
While providers typically return human-readable HTML, they can also return many other data types, including:
- Images (e.g., https://www.ebi.ac.uk/chebi/displayImage.do?defaultImage=true&chebiId=132964 for the ChEBI entry on fluazifop-P-butyl)
- XML (e.g., https://www.uniprot.org/uniprot/P10636.xml for UniProt entry on human Microtubule-associated protein tau)
- JSON (e.g., https://gen3.biodatacatalyst.nhlbi.nih.gov/ga4gh/drs/v1/objects/0000ffeb-36e0-4a29-b21d-84423bda979d for NCBI’s BioData Catalyst)
- RDF
Providers can return any other information that can be transferred via HTTP, FTP, or related data transfer protocols. Alternatively, content negotiation could be used to return multiple kinds of data from the same provider URI.
Responsibility
Most semantic spaces have an associated first-party provider that returns information via a web page. Some semantic spaces, like ChEBI, have several first-party providers for different content types (e.g., HTML, image). Some semantic spaces, like Entrez Gene, have additional external providers, including databases that use its identifiers like the Comparative Toxicogenomics Database. Some semantic spaces, such as many OBO ontologies, do not have an associated first party provider and rely solely on third party browsers like AberOWL, OntoBee, and the Ontology Lookup Service.
Naming things on the semantic web
There are two (mostly) interchangeable formalisms for naming things in the semantic web: uniform resource identifiers (URIs) and compact uniform resource identifiers (CURIEs).
Uniform Resource Identifiers (URIs)
The semantic web community has adopted the internationalized resource identifier (IRI) as the de facto standard for naming entities. In practice, usage is often restricted to IRIs that are also uniform resource identifiers (URIs) (i.e., they only use ASCII characters) and that are also valid uniform resource locators (URLs) (i.e., they point to a web page). In applied semantic web contexts like biomedicine, the subtleties between URLs, URIs, and IRIs are disregarded and the term URI is preferred such as in the seminal paper Identifiers for the 21st Century. A more detailed explanation on the difference between URLs, URIs, and IRIs can be found here.
For a given semantic space like ChEBI, URIs can usually be constructed given two parts:
- A URI prefix (in red)
- A local identifier (in orange)
All URIs from the same semantic space have the same URI prefix (in red), but a different local identifier (in orange). Here’s an example, using the ChEBI local identifier for alsterpaullone:
https://www.ebi.ac.uk/chebi/searchId.do?chebiId=138488
There may be potentially many URI prefixes corresponding to the same semantic space and therefore many URIs describing the same entity. For example, ChEBI also serves images with:
https://www.ebi.ac.uk/chebi/displayImage.do?defaultImage=true&imageIndex=0&chebiId=138488
Compact Uniform Resource Identifiers (CURIEs)
A compact uniform resource identifier (CURIE) allows for the replacement of a URI prefix in a URI with a short prefix. As a short recapitulation of the W3C specification, a CURIE has three parts:
- A prefix (in red)
- A delimiter (in black)
- A local identifier from the given semantic space (in orange)
Since everyone agrees on what ChEBI is within the biomedical domain, it makes
sense to use chebi
as the prefix for ChEBI local identifiers. However, there
is no globally unique set of prefixes used across the semantic web (nor should
there be). Therefore, when using CURIEs, you need at minimum a prefix map
(described below) and ideally a registry that stores additional metadata about
each prefix.
Here’s the same example as in the URI section above for alsterpaullone, but now condensed into a CURIE:
chebi:138488
Converting between URIs and CURIEs
A prefix map associates each prefix to exactly one URI prefix. It can be used to expand CURIEs into URIs. Disregarding (for now) how to choose the best URI prefix, one potential prefix map that could be used to expand the example CURIE for alsterpaullone could be:
{
"chebi": "https://www.ebi.ac.uk/chebi/searchId.do?chebiId="
}
A simple algorithm for expanding a CURIE to a URI is as follows:
- Split the CURIE on the first instance of the delimiter, usually a colon
:
- Look up the left-hand side of the split (i.e., the prefix) in the prefix map
- String concatenate the resulting URI prefix with the right-hand side of the split (i.e., the local identifier)
A reverse prefix map associates one or more URI prefixes to each prefix. It can be used to contract URIs into CURIEs. Disregarding (for now) how to chose the best prefix for each URI prefix, one potential reverse prefix map that could be used to contract the two example URIs for alterpaullone could be:
{
"https://www.ebi.ac.uk/chebi/searchId.do?chebiId=": "chebi",
"https://www.ebi.ac.uk/chebi/displayImage.do?defaultImage=true&imageIndex=0&chebiId=": "chebi"
}
Because it’s possible some URI prefixes might overlap, it’s a good heuristic to check a given URI against a reverse prefix map in decreasing order by URI prefix length.
Poorly Behaved URIs
Unfortunately, not all URLs that provide information about entities
in semantic spaces can be trivially split into a URI prefix and a
local identifier. For example, the REBASE
entry for Asp14HI has the URI
http://rebase.neb.com/rebase/enz/101.html.
Note the pesky .html
at the end, which if removed, causes an HTTP 404 error
due to the implementation of the REBASE website.
While this creates a big problem for parsing URIs into CURIEs, it’s still possible to generate a URI from a CURIE given a slight variation on a prefix map, which relies on the previously described notion of URI formatters (see the section above on Providers)
A URI prefix corresponds to a special case of a URI formatter where there
is exactly one instance of $1
that appears at the end of the string.
Therefore, it is more valuable to curate URI formatters and programmatically
generate prefix maps when possible. The fact that some URIs are hard to
construct easily is also one of the motivations for resolver services, described
in a later section.
Open Biomedical Ontologies CURIEs
The Open Biomedical Ontologies (OBO) Foundry provides a persistent URL service (PURL) to create stable URIs for biomedical entities curated in their ontologies (e.g., Human Disease Ontology, Phenotype And Trait Ontology). They have four parts:
- A URI prefix (in red; always the same)
- An ontology prefix (in orange)
- A delimiter (in black; always the same)
- An ontology local identifier (in blue)
http://purl.obolibrary.org/obo/DRON_0000005
Confusingly, the entire combination of the ontology’s prefix, the delimiter,
and the ontology’s local identifier (e.g., DRON_0000005
) are considered in
some contexts as a local identifier in a theoretical semantic space
for OBO, whose URI prefix is http://purl.obolibrary.org/obo/
. This confusion
lead to services like Identifiers.org to denote these ontologies as having the
“namespace embedded in the local unique identifier” and therefore include the
prefix again in the regular expression pattern describing the local
identifiers, e.g. ^DOID:\d+$
for the Human Disease Ontology.
This notation of the regular expression makes no sense for several reasons:
- The regular expression should correspond to the local identifiers of a
semantic space like
DOID
, not a registry like the OBO PURL system. - If you follow the simple algorithm for constructing a CURIE from a prefix and
identifier, you end up with identifiers that look like CURIEs like
DOID:11337
or redundant CURIEs that look likeDOID:DOID:11337
. - Identifiers.org doesn’t even handle CURIEs constructed following the rules for embedding the prefix in the local identifier.
- It creates ambiguities in spreadsheets where columns are supposed to contain local identifiers or CURIEs.
The solution is simply to drop the entire notion of namespaces embedded in local unique identifiers. Since this would require updating a lot of data in a lot of places, the interim solution is to programmatically normalize identifiers and CURIEs in the meantime to remove instances of this redundancy.
Registry
A registry is a special kind of semantic space that enumerates other semantic spaces and assigns them local identifiers. Due to the connection with prefix maps and CURIEs, the local identifiers in registries are also colloquially called prefixes.
A registry also collects additional metadata about each semantic space, including its name, its canonical prefix, its stylized prefix, additional prefix synonyms, its homepage, an example local identifier, a regular expression pattern for local identifiers, and one or more URI format strings from both first-party and third-party sources. However, there are a wide variety of metadata standards across various biomedical and semantic web registries, and not all fields are included.
Like with semantic spaces, a high-quality registry should have an associated first-party provider that comprises a website for exploring its entries and their associated metadata.
Metaregistry
A metaregistry is a special kind of registry that assigns local identifiers to a collection of registries; it could even contain an entry about itself. It collects additional metadata about each registry, such as a description of its metadata standards and capabilities. Most importantly, a metaregistry contains mappings between equivalent entries in its constituent registries. Before the publication of this article, to the best of our knowledge, there were no dedicated metaregistries. Some registries such as FAIRSharing and the MIRIAM/Identifiers.org registry contain limited numbers of entries referring to other registries (e.g., BioPortal), but they neither delineate these records as representing registries, provide additional metadata, nor provide mappings.
The only metaregistry in the biomedical domain is the Bioregistry.
Resolver
A resolver uses a registry to generate a URI for a given CURIE based on the registry’s default provider for the semantic space with the given prefix, then redirects the requester to the constructed URI. Resolvers are different from providers in that they are general for many semantic spaces and do not host content themselves. Two well-known resolvers are Identifiers.org and Name-To-Thing.
Lookup Service A lookup service is like a provider but generalized to provide for many semantic spaces. They typically have a URI format string into which a compact identifier can be placed like OntoBee, but many require more complicated programmatic logic to construct. Some well-known lookup services are the OLS, AberOWL, OntoBee, and BioPortal.