What’s a CURIE, and Why You Should be Using Them

Compact uniform resource identifiers, or CURIEs, are an important formalism for referencing biomedical entities. This post explains what they are, how to write them yourself, and a brief outline of how they fit in to the semantic web, linked open data, and open biomedical ontology worlds.

In the semantic web, linked open data, and ontology communities, uniform resource identifiers (URIs) are used to reference named entities. For a given nomenclature, like the Chemical Entities of Biological Interest (ChEBI), URIs usually have two parts:

A URI prefix (in red)
A unique local identifier from the given nomenclature (in orange)

All the URIs from the same nomenclature will have the same URI prefix, but a different unique local identifier. Here’s an example, using the ChEBI unique local identifier for alsterpaullone:

https://www.ebi.ac.uk/chebi/searchId.do?chebiId=138488

The Trouble with URIs

URIs are inconvenient because each named entity could be referenced by potentially many URIs. For example, a URI could start with either http or https. Even worse, there are several competing services that each try to mint the one true URI for each. XKCD sums the ensuing chaos up pretty well:

For the example molecule, alsterpaullone, here are some (but not all) of the possible URIs that could be used to reference it:

Provider	URI
First-party	https://www.ebi.ac.uk/chebi/searchId.do?chebiId=138488
Identifiers.org	https://identifiers.org/CHEBI:138488 https://identifiers.org/CHEBI/138488 http://identifiers.org/CHEBI:138488 http://identifiers.org/CHEBI/138488
OBO Library PURL	http://purl.obolibrary.org/obo/CHEBI_138488
Name-to-Thing	https://n2t.net/chebi:138488

The real issue with URIs is that the URI prefix (the beginning part) doesn’t really tell you anything. In fact, given a URI, you usually have to do some detective work to figure out which nomenclature authority it goes with.

One solution was to use resolvers that create “persistent URLs”, but in the end, there are many competing resolvers that don’t cover everything. For example, the OBO PURL system doesn’t cover HGNC and UniProt. The Identifiers.org system doesn’t cover many ontologies.

For practical purposes, it makes sense to keep track of the commonly used names of each resource, then just generate the kinds of URIs that people might want depending on what software or data systems they’re working with rather than prescribing one URI to be the canonical one.

Come to the Dark Side, We have CURIEs

The solution is to use compact uniform resource identifiers (CURIEs), which replace the URI prefix with a more approachable prefix. A CURIE has three parts:

A prefix (in red)
A delimiter (in black)
A unique local identifier from the given nomenclature (in orange)

Since everyone agrees on what ChEBI is, it makes sense to use chebi as the prefix for ChEBI unique local identifiers. Here’s the same example for alsterpaullone, condensed as a CURIE:

chebi:138488

The best part of a CURIE is that you can associate your favorite URI prefix with its corresponding prefix depending on your use case. You can even have a database that stores all of the possible ones for you. Replacing URIs with prefixes is so common, that it’s a core part of the SPARQL query language, which is used both in the semantic web and ontologies to traverse data stored in the resource description framework (RDF) schema. Here’s an example SPARQL query that has these prefixes prominently at the top:

prefix obo: <http://purl.obolibrary.org/obo/>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?x ?p ?y
WHERE {
  {?x rdfs:subClassOf [
    a owl:Restriction ;
    owl:onProperty ?p ;
    owl:someValuesFrom ?y ]
  }
  UNION {
   ?x rdfs:subClassOf ?y .
   BIND(rdfs:subClassOf AS ?p)
  }
  ?x a owl:Class .
  ?y a owl:Class .
}

This example was borrowed from the SPARQL queries in the repository for the Core Ontology for Biology and Biomedicine (COB). You don’t have to understand the SPARQL itself, just check the first 4 lines that start with prefix ....

How to Build a CURIE

Most common vocabularies can be written as CURIEs. Here are a few examples:

Name	Prefix	Example Unique Local Identifier	Example CURIE
Gene Ontology	go	0032571	go:0032571
HGNC	hgnc	16793	hgnc:16793
UniProt	uniprot	P0DP23	uniprot:P0DP23
Disease Ontology	doid	0110974	doid:0110974
Medical Subject Headings	mesh	C063233	mesh:C063233

As you might guess, most prefixes are either the acronym for a nomenclature authority or the name itself. There are a few cases where this isn’t true, like for Disease Ontology. Their acronym is DO, but then that add ID which is usually shorthand for “identifier”, and therefore get doid as a prefix.

Just to reiterate, it’s really easy to make a CURIE. You take the prefix, a colon, then the unique local identifier and smash them together! Note: some communities, like the Open Biomedical Ontologies Foundry, like to stylize prefixes with uppercase or mixed-case. If you live in URI world, this is a big deal, but for most practical purposes, it’s nice to be able to just keep it all lowercase.

How do you know what’s the right prefix for each resource? And who even keeps track of this stuff? The Bioregistry keeps an up-to-date list that you can browse or search here. My team at Harvard Medical School has been building this with help from the community to serve the community needs that previous registries didn’t - most importantly, to make the data open and transparent and to enable community suggestions in an open and fair way.