Idiomatic conversion between URIs and compact URIs

The semantic web and ontology communities needed a reusable Python package for converting between uniform resource identifiers (URIs) and compact URIs (CURIEs) that is reliable, idiomatic, generic, and performant. This post describes the curies Python package that fills this need.

After installing with pip install curies or checking out the code on GitHub and installing a local copy, you can directly jump in to using the curies package. Its main data structure is curies.Converter. It can be instantiated with various class methods corresponding to data in one of several formats.

The most common format is a prefix map, a dictionary containing a one-to-many mapping from CURIE prefixes to URI prefixes. It can be used in combination with the Converter.from_prefix_map class method. The following example includes some (but not all) of the CURIE and URI prefixes used by ontologies in the Open Biological and Biomedical Ontology (OBO) Foundry.

from curies import Converter

prefix_map = {
    "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
    "MONDO": "http://purl.obolibrary.org/obo/MONDO_",
    "GO": "http://purl.obolibrary.org/obo/GO_",
    # ... and so on
    "OBO": "http://purl.obolibrary.org/obo/",
}
converter = Converter.from_prefix_map(prefix_map)

The Converter class indexes the prefix map using a trie data structure, which makes search of the beginning of sequences (such as strings) efficient. The curies implementation builds on the implementation of this data structure in the PyTrie package.

Conversion

A uniform resource identifier (URI) that corresponds to one of the URI prefixes registered in the converter can be compressed into a compact URI (CURIE) using the Converter.compress method. In the following example, we use the canonical URI (within the scope of the OBO Foundry) for the Gene Ontology term for response to vitamin K (GO:0032571).

>>> converter.compress("http://purl.obolibrary.org/obo/GO_0032571")
'GO:0032571'

When some URI prefixes are partially overlapping (e.g., http://purl.obolibrary.org/obo/CHEBI_ for GO and http://purl.obolibrary.org/obo/ for OBO), the longest URI prefix will always be matched. For example, compressing http://purl.obolibrary.org/obo/GO_0032571 returns GO:0032571 instead of OBO:GO_0032571.

If there’s no matching URI prefix, then compress() will return None.

>>> converter.compress("http://example.com/missing:0000000") is None
True

Similarly, a CURIE can be expanded into a URI using the Converter.expand method.

>>> converter.expand("GO:0032571")
'http://purl.obolibrary.org/obo/GO_0032571'

If there’s no matching CURIE prefix, then expand() will return None.

>>> converter.expand("missing:0000000") is None
True

Getting Prefix Maps

The curies package includes functions for loading several prefix maps from external resources. These are not cached in order to take advantage of the most recent versions. This is particularly important for resources like the Bioregistry that are updated frequently.

Name	Function	Description
Bioregistry	`curies.get_bioregistry_converter`	A high-coverage, general purpose registry for the life and natural sciences.
OBO Foundry	`curies.get_obo_converter`	A set of orthogonal ontologies for the life sciences constructed for mutual interoperability
Prefix Commons	`curies.get_prefixcommons_converter`	A medium-coverage, general purpose registry for the life and natural sciences
Gene Ontology	`curies.get_go_converter`	A project-specific prefix map for the Gene Ontology, includes several duplicate and non-standard definitions
Monarch	`curies.get_monarch_converter`	A project-specific prefix map for the Monarch Initiative, includes several duplicate and non-standard definitions

Loading from the Bioregistry

The bioregistry Python package has first-class support for the curies package through the generic function bioregistry.get_converter. This can be used as an alternative to curies.get_bioregistry_converter in cases when the Bioregistry is installed and it’s desired to use local data.

import bioregistry
from curies import Converter

converter: Converter = bioregistry.get_converter()

Loading from `prefixmaps`

The prefixmaps Python package keeps various prefix maps under version control that also has partial support for the curies package using the extended prefix map data structure (as opposed to a prefix map, this includes synonyms). See Converter.from_extended_prefix_map for more information on how to use this data structure.

from prefixmaps import load_context
from curies import Converter

extended_prefix_map = load_context("obo").as_extended_prefix_map()
converter = Converter.from_extended_prefix_map(extended_prefix_map)

Here’s a short (probably incomplete) list of other packages I’ve found that have related functionalities:

https://github.com/prefixcommons/prefixcommons-py (Python)
https://github.com/prefixcommons/curie-util (Java)
https://github.com/geneontology/curie-util-py (Python)
https://github.com/geneontology/curie-util-es5 (Node.js)
https://github.com/endoli/curie.rs (Rust)

This post didn’t touch the more advanced features of the Converter class such as its support for CURIE prefix synonyms and URI prefix synonyms. It also didn’t touch the curies.chain() function which enables several pre-instantiated converters to be used in succession, similarly to the Python built-in collections.ChainMap class. These are described in the documentation at curies.readthedocs.io

Conversion

Getting Prefix Maps

Loading from the Bioregistry

Loading from prefixmaps

Related

Loading from `prefixmaps`