Idiomatic conversion between URIs and compact URIs
The semantic web and ontology communities needed a reusable Python package for
converting between uniform resource identifiers (URIs) and compact URIs (CURIEs)
that is reliable, idiomatic, generic, and performant. This post describes the
curies
Python package that fills this
need.
After installing with pip install curies
or checking out the code on
GitHub and installing a local copy, you can
directly jump in to using the curies
package. Its main data structure is
curies.Converter
.
It can be instantiated with various class methods corresponding to data in one
of several formats.
The most common format is a prefix map, a dictionary containing a one-to-many
mapping from CURIE prefixes to URI prefixes. It can be used in combination with
the
Converter.from_prefix_map
class method. The following example includes some (but not all) of the CURIE and
URI prefixes used by ontologies in the
Open Biological and Biomedical Ontology (OBO) Foundry.
from curies import Converter
prefix_map = {
"CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
"MONDO": "http://purl.obolibrary.org/obo/MONDO_",
"GO": "http://purl.obolibrary.org/obo/GO_",
# ... and so on
"OBO": "http://purl.obolibrary.org/obo/",
}
converter = Converter.from_prefix_map(prefix_map)
The Converter
class indexes the prefix map using a
trie data structure, which makes search of
the beginning of sequences (such as strings) efficient. The curies
implementation builds on the implementation of this data structure in the
PyTrie
package.
Conversion
A uniform resource identifier (URI) that corresponds to one of the URI prefixes
registered in the converter can be compressed into a compact URI (CURIE) using
the
Converter.compress
method. In the following example, we use the canonical URI (within the scope of
the OBO Foundry) for the Gene Ontology term for
response to vitamin K (GO:0032571).
>>> converter.compress("http://purl.obolibrary.org/obo/GO_0032571")
'GO:0032571'
When some URI prefixes are partially overlapping (e.g.,
http://purl.obolibrary.org/obo/CHEBI_
for GO
and
http://purl.obolibrary.org/obo/
for OBO
), the longest URI prefix will always
be matched. For example, compressing http://purl.obolibrary.org/obo/GO_0032571
returns GO:0032571
instead of OBO:GO_0032571
.
If there’s no matching URI prefix, then compress()
will return None
.
>>> converter.compress("http://example.com/missing:0000000") is None
True
Similarly, a CURIE can be expanded into a URI using the
Converter.expand
method.
>>> converter.expand("GO:0032571")
'http://purl.obolibrary.org/obo/GO_0032571'
If there’s no matching CURIE prefix, then expand()
will return None
.
>>> converter.expand("missing:0000000") is None
True
Getting Prefix Maps
The curies
package includes functions for loading several prefix maps from
external resources. These are not cached in order to take advantage of the most
recent versions. This is particularly important for resources like the
Bioregistry that are updated frequently.
Name | Function | Description |
---|---|---|
Bioregistry | curies.get_bioregistry_converter |
A high-coverage, general purpose registry for the life and natural sciences. |
OBO Foundry | curies.get_obo_converter |
A set of orthogonal ontologies for the life sciences constructed for mutual interoperability |
Prefix Commons | curies.get_prefixcommons_converter |
A medium-coverage, general purpose registry for the life and natural sciences |
Gene Ontology | curies.get_go_converter |
A project-specific prefix map for the Gene Ontology, includes several duplicate and non-standard definitions |
Monarch | curies.get_monarch_converter |
A project-specific prefix map for the Monarch Initiative, includes several duplicate and non-standard definitions |
Loading from the Bioregistry
The bioregistry
Python package
has first-class support for the curies
package through the generic function
bioregistry.get_converter
.
This can be used as an alternative to curies.get_bioregistry_converter
in
cases when the Bioregistry is installed and it’s desired to use local data.
import bioregistry
from curies import Converter
converter: Converter = bioregistry.get_converter()
Loading from prefixmaps
The prefixmaps
Python package keeps
various prefix maps under version control that also has partial support for the
curies
package using the extended prefix map data structure (as opposed to a
prefix map, this includes synonyms). See
Converter.from_extended_prefix_map
for more information on how to use this data structure.
from prefixmaps import load_context
from curies import Converter
extended_prefix_map = load_context("obo").as_extended_prefix_map()
converter = Converter.from_extended_prefix_map(extended_prefix_map)
Related
Here’s a short (probably incomplete) list of other packages I’ve found that have related functionalities:
- https://github.com/prefixcommons/prefixcommons-py (Python)
- https://github.com/prefixcommons/curie-util (Java)
- https://github.com/geneontology/curie-util-py (Python)
- https://github.com/geneontology/curie-util-es5 (Node.js)
- https://github.com/endoli/curie.rs (Rust)
This post didn’t touch the more advanced features of the Converter
class such
as its support for CURIE prefix synonyms and URI prefix synonyms. It also didn’t
touch the curies.chain()
function which enables several pre-instantiated
converters to be used in succession, similarly to the Python built-in
collections.ChainMap
class. These are described in the documentation at
curies.readthedocs.io