Compliance of Bioregistry Prefixes to the W3C Standard

This post gives a brief background on the formal definition of the syntax and semantics of compact uniform resource identifiers (CURIEs) from the Worldwide Web Consortium (W3C) and investigates how many prefixes in the Bioregistry are compliant with the standard.

Syntax

The W3C’s CURIE 1.0 Syntax is unfortunately obfuscated. Understanding it requires navigating through several pages and reading cryptic definitions in a BNF-like notation. Below is a short explanation of the two important parts and a nice simplification:

safe_curie  :=  '[' curie ']'
curie       :=  [ [ prefix ] ':' ] reference
prefix      :=  NCName
reference   :=  irelative-ref

where NCName is defined on this page as

NCName     ::= (Letter | '_') (NCNameChar)*
NCNameChar ::= Letter | Digit  | '.' | '-' | '_' | CombiningChar | Extender

and irelative-ref is defined here by referencing external RFC 3987. Understanding this part is not strictly necessary for checking Bioregistry prefixes.

After unpacking all of these nested references and making the reasonable assumption that the strange characters referenced by CombiningChar and Extender are unlikely to appear in any real prefixes, we arrive a the following regular expression for validating prefixes: ^[a-zA-Z_][a-zA-Z0-9.-_]*

Bioregistry Compliance

It’s relatively easy to write a script that checks Bioregistry prefixes against this regular expression.

import re
import bioregistry
from tabulate import tabulate

W3C_PREFIX = re.compile("^[a-zA-Z_][a-zA-Z0-9.-_]*")
failed = [
    (
        f"[{resource.prefix}](https://bioregistry.io/{resource.prefix})",
        resource.get_name(),
    )
    for resource in bioregistry.resources()
    if not W3C_PREFIX.match(resource.prefix)
]
print(tabulate(failed, headers=["prefix", "name"], tablefmt="github"))

This script produces the following table as an output:

prefix	name
3dmet	3D Metabolites
4dn.biosource	4D Nucleome Data Portal Biosource
4dn.replicate	4D Nucleome Data Portal Experiment Replicate

Note that only three prefixes (at the time of writing) are non-compliant, each because it starts with a number instead of a letter or underscore. Overall, the Bioregistry is doing pretty good! Note that this does not check preferred prefixes nor synonyms. This might be good for a future update to this post or a follow-up post.

In the future, it might be nice to enforce some kind of prefix compliance at the unit test level to automate checking prefixes are appropriate. This might also include a blacklist of certain generic prefixes (e.g., gene) or other rules discussed in the project’s contribution guidelines.