Compliance of Bioregistry Prefixes to the W3C Standard
This post gives a brief background on the formal definition of the syntax and semantics of compact uniform resource identifiers (CURIEs) from the Worldwide Web Consortium (W3C) and investigates how many prefixes in the Bioregistry are compliant with the standard.
Syntax
The W3C’s CURIE 1.0 Syntax is unfortunately obfuscated. Understanding it requires navigating through several pages and reading cryptic definitions in a BNF-like notation. Below is a short explanation of the two important parts and a nice simplification:
safe_curie := '[' curie ']'
curie := [ [ prefix ] ':' ] reference
prefix := NCName
reference := irelative-ref
where NCName
is defined on this page as
NCName ::= (Letter | '_') (NCNameChar)*
NCNameChar ::= Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender
and irelative-ref
is defined here
by referencing external RFC 3987. Understanding this part is not strictly
necessary for checking Bioregistry prefixes.
After unpacking all of these nested references and making the reasonable assumption that the strange characters
referenced by CombiningChar
and Extender
are unlikely to appear in any real prefixes, we arrive a the following
regular expression for validating prefixes: ^[a-zA-Z_][a-zA-Z0-9.-_]*
Bioregistry Compliance
It’s relatively easy to write a script that checks Bioregistry prefixes against this regular expression.
import re
import bioregistry
from tabulate import tabulate
W3C_PREFIX = re.compile("^[a-zA-Z_][a-zA-Z0-9.-_]*")
failed = [
(
f"[{resource.prefix}](https://bioregistry.io/{resource.prefix})",
resource.get_name(),
)
for resource in bioregistry.resources()
if not W3C_PREFIX.match(resource.prefix)
]
print(tabulate(failed, headers=["prefix", "name"], tablefmt="github"))
This script produces the following table as an output:
prefix | name |
---|---|
3dmet | 3D Metabolites |
4dn.biosource | 4D Nucleome Data Portal Biosource |
4dn.replicate | 4D Nucleome Data Portal Experiment Replicate |
Note that only three prefixes (at the time of writing) are non-compliant, each because it starts with a number instead of a letter or underscore. Overall, the Bioregistry is doing pretty good! Note that this does not check preferred prefixes nor synonyms. This might be good for a future update to this post or a follow-up post.
In the future, it might be nice to enforce some kind of prefix compliance at the unit test level to automate checking prefixes are appropriate. This might also include a blacklist of certain generic prefixes (e.g., gene) or other rules discussed in the project’s contribution guidelines.