This post gives a brief background on the formal definition of the syntax and semantics of compact uniform resource identifiers (CURIEs) from the Worldwide Web Consortium (W3C) and investigates how many prefixes in the Bioregistry are compliant with the standard.

Syntax

The W3C’s CURIE 1.0 Syntax is unfortunately obfuscated. Understanding it requires navigating through several pages and reading cryptic definitions in a BNF-like notation. Below is a short explanation of the two important parts and a nice simplification:

safe_curie  :=  '[' curie ']'
curie       :=  [ [ prefix ] ':' ] reference
prefix      :=  NCName
reference   :=  irelative-ref


where NCName is defined on this page as

NCName     ::= (Letter | '_') (NCNameChar)*
NCNameChar ::= Letter | Digit  | '.' | '-' | '_' | CombiningChar | Extender


and irelative-ref is defined here by referencing external RFC 3987. Understanding this part is not strictly necessary for checking Bioregistry prefixes.

After unpacking all of these nested references and making the reasonable assumption that the strange characters referenced by CombiningChar and Extender are unlikely to appear in any real prefixes, we arrive a the following regular expression for validating prefixes: ^[a-zA-Z_][a-zA-Z0-9.-_]*

Bioregistry Compliance

It’s relatively easy to write a script that checks Bioregistry prefixes against this regular expression.

import re
import bioregistry
from tabulate import tabulate

W3C_PREFIX = re.compile("^[a-zA-Z_][a-zA-Z0-9.-_]*")
failed = [
(
f"[{resource.prefix}](https://bioregistry.io/{resource.prefix})",
resource.get_name(),
)
for resource in bioregistry.resources()
if not W3C_PREFIX.match(resource.prefix)
]