Databases as Ontologies Part 2 - A Case Study with HGNC

This is the second of a two-part post about encoding databases as ontologies. In the first part, I gave a background on how I started working on this problem and the software stack I developed along the way. In this post, I explain the philosophy and design about how I encoded the HGNC (HUGO Gene Nomenclature Committee) database as an ontology using PyOBO.

While the previous post used CRediT (Contributor Roles Taxonomy) to demonstrate encoding as an ontology a simple resource that only contains names and descriptions for its identifiers, the goal of this post is to describe the design decisions take to ontologize a more complex resource: the HGNC (HUGO Gene Nomenclature Committee) database.

The HGNC assigns names, symbols, and numeric identifiers to human genes. Gene symbols like AKT1 are the primary names referenced in the biomedical literature (which are sometimes misunderstood by Microsoft Excel). HGNC identifiers are used to unambiguously reference human genes in biocuration efforts like DECIPHER, MedlinePlus, GeneCards, and the Alliance of Genome Resources. They are also the targets of grounding human genes in manual literature curation workflows (like for BEL, BioPAX, SBML) and text mining workflows (like INDRA).

I see the following three major benefits in ontologizing HGNC:

To support the standardized reuse of HGNC terms within semantic web applications and ontologies. While databases can create fields with well-defined semantics where they place either numeric HGNC identifiers or references to HGNC gene symbols, semantic web applications often require the use of (consistent) URIs and ontologies further require consistent as classes/individuals with the appropriate axioms. For example, the MONDO Disease Ontology annotates genes’ relationships to disease (such as being a disease driver), but they are forced to use workarounds to reference HGNC records, since they are not encoded in an ontology.
To support the standardized distribution of HGNC. HGNC has its own ad hoc distribution formats (JSON, SQL, TSV). Ontologizing HGNC enables standard tooling to consume and reuse the database.
To support the standardized interpretation of HGNC. The content of HGNC does not have formally defined semantics - for example, if you download the JSON dump, how does one know what the mane_select key means, or what virus integration site means in the locus_type field? Ontologizing HGNC enables for a single person or small group to do the hard work of understanding the meaning of the fields and values used in the source data, then encode their hard-earned domain knowledge with formal semantics such that everyone can understand it. I’ll use the locus_type and locus_group fields as an example to illustrate this.

I don’t want to bury the lede, so here’s a link to the PyOBO source script for HGNC that implements everything I’m about to describe. Actionable feedback and pull requests are welcome if you have concrete ideas for improvement.

Lexicalization of a Gene

Each record contains up to five lexical components (i.e., name, description, synonyms), which are mapped to the ontology as follows:

Key	Cardinality	Predicate	Synonym Type
`symbol`	one	`rdfs:label`	N/A
`name`	one	`dcterms:description`	N/A
`alias symbol`	zero or more	`oboInOwl:hasExactSynonym`	`OMO:0003016` (gene symbol synonym)
`alias_name`	zero or more	`oboInOwl:hasExactSynonym`	`OMO:0003008` (previous name)
`previous_symbol`	zero or more	`oboInOwl:hasExactSynonym`	`OMO:0003015` (previous gene symbol)

The dichotomy of gene symbols (short form) and gene names (long form) requires a making the important design decision of which to use as the label. I chose to use the gene symbol because of its ubiquitous use, see discussion here. An alternative to this lexicalization could be to mark the name as the primary label with rdfs:label and to use the symbol as an exact synonym with type abbreviation OMO:0003000. However, using the gene symbol as the primary label is so ubiquitous that this seemed appropriate. Further, HGNC does not provide dedicated textual descriptions, and in their place, the name is often a reasonable alternative.

Here’s an example record in OBO flat file format to illustrate:

[Term]
id: hgnc:100
name: ASIC1
def: "acid sensing ion channel subunit 1"
synonym: "ACCN2" RELATED OMO:0003015 []
synonym: "BNaC2" RELATED OMO:0003016 []
synonym: "acid sensing (proton gated) ion channel 1" RELATED OMO:0003008 []
synonym: "acid-sensing (proton-gated) ion channel 1" RELATED OMO:0003008 []
synonym: "amiloride-sensitive cation channel 2, neuronal" RELATED OMO:0003008 []
synonym: "hBNaC2" RELATED OMO:0003016 []

As an aside: the classes and properties needed to curate an ontology, or ontologize a database, aren’t always available from the start. In many situations, this leads to making ad hoc classes or properties to get the job done - I am not above this. Initially, I had created ad hoc synonym types for gene symbol synonyms and previous gene symbols. Later, I requested two new synonym types in the OBO Metadata Ontology (OMO) to cover these use cases. This is actually a difficult step, because it requires justifying to the community why they are useful. In this case, I think it’s clear, since all model organism databases (MODs) make these kinds of synonyms, and I was able to give a good justification based on the fact that I also made similar ad hoc synonym types for the PyOBO source for the Rat Genome Database (RGD). After doing the design work and making the pull request, I updated both the HGNC and RGD sources in PyOBO to reuse these terms in biopragmatics/pyobo#447.

Classification of a Gene by Locus Type

Each gene is annotated with a locus group and locus type. These correspond to a classification, which translates into an ontology as parent-child relationships between classes, mediated by the rdfs:subClassOf relationship (often abbreviated by is a). Here’s a count summary of all locus groups at the time of writing:

Locus Type	Frequency
protein-coding gene	19,297
pseudogene	14,602
non-coding RNA	9,634
other	1,004

Here’s a count summary of all locus types at the time of writing. It’s clear that the locus type is more granular than locus group and completely subsumes it. Therefore, I’ll throw away the locus group and only discuss the locus type here. Looking ahead, I’ve included my manual mapping from each ad hoc values used in HGNC to formal terms in the Sequence Ontology (SO).

Locus Type	Frequency	Sequence Ontology
gene with protein product	19,297	SO:0001217
pseudogene	14,361	SO:0000336
RNA, long non-coding	6,296	SO:0002127
RNA, micro	1,912	SO:0001265
RNA, transfer	591	SO:0001272
RNA, small nucleolar	568	SO:0001267
immunoglobulin gene	230	SO:0002122
T cell receptor gene	205	SO:0002133
immunoglobulin pseudogene	203	SO:0002098
readthrough	151	SO:0000697
RNA, cluster	119	SO:0003001 (see PR)
fragile site	116	SO:0002349
endogenous retrovirus	110	SO:0000100
unknown	69	SO:0000704 (mapped to top-level gene)
complex locus constituent	69	SO:0000997
RNA, ribosomal	60	SO:0001637
RNA, small nuclear	51	SO:0001268
region	46	SO:0001411
T cell receptor pseudogene	38	SO:0002099
RNA, misc	29	SO:0001266
virus integration site	8	SO:0003002 (see PR)
RNA, Y	4	SO:0002359
RNA, vault	4	SO:0002358

I created this issue on the PyOBO tracker when I started preparing this mapping, since there were a few already available in the info box for the locus type on a given gene page on the HGNC website. However, several were incorrect and most were missing. Therefore, I had to manually map several to terms in the Sequence Ontology. Many mappings were easy, but several required discussion with the HGNC and Sequence Ontology teams (as you can see on the issue). HGNC was proactive and incorporated my mappings into their front-end.

There were several cases where there was no appropriate term in the Sequence Ontology. For some, the maintainers created new terms. Unfortunately, for some, the maintainers were unresponsive, so I had to make my own PRs to the repository which probably won’t get accepted in a timely fashion. However, I was able to use the placeholder identifiers in the PyOBO source module even though they haven’t yet been merged and released.

As an aside: annotating locus types is not just a human gene problem, but all model organism databases (MODs) need to work on. I already have a thread for taking a similar approach for FlyBase, but it would be great to do the same for MGI (mouse), RGD (rat), and other MODs for which PyOBO encodes a source. In general, it would be great to see the Alliance of Genome Resources (AGR) push their members towards adopting more shared semantics in the way they curate, especially for locus types.

Chromosomal Locations

The location field connects a gene to its chromosomal location by encoding the location as a string. Initially, I had created an ad hoc relation to encode this string field (obo:hgnc#has_location). In biopragmatics/pyobo#451, I adapted this to map the chromosomal location strings to classes in the Chromosome Ontology (CHR) and use a combination of well-established relations, based on the apparent values for the chromosomes. Note that this is a first attempt at ontologization, and the relations might need updating.

Single Point Annotations

RO:0001025 (located in) is used for single point annotations, such as in hgnc:10080.

[Term]
id: hgnc:10080
name: RNPS1
is_a: SO:0001217 ! protein_coding_gene
relationship: RO:0001025 CHR:9606-chr16p13.3 ! located in 16p13.3 (Human)

Pairs of Points

Multiple RO:0001025 (located in) is used for pairs of point annotations, e.g., when written like Xq28 and chrYq12, like in hgnc:38513:

id: hgnc:38513
name: WASIR1
is_a: SO:0002127 ! lncRNA_gene
relationship: RO:0001025 CHR:9606-chrXq28 ! located in Xq28 (Human)
relationship: RO:0001025 CHR:9606-chrYq12 ! located in Yq12 (Human)

There’s also a single example of a location containing an “or” in hgnc:3829 which looks like 10q23.3 or 10q24.2. There are more sophisticated ways of represent “or” logic in OWL, but not serializable directly in the OBO flat file format.

Ranges

RO:0002223 (starts) and RO:0002229 (ends) are used for ranges of chromosomes, e.g., when written like 8q11.23-q12.1, like in hgnc:10263:

[Term]
id: hgnc:10263
name: RP1
is_a: SO:0001217 ! protein_coding_gene
relationship: RO:0002223 CHR:9606-chr8q11.23 ! starts 8q11.23 (Human)
relationship: RO:0002229 CHR:9606-chr8q12.1 ! ends 8q12.1 (Human)

Special case: Mitochondria

Genes that are mapped to the mitochondrial chromosome get mapped to the Gene Ontology (GO) term GO:0000262 instead of a Chromosome Ontology term, like in hgnc:50279:

[Term]
id: hgnc:50279
name: MT-LIPCAR
is_a: SO:0002127 ! lncRNA_gene
relationship: RO:0001025 GO:0000262 ! located in mitochondrial chromosome

Qualified Annotations

Some annotations that end with a qualifier “not on reference assembly”, “unplaced”, or “alternate reference locus” get them annotated as comment axioms.

[Term]
id: hgnc:10082
name: RNR1
is_a: SO:0003001
relationship: RO:0001025 CHR:9606-chr13p12 {rdfs:comment="not on reference assembly -named gene is not annotated on the current version of the Genome Reference Consortium human reference assembly; may have been annotated on previous assembly versions or on a non-reference human assembly"} ! located in 13p12 (Human)

Unprocessable Locations

After processing HGNC, there were several locations that could not be mapped to CHR. I made an issue on the Chromosome Ontology’s issue tracker noting all the locations that were not mappable. However, several of these could be errors on the side of HGNC as well, and requires checking each manually.

unhandled location	count	appears in
10q23.3 or 10q24.2	1	hgnc:3829
Yp13.3	1	hgnc:6012
17qter	1	hgnc:8841
13cen, GRCh38 novel patch	1	hgnc:15732
Xp22.22	1	hgnc:10199
1p36.13q41	1	hgnc:36026
12q22.32	1	hgnc:58534
1q13.1	1	hgnc:32558
3q25.22	1	hgnc:32563
7p36.1	1	hgnc:34871
11p11.2	1	hgnc:58650
22pter	1	hgnc:1838
18p22.3	1	hgnc:58557
Xp11.32	1	hgnc:37114
17q12b	1	hgnc:49316

Here’s a few observations I had on this:

the ter seems to be an annotation related to trisomy
13cen, GRCh38 novel patch is a weird outlier
1p36.13q41 might be a typo
the “or” entry probably should be processed and not actually get a term, but keeping here for completeness
some of them might have typos between “p” and “q”

In general, these kinds of unmapped items are not blockers towards ontologizing a resource. It’s generally valuable to include logging in a PyOBO source when there is content that is unhandled, since this can be valuable feedback for the upstream resources themselves.

Membership in Gene Groups

HGNC has a secondary categorization of genes into gene groups (formerly called gene families). There’s a variety of purposes for gene groups which themselves have a hierarchical classification. However, based on the contents of gene groups, I don’t think that it’s appropriate to use rdfs:subClassOf for relations between genes and gene sets. Instead, I have opted to use RO:0002350 (member of), which is defined as a mereological relation (i.e., a part-of relation) between an item and a collection.

Genes and Enzymes

HGNC annotates genes with Enzyme Commission (EC) codes. There’s spirited discussion in the ontology world about how we should ontologize enzymes. For example, the Gene Ontology ( GO) models them as catalytic activities within GO’s molecular function branch.

We typically classify proteins based on their catalytic activities, which means that to model the relationship between a gene and an enzyme, we need to use a property chain connecting the gene to the protein it encodes (RO:0002205), and the catalytic activity that the protein enables (RO:0002327).

graph LR
    gene["AKT1 (HGNC:391)"] -- " has gene product (RO:0002205) " --> protein["RAC-alpha serine/threonine-protein kinase (uniprot:P31749)"] -- " enables (RO:0002327) " --> activity["non-specific serine/threonine protein kinase (EC:2.7.11.1)"]
    gene -- has gene product that enables --> activity

This property chain doesn’t yet exist in RO, so I made a new term request and associated pull request to proactively mint a new identifier for use in PyOBO. I have a pull request to PyOBO (biopragmatics/pyobo#455) waiting to reflect this change depending on feedback. Otherwise, the old ontologization uses a property chain of gene product of and member of, to consider an EC class as a more general classification class.

I’m still undecided on what’s the best modeling choice. I am keen to fill out the following chart in a more satisfying way, that captures a bit more nuance in the fact that enzymes are a classification that implies the ability to carry out an activity, but when going against the historical choices of a resource as large as go, I am punching outside my ontological weight class 🤷.

Remaining Logical Axioms and Semantic Mappings

As I come to a close, the only remaining content to ontologize are the many database cross-references. A first and simple approach is to use oboInOwl:hasDbXref, but this is a missed opportunity to encode domain knowledge about each of the resources. The following chart gives an overview of the remaining logical axioms and semantic mapping types:

graph LR
    genegroup[Gene Group<br>HGNC] -- " member of<br>(RO:0002350) " --- gene[Gene]
    geneclass[Gene Class<br>SO] -- is a --- gene
    gene -- " transcribed to (RO:0002511) " --> rna[RNA<br>RNA Central, miRBase, snoRNABase]
    gene -- " has gene product<br>(RO:0002205) " --> protein[Protein<br>UniProt]
    gene -- " has exact match<br>(skos:exactMatch) " --> external1[External<br>NCBIGene, Ensembl, Orphanet, OMIM, RefSeq,...]
    gene -- " has database cross-reference<br>(oboInOwl:hasDbXref) " --> externa2[External<br>CCDS,...]
    gene -- " is orthologous to<br>(RO:HOM0000017) " --> orthology[Orthologous Gene<br>MGI, RGD]
    gene -- " has gene product that enables<br>(RO:0002205 + RO:0002327) " --> enzyme[Enzyme<br>EC]
    gene -- " located in<br>(RO:0001025) " --> chr[Chromosome Region<br>CHR]

To briefly summarize this diagram:

References to other model organism databases are modeled as orthology relationships
References to UniProt (the protein database) are modeled as has gene product (i.e., a broader relationship than translation)
References to RNA databases are modeled as transcription
References to databases that are nomenclature resources for genes are modeled as exact matches
References to databases that can have potentially multiple experimental measurements for a given gene are modeled with database cross-references
I already mentioned in more detail above how enzymes, gene groups, and locations are annotated.

The implementation of this logic can be found in the PyOBO source module for HGNC.

What Was Skipped

There are many extra fields in HGNC that I throw away which effectively duplicate the HGNC identifier or gene symbol, such as agr (which reuses the HGNC identifier) and lncrnadb (which reuses the HGNC gene symbol).

Rather than representing the fact that the external database provides information about this term, a better solution is to add additional providers to the Bioregistry, such that a given HGNC identifier can be used to create a link to the database itself. This isn’t a perfect solution, because some databases only cover a subset of genes. There’s more discussion about this on this issue on OMO issue tracker, specifically in this comment.

While I don’t offer an isomorphic (i.e., covers everything that’s there) solution for ontologization of this part of the content, I do believe that the rest of my choices address the three big benefits I mentioned at the start.

Wow, this is my first ever double blog post. It took me a full day to write it, not to mention the years of work that went into the software ecosystem itself and the time put into improving the HGNC PyOBO source module in preparation for writing it. I am very happy to be reporting on this, and to see how it will positively impact the community.

If you made it this far and are interested in collaborating to make your own resource accessible through PyOBO, please get in touch using my contact information at the bottom of the page or by opening an issue on the PyOBO issue tracker.

I’m also open to collaboration through grant writing or contract/consulting work via my current employer (RWTH Aachen University) for extending and applying PyOBO and the wider Biopragmatics Stack in new domains. This has been previously successful in the DARPA ASKEM and DTRA RAPTER projects, and is now a key contribution to several of the DFG-funded German NFDI consortia.