Biopragmatics

Easier ORCID

2024-06-08T12:33:00+00:00

The Open Researcher and Contributor Identifier (ORCID) database is an invaluable resource that supports the unambiguous identification of researchers. However, its first party data dump is too complex, verbose, and unstandardized for many use cases. This post describes open source software I wrote that automates downloading, processing, and exporting ORCID into a more usable form. I put the results on Zenodo under the CC0 license.

Challenges to Overcome

ORCID currently contains on the scale of tens of million records, meaning that there isn’t a reasonable way to access the data in bulk via its public API. As an alternative, ORCID dumps its public content once per year on FigShare. The most recent (2023) dump is available at doi:10.23640/07243.24204912. Previous versions are deposited with under different DOIs:

2022 ORCID Public Data File (https://doi.org/10.23640/07243.21220892.v4)
2021 ORCID Public Data File (https://doi.org/10.23640/07243.16750535.v1)
2020 ORCID Public Data File (https://doi.org/10.23640/07243.13066970)
…

Unfortunately, this arrangement makes it difficult to automatically discover new versions without writing software that searches FigShare programmatically and has a heuristic for guessing which might be a newer record. I don’t have a solution for this yet, but I can imagine one.

Only making a yearly dump means that the derived artifacts can become out of date quickly. Other large databases like PubChem make monthly and nightly dumps on their FTP servers which are deleted when no longer relevant. For example, monthly dumps from more than a year ago can be safely deleted and nightly dumps only need to exist until they are replaced by the next one. Since ORCID is using FigShare as an archival system, this would be a disk space-intensive operation. Using the ORCID API or secondary data distribution via Wikidata could be good stopgaps for consumers who want the most recent data.

ORCID distributes its data as XML. They also provide Java software to convert it to JSON, but since 2018 are pretty adamant about not running this software and distributing the JSON artifact themselves. This is unfortunate, since XML is hugely verbose both in terms of the way that data gets structure and the markup itself. Further, the data structure contains a huge amount of provenance information that isn’t useful for many downstream consumers (both in terms of when things were updated, by whom, and which API endpoint could be used to update it further). An example from the JSON converter library also illustrates that converting from XML to JSON accomplish some obvious simplifications that most users would want.

Another tricky thing about consuming the ORCID data is that the summary file that contains all the records is 32 gigabytes compressed and has a very strange internal file structure. This means that you either have to uncompress it, which takes a long time with its tens of millions of files, or iterate through the file handles inside it. I also haven’t figured out a good way to open a specific file inside the compressed archive beyond iterating through all the handles. The file names themselves are also difficult to guess because of the directory structure used.

Solutions

I wrote a Python package, orcid_downloader that can automatically download the right file from FigShare, iterate through the individual compressed XML files for each record, and process them. The package can be used to iterate over records and process them in your own way like:

import orcid_downloader

for record in orcid_downloader.iter_records():
    ...

The main goal of this was also to create a simplified version of the ORCID data dump that is more straightforwards, accessible, and smaller. I would imagine most people would be interested in just downloading the results instead of rebuilding them, so the results of this process are posted on Zenodo at doi:10.5281/zenodo.10137939. It uses Zenodo’s versioning system to make sure that all different versions (both from updates to the yearly dump or improvements to the processing pipeline) are all in the same Zenodo record.

So far, this includes the name, aliases, external identifiers, employers, education, and publications linked to PubMed. Along the way, I realized that ORCID did not consistently ground educational institutions to the Research Organization Registry (ROR) like it did for employers. I also had it double-check all groundings for employers, since these were incomplete. I also did some minor string processing, such as standardization of education types (e.g., Bachelor of Science, Master of Science), standardization of PubMed references, and standardization of aliases (e.g., pruning off titles like Dr.)

The records.jsonl.gz file is a JSON Lines file where each row represents a single ORCID record in a simple, well-defined schema (see schema.json). Here are a few rows ( expanded for viewing comfort) as example:

{
  "orcid": "0000-0001-5045-1000",
  "name": "Patricio Sánchez Quinchuela",
  "employments": [
    {
      "name": "Universidad de las Artes",
      "start": 2021,
      "role": "Especialista de Proyectos y docente",
      "xrefs": {
        "ror": "016drwn73"
      }
    },
    {
      "name": "Universidad Regional Amazónica IKIAM",
      "start": 2019,
      "end": 2021,
      "role": "Director",
      "xrefs": {
        "ror": "05xedqd83"
      }
    }
  ],
  "educations": [
    {
      "name": "Universidad Nacional de Educación a Distancia Facultad de Ciencias Políticas y Sociología",
      "start": 2020,
      "role": "Doctorando del Programa de Sociología",
      "xrefs": {
        "ringgold": "223339"
      }
    }
  ]
}
{
  "orcid": "0000-0001-5101-6000",
  "name": "Céline LEPERE"
}
{
  "orcid": "0000-0001-5001-3000",
  "name": "Vincent Nguyen",
  "employments": [
    {
      "name": "Troy High School",
      "start": 2020,
      "role": "Student",
      "xrefs": {
        "ringgold": "289570"
      }
    }
  ]
}
{
  "orcid": "0000-0001-5002-1000",
  "name": "Sameer Abbood",
  "employments": [
    {
      "name": "University of Al-Ameed",
      "role": "Doctor of Philosophy"
    }
  ]
}

Many records in ORCID are relatively unhelpful, i.e., ones that only have a name and no other (public) information. Therefore, I created a high-quality subset that only contains records with at least one ROR-grounded employer, at least one ROR-grounded education, or at least one publication indexed in PubMed. The point of this subset is to remove ORCID records that are generally not possible to match up to any external information. It is listed in the same Zenodo record as records_hq.jsonl.gz.

External Identifiers

An ORCID record has two places that make cross-references to external nomenclature authorities:

The “Websites & Social Links” section which allows a researcher to give a link with a name. This is a trove of links to LinkedIn, Google Scholar, GitHub, and other external identifiers. ORCID itself doesn’t standardize them, but using a combination of the Bioregistry and custom parsing, many can be standardized.
The “Other IDs” section is generated based on applications that connect to ORCID and send structured links. This includes Scopus, Web of Science (formerly ResearcherID), Loop, and some others. This also needs quite a bit of standardization, probably due to a combination of bugs in ORCID, bugs in external services, legacy data, and other things.

Reflections

I found several interesting things while parsing these sections:

I discovered several new nomenclature authorities that weren’t already registered in the Bioregistry. This includes Loop, Digital Author ID (Dutch), Authenticus (Portuguese), Dialnet (Spanish), SciProfiles, Ciência (Portuguese), and KAKEN (Japanese). I still have to send new prefix requests for these.
I found several fields that were totally junk or didn’t make sense. For example, there is sometimes a reference to the ORCID record itself. There are also references to external IDs that aren’t really IDs, or at least don’t follow enough of the guidelines from Identifiers in the 21^st Century to be useful.
Both the “Websites & Social Links” and “Other IDs” needed standardization. In some places, this was as easy as using the Bioregistry prefix standardization, but in other places required more custom URL parsing. This is especially true for Google Scholar, which can appear with a number of domain names (e.g., https://scholar.google.com or https://scholar.google.es). The local unique identifier appears here inside the URL parameters, which can be in any order along with the language tag, so this needs URL parsing instead of more simple URI prefix handling a la the curies Python Package.

Summary

Here’s a breakdown of the top external cross-references, standardized with the Bioregistry (where possible). Note that this was prioritized by the most common cross-references, and is not complete. To capture all would be a lot of work and require many more corner cases to less common services. I also threw away links to non-professional social networks like Facebook/Instagram. I also made the value judgement to throw away links to Twitter since it doesn’t reflect open and inclusive scientific community values anymore.

Resource	Count
`scopus`	1,400,735
`wos.researcher`	608,543
`sciprofiles`	259,654
`loop`	229,224
`linkedin`	191,321
`researchgate`	125,242
`google.scholar`	52,131
`github`	13,397
`gnd`	7,621
`isni`	4,105
`dai`	1,982
`authenticus`	1,422
`dialnet`	1,210
`wikidata`	147

The results of this process are first available as part of the records file(s). Second, they are available through a dedicated file (sssom.tsv.gz) in the Simple Standard for Sharing Ontological Mappings (SSSOM) that solely focuses on the cross-references. Here’s what the first few lines of that file look like:

subject_id	subject_label	predicate_id	object_id	mapping_justification
orcid:0000-0001-5099-6000	Debashis Bhowmick	skos:exactMatch	scopus:57214299968	semapv:ManualMappingCuration
orcid:0000-0001-5009-9000	Ali Gargouri	skos:exactMatch	loop:470724	semapv:ManualMappingCuration
orcid:0000-0001-5084-9000	Luana Licata	skos:exactMatch	loop:1172627	semapv:ManualMappingCuration
orcid:0000-0001-5084-9000	Luana Licata	skos:exactMatch	scopus:6603618518	semapv:ManualMappingCuration
orcid:0000-0001-5124-3000	Wojciech Nawrocki	skos:exactMatch	loop:661557	semapv:ManualMappingCuration
orcid:0000-0001-5075-0000	Xueyong Pang	skos:exactMatch	wos.researcher:K-6721-2018	semapv:ManualMappingCuration
orcid:0000-0001-5103-2000	Bartlomiej Dec	skos:exactMatch	scopus:57194469902	semapv:ManualMappingCuration
orcid:0000-0001-5020-8000	Sapna Gambhir	skos:exactMatch	scopus:35811915000	semapv:ManualMappingCuration
orcid:0000-0001-5074-2000	Martin Perez-Santos	skos:exactMatch	scopus:56082352000	semapv:ManualMappingCuration

What are Cross-References Useful For?

There are many different nomenclature authorities because each have different goals and data models associated with records. Different communities also value different nomenclature authorities differently. For example, life scientists are more often using ORCID when publishing, but including persistent identifiers in publications is not yet common in computer science papers. Further, computer scientists more often link to their DBLP, arXiv, OpenReview, or other computer-science focused pages.

When assembling data and knowledge from more than a single resource, it’s important to resolve the identifiers used for researchers to a single identifier - it’s not good if different knowledge is connected to an ORCID and a DBLP for a single researcher. This can be resolved using a combination of semantic mappings (i.e., cross-references) and software for the large-scale automated assembly of mappings such as the Semantic Mapping Reasoning Assembler (SeMRA). It allows for specifying a priority list of nomenclature authorities to assemble coherent knowledge from ORCID and other sources simultaneously.

This is also a much more valuable process when combining other mapping resources. Wikidata has put a lot of effort into capturing bibliographic metadata, especially to support the Scholia project. The following SPARQL query against Wikidata shows that there are 1,811,573 (about 10% of all ORCID records) links from Wikidata entries to ORCID as of June 2024. Run it yourself.

SELECT (count(DISTINCT ?orcid) as ?total)
WHERE { ?item wdt:P496 ?orcid }

OpenCheck tried to create mappings between Twitter, GitHub, and ORCID using their APIs, but became defunct when Twitter shut off their public APIs. OpenAlex, Microsoft Academic Graph, Open Academic Graph, and other bibliographic knowledge graphs all have to address this problem internally, as well. Other mapping resources probably exist, please let me know if you’re aware of other ones!

Affiliations

ORCID breaks up affiliations into several blocks: educations, employments, invited positions, etc. I’m focusing on educations and employments here. The processing code could be extended for the others later.

Employments contained ROR references but educations didn’t, and even when they’re available, they’re incomplete. PyOBO has implemented a loader for ROR. Any resource loaded through PyOBO can also be used for named entity normalization through an interface to Gilda. This allowed for grounding of a large number of missing education and employer entries. Here’s the total number of cross-references made from educations and employments to ROR and other nomenclature authorities.

Resource	Count
`ror`	8,725,215
`ringgold`	6,614,519
`grid`	1,404,183
`funderregistry`	698,384
`lei`	1,206

Education Roles

Both educations and employments also have associated roles. These do not use a controlled vocabulary, but there are a number of patterns that could be standardized. This task had to be split between education and employment entries. I just focused on education entries for now. The “role” field in each education entry corresponds to the degree.

I started by looking for existing resources (both structured and unstructured) that have lists of degrees. Here are a few things I found:

https://degree.studentnews.eu lists degrees conferred in the EU/Europe
https://github.com/vivo-ontologies/academic-degree-ontology is an incomplete/abandoned effort from 2020 to ontologize degree names
Wikidata has a class for academic degree https://www.wikidata.org/wiki/Q189533. Its SPARQL query service _ can be queried with the following, though note that the Wikidata class hierarchy is broken in several places.
```
   SELECT ?item ?itemLabel WHERE {
      ?item wdt:P279* wd:Q189533 .
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }
```

In the end, these were either incomplete or not organized well enough to directly use. It also turns out that people often conflate the degree (e.g., Master of Science) with the field that it’s in (e.g., Chemistry) or include some combination (e.g., Master of Science in Chemistry). This meant that a lot of string processing and heuristics would be required on top of lexical approaches. Instead, I took the tried-and-true method of listing the strings by frequency and just curating from the top. The results are in https://github.com/cthoyt/orcid_downloader/blob/851af81d8aacebf2768bfc591080cbceab2047cc/src/orcid_downloader/standardize.py. Of course, this is incomplete in many ways, and could be extended to capture further cases. I also found that there are a huge number of Spanish and Portuguese entries that I needed help from my international friends to get the best translations (since the meaning is pretty subtle for many). Further, the results would be more useful as a proper ontology that could extend and replace VIVO’s Academic Degree Ontology. I’ll leave this for future work. Here’s a summary of the most frequent roles that have been standardized so far:

Education Role	Count
Doctor of Philosophy	1,231,845
Master	426,625
Bachelor	402,297
Master of Science	360,052
Bachelor of Science	322,853
Doctor of Medicine	202,884
Bachelor of Arts	108,778
Master of Arts	78,601
Postdoctoral Researcher	37,548
Bachelor of Medicine, Bachelor of Surgery	27,722
Master of Technology	22,473
Bachelor of Technology	21,465
Bachelor of Engineering	21,345
Diploma	20,695
Master of Business Administration	19,660
Master of Philosophy	18,069
Master of Education	18,055
Master of Public Health	17,760
Bachelor of Education	17,376
Master of Engineering	14,260

Authorship

Authorships are extracted and standardized in the pubmeds.tsv.gz file, which contains an ORCID column and PubMed column that has been pre-sanitized to only contain local unique identifiers. This information is also available through the main records file.

While the field inside the XML data is supposed to contain local unique identifiers, there was a huge variety of what actually showed up there. This included local unique identifiers (i.e., 36402838), local unique identifiers with junk attached (e.g., 36402838/), valid Compact URIs (CURIEs), invalid CURIEs, URIs, free text that’s totally irrelevant. Overall, there were 3,175,196 (99.85%) that were valid local unique identifiers, 2,832 (0.09%) that were able to be cleaned up, and 2,080 (0.07%) that were junk and couldn’t be salvaged. Inside the junk were a few things:

DOIs
PMC identifiers,
a few stray strings that contain a combination of pubmed, PMC, and DOIs
a lot with random text (keywords)
some with full text citations

Later, other identifier types could be added in here too.

Lexical Indexing

One of the original goals of processing ORCID in bulk was to ground and disambiguate author lists in publications. Therefore, I made two pre-built Gilda indexes for named entity recognition (NER) and named entity normalization (NEN). One contains all records, and the second is filtered to high-quality records. The following Python code snippet can be used for grounding:

from gilda import Grounder

url = "..."
grounder = Grounder(url)
results = grounder.ground("Charles Tapley Hoyt")

The ORCID downloader also has its own extension that does a smarter job with caching and some clever name preprocessing

import orcid_downloader

results = orcid_downloader.ground_researcher("Charles Hoyt")

Ontology Artifact

The file orcid.ttl.gz is an OWL-ready RDF file that can be opened in Protégé or used with the Ontology Development Kit. It can also be converted into OWL XML, OWL Functional Notation, or other OWL formats using ROBOT. This artifact can serve as a replacement for the ones generated by https://github.com/cthoyt/orcidio, which was a smaller-scale way of turning ORCID records for contributors to OBO Foundry ontologies into a small OWL file. Now, the export here contains all ORCID records with names.

@prefix orcid:  .
@prefix rdfs:  .
@prefix oboInOwl:  .
@prefix owl:  .
@prefix skos:  .
@prefix human:  .

human: a owl:Class ;
    rdfs:label "Homo sapiens"^^xsd:string .

orcid:0000-0001-5000-5000 a human: ; 
   rdfs:label: "Joel Adam Gordon"^^xsd:string ; 
   oboInOwl:hasExactSynonym "Joel Gordon"^^xsd:string .

orcid:0000-0001-5099-6000 a human: ; 
   rdfs:label: "Debashis Bhowmick"^^xsd:string ;
   oboInOwl:hasExactSynonym "Bhowmick D. S."^^xsd:string ;
   oboInOwl:hasExactSynonym "D. S. Bhowmick"^^xsd:string ;
   oboInOwl:hasExactSynonym "Debashis S Bhowmick"^^xsd:string ;
   skos:exactMatch "scopus:57214299968" .

orcid:0000-0001-5084-9000 a human: ; 
   rdfs:label: "Luana Licata"^^xsd:string ;
   skos:exactMatch "loop:1172627"^^xsd:string ;
   skos:exactMatch "scopus:6603618518"^^xsd:string .

It’s still TBD on the best way to encode the cross-references.

Code

The artifacts described here were all automatically generated with code in https://github.com/cthoyt/orcid_downloader.

Like I mentioned a few times throughout, this is a work in progress. Doing practical data science is hard work, and there is a lot of room for improvement. I’m still recovering from burnout, so working at a slow pace only when I felt inspired also was fine for me. I know there are lots of things I would like to improve given more motivation, but that will have to wait for now.

Discussions and Follow-ups from Biocuration 2024

2024-03-11T13:52:00+00:00

I’ve just returned from the 17^th Annual International Biocuration Conference at the Indian Biological Data Centre (IBDC) in Faridabad, India. I wanted to highlight some of the interesting conversations I had while I was there, and ideas for follow-up. Most were centered around the Bioregistry and the Semantic Mapping Assembler and Reasoner (SeMRA), which I gave an oral presentation on.

I talked to Guy Cochrane and Chuck Cook from the Global Biodata Coalition (GBC). They chaired a session on sustainability of biocurated resources, with specific focus on the Global Biodata Coalition’s Global Biodata Core Resources (GBCR) initiative. I felt like my talk from last year’s biocuration conference on the Open Code, Open Data, Open Infrastructure (O3) roadmap (preprint) would have fit right in here. I am very keen to have their perspectives, as GBC has first worked on evaluation of resources and is second working towards funding resources. Since they have not worked on practical recommendations for supporting sustainability, I eagerly volunteered to join their work in some capacity to help advise on this.

GBC also published a workflow for evaluating the landscape of biological databases (press release / publication / code). When possible, this workflow aligned on FAIRsharing, but given that it is a limited resource and only has partial mappings to relevant related resources like re3code, BARTOC, etc. I suggested using the Bioregistry as a mapping hub to enrich the output of this workflow, which will definitely be run again on a periodic basis.

Lynn Schriml presented recent updates on the Disease Ontology, which prompted a relevant question from Harpreet Singh - Chief Data Officer at the Indian Council of Medical Research (ICMR) who himself works with clinical data and has wondered how to best annotate - using MeSH, SNOMED, ICD, or other disease resources. I had an interesting discussion with him following the talk which gave big motivation to the talk I was about to give on the large scale assembly and reasoning over semantic mappings. I was very excited, since I love to add (last minute) shout-outs into my conference talks that motivate parts of the work based on questions or discussions from earlier parts of the conference.

There were a series of talks that motivated further discussions about mappings. One of the most interesting was the talk from Shivani Sharma, a curator at the Indian Biological Data Centre (IBDC) and one of the local organizers. She works on the Indian Metabolome Data Archive. Many of the lines of work at the IBDC have practical applications towards agriculture and integrate medium- and large-scale experimental work, biocuration, and downstream analysis. Often, these applications are oriented towards improving crop yields and avoiding disease. Shivani showed a slide where they considered a large number of metabolomics nomenclature resources to use for annotating their data. However, they were not familiar with methods for incorporating multiple nomenclature resources, meaning that their curators were running into issues where their chosen metabolomics database did not cover chemicals they needed to annotate. This often lead to them having to create their own ad hoc annotations, which also create issues for data integration. I am looking forwards to catching up with them again, incorporating new metabolomics resource into PyOBO, ingesting mappings into SeMRA, and filling in the gaps using Biomappings to support their curators.

Scott V. Nguyen from the American Type Culture Collection (ATCC) also approached me about this work, since he’s currently trying to curate mappings between cell lines in their resource and other public resources. It was lucky that one of the examples from my talk was specifically about the cell lince scenario, which I hope he can ingest and reduce his curation workload. Rachel Lyne also presented on COSMIC, a cancer cell line resource that also creates its own accession numbers and could benefit from this work, but I didn’t get a chance to talk about it with her yet.

I also met Yasunori Yamamoto, who works on TogoID, a secondary database of semantic mappings that covers select domains within biomedicine. We discussed how they could make use of the Simple Standard for Sharing Ontology Mappings (SSSOM) to ingest more mappings from different resources, especially from Biomappings or potentially from the outputs of SeMRA (which I presented on).

Matt Jeffreys presented on the annotations database in European PubMed Central which allows for tagging articles, sentences, or tokens in articles with annotations. They already showed how this applies to named entity recognition (NER) and MeSH term annotations, but we discussed how SeMRA and comprehensive semantic mapping databases could help unify other annotations of overlapping vocabularies, e.g., if someone put Disease Ontology (DO) NER annotations, which overlap with MeSH terms in the disease (C) and psychiatric disorders (F) branches.

I discussed with Raja Mazumdar and Jeet Vora from George Washington University who both work on GlyGen and are plugged into the NIH’s Common Fund Data Ecosystem (CFDE) about how they can continue to use the Bioregistry to standardize the annotations in their resources. Jeet has got in touch earlier this year and helped update the records in the Bioregistry related to GlyGen. Raja’s talk also motivated two new prefix additions to the Bioregistry for Biocompute Objects and for OncoMX data objects. Further, Raja is very interested in improving his data using the Bioregistry, since it already uses a Python script to validate its JSON and TSV components, it will be easy to incorporate the Bioregistry Python package’s validation functions.

Earlier this winter, I presented to the American National Institutes of Health (NIH) BISTI group about different avenues through which they could use the Bioregistry to create more value for the NIH and its grantees. One of those discussions was about improving GenBank’s internal database catalog. By chance, I talked with Ilene Karsch Mizrachi, a program head at the NIH about this. She was attending the conference and made big contributions to the discussions about the Indian relation to the International Nucleotide Sequence Database Collaboration (INSDC). However, it turns out she was the one who made/contributed to this GenBank table, many years ago. We will try and follow up by enriching this table with information from the Bioregistry.

At last year’s biocuration conference, Chris Hunter presented on GigaDB, and we had some initial discussions about using the Bioregistry (or other related parts of the Biopragmatics Stack) to make standardized annotations on data sets deposited in their database, such as cell line annotations. We picked back up that conversation, and it seems that the GigaDB developers are working with PHP - since we got CZI funding to make the Bioregistry available in other languages, making a wrapper from Rust to PHP (within the curies.rs framework).

There was an entire session on the final day of the conference on structural bioinformatics, which included several presentations from the American and European loci of the Protein Databank (PDB). The first discussion was with Marcus Bage, who is currently trying to annotate protein modifications. We discussed the implications of the vast number of resources that partially cover this domain in different senses, including GO, MOD, SBO, MOP / PSI-MI, and UniProt’s internal vocabulary. A long time ago, I mapped these together in PyBEL, but this was only a partial solution, too!

The second discussion was with Brinda Vallat about the upcoming change for PDB accession numbers. It turns out that the 4 character code is estimated to fill up in 2029, so it’s time for PDB to make a change. Unfortunately, their solution is to switch to local unique identifiers that look like pdb_000002GC4, which is problematic for two main reasons. First, it’s not backwards compatible with existing IDs. Second, it introduces a banana (i.e., a redundant copy of the name/acronym of the database in the local unique identifier). The reasoning behind adding in the banana was to make it easier to find references in papers. I can understand this, since we don’t yet have general solutions for referencing concepts across different publishers (though, we solved this in Manubot by integrating the Bioregistry). However, this increases confusion for consumers. I suggested they simply extend the existing IDs to be able to have more than 4 characters, and suggest people reference their entities with CURIEs like PDB:2GC4 within papers, which solves both issues simultaneously. Similarly, I talked to Ibrahim Roshan Kunnakkattu about creating more careful identifier recommendations for the PDB’s Chemical Component Dictionary as well as using some of the automated mapping tools I presented for filling out references to ChEBI, ChEMBL, PubChem, and more.

I also had the unique pleasure to spend time in person with Tiago Lubiana, who is highly aligned on many of my interests in data standardization, semantic web, and open science. He has been a helpful contributor in the Bioregistry, Wikidata, and the OBO Foundry. Writing up some of the things we discussed would take a whole blog post, so instead, here’s a nice picture we got together.

Overall, like every Biocuration conference, I was very happy to find people interested in my work, and more importantly, interested in the idea of improving their own data standardization! I also had lots of other interesting discussions that don’t require any follow-up. I am also planning on writing a post that gives a more high-level summary of the different parts of the conference itself, not just focusing on my work.

Semantic Pydantic

2024-01-10T13:38:00+00:00

Using Pydantic for encoding data models and FastAPI for implementing APIs on top of them has become a staple for many Python programmers. When this intersects with the semantic web, linked open data, and the natural sciences, we are still lacking a bridge to annotate our data models and APIs to make them more FAIR (findable, accessible, interoperable, and reusable). In this post, we build an extension to Pydantic and FastAPI to annotate data models’ fields and API endpoints’ query, path, and other parameters using the Bioregistry, a comprehensive catalog of metadata about semantic spaces from the semantic web and the natural sciences.

As a demonstration, we will build a data model and API that serves information about scholars.

First Steps with Pydantic

We’ll use Open Researcher and Contributor (ORCID) identifiers as primary keys, include the researcher’s name, and start with a single cross-reference, e.g., to the author’s DBLP identifier. We’ll encode this data model using Pydantic in the Python programming language as follows:

from pydantic import BaseModel, Field

class ScholarV1(BaseModel):
    """A model representing a researcher, who might have several IDs on different services."""

    orcid: str = Field(...)
    name: str = Field(...)
    dblp: str | None = Field(None)

print(ScholarV1.schema_json(indent=2))

JSON Schema - Version 1

Books I Read in 2023

2024-01-01T21:22:00+00:00

I finally got back into reading! Over winter break 2022, I started the Stormlight Archive then followed up in 2023 by reading the entirety of Brandon Sanderson’s Cosmere, as well as a some other fantasy, science fiction, and literary fiction. Here’s the list.

Oathbringer (Stormlight Archive #3) by Brandon Sanderson
Rhythm of War (Stormlight Archive #4) by Brandon Sanderson
The Final Empire (Mistborn #1) by Brandon Sanderson
The Well of Ascension (Mistborn #2) by Brandon Sanderson
The Hero of Ages (Mistborn #3) by Brandon Sanderson
The Alloy of Law (Mistborn #4) by Brandon Sanderson
Shadows of Self (Mistborn #5) by Brandon Sanderson
The Bands of Mourning (Mistborn #6) by Brandon Sanderson
Warbreaker by Brandon Sanderson
Elantris by Brandon Sanderson
The Shadow of the Wind (The Cemetery of Forgotten Books #1) by Carlos Ruiz Zafón
The Angel’s Game (The Cemetery of Forgotten Books #2) by Carlos Ruiz Zafón
The Emperor’s Soul by Brandon Sanderson
The Hope of Elantris by Brandon Sanderson
The Eleventh Metal by Brandon Sanderson
Mistborn: Secret History by Brandon Sanderson
Shadows for Silence in the Forests of Hell by Brandon Sanderson
Sixth of Dusk by Brandon Sanderson
White Sands by Brandon Sanderson
The Lost Metal (Mistborn #7) by Brandon Sanderson
Edgedancer (Stormlight Archive #2.5) by Brandon Sanderson
What It Means When a Man Falls from the Sky by Lesley Nneka Arimah
Tress of the Emerald Sea by Brandon Sanderson
The Prisoner of Heaven (The Cemetery of Forgotten Books #3) by Carlos Ruiz Zafón
The Name of the Wind (The Kingkiller Chronicle #1) by Patrick Rothfuss
The Wise Man’s Fear (The Kingkiller Chronicle #2) by Patrick Rothfuss
The Labyrinth of Spirits (The Cemetery of Forgotten Books #4) by Carlos Ruiz Zafón
Yumi and the Nightmare Painter by Brandon Sanderson
Hyperion (The Hyperion Cantos #1) by Dan Simmons
The Fall of Hyperion (The Hyperion Cantos #2) by Dan Simmons
Dawnshard (Stormlight Archive #3.5) by Brandon Sanderson
The Sunlit Man by Brandon Sanderson
All Systems Red (Murderbot Dairies #1) by Martha Wells
Endymion (The Hyperion Cantos #3) by Dan Simmons
The Rise of Endymion (The Hyperion Cantos #4) by Dan Simmons
A Memory Called Empire (Teixcalaan #1) by Arkady Martine
A Desolation Called Peace (Teixcalaan #2) by Arkady Martine
Crime and Punishment by Fyodor Dostoevsky
The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson
Anna Karenina by Leo Tolstoy
Red Rising (Red Rising Saga #1) by Pierce Brown
Golden Son (Red Rising Saga #2) by Pierce Brown
This is How You Lose the Time War by Amal El-Mohtar and Max Gladstone

Cosmere moments I really enjoyed (spoilers):

Every single oath (finished or not), particularly Elhokar and Eshonai
I guessed the twist in Tress of the Emerald Sea
Kelsier’s not dead!
Dalinar Kholin’s entire character arc
Wayne, Pattern, and Nightblood being hilarious
Michael Kramer and Kate Reading’s voices
All the tidbits in Sunlit Man… I can’t wait for Stormlight 5 to fill in some of the gaps.

Other non-Cosmere highlights (spoilers):

Narrator reveal at the end of The Labyrinth of Spirits
Anna Karenina’s inner monologue as she commits suicide was devastating
Rothfuss’s prose is like honey
The messiah arc in the second Hyperion duology is really nice. It’s a much better version of the awful “love connects the universe” cliché at the end of Interstellar that actually creates plot- and character-driven relevance.

Disappointments:

Knowing the Doors of Stone might never come out but reading Kingkiller Chronicle #1 and #2 anyway
I didn’t like that All Systems Red had the entire plot get resolved off-screen. My sisters were really hoping I’d like this book, and I was only neutral. I’ll still give the next one a try.
I missed the Year of Sanderson Kickstarter (and the hilarity of the announcement video). I hadn’t really started until after.
The Russian classics were pretty slow, but I guess that’s how it goes.

My goal in 2024 is to read more books from different genres, especially ones I’ve never touched before.

Unlocking UMLS

2023-09-01T08:00:00+00:00

The Unified Medical Language System (UMLS) is a widely used biomedical and clinical vocabulary maintained by the United States National Library of Medicine. However, it is notoriously difficult to access and work with due to licensing restrictions and its complex download system. In the same vein as my previous posts about DrugBank and ChEMBL, this post describes open source software I’ve developed for downloading and working with this data. It also works for RxNorm, SemMedDB, SNOMED-CT, and any other data accessible through the UMLS Terminology Services (UTS) ticket granting system.

The first big issue with the UMLS is its licensing. Here’s an excerpt from the How to License and Access the Unified Medical Language System® (UMLS®) Data page accessed on August 28^th, 2023:

Please sign up for a new UMLS Terminology Services (UTS) account with your preferred identity provider at the UTS homepage.

Complete and submit the license request form. NLM will send the license approval e-mail within 5 business days after reviewing your authenticated license request.

You will sign in using identity provider credentials to download files or access web interfaces that require UTS authentication such as the UTS, VSAC, SNOMED CT, or RxNorm.

These are a few big hurdles:

We typically expect scientific data to be available for download without login. Specifically, most data can be downloaded by following a link that points directly to a file. For example, ChEMBL v33 can be downloaded as a gzipped SQLite file from https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_33_sqlite.tar.gz. Rather than providing a data download link, UMLS, has a complicated API called the UMLS Terminology Services (UTS) ticket granting system that needs to be asked for a specific file, polled for a unique access key, then downloaded via an ephemeral (i.e., disappearing) URL that only works once.
We typically expect scientific data to be licensed under a standard, widely used license such as those from Creative Commons. Using well-understood licenses reduces the cognitive and legal burden of consumers when deciding if and how they can reuse, modify, or redistribute data. UMLS uses a non-standard terms of service that makes it more difficult to understand how the data can be stored, modified, or redistributed.
Further, we hope that data is permissively licensed such that it can be re-used, modified, and re-distributed. The Creative Commons CC BY 4.0 and CC0 licenses are golden standards of permissibility. UMLS does not have a permissive license, meaning (from my best interpretation) that you can not redistribute UMLS and you can (probably) not redistribute data derived from UMLS. As an aside, Creative Commons also has license containing clauses to be explicit about restrictions such as the share-alike (SA), non-commercial (NC), and non-distribution (ND). While these clauses aren’t ideal for scientific data, it would at least be nice for UMLS to use a Creative Commons license with the appropriate combination of these clauses (I guess all three) to make it more explicit about its restrictions.
The most bizarre facet of UMLS is that they require you to fill out a user survey each year to keep access.

I want to 1) convert UMLS into an OWL ontology and 2) extract and encode its semantic mappings to external vocabularies like the Medical Subject Headings (MeSH) with Simple Standard for Sharing Ontology Mappings (SSSOM). Given all of these hurdles, it’s probably the case that I am not allowed to redistribute these artifacts.

All together, I consider this a big bummer. The United States National Library of Medicine (NLM) maintains several highly influential resources, but I have found in many instances that they lack a community perspective. Regardless, even as an expat, I pay American taxes, and it makes me upset that the government funds the development and maintenance of resources that I can’t easily use.

How To Break Free

Despite all of this rigamarole, there’s a process to subvert these issues by automating the interaction with the UMLS Terminology Services (UTS) and therefore enabling automated download of UMLS and the following (non-exhaustive) list of resources:

RxNorm
SemMedDB
SNOMED-CT
potentially more in the future

This has been implemented in the open source umls_downloader Python package. It can be installed with the following one-liner in your shell:

$ pip install umls_downloader

Below, I’ll walk you through using it. Throughout, keep in mind that full documentation for the umls_downloader is available at umls-downloader.readthedocs.io, which describes the other functionality and other data that can be downloaded.

Usage

UMLS has three different distributions that are described here. The following Python code downloads the most simple and straightforward file, MRCONSO.RRF as a zip archive:

from umls_downloader import download_umls

path = download_umls(version="2023AA", api_key="")

This code is smart and does not need to download the file more than once. It uses pystow to choose a stable path ~/.data/bio/umls relative to the current user’s home directory. Inside this directory, it also uses the version of the data to create a subdirectory. Finally, this function returns the path to the data, such that no file paths ever need to be hard-coded.

Warning This still requires an API key, which requires creating an account, agreeing to UMLS’s terms and conditions, etc. This can be done here: https://uts.nlm.nih.gov/uts/edit-profile.

Automating Configuration of UTS Credentials

There are two ways to automatically set the API key, so you don’t have to worry about getting it and passing it around in your python code:

Set UMLS_API_KEY in the environment. This can be done in your interactive session or in the configuration for your shell such as in a .bashrc file for the Bourne Again Shell (bash).
Create ~/.config/umls.ini and set in the [umls] section a api_key key. Mine looks like:
```
[umls]
api_key=1234567890abcdefghijklmno
```

Now you can omit the api_key keyword like in the following:

from umls_downloader import download_umls

# Same path as before
path = download_umls(version="2023AA")

Download the Latest Version

First, you’ll have to install bioversions with pip install bioversions, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version keyword argument:

from umls_downloader import download_umls

# Same path as before (when run on September 1st, 2023)
path = download_umls()

Download and open the file

The UMLS file is zipped, so it’s usually accompanied by the following boilerplate code:

import zipfile
from umls_downloader import download_umls

path = download_umls()
with zipfile.ZipFile(path) as zip_file:
    with zip_file.open("MRCONSO.RRF", mode="r") as file:
        for line in file:
            ...

This exact code is wrapped with the umls_downloader.open_umls() using Python’s context manager, so it can more simply be written as:

from umls_downloader import open_umls

with open_umls() as file:
    for line in file:
        ...

Note The version and api_key arguments work the same for umls_downloader.open_umls() as in umls_downloader.download_umls()

At this point, it’s up to you to decide how you want to consume the MRCONSO.RRF file. Below, I give a demo on how parsed this file in PyOBO in order to convert UMLS to an OWL ontology.

Why not an API?

The UMLS provides an API for access to tiny bits of data at a time. There are even two recent (last 5 years) packages umls-api connect-umls that provide a wrapper around them. However, API access is generally rate limited, difficult to use in bulk, and slow. For working with UMLS (or any other database, for that matter) in bulk, it’s necessary to download full database dumps.

UMLS Conversions

Building on top of the automated download of UMLS, I implemented a fit-for-purpose processor with the PyOBO framework that converts UMLS into an ontology (encoded either as OWL, OBO, or OBO Graph JSON) which can therefore be used to generate semantic mappings in the SSSOM format. The code that implements this can be found here. After installing PyOBO with pip install pyobo, you can automatically download and convert UMLS first into an ontology encoded in the OBO flat file format, then convert to OWL with the following code. Note: you’ll need robot for the second step:

import pyobo

umls = pyobo.get_ontology("umls")

# Write simple OBO Format
umls.write_obo("umls.obo")

# Convert to OWL
from pyobo.utils.misc import obo_to_owl

obo_to_owl("umls.obo", "umls.owl")

In an ideal world, the results of such a conversion could be included as a part of the OBO Database Ingestion, which converts database resources available through PyOBO into ontology artifacts, archives them on GitHub and Zenodo, and gives them PURLs all on a weekly basis to make sure the most up-to-date version is available as well as all previous named versions. Instead, we live in a world with pineapple pizza and restrictive licenses.

One of the nice qualities of UMLS is that it is a semantic mapping hub. It provides mostly complete mappings between many vocabularies including MeSH, NCIT, SNOMED-CT, HPO, LOINC, and more. However, there are a few caveats to consider:

UMLS mappings aren’t all 1-to-1. For example, MeSH mappings typically include many UMLS terms (narrower) pointing to the same MeSH term (broader). For other vocabularies, such as NCBITaxon, UMLS mappings are more reliably 1-to-1. Thanks to Tiago Lubiana for pointing this out.
Mapping provenance is not available, so the mapping_justification field in SSSOM is uniformly filled with sempav:UnspecifiedMatching.
Similarly, UMLS does not apply precise semantic predicates for each mapping. This means that they are output in PyOBO and as SSSOM with oboInOwl:hasDbXref instead of more detailed types such as skos:exactMatch, skos:narrowMatch, and skos:broaderMatch. Tools like Boomer can be used to address this (in part). The Semantic Mapping Reasoning Assembler (SeMRA) can also be configured with prior knowledge about UMLS mapping assumptions when aggregating and reasoning over semantic mappings at scale.

With that in mind, anything that can be loaded as an ontology in PyOBO can also be exported with SSSOM, which I show below. For UMLS, this looks like:

import pyobo

df = pyobo.get_sssom_df("umls", names=False)
df.to_csv("umls.sssom.tsv", sep="\t", index=False)

Note You can set names=True to have PyOBO look up the names for all entities, but this is a bit of a rabbit hole since it requires getting and processing many external resources.

There’s much more to say about UMLS and SSSOM, but this is a good place to pause and publish this post, since getting UMLS as SSSOM is a task a lot of people have asked me for help with lately. I might also come back and explain more about how I use the other resources from UMLS’s UTS.

Reproducibility Pilot in the Journal of Cheminformatics

2023-08-27T08:00:00+00:00

I’ve been working on improving reproducibility in the field of cheminformatics for some time now. For example, I’ve written posts about making data from DrugBank and ChEMBL more actionable. Over the last year, I’ve been preparing a concept with the editors of the Journal of Cheminformatics on how to include an assessment of reproducibility to reviews of manuscripts submitted to the journal. This has resulted in an editorial Improving reproducibility and reusability in the Journal of Cheminformatics as well as a call for papers. In this post, I want to summarize the first generation review criteria we developed, give an example of it applied in practice

The Seven (First Generation) Criteria

There are many potential directions for reproducibility. Given the fact that typical computational scientists are not trained as software engineers, we decided on seven very simple criteria that can be easily reviewed and easily addressed:

Does the repository contain a LICENSE file in its root?
Does the repository contain a README file in its root?
Does the repository contain an associated public issue tracker?
Has the repository been externally archived on Zenodo, FigShare, or equivalent that is referenced in the README?
Does the README contain installation documentation?
Is the code from the repository installable in a straight-forward manner?
Does the code conform to an external linter (e.g., black for Python)?

These correspond to important details that are complementary to other considerations of reproducibility, but often overlooked. Throughout the pilot, the editors and reviewers will try to support authors in addressing each of these points during revision. I imagine that there will be future iterations of these criteria as the community begins to expect these as standard practice. For example, we can narrow criteria 1 to specifically say that the software should be licensed with an OSI-approved license and not accept science made with non-open licenses. We could further narrow point 7 to have additional community style requirements (e.g., passes parts of flake8, as you know I love from my post on flake8 hell). We could also include additional guidelines that e.g. say that the results presented in the paper should be reproducible with a single command from the command line, e.g., a shell script. The rabbit hole could go very deep, so again, it’s worth saying that these are very non-controversial criteria for the first generation.

That being said, many repositories don’t follow these! Since these criteria are so simple, I’m interested in automating their assessment and further applying it to the entire Journal of Cheminformatics backlog. I’ll describe this more in a future post.

Without further ado, the text below is what I sent verbatim in the review for Drug-Protein Interaction Prediction via Multi-View Variational Autoencoder and Cascade Deep Forests, which is pre-printed on Research Square and has associated code here. I have tried my best to include actionable links and information with each piece. I would like to also automate sending separate GitHub issues for each of these points as a more concrete to-do list for authors, then also send an “epic” issue that lists all of them together. With the magic of the GitHub API, this is possible.

My First Reproducibility Review for the Reproducibility Pilot

Below, I apply the seven point reproducibility review prescribed by Improving reproducibility and reusability in the Journal of Cheminformatics to the default branch of repository https://github.com/Macau-LYXia/MVAE-DFDTnet (commit c0858c8), accessed on August 27^th, 2023.

Does the repository contain a LICENSE file in its root?
No. The GitHub license picker can be used to facilitate adding one by following this link: https://github.com/Macau-LYXia/MVAE-DFDTnet/community/license/new?branch=main. Ideal software licenses for open source software include the MIT License, BSD family of licenses, and other licenses approved by the Open Source Initiative. A simple, informative guide for picking a license can be found at https://choosealicense.com.
Does the repository contain a README file in its root?
No. A minimal viable README file contains:
- A short, one line description of the project
- Information on how to download, install, and run the code locally
- Brief documentation describing the single most important use case for the repository. For scientific code, this is ideally a one-liner in Python code, a shell script, or a command line interface (CLI) that can be used to reproduce the results of the analysis presented in a corresponding manuscript, use the tool presented in the manuscript, etc.
- Link to an archive on an external system like Zenodo, FigShare, or an equivalent.
- Citation information, e.g., for a pre-print then later for a peer reviewed manuscript
GitHub can be used to create a README file with https://github.com/Macau-LYXia/MVAE-DFDTnet/new/main?filename=README.md. Repositories typically use the Markdown format, which is explained here.
Does the repository contain an associated public issue tracker?
Yes. It is available at https://github.com/Macau-LYXia/MVAE-DFDTnet/issues.
Has the repository been externally archived on Zenodo, FigShare, or an equivalent that is referenced in the README?
No, there is no README. This is also not mentioned in the manuscript. See https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content for a tutorial on how to do this.
Does the README contain installation documentation?
No, there is no README. This is also not mentioned in the manuscript.
Is the code in the repository installable in a straight-forward manner?
No, the code is not laid out in a typical structure, e.g., as described in https://blog.ionelmc.ro/2014/05/25/python-packaging. Further, there is no setup configuration that encodes the dependencies or facilitates reuse of the code.
Does the code in the repository conform to an external linter (e.g., black for Python)?
No. The Python code has not been linted, e.g., using black. Similarly, the Matlab code has not been linted, e.g. using checkcode.

Scientific integrity depends on enabling others to understand the methodology (written as computer code) and reproduce the results generated from it. This reproducibility review reflects steps towards this goal that may be new for some researchers, but will ultimately raise standards across our community and lead to better science. Because the work presented in this article only yet address one of the seven points of the reproducibility review, I recommend rejecting the article and inviting later resubmission following addressing the points.

For posterity, this review has also been included on https://github.com/Macau-LYXia/MVAE-DFDTnet/issues/1.

The Future is Looking Good

The example above isn’t so great - it’s possible that these authors have never considered most of these points about reproducibility before. The reality is that many computational scientists are not trained in this since their mentors were not primarily trained as computational scientists themselves. Combine with the perverse incentive structure in academia, it’s understandable how this can be left out from some publications. I experienced something similar in my doctoral studies, and had to bootstrap my own philosophy on reproducibility as well as the practical skills to achieve it. I also understand not everyone is in the position where they have the flexibility/freedom/initiative to do this.

That all being said, we are now entering an era where progressive and newly minted PIs actually have training as computational scientists. The next paper in my queue for a reproducibility review is for https://github.com/Steinbeck-Lab/cheminformatics-python-microservice, which will pass the 7 criteria with flying colors. I’m looking forward to the future when we expect more excellent science on the regular. See you there!

I’m not sure how people will view the way I talk about reviews - I am quite open with posting reviews on GitHub and also openly discussing the fact that I’ve reviewed something. Ideally, I don’t accept reviews for papers that don’t have pre-prints, since I personally think the review process should be open. I hope it’s the case that I haven’t been rude or unfair. If that’s the case, someone can help me change the way I write about these topics.

Querying Journals and Publishers in Wikidata

2023-06-22T17:00:23+00:00

Today’s short post is about three SPARQL queries I wrote to get bibliometric information about journals and publishers out of Wikidata.

Each of the following queries can be readily copy-pasted into the Wikidata Query Service and run in the browser.

Journals

The following SPARQL query gets information about journals:

SELECT ?journal ?journalLabel (GROUP_CONCAT(?issn) as ?issns)
WHERE 
{
  ?journal wdt:P31 wd:Q5633421 ;
           wdt:P236 ?issn .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}
GROUP BY ?journal ?journalLabel

Follow this link to populate the Wikidata Query Service with this query. Note that this query takes a while to run and may time out since there are on the scale of 100K journals.

Journals might have multiple International Standard Serial Numbers (ISSNs) because a different one is assigned to the print and electronic versions of the journal, among other things.

Get the ISSN-L (the normalized/preferred) ISSN for each:

SELECT ?journal ?journalLabel ?issn
WHERE 
{
  ?journal wdt:P31 wd:Q5633421 .
  OPTIONAL { ?journal wdt:P7363 ?issnl }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}

Get a forward mapping from all ISSNs to ISSN-L. Note that these have been filtered to scientific journals (wd:Q5633421)

SELECT ?issn ?issnl
WHERE 
{
  ?journal wdt:P31 wd:Q5633421 ;
           wdt:P7363 ?issnl ;
           wdt:P236 ?issn .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}

Publishers

The following SPARQL query gets information about publishers:

SELECT DISTINCT ?publisher ?publisherLabel ?ror ?grid ?isni
WHERE 
{
  ?publisher wdt:P31/wdt:P279+ wd:Q2085381 ;
             rdfs:label ?publisherLabel .
  FILTER ( LANG(?publisherLabel) = "en" )
  OPTIONAL { ?publisher wdt:P6782 ?ror }
  OPTIONAL { ?publisher wdt:P2427 ?grid }
  OPTIONAL { ?publisher wdt:P213 ?isni }
}

Follow this link to populate the Wikidata Query Service with this query. This query returns the Research Organization Registry (ROR) identifier when available. This registry effectively subsumes the Global Research Identifier Database (GRID), which has since been shut down, but this might be helpful for integrating data that hasn’t been updated. The International Standard Name Identifier (ISNI) is also included when available. Wikidata has several other nomenclature authorities such as GND, VIAF, RingGold, and others that are omitted for brevity (each has their own corresponding Wikidata property.).

Later, I could consider adding a clause to make sure there’s a “scientific journal” in the publisher to remove some irrelevant records.

Connections between Journals and Publishers

Finally, the publisher (P123) relation can be used to identify the relationships between journals and their respective publishers.

SELECT DISTINCT ?journal ?journalLabel ?publisher ?publisherLabel
WHERE 
{
  ?journal wdt:P31 wd:Q5633421 ;
           rdfs:label ?journalLabel ;
           wdt:P123 ?publisher .
  FILTER ( LANG(?journalLabel) = "en" ) 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}
ORDER BY ?journalLabel

Follow this link to populate the Wikidata Query Service with this query.

Rather than using the Wikidata label service for the journal label, I more explicitly wrote it out to ensure that there is an english label, and to remove anything without an english label.

Modeling and Querying Awards in Wikidata

2023-06-08T10:09:23+00:00

I was recently nominated for the International Society for Biocuration’s Excellence in Biocuration Early Career Award (results will be announced on June 14^th!). This made me curious about how to model nominations and awards on Wikidata. In this post, I’ll describe how to curate awards, nominations, recipients, and how to make SPARQL queries to get them.

Summarizing an Individual

I’m going to use SPARQL with the Wikidata Query Service to see what’s already in Wikidata. First, I want to find all of the awards that I’ve personally received using the P166 (award received) property. Note that the following query also takes advantage of Wikidata’s reification so I can reach into the qualifiers of each statement to figure out when the award was given.

SELECT ?award ?awardLabel ?year ?conferer ?confererLabel
WHERE { 
  VALUES ?person { wd:Q47475003 }
  ?person p:P166 ?award_statement .
  ?award_statement ps:P166 ?award .
  OPTIONAL { ?award wdt:P1027 ?conferer . }
  OPTIONAL { 
    ?award_statement pq:P585 ?date . 
    BIND(year(?date) AS ?year)
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

See this query in action at https://w.wiki/6odU.

As of time of writing, the only award that is listed here is the Bernie Lemire Award. This was given to me by the Northeastern University Department of Chemistry at the end of my bachelor’s degree for service to the department and academic excellence. I am very proud of this award! You can switch out wd:Q47475003 for your Wikidata identifier.

A similar SPARQL query can be written to identify all of the awards for which I was nominated by swapping the predicate to P1411 (nominated for). This isn’t necessarily a superset of the awards received since some awards are decided without a nomination. It might also be the case depending on how curation is done that these are out of sync.

SELECT ?award ?awardLabel ?year ?conferer ?confererLabel
WHERE { 
  VALUES ?person { wd:Q47475003 }
  ?person p:P1411 ?award_statement .
  ?award_statement ps:P1411 ?award .
  OPTIONAL { ?award wdt:P1027 ?conferer . }
  OPTIONAL { 
    ?award_statement pq:P585 ?date . 
    BIND(year(?date) AS ?year)
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

See this query in action at https://w.wiki/6odV.

Summarizing an Award

Many awards are given on a periodic basis (e.g., yearly, bi-yearly). Scholia is an excellent frontend to Wikidata that already has a way of summarizing awards. Some examples:

Summarizing a Conferrer

Finally, I want to summarize all awards nominated or given by an organization. In this example, I’m going to look at the International Society for Biocuration (ISB; Q23809291).

Winners

The following query shows all of the recipients for all of the various awards conferred by the ISB:

SELECT ?award ?awardLabel ?recipient ?recipientLabel ?year 
WHERE { 
  ?recipient p:P166 ?award_statement .
  ?award_statement ps:P166 ?award .
  OPTIONAL { 
    ?award_statement pq:P585 ?date . 
    BIND(year(?date) AS ?year)
  }
  ?award wdt:P1027 wd:Q23809291 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?year) ?awardLabel

See this query in action at https://w.wiki/6odW or the results embedded below.

At the time of writing, this only returned a paltry 9 rows, meaning more curation is necessary! Considering this award is about biocurators, we better get our act together 🙃. Update June 4^th, 2023: I went back and curated the full catalog.

Nominations

Similarly, the following query can be used to identify all nominations:

SELECT ?award ?awardLabel ?nominee ?nomineeLabel ?year 
WHERE { 
  ?nominee p:P1411 ?award_statement .
  ?award_statement ps:P1411 ?award .
  OPTIONAL { 
    ?award_statement pq:P585 ?date . 
    BIND(year(?date) AS ?year)
  }
  ?award wdt:P1027 wd:Q23809291 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?year) ?awardLabel

See this query in action at https://w.wiki/6odX or the results embedded below.

There are only 5 results at the time of writing, and these are for my fellow nominees for the Excellence in Biocuration Early Career Award that I recently curated! There’s a lot of work to do here for keeping a history of the ISB’s awards. Update June 4^th, 2023: I went back and curated the full catalog. It turns out that the ISB did not publish the list of nominees for any awards until 2022, so this list will remain short.

Wikidata-wide Summary

More generally, it turns out that there are only a bit more than 55K nomination relations in total for all of Wikidata. You can check this with:

SELECT (count(*) AS ?count)
WHERE { ?nominee wdt:P1411 ?award . }

Curating an Award

Award objects don’t have to be complicated - the most important information is to include a useful instance annotation (e.g., to science award (Q11448906)) and the following:

Field of Work (P101)
Conferred By (P1027)
Website (P856)
Inception (P571)

See https://www.wikidata.org/wiki/Q118947746 as an example.

Curating an Individual

On a given Wikidata page, you can add a statement for either nominated for or award received using Wikidata’s amazing curation interface that has search built in. It’s recommended to add a point in time (P585) annotation to make a distinction between different periods. Further, it’s recommended to add a refernce using the reference url (P854) property that pints to a webpage with an announcement about the nomination or award.

Overall, I think modeling awards is hard, since these are less concrete than other academic information such as employment or education. Still, this is the next step in making my resume 100% auto-generated by SPARQL and Wikidata!

See also an analysis by Chris Mungall of the gender distribution of awards in Wikidata.

Update June 4^th, 2023: I won the Excellence in Biocuration Early Career Award! Nico Matentzoglu won the Excellence in Biocuration Advanced Career Award and we are both excited to see that the community was interested to recognize people who work on fundamental underlying technologies.

Re-implementing the N2T ARK Resolver

2023-04-11T18:44:23+00:00

Archival Resource Keys (ARKs) are flavor of persistent identifiers like DOIs, URNs, and Handles that have the benefit of being free, flexible with what metadata gets attached, and natively able to resolve to web pages. Name-to-Thing (N2T) implements a resolver for a variety of ARKs, so this blog post is about how that resolver can be re-implemented with the curies Python package.

In a lot of ways, ARKs look and act like CURIEs. For example, ark:/53355/cl010277627 could be interpreted as having the prefix ark and the local unique identifier /53355/cl010277627. The first part of each ARK between the first two slashes corresponds to the provider. In this example, 53355 corresponds to the Louvre museum in Paris, France and cl010277627 is the local unique identifier corresponding to the Vénus de Milo statue.

However, I might have just committed ARK blasphemy. In N2T, it appears that the ARK prefix and provider code stay grouped together in the front half like ark:/53355/ and then the back half cl010277627 represents the local unique identifier. This is very similar to the two-layer identifiers in DOI and the arbitrary number of layer identifiers in OID.

The point is, if we can interpret this enough like CURIEs, we can use the curies package to implement a resolver. The first step we can take is to download the N2T data from https://n2t.net/e/n2t_full_prefixes.yaml. Then we can parse out the ARKs (there are other things in N2T we’ll disregard) with the following code:

import pystow
import yaml

URL = "https://n2t.net/e/n2t_full_prefixes.yaml"
PROTOCOLS = {"https://", "http://", "ftp://"}


def get_prefix_map():
    """Get the prefix map from N2T, not including redundant ``ark:/`` in prefixes."""
    with pystow.ensure_open("n2t", url=URL) as file:
        records = yaml.safe_load(file)
    prefix_map = {}
    for key, record in records.items():
        uri_prefix = record.get("redirect")
        if (
            not uri_prefix
            or all(not uri_prefix.startswith(protocol) for protocol in PROTOCOLS)
            or uri_prefix.count("$id") != 1
            or not uri_prefix.endswith("$id")
            or not key.startswith("ark:/")
        ):
            continue
        key = key.removeprefix("ark:/")
        prefix_map[key] = uri_prefix.removesuffix("$id") + "/" + key + "/"
    return prefix_map

This prefix map removes ark:/ from the beginning of the prefixes in N2T and also adds the provider code into the URI prefix to make the URIs more focused on the local unique identifiers within each provider, rather than the entire ARK space.

Once we have a prefix map, we can make a curies.Converter and a Flask web application for resolving in a few lines:

from curies import Converter, get_flask_app


def get_app():
    """Get an ARK resolver app, noting that it uses a non-standard delimiter and URL prefix."""
    prefix_map = get_prefix_map()
    print(prefix_map)
    converter = Converter.from_prefix_map(prefix_map, delimiter="/")
    app = get_flask_app(converter, blueprint_kwargs=dict(url_prefix="/ark:"))
    return app

The two tricks here are:

We want to remove the redundant ark:/ then interpret the ARK provider code as the prefix and the rest as the local unique identifier. However, we still want to be able to write URLs in our resolver that have the ark:/ prefix. Luckily, Flask has the facility to define a default url_prefix before a given blueprint that we invoke directly.
Unlike CURIEs that use a colon : as the delimiter between the prefix and local unique identifier, ARKs use a slash /. We can also set this in the Converter’s settings.

Now, all we need to do is instantiate the app and serve it with any WSGI tool like Gunicorn, Uvicorn, or Flask’s built-in development server (from Werkzeug). Navigating to http://localhost:5000/ark:/53355/cl010277627 redirects to https://collections.louvre.fr/ark:/53355/cl010277627 and gets some nice art from the Louvre. In general, you can stick any ARK after http://localhost:5000/ark: that is resolvable via N2T when running this server.

All of this code is on GitHub and can be run with the following:

git clone https://github.com/cthoyt/n2t-ark-resolver
cd n2t-ark-resolver
python -m pip install -r requirements.txt
python wsgi.py

Update: since posting this, I have heard from John Kunze that the ARK format is currently being updated to look more like URNs and therefore not have the slash after ark:/ anymore. If/when that happens, there are only a few bits of string pre-processing in this script that need to be updated to keep everything running.

The Representatives of Monkey Jack - German Battle of the Bands Finale

2023-03-29T19:33:23+00:00

This blog is normally about very serious science, but I’m taking a break from that for the evening to advertize my band’s upcoming show on April 8^th in the SPH Music Masters Finale (aka, the German Battle of the Bands). We need your support! There are streaming tickets available, and this post has a guide on how to navigate the German website to get tickets (or just text me, I’ll hook you up).

Here’s a video of us playing. If you love it, let me know. If you hate it, slap like and subscribe.

When is this?

The show is on April 8^th at the Live Music Hall in Cologne, Germany. Because this is Easter weekend, it’s the perfect thing to do while you’re relaxing with your family. The show starts at 15.30 CEST / 9:30AM EST. The order of the 10 bands playing will be determined on the morning of, so I’ll send out a mass text about what time that will be for everyone who’s streaming.

How does the battle work?

The battle of the bands is judged in two ways: half of the score is by the judges and half by the audience. Each audience member gets two votes - this usually means they vote for the band they support and a second band that they liked after seeing them for the first time.

There will be 10 bands playing, which means this is going to be a loooooong day. The best spots in the show will be in the middle or towards the end, after the afternoon settles into evening. The order will be determined just before the show starts, based on which band sells the most pre-sale tickets. This means that getting a streaming ticket will support us to get a better spot, which is highly correlated with winning.

Who is Monkey Jack, and why do you represent him?

As you noticed, we’re the representatives of Monkey Jack. We will tell The Tale of Monkey Jack at our show. You should prepare by bringing a banana with you to the show, which you will need at the end of the Tale when we begin The Ritual.

Getting a Ticket

First, navigate here for tickets.

Second, click Streamingtickets. If you want to be cool while speaking German, you should throw in some English words (or internationalisms).

Third, click Auswählen. This verb means that you are pledging your allegiance to Monkey Jack and promise to follow Him.

Fourth, click Bitte Wählen (please choose). This is a drop down menu to show your support for Monkey Jack and his representatives. Note that you only need one streaming ticket per stream, obviously you should throw a party/ritual to represent Monkey Jack yourself.

Fifth, click Monkey Jack.

Sixth, You can fill in the form with your information. The image below annotates what each of the fields means. Plz is short for “Postleitzahl”, which means Zip code. Don’t worry about the part of the form with the country picker. You’re a German now.

After you click it, it will bring you to a PayPal page. They’ll email you a confirmation within 5-10 minutes and send the streaming link the day of the show.