I talked to Guy Cochrane and Chuck Cook from the Global Biodata Coalition (GBC). They chaired a session on sustainability of biocurated resources, with specific focus on the Global Biodata Coalition’s Global Biodata Core Resources (GBCR) initiative. I felt like my talk from last year’s biocuration conference on the Open Code, Open Data, Open Infrastructure (O3) roadmap (preprint) would have fit right in here. I am very keen to have their perspectives, as GBC has first worked on evaluation of resources and is second working towards funding resources. Since they have not worked on practical recommendations for supporting sustainability, I eagerly volunteered to join their work in some capacity to help advise on this.
GBC also published a workflow for evaluating the landscape of biological databases (press release / publication / code). When possible, this workflow aligned on FAIRsharing, but given that it is a limited resource and only has partial mappings to relevant related resources like re3code, BARTOC, etc. I suggested using the Bioregistry as a mapping hub to enrich the output of this workflow, which will definitely be run again on a periodic basis.
Lynn Schriml presented recent updates on the Disease Ontology, which prompted a relevant question from Harpreet Singh - Chief Data Officer at the Indian Council of Medical Research (ICMR) who himself works with clinical data and has wondered how to best annotate - using MeSH, SNOMED, ICD, or other disease resources. I had an interesting discussion with him following the talk which gave big motivation to the talk I was about to give on the large scale assembly and reasoning over semantic mappings. I was very excited, since I love to add (last minute) shout-outs into my conference talks that motivate parts of the work based on questions or discussions from earlier parts of the conference.
There were a series of talks that motivated further discussions about mappings. One of the most interesting was the talk from Shivani Sharma, a curator at the Indian Biological Data Centre (IBDC) and one of the local organizers. She works on the Indian Metabolome Data Archive. Many of the lines of work at the IBDC have practical applications towards agriculture and integrate medium- and large-scale experimental work, biocuration, and downstream analysis. Often, these applications are oriented towards improving crop yields and avoiding disease. Shivani showed a slide where they considered a large number of metabolomics nomenclature resources to use for annotating their data. However, they were not familiar with methods for incorporating multiple nomenclature resources, meaning that their curators were running into issues where their chosen metabolomics database did not cover chemicals they needed to annotate. This often lead to them having to create their own ad hoc annotations, which also create issues for data integration. I am looking forwards to catching up with them again, incorporating new metabolomics resource into PyOBO, ingesting mappings into SeMRA, and filling in the gaps using Biomappings to support their curators.
Scott V. Nguyen from the American Type Culture Collection (ATCC) also approached me about this work, since he’s currently trying to curate mappings between cell lines in their resource and other public resources. It was lucky that one of the examples from my talk was specifically about the cell lince scenario, which I hope he can ingest and reduce his curation workload. Rachel Lyne also presented on COSMIC, a cancer cell line resource that also creates its own accession numbers and could benefit from this work, but I didn’t get a chance to talk about it with her yet.
I also met Yasunori Yamamoto, who works on TogoID, a secondary database of semantic mappings that covers select domains within biomedicine. We discussed how they could make use of the Simple Standard for Sharing Ontology Mappings (SSSOM) to ingest more mappings from different resources, especially from Biomappings or potentially from the outputs of SeMRA (which I presented on).
Matt Jeffreys presented on the annotations database in European PubMed Central which allows for tagging articles, sentences, or tokens in articles with annotations. They already showed how this applies to named entity recognition (NER) and MeSH term annotations, but we discussed how SeMRA and comprehensive semantic mapping databases could help unify other annotations of overlapping vocabularies, e.g., if someone put Disease Ontology (DO) NER annotations, which overlap with MeSH terms in the disease (C) and psychiatric disorders (F) branches.
I discussed with Raja Mazumdar and Jeet Vora from George Washington University who both work on GlyGen and are plugged into the NIH’s Common Fund Data Ecosystem (CFDE) about how they can continue to use the Bioregistry to standardize the annotations in their resources. Jeet has got in touch earlier this year and helped update the records in the Bioregistry related to GlyGen. Raja’s talk also motivated two new prefix additions to the Bioregistry for Biocompute Objects and for OncoMX data objects. Further, Raja is very interested in improving his data using the Bioregistry, since it already uses a Python script to validate its JSON and TSV components, it will be easy to incorporate the Bioregistry Python package’s validation functions.
Earlier this winter, I presented to the American National Institutes of Health (NIH) BISTI group about different avenues through which they could use the Bioregistry to create more value for the NIH and its grantees. One of those discussions was about improving GenBank’s internal database catalog. By chance, I talked with Ilene Karsch Mizrachi, a program head at the NIH about this. She was attending the conference and made big contributions to the discussions about the Indian relation to the International Nucleotide Sequence Database Collaboration (INSDC). However, it turns out she was the one who made/contributed to this GenBank table, many years ago. We will try and follow up by enriching this table with information from the Bioregistry.
At last year’s biocuration conference, Chris Hunter presented on GigaDB, and we had some initial discussions about using the Bioregistry (or other related parts of the Biopragmatics Stack) to make standardized annotations on data sets deposited in their database, such as cell line annotations. We picked back up that conversation, and it seems that the GigaDB developers are working with PHP - since we got CZI funding to make the Bioregistry available in other languages, making a wrapper from Rust to PHP (within the curies.rs framework).
There was an entire session on the final day of the conference on structural bioinformatics, which included several presentations from the American and European loci of the Protein Databank (PDB). The first discussion was with Marcus Bage, who is currently trying to annotate protein modifications. We discussed the implications of the vast number of resources that partially cover this domain in different senses, including GO, MOD, SBO, MOP / PSI-MI, and UniProt’s internal vocabulary. A long time ago, I mapped these together in PyBEL, but this was only a partial solution, too!
The second discussion was with Brinda Vallat about the upcoming change for PDB accession numbers. It turns out that
the 4 character code is estimated to fill up in 2029, so it’s time for PDB to make a change. Unfortunately, their
solution is to switch to local unique identifiers that look like pdb_000002GC4
, which is problematic for two main
reasons. First, it’s not backwards compatible with existing IDs. Second, it introduces a banana (i.e., a redundant copy
of the name/acronym of the database in the local unique identifier). The reasoning behind adding in the banana was to
make it easier to find references in papers. I can understand this, since we don’t yet have general solutions for
referencing concepts across different publishers (though, we solved this in Manubot by integrating the Bioregistry).
However, this increases confusion for consumers. I suggested they simply extend the existing IDs to be able to have more
than 4 characters, and suggest people reference their entities with CURIEs like PDB:2GC4
within papers, which solves
both issues simultaneously. Similarly, I talked to Ibrahim Roshan Kunnakkattu about creating more careful identifier
recommendations for the PDB’s Chemical Component Dictionary as well as
using some of the automated mapping tools I presented for filling out references to ChEBI, ChEMBL, PubChem, and more.
I also had the unique pleasure to spend time in person with Tiago Lubiana, who is highly aligned on many of my interests in data standardization, semantic web, and open science. He has been a helpful contributor in the Bioregistry, Wikidata, and the OBO Foundry. Writing up some of the things we discussed would take a whole blog post, so instead, here’s a nice picture we got together.
Overall, like every Biocuration conference, I was very happy to find people interested in my work, and more importantly, interested in the idea of improving their own data standardization! I also had lots of other interesting discussions that don’t require any follow-up. I am also planning on writing a post that gives a more high-level summary of the different parts of the conference itself, not just focusing on my work.
]]>As a demonstration, we will build a data model and API that serves information about scholars.
We’ll use Open Researcher and Contributor (ORCID) identifiers as primary keys, include the researcher’s name, and start with a single cross-reference, e.g., to the author’s DBLP identifier. We’ll encode this data model using Pydantic in the Python programming language as follows:
from pydantic import BaseModel, Field
class ScholarV1(BaseModel):
"""A model representing a researcher, who might have several IDs on different services."""
orcid: str = Field(...)
name: str = Field(...)
dblp: str | None = Field(None)
print(ScholarV1.schema_json(indent=2))
There are several places for improvement here:
ORCID
instead of Orcid
and DBLP
instead of Dblp
)All of these are possible to annotate into Pydantic’s Field
object, but it requires lots of effort and takes lots of
space. Even worse, this might have to be partially duplicated if multiple models share the same fields. In the example
below, I annotated ORCID but will skip the others for brevity.
from pydantic import BaseModel, Field
class ScholarV2(BaseModel):
"""A model representing a researcher, who might have several IDs on different services."""
orcid: str = Field(
...,
title="ORCID",
description="A stable, public identifier for a researcher from https://orcid.com",
pattern="^\d{4}-\d{4}-\d{4}-\d{3}(\d|X)$",
example="0000-0003-4423-4370",
)
name: str = Field(...)
dblp: str | None = Field(None)
print(ScholarV2.schema_json(indent=2))
However, this was a lot of work. It would be nice if there were some database of all the semantic spaces in the semantic web and natural sciences that contained the name, description, regular expression pattern, and examples. Then, we could draw from this database to automatically populate our fields.
The good news is that such a database exists - it’s called the Bioregistry. Each semantic space (e.g., ORCID, DBLP) gets a prefix which is usually an acronym for the name of the resource that serves as the primary key for the semantic space. These prefixes are also useful in making references to entities in the semantic space more FAIR (findable, accessible, interoperable, reusable) using the compact URI (CURIE) syntax, though this isn’t the goal of this demo.
I’ve mocked some Python code that bridges Pydantic and the Bioregistry in this repository (https://github.com/cthoyt/semantic-pydantic). I’m calling it Semantic Pydantic because it lets us annotate our data models with external metadata (and because it rhymes).
Here’s the same model as before, but now using a SemanticField
that extends Pydantic’s Field
. It has a special
keyword prefix
that lets you give a Bioregistry prefix, then it is smart enough to fill out all the fields
on its own. I also took the liberty of adding several more semantic spaces that identify scholars like
Web of Science (wos
),
Scopus, and even GitHub.
from pydantic import BaseModel, Field
from semantic_pydantic import SemanticField
class ScholarV3(BaseModel):
"""A model representing a researcher, who might have several IDs on different services."""
orcid: str = SemanticField(..., prefix="orcid")
name: str = Field(..., example="Charles Tapley Hoyt")
wos: str | None = SemanticField(default=None, prefix="wos.researcher")
dblp: str | None = SemanticField(default=None, prefix="dblp.author")
github: str | None = SemanticField(default=None, prefix="github")
scopus: str | None = SemanticField(default=None, prefix="scopus")
semion: str | None = SemanticField(default=None, prefix="semion")
publons: str | None = SemanticField(default=None, prefix="publons.researcher")
authorea: str | None = SemanticField(default=None, prefix="authorea.author")
print(ScholarV3.schema_json(indent=2))
Finally, we can see a very detailed JSON schema, which includes everything from before plus additional context from the Bioregistry, including the prefix itself as well as mappings from the Bioregistry prefix to external registries like BARTOC, FAIRsharing, and others. Together, these make the data model more FAIR and support interoperability, since now it is possible to directly match the fields annotated with Bioregistry prefixes in this model to fields annotated with the same prefix in other models, even external to the project.
This field corresponds to a local unique identifier from Open Researcher and Contributor</a>.\n
The semantics of this field are derived from the\n<a href=\"https://bioregistry.io/orcid\">orcid
</a> entry in\nthe <a href=\"https://bioregistry.io\">Bioregistry</a>: a registry of semantic web and linked \nopen data compact URI (CURIE) prefixes and URI prefixes.\n
Let’s take the next step to a web application using FastAPI. The goal of this web application will be to look up the information for a scholar in Wikidata based on their ORCID. You don’t really have to understand how the query works other than that it takes in an ORCID string and gives back an instance of the Scholar model we’ve been working on above.
The app uses annotations for the query parameters, path parameters, and other inputs to routes using extensions of
Pydantic Fields
. So similar to before, we can extend their custom fields to be semantic in Semantic Pydantic.
from fastapi import FastAPI
from semantic_pydantic import SemanticPath
app = FastAPI(title="Semantic Pydantic Demo")
Scholar = ... # defined before
@app.get("/api/orcid/{orcid}", response_model=Scholar)
def get_scholar_from_orcid(orcid: str = SemanticPath(prefix="orcid")):
"""Get xrefs for a researcher in Wikidata, given ORCID identifier."""
... # full implementation in https://github.com/cthoyt/semantic-pydantic
return Scholar(...)
The real power is how this translates to the API, and more importantly, the automatically generated API documentation.
First, the SemanticPath
object which we used in place of a normal fastapi.Path
also knows it is for ORCID
identifiers. Second, the response model points to the Scholar class from before which already knows about its semantics.
Below, we see this in a screenshot of the OpenAPI (formerly known as Swagger) user interface automatically generated
by FastAPI.
There are two big things to note here:
Now, we have an API that is also annotated with detailed semantics. If you take a look at the OpenAPI JSON file, it has similar references to Bioregistry prefixes for the routes themselves, and directly reuses the JSON schema for the response model.
So far, this is a proof-of-concept that lives in an ad hoc repository. It’s not clear yet if this code is just a neat demo, whether it should live inside the Bioregistry Python package, I haven’t decided yet if this should go inside the Bioregistry Python package, or if it should be in a stand-alone package that might be extensible even further. There are a few other things to think about in the meantime:
The first version of this idea just throws the Bioregistry data into the JSON schema. It would be interesting to develop this infrastructure further, such as keeping a catalog of all APIs that consume or produce data models containing semantic fields. A few places this would be great:
inchikey
)There are also so many more examples, please let me know some services you think would benefit in the comments on my blog post. Looking forward, it’s also a question on how to automatically discover such semantic APIs (e.g., by cleverly searching GitHub) or if it would have to be a manually curated catalog.
A key feature of the Bioregistry is that it provides a way to take a local unique identifier for an entity
in a given semantic space and make a URL that points to a web page describing the entity. For example,
if you have an ORCID identifier, you can make a URL for the ORCID page following the format
https://orcid.org/<put ID here>
. It would be very cool to extend Semantic Pydantic to add some
properties that auto-generate URLs, like in the following:
from pydantic import BaseModel, Field
from semantic_pydantic import SemanticField
class Scholar(BaseModel):
orcid: str = SemanticField(..., prefix="orcid")
name: str = Field(...)
charlie = Scholar(orcid="0000-0003-4423-4370", name="Charles Tapley Hoyt")
assert charlie.orcid_url == 'https://orcid.org/0000-0003-4423-4370'
The demo can be run by cloning the repository, installing its requirements, and
running the self-contained app.py
.
git clone https://github.com/cthoyt/semantic-pydantic
cd semantic-pydantic
python -m pip install -r requirements.txt
python app.py
This is my first science post of 2024! I’m very happy that the Bioregistry is currently supported by the Chan Zuckerberg Initiative (CZI) under award 2023-329850.
]]>Cosmere moments I really enjoyed (spoilers):
Other non-Cosmere highlights (spoilers):
Disappointments:
My goal in 2024 is to read more books from different genres, especially ones I’ve never touched before.
]]>The first big issue with the UMLS is its licensing. Here’s an excerpt from the How to License and Access the Unified Medical Language System® (UMLS®) Data page accessed on August 28th, 2023:
- Please sign up for a new UMLS Terminology Services (UTS) account with your preferred identity provider at the UTS homepage.
- Complete and submit the license request form. NLM will send the license approval e-mail within 5 business days after reviewing your authenticated license request.
- You will sign in using identity provider credentials to download files or access web interfaces that require UTS authentication such as the UTS, VSAC, SNOMED CT, or RxNorm.
These are a few big hurdles:
I want to 1) convert UMLS into an OWL ontology and 2) extract and encode its semantic mappings to external vocabularies like the Medical Subject Headings (MeSH) with Simple Standard for Sharing Ontology Mappings (SSSOM). Given all of these hurdles, it’s probably the case that I am not allowed to redistribute these artifacts.
All together, I consider this a big bummer. The United States National Library of Medicine (NLM) maintains several highly influential resources, but I have found in many instances that they lack a community perspective. Regardless, even as an expat, I pay American taxes, and it makes me upset that the government funds the development and maintenance of resources that I can’t easily use.
Despite all of this rigamarole, there’s a process to subvert these issues by automating the interaction with the UMLS Terminology Services (UTS) and therefore enabling automated download of UMLS and the following (non-exhaustive) list of resources:
This has been implemented in the open source umls_downloader
Python
package. It can be installed with the following one-liner in your shell:
$ pip install umls_downloader
Below, I’ll walk you through using it.
Throughout, keep in mind that full documentation for the umls_downloader
is available
at umls-downloader.readthedocs.io, which describes the other functionality and
other data that can be downloaded.
UMLS has three different distributions that are
described here.
The following Python code downloads the most simple and straightforward file, MRCONSO.RRF
as a zip archive:
from umls_downloader import download_umls
path = download_umls(version="2023AA", api_key="<your API key>")
This code is smart and does not need to download the file more than once.
It uses pystow
to choose a stable path ~/.data/bio/umls
relative to the current
user’s home directory. Inside this directory, it also uses the version of the data to create a subdirectory. Finally,
this function returns the path to the data, such that no file paths ever need to be hard-coded.
Warning This still requires an API key, which requires creating an account, agreeing to UMLS’s terms and conditions, etc. This can be done here: https://uts.nlm.nih.gov/uts/edit-profile.
There are two ways to automatically set the API key, so you don’t have to worry about getting it and passing it around in your python code:
UMLS_API_KEY
in the environment. This can be done in your interactive session or in the configuration for your
shell such as in a .bashrc
file for the Bourne Again Shell (bash).~/.config/umls.ini
and set in the [umls]
section a api_key
key. Mine looks like:
[umls]
api_key=1234567890abcdefghijklmno
Now you can omit the api_key
keyword like in the following:
from umls_downloader import download_umls
# Same path as before
path = download_umls(version="2023AA")
First, you’ll have to
install bioversions
with pip install bioversions
, whose job it is to look up the latest version of
many databases. Then, you can modify the previous code slightly by omitting
the version
keyword argument:
from umls_downloader import download_umls
# Same path as before (when run on September 1st, 2023)
path = download_umls()
The UMLS file is zipped, so it’s usually accompanied by the following boilerplate code:
import zipfile
from umls_downloader import download_umls
path = download_umls()
with zipfile.ZipFile(path) as zip_file:
with zip_file.open("MRCONSO.RRF", mode="r") as file:
for line in file:
...
This exact code is wrapped with the umls_downloader.open_umls()
using Python’s context manager,
so it can more simply be written as:
from umls_downloader import open_umls
with open_umls() as file:
for line in file:
...
Note The
version
andapi_key
arguments work the same forumls_downloader.open_umls()
as inumls_downloader.download_umls()
At this point, it’s up to you to decide how you want to consume the MRCONSO.RRF
file.
Below, I give a demo on how parsed this file in PyOBO in order to convert UMLS to an
OWL ontology.
The UMLS provides an API
for access to tiny bits of data at a time. There are even two recent (last 5
years) packages umls-api
connect-umls
that provide a wrapper
around them. However, API access is generally rate limited, difficult to use in
bulk, and slow. For working with UMLS (or any other database, for that matter) in
bulk, it’s necessary to download full database dumps.
Building on top of the automated download of UMLS, I implemented a fit-for-purpose processor with the
PyOBO framework that converts UMLS into
an ontology (encoded either as OWL, OBO, or OBO Graph JSON) which can therefore be used to generate semantic mappings in
the SSSOM format. The code that implements this can be
found here. After installing PyOBO
with pip install pyobo
, you can automatically download and convert UMLS
first into an ontology encoded in
the OBO flat file format,
then convert to OWL with the following code. Note: you’ll need robot
for the second
step:
import pyobo
umls = pyobo.get_ontology("umls")
# Write simple OBO Format
umls.write_obo("umls.obo")
# Convert to OWL
from pyobo.utils.misc import obo_to_owl
obo_to_owl("umls.obo", "umls.owl")
In an ideal world, the results of such a conversion could be included as a part of the OBO Database Ingestion, which converts database resources available through PyOBO into ontology artifacts, archives them on GitHub and Zenodo, and gives them PURLs all on a weekly basis to make sure the most up-to-date version is available as well as all previous named versions. Instead, we live in a world with pineapple pizza and restrictive licenses.
One of the nice qualities of UMLS is that it is a semantic mapping hub. It provides mostly complete mappings between many vocabularies including MeSH, NCIT, SNOMED-CT, HPO, LOINC, and more. However, there are a few caveats to consider:
mapping_justification
field in SSSOM is uniformly filled with
sempav:UnspecifiedMatching
.oboInOwl:hasDbXref
instead of more detailed types such
as skos:exactMatch
, skos:narrowMatch
, and skos:broaderMatch
. Tools
like Boomer can be used to address this (in part). The Semantic Mapping
Reasoning Assembler (SeMRA) can also be configured with prior knowledge
about UMLS mapping assumptions when aggregating and reasoning over semantic mappings at scale.With that in mind, anything that can be loaded as an ontology in PyOBO can also be exported with SSSOM, which I show below. For UMLS, this looks like:
import pyobo
df = pyobo.get_sssom_df("umls", names=False)
df.to_csv("umls.sssom.tsv", sep="\t", index=False)
Note You can set
names=True
to have PyOBO look up the names for all entities, but this is a bit of a rabbit hole since it requires getting and processing many external resources.
There’s much more to say about UMLS and SSSOM, but this is a good place to pause and publish this post, since getting UMLS as SSSOM is a task a lot of people have asked me for help with lately. I might also come back and explain more about how I use the other resources from UMLS’s UTS.
]]>There are many potential directions for reproducibility. Given the fact that typical computational scientists are not trained as software engineers, we decided on seven very simple criteria that can be easily reviewed and easily addressed:
black
for Python)?These correspond to important details that are complementary to other considerations of reproducibility, but often
overlooked. Throughout the pilot, the editors and reviewers will try to support authors in addressing each of these
points during revision. I imagine that there will be future iterations of these criteria as the community begins
to expect these as standard practice. For example, we can narrow criteria 1 to specifically say that the software
should be licensed with an OSI-approved license and not accept science made with non-open licenses. We could
further narrow point 7 to have additional community style requirements (e.g., passes parts of flake8
, as you know I
love from my post on flake8 hell). We could also include
additional guidelines that e.g. say that the results presented in the paper should be reproducible with a single
command from the command line, e.g., a shell script. The rabbit hole could go very deep, so again, it’s worth
saying that these are very non-controversial criteria for the first generation.
That being said, many repositories don’t follow these! Since these criteria are so simple, I’m interested in automating their assessment and further applying it to the entire Journal of Cheminformatics backlog. I’ll describe this more in a future post.
Without further ado, the text below is what I sent verbatim in the review for Drug-Protein Interaction Prediction via Multi-View Variational Autoencoder and Cascade Deep Forests, which is pre-printed on Research Square and has associated code here. I have tried my best to include actionable links and information with each piece. I would like to also automate sending separate GitHub issues for each of these points as a more concrete to-do list for authors, then also send an “epic” issue that lists all of them together. With the magic of the GitHub API, this is possible.
Below, I apply the seven point reproducibility review prescribed by Improving reproducibility and reusability in the Journal of Cheminformatics to the default branch of repository https://github.com/Macau-LYXia/MVAE-DFDTnet (commit c0858c8), accessed on August 27th, 2023.
GitHub can be used to create a README file with https://github.com/Macau-LYXia/MVAE-DFDTnet/new/main?filename=README.md. Repositories typically use the Markdown format, which is explained here.
black
for Python)?black
. Similarly, the Matlab
code has not been linted, e.g. using checkcode
.Scientific integrity depends on enabling others to understand the methodology (written as computer code) and reproduce the results generated from it. This reproducibility review reflects steps towards this goal that may be new for some researchers, but will ultimately raise standards across our community and lead to better science. Because the work presented in this article only yet address one of the seven points of the reproducibility review, I recommend rejecting the article and inviting later resubmission following addressing the points.
For posterity, this review has also been included on https://github.com/Macau-LYXia/MVAE-DFDTnet/issues/1.
The example above isn’t so great - it’s possible that these authors have never considered most of these points about reproducibility before. The reality is that many computational scientists are not trained in this since their mentors were not primarily trained as computational scientists themselves. Combine with the perverse incentive structure in academia, it’s understandable how this can be left out from some publications. I experienced something similar in my doctoral studies, and had to bootstrap my own philosophy on reproducibility as well as the practical skills to achieve it. I also understand not everyone is in the position where they have the flexibility/freedom/initiative to do this.
That all being said, we are now entering an era where progressive and newly minted PIs actually have training as computational scientists. The next paper in my queue for a reproducibility review is for https://github.com/Steinbeck-Lab/cheminformatics-python-microservice, which will pass the 7 criteria with flying colors. I’m looking forward to the future when we expect more excellent science on the regular. See you there!
I’m not sure how people will view the way I talk about reviews - I am quite open with posting reviews on GitHub and also openly discussing the fact that I’ve reviewed something. Ideally, I don’t accept reviews for papers that don’t have pre-prints, since I personally think the review process should be open. I hope it’s the case that I haven’t been rude or unfair. If that’s the case, someone can help me change the way I write about these topics.
]]>Each of the following queries can be readily copy-pasted into the Wikidata Query Service and run in the browser.
The following SPARQL query gets information about journals:
SELECT ?journal ?journalLabel (GROUP_CONCAT(?issn) as ?issns)
WHERE
{
?journal wdt:P31 wd:Q5633421 ;
wdt:P236 ?issn .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}
GROUP BY ?journal ?journalLabel
Follow this link to populate the Wikidata Query Service with this query. Note that this query takes a while to run and may time out since there are on the scale of 100K journals.
Journals might have multiple International Standard Serial Numbers (ISSNs) because a different one is assigned to the print and electronic versions of the journal, among other things.
Get the ISSN-L (the normalized/preferred) ISSN for each:
SELECT ?journal ?journalLabel ?issn
WHERE
{
?journal wdt:P31 wd:Q5633421 .
OPTIONAL { ?journal wdt:P7363 ?issnl }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}
Get a forward mapping from all ISSNs to ISSN-L. Note that these have been filtered to scientific journals (wd:Q5633421)
SELECT ?issn ?issnl
WHERE
{
?journal wdt:P31 wd:Q5633421 ;
wdt:P7363 ?issnl ;
wdt:P236 ?issn .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}
The following SPARQL query gets information about publishers:
SELECT DISTINCT ?publisher ?publisherLabel ?ror ?grid ?isni
WHERE
{
?publisher wdt:P31/wdt:P279+ wd:Q2085381 ;
rdfs:label ?publisherLabel .
FILTER ( LANG(?publisherLabel) = "en" )
OPTIONAL { ?publisher wdt:P6782 ?ror }
OPTIONAL { ?publisher wdt:P2427 ?grid }
OPTIONAL { ?publisher wdt:P213 ?isni }
}
Follow this link to populate the Wikidata Query Service with this query. This query returns the Research Organization Registry (ROR) identifier when available. This registry effectively subsumes the Global Research Identifier Database (GRID), which has since been shut down, but this might be helpful for integrating data that hasn’t been updated. The International Standard Name Identifier (ISNI) is also included when available. Wikidata has several other nomenclature authorities such as GND, VIAF, RingGold, and others that are omitted for brevity (each has their own corresponding Wikidata property.).
Later, I could consider adding a clause to make sure there’s a “scientific journal” in the publisher to remove some irrelevant records.
Finally, the publisher (P123) relation can be used to identify the relationships between journals and their respective publishers.
SELECT DISTINCT ?journal ?journalLabel ?publisher ?publisherLabel
WHERE
{
?journal wdt:P31 wd:Q5633421 ;
rdfs:label ?journalLabel ;
wdt:P123 ?publisher .
FILTER ( LANG(?journalLabel) = "en" )
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language
}
ORDER BY ?journalLabel
Follow this link to populate the Wikidata Query Service with this query.
Rather than using the Wikidata label service for the journal label, I more explicitly wrote it out to ensure that there is an english label, and to remove anything without an english label.
]]>I’m going to use SPARQL with the Wikidata Query Service to see what’s already in Wikidata. First, I want to find all of the awards that I’ve personally received using the P166 (award received) property. Note that the following query also takes advantage of Wikidata’s reification so I can reach into the qualifiers of each statement to figure out when the award was given.
SELECT ?award ?awardLabel ?year ?conferer ?confererLabel
WHERE {
VALUES ?person { wd:Q47475003 }
?person p:P166 ?award_statement .
?award_statement ps:P166 ?award .
OPTIONAL { ?award wdt:P1027 ?conferer . }
OPTIONAL {
?award_statement pq:P585 ?date .
BIND(year(?date) AS ?year)
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
See this query in action at https://w.wiki/6odU.
As of time of writing, the only award that is listed here is the Bernie Lemire Award. This was given to me by
the Northeastern University Department of Chemistry at the end of my bachelor’s degree for service to the department and
academic excellence. I am very proud of this award! You can switch out wd:Q47475003
for your Wikidata
identifier.
A similar SPARQL query can be written to identify all of the awards for which I was nominated by swapping the predicate to P1411 (nominated for). This isn’t necessarily a superset of the awards received since some awards are decided without a nomination. It might also be the case depending on how curation is done that these are out of sync.
SELECT ?award ?awardLabel ?year ?conferer ?confererLabel
WHERE {
VALUES ?person { wd:Q47475003 }
?person p:P1411 ?award_statement .
?award_statement ps:P1411 ?award .
OPTIONAL { ?award wdt:P1027 ?conferer . }
OPTIONAL {
?award_statement pq:P585 ?date .
BIND(year(?date) AS ?year)
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
See this query in action at https://w.wiki/6odV.
Many awards are given on a periodic basis (e.g., yearly, bi-yearly). Scholia is an excellent frontend to Wikidata that already has a way of summarizing awards. Some examples:
Finally, I want to summarize all awards nominated or given by an organization. In this example, I’m going to look at the International Society for Biocuration (ISB; Q23809291).
The following query shows all of the recipients for all of the various awards conferred by the ISB:
SELECT ?award ?awardLabel ?recipient ?recipientLabel ?year
WHERE {
?recipient p:P166 ?award_statement .
?award_statement ps:P166 ?award .
OPTIONAL {
?award_statement pq:P585 ?date .
BIND(year(?date) AS ?year)
}
?award wdt:P1027 wd:Q23809291 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?year) ?awardLabel
See this query in action at https://w.wiki/6odW or the results embedded below.
At the time of writing, this only returned a paltry 9 rows, meaning more curation is necessary! Considering this award is about biocurators, we better get our act together 🙃. Update June 4th, 2023: I went back and curated the full catalog.
Similarly, the following query can be used to identify all nominations:
SELECT ?award ?awardLabel ?nominee ?nomineeLabel ?year
WHERE {
?nominee p:P1411 ?award_statement .
?award_statement ps:P1411 ?award .
OPTIONAL {
?award_statement pq:P585 ?date .
BIND(year(?date) AS ?year)
}
?award wdt:P1027 wd:Q23809291 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?year) ?awardLabel
See this query in action at https://w.wiki/6odX or the results embedded below.
There are only 5 results at the time of writing, and these are for my fellow nominees for the Excellence in Biocuration Early Career Award that I recently curated! There’s a lot of work to do here for keeping a history of the ISB’s awards. Update June 4th, 2023: I went back and curated the full catalog. It turns out that the ISB did not publish the list of nominees for any awards until 2022, so this list will remain short.
More generally, it turns out that there are only a bit more than 55K nomination relations in total for all of Wikidata. You can check this with:
SELECT (count(*) AS ?count)
WHERE { ?nominee wdt:P1411 ?award . }
Award objects don’t have to be complicated - the most important information is to include a useful instance annotation (e.g., to science award (Q11448906)) and the following:
See https://www.wikidata.org/wiki/Q118947746 as an example.
On a given Wikidata page, you can add a statement for either nominated for or award received using Wikidata’s amazing curation interface that has search built in. It’s recommended to add a point in time (P585) annotation to make a distinction between different periods. Further, it’s recommended to add a refernce using the reference url (P854) property that pints to a webpage with an announcement about the nomination or award.
Overall, I think modeling awards is hard, since these are less concrete than other academic information such as employment or education. Still, this is the next step in making my resume 100% auto-generated by SPARQL and Wikidata!
See also an analysis by Chris Mungall of the gender distribution of awards in Wikidata.
Update June 4th, 2023: I won the Excellence in Biocuration Early Career Award! Nico Matentzoglu won the Excellence in Biocuration Advanced Career Award and we are both excited to see that the community was interested to recognize people who work on fundamental underlying technologies.
]]>curies
Python package.
In a lot of ways, ARKs look and act like CURIEs. For example, ark:/53355/cl010277627
could be interpreted
as having the prefix ark
and the local unique identifier /53355/cl010277627
. The first part of each ARK
between the first two slashes corresponds to the provider. In this example, 53355
corresponds to the
Louvre museum in Paris, France and cl010277627
is the local unique identifier
corresponding to the Vénus de Milo statue.
However, I might have just committed ARK blasphemy. In N2T, it appears that the ARK prefix and provider code stay
grouped together in the front half like ark:/53355/
and then the back half cl010277627
represents the local unique
identifier. This is very similar to the two-layer identifiers in DOI and the arbitrary number of layer identifiers in
OID.
The point is, if we can interpret this enough like CURIEs, we can use the curies
package to implement a resolver.
The first step we can take is to download the N2T data
from https://n2t.net/e/n2t_full_prefixes.yaml. Then we can parse out
the ARKs (there are other things in N2T we’ll disregard) with the following code:
import pystow
import yaml
URL = "https://n2t.net/e/n2t_full_prefixes.yaml"
PROTOCOLS = {"https://", "http://", "ftp://"}
def get_prefix_map():
"""Get the prefix map from N2T, not including redundant ``ark:/`` in prefixes."""
with pystow.ensure_open("n2t", url=URL) as file:
records = yaml.safe_load(file)
prefix_map = {}
for key, record in records.items():
uri_prefix = record.get("redirect")
if (
not uri_prefix
or all(not uri_prefix.startswith(protocol) for protocol in PROTOCOLS)
or uri_prefix.count("$id") != 1
or not uri_prefix.endswith("$id")
or not key.startswith("ark:/")
):
continue
key = key.removeprefix("ark:/")
prefix_map[key] = uri_prefix.removesuffix("$id") + "/" + key + "/"
return prefix_map
This prefix map removes ark:/
from the beginning of the prefixes in N2T and also adds the provider code into the
URI prefix to make the URIs more focused on the local unique identifiers within each provider, rather than the
entire ARK space.
Once we have a prefix map, we can make a curies.Converter
and a Flask web application for resolving in a few lines:
from curies import Converter, get_flask_app
def get_app():
"""Get an ARK resolver app, noting that it uses a non-standard delimiter and URL prefix."""
prefix_map = get_prefix_map()
print(prefix_map)
converter = Converter.from_prefix_map(prefix_map, delimiter="/")
app = get_flask_app(converter, blueprint_kwargs=dict(url_prefix="/ark:"))
return app
The two tricks here are:
ark:/
then interpret the ARK provider code as the prefix and the rest as the local
unique identifier. However, we still want to be able to write URLs in our resolver that have the ark:/
prefix.
Luckily, Flask has the facility to define a default url_prefix
before a given blueprint that we invoke directly.:
as the delimiter between the prefix and local unique identifier, ARKs use a
slash /
. We can also set this in the Converter’s settings.Now, all we need to do is instantiate the app and serve it with any WSGI tool like Gunicorn, Uvicorn, or Flask’s built-in development server (from Werkzeug). Navigating to http://localhost:5000/ark:/53355/cl010277627 redirects to https://collections.louvre.fr/ark:/53355/cl010277627 and gets some nice art from the Louvre. In general, you can stick any ARK after http://localhost:5000/ark: that is resolvable via N2T when running this server.
All of this code is on GitHub and can be run with the following:
git clone https://github.com/cthoyt/n2t-ark-resolver
cd n2t-ark-resolver
python -m pip install -r requirements.txt
python wsgi.py
Update: since posting this, I have heard from John Kunze that the ARK format is currently being updated to look more
like URNs and therefore not have the slash after ark:/
anymore. If/when that happens, there are only a few bits of
string pre-processing in this script that need to be updated to keep everything running.
Here’s a video of us playing. If you love it, let me know. If you hate it, slap like and subscribe.
The show is on April 8th at the Live Music Hall in Cologne, Germany. Because this is Easter weekend, it’s the perfect thing to do while you’re relaxing with your family. The show starts at 15.30 CEST / 9:30AM EST. The order of the 10 bands playing will be determined on the morning of, so I’ll send out a mass text about what time that will be for everyone who’s streaming.
The battle of the bands is judged in two ways: half of the score is by the judges and half by the audience. Each audience member gets two votes - this usually means they vote for the band they support and a second band that they liked after seeing them for the first time.
There will be 10 bands playing, which means this is going to be a loooooong day. The best spots in the show will be in the middle or towards the end, after the afternoon settles into evening. The order will be determined just before the show starts, based on which band sells the most pre-sale tickets. This means that getting a streaming ticket will support us to get a better spot, which is highly correlated with winning.
As you noticed, we’re the representatives of Monkey Jack. We will tell The Tale of Monkey Jack at our show. You should prepare by bringing a banana with you to the show, which you will need at the end of the Tale when we begin The Ritual.
First, navigate here for tickets.
Second, click Streamingtickets. If you want to be cool while speaking German, you should throw in some English words (or internationalisms).
Third, click Auswählen. This verb means that you are pledging your allegiance to Monkey Jack and promise to follow Him.
Fourth, click Bitte Wählen (please choose). This is a drop down menu to show your support for Monkey Jack and his representatives. Note that you only need one streaming ticket per stream, obviously you should throw a party/ritual to represent Monkey Jack yourself.
Fifth, click Monkey Jack.
Sixth, You can fill in the form with your information. The image below annotates what each of the fields means. Plz is short for “Postleitzahl”, which means Zip code. Don’t worry about the part of the form with the country picker. You’re a German now.
After you click it, it will bring you to a PayPal page. They’ll email you a confirmation within 5-10 minutes and send the streaming link the day of the show.
]]>The OBO Foundry is a set of independent, interoperable biomedical ontologies that aspire to
using shared development principles. One such principle is to
use a principled approach for creating persistent uniform resource locators (PURLs) for local unique identifiers in each
ontology. These PURLs follow the form http://purl.obolibrary.org/obo/<PREFIX>_<LOCAL UNIQUE IDENTIFIER>
. For example,
a prefix might be GO
(for the Gene Ontology) and local unique identifier might be 0032571
(for response to vitamin
K in GO) resulting in the PURL http://purl.obolibrary.org/obo/GO_0032571.
While most semantic web resources allow the use of any IRIs (internationalized resource identifiers), the OBO Foundry
enforces that its PURLs resolve to something useful for readers (e.g., to
the Ontology Lookup Service). The resolver
behind http://purl.obolibrary.org is implemented and maintained in
a GitHub repository with
corresponding .htaccess
files for each OBO Foundry ontology. Correct and useful configuration for each ontology are
a requirement for acceptance to the OBO Foundry.
At the core of the OBO Foundry are several high quality, well-known, generally useful ontologies such as the Gene Ontology and the Cell Ontology. Inclusion in the OBO Foundry has therefore become a de facto stamp of approval for ontologies that (until now) 254 ontologies have (for better or worse) successfully sought out.
Unfortunately, some ontologies and controlled vocabularies have adopted OBO PURLs even though they are not OBO Foundry ontologies. This is a problem for a few reasons:
One of the jobs of the Bioregistry is to catalog the URI format strings for identifier resources useful for the life and natural sciences. This allows us to assess how big the problem of non-OBO Foundry ontologies are using OBO PURLs, and why. Without further ado, here’s the list of offending resources that appear in the Bioregistry:
prefix | name | evidence | uri_prefix |
---|---|---|---|
aeon | Academic Event Ontology | curated | http://purl.obolibrary.org/obo/AEON_ |
cemo | COVID-19 epidemiology and monitoring ontology | extra | http://purl.obolibrary.org/obo/cemo.owl# |
covoc | CoVoc Coronavirus Vocabulary | curated | http://purl.obolibrary.org/obo/COVOC_ |
decipher | DECIPHER CNV Syndromes | biocontext | http://purl.obolibrary.org/obo/DECIPHER_ |
dermo | Human Dermatological Disease Ontology | curated | http://purl.obolibrary.org/obo/DERMO_ |
efo | Experimental Factor Ontology | biocontext | http://purl.obolibrary.org/obo/EFO_ |
gorel | GO Relations | biolink | http://purl.obolibrary.org/obo/GOREL_ |
hpath | Histopathology Ontology | curated | http://purl.obolibrary.org/obo/MC_ |
idocovid19 | COVID-19 Infectious Disease Ontology | curated | http://purl.obolibrary.org/obo/COVIDO_ |
lbo | Livestock Breed Ontology | curated | http://purl.obolibrary.org/obo/LBO_ |
lpt | Livestock Product Trait Ontology | curated | http://purl.obolibrary.org/obo/LPT_ |
mesh | Medical Subject Headings | biocontext | http://purl.obolibrary.org/obo/MESH_ |
msio | Metabolomics Standards Initiative Ontology | curated | http://purl.obolibrary.org/obo/MSIO_ |
omia | Online Mendelian Inheritance in Animals | biocontext | http://purl.obolibrary.org/obo/OMIA_ |
omim | Online Mendelian Inheritance in Man | biocontext | http://purl.obolibrary.org/obo/OMIM_ |
pride | PRIDE Controlled Vocabulary | curated | http://purl.obolibrary.org/obo/PRIDE_ |
reo | Reagent Ontology | curated | http://purl.obolibrary.org/obo/REO_ |
roleo | Role Ontology | curated | http://purl.obolibrary.org/obo/RoleO_ |
soybase | SoyBase | prefixcommons | http://purl.obolibrary.org/obo/ |
uniprot.isoform | UniProt Isoform | extra | http://purl.obolibrary.org/obo/UniProtKB_ |
vido | Virus Infectious Disease Ontology | curated | http://purl.obolibrary.org/obo/VIDO_ |
vsmo | Ontology for vector surveillance and management | curated | http://purl.obolibrary.org/obo/VSMO_ |
xl | Cross-linker reagents ontology | curated | http://purl.obolibrary.org/obo/XL_ |
In the evidence column, there are a few possible entries:
It’s worth noting that there are probably lots more resources doing this, e.g., that are listed in BioPortal, but have not been included in the Bioregistry because of their lack of notability, utility, or reuse.
Based on the table above, there are several situations in which an OBO PURL appears:
It’s hard to know for sure for the situation that lead to the developers/maintainers of primary resources using unsanctioned OBO PURLs or the developers/maintainers of third party resources using unsanctioned OBO PURLs. Regardless, it’s still valuable for the community to know about these problems and potentially use comprehensive resources like the Bioregistry as a guide towards improving interoperability and interpretability.
]]>