Biopragmatics

My name is Charles Tapley Hoyt (he/his). I’m a scientist in the Institute of Inorganic Chemistry at RWTH Aachen University. I’m building my own research group focused on software development, data standardization/FAIRification/integration, and applications of ML/AI in the chemical, biological, and health sciences - specifically in drug discovery and precision medicine.

Through my position at RWTH Aachen University, I’m establishing academic collaborations through German, European, and international grants and developing project-based contracts for organizations with unmet business needs addressed by the semantic technologies and capabilities that I’ve developed and write about here. Privately, I can offer consulting services, speaking engagements, and training for organizations interested in these topics.

Here’s some more details about me and my research. You can download my résumé (single page), CV, or see my ORCiD page at https://orcid.org/0000-0003-4423-4370. Content on this site is licensed as CC BY 4.0. See also my family recipe blog.

Posts

Jan 20, 2026
Challenges with Semantic Mappings
There are many challenges associated with the curation, publication, acquisition, and usage of semantic mappings. This post examines their philosophical, technical, and practical implications, highlights existing solutions, and describes opportunities for next steps for the community of curators, semantic engineers, software developers, and data scientists who make and use semantic mappings.
Jan 16, 2026
Semantic Mappings Enable Automated Assembly
Data and knowledge originating from heterogeneous sources often use heterogeneous controlled vocabularies and/or ontologies for annotating named entities. Semantic mappings are essential towards resolving these discrepancies and integrating in a coherent way. This post highlights how this looks in two scenarios: when constructing a knowledge graph for graph machine learning and when constructing a comprehensive lexica for natural language processing, text mining, and curation.
Jan 15, 2026
Mapping from SSSOM to JSKOS
JSKOS (JSON for Knowledge Organization Systems) is a JSON-based data model for representing terminologies, thesauri, classifications, and other semantic artifacts. Like the Simple Standard for Sharing Ontological Mappings (SSSOM), it can also encode semantic mappings. This post is about developing and implementing a crosswalk between them in the sssom-pydantic Python package.
Jan 8, 2026
Mapping from SSSOM to Wikidata
At the 4th Ontologies4Chem Workshop in Limburg an der Lahn, I proposed an initial crosswalk between the Simple Standard for Sharing Ontological Mappings (SSSOM) and the Wikidata semantic mapping data model. This post describes the motivation for this proposal and the concrete implementation I’ve developed in sssom-pydantic.
Jan 6, 2026
Validating Prefix Maps in LinkML Schemas
LinkML enables defining data models and data schemas in YAML informed by semantic web best practices. As such, each definition includes a prefix map. Similarly to my previous posts on validating the prefix maps appearing in Turtle files and in unfamiliar SPARQL endpoints, this post showcases describes a new extension to the Bioregistry that validates prefix maps in LinkML definitions.
Jan 1, 2026
Books I Read in 2025
Here are the books I read in 2025. My goals for the year were to get some more variety, and I think I managed that.
Dec 19, 2025
Annotating the Literature with Named Entity Recognition
Annotating the literature with mentions of key concepts from a given domain is often the first step towards extracting more substantial structured knowledge. This can be challenging, as it typically encompasses acquiring and processing the relevant literature and ontologies then installing and applying difficult-to-use named entity recognition (NER) workflows. This post highlights software components I’ve implemented to simplify this workflow. I demonstrate it by annotating the biomedical literature available through PubMed with Medical Subject Headings (MeSH) terms, and also comment on how this can be generalized to other natural sciences, engineering, and humanities disciplines.
Dec 9, 2025
Machine-Actionable Training Materials at BioHackathon Germany 2025
I recently attended the 4^th BioHackathon Germany hosted by the German Network for Bioinformatics Infrastructure (de.NBI). I participated in the project On the Path to Machine-actionable Training Materials in order to improve the interoperability between DALIA, TeSS, mTeSS-X, and Schema.org. This post gives a summary of the activities leading up to the hackathon and the results of our happy hacking.
Nov 23, 2025
Extracting Semantic Mappings from BioPortal in SSSOM
Earlier this week, a question was asked on OBO Foundry Slack on where to find semantic mappings to terms in the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT). While some are available in the SeMRA Disease Mappings Database, there are many more available within BioPortal, which has access to the entire SNOMED-CT source data and has produced semantic mapping predictions using LOOM. This post is about how I implemented an API wrapper for generic OntoPortal instances’ mapping endpoints and a post-processing pipeline that converts OntoPortal’s custom mapping format into SSSOM.
Oct 14, 2025
Databases as Ontologies Part 2 - A Case Study with HGNC
This is the second of a two-part post about encoding databases as ontologies. In the first part, I gave a background on how I started working on this problem and the software stack I developed along the way. In this post, I explain the philosophy and design about how I encoded the HGNC (HUGO Gene Nomenclature Committee) database as an ontology using PyOBO.
Oct 14, 2025
Databases as Ontologies Part 1 - Background and Software
This is the first of a two-part post about encoding databases as ontologies. In this post, I give a background on the problems in biocuration that led me to start encoding databases as ontologies, the software I have written to do it, and the repository I have created to store the resulting artifacts in a FAIR, open, and sustainable way. See also the second part which describes how I applied these tools to encode the HGNC (HUGO Gene Nomenclature Committee) database as an ontology.
Oct 7, 2025
Representing Negative Knowledge
Representing negative knowledge in the semantic web is an open problem. This post is going to be a living document where I keep notes on use cases, potential solutions, and awful hacks.
Oct 7, 2025
Bridging NFDI's culture and chemistry knowledge graphs
At the sixth NFDI4Chem consortium meeting, Torsten Schrade from the NFDI4Culture consortium gave a lovely and whimsical talk entitled A Data Alchemist’s Journey through NFDI which explored ways that we might federate and jointly query both consortia’s knowledge via their respective SPARQL endpoints. He proposed a toy example in which he linked paintings depicting alchemists trying to make gold to compounds containing gold. This post is about the steps I took to automate his toy example and extend it to not only chemicals or compounds represented in Iconclass, but also equipment and devices.
Sep 25, 2025
Suggesting new relations in ROR from Wikidata
I was looking at the different NFDI consortia in the Research Organization Registry (ROR), and found that the only two that have a parent relations to the NFDI (ror:05qj6w324) are NFDI4DS (ror:00bb4nn95) and MaRDI (ror:04ncnzm65). This felt strange to me, so I started looking around Wikidata to see if I could automatically make a curation sheet to send along to them. I found that Wikidata already has detailed pages for all NFDI consortia, and that they also include relationships to the parent. This blog post is about the steps I took to write a workflow to find relationships in Wikidata that are appropriate for submission to ROR.
Sep 21, 2025
Switching from using Tox to Just
I became aware of just while watching Hynek’s second video on uv a few months ago. I immediately fell in love with its elegance and simplicity, so I have begun replacing task running in my repositories that relied on tox with just. This post gives a bit of background, context, and walks through making the switch on one of my repositories that has some annoying dependencies.
Sep 11, 2025
Exploring an unfamiliar SPARQL endpoint with the Bioregistry - a case study from NFDI4Culture
Earlier this week at the sixth NFDI4Chem consortium meeting, Torsten Schrade from the NFDI4Culture consortium gave a lovely and whimsical talk entitled A Data Alchemist’s Journey through NFDI which explored ways that we might federate and jointly query both consortia’s knowledge via their respective SPARQL endpoints. This post is about the very first steps I took when looking into this new (to me) SPARQL endpoint, namely to identify what prefixes and semantic spaces are present, then added a new CLI tool to the Bioregistry to do this reproducibly.
Sep 4, 2025
Validating the FAIRness of knowledge graphs and ontologies in RDF using the Bioregistry
Using standard CURIE prefixes and URI prefixes in semantic web artifacts such as Resource Description Framework (RDF) promotes interoperability, enables reuse in downstream data integration, and makes data more FAIR. The Bioregistry defines a set of standard CURIE prefixes and URI prefixes against which RDF files can be validated/standardized. This blog post describes a new CLI tool bioregistry validate ttl in the Bioregistry Python package that can run validation on Turtle files (a common serialization of RDF).
Aug 26, 2025
A historical analysis of ChEMBL
I’ve recently submitted an article to the Journal of Open Source Software (JOSS) describing chembl-downloader, a Python package for automating downloading and using ChEMBL data in a reproducible way. In this post, I use chembl-downloader to show how the number of compounds, assays, activities, and other entities in ChEMBL have changed over time.
Aug 22, 2025
Measuring the impact of the Bioregistry
The Bioregistry is a database and toolchain for standardization of prefixes, CURIEs, and URIs that appear in linked (open) data. While I created it in 2019 as a component of PyOBO in order to support parsing database cross-references appearing in biomedical ontologies, it has since become an independent project with a community-driven governance model and much broader applications. This post is a first attempt to quantify its usage and impact.
Aug 22, 2025
The Bioregistry and BiomarkerKB
The Bioregistry is a community-driven registry of semantic spaces and their metadata. When I learned about BiomarkerKB at the International Society for Biocuration’s 18th Annual International Biocuration Conference, I was excited to curate new records (and prefixes) in the Bioregistry to cover BiomarkerKB’s semantic spaces on biomarkers. This post summarizes the discussions I’ve had with its maintainers, Jeet and Raja, throughout the Bioregistry curation process and also gives insight into how databases can benefit from being represented in the Bioregistry.
Aug 4, 2025
Text-based embeddings of ontology terms
The Ontology Lookup Service (OLS) is now indexing dense embeddings for ontology terms constructed from term labels, synonyms, and descriptions using LLMs. I maintain a Python client library for the OLS (ols-client) and was recently asked to implement a wrapper to the OLS’s API endpoint that exposes these embeddings. This post is a demo of how to use that code, and how I replicated the same embedding functionality with PyOBO to arbitrarily extend it to ontologies and databases not in OLS.
Apr 28, 2025
Inference over Semantic Mappings with SeMRA
Assembling and inferring missing semantic mappings is a timely problem in biomedical data and knowledge integration. I’ve been developing the Semantic Mapping Assembler and Reasoner (SeMRA) as a generic toolkit for this. In this blog post, I highlight its inference capabilities.
Apr 23, 2025
I wish I could unpack Callables in Python type annotations
Following the theme of my previous two posts, I’ve run into another typing conundrum where I want to unpack a pre-existing Callable into a class with Generic[P, T] where P is a parameter specification type (i.e. ParamsSpec)
Apr 22, 2025
Using ParamSpec with Python Generics
I’ve been working on applying strict static typing to my Python package class-resolver and ran into an interesting way of using generics in combination with parameter specification variables (i.e., ParamSpecs).
Apr 19, 2025
A dilemma with PEP-696 default generics when using optional static typing in Python
This post describes an issue I’ve had with writing correct types when using PEP-696 defaults in typing.TypeVar. I posted the exploration in a companion repository on GitHub.
Apr 17, 2025
The EFO_ID column in ChEMBL's drug indications table isn't what you think it is
ChEMBL periodically curates clinical trial information into its DRUG_INDICATION table. However, there’s some weird inconsistencies in the way it references disease concepts in external vocabularies. This blog post is an exploration of that table.
Jan 23, 2025
Data Modeling and Integration with Clinical Trials
I’ve recently worked with clinical studies from ClinicalTrials.gov and other international registries. This post is a review on how to access data, a proposal for how it can be modeled using the Ontology for Biomedical Investigations (OBI), a proof-of-concept ontologization of ClinicalTrials.gov, and some insights into how this data can be integrated with other resources to address classical problems in drug discovery from a knowledge graph perspective.
Jan 18, 2025
Books I Read in 2024
Here’s the books I read in 2024. If I were Dudley Dursley, I’d be very upset that I read one fewer new book than in 2023. But then, I’d remember that I re-read a lot of Cosmere in 2024 to prepare for Wind and Truth, which was great.
Jan 17, 2025
Exploring Event Venues in Wikidata
I was working on making data about scholarly conferences more FAIR and a big question crossed my mind: what are all the conference venues? This post is about some queries I wrote for Wikidata, data issues I found, and a few drive-by curations that I did while looking for an answer, and my ideas for the future.
Dec 3, 2024
Notes on Open Source Funding
This stub post contains my notes about funding for open source software. It doesn’t follow a story like a lot of my posts, and is more like an ever-evolving notes sheet.
Dec 3, 2024
Downloading Audio from Soundcloud
Brandon Sanderson has been releasing a few chapters a week of his upcoming novel, Wind and Truth, on his publisher’s website leading up to its December 6^th release. This includes the audiobook chapters, but they’re posted to Soundcloud and there’s no good way to listen at 1.6x speed. This post is a note sheet on how to download audio from Soundcloud and prepare it for my audiobook reader.
Nov 19, 2024
Dependency Groups and ReadTheDocs
PEP 735 introduced dependency groups in packaging metadata, which are complementary to optional dependencies in that they might not correspond to features in the package, but rather be something like development or release dependencies. I am slowly working towards updating my cookiecutter template cookiecutter-snekpack to use PEP 735. So far, uv and tox have released support - all that’s left is ReadTheDocs. This post summarizes the issue I added to their issue tracker and the following discussion.
Nov 5, 2024
Building Graphviz when installing PyGraphviz
Graphviz is software for graph visualization written in C. PyGraphviz provides a nice Python wrapper for it. The issue is that getting Python to know about the C headers changes every few months. I’ll try and keep this blog post updated every time there are some changes.
Sep 26, 2024
Some Haskell I Tried to Write
I’m working through making a contribution to pandoc that adds first-class support for author role annotations using the Contribution Role Taxonomy (CRediT) and also outputs compliant Journal Publishing Tag Set (JATS) XML. This has lead me down a (losing) journey with learning the Haskell programming language, so I thought I would post a short note on a function I tried to understand.
Sep 20, 2024
Programmatic Access to a Wordpress User List
The International Society of Biocuration (ISB) partners with the journal Database to get discounts for its members when they publish there. This means the ISB’s executive committee needs to send a member list to the journal’s editor. Historically, this has been done manually by exporting the list from the membership management plugin in the ISB Wordpress blog once per month and emailing it to th This post is about my journey trying to automate it
Jun 8, 2024
Easier ORCID
The Open Researcher and Contributor Identifier (ORCID) database is an invaluable resource that supports the unambiguous identification of researchers. However, its first party data dump is too complex, verbose, and unstandardized for many use cases. This post describes open source software I wrote that automates downloading, processing, and exporting ORCID into a more usable form. I put the results on Zenodo under the CC0 license.
Mar 11, 2024
Discussions and Follow-ups from Biocuration 2024
I’ve just returned from the 17^th Annual International Biocuration Conference at the Indian Biological Data Centre (IBDC) in Faridabad, India. I wanted to highlight some of the interesting conversations I had while I was there, and ideas for follow-up. Most were centered around the Bioregistry and the Semantic Mapping Assembler and Reasoner (SeMRA), which I gave an oral presentation on.
Jan 10, 2024
Semantic Pydantic
Using Pydantic for encoding data models and FastAPI for implementing APIs on top of them has become a staple for many Python programmers. When this intersects with the semantic web, linked open data, and the natural sciences, we are still lacking a bridge to annotate our data models and APIs to make them more FAIR (findable, accessible, interoperable, and reusable). In this post, we build an extension to Pydantic and FastAPI to annotate data models’ fields and API endpoints’ query, path, and other parameters using the Bioregistry, a comprehensive catalog of metadata about semantic spaces from the semantic web and the natural sciences.
Jan 1, 2024
Books I Read in 2023
I finally got back into reading! Over winter break 2022, I started the Stormlight Archive then followed up in 2023 by reading the entirety of Brandon Sanderson’s Cosmere, as well as a some other fantasy, science fiction, and literary fiction. Here’s the list.
Sep 1, 2023
Unlocking UMLS
The Unified Medical Language System (UMLS) is a widely used biomedical and clinical vocabulary maintained by the United States National Library of Medicine. However, it is notoriously difficult to access and work with due to licensing restrictions and its complex download system. In the same vein as my previous posts about DrugBank and ChEMBL, this post describes open source software I’ve developed for downloading and working with this data. It also works for RxNorm, SemMedDB, SNOMED-CT, and any other data accessible through the UMLS Terminology Services (UTS) ticket granting system.
Aug 27, 2023
Reproducibility Pilot in the Journal of Cheminformatics
I’ve been working on improving reproducibility in the field of cheminformatics for some time now. For example, I’ve written posts about making data from DrugBank and ChEMBL more actionable. Over the last year, I’ve been preparing a concept with the editors of the Journal of Cheminformatics on how to include an assessment of reproducibility to reviews of manuscripts submitted to the journal. This has resulted in an editorial Improving reproducibility and reusability in the Journal of Cheminformatics as well as a call for papers. In this post, I want to summarize the first generation review criteria we developed, give an example of it applied in practice
Jun 22, 2023
Querying Journals and Publishers in Wikidata
Today’s short post is about three SPARQL queries I wrote to get bibliometric information about journals and publishers out of Wikidata.
Jun 8, 2023
Modeling and Querying Awards in Wikidata
I was recently nominated for the International Society for Biocuration’s Excellence in Biocuration Early Career Award (results will be announced on June 14^th!). This made me curious about how to model nominations and awards on Wikidata. In this post, I’ll describe how to curate awards, nominations, recipients, and how to make SPARQL queries to get them.
Apr 11, 2023
Re-implementing the N2T ARK Resolver
Archival Resource Keys (ARKs) are flavor of persistent identifiers like DOIs, URNs, and Handles that have the benefit of being free, flexible with what metadata gets attached, and natively able to resolve to web pages. Name-to-Thing (N2T) implements a resolver for a variety of ARKs, so this blog post is about how that resolver can be re-implemented with the curies Python package.
Mar 29, 2023
The Representatives of Monkey Jack - German Battle of the Bands Finale
This blog is normally about very serious science, but I’m taking a break from that for the evening to advertize my band’s upcoming show on April 8^th in the SPH Music Masters Finale (aka, the German Battle of the Bands). We need your support! There are streaming tickets available, and this post has a guide on how to navigate the German website to get tickets (or just text me, I’ll hook you up).
Mar 11, 2023
Resources masquerading as OBO Foundry ontologies
Several controlled vocabularies and ontologies that aren’t themselves OBO Foundry ontologies use unsanctioned OBO PURLs. This post is about how to use the Bioregistry to identify which resources are doing this and to give some insight into how we arrived in this situation.
Jan 11, 2023
Compliance of Bioregistry Prefixes to the W3C Standard
This post gives a brief background on the formal definition of the syntax and semantics of compact uniform resource identifiers (CURIEs) from the Worldwide Web Consortium (W3C) and investigates how many prefixes in the Bioregistry are compliant with the standard.
Jan 10, 2023
Idiomatic conversion between URIs and compact URIs
The semantic web and ontology communities needed a reusable Python package for converting between uniform resource identifiers (URIs) and compact URIs (CURIEs) that is reliable, idiomatic, generic, and performant. This post describes the curies Python package that fills this need.
Jan 4, 2023
Long-term Funding for Small Biomedical Databases
Way back in 2021, during the annual general assembly of the International Society for Biocuration (ISB) at the the 14th Annual International Biocuration Conference (Biocuration 2021) , there was a discussion about the notably underutilized budget of the society that resulted in an informal open call for ideas for new small funding schemes. Concurrently, discussions with external stakeholders for the relatively new (at the time) Bioregistry project often included questions about the sustainability and longevity of the resource. We had conservatively estimated it would cost about 100 USD/year to run the Bioregistry site, so this seemed like the perfect opportunity to ask for a small amount funding distributed over a relatively long period of time. This post is about the more general reality of funding for small resources in the life sciences, how we petitioned the ISB for funding, and what happened next.
Jan 3, 2023
Promoting the longevity of curated scientific resources through open code, open data, and public infrastructure
The 16th Annual International Biocuration Conference (Biocuration 2023) is taking place in Padua, Italy from April 24-26^th, 2023. While I’m serving as a co-chair of the conference, I also think this is a great venue to communicate some of my thoughts on longevity and sustainability that have been gestating during the development of the Bioregistry and other Biopragmatics projects. This blog post contains the abstract I’ve submitted for oral presentation.
Jan 2, 2023
Connecting Preprints to Peer-reviewed Articles on Wikidata
After the BioCypher preprint went up on the arXiv, I checked in on the missing co-author items list on the Scholia page that reflects my Wikidata entry. In addition to the several co-authors of the BioCypher manuscript that I don’t know personally, I was curious to see which other papers of mine did not have fully complete co-author annotations. This post has a few SPARQL queries that I used to look into this as well as a few ongoing questions I have about the relationship between distinct entries for preprints and published articles.
Dec 19, 2022
Global Core Biodata Resources in the Bioregistry
The Global Biodata Coalition released a list of Global Core Biodata Resources (GCBRs) in December 2022, comprising 37 life science databases that they considered as having significant importance (selected following this procedure). While the Bioregistry does not generally cover databases, many notable databases have one or more associated semantic spaces that are relevant for inclusion. Accordingly, 33 of 37 of the GCBRs (that’s 89%) have one or more directly-related prefixes in the Bioregistry. This post gives some insight into this landscape.
Nov 15, 2022
A First Look at OpenCheck
There has been legitimate concern about the future of Twitter over the last week due to its new ownership and management. This is pretty upsetting considering how great it’s been to use to connect to and to follow other researchers. OpenCheck is currently working to map Twitter handles to ORCID identifiers and capture the directed follow graph of researchers on Twitter in case the service becomes unusable in the near future. This post is about my initial exploration of the resource. Update in November 2024 - OpenCheck has been shut down.
Feb 12, 2022
Curating Publications on Wikidata
This blog post is a tutorial on how to curate the links between a researcher and scholarly works (e.g., pre-prints, publications, presentations) on Wikidata using Scholia and the Author Disambiguator tool.
Feb 6, 2022
You Should Use a Private Email on Publications
While we were recently preparing to submit a manuscript, the lead author said they looked at my last few papers and noticed I always used a private email address instead of an institutional email address. They asked, perplexed, if they should also use my private email address with our submission. The answer was a resounding yes; always use a private email address. Here’s why.
Feb 6, 2022
Abstracting the parameters of a Machine Learning Model
As a follow-up to my previous post on refactoring and improving a machine learning model implemented with PyTorch, this post will be a tutorial on how to generalize the implementation of a multilayer perceptron (MLP) to use one of several potential non-linear activation functions in an elegant way.
Feb 6, 2022
Refactoring a Machine Learning Model
This blog post is a tutorial that will take you from a naive implementation of a multilayer perceptron (MLP) in PyTorch to an enlightened implementation that simultaneously leverages the power of PyTorch, Python’s built-ins, and some powerful third party Python packages.
Jan 24, 2022
The Official Rules of Python Packaging Speedrunning
I figured over the holiday break or early days of the new year, I’d catch up on some serious blogging. Instead, here’s my first post of 2022: a silly take on a topic I actually care a lot about. Here are the rules for Python Packaging Speedruns.
Dec 17, 2021
How to Pick a Unique Prefix
After the recent incident on the OBO Foundry where an inexperienced group submitted a new ontology request using a prefix that already existed in the BioPortal, there has been a renewed interest in implementing an automated solution to protect against this.
Oct 7, 2021
A Glossary for the Bioregistry and Biopragmatics Stack
There are a lot of terms that I’ve been throwing around when talking about the Bioregistry, so this blog post is a first draft of a gloassary of all of them.
Sep 16, 2021
How to Curate the INDRA Database
With the recent paper on Gilda and approaching INDRA 2 and INDRA database papers coming up, I’ve put together a visual guide on how to curate statements extracted by INDRA through the web interface at https://db.indra.bio.
Sep 14, 2021
What's a CURIE, and Why You Should be Using Them
Compact uniform resource identifiers, or CURIEs, are an important formalism for referencing biomedical entities. This post explains what they are, how to write them yourself, and a brief outline of how they fit in to the semantic web, linked open data, and open biomedical ontology worlds.
Sep 13, 2021
How to Code with Me - Beyond Linters
This post is about my personal code style guide that are beyond the enforcement of my flake8 plugins or black. I’ll try and update it over time.
Aug 28, 2021
Pre-loading a PostgreSQL Docker Container
PostgreSQL is a powerful relational database management system that can be easily downloaded and installed from its official image on DockerHub using Docker. However, it’s not so straightforward to pre-load your own data. This blog post is about preparing a derivative of the base PostgreSQL Docker image that’s preloaded with your own database and pushing it back to DockerHub for redistribution.
Aug 18, 2021
Machine Learning Needs More Generators
I’ve spent the last two days cleaning up some research machine learning code that blew up when I tried applying it to my own data due to memory constraints. This post is about the anti-pattern that caused this, how I fixed it, and how you can avoid it too.
Aug 17, 2021
Organizing the Public Data about a Researcher
In a previous post, I described how to formalize the information about a research organization using Wikidata. This post follows the same theme, but about this time about a given researcher. Not only can you follow this post to make your own scientific profile easier to find and navigate, but you can also use Wikidata to improve the profiles of your co-workers and collaborators.
Aug 5, 2021
Reproducibly Loading the ChEMBL Relational Database
In his blog post, Some Thoughts on Comparing Classification Models, Pat Walters illustrated enlightened ways to convey the results of training and evaluating machine learning models on hERG activity data from ChEMBL (spoiler: it includes box plots). It started by querying the ChEMBL relational database, but featured a common issue that hampers reproducibility: hard-coded configuration to a local database based on a specific database (MySQL). This blog post is about how to address this using chembl_downloader and make code using ChEMBL’s SQL dump more reusable and reproducible.
Aug 4, 2021
Reproducibly Loading the ChEMBL SDF
ChEMBL is easily the most useful database in a cheminformatician’s toolbox, containing structural and activity information for millions of diverse compounds. In his recent blog post, Generalized Substructure Search, Greg Landrum highlighted some new RDKit features that enable more advanced substructure queries. It started by loading molecules from the ChEMBL 29 SDF dump, but it featured a common issue that hampers reproducibility: a hard-coded local file path to the ChEMBL data. This blog post is how to address this using chembl_downloader and make code using ChEMBL’s SDF dump more reusable and reproducible.
Jul 26, 2021
Tales from the Bonner Ausländeramt
This is a more personal blog post about my experience as an american expat in Germany - specifically about my experiences at the Bonner Ausländeramt (Bonn’s Foreigner’s Office of the City of Bonn).
Apr 19, 2021
Pythagorean Mean Rank Metrics
The mean rank (MR) and mean reciprocal rank (MRR) are among the most popular metrics reported for the evaluation of knowledge graph embedding models in the link prediction task. While they are reported on very different intervals ($\text{MR} \in [1,\infty)$ and $\text{MRR} \in (0,1]$, their deep theoretical connection can be elegantly described through the lens of Pythagorean means. This blog post describes ideas Max Berrendorf shared with me that I recently implemented in PyKEEN and later wrote up as a full manuscript.
Apr 5, 2021
Current Perspectives on KGEMs in and out of Biomedicine
After many discussions scientists from AstraZeneca’s knowledge graph and target prioritization platform (BIKG) about the PyKEEN knowledge graph embedding model package, I joined them in writing a review on biomedical knowledge graphs. I’m giving a talk in their group tomorrow - this blog post is a longer form of some ideas I’ll be presenting there. Here are the slides.
Feb 23, 2021
Explaining MCI Conversion with Path Queries to NeuroMMSig
In late 2017, I visited the Critical Path Institute in Tucson, Arizona with my colleague Daniel Domingo-Fernández to use our Alzheimer’s disease map encoded in the Biological Expression Language (BEL) and the tools we built with PyBEL to help contextualize their mild cognitive impairment (MCI) conversion models. We got very interesting results, but they had a major overlap with unpublished work of one of our colleagues on the role of KANSL1 in Alzheimer’s disease, so we never reported them. Last week, his paper finally made it publication (congratulations, Sepehr!) so I thought it would be fun to rehash the old results and look at how the results might have changed over time with improvements to the underlying knowledge graph.
Feb 20, 2021
Adding Structured Data to Docstrings
Writing excellent documentation is crucial for open source software projects. It’s also a lot of hard work. While I consider tools like Sphinx combine with services like ReadTheDocs completely invaluable, I’ve recently hit a bit of a roadblock when it comes to making the README of a GitHub repository a bit more dynamic. This blog post is about the dark magic I invented as a solution (i.e., the docdata package).
Jan 23, 2021
Adding New Literature Sources to the Wikidata Integrator
Scholia is a powerful frontend for summarizing authors, publications, institutions, topics, etc. that draws content from Wikidata. However, the content that’s available in Wikidata depends on what has been manually curated by community members and what has been (semi-) automatically imported by scripts and bots. The Wikidata Integrator from the Su Lab at Scripps automates the import of bibliometric information from Crossref and Europe PMC. This blog post is about how I added functionality to it to import from three prominent preprint servers in the natural sciences (arXiv, bioRxiv, and ChemRxiv) that can serve as a guide to others who want to have content about their field included with this tool.
Jan 17, 2021
Organizing the Public Data about your Research Organization
If you’ve ever read a scientific paper, you know that the information that makes it into the author affiliations is a mess. I’m a big fan of Manubot and fully support its mission to upend the modern scientific publishing model. Like how they use structured ORCID identifiers for identifying authors in manuscript metadata, they are also working towards using ROR identifiers for organizations. There are still a few growing pains for ROR, so I chimed in on a discussion on GitHub about how Wikidata might be a potential solution for organizing and retrieving information about reserach organizations. I said I’d describe my idea more in detail, so here I go!
Jan 11, 2021
How to Code with Me - Wrapping a Flask App in a CLI
Previous posts in my “How to Code with Me” series have addressed packaging python code and setting up a command line interface (CLI) using click. This post is about how to do this when your Python code is running a web application made with Flask and how to set it up to run through your CLI.
Dec 30, 2020
Pathway Relationships
Domingo-Fernandez et al. published ComPath: An ecosystem for exploring, analyzing, and curating mappings across pathway databases. in 2018 describing the overlap between human pathways in KEGG, Reactome, and WikiPathways. A lot of the underlying machinery I developed to support this project has been improved since, and it’s time to spread the search to other organisms besides humans and other databases. This blog post is about some additional relation types needed to capture the relations between pathways appearing in these databases.
Dec 14, 2020
Making DrugBank Reproducible
If you’re reading my blog, there’s a pretty high chance you’ve used DrugBank, a database of drug-target interactions, drug-drug interactions, and other high-granularity information about clinically-studied chemicals. DrugBank has two major problems, though: its data are password-protected, and its license does not allow redistribution. Time to solve these problems once and for all.
Dec 11, 2020
Scoring Inverse Triples
When training a knowledge graph embedding model with inverse triples, two scores are learned for every triple (h, r, t) - one for the original and one for the inverse triple (t, r', h). This blog post is about investigating when/why there might be meaningful differences between those scores depending on the dataset, model, and training assumption.
Dec 7, 2020
Generating Testing Knowledge Graphs with Literals
PyKEEN has a wide variety of functionality related to knowledge graph embedding models and handling various sources of knowledge graphs. This post describes the journey towards properly testing the functionality of an exotic set of knowledge graph embedding models that incorporate feature vectors for entities via triples with numeric literals.
Sep 17, 2020
Referring to SARS-CoV-2 Proteins in BEL
Many of the proteins in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are cleavage products of the replicase polyprotein 1ab (uniprot:P0DTD1). Unfortunately, the bioinformatics community is not so comfortable with proteins like this and nomenclature remains tricky. Luckily, the Biological Expression Language (BEL) has exactly the right tool to encode information about these proteins using the fragment() function.
Jun 11, 2020
How to Code with Me - Making a CLI
One of the cardinal sins in computational science is to hard code a file path in your analysis. This post is a guide to reorganizing your code to avoid this and then to generate a command line interface (CLI) using click.
Jun 9, 2020
The Curation of Neurodegeneration Supporting Ontology
While I led the curation program in the Human Brain Pharmacome project during my Ph.D. from 2018-2019 at Fraunhofer, we built the Curation of Neurodegeneration Supporting Ontology (CONSO). This post outlines the project’s needs for quality control and re-curation that lead to its generation, the curation process, and how CONSO constitutes an example of how to follow the guidelines I proposed in a previous blog post on building ontologies.
Jun 3, 2020
How to Code with Me - Organizing a Package
This blog post is the next installment in the series about all of the very particular ways I do software development in Python. This round is about where to put your code, your tests, your CLI, and the right metadata for each.
May 22, 2020
A Reading List of Academic Articles using the Biological Expression Language (BEL)
This post is evolving from a reading list to a review of the academic papers published that are either about or use the Biological Expression Language (BEL). It’s divided into the categories of software/visualization tools, algorithms/analytical frameworks, data integration, natural language processing, curation workflows, and downstream applications.
May 12, 2020
The Trouble with Ontologies, or, How to Build an Ontology
Everyone’s talking about biomedical ontologies! Let’s look at where most people go wrong and how to do it right.
Apr 30, 2020
A Listing of Publicly Available Content in the Biological Expression Language (BEL)
While many researchers have a pathway or pathology of interest, their first time curating content in the Biological Expression Language (BEL) may seem intimidating. This post lists several disease maps and BEL content sources that are directly available for re-use.
Apr 28, 2020
An Incomplete History of Selventa and the Biological Expression Language (BEL)
The company and community that surround the Biological Expression Language (BEL) are enigmatic, to say the least. This post represents the best I could do to tell the history of Selventa and BEL.
Apr 25, 2020
How to Code with Me - Flake8 Hell
As scientists, we place huge importance on the communication of our results. We spend lots of time on editing, revising, and formatting so people can understand what we did. We also write a lot of code, so why aren’t we investing the same amount of love? Enter, flake8.
Apr 19, 2020
Inspector Javert's Xref Database
On top the issue of resolving identifiers to their names, the bioinformatics community has a hard time figuring out when two identifiers from different databases are equivalent. You know who else has the same problem? Inspector Javert. Get ready for a Les Miserables-themed post on how to address this long-standing problem.
Apr 18, 2020
Ooh Na Na, What's My Name?
We have a big problem in the bioinformatics community with namespaces, identifiers, and names. And nobody’s posed the question better than Rihanna herself.
Apr 15, 2020
Summarizing ChemRxiv
A few months ago, the question was posed on science Twitter: “How many people have published on ChemRxiv?”
Mar 20, 2020
How to Fix Your Monolithic Pull Request
We’ve all been there. You started a new branch from master. You had a very specific goal in mind, The Original Goal. You made a pull request (PR) to go with it, too, The Original Pull Request. But then, you had an idea! And also, someone on your team asked you to solve another problem! Now the original code you wrote to address The Original Goal relies on that code … and now you’ve got dozens of files changed, hundreds of lines of diff, and nobody (including you) can understand what you’ve done. Like I said, we’ve all been there. Here’s what you can do to fix it:
Feb 9, 2020
Host a Graduate Seminar Before Writing Your Thesis
The other day I saw a tweet lamenting the drag that is literature review during preparation for writing your thesis.
Jan 23, 2020
Encoding Biology in Knowledge Graphs
How many molecular biology papers have you read today? This week? This month? If you’re like me, its not so many, and we’re falling behind very quickly. Here’s a chart made by the new PubMed that summarizes how many papers were published mentioning RAS in the last 50 years.
Jan 22, 2020
Biosemantics vs. Biopragmatics
In language, semantics describe the names and meanings of words. The bioinformatics community has aptly adopted biosemantics as a concept that encompasses the issues with the names and meanings of biological entities, usually in natural language processing and data integration. However, semantics does not capture the context of words, and biosemantics fails to describe the biological context and complex relationships between biological entities.