My name is Charles Tapley Hoyt (he/his). I’m working in bio/cheminformatics - more specifically using biological knowledge graphs to generate testable hypotheses for drug discovery and precision medicine.
Here’s some more details about me and my research. You can download my résumé (single page), CV, or see my ORCiD page at https://orcid.org/0000-0003-4423-4370. Content on this site is licensed as CC BY 4.0. See also my family recipe blog.
Posts
Exploring Event Venues in Wikidata
I was working on making data about scholarly conferences more FAIR and a big question crossed my mind: what are all the conference venues? This post is about some queries I wrote for Wikidata, data issues I found, and a few drive-by curations that I did while looking for an answer, and my ideas for the future.
Downloading Audio from Soundcloud
Brandon Sanderson has been releasing a few chapters a week of his upcoming novel, Wind and Truth, on his publisher’s website leading up to its December 6th release. This includes the audiobook chapters, but they’re posted to Soundcloud and there’s no good way to listen at 1.6x speed. This post is a note sheet on how to download audio from Soundcloud and prepare it for my audiobook reader.
Dependency Groups and ReadTheDocs
PEP 735 introduced dependency groups in packaging metadata, which are complementary to optional dependencies in that they might not correspond to features in the package, but rather be something like development or release dependencies. I am slowly working towards updating my cookiecutter template cookiecutter-snekpack to use PEP 735. So far, uv and tox have released support - all that’s left is ReadTheDocs. This post summarizes the issue I added to their issue tracker and the following discussion.
Building Graphviz when installing PyGraphviz
Graphviz is software for graph visualization written in C. PyGraphviz provides a nice Python wrapper for it. The issue is that getting Python to know about the C headers changes every few months. I’ll try and keep this blog post updated every time there are some changes.
Some Haskell I Tried to Write
I’m working through making a contribution to pandoc that adds first-class support for author role annotations using the Contribution Role Taxonomy (CRediT) and also outputs compliant Journal Publishing Tag Set (JATS) XML. This has lead me down a (losing) journey with learning the Haskell programming language, so I thought I would post a short note on a function I tried to understand.
Programmatic Access to a Wordpress User List
The International Society of Biocuration (ISB) partners with the journal Database to get discounts for its members when they publish there. This means the ISB’s executive committee needs to send a member list to the journal’s editor. Historically, this has been done manually by exporting the list from the membership management plugin in the ISB Wordpress blog once per month and emailing it to th This post is about my journey trying to automate it
Easier ORCID
The Open Researcher and Contributor Identifier (ORCID) database is an invaluable resource that supports the unambiguous identification of researchers. However, its first party data dump is too complex, verbose, and unstandardized for many use cases. This post describes open source software I wrote that automates downloading, processing, and exporting ORCID into a more usable form. I put the results on Zenodo under the CC0 license.
Discussions and Follow-ups from Biocuration 2024
I’ve just returned from the 17th Annual International Biocuration Conference at the Indian Biological Data Centre (IBDC) in Faridabad, India. I wanted to highlight some of the interesting conversations I had while I was there, and ideas for follow-up. Most were centered around the Bioregistry and the Semantic Mapping Assembler and Reasoner (SeMRA), which I gave an oral presentation on.
Semantic Pydantic
Using Pydantic for encoding data models and FastAPI for implementing APIs on top of them has become a staple for many Python programmers. When this intersects with the semantic web, linked open data, and the natural sciences, we are still lacking a bridge to annotate our data models and APIs to make them more FAIR (findable, accessible, interoperable, and reusable). In this post, we build an extension to Pydantic and FastAPI to annotate data models’ fields and API endpoints’ query, path, and other parameters using the Bioregistry, a comprehensive catalog of metadata about semantic spaces from the semantic web and the natural sciences.
Books I Read in 2023
I finally got back into reading! Over winter break 2022, I started the Stormlight Archive then followed up in 2023 by reading the entirety of Brandon Sanderson’s Cosmere, as well as a some other fantasy, science fiction, and literary fiction. Here’s the list.
Unlocking UMLS
The Unified Medical Language System (UMLS) is a widely used biomedical and clinical vocabulary maintained by the United States National Library of Medicine. However, it is notoriously difficult to access and work with due to licensing restrictions and its complex download system. In the same vein as my previous posts about DrugBank and ChEMBL, this post describes open source software I’ve developed for downloading and working with this data. It also works for RxNorm, SemMedDB, SNOMED-CT, and any other data accessible through the UMLS Terminology Services (UTS) ticket granting system.
Reproducibility Pilot in the Journal of Cheminformatics
I’ve been working on improving reproducibility in the field of cheminformatics for some time now. For example, I’ve written posts about making data from DrugBank and ChEMBL more actionable. Over the last year, I’ve been preparing a concept with the editors of the Journal of Cheminformatics on how to include an assessment of reproducibility to reviews of manuscripts submitted to the journal. This has resulted in an editorial Improving reproducibility and reusability in the Journal of Cheminformatics as well as a call for papers. In this post, I want to summarize the first generation review criteria we developed, give an example of it applied in practice
Querying Journals and Publishers in Wikidata
Today’s short post is about three SPARQL queries I wrote to get bibliometric information about journals and publishers out of Wikidata.
Modeling and Querying Awards in Wikidata
I was recently nominated for the International Society for Biocuration’s Excellence in Biocuration Early Career Award (results will be announced on June 14th!). This made me curious about how to model nominations and awards on Wikidata. In this post, I’ll describe how to curate awards, nominations, recipients, and how to make SPARQL queries to get them.
Re-implementing the N2T ARK Resolver
Archival Resource Keys (ARKs) are flavor of persistent identifiers like DOIs, URNs, and Handles that have the benefit of being free, flexible with what metadata gets attached, and natively able to resolve to web pages. Name-to-Thing (N2T) implements a resolver for a variety of ARKs, so this blog post is about how that resolver can be re-implemented with the
curies
Python package.The Representatives of Monkey Jack - German Battle of the Bands Finale
This blog is normally about very serious science, but I’m taking a break from that for the evening to advertize my band’s upcoming show on April 8th in the SPH Music Masters Finale (aka, the German Battle of the Bands). We need your support! There are streaming tickets available, and this post has a guide on how to navigate the German website to get tickets (or just text me, I’ll hook you up).
Resources masquerading as OBO Foundry ontologies
Several controlled vocabularies and ontologies that aren’t themselves OBO Foundry ontologies use unsanctioned OBO PURLs. This post is about how to use the Bioregistry to identify which resources are doing this and to give some insight into how we arrived in this situation.
Compliance of Bioregistry Prefixes to the W3C Standard
This post gives a brief background on the formal definition of the syntax and semantics of compact uniform resource identifiers (CURIEs) from the Worldwide Web Consortium (W3C) and investigates how many prefixes in the Bioregistry are compliant with the standard.
Idiomatic conversion between URIs and compact URIs
The semantic web and ontology communities needed a reusable Python package for converting between uniform resource identifiers (URIs) and compact URIs (CURIEs) that is reliable, idiomatic, generic, and performant. This post describes the
curies
Python package that fills this need.Long-term Funding for Small Biomedical Databases
Way back in 2021, during the annual general assembly of the International Society for Biocuration (ISB) at the the 14th Annual International Biocuration Conference (Biocuration 2021) , there was a discussion about the notably underutilized budget of the society that resulted in an informal open call for ideas for new small funding schemes. Concurrently, discussions with external stakeholders for the relatively new (at the time) Bioregistry project often included questions about the sustainability and longevity of the resource. We had conservatively estimated it would cost about 100 USD/year to run the Bioregistry site, so this seemed like the perfect opportunity to ask for a small amount funding distributed over a relatively long period of time. This post is about the more general reality of funding for small resources in the life sciences, how we petitioned the ISB for funding, and what happened next.
Promoting the longevity of curated scientific resources through open code, open data, and public infrastructure
The 16th Annual International Biocuration Conference (Biocuration 2023) is taking place in Padua, Italy from April 24-26th, 2023. While I’m serving as a co-chair of the conference, I also think this is a great venue to communicate some of my thoughts on longevity and sustainability that have been gestating during the development of the Bioregistry and other Biopragmatics projects. This blog post contains the abstract I’ve submitted for oral presentation.
Connecting Preprints to Peer-reviewed Articles on Wikidata
After the BioCypher preprint went up on the arXiv, I checked in on the missing co-author items list on the Scholia page that reflects my Wikidata entry. In addition to the several co-authors of the BioCypher manuscript that I don’t know personally, I was curious to see which other papers of mine did not have fully complete co-author annotations. This post has a few SPARQL queries that I used to look into this as well as a few ongoing questions I have about the relationship between distinct entries for preprints and published articles.
Global Core Biodata Resources in the Bioregistry
The Global Biodata Coalition released a list of Global Core Biodata Resources (GCBRs) in December 2022, comprising 37 life science databases that they considered as having significant importance (selected following this procedure). While the the Bioregistry does not generally cover databases, many notable databases have one or more associated semantic spaces that are relevant for inclusion. Accordingly, 33 of 37 of the GCBRs (that’s 89%) have one or more directly-related prefixes in the Bioregistry. This post gives some insight into this landscape.
A First Look at OpenCheck
There has been legitimate concern about the future of Twitter over the last week due to its new ownership and management. This is pretty upsetting considering how great it’s been to use to connect to and to follow other researchers. OpenCheck is currently working to map Twitter handles to ORCID identifiers and capture the directed follow graph of researchers on Twitter in case the service becomes unusable in the near future. This post is about my initial exploration of the resource. Update in November 2024 - OpenCheck has been shut down.
Curating Publications on Wikidata
This blog post is a tutorial on how to curate the links between a researcher and scholarly works (e.g., pre-prints, publications, presentations) on Wikidata using Scholia and the Author Disambiguator tool.
You Should Use a Private Email on Publications
While we were recently preparing to submit a manuscript, the lead author said they looked at my last few papers and noticed I always used a private email address instead of an institutional email address. They asked, perplexed, if they should also use my private email address with our submission. The answer was a resounding yes; always use a private email address. Here’s why.
Abstracting the parameters of a Machine Learning Model
As a follow-up to my previous post on refactoring and improving a machine learning model implemented with PyTorch, this post will be a tutorial on how to generalize the implementation of a multilayer perceptron (MLP) to use one of several potential non-linear activation functions in an elegant way.
Refactoring a Machine Learning Model
This blog post is a tutorial that will take you from a naive implementation of a multilayer perceptron (MLP) in PyTorch to an enlightened implementation that simultaneously leverages the power of PyTorch, Python’s built-ins, and some powerful third party Python packages.
The Official Rules of Python Packaging Speedrunning
I figured over the holiday break or early days of the new year, I’d catch up on some serious blogging. Instead, here’s my first post of 2022: a silly take on a topic I actually care a lot about. Here are the rules for Python Packaging Speedruns.
How to Pick a Unique Prefix
After the recent incident on the OBO Foundry where an inexperienced group submitted a new ontology request using a prefix that already existed in the BioPortal, there has been a renewed interest in implementing an automated solution to protect against this.
A Glossary for the Bioregistry and Biopragmatics Stack
There are a lot of terms that I’ve been throwing around when talking about the Bioregistry, so this blog post is a first draft of a gloassary of all of them.
How to Curate the INDRA Database
With the recent paper on Gilda and approaching INDRA 2 and INDRA database papers coming up, I’ve put together a visual guide on how to curate statements extracted by INDRA through the web interface at https://db.indra.bio.
What's a CURIE, and Why You Should be Using Them
Compact uniform resource identifiers, or CURIEs, are an important formalism for referencing biomedical entities. This post explains what they are, how to write them yourself, and a brief outline of how they fit in to the semantic web, linked open data, and open biomedical ontology worlds.
How to Code with Me - Beyond Linters
This post is about my personal code style guide that are beyond the enforcement of my flake8 plugins or
black
. I’ll try and update it over time.Pre-loading a PostgreSQL Docker Container
PostgreSQL is a powerful relational database management system that can be easily downloaded and installed from its official image on DockerHub using Docker. However, it’s not so straightforward to pre-load your own data. This blog post is about preparing a derivative of the base PostgreSQL Docker image that’s preloaded with your own database and pushing it back to DockerHub for redistribution.
Machine Learning Needs More Generators
I’ve spent the last two days cleaning up some research machine learning code that blew up when I tried applying it to my own data due to memory constraints. This post is about the anti-pattern that caused this, how I fixed it, and how you can avoid it too.
Organizing the Public Data about a Researcher
In a previous post, I described how to formalize the information about a research organization using Wikidata. This post follows the same theme, but about this time about a given researcher. Not only can you follow this post to make your own scientific profile easier to find and navigate, but you can also use Wikidata to improve the profiles of your co-workers and collaborators.
Reproducibly Loading the ChEMBL Relational Database
In his blog post, Some Thoughts on Comparing Classification Models, Pat Walters illustrated enlightened ways to convey the results of training and evaluating machine learning models on hERG activity data from ChEMBL (spoiler: it includes box plots). It started by querying the ChEMBL relational database, but featured a common issue that hampers reproducibility: hard-coded configuration to a local database based on a specific database (MySQL). This blog post is about how to address this using
chembl_downloader
and make code using ChEMBL’s SQL dump more reusable and reproducible.Reproducibly Loading the ChEMBL SDF
ChEMBL is easily the most useful database in a cheminformatician’s toolbox, containing structural and activity information for millions of diverse compounds. In his recent blog post, Generalized Substructure Search, Greg Landrum highlighted some new RDKit features that enable more advanced substructure queries. It started by loading molecules from the ChEMBL 29 SDF dump, but it featured a common issue that hampers reproducibility: a hard-coded local file path to the ChEMBL data. This blog post is how to address this using
chembl_downloader
and make code using ChEMBL’s SDF dump more reusable and reproducible.Tales from the Bonner Ausländeramt
This is a more personal blog post about my experience as an american expat in Germany - specifically about my experiences at the Bonner Ausländeramt (Bonn’s Foreigner’s Office of the City of Bonn).
Pythagorean Mean Rank Metrics
The mean rank (MR) and mean reciprocal rank (MRR) are among the most popular metrics reported for the evaluation of knowledge graph embedding models in the link prediction task. While they are reported on very different intervals ($\text{MR} \in [1,\infty)$ and $\text{MRR} \in (0,1]$, their deep theoretical connection can be elegantly described through the lens of Pythagorean means. This blog post describes ideas Max Berrendorf shared with me that I recently implemented in PyKEEN and later wrote up as a full manuscript.
Current Perspectives on KGEMs in and out of Biomedicine
After many discussions scientists from AstraZeneca’s knowledge graph and target prioritization platform (BIKG) about the PyKEEN knowledge graph embedding model package, I joined them in writing a review on biomedical knowledge graphs. I’m giving a talk in their group tomorrow - this blog post is a longer form of some ideas I’ll be presenting there. Here are the slides.
Explaining MCI Conversion with Path Queries to NeuroMMSig
In late 2017, I visited the Critical Path Institute in Tucson, Arizona with my colleague Daniel Domingo-Fernández to use our Alzheimer’s disease map encoded in the Biological Expression Language (BEL) and the tools we built with PyBEL to help contextualize their mild cognitive impairment (MCI) conversion models. We got very interesting results, but they had a major overlap with unpublished work of one of our colleagues on the role of KANSL1 in Alzheimer’s disease, so we never reported them. Last week, his paper finally made it publication (congratulations, Sepehr!) so I thought it would be fun to rehash the old results and look at how the results might have changed over time with improvements to the underlying knowledge graph.
Adding Structured Data to Docstrings
Writing excellent documentation is crucial for open source software projects. It’s also a lot of hard work. While I consider tools like Sphinx combine with services like ReadTheDocs completely invaluable, I’ve recently hit a bit of a roadblock when it comes to making the README of a GitHub repository a bit more dynamic. This blog post is about the dark magic I invented as a solution (i.e., the
docdata
package).Adding New Literature Sources to the Wikidata Integrator
Scholia is a powerful frontend for summarizing authors, publications, institutions, topics, etc. that draws content from Wikidata. However, the content that’s available in Wikidata depends on what has been manually curated by community members and what has been (semi-) automatically imported by scripts and bots. The Wikidata Integrator from the Su Lab at Scripps automates the import of bibliometric information from Crossref and Europe PMC. This blog post is about how I added functionality to it to import from three prominent preprint servers in the natural sciences (arXiv, bioRxiv, and ChemRxiv) that can serve as a guide to others who want to have content about their field included with this tool.
Organizing the Public Data about your Research Organization
If you’ve ever read a scientific paper, you know that the information that makes it into the author affiliations is a mess. I’m a big fan of Manubot and fully support its mission to upend the modern scientific publishing model. Like how they use structured ORCID identifiers for identifying authors in manuscript metadata, they are also working towards using ROR identifiers for organizations. There are still a few growing pains for ROR, so I chimed in on a discussion on GitHub about how Wikidata might be a potential solution for organizing and retrieving information about reserach organizations. I said I’d describe my idea more in detail, so here I go!
How to Code with Me - Wrapping a Flask App in a CLI
Previous posts in my “How to Code with Me” series have addressed packaging python code and setting up a command line interface (CLI) using
click
. This post is about how to do this when your Python code is running a web application made with Flask and how to set it up to run through your CLI.Pathway Relationships
Domingo-Fernandez et al. published ComPath: An ecosystem for exploring, analyzing, and curating mappings across pathway databases. in 2018 describing the overlap between human pathways in KEGG, Reactome, and WikiPathways. A lot of the underlying machinery I developed to support this project has been improved since, and it’s time to spread the search to other organisms besides humans and other databases. This blog post is about some additional relation types needed to capture the relations between pathways appearing in these databases.
Making DrugBank Reproducible
If you’re reading my blog, there’s a pretty high chance you’ve used DrugBank, a database of drug-target interations, drug-drug interactions, and other high-granularity information about clinically-studied chemicals. DrugBank has two major problems, though: its data are password-protected, and its license does not allow redistribution. Time to solve these problems once and for all.
Scoring Inverse Triples
When training a knowledge graph embedding model with inverse triples, two scores are learned for every triple
(h, r, t)
- one for the original and one for the inverse triple(t, r', h)
. This blog post is about investigating when/why there might be meaningful differences between those scores depending on the dataset, model, and training assumption.Generating Testing Knowledge Graphs with Literals
PyKEEN has a wide variety of functionality related to knowledge graph embedding models and handling various sources of knowledge graphs. This post describes the journey towards properly testing the functionality of an exotic set of knowledge graph embedding models that incorporate feature vectors for entities via triples with numeric literals.
Referring to SARS-CoV-2 Proteins in BEL
Many of the proteins in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are cleavage products of the replicase polyprotein 1ab (uniprot:P0DTD1). Unfortunately, the bioinformatics community is not so comfortable with proteins like this and nomenclature remains tricky. Luckily, the Biological Expression Language (BEL) has exactly the right tool to encode information about these proteins using the
fragment()
function.How to Code with Me - Making a CLI
One of the cardinal sins in computational science is to hard code a file path in your analysis. This post is a guide to reorganizing your code to avoid this and then to generate a command line interface (CLI) using click.
The Curation of Neurodegeneration Supporting Ontology
While I led the curation program in the Human Brain Pharmacome project during my Ph.D. from 2018-2019 at Fraunhofer, we built the Curation of Neurodegeneration Supporting Ontology (CONSO). This post outlines the project’s needs for quality control and re-curation that lead to its generation, the curation process, and how CONSO constitutes an example of how to follow the guidelines I proposed in a previous blog post on building ontologies.
How to Code with Me - Organizing a Package
This blog post is the next installment in the series about all of the very particular ways I do software development in Python. This round is about where to put your code, your tests, your CLI, and the right metadata for each.
A Reading List of Academic Articles using the Biological Expression Language (BEL)
This post is evolving from a reading list to a review of the academic papers published that are either about or use the Biological Expression Language (BEL). It’s divided into the categories of software/visualization tools, algorithms/analytical frameworks, data integration, natural language processing, curation workflows, and downstream applications.
The Trouble with Ontologies, or, How to Build an Ontology
Everyone’s talking about biomedical ontologies! Let’s look at where most people go wrong and how to do it right.
A Listing of Publicly Available Content in the Biological Expression Language (BEL)
While many researchers have a pathway or pathology of interest, their first time curating content in the Biological Expression Language (BEL) may seem intimidating. This post lists several disease maps and BEL content sources that are directly available for re-use.
An Incomplete History of Selventa and the Biological Expression Language (BEL)
The company and community that surround the Biological Expression Language (BEL) are enigmatic, to say the least. This post represents the best I could do to tell the history of Selventa and BEL.
How to Code with Me - Flake8 Hell
As scientists, we place huge importance on the communication of our results. We spend lots of time on editing, revising, and formatting so people can understand what we did. We also write a lot of code, so why aren’t we investing the same amount of love? Enter, flake8.
Inspector Javert's Xref Database
On top the issue of resolving identifiers to their names, the bioinformatics community has a hard time figuring out when two identifiers from different databases are equivalent. You know who else has the same problem? Inspector Javert. Get ready for a Les Miserables-themed post on how to address this long-standing problem.
Ooh Na Na, What's My Name?
We have a big problem in the bioinformatics community with namespaces, identifiers, and names. And nobody’s posed the question better than Rihanna herself.
Summarizing ChemRxiv
A few months ago, the question was posed on science Twitter: “How many people have published on ChemRxiv?”
How to Fix Your Monolithic Pull Request
We’ve all been there. You started a new branch from master. You had a very specific goal in mind, The Original Goal. You made a pull request (PR) to go with it, too, The Original Pull Request. But then, you had an idea! And also, someone on your team asked you to solve another problem! Now the original code you wrote to address The Original Goal relies on that code … and now you’ve got dozens of files changed, hundreds of lines of diff, and nobody (including you) can understand what you’ve done. Like I said, we’ve all been there. Here’s what you can do to fix it:
Host a Graduate Seminar Before Writing Your Thesis
The other day I saw a tweet lamenting the drag that is literature review during preparation for writing your thesis.
Encoding Biology in Knowledge Graphs
How many molecular biology papers have you read today? This week? This month? If you’re like me, its not so many, and we’re falling behind very quickly. Here’s a chart made by the new PubMed that summarizes how many papers were published mentioning RAS in the last 50 years.
Biosemantics vs. Biopragmatics
In language, semantics describe the names and meanings of words. The bioinformatics community has aptly adopted biosemantics as a concept that encompasses the issues with the names and meanings of biological entities, usually in natural language processing and data integration. However, semantics does not capture the context of words, and biosemantics fails to describe the biological context and complex relationships between biological entities.
subscribe via RSS