Charles Tapley Hoyt My name is Charles Tapley Hoyt. I’m a Postdoctoral Research Fellow at Harvard Medical School working remotely from Germany. I’m working in bio/cheminformatics - more specifically using biological knowledge graphs to generate testable hypotheses for drug discovery and precision medicine.

Here’s some more details about me and my research. My résumé can be found here and everything else on ORCID at ORCID logo https://orcid.org/0000-0003-4423-4370. Content on this site is licensed as CC BY 4.0. See also my family recipe blog.

Posts

  • A First Look at OpenCheck

    There has been legitimate concern about the future of Twitter over the last week due to its new ownership and management. This is pretty upsetting considering how great it’s been to use to connect to and to follow other researchers. OpenCheck is currently working to map Twitter handles to ORCID identifiers and capture the directed follow graph of researchers on Twitter in case the service becomes unusable in the near future. This post is about my initial exploration of the resource.

  • Curating Publications on Wikidata

    This blog post is a tutorial on how to curate the links between a researcher and scholarly works (e.g., pre-prints, publications, presentations) on Wikidata using Scholia and the Author Disambiguator tool.

  • You Should Use a Private Email on Publications

    While we were recently preparing to submit a manuscript, the lead author said they looked at my last few papers and noticed I always used a private email address instead of an institutional email address. They asked, perplexed, if they should also use my private email address with our submission. The answer was a resounding yes; always use a private email address. Here’s why.

  • Abstracting the parameters of a Machine Learning Model

    As a follow-up to my previous post on refactoring and improving a machine learning model implemented with PyTorch, this post will be a tutorial on how to generalize the implementation of a multilayer perceptron (MLP) to use one of several potential non-linear activation functions in an elegant way.

  • Refactoring a Machine Learning Model

    This blog post is a tutorial that will take you from a naive implementation of a multilayer perceptron (MLP) in PyTorch to an enlightened implementation that simultaneously leverages the power of PyTorch, Python’s built-ins, and some powerful third party Python packages.

  • The Official Rules of Python Packaging Speedrunning

    I figured over the holiday break or early days of the new year, I’d catch up on some serious blogging. Instead, here’s my first post of 2022: a silly take on a topic I actually care a lot about. Here are the rules for Python Packaging Speedruns.

  • How to Pick a Unique Prefix

    After the recent incident on the OBO Foundry where an inexperienced group submitted a new ontology request using a prefix that already existed in the BioPortal, there has been a renewed interest in implementing an automated solution to protect against this.

  • A Glossary for the Bioregistry and Biopragmatics Stack

    There are a lot of terms that I’ve been throwing around when talking about the Bioregistry, so this blog post is a first draft of a gloassary of all of them.

  • How to Curate the INDRA Database

    With the recent paper on Gilda and approaching INDRA 2 and INDRA database papers coming up, I’ve put together a visual guide on how to curate statements extracted by INDRA through the web interface at https://db.indra.bio.

  • What's a CURIE, and Why You Should be Using Them

    Compact uniform resource identifiers, or CURIEs, are an important formalism for referencing biomedical entities. This post explains what they are, how to write them yourself, and a brief outline of how they fit in to the semantic web, linked open data, and open biomedical ontology worlds.

  • How to Code with Me - Beyond Linters

    This post is about my personal code style guide that are beyond the enforcement of my flake8 plugins or black. I’ll try and update it over time.

  • Pre-loading a PostgreSQL Docker Container

    PostgreSQL is a powerful relational database management system that can be easily downloaded and installed from its official image on DockerHub using Docker. However, it’s not so straightforward to pre-load your own data. This blog post is about preparing a derivative of the base PostgreSQL Docker image that’s preloaded with your own database and pushing it back to DockerHub for redistribution.

  • Machine Learning Needs More Generators

    I’ve spent the last two days cleaning up some research machine learning code that blew up when I tried applying it to my own data due to memory constraints. This post is about the anti-pattern that caused this, how I fixed it, and how you can avoid it too.

  • Organizing the Public Data about a Researcher

    In a previous post, I described how to formalize the information about a research organization using Wikidata. This post follows the same theme, but about this time about a given researcher. Not only can you follow this post to make your own scientific profile easier to find and navigate, but you can also use Wikidata to improve the profiles of your co-workers and collaborators.

  • Reproducibly Loading the ChEMBL Relational Database

    In his blog post, Some Thoughts on Comparing Classification Models, Pat Walters illustrated enlightened ways to convey the results of training and evaluating machine learning models on hERG activity data from ChEMBL (spoiler: it includes box plots). It started by querying the ChEMBL relational database, but featured a common issue that hampers reproducibility: hard-coded configuration to a local database based on a specific database (MySQL). This blog post is about how to address this using chembl_downloader and make code using ChEMBL’s SQL dump more reusable and reproducible.

  • Reproducibly Loading the ChEMBL SDF

    ChEMBL is easily the most useful database in a cheminformatician’s toolbox, containing structural and activity information for millions of diverse compounds. In his recent blog post, Generalized Substructure Search, Greg Landrum highlighted some new RDKit features that enable more advanced substructure queries. It started by loading molecules from the ChEMBL 29 SDF dump, but it featured a common issue that hampers reproducibility: a hard-coded local file path to the ChEMBL data. This blog post is how to address this using chembl_downloader and make code using ChEMBL’s SDF dump more reusable and reproducible.

  • Tales from the Bonner Ausländeramt

    This is a more personal blog post about my experience as an american expat in Germany - specifically about my experiences at the Bonner Ausländeramt (Bonn’s Foreigner’s Office of the City of Bonn).

  • Pythagorean Mean Rank Metrics

    The mean rank (MR) and mean reciprocal rank (MRR) are among the most popular metrics reported for the evaluation of knowledge graph embedding models in the link prediction task. While they are reported on very different intervals ($\text{MR} \in [1,\infty)$ and $\text{MRR} \in (0,1]$, their deep theoretical connection can be elegantly described through the lens of Pythagorean means. This blog post describes ideas Max Berrendorf shared with me that I recently implemented in PyKEEN and later wrote up as a full manuscript.

  • Current Perspectives on KGEMs in and out of Biomedicine

    After many discussions scientists from AstraZeneca’s knowledge graph and target prioritization platform (BIKG) about the PyKEEN knowledge graph embedding model package, I joined them in writing a review on biomedical knowledge graphs. I’m giving a talk in their group tomorrow - this blog post is a longer form of some ideas I’ll be presenting there. Here are the slides.

  • Explaining MCI Conversion with Path Queries to NeuroMMSig

    In late 2017, I visited the Critical Path Institute in Tucson, Arizona with my colleague Daniel Domingo-Fernández to use our Alzheimer’s disease map encoded in the Biological Expression Language (BEL) and the tools we built with PyBEL to help contextualize their mild cognitive impairment (MCI) conversion models. We got very interesting results, but they had a major overlap with unpublished work of one of our colleagues on the role of KANSL1 in Alzheimer’s disease, so we never reported them. Last week, his paper finally made it publication (congratulations, Sepehr!) so I thought it would be fun to rehash the old results and look at how the results might have changed over time with improvements to the underlying knowledge graph.

  • Adding Structured Data to Docstrings

    Writing excellent documentation is crucial for open source software projects. It’s also a lot of hard work. While I consider tools like Sphinx combine with services like ReadTheDocs completely invaluable, I’ve recently hit a bit of a roadblock when it comes to making the README of a GitHub repository a bit more dynamic. This blog post is about the dark magic I invented as a solution (i.e., the docdata package).

  • Adding New Literature Sources to the Wikidata Integrator

    Scholia is a powerful frontend for summarizing authors, publications, institutions, topics, etc. that draws content from Wikidata. However, the content that’s available in Wikidata depends on what has been manually curated by community members and what has been (semi-) automatically imported by scripts and bots. The Wikidata Integrator from the Su Lab at Scripps automates the import of bibliometric information from Crossref and Europe PMC. This blog post is about how I added functionality to it to import from three prominent preprint servers in the natural sciences (arXiv, bioRxiv, and ChemRxiv) that can serve as a guide to others who want to have content about their field included with this tool.

  • Organizing the Public Data about your Research Organization

    If you’ve ever read a scientific paper, you know that the information that makes it into the author affiliations is a mess. I’m a big fan of Manubot and fully support its mission to upend the modern scientific publishing model. Like how they use structured ORCID identifiers for identifying authors in manuscript metadata, they are also working towards using ROR identifiers for organizations. There are still a few growing pains for ROR, so I chimed in on a discussion on GitHub about how Wikidata might be a potential solution for organizing and retrieving information about reserach organizations. I said I’d describe my idea more in detail, so here I go!

  • How to Code with Me - Wrapping a Flask App in a CLI

    Previous posts in my “How to Code with Me” series have addressed packaging python code and setting up a command line interface (CLI) using click. This post is about how to do this when your Python code is running a web application made with Flask and how to set it up to run through your CLI.

  • Pathway Relationships

    Domingo-Fernandez et al. published ComPath: An ecosystem for exploring, analyzing, and curating mappings across pathway databases. in 2018 describing the overlap between human pathways in KEGG, Reactome, and WikiPathways. A lot of the underlying machinery I developed to support this project has been improved since, and it’s time to spread the search to other organisms besides humans and other databases. This blog post is about some additional relation types needed to capture the relations between pathways appearing in these databases.

  • Making DrugBank Reproducible

    If you’re reading my blog, there’s a pretty high chance you’ve used DrugBank, a database of drug-target interations, drug-drug interactions, and other high-granularity information about clinically-studied chemicals. DrugBank has two major problems, though: its data are password-protected, and its license does not allow redistribution. Time to solve these problems once and for all.

  • Scoring Inverse Triples

    When training a knowledge graph embedding model with inverse triples, two scores are learned for every triple (h, r, t) - one for the original and one for the inverse triple (t, r', h). This blog post is about investigating when/why there might be meaningful differences between those scores depending on the dataset, model, and training assumption.

  • Generating Testing Knowledge Graphs with Literals

    PyKEEN has a wide variety of functionality related to knowledge graph embedding models and handling various sources of knowledge graphs. This post describes the journey towards properly testing the functionality of an exotic set of knowledge graph embedding models that incorporate feature vectors for entities via triples with numeric literals.

  • Referring to SARS-CoV-2 Proteins in BEL

    Many of the proteins in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are cleavage products of the replicase polyprotein 1ab (uniprot:P0DTD1). Unfortunately, the bioinformatics community is not so comfortable with proteins like this and nomenclature remains tricky. Luckily, the Biological Expression Language (BEL) has exactly the right tool to encode information about these proteins using the fragment() function.

  • How to Code with Me - Making a CLI

    One of the cardinal sins in computational science is to hard code a file path in your analysis. This post is a guide to reorganizing your code to avoid this and then to generate a command line interface (CLI) using click.

  • The Curation of Neurodegeneration Supporting Ontology

    While I led the curation program in the Human Brain Pharmacome project during my Ph.D. from 2018-2019 at Fraunhofer, we built the Curation of Neurodegeneration Supporting Ontology (CONSO). This post outlines the project’s needs for quality control and re-curation that lead to its generation, the curation process, and how CONSO constitutes an example of how to follow the guidelines I proposed in a previous blog post on building ontologies.

  • How to Code with Me - Organizing a Package

    This blog post is the next installment in the series about all of the very particular ways I do software development in Python. This round is about where to put your code, your tests, your CLI, and the right metadata for each.

  • A Reading List of Academic Articles using the Biological Expression Language (BEL)

    This post is evolving from a reading list to a review of the academic papers published that are either about or use the Biological Expression Language (BEL). It’s divided into the categories of software/visualization tools, algorithms/analytical frameworks, data integration, natural language processing, curation workflows, and downstream applications.

  • The Trouble with Ontologies, or, How to Build an Ontology

    Everyone’s talking about biomedical ontologies! Let’s look at where most people go wrong and how to do it right.

  • A Listing of Publicly Available Content in the Biological Expression Language (BEL)

    While many researchers have a pathway or pathology of interest, their first time curating content in the Biological Expression Language (BEL) may seem intimidating. This post lists several disease maps and BEL content sources that are directly available for re-use.

  • An Incomplete History of Selventa and the Biological Expression Language (BEL)

    The company and community that surround the Biological Expression Language (BEL) are enigmatic, to say the least. This post represents the best I could do to tell the history of Selventa and BEL.

  • How to Code with Me - Flake8 Hell

    As scientists, we place huge importance on the communication of our results. We spend lots of time on editing, revising, and formatting so people can understand what we did. We also write a lot of code, so why aren’t we investing the same amount of love? Enter, flake8.

  • Inspector Javert's Xref Database

    On top the issue of resolving identifiers to their names, the bioinformatics community has a hard time figuring out when two identifiers from different databases are equivalent. You know who else has the same problem? Inspector Javert. Get ready for a Les Miserables-themed post on how to address this long-standing problem.

  • Ooh Na Na, What's My Name?

    We have a big problem in the bioinformatics community with namespaces, identifiers, and names. And nobody’s posed the question better than Rihanna herself.

  • Summarizing ChemRxiv

    A few months ago, the question was posed on science Twitter: “How many people have published on ChemRxiv?”

  • How to Fix Your Monolithic Pull Request

    We’ve all been there. You started a new branch from master. You had a very specific goal in mind, The Original Goal. You made a pull request (PR) to go with it, too, The Original Pull Request. But then, you had an idea! And also, someone on your team asked you to solve another problem! Now the original code you wrote to address The Original Goal relies on that code … and now you’ve got dozens of files changed, hundreds of lines of diff, and nobody (including you) can understand what you’ve done. Like I said, we’ve all been there. Here’s what you can do to fix it:

  • Host a Graduate Seminar Before Writing Your Thesis

    The other day I saw a tweet lamenting the drag that is literature review during preparation for writing your thesis.

  • Encoding Biology in Knowledge Graphs

    How many molecular biology papers have you read today? This week? This month? If you’re like me, its not so many, and we’re falling behind very quickly. Here’s a chart made by the new PubMed that summarizes how many papers were published mentioning RAS in the last 50 years.

  • Biosemantics vs. Biopragmatics

    In language, semantics describe the names and meanings of words. The bioinformatics community has aptly adopted biosemantics as a concept that encompasses the issues with the names and meanings of biological entities, usually in natural language processing and data integration. However, semantics does not capture the context of words, and biosemantics fails to describe the biological context and complex relationships between biological entities.

subscribe via RSS