Reproducibly Loading the ChEMBL Relational Database

In his blog post, Some Thoughts on Comparing Classification Models, Pat Walters illustrated enlightened ways to convey the results of training and evaluating machine learning models on hERG activity data from ChEMBL (spoiler: it includes box plots). It started by querying the ChEMBL relational database, but featured a common issue that hampers reproducibility: hard-coded configuration to a local database based on a specific database (MySQL). This blog post is about how to address this using chembl_downloader and make code using ChEMBL’s SQL dump more reusable and reproducible.

While the original blog post pointed to code PatWalters/comparing_classifiers, there’s an updated version at PatWalters/jcamd_model_comparison that includes the code that queries ChEMBL (among other things). The original notebook began like this in cells 2 and 4 (edited for clarity):

import mysql.connector as sql
import pandas as pd

sql = """
select
    canonical_smiles,
    cs.molregno,
    md.chembl_id as mol_chembl_id,
    standard_relation, standard_value, standard_type, standard_units, description,
    td.organism, assay_type, confidence_score,
    td.pref_name, td.chembl_id as tgt_chembl_id
from activities act
    join assays ass on act.assay_id = ass.assay_id
    join target_dictionary td on td.tid = ass.tid
    join compound_structures cs on cs.molregno = act.molregno
    join molecule_dictionary md on md.molregno = cs.molregno
where ass.tid = 165
    and assay_type in ('B','F')
    and standard_value is not null
    and standard_units = 'nM'
    and act.standard_relation is not null
    and standard_type = 'IC50'
    and standard_relation = '='
"""

with sql.connect(
    host='localhost',
    database='chembl_26',
    user='pwalters',
    password='itsasecret',
) as con:
    df = pd.read_sql(sql, con=con)

There are two main issues with this code:

It uses a specific database (MySQL) and user-specific connection configuration which means:
- Nobody else can run this code without editing it.
- There’s no information about how to get, preprocess, or load the database before running the script.
It relies on a specific version of ChEMBL, which means that we can’t benefit from new compounds and assays in new releases without editing it.

To be fair, this is from a Jupyter notebook that’s not necessarily supposed to be reused. But what if were so easy to fix this anti-pattern that there’s no excuse not to? Here’s how using the chembl_downloader Python package:

import chembl_downloader
import pandas as pd

sql = ...  # omitted for brevity

version = "26"
with chembl_downloader.connect(version=version) as con:
    df = pd.read_sql(sql, con=con)

With only a single (logical) line changed, this code now knows how to download the ChEMBL 26 SQLite dump from the source, store it in a deterministic location, extract it, and load with SQLite.

This means that anyone can run it without knowing how to download ChEMBL themselves, which version to get, how to name the file, or where to put it on their machine. It also relies on SQLite, which is effectively available on all devices that run Python and has exactly the same programmatic API, but without the need to run or connect to extra software. While a RDBMS like MySQL might be more powerful for some kinds of queries, the difference is negligible when querying single assays. It also implicitly solves the problem that the user doesn’t know if there was any pre-processing done to the file.

Under the hood, it’s using the pystow package to deterministically pick a folder (~/.data/chembl/26/) into which the file ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_26/chembl_26_sqlite.tar.gz is download (~/.data/chembl/26/chembl_26_sqlite.tar.gz).

Since the pattern of connecting to the database then running a SQL query with pandas is so common, the chembl_downloader has a query() function that wraps the two lines from the previous example:

import chembl_downloader

sql = ...  # omitted for brevity

version = "26"
df = chembl_downloader.query(sql, version=version)

Getting the Newest Version

What about making this code automatically updating to the newest version of ChEMBL? Just use the chembl_downloader.latest() to the latest version up for you. Under the hood, it’s using the bioversions package to do this.

import chembl_downloader

sql = ...  # omitted for brevity

version = chembl_downloader.latest()
df = chembl_downloader.query(sql, version=version)

Doing this is so common, that you can actually just omit the version argument and it will look up the latest for you.

Meaningful Improvement

I made a pull request to update the notebook based on these suggestions. The new notebook is here and features the most recent version of ChEMBL at the time of writing (ChEMBL 29) instead of ChEMBL 26. The table below shows how much improvement updating ChEMBL gives:

Flag	ChEMBL 26	ChEMBL 29	Increase	Percent Increase
Active	4191	4601	410	9%
Inactive	2048	2274	226	11%

I’d say getting 9% more actives and 11% more inactives basically for free by writing better code is a pretty big success. In this notebook, the AUC-ROC of the prominently presented LGBM classifier improved by about 1%. This could have just as easily have gone down, but I think it was worth checking.

One time, I received negative feedback from authors I asked why they hadn’t updated the analysis they presented in their manuscript using the newest version of ChEMBL (this was a few months ago, so the jump was from ChEMBL 25 to ChEMBL 28). One excuse they gave was that new data would (probably) not change their results. Depending on what kind of stuff you do, 1% might be a big deal. Or not.

When I got that feedback, I checked in on the code that had been released along with the manuscript to see if I could do it myself. It wasn’t pretty. I’d guess the authors really just didn’t want to ever touch their code again because it was very complicated, relied on tons of finnicky dependencies, and was overall written poorly. I don’t think shaming scientists for writing bad code is a very constructive nor a good way to motivate them to write better code. I’ve found on many occassions that authors usually just don’t have the right training or mindset to do reproducible/reusable science. A better solution is to offer pull requests to their code that demonstrates how to fix the issues and explain in detail how it works. Then, the best you can do is hope that they learn something and use it in their next publication.

So for the case of these authors, I looked into their downstream dependencies and began getting in touch with their code owners (maintainer would be a strong word) then sending pull requests to make them more reusable. I also ended up writing a bit of my own code to see if I could ultimately re-write the analysis to be a little more automatic. This chembl_downloader package is one of the tools I built along the way! I might come back and write a blog post about the original paper that caused the negative feedback too, because it was indeed a very cool paper! But first, I want to show that it can be reproduced.