Reproducibly Loading the ChEMBL SDF

ChEMBL is easily the most useful database in a cheminformatician’s toolbox, containing structural and activity information for millions of diverse compounds. In his recent blog post, Generalized Substructure Search, Greg Landrum highlighted some new RDKit features that enable more advanced substructure queries. It started by loading molecules from the ChEMBL 29 SDF dump, but it featured a common issue that hampers reproducibility: a hard-coded local file path to the ChEMBL data. This blog post is how to address this using chembl_downloader and make code using ChEMBL’s SDF dump more reusable and reproducible.

Getting Data Reproducibly

The code in the blog post began by loading up ChEMBL 29 like this (edited for clarity and imports omitted for brevity):

in_path = "/home/glandrum/Downloads/chembl_29.sdf.gz"

with gzip.open(in_path) as file:
    data = []
    for i, mol in enumerate(rdkit.Chem.ForwardSDMolSupplier(file)):
        ...
        data.append(...)

out_path = "../data/chembl29_sssdata.pkl"
with open(out_path, 'wb') as file:
    pickle.dump(data, file)

There are three main issues with this code:

It relies on a local file path to the ChEMBL data, which means:
- Nobody else can run this code without editing it.
- There’s no information on how to get or preprocess this file before running the script.
It relies on a specific version of ChEMBL, which means that we can’t benefit from new compounds in new releases without editing it.
It outputs data to a relative file path, which might not work based on the way the script is run or the directory structure on your drive

To be fair, this is a blog post that’s not necessarily supposed to be reused. But what if were so easy to fix this anti-pattern that there’s no excuse not to? Here’s how using the chembl_downloader Python package:

import chembl_downloader

version = "29"                                             # <-- This line changed for this example
in_path = chembl_downloader.download_sdf(version=version)  # <-- This line changed for this example

with gzip.open(in_path) as file:
    data = []
    for i, mol in enumerate(rdkit.Chem.ForwardSDMolSupplier(file)):
        ...
        data.append(...)

out_path = "../data/chembl29_sssdata.pkl"
with open(out_path, 'wb') as file:
    pickle.dump(data, file)

With only a single line changed, this code now knows how to download the ChEMBL 29 from the source and store it in a deterministic location on your hard drive. This means that anyone can run it without knowing how to download ChEMBL themselves, which version to get, how to name the file, or where to put it on their machine. It also implicitly solves the problem that the user doesn’t know if there was any pre-processing done to the file at "/home/glandrum/Downloads/chembl_29.sdf.gz". Under the hood, it’s using the pystow package to determinisically pick a folder (~/.data/chembl/29/) into which the file ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_29/chembl_29.sdf.gz is download (~/.data/chembl/29/chembl_29.sdf.gz).

Getting the Newest Version

What about making this code automatically updating to the newest version of ChEMBL? Just use the chembl_downloader.latest() to the latest version up for you. Under the hood, it’s using the bioversions package to do this.

import chembl_downloader

version = chembl_downloader.latest()                       # <-- This line changed for this example
in_path = chembl_downloader.download_sdf(version=version)

with gzip.open(in_path) as file:
    data = []
    for i, mol in enumerate(rdkit.Chem.ForwardSDMolSupplier(file)):
        ...
        data.append(...)

out_path = "../data/chembl29_sssdata.pkl"
with open(out_path, 'wb') as file:
    pickle.dump(data, file)

Note, if you omit the version argument completely, it automatically looks up the version as well. However, there’s one more thing to update before we’ve addressed our third point: where the file is output. There are two goals in fixing the output:

Make the path deterministic
Make the path based on the version of ChEMBL that’s being used, so if a newer version gets used, it doesn’t delete the old file

The solution comes by using pystow to pick a deterministic path, which the download_sdf() function is actually using under the hood, too:

import chembl_downloader

version = chembl_downloader.latest()
in_path = chembl_downloader.download_sdf(version=version)

with gzip.open(in_path) as file:
    data = []
    for i, mol in enumerate(rdkit.Chem.ForwardSDMolSupplier(file)):
        ...
        data.append(...)

import pystow                                                  # <-- This line changed for this example
out_path = pystow.join("chembl", version, name="sssdata.pkl")  # <-- This line changed for this example
with open(out_path, 'wb') as file:
    pickle.dump(data, file)

The pystow.join method creates a path to ~/.data/chembl/<version>/sssdata.pkl. Now this code is ready to stand the test of time and a variety of different uses!

Coda

Because the pattern of getting the SDF from ChEMBL then opening it with a ForwardSDMolSupplier is so common, it’s actually included in its own function supplier(). The code could be compressed one more time like:

import chembl_downloader

version = chembl_downloader.latest()

with chembl_downloader.supplier(version=version) as suppl:
    data = []
    for i, mol in enumerate(suppl):
        ...
        data.append(...)

import pystow

out_path = pystow.join("chembl", version, name="sssdata.pkl")
with open(out_path, 'wb') as file:
    pickle.dump(data, file)