A historical analysis of ChEMBL

I’ve recently submitted an article to the Journal of Open Source Software (JOSS) describing chembl-downloader, a Python package for automating downloading and using ChEMBL data in a reproducible way. In this post, I use chembl-downloader to show how the number of compounds, assays, activities, and other entities in ChEMBL have changed over time.

ChEMBL has made 37 releases so far. 35 of them have been major releases, and two have been minor releases (v22.1 and v24.1). While it only began bundling a SQLite dump of the database v19, Eloy Felix recently informed me that the team constructed SQLite dumps for all previous versions, too. This is great, because I use the SQLite dump as the primary mechanism for querying the database.

chembl-downloader automates downloading any version of ChEMBL’s SQLite database, unpacking it from the TAR archive, connecting to it, making a SQL query, and returning the results as a Pandas DataFrame object with chembl_downloader.query() or chembl_downloader.query_scalar() like in:

import chembl_downloader

sql = "SELECT * FROM activities LIMIT 5"
rows = chembl_downloader.query(sql, version=1)

sql = "SELECT COUNT(*) FROM activities"
count: int = chembl_downloader.query_scalar(sql, version=1)

These functions can be used to write a loop and run the same SQL query over every version of ChEMBL with a few two caveats:

to get all versions of ChEMBL, it needs to include v22.1 and v24.1. chembl.versions() provides convenient access to construct this iterator
SQLite does not allow for opening a compressed database (without paying for an extension), so each version needs to be uncompressed. Unfortunately, most personal computers (including mine) don’t enough hard disk space to have an uncompressed copy of each version of ChEMBL

Results of Temporal Analysis

I wrote a CLI utility chembl_downloader history which downloads, decompresses, analyzes, and then deletes each version of ChEMBL iteratively over the span of about three hours.

It summarizes the dates of release, number of compounds, number of named compounds, (i.e., with a pref_name), number of assays, and number of activities, and several other entity types to (almost) match what’s summarized on the ChEMBL homepage. The results can be downloaded as a TSV and are as follows:

Version	Date	Compounds	Named Compounds	Assays	Activities	Documents	Targets	Cells	Tissues	Drug Warnings	Drug Indications	Drug Mechanisms
35	2024-12-01	2,496,335	42,231	1,740,546	21,123,501	92,121	16,003	2,129	782	1,676	55,442	7,330
34	2024-03-28	2,431,025	42,387	1,644,390	20,772,701	89,892	15,598	2,023	782	1,676	55,442	7,330
33	2023-05-31	2,399,743	41,923	1,610,596	20,334,684	88,630	15,398	2,021	782	1,636	51,582	7,098
32	2023-01-26	2,354,965	41,923	1,536,903	20,038,828	86,361	15,139	2,015	759	1,636	51,582	7,098
31	2022-07-12	2,331,700	41,585	1,498,681	19,780,369	85,431	15,072	2,000	757	1,293	48,816	6,656
30	2022-02-22	2,157,379	41,549	1,458,215	19,286,751	84,092	14,855	1,991	752	1,293	48,816	6,656
29	2021-07-01	2,105,464	41,383	1,383,553	18,635,916	81,544	14,554	1,978	743	1,262	45,902	6,202
28	2021-01-15	2,086,898	41,049	1,358,549	17,276,334	80,480	14,347	1,950	739	1,256	42,988	5,347
27	2020-05-18	1,961,462	40,834	1,221,361	16,066,124	76,086	13,382	1,831	707	0	37,259	5,134
26	2020-02-14	1,950,765	40,822	1,221,311	15,996,368	76,076	13,377	1,830	707	0	37,259	5,070
25	2019-02-01	1,879,206	39,885	1,125,387	15,504,603	72,271	12,482	1,670	655	0	29,457	4,992
24.1	2018-05-01	1,828,820	39,877	1,060,283	15,207,914	69,861	12,091	1,667	655	0	29,163	4,992
24	2018-05-01	1,828,820	39,877	1,060,283	15,207,914	69,861	12,091	1,667	655	0	29,163	4,992
23	2017-05-18	1,735,442	39,584	1,302,147	14,675,320	67,722	11,538	1,624	125	0	13,504	4,305
22.1	2016-11-17	1,686,695	39,422	1,246,683	14,371,197	65,213	11,224	1,619	111	0	12,573	3,834
22	2016-09-28	1,686,695	39,422	1,246,132	14,371,219	65,213	11,224	1,619	111	0	12,573	3,834
21	2015-02-12	1,592,191	39,347	1,212,831	13,968,617	62,502	11,019	1,612	0	0	5,951	3,799
20	2015-02-03	1,463,270	39,016	1,148,942	13,520,737	59,610	10,774	1,647	0	0	0	2,266
19	2014-07-23	1,411,786	38,910	1,106,285	12,843,338	57,156	10,579	1,653	0	0	0	2,239
18	2014-04-02	1,359,508	35,817	1,042,374	12,419,715	53,298	9,414	1,655	0	0	0	2,233
17	2013-09-16	1,324,941	32,692	734,201	12,077,491	51,277	9,356	1,746	0	0	0	2,213
16	2013-05-15	1,295,510	23,532	712,836	11,420,351	50,095	9,844	1,432	0	0	0	0
15	2013-01-30	1,254,575	23,528	679,259	10,509,572	48,735	9,570	1,432	0	0	0	0
14	2012-07-18	1,213,242	16,573	644,734	10,129,256	46,133	9,003	0	0	0	0	0
13	2012-02-29	1,143,682	16,397	617,681	6,933,068	44,682	8,845	0	0	0	0	0
12	2011-11-30	1,077,189	16,658	596,122	5,654,847	43,418	8,703	0	0	0	0	0
11	2011-06-07	1,060,258	16,264	582,982	5,479,146	42,516	8,603	0	0	0	0	0
10	2011-06-07	1,000,468	16,159	534,391	4,668,202	40,624	8,372	0	0	0	0	0
9	2011-01-04	658,075	3,746	499,867	3,030,317	39,094	8,091	0	0	0	0	0
8	2010-11-05	636,269	0	488,898	2,973,034	38,462	8,088	0	0	0	0	0
7	2010-09-03	602,500	0	485,095	2,948,069	38,204	8,078	0	0	0	0	0
6	2010-09-03	600,625	0	481,752	2,925,588	38,029	8,054	0	0	0	0	0
5	2010-06-07	578,715	0	459,823	2,787,240	36,624	7,493	0	0	0	0	0
4	2010-05-26	565,245	0	446,645	2,705,136	35,821	7,330	0	0	0	0	0
3	2010-04-30	547,133	0	432,022	2,490,742	34,982	7,330	0	0	0	0	0
2	2009-12-07	517,261	0	416,284	2,404,622	33,956	7,192	0	0	0	0	0
1	2009-10-28	440,055	0	329,250	1,936,969	26,299	5,694	0	0	0	0	0

The same results can be viewed as charts:

These charts show when certain features were introduced, such as cells in v15, drug indications in v20, tissues in v22, and drug warnings in v28.

The number of named compounds seems to have plateaued in v19 in 2014. This is strange, considering that ChEMBL links to many external resources like ChEBI that have nice preferred names that be imported. However, much like I found in my recent post about the EFO identifier column in ChEMBL’s diseases table, the pref_name column in the compounds table might not actually mean what I guess it does.

Change over Time

In order to investigate the changes over time, I also took the discrete derivative of each:

There are a few interesting places where the numbers dropped, such as the number of targets in v17 and the number of assays in v24 (which might have been a mistake that triggered the v24.1 release). I’m sure there’s a bit of explanation in the READMEs for these releases - please comment at the end of the post if you happen to take a look and have more explanation.

Overall, this analysis shows that the amount of content added between ChEMBL versions is relatively consistent (though keep in mind it’s on a log axis). The time for each release is also only slightly increasing on average.

Future Ideas

I would love to extend the idea of a temporal analysis towards other target-centric metrics like:

Are there examples of targets where the chemical space gets a lot bigger?
Conversely, are there targets where new compounds just seem to be in the same old neighborhood?
Are there widely conflicting activities added over time?
How does the ability of a QSAR model trained on a given version of ChEMBL perform with respect to the data that’s added later?

I presented one such example in the chembl-downloader manuscript where I re-ran one of Pat Walter’s analyses on 5-lipoxygenase activating protein (CHEMBL4550) in this notebook. There, the number of activities increased by more than double since the original analysis, but the distribution was roughly the same.

If you’re interested in teaming up to do a retrospective analysis on your favorite target (or, maybe even using knowledge graphs for interesting aggregations of targets based on gene sets, disease associations, etc.), then let me know.