A historical analysis of ChEMBL
I’ve recently submitted an article to the
Journal of Open Source Software (JOSS) describing
chembl-downloader
, a Python package for
automating downloading and using ChEMBL data in a reproducible way. In this
post, I use chembl-downloader
to show how the number of compounds, assays,
activities, and other entities in ChEMBL have changed over time.
ChEMBL has made 37 releases so far. 35 of them have been major releases, and two have been minor releases (v22.1 and v24.1). While it only began bundling a SQLite dump of the database v19, Eloy Felix recently informed me that the team constructed SQLite dumps for all previous versions, too. This is great, because I use the SQLite dump as the primary mechanism for querying the database.
chembl-downloader
automates downloading any version of ChEMBL’s SQLite
database, unpacking it from the TAR archive, connecting to it, making a SQL
query, and returning the results as a Pandas DataFrame object with
chembl_downloader.query()
or
chembl_downloader.query_scalar()
like in:
import chembl_downloader
sql = "SELECT * FROM activities LIMIT 5"
rows = chembl_downloader.query(sql, version=1)
sql = "SELECT COUNT(*) FROM activities"
count: int = chembl_downloader.query_scalar(sql, version=1)
These functions can be used to write a loop and run the same SQL query over every version of ChEMBL with a few two caveats:
- to get all versions of ChEMBL, it needs to include v22.1 and v24.1.
chembl.versions()
provides convenient access to construct this iterator - SQLite does not allow for opening a compressed database (without paying for an extension), so each version needs to be uncompressed. Unfortunately, most personal computers (including mine) don’t enough hard disk space to have an uncompressed copy of each version of ChEMBL
Results of Temporal Analysis
I wrote a CLI utility
chembl_downloader history
which downloads, decompresses, analyzes, and then deletes each version of ChEMBL
iteratively over the span of about three hours.
It summarizes the dates of release, number of compounds, number of named
compounds, (i.e., with a pref_name
), number of assays, and number of
activities, and several other entity types to (almost) match what’s summarized
on the ChEMBL homepage. The results can be
downloaded as a TSV
and are as follows:
Version | Date | Compounds | Named Compounds | Assays | Activities | Documents | Targets | Cells | Tissues | Drug Warnings | Drug Indications | Drug Mechanisms |
---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | 2024-12-01 | 2,496,335 | 42,231 | 1,740,546 | 21,123,501 | 92,121 | 16,003 | 2,129 | 782 | 1,676 | 55,442 | 7,330 |
34 | 2024-03-28 | 2,431,025 | 42,387 | 1,644,390 | 20,772,701 | 89,892 | 15,598 | 2,023 | 782 | 1,676 | 55,442 | 7,330 |
33 | 2023-05-31 | 2,399,743 | 41,923 | 1,610,596 | 20,334,684 | 88,630 | 15,398 | 2,021 | 782 | 1,636 | 51,582 | 7,098 |
32 | 2023-01-26 | 2,354,965 | 41,923 | 1,536,903 | 20,038,828 | 86,361 | 15,139 | 2,015 | 759 | 1,636 | 51,582 | 7,098 |
31 | 2022-07-12 | 2,331,700 | 41,585 | 1,498,681 | 19,780,369 | 85,431 | 15,072 | 2,000 | 757 | 1,293 | 48,816 | 6,656 |
30 | 2022-02-22 | 2,157,379 | 41,549 | 1,458,215 | 19,286,751 | 84,092 | 14,855 | 1,991 | 752 | 1,293 | 48,816 | 6,656 |
29 | 2021-07-01 | 2,105,464 | 41,383 | 1,383,553 | 18,635,916 | 81,544 | 14,554 | 1,978 | 743 | 1,262 | 45,902 | 6,202 |
28 | 2021-01-15 | 2,086,898 | 41,049 | 1,358,549 | 17,276,334 | 80,480 | 14,347 | 1,950 | 739 | 1,256 | 42,988 | 5,347 |
27 | 2020-05-18 | 1,961,462 | 40,834 | 1,221,361 | 16,066,124 | 76,086 | 13,382 | 1,831 | 707 | 0 | 37,259 | 5,134 |
26 | 2020-02-14 | 1,950,765 | 40,822 | 1,221,311 | 15,996,368 | 76,076 | 13,377 | 1,830 | 707 | 0 | 37,259 | 5,070 |
25 | 2019-02-01 | 1,879,206 | 39,885 | 1,125,387 | 15,504,603 | 72,271 | 12,482 | 1,670 | 655 | 0 | 29,457 | 4,992 |
24.1 | 2018-05-01 | 1,828,820 | 39,877 | 1,060,283 | 15,207,914 | 69,861 | 12,091 | 1,667 | 655 | 0 | 29,163 | 4,992 |
24 | 2018-05-01 | 1,828,820 | 39,877 | 1,060,283 | 15,207,914 | 69,861 | 12,091 | 1,667 | 655 | 0 | 29,163 | 4,992 |
23 | 2017-05-18 | 1,735,442 | 39,584 | 1,302,147 | 14,675,320 | 67,722 | 11,538 | 1,624 | 125 | 0 | 13,504 | 4,305 |
22.1 | 2016-11-17 | 1,686,695 | 39,422 | 1,246,683 | 14,371,197 | 65,213 | 11,224 | 1,619 | 111 | 0 | 12,573 | 3,834 |
22 | 2016-09-28 | 1,686,695 | 39,422 | 1,246,132 | 14,371,219 | 65,213 | 11,224 | 1,619 | 111 | 0 | 12,573 | 3,834 |
21 | 2015-02-12 | 1,592,191 | 39,347 | 1,212,831 | 13,968,617 | 62,502 | 11,019 | 1,612 | 0 | 0 | 5,951 | 3,799 |
20 | 2015-02-03 | 1,463,270 | 39,016 | 1,148,942 | 13,520,737 | 59,610 | 10,774 | 1,647 | 0 | 0 | 0 | 2,266 |
19 | 2014-07-23 | 1,411,786 | 38,910 | 1,106,285 | 12,843,338 | 57,156 | 10,579 | 1,653 | 0 | 0 | 0 | 2,239 |
18 | 2014-04-02 | 1,359,508 | 35,817 | 1,042,374 | 12,419,715 | 53,298 | 9,414 | 1,655 | 0 | 0 | 0 | 2,233 |
17 | 2013-09-16 | 1,324,941 | 32,692 | 734,201 | 12,077,491 | 51,277 | 9,356 | 1,746 | 0 | 0 | 0 | 2,213 |
16 | 2013-05-15 | 1,295,510 | 23,532 | 712,836 | 11,420,351 | 50,095 | 9,844 | 1,432 | 0 | 0 | 0 | 0 |
15 | 2013-01-30 | 1,254,575 | 23,528 | 679,259 | 10,509,572 | 48,735 | 9,570 | 1,432 | 0 | 0 | 0 | 0 |
14 | 2012-07-18 | 1,213,242 | 16,573 | 644,734 | 10,129,256 | 46,133 | 9,003 | 0 | 0 | 0 | 0 | 0 |
13 | 2012-02-29 | 1,143,682 | 16,397 | 617,681 | 6,933,068 | 44,682 | 8,845 | 0 | 0 | 0 | 0 | 0 |
12 | 2011-11-30 | 1,077,189 | 16,658 | 596,122 | 5,654,847 | 43,418 | 8,703 | 0 | 0 | 0 | 0 | 0 |
11 | 2011-06-07 | 1,060,258 | 16,264 | 582,982 | 5,479,146 | 42,516 | 8,603 | 0 | 0 | 0 | 0 | 0 |
10 | 2011-06-07 | 1,000,468 | 16,159 | 534,391 | 4,668,202 | 40,624 | 8,372 | 0 | 0 | 0 | 0 | 0 |
9 | 2011-01-04 | 658,075 | 3,746 | 499,867 | 3,030,317 | 39,094 | 8,091 | 0 | 0 | 0 | 0 | 0 |
8 | 2010-11-05 | 636,269 | 0 | 488,898 | 2,973,034 | 38,462 | 8,088 | 0 | 0 | 0 | 0 | 0 |
7 | 2010-09-03 | 602,500 | 0 | 485,095 | 2,948,069 | 38,204 | 8,078 | 0 | 0 | 0 | 0 | 0 |
6 | 2010-09-03 | 600,625 | 0 | 481,752 | 2,925,588 | 38,029 | 8,054 | 0 | 0 | 0 | 0 | 0 |
5 | 2010-06-07 | 578,715 | 0 | 459,823 | 2,787,240 | 36,624 | 7,493 | 0 | 0 | 0 | 0 | 0 |
4 | 2010-05-26 | 565,245 | 0 | 446,645 | 2,705,136 | 35,821 | 7,330 | 0 | 0 | 0 | 0 | 0 |
3 | 2010-04-30 | 547,133 | 0 | 432,022 | 2,490,742 | 34,982 | 7,330 | 0 | 0 | 0 | 0 | 0 |
2 | 2009-12-07 | 517,261 | 0 | 416,284 | 2,404,622 | 33,956 | 7,192 | 0 | 0 | 0 | 0 | 0 |
1 | 2009-10-28 | 440,055 | 0 | 329,250 | 1,936,969 | 26,299 | 5,694 | 0 | 0 | 0 | 0 | 0 |
The same results can be viewed as charts:
These charts show when certain features were introduced, such as cells in v15, drug indications in v20, tissues in v22, and drug warnings in v28.
The number of named compounds seems to have plateaued in v19 in 2014. This is
strange, considering that ChEMBL links to many external resources like ChEBI
that have nice preferred names that be imported. However, much like I found in
my recent post
about the EFO identifier column in ChEMBL’s diseases table, the pref_name
column in the compounds table might not actually mean what I guess it does.
Change over Time
In order to investigate the changes over time, I also took the discrete derivative of each:
There are a few interesting places where the numbers dropped, such as the number of targets in v17 and the number of assays in v24 (which might have been a mistake that triggered the v24.1 release). I’m sure there’s a bit of explanation in the READMEs for these releases - please comment at the end of the post if you happen to take a look and have more explanation.
Overall, this analysis shows that the amount of content added between ChEMBL versions is relatively consistent (though keep in mind it’s on a log axis). The time for each release is also only slightly increasing on average.
Future Ideas
I would love to extend the idea of a temporal analysis towards other target-centric metrics like:
- Are there examples of targets where the chemical space gets a lot bigger?
- Conversely, are there targets where new compounds just seem to be in the same old neighborhood?
- Are there widely conflicting activities added over time?
- How does the ability of a QSAR model trained on a given version of ChEMBL perform with respect to the data that’s added later?
I presented one such example in the chembl-downloader
manuscript where I
re-ran one of Pat Walter’s analyses on
5-lipoxygenase activating protein (CHEMBL4550)
in
this notebook.
There, the number of activities increased by more than double since the original
analysis, but the distribution was roughly the same.
If you’re interested in teaming up to do a retrospective analysis on your favorite target (or, maybe even using knowledge graphs for interesting aggregations of targets based on gene sets, disease associations, etc.), then let me know.