The EFO_ID column in ChEMBL's drug indications table isn't what you think it is
ChEMBL periodically curates clinical trial information into its
DRUG_INDICATION
table. However, there’s some weird inconsistencies in the way
it references disease concepts in external vocabularies. This blog post is an
exploration of that table.
As of ChEMBL v35, the DRUG_INDICATION
table contains the following columns:
DRUGIND_ID
- a unique identifier for the chemical-indication pairMOLREGNO
- a foreign key to the molecules tableMAX_PHASE_FOR_IND
- the maximum phase achieved by clinical trials of the chemical-indication pairMESH_ID
- the local unique identifier from Medical Subject Headings (MeSH) for the indicationMESH_HEADING
- the label in MeSH for the given MeSH IDEFO_ID
- A compact URI (CURIE) for a term in the Experimental Factor Ontology (EFO) ( in theory)EFO_TERM
- a label for the CURIE in theEFO_ID
column
This is already strange, considering that for cell lines, tissues, and targets, ChEMBL has created its own table which contains the cross-references to external vocabularies. Here, they’re baked into the pivot table.
The funny business is about the EFO_ID
column:
- It’s strange that the
MESH_ID
column uses local unique identifiers but theEFO_ID
column uses compact URIs (CURIEs). CURIEs are a syntax for referencing an entity in an ontology or database that takes the form of<prefix>:<local unique identifier>
. Theprefix
is usually the acronym for the resource and the local unique identifier is the ID inside the resource (usually a number). More on this in my previous post. - The CURIEs in the
EFO_ID
column aren’t all usingEFO
as the prefix!
Let’s have a look at what’s actually in the EFO_ID
column by using the
chembl_downloader
Python
package to automatically download the latest version of ChEMBL and run SQL
queries over it.
import chembl_downloader
chembl_downloader.query("""\
SELECT DISTINCT efo_id, efo_term
FROM DRUG_INDICATION
WHERE efo_id NOT LIKE 'EFO:%'
""")
efo_id | efo_term |
---|---|
HP:0001945 | Fever |
Orphanet:309005 | Disorder of lipid metabolism |
HP:0003124 | Hypercholesterolemia |
Orphanet:79211 | Combined hyperlipidemia |
HP:0000023 | Inguinal hernia |
… | … |
Using a bit of SQL string processing to identify the prefixes and the
Bioregistry to retrieve the name and homepage gives a
bit more context about what the prefixes in CURIEs in the EFO_ID
column
represent.
import bioregistry
import chembl_downloader
sql = """\
SELECT prefix, count(prefix) as count
FROM (
SELECT substr(efo_id, 0, instr(efo_id, ":")) as prefix
FROM DRUG_INDICATION
)
GROUP BY prefix
HAVING count(prefix) > 0
ORDER BY count(prefix) DESC
"""
df = chembl_downloader.query(sql)
df["name"] = df["prefix"].map(bioregistry.get_name)
df["homepage"] = df["prefix"].map(bioregistry.get_homepage)
prefix | count | name | homepage |
---|---|---|---|
EFO | 37,603 | Experimental Factor Ontology | http://www.ebi.ac.uk/efo |
MONDO | 13,532 | Mondo Disease Ontology | https://monarch-initiative.github.io/mondo |
HP | 3,381 | Human Phenotype Ontology | http://www.human-phenotype-ontology.org/ |
Orphanet | 359 | Orphanet | http://www.orpha.net/consor/ |
MP | 281 | Mammalian Phenotype Ontology | https://www.informatics.jax.org/vocab/mp_ontology/ |
GO | 50 | Gene Ontology | http://geneontology.org/ |
DOID | 45 | Human Disease Ontology | http://www.disease-ontology.org |
CHEBI | 19 | Chemical Entities of Biological Interest | http://www.ebi.ac.uk/chebi |
UBERON | 1 | Uber Anatomy Ontology | http://uberon.org |
The ones that stand out to me are CHEBI
, UBERON
, and GO
, since these
resources are respectively for chemicals, anatomical entities, and biological
processes/cellular components/molecular functions.
I wrote the following function to do a bit of exploring, based on the prefix.
import chembl_downloader
def print_indications_with_prefix(prefix: str) -> None:
sql = f"""\
SELECT DISTINCT
MOLECULE_DICTIONARY.chembl_id,
MOLECULE_DICTIONARY.pref_name,
DRUG_INDICATION.efo_id,
DRUG_INDICATION.efo_term
FROM MOLECULE_DICTIONARY
JOIN DRUG_INDICATION ON MOLECULE_DICTIONARY.molregno == DRUG_INDICATION.molregno
WHERE DRUG_INDICATION.efo_id LIKE '{prefix}:%'
ORDER BY MOLECULE_DICTIONARY.pref_name
"""
df = chembl_downloader.query(sql)
df["chembl_id"] = df["chembl_id"].map(lambda s: f"[{s}](https://bioregistry.io/chembl.compound:{s})")
df["efo_id"] = df["efo_id"].map(lambda s: f"[{s}](https://bioregistry.io/{s})")
print(df.to_markdown(tablefmt="github", index=False))
Using UBERON
returns a single result, which appears to be a mistake / an abuse
of the database schema.
chembl_id | pref_name | efo_id | efo_term |
---|---|---|---|
CHEMBL4650497 | PEGSITACIANINE | UBERON:0000029 | lymph node |
Using CHEBI
returns a large number of diagnostic agents. This is part of the
“role” hierarchy within ChEBI, and also what I would consider an abuse of the
database schema.
chembl_id | pref_name | efo_id | efo_term |
---|---|---|---|
CHEMBL1234270 | ARFOLITIXORIN | CHEBI:44185 | methotrexate |
CHEMBL5314823 | DIGADOGLUCITOL | CHEBI:33295 | diagnostic agent |
CHEMBL4650354 | FLORBENGUANE F18 | CHEBI:33295 | diagnostic agent |
CHEMBL5095045 | FLORZOLOTAU (18F) | CHEBI:33295 | diagnostic agent |
CHEMBL5314559 | FLOTUFOLASTAT F 18 GALLIUM | CHEBI:33295 | diagnostic agent |
CHEMBL4298157 | FLUBROBENGUANE F18 | CHEBI:33295 | diagnostic agent |
CHEMBL5314633 | IODINE I124 EVUZAMITIDE | CHEBI:33295 | diagnostic agent |
CHEMBL4298185 | LONAPEGSOMATROPIN | CHEBI:37845 | growth hormone |
CHEMBL5314761 | PEGFOSIMER MANGANESE | CHEBI:33295 | diagnostic agent |
CHEMBL4650497 | PEGSITACIANINE | CHEBI:33295 | diagnostic agent |
CHEMBL4297334 | PIFLUFOLASTAT F18 | CHEBI:33295 | diagnostic agent |
CHEMBL5314483 | RIZEDISBEN | CHEBI:33295 | diagnostic agent |
CHEMBL5314650 | TECHNETIUM TC-99M LABELED CARBON | CHEBI:33295 | diagnostic agent |
CHEMBL4298067 | TOMARALIMAB | CHEBI:35610 | antineoplastic agent |
CHEMBL5314445 | VIDOFLUFOLASTAT(18F) | CHEBI:33295 | diagnostic agent |
CHEMBL4594280 | VIPIVOTIDE TETRAXETAN | CHEBI:33295 | diagnostic agent |
CHEMBL4594411 | XENON XE-129, HYPERPOLARIZED | CHEBI:33295 | diagnostic agent |
CHEMBL5314610 | ZOPOCIANINE | CHEBI:33295 | diagnostic agent |
CHEMBL5314611 | ZOPOCIANINE SODIUM | CHEBI:33295 | diagnostic agent |
Using GO
returns aging, regulation of ovulation (both positive and negative),
and wound healing as the four unique biological processes. This is a little less
controversial than UBERON and CHEBI, but it still has a bit of a mismatch for
the idea of an “indication”.
chembl_id | pref_name | efo_id | efo_term |
---|---|---|---|
CHEMBL1566 | ACARBOSE | GO:0007568 | aging |
CHEMBL600 | ACETYLCYSTEINE | GO:0007568 | aging |
CHEMBL1399 | ANASTROZOLE | GO:0007568 | aging |
CHEMBL25 | ASPIRIN | GO:0007568 | aging |
CHEMBL1201556 | BECAPLERMIN | GO:0042060 | wound healing |
CHEMBL5315086 | BETULA PUBESCENS BARK | GO:0042060 | wound healing |
CHEMBL1200800 | CALCIUM ACETATE | GO:0007568 | aging |
CHEMBL1200539 | CALCIUM CARBONATE | GO:0007568 | aging |
CHEMBL113313 | CAPROMORELIN | GO:0007568 | aging |
CHEMBL1042 | CHOLECALCIFEROL | GO:0007568 | aging |
CHEMBL2108185 | CORIFOLLITROPIN ALFA | GO:0060279 | positive regulation of ovulation |
CHEMBL429910 | DAPAGLIFLOZIN | GO:0007568 | aging |
CHEMBL1421 | DASATINIB | GO:0007568 | aging |
CHEMBL139 | DICLOFENAC | GO:0007568 | aging |
CHEMBL367149 | DOCONEXENT | GO:0007568 | aging |
CHEMBL1200969 | DUTASTERIDE | GO:0007568 | aging |
CHEMBL135 | ESTRADIOL | GO:0007568 | aging |
CHEMBL2108390 | FIBRIN | GO:0007568 | aging |
CHEMBL500468 | GHRELIN | GO:0007568 | aging |
CHEMBL389621 | HYDROCORTISONE | GO:0007568 | aging |
CHEMBL13817 | IBUTAMOREN | GO:0007568 | aging |
CHEMBL460026 | ICOSAPENT | GO:0007568 | aging |
CHEMBL2109042 | INFLUENZA VIRUS VACCINE | GO:0007568 | aging |
CHEMBL471737 | IVABRADINE | GO:0007568 | aging |
CHEMBL444814 | L-CITRULLINE | GO:0007568 | aging |
CHEMBL191 | LOSARTAN | GO:0007568 | aging |
CHEMBL1201419 | LUTROPIN ALFA | GO:0060279 | positive regulation of ovulation |
CHEMBL2107951 | MALTODEXTRIN | GO:0007568 | aging |
CHEMBL267936 | MECAMYLAMINE | GO:0007568 | aging |
CHEMBL1201716 | MECASERMIN | GO:0007568 | aging |
CHEMBL1431 | METFORMIN | GO:0007568 | aging |
CHEMBL650 | METHYLPREDNISOLONE | GO:0007568 | aging |
CHEMBL4074884 | MITOQUINONE MESYLATE | GO:0007568 | aging |
CHEMBL438497 | NICOTINAMIDE RIBOSIDE | GO:0007568 | aging |
CHEMBL3 | NICOTINE | GO:0007568 | aging |
CHEMBL1201574 | ONABOTULINUMTOXINA | GO:0007568 | aging |
CHEMBL1234886 | OXYGEN | GO:0007568 | aging |
CHEMBL395429 | OXYTOCIN | GO:0007568 | aging |
CHEMBL90593 | PRASTERONE | GO:0007568 | aging |
CHEMBL50 | QUERCETIN | GO:0007568 | aging |
CHEMBL165 | RESVERATROL | GO:0007568 | aging |
CHEMBL413 | SIROLIMUS | GO:0007568 | aging |
CHEMBL1200574 | SODIUM CHLORIDE | GO:0007568 | aging |
CHEMBL93268 | SODIUM NITRITE | GO:0007568 | aging |
CHEMBL136478 | SODIUM NITROPRUSSIDE | GO:0007568 | aging |
CHEMBL1201621 | SOMATROPIN | GO:0007568 | aging |
CHEMBL48802 | SULFORAPHANE | GO:0007568 | aging |
CHEMBL386630 | TESTOSTERONE | GO:0007568 | aging |
CHEMBL2107067 | TESTOSTERONE UNDECANOATE | GO:0007568 | aging |
CHEMBL3545347 | TXA127 | GO:0007568 | aging |
There wasn’t really a point to this post other than to show off a quirk I found in ChEMBL. This is useful to be aware of when automatically processing the database in bulk, e.g., for building a knowledge graph.
There are two other follow-up questions I would have about this table:
- Are there any EFO terms that are outside the disease hierarchy (i.e., not a child of EFO:0000408)?
- Why are there DOID terms? The combination of EFO and MONDO should cover everything. Answering this question actually isn’t so difficult given my recent work on assembling mappings with SeMRA, specifically for the disease landscape. I’ll try to come back to this in a future post.
If you made it this far: what did you think about my clickbait title?