Unlocking UMLS
The Unified Medical Language System (UMLS) is a widely used biomedical and clinical vocabulary maintained by the United States National Library of Medicine. However, it is notoriously difficult to access and work with due to licensing restrictions and its complex download system. In the same vein as my previous posts about DrugBank and ChEMBL, this post describes open source software I’ve developed for downloading and working with this data. It also works for RxNorm, SemMedDB, SNOMED-CT, and any other data accessible through the UMLS Terminology Services (UTS) ticket granting system.
The first big issue with the UMLS is its licensing. Here’s an excerpt from the How to License and Access the Unified Medical Language System® (UMLS®) Data page accessed on August 28th, 2023:
- Please sign up for a new UMLS Terminology Services (UTS) account with your preferred identity provider at the UTS homepage.
- Complete and submit the license request form. NLM will send the license approval e-mail within 5 business days after reviewing your authenticated license request.
- You will sign in using identity provider credentials to download files or access web interfaces that require UTS authentication such as the UTS, VSAC, SNOMED CT, or RxNorm.
These are a few big hurdles:
- We typically expect scientific data to be available for download without login. Specifically, most data can be downloaded by following a link that points directly to a file. For example, ChEMBL v33 can be downloaded as a gzipped SQLite file from https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_33_sqlite.tar.gz. Rather than providing a data download link, UMLS, has a complicated API called the UMLS Terminology Services (UTS) ticket granting system that needs to be asked for a specific file, polled for a unique access key, then downloaded via an ephemeral (i.e., disappearing) URL that only works once.
- We typically expect scientific data to be licensed under a standard, widely used license such as those from Creative Commons. Using well-understood licenses reduces the cognitive and legal burden of consumers when deciding if and how they can reuse, modify, or redistribute data. UMLS uses a non-standard terms of service that makes it more difficult to understand how the data can be stored, modified, or redistributed.
- Further, we hope that data is permissively licensed such that it can be re-used, modified, and re-distributed. The Creative Commons CC BY 4.0 and CC0 licenses are golden standards of permissibility. UMLS does not have a permissive license, meaning (from my best interpretation) that you can not redistribute UMLS and you can (probably) not redistribute data derived from UMLS. As an aside, Creative Commons also has license containing clauses to be explicit about restrictions such as the share-alike (SA), non-commercial (NC), and non-distribution (ND). While these clauses aren’t ideal for scientific data, it would at least be nice for UMLS to use a Creative Commons license with the appropriate combination of these clauses (I guess all three) to make it more explicit about its restrictions.
- The most bizarre facet of UMLS is that they require you to fill out a user survey each year to keep access.
I want to 1) convert UMLS into an OWL ontology and 2) extract and encode its semantic mappings to external vocabularies like the Medical Subject Headings (MeSH) with Simple Standard for Sharing Ontology Mappings (SSSOM). Given all of these hurdles, it’s probably the case that I am not allowed to redistribute these artifacts.
All together, I consider this a big bummer. The United States National Library of Medicine (NLM) maintains several highly influential resources, but I have found in many instances that they lack a community perspective. Regardless, even as an expat, I pay American taxes, and it makes me upset that the government funds the development and maintenance of resources that I can’t easily use.
How To Break Free
Despite all of this rigamarole, there’s a process to subvert these issues by automating the interaction with the UMLS Terminology Services (UTS) and therefore enabling automated download of UMLS and the following (non-exhaustive) list of resources:
This has been implemented in the open source umls_downloader
Python
package. It can be installed with the following one-liner in your shell:
$ pip install umls_downloader
Below, I’ll walk you through using it.
Throughout, keep in mind that full documentation for the umls_downloader
is available
at umls-downloader.readthedocs.io, which describes the other functionality and
other data that can be downloaded.
Usage
UMLS has three different distributions that are
described here.
The following Python code downloads the most simple and straightforward file, MRCONSO.RRF
as a zip archive:
from umls_downloader import download_umls
path = download_umls(version="2023AA", api_key="<your API key>")
This code is smart and does not need to download the file more than once.
It uses pystow
to choose a stable path ~/.data/bio/umls
relative to the current
user’s home directory. Inside this directory, it also uses the version of the data to create a subdirectory. Finally,
this function returns the path to the data, such that no file paths ever need to be hard-coded.
Warning This still requires an API key, which requires creating an account, agreeing to UMLS’s terms and conditions, etc. This can be done here: https://uts.nlm.nih.gov/uts/edit-profile.
Automating Configuration of UTS Credentials
There are two ways to automatically set the API key, so you don’t have to worry about getting it and passing it around in your python code:
- Set
UMLS_API_KEY
in the environment. This can be done in your interactive session or in the configuration for your shell such as in a.bashrc
file for the Bourne Again Shell (bash). - Create
~/.config/umls.ini
and set in the[umls]
section aapi_key
key. Mine looks like:[umls] api_key=1234567890abcdefghijklmno
Now you can omit the api_key
keyword like in the following:
from umls_downloader import download_umls
# Same path as before
path = download_umls(version="2023AA")
Download the Latest Version
First, you’ll have to
install bioversions
with pip install bioversions
, whose job it is to look up the latest version of
many databases. Then, you can modify the previous code slightly by omitting
the version
keyword argument:
from umls_downloader import download_umls
# Same path as before (when run on September 1st, 2023)
path = download_umls()
Download and open the file
The UMLS file is zipped, so it’s usually accompanied by the following boilerplate code:
import zipfile
from umls_downloader import download_umls
path = download_umls()
with zipfile.ZipFile(path) as zip_file:
with zip_file.open("MRCONSO.RRF", mode="r") as file:
for line in file:
...
This exact code is wrapped with the umls_downloader.open_umls()
using Python’s context manager,
so it can more simply be written as:
from umls_downloader import open_umls
with open_umls() as file:
for line in file:
...
Note The
version
andapi_key
arguments work the same forumls_downloader.open_umls()
as inumls_downloader.download_umls()
At this point, it’s up to you to decide how you want to consume the MRCONSO.RRF
file.
Below, I give a demo on how parsed this file in PyOBO in order to convert UMLS to an
OWL ontology.
Why not an API?
The UMLS provides an API
for access to tiny bits of data at a time. There are even two recent (last 5
years) packages umls-api
connect-umls
that provide a wrapper
around them. However, API access is generally rate limited, difficult to use in
bulk, and slow. For working with UMLS (or any other database, for that matter) in
bulk, it’s necessary to download full database dumps.
UMLS Conversions
Building on top of the automated download of UMLS, I implemented a fit-for-purpose processor with the
PyOBO framework that converts UMLS into
an ontology (encoded either as OWL, OBO, or OBO Graph JSON) which can therefore be used to generate semantic mappings in
the SSSOM format. The code that implements this can be
found here. After installing PyOBO
with pip install pyobo
, you can automatically download and convert UMLS
first into an ontology encoded in
the OBO flat file format,
then convert to OWL with the following code. Note: you’ll need robot
for the second
step:
import pyobo
umls = pyobo.get_ontology("umls")
# Write simple OBO Format
umls.write_obo("umls.obo")
# Convert to OWL
from pyobo.utils.misc import obo_to_owl
obo_to_owl("umls.obo", "umls.owl")
In an ideal world, the results of such a conversion could be included as a part of the OBO Database Ingestion, which converts database resources available through PyOBO into ontology artifacts, archives them on GitHub and Zenodo, and gives them PURLs all on a weekly basis to make sure the most up-to-date version is available as well as all previous named versions. Instead, we live in a world with pineapple pizza and restrictive licenses.
One of the nice qualities of UMLS is that it is a semantic mapping hub. It provides mostly complete mappings between many vocabularies including MeSH, NCIT, SNOMED-CT, HPO, LOINC, and more. However, there are a few caveats to consider:
- UMLS mappings aren’t all 1-to-1. For example, MeSH mappings typically include many UMLS terms (narrower) pointing to the same MeSH term (broader). For other vocabularies, such as NCBITaxon, UMLS mappings are more reliably 1-to-1. Thanks to Tiago Lubiana for pointing this out.
- Mapping provenance is not available, so the
mapping_justification
field in SSSOM is uniformly filled withsempav:UnspecifiedMatching
. - Similarly, UMLS does not apply precise semantic predicates for each mapping. This means that they are output
in PyOBO and as SSSOM with
oboInOwl:hasDbXref
instead of more detailed types such asskos:exactMatch
,skos:narrowMatch
, andskos:broaderMatch
. Tools like Boomer can be used to address this (in part). The Semantic Mapping Reasoning Assembler (SeMRA) can also be configured with prior knowledge about UMLS mapping assumptions when aggregating and reasoning over semantic mappings at scale.
With that in mind, anything that can be loaded as an ontology in PyOBO can also be exported with SSSOM, which I show below. For UMLS, this looks like:
import pyobo
df = pyobo.get_sssom_df("umls", names=False)
df.to_csv("umls.sssom.tsv", sep="\t", index=False)
Note You can set
names=True
to have PyOBO look up the names for all entities, but this is a bit of a rabbit hole since it requires getting and processing many external resources.
There’s much more to say about UMLS and SSSOM, but this is a good place to pause and publish this post, since getting UMLS as SSSOM is a task a lot of people have asked me for help with lately. I might also come back and explain more about how I use the other resources from UMLS’s UTS.