Unlocking UMLS

The Unified Medical Language System (UMLS) is a widely used biomedical and clinical vocabulary maintained by the United States National Library of Medicine. However, it is notoriously difficult to access and work with due to licensing restrictions and its complex download system. In the same vein as my previous posts about DrugBank and ChEMBL, this post describes open source software I’ve developed for downloading and working with this data. It also works for RxNorm, SemMedDB, SNOMED-CT, and any other data accessible through the UMLS Terminology Services (UTS) ticket granting system.

The first big issue with the UMLS is its licensing. Here’s an excerpt from the How to License and Access the Unified Medical Language System® (UMLS®) Data page accessed on August 28^th, 2023:

Please sign up for a new UMLS Terminology Services (UTS) account with your preferred identity provider at the UTS

 homepage.

Complete and submit the license request form. NLM will send the license approval e-mail within 5 business days

 after reviewing your authenticated license request.

You will sign in using identity provider credentials to download files or access web interfaces that require UTS

 authentication such as the UTS, VSAC, SNOMED CT, or RxNorm.

These are a few big hurdles:

We typically expect scientific data to be available for download without login. Specifically, most data can be downloaded by following a link that points directly to a file. For example, ChEMBL v33 can be downloaded as a gzipped SQLite file from https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_33_sqlite.tar.gz. Rather than providing a data download link, UMLS, has a complicated API called the UMLS Terminology Services (UTS) ticket granting system that needs to be asked for a specific file, polled for a unique access key, then downloaded via an ephemeral (i.e., disappearing) URL that only works once.
We typically expect scientific data to be licensed under a standard, widely used license such as those from Creative Commons. Using well-understood licenses reduces the cognitive and legal burden of consumers when deciding if and how they can reuse, modify, or redistribute data. UMLS uses a non-standard terms of service that makes it more difficult to understand how the data can be stored, modified, or redistributed.
Further, we hope that data is permissively licensed such that it can be re-used, modified, and re-distributed. The Creative Commons CC BY 4.0 and CC0 licenses are golden standards of permissibility. UMLS does not have a permissive license, meaning (from my best interpretation) that you can not redistribute UMLS and you can (probably) not redistribute data derived from UMLS. As an aside, Creative Commons also has license containing clauses to be explicit about restrictions such as the share-alike (SA), non-commercial (NC), and non-distribution (ND). While these clauses aren’t ideal for scientific data, it would at least be nice for UMLS to use a Creative Commons license with the appropriate combination of these clauses (I guess all three) to make it more explicit about its restrictions.
The most bizarre facet of UMLS is that they require you to fill out a user survey each year to keep access.

I want to 1) convert UMLS into an OWL ontology and 2) extract and encode its semantic mappings to external vocabularies like the Medical Subject Headings (MeSH) with Simple Standard for Sharing Ontology Mappings (SSSOM). Given all of these hurdles, it’s probably the case that I am not allowed to redistribute these artifacts.

All together, I consider this a big bummer. The United States National Library of Medicine (NLM) maintains several highly influential resources, but I have found in many instances that they lack a community perspective. Regardless, even as an expat, I pay American taxes, and it makes me upset that the government funds the development and maintenance of resources that I can’t easily use.

How To Break Free

A clip of Queen's "I Want to Break Free" music video

Despite all of this rigamarole, there’s a process to subvert these issues by automating the interaction with the UMLS Terminology Services (UTS) and therefore enabling automated download of UMLS and the following (non-exhaustive) list of resources:

RxNorm
SemMedDB
SNOMED-CT
potentially more in the future

This has been implemented in the open source umls_downloader Python package. It can be installed with the following one-liner in your shell:

$ pip install umls_downloader

Below, I’ll walk you through using it. Throughout, keep in mind that full documentation for the umls_downloader is available at umls-downloader.readthedocs.io, which describes the other functionality and other data that can be downloaded.

Usage

UMLS has three different distributions that are described here. The following Python code downloads the most simple and straightforward file, MRCONSO.RRF as a zip archive:

from umls_downloader import download_umls

path = download_umls(version="2023AA", api_key="<your API key>")

This code is smart and does not need to download the file more than once. It uses pystow to choose a stable path ~/.data/bio/umls relative to the current user’s home directory. Inside this directory, it also uses the version of the data to create a subdirectory. Finally, this function returns the path to the data, such that no file paths ever need to be hard-coded.

Warning This still requires an API key, which requires creating an account, agreeing to UMLS’s terms and conditions, etc. This can be done here: https://uts.nlm.nih.gov/uts/edit-profile.

Automating Configuration of UTS Credentials

There are two ways to automatically set the API key, so you don’t have to worry about getting it and passing it around in your python code:

Set UMLS_API_KEY in the environment. This can be done in your interactive session or in the configuration for your shell such as in a .bashrc file for the Bourne Again Shell (bash).
Create ~/.config/umls.ini and set in the [umls] section a api_key key. Mine looks like:
```
[umls]
api_key=1234567890abcdefghijklmno
```

Now you can omit the api_key keyword like in the following:

from umls_downloader import download_umls

# Same path as before
path = download_umls(version="2023AA")

Download the Latest Version

First, you’ll have to install bioversions with pip install bioversions, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version keyword argument:

from umls_downloader import download_umls

# Same path as before (when run on September 1st, 2023)
path = download_umls()

Download and open the file

The UMLS file is zipped, so it’s usually accompanied by the following boilerplate code:

import zipfile
from umls_downloader import download_umls

path = download_umls()
with zipfile.ZipFile(path) as zip_file:
    with zip_file.open("MRCONSO.RRF", mode="r") as file:
        for line in file:
            ...

This exact code is wrapped with the umls_downloader.open_umls() using Python’s context manager, so it can more simply be written as:

from umls_downloader import open_umls

with open_umls() as file:
    for line in file:
        ...

Note The version and api_key arguments work the same for umls_downloader.open_umls() as in umls_downloader.download_umls()

At this point, it’s up to you to decide how you want to consume the MRCONSO.RRF file. Below, I give a demo on how parsed this file in PyOBO in order to convert UMLS to an OWL ontology.

Why not an API?

The UMLS provides an API for access to tiny bits of data at a time. There are even two recent (last 5 years) packages umls-api connect-umls that provide a wrapper around them. However, API access is generally rate limited, difficult to use in bulk, and slow. For working with UMLS (or any other database, for that matter) in bulk, it’s necessary to download full database dumps.

UMLS Conversions

Building on top of the automated download of UMLS, I implemented a fit-for-purpose processor with the PyOBO framework that converts UMLS into an ontology (encoded either as OWL, OBO, or OBO Graph JSON) which can therefore be used to generate semantic mappings in the SSSOM format. The code that implements this can be found here. After installing PyOBO with pip install pyobo, you can automatically download and convert UMLS first into an ontology encoded in the OBO flat file format, then convert to OWL with the following code. Note: you’ll need robot for the second step:

import pyobo

umls = pyobo.get_ontology("umls")

# Write simple OBO Format
umls.write_obo("umls.obo")

# Convert to OWL
from pyobo.utils.misc import obo_to_owl

obo_to_owl("umls.obo", "umls.owl")

In an ideal world, the results of such a conversion could be included as a part of the OBO Database Ingestion, which converts database resources available through PyOBO into ontology artifacts, archives them on GitHub and Zenodo, and gives them PURLs all on a weekly basis to make sure the most up-to-date version is available as well as all previous named versions. Instead, we live in a world with pineapple pizza and restrictive licenses.

One of the nice qualities of UMLS is that it is a semantic mapping hub. It provides mostly complete mappings between many vocabularies including MeSH, NCIT, SNOMED-CT, HPO, LOINC, and more. However, there are a few caveats to consider:

UMLS mappings aren’t all 1-to-1. For example, MeSH mappings typically include many UMLS terms (narrower) pointing to the same MeSH term (broader). For other vocabularies, such as NCBITaxon, UMLS mappings are more reliably 1-to-1. Thanks to Tiago Lubiana for pointing this out.
Mapping provenance is not available, so the mapping_justification field in SSSOM is uniformly filled with sempav:UnspecifiedMatching.
Similarly, UMLS does not apply precise semantic predicates for each mapping. This means that they are output in PyOBO and as SSSOM with oboInOwl:hasDbXref instead of more detailed types such as skos:exactMatch, skos:narrowMatch, and skos:broaderMatch. Tools like Boomer can be used to address this (in part). The Semantic Mapping Reasoning Assembler (SeMRA) can also be configured with prior knowledge about UMLS mapping assumptions when aggregating and reasoning over semantic mappings at scale.

With that in mind, anything that can be loaded as an ontology in PyOBO can also be exported with SSSOM, which I show below. For UMLS, this looks like:

import pyobo

df = pyobo.get_sssom_df("umls", names=False)
df.to_csv("umls.sssom.tsv", sep="\t", index=False)

Note You can set names=True to have PyOBO look up the names for all entities, but this is a bit of a rabbit hole since it requires getting and processing many external resources.

There’s much more to say about UMLS and SSSOM, but this is a good place to pause and publish this post, since getting UMLS as SSSOM is a task a lot of people have asked me for help with lately. I might also come back and explain more about how I use the other resources from UMLS’s UTS.