Efficient Bulk Access to Citations in OpenCitations

OpenCitations aggregates and deduplicates bibliographic information from CrossRef, Europe PubMed Central, and other sources to construct a comprehensive, open index of citations between scientific works. This post describes the opencitations-client package which wraps the OpenCitations API and implements an automated pipeline for locally downloading, caching, and accessing OpenCitations in bulk.

Background

OpenCitations both provides access via an API and bulk data downloads distributed across FigShare and Zenodo. Importantly, it publishes its data under the CC0 public domain license to democratize access to citations - previously, this data was only available through paid access to commercial databases owned by publishers.

While API access can be convenient for ad-hoc usage, it’s generally slow, rate-limited, susceptible to DDoS (e.g., from crawlers), and therefore difficult (if not impossible) to use in bulk. My solution is to write software that automates downloading, processing, and caching databases in bulk and provides fast, highly available, local access. I’ve previously written about developing standalone software packages for several large databases including DrugBank, ChEMBL, UMLS, ORCiD, and ClinicalTrials.gov. Similarly, I maintain several similar workflows in the PyOBO software package for converting resources into ontology-like data structures. I previously wrote about how this looks for HGNC.

Building on an Existing Ecosystem

I’ve been developing a software ecosystem over the last decade to support common workflows in research data management and data integration. When I start a new project, I try and reuse or improve existing components from that ecosystem wherever possible. Importantly, I try and find meaningful ways of organizing code across my ecosystem to reduce duplication, separate concerns, reduce the burden of testing, and ease maintenance.

OpenCitations publishes its bulk data dumps across several records in Figshare and Zenodo. I’ve previously written zenodo-client to interact with Zenodo’s API and orchestrates downloading and caching. zenodo-client heavily builds on pystow, which implements I/O and filesystem operations to enable reproducible, automated downloading, caching, and opening of data.

I had not previously written software to interact with Figshare, so I followed the form of zenodo-client and created a new package, figshare-client. I’m able to quickly create new high-quality packages because I’ve encoded all the wisdom and experience I’ve gained over the years in a Cookiecutter template, cookiecutter-snekpack, which I can use to set up a new project in mere minutes.

Along the way, I realized that the archives in Zenodo and Figshare were a combination of TAR and ZIP archives, each with many CSV files inside. In Python, TAR and ZIP archives have lots of weird quirks, even though they mostly do the same thing. However, rather than addressing those issues in opencitations-client, it made more sense to add utility functions in PyStow in cthoyt/pystow#125 (tar and zip archive iteration), which I was much better able to test in the PyStow archive.

A key functionality of OpenCitations is to implement graph-like queries to find incoming and outgoing citations. I considered several solutions for efficiently caching and querying graph-like data including pickles and SQLite, but these were respectively slow and disk inefficient. I found better solutions based on NumPy’s memory maps and was surprised that I couldn’t find an implementation in a popular package (e.g., SciPy). So, I had to decide where to put an implementation of disk-based cached graph. I didn’t want to put it in OpenCitations nor make a tiny package for just this one operation, so I decided to expand the scope of PyStow and add it there in cthoyt/pystow#121.

Finally, OpenCitations deals with a variety of identifier spaces including first-party OpenCitations Metadata IDs (OMIDs) and OpenCitations Citation IDs (OCIs) as well as third-party identifiers from Wikidata, OpenAlex, PubMed, DOI, and others. I’ve written the curies to handle identifiers in an explicit and transparent way. In the end, the opencitations-client relies on several components from my ecosystem, and of course, several more generic and popular packages. Here’s how the dependencies look:

flowchart LR
    opencitations-client -- depends on --> figshare-client
    opencitations-client -- depends on --> zenodo-client
    opencitations-client -- depends on --> curies
    figshare-client -- depends on --> pystow
    zenodo-client -- depends on --> pystow

Demo

It’s important for software packages to implement simple, top-level APIs that cover 99% of use cases with reasonable defaults. Most use cases for OpenCitations are to get incoming/outgoing citations for a DOI, PubMed identifiers, or OpenCitations identifiers. Here’s how this looks:

from curies import Reference
from opencitations_client import get_incoming_citations, get_outgoing_citations

# a CURIE for the DOI for the Bioregistry paper
bioregistry_curie = "doi:10.1038/s41597-022-01807-3"

# who did the Bioregistry paper cite?
outgoing: list[Reference] = get_outgoing_citations(bioregistry_curie)

# who cited the Bioregistry paper?
incoming: list[Reference] = get_incoming_citations(bioregistry_curie)

Importantly, each of these functions has a backend argument that defaults to api and can be swapped to local. Because everything is built on software that is smart about caching, loading, and data workflows, on the first time backend='local' is used, all processing happens automatically (warning, takes a few hours on a single core). This function also has a return_value argument that can be used to swap between principled curies.Reference data structures that explicitly encode identifiers, simple string local unique identifiers that match the input prefix, or full citation objects (only available through OpenCitations API).

See the opencitations-client code on GitHub (https://github.com/cthoyt/opencitations-client) and documentation on ReadTheDocs (https://opencitations-client.readthedocs.io).

While I’ve been thinking about adding citations to the bibliographic components of knowledge graph construction workflows for several years, I was finally pushed to implement opencitations-client for the Catalaix project, where we’re developing new methods for recycling and reuse of (bio)plastics. I wanted to get all seventeen laboratories’ publications, who they cited, and who cited them as a seed for information extraction and curation. Here’s a small example of a citation network from those queries:

flowchart TD
    26802344["Mechanism-specific and whole-organism ecotoxicity of mono-rhamnolipids.
Blank (2016)"]
34492827["The Green toxicology approach: Insight towards the eco-toxicologically safe development of benign catalysts.
Herres-Pawlis (2021)"]
28779508["Highly Active N,O Zinc Guanidine Catalysts for the Ring-Opening Polymerization of Lactide.
Herres-Pawlis (2017)"]
33195133["Genetic Cell-Surface Modification for Optimized Foam Fractionation.
Blank (2020)"]
32974309["Integration of Genetic and Process Engineering for Optimized Rhamnolipid Production Using
Jupke, Blank (2020)"]
30811863["New Kids in Lactide Polymerization: Highly Active and Robust Iron Guanidine Complexes as Superior Catalysts.
Pich, Herres-Pawlis (2019)"]
30758389["Tuning a robust system: N,O zinc guanidine catalysts for the ROP of lactide.
Pich, Herres-Pawlis (2019)"]
28524364["Biofunctional Microgel-Based Fertilizers for Controlled Foliar Delivery of Nutrients to Plants.
Pich, Schwaneberg (2017)"]
34865895["A plea for the integration of Green Toxicology in sustainable bioeconomy strategies - Biosurfactants and microgel-based pesticide release systems as examples.
Pich, Blank, Schwaneberg (2022)"]
32449840["Robust Guanidine Metal Catalysts for the Ring-Opening Polymerization of Lactide under Industrially Relevant Conditions.
Herres-Pawlis (2020)"]
34492827 --> 30811863
34492827 --> 30758389
34492827 --> 28779508
34492827 --> 32449840
32974309 --> 33195133
34865895 --> 26802344
34865895 --> 32974309
34865895 --> 28524364
34865895 --> 34492827