I was looking at the different NFDI consortia in the Research Organization Registry (ROR), and found that the only two that have a parent relations to the NFDI (ror:05qj6w324) are NFDI4DS (ror:00bb4nn95) and MaRDI (ror:04ncnzm65). This felt strange to me, so I started looking around Wikidata to see if I could automatically make a curation sheet to send along to them. I found that Wikidata already has detailed pages for all NFDI consortia, and that they also include relationships to the parent. This blog post is about the steps I took to write a workflow to find relationships in Wikidata that are appropriate for submission to ROR.

Getting Wikidata

In Wikidata, an entity can be annotated with a ROR identifier via property P6782. I wanted to write a SPARQL query for the Wikidata Query Service to retrieve all triples for which both the subject and object have and ROR identifier.

SELECT ?subject ?subjectROR ?subjectLabel ?predicate ?object ?objectROR ?objectLabel
{
  ?subject ?predicate ?object ;
           wdt:P6782 ?subjectROR .
  ?object wdt:P6782 ?objectROR .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}

While I now know this query should return about 67K rows, at the time, I ran into the issue that it was too complicated and caused the Wikidata Query Service to timeout. The next step in any investigation with a blasphemous ?subject ?predicate ?object pattern is to look into the predicates and try to cut them down. I set to reformulating the query to count the frequency of appearance of each predicate.

SELECT DISTINCT ?p ?pLabel (COUNT(?p) as ?count)
{
  ?subject wdt:P6782 ?subjectROR;
           ?predicate ?object .
  ?object wdt:P6782 ?objectROR .
  ?p wikibase:directClaim ?predicate .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
GROUP BY ?p ?pLabel
ORDER BY DESC(?count)

This query uses the sneaky wikibase:directClaim to map between the wd: entity namespace and wdt: direct property namespace so the query service could look up the label for the link. The problem was, this query was still too heavy and caused a timeout. Therefore, I had to simplify the query to just get the counts without the label, then use a second query and join the data externally (I also tried a nested query along the way, but it still timed out).

SELECT DISTINCT ?predicate (COUNT(?predicate) as ?count)
{
  ?subject wdt:P6782 ?subjectROR ;
           ?predicate ?object .
  ?object wdt:P6782 ?objectROR .
}
GROUP BY ?predicate
ORDER BY DESC(?count)

With that out of the way, I tried re-writing the original query by formatting in the 147 predicates I pulled out into the VALUES ?predicate { ... } (abbreviated), like:


SELECT ?subject ?subjectROR ?subjectLabel ?predicate ?object ?objectROR ?objectLabel
{
  VALUES ?predicate { ... }
  ?subject ?predicate ?object ;
           wdt:P6782 ?subjectROR .
  ?object wdt:P6782 ?objectROR .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}

This still caused timeouts, so I resorted to a loop in Python, which also let me simplify the query to skip the Wikidata IDs and just pull out RORs for the subject and object (where the {...} gets replaced with a different property on each):

SELECT ?subjectROR ?objectROR
WHERE {
    ?subjectROR ^wdt:P6782/wdt:{...}/wdt:P6782 ?objectROR .
}

I really like this because it uses paths to reduce the need to specify the middle entities which don’t get used. I don’t know if the SPARQL engine is able to optimize on it, but it’s cool. Maybe not so readable, but cool. The loop created a super-sized TSV with the predicate and labels added back.

The workflow I implemented for this lives in https://github.com/cthoyt/ror-wikidata-enrichment. The data from Wikidata is in this file, licensed under CC0.

Do you want this workflow to better reflect your organization? Check out my other blog post on how to curate data about your research organization: https://cthoyt.com/2021/01/17/organization-organization.html.

Getting ROR

I’ve previously implemented a source in PyOBO that wraps downloading and structuring ROR’s data dump into a readily usable format, so getting ROR’s triples was as easy as:

import pyobo

df = pyobo.get_relations_df("ror")

I also had to map the part of and has part relations from BFO to Wikidata properties. I did this by hand because it was faster than doing it the sustainable way, which would have been to pull the mappings from SSSOM-like annotations in the BFO ontology or from Wikidata itself (since I curated those into Wikidata years ago when we were preparing the (unpublished) relation ontology paper).

I made an intermediate output of all of thet triples here, licensed under CC0.

Putting it all together

While I’m glossing over a few steps that you can grok by reading my python script, it was possible to finish getting the data in the right shape to compare with tools in PyOBO and the Bioregistry

The final step was to take the difference between the Wikidata triples and the ROR triples, filter for triples that make sense within the ROR schema (which for now is just part of and has part relationships), and then dump the results out. There were around 67K records before filtering around 2.8K after filtering. Here are a few examples:

subjectROR subjectLabel predicate predicateLabel objectROR objectLabel
00k4nrj32 Essex County Hospital P361 part of 02wnqcb97 National Health Service
022efad20 University of Gabès P527 has part(s) 01hwc7828 Institut des Régions Arides
04p4gjp18 Center of Excellence on Hazardous Substance Management P361 part of 028wp3y58 Chulalongkorn University
04tnv7w23 École Supérieure Polytechnique d’Antsiranana P361 part of 00pd4qq98 Université d’Antsiranana
02f4ya153 Barro Colorado Island P361 part of 01pp8nd67 Smithsonian Institution

Coda

The point of all of this was to automate adding the missing NFDI consortia relationships to the parent NFDI organization in ROR, because I’m interested in creating queries over the organization landscape related to NFDI to support an upcoming section on Internationalization. And like most things in my work life, I ended up cleaning some data and making upstream contributions along the way. Let’s see how receptive ROR is to this! The triples are all here and I can easily make them a different format for submission.


Caveat: if you look into the data, you might notice that some of the entities don’t have labels. I realized this is happening because I haven’t updated my PyOBO importer to get the 2.0 data dump from ROR, and I’m stuck on old version 1.36. This can be fixed independently of this workflow. Here’s the rows related to the NFDI consortia that need new relations, which are all missing labels until I fix this.

subjectROR subjectLabel predicate predicateLabel objectROR objectLabel
00enhv193   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
02cxb1m07   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
03xrvbe74   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
020tty630   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
04ncnzm65   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
01f5dqg10   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
001jhv750   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
0310v3480   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
01d2qgg03   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
01k9z4a50   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
03a4sp974   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
05wwzbv21   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
0305k8y39   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
0238fds33   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
03f6sdf65   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
0033j3009   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
01vnkaz16   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
01v7r4v08   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
04dy2xw62   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
01xptp363   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
034pbpe12   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
05nfk7108   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
00r0qs524   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
00bb4nn95   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur
03fqpzb44   P361 part of 05qj6w324 Nationale Forschungsdateninfrastruktur