After the BioCypher preprint went up on the arXiv, I checked in on the missing co-author items list on the Scholia page that reflects my Wikidata entry. In addition to the several co-authors of the BioCypher manuscript that I don’t know personally, I was curious to see which other papers of mine did not have fully complete co-author annotations. This post has a few SPARQL queries that I used to look into this as well as a few ongoing questions I have about the relationship between distinct entries for preprints and published articles.

First, I wrote two SPARQL queries for the Wikidata Query Service:

There were around 200 co-authors that I had included through a painstaking combination of manual curation and usage of the Author Disambiguator tool. However, when I looked at the ambiguous authors, i.e., authors that only stored by name via author name string (P2093) instead of by reference to a Wikidata entry via author (P50). At the time, this included co-authors from only four unique manuscripts including:

  1. Democratising Knowledge Representation with BioCypher
  2. A Simple Standard for Sharing Ontological Mappings (SSSOM),
  3. Leveraging Structured Biological Knowledge for Counterfactual Inference: A Case Study of Viral Pathogenesis
  4. Ontology Development Kit: a toolkit for building, maintaining, and standardising biomedical ontologies

The appearance of the BioCypher manuscript on this list was no surprising because we just preprinted it on arXiv. However, I very clearly remember carefully curating all of the co-authors of the the other papers. After more careful inspection, it turns out that I had indeed done this curation for the preprints of each of these articles, but the ones appearing on the list corresponded to duplicate Wikidata entries not for the preprints, but for the published papers. This lead me to a couple questions, which I don’t have answers for yet:

  1. Should there two different entries for a preprint and a publication?
  2. What’s even the right word for the dichotomy between a preprint and a publication? I don’t think it’s post-print.
  3. It appears that there are multiple different entries between preprints and publications, so question 1) is a “perfect world” question. Since, in reality, there are duplicates, how should we handle them?
    • Should we connect them via some kind of relationship? I recently noticed Tiago Lubiana had been using followed by (P156) for papers he had curated and have started using that myself in some cases (see notes below).
    • Should we merge these two entries into one? There are various properties for the identifiers within preprint servers that can help point to pre-prints, though preprints are given different DOIs than the publication so maybe this would create confusion.
    • How will this work with the advent of “overlay journals”, like what eLife is doing by more heavily relying on peer review attached to existing preprints?
  4. To what extent does this confusion affect Wikidata content related to me (e.g., my papers)?

In order to assess my own Wikidata cleanliness, I wrote the following SPARQL query:

SELECT ?preprint ?preprintDate ?followedBy ?article ?articleDate ?label
WHERE 
{
  VALUES ?author { wd:Q47475003 }
  ?preprint wdt:P31 wd:Q580922 ;
    wdt:P50 ?author ;
    rdfs:label ?preprintLabel .
  ?article wdt:P31 wd:Q13442814 ;
    wdt:P50 ?author ;
    rdfs:label ?label .
  OPTIONAL { ?preprint wdt:P577 ?preprintDate }
  OPTIONAL { ?article wdt:P577 ?articleDate }
  OPTIONAL { ?preprint wdt:P156 ?followedBy }
  FILTER (LCASE(?preprintLabel) = LCASE(?label))
  FILTER (?preprint != ?article)
  FILTER (LANG(?preprintLabel) = "en")
  FILTER (LANG(?label) = "en")
}
ORDER BY DESC(?articleDate)

Here are the live results from running that SPARQL query, embedded via the Wikidata Query Service:

At the time of writing, there are 15 duplicates (based on case insensitive string matching). I’ve begun curating the followed-by relationships, but am holding out since I might be able to come up with a script to automatically generate appropriate quickstatements.

Interestingly, the fact that these are different entries allows an alternate view that gives insight in turnover from preprint date to publication date. Considering that I typically preprint the paper and send for peer review simultaneously, this is an interesting statistic.


Usually I try and write blog posts about something I made or some insight that I got out of working on something, but I’m not really sure where to go from here. Ideally, I’d like to see the entirety of PubMed, PMC, other major scholarly article indexes, arXiv, bioRxiv, other preprint servers, and other bibliographic content automatically aligned in full on Wikidata. I’ve heard that there are concerns about the technical limitations about the service so this might not be feasible in the near future. In the mean time, if you’ve got some answers for my questions, please let me know.