Exploring Event Venues in Wikidata
I was working on making data about scholarly conferences more FAIR and a big question crossed my mind: what are all the conference venues? This post is about some queries I wrote for Wikidata, data issues I found, and a few drive-by curations that I did while looking for an answer, and my ideas for the future.
Querying Wikidata
Wikidata is always a good place to start looking for structured data because it contains a detailed and multidisciplinary ontology whose classes, subclass relationships, and instances are all queryable via SPARQL.
It has a top-level class for event venues (Q18674739), that even comes with a high-level schematic diagram on how this class relates to performance arts spaces:
I used the following SPARQL query to retrieve all the subclasses and investigate which might be relevant for conferences.
SELECT DISTINCT ?venueType ?venueTypeLabel ?venueTypeDescription
WHERE
{
?venueType wdt:P279* wd:Q18674739 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
The results from this query are embedded in the table below:
The results contain a lot of irrelevant venue types such as cinemas, theaters, and clubs. It’s interesting for me to make a list of these, and refine the query to exclude them like so:
SELECT DISTINCT ?venueType ?venueTypeLabel ?venueTypeDescription
WHERE
{
?venueType wdt:P279* wd:Q18674739 .
FILTER(?venueType NOT IN (wd:Q41253, wd:Q622425)) # this list is shortened for demo purposes
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
Data Exploration
I encountered two noteworthy data issues the first time I ran this query:
- Some results had Wikidata identifiers in the labels column
- Some results appeared to be for specific event venues, and not classes
The first issue can be attributed to the fact that not all Wikidata records have labels in english. This doesn’t mean that they aren’t valuable - it’s also an opportunity for doing manual or semi-automated translation to improve the data in Wikidata.
The second issue is related to the ontology structure of Wikidata. Because this is a query over subclasses of event venues, specific venues shouldn’t show up. This is because Wikidata records representing specific venues should use the instance of (P31) relationship to annotate what kind/type of event venue they are.
I identified a couple groups of event venues that seemed to have this issue:
- dive sites, like Q14213916
- cinemas in Brazil, like Q123856958
- museums in Sweden, like Q60628067
I was able to go through and update the relationship types to be correct, so if you’re reading this later, you won’t see the issue. That’s the power of drive-by curation!
Follow-up
Next steps are to make a spreadsheet where I curate for each event venue subclass if they’re relevant or not.
After that, then I can construct a SPARQL query that uses the VALUES
syntax to
pick the list of superclasses, then the wdt:P31/wdt:P279*
relation path to get
all instances of any subclass of the selected event venue classes.
SELECT DISTINCT ?venue ?venueLabel
WHERE
{
VALUES ?venueType { wd:Q1329623 }
?venue wdt:P31/wdt:P279* ?venueType .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
At the time of writing, this query resulted in 8,728 distinct records in about 27 seconds.
While Wikidata is a great place to start, its nature as a fully open and community-driven resource has the drawback of being less trustworthy than resources that have an expert curation component (whether manual or semi-automated).
Persistent Identifier (PID) resources like the Research Organization Registry (ROR) demonstrated that there’s added value to having an expert-curated layer that both draws from and contributes back to Wikidata.
I’m working towards seeding a registry for event venues with a similar philosophy at https://github.com/event-venue-registry/evr. Importantly, this resource will follow the Open Data, Open Code, Open Infrastructure (O3) principles, meaning that the data and code are all in one GitHub repository that anyone can contribute to. Please reach out if you’d like to help.