SPARQL/Basics
SPARQL may look complicated, but the simple basics will already get you a long way – if you want, you can stop reading after this chapter, and you’ll already know enough to write many interesting queries. The other chapters just add information about more topics that you can use to write different queries. Each of them will empower you to write even more awesome queries, but none of them are necessary – you can stop reading at any point and hopefully still walk away with a lot of useful knowledge!
Also, if you’ve never heard of Wikidata, SPARQL, or WDQS before, here’s a short explanation of those terms:
- Wikidata is a knowledge database. It contains lots of statements, like “the capital of Canada is Ottawa”, or “the Mona Lisa is painted in oil paint on poplar wood”, or “gold has a thermal conductivity of 25.418 joule per mole kelvin”.
- SPARQL is a language to formulate questions (queries) for knowledge databases. With the right database, a SPARQL query could answer questions like “what is the most popular tonality in music?” or “which character was portrayed by the most actors?” or “what’s the distribution of blood types?” or “which authors’ works entered the public domain this year?”.
- WDQS, the Wikidata Query Service, brings the two together: You enter a SPARQL query, it runs it against Wikidata’s dataset and shows you the result.
SPARQL basics
[edit | edit source]A simple SPARQL query looks like this:
SELECT ?a ?b ?c
WHERE
{
x y ?a.
m n ?b.
?b f ?c.
}
The SELECT
clause lists variables that you want returned (variables start with a question mark), and the WHERE
clause contains restrictions on them, mostly in the form of triples.
All information in Wikidata (and similar knowledge databases) is stored in the form of triples;
when you run the query, the query service tries to fill in the variables with actual values so that the resulting triples appear in the knowledge database,
and returns one result for each combination of variables it finds.
A triple can be read like a sentence (which is why it ends with a period), with a subject, a predicate, and an object:
SELECT ?fruit
WHERE
{
?fruit hasColor yellow.
?fruit tastes sour.
}
The results for this query could include, for example, “lemon”. In Wikidata, most properties are “has”-kind properties, so the query might instead read:
SELECT ?fruit
WHERE
{
?fruit color yellow.
?fruit taste sour.
}
which reads like “?fruit
has color ‘yellow’” (not “?fruit
is the color of ‘yellow’” – keep this in mind for property pairs like “parent”/“child”!).
However, that’s not a good example for WDQS. Taste is subjective, so Wikidata doesn’t have a property for it. Instead, let’s think about parent/child relationships, which are mostly unambiguous.
Our first query
[edit | edit source]Suppose we want to list all children of the baroque composer Johann Sebastian Bach. Using pseudo-elements like in the queries above, how would you write that query?
Hopefully you got something like this:
SELECT ?child
WHERE
{
# either this...
?child parent Bach.
# or this...
?child father Bach.
# or this.
Bach child ?child.
# (note: everything after a ‘#’ is a comment and ignored by WDQS.)
}
The first two triples say that the ?child
must have the parent/father Bach; the third says that Bach must have the child ?child
. Let’s go with the second one for now.
So what remains to be done in order to turn this into a proper WDQS query? On Wikidata, items and properties are not identified by human-readable names like “father” (property) or “Bach” (item). (For good reason: “Johann Sebastian Bach” is also the name of a German painter, and “Bach” might also refer to the surname, the French commune, the Mercury crater, etc.) Instead, Wikidata items and properties are assigned an identifier. To find the identifier for an item, we search for the item and copy the Q-number of the result that sounds like it’s the item we’re looking for (based on the description, for example). To find the identifier for a property, we do the same, but search for “P:search term” instead of just “search term”, which limits the search to properties. This tells us that the famous composer Johann Sebastian Bach is Q1339, and the property to designate an item’s father is P22.
And last but not least, we need to include prefixes. For simple WDQS triples, items should be prefixed with wd:
, and properties with wdt:
. (But this only applies to fixed values – variables don’t get a prefix!)
Putting this together, we arrive at our first proper WDQS query:
SELECT ?child
WHERE
{
# ?child father Bach
?child wdt:P22 wd:Q1339.
}
Click that “Try it” link, then “Run” the query on the WDQS page. What do you get?
child |
---|
wd:Q57225 |
wd:Q76428 |
… |
Well that’s disappointing. You just see the identifiers. You can click on them to see their Wikidata page (including a human-readable label), but isn’t there a better way to see the results?
Well, as it happens, there is! (Aren’t rhetorical questions great?) If you include the magic text
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
somewhere within the WHERE
clause, you get additional variables: For every variable ?foo
in your query, you now also have a variable ?fooLabel
, which contains the label of the item behind ?foo
. If you add this to the SELECT
clause, you get the item as well as its label:
SELECT ?child ?childLabel
WHERE
{
# ?child father Bach
?child wdt:P22 wd:Q1339.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Try running that query – you should see not only the item numbers, but also the names of the various children.
child | childLabel |
---|---|
wd:Q57225 | Johann Christoph Friedrich Bach |
wd:Q76428 | Carl Philipp Emanuel Bach |
… | … |
This completes the basics. Try amending this by varying the properties.
References
[edit | edit source]