Query Languages for Graph Databases
Query Languages for Graph Databases
Peter T. Wood
Department of Computer Science and Information Systems, Birkbeck, University of London
ptw@dcs.bbk.ac.uk
w i citizenOf |
w ans
w b ((bornIn | livesIn) · locatedIn∗ )
b w
w Figure 2: Query to find places related to
w l
b l
authors who have won both the Nobel and
Gordimer Carey Bacchus Marsh Victoria Booker prizes.
Figure 1: A graph of authors, prizes they [2], StruQL [26] and UnQL [13], as well as in terms
have won, and places where they were born. of query containment [16, 28], query rewriting [15]
and so on. CRPQs reappeared more recently in
NAGA [43], for example, in a form similar to that
hasW on, N obel and Booker are constants. This
in Figure 2. The ability to query paths using regu-
syntax is reminiscent of Datalog, except that atoms
lar expressions, and hence provide the functionality
in the body do not have predicate names since only
of CRPQs, has only very recently been introduced
a single graph is being queried; indeed, the atoms
in SPARQL 1.1 [36].
are similar to the triple patterns used in SPARQL
However, for a number of problems arising in
[36], the W3C query language for RDF [44].
graph querying CRPQs are insufficiently powerful
It is common in querying graphs that users may
[10]. These problems include comparing semantic
want to find pairs (x, y) of nodes such that there is
associations in RDF graphs [8], comparing biologi-
a path from x to y whose sequence of edge labels
cal sequences [31], and so on. In such settings, we
matches some pattern. One way of specifying such
want to be able to express relations among paths.
a pattern is by using a regular expression defined
For example, the following query Q finds entities x
over the alphabet of edge labels [52]. Such a query is
and y such that the same sequence of edge labels π
called a regular path query (RPQ). So, using the ex-
connects x and y as connects Coetzee and y:
ample of Figure 1, an RPQ using the regular expres-
sion citizenOf | ((bornIn | livesIn) · locatedIn∗ ) (x, y) ← (Coetzee, π, y), (x, π, y), Σ∗ (π) (1)
asks for pairs of author x and place y such that
x is a citizen of y or was born in or lives in y, or Here, π is a path variable, and Σ∗ denotes any se-
where x was born in or lives in some place which quence of edge labels. Q is an example of an ex-
is connected to y by a sequence of any number of tended conjunctive regular path query (ECRPQ),
locatedIn relationships. as proposed in [10].
CQs and RPQs can be combined to form con- We might also want to include paths themselves
junctive regular path queries (CRPQs). For exam- in the output of a query. This has been proposed,
ple, the following query Q adds the conjuncts from for example, as an extension to SPARQL [45], and
the example CQ to those of the RPQ as follows: is also provided by ECRPQs; to return the paths
ans(x, y) as part of the answer to query Q above, one simply
← (x, hasW on, N obel), (x, hasW on, Booker) includes the path variable π in the head. Of course,
(x, (citizenOf | ((bornIn | livesIn) · locatedIn∗ )), y) if there are cycles in the input graph, the answer to
a query may be infinite. In such cases a compact
CRPQs formed the basis of the languages G [21] representation of the set of answers to an ECRPQ
and GraphLog [17], although those languages used can be returned in the form of an automaton [10].
a syntax of graph patterns. Figure 2 shows how the ECRPQs are described in more detail in Section 3.3.
CRPQ Q above would be expressed in GraphLog, Requirements arising from querying and analysing
where there is an obvious mapping between the social networks bring the need for further capabili-
edges in the graph pattern and the atoms in the ties to be provided by graph query languages [5, 24,
body of Q. The thick edge in Figure 2 is called the 60, 61]. In particular, aggregation functions play an
distinguished edge, representing edges which occur essential role in network analysis, while the ability
in the answer of the query; hence, it corresponds to to transform networks by creating new nodes based
the head of Q. on (aggregations of) sets of existing nodes is also
CRPQs were much studied with respect to query- crucial. We discuss these requirements further, and
ing semistructured data in languages such as Lorel provide examples, in Sections 3.4 and 3.5.