Lecture 8 Graph Databases
Lecture 8 Graph Databases
6
Basic Terminology
● Data: a set of entities and their relationships
○ => we need to efficiently represent graphs
● Basic operations:
■ finding the neighbors of a node,
■ checking if two nodes are connected by an edge,
■ updating the graph structure, …
○ => we need efficient graph operations
● Graph G = (V, E) is usually modelled as
○ set of nodes (vertices) V, |V| = n
○ set of (directed) edges E = (V1,V2), |E| = m
● Which data structure to use?
7
Data Structure: Adjacency Matrix
● Two-dimensional array A of
n ⨉ n Boolean values
○ Indexes of the array = node
identifiers of the graph
○ Boolean value Aij indicates
whether nodes i, j are connected
● Variants:
○ (Un)directed graphs
○ Weighted graphs…
8
Adjacency Matrix: Properties
● Pros:
○ Adding/removing edges
○ Checking if 2 nodes are
connected
● Cons:
○ Quadratic space: O(n2)
○ We usually have sparse graphs
○ Adding nodes is expensive
○ Retrieval of all the neighboring
nodes takes linear time: O(n)
9
Data Structure: Adjacency List
● A set of lists, each enumerating
neighbors of one node
○ Vector of n pointers to adjacency lists
● Undirected graph:
○ An edge connects nodes i and j
○ => the adjacency list of i contains
node j and vice versa
● Often compressed
○ Exploiting regularities in graphs 10
Adjacency List: Properties
● Pros:
○ Getting the neighbors of a node
○ Cheap addition of nodes
○ More compact representation of
sparse graphs
● Cons:
○ Checking if there is an edge
between two nodes
■ Optimization: sorted lists => logarithmic
scan, but also logarithmic insertion
11
Data Structure: Incidence Matrix
● Two-dimensional Boolean
matrix of n rows and m columns
○ Each row represents a node
■ All edges that are connected to the node
○ Each column represents an edge
■ Nodes that are connected by a certain edge
12
Incidence Matrix: Properties
● Pros:
○ Representation of hypergraphs
■ where one edge connects an
arbitrary number of nodes
● Cons:
○ Requires n ⨉ m bits (for most
graphs m ⋙ n)
○ Listing neighborhood is slow
13
Data Structure: Laplacian Matrix
● Two-dimensional array of
n ⨉ n integers
○ Similar structure as adjacency matrix
○ Diagonal of the Laplacian matrix
indicates the degree of the node
○ The rest of positions are set to -1 if
the two vertices are connected, 0
otherwise
14
Laplacian Matrix: Properties
All features of adjacency matrix
● Pros:
○ Analyzing the graph structure by
means of spectral analysis
■ Calculating the number of spanning trees
■ Approximation of the sparsest cut of the
graph
■ Calculate eigenvalues of the matrix
○ A good summary: Wikipedia
15
A Bit of a Theory
16
Basic Graph Algorithms
● Access all nodes reachable from a given source:
○ Breadth-first Search (BFS)
○ Depth-first Search (DFS)
● Shortest path between two nodes
● Single-source shortest path problem
○ BFS (unweighted),
○ Dijkstra (nonnegative weights),
○ Bellman-Ford algorithm
● All-pairs shortest path problem
○ Floyd-Warshall algorithm
http://en.wikipedia.org/wiki/Shortest_path_problem 17
Improving Data Locality
● Performance of the read/write operations
○ Depends also on physical organization of the data
○ Objective: Achieve the best “data locality”
● Spatial locality:
○ if a data item has been accessed, the nearby data items
are likely to be accessed in the following computations
■ e.g., during graph traversal
● Strategy:
○ in graph adjacency matrix representation, exchange rows
and columns to improve the disk cache hit ratio
○ Specific methods: BFSL, Bandwidth of a Matrix, ... 18
Data Locality: Example
19
Breadth First Search Layout (BFSL)
● Input: vertices of a graph
● Output: a permutation of the vertices
○ with better cache performance for graph traversals
● BFSL algorithm:
1. Select a node (at random, the origin of the traversal)
2. Traverse the graph using the BFS alg.
■ generating a list of vertex identifiers in the order they are visited
3. Take the generated list as the new vertices permutation
20
Breadth First Search Layout (2)
● Let us recall:
Breadth First Search (BFS)
○ FIFO queue of frontier vertices
22
Matrix Bandwidth: Formalization
● The minimum bandwidth problem
○ Bandwidth of a row in a matrix = the maximum distance
between nonzero elements, where one is left of the
diagonal and the other is right of the diagonal
○ Bandwidth of a matrix = maximum bandwidth of its rows
Graph partitioning
24
Graph Partitioning
● Some graphs are too large to be fully loaded into
the main memory of a single computer
○ Usage of secondary storage degrades the performance
○ Scalable solution: distribute the graph on multiple nodes
25
Example: 1-Dimensional Partitioning
● Aim: Partition the graph to solve BFS efficiently
○ Distributed into shared-nothing parallel system
○ Partitioning of the adjacency matrix
26
Starting BFS traversal at node 1:
1. (at black) 1 -> 10, 11
visit green server
2. (at green) 10, 11 ->
a. 1, back to black
b. 6, visit red
c. 7,9, visit blue
d. 10, 11, myself
3. (at red) 6 -> 7
visit blue
3. (at blue) 7,9 ->
a. 3, back to black …
b. 6, back to red
c. 8 -> 2,3, back to black
d. 10,11,12, back to green
27
Traversing Graph
● Traversing with 1D partitioning (e.g., BFS)
1. Each processor keeps information about frontier vertices
2. ...and also list of neighboring vertices in other processors
3. Messages are sent to other processors…
29
Types of Graphs
● Single-relational graphs
○ Edges are homogeneous in meaning
■ e.g., all edges represent friendship
30
Graph Databases
● A graph database = a set of graphs
31
Transactional DBs: Queries
● Types of Queries
○ Subgraph queries
■ Searches for a specific pattern in the graph database
■ Query = a small graph
● or a graph, where some parts are uncertain, e.g., vertices with wildcard labels
■ More general type: allow sub-graph isomorphism
32
Transactional DBs: Queries (2)
○ Super-graph queries
■ Search for graphs whose whole structure is contained in the query graph
34
Subgraph Query Processing
1. Mining-based Graph Indexing Techniques
○ Idea: if some features of query graph q do not exist in data
graph G, then G cannot contain q as its subgraph
○ Apply graph-mining methods to extract some features
(sub-structures) from the graph database members
■ e.g., frequent sub-trees, frequent sub-graphs
○ An inverted index is created for each feature
35
Mining-based Technique
● Example method: GIndex [2004]
○ Indexing “frequent discriminative graphs”
○ Build inverted index for selected discriminative subgraphs
36
Non Mining-based Techniques
● Example: GString (2007)
○ Model the graphs in the context of organic chemistry
using basic structures
■ Line = series of vertices connected end to end
■ Cycle = series of vertices that form a close loop
■ Star = core vertex directly connects to several vertices
37
Non Mining-based Techniques
● GDIndex (2007)
○ all connected and induced subgraphs of a given graph are
enumerated (at most 2n)
○ due to isomorfisms, there much less subgraphs.
■ if all labels are identical, a complete graph of size n is decomposed into
just n+1 subgraphs.
38
Graph Databases
Non-transactional Databases
39
Non-transactional Databases
● A few very large graphs
○ e.g., Web graph, social networks, …
● Queries:
○ Nodes/edges with properties
○ Neighboring nodes/edges
○ Paths (all, shortest, etc.)
40
Basic Characteristics
● Different types of relationships between nodes
○ To represent relationships between domain entities
○ Or to model any kind of secondary relationships
■ Category, path, time-trees, spatial relationships, …
41
Relationship Properties: Example
42
source: Sadalage & Fowler: NoSQL Distilled, 2012
Graph DB vs. RDBMS
● RDBMS designed for a single type of relationship
○ “Who is my manager”
● Adding another relationship usually means a lot of
schema changes
44
Neo4j: Basic Info
● Open source graph database
○ The most popular
● Initial release: 2007
● Written in: Java
● OS: cross-platform
● Stores data as nodes connected
by directed, typed relationships
○ With properties on both
nodes and relationships
45
Neo4j: Basic Features
● reliable – with full ACID transactions
● durable and fast – disk-based, native storage engine
● scalable – up to several billion nodes/relationships/properties
● highly-available – when distributed (replicated)
● expressive – powerful, human readable graph query language
● fast – powerful traversal framework
● embeddable - in Java program
● accessible – simple REST interface & Java API
46
Data Model: Nodes
● Fundamental unit: node
● Nodes have properties
○ Key-value pairs
○ null is not a valid property value
■ nulls can be modelled by the absence of a key
http://db-engines.com/en/system/Neo4j 47
Data Model: Properties
Type Description
boolean true/false
49
What How
What How
get all paths for a file incoming file and symbolic link relationships
get all files in a directory outgoing file and symbolic link relationships,
depth one
get all files in a directory, excluding outgoing file relationships, depth one
symbolic links
get all files in a directory, recursively outgoing file and symbolic link relationships 50
Access to Neo4j
● Embedded database in Java system
● Language-specific connectors
○ Libraries to connect to a running Neo4j server
● Cypher query language
○ Standard language to query graph data
● HTTP REST API
● Gremlin graph traversal language (plugin)
● etc.
51
Neo4J: Native Java API & Graph Traversal
52
Native Java Interface: Example
Node irena = graphDb.createNode();
irena.setProperty("name", "Irena");
Node jirka = graphDb.createNode();
jirka.setProperty("name", "Jirka");
● Undirected edge:
○ Relationship between the nodes in both directions
○ INCOMING and OUTGOING relationships from a node
53
Data Model: Path & Traversal
● Path = specific nodes + connecting relationships
○ Path can be a result of a query or a traversal
● .relationships()
○ Specify the relationship types to traverse
■ e.g., traverse only edge types: FRIEND, RELATIVE
■ Empty (default) = traverse all relationships
○ Can also specify direction
■ Direction.BOTH
■ Direction.INCOMING
■ Direction.OUTGOING
56
Traversal Framework – Java API (2)
● org.neo4j...Evaluator
○ Used for deciding at each node: should the traversal
continue, and should the node be included in the result
■ INCLUDE_AND_CONTINUE: Include this node in the result and
continue the traversal
■ INCLUDE_AND_PRUNE: Include this node, do not continue traversal
■ EXCLUDE_AND_CONTINUE: Exclude this node, but continue traversal
■ EXCLUDE_AND_PRUNE: Exclude this node and do not continue
○ Pre-defined evaluators:
■ Evaluators.toDepth(int depth) /
Evaluators.fromDepth(int depth),
■ Evaluators.excludeStartPosition()
■ …
57
Traversal Framework – Java API (3)
● org.neo4j...Uniqueness
○ Indicates under what circumstances a traversal may
revisit the same position in the graph
● Traverser
○ Starts actual traversal given a TraversalDescription and
starting node(s)
○ Returns an iterator over “steps” in the traversal
■ Steps can be: Path (default), Node, Relationship
○ The graph is actually traversed “lazily” (on request)
58
Example of Traversal
TraversalDescription desc =
db.traversalDescription()
.depthFirst()
.relationships( Rels.KNOWS,
Direction.BOTH )
.evaluator(Evaluators.toDepth(3));
60
Neo4J: Cypher Language
61
Cypher Language
● Neo4j graph query language
○ For querying and updating
● Declarative – we say what we want
○ Not how to get it
○ Not necessary to express traversals
● Human-readable
● Inspired by SQL and SPARQL
● Still growing = syntax changes are often
http://neo4j.com/docs/stable/cypher-query-lang.html 62
Cypher: Clauses
● MATCH: The graph pattern to match
● WHERE: Filtering criteria
● RETURN: What to return
● WITH: Divides a query into multiple parts
○ can define starting points in the graph
66
Cypher: Queries (2)
MATCH (andres: Person {name: 'Andres'})-[*1..3]-(node)
RETURN andres, node ;
(find all ‘nodes’ within three hops from ‘Andres’)
MATCH p=shortestPath(
(andres:Person {name: 'Andres'})-[*]-(david {name:'David'})
)
RETURN p ;
(find the shortest connection between ‘Andres’ and ‘David’)
67
Neo4J: Behind the Scene
68
Neo4j Internals: Indexes
CREATE INDEX ON :Person(name);
(Create index on property name of nodes with label Person)
Indexes added: 1
70
Neo4j Internals: High Availability
● Master-slave replication
○ Several Neo4j slave databases can be configured to be
exact replicas of a single Neo4j master database
73
Graph DBs: Suitable Use Cases
● Connected Data
○ Social networks
○ Any link-rich domain is well suited for graph databases
● Routing, Dispatch, and Location-Based Services
○ Node = location or address that has a delivery
○ Graph = nodes where a delivery has to be made
○ Relationships = distance
● Recommendation Engines
○ “your friends also bought this product”
○ “when buying this item, these others are usually bought”
74
Graph DBs: Modeling Issues
● Node modeling:
● tradeoff between placing all attributes and properties in
a single node,
● and separating each attribute into an individual node.
● Relationship modeling:
○ “unlabeled” all,
■ e.g., person connected_to person/address/product
○ versus semantic meaning encoded labels
■ e.g., person peters_work_colleague person,
person peters_home_address address
75
Graph DBs: When Not to Use
● If we want to update all or a subset of entities
○ Changing a property on many nodes is not straightforward
■ e.g., analytics solution where all entities may need to be updated with a
changed property
76
Questions?
77
References
● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a
NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
● RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040:
Big Data Management and NoSQL Databases
● Sherif Sakr - Eric Pardede: Graph Data Management:
Techniques and Applications
● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A
Brief Guide to the Emerging World of Polyglot
Persistence. Addison-Wesley Professional, 192 p.
● http://neo4j.com/docs/stable/
78