NETS 212: Scalable and Cloud
Computing
Graph algorithms in MapReduce
October 15, 2013
2013 A. Haeberlen, Z.
Ives
University of Pennsylvania
Announcements
No class on October 22nd or 24th
Special 'catch-up' class on October
30th
Andreas at IMC in Barcelona
Please work on HW2 and HW3 (will be released this
week)
4:30-6:00pm, Location TBA
Any questions about HW2?
2013 A. Haeberlen, Z.
Ives
If you haven't started yet: Please start early!!!
University of Pennsylvania
What we have seen so far
In the first half of the semester, we saw
how the map/reduce model could be
used to filter, collect, and aggregate
data values
This is useful for data with limited
structure
2013 A. Haeberlen, Z.
Ives
We could extract pieces of input data items and
collect them to run various reduce operations
We could join two different data sets on a common
key
3
Beyond average/sum/count
Much of the world is a network of
relationships and shared features
2013 A. Haeberlen, Z.
Ives
Members of a social network can be friends, and may
have shared interests / memberships / etc.
Customers might view similar movies, and might
even be clustered by interest groups
The Web consists of documents with links
Documents are also related by topics, words,
authors, etc.
Goal: Develop a toolbox
We need a toolbox of algorithms useful
for analyzing data that has both
relationships and properties
For the next ~2 lectures well start to
build this toolbox
Some of the problems are studied in courses you
may not have taken yet:
2013 A. Haeberlen, Z.
Ives
CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems)
So well see both the traditional solution and the
MapReduce one
5
Plan for today
Representing data in graphs
Graph algorithms in MapReduce
NEXT
Computation model
Iterative MapReduce
A toolbox of algorithms
2013 A. Haeberlen, Z.
Ives
Single-source shortest path (SSSP)
k-means clustering
Classification with Nave Bayes
University of Pennsylvania
Facebook
fan-of
fan-of
friend-of
Alice
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
We can represent related objects as a
labeled, directed graph
Entities are typically represented as
nodes; relationships are typically edges
2013 A. Haeberlen, Z.
Ives
Nodes all have IDs, and possibly other properties
Edges typically have values, possibly IDs and other
properties
7
Images by Jojo Mendoza, Creative Commons license
Thinking about related objects
Encoding the data in a graph
Facebook
Mikhail
Magna Carta
Alice
Jose
Recall basic definition of a graph:
Sunita
G = (V, E) where V is vertices, E is edges of
the form (v1,v2) where v1,v2 V
Assume we only care about connected
vertices
2013 A. Haeberlen, Z.
Ives
Then we can capture a graph simply as the
edges
8
Graph encodings: Set of edges
Facebook
Mikhail
Magna Carta
Alice
2013 A. Haeberlen, Z.
Ives
Sunita
Jose
(Alice, Facebook)
(Alice, Sunita)
(Jose, Magna Carta)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Magna
Carta)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)
Graph encodings: Adding edge
types
Facebook
fan-of
fan-of
friend-of
Alice
2013 A. Haeberlen, Z.
Ives
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
(Alice, fan-of, Facebook)
(Alice, friend-of, Sunita)
(Jose, fan-of, Magna Carta)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Magna
Carta)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)
10
Graph encodings: Adding weights
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
2013 A. Haeberlen, Z.
Ives
0.9
Sunita
0.3
fan-of
Mikhail 0.7
fan-of
Magna Carta
0.5
Jose
(Alice, fan-of, 0.5, Facebook)
(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Magna
Carta)
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8,
Facebook)
(Mikhail, fan-of, 0.7, Magna
Carta)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)
11
Recap: Related objects
We can represent the relationships
between related objects as a directed,
labeled graph
We can annotate this graph in various
ways
Add labels to edges to distinguish different types
Add weights to edges
...
12
We can encode the graph in various
2013 A. Haeberlen, Z.
Ives
Vertices represent the objects
Edges represent relationships
Plan for today
Representing data in graphs
Graph algorithms in MapReduce
NEXT
Computation model
Iterative MapReduce
A toolbox of algorithms
2013 A. Haeberlen, Z.
Ives
Single-source shortest path (SSSP)
k-means clustering
Classification with Nave Bayes
University of Pennsylvania
13
A computation model for graphs
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
Sunita
0.3
Mikhail 0.7
fan-of
Magna Carta
0.5
Jose
Once the data is encoded in this way, we
can perform various computations on it
0.9
fan-of
Simple example: Which users are their friends' best
friend?
More complicated examples (later): Page rank,
adsorption, ...
This is often done by
2013 A. Haeberlen, Z.
Ives
annotating the vertices with additional information,
and
14
University of Pennsylvania
A computation model for graphs
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
Sunita
0.3
fan-of
Mikhail 0.7
fan-of
Jose
0.5
Magna Carta
Slightly more
technical: How many
of my friends have
me as their
best friend?
Example: Am I my friends' best friend?
2013 A. Haeberlen, Z.
Ives
Step #1: Discard irrelevant vertices and edges
University of Pennsylvania
15
A computation model for graphs
Mikhail
friend-of
friend-of
0.9
0.3
Alice
Sunita
Jose
alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3
sunitajose: 0.3
Example: Am I my friends' best friend?
2013 A. Haeberlen, Z.
Ives
Step #1: Discard irrelevant vertices and edges
Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge
University of Pennsylvania
16
A computation model for graphs
Mikhail
friend-of
friend-of
0.9
0.3
Alice
Sunita
Jose
sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9
sunitajose: 0.3 josesunita: 0.3 sunitajose: 0.3
alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3
sunitajose: 0.3
Example: Am I my friends' best friend?
2013 A. Haeberlen, Z.
Ives
Step #1: Discard irrelevant vertices and edges
Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge
University of Pennsylvania
17
A computation model for graphs
Mikhail
friend-of
friend-of
0.9
0.3
Alice
Sunita
Jose
sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9
sunitajose: 0.3 josesunita: 0.3 sunitajose: 0.3
alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3
sunitajose: 0.3
Example: Am I my friends' best friend?
2013 A. Haeberlen, Z.
Ives
Step
Step
Step
Step
#1:
#2:
#3:
#4:
Discard irrelevant vertices and edges
Annotate each vertex with list of friends
Push annotations along each edge
Determine result at each vertex
University of Pennsylvania
18
Can we do this in MapReduce?
map(key: node, value: [<otherNode, relType, strength>])
{
}
reduce(key: ________, values: list of _________)
{
Using adjacency list representation?
2013 A. Haeberlen, Z.
Ives
19
Can we do this in MapReduce?
map(key: node, value: <otherNode, relType, strength>)
{
}
reduce(key: ________, values: list of _________)
{
Using single-edge data representation?
2013 A. Haeberlen, Z.
Ives
20
A real-world use case
A variant that is actually used in social
networks today: "Who are the friends of
multiple of my friends?"
Where have you seen this before?
Friend recommendation!
2013 A. Haeberlen, Z.
Ives
Maybe these people should be my friends too!
21
Generalizing
Now suppose we want to go beyond
direct friend relationships
Example: How many of my friends' friends (distance2 neighbors) have me as their best friend's best
friend?
What do we need to do?
How about distance k>2?
To compute the answer, we need to run
multiple iterations of MapReduce!
2013 A. Haeberlen, Z.
Ives
22
Iterative MapReduce
The basic model:
copy files from input dir staging dir 1
(optional: do some preprocessing)
while (!terminating condition) {
map from staging dir 1
reduce into staging dir 2
move files from staging dir 2 staging dir1
}
(optional: postprocessing)
move files from staging dir 2 output dir
Note that reduce output must be compatible
with the map input!
2013 A. Haeberlen, Z.
Ives
What can happen if we filter out some information in the mapper
or in the reducer?
23
Graph algorithms and
MapReduce
A centralized algorithm typically
traverses a tree or a graph one item at
a time (theres only one cursor)
Youve learned breadth-first and depth-first
traversals
Most algorithms that are based on
graphs make use of multiple
map/reduce stages processing one
wave at a time
2013 A. Haeberlen, Z.
Ives
Sometimes iterative MapReduce, other times chains
of map/reduce
24
Recap: MapReduce on graphs
Suppose we want to:
compute a function for each vertex in a graph...
... using data from vertices at most k hops away
We can do this as follows:
"Push" information along the edges
"Think like a vertex"
Finally, perform the computation at each vertex
May need more than one MapReduce
phase
2013 A. Haeberlen, Z.
Ives
Iterative MapReduce: Outputs of stage i inputs of
stage i+1 University of Pennsylvania
25
Plan for today
Representing data in graphs
Graph algorithms in MapReduce
Computation model
Iterative MapReduce
A toolbox of algorithms
NEXT
2013 A. Haeberlen, Z.
Ives
Single-source shortest path (SSSP)
k-means clustering
Classification with Nave Bayes
University of Pennsylvania
26
Path-based algorithms
Sometimes our goal is to compute
information about the paths (sets of
paths) between nodes
Edges may be annotated with cost, distance, or
similarity
Examples of such problems (see CIS
121+320):
2013 A. Haeberlen, Z.
Ives
Shortest path from one node to another
Minimum spanning tree (minimal-cost tree connecting
all vertices in a graph)
Steiner tree (minimal-cost tree connecting certain
27
nodes)
Single-Source Shortest Path
(SSSP)
Given a directed graph G = (V, E) in which each edge e has a
cost c(e):
Compute the cost of reaching each node from the source
node s in the most aefficient way (potentially
after multiple
b
1
'hops')
?
?
10
2
s
?
c
2013 A. Haeberlen, Z.
Ives
?
d
28
SSSP: Intuition
We can formulate the problem using
induction
The shortest path follows the principle of optimality:
the last step (u,v) makes use of the shortest path to u
We can express this as follows:
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
2013 A. Haeberlen, Z.
Ives
29
SSSP: CIS 320-style solution
Traditional approach: Dijkstra's
algorithm
V: vertices, E: edges, S: start node
foreach v in V
dist_S_to[v] := infinity
Initialize length and
predecessor[v] = nil
last step of path
to default values
spSet = {}
Q := V
while (Q not empty) do
Update length and
u := Q.removeNodeClosestTo(S) path based on edges
radiating from u
spSet := spSet + {u}
foreach v in V where (u,v) in E
if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then
dist_S_To[v] = dist_S_To[u] + cost(u,v)
predecessor[v] = u
2013 A. Haeberlen, Z.
Ives
30
10
2
s
4
7
Q = {s,a,b,c,d}
spSet = {}
dist_S_To: {(a,), (b,), (c,), (d,)}
predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)}
2013 A. Haeberlen, Z.
Ives
31
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a 1
0
10
2
s
4
7
Q = {a,b,c,d}
spSet = {s}
dist_S_To: {(a,10), (b,), (c,5), (d,)}
predecessor: {(a,s), (b,nil), (c,s), (d,nil)}
2013 A. Haeberlen, Z.
Ives
32
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
1 b
4
10
2
s
7
c
Q = {a,b,d}
spSet = {c,s}
dist_S_To: {(a,8), (b,14), (c,5), (d,7)}
predecessor: {(a,c), (b,c), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives
33
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
1 b
3
10
2
s
4
7
Q = {a,b}
spSet = {c,d,s}
dist_S_To: {(a,8), (b,13), (c,5), (d,7)}
predecessor: {(a,c), (b,d), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives
34
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
10
2
s
7
c
Q = {b}
spSet = {a,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives
35
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
10
2
s
7
c
Q = {}
spSet = {a,b,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives
36
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
SSSP: How to parallelize?
Dijkstra traverses the graph along a
single route at a time, prioritizing its
traversal to the next step based on total
?
path length (and avoiding cycles)?
No real parallelism to be had here!
s 0
Intuitively, we want something
?
that radiates from the origin,
one edge hop distance at a time
2013 A. Haeberlen, Z.
Ives
Each step outwards can be done in parallel, before
another iteration occurs - or we are done
Recall our earlier discussion: Scalability depends on
the algorithm, not (just) on the problem!
37
SSSP: Revisiting the inductive
definition
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
Dijkstras algorithm carefully considered
each u in a way that allowed us to prune
certain points
Instead we can look at all potential us for
each v
Compute iteratively, by keeping a frontier38
set of u
nodes i edge-hops from the source
2013 A. Haeberlen, Z.
Ives
SSSP: MapReduce formulation
init:
For each node, node ID <, -, {<succ-node-ID,edgecost>}>
map:
This is a new path from
take node ID <dist, next, {<succ-node-ID,edgethe source to succ-nodecost>}>
ID
that we just discovered
For each succ-node-ID:
(not necessarily
The shortest path we have found so far ... this is the next... and here is the adjacency
list for nodeID
from the source to nodeID has length hop on that path...
...
emit succ-node ID {<node ID, distance+edge-cost>} shortest)
Why is this necessary?
emit node ID distance,{<succ-node-ID,edge-cost>}
reduce:
2013 A. Haeberlen, Z.
Ives
distance := min cost from a predecessor; next := that
predec.
emit node ID <distance, next, {<succ-node-ID,edgecost>}>
39
Iteration 0: Base case
mapper:
(a,<s,10>) (c,<s,5>) edges
reducer: (a,<10, ...>) (c,<5, ...>)
a
"Wave"
10
2
s 0
4
7
c
2013 A. Haeberlen, Z.
Ives
40
Iteration 1
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) edges
reducer: (a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)
"Wave
b
1
a 1
"
0
10
2
s 0
7
c
2013 A. Haeberlen, Z.
Ives
41
Iteration 2
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>)
edges
reducer: (a,<8>) (c,<5>) (b,<11>) (d,<7>)
"Wave
1
a
1 b
"
8
1
10
2
s 0
7
c
2013 A. Haeberlen, Z.
Ives
7
42
No change!
change!
No
Convergence!
Convergence!
Iteration 3
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>)
edges
1 (d,<7>)
a
reducer: (a,<8>) (c,<5>)
(b,<11>)
1 b
8
1
10
2
s 0
5
Question: If a vertex's path cost
is the same in two consecutive
rounds, can we be sure that
this vertex has converged?
2013 A. Haeberlen, Z.
Ives
4
7
7
43
Summary: SSSP
Path-based algorithms typically involve
iterative map/reduce
They are typically formulated in a way
that traverses in waves or stages,
like breadth-first search
This allows for parallelism
They need a way to test for convergence
Example: Single-source shortest path
(SSSP)
2013 A. Haeberlen, Z.
Ives
Original Dijkstra formulation is hard to parallelize
But we can make it work with the "wave" approach
44
Plan for today
Representing data in graphs
Graph algorithms in MapReduce
Computation model
Iterative MapReduce
A toolbox of algorithms
2013 A. Haeberlen, Z.
Ives
Single-source shortest path (SSSP)
k-means clustering NEXT
Classification with Nave Bayes
University of Pennsylvania
45
Learning (clustering /
classification)
Sometimes our goal is to take a set of
entities, possibly related, and group
them
If the groups are based on similarity, we call this
clustering
If the groups are based on putting them into a
semantically meaningful class, we call this
classification
Both are instances of machine learning
2013 A. Haeberlen, Z.
Ives
46
The k-clustering Problem
Ag
e
Clusters
Items
Expenses
Given: A set of items in a n-dimensional
feature space
Example: data points from survey, people in a social
network
Goal: Group the items into k clusters
2013 A. Haeberlen, Z.
Ives
What would be a 'good' set of clusters?47
Approach: k-Means
Let m1, m2, , mk be representative
points for each of our k clusters
Specifically: the centroid of the cluster
Initialize m1, m2, , mk to random
values in the data
For t = 1, 2, :
Map each observation to the closest mean
Si( t ) x j : x j mi( t ) x j mi(*t ) , i* 1,..., k
Assign the mi to be a new centroid for each set
mi(t 1)
2013 A. Haeberlen, Z.
Ives
1
Si( t )
x j S i( t )
48
A simple example (1/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
(10,10)
(15,12)
Expenses
2013 A. Haeberlen, Z.
Ives
49
A simple example (2/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
Randomly chosen
initial centers
(10,10)
(15,12)
Expenses
2013 A. Haeberlen, Z.
Ives
50
A simple example (3/4)
(20,21)
(18,20)
(30,21)
Age
(19.75,19.5)
(11,16)
(12.5,11)
(10,10)
(15,12)
Expenses
2013 A. Haeberlen, Z.
Ives
51
A simple example (4/4)
(20,21)
(30,21)
Age
(18,20)
(22.67,20.67)
(11,16)
(12,12.67)
(10,10)
(15,12)
Expenses
2013 A. Haeberlen, Z.
Ives
Stable!
52
k-Means in MapReduce
Map #1:
Input: node ID <position, centroid ID, [centroid IDs and
positions]>
Compute nearest centroid; emit centroid ID <node ID, position>
Reduce #1:
Recompute centroid position from positions of nodes in it
Emit centroidID <node IDs, positions> and for all other centroid
IDs, emit otherCentroidID centroid(centroidID,X,Y)
Map #2:
Pass through values to Reducer #2
Reduce #2:
For each node in the current centroid, emit
node ID <position, centroid ID, [centroid IDs and positions]>
Input for the next map iteration
Also, emit <X, <centroid ID, position>>
Each centroid will need to know where all the other centroids are
This will be the 'result' (remember that we wanted the centroids!)
Repeat until no change
2013 A. Haeberlen, Z.
Ives
53
Plan for today
Representing data in graphs
Graph algorithms in MapReduce
Computation model
Iterative MapReduce
A toolbox of algorithms
2013 A. Haeberlen, Z.
Ives
Single-source shortest path (SSSP)
k-means clustering
Classification with Nave BayesNEXT
University of Pennsylvania
54
Classification
Suppose we want to learn what is
spam (or interesting, or )
Predefine a set of classes with semantic meaning
Train an algorithm to look at data and assign a class
Based on giving it some examples of data in each class
and the sets of features they have
Many probabilistic techniques exist
Each class has probabilistic relationships with others
2013 A. Haeberlen, Z.
Ives
e.g., p (spam | isSentLocally), p (isSentLocally | fromBob),
Typically represented as a graph(ical model)! See CIS 520
But well focus on a simple, flat model: Nave
Bayes
55
A simple example
Suppose we just look at the keywords in
the email's title:
Message(1,
Message(2,
Message(3,
Message(4,
Message(5,
Message(6,
Won contract)
Won award)
"Won the lottery")
Unsubscribe)
"Millions of customers")
"Millions of dollars")
What is probability message "Won Millions" is
?
Bayes
p(spam|containsWon,containsMillions)
Theorem
= p(spam) p(containsWon,containsMillions |spam)
p(containsWon,containsMillions)
2013 A. Haeberlen, Z.
Ives
56
Classification using Nave Bayes
Basic assumption: Probabilities of events are
independent
This is why it is called 'nave'
Under this assumption,
p(spam) p(containsWon,containsMillions | spam)
p(containsWon,containsMillions)
= p(spam) p(containsWon | spam) p(containsMillions | spam)
p(containsWon) p(containsMillions)
= 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67
So how do we train a learner (compute the above
probabilities) using MapReduce?
2013 A. Haeberlen, Z.
Ives
57
What do we need to train the
learner?
p(spam)
Easy
Easy
p(containsXYZ | spam)
Count how many spam emails there are
Count total number of emails
Count how many spam emails contain XYZ 1
Count how many emails contain XYZ overall 2
p(containsXYZ)
2013 A. Haeberlen, Z.
Ives
Count how many emails contain XYZ overall 2
Count total number of emails
Easy
University of Pennsylvania
58
Training a Nave Bayes Learner
map 1:
reduce 1:
emits <word, class> <count>
takes messageId -> <class, {words}> Count how many
emails contain the
emits word 1
word overall
(WordCount)
reduce 2:
2013 A. Haeberlen, Z.
Ives
contain the word
(modified WordCount)
map 2:
takes messageId <class, {words}>
Count how many
emails in the class
emits <word, class> 1
emits word <totalCount>
59
Summary: Learning and
MapReduce
Clustering algorithms typically have
multiple aggregation stages or
iterations
k-means clustering repeatedly computes centroids,
maps items to them
Fixpoint computation
Classification algorithms can be quite
complex
2013 A. Haeberlen, Z.
Ives
In general: need to capture conditional probabilities
Nave Bayes assumes everything is independent
Training is a matter of computing probability
60
distribution
Stay tuned
Next time you will learn about:
PageRank and Adsorption
2013 A. Haeberlen, Z.
Ives
University of Pennsylvania
61