Spark is a fast and general engine for large-scale data processing. It provides an interface called resilient distributed datasets (RDDs) that allow data to be distributed in memory across clusters and manipulated using parallel operations. Shark is a system built on Spark that allows running SQL queries over large datasets using Spark's speed and generality. The document discusses Spark and Shark's performance advantages over Hadoop for iterative and interactive applications.
1 of 37
More Related Content
Spark and shark
1. Spark
and
Shark
High-‐Speed
In-‐Memory
Analytics
over
Hadoop
and
Hive
Data
Matei
Zaharia,
in
collaboration
with
Mosharaf
Chowdhury,
Tathagata
Das,
Ankur
Dave,
Cliff
Engle,
Michael
Franklin,
Haoyuan
Li,
Antonio
Lupher,
Justin
Ma,
Murphy
McCauley,
Scott
Shenker,
Ion
Stoica,
Reynold
Xin
UC
Berkeley
spark-‐project.org
UC
BERKELEY
2. What
is
Spark?
Not
a
modified
version
of
Hadoop
Separate,
fast,
MapReduce-‐like
engine
» In-‐memory
data
storage
for
very
fast
iterative
queries
» General
execution
graphs
and
powerful
optimizations
» Up
to
40x
faster
than
Hadoop
Compatible
with
Hadoop’s
storage
APIs
» Can
read/write
to
any
Hadoop-‐supported
system,
including
HDFS,
HBase,
SequenceFiles,
etc
3. What
is
Shark?
Port
of
Apache
Hive
to
run
on
Spark
Compatible
with
existing
Hive
data,
metastores,
and
queries
(HiveQL,
UDFs,
etc)
Similar
speedups
of
up
to
40x
4. Project
History
Spark
project
started
in
2009,
open
sourced
2010
Shark
started
summer
2011,
alpha
April
2012
In
use
at
Berkeley,
Princeton,
Klout,
Foursquare,
Conviva,
Quantifind,
Yahoo!
Research
&
others
200+
member
meetup,
500+
watchers
on
GitHub
5. This
Talk
Spark
programming
model
User
applications
Shark
overview
Demo
Next
major
addition:
Streaming
Spark
6. Why
a
New
Programming
Model?
MapReduce
greatly
simplified
big
data
analysis
But
as
soon
as
it
got
popular,
users
wanted
more:
» More
complex,
multi-‐stage
applications
(e.g.
iterative
graph
algorithms
and
machine
learning)
» More
interactive
ad-‐hoc
queries
Both
multi-‐stage
and
interactive
apps
require
faster
data
sharing
across
parallel
jobs
7. Data
Sharing
in
MapReduce
HDFS
HDFS
HDFS
HDFS
read
write
read
write
iter.
1
iter.
2
.
.
.
Input
HDFS
query
1
result
1
read
query
2
result
2
query
3
result
3
Input
.
.
.
Slow
due
to
replication,
serialization,
and
disk
IO
8. Data
Sharing
in
Spark
iter.
1
iter.
2
.
.
.
Input
query
1
one-‐time
processing
query
2
query
3
Input
Distributed
memory
.
.
.
10-‐100×
faster
than
network
and
disk
9. Spark
Programming
Model
Key
idea:
resilient
distributed
datasets
(RDDs)
» Distributed
collections
of
objects
that
can
be
cached
in
memory
across
cluster
nodes
» Manipulated
through
various
parallel
operators
» Automatically
rebuilt
on
failure
Interface
» Clean
language-‐integrated
API
in
Scala
» Can
be
used
interactively
from
Scala
console
10. Example:
Log
Mining
Load
error
messages
from
a
log
into
memory,
then
interactively
search
for
various
patterns
Base
RDD
Cache
1
lines = spark.textFile(“hdfs://...”) Transformed
RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2)) tasks
Block
1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache
2
Worker
. . .
Cache
3
Worker
Block
2
Result:
sull-‐text
s1
TB
data
in
5-‐7
sec
fcaled
to
earch
of
Wikipedia
in
<1
sec
(vs
ec
for
on-‐disk
data)
ata)
(vs
170
s 20
sec
for
on-‐disk
d Block
3
11. Fault
Tolerance
RDDs
track
the
series
of
transformations
used
to
build
them
(their
lineage)
to
recompute
lost
data
E.g:
messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
HadoopRDD
FilteredRDD
MappedRDD
path
=
hdfs://…
func
=
_.contains(...)
func
=
_.split(…)
12. Example:
Logistic
Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D) Load
data
in
memory
once
Initial
parameter
vector
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient Repeated
MapReduce
steps
} to
do
gradient
descent
println("Final w: " + w)
13. Logistic
Regression
Performance
4500
4000
127
s
/
iteration
3500
Running
Time
(s)
3000
2500
Hadoop
2000
Spark
1500
1000
500
first
iteration
174
s
0
further
iterations
6
s
1
5
10
20
30
Number
of
Iterations
14. Supported
Operators
map reduce sample
filter count cogroup
groupBy reduceByKey take
sort groupByKey partitionBy
join first pipe
leftOuterJoin union save
rightOuterJoin cross ...
15. Other
Engine
Features
General
graphs
of
operators
(e.g.
map-‐reduce-‐reduce)
Hash-‐based
reduces
(faster
than
Hadoop’s
sort)
Controlled
data
partitioning
to
lower
communication
PageRank
Performance
200
171
Iteration
time
(s)
Hadoop
150
Basic
Spark
100
72
50
23
Spark
+
Controlled
Partitioning
0
17. User
Applications
In-‐memory
analytics
&
anomaly
detection
(Conviva)
Interactive
queries
on
data
streams
(Quantifind)
Exploratory
log
analysis
(Foursquare)
Traffic
estimation
w/
GPS
data
(Mobile
Millennium)
Twitter
spam
classification
(Monarch)
.
.
.
18. Conviva
GeoReport
Hive
20
Spark
0.5
Time
(hours)
0
5
10
15
20
Group
aggregations
on
many
keys
w/
same
filter
40×
gain
over
Hive
from
avoiding
repeated
reading,
deserialization
and
filtering
19. Mobile
Millennium
Project
Estimate
city
traffic
from
crowdsourced
GPS
data
Iterative
EM
algorithm
scaling
to
160
nodes
Credit:
Tim
Hunter,
with
support
of
the
Mobile
Millennium
team;
P.I.
Alex
Bayen;
traffic.berkeley.edu
21. Motivation
Hive
is
great,
but
Hadoop’s
execution
engine
makes
even
the
smallest
queries
take
minutes
Scala
is
good
for
programmers,
but
many
data
users
only
know
SQL
Can
we
extend
Hive
to
run
on
Spark?
22. Hive
Architecture
Client
CLI
JDBC
Driver
Meta
Physical
Plan
store
SQL
Query
Parser
Optimizer
Execution
MapReduce
HDFS
23. Shark
Architecture
Client
CLI
JDBC
Driver
Cache
Mgr.
Meta
Physical
Plan
store
SQL
Query
Parser
Optimizer
Execution
Spark
HDFS
[Engle
et
al,
SIGMOD
2012]
24. Efficient
In-‐Memory
Storage
Simply
caching
Hive
records
as
Java
objects
is
inefficient
due
to
high
per-‐object
overhead
Instead,
Shark
employs
column-‐oriented
storage
using
arrays
of
primitive
types
Row
Storage
Column
Storage
1
john
4.1
1
2
3
2
mike
3.5
john
mike
sally
3
sally
6.4
4.1
3.5
6.4
25. Efficient
In-‐Memory
Storage
Simply
caching
Hive
records
as
Java
objects
is
inefficient
due
to
high
per-‐object
overhead
Instead,
Shark
employs
column-‐oriented
storage
using
arrays
of
primitive
types
Row
Storage
Column
Storage
1
john
4.1
1
2
3
Benefit:
similarly
compact
size
to
serialized
data,
2
mike
3.5
faster
to
access
sally
but
>5x
john
mike
3
sally
6.4
4.1
3.5
6.4
26. Using
Shark
CREATE TABLE mydata_cached AS SELECT …
Run
standard
HiveQL
on
it,
including
UDFs
» A
few
esoteric
features
are
not
yet
supported
Can
also
call
from
Scala
to
mix
with
Spark
Early
alpha
release
at
shark.cs.berkeley.edu
27. Benchmark
Query
1
SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;
Shark (cached) 12s
Shark 182s
Hive 207s
0 50 100 150 200 250
Execution Time (secs)
28. Benchmark
Query
2
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;
Shark (cached) 126s
Shark 270s
Hive 447s
0 100 200 300 400 500
Execution Time (secs)
30. What’s
Next?
Recall
that
Spark’s
model
was
motivated
by
two
emerging
uses
(interactive
and
multi-‐stage
apps)
Another
emerging
use
case
that
needs
fast
data
sharing
is
stream
processing
» Track
and
update
state
in
memory
as
events
arrive
» Large-‐scale
reporting,
click
analysis,
spam
filtering,
etc
31. Streaming
Spark
Extends
Spark
to
perform
streaming
computations
Runs
as
a
series
of
small
(~1
s)
batch
jobs,
keeping
state
in
memory
as
fault-‐tolerant
RDDs
Intermix
seamlessly
with
batch
and
ad-‐hoc
queries
map
reduceByWindow
tweetStream T=1
.flatMap(_.toLower.split)
.map(word => (word, 1))
.reduceByWindow(5, _ + _)
T=2
…
[Zaharia
et
al,
HotCloud
2012]
32. Streaming
Spark
Extends
Spark
to
perform
streaming
computations
Runs
as
a
series
of
small
(~1
s)
batch
jobs,
keeping
state
in
memory
as
fault-‐tolerant
RDDs
Intermix
seamlessly
with
batch
and
ad-‐hoc
queries
map
reduceByWindow
tweetStream T=1
.flatMap(_.toLower.split)
Result:
can
process
42
million
records/second
.map(word => (word, 1))
(4
GB/s)
on
100
nodes
at
sub-‐second
latency
.reduceByWindow(5, _ + _)
T=2
…
[Zaharia
et
al,
HotCloud
2012]
33. Streaming
Spark
Extends
Spark
to
perform
streaming
computations
Runs
as
a
series
of
small
(~1
s)
batch
jobs,
keeping
state
in
memory
as
fault-‐tolerant
RDDs
Intermix
seamlessly
with
batch
and
ad-‐hoc
queries
map
reduceByWindow
tweetStream T=1
.flatMap(_.toLower.split)
.map(word => (word, 1))Alpha
coming
this
summer
.reduceByWindow(5, _ + _)
T=2
…
[Zaharia
et
al,
HotCloud
2012]
34. Conclusion
Spark
and
Shark
speed
up
your
interactive
and
complex
analytics
on
Hadoop
data
Download
and
docs:
www.spark-‐project.org
» Easy
to
run
locally,
on
EC2,
or
on
Mesos
and
soon
YARN
User
meetup:
meetup.com/spark-‐users
Training
camp
at
Berkeley
in
August!
matei@berkeley.edu
/
@matei_zaharia
35. Behavior
with
Not
Enough
RAM
100
68.8
Iteration
time
(s)
58.1
80
40.7
60
29.7
40
11.5
20
0
Cache
25%
50%
75%
Fully
disabled
cached
%
of
working
set
in
memory
36. Software
Stack
Shark
Bagel
Streaming
(Hive
on
Spark)
(Pregel
on
Spark)
Spark
…
Spark
Local
Apache
EC2
YARN
mode
Mesos