Spark and shark

Spark
and
Shark

High-‐Speed
In-‐Memory
Analytics

over
Hadoop
and
Hive
Data

Matei
Zaharia,
in
collaboration
with

Mosharaf
Chowdhury,
Tathagata
Das,
Ankur
Dave,
Cliﬀ
Engle,

Michael
Franklin,
Haoyuan
Li,
Antonio
Lupher,
Justin
Ma,
Murphy

McCauley,
Scott
Shenker,
Ion
Stoica,
Reynold
Xin

UC
Berkeley

spark-‐project.org
UC
BERKELEY

What
is
Spark?

Not
a
modiﬁed
version
of
Hadoop

Separate,
fast,
MapReduce-‐like
engine

» In-‐memory
data
storage
for
very
fast
iterative
queries

» General
execution
graphs
and
powerful
optimizations

» Up
to
40x
faster
than
Hadoop

Compatible
with
Hadoop’s
storage
APIs

» Can
read/write
to
any
Hadoop-‐supported
system,

including
HDFS,
HBase,
SequenceFiles,
etc

What
is
Shark?

Port
of
Apache
Hive
to
run
on
Spark

Compatible
with
existing
Hive
data,
metastores,

and
queries
(HiveQL,
UDFs,
etc)

Similar
speedups
of
up
to
40x

Project
History

Spark
project
started
in
2009,
open
sourced
2010

Shark
started
summer
2011,
alpha
April
2012

In
use
at
Berkeley,
Princeton,
Klout,
Foursquare,

Conviva,
Quantiﬁnd,
Yahoo!
Research
&
others

200+
member
meetup,
500+
watchers
on
GitHub

This
Talk

Spark
programming
model

User
applications

Shark
overview

Demo

Next
major
addition:
Streaming
Spark

Why
a
New
Programming
Model?

MapReduce
greatly
simpliﬁed
big
data
analysis

But
as
soon
as
it
got
popular,
users
wanted
more:

» More
complex,
multi-‐stage
applications
(e.g.

iterative
graph
algorithms
and
machine
learning)

» More
interactive
ad-‐hoc
queries

Both
multi-‐stage
and
interactive
apps
require

faster
data
sharing
across
parallel
jobs

Data
Sharing
in
MapReduce

HDFS
HDFS
HDFS
HDFS

read
write
read
write

iter.
1
iter.
2
.

.

.

Input

HDFS
query
1
result
1

read

query
2
result
2

query
3
result
3

Input

.

.

.

Slow
due
to
replication,
serialization,
and
disk
IO

Data
Sharing
in
Spark

iter.
1
iter.
2
.

.

.

Input

query
1

one-‐time

processing

query
2

query
3

Input
Distributed

memory
.

.

.

10-‐100×
faster
than
network
and
disk

Spark
Programming
Model

Key
idea:
resilient
distributed
datasets
(RDDs)

» Distributed
collections
of
objects
that
can
be
cached

in
memory
across
cluster
nodes

» Manipulated
through
various
parallel
operators

» Automatically
rebuilt
on
failure

Interface

» Clean
language-‐integrated
API
in
Scala

» Can
be
used
interactively
from
Scala
console

Example:
Log
Mining

Load
error
messages
from
a
log
into
memory,
then

interactively
search
for
various
patterns

Base
RDD
Cache
1

lines = spark.textFile(“hdfs://...”) Transformed
RDD

Worker

results

errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2)) tasks
Block
1

Driver

cachedMsgs = messages.cache()

Action

cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache
2

Worker

. . .
Cache
3

Worker
Block
2

Result:
sull-‐text
s1
TB
data
in
5-‐7
sec

fcaled
to
earch
of
Wikipedia

in
<1
sec
(vs
ec
for
on-‐disk
data)
ata)

(vs
170
s 20
sec
for
on-‐disk
d Block
3

Fault
Tolerance

RDDs
track
the
series
of
transformations
used
to

build
them
(their
lineage)
to
recompute
lost
data

E.g:
messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))

HadoopRDD
FilteredRDD
MappedRDD

path
=
hdfs://…
func
=
_.contains(...)
func
=
_.split(…)

Example:
Logistic
Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D) Load
data
in
memory
once

Initial
parameter
vector

for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient Repeated
MapReduce
steps

} to
do
gradient
descent

println("Final w: " + w)

Logistic
Regression
Performance

4500

4000

127
s
/
iteration

3500

Running
Time
(s)

3000

2500
Hadoop

2000

Spark

1500

1000

500
ﬁrst
iteration
174
s

0
further
iterations
6
s

1
5
10
20
30

Number
of
Iterations

Supported
Operators

map reduce sample

filter count cogroup

groupBy reduceByKey take

sort groupByKey partitionBy

join first pipe

leftOuterJoin union save

rightOuterJoin cross ...

Other
Engine
Features

General
graphs
of
operators
(e.g.
map-‐reduce-‐reduce)

Hash-‐based
reduces
(faster
than
Hadoop’s
sort)

Controlled
data
partitioning
to
lower
communication

PageRank
Performance

200
171

Iteration
time
(s)

Hadoop

150

Basic
Spark

100
72

50
23
Spark
+
Controlled

Partitioning

0

User
Applications

In-‐memory
analytics
&
anomaly
detection
(Conviva)

Interactive
queries
on
data
streams
(Quantifind)

Exploratory
log
analysis
(Foursquare)

Traffic
estimation
w/
GPS
data
(Mobile
Millennium)

Twitter
spam
classification
(Monarch)

.
.
.

Conviva
GeoReport

Hive
20

Spark
0.5

Time
(hours)

0
5
10
15
20

Group
aggregations
on
many
keys
w/
same
ﬁlter

40×
gain
over
Hive
from
avoiding
repeated

reading,
deserialization
and
ﬁltering

Mobile
Millennium
Project

Estimate
city
traﬃc
from
crowdsourced
GPS
data

Iterative
EM
algorithm

scaling
to
160
nodes

Credit:
Tim
Hunter,
with
support
of
the
Mobile
Millennium
team;
P.I.
Alex
Bayen;
traﬃc.berkeley.edu

Shark:
Hive
on
Spark

Motivation

Hive
is
great,
but
Hadoop’s
execution
engine

makes
even
the
smallest
queries
take
minutes

Scala
is
good
for
programmers,
but
many
data

users
only
know
SQL

Can
we
extend
Hive
to
run
on
Spark?

Hive
Architecture

Client
CLI
JDBC

Driver

Meta
Physical
Plan

store
SQL
Query

Parser
Optimizer
Execution

MapReduce

HDFS

Shark
Architecture

Client
CLI
JDBC

Driver
Cache
Mgr.

Meta
Physical
Plan

store
SQL
Query

Parser
Optimizer
Execution

Spark

HDFS

[Engle
et
al,
SIGMOD
2012]

Eﬃcient
In-‐Memory
Storage

Simply
caching
Hive
records
as
Java
objects
is

ineﬃcient
due
to
high
per-‐object
overhead

Instead,
Shark
employs
column-‐oriented

storage
using
arrays
of
primitive
types

Row
Storage
Column
Storage

1
john
4.1
1
2
3

2
mike
3.5
john
mike
sally

3
sally
6.4
4.1
3.5
6.4

Efficient
In-‐Memory
Storage

Simply
caching
Hive
records
as
Java
objects
is

inefficient
due
to
high
per-‐object
overhead

Instead,
Shark
employs
column-‐oriented

storage
using
arrays
of
primitive
types

Row
Storage
Column
Storage

1
john
4.1
1
2
3

Benefit:
similarly
compact
size
to
serialized
data,

2
mike
3.5
faster
to
access
sally

but
>5x
john
mike

3
sally
6.4
4.1
3.5
6.4

Using
Shark

CREATE TABLE mydata_cached AS SELECT …

Run
standard
HiveQL
on
it,
including
UDFs

» A
few
esoteric
features
are
not
yet
supported

Can
also
call
from
Scala
to
mix
with
Spark

Early
alpha
release
at
shark.cs.berkeley.edu

Benchmark
Query
1

SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

Shark (cached) 12s

Shark 182s

Hive 207s

0 50 100 150 200 250
Execution Time (secs)

Benchmark
Query
2

SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;

Shark (cached) 126s

Shark 270s

Hive 447s

0 100 200 300 400 500
Execution Time (secs)

What’s
Next?

Recall
that
Spark’s
model
was
motivated
by
two

emerging
uses
(interactive
and
multi-‐stage
apps)

Another
emerging
use
case
that
needs
fast
data

sharing
is
stream
processing

» Track
and
update
state
in
memory
as
events
arrive

» Large-‐scale
reporting,
click
analysis,
spam
ﬁltering,
etc

Streaming
Spark

Extends
Spark
to
perform
streaming
computations

Runs
as
a
series
of
small
(~1
s)
batch
jobs,
keeping

state
in
memory
as
fault-‐tolerant
RDDs

Intermix
seamlessly
with
batch
and
ad-‐hoc
queries

map
reduceByWindow

tweetStream T=1

.flatMap(_.toLower.split)
.map(word => (word, 1))
.reduceByWindow(5, _ + _)
T=2

…

[Zaharia
et
al,
HotCloud
2012]

Streaming
Spark

Extends
Spark
to
perform
streaming
computations

Runs
as
a
series
of
small
(~1
s)
batch
jobs,
keeping

state
in
memory
as
fault-‐tolerant
RDDs

Intermix
seamlessly
with
batch
and
ad-‐hoc
queries

map
reduceByWindow

tweetStream T=1

Result:
can
process
42
million
records/second

.map(word => (word, 1))
(4
GB/s)
on
100
nodes
at
sub-‐second
latency

T=2

…

[Zaharia
et
al,
HotCloud
2012]

Streaming
Spark

Extends
Spark
to
perform
streaming
computations

Runs
as
a
series
of
small
(~1
s)
batch
jobs,
keeping

state
in
memory
as
fault-‐tolerant
RDDs

Intermix
seamlessly
with
batch
and
ad-‐hoc
queries

map
reduceByWindow

tweetStream T=1

.map(word => (word, 1))Alpha
coming
this
summer

T=2

…

[Zaharia
et
al,
HotCloud
2012]

Conclusion

Spark
and
Shark
speed
up
your
interactive
and

complex
analytics
on
Hadoop
data

Download
and
docs:
www.spark-‐project.org

» Easy
to
run
locally,
on
EC2,
or
on
Mesos
and
soon
YARN

User
meetup:
meetup.com/spark-‐users

Training
camp
at
Berkeley
in
August!

matei@berkeley.edu
/
@matei_zaharia

Behavior
with
Not
Enough
RAM

100

68.8

Iteration
time
(s)

58.1

80

40.7

60

29.7

40

11.5

20

0

Cache
25%
50%
75%
Fully

disabled
cached

%
of
working
set
in
memory

Software
Stack

Shark
Bagel
Streaming

(Hive
on
Spark)
(Pregel
on
Spark)
Spark

…

Spark

Local
Apache

EC2
YARN

mode
Mesos

Spark and shark

More Related Content

Spark and shark