SlideShare a Scribd company logo
Spark	
  and	
  Shark	
  
High-­‐Speed	
  In-­‐Memory	
  Analytics	
  
over	
  Hadoop	
  and	
  Hive	
  Data	
  
Matei	
  Zaharia,	
  in	
  collaboration	
  with	
  
Mosharaf	
  Chowdhury,	
  Tathagata	
  Das,	
  Ankur	
  Dave,	
  Cliff	
  Engle,	
  
Michael	
  Franklin,	
  Haoyuan	
  Li,	
  Antonio	
  Lupher,	
  Justin	
  Ma,	
  Murphy	
  
McCauley,	
  Scott	
  Shenker,	
  Ion	
  Stoica,	
  Reynold	
  Xin	
  
	
  
UC	
  Berkeley	
  
spark-­‐project.org	
                                                 UC	
  BERKELEY	
  
What	
  is	
  Spark?	
  
Not	
  a	
  modified	
  version	
  of	
  Hadoop	
  
Separate,	
  fast,	
  MapReduce-­‐like	
  engine	
  
 » In-­‐memory	
  data	
  storage	
  for	
  very	
  fast	
  iterative	
  queries	
  
 » General	
  execution	
  graphs	
  and	
  powerful	
  optimizations	
  
 » Up	
  to	
  40x	
  faster	
  than	
  Hadoop	
  

Compatible	
  with	
  Hadoop’s	
  storage	
  APIs	
  
 » Can	
  read/write	
  to	
  any	
  Hadoop-­‐supported	
  system,	
  
   including	
  HDFS,	
  HBase,	
  SequenceFiles,	
  etc	
  
What	
  is	
  Shark?	
  
Port	
  of	
  Apache	
  Hive	
  to	
  run	
  on	
  Spark	
  
Compatible	
  with	
  existing	
  Hive	
  data,	
  metastores,	
  
and	
  queries	
  (HiveQL,	
  UDFs,	
  etc)	
  
Similar	
  speedups	
  of	
  up	
  to	
  40x	
  
Project	
  History	
  
Spark	
  project	
  started	
  in	
  2009,	
  open	
  sourced	
  2010	
  
Shark	
  started	
  summer	
  2011,	
  alpha	
  April	
  2012	
  
In	
  use	
  at	
  Berkeley,	
  Princeton,	
  Klout,	
  Foursquare,	
  
Conviva,	
  Quantifind,	
  Yahoo!	
  Research	
  &	
  others	
  
200+	
  member	
  meetup,	
  500+	
  watchers	
  on	
  GitHub	
  
This	
  Talk	
  
Spark	
  programming	
  model	
  
User	
  applications	
  
Shark	
  overview	
  
Demo	
  
Next	
  major	
  addition:	
  Streaming	
  Spark	
  
Why	
  a	
  New	
  Programming	
  Model?	
  
MapReduce	
  greatly	
  simplified	
  big	
  data	
  analysis	
  
But	
  as	
  soon	
  as	
  it	
  got	
  popular,	
  users	
  wanted	
  more:	
  
  » More	
  complex,	
  multi-­‐stage	
  applications	
  (e.g.	
  
    iterative	
  graph	
  algorithms	
  and	
  machine	
  learning)	
  
  » More	
  interactive	
  ad-­‐hoc	
  queries	
  

Both	
  multi-­‐stage	
  and	
  interactive	
  apps	
  require	
  
faster	
  data	
  sharing	
  across	
  parallel	
  jobs	
  
Data	
  Sharing	
  in	
  MapReduce	
  
               HDFS	
                      HDFS	
                                HDFS	
                      HDFS	
  
               read	
                      write	
                               read	
                      write	
  
                          iter.	
  1	
                                                      iter.	
  2	
                         .	
  	
  .	
  	
  .	
  

   Input	
  

                HDFS	
                             query	
  1	
                                                result	
  1	
  
                read	
  
                                                   query	
  2	
                                                result	
  2	
  


                                                   query	
  3	
                                                result	
  3	
  
   Input	
  
                                                       .	
  	
  .	
  	
  .	
  

Slow	
  due	
  to	
  replication,	
  serialization,	
  and	
  disk	
  IO	
  
Data	
  Sharing	
  in	
  Spark	
  

                       iter.	
  1	
           iter.	
  2	
                .	
  	
  .	
  	
  .	
  

   Input	
  

                                              query	
  1	
  
                one-­‐time	
  
               processing	
  
                                              query	
  2	
  

                                              query	
  3	
  
   Input	
                  Distributed	
  
                             memory	
           .	
  	
  .	
  	
  .	
  

         10-­‐100×	
  faster	
  than	
  network	
  and	
  disk	
  
Spark	
  Programming	
  Model	
  
Key	
  idea:	
  resilient	
  distributed	
  datasets	
  (RDDs)	
  
  » Distributed	
  collections	
  of	
  objects	
  that	
  can	
  be	
  cached	
  
    in	
  memory	
  across	
  cluster	
  nodes	
  
  » Manipulated	
  through	
  various	
  parallel	
  operators	
  
  » Automatically	
  rebuilt	
  on	
  failure	
  

Interface	
  
  » Clean	
  language-­‐integrated	
  API	
  in	
  Scala	
  
  » Can	
  be	
  used	
  interactively	
  from	
  Scala	
  console	
  
Example:	
  Log	
  Mining	
  
 Load	
  error	
  messages	
  from	
  a	
  log	
  into	
  memory,	
  then	
  
 interactively	
  search	
  for	
  various	
  patterns	
  
                                                                         Base	
  RDD	
                                        Cache	
  1	
  
lines = spark.textFile(“hdfs://...”)                                         Transformed	
  RDD	
  
                                                                                                                    Worker	
  
                                                                                               results	
  
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))                                                                  tasks	
   Block	
  1	
  
                                                                                     Driver	
  
cachedMsgs = messages.cache()

                                                                                    Action	
  
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                                                                   Cache	
  2	
  
                                                                                                                  Worker	
  
. . .
                                                                                        Cache	
  3	
  
                                                                             Worker	
                              Block	
  2	
  
 Result:	
  sull-­‐text	
  s1	
  TB	
  data	
  in	
  5-­‐7	
  sec	
  
               fcaled	
  to	
   earch	
  of	
  Wikipedia	
  
 in	
  <1	
  sec	
  (vs	
  ec	
  for	
  on-­‐disk	
  data)	
  ata)	
  
        (vs	
  170	
  s 20	
  sec	
  for	
  on-­‐disk	
  d                    Block	
  3	
  
Fault	
  Tolerance	
  
RDDs	
  track	
  the	
  series	
  of	
  transformations	
  used	
  to	
  
build	
  them	
  (their	
  lineage)	
  to	
  recompute	
  lost	
  data	
  
E.g:	
   messages             = textFile(...).filter(_.contains(“error”))
                                             .map(_.split(‘t’)(2))
	
  
	
  
       HadoopRDD	
                      FilteredRDD	
                    MappedRDD	
  
       path	
  =	
  hdfs://…	
        func	
  =	
  _.contains(...)	
     func	
  =	
  _.split(…)	
  
Example:	
  Logistic	
  Regression	
  
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)                            Load	
  data	
  in	
  memory	
  once	
  
                                    Initial	
  parameter	
  vector	
  
for (i <- 1 to ITERATIONS) {
  val gradient = data.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= gradient              Repeated	
  MapReduce	
  steps	
  
}                              to	
  do	
  gradient	
  descent	
  

println("Final w: " + w)
Logistic	
  Regression	
  Performance	
  
                             4500	
  
                             4000	
  
                                                                                               127	
  s	
  /	
  iteration	
  
                             3500	
  
Running	
  Time	
  (s)	
  




                             3000	
  
                             2500	
                                                           Hadoop	
  
                             2000	
  
                                                                                              Spark	
  
                             1500	
  
                             1000	
  
                              500	
                                                        first	
  iteration	
  174	
  s	
  
                                0	
                                                       further	
  iterations	
  6	
  s	
  
                                        1	
      5	
       10	
        20	
      30	
  
                                                Number	
  of	
  Iterations	
  
Supported	
  Operators	
  
map              reduce        sample

filter           count         cogroup

groupBy          reduceByKey   take

sort             groupByKey    partitionBy

join             first         pipe

leftOuterJoin    union         save

rightOuterJoin   cross         ...
Other	
  Engine	
  Features	
  
General	
  graphs	
  of	
  operators	
  (e.g.	
  map-­‐reduce-­‐reduce)	
  
Hash-­‐based	
  reduces	
  (faster	
  than	
  Hadoop’s	
  sort)	
  
Controlled	
  data	
  partitioning	
  to	
  lower	
  communication	
  

                                                    PageRank	
  Performance	
  
                                        200	
       171	
  
         Iteration	
  time	
  (s)	
  




                                                                                Hadoop	
  
                                        150	
  
                                                                                Basic	
  Spark	
  
                                        100	
                 72	
  
                                          50	
                         23	
     Spark	
  +	
  Controlled	
  
                                                                                Partitioning	
  
                                            0	
  
Spark	
  Users	
  
User	
  Applications	
  
In-­‐memory	
  analytics	
  &	
  anomaly	
  detection	
  (Conviva)	
  
Interactive	
  queries	
  on	
  data	
  streams	
  (Quantifind)	
  
Exploratory	
  log	
  analysis	
  (Foursquare)	
  
Traffic	
  estimation	
  w/	
  GPS	
  data	
  (Mobile	
  Millennium)	
  
Twitter	
  spam	
  classification	
  (Monarch)	
  
.	
  .	
  .	
  
Conviva	
  GeoReport	
  
  Hive	
                                                   20	
  

Spark	
              0.5	
  
                                                             Time	
  (hours)	
  
             0	
               5	
     10	
     15	
     20	
  


Group	
  aggregations	
  on	
  many	
  keys	
  w/	
  same	
  filter	
  
40×	
  gain	
  over	
  Hive	
  from	
  avoiding	
  repeated	
  
reading,	
  deserialization	
  and	
  filtering	
  
Mobile	
  Millennium	
  Project	
  
Estimate	
  city	
  traffic	
  from	
  crowdsourced	
  GPS	
  data	
  
                                                                     Iterative	
  EM	
  algorithm	
  
                                                                     scaling	
  to	
  160	
  nodes	
  




          Credit:	
  Tim	
  Hunter,	
  with	
  support	
  of	
  the	
  Mobile	
  Millennium	
  team;	
  P.I.	
  Alex	
  Bayen;	
  traffic.berkeley.edu	
  
Shark:	
  Hive	
  on	
  Spark	
  
Motivation	
  
Hive	
  is	
  great,	
  but	
  Hadoop’s	
  execution	
  engine	
  
makes	
  even	
  the	
  smallest	
  queries	
  take	
  minutes	
  
Scala	
  is	
  good	
  for	
  programmers,	
  but	
  many	
  data	
  
users	
  only	
  know	
  SQL	
  
Can	
  we	
  extend	
  Hive	
  to	
  run	
  on	
  Spark?	
  
Hive	
  Architecture	
  
              	
  


              	
  	
  Client	
                   CLI	
             JDBC	
  
                                              Driver	
  
  Meta	
                                                   Physical	
  Plan	
  
  store	
             SQL	
      Query	
  
                     Parser	
   Optimizer	
                  Execution	
  

                                        MapReduce	
  

                                   HDFS	
  
Shark	
  Architecture	
  
              	
  


              	
  	
  Client	
                   CLI	
             JDBC	
  
                                              Driver	
        Cache	
  Mgr.	
  
  Meta	
                                                   Physical	
  Plan	
  
  store	
             SQL	
      Query	
  
                     Parser	
   Optimizer	
                  Execution	
  

                                              Spark	
  

                                   HDFS	
  
                                                      [Engle	
  et	
  al,	
  SIGMOD	
  2012]	
  
Efficient	
  In-­‐Memory	
  Storage	
  
Simply	
  caching	
  Hive	
  records	
  as	
  Java	
  objects	
  is	
  
inefficient	
  due	
  to	
  high	
  per-­‐object	
  overhead	
  
Instead,	
  Shark	
  employs	
  column-­‐oriented	
  
storage	
  using	
  arrays	
  of	
  primitive	
  types	
  
            Row	
  Storage	
               Column	
  Storage	
  
             1	
     john	
      4.1	
        1	
       2	
       3	
  

             2	
     mike	
      3.5	
      john	
   mike	
   sally	
  

             3	
     sally	
     6.4	
       4.1	
     3.5	
     6.4	
  
Efficient	
  In-­‐Memory	
  Storage	
  
Simply	
  caching	
  Hive	
  records	
  as	
  Java	
  objects	
  is	
  
inefficient	
  due	
  to	
  high	
  per-­‐object	
  overhead	
  
Instead,	
  Shark	
  employs	
  column-­‐oriented	
  
storage	
  using	
  arrays	
  of	
  primitive	
  types	
  
            Row	
  Storage	
               Column	
  Storage	
  
             1	
     john	
      4.1	
        1	
       2	
       3	
  
Benefit:	
  similarly	
  compact	
  size	
  to	
  serialized	
  data,	
  
          2	
   mike	
   3.5	
   faster	
  to	
  access	
   sally	
  
                 but	
  >5x	
                john	
   mike	
  

             3	
     sally	
     6.4	
       4.1	
     3.5	
     6.4	
  
Using	
  Shark	
  
CREATE TABLE mydata_cached AS SELECT …
	
  




Run	
  standard	
  HiveQL	
  on	
  it,	
  including	
  UDFs	
  
       » A	
  few	
  esoteric	
  features	
  are	
  not	
  yet	
  supported	
  

Can	
  also	
  call	
  from	
  Scala	
  to	
  mix	
  with	
  Spark	
  
	
  
        Early	
  alpha	
  release	
  at	
  shark.cs.berkeley.edu	
  
Benchmark	
  Query	
  1	
  
SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;



  Shark (cached) 12s



          Shark                                     182s




           Hive                                            207s



                  0    50    100            150            200    250
                            Execution Time (secs)
Benchmark	
  Query	
  2	
  
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;


   Shark (cached)        126s




           Shark                         270s




            Hive                                              447s



                    0   100      200            300     400          500
                                Execution Time (secs)
Demo	
  
What’s	
  Next?	
  
Recall	
  that	
  Spark’s	
  model	
  was	
  motivated	
  by	
  two	
  
emerging	
  uses	
  (interactive	
  and	
  multi-­‐stage	
  apps)	
  
Another	
  emerging	
  use	
  case	
  that	
  needs	
  fast	
  data	
  
sharing	
  is	
  stream	
  processing	
  
  » Track	
  and	
  update	
  state	
  in	
  memory	
  as	
  events	
  arrive	
  
  » Large-­‐scale	
  reporting,	
  click	
  analysis,	
  spam	
  filtering,	
  etc	
  
Streaming	
  Spark	
  
      Extends	
  Spark	
  to	
  perform	
  streaming	
  computations	
  
      Runs	
  as	
  a	
  series	
  of	
  small	
  (~1	
  s)	
  batch	
  jobs,	
  keeping	
  
      state	
  in	
  memory	
  as	
  fault-­‐tolerant	
  RDDs	
  
      Intermix	
  seamlessly	
  with	
  batch	
  and	
  ad-­‐hoc	
  queries	
  
                                                                   map	
     reduceByWindow	
  

     tweetStream                                         T=1	
  
      .flatMap(_.toLower.split)
      .map(word => (word, 1))
      .reduceByWindow(5, _ + _)
                                                         T=2	
  

                                                                                  …	
  
[Zaharia	
  et	
  al,	
  HotCloud	
  2012]	
  
Streaming	
  Spark	
  
      Extends	
  Spark	
  to	
  perform	
  streaming	
  computations	
  
      Runs	
  as	
  a	
  series	
  of	
  small	
  (~1	
  s)	
  batch	
  jobs,	
  keeping	
  
      state	
  in	
  memory	
  as	
  fault-­‐tolerant	
  RDDs	
  
      Intermix	
  seamlessly	
  with	
  batch	
  and	
  ad-­‐hoc	
  queries	
  
                                                                   map	
     reduceByWindow	
  

     tweetStream                                         T=1	
  
      .flatMap(_.toLower.split)
          Result:	
  can	
  process	
  42	
  million	
  records/second	
  
      .map(word => (word, 1))
           (4	
  GB/s)	
  on	
  100	
  nodes	
  at	
  sub-­‐second	
  latency	
  
      .reduceByWindow(5, _ + _)
                                                         T=2	
  

                                                                                  …	
  
[Zaharia	
  et	
  al,	
  HotCloud	
  2012]	
  
Streaming	
  Spark	
  
      Extends	
  Spark	
  to	
  perform	
  streaming	
  computations	
  
      Runs	
  as	
  a	
  series	
  of	
  small	
  (~1	
  s)	
  batch	
  jobs,	
  keeping	
  
      state	
  in	
  memory	
  as	
  fault-­‐tolerant	
  RDDs	
  
      Intermix	
  seamlessly	
  with	
  batch	
  and	
  ad-­‐hoc	
  queries	
  
                                                                   map	
     reduceByWindow	
  

     tweetStream                                         T=1	
  
      .flatMap(_.toLower.split)
      .map(word => (word, 1))Alpha	
  coming	
  this	
  summer	
  
      .reduceByWindow(5, _ + _)
                                                         T=2	
  

                                                                                  …	
  
[Zaharia	
  et	
  al,	
  HotCloud	
  2012]	
  
Conclusion	
  
Spark	
  and	
  Shark	
  speed	
  up	
  your	
  interactive	
  and	
  
complex	
  analytics	
  on	
  Hadoop	
  data	
  
Download	
  and	
  docs:	
  www.spark-­‐project.org	
  	
  
  » Easy	
  to	
  run	
  locally,	
  on	
  EC2,	
  or	
  on	
  Mesos	
  and	
  soon	
  YARN	
  

User	
  meetup:	
  meetup.com/spark-­‐users	
  	
  
Training	
  camp	
  at	
  Berkeley	
  in	
  August!	
  

                                   matei@berkeley.edu	
  /	
  @matei_zaharia	
  
Behavior	
  with	
  Not	
  Enough	
  RAM	
  
                               100	
  
                                              68.8	
  
Iteration	
  time	
  (s)	
  




                                                           58.1	
  
                                80	
  




                                                                              40.7	
  
                                60	
  




                                                                                                29.7	
  
                                40	
  




                                                                                                             11.5	
  
                                20	
  
                                   0	
  
                                            Cache	
       25%	
             50%	
              75%	
        Fully	
  
                                           disabled	
                                                      cached	
  
                                                          %	
  of	
  working	
  set	
  in	
  memory	
  
Software	
  Stack	
  
          Shark	
                      Bagel	
                Streaming	
  
    (Hive	
  on	
  Spark)	
     (Pregel	
  on	
  Spark)	
       Spark	
  
                                                                              …	
  
                                           Spark	
  

      Local	
                                       Apache	
  
                                EC2	
                                 YARN	
  
      mode	
                                        Mesos	
  
Thank You!


             Page 37

More Related Content

Spark and shark

  • 1. Spark  and  Shark   High-­‐Speed  In-­‐Memory  Analytics   over  Hadoop  and  Hive  Data   Matei  Zaharia,  in  collaboration  with   Mosharaf  Chowdhury,  Tathagata  Das,  Ankur  Dave,  Cliff  Engle,   Michael  Franklin,  Haoyuan  Li,  Antonio  Lupher,  Justin  Ma,  Murphy   McCauley,  Scott  Shenker,  Ion  Stoica,  Reynold  Xin     UC  Berkeley   spark-­‐project.org   UC  BERKELEY  
  • 2. What  is  Spark?   Not  a  modified  version  of  Hadoop   Separate,  fast,  MapReduce-­‐like  engine   » In-­‐memory  data  storage  for  very  fast  iterative  queries   » General  execution  graphs  and  powerful  optimizations   » Up  to  40x  faster  than  Hadoop   Compatible  with  Hadoop’s  storage  APIs   » Can  read/write  to  any  Hadoop-­‐supported  system,   including  HDFS,  HBase,  SequenceFiles,  etc  
  • 3. What  is  Shark?   Port  of  Apache  Hive  to  run  on  Spark   Compatible  with  existing  Hive  data,  metastores,   and  queries  (HiveQL,  UDFs,  etc)   Similar  speedups  of  up  to  40x  
  • 4. Project  History   Spark  project  started  in  2009,  open  sourced  2010   Shark  started  summer  2011,  alpha  April  2012   In  use  at  Berkeley,  Princeton,  Klout,  Foursquare,   Conviva,  Quantifind,  Yahoo!  Research  &  others   200+  member  meetup,  500+  watchers  on  GitHub  
  • 5. This  Talk   Spark  programming  model   User  applications   Shark  overview   Demo   Next  major  addition:  Streaming  Spark  
  • 6. Why  a  New  Programming  Model?   MapReduce  greatly  simplified  big  data  analysis   But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐stage  applications  (e.g.   iterative  graph  algorithms  and  machine  learning)   » More  interactive  ad-­‐hoc  queries   Both  multi-­‐stage  and  interactive  apps  require   faster  data  sharing  across  parallel  jobs  
  • 7. Data  Sharing  in  MapReduce   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .   Slow  due  to  replication,  serialization,  and  disk  IO  
  • 8. Data  Sharing  in  Spark   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   Distributed   memory   .    .    .   10-­‐100×  faster  than  network  and  disk  
  • 9. Spark  Programming  Model   Key  idea:  resilient  distributed  datasets  (RDDs)   » Distributed  collections  of  objects  that  can  be  cached   in  memory  across  cluster  nodes   » Manipulated  through  various  parallel  operators   » Automatically  rebuilt  on  failure   Interface   » Clean  language-­‐integrated  API  in  Scala   » Can  be  used  interactively  from  Scala  console  
  • 10. Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Cache  1   lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results   errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver   cachedMsgs = messages.cache() Action   cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache  2   Worker   . . . Cache  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  • 11. Fault  Tolerance   RDDs  track  the  series  of  transformations  used  to   build  them  (their  lineage)  to  recompute  lost  data   E.g:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD   FilteredRDD   MappedRDD   path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  • 12. Example:  Logistic  Regression   val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) Load  data  in  memory  once   Initial  parameter  vector   for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient Repeated  MapReduce  steps   } to  do  gradient  descent   println("Final w: " + w)
  • 13. Logistic  Regression  Performance   4500   4000   127  s  /  iteration   3500   Running  Time  (s)   3000   2500   Hadoop   2000   Spark   1500   1000   500   first  iteration  174  s   0   further  iterations  6  s   1   5   10   20   30   Number  of  Iterations  
  • 14. Supported  Operators   map reduce sample filter count cogroup groupBy reduceByKey take sort groupByKey partitionBy join first pipe leftOuterJoin union save rightOuterJoin cross ...
  • 15. Other  Engine  Features   General  graphs  of  operators  (e.g.  map-­‐reduce-­‐reduce)   Hash-­‐based  reduces  (faster  than  Hadoop’s  sort)   Controlled  data  partitioning  to  lower  communication   PageRank  Performance   200   171   Iteration  time  (s)   Hadoop   150   Basic  Spark   100   72   50   23   Spark  +  Controlled   Partitioning   0  
  • 17. User  Applications   In-­‐memory  analytics  &  anomaly  detection  (Conviva)   Interactive  queries  on  data  streams  (Quantifind)   Exploratory  log  analysis  (Foursquare)   Traffic  estimation  w/  GPS  data  (Mobile  Millennium)   Twitter  spam  classification  (Monarch)   .  .  .  
  • 18. Conviva  GeoReport   Hive   20   Spark   0.5   Time  (hours)   0   5   10   15   20   Group  aggregations  on  many  keys  w/  same  filter   40×  gain  over  Hive  from  avoiding  repeated   reading,  deserialization  and  filtering  
  • 19. Mobile  Millennium  Project   Estimate  city  traffic  from  crowdsourced  GPS  data   Iterative  EM  algorithm   scaling  to  160  nodes   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  • 20. Shark:  Hive  on  Spark  
  • 21. Motivation   Hive  is  great,  but  Hadoop’s  execution  engine   makes  even  the  smallest  queries  take  minutes   Scala  is  good  for  programmers,  but  many  data   users  only  know  SQL   Can  we  extend  Hive  to  run  on  Spark?  
  • 22. Hive  Architecture        Client   CLI   JDBC   Driver   Meta   Physical  Plan   store   SQL   Query   Parser   Optimizer   Execution   MapReduce   HDFS  
  • 23. Shark  Architecture        Client   CLI   JDBC   Driver   Cache  Mgr.   Meta   Physical  Plan   store   SQL   Query   Parser   Optimizer   Execution   Spark   HDFS   [Engle  et  al,  SIGMOD  2012]  
  • 24. Efficient  In-­‐Memory  Storage   Simply  caching  Hive  records  as  Java  objects  is   inefficient  due  to  high  per-­‐object  overhead   Instead,  Shark  employs  column-­‐oriented   storage  using  arrays  of  primitive  types   Row  Storage   Column  Storage   1   john   4.1   1   2   3   2   mike   3.5   john   mike   sally   3   sally   6.4   4.1   3.5   6.4  
  • 25. Efficient  In-­‐Memory  Storage   Simply  caching  Hive  records  as  Java  objects  is   inefficient  due  to  high  per-­‐object  overhead   Instead,  Shark  employs  column-­‐oriented   storage  using  arrays  of  primitive  types   Row  Storage   Column  Storage   1   john   4.1   1   2   3   Benefit:  similarly  compact  size  to  serialized  data,   2   mike   3.5   faster  to  access   sally   but  >5x   john   mike   3   sally   6.4   4.1   3.5   6.4  
  • 26. Using  Shark   CREATE TABLE mydata_cached AS SELECT …   Run  standard  HiveQL  on  it,  including  UDFs   » A  few  esoteric  features  are  not  yet  supported   Can  also  call  from  Scala  to  mix  with  Spark     Early  alpha  release  at  shark.cs.berkeley.edu  
  • 27. Benchmark  Query  1   SELECT * FROM grep WHERE field LIKE ‘%XYZ%’; Shark (cached) 12s Shark 182s Hive 207s 0 50 100 150 200 250 Execution Time (secs)
  • 28. Benchmark  Query  2   SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESC LIMIT 1; Shark (cached) 126s Shark 270s Hive 447s 0 100 200 300 400 500 Execution Time (secs)
  • 30. What’s  Next?   Recall  that  Spark’s  model  was  motivated  by  two   emerging  uses  (interactive  and  multi-­‐stage  apps)   Another  emerging  use  case  that  needs  fast  data   sharing  is  stream  processing   » Track  and  update  state  in  memory  as  events  arrive   » Large-­‐scale  reporting,  click  analysis,  spam  filtering,  etc  
  • 31. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=2   …   [Zaharia  et  al,  HotCloud  2012]  
  • 32. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) Result:  can  process  42  million  records/second   .map(word => (word, 1)) (4  GB/s)  on  100  nodes  at  sub-­‐second  latency   .reduceByWindow(5, _ + _) T=2   …   [Zaharia  et  al,  HotCloud  2012]  
  • 33. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) .map(word => (word, 1))Alpha  coming  this  summer   .reduceByWindow(5, _ + _) T=2   …   [Zaharia  et  al,  HotCloud  2012]  
  • 34. Conclusion   Spark  and  Shark  speed  up  your  interactive  and   complex  analytics  on  Hadoop  data   Download  and  docs:  www.spark-­‐project.org     » Easy  to  run  locally,  on  EC2,  or  on  Mesos  and  soon  YARN   User  meetup:  meetup.com/spark-­‐users     Training  camp  at  Berkeley  in  August!   matei@berkeley.edu  /  @matei_zaharia  
  • 35. Behavior  with  Not  Enough  RAM   100   68.8   Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  • 36. Software  Stack   Shark   Bagel   Streaming   (Hive  on  Spark)   (Pregel  on  Spark)   Spark   …   Spark   Local   Apache   EC2   YARN   mode   Mesos  
  • 37. Thank You! Page 37