0% found this document useful (0 votes)

56 views75 pages

Week 8 - Lecture Notes

The document provides an introduction to Apache Spark, focusing on its framework, Resilient Distributed Datasets (RDDs), and applications like PageRank and GraphX. It discusses the need for Spark over MapReduce, emphasizing its efficiency in handling big data analytics through features like fault tolerance and data sharing. Additionally, the document touches on Spark's capabilities in various programming models and its applications in different domains.

Uploaded by

bheeshma9347

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views75 pages

Week 8 - Lecture Notes

Uploaded by

bheeshma9347

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Introduction to Spark

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Preface

Content of this Lecture:

In this lecture, we will discuss the ‘framework of

EL
spark’, Resilient Distributed Datasets (RDDs) and also
discuss some of its applications such as: Page rank and
GraphX.
PT
N

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Need of Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of
California, Berkeley's AMPLab, in 2012. Since then, it

EL
has gained a lot of attraction both in academia and in
industry.

PT
It is an another system for big data analytics
N
Isn’t MapReduce good enough?
Simplifies batch processing on large commodity clusters

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Need of Spark
Map Reduce

EL
Input
PT Output
N

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Need of Spark
Map Reduce

EL
Input
PT
Expensive save to disk for fault
tolerance
Output
N

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Need of Spark
MapReduce can be expensive for some applications e.g.,
Iterative
Interactive

EL
Lacks efficient data sharing

PT
Specialized frameworks did evolve for different programming
N
models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records

EL
Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse

PT
N

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Need of Spark
RDD RDD RDD

Read

EL
HDFS
Read
PT Cache
N

Map Reduce
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records

EL
Built through coarse grained transformations (map, join …)

Fault Recovery?
Lineage! PT
Log the coarse grained operation applied to a
N
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
RDD RDD RDD

Read

EL
HDFS
Read Cache

PT
N
Map Reduce

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
EL
Read
HDFS Map Reduce
Lineage

PT Introduction to Spark
N

Vu Pham
RDD RDD RDD

Read

EL
HDFS RDDs track the graph of
Read transformations that built them Cache

PT
(their lineage) to rebuild lost data
N
Map Reduce

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
What can you do with Spark?
RDD operations
Transformations e.g., filter, join, map, group-by …
Actions e.g., count, print …

EL
Control

PT
Partitioning: Spark also gives you control over how you can
partition your RDDs.
N
Persistence: Allows you to choose whether you want to
persist RDD onto disk or not.

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Partitioning: PageRank

EL
Joins take place
repeatedly

PT Good partitioning
reduces shuffles
N

Cloud Computing and DistributedVuSystems

Pham Introduction to Spark
Example: PageRank

Give pages ranks (scores) based on links to them

Links from many pages high rank

EL
Links from a high-rank page high rank

PT
N