Spark 101 - Overview and Efficient Use

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 9

Spark 101– Overview and Efficient Use

Spark Model of Parallel Computing:


RDDs & Dataframe

 Immutable and fault-tolerant

 Support vast spectrum of Input


formats such as local file, hdfs, ES

 Can be Persisted for faster reuse

 Dataframe is distributed equivalent


for pandas or R dataframe

 Lazily Evaluated
Performance Comparison : RDD &
Dataframe
Lazy Evaluation

 Transformations

 Action

 Directed Acyclic Graph

 Logical and Execution plan


Unified Spark Stack
Spark SQL Flow
Wide Versus Narrow Dependencies

 Narrow Dependencies, e.g. –


map, filter, flatmap etc.

 Wide Dependencies, e.g. – sort,


join, groupByKey etc.
The Anatomy of a Spark Job

 The DAG

 JOB

 Stages

 Tasks

You might also like