Apache Spark Theory by Arsh
Apache Spark Theory by Arsh
Apache Spark Theory by Arsh
Apache Spark is an open source cluster computing framework for real-time data processing.
The main feature of Apache Spark is its in-memory cluster computing that increases the
processing speed of an application.
Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance.
Spark Support multiple framework like Spark SQL, Spark Streaming, MLlib, GraphX, and the
Core API component.
1. Transformations: They are the operations that are applied to create a new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and
pass the result back to the driver.
DAG -
Directed Acyclic Graph
Spark Architecture
Working of Spark -
STEP 1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a logically
directed acyclic graph called DAG. At this stage, it also performs optimizations such as
pipelining transformations.
STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution units
called tasks under each stage. Then the tasks are bundled and sent to the cluster.
STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver will
send the tasks to the executors based on data placement. When executors start, they register
themselves with drivers. So, the driver will have a complete view of executors that are executing
the task.
STEP 4: During the course of execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.