Apache Spark Theory by Arsh

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Spark & its Features

Apache Spark is an open source cluster computing framework for real-time data processing.
The main feature of Apache Spark is its in-memory cluster computing that increases the
processing speed of an application.

Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance.

It is designed to cover a wide range of workloads such as batch applications, iterative


algorithms, interactive queries, and streaming.

Features of Apache Spark:

Fig: Features of Spark


1. Speed
Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
2. Powerful Caching
Simple programming layer provides powerful caching and disk persistence capabilities.
3. Deployment
It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
4. Real-Time
It offers Real-time computation & low latency because of in-memory computation.
5. Polyglot
Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written
in any of these four languages. It also provides a shell in Scala and Python.
Fig: Spark Architecture
Spark Eco-System
Spark Core is the base engine for large-scale parallel and distributed data processing.
Further, additional libraries which are built on the top of the core allows diverse workloads for
streaming, SQL, and machine learning.
It is responsible for memory management and fault recovery, scheduling, distributing and
monitoring jobs on a cluster & interacting with storage systems.

Spark Support multiple framework like Spark SQL, Spark Streaming, MLlib, GraphX, and the
Core API component.

Fig: Spark Eco-System

Resilient Distributed Dataset(RDD)


RDDs are the building blocks of any Spark application. RDDs Stands for:

● Resilient: Fault tolerant and is capable of rebuilding data on failure


● Distributed: Distributed data among the multiple nodes in a cluster
● Dataset: Collection of partitioned data with values
Workflow of RDD

With RDDs, you can perform two types of operations:

1. Transformations: They are the operations that are applied to create a new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and
pass the result back to the driver.

DAG -
Directed Acyclic Graph

1. It represents the flow chart of your spark application.


2. It will decide the flow of processing of your spark application.
3. According to the flow the spark driver will create a execution plan.

Spark Architecture
Working of Spark -

STEP 1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a logically
directed acyclic graph called DAG. At this stage, it also performs optimizations such as
pipelining transformations.

STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution units
called tasks under each stage. Then the tasks are bundled and sent to the cluster.

STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver will
send the tasks to the executors based on data placement. When executors start, they register
themselves with drivers. So, the driver will have a complete view of executors that are executing
the task.

STEP 4: During the course of execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.

This was all about Spark Architecture.

You might also like