Welcome to Scribd!

0% found this document useful (0 votes)

97 views

Apache Spark Theory by Arsh

Uploaded by

Apache Spark is an open-source cluster computing framework that provides in-memory processing for real-time data. It runs up to 100 times faster than Hadoop for large-scale data processing due to its in-memory computing and powerful caching capabilities. Spark supports batch processing, iterative algorithms, interactive queries, streaming, and can be deployed on Mesos, YARN, or its own cluster manager.

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Apache Spark Theory by Arsh

Uploaded by

Faraz Akhtar

0% found this document useful (0 votes)

97 views4 pages

Original Description:

Hadoop and spark

Original Title

Apache Spark Theory by Arsh (3)

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Download as docx, pdf, or txt

0% found this document useful (0 votes)

97 views4 pages

Apache Spark Theory by Arsh

Uploaded by

Faraz Akhtar

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Download as docx, pdf, or txt

Jump to Page

You are on page 1of 4

Search inside document

Spark & its Features

Apache Spark is an open source cluster computing framework for real-time data processing.
The main feature of Apache Spark is its in-memory cluster computing that increases the
processing speed of an application.

Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance.

It is designed to cover a wide range of workloads such as batch applications, iterative

algorithms, interactive queries, and streaming.

Features of Apache Spark:

Fig: Features of Spark

1. Speed
Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
2. Powerful Caching
Simple programming layer provides powerful caching and disk persistence capabilities.
3. Deployment
It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
4. Real-Time
It offers Real-time computation & low latency because of in-memory computation.
5. Polyglot
Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written
in any of these four languages. It also provides a shell in Scala and Python.
Fig: Spark Architecture
Spark Eco-System
Spark Core is the base engine for large-scale parallel and distributed data processing.
Further, additional libraries which are built on the top of the core allows diverse workloads for
streaming, SQL, and machine learning.
It is responsible for memory management and fault recovery, scheduling, distributing and
monitoring jobs on a cluster & interacting with storage systems.

Spark Support multiple framework like Spark SQL, Spark Streaming, MLlib, GraphX, and the
Core API component.

Fig: Spark Eco-System

Resilient Distributed Dataset(RDD)

RDDs are the building blocks of any Spark application. RDDs Stands for:

● Resilient: Fault tolerant and is capable of rebuilding data on failure

● Distributed: Distributed data among the multiple nodes in a cluster
● Dataset: Collection of partitioned data with values
Workflow of RDD

With RDDs, you can perform two types of operations:

1. Transformations: They are the operations that are applied to create a new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and
pass the result back to the driver.

DAG -
Directed Acyclic Graph

1. It represents the flow chart of your spark application.

2. It will decide the flow of processing of your spark application.
3. According to the flow the spark driver will create a execution plan.

Spark Architecture
Working of Spark -

STEP 1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a logically
directed acyclic graph called DAG. At this stage, it also performs optimizations such as
pipelining transformations.

STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution units
called tasks under each stage. Then the tasks are bundled and sent to the cluster.

STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver will
send the tasks to the executors based on data placement. When executors start, they register
themselves with drivers. So, the driver will have a complete view of executors that are executing
the task.

STEP 4: During the course of execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.

This was all about Spark Architecture.

Describe The Functions and Features of HDP
Document16 pages
Describe The Functions and Features of HDP
Mahmoud Elmahdy
100% (2)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Handle Large Messages in Apache Kafka
Document59 pages
Handle Large Messages in Apache Kafka
Bùi Văn Kiên
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
Document11 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
divya kolluri
No ratings yet
ApacheSpark MyNotes
Document6 pages
ApacheSpark MyNotes
seenu0104
No ratings yet
Apache Spark: Dhineshkumar S K
Document31 pages
Apache Spark: Dhineshkumar S K
PREM KUMAR M
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
Document47 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
kavitha
No ratings yet
Cloudera Certification Dump 410 Anil PDF
Document49 pages
Cloudera Certification Dump 410 Anil PDF
arunshan
No ratings yet
Chatgpt
Document7 pages
Chatgpt
ann
No ratings yet
3 Mapreduce Notes
Document25 pages
3 Mapreduce Notes
Sandeep Boyina
No ratings yet
Bigdata 2016 Hands On 2891109
Document96 pages
Bigdata 2016 Hands On 2891109
cesmarscribd
No ratings yet
Oracle Architechture & Concepts
Document90 pages
Oracle Architechture & Concepts
Elango Gopal
No ratings yet
Spark Training in Bangalore
Document36 pages
Spark Training in Bangalore
kellytechnologies
No ratings yet
MongoDB's Performance Over RDBMS - MongoDB
Document12 pages
MongoDB's Performance Over RDBMS - MongoDB
Tagore u
No ratings yet
Difference Between Heap and Stack: Java - Lang.Outofmemoryerror: Java Heap Space
Document3 pages
Difference Between Heap and Stack: Java - Lang.Outofmemoryerror: Java Heap Space
Harinath srinivasan
No ratings yet
Hadoop Fundamentals and Hive Interview Questions
Document8 pages
Hadoop Fundamentals and Hive Interview Questions
michel jonson
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
Document8 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
Krish Dhoom
No ratings yet
TF On Spark
Document35 pages
TF On Spark
ark
No ratings yet
Garbage Collection in Java - What Is GC and How It Works in The JVM
Document19 pages
Garbage Collection in Java - What Is GC and How It Works in The JVM
Harapriya Mohanta
No ratings yet
Hadoop Hdfs Commands
Document5 pages
Hadoop Hdfs Commands
Vijaya K Rao
No ratings yet
JVM Architecture
Document6 pages
JVM Architecture
alka
No ratings yet
Hadoop Spark
Document31 pages
Hadoop Spark
sarvesh_mishra
No ratings yet
9 Sqoop Notes
Document17 pages
9 Sqoop Notes
Sandeep Boyina
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
Document8 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
Ngô Hoàng
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
Document11 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
Snehal Mahesh Wable
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
Document57 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
shan4u4me
No ratings yet
Big Data Masters Certification Learnbay
Document12 pages
Big Data Masters Certification Learnbay
Lilith Kns
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
Document1 page
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
Devanshu Tanna
No ratings yet
2 Hadoop (Uploaded)
Document82 pages
2 Hadoop (Uploaded)
Prateek Pole
No ratings yet
Facebook Hive POC
Document18 pages
Facebook Hive POC
Jayashree Ravi
No ratings yet
04 - Introduction To The Big Data Ecosystem
Document25 pages
04 - Introduction To The Big Data Ecosystem
Jose Evanan
No ratings yet
Unix Training by Dhanabal
Document340 pages
Unix Training by Dhanabal
Leo Thomas
No ratings yet
Hadoop & Big Data
Document36 pages
Hadoop & Big Data
Paresh Bhatia
No ratings yet
Spark Training - Java
Document8 pages
Spark Training - Java
Pavan Kumar
No ratings yet
Spark Notes
Document6 pages
Spark Notes
babjeereddy
No ratings yet
Jenkins - Fundamentals - CloudBees
Document14 pages
Jenkins - Fundamentals - CloudBees
GOPI C
No ratings yet
Spark Streaming Twitter Example
Document4 pages
Spark Streaming Twitter Example
anon_158103504
No ratings yet
REVISION Jenkins
Document3 pages
REVISION Jenkins
shanthi
No ratings yet
Unix Shell Scripting With SH&KSH
Document50 pages
Unix Shell Scripting With SH&KSH
Ashar Bazeb Saeed
No ratings yet
DVS SPARK Course Content PDF
Document2 pages
DVS SPARK Course Content PDF
JayaramReddy
No ratings yet
Spark NLP Training-Public-April 2020
Document39 pages
Spark NLP Training-Public-April 2020
Xuân Vinh Nguyễn
No ratings yet
1 Hdfs Notes
Document38 pages
1 Hdfs Notes
Sandeep Boyina
No ratings yet
JDK Jre JVM
Document47 pages
JDK Jre JVM
Vijay Akula
No ratings yet
Mining Data Streams
Document67 pages
Mining Data Streams
usha
No ratings yet
Midhun BIGDATA Curicullum
Document17 pages
Midhun BIGDATA Curicullum
Fukkk
No ratings yet
Apache Hive
Document3 pages
Apache Hive
kual21
No ratings yet
Sample Resume For Freshers
Document3 pages
Sample Resume For Freshers
Fresh Epic
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
Document3 pages
BD - Spark - Baladasu A - SightSpectrum
ssinha122
No ratings yet
Mongodb Notes Basic To Advanced 1692833294
Document10 pages
Mongodb Notes Basic To Advanced 1692833294
Mostafa Batt
No ratings yet
Scala PDF
Document29 pages
Scala PDF
Mauricio Alejandro Arenas Arriagada
No ratings yet
4.memory Management in Java
Document9 pages
4.memory Management in Java
Le Thi Yen Nhi (K17 HCM)
No ratings yet
Chapter 03 - Arrays & Strings Part-02 Strings
Document25 pages
Chapter 03 - Arrays & Strings Part-02 Strings
Tanveer Ahmed Hakro
No ratings yet
Nagarjuna Hadoop Resume
Document7 pages
Nagarjuna Hadoop Resume
recruiterkk
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
Document2 pages
Problem Description: Sensitivity: Internal & Restricted
yuvaraj subramani
No ratings yet
Lambda Expressions With Collections Udemy
Document9 pages
Lambda Expressions With Collections Udemy
ChereddySurendra
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
Document9 pages
Unstructured Dataload Into Hive Database Through PySpark
sayhi2sudarshan
No ratings yet
Interview
Document86 pages
Interview
Mounika Raj
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
Document6 pages
Real Time Hadoop Interview Questions From Various Interviews
Saurabh Gupta
No ratings yet
Admin Scripts
Document97 pages
Admin Scripts
Mudireddy Madhu
No ratings yet
Apache Cassandra Database - Instaclustr
Document8 pages
Apache Cassandra Database - Instaclustr
Adebayo
No ratings yet
Openvms Rolling Roadmap: June 7, 2017
Document13 pages
Openvms Rolling Roadmap: June 7, 2017
Old Moustachio
No ratings yet
API Digitalpersona PDF
Document41 pages
API Digitalpersona PDF
Nguyễn Đức Cường
No ratings yet
Module: C++ & Data Structures Using C++ Session No. 1 Q. No. 1
Document13 pages
Module: C++ & Data Structures Using C++ Session No. 1 Q. No. 1
Pranav Kumar
No ratings yet
Fixed-Point Arithmetic: An Introduction
Document13 pages
Fixed-Point Arithmetic: An Introduction
Anil Dharavath
No ratings yet
CA3282AS1
Document11 pages
CA3282AS1
bruno buin
No ratings yet
Internetworking Using TCP/IP: By: Nitin Saraswat
Document16 pages
Internetworking Using TCP/IP: By: Nitin Saraswat
droid
No ratings yet
Bioplasm-NLS Features & Warranty
Document4 pages
Bioplasm-NLS Features & Warranty
Suganthi Ravindra
50% (2)
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
Document39 pages
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
Sourabh
No ratings yet
Ultima Salto
Document3 pages
Ultima Salto
Majicka Malu
No ratings yet
How To Convert Octal To Binary Step by Step
Document1 page
How To Convert Octal To Binary Step by Step
SyahmiSyaNizar
No ratings yet
Client Server
Document38 pages
Client Server
annamyem
100% (1)
Implicit Wait in Selenium: 1. What Are The Different Types of Waits Available in Selenium Webdriver? Ans
Document3 pages
Implicit Wait in Selenium: 1. What Are The Different Types of Waits Available in Selenium Webdriver? Ans
Anshuman Kumar Tiwari
No ratings yet
Readme PV73100 v7.4.x.r7317
Document10 pages
Readme PV73100 v7.4.x.r7317
ayie5524
No ratings yet
Assem 2
Document8 pages
Assem 2
Danish Faizan
No ratings yet
Corel Draw Tutorial 1
Document12 pages
Corel Draw Tutorial 1
Opia Anthony
No ratings yet
Project (Chandan Boro)
Document39 pages
Project (Chandan Boro)
Chandan Boro
No ratings yet
Enfsuite64 Admin Database PDF
Document40 pages
Enfsuite64 Admin Database PDF
Richard Kaufmann
No ratings yet
Mini Project Synopsis
Document5 pages
Mini Project Synopsis
Alok Trivedi
100% (1)
Assignment 1 (LP Models)
Document2 pages
Assignment 1 (LP Models)
Usman Ghani
No ratings yet
Nortel CS1000 Overview
Document45 pages
Nortel CS1000 Overview
Ivan Brovkovich
No ratings yet
Module 3 Strings Solutions
Document26 pages
Module 3 Strings Solutions
yashwanth
No ratings yet
Interfacing The Keyboard To 8051 Micro Controller 11
Document10 pages
Interfacing The Keyboard To 8051 Micro Controller 11
rajinikanth
100% (1)
PRO T9H Series
Document3 pages
PRO T9H Series
amir basha
No ratings yet
CODING FOR BEGINNERS - The Simplified Guide To Learn Coding Step by Step and Become An Expert Quickly
Document91 pages
CODING FOR BEGINNERS - The Simplified Guide To Learn Coding Step by Step and Become An Expert Quickly
wasimullahpk
No ratings yet
6ES72141AG400XB0 Datasheet en
Document9 pages
6ES72141AG400XB0 Datasheet en
guevba
No ratings yet
DCC Compre QP
Document5 pages
DCC Compre QP
pavankumar
No ratings yet
054 FXS2061 SWING-Tool A6V10227643 A en
Document56 pages
054 FXS2061 SWING-Tool A6V10227643 A en
Pintilei Liviu
No ratings yet
Fronius Single Interface Devicnet R-J3iB and Higher
Document34 pages
Fronius Single Interface Devicnet R-J3iB and Higher
Carlos Roberto Santos
No ratings yet
Lecture-1-Introduction To Computer: Dr. Asif Raheem
Document17 pages
Lecture-1-Introduction To Computer: Dr. Asif Raheem
Asif qaisrani
No ratings yet
ND Syllaby 6 Subjects
Document29 pages
ND Syllaby 6 Subjects
Preston Khumbula
No ratings yet