0% found this document useful (0 votes)

475 views

Spark Tutorial

This spark tutorial by https://data-flair.training/ covers the technical Big Picture of Apache Spark and take you to the path way to become and expert in the field.

Uploaded by

Dukool Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

475 views

Spark Tutorial

This spark tutorial by https://data-flair.training/ covers the technical Big Picture of Apache Spark and take you to the path way to become and expert in the field.

Uploaded by

Dukool Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Apache Spark Tutorial

Contents
1. Introduction .............................................................................................................................................. 3
2. What is Spark? .......................................................................................................................................... 3
3. History of Apache Spark ............................................................................................................................ 3
4. Why Spark? ............................................................................................................................................... 3
5. Apache Spark Components ....................................................................................................................... 4
5.1. Spark Core .......................................................................................................................................... 4
5.2. Spark SQL ........................................................................................................................................... 4
5.3. Spark Streaming ................................................................................................................................. 4
5.4. Spark MLlib......................................................................................................................................... 5
5.5. Spark GraphX ..................................................................................................................................... 5
5.6. SparkR ................................................................................................................................................ 5
6. Resilient Distributed Dataset – RDD ......................................................................................................... 5
7. Spark Shell ................................................................................................................................................. 7
8. Conclusion ................................................................................................................................................. 7

2
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial

1. Introduction
What is Spark? Why there is a serious buzz going on about this technology? I hope this Spark
introduction tutorial will help to answer some of these questions. Apache Spark is an open-source
cluster computing system that provides high-level API in Java, Scala, Python and R. It can access data
from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone,
YARN and Mesos cluster manager.

What this Spark tutorial will cover is Spark ecosystem components, Spark video tutorial, Spark
abstraction – RDD, transformation, and action in Spark RDD. The objective of this introductory guide is
to provide Spark Overview in detail, its history, Spark architecture, deployment model and RDD in Spark.

2. What is Spark?
Apache Spark is a general-purpose & lightning fast cluster computing system. It provides high-level API.
For example, Java, Scala, Python and R. Apache Spark is a tool for Running Spark Applications. Spark is
100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk.

Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R.

It can be integrated with Hadoop and can process existing Hadoop HDFS data. Follow this guide to learn
How Spark is compatible with Hadoop?

It is saying that the images are the worth of thousand words. To keep this in mind we have also provided
Spark video tutorial for more understanding of Apache Spark.

3. History of Apache Spark

Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it becomes AMPLab. It was open
sourced in 2010 under BSD license. In 2013 spark was donated to Apache Software Foundation where it
became top-level Apache project in 2014.

4. Why Spark?
After studying Apache Spark introduction lets discuss, why Spark come into existence?

In the industry, there is a need for general-purpose cluster computing tool as:

Hadoop MapReduce can only perform batch processing.

Apache Storm / S4 can only perform stream processing.

3
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial

Apache Impala / Apache Tez can only perform interactive processing

Neo4j / Apache Giraph can only perform graph processing

Hence in the industry, there is a big demand for a powerful engine that can process the data in real-time
(streaming) as well as in batch mode. There is a need for an engine that can respond in sub-second and
perform in-memory processing.

Apache Spark Definition says it is a powerful open-source engine that provides real-time stream
processing, interactive processing, graph processing, in-memory processing as well as batch processing
with very fast speed, ease of use and standard interface. This creates the difference between Hadoop vs
Spark and also makes a huge comparison between Spark vs Storm.

In what is Spark tutorial, we discussed a definition of spark, history of spark and importance of spark.
Now let’s move towards spark components.

5. Apache Spark Components

Apache Spark puts the promise for faster data processing and easier development. How Spark achieves
this? To answer this question, let’s introduce the Apache Spark ecosystem which is the important topic
in Apache Spark introduction that makes Spark fast and reliable. These components of Spark resolves
the issues that cropped up while using Hadoop MapReduce.

5.1. Spark Core

It is the kernel of Spark, which provides an execution platform for all the Spark applications. It is a
generalized platform to support a wide array of applications.

5.2. Spark SQL

It enables users to run SQL/HQL queries on the top of Spark. Using Apache Spark SQL, we can process
structured as well as semi-structured data. It also provides an engine for Hive to run unmodified queries
up to 100 times faster on existing deployments. Refer Spark SQL Tutorial for detailed study.

5.3. Spark Streaming

Apache Spark Streaming enables powerful interactive and data analytics application across live
streaming data. The live streams are converted into micro-batches which are executed on top of spark
core. Refer our Spark Streaming tutorial for detailed study of Apache Spark Streaming.

4
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial

5.4. Spark MLlib

It is the scalable machine learning library which delivers both efficiencies as well as the high-quality
algorithm. Apache Spark MLlib is one of the hottest choices for Data Scientist due to its capability of in-
memory data processing, which improves the performance of iterative algorithm drastically.

5.5. Spark GraphX

Apache Spark GraphX is the graph computation engine built on top of spark that enables to process
graph data at scale.

5.6. SparkR

It is R package that gives light-weight frontend to use Apache Spark from R. It allows data scientists to
analyze large datasets and interactively run jobs on them from the R shell. The main idea
behind SparkR was to explore different techniques to integrate the usability of R with the scalability of
Spark.

6. Resilient Distributed Dataset – RDD

In this section of Apache Spark Tutorial, we will discuss the key abstraction of Spark knows as RDD.

5
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial

Resilient Distributed Dataset (RDD) is the fundamental unit of data in Apache Spark, which is a
distributed collection of elements across cluster nodes and can perform parallel operations. Spark RDDs
are immutable but can generate new RDD by transforming existing RDD.

There are three ways to create RDDs in Spark:

Parallelized collections – We can create parallelized collections by invoking parallelize method in the
driver program.

External datasets – By calling a textFile method one can create RDDs. This method takes URL of the file
and reads it as a collection of lines.

Existing RDDs – By applying transformation operation on existing RDDs we can create new RDD.

Apache Spark RDDs support two types of operations:

Transformation – Creates a new RDD from the existing one. It passes the dataset to the function and
returns new dataset.

Action – Spark Action returns final result to driver program or write it to the external data store.

Refer this link to learn RDD Transformations and Actions APIs with examples.

6
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial

7. Spark Shell
Apache Spark provides an interactive spark-shell. It helps Spark applications to easily run on the
command line of the system. Using Spark shell we can run/test our application code interactively. Spark
can read from many types of data sources so that it can access and process a large amount of data.

8. Conclusion
Spark tutorial provides a collection of technologies that increase the value of big data and permits new
Spark the use cases. It gives us a unified framework for creating, managing and implementing Spark big
data processing requirements. Spark video tutorial provides you detailed information about Spark.

In addition to the MapReduce operations, one can also implement SQL queries and process streaming
data through Spark, which were the drawbacks for Hadoop-1. With Spark, developers can develop with
Spark features either on a stand-alone basis or, combine them with MapReduce programming
techniques.

This conclusion is not the end but a foundation to learn new Theories of Apache Spark. Here are the
next steps after you are through with this Apache Spark Tutorial:

1. Learn Different Terminologies of Apache Spark

2. Install Spark on Ubuntu
3. Learn Spark Shell Commands
4. Learn Internal Working of Spark
5. Spark Limitations and Drawbacks

7
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial

8
https://data-flair.training/hadoop-spark-developer-course/

AlliedCrowds Private Equity and Venture Capital in Africa Directory PDF
100% (1)
AlliedCrowds Private Equity and Venture Capital in Africa Directory PDF
82 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Mastering Apache Spark
100% (6)
Mastering Apache Spark
1,044 pages
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
LOG-book 7
No ratings yet
LOG-book 7
4 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Apache Spark Tutorial
100% (4)
Apache Spark Tutorial
36 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Py Spark
83% (6)
Py Spark
195 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Pyspark Vs Spark SQL
No ratings yet
Pyspark Vs Spark SQL
6 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark SQL
No ratings yet
Spark SQL
24 pages
Spark Walmart Data Analysis Project
No ratings yet
Spark Walmart Data Analysis Project
17 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
99 Apache Spark Interview Questions For Professionals
33% (12)
99 Apache Spark Interview Questions For Professionals
11 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
HowToCrackInterview Udemy
No ratings yet
HowToCrackInterview Udemy
58 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
4.trouble Shooting - EVA525
No ratings yet
4.trouble Shooting - EVA525
57 pages
American Libraries December 2017 PDF
No ratings yet
American Libraries December 2017 PDF
76 pages
The Maxwell Capacitance Matrix WP110301 R02
No ratings yet
The Maxwell Capacitance Matrix WP110301 R02
3 pages
5 Phraseology. Classifications of Phraseological Units
No ratings yet
5 Phraseology. Classifications of Phraseological Units
31 pages
Manual Bissel
No ratings yet
Manual Bissel
20 pages
Christianity in India The Hindutva Perspective
100% (1)
Christianity in India The Hindutva Perspective
76 pages
6 Sigma
No ratings yet
6 Sigma
7 pages
The Tropical Rainforest Notes
No ratings yet
The Tropical Rainforest Notes
8 pages
A Study On Customer Satisfaction at HDFC Bank, Vijayapura
100% (1)
A Study On Customer Satisfaction at HDFC Bank, Vijayapura
90 pages
Preliminary Examinations: 2. Main Examination
No ratings yet
Preliminary Examinations: 2. Main Examination
3 pages
Family and NFK Democrat Party Led Purges in Bursa
No ratings yet
Family and NFK Democrat Party Led Purges in Bursa
139 pages
Poetry Commentary
No ratings yet
Poetry Commentary
4 pages
Introduction To Algorithms: 6.046J/18.401J/SMA5503
No ratings yet
Introduction To Algorithms: 6.046J/18.401J/SMA5503
19 pages
CPXL12.5ESK R96 - chg18 - 2021-04-09
No ratings yet
CPXL12.5ESK R96 - chg18 - 2021-04-09
48 pages
Travelling in Time: Using Interpretative Phenomenological Analysis (IPA) To Examine Temporal Process in Personal Experience
No ratings yet
Travelling in Time: Using Interpretative Phenomenological Analysis (IPA) To Examine Temporal Process in Personal Experience
33 pages
Darwin Calibrator (DarwinCalibrator - WTG) - New Optimized Run - 1
No ratings yet
Darwin Calibrator (DarwinCalibrator - WTG) - New Optimized Run - 1
3 pages
72121368
No ratings yet
72121368
81 pages
12 Chemistry Notes ch15 Polymers PDF
No ratings yet
12 Chemistry Notes ch15 Polymers PDF
7 pages
os
No ratings yet
os
6 pages
"Foreign Aid and The Tyranny of Experts": Professor William Easterly
No ratings yet
"Foreign Aid and The Tyranny of Experts": Professor William Easterly
5 pages
COMSATS Institute of Information Technology Course Handbook Research Tool and Techniques
No ratings yet
COMSATS Institute of Information Technology Course Handbook Research Tool and Techniques
5 pages
L2200228059 T2200109979 P2200046451 Hinola-Hermogenes-Jr.-B. T2200109979 2202 0 19850216 Cov-Rptt
No ratings yet
L2200228059 T2200109979 P2200046451 Hinola-Hermogenes-Jr.-B. T2200109979 2202 0 19850216 Cov-Rptt
2 pages
Serie GJN DGBB Booklet 16434 - 1 EN PDF
No ratings yet
Serie GJN DGBB Booklet 16434 - 1 EN PDF
28 pages
02.1 Organizational Structure 17 Q_A
No ratings yet
02.1 Organizational Structure 17 Q_A
35 pages
ir7105_series-ip
No ratings yet
ir7105_series-ip
102 pages
15 Amazing Facts About The Origins of Microsoft and Its Original Xbox
No ratings yet
15 Amazing Facts About The Origins of Microsoft and Its Original Xbox
6 pages
User Manual of DS-8100HMI-ST-GW-WI
No ratings yet
User Manual of DS-8100HMI-ST-GW-WI
78 pages
Kangaroo Math Questions 2021
No ratings yet
Kangaroo Math Questions 2021
6 pages