0% found this document useful (0 votes)

27 views61 pages

Slide 8 Spark Shell Tutorial

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views61 pages

Slide 8 Spark Shell Tutorial

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Big Data

(Spark)
Instructor: Trong-Hop Do
May 5th 2021

S3Lab
Smart Software System Laboratory

1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems

Big Data 2
Spark shell
● Open Spark shell
● Command: spark-shell

3
What is SparkContext?

● Spark SparkContext is an entry point to Spark and defined in

org.apache.spark package and used to programmatically create Spark
RDD, accumulators, and broadcast variables on the cluster.
● Its object sc is default variable available in spark-shell and it can be
programmatically created using SparkContext class.
● Most of the operations/methods or functions we use in Spark come from
SparkContext for example accumulators, broadcast variables, parallelize,
and more.
4
What is SparkContext?

● Test some methods with the default SparkContext object sc in Spark shell

● There are many more methods of SparkContext

5
Spark RDD

● Resilient Distributed Datasets (RDD) is a fundamental data structure of

Spark, It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
● Spark RDD can be created in several ways using Scala & Pyspark
languages, for example, It can be created by using
sparkContext.parallelize(), from text file, from another RDD, DataFrame,
and Dataset.
6
Spark RDD
Create an RDD through Parallelized Collection

● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].

Note: the printed list might be unordered as foreach runs in

parallel across the partitions which creates non-deterministic
ordering. The action collect() gives you an array of the partitions
concatenated in their sorted order. Or you might set the number
of parititions to 1.

7
Spark RDD
Create an RDD through Parallelized Collection

● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].

8
Spark RDD
Read single or multiple text files into single RDD?

● Create files and put them to HDFS

9
Spark RDD
Read single or multiple text files into single RDD?

● Read single file and create an RDD[String]

10
Spark RDD
Read single or multiple text files into single RDD?

● Read multiple files and create an RDD[String]

11
Spark RDD
Read single or multiple text files into single RDD?

● Read multiple specific files and create an RDD[String]

12
Spark RDD
Load CSV File into RDD

● Create .csv files and put them to HDFS

13
Spark RDD
Load CSV File into RDD

● textFile() method read an entire CSV record as a String and returns RDD[String]

14
Spark RDD
Load CSV File into RDD

● we would need every record in a CSV to split by comma delimiter and store it in RDD as
multiple columns, In order to achieve this, we should use map() transformation on RDD where
we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter.
map() method returns a new RDD instead of updating existing.

Each record in the csvData RDD is an Array[String]

f(0), f(1) are the first and second element in the array f
15
Spark RDD
Load CSV File into RDD

● Skip header from csv file

● Command: rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }

16
Spark RDD Operations

● https://spark.apache.org/docs/latest/api/python/reference/pyspark.html

● https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/

17
Spark RDD Operations
Two type of RDD operations

● Apache Spark RDD supports two types of Operations-

○ Transformations

○ Actions
● Spark Transformation is a function that produces new RDD from the existing RDDs. It takes
RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation.
● RDD actions are operations that return the raw values, In other words, any RDD function
that returns other than RDD is considered as an action in spark programming.

18
Spark RDD Operations
RDD Transformation

● Two types of RDD Transformation

○ Narrow transformations are the result of map() and filter() functions

and these compute data that live on a single partition meaning there
will not be any data movement between partitions to execute narrow
transformations.

○ Wider transformations are the result of groupByKey() and

reduceByKey() functions and these compute data that live on many
partitions meaning there will be data movements between partitions
to execute wider transformations.

19
Spark RDD Operations
RDD Transformation

20
Spark RDD Operations
RDD Action

21
Spark RDD Operations
RDD Transformation – Word count example

● Create a .txt file and put it to HDFS

22
Spark RDD Operations
RDD Transformation – Word count example

● Read the file and create an RDD[String]

23
Spark RDD Operations
RDD Transformation – Word count example

● Apply flatMap() Transformation

● flatMap() transformation flattens the RDD after applying the function and returns a new RDD. On the below
example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a
single word on each record.

24
Spark RDD Operations
RDD Transformation – Word count example

● Apply map() Transformation

● map() transformation is used the apply any complex operations like adding a column,
updating a column e.t.c, the output of map transformations would always have the same
number of records as input.
● In this word count example, we are adding a new column with value 1 for each word, the
result of the RDD is PairRDD which contains key-value pairs, word of type String as Key
and 1 of type Int as value.

Each record in testfile3 RDD is a tuple [String,Int] (not an array) 25

Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the content of these RDDs (each record is printed in a single line)

Key Value

26
Spark RDD Operations
RDD Transformation – Word count example

● Apply reduceByKey() Transformation

● reduceByKey() merges the values for each key with the function specified. In our example, it
reduces the word string by applying the sum function on value. The result of our RDD
contains unique words and their count.

● Note: reduceByKey() is a transformation, so count is a RDD

27
Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the result

28
Spark RDD Operations
RDD Transformation – Word count example

● Another way to use reduceByKey

29
Spark RDD Operations
RDD Transformation – Word count example

● Try sortByKey() Transformation

● sortByKey() transformation is used to sort RDD elements on key.
● In this example, we want to sort RDD elements on the number of each word. Therefore, we
need convert RDD[(String,Int]) to RDD[(Int,String]) using map transformation and then apply
sortByKey which ideally does sort on an integer key value.

Notice how each element in a tuple is accessed. If a is an array, the second and first element are accessed
by a(1) and a(0). In this case, a is a tuple, the second and first element are accessed by a._2 and a._1
30
Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the result (and see that the count number of each word now become the key)

31
Spark RDD Operations
RDD Transformation – Word count example

● Apply sortByKey() transformation

If there are more than 1 partitions, records in each partitions are sorted, but the
sorted results of multiple partitions might be mixed up.
To avoid this, use sortByKey(numPartitions=1)

32
Spark RDD Operations
RDD Transformation – Word count example

● You can also apply map transformation to switch the word back to the key

33
Spark RDD Operations
RDD Transformation – Word count example

● You can to all these steps in one line of code

34
Spark RDD Operations
RDD Transformation – Word count example

● Or you can use sortBy

35
Spark RDD Operations
RDD Transformation – Word count example

● Apply filter() Transformation

● filter() transformation is used to filter the records in an RDD. In this example we are filtering
all words starts with “t”.

36
Spark RDD Operations
RDD Transformation – Word count example

● Merge two RDD

37
Spark RDD Operations
RDD Action – Easiest Wordcount using countByValue

38
Spark RDD Operations
RDD Action
● Let’s create some RDD

39
Spark RDD Operations
RDD Action
● reduce() - Reduces the elements of the dataset using the specified binary
operator.

40
Spark RDD Operations
RDD Action
● collect() -Return the complete dataset as an Array.

41
Spark RDD Operations
RDD Action
● count() – Return the count of elements in the dataset.
● countApprox() – Return approximate count of elements in the dataset, this method returns
incomplete when execution time meets timeout.
● countApproxDistinct() – Return an approximate number of distinct elements in the dataset.

42
Spark RDD Operations
RDD Action

● first() – Return the first element in the dataset.

● top() – Return top n elements from the dataset.
● min() – Return the minimum value from the dataset.
● max() – Return the maximum value from the dataset.
● take() – Return the first num elements of the dataset.
● takeOrdered() – Return the first num (smallest) elements from the dataset and this is
the opposite of the take() action.
● takeSample() – Return the subset of the dataset in an Array.

43
Spark Pair RDD Functions
Pair RDD Transformation
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair

44
Spark Pair RDD Functions
Pair RDD Action

Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair

45
Spark Pair RDD Functions
Wordcount using reduceByKey

46
Spark Pair RDD Functions
Wordcount using reduceByKey

● Save the output in a text file

47
Spark Pair RDD Functions
Wordcount using reduceByKey

● Check the result

48
Spark Pair RDD Functions
Another usage of reduceByKey

49
Spark Pair RDD Functions
Wordcount using countByKey action

50
Spark Pair RDD Functions
Join two RDDs

● Create two RDDs from text files

51
Spark Pair RDD Functions
Join two RDDs

● Check the result

If you have duplicates key in any RDD, join will result in cartesian product.

52
Spark Pair RDD Functions
Join two RDDs

● Use cogroup or groupByKey to get unique keys in the joining result

53
Spark shell Web UI
Accessing the Web UI of Spark

● Open http://quickstart.cloudera:4040 in web browser

54
Spark shell Web UI
Explore Spark shell Web UI

● Click DAG Visualization

Click

Number of partitions 55
Spark shell Web UI
Explore Spark shell Web UI

● View the Direct Acylic Graph (DAG) of the completed job

56
Spark shell Web UI
Partition and parallelism in RDDs

57
Spark shell Web UI
Partition and parallelism in RDDs

● Execute a parallel task in the shell

58
Spark shell Web UI
Partition and parallelism in RDDs
● Open Spark shell Web UI

Totally 5 partitions are done 59

Spark shell Web UI
Partition and parallelism in RDDs

Parallel execution of 5 completed jobs

60
Q&A

Cảm ơn đã theo dõi

Chúng tôi hy vọng cùng nhau đi đến thành công.

61
Big Data

Vulnerability Assessment and Penetration Testing
100% (1)
Vulnerability Assessment and Penetration Testing
19 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
No ratings yet
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
56 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
SPARK
No ratings yet
SPARK
35 pages
spark-rdds-operations
No ratings yet
spark-rdds-operations
46 pages
06_parallel_processing_part2
No ratings yet
06_parallel_processing_part2
93 pages
Spark
No ratings yet
Spark
160 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
External Video-En
No ratings yet
External Video-En
2 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Spark 1
No ratings yet
Spark 1
97 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Day9
No ratings yet
Day9
30 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Big Data Analysis With Scala and Spark: Heather Miller
No ratings yet
Big Data Analysis With Scala and Spark: Heather Miller
17 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Pyspark
No ratings yet
Pyspark
31 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Pyspark
No ratings yet
Pyspark
44 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
spark-rdds.2
No ratings yet
spark-rdds.2
28 pages
SPARK
No ratings yet
SPARK
27 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Eddl - DDL Intro
No ratings yet
Eddl - DDL Intro
2 pages
Unit - 1
No ratings yet
Unit - 1
39 pages
Mini Project Report 2024-25-1
No ratings yet
Mini Project Report 2024-25-1
29 pages
Esay
No ratings yet
Esay
2 pages
4 Ways To Improve Your Plotly Graphs - by Dylan Castillo - Towards Data Science
No ratings yet
4 Ways To Improve Your Plotly Graphs - by Dylan Castillo - Towards Data Science
11 pages
Cyber Security Course
No ratings yet
Cyber Security Course
5 pages
What Is TikTok An Introduction
No ratings yet
What Is TikTok An Introduction
8 pages
English 201 Final Exam Review Guide
No ratings yet
English 201 Final Exam Review Guide
19 pages
System Based Error Book
No ratings yet
System Based Error Book
16 pages
Data Validation and Verification: Information Security Spring-2020
No ratings yet
Data Validation and Verification: Information Security Spring-2020
16 pages
25 Zero Investment Business Ideas
No ratings yet
25 Zero Investment Business Ideas
109 pages
Matroska-File Format
No ratings yet
Matroska-File Format
51 pages
A New PWM Controller With One Cycle Response
No ratings yet
A New PWM Controller With One Cycle Response
7 pages
Exponent Rules & Practice PDF
No ratings yet
Exponent Rules & Practice PDF
2 pages
Tdt7 Manual
No ratings yet
Tdt7 Manual
2 pages
People Also Ask: How To Download Scribd Documents For Free - Techjunkie
No ratings yet
People Also Ask: How To Download Scribd Documents For Free - Techjunkie
7 pages
General Guide About In-App Purchase (IAP) Services: Buying Credits
No ratings yet
General Guide About In-App Purchase (IAP) Services: Buying Credits
3 pages
2-3 - The Serial Monitor
No ratings yet
2-3 - The Serial Monitor
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Section6Exercise1 MakingPredictions ParticulateMatterExposure PDF
No ratings yet
Section6Exercise1 MakingPredictions ParticulateMatterExposure PDF
66 pages
Mastering CCNARouting Fundamentals 654 FCFC 9 Da 692 A 4 D
No ratings yet
Mastering CCNARouting Fundamentals 654 FCFC 9 Da 692 A 4 D
12 pages
CURRICULUM MAP 10 Computer
No ratings yet
CURRICULUM MAP 10 Computer
11 pages
Digital Audio Workstation Meaning
No ratings yet
Digital Audio Workstation Meaning
10 pages
Venu Babu Ravur
No ratings yet
Venu Babu Ravur
1 page
T2 Sequence and Selection
No ratings yet
T2 Sequence and Selection
15 pages
SIMATIC S5 As511 PDF
100% (1)
SIMATIC S5 As511 PDF
4 pages
200088-Job Interview Presentation Samples
No ratings yet
200088-Job Interview Presentation Samples
37 pages
Certificate
No ratings yet
Certificate
2 pages

Slide 8 Spark Shell Tutorial

Uploaded by

Slide 8 Spark Shell Tutorial

Uploaded by

Big Data

● Spark SparkContext is an entry point to Spark and defined in

● There are many more methods of SparkContext

● Resilient Distributed Datasets (RDD) is a fundamental data structure of

Note: the printed list might be unordered as foreach runs in

● Create files and put them to HDFS

● Read single file and create an RDD[String]

● Read multiple files and create an RDD[String]

● Read multiple specific files and create an RDD[String]

● Create .csv files and put them to HDFS

Each record in the csvData RDD is an Array[String]

● Skip header from csv file

● Apache Spark RDD supports two types of Operations-

● Two types of RDD Transformation

○ Narrow transformations are the result of map() and filter() functions

○ Wider transformations are the result of groupByKey() and

● Create a .txt file and put it to HDFS

● Read the file and create an RDD[String]

● Apply flatMap() Transformation

● Apply map() Transformation

Each record in testfile3 RDD is a tuple [String,Int] (not an array) 25

● Apply reduceByKey() Transformation

● Note: reduceByKey() is a transformation, so count is a RDD

● Let’s print out the result

● Another way to use reduceByKey

● Try sortByKey() Transformation

● Apply sortByKey() transformation

● You can to all these steps in one line of code

● Or you can use sortBy

● Apply filter() Transformation

● Merge two RDD

● first() – Return the first element in the dataset.

● Save the output in a text file

● Check the result

● Create two RDDs from text files

● Check the result

● Use cogroup or groupByKey to get unique keys in the joining result

● Open http://quickstart.cloudera:4040 in web browser

● Click DAG Visualization

● View the Direct Acylic Graph (DAG) of the completed job

● Execute a parallel task in the shell

Totally 5 partitions are done 59

Parallel execution of 5 completed jobs

Cảm ơn đã theo dõi

You might also like