Big Data
(Spark)
Instructor: Trong-Hop Do
May 5th 2021
S3Lab
Smart Software System Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems
Big Data 2
Spark shell
● Open Spark shell
● Command: spark-shell
3
What is SparkContext?
● Spark SparkContext is an entry point to Spark and defined in
org.apache.spark package and used to programmatically create Spark
RDD, accumulators, and broadcast variables on the cluster.
● Its object sc is default variable available in spark-shell and it can be
programmatically created using SparkContext class.
● Most of the operations/methods or functions we use in Spark come from
SparkContext for example accumulators, broadcast variables, parallelize,
and more.
4
What is SparkContext?
● Test some methods with the default SparkContext object sc in Spark shell
● There are many more methods of SparkContext
5
Spark RDD
● Resilient Distributed Datasets (RDD) is a fundamental data structure of
Spark, It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
● Spark RDD can be created in several ways using Scala & Pyspark
languages, for example, It can be created by using
sparkContext.parallelize(), from text file, from another RDD, DataFrame,
and Dataset.
6
Spark RDD
Create an RDD through Parallelized Collection
● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].
Note: the printed list might be unordered as foreach runs in
parallel across the partitions which creates non-deterministic
ordering. The action collect() gives you an array of the partitions
concatenated in their sorted order. Or you might set the number
of parititions to 1.
7
Spark RDD
Create an RDD through Parallelized Collection
● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].
8
Spark RDD
Read single or multiple text files into single RDD?
● Create files and put them to HDFS
9
Spark RDD
Read single or multiple text files into single RDD?
● Read single file and create an RDD[String]
10
Spark RDD
Read single or multiple text files into single RDD?
● Read multiple files and create an RDD[String]
11
Spark RDD
Read single or multiple text files into single RDD?
● Read multiple specific files and create an RDD[String]
12
Spark RDD
Load CSV File into RDD
● Create .csv files and put them to HDFS
13
Spark RDD
Load CSV File into RDD
● textFile() method read an entire CSV record as a String and returns RDD[String]
14
Spark RDD
Load CSV File into RDD
● we would need every record in a CSV to split by comma delimiter and store it in RDD as
multiple columns, In order to achieve this, we should use map() transformation on RDD where
we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter.
map() method returns a new RDD instead of updating existing.
Each record in the csvData RDD is an Array[String]
f(0), f(1) are the first and second element in the array f
15
Spark RDD
Load CSV File into RDD
● Skip header from csv file
● Command: rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
16
Spark RDD Operations
● https://spark.apache.org/docs/latest/api/python/reference/pyspark.html
● https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
17
Spark RDD Operations
Two type of RDD operations
● Apache Spark RDD supports two types of Operations-
○ Transformations
○ Actions
● Spark Transformation is a function that produces new RDD from the existing RDDs. It takes
RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation.
● RDD actions are operations that return the raw values, In other words, any RDD function
that returns other than RDD is considered as an action in spark programming.
18
Spark RDD Operations
RDD Transformation
● Two types of RDD Transformation
○ Narrow transformations are the result of map() and filter() functions
and these compute data that live on a single partition meaning there
will not be any data movement between partitions to execute narrow
transformations.
○ Wider transformations are the result of groupByKey() and
reduceByKey() functions and these compute data that live on many
partitions meaning there will be data movements between partitions
to execute wider transformations.
19
Spark RDD Operations
RDD Transformation
20
Spark RDD Operations
RDD Action
21
Spark RDD Operations
RDD Transformation – Word count example
● Create a .txt file and put it to HDFS
22
Spark RDD Operations
RDD Transformation – Word count example
● Read the file and create an RDD[String]
23
Spark RDD Operations
RDD Transformation – Word count example
● Apply flatMap() Transformation
● flatMap() transformation flattens the RDD after applying the function and returns a new RDD. On the below
example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a
single word on each record.
24
Spark RDD Operations
RDD Transformation – Word count example
● Apply map() Transformation
● map() transformation is used the apply any complex operations like adding a column,
updating a column e.t.c, the output of map transformations would always have the same
number of records as input.
● In this word count example, we are adding a new column with value 1 for each word, the
result of the RDD is PairRDD which contains key-value pairs, word of type String as Key
and 1 of type Int as value.
Each record in testfile3 RDD is a tuple [String,Int] (not an array) 25
Spark RDD Operations
RDD Transformation – Word count example
● Let’s print out the content of these RDDs (each record is printed in a single line)
Key Value
26
Spark RDD Operations
RDD Transformation – Word count example
● Apply reduceByKey() Transformation
● reduceByKey() merges the values for each key with the function specified. In our example, it
reduces the word string by applying the sum function on value. The result of our RDD
contains unique words and their count.
● Note: reduceByKey() is a transformation, so count is a RDD
27
Spark RDD Operations
RDD Transformation – Word count example
● Let’s print out the result
28
Spark RDD Operations
RDD Transformation – Word count example
● Another way to use reduceByKey
29
Spark RDD Operations
RDD Transformation – Word count example
● Try sortByKey() Transformation
● sortByKey() transformation is used to sort RDD elements on key.
● In this example, we want to sort RDD elements on the number of each word. Therefore, we
need convert RDD[(String,Int]) to RDD[(Int,String]) using map transformation and then apply
sortByKey which ideally does sort on an integer key value.
Notice how each element in a tuple is accessed. If a is an array, the second and first element are accessed
by a(1) and a(0). In this case, a is a tuple, the second and first element are accessed by a._2 and a._1
30
Spark RDD Operations
RDD Transformation – Word count example
● Let’s print out the result (and see that the count number of each word now become the key)
31
Spark RDD Operations
RDD Transformation – Word count example
● Apply sortByKey() transformation
If there are more than 1 partitions, records in each partitions are sorted, but the
sorted results of multiple partitions might be mixed up.
To avoid this, use sortByKey(numPartitions=1)
32
Spark RDD Operations
RDD Transformation – Word count example
● You can also apply map transformation to switch the word back to the key
33
Spark RDD Operations
RDD Transformation – Word count example
● You can to all these steps in one line of code
34
Spark RDD Operations
RDD Transformation – Word count example
● Or you can use sortBy
35
Spark RDD Operations
RDD Transformation – Word count example
● Apply filter() Transformation
● filter() transformation is used to filter the records in an RDD. In this example we are filtering
all words starts with “t”.
36
Spark RDD Operations
RDD Transformation – Word count example
● Merge two RDD
37
Spark RDD Operations
RDD Action – Easiest Wordcount using countByValue
38
Spark RDD Operations
RDD Action
● Let’s create some RDD
39
Spark RDD Operations
RDD Action
● reduce() - Reduces the elements of the dataset using the specified binary
operator.
40
Spark RDD Operations
RDD Action
● collect() -Return the complete dataset as an Array.
41
Spark RDD Operations
RDD Action
● count() – Return the count of elements in the dataset.
● countApprox() – Return approximate count of elements in the dataset, this method returns
incomplete when execution time meets timeout.
● countApproxDistinct() – Return an approximate number of distinct elements in the dataset.
42
Spark RDD Operations
RDD Action
● first() – Return the first element in the dataset.
● top() – Return top n elements from the dataset.
● min() – Return the minimum value from the dataset.
● max() – Return the maximum value from the dataset.
● take() – Return the first num elements of the dataset.
● takeOrdered() – Return the first num (smallest) elements from the dataset and this is
the opposite of the take() action.
● takeSample() – Return the subset of the dataset in an Array.
43
Spark Pair RDD Functions
Pair RDD Transformation
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair
44
Spark Pair RDD Functions
Pair RDD Action
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair
45
Spark Pair RDD Functions
Wordcount using reduceByKey
46
Spark Pair RDD Functions
Wordcount using reduceByKey
● Save the output in a text file
47
Spark Pair RDD Functions
Wordcount using reduceByKey
● Check the result
48
Spark Pair RDD Functions
Another usage of reduceByKey
49
Spark Pair RDD Functions
Wordcount using countByKey action
50
Spark Pair RDD Functions
Join two RDDs
● Create two RDDs from text files
51
Spark Pair RDD Functions
Join two RDDs
● Check the result
If you have duplicates key in any RDD, join will result in cartesian product.
52
Spark Pair RDD Functions
Join two RDDs
● Use cogroup or groupByKey to get unique keys in the joining result
53
Spark shell Web UI
Accessing the Web UI of Spark
● Open http://quickstart.cloudera:4040 in web browser
54
Spark shell Web UI
Explore Spark shell Web UI
● Click DAG Visualization
Click
Number of partitions 55
Spark shell Web UI
Explore Spark shell Web UI
● View the Direct Acylic Graph (DAG) of the completed job
56
Spark shell Web UI
Partition and parallelism in RDDs
57
Spark shell Web UI
Partition and parallelism in RDDs
● Execute a parallel task in the shell
58
Spark shell Web UI
Partition and parallelism in RDDs
● Open Spark shell Web UI
Totally 5 partitions are done 59
Spark shell Web UI
Partition and parallelism in RDDs
Parallel execution of 5 completed jobs
60
Q&A
Cảm ơn đã theo dõi
Chúng tôi hy vọng cùng nhau đi đến thành công.
61
Big Data