Chapter 6 Spark - An in-Memory Distributed Computing Engine
Chapter 6 Spark - An in-Memory Distributed Computing Engine
This course describes the basic concepts of Spark and the similarities and
differences between the Resilient Distributed Dataset (RDD), DataSet, and
DataFrame data structures in Spark. Additionally, you can understand the
features of Spark SQL, Spark Streaming, and Structured Streaming.
1 Huawei Confidential
Objectives
2 Huawei Confidential
Contents
1. Spark Overview
3 Huawei Confidential
Introduction to Spark
Apache Spark was developed in the UC Berkeley AMP lab in 2009.
It is a fast, versatile, and scalable memory-based big data computing engine.
As a one-stop solution, Apache Spark integrates batch processing, real-time
streaming, interactive query, graph programming, and machine learning.
4 Huawei Confidential
Application Scenarios of Spark
Batch processing can be used for extracting, transforming, and loading (ETL).
Machine learning can be used to automatically determine whether e-Bay buyers
comments are positive or negative.
Interactive analysis can be used to query the Hive warehouse.
Streaming processing can be used for real-time businesses such as page-click
stream analysis, recommendation systems, and public opinion analysis.
5 Huawei Confidential
Highlights of Spark
Lightwei
Fast Flexible Smart
ght
Spark core code Delay for small Spark offers Spark smartly
has 30,000 datasets different levels uses existing
lines. reaches the of flexibility. big data
sub-second components.
level.
6 Huawei Confidential
Spark and MapReduce
7 Huawei Confidential
Contents
1. Spark Overview
8 Huawei Confidential
RDD - Core Concept of Spark
Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.
RDDs are stored in the memory by default and are written to disks when the memory is
insufficient.
RDD data is stored in clusters as partitions.
RDDs have a lineage mechanism, which allows for rapid data recovery when data loss occurs.
RDD1 RDD2
9 Huawei Confidential
RDD Dependencies
groupByKey
map, filter
10 Huawei Confidential
Differences Between Wide and Narrow Dependencies
- Operator
Narrow dependency indicates that each partition of a parent RDD can be used
by at most one partition of a child RDD, for example, map, filter, and union.
Wide dependency indicates that the partitions of multiple child RDDs depend on
the partition of the same parent RDD, for example, groupByKey, reduceByKey,
and sortByKey.
11 Huawei Confidential
Differences Between Wide and Narrow Dependencies
- Fault Tolerance
If a node is faulty:
Narrow dependency: Only the parent RDD partition corresponding to the child RDD partition
needs to be recalculated.
Wide dependency: In extreme cases, all parent RDD partitions need to be recalculated.
As shown in the following figure, if the b1 partition is lost, a1, a2, and a3 need to be
recalculated.
12 Huawei Confidential
Differences Between Wide and Narrow Dependencies
- Data Transmission
Wide dependency usually corresponds to shuffle operations. During the running
process, the partition of the same parent RDD needs to be transferred to
different child RDD partitions, which may involve data transmission between
multiple nodes.
The partition of each parent RDD on which narrow dependency exists is
transferred to only one child RDD partition. Generally, the conversion can be
completed on one node.
13 Huawei Confidential
Stage Division of RDD
A: B:
G:
Stage1 groupby
C: D: F:
map join
E:
Stage2 union
Stage3
14 Huawei Confidential
RDD Operation Type
Spark operations can be classified into creation, conversion, control, and behavior operations.
Creation operation: used to create an RDD. An RDD is created through a memory collection and an
external storage system or by a transformation operation.
Transformation operation: An RDD is transformed into a new RDD through certain operations. The
transformation operation of the RDD is a lazy operation, which only defines a new RDD but does not
execute it immediately.
Control operation: RDD persistence is performed. An RDD can be stored in the disk or memory based on
different storage policies. For example, the cache API caches the RDD in the memory by default.
Action operation: An operation that can trigger Spark running. Action operations in Spark are classified
into two types. One is to output the calculation result, and the other is to save the RDD to an external file
system or database.
15 Huawei Confidential
Creation Operation
Currently, there are two types of basic RDDs:
Parallel collection: An existing collection is collected and computed in parallel.
External storage: A function is executed on each record in a file. The file system must
be HDFS or any storage system supported by Hadoop.
The two types of RDD can be operated in the same way to obtain a series of
extensions such as sub-RDD and form a lineage diagram.
16 Huawei Confidential
Control Operation
Spark can store RDD in the memory or disk file system persistently. The RDD in
memory can greatly improve iterative computing and data sharing between
computing models. Generally, 60% of the memory of the execution node is used
to cache data, and the remaining 40% is used to run tasks. In Spark, persist and
cache operations are used for data persistency. Cache is a special case of
persist().
17 Huawei Confidential
Transformation Operation - Transformation Operator
Transformation Description
Uses the func method to generate a new RDD
map(func)
for each element in the RDD that invokes map.
func is used for each element of an RDD that
filter(func) invokes filter and then an RDD with elements
containing func (the value is true) is returned.
It is similar to groupBykey. However, the value
reduceBykey(func, [numTasks]) of each key is calculated based on the provided
func to obtain a new value.
If the data set is (K, V) and the associated data
set is (K, W), then (K, (V, W) is returned.
join(otherDataset, [numTasks]) leftOuterJoin, rightOutJoin, and fullOuterJoin
are supported.
18 Huawei Confidential
Action Operation - Action Operator
Action Description
reduce(func) Aggregates elements in a dataset based on functions.
Used to encapsulate the filter result or a small enough
collect()
result and return an array.
count() Collects statistics on the number of elements in an RDD.
first() Obtains the first element of a dataset.
Obtains the top elements of a dataset and returns an
take(n)
array.
This API is used to write the dataset to a text file or
saveAsTextFile(path) HDFS. Spark converts each record into a row and writes
the row to the file.
19 Huawei Confidential
DataFrame
Similar to RDD, DataFrame is also an invariable, elastic, and distributed
dataset. In addition to data, it records the data structure information,
which is known as schema. The schema is similar to a two-dimensional
table.
The query plan of DataFrame can be optimized by Spark Catalyst
Optimizer. Spark is easy to use, which improves high-quality execution
efficiency. Programs written by DataFrame can be converted into efficient
forms for execution.
20 Huawei Confidential
DataSet
DataFrame is a special case of DataSet (DataFrame=Dataset[Row]).
Therefore, you can use the as method to convert DataFrame to DataSet.
Row is a common type. All table structure information is represented by
row.
DataSet is a strongly typed dataset, for example, Dataset[Car] and
Dataset[Person].
21 Huawei Confidential
Differences Between DataFrame, DataSet, and RDD
Assume that there are two lines of data in an RDD, which are displayed as follows:
1, Allen, 23
2, Bobby, 35
Allen
Bobby
Allen
Bobby
22 Huawei Confidential
Contents
1. Spark Overview
24 Huawei Confidential
Spark System Architecture
Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming
Spark Core
25 Huawei Confidential
Typical Case - WordCount
26 Huawei Confidential
WordCount
object WordCount
27 Huawei Confidential
Contents
1. Spark Overview
28 Huawei Confidential
Spark SQL Overview
Spark SQL is the module used in Spark for structured data processing. In Spark
applications, you can seamlessly use SQL statements or DataFrame APIs to
query structured data.
Costm
model
Unresolved Optimized Selected
DataFrame logical plan Logical plan Physical plans RDDs
logical plan physical plan
Catalog
Dataset
29 Huawei Confidential
Spark SQL and Hive
Differences:
The execution engine of Spark SQL is Spark Core, and the default execution engine
of Hive is MapReduce.
The execution speed of Spark SQL is 10 to 100 times faster than Hive.
Spark SQL does not support buckets, but Hive supports.
Relationships:
Spark SQL depends on the metadata of Hive.
Spark SQL is compatible with most syntax and functions of Hive.
Spark SQL can use the custom functions of Hive.
30 Huawei Confidential
Contents
1. Spark Overview
31 Huawei Confidential
Structured Streaming Overview (1)
Structured Streaming is a streaming data–processing engine built on the Spark
SQL engine. You can compile a streaming computing process like using static
RDD data. When streaming data is continuously generated, Spark SQL will
process the data incrementally and continuously, and update the results to the
result set.
32 Huawei Confidential
Structured Streaming Overview (2)
33 Huawei Confidential
Example Programming Model of Structured Streaming
Cat dog Dog
Owl cat
Dog dog owl
1 2 3
Time
Cat 2 Cat 2
Cat 1
Calculation results dog 3 dog 4
dog 3
owl 1 owl 2
34 Huawei Confidential
Contents
1. Spark Overview
35 Huawei Confidential
Spark Streaming Overview
The basic principle of Spark Streaming is to split real-time input data streams by
time slice (in seconds), and then use the Spark engine to process data of each
time slice in a way, which is similar to batch processing.
Batches of
Input data stream Spark Batches of input data Spark processed data
Streaming Engine
36 Huawei Confidential
Window Interval and Sliding Interval
Original
DStream
Window-based
operation
Windowed
DStream
Window Window Window
at Time 1 at Time 3 at Time 5
The window slides on the Dstream. The RDDs that fall within the window are merged and operated to
generate a window-based RDD.
Window length: indicates the duration of a window.
Sliding window interval: indicates the interval for performing window operations.
37 Huawei Confidential
Comparison Between Spark Streaming and Storm
Dynamic adjustment
Supported Not supported
parallelism
38 Huawei Confidential
Summary
39 Huawei Confidential
Quiz
3. What are the differences between wide dependencies and narrow dependencies
of Spark?
40 Huawei Confidential
Quiz
41 Huawei Confidential
Recommendations
42 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.