0% found this document useful (0 votes)
16 views

Chapter 6 Spark - An in-Memory Distributed Computing Engine

Uploaded by

mazlout hanadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter 6 Spark - An in-Memory Distributed Computing Engine

Uploaded by

mazlout hanadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter 6 Spark - An In-Memory

Distributed Computing Engine


Foreword

 This course describes the basic concepts of Spark and the similarities and
differences between the Resilient Distributed Dataset (RDD), DataSet, and
DataFrame data structures in Spark. Additionally, you can understand the
features of Spark SQL, Spark Streaming, and Structured Streaming.

1 Huawei Confidential
Objectives

 Upon completion of this course, you will be able to:


 Understand application scenarios and master highlights of Spark.
 Master the programming capability and technical framework of Spark.

2 Huawei Confidential
Contents

1. Spark Overview

2. Spark Data Structure

3. Spark Principles and Architecture

3 Huawei Confidential
Introduction to Spark
 Apache Spark was developed in the UC Berkeley AMP lab in 2009.
 It is a fast, versatile, and scalable memory-based big data computing engine.
 As a one-stop solution, Apache Spark integrates batch processing, real-time
streaming, interactive query, graph programming, and machine learning.

4 Huawei Confidential
Application Scenarios of Spark
 Batch processing can be used for extracting, transforming, and loading (ETL).
 Machine learning can be used to automatically determine whether e-Bay buyers
comments are positive or negative.
 Interactive analysis can be used to query the Hive warehouse.
 Streaming processing can be used for real-time businesses such as page-click
stream analysis, recommendation systems, and public opinion analysis.

5 Huawei Confidential
Highlights of Spark

Lightwei
Fast Flexible Smart
ght

Spark core code Delay for small Spark offers Spark smartly
has 30,000 datasets different levels uses existing
lines. reaches the of flexibility. big data
sub-second components.
level.

6 Huawei Confidential
Spark and MapReduce

Hadoop Spark Spark


Data size 102.5 TB 102 TB 1,000 TB
Consumed time
72 23 234
(mins)
Nodes 2,100 206 190

Cores 50,400 6,592 6,080

Rate 1.4 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Daytona Gray Sort Yes Yes Yes

7 Huawei Confidential
Contents

1. Spark Overview

2. Spark Data Structure

3. Spark Principles and Architecture

8 Huawei Confidential
RDD - Core Concept of Spark
 Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.
 RDDs are stored in the memory by default and are written to disks when the memory is
insufficient.
 RDD data is stored in clusters as partitions.
 RDDs have a lineage mechanism, which allows for rapid data recovery when data loss occurs.

HDFS Spark cluster External storage

RDD1 RDD2

Hello spark Hello Spark


Hello hadoop "Hello Spark" "Hello, Spark" Hello Hadoop
China Mobile "Hello Hadoop" "Hello, Hadoop" China Mobile
"China Mobile" "China, Mobile"

9 Huawei Confidential
RDD Dependencies

Narrow dependencies: Wide dependencies:

groupByKey
map, filter

join with inputs


co-partitioned
union join with inputs not
co-partitioned

10 Huawei Confidential
Differences Between Wide and Narrow Dependencies
- Operator
 Narrow dependency indicates that each partition of a parent RDD can be used
by at most one partition of a child RDD, for example, map, filter, and union.
 Wide dependency indicates that the partitions of multiple child RDDs depend on
the partition of the same parent RDD, for example, groupByKey, reduceByKey,
and sortByKey.

11 Huawei Confidential
Differences Between Wide and Narrow Dependencies
- Fault Tolerance
 If a node is faulty:
 Narrow dependency: Only the parent RDD partition corresponding to the child RDD partition
needs to be recalculated.
 Wide dependency: In extreme cases, all parent RDD partitions need to be recalculated.

 As shown in the following figure, if the b1 partition is lost, a1, a2, and a3 need to be
recalculated.

12 Huawei Confidential
Differences Between Wide and Narrow Dependencies
- Data Transmission
 Wide dependency usually corresponds to shuffle operations. During the running
process, the partition of the same parent RDD needs to be transferred to
different child RDD partitions, which may involve data transmission between
multiple nodes.
 The partition of each parent RDD on which narrow dependency exists is
transferred to only one child RDD partition. Generally, the conversion can be
completed on one node.

13 Huawei Confidential
Stage Division of RDD

A: B:

G:
Stage1 groupby

C: D: F:

map join

E:

Stage2 union
Stage3

14 Huawei Confidential
RDD Operation Type
 Spark operations can be classified into creation, conversion, control, and behavior operations.
 Creation operation: used to create an RDD. An RDD is created through a memory collection and an
external storage system or by a transformation operation.
 Transformation operation: An RDD is transformed into a new RDD through certain operations. The
transformation operation of the RDD is a lazy operation, which only defines a new RDD but does not
execute it immediately.
 Control operation: RDD persistence is performed. An RDD can be stored in the disk or memory based on
different storage policies. For example, the cache API caches the RDD in the memory by default.
 Action operation: An operation that can trigger Spark running. Action operations in Spark are classified
into two types. One is to output the calculation result, and the other is to save the RDD to an external file
system or database.

15 Huawei Confidential
Creation Operation
 Currently, there are two types of basic RDDs:
 Parallel collection: An existing collection is collected and computed in parallel.
 External storage: A function is executed on each record in a file. The file system must
be HDFS or any storage system supported by Hadoop.
 The two types of RDD can be operated in the same way to obtain a series of
extensions such as sub-RDD and form a lineage diagram.

16 Huawei Confidential
Control Operation
 Spark can store RDD in the memory or disk file system persistently. The RDD in
memory can greatly improve iterative computing and data sharing between
computing models. Generally, 60% of the memory of the execution node is used
to cache data, and the remaining 40% is used to run tasks. In Spark, persist and
cache operations are used for data persistency. Cache is a special case of
persist().

17 Huawei Confidential
Transformation Operation - Transformation Operator

Transformation Description
Uses the func method to generate a new RDD
map(func)
for each element in the RDD that invokes map.
func is used for each element of an RDD that
filter(func) invokes filter and then an RDD with elements
containing func (the value is true) is returned.
It is similar to groupBykey. However, the value
reduceBykey(func, [numTasks]) of each key is calculated based on the provided
func to obtain a new value.
If the data set is (K, V) and the associated data
set is (K, W), then (K, (V, W) is returned.
join(otherDataset, [numTasks]) leftOuterJoin, rightOutJoin, and fullOuterJoin
are supported.

18 Huawei Confidential
Action Operation - Action Operator

Action Description
reduce(func) Aggregates elements in a dataset based on functions.
Used to encapsulate the filter result or a small enough
collect()
result and return an array.
count() Collects statistics on the number of elements in an RDD.
first() Obtains the first element of a dataset.
Obtains the top elements of a dataset and returns an
take(n)
array.
This API is used to write the dataset to a text file or
saveAsTextFile(path) HDFS. Spark converts each record into a row and writes
the row to the file.

19 Huawei Confidential
DataFrame
 Similar to RDD, DataFrame is also an invariable, elastic, and distributed
dataset. In addition to data, it records the data structure information,
which is known as schema. The schema is similar to a two-dimensional
table.
 The query plan of DataFrame can be optimized by Spark Catalyst
Optimizer. Spark is easy to use, which improves high-quality execution
efficiency. Programs written by DataFrame can be converted into efficient
forms for execution.

20 Huawei Confidential
DataSet
 DataFrame is a special case of DataSet (DataFrame=Dataset[Row]).
Therefore, you can use the as method to convert DataFrame to DataSet.
Row is a common type. All table structure information is represented by
row.
 DataSet is a strongly typed dataset, for example, Dataset[Car] and
Dataset[Person].

21 Huawei Confidential
Differences Between DataFrame, DataSet, and RDD
 Assume that there are two lines of data in an RDD, which are displayed as follows:
1, Allen, 23
2, Bobby, 35

 The data in DataFrame is displayed as follows:

Allen
Bobby

 The data in DataSet is displayed as follows:

Allen
Bobby

22 Huawei Confidential
Contents

1. Spark Overview

2. Spark Data Structure

3. Spark Principles and Architecture


 Spark Core
 Spark SQL and DataSet
 Spark Structured Streaming
 Spark Streaming

24 Huawei Confidential
Spark System Architecture

Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming

Spark Core

Standalone Yarn Mesos

Existing functions of Spark 1.0

New functions of Spark 2.0

25 Huawei Confidential
Typical Case - WordCount

textFile flatMap map reduceByKey saveAsTextFile

HDFS RDD RDD RDD RDD HDFS

An apple An apple (An, 1) (An, 1) (An, 1)


A pair of A pair of (apple, 1) (A,1) (A,1)
An apple
shoes shoes (A, 1) (apple, 2)
A pairHDFS
of shoes HDFS2)
(apple,
Orange apple Orange (pair, 1) (pair, 1) (pair, 1)
Orange apple
apple (of, 1) (of, 1) (of, 1)
(shoes, 1) (shoes, 1) (shoes, 1)
(Orange, 1) (Orange, 1) (Orange,
(apple, 1) 1)

26 Huawei Confidential
WordCount
object WordCount

def main (args: Array[String]): Unit = {

//Configuring the Spark application name.

val conf = new


Create a SparkContext
object and set the SparkConf().setAppName("WordCount") Load the text file on HDFS
application name to to obtain the RDD data
Wordcount. val sc: SparkContext = new SparkContext(conf) set.

val textFile = sc.textFile("hdfs://...")


Invoke RDD transformation for
calculations. val counts = textFile.flatMap(line => line.split(" "))
Split the text file by spaces, set
the occurrence frequency of .map(word => (word, 1)) Invoke the saveAsTextFile
each word to 1, and then action and save the result.
combine the counts by the .reduceByKey(_ + _) This line triggers the actual
same key. task execution.
This operation is performed on counts.saveAsTextFile("hdfs://...")
each executor.
}

27 Huawei Confidential
Contents

1. Spark Overview

2. Spark Data Structure

3. Spark Principles and Architecture


 Spark Core
 Spark SQL and DataSet
 Spark Structured Streaming
 Spark Streaming

28 Huawei Confidential
Spark SQL Overview
 Spark SQL is the module used in Spark for structured data processing. In Spark
applications, you can seamlessly use SQL statements or DataFrame APIs to
query structured data.

SQL AST Code


Logical
Analysis generation
optimization

Costm
model
Unresolved Optimized Selected
DataFrame logical plan Logical plan Physical plans RDDs
logical plan physical plan

Catalog
Dataset

29 Huawei Confidential
Spark SQL and Hive
 Differences:
 The execution engine of Spark SQL is Spark Core, and the default execution engine
of Hive is MapReduce.
 The execution speed of Spark SQL is 10 to 100 times faster than Hive.
 Spark SQL does not support buckets, but Hive supports.
 Relationships:
 Spark SQL depends on the metadata of Hive.
 Spark SQL is compatible with most syntax and functions of Hive.
 Spark SQL can use the custom functions of Hive.

30 Huawei Confidential
Contents

1. Spark Overview

2. Spark Data Structure

3. Spark Principles and Architecture


 Spark Core
 Spark SQL and DataSet
 Spark Structured Streaming
 Spark Streaming

31 Huawei Confidential
Structured Streaming Overview (1)
 Structured Streaming is a streaming data–processing engine built on the Spark
SQL engine. You can compile a streaming computing process like using static
RDD data. When streaming data is continuously generated, Spark SQL will
process the data incrementally and continuously, and update the results to the
result set.

32 Huawei Confidential
Structured Streaming Overview (2)

Data stream Unbounded table

New data in the


data stream
=
New rows appended to
an unbounded table

Data stream as an unbounded table

33 Huawei Confidential
Example Programming Model of Structured Streaming
Cat dog Dog
Owl cat
Dog dog owl

1 2 3
Time

Enter data in Cat dog Cat dog Cat dog


the unbounded
Dog dog Dog dog Dog dog
table.
Owl cat Owl cat
Dog
owl

T=1 computing result T=2 computing result T=3 computing result

Cat 2 Cat 2
Cat 1
Calculation results dog 3 dog 4
dog 3
owl 1 owl 2

34 Huawei Confidential
Contents

1. Spark Overview

2. Spark Data Structure

3. Spark Principles and Architecture


 Spark Core
 Spark SQL and DataSet
 Spark Structured Streaming
 Spark Streaming

35 Huawei Confidential
Spark Streaming Overview
 The basic principle of Spark Streaming is to split real-time input data streams by
time slice (in seconds), and then use the Spark engine to process data of each
time slice in a way, which is similar to batch processing.

Batches of
Input data stream Spark Batches of input data Spark processed data
Streaming Engine

36 Huawei Confidential
Window Interval and Sliding Interval

Time 1 Time 2 Time 3 Time 4 Time 5

Original
DStream

Window-based
operation
Windowed
DStream
Window Window Window
at Time 1 at Time 3 at Time 5

 The window slides on the Dstream. The RDDs that fall within the window are merged and operated to
generate a window-based RDD.
 Window length: indicates the duration of a window.
 Sliding window interval: indicates the interval for performing window operations.
37 Huawei Confidential
Comparison Between Spark Streaming and Storm

Item Storm Spark Streaming


Quasi real time. Data within a
Real time. A piece of data is
Real-time computing model time range is collected and
processed once it is requested.
processed as an RDD.

Real-time computing delay In milliseconds In seconds


Throughput Low High
Transaction mechanism Supported and perfect Supported but not perfect
Normal for Checkpoint and
Fault tolerance High for ZooKeeper and Acker
WAL

Dynamic adjustment
Supported Not supported
parallelism

38 Huawei Confidential
Summary

 This course describes the basic concepts and technical architecture of


Spark, including SparkSQL, Structured Streaming, and Spark Streaming.

39 Huawei Confidential
Quiz

1. What are the features of Spark?

2. What are the advantages of Spark in comparison with MapReduce?

3. What are the differences between wide dependencies and narrow dependencies
of Spark?

4. What are the application scenarios of Spark?

40 Huawei Confidential
Quiz

5. RDD operators are classified into: ________ and _________.

6. The ___________ module is the core module of Spark.

7. RDD dependency types include ___________ and ___________.

41 Huawei Confidential
Recommendations

 Huawei Cloud Official Web Link:


 https://www.huaweicloud.com/intl/en-us/
 Huawei MRS Documentation:
 https://www.huaweicloud.com/intl/en-us/product/mrs.html
 Huawei TALENT ONLINE:
 https://e.huawei.com/en/talent/#/

42 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright© 2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like