big_data_topic3_[spark]_[thanh_binh_nguyen].TextMark
big_data_topic3_[spark]_[thanh_binh_nguyen].TextMark
big_data_topic3_[spark]_[thanh_binh_nguyen].TextMark
A
Big Data
L
(Spark)
3
Instructor: Thanh Binh Nguyen
S
September 1st, 2019
S3Lab
Smart Software System Laboratory
1
B .
LA
“Big data is at the foundation of all the
megatrends that are happening today, from
3
social to mobile to cloud to gaming.”
S
– Chris Lynch, Vertica Systems
Big Data 2
Introduction
B .
A
● Apache Spark is an open-source cluster computing framework for
L
real-time processing
Spark provides an interface for programming entire clusters with implicit
3
●
data parallelism and fault-tolerance.
S
● Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
Big Data 3
Introduction
B .
LA
Big Data S3 4
Applications
B .
LA
Big Data S3 5
Components
B .
A
● Spark Core and Resilient Distributed Datasets or RDDs
L
● Spark SQL
3
● Spark Streaming
● Machine Learning Library or MLlib
S
● GraphX
Big Data 6
Components
B .
LA
Big Data S3 7
Components
B .
A
Spark Core
● The base engine for large-scale parallel and distributed data processing.
L
The core is the distributed execution engine and the Java, Scala, and
3
Python APIs offer a platform for distributed ETL application development.
Further, additional libraries which are built atop the core allow diverse
S
workloads for streaming, SQL, and machine learning. It is responsible for:
○ Memory management and fault recovery
○ Scheduling, distributing and monitoring jobs on a cluster
○ Interacting with storage systems
Big Data 8
Components
B .
A
Spark Streaming
● Used to process real-time streaming data.
L
● It enables high-throughput and fault-tolerant stream processing of live
3
data streams. The fundamental stream unit is DStream which is basically
a series of RDDs (Resilient Distributed Datasets) to process the real-time
S
data.
Big Data 9
Components
B .
A
Spark Streaming - Workflow
3 L
Big Data S 10
Components
B .
A
Spark Streaming - Fundamentals
● Streaming Context
L
● DStream (Discretized Stream)
3
● Caching
● Accumulators, Broadcast Variables and Checkpoints
Big Data S 11
Components
B .
A
Spark Streaming - Fundamentals
3 L
Big Data S 12
Components
B .
A
Spark Streaming - Fundamentals
3 L
Big Data S 13
Components
B .
A
Spark Streaming - Fundamentals
3 L
Big Data S 14
Components
B .
A
Spark Streaming - Fundamentals
3 L
Big Data S 15
Components
B .
A
Spark SQL
● Spark SQL integrates relational processing with Spark functional
L
programming. Provides support for various data sources and makes it
3
possible to weave SQL queries with code transformations thus resulting in
a very powerful tool. The following are the four libraries of Spark SQL.
S
○ Data Source API
○ DataFrame API
○ Interpreter & Optimizer
○ SQL Service
Big Data 16
Components
B .
A
Spark SQL
3 L
Big Data S 17
Components
B .
A
Spark SQL - Data Source API
● Universal API for loading and storing structured data.
L
○ Built in support for Hive, JSON, Avro, JDBC, Parquet, ect.
○ Support third party integration through spark packages
3
○ Support for smart sources
○ Data Abstraction and Domain Specific Language (DSL) applicable on structure and
semi-structured data
S
○ Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage
systems (HDFS, HIVE Tables, MySQL, etc.)
○ Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
○ It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to
multi-node clusters.
Big Data 18
Components
B .
A
Spark SQL - DataFrame API
● A Data Frame is a distributed collection of data organized into named
L
column. It is equivalent to a relational table in SQL used for storing data
into tables.
Big Data S3 19
Components
B .
A
Spark SQL - SQL Interpreter And Optimizer
● based on functional programming constructed in Scala.
L
○ provides a general framework for transforming trees, which is used to perform
analysis/evaluation, optimization, planning, and run time code spawning.
3
○ This supports cost based optimization (run time and resource utilization is termed as cost)
and rule based optimization, making queries run much faster than their RDD (Resilient
Distributed Dataset) counterparts.
Big Data S 20
Components
B .
A
Spark SQL - SQL Service
● SQL Service is the entry point for working along structured data in Spark.
L
It allows the creation of DataFrame objects as well as the execution of
SQL queries.
Big Data S3 21
Components
B .
A
Spark SQL - Features
● Integration With Spark
L
● Uniform Data Access
● Hive Compatibility
3
● Standard Connectivity
● Performance And Scalability
S
● User Defined Functions
Big Data 22
Components
B .
A
GraphX
● GraphX is the Spark API for graphs and graph-parallel computation. Thus,
L
it extends the Spark RDD with a Resilient Distributed Property Graph. The
3
property graph is a directed multigraph which can have multiple edges in
parallel. Every edge and vertex have user defined properties associated
S
with it. Here, the parallel edges allow multiple relationships between the
same vertices.
Big Data 23
Components
B .
A
GraphX
● GraphX exposes a set of fundamental operators (e.g., subgraph,
L
joinVertices, and mapReduceTriplets) as well as an optimized variant of
3
the Pregel API.
● In addition, GraphX includes a growing collection of graph algorithms and
S
builders to simplify graph analytics tasks.
● GraphX unifies ETL (Extract, Transform & Load) process, exploratory
analysis and iterative graph computation within a single system.
Big Data 24
Components
B .
A
GraphX - use cases
● Disaster Detection System
L
● Page Rank
Financial Fraud Detection
3
●
● Business Analysis
S
○ Machine Learning, understanding the customer purchase trends
● Google Pregel
Big Data 25
Components
B .
A
GraphX - Graph and Examples
3 L
Big Data S 26
Components
B .
A
GraphX - Flight Data Analysis using Spark GraphX
3 L
Big Data S 27
Components
B .
A
GraphX - Flight Data Analysis using Spark GraphX
3 L
Big Data S 28
Components
B .
A
GraphX - Flight Data Analysis using Spark GraphX
3 L
Big Data S 29
Components
B .
A
MLLib
● stands for Machine Learning Library. Spark MLlib is used to perform
L
machine learning in Apache Spark.
Big Data S3 30
Components
B .
A
MLLib - Algorithms
● Basic Statistics: Summary, Correlation, Stratified Sampling, Hypothesis
L
Testing, Random Data Generation.
● Regression
3
● Classification
● Recommendation System: Collaborative, Content-Based
S
● Clustering
● Dimensionality Reduction: Feature Selection, Feature Extraction
● Feature Extraction
● Optimization
Big Data 31
Components
B .
A
MLLib - Use case
● Earthquake Detection System
3 L
Big Data S 32
Components
B .
A
MLLib - Use case
● Movie Recommendation System
3 L
Big Data S 33
Components
B .
A
MLLib - Use case
● Movie Recommendation System
3 L
Big Data S 34
Challenges of Distributed computing
B .
A
● How to divide the input data
L
● How to assign the divided data to machines in the cluster
● How to check and monitor a machine in the cluster is live and has
3
resources to perform its duty
How to retry or reassign failed chunks to another machine or worker
S
●
Big Data 35
Challenges of Distributed computing
B .
A
● If the computation involves any aggregation operation like a sum, how to
L
collate results from many workers and compute the aggregation
● Efficient use of memory , cpu and network
3
● Monitoring the tasks
Overall job coordination
S
●
● Keeping a global time
Big Data 36
Usercases
B .
A
● ETL
L
● Analytics
● Machine Learning
3
● Graph processing
S
● SQL queries on large data sets
● Batch processing
● Stream processing
Big Data 37
Features
B .
A
● Multiple languages support namely Java, R,
Scala, Python for building applications. Spark
L
provides high-level APIs in Java, Scala, Python
3
and R. Spark code can be written in any of
these four languages. It provides a shell in Scala
S
and Python. The Scala shell can be accessed
through ./bin/spark-shell and Python shell
through ./bin/pyspark from the installed
directory.
Big Data 38
Features
B .
A
● Fast Speed in Data Processing, 10 times faster on Disk and 100 times
swifter in Memory. Spark is able to achieve this speed through controlled
L
partitioning. It manages data using partitions that help parallelize
3
distributed data processing with minimal network traffic.
● Spark with abstraction-RDD provides Fault Tolerance with ensured Zero
S
Data loss
Big Data 39
Features
B .
A
● Increase in system efficiency due to Lazy Evaluation
of transformation in RDD: Apache Spark delays its
L
evaluation till it is absolutely necessary. This is one of
3
the key factors contributing to its speed. For
transformations, Spark adds them to a DAG (Directed
S
Acyclic Graph) of computation and only when the
driver requests some data, does this DAG actually
gets executed.
Big Data 40
Features
B .
A
● In-Memory Processing resulting in high computation speed and acyclic
data-flow
L
● With 80- high level operators it is easy to develop Dynamic & Parallel
3
applications
● Real-time data stream processing with Spark Streaming
Big Data S 41
Features
B .
A
● Flexible to run Independently and can be integrated
with Hadoop Yarn Cluster Manager. Apache Spark
L
provides smooth compatibility with Hadoop. This is a
3
boon for all the Big Data engineers who started their
careers with Hadoop. Spark is a potential replacement
S
for the MapReduce functions of Hadoop, while Spark
has the ability to run on top of an existing Hadoop
cluster using YARN for resource scheduling.
Big Data 42
Features
B .
A
● Cost Efficient for Big data as minimal need of storage and data center
● Futuristic analysis with built-in tools for machine learning (MLLib),
L
interactive queries & data streaming
3
● Persistence and Immutable in nature with data paralleling processing over
the cluster
S
● Graphx simplifies Graph Analytics by collecting algorithm and builders
● Re-using Code for batch processing and to run ad-hoc queries
● Progressive and expanding Apache community active for Quick Assistance
Big Data 43
Features
B .
A
● Spark supports multiple data sources such as
Parquet, JSON, Hive and Cassandra apart
L
from the usual formats such as text files, CSV
3
and RDBMS tables. The Data Source API
provides a pluggable mechanism for
S
accessing structured data though Spark SQL.
Data sources can be more than just simple
pipes that convert data and pull it into Spark.
Big Data 44
Spark build on Hadoop
B .
LA
Big Data S3 45
RDD - Resilient Distributed Dataset
B .
A
● Resilient distributed dataset is the fundamental data structure
abstraction of Spark
L
● RDD is a collection of elements partitioned across the nodes of the
3
cluster that can be operated on in parallel. For instance you can create
an RDD of integers and these gets partitioned and divided and
S
assigned to various nodes in the cluster for parallel processing.
● RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
Big Data 46
RDD - Resilient Distributed Dataset
B .
A
Features
● Resilient - They are fault tolerant and able to detect and recompute
L
missing or damaged partitions of an RDD due to node or network
failures.
3
● Distributed - Data is partitioned and resides on multiple nodes
S
depending on the cluster size, type and configuration
● In Memory Data Structure - Mostly they are in memory so that
iterative operations runs faster and performs way better than
traditional hadoop programs in executing iterative algorithms
Big Data 47
RDD - Resilient Distributed Dataset
B .
A
Features
● Dataset - it can represent any form of data be it reading from a csv
L
file, loading data from a table using rdbms, text file, json , xml.
Big Data S3 48
RDD - Resilient Distributed Dataset
B .
A
Workflow
● Create RDDs
L
○ An existing collection in your driver program.
Referencing a dataset in an external storage system.
3
○
● 2 types of operations
S
○ Transformation: to create new RDD.
○ Actions: applied on an RDD to instruct
Spark to apply computation and pass the result back to the driver.
Big Data 49
RDD - Resilient Distributed Dataset
B .
A
Partition and Parallelism
● Partition - is a logical chunk of a large distributed data set. By default.
L
Spark tries to read data into an RDD from the nodes that are close to
it.
Big Data S3 50
RDD - Resilient Distributed Dataset
B .
A
Partition and Parallelism
3 L
Big Data S 51
Spark Program
B .
A
Architecture
3 L
Big Data S 52
Spark Program
B .
A
Architecture
3 L
Big Data S 53
Spark Program
B .
A
Architecture
3 L
Big Data S 54
Spark Program
B .
A
Architecture
3 L
Big Data S 55
Spark Program
B .
A
Directed Acyclic Graph (DAG) Visualization
3 L
Big Data S 56
Spark Program
B .
A
Word count demo
3 L
Big Data S 57
Spark Program
B .
A
Word count demo
● Web UI port for Spark is 4040.
L
● HDFS web browser port is 50040
Big Data S3 58
Spark Program
B .
A
Word count demo
● Create a text file and stored it in the hdfs directory.
L
● Create RDD with transformation flatMap():
3
○ scala> var map =
sc.textFile("/user/cloudera/input/wordcount.txt").flatMap(line =>
line.split(" ")).map(word => (word,1));
S
● Apply action reduceByKey() to the created RDD.
○ scala> var counts = map.reduceByKey(_+_);
Big Data 59
Q&A
B .
LA
Big Data S3
Cảm ơn đã theo dõi
Chúng tôi hy vọng cùng nhau đi đến thành công.
60