AIO2024 Module02 Extra SQL Big Data
AIO2024 Module02 Extra SQL Big Data
All-in-One Course
(TA Session)
Big Data
Extra
Dinh-Thang Duong – TA
Truong-Binh Duong – STA
Year
AI VIETNAM
All-in-One Course
(TA Session)
Outline
Introduction
Hadoop
Spark
Question
2
AI VIETNAM
All-in-One Course
(TA Session) Getting Started
❖ Objectives
3
AI VIETNAM
All-in-One Course
(TA Session)
Introduction
4
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started
UNIT VALUE
bit 1 bit
6
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Characteristics
Volume
Velocity Variety
3V of Big Data 7
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Characteristics
11
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Applications
Healthcare Education
Retail E-commerce
12
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Tools for Big Data
13
AI VIETNAM
All-in-One Course
(TA Session)
Hadoop
14
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Introduction
Apache Hadoop: an open-source framework that is used to efficiently store and process large
datasets ranging in size from gigabytes to petabytes of data.
15
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Introduction
16
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem
17
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem: Hadoop HDFS
18
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem: Hadoop YARN
19
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem: Hadoop MapReduce
20
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Architecture
Master Node Slave Node Slave Node
Task Tracker Task Tracker Task Tracker
MapReduce Layer
Job Tracker
Name Node
HDFS Layer
Data Node Task Tracker Task Tracker
22
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS
23
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS
24
https://www.javatpoint.com/hdfs
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS
25
https://www.javatpoint.com/hdfs
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS
28
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce
29
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce
30
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce
31
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce
32
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce
33
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce: WordCount Example
34
https://www.guru99.com/introduction-to-mapreduce.html
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation
Hadoop Single Node (Hadoop pseudo-distributed mode):
A configuration option in Hadoop that allows us to run
Hadoop on a single machine.
Master Master
Master Slave
35
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Requirements
1. Virtual Machine
3. Hadoop
2. CentOS
Note: The following installation demo is on Mac M1. You
need to choose proper version for your computer.
36
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Virtual Machine
VM Fusion Interface 39
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: CentOS
41
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware
42
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware
43
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware
8 9
44
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware
10
11
45
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware
11
46
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS
47
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS
48
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS
8
7
49
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS
10
9
50
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS
51
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop Requirements
52
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Java Install Java 8 on CentOS:
1. sudo yum –y update
2. sudo yum install –y java-1.8.0-openjdk
53
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
55
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Config Hadoop
1. Check Java configuration 2. Enter ~/.bashrc script
56
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
export
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.
0.362.b09-4.el9.aarch64
export HADOOP_HOME=/home/thangduong/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/INSTALL/
lib/native
export
HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTA
57
LL/lib”
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
59
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
5. Config core-site.xml Paste the code below:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
60
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
5. Config core-site.xml 61
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
6. Config yarn-site.xml Paste the code below:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapredu
ce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHand
ler</value>
</property>
</configuration>
62
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
6. Config yarn-site.xml 63
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
7. Config mapred-site.xml Paste the code below:
<configuration>
<property>
Copy and edit on new mapred-site.xml <name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
64
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
7. Config mapred-site.xml 65
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
8.1. Create namenode and datanode folder 8.2. Config hdfs-site.xml
66
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
Paste the code below:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/thangduong/hadoop_store/h
dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/thangduong/hadoop_store/h
dfs/datanode</value>
</property>
</configuration>
67
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Check Hadoop
69
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text
70
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text
3. Create input folder and move all .xml file into input 4. Perform grep with string pattern “dfs[a-z.]+”
71
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text
72
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text
73
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Other default applications
74
AI VIETNAM
All-in-One Course
(TA Session)
QUIZ
75
AI VIETNAM
All-in-One Course
(TA Session)
Spark
76
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ Introduction
Apache Spark: an open-source unified analytics engine for large-scale data processing. Spark
is an alternative replacement for Hadoop MapReduce.
77
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ Why Spark?
78
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ Spark Modules
79
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Introduction
PySpark: The Python API for Apache Spark. It enables to perform real-time, large-scale data
processing in a distributed environment using Python.
80
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Installation
81
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Usage
82
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Usage: Create DataFrame
83
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Usage: Operations
select(): Choose columns to show in table filter(): Extract rows satisfied a criteria
84
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file
Description: Perform some basic operation of Spark with Flights.csv dataset. Download link: here.
85
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file
86
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file
87
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file
88
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file
89
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file
92
AI VIETNAM
All-in-One Course
(TA Session) Question
?
93
94