0% found this document useful (0 votes)
2 views

AIO2024 Module02 Extra SQL Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AIO2024 Module02 Extra SQL Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

AI VIETNAM

All-in-One Course
(TA Session)

Big Data
Extra

Dinh-Thang Duong – TA
Truong-Binh Duong – STA
Year
AI VIETNAM
All-in-One Course
(TA Session)

Outline
Introduction
Hadoop
Spark
Question

2
AI VIETNAM
All-in-One Course
(TA Session) Getting Started
❖ Objectives

In this session, we will discuss about:


- Introduction to big data.
- Introduction to Hadoop.
- How to install virtual machine.
- How to install and use Hadoop on virtual machine.
- Introduction to spark and pyspark.

3
AI VIETNAM
All-in-One Course
(TA Session)

Introduction

4
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started
UNIT VALUE

bit 1 bit

byte (B) 8 bits

kilobyte (KB) 1024 bytes

megabyte (MB) 1024 kilobytes

gigabyte (GB) 1024 megabytes

terabyte (TB) 1024 gigabytes

petabyte (PB) 1024 terabytes

exabyte (EB) 1024 petabytes

zettabyte (ZB) 1024 exabytes

yottabyte (YB) 1024 zettabytes


5
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Definition

Big Data: A collection of datasets that is large, complex


that make it very difficult to process using traditional
data processing applications.

6
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Characteristics

Volume

Velocity Variety

3V of Big Data 7
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Characteristics

The amount of data generated on Internet each year

Volume: The amount of data


8
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Characteristics

Data generated on Internet every 1 minute

Velocity: The growth of data


9
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Characteristics

Lots of types of data to be processed

Variety: The diversity of data


10
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Types of big data
Structured Data Unstructured Data Semi-structured Data

11
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Big Data Applications

Healthcare Education

Finance Big Data Telecom

Retail E-commerce

12
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Tools for Big Data

13
AI VIETNAM
All-in-One Course
(TA Session)

Hadoop

14
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Introduction

Apache Hadoop: an open-source framework that is used to efficiently store and process large
datasets ranging in size from gigabytes to petabytes of data.

15
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Introduction

16
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem

17
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem: Hadoop HDFS

Hadoop HDFS (Hadoop Distributed File


System): A distributed file system that
handles larget data sets running on
commodity hardware.

18
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem: Hadoop YARN

Hadoop YARN (Yet Another Resource


Negotiator): the resources management
and job scheduling technology.

19
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Ecosystem: Hadoop MapReduce

Hadoop MapReduce: A highly sufficient


methodology for parallel processing of
huge volumes of data.

20
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop Architecture
Master Node Slave Node Slave Node
Task Tracker Task Tracker Task Tracker
MapReduce Layer
Job Tracker

Name Node
HDFS Layer
Data Node Task Tracker Task Tracker

High-level Architecture of Hadoop 21


AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS

HDFS (Hadoop Distributed File System): A distributed file


system designed to store and process large amounts of data
across multiple machines in a Hadoop cluster, providing high
fault-tolerance and scalability.

22
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS

• Namenode: Master node in a Hadoop


cluster, managing file system’s metadata.
• Datanode: Worker node in a Hadoop
cluster, storing data blocks.
• Blocks: A fixed-size chunk of data.
Default is 128MB (or 64MB) size.

23
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS

24
https://www.javatpoint.com/hdfs
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS

25
https://www.javatpoint.com/hdfs
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS

Hadoop Cluster Architecture


26
https://techvidvan.com/tutorials/hadoop-cluster/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop HDFS

Hadoop Single Node Cluster Architecture


27
https://techvidvan.com/tutorials/hadoop-cluster/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce

Hadoop MapReduce: A distributed programming model and


framework that breaks down large-scale data processing tasks
into smaller, parallelizable operations (map and reduce),
allowing for efficient processing and analysis of big data across https://www.geeksforgeeks.org/hadoop-architecture/
a cluster of machines.

28
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce

InputReader: Read the input data and split it into the


data blocks.

29
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce

MapFunction: Process the key-value pairs and


generate the corresponding output key-value pairs.

30
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce

PartitionFunction: Assign the output of Mapping to


the appropriate Reducer.

31
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce

ShufflingAndSorting: Get the results of mapping and


sorting

32
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce

ReduceFunction: Grouping and processing the


intermediate key-value pairs produced by the map
phase.

33
https://www.analyticsvidhya.com/blog/2022/07/learn-everything-about-mapreduce-architecture-and-its-components/
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Hadoop MapReduce: WordCount Example

34
https://www.guru99.com/introduction-to-mapreduce.html
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation
Hadoop Single Node (Hadoop pseudo-distributed mode):
A configuration option in Hadoop that allows us to run
Hadoop on a single machine.

Master Master

Single Node cluster Single Node cluster

Master Slave

Multi Node cluster https://www.geeksforgeeks.org/hadoop-architecture/

35
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Requirements
1. Virtual Machine

3. Hadoop

VirtualBox VMWare Cloudera

2. CentOS
Note: The following installation demo is on Mac M1. You
need to choose proper version for your computer.

36
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Virtual Machine

VM1 VM2 VM3 VM4


Guest Guest Guest Guest

Host Applications Hypervisor

Host Operating System (Windows, Ubuntu, MacOS…)

Hardware Layer (CPU, GPU, RAM…)


Example: A MacOS computer running two Windows VMs
Virtual Machines (VMs): An emulation of a computer system,
where these machines use computer architectures to provide
the functionality of a physical computer.
37
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: VMware

For Windows/Linux: Use VMware Workstation

VMware: A software providing machine virtualization

For MacOS: Use Vmware Fusion 38


AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: VMware

VM Fusion Interface 39
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: CentOS

CentOS: Community-driven free software effort focused around the goal of


providing a rich base platform for open source communities to build upon.
40
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: CentOS
Choose version:
• x86_64: For Windows/Ubuntu/Mac Intel
• ARM64 (aarch64): For Mac M1/M2

Download file .iso of CentOS: Link

41
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware

42
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware

3 Choose CentOS .iso file


6

43
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware

7 You can customize


this configuration if
needed

8 9

44
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware
10

11

Choose where to save the new virtual machine

45
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Run CentOS in VMware

11

46
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS

47
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS

48
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS
8
7

49
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS

10
9

50
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Setup CentOS

51
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop Requirements

1. Install Java 2. Install Hadoop 3. Set up Hadoop

52
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Java Install Java 8 on CentOS:
1. sudo yum –y update
2. sudo yum install –y java-1.8.0-openjdk

Java: A high-level, class-based, object-oriented


programming language that is designed to have as few
implementation dependencies as possible.

53
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

Hadoop 2.7.3 download page

Install Hadoop via wget command


54
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

• Extract downloaded file via tar command:


tar –xf hadoop-2.7.3.tar.gz
• Rename Hadoop folder:
mv hadoop-2.7.3 hadoop

55
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Config Hadoop
1. Check Java configuration 2. Enter ~/.bashrc script

56
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

3. Set up .bashrc script Paste the code below:

export
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.
0.362.b09-4.el9.aarch64
export HADOOP_HOME=/home/thangduong/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/INSTALL/
lib/native
export
HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTA
57
LL/lib”
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

3. Set up .bashrc script then run it


58
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
4. Config hadoop-env.sh

59
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
5. Config core-site.xml Paste the code below:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

60
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

5. Config core-site.xml 61
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
6. Config yarn-site.xml Paste the code below:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapredu
ce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHand
ler</value>
</property>
</configuration>

62
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

6. Config yarn-site.xml 63
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
7. Config mapred-site.xml Paste the code below:

<configuration>
<property>
Copy and edit on new mapred-site.xml <name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

64
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop

7. Config mapred-site.xml 65
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
8.1. Create namenode and datanode folder 8.2. Config hdfs-site.xml

66
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Hadoop
Paste the code below:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/thangduong/hadoop_store/h
dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/thangduong/hadoop_store/h
dfs/datanode</value>
</property>
</configuration>
67
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Installation: Check Hadoop

Check Hadoop Installation 68


AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text

Pattern to search: “dfs[a-z.]”

69
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text

1. Initialize namenode 2. Start namenode and datanode daemon

70
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text

3. Create input folder and move all .xml file into input 4. Perform grep with string pattern “dfs[a-z.]+”

71
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text

5. Get output and print results

72
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Search text

5. Get output and print results

73
AI VIETNAM
All-in-One Course
(TA Session) Hadoop
❖ Example: Other default applications

74
AI VIETNAM
All-in-One Course
(TA Session)

QUIZ

75
AI VIETNAM
All-in-One Course
(TA Session)

Spark

76
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ Introduction

Apache Spark: an open-source unified analytics engine for large-scale data processing. Spark
is an alternative replacement for Hadoop MapReduce.

77
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ Why Spark?

Fast 100x Faster than MapReduce

Batch Processing Real-time Processing

Stores Data on Disk Stores Data Memory

Written in Java Written in Scala

Low Cost High Cost

78
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ Spark Modules

79
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Introduction

PySpark: The Python API for Apache Spark. It enables to perform real-time, large-scale data
processing in a distributed environment using Python.

80
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Installation

81
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Usage

SparkSession: Represents the connection to a


Spark cluster and provides a way to interact with
various Spark APIs. It allows to create DataFrames,
execute SQL queries, and perform distributed data
processing tasks.

82
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Usage: Create DataFrame

83
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Usage: Operations

select(): Choose columns to show in table filter(): Extract rows satisfied a criteria

84
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

Description: Perform some basic operation of Spark with Flights.csv dataset. Download link: here.

85
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

86
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

87
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

Assign DataFrame to Database

88
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

Call the datatable

89
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

Perform filtering: Using function or SQL Query


90
AI VIETNAM
All-in-One Course
(TA Session) Spark
❖ PySpark Example: Working with csv file

Perform filtering: Using function or SQL Query


91
AI VIETNAM
All-in-One Course
(TA Session) Summarization

In this session, we will discuss about:


- Introduction to big data.
- Introduction to Hadoop.
- How to install virtual machine.
- How to install and use Hadoop on virtual machine.
- Introduction to spark and pyspark.

92
AI VIETNAM
All-in-One Course
(TA Session) Question

?
93
94

You might also like