0% found this document useful (0 votes)
46 views35 pages

Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 35

Introduction

• Hadoop is an open-source project administered by the


Apache Software Foundation. Hadoop’s contributors work for
some of the world’s biggest technology companies. That
diverse, motivated community has produced a collaborative
platform for consolidating, combining and understanding
data.
• Technically, Hadoop consists of two key services: data storage
using the Hadoop Distributed File System (HDFS) and large
scale parallel data processing using a technique called
MapReduce
• HDFS(Hadoop Distributed File System) is utilized for storage
permission is a Hadoop cluster.
• It mainly designed for working on commodity Hardware
devices(devices that are inexpensive), working on a
distributed file system design.
• HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small
data blocks.
• HDFS in Hadoop provides Fault-tolerance and High availability
to the storage layer 
•  Example :
Suppose you have a DFS comprises of 4 different machines
each of size 10TB in that case you can store let say 30TB
across this DFS as it provides you a combined Machine of size
40TB. The 30TB data is distributed among these Nodes in
form of Blocks.
• HDFS is capable of handling larger size data with high volume velocity and
variety makes
• Hadoop work more efficient and reliable with easy access to all its
components.
• HDFS stores the data in the form of the block where the size of each data
block is 128MB in size which is configurable means you can change it
according to your requirement in hdfs-site.xml file in your Hadoop
directory.
• About this Lab
• After completing this hands-on lab, you’ll be able to:
· Use Hadoop commands to explore the HDFS on the Hadoop
system
· Use Hadoop commands to run a sample MapReduce
program on the Hadoop system
• Explore Hive and sqoop
Hadoop Setup
Hadoop Setup
• Setting up Hadoop development environment
can be a tedious and time-consuming task.
• In order to better utilize time, we will be using
a pre-configured Ubuntu Operating System
(OS) running on a Virtual Machine (VM)
Hadoop Setup
• We uses a virtualization product called Virtual Box
• It is feature-rich, high performance, and free
• To set up Hadoop Training Virtual Machine (VM) in
Virtual Box.
– Virtual Box can run in most modern Operating
Systems; therefore, you need to have a laptop/desktop
– The minimal specifications for your machine are 4G RAM,
Core2 Duo Processor, and 20G of empty hard drive space
Hadoop Setup
1. Download Virtual Box
Download the latest version of Virtual Box for your Operating System:
https://www.virtualbox.org/wiki/Downloads
2. Install Virtual Box
Double click on the Virtual Box Installer and follow installation
instructions. When the installation is complete, launch the application.
3. Download Hadoop Training Ubuntu Image
Download Hadoop Training Virtual Machine (VM). The file is ~1.5G so
it may take few minutes to download.
After all, you are downloading Ubuntu Operating System with Hadoop
installed.
https://www.dropbox.com/s/sbyr0baonjwf62v/HadoopTraining_v1.0.ova
Hadoop Setup
Hadoop Setup
Hadoop Setup
Hadoop Setup
Hadoop Setup
Logging in
• Username: hadoop
• Password: hadoop
Exploring Hadoop Distributed File System (HDFS)

• Hadoop Distributed File System (HDFS), allows user


data to be organized in the form of files and directories.
It provides a command line interface called DFS shell
that lets a user interact with the data in HDFS accessible
to Hadoop MapReduce programs.

• You can use the command-line approach and invoke the


FileSystem (fs) shell using the format: hdfs dfs <args>.

• Many HDFS commands are similar to UNIX commands


VM exercise
Perform
1. Start Virtual Machine
2. Open Command Line Terminal
3. Start ALL Hadoop installed products
4. Stop ALL Hadoop installed products
5. Verify there are any leftover Java processes
Solution
1. Open Virtual Box, select training VM and click
start
2. Within VM double click on Terminal icon
3. In the Command Line Terminal type
startCDH.sh
4. In the Command Line Terminal type
stopCDH.sh
5. jps
Hadoop Shell Commands
Perform
1. Start HDFS and verify that it's running
2. Create a new directory /lab1_ex on HDFS
3. Create a New file in the directory /lab1_ex
4. Upload a.txt file into HDFS under /lab1_ex directory
5. View the content of the /lab1_ex directory
6. Print the first 25 lines to the screen from a.txt on HDFS
7. Copy a file in HDFS with another name in HDFS
8. Copy a.txt to local file system and name it a_copy.txt
9. Delete a.txt from HDFS
10.Delete the /lab1_ex directory from HDFS
11.Take a second to look at other available shell options
Hadoop Commands
1. Start HDFS and verify that it's running
$ startCDH.sh
2. Create a new directory /lab1_ex on HDFS
$ hdfs dfs -mkdir /lab1_ex
3. creates an empty file
$ hdfs dfs -touchz /lab1_ex /myfile.txt

4. Upload a.txt file into HDFS under /lab1_ex directory


Copy file from local to Hdfs copyFromLocal (or) put
$ hdfs dfs -put ./Desktop/jps.txt / lab1_ex /

View whether the file is copied or not


hadoop@hadoop-laptop:~$ hdfs dfs -ls / lab1_ex /
Output:
Found 1 items
-rw-r--r-- 1 hadoop supergroup 201 2020-07-29 23:06 /lab1_ex/jps.txt
Hadoop Commands
5. View the content of the /lab1-ex directory
$ hdfs dfs -ls / lab1-ex /
6. Print the first 2 lines to the screen from a.txt
on HDFS
$ hdfs dfs -cat / lab1-ex /jps.txt | head -n 2
7. Copy a file in HDFS with another name in
HDFS
$ hdfs dfs -cp / lab1-ex /jps.txt / lab1-ex /a1_hdfsCopy.txt
Hadoop Commands
8. Copy jps.txt to local file system and name it
a_copy.txt copyToLocal (or) get
$ hdfs dfs -get / lab1-ex /jps.txt a_copy.txt
9. Delete jps.txt from HDFS
$ hdfs dfs -rm / lab1-ex /jps.txt
10. Delete the /lab1_ex directory from HDFS
$ hdfs dfs -rm -r / lab1-ex
11. Take a second to look at other available shell option
$ hdfs dfs -help
• moveFromLocal: This command will move file
from local to hdfs.
• Syntax:
hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

hdfs dfs –moveFromLocal ./Desktop/jps.txt /lab1_ex


• setrep: This command is used to change the replication factor of a
file/directory in HDFS. By default it is 3 for anything which is stored in
HDFS (as set in hdfs core-site.xml).

• Example 1: To change the replication factor to 6 for jps.txt stored in HDFS.


$hdfs dfs -setrep -R -w 6 /lab1_ex/jps.txt
• Example 2: To change the replication factor to 4 for a directory lab1_ex
Input stored in HDFS.
$hdfs dfs -setrep -R 4 /lab1_ex

The -w means wait till the replication is completed. And -R means


recursively,
To view the Data in Browser
Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides
various services to solve the big data problems.
• There are four major elements of Hadoop i.e.
1.  HDFS, 
2. MapReduce,
3.  YARN, and 
4. Hadoop Common. 
• All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
• Following are the components that collectively form a Hadoop
ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Thank You

You might also like