use the content and benefit from using it,
study the content and apply what is learned,
make and distribute copies of the content,
change and improve the content and distribute these derivative works
use the content and benefit from using it,
study the content and apply what is learned,
make and distribute copies of the content,
change and improve the content and distribute these derivative works
use the content and benefit from using it,
study the content and apply what is learned,
make and distribute copies of the content,
change and improve the content and distribute these derivative works
use the content and benefit from using it,
study the content and apply what is learned,
make and distribute copies of the content,
change and improve the content and distribute these derivative works
Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1/ 35
Introduction
• Hadoop is an open-source project administered by the
Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a collaborative platform for consolidating, combining and understanding data. • Technically, Hadoop consists of two key services: data storage using the Hadoop Distributed File System (HDFS) and large scale parallel data processing using a technique called MapReduce • HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop cluster. • It mainly designed for working on commodity Hardware devices(devices that are inexpensive), working on a distributed file system design. • HDFS is designed in such a way that it believes more in storing the data in a large chunk of blocks rather than storing small data blocks. • HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer • Example : Suppose you have a DFS comprises of 4 different machines each of size 10TB in that case you can store let say 30TB across this DFS as it provides you a combined Machine of size 40TB. The 30TB data is distributed among these Nodes in form of Blocks. • HDFS is capable of handling larger size data with high volume velocity and variety makes • Hadoop work more efficient and reliable with easy access to all its components. • HDFS stores the data in the form of the block where the size of each data block is 128MB in size which is configurable means you can change it according to your requirement in hdfs-site.xml file in your Hadoop directory. • About this Lab • After completing this hands-on lab, you’ll be able to: · Use Hadoop commands to explore the HDFS on the Hadoop system · Use Hadoop commands to run a sample MapReduce program on the Hadoop system • Explore Hive and sqoop Hadoop Setup Hadoop Setup • Setting up Hadoop development environment can be a tedious and time-consuming task. • In order to better utilize time, we will be using a pre-configured Ubuntu Operating System (OS) running on a Virtual Machine (VM) Hadoop Setup • We uses a virtualization product called Virtual Box • It is feature-rich, high performance, and free • To set up Hadoop Training Virtual Machine (VM) in Virtual Box. – Virtual Box can run in most modern Operating Systems; therefore, you need to have a laptop/desktop – The minimal specifications for your machine are 4G RAM, Core2 Duo Processor, and 20G of empty hard drive space Hadoop Setup 1. Download Virtual Box Download the latest version of Virtual Box for your Operating System: https://www.virtualbox.org/wiki/Downloads 2. Install Virtual Box Double click on the Virtual Box Installer and follow installation instructions. When the installation is complete, launch the application. 3. Download Hadoop Training Ubuntu Image Download Hadoop Training Virtual Machine (VM). The file is ~1.5G so it may take few minutes to download. After all, you are downloading Ubuntu Operating System with Hadoop installed. https://www.dropbox.com/s/sbyr0baonjwf62v/HadoopTraining_v1.0.ova Hadoop Setup Hadoop Setup Hadoop Setup Hadoop Setup Hadoop Setup Logging in • Username: hadoop • Password: hadoop Exploring Hadoop Distributed File System (HDFS)
• Hadoop Distributed File System (HDFS), allows user
data to be organized in the form of files and directories. It provides a command line interface called DFS shell that lets a user interact with the data in HDFS accessible to Hadoop MapReduce programs.
• You can use the command-line approach and invoke the
FileSystem (fs) shell using the format: hdfs dfs <args>.
• Many HDFS commands are similar to UNIX commands
VM exercise Perform 1. Start Virtual Machine 2. Open Command Line Terminal 3. Start ALL Hadoop installed products 4. Stop ALL Hadoop installed products 5. Verify there are any leftover Java processes Solution 1. Open Virtual Box, select training VM and click start 2. Within VM double click on Terminal icon 3. In the Command Line Terminal type startCDH.sh 4. In the Command Line Terminal type stopCDH.sh 5. jps Hadoop Shell Commands Perform 1. Start HDFS and verify that it's running 2. Create a new directory /lab1_ex on HDFS 3. Create a New file in the directory /lab1_ex 4. Upload a.txt file into HDFS under /lab1_ex directory 5. View the content of the /lab1_ex directory 6. Print the first 25 lines to the screen from a.txt on HDFS 7. Copy a file in HDFS with another name in HDFS 8. Copy a.txt to local file system and name it a_copy.txt 9. Delete a.txt from HDFS 10.Delete the /lab1_ex directory from HDFS 11.Take a second to look at other available shell options Hadoop Commands 1. Start HDFS and verify that it's running $ startCDH.sh 2. Create a new directory /lab1_ex on HDFS $ hdfs dfs -mkdir /lab1_ex 3. creates an empty file $ hdfs dfs -touchz /lab1_ex /myfile.txt
4. Upload a.txt file into HDFS under /lab1_ex directory
Copy file from local to Hdfs copyFromLocal (or) put $ hdfs dfs -put ./Desktop/jps.txt / lab1_ex /
View whether the file is copied or not
hadoop@hadoop-laptop:~$ hdfs dfs -ls / lab1_ex / Output: Found 1 items -rw-r--r-- 1 hadoop supergroup 201 2020-07-29 23:06 /lab1_ex/jps.txt Hadoop Commands 5. View the content of the /lab1-ex directory $ hdfs dfs -ls / lab1-ex / 6. Print the first 2 lines to the screen from a.txt on HDFS $ hdfs dfs -cat / lab1-ex /jps.txt | head -n 2 7. Copy a file in HDFS with another name in HDFS $ hdfs dfs -cp / lab1-ex /jps.txt / lab1-ex /a1_hdfsCopy.txt Hadoop Commands 8. Copy jps.txt to local file system and name it a_copy.txt copyToLocal (or) get $ hdfs dfs -get / lab1-ex /jps.txt a_copy.txt 9. Delete jps.txt from HDFS $ hdfs dfs -rm / lab1-ex /jps.txt 10. Delete the /lab1_ex directory from HDFS $ hdfs dfs -rm -r / lab1-ex 11. Take a second to look at other available shell option $ hdfs dfs -help • moveFromLocal: This command will move file from local to hdfs. • Syntax: hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
• setrep: This command is used to change the replication factor of a file/directory in HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
• Example 1: To change the replication factor to 6 for jps.txt stored in HDFS.
$hdfs dfs -setrep -R -w 6 /lab1_ex/jps.txt • Example 2: To change the replication factor to 4 for a directory lab1_ex Input stored in HDFS. $hdfs dfs -setrep -R 4 /lab1_ex
The -w means wait till the replication is completed. And -R means
recursively, To view the Data in Browser Hadoop Ecosystem • Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. • There are four major elements of Hadoop i.e. 1. HDFS, 2. MapReduce, 3. YARN, and 4. Hadoop Common. • All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc. • Following are the components that collectively form a Hadoop ecosystem: • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling Thank You