0% found this document useful (0 votes)

25 views49 pages

Hadoop Distributed File System HDFS 1688981751

HDFS is a distributed file system built on the architecture of Google File System (GFS). It stores file data across multiple machines and provides high throughput access to application data. HDFS uses a master/slave architecture with a single NameNode managing the file system metadata and DataNodes storing the actual data blocks. Files are broken into blocks and replicated across multiple DataNodes for reliability. The NameNode manages the filesystem tree and tracks where blocks are stored.

Uploaded by

Siddharth set

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views49 pages

Hadoop Distributed File System HDFS 1688981751

Uploaded by

Siddharth set

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Hadoop Distributed

File System (HDFS)

1
HDFS Overview
• A distributed file system
• Built on the architecture of Google File
System (GFS)
• Shares a similar architecture to many other
common distributed storage engines such as
Amazon S3 and Microsoft Azure
• HDFS is a stand-alone storage engine and can
be used in isolation of the query processing
engine
2
Background on Disk Storage
• What are file systems and why do we need
them?
• A file is a logical sequence of bits/bytes
• A physical disk stores data in sectors, tracks,
tapes, blocks, … etc.

3
File System
• Any file system, is a method to provide a
high-level abstraction on physical disk to
make it easier to store files
Files Folders

File System

4
Distributed File System

Files Folders

Distributed File System

5
Analogy to Unix FS

The logical view is similar

mary
user
/ chu
etc hadoop

6
Analogy to Unix FS

The physical model is comparable

List of iNodes List of block locations

File1 Meta data File1

Block 1

Block 2
Block 3

… B B B B B B B B B
B B B B B B

Unix HDFS
7
HDFS Architecture

Name node

Data nodes

B B B B B B B B B

B B B B B B

8
What is where?
File and directory names
Block ordering and locations
Capacity of data nodes Name node
Architecture of data nodes

Data nodes

Block data
Name node location
B B B B B B B B B

B B B B B B

9
Physical
Cluster Node #1 Node #33

Layout Node #2 Node #34

Node #3 …
… …

Rack Rack
#1 #2 …

Node #32 …
10
Analogy of racks

Node Node

Node Node
Switch Switch

Node Node

Node Node
Rack Rack

11
HDFS Shell
Manage the files from command line

12
HDFS Shell
• The easiest way to deal with HDFS is through
its shell
• The commands are very similar to the Linux
shell commands
• General format
hdfs dfs -<cmd> <arguments>
• So, instead of
mkdir –p myproject/mydir
• You will write
hdfs dfs -mkdir –p myproject/mydir
13
HDFS Shell
• In addition to regular commands, there
are special commands in HDFS
▪ copyToLocal/get Copies a file
from HDFS to the local file system
▪ copyFromLocal/put Copies a file
from the local file system to HDFS
▪ setrep Changes the replication factor
• A list of shell commands with usage
▪ https://hadoop.apache.org/docs/r3.2.2/hadoop-project-dist/hadoop-
common/FileSystemShell.html

14
HDFS API
Mange the file system programmatically

15
FileSystem API
• HDFS provides a Java API that allows your
programs to manage the files similar to the
shell. It is even more powerful.
• For interoperability, the FileSystem API
covers not only HDFS, but also the local file
system and other common file systems, e.g.,
Amazon S3
• If you write your program in Hadoop
FileSystem API, it will generally work for
those file systems
16
HDFS API Basic Classes

FileSystem

LocalFileSystem DistributedFileSystem S3FileSystem

Path Configuration

17
HDFS API Classes
• Configuration: Holds system
configuration such as where the master
node is running and default system
parameters
• Path: Stores a path to a file or directory
• FileSystem: An abstract class for file
system commands

18
Fully Qualified Path

hdfs://masternode:9000/path/to/file
hdfs: the file system scheme. Other possible values are
file, ftp, s3, …
masternode: the name or IP address of the node that
hosts the master of the file system
9000: the port on which the master node is listening
/path/to/file: the absolute path of the file

19
Shorter Path Forms
• file: relative path to the current working directory
in the default file system
• /path/to/file: Absolute path to a file in the
default* file system (as configured)
• hdfs://path/to/file: Use the default* values for the
master node and port
• hdfs://masternode/path/tofile: Use the given
masternode name or IP and the default* port
*All the defaults are in the Configuration object

20
HDFS API
Create the file system

Configuration conf = new Configuration();

Path path = new Path(“…”);
FileSystem fs = path.getFileSystem(conf);

// To get the local FS

fs = FileSystem.getLocal(conf);

// To get the default FS

fs = FileSystem.get(conf);

21
HDFS API
Create a new file
FSDataOutputStream out = fs.create(path, …);

Delete a file
fs.delete(path, recursive);

fs.deleteOnExit(path); // For temporary files

Rename/Move a file
fs.rename(oldPath, newPath);

22
HDFS API
Open a file for reading
FSDataInputStream in = fs.open(path, …);

Seek to a different location for random access

in.seek(pos);
in.seekToNewSource(pos);

23
HDFS API
Concatenate
fs.concat(destination, src[]);

Get file metadata

fs.getFileStatus(path);

Get block locations

fs.getFileBlockLocations(path, from, to);

24
HDFS Writing Process
Name node

File creator

Data nodes

25
HDFS Writing Process
Name node
Create(…)
File creator

Data nodes
The creator process calls the create
function which translates to an RPC call at
the name node

26
HDFS Writing Process
Name node
Create(…)
File creator

Data nodes
The master node creates an initial block
with three replicas
1. First block replica is assigned to a
random machine
2. Second block replica is assigned to 1 2 3
another random machine on a different
rack
3. Third block replica is assigned to a
random machine on the second rack

27
HDFS Writing Process
Name node
OutputStream
File creator

Data nodes

1 2 3

28
HDFS Writing Process
Name node

File creator

Data nodes

OutputStream#write
1 2 3

29
HDFS Writing Process
Name node

File creator

Data nodes

OutputStream#write
1 2 3

30
HDFS Writing Process
Name node

File creator

Data nodes

OutputStream#write
1 2 3

31
HDFS Writing Process
Name node
Next block
File creator

Data nodes

OutputStream#write
1 2 3

When a block is filled up, the

creator contacts the name node to
create the next block with three
new replicas on possibly a
different set of nodes

32
Notes about writing to HDFS
• Data transfers of replicas are pipelined
• The data does not go through the name
node
• Random writing is not supported
• Appending to a file is supported but it
creates a new block

33
Writing from a datanode
Name node

If the file creator is running on one of

the data nodes, the first replica is
always assigned to that node
The second and third replicas are Data nodes
assigned as before

File
creator

34
Reading from HDFS
• Reading is relatively easier
• No replication is needed
• Replication can be exploited
• Random reading is allowed

35
HDFS Reading Process
Name node
open(…)
File reader

Data nodes
The reader process calls the open function
which translates to an RPC call at the
name node

36
HDFS Reading Process
Name node
InputStream
File reader

Data nodes
The name node locates the first block of
that file and returns the address of one of
the nodes that store that block

The name node returns an input stream

for the file

37
HDFS Reading Process
Name node

File reader

InputStream#read(…) Data nodes

38
HDFS Reading Process
Name node
Next block
File reader

When an end-of-block is Data nodes

reached, the name node
locates the next block

39
HDFS Reading Process
Name node
seek(pos)
File reader

InputStream#seek operation locates a Data nodes

block and positions the stream
accordingly

40
Reading from a datanode
Name node
Open, seek

1. If the block is locally stored on

the reader, this replica is
Data nodes
chosen to read
2. If not, a replica on another
machine in the same rack is
chosen File
3. Any other random block reader
replica is chosen

When self-reading occurs, HDFS

can make it much faster through
a feature called
short-circuit

41
Notes About Reading
• The API is much richer than the simple
open/seek/close API
▪ You can retrieve block locations
▪ You can choose a specific replica to
read
• The same API is generalized to other file
systems including the local FS and S3
• Review question: Compare random
access read in local file systems to HDFS
42
HDFS Special Features
• Node decommission
• Load balancer
• Cheap concatenation

43
Node Decommission

B B B B B B B B B B B B

B B B B B B B

44
Load Balancing

B B B B B B B B B

B B B B B B

45
Load Balancing

B B B B B B B B B

B B B B B B

Start the load balancer

46
Cheap Concatenation

File 1

File 2

File 3
Name node

Concatenate File 1 + File 2 + File 3 ➔ File 4

Rather than creating new blocks, HDFS can just change

the metadata in the name node to delete File 1, File 2,
and File 3, and assign their blocks to a new File 4 in the
right order.

47
Conclusion
• HDFS is a general-purpose distributed file
system
• Provides a similar abstraction to other file
systems
• HDFS provides two interfaces
▪ Shell script. Similar to Linux and MacOS
▪ Java API: For programmatic access
• The FileSystem API applies to other file
systems including the local file system and
Amazon S3
48
Further Readings
• HDFS Architecture
▪ https://hadoop.apache.org/docs/r3.2.2/hadoop
-project-dist/hadoop-hdfs/HdfsDesign.html
• Shell commands
▪ https://hadoop.apache.org/docs/r3.2.2/hadoop
-project-dist/hadoop-
common/FileSystemShell.html
• FileSystem API
▪ https://hadoop.apache.org/docs/r3.2.2/api/org
/apache/hadoop/fs/FileSystem.html

CIPer Model 30 System Engineering User Guide - 31-00237
No ratings yet
CIPer Model 30 System Engineering User Guide - 31-00237
420 pages
l2 Hdfs and Mapreduce Model 2022s2
No ratings yet
l2 Hdfs and Mapreduce Model 2022s2
52 pages
BigData_Unit2
No ratings yet
BigData_Unit2
80 pages
System Design Basics: Placement Preparation (Exclusive Notes) (Save and Share)
No ratings yet
System Design Basics: Placement Preparation (Exclusive Notes) (Save and Share)
26 pages
Pt1 Xii Cs Complete in Word
No ratings yet
Pt1 Xii Cs Complete in Word
6 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Big Data Unit-3 PPT
No ratings yet
Big Data Unit-3 PPT
46 pages
Layout and Line Balancing
No ratings yet
Layout and Line Balancing
72 pages
HDFS
No ratings yet
HDFS
19 pages
Thesis Filipino Title
100% (3)
Thesis Filipino Title
7 pages
Semiconductors Review
No ratings yet
Semiconductors Review
42 pages
Digital Electronic Principles I, II
No ratings yet
Digital Electronic Principles I, II
60 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
unit IV
No ratings yet
unit IV
248 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
Big Data AnalyticUnit2
No ratings yet
Big Data AnalyticUnit2
19 pages
4
No ratings yet
4
53 pages
Technical Explanation: Skiip 3 V3
No ratings yet
Technical Explanation: Skiip 3 V3
49 pages
3_HDFS-Hive-HBase-Pig
No ratings yet
3_HDFS-Hive-HBase-Pig
8 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
DATA228 Lecture Notes Week 4
No ratings yet
DATA228 Lecture Notes Week 4
21 pages
LSP Project Presentation
No ratings yet
LSP Project Presentation
23 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
Unit 1a Awareness of Cyber Crimes and Security
No ratings yet
Unit 1a Awareness of Cyber Crimes and Security
32 pages
Linear Quadratic Stochastic Control With Partial State Observation
No ratings yet
Linear Quadratic Stochastic Control With Partial State Observation
20 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
Centrix Seba KMT Megger
No ratings yet
Centrix Seba KMT Megger
4 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Wa Introhdfs PDF
No ratings yet
Wa Introhdfs PDF
11 pages
Big Data Importance of Hadoop Distributed Filesystem
No ratings yet
Big Data Importance of Hadoop Distributed Filesystem
4 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Paper Hdfs Summary
No ratings yet
Paper Hdfs Summary
5 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
Zoom Meeting Instructions
100% (1)
Zoom Meeting Instructions
7 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
JDBC Introduction
No ratings yet
JDBC Introduction
14 pages
bdh_unit_3
No ratings yet
bdh_unit_3
25 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Quality Control Vs Quality Assurance
100% (1)
Quality Control Vs Quality Assurance
14 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
huawei
No ratings yet
huawei
32 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Browser Events and Custom Events: Angular
No ratings yet
Browser Events and Custom Events: Angular
11 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
Unix Notes
No ratings yet
Unix Notes
4 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Pulse Secure Desktop Client Supported Platforms Guide
No ratings yet
Pulse Secure Desktop Client Supported Platforms Guide
12 pages
IMTC634_Data Science_Chapter 14
No ratings yet
IMTC634_Data Science_Chapter 14
22 pages
Spyware: Computer Hacking
No ratings yet
Spyware: Computer Hacking
14 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Introduction To Ir Remote Control System
No ratings yet
Introduction To Ir Remote Control System
9 pages
SNC Transaction Codes
No ratings yet
SNC Transaction Codes
19 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Sensirion Differential Pressure SDP600series Datasheet V1.7 PDF
No ratings yet
Sensirion Differential Pressure SDP600series Datasheet V1.7 PDF
10 pages
Test Scenarios - Orange HRM 3.0 SoftwareTestingHelp
No ratings yet
Test Scenarios - Orange HRM 3.0 SoftwareTestingHelp
3 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Estacion Total South n1 Series
No ratings yet
Estacion Total South n1 Series
2 pages
Operating System Services
No ratings yet
Operating System Services
4 pages
Toyota Tacoma 2.4 2rzfe 2.73rzfe 3.4 Engine Performance
100% (1)
Toyota Tacoma 2.4 2rzfe 2.73rzfe 3.4 Engine Performance
21 pages
DC 240 - DC250 Max Set Up
83% (6)
DC 240 - DC250 Max Set Up
36 pages
PS31
No ratings yet
PS31
4 pages
Datasheet Gxp1610 1615 English
No ratings yet
Datasheet Gxp1610 1615 English
2 pages
SAP TM - Charge Management
100% (2)
SAP TM - Charge Management
33 pages