Introduction To Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is designed for storing extremely large files across a network of commodity hardware, providing reliable and fault-tolerant data storage. It uses a master-slave architecture with a NameNode managing metadata and DataNodes storing the actual data, which is divided into blocks for efficient access and replicated for fault tolerance. While HDFS excels in high-throughput data access, it struggles with low-latency requirements and inefficiencies related to handling numerous small files.

Uploaded by

rajeshmeheto.ica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views3 pages

Introduction To Hadoop Distributed File System

Uploaded by

rajeshmeheto.ica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Introduction to Hadoop Distributed File

System(HDFS)
Last Updated : 04 Apr, 2025




With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines.
Such filesystems are called distributed filesystems. Since data is stored
across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable
filesystems. HDFS (Hadoop Distributed File System) is a unique design that
provides storage for extremely large files with streaming data access pattern,
and it runs on commodity hardware. Let's elaborate on the terms:
 Extremely large files: Here, we are talking about the data in a range of
petabytes (1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-
once and read-many-times. Once data is written large portions of dataset
can be processed any number times.
 Commodity hardware: Hardware that is inexpensive and easily available in
the market. This is one of the features that especially distinguishes HDFS
from other file systems.
Nodes: Master-slave nodes typically form the HDFS cluster.
1. NameNode(MasterNode):

 Manages all the slave nodes and assigns work to them.

 It executes filesystem namespace operations like opening, closing, and
renaming files and directories.
 It should be deployed on reliable hardware that has a high configuration.
not on commodity hardware.
2. DataNode(SlaveNode):

Actual worker nodes do the actual work like reading, writing, processing,
etc.
 They also perform creation, deletion, and replication upon instruction
from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in the background.
 Namenodes:

o Run on the master node.

oStore metadata (data about data) like file path, the number of
blocks, block Ids. etc.
o Requires a high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
 DataNodes:

oRun on slave nodes.

o Require high memory as data is actually stored here.
Data storage in HDFS: Now let's see how the data is stored in a distributed
manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will

first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x
and above). Then these blocks are stored across different
datanodes(slavenode). Datanodes(slavenode) replicate the blocks among
themselves and the information of what blocks they contain is sent to the
master. Default replication factor is 3 means for each block 3 replicas are
created (including itself). In hdfs.site.xml we can increase or decrease the
replication factor i.e we can edit its configuration here.

Note: MasterNode has the record of everything, it knows the location and info
of each and every single data nodes and the blocks they contain, i.e. nothing is
done without the permission of masternode.

Why divide the file into blocks?

Answer: Let's assume that we don't divide, now it's very difficult to store a 100
TB file on a single machine. Even if we store, then each read and write
operation on that whole file is going to take very high seek time. But if we have
multiple blocks of size 128MB then its become easy to perform various read
and write operations on it compared to doing it on a whole file at once. So we
divide the file to have faster data access i.e. reduce seek time.

Why replicate the blocks in data nodes while storing?

Answer: Let's assume we don't replicate and only one yellow block is present
on datanode D1. Now if the data node D1 crashes we will lose the block and
which will make the overall data inconsistent and faulty. So we replicate the
blocks to achieve fault-tolerance.

Terms related to HDFS:

 HeartBeat : It is the signal that datanode continuously sends to namenode.
If namenode doesn't receive heartbeat from a datanode then it will consider
it dead.
 Balancing : If a datanode is crashed the blocks present on it will be gone
too and the blocks will be under-replicated compared to the remaining
blocks. Here master node(namenode) will give a signal to datanodes
containing replicas of those lost blocks to replicate so that overall
distribution of blocks is balanced.
 Replication:: It is done by datanode.

Note: No two replicas of the same block are present on the same datanode.

Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple
datanodes.
 Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
 High fault tolerance.

Limitations: Though HDFS provide many features there are some areas
where it doesn't work well.
 Low latency data access: Applications that require low-latency access to
data i.e in the range of milliseconds will not work well with HDFS, because
HDFS is designed keeping in mind that we need high-throughput of data
even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of seeks and
lots of movement from one datanode to another datanode to retrieve each
small file, this whole process is a very inefficient data access pattern.

Philips Efficia CM Series Network Configuration Manual 74
No ratings yet
Philips Efficia CM Series Network Configuration Manual 74
74 pages
GSM 900 Mobile Jammer
No ratings yet
GSM 900 Mobile Jammer
24 pages
FSC2 Supply, Maintenance Repair and Organic Instructions 13 FEB 2018
No ratings yet
FSC2 Supply, Maintenance Repair and Organic Instructions 13 FEB 2018
46 pages
Nigerian Vendors Registration Process - Questions and Answers (Q&As)
No ratings yet
Nigerian Vendors Registration Process - Questions and Answers (Q&As)
4 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Unit 3 Big Data - 240516 - 090400
No ratings yet
Unit 3 Big Data - 240516 - 090400
20 pages
Unit 3
No ratings yet
Unit 3
5 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
HDFS
No ratings yet
HDFS
11 pages
HDFS
No ratings yet
HDFS
16 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
HDFS
No ratings yet
HDFS
37 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Bbvcx
No ratings yet
Bbvcx
89 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
HDFS
No ratings yet
HDFS
8 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
HDFS
No ratings yet
HDFS
1 page
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
HDFS
No ratings yet
HDFS
15 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
HDFS
No ratings yet
HDFS
3 pages
Unit 3 HDFS Notes
No ratings yet
Unit 3 HDFS Notes
71 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
BDA Mid 2
No ratings yet
BDA Mid 2
21 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
ASSIGNMENT 1 BIG DATA
No ratings yet
ASSIGNMENT 1 BIG DATA
9 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Unit 4
No ratings yet
Unit 4
104 pages
Unit 2
No ratings yet
Unit 2
14 pages
Bigdata
No ratings yet
Bigdata
5 pages
HDFS
No ratings yet
HDFS
14 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
IMNCI Module2
No ratings yet
IMNCI Module2
31 pages
Warren Hastings - 51324493 - 2025 - 03 - 09 - 17 - 40
No ratings yet
Warren Hastings - 51324493 - 2025 - 03 - 09 - 17 - 40
29 pages
Feature Extraction Techniques in Pattern Recognition Are Methods Used To Extract Relevant Features From Raw Data
No ratings yet
Feature Extraction Techniques in Pattern Recognition Are Methods Used To Extract Relevant Features From Raw Data
1 page
Mech 4 Sem Thermodynamics 2 181743 Feb 2022
No ratings yet
Mech 4 Sem Thermodynamics 2 181743 Feb 2022
2 pages
Code Scheduling in Compiler Design
No ratings yet
Code Scheduling in Compiler Design
6 pages
An Information System in IT Consists of Several Components That Work Together To Support The Collection
No ratings yet
An Information System in IT Consists of Several Components That Work Together To Support The Collection
5 pages
Mech 5 Sem Machine Design 181756 2022
No ratings yet
Mech 5 Sem Machine Design 181756 2022
2 pages
Index Java
No ratings yet
Index Java
2 pages
FRONT PAGE 2 (1) .PDF 20250313 100834 0000
No ratings yet
FRONT PAGE 2 (1) .PDF 20250313 100834 0000
1 page
Learning Activity Sheet In: Computer Systems Servicing
100% (1)
Learning Activity Sheet In: Computer Systems Servicing
12 pages
8 - Case 6 - Automated Fare Collection
No ratings yet
8 - Case 6 - Automated Fare Collection
4 pages
Lattice Network
No ratings yet
Lattice Network
17 pages
69 20210921 122701
No ratings yet
69 20210921 122701
7 pages
Case Study 3 ERP
No ratings yet
Case Study 3 ERP
2 pages
SAP MM - Configuration-1
No ratings yet
SAP MM - Configuration-1
16 pages
HA300 Notes
No ratings yet
HA300 Notes
86 pages
New Market Leader Elementary Audio - Scripts
No ratings yet
New Market Leader Elementary Audio - Scripts
6 pages
Lk-100S1 Auto Dialer: User'S Manua Front Panel Rear Panel
No ratings yet
Lk-100S1 Auto Dialer: User'S Manua Front Panel Rear Panel
4 pages
Mern Stack Material 5
No ratings yet
Mern Stack Material 5
6 pages
Unit - 2: Cad & Reverse Engineering
No ratings yet
Unit - 2: Cad & Reverse Engineering
33 pages
Coursera Akwy3tseh2qb
No ratings yet
Coursera Akwy3tseh2qb
1 page
2025-03-30
No ratings yet
2025-03-30
24 pages
Hmis Data Quality Notes Summary
No ratings yet
Hmis Data Quality Notes Summary
3 pages
ServiceNow Transforming Enterprise Workflows
No ratings yet
ServiceNow Transforming Enterprise Workflows
9 pages
Asynchronous Transfer Mode (ATM)
No ratings yet
Asynchronous Transfer Mode (ATM)
39 pages
Thoracic Imaging Oxford Specialist Handbooks in Radiology Ebook and TestBank Bundle Fast Access
No ratings yet
Thoracic Imaging Oxford Specialist Handbooks in Radiology Ebook and TestBank Bundle Fast Access
338 pages
DDOS Research Paper
No ratings yet
DDOS Research Paper
14 pages
Bluebell
No ratings yet
Bluebell
80 pages
Creating, Searching, and Deleting KD Trees Using C++: by Robert J Yager
No ratings yet
Creating, Searching, and Deleting KD Trees Using C++: by Robert J Yager
28 pages
Dell Rugged Quotation (2025-03-05 17 - 41 - 53)
No ratings yet
Dell Rugged Quotation (2025-03-05 17 - 41 - 53)
8 pages
Setting Up ADB: Android Debug Bridge
No ratings yet
Setting Up ADB: Android Debug Bridge
8 pages
Cyber Security CEH Assignment
No ratings yet
Cyber Security CEH Assignment
2 pages
ImageRUNNERFirmwareChart CANON
No ratings yet
ImageRUNNERFirmwareChart CANON
3 pages
SIRE - MarineAssurance-factsheet - 10 Dec Update
100% (1)
SIRE - MarineAssurance-factsheet - 10 Dec Update
2 pages
Manual Yamaha YST-SW012 (Service)
No ratings yet
Manual Yamaha YST-SW012 (Service)
16 pages

Introduction To Hadoop Distributed File System

Uploaded by

Introduction To Hadoop Distributed File System

Uploaded by

Introduction to Hadoop Distributed File

 Manages all the slave nodes and assigns work to them.

o Run on the master node.

oRun on slave nodes.

Lets assume that 100TB file is inserted, then masternode(namenode) will

Why divide the file into blocks?

Why replicate the blocks in data nodes while storing?

Terms related to HDFS:

You might also like