0% found this document useful (0 votes)
2 views10 pages

hdfs copy

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

The Hadoop Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler


Yahoo!
Sunnyvale, California USA
{Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com

Abstract—The Hadoop Distributed File System (HDFS) is developed at Facebook. Pig [4], ZooKeeper [6], and Chukwa
designed to store very large data sets reliably, and to stream were originated and developed at Yahoo! Avro was originated
those data sets at high bandwidth to user applications. In a large at Yahoo! and is being co-developed with Cloudera.
cluster, thousands of servers both host directly attached storage
and execute user application tasks. By distributing storage and HDFS is the file system component of Hadoop. While the
computation across many servers, the resource can grow with interface to HDFS is patterned after the UNIX file system,
demand while remaining economical at every size. We describe faithfulness to standards was sacrificed in favor of improved
the architecture of HDFS and report on experience using HDFS performance for the applications at hand.
to manage 25 petabytes of enterprise data at Yahoo!.
HDFS stores file system metadata and application data
Keywords: Hadoop, HDFS, distributed file system separately. As in other distributed file systems, like PVFS
[2][14], Lustre [7] and GFS [5][8], HDFS stores metadata on a
dedicated server, called the NameNode. Application data are
I. INTRODUCTION AND RELATED WORK stored on other servers called DataNodes. All servers are fully
Hadoop [1][16][19] provides a distributed file system and a connected and communicate with each other using TCP-based
framework for the analysis and transformation of very large protocols.
data sets using the MapReduce [3] paradigm. An important
characteristic of Hadoop is the partitioning of data and compu- Unlike Lustre and PVFS, the DataNodes in HDFS do not
tation across many (thousands) of hosts, and executing applica- use data protection mechanisms such as RAID to make the data
tion computations in parallel close to their data. A Hadoop durable. Instead, like GFS, the file content is replicated on mul-
cluster scales computation capacity, storage capacity and IO tiple DataNodes for reliability. While ensuring data durability,
this strategy has the added advantage that data transfer band-
bandwidth by simply adding commodity servers. Hadoop clus-
ters at Yahoo! span 25 000 servers, and store 25 petabytes of width is multiplied, and there are more opportunities for locat-
application data, with the largest cluster being 3500 servers. ing computation near the needed data.
One hundred other organizations worldwide report using Several distributed file systems have or are exploring truly
Hadoop. distributed implementations of the namespace. Ceph [17] has a
cluster of namespace servers (MDS) and uses a dynamic sub-
Distributed file system tree partitioning algorithm in order to map the namespace tree
HDFS to MDSs evenly. GFS is also evolving into a distributed name-
Subject of this paper!
space implementation [8]. The new GFS will have hundreds of
MapReduce Distributed computation framework namespace servers (masters) with 100 million files per master.
Lustre [7] has an implementation of clustered namespace on its
HBase Column-oriented table service roadmap for Lustre 2.2 release. The intent is to stripe a direc-
tory over multiple metadata servers (MDS), each of which con-
Dataflow language and parallel execution
Pig tains a disjoint portion of the namespace. A file is assigned to a
framework
particular MDS using a hash function on the file name.
Hive Data warehouse infrastructure
II. ARCHITECTURE
ZooKeeper Distributed coordination service
Chukwa System for collecting management data A. NameNode
The HDFS namespace is a hierarchy of files and directo-
Avro Data serialization system ries. Files and directories are represented on the NameNode by
inodes, which record attributes like permissions, modification
Table 1. Hadoop project components and access times, namespace and disk space quotas. The file
content is split into large blocks (typically 128 megabytes, but
Hadoop is an Apache project; all components are available
user selectable file-by-file) and each block of the file is inde-
via the Apache open source license. Yahoo! has developed and
pendently replicated at multiple DataNodes (typically three, but
contributed to 80% of the core of Hadoop (HDFS and MapRe-
user selectable file-by-file). The NameNode maintains the
duce). HBase was originally developed at Powerset, now a
namespace tree and the mapping of file blocks to DataNodes
department at Microsoft. Hive [15] was originated and devel-

978-1-4244-7153-9/10/$26.00 ©2010 IEEE


(the physical location of file data). An HDFS client wanting to DataNode when it registers with the NameNode for the first
read a file first contacts the NameNode for the locations of data time and never changes after that.
blocks comprising the file and then reads block contents from
the DataNode closest to the client. When writing data, the cli- A DataNode identifies block replicas in its possession to the
NameNode by sending a block report. A block report contains
ent requests the NameNode to nominate a suite of three
DataNodes to host the block replicas. The client then writes the block id, the generation stamp and the length for each block
replica the server hosts. The first block report is sent immedi-
data to the DataNodes in a pipeline fashion. The current design
has a single NameNode for each cluster. The cluster can have ately after the DataNode registration. Subsequent block reports
are sent every hour and provide the NameNode with an up-to-
thousands of DataNodes and tens of thousands of HDFS clients
per cluster, as each DataNode may execute multiple application date view of where block replicas are located on the cluster.
tasks concurrently. During normal operation DataNodes send heartbeats to the
HDFS keeps the entire namespace in RAM. The inode data NameNode to confirm that the DataNode is operating and the
block replicas it hosts are available. The default heartbeat in-
and the list of blocks belonging to each file comprise the meta-
data of the name system called the image. The persistent record terval is three seconds. If the NameNode does not receive a
heartbeat from a DataNode in ten minutes the NameNode con-
of the image stored in the local host’s native files system is
called a checkpoint. The NameNode also stores the modifica- siders the DataNode to be out of service and the block replicas
hosted by that DataNode to be unavailable. The NameNode
tion log of the image called the journal in the local host’s na-
tive file system. For improved durability, redundant copies of then schedules creation of new replicas of those blocks on other
the checkpoint and journal can be made at other servers. Dur- DataNodes.
ing restarts the NameNode restores the namespace by reading Heartbeats from a DataNode also carry information about
the namespace and replaying the journal. The locations of total storage capacity, fraction of storage in use, and the num-
block replicas may change over time and are not part of the ber of data transfers currently in progress. These statistics are
persistent checkpoint. used for the NameNode’s space allocation and load balancing
decisions.
B. DataNodes
The NameNode does not directly call DataNodes. It uses
Each block replica on a DataNode is represented by two replies to heartbeats to send instructions to the DataNodes. The
files in the local host’s native file system. The first file contains instructions include commands to:
the data itself and the second file is block’s metadata including
checksums for the block data and the block’s generation stamp. • replicate blocks to other nodes;
The size of the data file equals the actual length of the block • remove local block replicas;
and does not require extra space to round it up to the nominal • re-register or to shut down the node;
block size as in traditional file systems. Thus, if a block is half • send an immediate block report.
full it needs only half of the space of the full block on the local
drive. These commands are important for maintaining the overall
system integrity and therefore it is critical to keep heartbeats
During startup each DataNode connects to the NameNode
frequent even on big clusters. The NameNode can process
and performs a handshake. The purpose of the handshake is to
thousands of heartbeats per second without affecting other
verify the namespace ID and the software version of the
NameNode operations.
DataNode. If either does not match that of the NameNode the
DataNode automatically shuts down.
C. HDFS Client
The namespace ID is assigned to the file system instance User applications access the file system using the HDFS
when it is formatted. The namespace ID is persistently stored client, a code library that exports the HDFS file system inter-
on all nodes of the cluster. Nodes with a different namespace face.
ID will not be able to join the cluster, thus preserving the integ-
rity of the file system. Similar to most conventional file systems, HDFS supports
operations to read, write and delete files, and operations to cre-
The consistency of software versions is important because ate and delete directories. The user references files and directo-
incompatible version may cause data corruption or loss, and on ries by paths in the namespace. The user application generally
large clusters of thousands of machines it is easy to overlook does not need to know that file system metadata and storage are
nodes that did not shut down properly prior to the software on different servers, or that blocks have multiple replicas.
upgrade or were not available during the upgrade.
When an application reads a file, the HDFS client first asks
A DataNode that is newly initialized and without any the NameNode for the list of DataNodes that host replicas of
namespace ID is permitted to join the cluster and receive the the blocks of the file. It then contacts a DataNode directly and
cluster’s namespace ID. requests the transfer of the desired block. When a client writes,
After the handshake the DataNode registers with the it first asks the NameNode to choose DataNodes to host repli-
NameNode. DataNodes persistently store their unique storage cas of the first block of the file. The client organizes a pipeline
IDs. The storage ID is an internal identifier of the DataNode, from node-to-node and sends the data. When the first block is
which makes it recognizable even if it is restarted with a differ- filled, the client requests new DataNodes to be chosen to host
ent IP address or port. The storage ID is assigned to the replicas of the next block. A new pipeline is organized, and the

2
Figure 1. An HDFS client creates a new file by giving its path to the NameNode. For each block of the file, the NameNode returns
a list of DataNodes to host its replicas. The client then pipelines data to the chosen DataNodes, which eventually confirm the
creation of the block replicas to the NameNode.

client sends the further bytes of the file. Each choice of be configured to store the checkpoint and journal in multiple
DataNodes is likely to be different. The interactions among the storage directories. Recommended practice is to place the di-
client, the NameNode and the DataNodes are illustrated in rectories on different volumes, and for one storage directory to
Fig. 1. be on a remote NFS server. The first choice prevents loss from
single volume failures, and the second choice protects against
Unlike conventional file systems, HDFS provides an API
failure of the entire node. If the NameNode encounters an error
that exposes the locations of a file blocks. This allows applica- writing the journal to one of the storage directories it automati-
tions like the MapReduce framework to schedule a task to
cally excludes that directory from the list of storage directories.
where the data are located, thus improving the read perform- The NameNode automatically shuts itself down if no storage
ance. It also allows an application to set the replication factor
directory is available.
of a file. By default a file’s replication factor is three. For criti-
cal files or files which are accessed very often, having a higher The NameNode is a multithreaded system and processes
replication factor improves their tolerance against faults and requests simultaneously from multiple clients. Saving a trans-
increase their read bandwidth. action to disk becomes a bottleneck since all other threads need
to wait until the synchronous flush-and-sync procedure initi-
D. Image and Journal ated by one of them is complete. In order to optimize this
The namespace image is the file system metadata that de- process the NameNode batches multiple transactions initiated
scribes the organization of application data as directories and by different clients. When one of the NameNode’s threads ini-
files. A persistent record of the image written to disk is called a tiates a flush-and-sync operation, all transactions batched at
checkpoint. The journal is a write-ahead commit log for that time are committed together. Remaining threads only need
changes to the file system that must be persistent. For each to check that their transactions have been saved and do not
client-initiated transaction, the change is recorded in the jour- need to initiate a flush-and-sync operation.
nal, and the journal file is flushed and synched before the
change is committed to the HDFS client. The checkpoint file is E. CheckpointNode
never changed by the NameNode; it is replaced in its entirety The NameNode in HDFS, in addition to its primary role
when a new checkpoint is created during restart, when re- serving client requests, can alternatively execute either of two
quested by the administrator, or by the CheckpointNode de- other roles, either a CheckpointNode or a BackupNode. The
scribed in the next section. During startup the NameNode ini- role is specified at the node startup.
tializes the namespace image from the checkpoint, and then
The CheckpointNode periodically combines the existing
replays changes from the journal until the image is up-to-date
with the last state of the file system. A new checkpoint and checkpoint and journal to create a new checkpoint and an
empty journal. The CheckpointNode usually runs on a different
empty journal are written back to the storage directories before
host from the NameNode since it has the same memory re-
the NameNode starts serving clients.
quirements as the NameNode. It downloads the current check-
If either the checkpoint or the journal is missing, or be- point and journal files from the NameNode, merges them lo-
comes corrupt, the namespace information will be lost partly or cally, and returns the new checkpoint back to the NameNode.
entirely. In order to preserve this critical information HDFS can

3
Creating periodic checkpoints is one way to protect the file and journal files and merges them in memory. Then it writes
system metadata. The system can start from the most recent the new checkpoint and the empty journal to a new location, so
checkpoint if all other persistent copies of the namespace im- that the old checkpoint and journal remain unchanged.
age or journal are unavailable.
During handshake the NameNode instructs DataNodes
Creating a checkpoint lets the NameNode truncate the tail whether to create a local snapshot. The local snapshot on the
of the journal when the new checkpoint is uploaded to the DataNode cannot be created by replicating the data files direc-
NameNode. HDFS clusters run for prolonged periods of time tories as this will require doubling the storage capacity of every
without restarts during which the journal constantly grows. If DataNode on the cluster. Instead each DataNode creates a copy
the journal grows very large, the probability of loss or corrup- of the storage directory and hard links existing block files into
tion of the journal file increases. Also, a very large journal ex- it. When the DataNode removes a block it removes only the
tends the time required to restart the NameNode. For a large hard link, and block modifications during appends use the
cluster, it takes an hour to process a week-long journal. Good copy-on-write technique. Thus old block replicas remain un-
practice is to create a daily checkpoint. touched in their old directories.
The cluster administrator can choose to roll back HDFS to
F. BackupNode the snapshot state when restarting the system. The NameNode
A recently introduced feature of HDFS is the BackupNode. recovers the checkpoint saved when the snapshot was created.
Like a CheckpointNode, the BackupNode is capable of creating DataNodes restore the previously renamed directories and initi-
periodic checkpoints, but in addition it maintains an in- ate a background process to delete block replicas created after
memory, up-to-date image of the file system namespace that is the snapshot was made. Having chosen to roll back, there is no
always synchronized with the state of the NameNode. provision to roll forward. The cluster administrator can recover
The BackupNode accepts the journal stream of namespace the storage occupied by the snapshot by commanding the sys-
transactions from the active NameNode, saves them to its own tem to abandon the snapshot, thus finalizing the software up-
storage directories, and applies these transactions to its own grade.
namespace image in memory. The NameNode treats the System evolution may lead to a change in the format of the
BackupNode as a journal store the same as it treats journal files NameNode’s checkpoint and journal files, or in the data repre-
in its storage directories. If the NameNode fails, the sentation of block replica files on DataNodes. The layout ver-
BackupNode’s image in memory and the checkpoint on disk is sion identifies the data representation formats, and is persis-
a record of the latest namespace state. tently stored in the NameNode’s and the DataNodes’ storage
The BackupNode can create a checkpoint without down- directories. During startup each node compares the layout ver-
loading checkpoint and journal files from the active sion of the current software with the version stored in its stor-
NameNode, since it already has an up-to-date namespace im- age directories and automatically converts data from older for-
age in its memory. This makes the checkpoint process on the mats to the newer ones. The conversion requires the mandatory
BackupNode more efficient as it only needs to save the name- creation of a snapshot when the system restarts with the new
space into its local storage directories. software layout version.

The BackupNode can be viewed as a read-only NameNode. HDFS does not separate layout versions for the NameNode
It contains all file system metadata information except for and DataNodes because snapshot creation must be an all-
block locations. It can perform all operations of the regular cluster effort rather than a node-selective event. If an upgraded
NameNode that do not involve modification of the namespace NameNode due to a software bug purges its image then back-
or knowledge of block locations. Use of a BackupNode pro- ing up only the namespace state still results in total data loss, as
vides the option of running the NameNode without persistent the NameNode will not recognize the blocks reported by
storage, delegating responsibility for the namespace state per- DataNodes, and will order their deletion. Rolling back in this
case will recover the metadata, but the data itself will be lost. A
sisting to the BackupNode.
coordinated snapshot is required to avoid a cataclysmic de-
struction.
G. Upgrades, File System Snapshots
During software upgrades the possibility of corrupting the
III. FILE I/O OPERATIONS AND REPLICA MANGEMENT
system due to software bugs or human mistakes increases. The
purpose of creating snapshots in HDFS is to minimize potential
A. File Read and Write
damage to the data stored in the system during upgrades.
An application adds data to HDFS by creating a new file
The snapshot mechanism lets administrators persistently and writing the data to it. After the file is closed, the bytes writ-
save the current state of the file system, so that if the upgrade ten cannot be altered or removed except that new data can be
results in data loss or corruption it is possible to rollback the added to the file by reopening the file for append. HDFS im-
upgrade and return HDFS to the namespace and storage state as plements a single-writer, multiple-reader model.
they were at the time of the snapshot.
The HDFS client that opens a file for writing is granted a
The snapshot (only one can exist) is created at the cluster lease for the file; no other client can write to the file. The writ-
administrator’s option whenever the system is started. If a ing client periodically renews the lease by sending a heartbeat
snapshot is requested, the NameNode first reads the checkpoint to the NameNode. When the file is closed, the lease is revoked.

4
The lease duration is bound by a soft limit and a hard limit. bold lines represent data packets, dashed lines represent ac-
Until the soft limit expires, the writer is certain of exclusive knowledgment messages, and thin lines represent control mes-
access to the file. If the soft limit expires and the client fails to sages to setup and close the pipeline. Vertical lines represent
close the file or renew the lease, another client can preempt the activity at the client and the three DataNodes where time pro-
lease. If after the hard limit expires (one hour) and the client ceeds from top to bottom. From t0 to t1 is the pipeline setup
has failed to renew the lease, HDFS assumes that the client has stage. The interval t1 to t2 is the data streaming stage, where t1
quit and will automatically close the file on behalf of the writer, is the time when the first data packet gets sent and t2 is the time
and recover the lease. The writer's lease does not prevent other that the acknowledgment to the last packet gets received. Here
clients from reading the file; a file may have many concurrent an hflush operation transmits the second packet. The hflush
readers. indication travels with the packet data and is not a separate
operation. The final interval t2 to t3 is the pipeline close stage
An HDFS file consists of blocks. When there is a need for a
for this block.
new block, the NameNode allocates a block with a unique
block ID and determines a list of DataNodes to host replicas of In a cluster of thousands of nodes, failures of a node (most
the block. The DataNodes form a pipeline, the order of which commonly storage faults) are daily occurrences. A replica
minimizes the total network distance from the client to the last stored on a DataNode may become corrupted because of faults
DataNode. Bytes are pushed to the pipeline as a sequence of in memory, disk, or network. HDFS generates and stores
packets. The bytes that an application writes first buffer at the checksums for each data block of an HDFS file. Checksums are
client side. After a packet buffer is filled (typically 64 KB), the verified by the HDFS client while reading to help detect any
data are pushed to the pipeline. The next packet can be pushed corruption caused either by client, DataNodes, or network.
to the pipeline before receiving the acknowledgement for the When a client creates an HDFS file, it computes the checksum
previous packets. The number of outstanding packets is limited sequence for each block and sends it to a DataNode along with
by the outstanding packets window size of the client. the data. A DataNode stores checksums in a metadata file sepa-
rate from the block’s data file. When HDFS reads a file, each
After data are written to an HDFS file, HDFS does not pro- block’s data and checksums are shipped to the client. The client
vide any guarantee that data are visible to a new reader until the
computes the checksum for the received data and verifies that
file is closed. If a user application needs the visibility guaran- the newly computed checksums matches the checksums it re-
tee, it can explicitly call the hflush operation. Then the current
ceived. If not, the client notifies the NameNode of the corrupt
packet is immediately pushed to the pipeline, and the hflush replica and then fetches a different replica of the block from
operation will wait until all DataNodes in the pipeline ac-
another DataNode.
knowledge the successful transmission of the packet. All data
written before the hflush operation are then certain to be visible When a client opens a file to read, it fetches the list of
to readers. blocks and the locations of each block replica from the
NameNode. The locations of each block are ordered by their
distance from the reader. When reading the content of a block,
the client tries the closest replica first. If the read attempt fails,
the client tries the next replica in sequence. A read may fail if
the target DataNode is unavailable, the node no longer hosts a
replica of the block, or the replica is found to be corrupt when
checksums are tested.
HDFS permits a client to read a file that is open for writing.
When reading a file open for writing, the length of the last
block still being written is unknown to the NameNode. In this
case, the client asks one of the replicas for the latest length be-
fore starting to read its content.
The design of HDFS I/O is particularly optimized for batch
processing systems, like MapReduce, which require high
throughput for sequential reads and writes. However, many
efforts have been put to improve its read/write response time in
order to support applications like Scribe that provide real-time
data streaming to HDFS, or HBase that provides random, real-
time access to large tables.

B. Block Placement
For a large cluster, it may not be practical to connect all
Figure 2. Data pipeline during block construction nodes in a flat topology. A common practice is to spread the
nodes across multiple racks. Nodes of a rack share a switch,
If no error occurs, block construction goes through three and rack switches are connected by one or more core switches.
stages as shown in Fig. 2 illustrating a pipeline of three Communication between two nodes in different racks has to go
DataNodes (DN) and a block of five packets. In the picture, through multiple switches. In most cases, network bandwidth

5
between nodes in the same rack is greater than network band- to run on cluster nodes, but as long as a host can connect to the
width between nodes in different racks. Fig. 3 describes a clus- NameNode and DataNodes, it can execute the HDFS client.)
ter with two racks, each of which contains three nodes.
This policy reduces the inter-rack and inter-node write traf-
fic and generally improves write performance. Because the
/ chance of a rack failure is far less than that of a node failure,
this policy does not impact data reliability and availability
guarantees. In the usual case of three replicas, it can reduce the
aggregate network bandwidth used when reading data since a
Rack 0 Rack 1 block is placed in only two unique racks rather than three.
The default HDFS replica placement policy can be summa-
rized as follows:
DN00 DN01 DN02 DN10 DN11 DN12
1. No Datanode contains more than one replica of
any block.
Figure 3. Cluster topology example
2. No rack contains more than two replicas of the
HDFS estimates the network bandwidth between two nodes same block, provided there are sufficient racks on
by their distance. The distance from a node to its parent node is the cluster.
assumed to be one. A distance between two nodes can be cal-
culated by summing up their distances to their closest common C. Replication management
ancestor. A shorter distance between two nodes means that the The NameNode endeavors to ensure that each block always
greater bandwidth they can utilize to transfer data. has the intended number of replicas. The NameNode detects
HDFS allows an administrator to configure a script that re- that a block has become under- or over-replicated when a block
turns a node’s rack identification given a node’s address. The report from a DataNode arrives. When a block becomes over
NameNode is the central place that resolves the rack location of replicated, the NameNode chooses a replica to remove. The
each DataNode. When a DataNode registers with the NameNode will prefer not to reduce the number of racks that
NameNode, the NameNode runs a configured script to decide host replicas, and secondly prefer to remove a replica from the
which rack the node belongs to. If no such a script is config- DataNode with the least amount of available disk space. The
ured, the NameNode assumes that all the nodes belong to a goal is to balance storage utilization across DataNodes without
default single rack. reducing the block’s availability.
The placement of replicas is critical to HDFS data reliabil- When a block becomes under-replicated, it is put in the rep-
ity and read/write performance. A good replica placement pol- lication priority queue. A block with only one replica has the
icy should improve data reliability, availability, and network highest priority, while a block with a number of replicas that is
bandwidth utilization. Currently HDFS provides a configurable greater than two thirds of its replication factor has the lowest
block placement policy interface so that the users and research- priority. A background thread periodically scans the head of the
ers can experiment and test any policy that’s optimal for their replication queue to decide where to place new replicas. Block
applications. replication follows a similar policy as that of the new block
placement. If the number of existing replicas is one, HDFS
The default HDFS block placement policy provides a places the next replica on a different rack. In case that the block
tradeoff between minimizing the write cost, and maximizing has two existing replicas, if the two existing replicas are on the
data reliability, availability and aggregate read bandwidth. same rack, the third replica is placed on a different rack; other-
When a new block is created, HDFS places the first replica on wise, the third replica is placed on a different node in the same
the node where the writer is located, the second and the third rack as an existing replica. Here the goal is to reduce the cost of
replicas on two different nodes in a different rack, and the rest creating new replicas.
are placed on random nodes with restrictions that no more than
one replica is placed at one node and no more than two replicas The NameNode also makes sure that not all replicas of a
are placed in the same rack when the number of replicas is less block are located on one rack. If the NameNode detects that a
than twice the number of racks. The choice to place the second block’s replicas end up at one rack, the NameNode treats the
and third replicas on a different rack better distributes the block block as under-replicated and replicates the block to a different
replicas for a single file across the cluster. If the first two repli- rack using the same block placement policy described above.
cas were placed on the same rack, for any file, two-thirds of its After the NameNode receives the notification that the replica is
block replicas would be on the same rack. created, the block becomes over-replicated. The NameNode
then will decides to remove an old replica because the over-
After all target nodes are selected, nodes are organized as a replication policy prefers not to reduce the number of racks.
pipeline in the order of their proximity to the first replica. Data
are pushed to nodes in this order. For reading, the NameNode
D. Balancer
first checks if the client’s host is located in the cluster. If yes,
block locations are returned to the client in the order of its HDFS block placement strategy does not take into account
closeness to the reader. The block is read from DataNodes in DataNode disk space utilization. This is to avoid placing
this preference order. (It is usual for MapReduce applications new—more likely to be referenced—data at a small subset of

6
the DataNodes. Therefore data might not always be placed uni- to register and the host addresses of nodes that are not permit-
formly across DataNodes. Imbalance also occurs when new ted to register. The administrator can command the system to
nodes are added to the cluster. re-evaluate these include and exclude lists. A present member
of the cluster that becomes excluded is marked for decommis-
The balancer is a tool that balances disk space usage on an
sioning. Once a DataNode is marked as decommissioning, it
HDFS cluster. It takes a threshold value as an input parameter, will not be selected as the target of replica placement, but it
which is a fraction in the range of (0, 1). A cluster is balanced
will continue to serve read requests. The NameNode starts to
if for each DataNode, the utilization of the node (ratio of used schedule replication of its blocks to other DataNodes. Once the
space at the node to total capacity of the node) differs from the
NameNode detects that all blocks on the decommissioning
utilization of the whole cluster (ratio of used space in the clus- DataNode are replicated, the node enters the decommissioned
ter to total capacity of the cluster) by no more than the thresh-
state. Then it can be safely removed from the cluster without
old value.
jeopardizing any data availability.
The tool is deployed as an application program that can be
run by the cluster administrator. It iteratively moves replicas G. Inter-Cluster Data Copy
from DataNodes with higher utilization to DataNodes with When working with large datasets, copying data into and
lower utilization. One key requirement for the balancer is to out of a HDFS cluster is daunting. HDFS provides a tool called
maintain data availability. When choosing a replica to move DistCp for large inter/intra-cluster parallel copying. It is a
and deciding its destination, the balancer guarantees that the MapReduce job; each of the map tasks copies a portion of the
decision does not reduce either the number of replicas or the source data into the destination file system. The MapReduce
number of racks. framework automatically handles parallel task scheduling, error
The balancer optimizes the balancing process by minimiz- detection and recovery.
ing the inter-rack data copying. If the balancer decides that a
replica A needs to be moved to a different rack and the destina- IV. PRACTICE AT YAHOO!
tion rack happens to have a replica B of the same block, the
Large HDFS clusters at Yahoo! include about 3500 nodes.
data will be copied from replica B instead of replica A.
A typical cluster node has:
A second configuration parameter limits the bandwidth
consumed by rebalancing operations. The higher the allowed · 2 quad core Xeon processors @ 2.5ghz
bandwidth, the faster a cluster can reach the balanced state, but · Red Hat Enterprise Linux Server Release 5.1
with greater competition with application processes. · Sun Java JDK 1.6.0_13-b03
· 4 directly attached SATA drives (one terabyte each)
E. Block Scanner · 16G RAM
· 1-gigabit Ethernet
Each DataNode runs a block scanner that periodically scans
its block replicas and verifies that stored checksums match the Seventy percent of the disk space is allocated to HDFS. The
block data. In each scan period, the block scanner adjusts the remainder is reserved for the operating system (Red Hat
read bandwidth in order to complete the verification in a con- Linux), logs, and space to spill the output of map tasks.
figurable period. If a client reads a complete block and check- (MapReduce intermediate data are not stored in HDFS.) Forty
sum verification succeeds, it informs the DataNode. The nodes in a single rack share an IP switch. The rack switches are
DataNode treats it as a verification of the replica. connected to each of eight core switches. The core switches
provide connectivity between racks and to out-of-cluster re-
The verification time of each block is stored in a human sources. For each cluster, the NameNode and the BackupNode
readable log file. At any time there are up to two files in top- hosts are specially provisioned with up to 64GB RAM; applica-
level DataNode directory, current and prev logs. New verifica- tion tasks are never assigned to those hosts. In total, a cluster of
tion times are appended to current file. Correspondingly each 3500 nodes has 9.8 PB of storage available as blocks that are
DataNode has an in-memory scanning list ordered by the rep- replicated three times yielding a net 3.3 PB of storage for user
lica’s verification time. applications. As a convenient approximation, one thousand
Whenever a read client or a block scanner detects a corrupt nodes represent one PB of application storage. Over the years
block, it notifies the NameNode. The NameNode marks the that HDFS has been in use (and into the future), the hosts se-
replica as corrupt, but does not schedule deletion of the replica lected as cluster nodes benefit from improved technologies.
immediately. Instead, it starts to replicate a good copy of the New cluster nodes always have faster processors, bigger disks
block. Only when the good replica count reaches the replication and larger RAM. Slower, smaller nodes are retired or relegated
factor of the block the corrupt replica is scheduled to be re- to clusters reserved for development and testing of Hadoop.
moved. This policy aims to preserve data as long as possible. The choice of how to provision a cluster node is largely an is-
So even if all replicas of a block are corrupt, the policy allows sue of economically purchasing computation and storage.
the user to retrieve its data from the corrupt replicas. HDFS does not compel a particular ratio of computation to
storage, or set a limit on the amount of storage attached to a
F. Decommissioing cluster node.
The cluster administrator specifies which nodes can join the On an example large cluster (3500 nodes), there are about
cluster by listing the host addresses of nodes that are permitted 60 million files. Those files have 63 million blocks. As each

7
block typically is replicated three times, every data node hosts In addition to total failures of nodes, stored data can be
54 000 block replicas. Each day user applications will create corrupted or lost. The block scanner scans all blocks in a large
two million new files on the cluster. The 25 000 nodes in cluster each fortnight and finds about 20 bad replicas in the
Hadoop clusters at Yahoo! provide 25 PB of on-line data stor- process.
age. At the start of 2010, this is a modest—but growing—
fraction of the data processing infrastructure at Yahoo!. Yahoo! B. Caring for the Commons
began to investigate MapReduce programming with a distrib- As the use of HDFS has grown, the file system itself has
uted file system in 2004. The Apache Hadoop project was had to introduce means to share the resource within a large and
founded in 2006. By the end of that year, Yahoo! had adopted diverse user community. The first such feature was a permis-
Hadoop for internal use and had a 300-node cluster for devel- sions framework closely modeled on the Unix permissions
opment. Since then HDFS has become integral to the back of- scheme for file and directories. In this framework, files and
fice at Yahoo!. The flagship application for HDFS has been the directories have separate access permissions for the owner, for
production of the Web Map, an index of the World Wide Web other members of the user group associated with the file or
that is a critical component of search (75 hours elapsed time, directory, and for all other users. The principle differences be-
500 terabytes of MapReduce intermediate data, 300 terabytes tween Unix (POSIX) and HDFS are that ordinary files in
total output). More applications are moving to Hadoop, espe-
HDFS have neither “execute” permissions nor “sticky” bits.
cially those that analyze and model user behavior.
In the present permissions framework, user identity is
Becoming a key component of Yahoo!’s technology suite weak: you are who your host says you are. When accessing
meant tackling technical problems that are the difference be- HDFS, the application client simply queries the local operating
tween being a research project and being the custodian of many system for user identity and group membership. A stronger
petabytes of corporate data. Foremost are issues of robustness identity model is under development. In the new framework,
and durability of data. But also important are economical per- the application client must present to the name system creden-
formance, provisions for resource sharing among members of tials obtained from a trusted source. Different credential ad-
the user community, and ease of administration by the system ministrations are possible; the initial implementation will use
operators. Kerberos. The user application can use the same framework to
confirm that the name system also has a trustworthy identity.
A. Durability of Data And the name system also can demand credentials from each of
Replication of data three times is a robust guard against loss the data nodes participating in the cluster.
of data due to uncorrelated node failures. It is unlikely Yahoo!
The total space available for data storage is set by the num-
has ever lost a block in this way; for a large cluster, the prob-
ability of losing a block during one year is less than .005. The ber of data nodes and the storage provisioned for each node.
Early experience with HDFS demonstrated a need for some
key understanding is that about 0.8 percent of nodes fail each
month. (Even if the node is eventually recovered, no effort is means to enforce the resource allocation policy across user
communities. Not only must fairness of sharing be enforced,
taken to recover data it may have hosted.) So for the sample
large cluster as described above, a node or two is lost each day. but when a user application might involve thousands of hosts
writing data, protection against application inadvertently ex-
That same cluster will re-create the 54 000 block replicas
hausting resources is also important. For HDFS, because the
hosted on a failed node in about two minutes. (Re-replication is
fast because it is a parallel problem that scales with the size of system metadata are always in RAM, the size of the namespace
(number of files and directories) is also a finite resource. To
the cluster.) The probability of several nodes failing within two
minutes such that all replicas of some block are lost is indeed manage storage and namespace resources, each directory may
be assigned a quota for the total space occupied by files in the
small.
sub-tree of the namespace beginning at that directory. A sepa-
Correlated failure of nodes is a different threat. The most rate quota may also be set for the total number of files and di-
commonly observed fault in this regard is the failure of a rack rectories in the sub-tree.
or core switch. HDFS can tolerate losing a rack switch (each
While the architecture of HDFS presumes most applications
block has a replica on some other rack). Some failures of a core
switch can effectively disconnect a slice of the cluster from will stream large data sets as input, the MapReduce program-
ming framework can have a tendency to generate many small
multiple racks, in which case it is probable that some blocks
will become unavailable. In either case, repairing the switch output files (one from each reduce task) further stressing the
namespace resource. As a convenience, a directory sub-tree can
restores unavailable replicas to the cluster. Another kind of
correlated failure is the accidental or deliberate loss of electri- be collapsed into a single Hadoop Archive file. A HAR file is
similar to a familiar tar, JAR, or Zip file, but file system opera-
cal power to the cluster. If the loss of power spans racks, it is
likely that some blocks will become unavailable. But restoring tion can address the individual files for the archive, and a HAR
power may not be a remedy because one-half to one percent of file can be used transparently as the input to a MapReduce job.
the nodes will not survive a full power-on restart. Statistically,
and in practice, a large cluster will lose a handful of blocks C. Benchmarks
during a power-on restart. (The strategy of deliberately restart- A design goal of HDFS is to provide very high I/O band-
ing one node at a time over a period of weeks to identify nodes width for large data sets. There are three kinds of measure-
that will not survive a restart has not been tested.) ments that test that goal.

8
• What is bandwidth observed from a contrived bench- achieve with the current design and hardware. The I/O rate in
mark? the last column is the combination of reading the input and
writing the output from and to HDFS. In the second row, while
• What bandwidth is observed in a production cluster the rate for HDFS is reduced, the total I/O per node will be
with a mix of user jobs? about double because for the larger (petabyte!) data set, the
• What bandwidth can be obtained by the most carefully MapReduce intermediates must also be written to and read
constructed large-scale user application? from disk. In the smaller test, there is no need to spill the
MapReduce intermediates to disk; they are buffered the mem-
The statistics reported here were obtained from clusters of ory of the tasks.
at least 3500 nodes. At this scale, total bandwidth is linear with
the number of nodes, and so the interesting statistic is the Large clusters require that the HDFS NameNode support
bandwidth per node. These benchmarks are available as part of the number of client operations expected in a large cluster. The
the Hadoop codebase. NNThroughput benchmark is a single node process which
starts the NameNode application and runs a series of client
The DFSIO benchmark measures average throughput for threads on the same node. Each client thread performs the same
read, write and append operations. DFSIO is an application NameNode operation repeatedly by directly calling the Name-
available as part of the Hadoop distribution. This MapReduce Node method implementing this operation. The benchmark
program reads/writes/appends random data from/to large files. measures the number of operations per second performed by
Each map task within the job executes the same operation on a the NameNode. The benchmark is designed to avoid communi-
distinct file, transfers the same amount of data, and reports its cation overhead caused by RPC connections and serialization,
transfer rate to the single reduce task. The reduce task then and therefore runs clients locally rather than remotely from
summarizes the measurements. The test is run without conten- different nodes. This provides the upper bound of pure
tion from other applications, and the number of map tasks is NameNode performance.
chosen to be proportional to the cluster size. It is designed to
measure performance only during data transfer, and excludes Operation Throughput (ops/s)
the overheads of task scheduling, startup, and the reduce task. Open file for read 126 100
• DFSIO Read: 66 MB /s per node Create file 5600
• DFSIO Write: 40 MB /s per node Rename file 8300
Delete file 20 700
For a production cluster, the number of bytes read and writ-
DataNode Heartbeat 300 000
ten is reported to a metrics collection system. These averages
are taken over a few weeks and represent the utilization of the Blocks report (blocks/s) 639 700
cluster by jobs from hundreds of individual users. On average Table 3. NNThroughput benchmark
each node was occupied by one or two application tasks at any
moment (fewer than the number of processor cores available).
V. FUTURE WORK
• Busy Cluster Read: 1.02 MB/s per node
This section presents some of the future work that the
• Busy Cluster Write: 1.09 MB/s per node Hadoop team at Yahoo is considering; Hadoop being an open
source project implies that new features and changes are de-
HDFS I/0 Bytes/s cided by the Hadoop development community at large.
Bytes Aggregate Per
Nodes Maps Reduces Time The Hadoop cluster is effectively unavailable when its
(TB) (GB) Node
NameNode is down. Given that Hadoop is used primarily as a
(MB)
batch system, restarting the NameNode has been a satisfactory
1 1460 8000 2700 62 s 32 22.1 recovery means. However, we have taken steps towards auto-
1000 3658 80 000 20 000 58 500 s 34.2 9.35 mated failover. Currently a BackupNode receives all transac-
tions from the primary NameNode. This will allow a failover to
Table 2. Sort benchmark for one terabyte and one petabyte of a warm or even a hot BackupNode if we send block reports to
data. Each data record is 100 bytes with a 10-byte key. The both the primary NameNode and BackupNode. A few Hadoop
test program is a general sorting procedure that is not special- users outside Yahoo! have experimented with manual failover.
ized for the record size. In the terabyte sort, the block replica- Our plan is to use Zookeeper, Yahoo’s distributed consensus
tion factor was set to one, a modest advantage for a short test. technology to build an automated failover solution.
In the petabyte sort, the replication factor was set to two so
that the test would confidently complete in case of a (not un- Scalability of the NameNode [13] has been a key struggle.
expected) node failure. Because the NameNode keeps all the namespace and block
locations in memory, the size of the NameNode heap has lim-
At the beginning of 2009, Yahoo! participated in the Gray ited the number of files and also the number of blocks address-
Sort competition [9]. The nature of this task stresses the sys- able. The main challenge with the NameNode has been that
tem’s ability to move data from and to the file system (it really when its memory usage is close to the maximum the
isn't about sorting). The competitive aspect means that the re- NameNode becomes unresponsive due to Java garbage collec-
sults in Table 2 are about the best a user application can tion and sometimes requires a restart. While we have encour-

9
aged our users to create larger files, this has not happened since
it would require changes in application behavior. We have REFERENCES
added quotas to manage the usage and have provided an ar- [1] Apache Hadoop. http://hadoop.apache.org/
chive tool. However these do not fundamentally address the [2] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. “PVFS: A
scalability problem. parallel file system for Linux clusters,” in Proc. of 4th Annual Linux
Showcase and Conference, 2000, pp. 317–327.
Our near-term solution to scalability is to allow multiple [3] J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on
namespaces (and NameNodes) to share the physical storage Large Clusters,” In Proc. of the 6th Symposium on Operating Systems
within a cluster. We are extending our block IDs to be prefixed Design and Implementation, San Francisco CA, Dec. 2004.
by block pool identifiers. Block pools are analogous to LUNs [4] A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C.
in a SAN storage system and a namespace with its pool of Olston, B. Reed, S. Srinivasan, U. Srivastava. “Building a High-Level
Dataflow System on top of MapReduce: The Pig Experience,” In Proc.
blocks is analogous as a file system volume. of Very Large Data Bases, vol 2 no. 2, 2009, pp. 1414–1425
This approach is fairly simple and requires minimal [5] S. Ghemawat, H. Gobioff, S. Leung. “The Google file system,” In Proc.
changes to the system. It offers a number of advantages besides of ACM Symposium on Operating Systems Principles, Lake George,
NY, Oct 2003, pp 29–43.
scalability: it isolates namespaces of different sets of applica-
[6] F. P. Junqueira, B. C. Reed. “The life and times of a zookeeper,” In
tions and improves the overall availability of the cluster. It also Proc. of the 28th ACM Symposium on Principles of Distributed
generalizes the block storage abstraction to allow other services Computing, Calgary, AB, Canada, August 10–12, 2009.
to use the block storage service with perhaps a different name- [7] Lustre File System. http://www.lustre.org
space structure. We plan to explore other approaches to scaling [8] M. K. McKusick, S. Quinlan. “GFS: Evolution on Fast-forward,” ACM
such as storing only partial namespace in memory and truly Queue, vol. 7, no. 7, New York, NY. August 2009.
distributed implementation of the NameNode in the future. In [9] O. O'Malley, A. C. Murthy. Hadoop Sorts a Petabyte in 16.25 Hours and
particular, our assumption that applications will create a small a Terabyte in 62 Seconds. May 2009.
number of large files was flawed. As noted earlier, changing http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petab
application behavior is hard. Furthermore, we are seeing new yte_in_162.html
classes of applications for HDFS that need to store a large [10] R. Pike, D. Presotto, K. Thompson, H. Trickey, P. Winterbottom, “Use
of Name Spaces in Plan9,” Operating Systems Review, 27(2), April
number of smaller files. 1993, pages 72–76.
The main drawback of multiple independent namespaces is [11] S. Radia, "Naming Policies in the spring system," In Proc. of 1st IEEE
the cost of managing them, especially if the number of name- Workshop on Services in Distributed and Networked Environments,
June 1994, pp. 164–171.
spaces is large. We are also planning to use application or job
[12] S. Radia, J. Pachl, “The Per-Process View of Naming and Remote
centric namespaces rather than cluster centric namespaces—
Execution,” IEEE Parallel and Distributed Technology, vol. 1, no. 3,
this is analogous to the per-process namespaces that are used to August 1993, pp. 71–80.
deal with remote execution in distributed systems in the late [13] K. V. Shvachko, “HDFS Scalability: The limits to growth,” ;login:.
80s and early 90s [10][11][12]. April 2010, pp. 6–16.
Currently our clusters are less than 4000 nodes. We believe [14] W. Tantisiriroj, S. Patil, G. Gibson. “Data-intensive file systems for
Internet services: A rose by any other name ...” Technical Report CMU-
we can scale to much larger clusters with the solutions outlined PDL-08-114, Parallel Data Laboratory, Carnegie Mellon University,
above. However, we believe it is prudent to have multiple clus- Pittsburgh, PA, October 2008.
ters rather than a single large cluster (say three 6000-node clus- [15] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu,
ters rather than a single 18 000-node cluster) as it allows much P. Wyckoff, R. Murthy, “Hive – A Warehousing Solution Over a Map-
improved availability and isolation. To that end we are plan- Reduce Framework,” In Proc. of Very Large Data Bases, vol. 2 no. 2,
ning to provide greater cooperation between clusters. For ex- August 2009, pp. 1626-1629.
ample caching remotely accessed files or reducing the replica- [16] J. Venner, Pro Hadoop. Apress, June 22, 2009.
tion factor of blocks when files sets are replicated across clus- [17] S. Weil, S. Brandt, E. Miller, D. Long, C. Maltzahn, “Ceph: A Scalable,
High-Performance Distributed File System,” In Proc. of the 7th
ters. Symposium on Operating Systems Design and Implementation, Seattle,
WA, November 2006.
VI. ACKNOWLEDGMENT [18] B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J.
Zelenka, B. Zhou, “Scalable Performance of the Panasas Parallel file
We would like to thank all members of the HDFS team at System”, In Proc. of the 6th USENIX Conference on File and Storage
Yahoo! present and past for their hard work building the file Technologies, San Jose, CA, February 2008
system. We would like to thank all Hadoop committers and [19] T. White, Hadoop: The Definitive Guide. O'Reilly Media, Yahoo! Press,
collaborators for their valuable contributions. Corinne Chandel June 5, 2009.
drew illustrations for this paper.

10

You might also like