Unit 2-HDFS SGS

Unit 2: Hadoop Distributed File System
 Design of HDFS
Hadoop is:
 An open source, Java-based software framework

 Supports the processing of large data sets in a distributed computing
environment
 Designed to scale up from a single server to thousands of machines
Has a very high degree of fault tolerance
 Possible to run application on systems with thousands of nodes involving
thousands of terabytes
 Consists of two main layers: Hadoop Distributed File Systems(HDFS):
Distributes data
 Map/Reduce: Distributes application
HDFS
. Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of name node and data node help users to easily check the
status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
 Name node
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks −
 Manages the file system namespace.

 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files
and directories.
 Data node
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will
be a datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
 Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
 Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
 The Command-Line Interface
The File System (FS) shell includes various shell-like commands that directly interact
with the Hadoop Distributed File System (HDFS) as well as other file systems that
Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others.Below are the
commands supported
For complete documentation please refer the link FileSystemShell.html
 appendToFile
hadoop fs -appendToFile /home/testuser/test/test.txt /user/haas_queue/test/test.txt
append the content of the /home/testuser/test/test.txt to the /user/haas_queue/test/test.txt

in the hdfs.
 cat
Copies source paths to stdout.
hadoop fs -cat hdfs://nameservice1/user/haas_queue/test/test.txt
 checksum
Returns the checksum information of a file.
hadoop fs -checksum hdfs://nameservice1/user/haas_queue/test/test.txt
 chgrp
Usage: hadoop fs -chgrp [-R] GROUP URI [URI …]
Change group association of files. The user must be the owner of files, or else a super-
user. Additional information is in the Permissions Guide HdfsPermissionsGuide.html
Options
The -R option will make the change recursively through the directory structure.
 chmod
hadoop fs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]
Change the permissions of files. With -R, make the change recursively through the
directory structure. The user must be the owner of the file, or else a super-user.
Additional information is in the Permissions guide HdfsPermissionsGuide.html.
Options
 chown
Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. The user must be a super-user. Additional information is in the
Permissions Guide HdfsPermissionsGuide.html
Options
 copyFromLocal
hadoop fs -copyFromLocal /home/sa081876/test/* /user/haas_queue/test
This command copies all the files inside test folder in the edge node to test folder in the
hdfs Similar to put command, except that the source is restricted to a local file reference.
Options: The -f option will overwrite the destination if it already exists.

 copyToLocal
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
hadoop fs -copyFromLocal /user/haas_queue/test/* /home/sa081876/test
This command copies all the files inside test folder in the hdfs to test folder in the edge
node.Similar to get command, except that the destination is restricted to a local file
reference.
 count
hadoop fs -count [-q] [-h] [-v] <paths>
Count the number of directories, files and bytes under the paths that match the specified
file pattern. The output columns with -count are: DIR_COUNT, FILE_COUNT,
CONTENT_SIZE, PATHNAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA,

SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT,
CONTENT_SIZE, PATHNAME
The -h option shows sizes in human readable format.
The -v option displays a header line.
Example:
hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2

hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hdfs dfs -count -q -h -v hdfs://nn1.example.com/file1
 cp
Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI …] <dest>
Copy files from source to destination. This command allows multiple sources as well in
which case the destination must be a directory.
Options:
The -f option will overwrite the destination if it already exists.

The -p option will preserve file attributes [topx] (timestamps, ownership, permission,
ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership,
permission. If -pa is specified, then preserves permission also because ACL is a super-set
of permission. Determination of whether raw namespace extended attributes are
preserved is independent of the -p flag.
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
 createSnapshot
HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be
taken on a subtree of the file system or the entire file system. Some common use cases of
snapshots are data backup, protection against user errors and disaster recovery. For more
information refer the link HdfsSnapshots.html
hdfs dfs -createSnapshot <path> [<snapshotName>]

path – The path of the snapshottable directory.
snapshotName – The snapshot name, which is an optional argument. When it is omitted,

a default name is generated using a timestamp with the format “‘s’yyyyMMdd-
HHmmss.SSS”, e.g. “s20130412-151029.033”
 deleteSnapshot
Delete a snapshot of from a snapshottable directory. This operation requires owner

privilege of the snapshottable directory.For more information refer the
link HdfsSnapshots.html
hdfs dfs -deleteSnapshot <path> <snapshotName>
path – The path of the snapshottable directory.

snapshotName – The snapshot name.
df
 Displays free space
hadoop fs -df [-h] URI [URI …
Options:
The -h option will format file sizes in a human-readable fashion.

Example:
hadoop fs -df /user/hadoop/dir1

 du
hadoop fs -du [-s] [-h] URI [URI …]
Displays sizes of files and directories contained in the given directory or the length of a
file in case its just a file.
Options:
The -s option will result in an aggregate summary of file lengths being displayed, rather
than the individual files.
The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of
67108864)
Example:
hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1

hdfs://nn.example.com/user/hadoop/dir1
 expunge
Empty the Trash.For more info refer the link HdfsDesign.html
hadoop fs -expunge
 find
hadoop fs -find <path> … <expression> …
Finds all files that match the specified expression and applies selected actions to them. If
no path is specified then defaults to the current working directory. If no expression is
specified then defaults to -print.
hadoop fs -find / -name test -print
 get
hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system. Files that fail the CRC check may be copied with the -
ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
hadoop fs -get /user/hadoop/file localfile

hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
 getfacl
hadoop fs -getfacl [-R] <path>
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a
default ACL, then getfacl also displays the default ACL.
Options:
-R: List the ACLs of all files and directories recursively.

path: File or directory to list.
Examples:
hadoop fs -getfacl /file

hadoop fs -getfacl -R /dir
 getfattr
hadoop fs -getfattr [-R] -n name | -d [-e en] <path>
Displays the extended attribute names and values (if any) for a file or directory.
Options:
-R: Recursively list the attributes for all files and directories.
-n name: Dump the named extended attribute value.
-d: Dump all extended attribute values associated with pathname.
-e encoding: Encode values after retrieving them. Valid encodings are “text”, “hex”, and
“base64”. Values encoded as text strings are enclosed in double quotes (“), and values
encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
path: The file or directory.
Examples:
hadoop fs -getfattr -d /file

hadoop fs -getfattr -R -n user.myAttr /dir
 getmerge
hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into
the destination local file. Optionally -nl can be set to enable adding a newline character
(LF) at the end of each file.
Examples:
hadoop fs -getmerge -nl /src /opt/output.txt

 help
hadoop fs -help
Return usage output.
ls
list files
hadoop fs -ls [-d] [-h] [-R] <args>
-d: Directories are listed as plain files.

-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
-R: Recursively list subdirectories encountered.
 lsr
Recursive version of ls.
hadoop fs -lsr <args>
mkdir
hadoop fs -mkdir [-p] <paths>
Takes path uri’s as argument and creates directories.
Options:
The -p option behavior is much like Unix mkdir -p, creating parent directories along the
path.
 moveFromLocal
hadoop fs -moveFromLocal <localsrc> <dst>
Similar to put command, except that the source localsrc is deleted after it’s copied.
 moveToLocal
hadoop fs -moveToLocal [-crc] <src> <dst>
Displays a “Not implemented yet” message.
 mv
hadoop fs -mv URI [URI …] <dest>
Moves files from source to destination. This command allows multiple sources as well in
which case the destination needs to be a directory. Moving files across file systems is not
permitted.
 put
hadoop fs -put <localsrc> … <dst>
Copy single src, or multiple srcs from local file system to the destination file system.
Also reads input from stdin and writes to destination file system.
 renameSnapshot
Rename a snapshot. This operation requires owner privilege of the snapshottable

directory.
hdfs dfs -renameSnapshot <path> <oldName> <newName>

path The path of the snapshottable directory.
oldName The old snapshot name.
newName The new snapshot name.
 rm
hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI …]
Delete files specified as args.
Options:
The -f option will not display a diagnostic message or modify the exit status to reflect an
error if the file does not exist.
The -R option deletes the directory and any content under it recursively.
The -r option is equivalent to -R.
The -skipTrash option will bypass trash, if enabled, and delete the specified file(s)
immediately. This can be useful when it is necessary to delete files from an over-quota
directory.
Example:
hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
 rmdir
hadoop fs -rmdir [–ignore-fail-on-non-empty] URI [URI …]
Delete a directory.
 rmr
hadoop fs -rmr [-skipTrash] URI [URI …]
Recursive version of delete.
 setfacl
hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[–set <acl_spec> <path>]
Sets Access Control Lists (ACLs) of files and directories.
 setfattr
hadoop fs -setfattr -n name [-v value] | -x name <path>
Sets an extended attribute name and value for a file or directory.
 setrep
hadoop fs -setrep [-R] [-w] <numReplicas> <path>
Changes the replication factor of a file. If path is a directory then the command
recursively changes the replication factor of all files under the directory tree rooted at
path.
 Stat
hadoop fs -stat [format] <path> …
Print statistics about the file/directory at <path> in the specified format.
 tail
hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.
 test
hadoop fs -test -[defsz] URI
 text
hadoop fs -text <src>
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
 touchz
hadoop fs -touchz URI [URI …]
Create a file of zero length.
 truncate
hadoop fs -truncate [-w] <length> <paths>
Truncate all files that match the specified file pattern to the specified length.
 usage
hadoop fs -usage command
 Hadoop File System Interface:
Hadoop is an open-source software framework written in Java along with some

shell scripting and C code for performing computation over very large data.
Hadoop is utilized for batch/offline processing over the network of so many
machines forming a physical cluster. The framework works in such a manner
that it is capable enough to provide distributed storage and processing over the
same cluster. It is designed to work on cheaper systems commonly known as
commodity hardware where each system offers its local storage and
computation power.
Hadoop is capable of running various file systems and HDFS is just one single
implementation that out of all those file systems. The Hadoop has a variety of
file systems that can be implemented concretely. The Java abstract
class org.apache.hadoop.fs.FileSystem represents a file system in Hadoop.
URI Java implementation (all

Filesystem scheme under org.apache.hadoop) Description
The Hadoop Local filesystem

is used for a locally
connected disk with client-
Local file fs.LocalFileSystem side checksumming. The local
filesystem uses
RawLocalFileSystem with no
checksums.
HDFS stands for Hadoop

Distributed File System and it
HDFS hdfs hdfs.DistributedFileSystem
is drafted for working with
MapReduce efficiently.
The HFTP filesystem

provides read-only access to
HDFS over HTTP. There is
no connection of HFTP with
HFTP hftp hdfs.HftpFileSystem FTP.
This filesystem is commonly
used with distcp to share data
between HDFS clusters
possessing different versions.
The HSFTP filesystem

provides read-only access to
HSFTP hsftp hdfs.HsftpFileSystem HDFS over HTTPS. This file
system also does not have
any connection with FTP.
The HAR file system is mainly

used to reduce the memory
usage of NameNode by
registering files in Hadoop
HAR har fs.HarFileSystem
HDFS. This file system is
layered on some other file
system for archiving
purposes.
cloud store
or KFS(KosmosFileSystem) is
KFS a file system that is written in
(Cloud- kfs fs.kfs.KosmosFileSystem c++. It is very much similar to
Store) a distributed file system like
HDFS and GFS(Google File
System).
The FTP filesystem is

FTP ftp fs.ftp.FTPFileSystem
supported by the FTP server.
This file system is backed

S3 (native) s3n fs.s3native.NativeS3FileSystem
by AmazonS3.
S3 (block- s3 fs.s3.S3FileSystem S3 (block-based) file system

which is supported by
Amazon s3 stores files in
based) blocks(similar to HDFS) just
to overcome S3’s file system
5 GB file size limit.
 Hadoop Data flow:

A basic data flow of the Hadoop system can be divided into four phases:
1. Capture Big Data : The sources can be extensive lists that are structured, semi-
structured, and unstructured, some streaming, real-time data sources, sensors,
devices, machine-captured data, and many other sources. For data capturing
and storage, we have different data integrators such as, Flume, Sqoop, Storm,
and so on in the Hadoop ecosystem, depending on the type of data.
2. Process and Structure: We will be cleansing, filtering, and transforming the

data by using a MapReduce-based framework or some other frameworks which
can perform distributed programming in the Hadoop ecosystem. The frameworks
available currently are MapReduce, Hive, Pig, Spark and so on.
3. Distribute Results: The processed data can be used by the BI and analytics
system or the big data analytics system for performing analysis or visualization.
4. Feedback and Retain: The data analyzed can be fed back to Hadoop and used
for improvements...
Data Flow In MapReduce

MapReduce is used to compute the huge amount of data . To handle the
upcoming data in a parallel and distributed form, the data has to flow from
various phases.
Phases of MapReduce data flow
Input reader
The input reader reads the upcoming data and splits it into the data blocks of
the appropriate size (64 MB to 128 MB). Each data block is associated with a
Map function.
Once input reads the data, it generates the corresponding key-value pairs.
The input files reside in HDFS.
Map function
The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs. The map input and output type may
be different from each other.
Partition function
The partition function assigns the output of each Map function to the
appropriate reducer. The available key and value provide this function. It
returns the index of reducers.
Shuffling and Sorting

The data are shuffled between/within nodes so that it moves out from the
map and get ready to process for reduce function. Sometimes, the shuffling
of data can take much computation time.
The sorting operation is performed on input data for Reduce function. Here,
the data is compared using comparison function and arranged in a sorted
form.
Reduce function
The Reduce function is assigned to each unique key. These keys are already
arranged in sorted order. The values associated with the keys can iterate the
Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The
role of Output writer is to write the Reduce output to the stable storage.
 Data Ingest with Flume and Scoop:
SCOOP(SQL TO HADOOP)
What is Sqoop in Hadoop?

Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is experiencing
difficulties in moving data from the data warehouse into the Hadoop
environment. Apache Sqoop is an effective hadoop tool used for importing
data from RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS. Sqoop
hadoop can also be used for exporting data from HDFS into RDBMS. Apache
Sqoop is a command line interpreter i.e. the Sqoop commands are executed
one at a time by the interpreter.
Need for Apache Sqoop

With increasing number of business organizations adopting Hadoop to
analyse huge amounts of structured or unstructured data, there is a need for
them to transfer petabytes or exabytes of data between their existing
relational databases, data sources, data warehouses and the Hadoop
environment. Accessing huge amounts of unstructured data directly
from MapReduce applications running on large Hadoop clusters or loading it from
production systems is a complex task because data transfer using scripts is
often not effective and time consuming.
How Apache Sqoop works?

Sqoop is an effective hadoop tool for non-programmers which functions by
looking at the databases that need to be imported and choosing a relevant
import function for the source data. Once the input is recognized by Sqoop
hadoop, the metadata for the table is read and a class definition is created
for the input requirements. Hadoop Sqoop can be forced to function
selectively by just getting the columns needed before input instead of
importing the entire input and looking for the data in it. This saves
considerable amount of time. In reality, the import from the database to
HDFS is accomplished by a MapReduce job that is created in the background
by Apache Sqoop.
Features of Apache Sqoop

Apache Sqoop supports bulk import i.e. it can import the complete database
or individual tables into HDFS. The files will be stored in the HDFS file system
and the data in built-in directories.
Sqoop parallelizes data transfer for optimal system utilization and fast
performance.
Apache Sqoop provides direct input i.e. it can map relational databases and
import directly into HBase and Hive.
Sqoop makes data analysis efficient.
Sqoop helps in mitigating the excessive loads to external systems.
Sqoop provides data interaction programmatically by generating Java

classes.
Companies Using Apache Sqoop

The Apollo Group education company uses Sqoop to extract data from
external databases and inject results of Hadoop jobs back into the RDBMS’s.
Coupons.com uses Sqoop tool for data transfer between its IBM Netezza data
warehouse and the hadoop environment.
Flume
What is Flume in Hadoop?

Apache Flume is service designed for streaming logs into Hadoop environment. Flume is a
distributed and reliable service for collecting and aggregating huge amounts of log data. With a
simple and easy to use architecture based on streaming data flows, it also has tunable reliability
mechanisms and several recovery and failover mechanisms.
Need for Flume

Logs are usually a source of stress and argument in most of the big data companies. Logs are one
of the most painful resources to manage for the operations team as they take up huge amount of
space. Logs are rarely present at places on the disk where someone in the company can make
effective use of them or hadoop developers can access them. Many big data companies wind up
building tools and processes to collect logs from application servers, transfer them to some
repository so that they can control the lifecycle without consuming unnecessary disk space.
This frustrates developers as the logs are often not present at the location where they can view
them easily, they have limited number of tools available for processing logs and have confined
capabilities in intelligently managing the lifecycle. Apache Flume is designed to address the
difficulties of both operations group and developers by providing them an easy to use tool that
can push logs from bunch of applications servers to various repositories via a highly configurable
agent.
How Apache Flume works?

Flume has a simple event driven pipeline architecture with 3 important roles-Source, Channel
and Sink.
Source defines where the data is coming from, for instance a message queue or a file.
Sinks defined the destination of the data pipelined from various sources.
Channels are pipes which establish connect between sources and sinks.
Apache flume works on two important concepts-
The master acts like a reliable configuration service which is used by nodes for retrieving their
configuration.
If the configuration for a particular node changes on the master then it will dynamically be
updated by the master.
Node is generally an event pipe in Hadoop Flume which reads from the source and writes to the
Sink. The characteristics and role of a flume node is determine by the behaviour of source and
sinks. Apache Flume is built with several source and sink options but if none of them fits in your
requirements then developers can write their own. A flume node can also be configured with the
help of a sink decorator which can interpret the event and transforms it as it passes through. With
all these basic primitives, developers can create different topologies to collect data on any
application server and direct it to any log repository.
Features of Apache Flume

Flume is a flexible tool as it allows to scale in environments with as low as five machines to as
high as several thousands of machines.
Apache Flume provides high throughput and low latency.
Apache Flume has a declarative configuration but provides ease of extensibility.
Flume in Hadoop is fault tolerant, linearly scalable and stream oriented.
Companies Using Apache Flume

Goibibo uses Hadoop flume to transfer logs from the production systems into HDFS.
Mozilla uses flume Hadoop for the BuildBot project along with Elastic Search.
Capillary technologies uses Flume for aggregating logs from 25 machines in production.
Difference between Sqoop and Flume
Apache Sqoop and Apache Flume work with various kinds of data sources. Flume functions well
in streaming data sources which are generated continuously in hadoop environment such as log
files from multiple servers whereas Apache Sqoop is designed to work well with any kind of
relational database system that has JDBC connectivity. Sqoop can also import data
from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive
or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for
which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in Apache Sqoop data load is not driven
by events.
Flume is a better choice when moving bulk streaming data from various sources like JMS or
Spooling directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata,
Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use
Apache Sqoop.
In Apache Flume, data flows to HDFS through multiple channels whereas in Apache Sqoop
HDFS is the destination for importing data.
Apache Flume has agent based architecture i.e. the code written in flume is known as agent
which is responsible for fetching data whereas in Apache Sqoop the architecture is based on
connectors. The connectors in Sqoop know how to connect with the various data sources and
fetch data accordingly.
Lastly, Sqoop and Flume cannot be used achieve the same tasks as they are developed
specifically to serve different purposes. Apache Flume agents are designed to fetch streaming
data like tweets from Twitter or log file from the web server whereas Sqoop connectors are
designed to work only with structured data sources and fetch data from them.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly
where Apache Flume is used for collecting and aggregating data because of its distributed,
reliable nature and highly available backup routes.
 Hadoop I/O
Hadoop comes with a set of primitives for data I/O. Some of these are techniques that are more
general than Hadoop, such as data integrity and compression, but deserve special consideration
when dealing with multi-terabyte datasets. Others are Hadoop tools or APIs that form the
building blocks for developing distributed systems, such as serialization frameworks and on-disk
data structures.
Unlike any I/O subsystem, Hadoop also comes with a set of primitives.
These primitive considerations, although generic in nature, go with the
Hadoop IO system as well with some special connotation to it, of
course. Hadoop deals with multi-terabytes of datasets; a special
consideration on these primitives will give an idea how Hadoop handles
data input and output. This article quickly skims over these primitives
to give a perspective on the Hadoop input output system.
Data Integrity
Data integrity means that data should remain accurate and consistent
all across its storing, processing, and retrieval operations. To ensure
that no data is lost or corrupted during persistence and processing,
Hadoop maintains stringent data integrity constraints. Every read/write
operation occurs in disks, more so through the network is prone to
errors. And, the volume of data that Hadoop handles only aggravates
the situation. The usual way to detect corrupt data is through
checksums. A checksum is computed when data first enters into the
system and is sent across the channel during the retrieval process. The
retrieving end computes the checksum again and matches with the
received ones. If it matches exactly then the data deemed to be error
free else it contains error. But the problem is – what if the checksum
sent itself is corrupt? This is highly unlikely because it is a small data,
but not an undeniable possibility. Using the right kind of hardware such
as ECC memory can be used to alleviate the situation.
This is mere detection. Therefore, to correct the error, another

technique, called CRC (Cyclic Redundancy Check), is used.
Hadoop takes it further and creates a distinct checksum for every 512
(default) bytes of data. Because CRC-32 is 4 bytes only, the storage
overhead is not an issue. All data that enters into the system is verified
by the datanodes before being forwarded for storage or further
processing. Data sent to the datanode pipeline is verified through
checksums and any corruption found is immediately notified to the
client with ChecksumException. The client read from the datanode also
goes through the same drill. The datanodes maintain a log of
checksum verification to keep track of the verified block. The log is
updated by the datanode upon receiving a block verification success
signal from the client. This type of statistics helps in keeping the bad
disks at bay.
Apart from this, a periodic verification on the block store is made with
the help of DataBlockScanner running along with the datanode thread
in the background. This protects data from corruption in the physical
storage media.
Hadoop maintains a copy or replicas of data. This is specifically used to

recover data from massive corruption. Once the client detects an error
while reading a block, it immediately reports to the datanode about the
bad block from the namenode before throwing ChecksumException.
The namenode then marks it as a bad block and schedules any further
reference to the block to its replicas. In this way, the replica is
maintained with other replicas and the marked bad block is removed
from the system.
For every file created in the Hadoop LocalFileSystem, a hidden file with
the same name in the same directory with the
extension .<filename>.crc is created. This file maintains the checksum
of each chunk of data (512 bytes) in the file. The maintenance of
metadata helps in detecting read error before
throwing ChecksumException by the LocalFileSystem.
Compression
Keeping in mind the volume of data Hadoop deals with, compression is
not a luxury but a requirement. There are many obvious benefits of file
compression rightly used by Hadoop. It economizes storage
requirements and is a must-have capability to speed up data
transmission over the network and disks. There are many tools,
techniques, and algorithms commonly used by Hadoop. Many of them
are quite popular and have been used in file compression over the
ages. For example, gzip, bzip2, LZO, zip, and so forth are often used.
Serialization
The process that turns structured objects to stream of bytes is
called serialization. This is specifically required for data transmission
over the network or persisting raw data in disks. Deserialization is just
the reverse process, where a stream of bytes is transformed into a
structured object. This is particularly required for object
implementation of the raw bytes. Therefore, it is not surprising that
distributed computing uses this in a couple of distinct areas: inter-
process communication and data persistence.
Hadoop uses RPC (Remote Procedure Call) to enact inter-process

communication between nodes. Therefore, the RPC protocol uses the
process of serialization and deserialization to render a message to the
stream of bytes and vice versa and sends it across the network.
However, the process must be compact enough to best use the
network bandwidth, as well as fast, interoperable, and flexible to
accommodate protocol updates over time.
Hadoop has its own compact and fast serialization format, Writables,
that MapReduce programs use to generate keys and value types.
Data Structure of Files

There are a couple of high-level containers that elaborate the
specialized data structure in Hadoop to hold special types of data. For
example, to maintain a binary log, the SequenceFile container provides
the data structure to persist binary key-value pairs. We then can use
the key, such as a timestamp represented by LongWritable and value
by Writable, which refers to logged quantity.
There is another container, a sorted derivation of SequenceFile,

called MapFile. It provides an index for convenient lookups by key.
These two containers are interoperable and can be converted to and

from each other.
Conclusion
This is just a quick overview of the input/output system of Hadoop. We
will delve into many intricate details in subsequent articles. It is not
very difficult to understand the Hadoop input/output system if one has
a basic understanding of I/O systems in general. Hadoop simply put
some extra juice to it to keep up with its distributed nature that works
in massive scale of data. That’s all.

Unit 2-HDFS SGS

Uploaded by

Document Informationclick to expand document informationbda

Document Informationclick to expand document information

Copyright:

Available Formats

Unit 2-HDFS SGS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2-HDFS SGS

Uploaded by

Copyright:

Available Formats

Unit 2: Hadoop Distributed File System

 An open source, Java-based software framework

 Manages the file system namespace.

 The Command-Line Interface

hadoop fs -appendToFile /home/testuser/test/test.txt /user/haas_queue/test/test.txt

append the content of the /home/testuser/test/test.txt to the /user/haas_queue/test/test.txt

Copies source paths to stdout.

hadoop fs -cat hdfs://nameservice1/user/haas_queue/test/test.txt

Returns the checksum information of a file.

hadoop fs -checksum hdfs://nameservice1/user/haas_queue/test/test.txt

Usage: hadoop fs -chgrp [-R] GROUP URI [URI …]

hadoop fs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

hadoop fs -copyFromLocal /home/sa081876/test/* /user/haas_queue/test

Options: The -f option will overwrite the destination if it already exists.

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

hadoop fs -copyFromLocal /user/haas_queue/test/* /home/sa081876/test

hadoop fs -count [-q] [-h] [-v] <paths>

The output columns with -count -q are: QUOTA, REMAINING_QUATA,

The -h option shows sizes in human readable format.

The -v option displays a header line.

hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2

Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI …] <dest>

The -f option will overwrite the destination if it already exists.

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2

hdfs dfs -createSnapshot <path> [<snapshotName>]

snapshotName – The snapshot name, which is an optional argument. When it is omitted,

Delete a snapshot of from a snapshottable directory. This operation requires owner

hdfs dfs -deleteSnapshot <path> <snapshotName>

path – The path of the snapshottable directory.

 Displays free space

hadoop fs -df [-h] URI [URI …

The -h option will format file sizes in a human-readable fashion.

hadoop fs -df /user/hadoop/dir1

hadoop fs -du [-s] [-h] URI [URI …]

hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1

Empty the Trash.For more info refer the link HdfsDesign.html

hadoop fs -find <path> … <expression> …

hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

hadoop fs -get /user/hadoop/file localfile

hadoop fs -getfacl [-R] <path>

-R: List the ACLs of all files and directories recursively.

hadoop fs -getfacl /file

hadoop fs -getfattr [-R] -n name | -d [-e en] <path>

hadoop fs -getfattr -d /file

hadoop fs -getmerge [-nl] <src> <localdst>

hadoop fs -getmerge -nl /src /opt/output.txt

Return usage output.

hadoop fs -ls [-d] [-h] [-R] <args>

-d: Directories are listed as plain files.

Recursive version of ls.

hadoop fs -lsr <args>

hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

hadoop fs -moveFromLocal <localsrc> <dst>

hadoop fs -moveToLocal [-crc] <src> <dst>

Displays a “Not implemented yet” message.

hadoop fs -mv URI [URI …] <dest>