Unit 3 Big Data_240516_090400
Unit 3 Big Data_240516_090400
Unit 3 Big Data_240516_090400
Design of HDFS:
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly
faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
● HDFS is a file system designed for storing very large files with streaming data
access patterns, running on clusters of commodity hardware.
● There are Hadoop clusters running today that store petabytes of data.
● HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
● A dataset is typically generated or copied from a source, then various analyses
are performed on that dataset over time.
● It’s designed to run on clusters of commodity hardware (commonly
available hardware available from multiple vendors) for which the chance
of node failure across the cluster is high, at least for large clusters.
● HDFS is designed to carry on working without a noticeable interruption to the
user in the face of such failure.
● Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode.
● Files in HDFS may be written by a single writer.
● Writes are always made at the end of the file.
● There is no support for multiple writers, or for modifications at arbitrary offsets
in the file.
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
1. NameNode(MasterNode)
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does
the following tasks –
2. DataNode(SlaveNode):
3.Block
Generally the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes. These
file segments are called as blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block. The default block size is 64MB, but
it can be increased as per the need to change in HDFS configuration.
HDFS Concepts :
● Blocks :
● HDFS has the concept of a block, but it is a much larger unit—64 MB by
default.
● Files in HDFS are broken into block-sized chunks, which are stored as
independent units.
● Having a block abstraction for a distributed filesystem brings several
benefits. :
1. A file can be larger than any single disk in the network. Nothing requires
the blocks from a file to be stored on the same disk, so they can take
advantage of any of the disks in the cluster.
2. Making the unit of abstraction a block rather than a file simplifies the
storage subsystem. It simplifies the storage management (since blocks are
a fixed size, it is easy to calculate how many can be stored on a given disk)
and eliminating metadata concerns.
3. Blocks fit well with replication for providing fault tolerance and availability.
To insure against corrupted blocks and disk and machine failure, each
block is replicated to a small number of physically separate machines.
● HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks.
● Namenodes and Datanodes :
● An HDFS cluster has two types of nodes operating in a master-worker pattern:
1. A Namenode (the master) and
2. A number of datanodes (workers).
● The namenode manages the filesystem namespace.
● It maintains the filesystem tree and the metadata for all the files and directories
in the tree.
● This information is stored persistently on the local disk in the form of two
files:
● The namespace image
● The edit log.
● The namenode also knows the datanodes on which all the blocks for a given file
are located.
Benefits of HDFS:
● HDFS can store a large amount of information.
● HDFS is a simple & robust coherency model.
● HDFS is scalable and has fast access to required information.
● HDFS also serve a substantial number of clients by adding more machines
to the cluster.
● HDFS provides streaming read access.
● HDFS can be used to read data stored multiple times but the data will be written
to the HDFS once.
● The recovery techniques will be applied very quickly.
● Hardware and operating systems portability across is heterogeneous
commodities.
● High Economy by distributing data and processing across clusters of commodity
personal computers.
● High Efficiency by distributing data, logic on parallel nodes to process it from
where data is located.
● High Reliability by automatically maintaining multiple copies of data and
automatically redeploying processing logic in the event of failures.
Data storage in HDFS: Now let’s see how the data is stored in a distributed
manner.
Features:
Distributed data storage.
Blocks reduce seek time.
The data is highly available as the same block is present at multiple
datanodes.
Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.
Low latency data access: Applications that require low-latency access to
data i.e in the range of milliseconds will not work well with HDFS, because
HDFS is designed keeping in mind that we need high-throughput of data
even at the cost of latency.
Small file problem: Having lots of small files will result in lots of seeks and
lots of movement from one datanode to another datanode to retrieve each
small file, this whole process is a very inefficient data access pattern.
Data Replication :
3. Store Files :
● HDFS divides files into blocks and stores each block on a DataNode.
● Multiple DataNodes are linked to the master node in the cluster, the
NameNode.
● The master node distributes replicas of these data blocks across the
cluster.
● It also instructs the user where to locate wanted information.
● Before the NameNode can help you store and manage the data, it first
needs to partition the file into smaller, manageable data blocks.
● This process is called data block splitting.
Java Interfaces to HDFS :
● Java code for writing file in HDFS :
FileSystem fileSystem = FileSystem.get(conf);
// Check if the file already exists
Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}
// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(new
File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
// Close all the file descripters
in.close();
out.close();
fileSystem.close();
● Java code for reading file in HDFS :
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/to/file.ext");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));
// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();
HDFS Interfaces :
Features of HDFS interfaces are :
1. Create new file
2. Upload files/folder
3. Set Permission
4. Copy
5. Move
6. Rename
7. Delete
8. Drag and Drop
9. HDFS File viewer
Data Flow :
● MapReduce is used to compute a huge amount of data.
● To handle the upcoming data in a parallel and distributed form, the data has to
flow from various phases :
● Input Reader :
● The input reader reads the upcoming data and splits it into the data blocks of the
appropriate size (64 MB to 128 MB).
● Once input reads the data, it generates the corresponding key-value pairs.
● The input files reside in HDFS.
● Map Function :
● The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.
● The mapped input and output types may be different from each other.
● Partition Function :
● The partition function assigns the output of each Map function to the
appropriate reducer.
● The available key and value provide this function.
● It returns the index of reducers.
● Shuffling and Sorting :
● The data are shuffled between nodes so that it moves out from the map
and get ready to process for reduce function.
● The sorting operation is performed on input data for Reduce function.
● Reduce Function :
● The Reduce function is assigned to each unique key.
● These keys are already arranged in sorted order.
● The values associated with the keys can iterate the Reduce and generates the
corresponding output.
● Output Writer :
● Once the data flow from all the above phases, the Output writer executes.
● The role of the Output writer is to write the Reduce output to the stable storage.
Data Ingestion :
● Hadoop Data ingestion is the beginning of your data pipeline in a datalake.
● It means taking data from various silo databases and files and putting it
into Hadoop.
● For many companies, it does turn out to be an intricate task.
● That is why they take more than a year to ingest all their data into the
Hadoop data lake.
● The reason is, as Hadoop is open-source; there are a variety of ways you
can ingest data into Hadoop.
● It gives every developer the choice of using her/his favourite tool or
language to ingest data into Hadoop.
● Developers while choosing a tool/technology stress on performance, but
this makes governance very complicated.
● Sqoop :
● Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is
experiencing difficulties in moving data from the data warehouse into the Hadoop
environment.
● Apache Sqoop is an effective Hadoop tool used for importing data from
RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS.
● Sqoop Hadoop can also be used for exporting data from HDFS into
RDBMS.
● Apache Sqoop is a command-line interpreter i.e. the Sqoop commands are
executed one at a time by the interpreter.
● Flume :
● Apache Flume is a service designed for streaming logs into the Hadoop
environment.
● Flume is a distributed and reliable service for collecting and aggregating huge
amounts of log data.
● With a simple and easy to use architecture based on streaming data flows,it also
has tunable reliability mechanisms and several recoveries and failover
mechanisms.
Hadoop Archives :
● Hadoop Archive is a facility that packs up small files into one compact HDFS
block to avoid memory wastage of name nodes.
● Name node stores the metadata information of the HDFS data.
● If 1GB file is broken into 1000 pieces then namenode will have to store metadata
about all those 1000 small files.
● In that manner,namenode memory will be wasted in storing and managing a lot
of data.
● HAR is created from a collection of files and the archiving tool will run a
MapReduce job.
● These Maps reduces jobs to process the input files in parallel to create an archive
file.
● Hadoop is created to deal with large files data, so small files are
problematic and to be handled efficiently.
● As a large input file is split into a number of small input files and stored across
all the data nodes, all these huge numbers of records are to be stored in the name
node which makes the name node inefficient.
● To handle this problem, Hadoop Archive has been created which packs the
HDFS files into archives and we can directly use these files as input to the MR
jobs.
● It always comes with *.har extension.
● HAR Syntax :
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example :
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
I/O Compression :
● In the Hadoop framework, where large data sets are stored and processed,
you will need storage for large files.
● These files are divided into blocks and those blocks are stored in different
nodes across the cluster so lots of I/O and network data transfer is also
involved.
● In order to reduce the storage requirements and to reduce the time spent in-
network transfer, you can have a look at data compression in the Hadoop
framework.
● Using data compression in Hadoop you can compress files at various
steps, at all of these steps it will help to reduce storage and quantity of data
transferred.
● You can compress the input file itself.
● That will help you reduce storage space in HDFS.
● You can also configure that the output of a MapReduce job is compressed in
Hadoop.
● That helps is reducing storage space if you are archiving output or sending it to
some other application for further processing.
I/O Serialization :
● Serialization refers to the conversion of structured objects into byte streams for
transmission over the network or permanent storage on a disk.
● Deserialization refers to the conversion of byte streams back to structured
objects.
● Serialization is mainly used in two areas of distributed data processing :
● Interprocess communication
● Permanent storage
● We require I/O Serialization because :
● To process records faster (Time-bound).
● When proper data formats need to maintain and transmit over data without
schema support on another end.
● When in the future, data without structure or format needs to process,
complex Errors may occur.
● Serialization offers data validation over transmission.
● To maintain the proper format of data serialization, the system must have
the following four properties -
● Compact - helps in the best use of network bandwidth
● Fast - reduces the performance overhead
● Extensible - can match new requirements
● Inter-operable - not language-specific.
Avro :
● Apache Avro is a language-neutral data serialization system.
● Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by
multiple languages.
● Avro is a preferred tool to serialize data in Hadoop.
● Avro has a schema-based system.
● A language-independent schema is associated with its read and write
operations.
● Avro serializes the data which has a built-in schema.
● Avro serializes the data into a compact binary format, which can be
deserialized by any application.
● Avro uses JSON format to declare the data structures.
● Presently, it supports languages such as Java, C, C++, C#, Python, and
Ruby.
Security in Hadoop :
● Apache Hadoop achieves security by using Kerberos.
● At a high level, there are three steps that a client must take to access a
service when using Kerberos.
● Thus, each of which involves a message exchange with a server.
● Authentication – The client authenticates itself to the authentication server.
Then, receives a timestamped Ticket-Granting Ticket (TGT).
● Authorization – The client uses the TGT to request a service ticket from the
Ticket Granting Server.
● Service Request – The client uses the service ticket to authenticate itself to the
server.
Administering Hadoop :
● The person who administers Hadoop is called HADOOP ADMINISTRATOR.
● Some of the common administering tasks in Hadoop are :
● Monitor health of a cluster
● Add new data nodes as needed
● Optionally turn on security
● Optionally turn on encryption
● Recommended, but optional, to turn on high availability
● Optional to turn on MapReduce Job History Tracking Server
● Fix corrupt data blocks when necessary
● Tune performance
Hadoop Environment: