0% found this document useful (0 votes)
2 views42 pages

Big Data Unit II

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 42

Big Data Unit –II Notes

Introducing Hadoop

Traditional Approach

➢ In this approach, an enterprise will have a computer to store and process big
data.
➢ For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
➢ In this approach, the user interacts with the application, which in turn handles
the part of data storage and analysis.

Limitation
This approach works fine with those applications that process less voluminous data
that can be accommodated by standard database servers, or up to the limit of the
processor that is processing the data.
But when it comes to dealing with huge amounts of scalable data, it is a hectic
task to process such data through a single database bottleneck.

Google’s Solution

➢ Google solved this problem using an algorithm called MapReduce.


➢ This algorithm divides the task into small parts and assigns them to many
computers, and collects the results from them which when integrated, form
the result dataset.

1
Big Data Unit –II Notes

What is Hadoop?
➢ Hadoop is an open-source software programming framework for
storing a large amount of data and performing the computation.
➢ Its framework is based on Java programming with some native code in
C and shell scripts.
➢ Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment.
➢ It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large
datasets.
➢ Hadoop has two main components:
• HDFS (Hadoop Distributed File System): This is the storage
component of Hadoop, which allows for the storage of large amounts
of data across multiple machines. It is designed to work with
commodity hardware, which makes it cost-effective.
• YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.

History of Hadoop

➢ Hadoop is an open-source software framework for storing and


processing big data.
➢ It was created by Apache Software Foundation in 2006, based on a
white paper written by Google in 2003 that described the Google File
System (GFS) and the MapReduce programming model.
➢ In April 2006 Hadoop 0.1.0 was released.
➢ The Hadoop framework allows for the distributed processing of large
data sets across clusters of computers using simple programming
models.
➢ It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
➢ It is used by many organizations, including Yahoo, Facebook, and IBM,
for a variety of purposes such as data warehousing, log processing, and
research.
➢ Hadoop has been widely adopted in the industry and has become a key
technology for big data processing.
➢ Hadoop has several key features that make it well-suited for big data
processing. They are

Distributed Storage: Hadoop stores large data sets across multiple


machines, allowing for the storage and processing of extremely large
amounts of data.
Scalability: Hadoop can scale from a single server to thousands of
machines, making it easy to add more capacity as needed.

2
Big Data Unit –II Notes

Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning


it can continue to operate even in the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is
stored on the same node where it will be processed, this feature helps to
reduce the network traffic and improve the performance
High Availability: Hadoop provides High Availability feature, which helps
to make sure that the data is always available and is not lost.
Flexible Data Processing: Hadoop MapReduce programming model
allows for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps
to replicate the data across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature,
which helps to reduce the storage space and improve the performance.
YARN: A resource management platform that allows multiple data
processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.

Some common frameworks of Hadoop

1. Hive- It uses HiveQL for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data
exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library (MLlib) for providing
enhanced machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data
transformation of unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running
of their codes faster.

Hadoop framework is made up of the following modules:


1. Hadoop MapReduce- a MapReduce programming model for handling
and processing large data.
2. Hadoop Distributed File System- distributed files in clusters among
nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for
other modules.

3
Big Data Unit –II Notes

Advantages and Disadvantages of Hadoop

Advantages:
➢ Ability to store a large amount of data.
➢ High flexibility.
➢ Cost effective.
➢ High computational power.
➢ Tasks are independent.
➢ Linear scaling.
Disadvantages:
➢ Not very effective for small data.
➢ Hard cluster management.
➢ Has a stability issue.
➢ Security concerns.
➢ Complexity: Hadoop can be complex to set up and maintain, especially
for organizations without a dedicated team of experts.
➢ Latency: Hadoop is not well-suited for low-latency workloads and may
not be the best choice for real-time data processing.
➢ Limited Support for Real-time Processing: Hadoop batch-oriented
nature makes it less suited for real-time streaming or interactive data
processing use cases.
➢ Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for
structured data processing
➢ Data Security: Hadoop does not provide built-in security features such
as data encryption or user authentication, which can make it difficult
to secure sensitive data.
➢ Limited Support for Ad-hoc Queries: Hadoop MapReduce
programming model is not well-suited for ad-hoc queries, making it
difficult to perform exploratory data analysis.
➢ Limited Support for Graph and Machine Learning: Hadoop core
component HDFS and MapReduce are not well-suited for graph and
machine learning workloads, specialized components like Apache
Graph and Mahout are available but have some limitations.
➢ Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
➢ Data Loss: In the event of a hardware failure, the data stored in a
single node may be lost permanently.
➢ Data Governance: Data Governance is a critical aspect of data
management; Hadoop does not provide a built-in feature to manage
data lineage, data quality, data cataloging, data lineage, and data audit.

4
Big Data Unit –II Notes

Features of Hadoop
Apache Hadoop is the most popular and powerful big data tool; Hadoop
provides the world’s most reliable storage layer.

1. Hadoop is Open Source

Hadoop is an open-source project, which means its source code is available


free of cost for inspection, modification, and analyses that allows
enterprises to modify the code as per their requirements.

2. Hadoop cluster is Highly Scalable

Hadoop cluster is scalable means we can add any number of nodes


(horizontal scalable) or increase the hardware capacity of nodes
(vertical scalable) to achieve high computation power. This provides
horizontal as well as vertical scalability to the Hadoop framework.

3. Hadoop provides Fault Tolerance

Fault tolerance is the most important feature of Hadoop.


HDFS in Hadoop2 uses a replication mechanism to provide fault
tolerance. It creates a replica of each block on the different machines
depending on the replication factor (by default). So, if any machine in a
cluster goes down, data can be accessed from the other machines
containing a replica of the same data.
Hadoop3 has replaced this replication mechanism by erasure coding.
Erasure coding provides the same level of fault tolerance with less space.

4. Hadoop provides High Availability

This feature of Hadoop ensures the high availability of the data, even in
unfavorable conditions.

Due to the fault tolerance feature of Hadoop, if any of the Data Nodes goes
down, the data is available to the user from different Data Nodes containing
a copy of the same data.

Also, the high availability Hadoop cluster consists of 2 or more running


Name Nodes (active and passive) in a hot standby configuration. The active
node is the Name Node, which is active. Passive node is the standby node
that reads edit logs modification of active Name Node and applies them to
its own namespace.

If an active node fails, the passive node takes over the responsibility of the
active node. Thus, even if the Name Node goes down, files are available and
accessible to users.

5
Big Data Unit –II Notes

5. Hadoop is very Cost-Effective

Since the Hadoop cluster consists of nodes of commodity hardware that are
inexpensive, thus provides a cost-effective solution for storing and
processing big data. Being an open-source product, Hadoop doesn’t need
any license.

6. Hadoop is Faster in Data Processing

Hadoop stores data in a distributed fashion, which allows data to be


processed distributed on a cluster of nodes. Thus, it provides lightning-fast
processing capability to the Hadoop framework.

7. Hadoop is based on Data Locality concept

Hadoop is popularly known for its data locality feature means moving
computation logic to the data, rather than moving data to the computation
logic. This feature of Hadoop reduces the bandwidth utilization in a system.

8. Hadoop provides Feasibility

Unlike the traditional system, Hadoop can process unstructured data. Thus,
provide feasibility to the users to analyze data of any formats and size.

9. Hadoop is Easy to use

Hadoop is easy to use as the clients don’t have to worry about distributing
computing. The processing is handled by the framework itself.

10. Hadoop ensures Data Reliability

In Hadoop due to the replication of data in the cluster, data is stored


reliably on the cluster machines despite machine failures.

The framework itself provides a mechanism to ensure data reliability by


Block Scanner, Volume Scanner, Disk Checker, and Directory Scanner. If
your machine goes down or data gets corrupted, then also your data is
stored reliably in the cluster and is accessible from the other machine
containing a copy of data.

6
Big Data Unit –II Notes

Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, Big data is a term given to the data sets
which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries
and companies that need to work on large data sets which are sensitive and
needs efficient handling. Hadoop is a framework that enables processing of
large data sets which reside in the form of clusters. Being a framework,
Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.

Introduction: Hadoop Ecosystem is a platform or a suite which provides


various services to solve the big data problems. It includes Apache projects
and various commercial tools and solutions. There are four major elements
of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop Ecosystem:

➢ HDFS : Hadoop Distributed File System


➢ YARN : Yet Another Resource Negotiator
➢ MapReduce : Programming based Data Processing
➢ Spark : In-Memory data processing
➢ PIG, HIVE : Query based processing of data services
➢ HBase : NoSQL Database
➢ Mahout, Spark MLLib: Machine Learning algorithm libraries
➢ Solar, Lucene : Searching and Indexing
➢ Zookeeper : Managing cluster
➢ Oozie : Job Scheduling

7
Big Data Unit –II Notes

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s
the beauty of Hadoop that it revolves around data and hence making its
synthesis easier.
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores
the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.

8
Big Data Unit –II Notes

YARN:

• Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.

MapReduce:

• By making the use of distributed and parallel algorithms, MapReduce


makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair-based result
which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.

PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing
huge data sets.
• Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
• Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.

9
Big Data Unit –II Notes

HIVE:

• With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.

Mahout:

• Mahout allows Machine Learn ability to a system or application. Machine


Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of
algorithms.
• It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts of
Machine learning. It allows invoking algorithms as per our need with the
help of its own libraries.

Apache Spark:

• It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior
in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing; hence both are used in most of the
companies interchangeably.

Apache HBase:

• It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of
Google’s Bigtable, thus able to work on Big Data sets effectively.
• At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short
quick span of time. At such times, HBase comes handy as it gives us a
tolerant way of storing limited data

10
Big Data Unit –II Notes

Other Components: Apart from all of these, there are some other
components too that carry out a huge task in order to make Hadoop capable
of processing large datasets. They are as follows:

• Solar, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries, especially
Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component-based communication,
grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There are two kinds of
jobs. i.e. Oozie workflow and Oozie coordinator jobs. Oozie workflow is
the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

Hadoop Distributed File System (HDFS)


➢ It is difficult to maintain huge volumes of data in a single machine.
Therefore, it becomes necessary to break down the data into smaller
chunks and store it on multiple machines.
➢ File systems that manage the storage across a network of machines
are called distributed file systems.
➢ Hadoop Distributed File System (HDFS) is the storage component of
Hadoop. All data stored on Hadoop is stored in a distributed manner
across a cluster of machines. But it has a few properties that define its
existence.
Huge volumes – Being a distributed file system, it is highly capable
of storing peta bytes of data without any glitches.
Data access – It is based on the philosophy that “the most effective
data processing pattern is write-once, the read-many-times pattern”.
Cost-effective – HDFS runs on a cluster of commodity hardware.
These are inexpensive machines that can be bought from any vendor.

The components of the Hadoop Distributed File System (HDFS):

HDFS has two main components, broadly speaking, – Data blocks and
Nodes storing those data blocks.

11
Big Data Unit –II Notes

Concept of Blocks in HDFS Architecture, Name nodes and


Data nodes
HDFS Blocks
HDFS breaks down a file into smaller units. Each of these units is stored on
different machines in the cluster. This, however, is transparent to the user
working on HDFS. To them, it seems like storing all the data onto a single
machine.
These smaller units are the blocks in HDFS. The size of each of these
blocks is 128MB by default; you can easily change it according to
requirement. If you had a file of size 512MB, it would be divided into 4
blocks storing 128MB each.

If, however, you had a file of size 524MB, then, it would be divided into 5
blocks. 4 of these would store 128MB each, amounting to 512MB. And the
5th would store the remaining 12MB. That’s right! This last block won’t take
up the complete 128MB on the disk.

But, you must be wondering, why such a huge amount in a single block?
Why not multiple blocks of 10KB each? Well, the amount of data with

12
Big Data Unit –II Notes

which we generally deal with in Hadoop is usually in the order of Petra


bytes or higher.
Therefore, if we create blocks of small size, we would end up with a colossal
number of blocks. This would mean we would have to deal with equally
large metadata regarding the location of the blocks which would just create
a lot of overhead. And we don’t really want that!
There are several perks to storing data in blocks rather than saving the
complete file.

• The file itself would be too large to store on any single disk alone.
Therefore, it is prudent to spread it across different machines on the
cluster.
• It would also enable a proper spread of the workload and prevent the choke
of a single machine by taking advantage of parallelism.
Now, you must be wondering, what about the machines in the cluster? How
do they store the blocks and where is the metadata stored? Let’s find out.

Namenode in HDFS

HDFS operates in a master-worker architecture, this means that there are


one master node and several worker nodes in the cluster. The master node
is the Namenode.
Namenode is the master node that runs on a separate node in the cluster.

➢ Manages the filesystem namespace which is the filesystem tree or


hierarchy of the files and directories.
➢ Stores information like owners of files, file permissions, etc for all the
files.
➢ It is also aware of the locations of all the blocks of a file and their size.

All this information is maintained persistently over the local disk in the
form of two files: Fsimage and Edit Log.

• Fsimage stores the information about the files and directories in the
filesystem. For files, it stores the replication level, modification and access
times, access permissions, blocks the file is made up of, and their sizes. For
directories, it stores the modification time and permissions.
• Edit log on the other hand keeps track of all the write operations that the
client performs. This is regularly updated to the in-memory metadata to
serve the read requests.

13
Big Data Unit –II Notes

Whenever a client wants to write information to HDFS or read information


from HDFS, it connects with the Namenode. The Namenode returns the
location of the blocks to the client and the operation is carried out.
Yes, that’s right, the Namenode does not store the blocks. For that, we have
separate nodes.

Datanodes in HDFS
Datanodes are the worker nodes. They are inexpensive commodity
hardware that can be easily added to the cluster.
Datanodes are responsible for storing, retrieving, replicating, deletion,
etc. of blocks when asked by the Namenode.
They periodically send heartbeats to the Namenode so that it is aware of
their health. With that, a Data Node also sends a list of blocks that are
stored on it so that the Namenode can maintain the mapping of blocks to
Datanodes in its memory.
But in addition to these two types of nodes in the cluster, there is also
another node called the Secondary Namenode. Let’s look at what that is.
Secondary Namenode in HDFS
Suppose we need to restart the Namenode, which can happen in case of a
failure. This would mean that we have to copy the Fsimage from disk to
memory. Also, we would also have to copy the latest copy of Edit Log to
Fsimage to keep track of all the transactions. But if we restart the node after
a long time, then the Edit log could have grown in size. This would mean
that it would take a lot of time to apply the transactions from the Edit log.
And during this time, the filesystem would be offline. Therefore, to solve
this problem, we bring in the Secondary Namenode.

Secondary Namenode is another node present in the cluster whose main


task is to regularly merge the Edit log with the Fsimage and produce
check‐points of the primary’s in-memory file system metadata. This is also
referred to as Checkpointing.

14
Big Data Unit –II Notes

But the checkpointing procedure is computationally very expensive and


requires a lot of memory, which is why the Secondary namenode runs on a
separate node on the cluster.
However, despite its name, the Secondary Namenode does not act as a
Namenode. It is merely there for Checkpointing and keeping a copy of the
latest Fsimage.
Replication Management in HDFS
Now, one of the best features of HDFS is the replication of blocks which
makes it very reliable. But how does it replicate the blocks and where does
it store them? Let’s answer those questions now.
Replication of blocks
HDFS is a reliable storage component of Hadoop. This is because every
block stored in the filesystem is replicated on different Data Nodes in the
cluster. This makes HDFS fault-tolerant.
The default replication factor in HDFS is 3. This means that every block will
have two more copies of it, each stored on separate DataNodes in the
cluster. However, this number is configurable.

15
Big Data Unit –II Notes

But you must be wondering doesn’t that mean that we are taking up too
much storage. For instance, if we have 5 blocks of 128MB each, that
amounts to 5*128*3 = 1920 MB. True. But then these nodes are commodity
hardware. We can easily scale the cluster to add more of these machines.
The cost of buying machines is much lower than the cost of losing the data!
Now, you must be wondering, how does Namenode decide which Datanode
to store the replicas on? Well, before answering that question, we need to
have a look at what is a Rack in Hadoop.
What is a Rack in Hadoop?
A Rack is a collection of machines (30-40 in Hadoop) that are stored in the
same physical location. There are multiple racks in a Hadoop cluster, all
connected through switches.

16
Big Data Unit –II Notes

Rack awareness
Replica storage is a tradeoff between reliability and read/write bandwidth.
To increase reliability, we need to store block replicas on different racks
and Datanodes to increase fault tolerance. While the write bandwidth is
lowest when replicas are stored on the same node. Therefore, Hadoop has a
default strategy to deal with this conundrum, also known as the Rack
Awareness algorithm.
For example, if the replication factor for a block is 3, then the first replica is
stored on the same Datanode on which the client writes. The second replica
is stored on a different Datanode but on a different rack, chosen randomly.
While the third replica is stored on the same rack as the second but on a
different Datanode, again chosen randomly. If, however, the replication
factor was higher, then the subsequent replicas would be stored on random
Data Nodes in the cluster.

17
Big Data Unit –II Notes

Features of HDFS
The key features of HDFS are:

1. Cost-effective:
In HDFS architecture, the DataNodes, which stores the actual data are inexpensive
commodity hardware, thus reduces storage costs.

2. Large Datasets/ Variety and volume of data


HDFS can store data of any size (ranging from megabytes to petabytes) and of any
formats (structured, unstructured).

3. Replication
Data Replication is one of the most important and unique features of HDFS. In
HDFS replication of data is done to solve the problem of data loss in unfavorable
conditions like crashing of a node, hardware failure, and so on.
The data is replicated across a number of machines in the cluster by creating replicas
of blocks. The process of replication is maintained at regular intervals of time by
HDFS and HDFS keeps creating replicas of user data on different machines present
in the cluster.
Hence whenever any machine in the cluster gets crashed, the user can access their
data from other machines that contain the blocks of that data. Hence there is no
possibility of a loss of user data.

4. Fault Tolerance and reliability


HDFS is highly fault-tolerant and reliable. HDFS creates replicas of file blocks
depending on the replication factor and stores them on different machines.

18
Big Data Unit –II Notes

If any of the machines containing data blocks fail, other DataNodes containing the
replicas of that data blocks are available. Thus ensuring no loss of data and makes
the system reliable even in unfavorable conditions.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in
HDFS improves storage efficiency while providing the same level of fault tolerance
and data durability as traditional replication-based HDFS deployment.
To study the fault tolerance features in detail, refer to Fault Tolerance.

5. High Availability
The High availability feature of Hadoop ensures the availability of data even during
NameNode or DataNode failure.
Since HDFS creates replicas of data blocks, if any of the DataNodes goes down, the
user can access his data from the other DataNodes containing a copy of the same
data block.
Also, if the active NameNode goes down, the passive node takes the responsibility of
the active NameNode. Thus, data will be available and accessible to the user even
during a machine crash.
To study the high availability feature in detail, refer to the High Availability article.

6. Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase,
we can scale the cluster.
There is two scalability mechanism available: Vertical scalability – add more
resources (CPU, Memory, Disk) on the existing nodes of the cluster.
Another way is horizontal scalability – Add more machines in the cluster. The
horizontal way is preferred since we can scale the cluster from 10s of nodes to 100s of
nodes on the fly without any downtime.

7. Data Integrity
Data integrity refers to the correctness of data. HDFS ensures data integrity by
constantly checking the data against the checksum calculated during the write of the
file.
While file reading, if the checksum does not match with the original checksum, the
data is said to be corrupted. The client then opts to retrieve the data block from
another DataNode that has a replica of that block. The NameNode discards the
corrupted block and creates an additional new replica.

8. High Throughput
Hadoop HDFS stores data in a distributed fashion, which allows data to be processed
parallelly on a cluster of nodes. This decreases the processing time and thus provides
high throughput.

19
Big Data Unit –II Notes

9. Data Locality
Data locality means moving computation logic to the data rather than moving data to
the computational unit.
In the traditional system, the data is brought at the application layer and then gets
processed.
But in the present scenario, due to the massive volume of data, bringing data to the
application layer degrades the network performance.
In HDFS, we bring the computation part to the Data Nodes where data resides.
Hence, with Hadoop HDFS, we are not moving computation logic to the data, rather
than moving data to the computation logic. This feature reduces the bandwidth
utilization in a system.

Importance of MapReduce
Apache Hadoop is a software framework that processes and stores big data across the
cluster of commodity hardware. Hadoop is based on the MapReduce model for
processing huge amounts of data in a distributed manner.

This MapReduce Tutorial enlisted several features of MapReduce. After reading this,
you will clearly understand why MapReduce is the best fit for processing vast
amounts of data.

First, we will see a small introduction to the MapReduce framework. Then we will
explore various features of MapReduce.

20
Big Data Unit –II Notes

1. Scalability

Apache Hadoop is a highly scalable framework. This is because of its ability to store
and distribute huge data across plenty of servers. All these servers were inexpensive
and can operate in parallel. We can easily scale the storage and computation power
by adding servers to the cluster.

Hadoop MapReduce programming enables organizations to run applications from


large sets of nodes which could involve the use of thousands of terabytes of data.

Hadoop MapReduce programming enables business organizations to run


applications from large sets of nodes. This can use thousands of terabytes of data.

2. Flexibility

MapReduce programming enables companies to access new sources of data. It


enables companies to operate on different types of data. It allows enterprises to
access structured as well as unstructured data, and derive significant value by
gaining insights from the multiple sources of data.

Additionally, the MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to clickstream.

The MapReduce processes data in simple key-value pairs thus supports data type
including meta-data, images, and large files. Hence, MapReduce is flexible to deal
with data rather than traditional DBMS.

3. Security and Authentication

The MapReduce programming model uses HBase and HDFS security platform that
allows access only to the authenticated users to operate on the data. Thus, it protects
unauthorized access to system data and enhances system security.

4. Cost-effective solution

Hadoop scalable architecture with the MapReduce programming framework allows


the storage and processing of large data sets in a very affordable manner.

5. Fast

Hadoop uses a distributed storage method called as a Hadoop Distributed File


System that basically implements a mapping system for locating data in a cluster.

The tools that are used for data processing, such as MapReduce programming, are
generally located on the very same servers that allow for the faster processing of data.

So, Even if we are dealing with large volumes of unstructured data, Hadoop
MapReduce just takes minutes to process terabytes of data. It can process petabytes
of data in just an hour.

21
Big Data Unit –II Notes

6. Simple model of programming

Amongst the various features of Hadoop MapReduce, one of the most important
features is that it is based on a simple programming model. Basically, this allows
programmers to develop the MapReduce programs which can handle tasks easily and
efficiently.

The MapReduce programs can be written in Java, which is not very hard to pick up
and is also used widely. So, anyone can easily learn and write MapReduce programs
and meet their data processing needs.

7. Parallel Programming

One of the major aspects of the working of MapReduce programming is its parallel
processing. It divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these divided tasks. So
the entire program is run in less time.

8. Availability and resilient nature

Whenever the data is sent to an individual node, the same set of data is forwarded to
some other nodes in a cluster. So, if any particular node suffers from a failure, then
there are always other copies present on other nodes that can still be accessed
whenever needed. This assures high availability of data.

One of the major features offered by Apache Hadoop is its fault tolerance. The
Hadoop MapReduce framework has the ability to quickly recognizing faults that
occur.

It then applies a quick and automatic recovery solution. This feature makes it a
game-changer in the world of big data processing.

22
Big Data Unit –II Notes

HDFS vs. MapReduce


What is HDFS?
HDFS stands for Hadoop Distributed File System. It is a distributed file system
of Hadoop to run on large clusters reliably and efficiently. Also, it is based on the
Google File System (GFS). Moreover, it also has a list of commands to interact with
the file system.
Furthermore, the HDFS works according to the master, slave architecture. The
master node or name node manages the file system metadata while the slave nodes
or the data notes store actual data.

Figure 1: HDFS Architecture

Besides, a file in an HDFS namespace is split into several blocks. Data nodes stores
these blocks. And, the name node maps the blocks to the data nodes, which handle
the reading and writing operations with the file system. Furthermore, they perform
tasks such as block creation, deletion etc. as instructed by the name node.

What is MapReduce
MapReduce is a software framework that allows writing applications to process big
data simultaneously on large clusters of commodity hardware. This framework
consists of a single master job tracker and one slave task tracker per cluster node.
23
Big Data Unit –II Notes

The master performs resource management, scheduling jobs on slaves, monitoring


and re-executing the failed tasks. On the other hand, the slave task tracker executes
the tasks instructed by the master and sends the tasks status information back to the
mater constantly.

Figure 2: MapReduce Overview

Also, there are two tasks associated with MapReduce. They are the map task and the
reduce task. The map task takes input data and divides them into tuples of key, value
pairs while the Reduce task takes the output from a map task as input and connects
those data tuples into smaller tuples. Furthermore, the map task is performed before
the reduce task.

Difference between HDFS and MapReduce

Definition
HDFS is a Distributed File System that reliably stores large files across machines in a
large cluster. In contrast, MapReduce is a software framework for easily writing
applications which process vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant manner. These definitions explain
the main difference between HDFS and MapReduce.

Main Functionality
Another difference between HDFS and MapReduce is that the HDFS provides high-
performance access to data across highly scalable Hadoop clusters while MapReduce
performs the processing of big data.
24
Big Data Unit –II Notes

Introducing HBase
Since 1970, RDBMS is the solution for data storage and maintenance related
problems. After the advent of big data, companies realized the benefit of processing
big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process
it. Hadoop excels in storing and processing of huge data of various formats such as
arbitrary, semi-, or even unstructured.

Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner. That means one has to search the entire dataset even for the
simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of
data in a single unit of time (random access).

Hadoop Random Access Databases


Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some
of the databases that store huge amounts of data and access the data in a random
manner.

What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.

25
Big Data Unit –II Notes

HBase and HDFS


HDFS HBase

HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.

HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.

It provides high latency batch It provides low latency access to single rows from
processing; no concept of batch billions of records (Random access).
processing.

It provides only sequential access HBase internally uses Hash tables and provides
of data. random access, and it stores the data in indexed
HDFS files for faster lookups.

Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The
table schema defines only column families, which are the key value pairs. A table
have multiple column families and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each cell
value of the table has a timestamp. In short, in an HBase:

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family


col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1

26
Big Data Unit –II Notes

Column Oriented and Row Oriented


Column-oriented databases are those that store data tables as sections of columns of
data, rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process It is suitable for Online Analytical


(OLTP). Processing (OLAP).

Such databases are designed for small number Column-oriented databases are
of rows and columns. designed for huge tables.

The following image shows column families in a column-oriented database:

HBase and RDBMS


HBase RDBMS

HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,


concept of fixed columns schema; defines only which describes the whole structure of
column families. tables.

It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard
scalable. to scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as It is good for structured data.


structured data.

27
Big Data Unit –II Notes

Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.

Where to Use HBase


• Apache HBase is used to have random, real-time read/write access to Big Data.
• It hosts very large tables on top of clusters of commodity hardware.
• Apache HBase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.

Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available
data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase History
Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.

Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept 2009 HBase 0.20.0 was released.

May 2010 HBase became Apache top-level project.

28
Big Data Unit –II Notes

HBase - Architecture
In HBase, tables are split into regions and are served by the region servers. Regions
are vertically divided by column families into “Stores”. Stores are saved as files in
HDFS. Shown below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.

HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.

MasterServer
The master server -
• Assigns regions to the region servers and takes the help of Apache ZooKeeper
for this task.
• Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as
creation of tables and column families.

Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -

• Communicate with the client and handle data-related operations.


• Handle read and write requests for all the regions under it.
29
Big Data Unit –II Notes

• Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as
shown below:

The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is
transferred and saved in Hfiles as blocks and the memstore is flushed.

Storing Big Data with HBase


HBase is highly configurable and gives great flexibility to address massive amounts
of data efficiently. Now let's understand how HBase can help address your significant
data challenges.
• HBase is a columnar database. Like relational database management systems
(RDBMSs), it stores all data in tables with columns and rows.
• The intersection of a column and row is called a cell. Each cell value contains a
"version" attribute that is no more than a timestamp, distinctively selecting the
cell.
• Versioning tracks swap in the cell and makes it possible to redeem any version of
the contents.
• HBase stores the data in cells in decreasing order (using the timestamp), so a
reader will always first choose the most current values.
• Columns in HBase belong to a column family. The column family name is used
to identify its family members.
• The rows in HBase tables also have a key associated with them. The structure of
the key is very flexible. It can be a computed value, a string, or another data
structure.
• The key is used to control access to the cells in the row, and they are stored in
order from low to high value.
• These features together make up the schema. It can alter new tables and column
families after the database is up and running.

Combining HBase and HDFS


30
Big Data Unit –II Notes

Apache HIVE – Features and Limitations

Apache hive is a data warehousing tool built on top of Hadoop and used for
extracting meaningful information from data. Data warehousing is all about
storing all kinds of data generated from different sources at the same
location. The data is mostly available in 3 forms i.e. structured (SQL
database), semi-structured (XML or JSON) and unstructured (music or
video). To process the structured data available in the tabular format we use
Hive on top of Hadoop. The Hive is so powerful that it can query Peta bytes
(PB) of data very efficiently.
As we know MapReduce is the by default model we use for programming on
Hadoop with java or some other language so Hive was mainly designed for
the developers who are comfortable with SQL. After the birth of the Hive, the
persons who are not very much comfortable with Java can also process data
over Hadoop with the help of the Hive. Using Hive also makes it easy to
query structure data because writing code in java is difficult as compared to
Hive. HQL or HIVEQL is the query language that we use to work with the
hive, whose syntax is very much similar to the SQL language which makes it
very easy to use Hive.

Apache Hive Features

Features Explanation

Supported Hive supports MapReduce, Tez, and Spark computing


Computing Engine engine.

Hive is a stable batch-processing framework built on top of


the Hadoop Distributed File system and can work as a data
Framework warehouse.

Hive uses HIVE query language to query structure data


which is easy to code. The 100 lines of java code we use to
query a structure data can be minimized to 4 lines with
Easy To Code HQL.

HQL is a declarative language like SQL means it is non-


Declarative procedural.

The table, the structure is similar to the RDBMS. It also


Structure Of Table supports partitioning and bucketing.

31
Big Data Unit –II Notes

Features Explanation

Supported data Partition, Bucket, and tables are the 3 data structures that
structures hive supports.

Apache hive supports ETL i.e. Extract Transform and


Supports ETL Load. Before Hive python is used for ETL.

Hive supports users to access files from HDFS, Apache


Storage HBase, Amazon S3, etc.

Hive is capable to process very large datasets of Petabytes


Capable in size.

Helps in processing We can easily embed custom MapReduce code with Hive to
unstructured data process unstructured data.

Drivers JDBC/ODBC drivers are also available in Hive.

Since we store Hive data on HDFS so fault tolerance is


Fault Tolerance provided by Hadoop.

We can use a hive for data mining, predictive modeling,


Area of uses and document indexing.

Apache Hive Limitations

Limitation Explanation

Apache Hive doesn’t support online transaction


Does not support processing (OLTP) but Online Analytical
OLTP Processing(OLAP) is supported.

Doesn’t support
subqueries Subqueries are not supported.

Latency The latency in the apache hive query is very high.

Only non-real or cold Hive is not used for real-time data querying since it

32
Big Data Unit –II Notes

Limitation Explanation

data is supported takes a while to produce a result.

Transaction
processing is not HQL does not support the Transaction processing
supported feature.

Features of Pig
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on semantics
of the language.
• Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Pig Latin is the language used to analyze data in Hadoop using Apache Pig.
In this chapter, we are going to discuss the basics of Pig Latin such as Pig
Latin statements, data types, general and relational operators, and Pig
Latin UDF’s.

Pig Latin – Data Model

As discussed in the previous chapters, the data model of Pig is fully nested.
A Relation is the outermost structure of the Pig Latin data model. And it is
a bag where −

• A bag is a collection of tuples.


• A tuple is an ordered set of fields.
• A field is a piece of data.

33
Big Data Unit –II Notes

Pig Latin – Statemets


While processing data using Pig Latin, statements are the basic constructs.

• These statements work with relations. They


include expressions and schemas.
• Every statement ends with a semicolon (;).
• We will perform various operations using operators provided by Pig
Latin, through statements.
• Except LOAD and STORE, while performing all other operations, Pig
Latin statements take a relation as input and produce another
relation as output.
• As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you
need to use the Dump operator. Only after performing
the dump operation, the MapReduce job for loading the data into the
file system will be carried out.

What is Sqoop?

Apache Sqoop is a tool designed for efficiently transferring bulk data


between Apache Hadoop and external datastores such as relational
databases, enterprise data warehouses.

Sqoop is used to import data from external datastores into Hadoop


Distributed File System or related Hadoop eco-systems like Hive and
HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its
eco-systems and export it to external datastores such as relational
databases, enterprise data warehouses. Sqoop works with relational
databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.

34
Big Data Unit –II Notes

Why is Sqoop used?

For Hadoop developers, the interesting work starts after data is loaded into
HDFS. Developers play around the data in order to find the magical
insights concealed in that Big Data. For this, the data residing in the
relational database management systems need to be transferred to HDFS,
play around the data and might need to transfer back to relational database
management systems. In reality of Big Data world, Developers feel the
transferring of data between relational database systems and HDFS is not
that interesting, tedious but too seldom required. Developers can always
write custom scripts to transfer data in and out of Hadoop, but Apache
Sqoop provides an alternative.

Sqoop automates most of the process, depends on the database to describe


the schema of the data to be imported. Sqoop uses MapReduce framework
to import and export the data, which provides parallel mechanism as well
as fault tolerance. Sqoop makes developers life easy by providing command
line interface. Developers just need to provide basic information like
source, destination and database authentication details in the sqoop
command. Sqoop takes care of remaining part.

Sqoop provides many salient features like:

1. Full Load
2. Incremental Load
3. Parallel import/export
4. Import results of SQL query
5. Compression
6. Connectors for all major RDBMS Databases
7. Kerberos Security Integration
8. Load data directly into Hive/Hbase
9. Support for Accumulo

35
Big Data Unit –II Notes

Sqoop is Robust, has great community support and contributions. Sqoop is


widely used in most of the Big Data companies to transfer data between
relational databases and Hadoop.

Zookeeper
• Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization,
etc.
• Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.
• In addition to availability, the nodes are also used to track server failures or
network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.

Apache Zookeeper Features

So, here is the list of best Features of ZooKeeper:


i. Naming service
To every node, identifying ZooKeeper attaches a unique identification
which is quite similar to the DNA. So, that helps to identify it.
ii. Updating the node’s status
Also, it has the flexibility of updating every node. Hence, that feature
permits it to store updated information about each node across the cluster.
iii. Managing the cluster
Moreover, in Zookeeper, the status of each node is maintained in real-time.
That leaves lesser chances for errors as well as ambiguity, that’s how it
manages the cluster.
iv. Automatic failure recovery
While modifying, ZooKeeper locks the data, so, if a failure occurs in the
database, that helps the cluster to recover it automatically.
v. Simplicity
By using a shared hierarchical namespace, it coordinates.
vi. Reliability
When one or more nodes fail, the system keeps performing.
vii. Ordered
By stamping each update, it keeps track with a number denoting its order.

36
Big Data Unit –II Notes

viii. Speed
In the cases where ‘Reads’ are more common, it runs with the ratio of 10:1.
ix. Scalability
Performance of Zookeeper can be enhanced by deploying more machines.
x. How is the order beneficial?
To implement higher-level abstractions order is required.
xi. ZooKeeper is fast
Zookeeper works very Fast in “read-dominant” workloads.
xii. Ordered Messages
By stamping each update with a number denoting its order, it keeps track.
xiii. Serialization
Serialization means a surety about the consistency of running application.
Though, this approach can be used in MapReduce for coordination of
queue in order to execute running threads.
xiv. Reliability
As soon as it applies the update until a client overwrites the update, it will
persist from that time forward.
xv. Atomicity
Either data transfer succeeds or fails completely.
xvi. Sequential Consistency
Sequential Consistency means, in the same order that they were sent, they
apply the updates from a client, in that order only.
xvii. Single System Image
Despite the server from which it connects to, a client will see the same view
of the service.
xviii. Timeliness
In some definite time amount, system’s client’s view is up-to-date.
So, this was all in Apache ZooKeeper Features. Hope you like our
explanation.

37
Big Data Unit –II Notes

Introduction to Apache Flume


Apache Flume is a distributed system for collecting, aggregating, and
transferring data from external sources like Twitter, Facebook, web servers
to the central repository like HDFS.

It is mainly for loading log data from different sources to Hadoop HDFS.
Apache Flume is a highly robust and available service. It is extensible, fault-
tolerant, and scalable.

Let us now study the different features of Apache Flume.

Features of Apache Flume

1. Open-source
Apache Flume is an open-source distributed system. So it is available free of
cost.

2. Data flow

38
Big Data Unit –II Notes

Apache Flume allows its users to build multi-hop, fan-in, and fan-out flows.
It also allows for contextual routing as well as backup routes (fail-over) for
the failed hops.

3. Reliability
In apache flume, the sources transfer events through the channel. The
flume source puts events in the channel which are then consumed by the
sink. The sink transfers the event to the next agent or to the terminal
repository (like HDFS).

The events in the flume channel are removed only when they are stored in
the next agent channel or in the terminal repository.

In this way, the single-hop message delivery semantics in Apache Flume


caters to end-to-end reliability of the flow. Flume uses a transactional
approach for guaranteeing reliable delivery of the flume events.

4. Recoverability
The flume events are staged in a flume channel on each flume agent. This
manages recovery from failure. Also, Apache Flume supports a durable File
channel. File channels can be backed by the local file system.

5. Steady flow
Apache Flume offers steady data flow between reading and writes
operations. When the rate at which data is coming exceeds the rate of
writing data to the destination, then Apache Flume acts as a mediator
between the data producers and the centralized stores. Thus offers a steady
flow of data between them.

6. Latency
Apache Flume caters to high throughput with lower latency.

7. Ease of use
With Flume, we can ingest the stream data from multiple web servers and
store them to any of the centralized stores such as HBase, Hadoop HDFS,
etc.

8. Reliable message delivery


All the transactions in Apache Flume are channel-based. For each message,
two transactions are there – one for the sender and one for the receiver.
This ensures reliable message delivery.

9. Import of Huge volumes of data

39
Big Data Unit –II Notes

Along with the log files, Apache Flume can also be used for importing huge
volumes of data produced by e-commerce sites like Flipkart, Amazon, and
networking sites like Twitter, Facebook.

10. Support for varieties of Sources and Sinks


Apache Flume supports a wide range of sources and sinks.

11. Streaming
Apache Flume gives us a reliable solution that helps us ingesting online
streaming data from different sources (such as email messages, network
traffic, log files, social media, etc) in HDFS.

12. Fault-tolerant and scalable


Flume is an extensible, reliable, highly available, and horizontally scalable
system. It is customizable for different types of sources and sinks.

13. Inexpensive
It is an inexpensive system. It is less costly to install and operate. Its
maintenance is very economical.

14. Configuration
Apache Flume contains a very declarative configuration.

15. Documentation
Flume provides complete documentation with many good examples and
patterns which helps its user to learn how Flume can be used and
configured.

Limitations of Apache Flume

40
Big Data Unit –II Notes

Some of the limitations of Apache Flume are:

1. Weak ordering guarantee


Apache Flume offers weaker guarantees than the other systems such as
message queues in the event of moving data more quickly and for enabling
cheaper fault tolerance. In Apache Flume’s end-to-end reliability mode, the
flume events are delivered at least once, but with zero ordering guarantees.

2. Duplicacy
Apache Flume does not guarantee that the messages reaching are 100%
unique. In many scenarios, the duplicate messages might pop in.

3. Low scalability
Flume scalability is often low because for any businesses, sizing the hardware of a
typical Apache Flume may be tricky, and in most of the cases, it is trial and error.
Due to this, Flume scalability aspect is often under the lens.

4. Reliability issue
The throughput that Apache Flume can handle depends highly upon the backing
store of the channel. So, if the backing store is not chosen wisely, then there may be
scalability and reliability issues.

5. Complex topology
It has complex topology and reconfiguration is challenging.

Despite its disadvantages, Flume’s advantages outweigh its disadvantages.

Oozie
Oozie is a workflow scheduler system that manages Apache Hadoop jobs.

Oozie system operates by running the workflows of dependent jobs and permits users
to create Directed Acyclic Graphs of workflows. These DAG’s can be run in parallel
and sequentially in Hadoop.

This workflow scheduler system consists of two parts:

• Workflow engine: Responsibility of a workflow engine is to store and run


workflows composed of Hadoop jobs. This includes, MapReduce, Pig and
Hive.
• Coordinator engine: It runs workflow jobs based on predefined schedules
and availability of data.

41
Big Data Unit –II Notes

Oozie operates by running as a service in a Hadoop cluster with clients submitting


workflow definitions for immediate or delayed processing.

Oozie workflow consists of action nodes and control-flow nodes.

An action node is a workflow task, which could be moving files into HDFS. While
control-flow node controls the workflow execution between actions by allowing
constructs like conditional logic, where it allows for more actions to follow depending
on the result of earlier action nodes.

Oozie can be integrated with the rest of the Hadoop stack supporting several types of
Hadoop jobs out of the box

Features of Oozie include:

• Having a client API and command line interface which can be used to launch,
control and monitor jobs from Java applications.
• Using its Web Service APIs one can control jobs from anywhere.
• Having provisions to execute jobs which are scheduled to run periodically.
• Having provision to send email notifications upon completion of jobs.

42

You might also like