Big Data Unit II
Big Data Unit II
Big Data Unit II
Introducing Hadoop
Traditional Approach
➢ In this approach, an enterprise will have a computer to store and process big
data.
➢ For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
➢ In this approach, the user interacts with the application, which in turn handles
the part of data storage and analysis.
Limitation
This approach works fine with those applications that process less voluminous data
that can be accommodated by standard database servers, or up to the limit of the
processor that is processing the data.
But when it comes to dealing with huge amounts of scalable data, it is a hectic
task to process such data through a single database bottleneck.
Google’s Solution
1
Big Data Unit –II Notes
What is Hadoop?
➢ Hadoop is an open-source software programming framework for
storing a large amount of data and performing the computation.
➢ Its framework is based on Java programming with some native code in
C and shell scripts.
➢ Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment.
➢ It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large
datasets.
➢ Hadoop has two main components:
• HDFS (Hadoop Distributed File System): This is the storage
component of Hadoop, which allows for the storage of large amounts
of data across multiple machines. It is designed to work with
commodity hardware, which makes it cost-effective.
• YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.
History of Hadoop
2
Big Data Unit –II Notes
1. Hive- It uses HiveQL for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data
exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library (MLlib) for providing
enhanced machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data
transformation of unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running
of their codes faster.
3
Big Data Unit –II Notes
Advantages:
➢ Ability to store a large amount of data.
➢ High flexibility.
➢ Cost effective.
➢ High computational power.
➢ Tasks are independent.
➢ Linear scaling.
Disadvantages:
➢ Not very effective for small data.
➢ Hard cluster management.
➢ Has a stability issue.
➢ Security concerns.
➢ Complexity: Hadoop can be complex to set up and maintain, especially
for organizations without a dedicated team of experts.
➢ Latency: Hadoop is not well-suited for low-latency workloads and may
not be the best choice for real-time data processing.
➢ Limited Support for Real-time Processing: Hadoop batch-oriented
nature makes it less suited for real-time streaming or interactive data
processing use cases.
➢ Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for
structured data processing
➢ Data Security: Hadoop does not provide built-in security features such
as data encryption or user authentication, which can make it difficult
to secure sensitive data.
➢ Limited Support for Ad-hoc Queries: Hadoop MapReduce
programming model is not well-suited for ad-hoc queries, making it
difficult to perform exploratory data analysis.
➢ Limited Support for Graph and Machine Learning: Hadoop core
component HDFS and MapReduce are not well-suited for graph and
machine learning workloads, specialized components like Apache
Graph and Mahout are available but have some limitations.
➢ Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
➢ Data Loss: In the event of a hardware failure, the data stored in a
single node may be lost permanently.
➢ Data Governance: Data Governance is a critical aspect of data
management; Hadoop does not provide a built-in feature to manage
data lineage, data quality, data cataloging, data lineage, and data audit.
4
Big Data Unit –II Notes
Features of Hadoop
Apache Hadoop is the most popular and powerful big data tool; Hadoop
provides the world’s most reliable storage layer.
This feature of Hadoop ensures the high availability of the data, even in
unfavorable conditions.
Due to the fault tolerance feature of Hadoop, if any of the Data Nodes goes
down, the data is available to the user from different Data Nodes containing
a copy of the same data.
If an active node fails, the passive node takes over the responsibility of the
active node. Thus, even if the Name Node goes down, files are available and
accessible to users.
5
Big Data Unit –II Notes
Since the Hadoop cluster consists of nodes of commodity hardware that are
inexpensive, thus provides a cost-effective solution for storing and
processing big data. Being an open-source product, Hadoop doesn’t need
any license.
Hadoop is popularly known for its data locality feature means moving
computation logic to the data, rather than moving data to the computation
logic. This feature of Hadoop reduces the bandwidth utilization in a system.
Unlike the traditional system, Hadoop can process unstructured data. Thus,
provide feasibility to the users to analyze data of any formats and size.
Hadoop is easy to use as the clients don’t have to worry about distributing
computing. The processing is handled by the framework itself.
6
Big Data Unit –II Notes
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, Big data is a term given to the data sets
which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries
and companies that need to work on large data sets which are sensitive and
needs efficient handling. Hadoop is a framework that enables processing of
large data sets which reside in the form of clusters. Being a framework,
Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.
7
Big Data Unit –II Notes
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s
the beauty of Hadoop that it revolves around data and hence making its
synthesis easier.
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores
the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
8
Big Data Unit –II Notes
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing
huge data sets.
• Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
• Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
9
Big Data Unit –II Notes
HIVE:
• With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Mahout:
Apache Spark:
• It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior
in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing; hence both are used in most of the
companies interchangeably.
Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of
Google’s Bigtable, thus able to work on Big Data sets effectively.
• At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short
quick span of time. At such times, HBase comes handy as it gives us a
tolerant way of storing limited data
10
Big Data Unit –II Notes
Other Components: Apart from all of these, there are some other
components too that carry out a huge task in order to make Hadoop capable
of processing large datasets. They are as follows:
• Solar, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries, especially
Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component-based communication,
grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There are two kinds of
jobs. i.e. Oozie workflow and Oozie coordinator jobs. Oozie workflow is
the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
HDFS has two main components, broadly speaking, – Data blocks and
Nodes storing those data blocks.
11
Big Data Unit –II Notes
If, however, you had a file of size 524MB, then, it would be divided into 5
blocks. 4 of these would store 128MB each, amounting to 512MB. And the
5th would store the remaining 12MB. That’s right! This last block won’t take
up the complete 128MB on the disk.
But, you must be wondering, why such a huge amount in a single block?
Why not multiple blocks of 10KB each? Well, the amount of data with
12
Big Data Unit –II Notes
• The file itself would be too large to store on any single disk alone.
Therefore, it is prudent to spread it across different machines on the
cluster.
• It would also enable a proper spread of the workload and prevent the choke
of a single machine by taking advantage of parallelism.
Now, you must be wondering, what about the machines in the cluster? How
do they store the blocks and where is the metadata stored? Let’s find out.
Namenode in HDFS
All this information is maintained persistently over the local disk in the
form of two files: Fsimage and Edit Log.
• Fsimage stores the information about the files and directories in the
filesystem. For files, it stores the replication level, modification and access
times, access permissions, blocks the file is made up of, and their sizes. For
directories, it stores the modification time and permissions.
• Edit log on the other hand keeps track of all the write operations that the
client performs. This is regularly updated to the in-memory metadata to
serve the read requests.
13
Big Data Unit –II Notes
Datanodes in HDFS
Datanodes are the worker nodes. They are inexpensive commodity
hardware that can be easily added to the cluster.
Datanodes are responsible for storing, retrieving, replicating, deletion,
etc. of blocks when asked by the Namenode.
They periodically send heartbeats to the Namenode so that it is aware of
their health. With that, a Data Node also sends a list of blocks that are
stored on it so that the Namenode can maintain the mapping of blocks to
Datanodes in its memory.
But in addition to these two types of nodes in the cluster, there is also
another node called the Secondary Namenode. Let’s look at what that is.
Secondary Namenode in HDFS
Suppose we need to restart the Namenode, which can happen in case of a
failure. This would mean that we have to copy the Fsimage from disk to
memory. Also, we would also have to copy the latest copy of Edit Log to
Fsimage to keep track of all the transactions. But if we restart the node after
a long time, then the Edit log could have grown in size. This would mean
that it would take a lot of time to apply the transactions from the Edit log.
And during this time, the filesystem would be offline. Therefore, to solve
this problem, we bring in the Secondary Namenode.
14
Big Data Unit –II Notes
15
Big Data Unit –II Notes
But you must be wondering doesn’t that mean that we are taking up too
much storage. For instance, if we have 5 blocks of 128MB each, that
amounts to 5*128*3 = 1920 MB. True. But then these nodes are commodity
hardware. We can easily scale the cluster to add more of these machines.
The cost of buying machines is much lower than the cost of losing the data!
Now, you must be wondering, how does Namenode decide which Datanode
to store the replicas on? Well, before answering that question, we need to
have a look at what is a Rack in Hadoop.
What is a Rack in Hadoop?
A Rack is a collection of machines (30-40 in Hadoop) that are stored in the
same physical location. There are multiple racks in a Hadoop cluster, all
connected through switches.
16
Big Data Unit –II Notes
Rack awareness
Replica storage is a tradeoff between reliability and read/write bandwidth.
To increase reliability, we need to store block replicas on different racks
and Datanodes to increase fault tolerance. While the write bandwidth is
lowest when replicas are stored on the same node. Therefore, Hadoop has a
default strategy to deal with this conundrum, also known as the Rack
Awareness algorithm.
For example, if the replication factor for a block is 3, then the first replica is
stored on the same Datanode on which the client writes. The second replica
is stored on a different Datanode but on a different rack, chosen randomly.
While the third replica is stored on the same rack as the second but on a
different Datanode, again chosen randomly. If, however, the replication
factor was higher, then the subsequent replicas would be stored on random
Data Nodes in the cluster.
17
Big Data Unit –II Notes
Features of HDFS
The key features of HDFS are:
1. Cost-effective:
In HDFS architecture, the DataNodes, which stores the actual data are inexpensive
commodity hardware, thus reduces storage costs.
3. Replication
Data Replication is one of the most important and unique features of HDFS. In
HDFS replication of data is done to solve the problem of data loss in unfavorable
conditions like crashing of a node, hardware failure, and so on.
The data is replicated across a number of machines in the cluster by creating replicas
of blocks. The process of replication is maintained at regular intervals of time by
HDFS and HDFS keeps creating replicas of user data on different machines present
in the cluster.
Hence whenever any machine in the cluster gets crashed, the user can access their
data from other machines that contain the blocks of that data. Hence there is no
possibility of a loss of user data.
18
Big Data Unit –II Notes
If any of the machines containing data blocks fail, other DataNodes containing the
replicas of that data blocks are available. Thus ensuring no loss of data and makes
the system reliable even in unfavorable conditions.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in
HDFS improves storage efficiency while providing the same level of fault tolerance
and data durability as traditional replication-based HDFS deployment.
To study the fault tolerance features in detail, refer to Fault Tolerance.
5. High Availability
The High availability feature of Hadoop ensures the availability of data even during
NameNode or DataNode failure.
Since HDFS creates replicas of data blocks, if any of the DataNodes goes down, the
user can access his data from the other DataNodes containing a copy of the same
data block.
Also, if the active NameNode goes down, the passive node takes the responsibility of
the active NameNode. Thus, data will be available and accessible to the user even
during a machine crash.
To study the high availability feature in detail, refer to the High Availability article.
6. Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase,
we can scale the cluster.
There is two scalability mechanism available: Vertical scalability – add more
resources (CPU, Memory, Disk) on the existing nodes of the cluster.
Another way is horizontal scalability – Add more machines in the cluster. The
horizontal way is preferred since we can scale the cluster from 10s of nodes to 100s of
nodes on the fly without any downtime.
7. Data Integrity
Data integrity refers to the correctness of data. HDFS ensures data integrity by
constantly checking the data against the checksum calculated during the write of the
file.
While file reading, if the checksum does not match with the original checksum, the
data is said to be corrupted. The client then opts to retrieve the data block from
another DataNode that has a replica of that block. The NameNode discards the
corrupted block and creates an additional new replica.
8. High Throughput
Hadoop HDFS stores data in a distributed fashion, which allows data to be processed
parallelly on a cluster of nodes. This decreases the processing time and thus provides
high throughput.
19
Big Data Unit –II Notes
9. Data Locality
Data locality means moving computation logic to the data rather than moving data to
the computational unit.
In the traditional system, the data is brought at the application layer and then gets
processed.
But in the present scenario, due to the massive volume of data, bringing data to the
application layer degrades the network performance.
In HDFS, we bring the computation part to the Data Nodes where data resides.
Hence, with Hadoop HDFS, we are not moving computation logic to the data, rather
than moving data to the computation logic. This feature reduces the bandwidth
utilization in a system.
Importance of MapReduce
Apache Hadoop is a software framework that processes and stores big data across the
cluster of commodity hardware. Hadoop is based on the MapReduce model for
processing huge amounts of data in a distributed manner.
This MapReduce Tutorial enlisted several features of MapReduce. After reading this,
you will clearly understand why MapReduce is the best fit for processing vast
amounts of data.
First, we will see a small introduction to the MapReduce framework. Then we will
explore various features of MapReduce.
20
Big Data Unit –II Notes
1. Scalability
Apache Hadoop is a highly scalable framework. This is because of its ability to store
and distribute huge data across plenty of servers. All these servers were inexpensive
and can operate in parallel. We can easily scale the storage and computation power
by adding servers to the cluster.
2. Flexibility
Additionally, the MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to clickstream.
The MapReduce processes data in simple key-value pairs thus supports data type
including meta-data, images, and large files. Hence, MapReduce is flexible to deal
with data rather than traditional DBMS.
The MapReduce programming model uses HBase and HDFS security platform that
allows access only to the authenticated users to operate on the data. Thus, it protects
unauthorized access to system data and enhances system security.
4. Cost-effective solution
5. Fast
The tools that are used for data processing, such as MapReduce programming, are
generally located on the very same servers that allow for the faster processing of data.
So, Even if we are dealing with large volumes of unstructured data, Hadoop
MapReduce just takes minutes to process terabytes of data. It can process petabytes
of data in just an hour.
21
Big Data Unit –II Notes
Amongst the various features of Hadoop MapReduce, one of the most important
features is that it is based on a simple programming model. Basically, this allows
programmers to develop the MapReduce programs which can handle tasks easily and
efficiently.
The MapReduce programs can be written in Java, which is not very hard to pick up
and is also used widely. So, anyone can easily learn and write MapReduce programs
and meet their data processing needs.
7. Parallel Programming
One of the major aspects of the working of MapReduce programming is its parallel
processing. It divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these divided tasks. So
the entire program is run in less time.
Whenever the data is sent to an individual node, the same set of data is forwarded to
some other nodes in a cluster. So, if any particular node suffers from a failure, then
there are always other copies present on other nodes that can still be accessed
whenever needed. This assures high availability of data.
One of the major features offered by Apache Hadoop is its fault tolerance. The
Hadoop MapReduce framework has the ability to quickly recognizing faults that
occur.
It then applies a quick and automatic recovery solution. This feature makes it a
game-changer in the world of big data processing.
22
Big Data Unit –II Notes
Besides, a file in an HDFS namespace is split into several blocks. Data nodes stores
these blocks. And, the name node maps the blocks to the data nodes, which handle
the reading and writing operations with the file system. Furthermore, they perform
tasks such as block creation, deletion etc. as instructed by the name node.
What is MapReduce
MapReduce is a software framework that allows writing applications to process big
data simultaneously on large clusters of commodity hardware. This framework
consists of a single master job tracker and one slave task tracker per cluster node.
23
Big Data Unit –II Notes
Also, there are two tasks associated with MapReduce. They are the map task and the
reduce task. The map task takes input data and divides them into tuples of key, value
pairs while the Reduce task takes the output from a map task as input and connects
those data tuples into smaller tuples. Furthermore, the map task is performed before
the reduce task.
Definition
HDFS is a Distributed File System that reliably stores large files across machines in a
large cluster. In contrast, MapReduce is a software framework for easily writing
applications which process vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant manner. These definitions explain
the main difference between HDFS and MapReduce.
Main Functionality
Another difference between HDFS and MapReduce is that the HDFS provides high-
performance access to data across highly scalable Hadoop clusters while MapReduce
performs the processing of big data.
24
Big Data Unit –II Notes
Introducing HBase
Since 1970, RDBMS is the solution for data storage and maintenance related
problems. After the advent of big data, companies realized the benefit of processing
big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process
it. Hadoop excels in storing and processing of huge data of various formats such as
arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner. That means one has to search the entire dataset even for the
simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of
data in a single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
25
Big Data Unit –II Notes
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency batch It provides low latency access to single rows from
processing; no concept of batch billions of records (Random access).
processing.
It provides only sequential access HBase internally uses Hash tables and provides
of data. random access, and it stores the data in indexed
HDFS files for faster lookups.
26
Big Data Unit –II Notes
Such databases are designed for small number Column-oriented databases are
of rows and columns. designed for huge tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard
scalable. to scale.
27
Big Data Unit –II Notes
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available
data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase History
Year Event
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
28
Big Data Unit –II Notes
HBase - Architecture
In HBase, tables are split into regions and are served by the region servers. Regions
are vertically divided by column families into “Stores”. Stores are saved as files in
HDFS. Shown below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.
MasterServer
The master server -
• Assigns regions to the region servers and takes the help of Apache ZooKeeper
for this task.
• Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as
creation of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
• Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as
shown below:
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is
transferred and saved in Hfiles as blocks and the memstore is flushed.
Apache hive is a data warehousing tool built on top of Hadoop and used for
extracting meaningful information from data. Data warehousing is all about
storing all kinds of data generated from different sources at the same
location. The data is mostly available in 3 forms i.e. structured (SQL
database), semi-structured (XML or JSON) and unstructured (music or
video). To process the structured data available in the tabular format we use
Hive on top of Hadoop. The Hive is so powerful that it can query Peta bytes
(PB) of data very efficiently.
As we know MapReduce is the by default model we use for programming on
Hadoop with java or some other language so Hive was mainly designed for
the developers who are comfortable with SQL. After the birth of the Hive, the
persons who are not very much comfortable with Java can also process data
over Hadoop with the help of the Hive. Using Hive also makes it easy to
query structure data because writing code in java is difficult as compared to
Hive. HQL or HIVEQL is the query language that we use to work with the
hive, whose syntax is very much similar to the SQL language which makes it
very easy to use Hive.
Features Explanation
31
Big Data Unit –II Notes
Features Explanation
Supported data Partition, Bucket, and tables are the 3 data structures that
structures hive supports.
Helps in processing We can easily embed custom MapReduce code with Hive to
unstructured data process unstructured data.
Limitation Explanation
Doesn’t support
subqueries Subqueries are not supported.
Only non-real or cold Hive is not used for real-time data querying since it
32
Big Data Unit –II Notes
Limitation Explanation
Transaction
processing is not HQL does not support the Transaction processing
supported feature.
Features of Pig
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on semantics
of the language.
• Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Pig Latin is the language used to analyze data in Hadoop using Apache Pig.
In this chapter, we are going to discuss the basics of Pig Latin such as Pig
Latin statements, data types, general and relational operators, and Pig
Latin UDF’s.
As discussed in the previous chapters, the data model of Pig is fully nested.
A Relation is the outermost structure of the Pig Latin data model. And it is
a bag where −
33
Big Data Unit –II Notes
What is Sqoop?
34
Big Data Unit –II Notes
For Hadoop developers, the interesting work starts after data is loaded into
HDFS. Developers play around the data in order to find the magical
insights concealed in that Big Data. For this, the data residing in the
relational database management systems need to be transferred to HDFS,
play around the data and might need to transfer back to relational database
management systems. In reality of Big Data world, Developers feel the
transferring of data between relational database systems and HDFS is not
that interesting, tedious but too seldom required. Developers can always
write custom scripts to transfer data in and out of Hadoop, but Apache
Sqoop provides an alternative.
1. Full Load
2. Incremental Load
3. Parallel import/export
4. Import results of SQL query
5. Compression
6. Connectors for all major RDBMS Databases
7. Kerberos Security Integration
8. Load data directly into Hive/Hbase
9. Support for Accumulo
35
Big Data Unit –II Notes
Zookeeper
• Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization,
etc.
• Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.
• In addition to availability, the nodes are also used to track server failures or
network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.
36
Big Data Unit –II Notes
viii. Speed
In the cases where ‘Reads’ are more common, it runs with the ratio of 10:1.
ix. Scalability
Performance of Zookeeper can be enhanced by deploying more machines.
x. How is the order beneficial?
To implement higher-level abstractions order is required.
xi. ZooKeeper is fast
Zookeeper works very Fast in “read-dominant” workloads.
xii. Ordered Messages
By stamping each update with a number denoting its order, it keeps track.
xiii. Serialization
Serialization means a surety about the consistency of running application.
Though, this approach can be used in MapReduce for coordination of
queue in order to execute running threads.
xiv. Reliability
As soon as it applies the update until a client overwrites the update, it will
persist from that time forward.
xv. Atomicity
Either data transfer succeeds or fails completely.
xvi. Sequential Consistency
Sequential Consistency means, in the same order that they were sent, they
apply the updates from a client, in that order only.
xvii. Single System Image
Despite the server from which it connects to, a client will see the same view
of the service.
xviii. Timeliness
In some definite time amount, system’s client’s view is up-to-date.
So, this was all in Apache ZooKeeper Features. Hope you like our
explanation.
37
Big Data Unit –II Notes
It is mainly for loading log data from different sources to Hadoop HDFS.
Apache Flume is a highly robust and available service. It is extensible, fault-
tolerant, and scalable.
1. Open-source
Apache Flume is an open-source distributed system. So it is available free of
cost.
2. Data flow
38
Big Data Unit –II Notes
Apache Flume allows its users to build multi-hop, fan-in, and fan-out flows.
It also allows for contextual routing as well as backup routes (fail-over) for
the failed hops.
3. Reliability
In apache flume, the sources transfer events through the channel. The
flume source puts events in the channel which are then consumed by the
sink. The sink transfers the event to the next agent or to the terminal
repository (like HDFS).
The events in the flume channel are removed only when they are stored in
the next agent channel or in the terminal repository.
4. Recoverability
The flume events are staged in a flume channel on each flume agent. This
manages recovery from failure. Also, Apache Flume supports a durable File
channel. File channels can be backed by the local file system.
5. Steady flow
Apache Flume offers steady data flow between reading and writes
operations. When the rate at which data is coming exceeds the rate of
writing data to the destination, then Apache Flume acts as a mediator
between the data producers and the centralized stores. Thus offers a steady
flow of data between them.
6. Latency
Apache Flume caters to high throughput with lower latency.
7. Ease of use
With Flume, we can ingest the stream data from multiple web servers and
store them to any of the centralized stores such as HBase, Hadoop HDFS,
etc.
39
Big Data Unit –II Notes
Along with the log files, Apache Flume can also be used for importing huge
volumes of data produced by e-commerce sites like Flipkart, Amazon, and
networking sites like Twitter, Facebook.
11. Streaming
Apache Flume gives us a reliable solution that helps us ingesting online
streaming data from different sources (such as email messages, network
traffic, log files, social media, etc) in HDFS.
13. Inexpensive
It is an inexpensive system. It is less costly to install and operate. Its
maintenance is very economical.
14. Configuration
Apache Flume contains a very declarative configuration.
15. Documentation
Flume provides complete documentation with many good examples and
patterns which helps its user to learn how Flume can be used and
configured.
40
Big Data Unit –II Notes
2. Duplicacy
Apache Flume does not guarantee that the messages reaching are 100%
unique. In many scenarios, the duplicate messages might pop in.
3. Low scalability
Flume scalability is often low because for any businesses, sizing the hardware of a
typical Apache Flume may be tricky, and in most of the cases, it is trial and error.
Due to this, Flume scalability aspect is often under the lens.
4. Reliability issue
The throughput that Apache Flume can handle depends highly upon the backing
store of the channel. So, if the backing store is not chosen wisely, then there may be
scalability and reliability issues.
5. Complex topology
It has complex topology and reconfiguration is challenging.
Oozie
Oozie is a workflow scheduler system that manages Apache Hadoop jobs.
Oozie system operates by running the workflows of dependent jobs and permits users
to create Directed Acyclic Graphs of workflows. These DAG’s can be run in parallel
and sequentially in Hadoop.
41
Big Data Unit –II Notes
An action node is a workflow task, which could be moving files into HDFS. While
control-flow node controls the workflow execution between actions by allowing
constructs like conditional logic, where it allows for more actions to follow depending
on the result of earlier action nodes.
Oozie can be integrated with the rest of the Hadoop stack supporting several types of
Hadoop jobs out of the box
• Having a client API and command line interface which can be used to launch,
control and monitor jobs from Java applications.
• Using its Web Service APIs one can control jobs from anywhere.
• Having provisions to execute jobs which are scheduled to run periodically.
• Having provision to send email notifications upon completion of jobs.
42