Big Data Notes - 2 Unit
Big Data Notes - 2 Unit
Big Data Notes - 2 Unit
Apache Hadoop
Apache Hadoop
Hadoop is an open source framework overseen by Apache Software Foundation which is written
in Java for storing and processing of huge datasets with the cluster of commodity hardware.
There are mainly two problems with the big data. First one is to store such a huge amount of data
and the second one is to process that stored data. The traditional approach like RDBMS is not
sufficient due to the heterogeneity of the data. So Hadoop comes as the solution to the problem
of big data i.e. storing and processing the big data with some extra capabilities.
History of Hadoop
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they
both started to work on Apache Nutch project. Apache Nutch project was the process of
building a search engine system that can index 1 billion pages. After a lot of research on
Nutch, they concluded that such a system will cost around half a million dollars in
hardware, and along with a monthly running cost of $30, 000 approximately, which is
very expensive. So, they realized that their project architecture will not be capable
enough to the workaround with billions of pages on the web. So they were looking for a
feasible solution which can reduce the implementation cost as well as the problem of
storing and processing of large datasets.
In 2003, they came across a paper that described the architecture of Google’s distributed
file system, called GFS (Google File System) which was published by Google, for
storing the large data sets. Now they realize that this paper can solve their problem of
storing very large files which were being generated because of web crawling and
indexing processes. But this paper was just the half solution to their problem.
In 2004, Google published one more paper on the technique MapReduce, which was the
solution of processing those large datasets. Now this paper was another half solution for
Doug Cutting and Mike Cafarella for their Nutch project. These both techniques (GFS &
MapReduce) were just on white paper at Google. Google didn’t implement these two
techniques. Doug Cutting knew from his work on Apache Lucene ( It is a free and open-
source information retrieval software library, originally written in Java by Doug Cutting
in 1999) that open-source is a great way to spread the technology to more people. So,
together with Mike Cafarella, he started implementing Google’s techniques (GFS &
MapReduce) as open-source in the Apache Nutch project.
In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon
realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike
Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he started to
find a job with a company who is interested in investing in their efforts. And he found
Yahoo!.Yahoo had a large team of engineers that was eager to work on this there project.
So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to provide
the world with an open-source, reliable, scalable computing framework, with the help of
Yahoo. So at Yahoo first, he separates the distributed computing parts from Nutch and
formed a new project Hadoop (He gave name Hadoop it was the name of a yellow toy
In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch.
It is an open source web crawler software project.
While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
This is where Hadoop comes in. It provides one of the most reliable file systems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely large
files with streaming data access pattern and it runs on commodity hardware. HDFS (Hadoop
Distributed File System) is the primary storage system used by Hadoop applications. This open
source framework works by rapidly transferring data between nodes. It's often used by
companies who need to handle and store big data. HDFS is a key component of many Hadoop
systems, as it provides a means for managing big data, as well as supporting big data analytics.
Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-
many-times. Once data is written large portions of dataset can be processed any number
times.
Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.
Low Latency data access: Applications that require very less time to access the first
data should not use HDFS as it is giving importance to whole data rather than time to
fetch the first record.
Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one data node to another data node to retrieve each small file, this whole
process is a very inefficient data access pattern.
Multiple Writes:It should not be used when we have to write multiple times.
The NameNode is the «smart» node in the cluster. It knows exactly which data node contains
which blocks and where the data nodes are located within the machine cluster. The NameNode
also manages access to the files, including reads, writes, creates, deletes and replication of data
blocks across different data nodes.
The NameNode operates in a “loosely coupled” way with the data nodes. This means the
elements of the cluster can dynamically adapt to the real-time demand of server capacity by
adding or subtracting nodes as the system sees fit.
The data nodes constantly communicate with the NameNode to see if they need complete a
certain task. The constant communication ensures that the NameNode is aware of each data
node’s status at all times. Since the NameNode assigns tasks to the individual datanodes, should
it realize that a datanode is not functioning properly it is able to immediately re-assign that
node’s task to a different node containing that same data block. Data nodes also communicate
with each other so they can cooperate during normal file operations. Clearly the NameNode is
critical to the whole system and should be replicated to prevent system failure.
Again, data blocks are replicated across multiple data nodes and access is managed by the
NameNode. This means when a data node no longer sends a “life signal” to the NameNode, the
NameNode unmaps the data note from the cluster and keeps operating with the other data nodes
as if nothing had happened. When this data node comes back to life or a different (new) data
node is detected, that new data node is (re-)added to the system. That is what makes HDFS
resilient and self-healing. Since data blocks are replicated across several data nodes, the failure
of one server will not corrupt a file.
Features of HDFS
There are several features that make HDFS particularly useful, including:
Data replication. This is used to ensure that the data is always available and prevents
data loss. For example, when a node crashes or there is a hardware failure, replicated data
can be pulled from elsewhere within a cluster, so processing continues while data is
recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them
across nodes in a large cluster ensures fault tolerance and reliability.
High availability. As mentioned earlier, because of replication across notes, data is
available even if the NameNode or a DataNode fails.
Scalability. Because HDFS stores data on various nodes in the cluster, as requirements
increase, a cluster can scale to hundreds of nodes.
High throughput. Because HDFS stores data in a distributed manner, the data can be
processed in parallel on a cluster of nodes. This, plus data locality (see next bullet), cut
the processing time and enable high throughput.
Data locality. With HDFS, computation happens on the DataNodes where the data
resides, rather than having the data move to where the computational unit is. By
minimizing the distance between the data and the computing process, this approach
decreases network congestion and boosts a system's overall throughput.
Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Goals of HDFS
Mr. Satish Kumar Singh Page 5
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets. HDFS accommodates applications that have data sets typically gigabytes
to terabytes in size.
Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.
Access to streaming data - HDFS is intended more for batch processing versus interactive use,
so the emphasis in the design is for high data throughput rates, which accommodate streaming
access to data sets.
Coherence Model – The application that runs on HDFS require to follow the write-once-ready-
many approach. So, a file once created need not to be changed. However, it can be appended and
truncate.
Disadvantages of HDFS
1. Problem with Small files :- Hadoop can efficiently perform over a small number of
files of large size. Hadoop stores the file in the form of file blocks which are from
128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small size
file in a large amount. This so many small files surcharge the Namenode and make it
difficult to work.
2. Vulnerability :- Hadoop is a framework that is written in java, and java is one of the
most commonly used programming languages which makes it more insecure as it can be
easily exploited by any of the cyber-criminal.
3. Low Performance In Small Data Surrounding :- Hadoop is mainly designed for
dealing with large datasets, so it can be efficiently utilized for the organizations that are
generating a massive volume of data. It’s efficiency decreases while performing in small
data surroundings.
4. Lack of Security :- Data is everything for an organization, by default the security
feature in Hadoop is made un-available. So the Data driver needs to be careful with this
security face and should take appropriate action on it. Hadoop uses Kerberos for security
feature which is not easy to manage. Storage and network encryption are missing in
Kerberos which makes us more concerned about it.
5. High Up Processing :- Read/Write operation in Hadoop is immoderate since we are
dealing with large size data that is in TB or PB. In Hadoop, the data read or write done
from the disk which makes it difficult to perform in-memory calculation and lead to
processing overhead or High up processing.
Components of Hadoop
1. MapReduce :- For computational processing i.e. MapReduce, MapReduce is the data
processing layer of Hadoop. It is a software framework for easily writing applications that
process the vast amount of structured and unstructured data stored in the Hadoop Distributed
Filesystem (HSDF). It processes huge amount of data in parallel by dividing the job (submitted
job) into a set of independent tasks (sub-job).
In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. The
Map is the first phase of processing, where we specify all the complex logic/business
rules/costly code. Reduce is the second phase of processing, where we specify light-weight
processing like aggregation/summation.
2. HDFS :- For storage purpose i.e., HDFS, Acronym of Hadoop Distributed File System –
which is basic motive of storage. It also works as the Master-Slave pattern. In HDFS NameNode
acts as a master which stores the metadata of data node and Data node acts as a slave which
stores the actual data in local disc parallel.
3. Yarn :- which is used for resource allocation.YARN is the processing framework in Hadoop,
which provides Resource management, and it allows multiple data processing engines such as
real-time streaming, data science and batch processing to handle data stored on a single platform.
4. Hadoop Common: Hadoop Common refers to the collection of common utilities and
libraries that support other Hadoop modules. It is an essential part or module of the Apache
Hadoop Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN
and Hadoop MapReduce. Like all other modules, Hadoop Common assumes that hardware
failures are common and that these should be automatically handled in software by the
Hadoop Framework. These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
1. Text/CSV Files: - Text/CSV file is the most commonly used data file format. It’s the most
human readable and also ubiquitously easy to parse. It’s the choice of format to use when export
data from an RDBMS table. It can be read or written in any programming language and is
mostly delimited by comma or tab. Text and CSV files are quite common and frequently Hadoop
developers and data scientists received text and CSV files to work upon.
2. JSON Records:- JSON is in text format that stores meta data with the data, so it fully
supports schema evolution. You can easily add or remove attributes for each datum. However,
because it’s text file, it doesn’t support block compression. JSON records contain JSON files
where each line is its own JSON datum. In the case of JSON files, metadata is stored and the file
is also splittable but again it also doesn’t support block compression.
3. Avro Files :- The Avro file format has efficient storage due to optimized binary encoding. It
is widely supported both inside and outside the Hadoop ecosystem. The Avro file format is ideal
for long-term storage of important data. It can read from and write in many languages like Java,
4. Sequence Files :- The sequencefile format can be used to store an image in the binary format.
They store key-value pairs in a binary container format and are more efficient than a text file.
However, sequence files are not human- readable.
5. RC Files :- RC File (Record Columnar Files) file was the first columnar file in Hadoop and
has significant compression and query performance benefits. But it doesn’t support schema
evaluation and if you want to add anything to RC file you will have to rewrite the file. Also, it is
a slower process.
6. ORC Files :- The Optimized Row Columnar (ORC) file format provides a highly efficient
way to store data. It was designed to overcome the limitations of other file formats. It
compresses better, but still does not support schema evolution. It is worthwhile to note that OCR
is a format primarily backed by Hortonworks, and it’s not supported by Cloudera Impala.
ORC is the compressed version of RC file and supports all the benefits of RC file with some
enhancements like ORC files compress better than RC files, enabling faster queries.
But it doesn’t support schema evolution.
7. Parquet Files:- Paquet file format is also a columnar format. Just like ORC file, it’s great for
compression with great query performance. It’s especially efficient when querying data from
specific columns. Parquet format is computationally intensive on the write side, but it reduces a
lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema
evolution, that it can add new columns to the end of the structure. It is also backed by Cloudera
and optimized with Impala.
1. Apache Spark
Apache spark in an open-source processing engine that is designed for ease of analytics
operations. It is a cluster computing platform that is designed to be fast and made for general
purpose uses. Spark is designed to cover various batch applications, Machine Learning,
streaming data processing, and interactive queries.
Features of Spark:
In memory processing
Tight Integration Of component
Easy and In-expensive
2. Map Reduce
MapReduce is just like an Algorithm or a data structure that is based on the YARN framework.
The primary feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster, which Makes Hadoop working so fast Because when we are dealing with Big
Data, serial processing is no more of any use.
Features of Map-Reduce:
Scalable
Fault Tolerance
Parallel Processing
Tunable Replication
Load Balancing
3. Apache Hive
Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various sources.
Hive is one of the best tools used for data analysis on Hadoop. The one who is having
knowledge of SQL can comfortably use Apache Hive. The query language of high is known as
HQL or HIVEQL.
Features of Hive:
4. Apache Impala
Apache Impala is an open-source SQL engine designed for Hadoop. Impala overcomes the
speed-related issue in Apache Hive with its faster-processing speed. Apache Impala uses similar
kinds of SQL syntax, ODBC driver, and user interface as that of Apache Hive. Apache Impala
can easily be integrated with Hadoop for data analytics purposes.
Features of Impala:
Easy-Integration
Scalability
Security
In Memory data processing
5. Apache Mahout
The name Mahout is taken from the Hindi word Mahavat which means the elephant rider.
Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout. Mahout is
mainly used for implementing various Machine Learning algorithms on our Hadoop like
Mr. Satish Kumar Singh Page 10
classification, Collaborative filtering, Recommendation. Apache Mahout can implement the
Machine algorithms without integration on Hadoop.
Features of Mahout:
6. Apache Pig
This Pig was initially developed by Yahoo to get ease in programming. Apache Pig has the
capability to process an extensive dataset as it works on top of the Hadoop. Apache pig is used
for analysing more massive datasets by representing them as dataflow. Apache Pig also raises
the level of abstraction for processing enormous datasets. Pig Latin is the scripting language that
the developer uses for working on the Pig framework that runs on Pig runtime.
Features of Pig:
Easy To Programme
Rich set of operators
Ability to handle various kind of data
Extensibility
7. HBase
HBase is nothing but a non-relational, NoSQL distributed, and column-oriented database. HBase
consists of various tables where each table has multiple numbers of data rows. These rows will
have multiple numbers of column family’s, and this column family will have columns that
contain key-value pairs. HBase works on the top of HDFS (Hadoop Distributed File System).
We use HBase for searching small size data from the more massive datasets.
Features of HBase:
8. Apache Sqoop
Sqoop is a command-line tool that is developed by Apache. The primary purpose of Apache
Sqoop is to import structured data i.e., RDBMS (Relational database management System) like
MySQL, SQL Server, Oracle to our HDFS(Hadoop Distributed File System). Sqoop can also
export the data from our HDFS to RDBMS.
Features of Sqoop:
Tableau is data visualization software that can be used for data analytics and business
intelligence. It provides a variety of interactive visualization to showcase the insights of the data
and can translate the queries to visualization and can also import all ranges and sizes of data.
Tableau offers rapid analysis and processing, so it Generates useful visualizing charts on
interactive dashboards and worksheets.
Features of Tableau:
Tableau supports Bar chart, Histogram, Pie chart, Motion chart, Bullet chart, Gantt chart
and so many
Secure and Robust
Interactive Dashboard and worksheets
Apache Storm is a free open source distributed real-time computation system build using
Programming languages like Clojure and java. It can be used with many programming
languages. Apache Storm is used for the Streaming process, which is very faster. We use
Daemons like Nimbus, Zookeeper, and Supervisor in Apache Storm. Apache Storm can be used
for real-time processing, online Machine learning, and many more. Companies like Yahoo,
Spotify, Twitter, and so many uses Apache Storm.
Features of Storm:
Easily operatable
each node can process millions of tuples in one second
Scalable and Fault Tolerance
Scaling out
Scaling: - Scaling is growing an infrastructure (compute, storage, and networking) larger so that
the applications riding on that infrastructure can serve more people at a time. When architects
talk about the ability of an infrastructure design to scale, we mean that the design as conceived is
able to grow larger over time without a wholesale replacement. The terms “scale up” and “scale
out” refer to the way in which the infrastructure is grown. The primary benefit of Hadoop is its
Scalability. One can easily scale the cluster by adding more nodes.
Scale out (Horizontal scalability):- It is also referred as “scale out” is basically the addition of
more machines or setting up the cluster. In horizontal scaling instead of increasing hardware
capacity of individual machine you add more nodes to existing cluster and most importantly, you
can add more machines without stopping the system. Therefore we don’t have any downtime or
green zone, nothing of such sort while scaling out. So at last to meet your requirements you will
have more machines working in parallel.
[While scaling out involves adding more discrete units to a system in order to add capacity, scaling up
involves building existing units by integrating resources into them.]
When you need a long-term scaling strategy: The incremental nature of scaling out allows you
to scale your infrastructure for expected, long-term data growth. Components can be added or
removed depending on your goals.
When upgrades need to be flexible: Scaling out avoids the limitations of depreciating
technology, as well as vendor lock-in for specific hardware technologies.
When storage workloads need to be distributed: Scaling out is perfect for use cases that
require workloads to be distributed across several storage nodes.
Strengths
Embraces newer server technologies: Because the architecture is not constrained by older
hardware, scale-out infrastructure is not affected by capacity and performance issues as much as
scale-up infrastructure.
Adaptability to demand changes: Scale-out architecture makes it easier to adapt to changes in
demand since services and hardware can be removed or added to satisfy demand needs. This also
makes it easy to carry out resource scaling.
Cost management: Scaling out follows an incremental model, which makes costs more
predictable. Furthermore, such a model allows you to pay for the resources required as you need
them.
Weaknesses
Limited rack space: Scale-out infrastructure poses the risk of running out of rack space.
Theoretically, rack space can get to a point where it cannot support increasing demand, showing
that scaling out is not always the approach to handle greater demand.
Increased operational costs: The introduction of more server resources introduces additional
costs, such as licensing, cooling, and power.
Higher upfront costs: Setting up a scale-out system requires a sizable investment, as you’re not
just upgrading existing infrastructure.
Scale up (Vertical scalability):- It is also referred as “scale up”. In vertical scaling, you can
increase the hardware capacity of the individual machine. In other words, you can add more
RAM or CPU to your existing system to make it more robust and powerful.
[Alternately, scaling up may involve adding storage capacity to an existing server, whether that's a piece
of hardware or a logically partitioned virtual machine.]
Strengths
Relative speed: Replacing a resource, such as a single processor with a dual processor,
means that the throughput of the CPU is doubled. The same can be done to resources
such as dynamic random access memory (DRAM) to improve dynamic memory
performance.
Weaknesses
Latency: Introducing higher capacity machines may not guarantee that a workload runs
faster. Latency may be introduced in scale-up architecture for a use case such as video
processing, which in turn may lead to worse performance.
Labor and risks: Upgrading the system can be cumbersome, as you may, for instance,
have to copy data to a new server. Switchover to a new server may result in downtime
and poses a risk of data loss during the process.
Aging hardware: The constraints of aging equipment lead to diminishing effectiveness
and efficiency with time. Backup and recovery times are examples of functionality that is
negatively impacted by diminishing performance and capacity.
Hadoop streaming
Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute
programs for big data analysis. Hadoop streaming can be performed using languages like
Python, Java, PHP, Scala, Perl, UNIX, and many more. The utility allows us to create and run
Map/Reduce jobs with any executable or script as the mapper and/or the reducer. It uses Unix
streams as the interface between the Hadoop and our MapReduce program so that we can use
any language which can read standard input and write to standard output to write for writing our
MapReduce program.
Some of the key features associated with Hadoop Streaming are as follows :
Hadoop Streaming is a part of the Hadoop Distribution System.
It facilitates ease of writing Map Reduce programs and codes.
Hadoop Streaming supports almost all types of programming languages such as Python,
C++, Ruby, Perl etc.
The entire Hadoop Streaming framework runs on Java. However, the codes might be
written in different languages as mentioned in the above point.
The Hadoop Streaming process uses Unix Streams that act as an interface between
Hadoop and Map Reduce programs.
Hadoop Streaming uses various Streaming Command Options and the two mandatory
ones are – -input directoryname or filename and -output directoryname
Hadoop pipes
Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be
used in MapReduce programs. Applications that require high numerical performance may see
better throughput if written in C++ and used through Pipes. The primary approach is to split the
C++ code into a separate process that does the application specific code. In many ways, the
Hadoop Ecosystem
Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software
library; it includes open source projects as well as a complete range of complementary tools.
Some of the most well-known tools of the Hadoop ecosystem include
HDFS
Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
HDFS creates a level of abstraction over the resources, from where we can see the whole
HDFS as a single unit.
YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities
by allocating resources and scheduling tasks.
MAPREDUCE
APACHE PIG
APACHE HIVE
Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them
feel at home while working in a Hadoop Ecosystem.
Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
APACHE MAHOUT
Now, let us talk about Mahout which is renowned for machine learning. Mahout provides
an environment for creating machine learning applications which are scalable.
So, What is machine learning?
Machine learning algorithms allow us to build self-learning machines that evolve by
itself without being explicitly programmed. Based on user behavior, data patterns and
past experiences it makes important future decisions. You can call it a descendant of
Artificial Intelligence (AI).
What Mahout does?
It performs collaborative filtering, clustering and classification. Some people also
consider frequent item set missing as Mahout’s function. Let us understand them
individually:
1. Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the users. The
typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can contain blogs,
news, research papers etc.
Mahout provides a command line to invoke various algorithms. It has a predefined set of
library which already contains different inbuilt algorithms for different use cases.
APACHE SPARK
Apache Spark is a framework for real time data analytics in a distributed computing
environment.
The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
It executes in-memory computations to increase speed of data processing over Map-
Reduce.
It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than
Map-Reduce.
APACHE HBASE
APACHE DRILL
As the name suggests, Apache Drill is used to drill into any kind of data. It’s an open
source application which works with distributed environment to analyze large data sets.
It is a replica of Google Dremel.
It supports different kinds NoSQL databases and file systems, which is a
powerful feature of Drill. For example: Azure Blob Storage, Google Cloud Storage,
HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.
So, basically the main aim behind Apache Drill is to provide scalability so that we can
process petabytes and exabytes of data efficiently (or you can say in minutes).
The main power of Apache Drill lies in combining a variety of data stores just by
using a single query.
Apache Drill basically follows the ANSI SQL.
It has a powerful scalability factor in supporting millions of users and serve their
query requests over large scale data.
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
Before Zookeeper, it was very difficult and time consuming to coordinate between
different services in Hadoop Ecosystem. The services earlier had many problems with
interactions like common configuration while synchronizing data. Even if the services are
configured, changes in the configurations of the services make it complex and difficult to
handle. The grouping and naming was also a time-consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time by
performing synchronization, configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.
APACHE OOZIE
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
them together as one logical work.
There are two kinds of Oozie jobs:
1. Oozie workflow: These are sequential set of actions to be executed. You can
assume it as a relay race. Where each athlete waits for the last one to complete his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data
is made available to it. Think of this as the response-stimuli system in our body. In the
same manner as we respond to an external stimulus, an Oozie coordinator responds to the
availability of data and it rests otherwise.
APACHE FLUME
APACHE SQOOP
Now, let us talk about another data ingesting service i.e. Sqoop. The major difference
between Flume and Sqoop is that:
Mr. Satish Kumar Singh Page 19
Flume only ingests unstructured data or semi-structured data into HDFS.
While Sqoop can import as well as export structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
When we submit Sqoop command, our main task gets divided into sub tasks which is
handled by individual Map Task internally. Map Task is the sub task, which imports part
of data to the Hadoop Ecosystem. Collectively, all Map tasks imports the whole data.
When we submit our Job, it is mapped into Map Tasks which brings the chunk of data
from HDFS. These chunks are exported to a structured data destination. Combining all
these exported chunks of data, we receive the whole data at the destination, which in most
of the cases is an RDBMS (MYSQL/Oracle/SQL Server).
APACHE AMBARI
Ambari is an Apache Software Foundation Project which aims at making Hadoop
ecosystem more manageable. It includes software for provisioning, managing and
monitoring Apache Hadoop clusters.
The Ambari provides:
1. Hadoop cluster provisioning:
It gives us step by step process for installing Hadoop services across a number of hosts.
It also handles configuration of Hadoop services over a cluster.
2. Hadoop cluster management:
It provides a central management service for starting, stopping and re-configuring
Hadoop services across the cluster.
3. Hadoop cluster monitoring:
For monitoring health and status, Ambari provides us a dashboard.
The Amber Alert framework is an alerting service which notifies the user, whenever
the attention is needed. For example, if a node goes down or low disk space on a node,
etc.