Hadoop Tutorial
Hadoop Tutorial
Hadoop Tutorial
HADOOP ECOSYSTEM
In the previous blog on Hadoop Tutorial, we discussed about Hadoop, its features
and core components. Now, the next step forward is to understand Hadoop
Hadoop. This Hadoop ecosystem blog will familiarize you with industry-wide used Big
or framework which solves big data problems. You can consider it as a suite which
inside it. Let us discuss and get a brief idea about how the services work individually
and in collaboration.
Below are the Hadoop components, that together form a Hadoop ecosystem, I will be
HDFS
• Hadoop Distributed File System is the core component or you can say, the
backbone of Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data
sets (i.e. structured, unstructured and semi structured data).
• HDFS creates a level of abstraction over the resources, from where we can see
the whole HDFS as a single unit.
• It helps us in storing our data across various nodes and maintaining the log file
about the stored data (metadata).
YARN
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your
MAPREDUCE
processes large data sets using distributed and parallel algorithms inside Hadoop
environment.
1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced
by map function.
3. The result generated by the Map function is a key value pair (K, V) which
acts as the input for Reduce function.
Let us take the above example to have a better understanding of a MapReduce
program.
calculate the number of students in each department. Initially, Map program will
execute and calculate the students appearing in each department, producing the key
value pair as mentioned above. This key value pair is the input to the Reduce
function. The Reduce function will then aggregate each department and calculate the
total number of students in each department and produce the given result.
APACHE PIG
• PIG has two parts: Pig Latin, the language and the
pig runtime, for the execution environment. You
can better understand it as Java and JVM.
But don’t be shocked when I say that at the back end of Pig job, a map-reduce job
executes.
• It gives you a platform for building data flow for ETL (Extract, Transform and
Load), processing and analyzing huge data sets.
In PIG, first the load command, loads the data. Then we perform various functions on
it like grouping, filtering, joining, sorting, etc. At last, either you can dump the data on
APACHE HIVE
• The query language of Hive is called Hive Query Language(HQL), which is very
similar like SQL.
• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large
data set processing (i.e. Batch query processing) and real time processing (i.e.
Interactive query processing).
• You can use predefined functions, or write tailored user defined functions (UDF)
also to accomplish your specific needs.
As an alternative, you may go to this comprehensive video tutorial where each tool
APACHE MAHOUT
scalable.
itself without being explicitly programmed. Based on user behavior, data patterns and
past experiences it makes important future decisions. You can call it a descendant of
individually:
1. Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the
users. The typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can contain
blogs, news, research papers etc.
4. Frequent item set missing: Here Mahout checks, which objects are likely to be
appearing together and make suggestions, if they are missing. For example, cell
phone and cover are brought together in general. So, if you search for a cell
phone, it will also recommend you the cover and cases.
Mahout provides a command line to invoke various algorithms. It has a predefined set
of library which already contains different inbuilt algorithms for different use cases.
APACHE SPARK
• It is 100x faster than Hadoop for large scale data processing by exploiting in-
memory computations and other optimizations. Therefore, it requires high
processing power than Map-Reduce.
As you can see, Spark comes packed with high-level libraries, including support for R,
SQL, Python, Scala, Java etc. These standard libraries increase the seamless
integrations in complex workflow. Over this, it also allows various sets of services to
integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to
The Answer to this – This is not an apple to apple comparison. Apache Spark best fits
for real time processing, whereas Hadoop was designed to store unstructured data
and execute batch processing over it. When we combine, Apache Spark’s ability, i.e.
high processing speed, advance analytics and multiple integration support with
Hadoop’s low cost operation on commodity hardware, it gives the best results.
That is the reason why, Spark and Hadoop are used together by many companies for
• It supports all types of data and that is why, it’s capable of handling anything
and everything inside a Hadoop ecosystem.
• The HBase was designed to run on top of HDFS and provides BigTable like
capabilities.
• It gives us a fault tolerant way of storing sparse data, which is common in most
Big Data use cases.
• The HBase is written in Java, whereas HBase applications can be written in REST,
Avro and Thrift APIs.
For better understanding, let us take an example. You have billions of customer
emails and you need to find out the number of customers who has used the word
complaint in their emails. The request needs to be processed quickly (i.e. at real
time). So, here we are handling a large data set while retrieving a small amount of
APACHE DRILL
So, basically the main aim behind Apache Drill is to provide scalability so that we can
process petabytes and exabytes of data efficiently (or you can say in minutes).
• The main power of Apache Drill lies in combining a variety of data stores just
by using a single query.
APACHE ZOOKEEPER
services earlier had many problems with interactions like common configuration while
synchronizing data. Even if the services are configured, changes in the configurations
of the services make it complex and difficult to handle. The grouping and naming was
APACHE OOZIE
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
1. Oozie workflow: These are sequential set of actions to be executed. You can
assume it as a relay race. Where each athlete waits for the last one to complete
his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the
data is made available to it. Think of this as the response-stimuli system in our
body. In the same manner as we respond to an external stimulus, an Oozie
coordinator responds to the availability of data and it rests otherwise.
APACHE FLUME
• It helps us to ingest online streaming data from various sources like network
traffic, social media, email messages, log files etc. in HDFS.
Now, let us understand the architecture of Flume from the below diagram:
There is a Flume agent which ingests the streaming data from various data sources
to HDFS. From the diagram, you can easily understand that the web server indicates
the data source. Twitter is among one of the famous sources for streaming data.
1. Source: it accepts the data from the incoming streamline and stores the data in
the channel.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.
APACHE SQOOP
• While Sqoop can import as well as export structured data from RDBMS or
Enterprise data warehouses to HDFS or vice versa.
Let us understand how Sqoop works using the below diagram:
When we submit Sqoop command, our main task gets divided into sub tasks which is handled
by individual Map Task internally. Map Task is the sub task, which imports part of data to the
Hadoop Ecosystem. Collectively, all Map tasks imports the whole data.
from HDFS. These chunks are exported to a structured data destination. Combining all
these exported chunks of data, we receive the whole data at the destination, which in
Apache Solr and Apache Lucene are the two services which are used for searching and
• If Apache Lucene is the engine, Apache Solr is the car built around it. Solr is a
complete application built around Lucene.
• It uses the Lucene Java search library as a core for search and full indexing.
APACHE AMBARI
clusters.
1. Hadoop Ecosystem owes its success to the whole developer community, many
big companies like Facebook, Google, Yahoo, University of California (Berkeley)
etc. have contributed their part to increase Hadoop’s capabilities.
3. Based on the use cases, we can choose a set of services from Hadoop
Ecosystem and create a tailored solution for an organization.
I hope this blog is informative and added value to you. If you are interested to learn
more, you can go through this case study which tells you how Big Data is used in
In our next blog of Hadoop Tutorial Series, we have introduced HDFS (Hadoop
Distributed File System) which is the very first component which I discussed in this
the Hadoop training by Edureka, a trusted online learning company with a network
of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data
Hadoop Certification Training course helps learners become expert in HDFS, Yarn,
MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on
Got a question for us? Please mention it in the comments section and we will get back
to you.