Hadoop Ecosystem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Big Data

Video Tutorials Articles Ebooks On-demand Webinars Free Practice Tests

Home Resources Big Data Hadoop Tutorial: Getting Started with Hadoop Hadoop Ecosystem
Explained: Learn the Fundamental Tools and Frameworks

Hadoop Ecosystem Explained: Learn the Fundamental Tools and


Frameworks

Lesson 2 of 16 By Simplilearn

Last updated on Jun 16, 2021 19887

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beg…


Beg…

Previous Next

Tutorial Playlist

Table of Contents

HDFS

YARN (Yet Another Resource Negotiator)

MapReduce

Sqoop

Flume
View More

Did you know that we currently generate 2.5 quintillion bytes of data every day? That’s a lot of
generated data, and it needs to be stored, processed, and analyzed before anyone can derive
meaningful information from it. Fortunately, we have Hadoop to deal with the issue of big data
management.

Hadoop is a framework that manages big data storage by means of parallel and distributed
processing. Hadoop is comprised of various tools and frameworks that are dedicated to different
sections of data management, like storing, processing, and analyzing. The Hadoop ecosystem
covers Hadoop itself and various other related big data tools.

In this blog, we will talk about the Hadoop ecosystem and its various fundamental tools. Below
we see a diagram of the entire Hadoop ecosystem:

Let us start with the Hadoop Distributed File System (HDFS).

Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop
Certification Training course and get certified today.

HDFS

In the traditional approach, all data was stored in a single central database. With the rise of big
data a single database was not enough to handle the task The solution was to use a distributed
data, a single database was not enough to handle the task. The solution was to use a distributed
approach to store the massive volume of information. Data was divided up and allocated to
many individual databases. HDFS is a specially designed file system for storing huge datasets in
commodity hardware, storing information in different formats on various machines.

There are two components in HDFS:

1. NameNode - NameNode is the master daemon. There is only one active NameNode. It
manages the DataNodes and stores all the metadata.

2. DataNode - DataNode is the slave daemon. There can be multiple DataNodes. It stores the
actual data.

Big Data Hadoop Certification Training Course

Master Big Data and Hadoop Ecosystem

EXPLORE COURSE

So, we spoke of HDFS storing data in a distributed fashion, but did you know that the storage
system has certain specifications? HDFS splits the data into multiple blocks, defaulting to a
maximum of 128 MB. The default block size can be changed depending on the processing speed
and the data distribution. Let’s have a look at the example below:
As seen from the above image, we have 300 MB of data. This is broken down into 128 MB, 128
MB, and 44 MB. The final block handles the remaining needed storage space, so it doesn’t have
to be sized at 128 MB. This is how data gets stored in a distributed manner in HDFS.

Now that you have an overview of HDFS, it is also vital for you to understand what it sits on and
how the HDFS cluster is managed. That is done by YARN, and that’s what we’re looking at next.

YARN (Yet Another Resource Negotiator)

YARN is an acronym for Yet Another Resource Negotiator. It handles the cluster of nodes and
acts as Hadoop’s resource management unit. YARN allocates RAM, memory, and other resources
to different applications.
YARN has two components :

1. ResourceManager (Master) - This is the master daemon. It manages the assignment of


resources such as CPU, memory, and network bandwidth.

2. NodeManager (Slave) - This is the slave daemon, and it reports the resource usage to the
Resource Manager.

Let us move on to MapReduce, Hadoop’s processing unit.

MapReduce

Hadoop data processing is built on MapReduce, which processes large volumes of data in a
parallelly distributed manner. With the help of the figure below, we can understand how
MapReduce works:
As we see, we have our big data that needs to be processed, with the intent of eventually arriving
at an output. So in the beginning, input data is divided up to form the input splits. The first phase
is the Map phase, where data in each split is passed to produce output values. In the shuffle and
sort phase, the mapping phase’s output is taken and grouped into blocks of similar data. Finally,
the output values from the shuffling phase are aggregated. It then returns a single output value.

In summary, HDFS, MapReduce, and YARN are the three components of Hadoop. Let us now dive
deep into the data collection and ingestion tools, starting with Sqoop.

Sqoop

Sqoop is used to transfer data between Hadoop and external datastores such as relational
databases and enterprise data warehouses. It imports data from external datastores into HDFS,
Hive, and HBase.

As seen below, the client machine gathers code, which will then be sent to Sqoop. The Sqoop
then goes to the Task Manager, which in turn connects to the enterprise data warehouse,
documents based systems, and RDBMS. It can map those tasks into Hadoop.

Flume
Flume is another data collection and ingestion tool, a distributed service for collecting,
aggregating, and moving large amounts of log data. It ingests online streaming data from social
media, logs files, web server into HDFS.

As you can see below, data is taken from various sources, depending on your organization’s
needs. It then goes through the source, channel, and sink. The sink feature ensures that
everything is in sync with the requirements. Finally, the data is dumped into HDFS.

Let us now have a look at Hadoop’s scripting languages and query languages.

Pig

A h Pi d l db Y h h t t d i l t d It
Apache Pig was developed by Yahoo researchers, targeted mainly towards non-programmers. It
was designed with the ability to analyze and process large datasets without using complex Java
codes. It provides a high-level data processing language that can perform numerous operations
without getting bogged down with too many technical concepts.

Big Data Hadoop and Spark Developer Course (FREE)

Learn Big Data Basics from Top Experts

ENROLL NOW

It consists of:

1. Pig Latin - This is the language for scripting

2. Pig Latin Compiler - This converts Pig Latin code into executable code

Pig also provides Extract, Transfer, and Load (ETL), and a platform for building data flow. Did you
know that ten lines of Pig Latin script equals approximately 200 lines of MapReduce job? Pig
uses simple, time-efficient steps to analyze datasets. Let’s take a closer look at Pig’s
architecture.

Programmers write scripts in Pig Latin to analyze data using Pig. Grunt Shell is Pig’s interactive
shell, used to execute all Pig scripts. If the Pig script is written in a script file, the Pig Server
executes it. The parser checks the syntax of the Pig script, after which the output will be a DAG
(Directed Acyclic Graph). The DAG (logical plan) is passed to the logical optimizer. The compiler
converts the DAG into MapReduce jobs. The MapReduce jobs are then run by the Execution
Engine. The results are displayed using the “DUMP” statement and stored in HDFS using the
“STORE” statement.

Next up on the language list is Hive.

Hive
Hive uses SQL (Structured Query Language) to facilitate the reading, writing, and management of
large datasets residing in distributed storage. The hive was developed with a vision of
incorporating the concepts of tables and columns with SQL since users were comfortable with
writing queries in SQL.

Apache Hive has two major components:

Hive Command Line

JDBC/ ODBC driver

The Java Database Connectivity (JDBC) application is connected through JDBC Driver, and the
Open Database Connectivity (ODBC) application is connected through ODBC Driver. Commands
are executed directly in CLI. Hive driver is responsible for all the queries submitted, performing
the three steps of compilation, optimization, and execution internally. It then uses the
MapReduce framework to process queries.

Hive’s architecture is shown below:

Spark

Spark is a huge framework in and of itself, an open-source distributed computing engine for
processing and analyzing vast volumes of real-time data. It runs 100 times faster than
MapReduce. Spark provides an in-memory computation of data, used to process and analyze
real-time streaming data such as stock market and banking data, among other things.

As seen from the above image, the MasterNode has a driver program. The Spark code behaves
as a driver program and creates a SparkContext, which is a gateway to all of the Spark
functionalities. Spark applications run as independent sets of processes on a cluster. The driver
program and Spark context take care of the job execution within the cluster. A job is split into
multiple tasks that are distributed over the worker node When an RDD is created in the Spark
multiple tasks that are distributed over the worker node. When an RDD is created in the Spark
context, it can be distributed across various nodes. Worker nodes are slaves that run different
tasks. The Executor is responsible for the execution of these tasks. Worker nodes execute the
tasks assigned by the Cluster Manager and return the results to the SparkContext.

Let us now move to the field of Hadoop Machine Learning and its different permutations.

Mahout

Mahout is used to create scalable and distributed machine learning algorithms such as
clustering, linear regression, classification, and so on. It has a library that contains built-in
algorithms for collaborative filtering, classification, and clustering.

Ambari

Next up, we have Apache Ambari. It is an open-source tool responsible for keeping track of
running applications and their statuses. Ambari manages, monitors, and provisions Hadoop
clusters. Also, it also provides a central management service to start, stop, and configure
Hadoop services.

As seen in the following image, the Ambari web, which is your interface, is connected to the
Ambari server. Apache Ambari follows a master/slave architecture. The master node is
accountable for keeping track of the state of the infrastructure. For doing this, the master node
uses a database server that can be configured during the setup time. Most of the time, the
Ambari server is located on the MasterNode, and is connected to the database. This is where
agents look into the host server. Agents run on all the nodes that you want to manage under
Ambari. This program occasionally sends heartbeats to the master node to show its aliveness.
By using Ambari Agent, the Ambari Server is able to execute many tasks.

W h t d t t i i K fk dA h St
We have two more data streaming services, Kafka and Apache Storm.

Kafka

Kafka is a distributed streaming platform designed to store and process streams of records. It is
written in Scala. It builds real-time streaming data pipelines that reliably get data between
applications, and also builds real-time applications that transform data into streams.

Kafka uses a messaging system for transferring data from one application to another. As seen
below, we have the sender, the message queue, and the receiver involved in data transfer.

Storm

The storm is an engine that processes real-time streaming data at a very high speed. It is written
in Clojure. A storm can handle over 1 million jobs on a node in a fraction of a second. It is
integrated with Hadoop to harness higher throughputs.

Now that we have looked at the various data ingestion tools and streaming services let us take a
look at the security frameworks in the Hadoop ecosystem.

Ranger

Ranger is a framework designed to enable, monitor, and manage data security across the
Hadoop platform. It provides centralized administration for managing all security-related tasks.
Ranger standardizes authorization across all Hadoop components, and provides enhanced
support for different authorization methods like role-based access control, and attributes based
access control, to name a few.
Knox

Apache Knox is an application gateway used in conjunction with Hadoop deployments,


interacting with REST APIs and UIs. The gateway delivers three types of user-facing services:

1. Proxying Services - This provides access to Hadoop via proxying the HTTP request

2. Authentication Services - This gives authentication for REST API access and WebSSO flow for
user interfaces

3. Client Services - This provides client development either via scripting through DSL or using the
Knox shell classes

Let us now take a look at the workflow system, Oozie.

Oozie

Oozie is a workflow scheduler system used to manage Hadoop jobs. It consists of two parts:

1. Workflow engine - This consists of Directed Acyclic Graphs (DAGs), which specify a sequence
of actions to be executed

2. Coordinator engine - The engine is made up of workflow jobs triggered by time and data
availability

As seen from the flowchart below, the process begins with the MapReduce jobs. This action can
either be successful, or it can end in an error. If it is successful, the client is notified by an email.
If the action is unsuccessful, the client is similarly notified, and the action is terminated.

Conclusion
We hope this has helped you gain a better understanding of the Hadoop ecosystem. If you’ve
read through this lesson, you have learned about HDFS, YARN, MapReduce, Sqoop, Flume, Pig,
Hive, Spark, Mahout, Ambari, Kafka, Storm, Ranger, Knox, and Oozie. Furthermore, you have an
idea of what each of these tools does.

If you want to learn more about Big Data and Hadoop, enroll in our Big Data Hadoop Certification
Course today! Refer to our Course Curriculum to learn more about the Hadoop Training.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in
top cities:

Name Date Place

Big Data Hadoop and Spark 9 May -31 May 2022,


Your City
Developer Weekdays batch

Big Data Hadoop and Spark 6 Jun -28 Jun 2022,


Bangalore
Developer Weekdays batch

Big Data Hadoop and Spark 25 Jun -6 Aug 2022,


Hyderabad
Developer Weekend batch

About the Author

Simplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud
Computing, Project Management, Data Science, IT, Software Development, and many other em…

View More

Recommended Programs
Recommended Programs

Big Data Hadoop and Spark Developer Lifetime


Access*
32172 Learners

Big Data Engineer Lifetime


Access*
12905 Learners

Post Graduate Program in Data Engineering Lifetime


Access*
1641 Learners

*Lifetime access to high-quality, self-paced e-learning content.

Explore Category

Find Big Data Hadoop and Spark Developer in these cities

Big Data Hadoop Certification Training Course in Ahmedabad Big Data Hadoop Training

Course in Ameerpet Big Data Hadoop Certification Training Course in Bangalore Big Data

Hadoop Certification Training Course in Chennai Big Data Hadoop Certification Training

Course in Delhi Big Data Hadoop Certification Training Course in Kolkata Big Data

Hadoop Certification Training Course in Mumbai Big Data Hadoop Certification Training

Course in Marathahalli Big Data Hadoop Certification Training Course in Pune

Recommended Resources
MapReduce Hadoop Interview
Example in Guide
Apache Hadoop

0 Comments Simplilearn 🔒 Disqus' Privacy Policy 


1 Login

 Favorite t Tweet f Share Sort by Best

Start the discussion…

LOG IN WITH
OR SIGN UP WITH DISQUS ?

Name

Be the first to comment.

✉ Subscribe d Add Disqus to your siteAdd DisqusAdd ⚠ Do Not Sell My Data

© 2009 -2022- Simplilearn Solutions

Disclaimer
PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

You might also like