0% found this document useful (0 votes)
2 views6 pages

INTRO hadoop-ecosystem

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

lOMoARcPSD|45702970

Hadoop ecosystem-converted

Big data analytics (Shri Dharmasthala Manjunatheshwara College of Engineering and


Technology)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Danu (danukrishnan003@gmail.com)
lOMoARcPSD|45702970

Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.

Hadoop is an open-source software Platform for storing huge volumes of data and running
applications on clusters of commodity software. It gives us the massive data storage facility,
enormous computational power and the ability to handle different virtually limitless jobs or
tasks. Its main core component is to support growing big data technologies, thereby support
advanced analytics like Predictive analytics, Machine learning and data mining.
Hadoop has the capability to handle different modes of data such as structured, unstructured and
semi-structured data. It gives us the flexibility to collect, process, and analyze data that our old
data warehouses failed to do.

Difference between Traditional Database System and Hadoop

Traditional Database System Hadoop


In Hadoop, the program goes to the data. It initially
Data is stored in a central location and sent
distributes the data to multiple systems and later
to the processor at runtime.
runs the computation wherever the data is located.
Traditional Database Systems cannot be Hadoop works better when the data size is big. It
used to process and store a significant can process and store a large amount of data
amount of data(big data). efficiently and effectively.
Traditional RDBMS is used to manage only
Hadoop can process and store a variety of data,
structured and semi-structured data. It
whether it is structured or unstructured.
cannot be used to control unstructured data.

Hadoop Ecosystem Overview


Hadoop ecosystem is a platform or framework which helps in solving the big data problems. It
comprises of different components and services ( ingesting, storing, analyzing, and maintaining)
inside of it. Most of the services available in the Hadoop ecosystem are to supplement the main
four core components of Hadoop which include HDFS, YARN, MapReduce and Common.
Most of the tools or solutions are used to supplement or support these major elements. All these
tools work collectively to provide services such as absorption, analysis, storage and maintenance
of data etc.

Following are the components that collectively form a Hadoop ecosystem:


• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services

SAYYADA HAJERA, ITD, MJCET 1

Downloaded by Danu (danukrishnan003@gmail.com)


lOMoARcPSD|45702970

• HBase: NoSQL Database


• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other components too that
are part of the Hadoop ecosystem.

HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.

SAYYADA HAJERA, ITD, MJCET 2

Downloaded by Danu (danukrishnan003@gmail.com)


lOMoARcPSD|45702970

YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.

MapReduce:
• By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.

PIG:
• Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data sets.
• Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.

HIVE:
• With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).

SAYYADA HAJERA, ITD, MJCET 3

Downloaded by Danu (danukrishnan003@gmail.com)


lOMoARcPSD|45702970

• It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.

Mahout:
• Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or om the basis of algorithms.
• It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.

Apache Spark:
• It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.

Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
• At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data.

Sqoop:
• Sqoop is a tool designed to transfer data between Hadoop and relational database servers.
• It is used to import data from relational databases (such as Oracle and MySQL) to HDFS
and export data from HDFS to relational databases.

Flume:
• Flume is a distributed service that collects event data and transfers it to HDFS.
• It is ideally suited for event data from multiple systems.

SAYYADA HAJERA, ITD, MJCET 4

Downloaded by Danu (danukrishnan003@gmail.com)


lOMoARcPSD|45702970

After the data is transferred into the HDFS, it is processed. One of the frameworks that process
data is Spark.

Avro:
Apache Avro is a part of the Hadoop ecosystem, and it works as a data serialization system. It is
an open source project which helps Hadoop in data serialization and data exchange. Avro
enables big data in exchanging programs written in different languages. It serializes data into
files or messages.

Zookeeper:
There was a huge issue of management of coordination and synchronization among the resources
or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the
problems by performing synchronization, inter-component based communication, grouping, and
maintenance.

Oozie:
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as
a single unit. There are two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie
workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external stimulus is given to it.

Chukwa:
Chukwa is an open source data collection system for managing large distributed systems. It is a
Hadoop subproject devoted to large-scale log collection and analysis built on top of the Hadoop
Distributed File System (HDFS) and Map/Reduce framework.
Chukwa aims to provide a flexible and powerful platform for distributed data collection and
rapid data processing that is capable of modification to use newer storage technologies (HDFS
appends HBase, etc) as they mature.

SAYYADA HAJERA, ITD, MJCET 5

Downloaded by Danu (danukrishnan003@gmail.com)

You might also like