Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Activity

NAME: Chogle Saif ali ROLLNO. : 12CO27 CLASS : BE-CO


Summary: Components of Hadoop Ecosystem.

1. HDFS.
The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. It employs a NameNode and DataNode architecture
to implement a distributed file system that
provides high-performance access to data
across highly scalable Hadoop clusters.

HDFS is a key part of the many Hadoop


ecosystem technologies, as it provides a reliable means for managing pools of big data
and supporting related big data analytics applications.

2. HBASE.
HBase is an open-source, column-oriented distributed database system in a
Hadoop environment. Apache HBase is needed for real-time Big
Data application.

HBase can store massive amounts of data from terabytes to


petabytes. The tables present in HBase consists of billions of
rows having millions of columns. HBase is built for low latency
operations, which is having some specific features compared
to traditional relational models
3. MAPREDUCE
MapReduce is a framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a reliable
manner.

MapReduce is a processing technique


and a program model for distributed
computing based on java.
TheMapReduce algorithm contains two
important tasks, namely Map and
Reduce. Map takes a set of data and
converts it into another set of data, where individual elements are broken down
into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

4. YARN
YARN (Yet Another Resource Manager) is the resource manger which wasintroduces
in Hadoop 2.x. The major process of YARN is take the job which is submitted to
Hadoop and then distributed the job among multiple slave nodes.

YARN relies on these three main components


for all of its functionality.
The first component is the ResourceManager
(RM), which is the arbitrator of all cluster
resources. It has two parts: a pluggable
scheduler and an ApplicationManager that
manages user jobs on the cluster.

The second component is the per-node NodeManager (NM), which manages users’
jobs and workflow on a given node. The central ResourceManager and the
collection of NodeManagers create the unified computational infrastructure of the
cluster.
5. HIVE
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.

The three important functionalities for which Hive is deployed


are data summarization, data analysis, and data query. The
query language, exclusively supported by Hive, is HiveQL. This
language translates SQL-like queries into MapReduce jobs for
deploying them on Hadoop. HiveQL also supports MapReduce
scripts that can be plugged into the queries

6. PIG
Pig is a high-level programming language useful
for analyzing large data sets.
Apache Pig enables people to focus more on
analyzing bulk data sets and to spend less time
writing Map-Reduce programs. Similar to Pigs,
who eat anything, the Pig programming
language is designed to work upon any kind of
data.

7. MAHOUT
Apache Mahout is an open source project that is
used to construct scalable libraries of machine
learning algorithms. Initially, it started as a child
project of Apache Lucene. In 2010, it became the
top-level project for Apache.

8. SOLR
APACHE SOLR is an Open-source REST-API based search
server platform written in java language by apache software
foundation. Solr is highly scalable, ready to deploy, search
engine that can handle large volumes of text-centric data. The
purpose of Apache Solr is to index and search large amounts of
web content and give relevant content based on search query.
9. SQOOP
Sqoop is a tool designed to transfer data
between Hadoop and relational database
servers. It is used to import data from relational
databases such as MySQL, Oracle to Hadoop
HDFS, and export from Hadoop file system to relational databases.

10. KAFKA
Apache Kafka is a distributed publish-subscribe messaging system and a robust
queue that can handle a high volume of data and
enables you to pass messages from one end-point to
another. Kafka is suitable for both offline and online
message consumption. Kafka messages are persisted
on the disk and replicated within the cluster to prevent
data loss. Kafka is built on top of the ZooKeeper
synchronization service. It integrates very well with Apache Storm and Spark for
real-time streaming data analysis.

11. SPARK
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation. It is based on
Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of
computations, which includes interactive queries and
stream processing. The main feature of Spark is its
in-memory cluster computing that increases the
processing speed of an application.

12. FLUME
Apache Flume is a system used for moving massive quantities of
streaming data into HDFS. Collecting log data present in log files from
web servers and aggregating it in HDFS for analysis, is one common
example use case of Flume.
13. AMBARI
Apache Ambari is a software project of the Apache
Software Foundation. Ambari enables system
administrators to provision, manage and monitor a
Hadoop cluster, and also to integrate Hadoop with
the existing enterprise infrastructure.

14. LUCENE
The Apache Lucene project develops open-source search software. The project
releases a core search library, named Lucene core,
as well as the Solr search server. It is a Java library
providing powerful indexing and search features,
as well as spellchecking, hit highlighting and advanced analysis/tokenization
capabilities. The sub project provides Python bindings for Lucene Core.

15. APACHE STORM


Apache Storm is a distributed real-time big data-processing system. Storm is
designed to process vast amount of data in a fault-tolerant and horizontal scalable
method. It is a streaming data framework
that has the capability of highest ingestion
rates. Apache Storm is a free and open
source distributed realtime computation
system. Apache Storm makes it easy to
reliably process unbounded streams of
data, doing for real-time processing what Hadoop did for batch processing. Apache
Storm is simple, can be used with any programming language, and is a lot of fun to
use.

You might also like