Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
1. HDFS.
The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. It employs a NameNode and DataNode architecture
to implement a distributed file system that
provides high-performance access to data
across highly scalable Hadoop clusters.
2. HBASE.
HBase is an open-source, column-oriented distributed database system in a
Hadoop environment. Apache HBase is needed for real-time Big
Data application.
4. YARN
YARN (Yet Another Resource Manager) is the resource manger which wasintroduces
in Hadoop 2.x. The major process of YARN is take the job which is submitted to
Hadoop and then distributed the job among multiple slave nodes.
The second component is the per-node NodeManager (NM), which manages users’
jobs and workflow on a given node. The central ResourceManager and the
collection of NodeManagers create the unified computational infrastructure of the
cluster.
5. HIVE
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
6. PIG
Pig is a high-level programming language useful
for analyzing large data sets.
Apache Pig enables people to focus more on
analyzing bulk data sets and to spend less time
writing Map-Reduce programs. Similar to Pigs,
who eat anything, the Pig programming
language is designed to work upon any kind of
data.
7. MAHOUT
Apache Mahout is an open source project that is
used to construct scalable libraries of machine
learning algorithms. Initially, it started as a child
project of Apache Lucene. In 2010, it became the
top-level project for Apache.
8. SOLR
APACHE SOLR is an Open-source REST-API based search
server platform written in java language by apache software
foundation. Solr is highly scalable, ready to deploy, search
engine that can handle large volumes of text-centric data. The
purpose of Apache Solr is to index and search large amounts of
web content and give relevant content based on search query.
9. SQOOP
Sqoop is a tool designed to transfer data
between Hadoop and relational database
servers. It is used to import data from relational
databases such as MySQL, Oracle to Hadoop
HDFS, and export from Hadoop file system to relational databases.
10. KAFKA
Apache Kafka is a distributed publish-subscribe messaging system and a robust
queue that can handle a high volume of data and
enables you to pass messages from one end-point to
another. Kafka is suitable for both offline and online
message consumption. Kafka messages are persisted
on the disk and replicated within the cluster to prevent
data loss. Kafka is built on top of the ZooKeeper
synchronization service. It integrates very well with Apache Storm and Spark for
real-time streaming data analysis.
11. SPARK
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation. It is based on
Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of
computations, which includes interactive queries and
stream processing. The main feature of Spark is its
in-memory cluster computing that increases the
processing speed of an application.
12. FLUME
Apache Flume is a system used for moving massive quantities of
streaming data into HDFS. Collecting log data present in log files from
web servers and aggregating it in HDFS for analysis, is one common
example use case of Flume.
13. AMBARI
Apache Ambari is a software project of the Apache
Software Foundation. Ambari enables system
administrators to provision, manage and monitor a
Hadoop cluster, and also to integrate Hadoop with
the existing enterprise infrastructure.
14. LUCENE
The Apache Lucene project develops open-source search software. The project
releases a core search library, named Lucene core,
as well as the Solr search server. It is a Java library
providing powerful indexing and search features,
as well as spellchecking, hit highlighting and advanced analysis/tokenization
capabilities. The sub project provides Python bindings for Lucene Core.