Hadoop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Hadoop is an open-source framework designed for processing and storing large data sets across

distributed clusters of computers. It uses a simple programming model and provides a scalable,
fault-tolerant environment for big data processing.

1. HDFS (Hadoop Distributed File System)


Function: HDFS is the primary storage system of Hadoop. It splits large data files into blocks
and distributes them across nodes in a cluster, ensuring fault tolerance and high availability.
Components:
• NameNode: Manages the metadata and directory structure of HDFS. It keeps track
of where data is stored in the cluster.
• DataNodes: Store the actual data. They are responsible for serving read and write
requests from clients and performing block creation, deletion, and replication based
on NameNode's instructions.

Features:

• Fault Tolerance: Data is replicated across multiple DataNodes.


• Scalability: Easily scales by adding more DataNodes.
• High Throughput: Designed for large files and high aggregate data bandwidth.
2. YARN (Yet Another Resource Negotiator)

Function: YARN is Hadoop's cluster resource management layer. It manages and schedules
the resources of the Hadoop cluster.

Components:

a) ResourceManager: Allocates system resources to various running applications. It has


two main components:
o Scheduler: Allocates resources to applications based on constraints such as
capacity, queues, and user priorities.
o ApplicationManager: Manages application lifecycle and user job
submissions.
b) NodeManager: Manages resources and monitoring on each node. It reports
resource usage to the ResourceManager and handles container lifecycle
management.

Features:

o Resource Utilization: Efficiently manages cluster resources.


o Multi-Tenancy: Supports multiple applications and frameworks on a single cluster.
o Scalability: Can handle a large number of nodes and applications.
3. MapReduce
Function: MapReduce is a programming model and processing engine for large data sets. It
processes data in parallel across a Hadoop cluster.

Phases:
o Map Phase: The input data is split into independent chunks processed by the
mapper function to produce intermediate key-value pairs.
o Shuffle and Sort Phase: The intermediate data is shuffled and sorted to prepare for
the reduce phase.
o Reduce Phase: The reducer function processes the intermediate data, aggregates
results, and produces the final output.

Features:

o Parallel Processing: Executes tasks in parallel across the cluster.


o Fault Tolerance: Automatically re-executes failed tasks.
o Data Locality: Processes data on the node where it is stored, reducing data
movement.
4. HBase
Function: HBase is a NoSQL database that provides real-time read/write access to large
datasets. It is built on top of HDFS.
Features:
o Column-Oriented Storage: Stores data in column families, providing fast read/write
access.
o Scalability: Can scale horizontally by adding more nodes.
o Real-Time Processing: Supports real-time querying and updates.
5. Hive
Function: Hive is a data warehousing solution on top of Hadoop that allows users to query
and manage large datasets using SQL-like syntax (HiveQL).
Features:
o Data Abstraction: Simplifies Hadoop data processing with familiar SQL-based
querying.
o Compatibility: Integrates with traditional BI tools.
o Scalability: Handles large datasets efficiently.
6. Pig
Function: Pig is a high-level scripting platform for creating MapReduce programs using a
language called Pig Latin.
Features:
o Ease of Use: Simplifies the development of data processing scripts.
o Extensibility: Supports user-defined functions for custom processing.
o Optimization: Automatically optimizes the execution of Pig Latin scripts.
7. Zookeeper
Function: Zookeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and group services.
Features:
o Coordination: Manages and coordinates distributed applications.
o Configuration Management: Keeps track of configuration changes.
o Leader Election: Helps in selecting the master node in distributed systems.
8. Oozie
Function: Oozie is a workflow scheduler system to manage Hadoop jobs.
Features:
o Workflow Management: Defines a sequence of actions in a Hadoop job.
o Job Coordination: Manages the execution of complex workflows.
o Time and Data Triggers: Supports scheduling jobs based on time and data
availability.

Hadoop's architecture is designed to handle big data efficiently with its distributed storage
(HDFS) and processing (MapReduce) capabilities. YARN enhances resource management,
while components like HBase, Hive, Pig, Zookeeper, and Oozie provide additional
functionalities to build a robust big data ecosystem. This modular architecture allows
Hadoop to scale horizontally and handle large datasets, making it a powerful tool for big
data analytics.

You might also like