Unit 3 ETI (BDA)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

The Big Data Technology:

Hadoop
Zikra Shaikh
3.1 Introduction to Hadoop:

Features of Hadoop,Key
Advantages of Hadoop, Why
Hadoop,RDBMS versus Hadoop
Topic 3.2 Hadoop Overview

3.3 Use Case of Hadoop

3.4 HDFS

3.5 Processing Data with Hadoop


Introduction to Hadoop:

● Hadoop is an open-source software framework that is used for storing


and processing large amounts of data in a distributed computing
environment.
● It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large
datasets.
● Hadoop is an open source software programming framework for storing
a large amount of data and performing the computation.
● Its framework is based on Java programming with some native code in C
and shell scripts.
Evolution of Hadoop:

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published
by Google.
Evolution of Hadoop:

the history of Hadoop in the following steps: -

○ In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It
is an open source web crawler software project.
○ While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reason for the emergence of Hadoop.
○ In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
○ In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
Evolution of Hadoop:

○ In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
○ In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough
Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop
Distributed File System). Hadoop first version 0.1.0 released in this year.
○ Doug Cutting gave named his project Hadoop after his son's toy elephant.
○ In 2007, Yahoo runs two clusters of 1000 machines.
○ In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
○ In 2013, Hadoop 2.2 was released.
○ In 2017, Hadoop 3.0 was released.
Hadoop Architecture

Hadoop framework is made up of the following modules:


1. Hadoop MapReduce- a MapReduce programming
model for handling and processing large data.
2. Hadoop Distributed File System- distributed files in
clusters among nodes.
3. Hadoop YARN- a platform which manages
computing resources.
4. Hadoop Common- it contains packages and libraries
which are used for other modules.
Features of Hadoop

● Suitable for Big Data Analysis


● Hadoop is Open Source
● Hadoop cluster is Highly Scalable
● Hadoop provides Fault Tolerance
● Hadoop provides High Availability
● Hadoop is very Cost-Effective
● Hadoop is Faster in Data Processing
● Hadoop is based on Data Locality concept
● Hadoop is Easy to use
● Hadoop ensures Data Reliability
Features of Hadoop

● Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
● Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
● Data locality: Hadoop provides data locality feature, where the data is stored on
the same node where it will be processed, this feature helps to reduce the network
traffic and improve the performance
Key Advantages of Hadoop

● Flexible Data Processing: Hadoop’s MapReduce programming model allows for


the processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.
● Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.
● Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
● Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
Features of Hadoop

● High Availability: Hadoop provides High Availability feature, which helps to make
sure that the data is always available and is not lost.
● YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run and
process data stored in HDFS.
Why Hadoop
RDBMS VS Hadoop
Hadoop overview
Hadoop overview
Hadoop overview

● HDFS: Hadoop Distributed File System


● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling
Hadoop overview

All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
● HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
● HDFS consists of two core components i.e.
1. Name node
2. Data Node
Hadoop overview

● Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
● HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
Hadoop overview
YARN:
● Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
● Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
● Resource manager has the privilege of allocating resources for the applications in a system whereas
Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and
later on acknowledges the resource manager. Application manager works as an interface between the
resource manager and node manager and performs negotiations as per the requirement of the two.
Hadoop overview
● MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing logic and helps to write applications which transform big data sets into a
manageable one.
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
Hadoop overview
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
● Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
● Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Hadoop overview
HIVE:

● With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
● Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Hadoop overview
Mahout:

● Mahout, allows Machine Learnability to a system or application. Machine


Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of algorithms.
● It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its own
libraries.
Hadoop overview
Apache Spark:
● It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions,
and visualization, etc.
● It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
● Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Hadoop overview
Apache HBase:
● It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short quick
span of time. At such times, HBase comes handy as it gives us a tolerant way
of storing limited data
Hadoop overview
● Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted in
inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.
● Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow
and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a
sequentially ordered manner whereas Oozie Coordinator jobs are those that are
triggered when some data or external stimulus is given to it.
HDFS
HDFS

● HDFS is the primary storage system used by Hadoop applications.


● It is based on Google File system(GFS)
● It Provides high-performance access to data across Hadoop clusters.
● Key tool for managing pools of big data and supporting big data analytics
applications.
● HDFS uses Master/slave architecture.
● It has two main components: Name Node and Data Node.
● Name Node manages the metadata of file system.
● Data Node store actual data .
● The file content is split into large blocks and each block of file is
independently replicated at multiple Data Nodes.
● Name Node maps the blocks to Data Node.
Features Of HDFS

● Cost Effective
● Large datasets/ variety and volume of data
● Replication
● Fault tolerance and reliability
● High availability (Availability of data even during Namenode or Datanode
failure)
● Scalability High throughput (processing time decreases thus throughput is
high)
● Data locality: Data locality is the process of moving the computation close to
where the actual data resides on the node, instead of moving large data to
computation. This minimizes network congestion and increases the overall
throughput of the system.
Physical organization of Compute Nodes

Figure 2.1: Compute nodes are organized into racks, and racks are interconnected
by a switch
Physical organization of Compute Node

The new parallel-computing architecture, sometimes called cluster computing, is organized as


follows. Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single
rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute
nodes, and racks are connected by another level of network or a switch. The bandwidth of
inter-rack communication is somewhat greater than the intra-rack Ethernet, but given the number
of pairs of nodes that might need to communicate between racks However, there may be many
more racks and many more compute nodes per rack

Some important calculations take minutes or even hours on thousands of compute nodes. If we
had to abort and restart the computation every time one component failed, then the computation
might never complete successfully. The solution to this problem takes two forms: • Files must be
stored redundantly. • Computations must be divided into tasks, such that if any one task fails to
execute to completion, it can be restarted without affecting other tasks.
Processing data with hadoop
MapReduce is a style of computing that has been implemented in several
systems, including Google’s internal implementation (simply called MapReduce)
and the popular open-source implementation Hadoop which can be obtained,
along with the HDFS file system from the Apache Foundation. You can use an
implementation of MapReduce to manage many large-scale computations in a
way that is tolerant of hardware faults. MapReduce is a processing layer in a
Hadoop environment. MapReduce works on tasks related to a job. The idea is to
tackle one large request by slicing it into smaller units.
JobTracker and TaskTracker

A JobTracker controlled the distribution of application requests to the compute


resources in a cluster. Since it monitored the execution and the status of
MapReduce, it resided on a master node.

A TaskTracker processed the requests that came from the JobTracker. All task
trackers were distributed across the slave nodes in a Hadoop cluster.

You might also like