0% found this document useful (0 votes)
23 views

Introduction To BigData Hadoop

Uploaded by

jr9617883006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Introduction To BigData Hadoop

Uploaded by

jr9617883006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

INTRODUCTION TO HADOOP 2020

Introduction to Hadoop
What is hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of
large data sets across cluster of commodity computers using a simple
programming model.

 It is an open source data management with scale-out storage & distributed


processing.
Hadoop Key Characteristics
Economical: -
1. It is open source and freely available.

2. No License require
Reliable: -
1. High availability of data.

2. If data may loss due to node failure, which can be recovered.


1
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Flexible: -
1. Number of nodes is not fixed, you can add “n” number of nodes into
cluster.
Scalable: -
1. You can process large data sets.

2. Your data may be in Kilobyte (KB),Megabyte (MB),Gigabyte


(GB),Terabyte (TB),Petabyte (PB),Exabyte (EB),Zettabyte (ZB),Yottabyte
(YB)

2
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Apache Hadoop Ecosystem

3
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

COMPONENT OF HADOOP ECOSYSTEM

4
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

HDFS (Hadoop distributed file system)


 The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.

 It has many similarities with existing distributed file systems.


 HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.

 HDFS provides high throughput access to application data and is suitable


for applications that have large data sets.

 In HDFS block size is 64MB which is expendable upto 128MB

 HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.

 HDFS is now an Apache Hadoop subproject.

5
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Distributed Processing (Map Reduce)


 Hadoop MapReduce is a software framework.

 Use for easily writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes)
of commodity hardware in a reliable, fault-tolerant manner.

 A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.

 The framework sorts the outputs of the maps, which are then input to the
reduce tasks.

 Typically both the input and the output of the job are stored in a file-system.

6
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Pig
 Apache Pig is a high level data flow platform for execution Map Reduce
programs of Hadoop.

 The language for Pig is pig Latin.

 The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.

 Every task which can be achieved using PIG can also be achieved using
java used in Map reduce.

 Ease of programming, Optimization opportunities, Extensibility

7
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Hive
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
 Initially Hive was developed by Facebook; later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

8
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Hbase
 Hbase is called as hadoop database.

 HBase is a column-oriented database management system that runs on top


of Hadoop Distributed File System (HDFS).

 It is well suited for sparse data sets, which are common in many big data
use cases.

 HBase does not support a structured query language like SQL.

 HBase does support writing applications in Apache™ Avro™, REST, and


Thrift.

9
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Sqoop
It is (SQL + Hadoop)
 Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.

 It is used to import data from relational databases such as MySQL, Oracle


to Hadoop HDFS, and export from Hadoop file system to relational
databases.

 Sqoop occupies a place in the Hadoop ecosystem to provide feasible


interaction between relational database server and Hadoop’s HDFS.

10
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Flume (Data streaming)


 Apache Flume is a system used for moving massive quantities of streaming
data into HDFS.

 Collecting log data present in log files from web servers and aggregating it
in HDFS for analysis, is one common example use case of Flume.

Oozie (scheduler system)


 Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
distributed environment.

 It allows combining multiple complex jobs to be run in a sequential order to


achieve a bigger task.

 Within a sequence of task, two or more jobs can also be programmed to run
parallel to each other.

11
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Zookeeper (reliable cluster co-ordination service)


 The ZooKeeper framework was originally built at “Yahoo!” for accessing
their applications in an easy and robust manner.

 Later, Apache ZooKeeper became a standard for organized service used by


Hadoop, HBase, and other distributed frameworks.

 Apache ZooKeeper is an open-source project which deals with maintaining


configuration information, naming, providing distributed synchronization,
group services for various distributed applications

Ambari (Hadoop clusters manager)

 A completely open source management platform for provisioning,


managing, monitoring and securing Apache Hadoop clusters.
 Ambari enables system administrators to provision, manage and monitor a
Hadoop cluster, and also to integrate Hadoop with the existing enterprise
infrastructure
12
By Mr. Virendra

You might also like