Hadoop Intro - Part1
Hadoop Intro - Part1
Hadoop Intro - Part1
Hadoop Introduction
OUTLINE
• Hadoop Introduction
• Features – Hadoop
• Advantages – Hadoop
• Versions – Hadoop
• Hadoop Eco System
• Hadoop Distribution
• Hadoop vs Sql
• Integrated Hadoop Systems
• Cloud based Hadoop Solutions
Hadoop - Introduction
Hadoop - Introduction
Hadoop - Features
Hadoop – Key Advantages
Hadoop – Key Advantages
Hadoop – Versions
Hadoop – Versions – 1.0
Hadoop – Versions – 1.0 - limitations
Hadoop – Versions – 2.0
Hadoop – Eco System
Hadoop – Eco System - HDFS
Hadoop Eco System
HDFS - Hadoop Distributed File System
• Hadoop Distributed File System is the core component, the backbone of
Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data
sets (i.e. structured, unstructured and semi structured data).
• HDFS creates a level of abstraction over the resources, from where we can
see the whole HDFS as a single unit.
• It helps us in storing our data across various nodes and maintaining the log
file about the stored data (metadata).
HDFS - Hadoop Distributed File System
• HDFS has two core components, i.e.
1) NameNode
2) DataNode
• The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
• On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity hardware
(like your laptops and desktops) in the distributed environment. That’s the
reason, why Hadoop solutions are very cost effective.
• You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
HDFS - Hadoop Distributed File System
• You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
HDFS - Hadoop Distributed File System - HBase
APACHE HBASE– NoSQL DataBase
HDFS - Hadoop Distributed File System Vs HBase
Hadoop – Eco System
APACHE Sqoop– Data Ingesting Service
Hadoop – sqoop & Uses
APACHE FLUME– Data Ingesting Service
Hadoop – flume
Hadoop – Components for Data Processing
• Map reduce
• Spark
Hadoop – Components for Data Processing
Map reduce
• Programming that allows distributed and parallel processing in huge datasets
• Based on Google MAP Reduce (2004)
• MAP:
• Coverts i/p to another set of data (key,value) pair.
• A new intermediate of dataset is created, which pass to input of REDUCE.
• REDUCE:
• From input – it combines (aggregate & Consolidate) and reduce them to smaller set of
tuples. Result Stored back in HDFS.
Hadoop – Components for Data Processing
SPARK
• Programming Model
• Open source big data processing framework
• Written in scala.
• Provides in – memory computing from Hadoop
• Basically, HIVE is a data warehousing S/w Project which runs on top of Hadoop which performs
summarization, Querying and analysis large data sets in a distributed environment using SQL-like
interface.
• The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
• You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.
Hadoop – Other Components
• Impala
• Zoo Keeper
• Oozie
• Mahout
• Chukwa
• Ambari
Hadoop – Other Components
Zookeeper –Coordinator
APACHE MAHOUT– Machine Learning
• What Mahout does?
It performs collaborative filtering, clustering and classification. Some
people also consider frequent item set missing as Mahout’s function. Let
us understand them individually:
1) Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the users.
The typical use case is E-commerce website.
2) Clustering: It organizes a similar group of data together like articles can contain blogs,
news, research papers etc.
3) Classification: It means classifying and categorizing data into various sub-
departments like articles can be categorized into blogs, news, essay, research papers
and other categories.
4) Frequent item set missing: Here Mahout checks, which objects are likely to be
appearing together and make suggestions, if they are missing. For example, cell
phone and cover are brought together in general. So, if you search for a cell phone, it
will also recommend you the cover and cases.
APACHE OOZIE– Scheduling
APACHE Ambari– Cluster Manager
Hadoop – Distributions
Hadoop – Distributions
Hadoop – vs SQL
Hadoop –
Integrated Hadoop systems by Leading Market Vendors
Hadoop – Cloud based Solutions