Hadoop Intro - Part1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

BIG DATA ANALYTICS

Hadoop Introduction
OUTLINE
• Hadoop Introduction
• Features – Hadoop
• Advantages – Hadoop
• Versions – Hadoop
• Hadoop Eco System
• Hadoop Distribution
• Hadoop vs Sql
• Integrated Hadoop Systems
• Cloud based Hadoop Solutions
Hadoop - Introduction
Hadoop - Introduction
Hadoop - Features
Hadoop – Key Advantages
Hadoop – Key Advantages
Hadoop – Versions
Hadoop – Versions – 1.0
Hadoop – Versions – 1.0 - limitations
Hadoop – Versions – 2.0
Hadoop – Eco System
Hadoop – Eco System - HDFS
Hadoop Eco System
HDFS - Hadoop Distributed File System
• Hadoop Distributed File System is the core component, the backbone of
Hadoop Ecosystem.

• HDFS is the one, which makes it possible to store different types of large data
sets (i.e. structured, unstructured and semi structured data).

• HDFS creates a level of abstraction over the resources, from where we can
see the whole HDFS as a single unit.

• It helps us in storing our data across various nodes and maintaining the log
file about the stored data (metadata).
HDFS - Hadoop Distributed File System
• HDFS has two core components, i.e.
1) NameNode
2) DataNode
• The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
• On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity hardware
(like your laptops and desktops) in the distributed environment. That’s the
reason, why Hadoop solutions are very cost effective.
• You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
HDFS - Hadoop Distributed File System
• You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
HDFS - Hadoop Distributed File System - HBase
APACHE HBASE– NoSQL DataBase
HDFS - Hadoop Distributed File System Vs HBase
Hadoop – Eco System
APACHE Sqoop– Data Ingesting Service
Hadoop – sqoop & Uses
APACHE FLUME– Data Ingesting Service
Hadoop – flume
Hadoop – Components for Data Processing
• Map reduce
• Spark
Hadoop – Components for Data Processing
Map reduce
• Programming that allows distributed and parallel processing in huge datasets
• Based on Google MAP Reduce (2004)

• Gets Input from HDFS


• Two Phases : Map & Reduce

• MAP:
• Coverts i/p to another set of data (key,value) pair.
• A new intermediate of dataset is created, which pass to input of REDUCE.

• REDUCE:
• From input – it combines (aggregate & Consolidate) and reduce them to smaller set of
tuples. Result Stored back in HDFS.
Hadoop – Components for Data Processing
SPARK
• Programming Model
• Open source big data processing framework
• Written in scala.
• Provides in – memory computing from Hadoop

• Workload executes in memory rather than on disk


• Makes Execution faster 10 times
• If datasets are too large to put on available memory than work with conventional disk-based processing.
• Potentially faster and more Flexible as alternative of Map - Reduce
• Access data from HDFS and bypass Map Reduce Processing.
• As a Programming Model, it works with SCALA, PYTHON, R Prog.
Hadoop – Components for Data Processing
SPARK - Libraries

• Hadoop & Spark are together work for so many organizations.


• Spark is mainly used for HIGH SPEED in memory computing
• Also it runs advance real-time analysis
Hadoop – Components for Data Analysis
• PIG
• HIVE
Hadoop – Components for Data Analysis
PIG
• Low level Scripting language
• Alternative to map reduce
• PIG has two parts:
1) Pig Latin, the language and
2) the pig runtime, for the execution environment. You can better understand it as Java and
JVM.
• It supports pig latin language, which has SQL like command structure.
• As everyone does not belong from a programming background. So, Apache PIG
relieves them. You might be curious to know how?
• Well, I will tell you an interesting fact:
• 10 line of pig latin = approx. 200 lines of Map-Reduce Java code
• But don’t be shocked when I say that at the back end of Pig job, a map-reduce job
executes.
Hadoop – Components for Data Analysis
PIG
• The compiler internally converts pig latin to MapReduce.
• It produces a sequential set of MapReduce jobs, and that’s an abstraction (which works
like black box).
• PIG was initially developed by Yahoo.
• It gives you a platform for building data flow for ETL (Extract, Transform and
Load), processing and analyzing huge data sets.

• How Pig works?


• In PIG, first the load command, loads the data from HDFS.
• Then we perform various functions on it like grouping, filtering, joining, sorting, etc.
• At last, either you can dump the data on the screen or you can store the result back in
HDFS.
Hadoop – Components for Data Analysis
HIVE
• Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home
while working in a Hadoop Ecosystem.

• Basically, HIVE is a data warehousing S/w Project which runs on top of Hadoop which performs
summarization, Querying and analysis large data sets in a distributed environment using SQL-like
interface.

• HIVE + SQL = HQL

• The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.

• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.


1) The Hive Command line interface is used to execute HQL commands.
2) While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish
connection from data storage.
Hadoop – Components for Data Analysis
HIVE
• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).

• It supports all primitive data types of SQL.

• You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.
Hadoop – Other Components
• Impala
• Zoo Keeper
• Oozie
• Mahout
• Chukwa
• Ambari
Hadoop – Other Components
Zookeeper –Coordinator
APACHE MAHOUT– Machine Learning
• What Mahout does?
It performs collaborative filtering, clustering and classification. Some
people also consider frequent item set missing as Mahout’s function. Let
us understand them individually:
1) Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the users.
The typical use case is E-commerce website.
2) Clustering: It organizes a similar group of data together like articles can contain blogs,
news, research papers etc.
3) Classification: It means classifying and categorizing data into various sub-
departments like articles can be categorized into blogs, news, essay, research papers
and other categories.
4) Frequent item set missing: Here Mahout checks, which objects are likely to be
appearing together and make suggestions, if they are missing. For example, cell
phone and cover are brought together in general. So, if you search for a cell phone, it
will also recommend you the cover and cases.
APACHE OOZIE– Scheduling
APACHE Ambari– Cluster Manager
Hadoop – Distributions
Hadoop – Distributions
Hadoop – vs SQL
Hadoop –
Integrated Hadoop systems by Leading Market Vendors
Hadoop – Cloud based Solutions

You might also like