BDT_Unit04
BDT_Unit04
BDT_Unit04
Technologies
UNIT III
Hadoop Ecosystem
UNIT III
3
Hadoop Ecosystem
UNIT III
4
Hadoop Ecosystem: Component
• Avro:. Avro is an open source project that provides data serialization and data
exchange services for Hadoop. A serialization system for efficient,
cross-language RPC, and persistent data storage.
• Features provided by Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.
UNIT III
5
Hadoop Ecosystem: Component
Pig:
• A data flow language and execution environment for exploring/processing very large datasets. Pig runs
on HDFS and MapReduce clusters.
• Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in
HDFS.
• Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL.
• It loads the data, applies the required filters and dumps the data in the required format. For Programs
execution, pig requires Java runtime environment.
UNIT III
6
Hadoop Ecosystem: Component
Hive:
• A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
• Hive do three main functions: data summarization, query, and analysis.
UNIT III
7
Hadoop Ecosystem: What Hive can provide
UNIT III
8
Hadoop Ecosystem:Component
• ZooKeeper: A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications. Zookeeper manages and coordinates a large cluster of
machines.
UNIT III
9
Hadoop Ecosystem: Component
• Sqoop: A tool for efficiently moving data between relational databases and
HDFS. Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive. It also exports data from
Hadoop to other external sources. Sqoop works with relational databases such
as teradata, Netezza, oracle, MySQL.
UNIT III
10
Hadoop Ecosystem: Component
• Capabilities of Sqoop Includes
• It helps to Import individual tables or entire databases to files in
HDFS
• Also can Generate Java classes to interact with your imported data
• Moreover, it offers the ability to import from SQL databases straight
into your Hive data warehouse
UNIT III
11
Hadoop Ecosystem: Component
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
0, this is a test,42
1, some more data,100
UNIT III
13
Hadoop Ecosystem: Sqoop Export
UNIT III
14
Hadoop Ecosystem: Sqoop Export
UNIT III
15
Hadoop Ecosystem: Flume
• Flume:
• efficiently collects, aggregate and moves a large amount of data from its origin and sending
it back to HDFS. It is fault tolerant and reliable mechanism.
• This Hadoop Ecosystem component allows the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that allows for the online analytic application. Using
Flume,
• we can get the data from multiple servers immediately into hadoop.
UNIT III
16
Hadoop Ecosystem: Flume Features
• Features of Flume:
• Flume has a flexible design based upon streaming data flows.
• It is fault tolerant and robust with multiple failovers and recovery
mechanisms. Flume has different levels of reliability to offer which
includes 'best-effort delivery' and an 'end-to-end delivery'.
• Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query processing
engine which makes it easy to transform each new batch of data before it
is moved to the intended sink.
• Flume can also be used to transport event data including but not limited
to network traffic data, data generated by social media websites and email
messages.
UNIT III
17
Hadoop Ecosystem: Component
• Ambari:
• another Hadop ecosystem component, is a management platform for provisioning,
managing, monitoring and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration, and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –
UNIT III
18
Hadoop Ecosystem: Component
Difference Between Apache Ambari and Apache Zookeeper
UNIT III
19
Hadoop Ecosystem: Component
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.
UNIT III
20
Hadoop Ecosystem: Component
• Working of Oozie
UNIT III
21
HIVE
Query Language by Hadoop
5/15/2024
22
History Hive
• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users key value key value pa
pag age <1,2 1 <1,2 1 p
eid 5> 5> e
1 25 <2,2 1 <1,3 1
Shu
2 25 5> 2>
ffle
Map Reduce
key value Sort
pag age key value
<1,3 1 p
eid <2,2 1
2> e
1 32 5>
<2,2 1
2 25 <2,2 1
5>
5>
Hive QL – Group By with Distinct
page_view
pag user time
eid id result
1 111 9:08:0 page count_distinct
1 id _userid
2 111 9:08:1 1 2
3 2 1
1 222 9:08:1
4
2 111 9:08:2
0
• SQL
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid
Hive QL – Group By with Distinct in Map
Reduce
page_view
drop database office cascade; // Drops the tables in the database when it is not
Drop empty
Pig
5/15/2024
4
9
Why Pig?
pig –help
Pig
>grunt
5/15/2024 Big Data Analytics Lab 68
Demo
For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name
of Country: No. of products sold
UNIT III
72
• Sqoop: A tool for efficiently moving data between relational
Hadoop databases and HDFS. Sqoop imports data from external sources
into related Hadoop ecosystem components like HDFS, Hbase
Ecosystem: or Hive. It also exports data from Hadoop to other external
sources. Sqoop works with relational databases such as teradata,
Netezza, oracle, MySQL.
Component
UNIT III
73
Hadoop Ecosystem: Component
UNIT III
74
Overview of - Apache Spark Ecosystem.
UNIT III
Introduction to Apache Spark
• Apache Spark is a general-purpose cluster computing framework.
• It was introduced by UC Berkeley’s AMP Lab in 2009 as a distributed computing system. But
later maintained by Apache Software Foundation from 2013 till date.
• Spark is a lighting fast computing engine designed for faster processing of large size of data.
• Spark supports batch application, iterative processing, interactive queries, and streaming
data. It reduces the burden of managing separate tools for the respective workload.
• The main feature of Spark is its in-memory processing which makes computation faster.
• It has its own cluster management system and it uses Hadoop for storage purpose
UNIT III
76
Introduction to Apache Spark
• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
• Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.
UNIT III
78
Apache Spark Ecosystem (The Spark Stack )
UNIT III
79
Apache Spark Ecosystem: Component
1. Spark Core:
• Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.
• Spark Core is also home to the API that defines resilient distributed datasets (RDDs),
which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute nodes that can
be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these collections.
UNIT III
80
Apache Spark Ecosystem: Component
2. Spark SQL
• Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language
(HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.
3. Spark Streaming
• Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
• Streaming provides an API for manipulating data streams that closely matches the Spark
Core’s RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real time.
• Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.
UNIT III
81
Apache Spark Ecosystem: Component
UNIT III
82
Features of Apache Spark
• Speed: Though spark is based on MapReduce, it is 10 times faster than Hadoop when
it comes to big data processing.
• Usability: Spark supports multiple languages thus making it easier to work with.
• Sophisticated Analytics: Spark provides a complex algorithm for Big Data Analytics
and Machine Learning.
• In-Memory Processing: Unlike Hadoop, Spark doesn’t move data in and out of the
cluster.
• Lazy Evaluation: It means that spark waits for the code to complete and then process
the instruction in the most efficient way possible.
• Fault Tolerance: Spark has improved fault tolerance than Hadoop. Both storage and
computation can tolerate failure by backing up to another node.
UNIT III
83
Comparison of different Framework
UNIT III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual user’s
data in the cloud or elsewhere and use it without requiring any
server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.
UNIT III
85
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as well
as service levels for Google’s services as well as more than a dozen
non-Google open source packages.
UNIT III
86
Case Study:
• Twitter Analytics: Capturing and Analyzing Tweets
https://blogs.ischool.berkeley.edu/i290-abdt-s12/
UNIT III
87
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/
UNIT III
88
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
3. https://mapr.com/products/apache-spark/
4. https://www.educba.com/what-is-apache-spark/
5. https://i0.wp.com/opensourceforu.com/wp-content/uploads/2018/03/Table-1-Compari
son-of-teh-best-Big-Data-frameworks-2.jpg?ssl=1
6. https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2
95
UNIT III
89
Demo
• https://www.dezyre.com/hadoop-tutorial/hadoop-
sqoop-tutorial-data-aggregation#:~:text=So%2C%2
0just%20using%20simple%20Sqoop,the%20data%2
0between%20the%20mappers.
• https://www.edureka.co/blog/apache-sqoop-tutori
al/
0, this is a test,42
1, some more data,100
UNIT III
92
Hadoop Ecosystem: Sqoop Export
UNIT III
93
Hadoop Ecosystem: Sqoop Export
UNIT III
94
Hadoop Ecosystem: Sqoop Import
UNIT III
95
Hadoop
Ecosystem:
Flume
• Flume:
• efficiently collects,
aggregate and moves a large
amount of data from its
origin and sending it back to
HDFS. It is fault tolerant
and reliable mechanism.
• This Hadoop Ecosystem
component allows the data
flow from the source into
Hadoop environment.
• It uses a simple extensible
data model that allows for
the online analytic
application. Using Flume,
• we can get the data from
multiple servers
immediately into hadoop.
UNIT III
96
UNIT III
• Features of Flume:
• Flume has a flexible design based upon
streaming data flows.
• It is fault tolerant and robust with multiple
failovers and recovery mechanisms. Flume
Hadoop has different levels of reliability to offer
which includes 'best-effort delivery' and an
'end-to-end delivery'.
Ecosystem: • Flume carries data between sources and
sinks. This gathering of data can either be
97
UNIT III
Hadoop
Ecosystem:
Component
• Ambari:
• another Hadop ecosystem component,
is a management platform for
provisioning, managing, monitoring
and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration,
and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –
98
Hadoop Ecosystem: Component
•99
UNIT III
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
Hadoop • Oozie combines multiple jobs sequentially into one logical unit of work.
Ecosystem: • Oozie framework is fully integrated with apache Hadoop stack, YARN as an
architecture center and supports Hadoop jobs for apache MapReduce, Pig,
Component Hive, and Sqoop.
UNIT III
102
Active Learning
• What is Avro?
A-Avro is a java serialization library.
B - Avro is a java compression library.
C - Avro is a java library that create split table files.
• Zookeeper ensures that….
A - All the namenodes are actively serving the client requests
B - Only one namenode is actively serving the client requests
C - A failover is triggered when any of the datanode fails.
• Which of these provides a Stream processing system used in Hadoop
ecosystem?
• A-Hive
• B-Pig
• C- Spark
UNIT III
103
Introduction to MapReduce
• Traditional techniques for working with information simply don’t scale to
Big Data: they’re too costly, time-consuming, and complex, which is where
MapReduce comes in.
• MapReduce is:
• new programming framework — created and successfully deployed by
Google
• uses the divide-and-conquer method (and lots of commodity servers)
• break down complex Big Data problems into small units of work, and
then process them in parallel.
• MapReduce is a programming framework that allows us to perform
distributed and parallel processing on large data sets in a distributed
environment.
• Eg: processing 100 TB data
• On 1 node🡪 scanning @50 MB/s =23 days
• On 1000 node cluster🡪 scanning @50 MB/s =33 min
UNIT III
104
Introduction to MapReduce: Use cases
• At Google
• Index construction for Google Search (replaced in 2010 by Caffeine)
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook
• Data mining, Web log processing
• SearchBox (with Cassandra)
• Facebook Messages (with HBase)
• Ad optimization
• Spam detection
UNIT III
105
Introduction to MapReduce
• Problem: How to compute the PageRank for a crawled set of
websites on a cluster of machines?
• Main Challenges:
• How to break up a large problem into smaller tasks, that can be executed
in parallel?
• How to assign tasks to machines?
• How to partition and distribute data?
• How to share intermediate results?
• How to coordinate synchronization, scheduling, fault-tolerance?
Solution is MapReduce:
• Algorithms that can be expressed as (or mapped to) a sequence of Map()
and Reduce() functions are automatically parallelized by the framework.
UNIT III
106
MapReduce Workflow
• Map Phase
• Raw data read and converted to key/value pairs
• Map() function applied to any pair
• Shuffle and Sort Phase
• All key/value pairs are sorted and grouped by their keys
• Reduce Phase
• All values with a the same key are processed by within the same reduce() function
UNIT III
107
MapReduce Programming Model
• Every MapReduce program must specify a Mapper and typically a Reducer
• The Mapper has a map() function that transforms input (key, value) pairs
into any number of intermediate (out_key, intermediate_value) pairs
• The Reducer has a reduce() function that transforms intermediate (out_key,
list(intermediate_value)) aggregates into any number of output (value’) pairs
UNIT III
108
MapReduce Programming Model
UNIT III
109
MapReduce Execution Details
UNIT III
110
MapReduce Data flow
UNIT III
Word Count Problem using
MapReduce
UNIT III
112
MapReduce Execution Details
UNIT III
113
Word Count Problem uisng MapReduce
UNIT III
114
Word Count Problem using
MapReduce
UNIT III
115
Word Count Problem uisng
MapReduce
UNIT III
116
Comparison of MapReduce and
RDBMS
UNIT III
117
UNIT III
UNIT III
119
Introduction to Apache Spark
One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.
UNIT III
120
Apache Spark Ecosystem
(The Spark Stack )
UNIT III
121
Apache Spark Ecosystem
(The Spark Stack )
•122
UNIT III
The Spark project contains multiple
closely integrated components.
• At its core, Spark is a “computational engine” that
is responsible for scheduling, distributing, and
monitoring applications consisting of many
computational tasks across many worker
machines, or a computing cluster.
1. Spark Core:
Apache Spark • Spark Core is also home to the API that defines
resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction.
Ecosystem: • RDDs represent a collection of items distributed
across many compute nodes that can be
Component manipulated in parallel.
• Spark Core provides many APIs for building and
manipulating these collections.
UNIT III
123
2. Spark SQL
• Spark SQL is Spark’s package for working with
structured data. It allows querying data via SQL as
well as the Apache Hive variant of SQL—called the
Hive Query Language (HQL)—and it supports many
sources of data, including Hive tables, Parquet, and
JSON.
• Beyond providing a SQL interface to Spark, Spark
SQL allows developers to intermix SQL queries with
the programmatic data manipulations supported by
RDDs in Python, Java, and Scala, all within a single
application, thus combining SQL with complex
3.analytics.
Spark Streaming
• Spark Streaming is a Spark component that enables
processing of live streams of data. Examples of data
streams include logfiles generated by production web
servers, or queues of messages containing status
Apache Spark updates posted by users of a web service.
• Streaming provides an API for manipulating data
UNIT III
124
4. MLlib (Machine Learning Library):
• MLlib provides multiple types of machine
learning algorithms, including classification,
regression, clustering, and collaborative
filtering, as well as supporting functionality
such as model evaluation and data import.
• It also provides some lower-level ML
primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out
across a cluster.
UNIT III
125
Features of Apache Spark
Sophisticated
Speed: Though spark is
Usability: Spark supports Analytics: Spark
based on MapReduce, it
multiple languages thus provides a complex
is 10 times faster than
making it easier to work algorithm for Big Data
Hadoop when it comes to
with. Analytics and Machine
big data processing.
Learning.
UNIT III
126
Comparison of different Framework
UNIT III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual user’s
data in the cloud or elsewhere and use it without requiring any
server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.
UNIT III
128
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as well
as service levels for Google’s services as well as more than a dozen
non-Google open source packages.
UNIT III
129
Case Study:
• Twitter Analytics: Capturing and Analyzing Tweets
https://blogs.ischool.berkeley.edu/i290-abdt-s12/
UNIT III
130
Hadoop High Level Architecture
UNIT III
131
Hadoop Cluster
• A Small Hadoop Cluster Include a single master &
multiple worker nodes
Master node:
Data Node
Job Tracker
Task Tracker
Name Node
Slave node:
Data Node
Task Tracke
UNIT III
132
Active Learning
Hadoop SQL
Schema on Read Schema on Write
Data stored in compressed Data stored in the logical
file form with interrelated tables
UNIT III
134
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/
UNIT III
135
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
3. https://mapr.com/products/apache-spark/
4. https://www.educba.com/what-is-apache-spark/
5. https://i0.wp.com/opensourceforu.com/wp-content/uploads/2018/03/Table-1-Compari
son-of-teh-best-Big-Data-frameworks-2.jpg?ssl=1
6. https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2
95
7. https://www.guru99.com/create-your-first-hadoop-program.html
8. https://www.simplilearn.com/tutorials/hadoop-tutorial/hdfs?source=sl_frs_n
av_playlist_video_clicked
9. https://www.projectpro.io/hadoop-tutorial/hadoop-hdfs-commands
UNIT III
136