0% found this document useful (0 votes)
5 views136 pages

BDT_Unit04

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 136

CET4001B Big Data

Technologies

5/15/2024 Big Data Analytics Lab 1


Unit-IV

Technologies and tools for Big Data:


• Importing Relational data with Sqoop, Injecting stream data
with flume.
• Basic concepts of Pig, Architecture of Pig, what is Hive.
Architecture of
• Hive, Hive Commands. Zookeeper, Apache Spark ecosystem.
• Overview of Apache Spark Ecosystem, Spark Architecture

UNIT III
Hadoop Ecosystem

UNIT III
3
Hadoop Ecosystem

UNIT III
4
Hadoop Ecosystem: Component
• Avro:. Avro is an open source project that provides data serialization and data
exchange services for Hadoop. A serialization system for efficient,
cross-language RPC, and persistent data storage.
• Features provided by Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.

• MapReduce: A distributed data processing model and execution environment


that runs on large clusters of commodity machines.

• HDFS:A distributed file system that runs on large clusters of commodity


machines. HDFS supports write-once-read-many semantics on files.

UNIT III
5
Hadoop Ecosystem: Component
Pig:
• A data flow language and execution environment for exploring/processing very large datasets. Pig runs
on HDFS and MapReduce clusters.
• Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in
HDFS.
• Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL.
• It loads the data, applies the required filters and dumps the data in the required format. For Programs
execution, pig requires Java runtime environment.

UNIT III
6
Hadoop Ecosystem: Component
Hive:
• A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
• Hive do three main functions: data summarization, query, and analysis.

UNIT III
7
Hadoop Ecosystem: What Hive can provide

UNIT III
8
Hadoop Ecosystem:Component
• ZooKeeper: A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications. Zookeeper manages and coordinates a large cluster of
machines.

UNIT III
9
Hadoop Ecosystem: Component
• Sqoop: A tool for efficiently moving data between relational databases and
HDFS. Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive. It also exports data from
Hadoop to other external sources. Sqoop works with relational databases such
as teradata, Netezza, oracle, MySQL.

UNIT III
10
Hadoop Ecosystem: Component
• Capabilities of Sqoop Includes
• It helps to Import individual tables or entire databases to files in
HDFS
• Also can Generate Java classes to interact with your imported data
• Moreover, it offers the ability to import from SQL databases straight
into your Hive data warehouse

UNIT III
11
Hadoop Ecosystem: Component
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);

Consider also a dataset in HDFS containing records like these:

0, this is a test,42
1, some more data,100

Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will


run an export job that executes SQL statements based on the data like so:

UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;

UPDATE Test SET msg=’some more data’, bar=100 WHERE id=1;


UNIT III
12
Hadoop Ecosystem: Sqoop Import

• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

• E.g. emp table data will be stored in HDFS as:


1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT III
13
Hadoop Ecosystem: Sqoop Export

• $ sqoop export (generic-args) (export-args)


• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT III
14
Hadoop Ecosystem: Sqoop Export

• $ sqoop export (generic-args) (export-args)


• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT III
15
Hadoop Ecosystem: Flume
• Flume:
• efficiently collects, aggregate and moves a large amount of data from its origin and sending
it back to HDFS. It is fault tolerant and reliable mechanism.
• This Hadoop Ecosystem component allows the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that allows for the online analytic application. Using
Flume,
• we can get the data from multiple servers immediately into hadoop.

UNIT III
16
Hadoop Ecosystem: Flume Features

• Features of Flume:
• Flume has a flexible design based upon streaming data flows.
• It is fault tolerant and robust with multiple failovers and recovery
mechanisms. Flume has different levels of reliability to offer which
includes 'best-effort delivery' and an 'end-to-end delivery'.
• Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query processing
engine which makes it easy to transform each new batch of data before it
is moved to the intended sink.
• Flume can also be used to transport event data including but not limited
to network traffic data, data generated by social media websites and email
messages.

UNIT III
17
Hadoop Ecosystem: Component
• Ambari:
• another Hadop ecosystem component, is a management platform for provisioning,
managing, monitoring and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration, and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –

UNIT III
18
Hadoop Ecosystem: Component
Difference Between Apache Ambari and Apache Zookeeper

UNIT III
19
Hadoop Ecosystem: Component
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.

UNIT III
20
Hadoop Ecosystem: Component
• Working of Oozie

UNIT III
21
HIVE
Query Language by Hadoop

Big Data Analytics Lab

5/15/2024

22
History Hive

5/15/2024 Big Data Analytics Lab 23


Why Hive?

5/15/2024 Big Data Analytics Lab 24


What is Hive

5/15/2024 Big Data Analytics Lab 25


Architecture of Hive

5/15/2024 Big Data Analytics Lab 26


Architecture of Hive

5/15/2024 Big Data Analytics Lab 27


Data Flow in Hive

5/15/2024 Big Data Analytics Lab 28


Hive Data Modeling

5/15/2024 Big Data Analytics Lab 29


Hive Data Types

5/15/2024 Big Data Analytics Lab 30


Different Modes of Hive

5/15/2024 Big Data Analytics Lab 31


Hive Vs RDBMS

5/15/2024 Big Data Analytics Lab 32


Hive QL – Join
page_view pv_users
user
pag use time pag age
eid rid use age gende eid
X rid r =
1 111 9:08:0 1 25
1 111 25 femal
e 2 25
2 111 9:08:1 1 32
3 222 32 male
1 222 9:08:1
4
• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce
page_view

Page User time


id id
key value key value
1 111 9:08:01
111 <1,1> 111 <1,1>
2 111 9:08:13
111 <1,2> 111 <1,2>
1 222 9:08:14
222 <1,1> Shuffl 111 <2,25>
user Map e
Reduce
User age gender Sort
key value
id key value
111 <2,25> 222 <1,1>
111 25 female
222 <2,32> 222 <2,32>
222 32 male
Hive QL – Group By
pv_users pageid_age_sum

pageid age Page age Count


1 25 id
2 25 1 25 1
1 32 2 25 2
2 25 1 32 1

• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users key value key value pa
pag age <1,2 1 <1,2 1 p
eid 5> 5> e
1 25 <2,2 1 <1,3 1
Shu
2 25 5> 2>
ffle
Map Reduce
key value Sort
pag age key value
<1,3 1 p
eid <2,2 1
2> e
1 32 5>
<2,2 1
2 25 <2,2 1
5>
5>
Hive QL – Group By with Distinct
page_view
pag user time
eid id result
1 111 9:08:0 page count_distinct
1 id _userid
2 111 9:08:1 1 2
3 2 1
1 222 9:08:1
4
2 111 9:08:2
0
• SQL
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid
Hive QL – Group By with Distinct in Map
Reduce
page_view

page useri time key v page cou


id d <1,11 id nt
1 111 9:08:0 Shuffl 1> 1 2
1 e <1,22
2 111 9:08:1 and 2> Reduce
page useri time
3 Sort
id d key v
page cou
1 222 9:08:1 <2,11 id nt
4 1>
2 1
2 111 9:08:2 <2,11
Shuffle key is a prefix of the sort key.
0 1>
Hive QL: Order By
page_view

page useri time key v page useri t


id d id d
<1,11 9:08:
2 111 9:08:1 Shuffl 1> 01 1 111 9:
3 e <2,11 9:08:
1 111 9:08:0 and
page useri time 1> 13 Reduce 2 111 9:
1 Sort
id d key v page useri t
2 111 9:08:2 <1,22 9:08: id d
0 2> 14 1 222 9:
1 222 9:08:1 <2,11 9:08:
Shuffle randomly.
4 1> 20 2 111 9:
Features of Hive

5/15/2024 Big Data Analytics Lab 40


Hive Demo

5/15/2024 Big Data Analytics Lab 41


Hive Demo
Create create database office; // We are creating a database called the office

Show show databases; // Shows the created database

Drop drop database office; // Drops the office database as it is empty

drop database office cascade; // Drops the tables in the database when it is not
Drop empty

Create create database office; // We will recreate the database office

Use use office; // Sets office as the default database

5/15/2024 Big Data Analytics Lab 42


5/15/2024 Big Data Analytics Lab 43
Hive Demo

5/15/2024 Big Data Analytics Lab 44


Hive Demo

5/15/2024 Big Data Analytics Lab 45


Hive Demo

5/15/2024 Big Data Analytics Lab 46


Hive Demo

5/15/2024 Big Data Analytics Lab 47


Hive Demo

5/15/2024 Big Data Analytics Lab 48


Big Data Analytics Lab

Pig

5/15/2024

4
9
Why Pig?

5/15/2024 Big Data Analytics Lab 50


Why Pig?

5/15/2024 Big Data Analytics Lab 51


Why Pig?

5/15/2024 Big Data Analytics Lab 52


What is Pig

5/15/2024 Big Data Analytics Lab 53


Map Reduce Vs Hive Vs Pig

5/15/2024 Big Data Analytics Lab 54


Map Reduce Vs Hive Vs Pig

5/15/2024 Big Data Analytics Lab 55


Components of Pig

5/15/2024 Big Data Analytics Lab 56


Pig Architecture

5/15/2024 Big Data Analytics Lab 57


Working of Pig

5/15/2024 Big Data Analytics Lab 58


Pig Latin data Model

5/15/2024 Big Data Analytics Lab 59


Pig Latin data Model

5/15/2024 Big Data Analytics Lab 60


Pig Execution Modes

5/15/2024 Big Data Analytics Lab 61


Pig Execution Modes

5/15/2024 Big Data Analytics Lab 62


Use Case-Twitter

5/15/2024 Big Data Analytics Lab 63


Use Case-Twitter

5/15/2024 Big Data Analytics Lab 64


Use Case-Twitter

5/15/2024 Big Data Analytics Lab 65


Features of Pig

5/15/2024 Big Data Analytics Lab 66


• Hdfs dfs –mkdir input
• Hdfs dfs –ls /
Demo • Hdfs dfs –copyFromLocal
Sales2009.csv /input/

5/15/2024 Big Data Analytics Lab 67


Pig Demo salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Na
me:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:chararray,
Last_Login:chararray,Latitude:chararray,Longitude:chararray);

pig –help

pig // run in default mode

pig -x local //local mode

pig -x mapreduce //mapreduce mode

Pig

>grunt
5/15/2024 Big Data Analytics Lab 68
Demo

For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name
of Country: No. of products sold

hdfs dfs -cat pig_output_sales/part-r-00000

5/15/2024 Big Data Analytics Lab 69


Example Pig Commands
•-- Load movie ratings data
•ratings = LOAD '/path/to/movie_ratings.csv' USING PigStorage(',') AS (user_id:int, movie_id:int, rating:int);

•-- Group ratings by movie_id and calculate average rating for each movie
•avg_ratings = FOREACH (GROUP ratings BY movie_id) {
• avg_rating = AVG(ratings.rating);
• GENERATE group AS movie_id, avg_rating AS avg_rating;
•}

•-- Order movies by average rating
•ordered_ratings = ORDER avg_ratings BY avg_rating DESC;

•-- Store recommendations into HDFS
•STORE ordered_ratings INTO '/path/to/movie_recommendations' USING PigStorage(',');

5/15/2024 Big Data Analytics Lab 70


Example Pig Commands
•-- Load customer data
•customer_data = LOAD '/path/to/customer_data.csv' USING PigStorage(',') AS (customer_id:int, age:int,
income:double, spending_score:int);

•-- Segment customers based on spending score
•segmented_customers = FOREACH customer_data {
• segment = (spending_score >= 80) ? 'High Spenders' :
• (spending_score >= 50) ? 'Medium Spenders' :
• 'Low Spenders';
• GENERATE customer_id, age, income, spending_score, segment;
•}

•-- Store segmented customers into HDFS
•STORE segmented_customers INTO '/path/to/customer_segments' USING PigStorage(',');

5/15/2024 Big Data Analytics Lab 71


• ZooKeeper: A distributed, highly available
coordination service. ZooKeeper provides primitives
Hadoop such as distributed locks that can be used for
building distributed applications. Zookeeper
Ecosystem:Componen manages and coordinates a large cluster of machines.
t

UNIT III
72
• Sqoop: A tool for efficiently moving data between relational
Hadoop databases and HDFS. Sqoop imports data from external sources
into related Hadoop ecosystem components like HDFS, Hbase
Ecosystem: or Hive. It also exports data from Hadoop to other external
sources. Sqoop works with relational databases such as teradata,
Netezza, oracle, MySQL.
Component
UNIT III
73
Hadoop Ecosystem: Component

• Capabilities of Sqoop Includes


• It helps to Import individual tables or entire databases
to files in HDFS
• Also can Generate Java classes to interact with your
imported data
• Moreover, it offers the ability to import from SQL
databases straight into your Hive data warehouse

UNIT III
74
Overview of - Apache Spark Ecosystem.

UNIT III
Introduction to Apache Spark
• Apache Spark is a general-purpose cluster computing framework.

• It was introduced by UC Berkeley’s AMP Lab in 2009 as a distributed computing system. But
later maintained by Apache Software Foundation from 2013 till date.

• Spark is a lighting fast computing engine designed for faster processing of large size of data.

• It is based on Hadoop’s Map Reduce model.

• Spark supports batch application, iterative processing, interactive queries, and streaming
data. It reduces the burden of managing separate tools for the respective workload.

• The main feature of Spark is its in-memory processing which makes computation faster.

• It has its own cluster management system and it uses Hadoop for storage purpose

UNIT III
76
Introduction to Apache Spark

• Spark extends the popular MapReduce model to efficiently support


more types of computations, including interactive queries and stream
processing.

• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.

• Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.

• Spark is designed to be highly accessible, offering simple APIs in


Python, Java, Scala, and SQL, and rich built-in libraries.
UNIT III
77
Apache Spark Ecosystem (The Spark Stack )

• Spark powers a stack of libraries including,


• SQL and DataFrames
• Spark Streaming
• MLlib for machine learning
• GraphX fro graph computation

UNIT III
78
Apache Spark Ecosystem (The Spark Stack )

UNIT III
79
Apache Spark Ecosystem: Component

• The Spark project contains multiple closely integrated components.


• At its core, Spark is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications consisting of many computational tasks
across many worker machines, or a computing cluster.

1. Spark Core:
• Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.
• Spark Core is also home to the API that defines resilient distributed datasets (RDDs),
which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute nodes that can
be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these collections.

UNIT III
80
Apache Spark Ecosystem: Component

2. Spark SQL
• Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language
(HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.

3. Spark Streaming
• Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
• Streaming provides an API for manipulating data streams that closely matches the Spark
Core’s RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real time.
• Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.

UNIT III
81
Apache Spark Ecosystem: Component

4. MLlib (Machine Learning Library):


• MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
• It also provides some lower-level ML primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out across a cluster.

5. GraphX (Graph Computation):


• GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations
• GraphX also provides various operators for manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle
counting).

UNIT III
82
Features of Apache Spark
• Speed: Though spark is based on MapReduce, it is 10 times faster than Hadoop when
it comes to big data processing.

• Usability: Spark supports multiple languages thus making it easier to work with.

• Sophisticated Analytics: Spark provides a complex algorithm for Big Data Analytics
and Machine Learning.

• In-Memory Processing: Unlike Hadoop, Spark doesn’t move data in and out of the
cluster.

• Lazy Evaluation: It means that spark waits for the code to complete and then process
the instruction in the most efficient way possible.

• Fault Tolerance: Spark has improved fault tolerance than Hadoop. Both storage and
computation can tolerate failure by backing up to another node.
UNIT III
83
Comparison of different Framework

UNIT III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual user’s
data in the cloud or elsewhere and use it without requiring any
server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.

UNIT III
85
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as well
as service levels for Google’s services as well as more than a dozen
non-Google open source packages.

UNIT III
86
Case Study:
• Twitter Analytics: Capturing and Analyzing Tweets

https://blogs.ischool.berkeley.edu/i290-abdt-s12/
UNIT III
87
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/

UNIT III
88
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
3. https://mapr.com/products/apache-spark/
4. https://www.educba.com/what-is-apache-spark/
5. https://i0.wp.com/opensourceforu.com/wp-content/uploads/2018/03/Table-1-Compari
son-of-teh-best-Big-Data-frameworks-2.jpg?ssl=1
6. https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2
95

UNIT III
89
Demo
• https://www.dezyre.com/hadoop-tutorial/hadoop-
sqoop-tutorial-data-aggregation#:~:text=So%2C%2
0just%20using%20simple%20Sqoop,the%20data%2
0between%20the%20mappers.
• https://www.edureka.co/blog/apache-sqoop-tutori
al/

5/15/2024 Big Data Analytics Lab 90


Hadoop Ecosystem: Component
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);

Consider also a dataset in HDFS containing records like these:

0, this is a test,42
1, some more data,100

Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will


run an export job that executes SQL statements based on the data like so:

UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;

UPDATE Test SET msg=’some more data’, bar=100 WHERE id=1;


UNIT III
91
Hadoop Ecosystem: Sqoop Import

• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

• E.g. emp table data will be stored in HDFS as:


1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT III
92
Hadoop Ecosystem: Sqoop Export

• $ sqoop export (generic-args) (export-args)


• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT III
93
Hadoop Ecosystem: Sqoop Export

• $ sqoop export (generic-args) (export-args)


• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT III
94
Hadoop Ecosystem: Sqoop Import

• $ sqoop import (generic-args) (import-args)


• $ sqoop-import (generic-args) (import-args)

UNIT III
95
Hadoop
Ecosystem:
Flume
• Flume:
• efficiently collects,
aggregate and moves a large
amount of data from its
origin and sending it back to
HDFS. It is fault tolerant
and reliable mechanism.
• This Hadoop Ecosystem
component allows the data
flow from the source into
Hadoop environment.
• It uses a simple extensible
data model that allows for
the online analytic
application. Using Flume,
• we can get the data from
multiple servers
immediately into hadoop.

UNIT III
96
UNIT III

• Features of Flume:
• Flume has a flexible design based upon
streaming data flows.
• It is fault tolerant and robust with multiple
failovers and recovery mechanisms. Flume
Hadoop has different levels of reliability to offer
which includes 'best-effort delivery' and an
'end-to-end delivery'.
Ecosystem: • Flume carries data between sources and
sinks. This gathering of data can either be

Flume scheduled or event-driven. Flume has its


own query processing engine which makes
it easy to transform each new batch of data
Features before it is moved to the intended sink.
• Flume can also be used to transport event
data including but not limited to network
traffic data, data generated by social media
websites and email messages.

97
UNIT III

Hadoop
Ecosystem:
Component
• Ambari:
• another Hadop ecosystem component,
is a management platform for
provisioning, managing, monitoring
and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration,
and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –

98
Hadoop Ecosystem: Component
•99

Difference Between Apache Ambari and Apache Zookeeper

UNIT III
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
Hadoop • Oozie combines multiple jobs sequentially into one logical unit of work.
Ecosystem: • Oozie framework is fully integrated with apache Hadoop stack, YARN as an
architecture center and supports Hadoop jobs for apache MapReduce, Pig,
Component Hive, and Sqoop.

UNIT III 100


Working of Oozie
Hadoop Ecosystem:
Component
UNIT III 101
• Avro:. Avro is an open source project that
provides data serialization and data exchange
services for Hadoop. A serialization system for
efficient, cross-language RPC, and persistent data
storage.
Hadoop • Features provided by Avro:
Ecosystem: • Rich data structures.
• Remote procedure call.
Component • Compact, fast, binary data format.
• Container file, to store persistent data.

UNIT III
102
Active Learning
• What is Avro?
A-Avro is a java serialization library.
B - Avro is a java compression library.
C - Avro is a java library that create split table files.
• Zookeeper ensures that….
A - All the namenodes are actively serving the client requests
B - Only one namenode is actively serving the client requests
C - A failover is triggered when any of the datanode fails.
• Which of these provides a Stream processing system used in Hadoop
ecosystem?
• A-Hive
• B-Pig
• C- Spark

UNIT III
103
Introduction to MapReduce
• Traditional techniques for working with information simply don’t scale to
Big Data: they’re too costly, time-consuming, and complex, which is where
MapReduce comes in.
• MapReduce is:
• new programming framework — created and successfully deployed by
Google
• uses the divide-and-conquer method (and lots of commodity servers)
• break down complex Big Data problems into small units of work, and
then process them in parallel.
• MapReduce is a programming framework that allows us to perform
distributed and parallel processing on large data sets in a distributed
environment.
• Eg: processing 100 TB data
• On 1 node🡪 scanning @50 MB/s =23 days
• On 1000 node cluster🡪 scanning @50 MB/s =33 min

UNIT III
104
Introduction to MapReduce: Use cases
• At Google
• Index construction for Google Search (replaced in 2010 by Caffeine)
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook
• Data mining, Web log processing
• SearchBox (with Cassandra)
• Facebook Messages (with HBase)
• Ad optimization
• Spam detection

UNIT III
105
Introduction to MapReduce
• Problem: How to compute the PageRank for a crawled set of
websites on a cluster of machines?
• Main Challenges:
• How to break up a large problem into smaller tasks, that can be executed
in parallel?
• How to assign tasks to machines?
• How to partition and distribute data?
• How to share intermediate results?
• How to coordinate synchronization, scheduling, fault-tolerance?
Solution is MapReduce:
• Algorithms that can be expressed as (or mapped to) a sequence of Map()
and Reduce() functions are automatically parallelized by the framework.

UNIT III
106
MapReduce Workflow
• Map Phase
• Raw data read and converted to key/value pairs
• Map() function applied to any pair
• Shuffle and Sort Phase
• All key/value pairs are sorted and grouped by their keys
• Reduce Phase
• All values with a the same key are processed by within the same reduce() function

UNIT III
107
MapReduce Programming Model
• Every MapReduce program must specify a Mapper and typically a Reducer
• The Mapper has a map() function that transforms input (key, value) pairs
into any number of intermediate (out_key, intermediate_value) pairs
• The Reducer has a reduce() function that transforms intermediate (out_key,
list(intermediate_value)) aggregates into any number of output (value’) pairs

UNIT III
108
MapReduce Programming Model

UNIT III
109
MapReduce Execution Details

• The complete execution process (execution of Map and Reduce tasks,


both) is controlled by two types of entities called a
• Jobtracker: Acts like a master (responsible for complete execution of submitted job)
• Multiple Task Trackers: Acts like slaves, each of them performing the job
• For every job submitted for execution in the system, there is
one Jobtracker that resides on NameNode and there are multiple task
trackers which reside on DataNode.

UNIT III
110
MapReduce Data flow

UNIT III
Word Count Problem using
MapReduce

Problem: Given a document, we want to count the


occurrences of any word
Input: Document with words.
Output: List of words and their occurrences, e.g.
“Infrastructure” 12 “the” 259 …

UNIT III
112
MapReduce Execution Details

UNIT III
113
Word Count Problem uisng MapReduce

UNIT III
114
Word Count Problem using
MapReduce

Map Reduce Real Time Use Cases


1. Merging Small Files into AVRO Files
2. Merging Small Files into SEQUENCE Files
3. Visits Per Hour
4. Measuring the Page Rank
5. Word Search in Huge Log Files (Word Count also)

UNIT III
115
Word Count Problem uisng
MapReduce

Map Reduce Real Time Use Case


1. Take a bunch of data
2. Perform some kind of transformation that converts every
datum to another kind of datum
3. Combine those new data into yet simpler data

UNIT III
116
Comparison of MapReduce and
RDBMS

UNIT III
117
UNIT III

Overview of - Apache Spark Ecosystem.


Introduction to Apache Spark
It was introduced by UC
Berkeley’s AMP Lab in 2009 as Spark is a lighting fast
Apache Spark is a
a distributed computing system. computing engine designed
general-purpose cluster
But later maintained by Apache for faster processing of large
computing framework.
Software Foundation from 2013 size of data.
till date.

Spark supports batch


application, iterative
processing, interactive The main feature of Spark is its
It is based on Hadoop’s Map
queries, and streaming data. in-memory processing which
Reduce model.
It reduces the burden of makes computation faster.
managing separate tools for the
respective workload.

It has its own cluster


management system and it
uses Hadoop for storage
purpose

UNIT III
119
Introduction to Apache Spark

Spark extends the popular MapReduce model to efficiently support more


types of computations, including interactive queries and stream
processing.

One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.

Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.

Spark is designed to be highly accessible, offering simple APIs in Python,


Java, Scala, and SQL, and rich built-in libraries.

UNIT III
120
Apache Spark Ecosystem
(The Spark Stack )

• Spark powers a stack of


libraries including,
• SQL and
DataFrames
• Spark Streaming
• MLlib for
machine learning
• GraphX fro graph
computation

UNIT III
121
Apache Spark Ecosystem
(The Spark Stack )

•122

UNIT III
The Spark project contains multiple
closely integrated components.
• At its core, Spark is a “computational engine” that
is responsible for scheduling, distributing, and
monitoring applications consisting of many
computational tasks across many worker
machines, or a computing cluster.
1. Spark Core:

• Spark Core contains the basic functionality of


Spark, including components for task scheduling,
memory management, fault recovery, interacting
with storage systems, and more.

Apache Spark • Spark Core is also home to the API that defines
resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction.
Ecosystem: • RDDs represent a collection of items distributed
across many compute nodes that can be
Component manipulated in parallel.
• Spark Core provides many APIs for building and
manipulating these collections.

UNIT III
123
2. Spark SQL
• Spark SQL is Spark’s package for working with
structured data. It allows querying data via SQL as
well as the Apache Hive variant of SQL—called the
Hive Query Language (HQL)—and it supports many
sources of data, including Hive tables, Parquet, and
JSON.
• Beyond providing a SQL interface to Spark, Spark
SQL allows developers to intermix SQL queries with
the programmatic data manipulations supported by
RDDs in Python, Java, and Scala, all within a single
application, thus combining SQL with complex
3.analytics.
Spark Streaming
• Spark Streaming is a Spark component that enables
processing of live streams of data. Examples of data
streams include logfiles generated by production web
servers, or queues of messages containing status
Apache Spark updates posted by users of a web service.
• Streaming provides an API for manipulating data

Ecosystem: streams that closely matches the Spark Core’s RDD


API, making it easy for programmers to learn the
project and move between applications that
Component manipulate data stored in memory, on disk, or arriving
in real time.
• Underneath its API, Spark Streaming was designed to
provide the same degree of fault tolerance,
throughput, and scalability as Spark Core.

UNIT III
124
4. MLlib (Machine Learning Library):
• MLlib provides multiple types of machine
learning algorithms, including classification,
regression, clustering, and collaborative
filtering, as well as supporting functionality
such as model evaluation and data import.
• It also provides some lower-level ML
primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out
across a cluster.

5. GraphX (Graph Computation):


Apache Spark • GraphX is a library for manipulating graphs
(e.g., a social network’s friend graph) and
Ecosystem: performing graph-parallel computations
• GraphX also provides various operators for
Component manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph
algorithms (e.g., PageRank and triangle
counting).

UNIT III
125
Features of Apache Spark

Sophisticated
Speed: Though spark is
Usability: Spark supports Analytics: Spark
based on MapReduce, it
multiple languages thus provides a complex
is 10 times faster than
making it easier to work algorithm for Big Data
Hadoop when it comes to
with. Analytics and Machine
big data processing.
Learning.

Fault Tolerance: Spark


Lazy Evaluation: It
has improved fault
In-Memory Processing: means that spark waits
tolerance than Hadoop.
Unlike Hadoop, Spark for the code to complete
Both storage and
doesn’t move data in and and then process the
computation can tolerate
out of the cluster. instruction in the most
failure by backing up to
efficient way possible.
another node.

UNIT III
126
Comparison of different Framework

UNIT III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual user’s
data in the cloud or elsewhere and use it without requiring any
server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.

UNIT III
128
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as well
as service levels for Google’s services as well as more than a dozen
non-Google open source packages.

UNIT III
129
Case Study:
• Twitter Analytics: Capturing and Analyzing Tweets

https://blogs.ischool.berkeley.edu/i290-abdt-s12/
UNIT III
130
Hadoop High Level Architecture

UNIT III
131
Hadoop Cluster
• A Small Hadoop Cluster Include a single master &
multiple worker nodes
Master node:
Data Node
Job Tracker
Task Tracker
Name Node
Slave node:
Data Node
Task Tracke

UNIT III
132
Active Learning

Answer the following


1. Hadoop Distributed File System (HDFS) is renamed from NDFS

2. All the core projects of Hadoop were hosted by Yahoo!

3. Large amount of data needs large hardware

4. IT organizations can handle Information growth irrespective of use of


commodity s/w and h/w
5. Apache Storm is a stream processing framework

6. YARN stands for _______________


UNIT III
133
Difference Between Hadoop and SQL

Hadoop SQL
Schema on Read Schema on Write
Data stored in compressed Data stored in the logical
file form with interrelated tables

UNIT III
134
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/

UNIT III
135
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
3. https://mapr.com/products/apache-spark/
4. https://www.educba.com/what-is-apache-spark/
5. https://i0.wp.com/opensourceforu.com/wp-content/uploads/2018/03/Table-1-Compari
son-of-teh-best-Big-Data-frameworks-2.jpg?ssl=1
6. https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2
95
7. https://www.guru99.com/create-your-first-hadoop-program.html
8. https://www.simplilearn.com/tutorials/hadoop-tutorial/hdfs?source=sl_frs_n
av_playlist_video_clicked
9. https://www.projectpro.io/hadoop-tutorial/hadoop-hdfs-commands

UNIT III
136

You might also like