BDT_Unit04

CET4001B Big Data
Technologies
5/15/2024 Big Data Analytics Lab 1

Unit-IV
Technologies and tools for Big Data:

• Importing Relational data with Sqoop, Injecting stream data
with flume.
• Basic concepts of Pig, Architecture of Pig, what is Hive.
Architecture of
• Hive, Hive Commands. Zookeeper, Apache Spark ecosystem.
• Overview of Apache Spark Ecosystem, Spark Architecture
UNIT III
Hadoop Ecosystem
UNIT III
3
Hadoop Ecosystem
UNIT III
4
Hadoop Ecosystem: Component
• Avro:. Avro is an open source project that provides data serialization and data
exchange services for Hadoop. A serialization system for efficient,
cross-language RPC, and persistent data storage.
• Features provided by Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.
• MapReduce: A distributed data processing model and execution environment

that runs on large clusters of commodity machines.
• HDFS:A distributed file system that runs on large clusters of commodity

machines. HDFS supports write-once-read-many semantics on files.
UNIT III
5
Pig:
• A data flow language and execution environment for exploring/processing very large datasets. Pig runs
on HDFS and MapReduce clusters.
• Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in
HDFS.
• Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL.
• It loads the data, applies the required filters and dumps the data in the required format. For Programs
execution, pig requires Java runtime environment.
UNIT III
6
Hive:
• A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
• Hive do three main functions: data summarization, query, and analysis.
UNIT III
7
Hadoop Ecosystem: What Hive can provide
UNIT III
8
Hadoop Ecosystem:Component
• ZooKeeper: A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications. Zookeeper manages and coordinates a large cluster of
machines.
UNIT III
9
• Sqoop: A tool for efficiently moving data between relational databases and
HDFS. Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive. It also exports data from
Hadoop to other external sources. Sqoop works with relational databases such
as teradata, Netezza, oracle, MySQL.
UNIT III
10
• Capabilities of Sqoop Includes
• It helps to Import individual tables or entire databases to files in
HDFS
• Also can Generate Java classes to interact with your imported data
• Moreover, it offers the ability to import from SQL databases straight
into your Hive data warehouse
UNIT III
11
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
Consider also a dataset in HDFS containing records like these:
0, this is a test,42
1, some more data,100
Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will

run an export job that executes SQL statements based on the data like so:
UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;
UPDATE Test SET msg=’some more data’, bar=100 WHERE id=1;

UNIT III
12
Hadoop Ecosystem: Sqoop Import
• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
• E.g. emp table data will be stored in HDFS as:

1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
UNIT III
13
Hadoop Ecosystem: Sqoop Export
• $ sqoop export (generic-args) (export-args)

• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
UNIT III
14

UNIT III
15
Hadoop Ecosystem: Flume
• Flume:
• efficiently collects, aggregate and moves a large amount of data from its origin and sending
it back to HDFS. It is fault tolerant and reliable mechanism.
• This Hadoop Ecosystem component allows the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that allows for the online analytic application. Using
Flume,
• we can get the data from multiple servers immediately into hadoop.
UNIT III
16
Hadoop Ecosystem: Flume Features
• Features of Flume:
• Flume has a flexible design based upon streaming data flows.
• It is fault tolerant and robust with multiple failovers and recovery
mechanisms. Flume has different levels of reliability to offer which
includes 'best-effort delivery' and an 'end-to-end delivery'.
• Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query processing
engine which makes it easy to transform each new batch of data before it
is moved to the intended sink.
• Flume can also be used to transport event data including but not limited
to network traffic data, data generated by social media websites and email
messages.
UNIT III
17
• Ambari:
• another Hadop ecosystem component, is a management platform for provisioning,
managing, monitoring and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration, and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –
UNIT III
18
Difference Between Apache Ambari and Apache Zookeeper
UNIT III
19
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.
UNIT III
20
• Working of Oozie
UNIT III
21
HIVE
Query Language by Hadoop
Big Data Analytics Lab
5/15/2024
22
History Hive

Why Hive?

What is Hive

Architecture of Hive

Architecture of Hive

Data Flow in Hive

Hive Data Modeling

Hive Data Types

Different Modes of Hive

Hive Vs RDBMS

Hive QL – Join
page_view pv_users
user
pag use time pag age
eid rid use age gende eid
X rid r =
1 111 9:08:0 1 25
1 111 25 femal
e 2 25
2 111 9:08:1 1 32
3 222 32 male
1 222 9:08:1
4
• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce
page_view
Page User time

id id
key value key value
1 111 9:08:01
111 <1,1> 111 <1,1>
2 111 9:08:13
111 <1,2> 111 <1,2>
1 222 9:08:14
222 <1,1> Shuffl 111 <2,25>
user Map e
Reduce
User age gender Sort
key value
id key value
111 <2,25> 222 <1,1>
111 25 female
222 <2,32> 222 <2,32>
222 32 male
Hive QL – Group By
pv_users pageid_age_sum
pageid age Page age Count

1 25 id
2 25 1 25 1
1 32 2 25 2
2 25 1 32 1
• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users key value key value pa
pag age <1,2 1 <1,2 1 p
eid 5> 5> e
1 25 <2,2 1 <1,3 1
Shu
2 25 5> 2>
ffle
Map Reduce
key value Sort
pag age key value
<1,3 1 p
eid <2,2 1
2> e
1 32 5>
<2,2 1
2 25 <2,2 1
5>
5>
Hive QL – Group By with Distinct
page_view
pag user time
eid id result
1 111 9:08:0 page count_distinct
1 id _userid
2 111 9:08:1 1 2
3 2 1
1 222 9:08:1
4
2 111 9:08:2
0
• SQL
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid
Hive QL – Group By with Distinct in Map
Reduce
page_view
page useri time key v page cou

id d <1,11 id nt
1 111 9:08:0 Shuffl 1> 1 2
1 e <1,22
2 111 9:08:1 and 2> Reduce
page useri time
3 Sort
id d key v
page cou
1 222 9:08:1 <2,11 id nt
4 1>
2 1
2 111 9:08:2 <2,11
Shuffle key is a prefix of the sort key.
0 1>
Hive QL: Order By
page_view
page useri time key v page useri t

id d id d
<1,11 9:08:
2 111 9:08:1 Shuffl 1> 01 1 111 9:
3 e <2,11 9:08:
1 111 9:08:0 and
page useri time 1> 13 Reduce 2 111 9:
1 Sort
id d key v page useri t
2 111 9:08:2 <1,22 9:08: id d
0 2> 14 1 222 9:
1 222 9:08:1 <2,11 9:08:
Shuffle randomly.
4 1> 20 2 111 9:
Features of Hive

Hive Demo

Hive Demo
Create create database office; // We are creating a database called the office
Show show databases; // Shows the created database
Drop drop database office; // Drops the office database as it is empty
drop database office cascade; // Drops the tables in the database when it is not
Drop empty
Create create database office; // We will recreate the database office
Use use office; // Sets office as the default database

Hive Demo

Hive Demo

Hive Demo

Hive Demo

Hive Demo

Big Data Analytics Lab
Pig
5/15/2024
4
9
Why Pig?

Why Pig?

Why Pig?

What is Pig

Map Reduce Vs Hive Vs Pig

Map Reduce Vs Hive Vs Pig

Components of Pig

Pig Architecture

Working of Pig

Pig Latin data Model

Pig Latin data Model

Pig Execution Modes

Pig Execution Modes

Use Case-Twitter

Use Case-Twitter

Use Case-Twitter

Features of Pig

• Hdfs dfs –mkdir input
• Hdfs dfs –ls /
Demo • Hdfs dfs –copyFromLocal
Sales2009.csv /input/

Pig Demo salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Na
me:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:chararray,
Last_Login:chararray,Latitude:chararray,Longitude:chararray);
pig –help
pig // run in default mode
pig -x local //local mode
pig -x mapreduce //mapreduce mode
Pig
>grunt
Demo
For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name
of Country: No. of products sold
hdfs dfs -cat pig_output_sales/part-r-00000

Example Pig Commands
•-- Load movie ratings data
•ratings = LOAD '/path/to/movie_ratings.csv' USING PigStorage(',') AS (user_id:int, movie_id:int, rating:int);
•
•-- Group ratings by movie_id and calculate average rating for each movie
•avg_ratings = FOREACH (GROUP ratings BY movie_id) {
• avg_rating = AVG(ratings.rating);
• GENERATE group AS movie_id, avg_rating AS avg_rating;
•}
•
•-- Order movies by average rating
•ordered_ratings = ORDER avg_ratings BY avg_rating DESC;
•
•-- Store recommendations into HDFS
•STORE ordered_ratings INTO '/path/to/movie_recommendations' USING PigStorage(',');

Example Pig Commands
•-- Load customer data
•customer_data = LOAD '/path/to/customer_data.csv' USING PigStorage(',') AS (customer_id:int, age:int,
income:double, spending_score:int);
•
•-- Segment customers based on spending score
•segmented_customers = FOREACH customer_data {
• segment = (spending_score >= 80) ? 'High Spenders' :
• (spending_score >= 50) ? 'Medium Spenders' :
• 'Low Spenders';
• GENERATE customer_id, age, income, spending_score, segment;
•}
•
•-- Store segmented customers into HDFS
•STORE segmented_customers INTO '/path/to/customer_segments' USING PigStorage(',');
•

• ZooKeeper: A distributed, highly available
coordination service. ZooKeeper provides primitives
Hadoop such as distributed locks that can be used for
building distributed applications. Zookeeper
Ecosystem:Componen manages and coordinates a large cluster of machines.
t
UNIT III
72
• Sqoop: A tool for efficiently moving data between relational
Hadoop databases and HDFS. Sqoop imports data from external sources
into related Hadoop ecosystem components like HDFS, Hbase
Ecosystem: or Hive. It also exports data from Hadoop to other external
sources. Sqoop works with relational databases such as teradata,
Netezza, oracle, MySQL.
Component
UNIT III
73
• Capabilities of Sqoop Includes

• It helps to Import individual tables or entire databases
to files in HDFS
• Also can Generate Java classes to interact with your
imported data
• Moreover, it offers the ability to import from SQL
databases straight into your Hive data warehouse
UNIT III
74
Overview of - Apache Spark Ecosystem.
UNIT III
Introduction to Apache Spark
• Apache Spark is a general-purpose cluster computing framework.
• It was introduced by UC Berkeley’s AMP Lab in 2009 as a distributed computing system. But
later maintained by Apache Software Foundation from 2013 till date.
• Spark is a lighting fast computing engine designed for faster processing of large size of data.
• It is based on Hadoop’s Map Reduce model.
• Spark supports batch application, iterative processing, interactive queries, and streaming
data. It reduces the burden of managing separate tools for the respective workload.
• The main feature of Spark is its in-memory processing which makes computation faster.
• It has its own cluster management system and it uses Hadoop for storage purpose
UNIT III
76
• Spark extends the popular MapReduce model to efficiently support

more types of computations, including interactive queries and stream
processing.
• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
• Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.
• Spark is designed to be highly accessible, offering simple APIs in

Python, Java, Scala, and SQL, and rich built-in libraries.
UNIT III
77
Apache Spark Ecosystem (The Spark Stack )
• Spark powers a stack of libraries including,

• SQL and DataFrames
• Spark Streaming
• MLlib for machine learning
• GraphX fro graph computation
UNIT III
78
Apache Spark Ecosystem (The Spark Stack )
UNIT III
79
Apache Spark Ecosystem: Component
• The Spark project contains multiple closely integrated components.

• At its core, Spark is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications consisting of many computational tasks
across many worker machines, or a computing cluster.
1. Spark Core:
• Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.
• Spark Core is also home to the API that defines resilient distributed datasets (RDDs),
which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute nodes that can
be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these collections.
UNIT III
80
2. Spark SQL
• Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language
(HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.
3. Spark Streaming
• Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
• Streaming provides an API for manipulating data streams that closely matches the Spark
Core’s RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real time.
• Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.
UNIT III
81
4. MLlib (Machine Learning Library):

• MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
• It also provides some lower-level ML primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out across a cluster.
5. GraphX (Graph Computation):

• GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations
• GraphX also provides various operators for manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle
counting).
UNIT III
82
Features of Apache Spark
• Speed: Though spark is based on MapReduce, it is 10 times faster than Hadoop when
it comes to big data processing.
• Usability: Spark supports multiple languages thus making it easier to work with.
• Sophisticated Analytics: Spark provides a complex algorithm for Big Data Analytics
and Machine Learning.
• In-Memory Processing: Unlike Hadoop, Spark doesn’t move data in and out of the
cluster.
• Lazy Evaluation: It means that spark waits for the code to complete and then process
the instruction in the most efficient way possible.
• Fault Tolerance: Spark has improved fault tolerance than Hadoop. Both storage and
computation can tolerate failure by backing up to another node.
UNIT III
83
Comparison of different Framework
UNIT III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual user’s
data in the cloud or elsewhere and use it without requiring any
server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.
UNIT III
85
Case Study:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as well
as service levels for Google’s services as well as more than a dozen
non-Google open source packages.
UNIT III
86
Case Study:
• Twitter Analytics: Capturing and Analyzing Tweets
https://blogs.ischool.berkeley.edu/i290-abdt-s12/
UNIT III
87
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/
UNIT III
88
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
3. https://mapr.com/products/apache-spark/
4. https://www.educba.com/what-is-apache-spark/
5. https://i0.wp.com/opensourceforu.com/wp-content/uploads/2018/03/Table-1-Compari
son-of-teh-best-Big-Data-frameworks-2.jpg?ssl=1
6. https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2
95
UNIT III
89
Demo
• https://www.dezyre.com/hadoop-tutorial/hadoop-
sqoop-tutorial-data-aggregation#:~:text=So%2C%2
0just%20using%20simple%20Sqoop,the%20data%2
0between%20the%20mappers.
• https://www.edureka.co/blog/apache-sqoop-tutori
al/

• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
Consider also a dataset in HDFS containing records like these:
0, this is a test,42
1, some more data,100
Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will

run an export job that executes SQL statements based on the data like so:
UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;
UPDATE Test SET msg=’some more data’, bar=100 WHERE id=1;

UNIT III
91
• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
• E.g. emp table data will be stored in HDFS as:

UNIT III
92

UNIT III
93

UNIT III
94
• $ sqoop import (generic-args) (import-args)

• $ sqoop-import (generic-args) (import-args)
UNIT III
95
Hadoop
Ecosystem:
Flume
• Flume:
• efficiently collects,
aggregate and moves a large
amount of data from its
origin and sending it back to
HDFS. It is fault tolerant
and reliable mechanism.
• This Hadoop Ecosystem
component allows the data
flow from the source into
Hadoop environment.
• It uses a simple extensible
data model that allows for
the online analytic
application. Using Flume,
• we can get the data from
multiple servers
immediately into hadoop.
UNIT III
96
UNIT III
• Features of Flume:
• Flume has a flexible design based upon
streaming data flows.
• It is fault tolerant and robust with multiple
failovers and recovery mechanisms. Flume
Hadoop has different levels of reliability to offer
which includes 'best-effort delivery' and an
'end-to-end delivery'.
Ecosystem: • Flume carries data between sources and
sinks. This gathering of data can either be
Flume scheduled or event-driven. Flume has its

own query processing engine which makes
it easy to transform each new batch of data
Features before it is moved to the intended sink.
• Flume can also be used to transport event
data including but not limited to network
traffic data, data generated by social media
websites and email messages.
97
UNIT III
Hadoop
Ecosystem:
Component
• Ambari:
• another Hadop ecosystem component,
is a management platform for
provisioning, managing, monitoring
and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration,
and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –
98
•99
Difference Between Apache Ambari and Apache Zookeeper
UNIT III
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
Hadoop • Oozie combines multiple jobs sequentially into one logical unit of work.
Ecosystem: • Oozie framework is fully integrated with apache Hadoop stack, YARN as an
architecture center and supports Hadoop jobs for apache MapReduce, Pig,
Component Hive, and Sqoop.
UNIT III 100

Working of Oozie
Hadoop Ecosystem:
Component
UNIT III 101
• Avro:. Avro is an open source project that
provides data serialization and data exchange
services for Hadoop. A serialization system for
efficient, cross-language RPC, and persistent data
storage.
Hadoop • Features provided by Avro:
Ecosystem: • Rich data structures.
• Remote procedure call.
Component • Compact, fast, binary data format.
• Container file, to store persistent data.
UNIT III
102
Active Learning
• What is Avro?
A-Avro is a java serialization library.
B - Avro is a java compression library.
C - Avro is a java library that create split table files.
• Zookeeper ensures that….
A - All the namenodes are actively serving the client requests
B - Only one namenode is actively serving the client requests
C - A failover is triggered when any of the datanode fails.
• Which of these provides a Stream processing system used in Hadoop
ecosystem?
• A-Hive
• B-Pig
• C- Spark
UNIT III
103
Introduction to MapReduce
• Traditional techniques for working with information simply don’t scale to
Big Data: they’re too costly, time-consuming, and complex, which is where
MapReduce comes in.
• MapReduce is:
• new programming framework — created and successfully deployed by
Google
• uses the divide-and-conquer method (and lots of commodity servers)
• break down complex Big Data problems into small units of work, and
then process them in parallel.
• MapReduce is a programming framework that allows us to perform
distributed and parallel processing on large data sets in a distributed
environment.
• Eg: processing 100 TB data
• On 1 node🡪 scanning @50 MB/s =23 days
• On 1000 node cluster🡪 scanning @50 MB/s =33 min
UNIT III
104
Introduction to MapReduce: Use cases
• At Google
• Index construction for Google Search (replaced in 2010 by Caffeine)
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook
• Data mining, Web log processing
• SearchBox (with Cassandra)
• Facebook Messages (with HBase)
• Ad optimization
• Spam detection
UNIT III
105
Introduction to MapReduce
• Problem: How to compute the PageRank for a crawled set of
websites on a cluster of machines?
• Main Challenges:
• How to break up a large problem into smaller tasks, that can be executed
in parallel?
• How to assign tasks to machines?
• How to partition and distribute data?
• How to share intermediate results?
• How to coordinate synchronization, scheduling, fault-tolerance?
Solution is MapReduce:
• Algorithms that can be expressed as (or mapped to) a sequence of Map()
and Reduce() functions are automatically parallelized by the framework.
UNIT III
106
MapReduce Workflow
• Map Phase
• Raw data read and converted to key/value pairs
• Map() function applied to any pair
• Shuffle and Sort Phase
• All key/value pairs are sorted and grouped by their keys
• Reduce Phase
• All values with a the same key are processed by within the same reduce() function
UNIT III
107
MapReduce Programming Model
• Every MapReduce program must specify a Mapper and typically a Reducer
• The Mapper has a map() function that transforms input (key, value) pairs
into any number of intermediate (out_key, intermediate_value) pairs
• The Reducer has a reduce() function that transforms intermediate (out_key,
list(intermediate_value)) aggregates into any number of output (value’) pairs
UNIT III
108
MapReduce Programming Model
UNIT III
109
MapReduce Execution Details
• The complete execution process (execution of Map and Reduce tasks,

both) is controlled by two types of entities called a
• Jobtracker: Acts like a master (responsible for complete execution of submitted job)
• Multiple Task Trackers: Acts like slaves, each of them performing the job
• For every job submitted for execution in the system, there is
one Jobtracker that resides on NameNode and there are multiple task
trackers which reside on DataNode.
UNIT III
110
MapReduce Data flow
UNIT III
Word Count Problem using
MapReduce
Problem: Given a document, we want to count the

occurrences of any word
Input: Document with words.
Output: List of words and their occurrences, e.g.
“Infrastructure” 12 “the” 259 …
UNIT III
112
MapReduce Execution Details
UNIT III
113
Word Count Problem uisng MapReduce
UNIT III
114
Word Count Problem using
MapReduce
Map Reduce Real Time Use Cases

1. Merging Small Files into AVRO Files
2. Merging Small Files into SEQUENCE Files
3. Visits Per Hour
4. Measuring the Page Rank
5. Word Search in Huge Log Files (Word Count also)
UNIT III
115
Word Count Problem uisng
MapReduce
Map Reduce Real Time Use Case

1. Take a bunch of data
2. Perform some kind of transformation that converts every
datum to another kind of datum
3. Combine those new data into yet simpler data
UNIT III
116
Comparison of MapReduce and
RDBMS
UNIT III
117
UNIT III
Overview of - Apache Spark Ecosystem.

It was introduced by UC
Berkeley’s AMP Lab in 2009 as Spark is a lighting fast
Apache Spark is a
a distributed computing system. computing engine designed
general-purpose cluster
But later maintained by Apache for faster processing of large
computing framework.
Software Foundation from 2013 size of data.
till date.
Spark supports batch

application, iterative
processing, interactive The main feature of Spark is its
It is based on Hadoop’s Map
queries, and streaming data. in-memory processing which
Reduce model.
It reduces the burden of makes computation faster.
managing separate tools for the
respective workload.
It has its own cluster

management system and it
uses Hadoop for storage
purpose
UNIT III
119
Spark extends the popular MapReduce model to efficiently support more

types of computations, including interactive queries and stream
processing.
One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.
Spark is designed to be highly accessible, offering simple APIs in Python,

Java, Scala, and SQL, and rich built-in libraries.
UNIT III
120
Apache Spark Ecosystem
(The Spark Stack )
• Spark powers a stack of

libraries including,
• SQL and
DataFrames
• Spark Streaming
• MLlib for
machine learning
• GraphX fro graph
computation
UNIT III
121
Apache Spark Ecosystem
(The Spark Stack )
•122
UNIT III
The Spark project contains multiple
closely integrated components.
• At its core, Spark is a “computational engine” that
is responsible for scheduling, distributing, and
monitoring applications consisting of many
computational tasks across many worker
machines, or a computing cluster.
1. Spark Core:
• Spark Core contains the basic functionality of

Spark, including components for task scheduling,
memory management, fault recovery, interacting
with storage systems, and more.
Apache Spark • Spark Core is also home to the API that defines
resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction.
Ecosystem: • RDDs represent a collection of items distributed
across many compute nodes that can be
Component manipulated in parallel.
• Spark Core provides many APIs for building and
manipulating these collections.
UNIT III
123
2. Spark SQL
• Spark SQL is Spark’s package for working with
structured data. It allows querying data via SQL as
well as the Apache Hive variant of SQL—called the
Hive Query Language (HQL)—and it supports many
sources of data, including Hive tables, Parquet, and
JSON.
• Beyond providing a SQL interface to Spark, Spark
SQL allows developers to intermix SQL queries with
the programmatic data manipulations supported by
RDDs in Python, Java, and Scala, all within a single
application, thus combining SQL with complex
3.analytics.
Spark Streaming
• Spark Streaming is a Spark component that enables
processing of live streams of data. Examples of data
streams include logfiles generated by production web
servers, or queues of messages containing status
Apache Spark updates posted by users of a web service.
• Streaming provides an API for manipulating data
Ecosystem: streams that closely matches the Spark Core’s RDD

API, making it easy for programmers to learn the
project and move between applications that
Component manipulate data stored in memory, on disk, or arriving
in real time.
• Underneath its API, Spark Streaming was designed to
provide the same degree of fault tolerance,
throughput, and scalability as Spark Core.
UNIT III
124
4. MLlib (Machine Learning Library):
• MLlib provides multiple types of machine
learning algorithms, including classification,
regression, clustering, and collaborative
filtering, as well as supporting functionality
such as model evaluation and data import.
• It also provides some lower-level ML
primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out
across a cluster.
5. GraphX (Graph Computation):

Apache Spark • GraphX is a library for manipulating graphs
(e.g., a social network’s friend graph) and
Ecosystem: performing graph-parallel computations
• GraphX also provides various operators for
Component manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph
algorithms (e.g., PageRank and triangle
counting).
UNIT III
125
Features of Apache Spark
Sophisticated
Speed: Though spark is
Usability: Spark supports Analytics: Spark
based on MapReduce, it
multiple languages thus provides a complex
is 10 times faster than
making it easier to work algorithm for Big Data
Hadoop when it comes to
with. Analytics and Machine
big data processing.
Learning.
Fault Tolerance: Spark

Lazy Evaluation: It
has improved fault
In-Memory Processing: means that spark waits
tolerance than Hadoop.
Unlike Hadoop, Spark for the code to complete
Both storage and
doesn’t move data in and and then process the
computation can tolerate
out of the cluster. instruction in the most
failure by backing up to
efficient way possible.
another node.
UNIT III
126
Comparison of different Framework
UNIT III
Case Study:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual user’s
data in the cloud or elsewhere and use it without requiring any
server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.
UNIT III
128
Case Study:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as well
as service levels for Google’s services as well as more than a dozen
non-Google open source packages.
UNIT III
129
Case Study:
• Twitter Analytics: Capturing and Analyzing Tweets
https://blogs.ischool.berkeley.edu/i290-abdt-s12/
UNIT III
130
Hadoop High Level Architecture
UNIT III
131
Hadoop Cluster
• A Small Hadoop Cluster Include a single master &
multiple worker nodes
Master node:
Data Node
Job Tracker
Task Tracker
Name Node
Slave node:
Data Node
Task Tracke
UNIT III
132
Active Learning
Answer the following

1. Hadoop Distributed File System (HDFS) is renamed from NDFS
2. All the core projects of Hadoop were hosted by Yahoo!
3. Large amount of data needs large hardware
4. IT organizations can handle Information growth irrespective of use of

commodity s/w and h/w
5. Apache Storm is a stream processing framework
6. YARN stands for _______________

UNIT III
133
Difference Between Hadoop and SQL
Hadoop SQL
Schema on Read Schema on Write
Data stored in compressed Data stored in the logical
file form with interrelated tables
UNIT III
134
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/
UNIT III
135
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
3. https://mapr.com/products/apache-spark/
4. https://www.educba.com/what-is-apache-spark/
5. https://i0.wp.com/opensourceforu.com/wp-content/uploads/2018/03/Table-1-Compari
son-of-teh-best-Big-Data-frameworks-2.jpg?ssl=1
6. https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2
95
7. https://www.guru99.com/create-your-first-hadoop-program.html
8. https://www.simplilearn.com/tutorials/hadoop-tutorial/hdfs?source=sl_frs_n
av_playlist_video_clicked
9. https://www.projectpro.io/hadoop-tutorial/hadoop-hdfs-commands
UNIT III
136

BDT_Unit04

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

BDT_Unit04

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDT_Unit04

Uploaded by

Copyright:

Available Formats

CET4001B Big Data

5/15/2024 Big Data Analytics Lab 1

Technologies and tools for Big Data:

• MapReduce: A distributed data processing model and execution environment

• HDFS:A distributed file system that runs on large clusters of commodity

Consider also a dataset in HDFS containing records like these:

Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will

UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;

UPDATE Test SET msg=’some more data’, bar=100 WHERE id=1;

• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

• E.g. emp table data will be stored in HDFS as:

• $ sqoop export (generic-args) (export-args)

• $ sqoop export (generic-args) (export-args)

Big Data Analytics Lab

5/15/2024 Big Data Analytics Lab 23

5/15/2024 Big Data Analytics Lab 24

5/15/2024 Big Data Analytics Lab 25

5/15/2024 Big Data Analytics Lab 26

5/15/2024 Big Data Analytics Lab 27

5/15/2024 Big Data Analytics Lab 28

5/15/2024 Big Data Analytics Lab 29

5/15/2024 Big Data Analytics Lab 30

5/15/2024 Big Data Analytics Lab 31

5/15/2024 Big Data Analytics Lab 32

Page User time

pageid age Page age Count

page useri time key v page cou

page useri time key v page useri t

5/15/2024 Big Data Analytics Lab 40

5/15/2024 Big Data Analytics Lab 41

Show show databases; // Shows the created database

Drop drop database office; // Drops the office database as it is empty

Create create database office; // We will recreate the database office

Use use office; // Sets office as the default database

5/15/2024 Big Data Analytics Lab 42

5/15/2024 Big Data Analytics Lab 44

5/15/2024 Big Data Analytics Lab 45

5/15/2024 Big Data Analytics Lab 46

5/15/2024 Big Data Analytics Lab 47

5/15/2024 Big Data Analytics Lab 48

5/15/2024 Big Data Analytics Lab 50

5/15/2024 Big Data Analytics Lab 51

5/15/2024 Big Data Analytics Lab 52

5/15/2024 Big Data Analytics Lab 53

5/15/2024 Big Data Analytics Lab 54

5/15/2024 Big Data Analytics Lab 55

5/15/2024 Big Data Analytics Lab 56

5/15/2024 Big Data Analytics Lab 57

5/15/2024 Big Data Analytics Lab 58

5/15/2024 Big Data Analytics Lab 59

5/15/2024 Big Data Analytics Lab 60

5/15/2024 Big Data Analytics Lab 61

5/15/2024 Big Data Analytics Lab 62

5/15/2024 Big Data Analytics Lab 63

5/15/2024 Big Data Analytics Lab 64

5/15/2024 Big Data Analytics Lab 65

5/15/2024 Big Data Analytics Lab 66

5/15/2024 Big Data Analytics Lab 67