Day1 2
Day1 2
Day1 2
Agenda
Day 1 Day 3
Introduction to BigData MapReduce Framework
BigData Roles and Responsibilities Hive Programming
Introduction to Hadoop and Its EcoSystem
Day 4
Day 2 Advanced Hive Programming
Hadoop Distributed Filesystem Introduction to YARN
Ingesting Data into HDFS Spark SQL
• The main objective of the database is to ensure that data can be stored and
retrieved easily and effectively
• It is a compilation of data (records) in a structured way.
• OLAP technology has been defined as the ability to achieve “fast access to shared
multidimensional information.”
• Given OLAP technology’s ability to create very fast aggregations and calculations
of underlying data sets, one can understand its usefulness in helping business
leaders make better, quicker “informed” decisions.
• Manage and maintain the Hadoop clusters for uninterrupted jobRoutine check-up,
back-up and monitoring of the entire system
• Ensuring the connectivity and network is always up and running
• Planning for capacity upgrading, downsizing as and when the need arises
• Managing the HDFS and ensuring it is working optimally at all times
• Securing the Hadoop cluster in a foolproof manner is paramount
• Regulating the administration rights depending on job profile of users
• Adding new users over time and discarding redundant users smoothly
• Proficiency in Linux scripting and also in Hive, Oozie and HCatalog
• A part of the attraction lies in the fact that a Data Scientist wears multiple hats
over the course of a typical day at the office.
• He/She is part scientist, part artist and part magician!
1. Which Role will take the responsibility of Construct and deploy both
positive and negative test cases
Source:- http://www.gartner.com/it-glossary/big-data/
map
map map map map 5 map
1 2 3 4 6 map
7
DataNode/NodeM DataNode/NodeM DataNode/NodeM DataNode/NodeM
anager anager anager anager
map
map map map map 5 map
1 2 3 4 6 map
DataNode/NodeM DataNode/NodeM DataNode/NodeM
7
DataNode/NodeM
anager anager anager anager
<key1, value> <key2, value> <key1, value> <key7, value> <key4, value> <key5, value> <key6, value>
<key2, value> <key3, value> <key4, value> <key9, value> <key2, value> <key6, value> <key1, value>
<key5, value> <key2, value> <key4, value> <key9, value> <key2, value> <key9, value>
<key9, value> <key8, value> <key4, value> <key6, value> <key9, value>
<key1, value> <key3, value>
4. The <key,value> pairs go through a shuffle/sort phase,
where records with the same key end up at the same
Talentum Global Technologies reducer. The specific pairs sent to a reducer are sorted
by key, and the values are aggregated into a collection.
Distributed Processing - MapReduce cont.
5. Reduce tasks run on a NodeManager as a
Java process. Each Reducer processes its input
and outputs <key,value> pairs that are
typically written to a file in HDFS.
<key1, <key2, (value,value,value,value,value)>
(value,value,value,value)> Output from the <key4, (value,value,value,value)>
<key3, (value,value)> mappers after <key7,(value)>
<key5, (value,value)> being shuffled <key8,(value)>
<key6, and sorted = <key9,(value,value,value,value,value)>
(value,value,value)> input of the
reducers.
reduce1 reduce2
NodeManager NodeManager
HDFS
What is Apache Hadoop uses comodity
hardware?
► All of this occurs on commodity hardware which
reduces not only the original purchase price, but also
potentially reduces support costs too
1. Sentiment
How your customers feel
2. Clickstream
Website visitors’ data
3. Sensor/Machine
Data from remote sensors and machines
4. Geographic
Location-based data Value
5. Server Logs
6. Text
Millions of web pages, emails, and documents
• Questions to answer:
o How did the public feel
about the debut?
o How might the
sentiment data have
been used to better
promote the launch of
™ Marvel
Comics the movie?
Talentum Global Technologies
Getting Twitter Feeds into Hadoop
Flume
Agent
• Iron Man 3 was awesome. I want to go see it again!
• Iron Man 3 = 7.7 stars
• Tony Stark has 42 different Iron Man suits in Iron Man 3
• Wow as good as or better than the first two
• Thor was way better than Iron Man 3
Raw sensor
data from the
trucks A5 A5 unsafe following distance 41.526509 -124.038407 Klamath California
33 1 0
A54 A54 normal 35.373292 -119.018712 Bakersfield California 19 0 0
A48 A48 overspeed 38.752124 -121.288006 Roseville California 77 1 0
... Flume
Agent
A Sqoop
A table in RDBMS job
containing the info
on the fleet of
trucks.
Hadoop
cluster
Talentum Global Technologies
HCatalog Stores a Shared Schema
create table trucks ( create table events ( create table
driverid string, truckid string, riskfactor (
truckid string, driverid string, driverid string,
model string, event string, riskfactor float
monthyear_miles int, latitude double, );
monthyear_gas int, longitude double,
total_miles int, city string,
total_gas double, state string,
mileage double velocity double
); event_indicator boolean,
idling_indicator boolean
);
HCatalog
Talentum Global Technologies
metastore
Data Analysis
We want to answer two questions:
• Which trucks are wasting fuel through unnecessary idling?
• Which drivers are most frequently involved in unsafe events
on the road?
a = LOAD 'events'
using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'Normal';
c = foreach b
generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid,
SUM(c.occurance) as t_occ;
f = LOAD 'trucks'
using org.apache.hive.hcatalog.pig.HCatLoader();
g = foreach f generate driverid,
((int) apr09_miles + (int) apr10_miles) as t_miles;
join_d = join e by (driverid), g by (driverid);
final_data = foreach join_d generate
$0 as driverid, (float) $1/$3*1000 as riskfactor;
store final_data into 'riskfactor'
using org.apache.hive.hcatalog.pig.HCatStorer();
MapReduce Others
(data processing) (data processing)
MapReduce
(cluster resource management YARN
& data processing) (cluster resource management)
HDFS HDFS
(redundant, reliable storage) (redundant, reliable storage)
Hadoo
p
Source – https://hortonworks.com/products/data-center/hdp/
Hadoop distros - HDP
Source – https://hortonworks.com/products/data-center/hdp/
Data Management and Operations
Frameworks
Framework Description
Hadoop Distributed File A Java-based, distributed file system that provides scalable, reliable, high-throughput
System (HDFS) access to application data stored across commodity servers
Yet Another Resource A framework for cluster resource management and job scheduling
Negotiator (YARN)
Framework Description
Ambari A Web-based framework for provisioning, managing, and monitoring Hadoop clusters
ZooKeeper A high-performance coordination service for distributed applications
Cloudbreak A tool for provisioning and managing Hadoop clusters in the cloud
Oozie A server-based workflow engine used to execute Hadoop jobs
Framework Description
HDFS A storage management service providing file and directory permissions, even more
granular file and directory access control lists, and transparent data encryption
YARN A resource management service with access control lists controlling access to
compute resources and YARN administrative functions
Hive A data warehouse infrastructure service providing granular access controls to table
columns and rows
Falcon A data governance tool providing access control lists that limit who may submit
Hadoop jobs
Knox A gateway providing perimeter security to a Hadoop cluster
Ranger A centralized security framework offering fine-grained policy controls for HDFS, Hive,
HBase, Knox, Storm, Kafka, and Solr
Discar
d
results
dat
a
Ingest and
dat Replicate?
a
Examine
dat
a Archive
Storage Tier? Cloud
Length of Time? Storage
Answers to Structured
questions = $$ 3. Data analysts use Hive to Data
query the data
Hidden gems = $$
4. Data scientists use MapReduce,
Talentum Global Technologies
R, and Mahout to mine the data
Hadoop Deployment Options
⬢ There are choices when deploying Hadoop:
► Deploy on-premise in your own data center
► Deploy in the cloud
► Deploy on Microsoft Windows
► Deploy on Linux
Deployment Choices
Linux Windows
standalone
host ⬢ Single system installation
CPU JVM ⬢ All Hadoop service daemons run in a
single Java virtual machine (JVM)
YARN ⬢ Uses the file system on local disk
memory
HDFS
MapReduce ⬢ Suitable for test and development,
… or introductory training
local FS
Pseudo-Distributed Mode
standalone
host JVM
CPU JVM
NameNode ResourceManager
memory
JVM JVM
Dashboard
Cluster service
cluster
configuration
monitoring
management
with alerts
Talentum Global Technologies
Ambari Architecture
View View View plugins
Ambari Server
DB REST Ambari
API Web UI
REST API
Ambari
Java
Server
provider
REST API
user/
auth
metrics
group info
DB orchestrator monitoring and
alerts
Ambari DB or
LDAP/AD
cluster configuration python scripts
and topology, users
and groups agent
/
Metrics Metrics storag
Monitor Collector e
Talentum Global Technologies
The Ambari Web UI
Click to display
available
Ambari Views
1. Sentiment is one of the six key types of big data. List the other five.
1. What technology might you use to stream Twitter feeds into Hadoop?
1. What technology might you use to define, store, and share the schemas of
your big data stored in Hadoop?