Day1 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 110

BigData – HADOOP, Hive, Spark Developer

Agenda

Day 1 Day 3
Introduction to BigData MapReduce Framework
BigData Roles and Responsibilities Hive Programming
Introduction to Hadoop and Its EcoSystem
Day 4
Day 2 Advanced Hive Programming
Hadoop Distributed Filesystem Introduction to YARN
Ingesting Data into HDFS Spark SQL

Talentum Global Technologies


Introductions
• Your name
• Job responsibilities
• Previous Hadoop experience (if any)
• What brought you here

Talentum Global Technologies


Class Logistics
• Schedule
10:00 a.m. – 5:00 p.m.
• Restrooms
• Lunch
• Computers and Wireless Access

Talentum Global Technologies


Introduction To BigData
Topics Covered
• Evolution of Databases
• Operational Systems Vs Analytical systems
• Operational System vs. DW
• About Data warehouse
• Introduction to OLAP
• Advantages of OLAP
Evolution of Databases

• The main objective of the database is to ensure that data can be stored and
retrieved easily and effectively
• It is a compilation of data (records) in a structured way.

Talentum Global Technologies


Evolution of Databases and Database
Models
• The database evolution happened in four “waves”:
• The first wave consisted of network, hierarchical, inverted list, and (in the
1990’s) object-oriented DBMSs; it took place from roughly 1960 to 1999.
• The relational wave introduced all of the SQL products (and a few non-SQL)
around 1990 and began to lose users around 2008.
• The decision support wave introduced Online Analytical Processing (OLAP) and
specialized DBMSs around 1990, and is still in full force today.
• The NoSQL wave includes big data, graphs, and much more; it began in 2008.

Talentum Global Technologies


Evolution of Databases and Database
Models
• There was a lot of ground to cover for the pioneers of Database Management
Systems.
• The first twenty to twenty-five years introduced and fine-tuned important
technological fundamentals.

Talentum Global Technologies


Evolution of Databases and Database
Models

Talentum Global Technologies


Source - http://graphdatamodeling.com/GraphDataModeling/History.html
Relational Empire to BigData/NoSql

• Today, vendors unite under the NoSQL / Big Data brand


• Around 2008, triggered by Facebook’s open source versions of Hive and Cassandra,
the NoSQL counter-revolution started.
• This space gets all of the attention today.
• the modern development platforms use schema-free or semi-structured
approaches (also under the umbrella of NoSQL).
• “Model as you go” is a common theme

Talentum Global Technologies


Relational Empire to BigData/NoSql

Talentum Global Technologies


Source - http://graphdatamodeling.com/GraphDataModeling/History.html
Operational Systems Vs Data
Warehousing
• Traditional database systems are designed to support typical day-to-day
operations via individual user transactions (e.g. registering for a course, entering a
financial transaction, etc.).
• Such systems are generally called operational or transactional systems.
• Operational system is one source of data
• There are still different sources of data
• So heterogenous information sources is the challenge
• Does not provide integrated view
• Does not provide uniform user interface
• Does not support sharing
• Result of above challenges affects decision-making processes across the
enterprise
Talentum Global Technologies
Operational Systems Vs Data
Warehousing

Talentum Global Technologies


Source – UC Berkeley EDW presentation October 2006
Operational Systems Vs Data
Warehousing
• A data warehouse complements an existing operational system by providing
forecasting and decision-making processes across the enterprise
• A data warehouse acts as a centralized repository of an organization's data,
ultimately providing a comprehensive and homogenized view of the organization.

Talentum Global Technologies


Operational Systems Vs Data
Warehousing

Talentum Global Technologies


Source – UC Berkeley EDW presentation October 2006
What data exists in Data Warehouse

• Large volumes of detailed data already exist in transactional database systems.


• A core subset of this data will be imported into the data warehouse, prioritized
by subject area (i.e. by business area), including finance, research, contracts and
grants, enrollment analysis, alumni, etc.
• A fundamental axiom of the data warehouse is that the imported data is
both read-only and non-volatile.
• As the amount of data within the data warehouse grows, the value of the data
increases, allowing a user to perform longer-term analyses of the data.

Talentum Global Technologies


What are users saying

Talentum Global Technologies


Source – UC Berkeley EDW presentation October 2006
What does a Data warehouse do?

Talentum Global Technologies


Source – UC Berkeley EDW presentation October 2006
Data Warehouse – Practitioners
viewpoint
A data warehouse is simply a single, complete, and consistent store of data
obtained from a variety of sources and made available to end users in a way
they can understand and use it in a business context.

Talentum Global Technologies


Data Warehouse – Practitioners
viewpoint

Talentum Global Technologies


Source – UC Berkeley EDW presentation October 2006
Data Warehouse Vs OLTP

Talentum Global Technologies


Source – UC Berkeley EDW presentation October 2006
Data Warehouse – OLAP

• OLAP (Online Analytical Processing) is the technology behind many Business


Intelligence (BI) applications.
• OLAP is a powerful technology for data discovery, including capabilities for
limitless report viewing, complex analytical calculations, and predictive “what if”
scenario (budget, forecast) planning.

Talentum Global Technologies


Advantages of OLAP

• OLAP technology has been defined as the ability to achieve “fast access to shared
multidimensional information.”
• Given OLAP technology’s ability to create very fast aggregations and calculations
of underlying data sets, one can understand its usefulness in helping business
leaders make better, quicker “informed” decisions.

Talentum Global Technologies


Lesson Review
1. State True or False - The decision support wave introduced Online
Analytical Processing (OLAP) and specialized DBMSs ?

1. Which type of system acts as a centralized repository of an organization's


data, ultimately providing a comprehensive and homogenized view of the
organization ?

1. A fundamental axiom of the data warehouse is that the imported data is


both __________ and ______________

1. State True or False - A staging area is an intermediate storage area used


for data processing during the extract, transform and load (ETL) process.

1. State any advantage of OLAP Technology


Talentum Global Technologies
BigData Hadoop Roles and Responsibilities
Topics Covered
• Primer on some general skills expected from Hadoop Professionals
• Various Job Roles under the Hadoop domain
• Data Enginnering
• Data Science
• DevOps
Primer on some general skills expected
from BigData Hadoop Professionals
• Ability to work with huge volumes of data so as to derive Business Intelligence
• Analyze data, uncover information, derive insights and propose data-driven
strategies
• A knowledge of OOP languages like Java, C++, Python, Scala is good to have
• Database theories, structures, categories, properties, and best practices
• A knowledge of installing, configuring, maintaining and securing Hadoop
• An analytical bent of mind and ability to learn-unlearn-relearn surely comes in
handy

Talentum Global Technologies


Various Job Roles under the BigData
Hadoop domain
• Hadoop Data Engineer
• Hadoop Architect
• Hadoop Administrator
• Hadoop DevOps Engineer
• Data Scientist

Talentum Global Technologies


BigData Hadoop Data Engg birds eye
view:
• The primary job of a Hadoop Data Engineer involves coding.
• They are basically software Engineers but working in the Big Data Hadoop domain.
• They are adept at coming up with the design concepts that are used for creating
extensive software applications.
• They are masters of computer programming languages.

Talentum Global Technologies


BigData Hadoop Data Engg Responsibilities:

• Knowledge of agile methodology for delivering software solutions


• Design, develop, document and architect Hadoop applications
• Work with Hadoop Log files to manage and monitor it
• Develop MapReduce coding that works seamlessly on Hadoop clusters
• Working knowledge of SQL, NoSQL, data warehousing & DBA
• Expertise in newer concepts like Apache Spark and Scala programming
• Complete knowledge of the Hadoop ecosystem and Hadoop Common
• Seamlessly convert hard-to-grasp technical requirements into outstanding designs
• Designing web services for swift data tracking and Querying data at high speeds
• Test software prototypes, propose standards and smoothly transfer it to operations

Talentum Global Technologies


BigData Hadoop Architect birds eye
view :
• A Hadoop Architect, as the name suggests, is the one entrusted with the
tremendous responsibility of dictating where the organization will go in terms of
Big Data Hadoop deployment.
• He/She is involved in planning, designing and strategizing the roadmap and
decides how the organization moves forward.

Talentum Global Technologies


BigData Hadoop Architect Responsibilities:

• Hands-on experience in working with Hadoop Distribution platforms like


HortonWorks, Cloudera, MapR and others
• Take end-to-end responsibility of the Hadoop Life Cycle in the organization
• Be the bridge between data scientists, engineers and the organizational needs
• Do in-depth requirement analysis and exclusively choose the work platform
• Full knowledge of Hadoop Architecture and HDFS is a must
• Working knowledge of MapReduce, HBase, Pig, Spark, Java and Hive
• Ensuring the chosen Hadoop solution is being deployed without any hindrance

Talentum Global Technologies


BigData Hadoop Administrator birds
eye view :
• The Hadoop Administrator is also a very prominent role as he/she is responsible
for ensuring there is no roadblock to the smooth functioning of Hadoop
framework.
• The roles and responsibilities resemble that of a System Administrator.
• A complete knowledge of the hardware ecosystem and Hadoop Architecture is
critical.

Talentum Global Technologies


BigData Hadoop Administrator Responsibilities:

• Manage and maintain the Hadoop clusters for uninterrupted jobRoutine check-up,
back-up and monitoring of the entire system
• Ensuring the connectivity and network is always up and running
• Planning for capacity upgrading, downsizing as and when the need arises
• Managing the HDFS and ensuring it is working optimally at all times
• Securing the Hadoop cluster in a foolproof manner is paramount
• Regulating the administration rights depending on job profile of users
• Adding new users over time and discarding redundant users smoothly
• Proficiency in Linux scripting and also in Hive, Oozie and HCatalog

Talentum Global Technologies


BigData Hadoop Tester birds eye view
:
• The job of the Hadoop Test Engineer has become extremely critical since today
Hadoop networks are getting bigger and more complex with each passing day.
• This poses some new problems when it comes to viability, security and ensuring
everything works smoothly without any bugs or issues.
• The Hadoop Test Engineer is primarily responsible for troubleshooting the Hadoop
Applications and rectifying any problem that he/she discovers at the earliest
before it becomes seriously threatening.

Talentum Global Technologies


BigData Hadoop Tester Responsibilities:

• Construct and deploy both positive and negative test cases


• Discover, document and report bugs and performance issues
• Ensure the MaReduce jobs are running at peak performance
• Constituent Hadoop scripts like HiveQL, Pig Latin are all robust
• Expert knowledge of Java to efficiently do the MapReduce testing
• Understanding of MRUnit, JUnit Testing frameworks is essential
• Full proficiency in Apache Pig and Hive is required
• Able to work with Selenium Testing Automation tool
• Able to come up with contingency plans in case of breakdown

Talentum Global Technologies


Data Scientist birds eye view :

• A part of the attraction lies in the fact that a Data Scientist wears multiple hats
over the course of a typical day at the office.
• He/She is part scientist, part artist and part magician!

Talentum Global Technologies


Data Scientist Responsibilities:

• Data Scientists are basically Data Analysts with wider responsibilities


• Complete mastery of different techniques of analyzing data
• Expect to solve real business issues backed by solid data
• Tailor the data analytics pursuit to suit the specific business needs
• A strong grip of mathematics and statistics is expected
• Keeping the big picture in mind at all times to know what needs to be done
• Develop data mining architecture, data modeling standards and more
• An advanced knowledge of SQL, Hive and Pig is a must
• Ability to work with R, SPSS and SAS is hugely beneficial
• Ability to reason, corroborate actions with data and insights
• Creative ability to do things that can work wonders for the business
• Top-notch communication skills to take everybody onboard in the organization.

Talentum Global Technologies


Lesson Review
1. Which Job Profile takes the responsibility of Application development
coding?

1. State True or False, Hadoop Data Engineer take end-to-end responsibility


of the Hadoop Life Cycle in the organization ?

1. Which Role will take the responsibility of Cluster Capacity Managemment

1. Which Role will take the responsibility of Construct and deploy both
positive and negative test cases

1. Which Role requires a strong grip of mathematics and statistics

Talentum Global Technologies


Lab: Setting up HDP 2.6 Lab Environment
Introduction To Hadoop and Its EcoSystem
Topics Covered
• What makes data a Big Data?
• The Three “V”s of Big Data
• Six Key Hadoop Data Types
• Use Cases
• About Hadoop
• RDBMS Vs Hadoop
• Hadoop Core
• Hadoop Ecosystem
• Hadoop Deployment modes
• Local Mode
• Psudo-distributed mode
• Cluster mode
What Makes Data BIG DATA?
► The phrase Big Data comes from the computational
sciences
► Specifically, it is used to describe scenarios where the
volume and variety of data types overwhelm the
existing tools to store and process it
► In 2001, the industry analyst Doug Laney described Big
Data using the three V’s of volume, velocity, and
variety

Talentum Global Technologies


The Three Vs. of Big Data

Unstructured and semi-structured data is becoming as strategic as


Variety the traditional structured data.

Data coming in from new sources as well as increased regulation in


Volume multiple areas means storing more data for longer periods of time.

Machine data, as well as data coming from new sources, is being


Velocity ingested at speeds not even imagined a few years ago.

Talentum Global Technologies


Variety
► Variety refers to the number of types of data being
generated
► Varieties of data include structured, semi-structured,
and unstructured data arriving from a myriad of sources
► Data can be gathered from databases, XML or JSON
files, text documents, email, video, audio, stock ticker
data, and financial transactions

Talentum Global Technologies


Variety

► There are problems related to the variety of data. This


include
► How to gather, link, match, cleanse, and transform data across
systems.
► You also have to consider how to connect and correlate data
relationships and hierarchies in order to extract business value
from the data

Talentum Global Technologies


Volume
► Volume refers to the amount of data being generated.
Think in terms of gigabytes, terabytes, and petabytes
► Many systems and applications are just not able to
store, let alone ingest or process, that much data

Talentum Global Technologies


Volume
► Many factors contribute to the increase in data volume.
This includes
► Transaction-based data stored for years
► Unstructured data streaming in from social media
► Ever increasing amounts of sensor and machine data being
produced and collected

Talentum Global Technologies


Volume
► There are problems related to the volume of data
► Storage cost is an obvious issue

► Another problem is filtering and finding relevant and valuable


information in large quantities of data that often contains not
valuable information

Talentum Global Technologies


Volume
► You also need a solution to analyze data quickly enough
in order to maximize business value today and not just
next quarter or next year

Talentum Global Technologies


Velocity
► Velocity refers to the rate at which new data is
created. Think in terms of megabytes per second and
gigabytes per second
► Data is streaming in at unprecedented speed and must
be dealt with in a timely manner in order to extract
maximum value from the data
► Sources of this data include logs, social media, RFID
tags, sensors, and many more

Talentum Global Technologies


Velocity
► There are problems related to the velocity of data.
These include not reacting quickly enough to benefit
from the data
► For example, data could be used to create a dashboard
that could warn of imminent failure or a security
breach
Failure to react in time could lead to service outages

Talentum Global Technologies


Velocity
► Another problem related to the velocity of data is that
data flows tend to be highly inconsistent with periodic
peaks.
► Causes include daily or seasonal changes or event-
triggered peak loads
► For example, a change in political leadership could
cause a peak in social media

Talentum Global Technologies


Hadoop Was Designed for Big Data

“Big Data is high-volume, -velocity and -variety


information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.” – Gartner

Source:- http://www.gartner.com/it-glossary/big-data/

Talentum Global Technologies


Hadoop Was Designed for Big Data

► The Gartner quote makes a good point.


► It is not enough to understand what Big Data is and then collect
it
► You must also have a means of processing it in order to extract
value from it
► The good news is that Hadoop was designed to collect,
store, and analyze Big Bata
► And it does it all in a cost-effective way

Talentum Global Technologies


What is Apache Hadoop?

► So what is Apache Hadoop?


► It is a scalable, fault tolerant, open source framework for the
distributed storing and processing of large sets of data on
commodity hardware
► But what does all that mean?

Talentum Global Technologies


What is Apache Hadoop scalability?

► Well first of all it is scalable.


► Hadoop clusters can range from one machine to
thousands of machines. That is scalability!

Talentum Global Technologies


What is Apache Hadoop fault tolerant?

► It is also fault tolerant


► Hadoop services become fault tolerant through
redundancy
► For example, the Hadoop distributed file system, called
HDFS, automatically replicates data blocks to three
separate machines, assuming that your cluster has at
least three machines in it
► Many other Hadoop services are replicated too in order
to avoid any single points of failure

Talentum Global Technologies


What is Apache Hadoop fault tolerant?
1. Client sends a
request to the
NameNode to add a
Big Data file to HDFS
NameNode
2. NameNode tells
client how and
where to distribute
the blocks 3. Client breaks the data
into blocks and writes each
block to a DataNode
DataNod DataNode DataNode
e1 2 3

4. The DataNode replicates each block to two other


Talentum Global Technologies
DataNodes (as chosen by the NameNode)
What is Apache Hadoop open source?

► Hadoop is also open source


► Hadoop development is a community effort governed
under the licensing of the Apache Software Foundation
► Anyone can help to improve Hadoop by adding features,
fixing software bugs, or improving performance and
scalability

Talentum Global Technologies


What is Apache Hadoop distributed
storage and processing?
► Hadoop also uses distributed storage and processing
► Large datasets are automatically split into smaller
chunks, called blocks, and distributed across the
cluster machines
► Not only that, but each machine processes its local
block of data. This means that processing is distributed
too, potentially across hundreds of CPUs and hundreds
of gigabytes of memory

Talentum Global Technologies


Distributed Processing - MapReduce
1. Suppose a file is the input to a
MapReduce job. That file is broken down
into blocks stored on DataNodes across the
Hadoop cluster.

map
map map map map 5 map
1 2 3 4 6 map
7
DataNode/NodeM DataNode/NodeM DataNode/NodeM DataNode/NodeM
anager anager anager anager

2. During the Map phase, map tasks


process the input of the MapReduce job,
with a map task assigned to each Input
Split. The map tasks are Java processes
that ideally run on the DataNodes where
Talentum Global Technologies
the blocks are stored.
Distributed Processing - MapReduce cont.
3. Each map task processes its Input
Split and outputs records of <key,
value> pairs.

map
map map map map 5 map
1 2 3 4 6 map
DataNode/NodeM DataNode/NodeM DataNode/NodeM
7
DataNode/NodeM
anager anager anager anager

<key1, value> <key2, value> <key1, value> <key7, value> <key4, value> <key5, value> <key6, value>
<key2, value> <key3, value> <key4, value> <key9, value> <key2, value> <key6, value> <key1, value>
<key5, value> <key2, value> <key4, value> <key9, value> <key2, value> <key9, value>
<key9, value> <key8, value> <key4, value> <key6, value> <key9, value>
<key1, value> <key3, value>
4. The <key,value> pairs go through a shuffle/sort phase,
where records with the same key end up at the same
Talentum Global Technologies reducer. The specific pairs sent to a reducer are sorted
by key, and the values are aggregated into a collection.
Distributed Processing - MapReduce cont.
5. Reduce tasks run on a NodeManager as a
Java process. Each Reducer processes its input
and outputs <key,value> pairs that are
typically written to a file in HDFS.
<key1, <key2, (value,value,value,value,value)>
(value,value,value,value)> Output from the <key4, (value,value,value,value)>
<key3, (value,value)> mappers after <key7,(value)>
<key5, (value,value)> being shuffled <key8,(value)>
<key6, and sorted = <key9,(value,value,value,value,value)>
(value,value,value)> input of the
reducers.
reduce1 reduce2

NodeManager NodeManager

<key, value> <key, value>


<key, value> <key, value>
<key, value> <key, value>
... ...

Talentum Global Technologies

HDFS
What is Apache Hadoop uses comodity
hardware?
► All of this occurs on commodity hardware which
reduces not only the original purchase price, but also
potentially reduces support costs too

Talentum Global Technologies


Six Key Hadoop DATA TYPES

1. Sentiment
How your customers feel

2. Clickstream
Website visitors’ data

3. Sensor/Machine
Data from remote sensors and machines

4. Geographic
Location-based data Value
5. Server Logs

6. Text
Millions of web pages, emails, and documents

Talentum Global Technologies


Sentiment Use Case

• Analyze customer sentiment


on the days leading up to
and following the release of
the movie Iron Man 3.

• Questions to answer:
o How did the public feel
about the debut?
o How might the
sentiment data have
been used to better
promote the launch of
™ Marvel
Comics the movie?
Talentum Global Technologies
Getting Twitter Feeds into Hadoop

Flume
Agent
• Iron Man 3 was awesome. I want to go see it again!
• Iron Man 3 = 7.7 stars
• Tony Stark has 42 different Iron Man suits in Iron Man 3
• Wow as good as or better than the first two
• Thor was way better than Iron Man 3

Flume is a tool for streaming


data into Hadoop.
Hadoop
Talentum Global Technologies
cluster
Use HCatalog to Define a Schema
CREATE EXTERNAL TABLE tweets_raw (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
text STRING
)

Talentum Global Technologies


HCatalog metastore
Use Hive to Determine Sentiment

CREATE TABLE tweetsbi


STORED AS RCFile
AS
SELECT
t.*,
case s.sentiment
when 'positive' then 2
when 'neutral' then 1
when 'negative' then 0
end as sentiment
FROM tweets_clean t LEFT OUTER JOIN
tweets_sentiment s on t.id = s.id;
Talentum Global Technologies
View Spikes in Tweet Volume

Notice a large spike in tweets around the Thursday midnight


opening and spikes around the Friday evening, Saturday afternoon,
Talentum Global Technologies
and Saturday evening showings.
View Sentiment by Country

Viewing the tweets on a map shows the sentiment of the movie by


country. For example, Ireland had 50% positive tweets, while 67%
Talentum Global Technologies
of tweets from Mexico were neutral.
Geolocation Use Case
• A trucking company has over 100 trucks.
• The geolocation data collected from the trucks contains events generated
while the truck drivers are driving.
• The company’s goal with Hadoop is to:
o reduce fuel costs
o improve driver safety

Talentum Global Technologies


The Geolocation Data
Here is what the collected data from the trucks’ sensors looks like:
• truckid
• driverid
• event
• latitude
• longitude
• city
• state
• velocity
• event_indicator (0 or 1)
• idling_indicator (0 or 1)
For example:
• A5 A5 unsafe following distance 41.526509 -124.038407 Klamath California 33 1 0
• A54 A54 normal 35.373292 -119.018712 Bakersfield California 19 0 0
• A48 A48 overspeed 38.752124 -121.288006 Roseville California 77 1 0

Talentum Global Technologies


Getting the Raw Data into Hadoop

Raw sensor
data from the
trucks A5 A5 unsafe following distance 41.526509 -124.038407 Klamath California
33 1 0
A54 A54 normal 35.373292 -119.018712 Bakersfield California 19 0 0
A48 A48 overspeed 38.752124 -121.288006 Roseville California 77 1 0
... Flume
Agent

Flume is a tool for streaming


data into Hadoop.

Talentum Global Technologies


Hadoop
cluster
The Truck Data
The truck data is stored in a database and looks like:
• driverid
• truckid
• model
• monthyear_miles
• monthyear_gas
• total_miles
• total_gas
• mileage
The miles and gas figures go back to 2009.

Talentum Global Technologies


Getting the Truck Data into Hadoop

A Sqoop
A table in RDBMS job
containing the info
on the fleet of
trucks.

Sqoop is a tool for transferring


data between an RDBMS and
Hadoop.

Hadoop
cluster
Talentum Global Technologies
HCatalog Stores a Shared Schema
create table trucks ( create table events ( create table
driverid string, truckid string, riskfactor (
truckid string, driverid string, driverid string,
model string, event string, riskfactor float
monthyear_miles int, latitude double, );
monthyear_gas int, longitude double,
total_miles int, city string,
total_gas double, state string,
mileage double velocity double
); event_indicator boolean,
idling_indicator boolean
);

HCatalog
Talentum Global Technologies
metastore
Data Analysis
We want to answer two questions:
• Which trucks are wasting fuel through unnecessary idling?
• Which drivers are most frequently involved in unsafe events
on the road?

Talentum Global Technologies


Use Hive to Compute Truck Mileage

CREATE TABLE truck_mileage AS


SELECT truckid, rdate, miles,
gas,
miles/gas mpg
FROM trucks
LATERAL VIEW stack(54,
'jun13',jun13_miles,jun13_gas,'may1
3',may13_miles,may13_gas,'apr13',ap
r13_miles,apr13_gas,...
) dummyalias AS rdate, miles, gas;
Talentum Global Technologies
Use Pig to Compute a Risk Factor

a = LOAD 'events'
using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'Normal';
c = foreach b
generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid,
SUM(c.occurance) as t_occ;
f = LOAD 'trucks'
using org.apache.hive.hcatalog.pig.HCatLoader();
g = foreach f generate driverid,
((int) apr09_miles + (int) apr10_miles) as t_miles;
join_d = join e by (driverid), g by (driverid);
final_data = foreach join_d generate
$0 as driverid, (float) $1/$3*1000 as riskfactor;
store final_data into 'riskfactor'
using org.apache.hive.hcatalog.pig.HCatStorer();

Talentum Global Technologies


Risk Factors Viewed in a Graph

Talentum Global Technologies


Risk Factors Viewed on a Map

Talentum Global Technologies


About Hadoop
• Framework for solving data-intensive processes
• Designed to scale massively
• Very fast for very large jobs
• Variety of processing engines
• Designed for hardware and software failures

Talentum Global Technologies


Relational Databases vs. Hadoop

Relational VS. Hadoop


Required on write schema Required on read

Reads are fast speed Writes are fast

Standards and structured governance Loosely structured

Limited, no data processing processing Processing coupled with data

Structured data types Multi- and unstructured

Interactive OLAP Analytics Data Discovery


Complex ACID Transactions best fit use Processing unstructured data
Operational Data Store Massive Storage/Processing
Talentum Global Technologies
About Hadoop 2.x
The Apache Hadoop 2.x project consists of the following modules:
• Hadoop Common: the utilities that provide support for the
other Hadoop modules
• HDFS: the Hadoop Distributed File System
• YARN: a framework for job scheduling and cluster resource
management
• MapReduce: for processing large data sets in a scalable and
parallel fashion

Talentum Global Technologies


New in Hadoop 2.x

YARN is a re-architecture of Hadoop that allows


multiple applications to run on the same platform

HADOOP 1.x HADOOP 2.x

MapReduce Others
(data processing) (data processing)

MapReduce
(cluster resource management YARN
& data processing) (cluster resource management)

HDFS HDFS
(redundant, reliable storage) (redundant, reliable storage)

Talentum Global Technologies


The Hadoop Ecosystem

Hadoo
p

Talentum Global Technologies


Enterprise ready Hadoop Platforms

Talentum Global Technologies

Source – https://hortonworks.com/products/data-center/hdp/
Hadoop distros - HDP

Talentum Global Technologies

Source – https://hortonworks.com/products/data-center/hdp/
Data Management and Operations
Frameworks
Framework Description
Hadoop Distributed File A Java-based, distributed file system that provides scalable, reliable, high-throughput
System (HDFS) access to application data stored across commodity servers
Yet Another Resource A framework for cluster resource management and job scheduling
Negotiator (YARN)

Framework Description
Ambari A Web-based framework for provisioning, managing, and monitoring Hadoop clusters
ZooKeeper A high-performance coordination service for distributed applications
Cloudbreak A tool for provisioning and managing Hadoop clusters in the cloud
Oozie A server-based workflow engine used to execute Hadoop jobs

These brief descriptions are provided for quick


convenience. More detailed descriptions are
Talentum Global Technologies
available online
Data Access Frameworks
Framework Description
Pig A high-level platform for extracting, transforming, or analyzing large datasets
Hive A data warehouse infrastructure that supports ad hoc SQL queries
HCatalog A table information, schema, and metadata management layer supporting Hive, Pig,
MapReduce, and Tez processing
Cascading An application development framework for building data applications, abstracting
the details of complex MapReduce programming
HBase A scalable, distributed NoSQL database that supports structured data storage for
large tables
Phoenix A client-side SQL layer over HBase that provides low-latency access to HBase data
Accumulo A low-latency, large table data storage and retrieval system with cell-level security
Storm A distributed computation system for processing continuous streams of real-time
data
Solr A distributed search platform capable of indexing petabytes of data
Spark A fast, general purpose processing engine use to build and run sophisticated SQL,
streaming, machine learning, or graphics applications.
Talentum Global Technologies
Governance and Integration Frameworks
Framework Description
Falcon A data governance tool providing workflow orchestration, data lifecycle
management, and data replication services.
WebHDFS A REST API that uses the standard HTTP verbs to access, operate, and manage
HDFS
HDFS NFS A gateway that enables access to HDFS as an NFS mounted file system
Gateway
Flume A distributed, reliable, and highly-available service that efficiently collects,
aggregates, and moves streaming data
Sqoop A set of tools for importing and exporting data between Hadoop and RDBM
systems
Kafka A fast, scalable, durable, and fault-tolerant publish-subscribe messaging system
Atlas A scalable and extensible set of core governance services enabling enterprises
to meet compliance and data integration requirements

Talentum Global Technologies


Security Frameworks

Framework Description
HDFS A storage management service providing file and directory permissions, even more
granular file and directory access control lists, and transparent data encryption
YARN A resource management service with access control lists controlling access to
compute resources and YARN administrative functions

Hive A data warehouse infrastructure service providing granular access controls to table
columns and rows
Falcon A data governance tool providing access control lists that limit who may submit
Hadoop jobs
Knox A gateway providing perimeter security to a Hadoop cluster
Ranger A centralized security framework offering fine-grained policy controls for HDFS, Hive,
HBase, Knox, Storm, Kafka, and Solr

Talentum Global Technologies


Hadoop and the Data Lifecycle
Transform?

Discar
d
results

dat
a
Ingest and
dat Replicate?
a
Examine

dat
a Archive
Storage Tier? Cloud
Length of Time? Storage

Tier 1 Tier 2 Tier 3


1 year 3 years 7 years HDFS Tier HDFS Tier
2 1
Until
90 Days
Deleted
The Path to ROI

Raw Hadoop Distributed


Data File System
1. Put the data into HDFS
in its raw format

2. Use Pig to explore and


transform

Answers to Structured
questions = $$ 3. Data analysts use Hive to Data
query the data

Hidden gems = $$
4. Data scientists use MapReduce,
Talentum Global Technologies
R, and Mahout to mine the data
Hadoop Deployment Options
⬢ There are choices when deploying Hadoop:
► Deploy on-premise in your own data center
► Deploy in the cloud
► Deploy on Microsoft Windows
► Deploy on Linux

Deployment Choices
Linux Windows

on-premise cloud on-premise cloud


Hadoop Deployment Modes
⬢ Hadoop may be deployed in three different modes:
► Standalone mode
► Pseudo-distributed mode
► Distributed mode
Standalone Mode

standalone
host ⬢ Single system installation
CPU JVM ⬢ All Hadoop service daemons run in a
single Java virtual machine (JVM)
YARN ⬢ Uses the file system on local disk
memory
HDFS
MapReduce ⬢ Suitable for test and development,
… or introductory training
local FS
Pseudo-Distributed Mode
standalone
host JVM
CPU JVM
NameNode ResourceManager
memory
JVM JVM

local FS HDFS DataNode NodeManager

⬢ Single system installation


⬢ Each Hadoop service daemon runs in its own JVM
⬢ Uses HDFS on local disk(s)
⬢ Appropriate for quality assurance, test and development
⬢ Format used for the Hortonworks Sandbox
Distributed Mode
master node
► Multi-system installation CPU
► Each Hadoop service daemon memory
runs in its own JVM.
► Multiple JVMs per system is
common local
disk
► Uses HDFS on local disk(s)
► HDFS is distributed across systems
► Best and typical for production worker node worker node worker node worker node
environments CPU CPU CPU CPU

memory memory memory memory

HDFS HDFS HDFS HDFS

Talentum Global Technologies


Ambari Cluster Management Features
Interactive
wizard-driven
►Ambari is the primary cluster
installation
management interface in today’s Ambari Views Non-interactive
Hadoop cluster. custom Web
API-driven
cluster
tools plugins installation
►Ambari provides a single control
point with many important features.

REST API for Ambari Granular control


integration of cluster
with other services’ start up
vendor tools and shut down

Dashboard
Cluster service
cluster
configuration
monitoring
management
with alerts
Talentum Global Technologies
Ambari Architecture
View View View plugins

Ambari Server

DB REST Ambari
API Web UI

Ambari Ambari Ambari Ambari


Agent Agent Agent Agent

cluster node cluster node cluster node cluster node

Talentum Global Technologies


Ambari Server Architecture
JavaScript Ambari Web UI GET /clusters/horton/services/HDFS/components/DATANODE

REST API

Ambari
Java
Server

provider
REST API
user/

auth
metrics
group info
DB orchestrator monitoring and
alerts
Ambari DB or
LDAP/AD
cluster configuration python scripts
and topology, users
and groups agent
/
Metrics Metrics storag
Monitor Collector e
Talentum Global Technologies
The Ambari Web UI

Talentum Global Technologies


Ambari Views
⬢ Views are Web applications that are
plugged into Ambari.
View Developed
⬢ Views enable organizations to extend package: by:
and customize Ambari Web. View
• Web UI • Community
• API • Partners
⬢ Developers write View packages (optional) • Hortonwork
• Provider s
⬢ Administrators deploy View packages to
the Ambari server. Part of
► Includes server and client-side software, Views Framework
Ambari core
and possibly new APIs
⬢ Ambari administrators create View
instances.
⬢ Administrators entitle users to access
specific Views.
Sample Ambari View

Click to display
available
Ambari Views

Talentum Global Technologies


Lesson Review
1. What are 1,024 petabytes known as?

1. What are 1,024 exabytes known as?

1. List the three Vs. of big data

1. Sentiment is one of the six key types of big data. List the other five.

1. What technology might you use to stream Twitter feeds into Hadoop?

1. What technology might you use to define, store, and share the schemas of
your big data stored in Hadoop?

7. What are the two main new components in Hadoop 2.x?


Talentum Global Technologies
Lab: Start an HDP 2.6 Cluster

You might also like