Mod10 PDF

Module-10
APACHE OOZIE And Hadoop Project
www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE
» Hadoop Architecture and HDFS » Advance HIVE and HBase
» Hadoop MapReduce Framework » Advance HBase
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Implement Flume and Sqoop
 Understand Oozie
 Schedule Job in Oozie
 Implement Oozie Workflow
 Implement Oozie Coordinator
 Understand Project Discussion and implement one
Flume and Sqoop
Demo on Flume and Sqoop

For detail steps of flume installation on Edureka VM: Click Here
For detail steps of MySql to HDFS - Using Sqoop installation on Edureka VM: Click Here
Flume and Sqoop is already

installed in Edureka VM .
Oozie
 Oozie is a workflow/coordination system that you can use to manage Apache Hadoop jobs.
 Oozie server — a web application that runs in a Java servlet container (the standard Oozie distribution is using
Tomcat).
 This server supports reading and executing Workflows, Coordinators and Bundles definitions.
Oozie
Oozie Functional Components
Oozie Workflow Oozie Coordinator Oozie Bundles
This component This provides support Facilitates packaging

provides support for for the automatic multiple coordinator
defining and execution of and workflow jobs,
executing a workflows and makes it easier
controlled sequence based on the time to manage the life
of MapReduce, Hive, and data availability. cycle of those jobs.
and Pig jobs.
Oozie Overview
 Main Features
» Execute and monitor workflows in Hadoop

» Periodic scheduling of workflows
» Trigger execution by data availability
» HTTP and command line interface + Web console
 Adoption
» ~100 users on mailing list since launch on github

» In production at Yahoo!, running >200K jobs/day
Apache Oozie
Oozie – Workflow
Oozie Workflow
A workflow job can be in any of the following states:
PREP: When a workflow job is first created it will be in PREP state. The workflow job is defined but it is not running.
RUNNING: When a CREATED workflow job is started, it goes into RUNNING state, it will remain in RUNNING state while
it does not reach its end state, ends in error or it is suspended.
SUSPENDED: A RUNNING workflow job can be suspended, it will remain in SUSPENDED state until the workflow job is
resumed or it is killed.
PREP
KILLED RUNNING FAILED
SUSPENDED SUCCEEDED
Oozie Workflow ( Contd.)
A workflow job can be in any of the following states:
SUCCEEDED: When a RUNNING workflow job reaches the end node it ends reaching the SUCCEEDED final state.
KILLED: When a CREATED , RUNNING or SUSPENDED workflow job is killed by an administrator or the owner via a
request to Oozie the workflow job ends reaching the KILLED final state.
FAILED: When a RUNNING workflow job fails due to an unexpected error it ends reaching the FAILED final state.
PREP
KILLED RUNNING FAILED
SUSPENDED SUCCEEDED
Scheduling with Oozie
Oozie
Coordinator Job
Launch MR jobs HDFS

at regular
intervals
MapReduce
Annie’s Question
…………………specialized in running workflows based on

time and data triggers.
Annie’s Answer
Coordinator Engine
Oozie - workflow.xml
 The bare minimum for workflow XML defines a name, a starting point, and an end point.
<workflow-app xmlns="uri:oozie:workflow:0.1"
name="WorkflowRunnerTest"> The workflow definition language
is XML based and it is called
<start to=“process1"/>
HPDL (Hadoop Process Definition
<end name="end"/>
Language).
</workflow-app>
 Flow-control nodes : Provide a way to control the Workflow execution path.
 Start node start: This specifies a starting point for an Oozie Workflow.
 End node end: This specifies an end point for an Oozie Workflow.
Oozie – workflow.xml
 To this we need to add an action, and within that we'll specify the map-reduce
parameters.
Action nodes provide a way for a
<action name=“process1"> Workflow to initiate the execution of a
<map-reduce> computation/processing task.
<job-tracker>localhost:8032</job-tracker>
<name-node>hdfs://localhost:9000</name-node> This runs an Hadoop MapReduce job.
<prepare> <delete
path="hdfs://localhost:9000/WordCountTest/out1"/></prepare>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value> Remember
</property> Actions require <Ok> and
</configuration> <error> tags to direct the next
</map-reduce> action on success or failure.
<ok to="end"/>
<error to="fail"/>
</action>
Oozie - job.properties and lib
 job.properties files provide another place where job arguments can be specified.
 All of the properties specified will be available in the job execution context, and consequently
can be used throughout the job.
nameNode=hdfs://localhost:9000
jobTracker=localhost:8032
oozie.wf.application.path=${nameNode}/WordCountTest
 There is a lib directory which contains libraries used in the workflow (such as jar files (.jar)
or shared object files (.so)).
Annie’s Question
The job.properties file needs to be a local file during

submissions, and not a HDFS path
True False
Annie’s Answer
True
DEMO ON OOZIE WORKFLOW
Running Oozie Application
 Create Application
Step 1 : Create a directory for oozie job (WordCountTest)
Step 2 :Write application and create the jar (Example MapReduce jar). Move this jar to lib folder in WordCountTest
directory.
Step 3 : job.properties and workflow.xml inside WordCountTest directory.
Step 4 : Move this directory to hdfs.
 Running the Application
oozie job -oozie http://localhost:11000/oozie -config job.properties –run

(job.properties should be from local path)
Monitoring an Oozie Workflow Job
 Workflow Job Status:
$ oozie job -info 1-20090525161321-oozie-xyz-W
Workflow Name : WorkflowRunnerTest

App Path : hdfs://localhost:9000/WordCountTest
Status : RUNNING
…
 Workflow Job Log:

$ oozie job -log 1-20090525161321-oozie-xyz-W
 Workflow Job Definition:

$ oozie job -definition 1-20090525161321-oozie-xyz-W
Oozie – Coordinator
Oozie Coordinator
 The Oozie Coordinator supports the automated starting of Oozie Workflow processes.
 It is typically used for the design and execution of recurring invocations of Workflow
processes triggered by time and/or data availability.
Oozie Coordinator Properties and XML
 We will start with Coordinator which schedule wordcount example every 60 minutes.
 Moreover, Oozie coordinators can be parameterized using variables like ${inputDir}, ${startTime}, etc. within the
coordinator definition.
 When submitting a coordinator job, values for the parameters must be provided as input. As parameters are key
value pairs, they can be written in a coordinator.properties file or a XML file.
frequency=60 <coordinator-app name="coordinator1" frequency="${frequency}"

startTime=2014-02-04T10\:00Z start="${startTime}" end="${endTime}" timezone="${timezone}"
endTime=2015-02-04T11\:00Z xmlns="uri:oozie:coordinator:0.1">
timezone=UTC <action>
<workflow>
nameNode=hdfs://localhost:9000 <app-path>${workflowPath}</app-path>
jobTracker=localhost:8032 </workflow>
queueName=default </action>
</coordinator-app>
workflowPath=${nameNode}/WordCountTest_TimeBased
oozie.coord.application.path=${nameNode}/WordCountTest_TimeBased
Oozie Application Lifecycle
Coordinator Job
0*f 1*f 2*f N*f

start end
action
create
Action Action Action Action
0 1 2 n
action
start
A WF WF WF
Oozie Coordinator Engine
B C Oozie Workflow Engine
Use Case 1: Time Triggers
 Execute your workflow every 15 minutes (CRON)
00:15 00:30 00:45 01:00
Example 1: Run Workflow every 15 mins
<coordinator-app name=“coord1”
start="2009-01-08T00:00Z"
end="2010-01-01T00:00Z"
frequency=”15"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>hdfs://localhost:9000/WordCountTest_TimeBased</app-path>
<configuration>
<property> <name>key1</name><value>value1</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
Use Case 2: Time and Data Triggers
 Materialize your workflow every hour, but only run them when the input data is ready.
Hadoop
Input Data
Exists?
01:00 02:00 03:00 04:00
Example 2: Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://localhost:9000/WordCountTest_TimeBased</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
Use Case 3: Rolling Windows
 Access 15 minute datasets and roll them up into hourly datasets
00:15 00:30 00:45 01:00
01:00
01:15 01:30 01:45 02:00
02:00
Monitoring an Oozie Coordinator Job
 Coordinator Job Status:
$ oozie job -info 1-20090525161321-oozie-xyz-C
Job Name : WordCountTest_TimeBased

App Path : hdfs://localhost:9000/WordCountTest_TimeBased
Status : RUNNING
…
 Coordinator Job Log:
$ oozie job –log 1-20090525161321-oozie-xyz-C
 Coordinator Job Definition:

$ oozie job –definition 1-20090525161321-oozie-xyz-C
Some Oozie Commands
 Checking the Status of multiple Workflow Jobs
oozie jobs -oozie http://localhost:11000/oozie -localtime -len 2 -fliter status=RUNNING
 Checking the Status of multiple Coordinator Jobs

oozie jobs -oozie http://localhost:11000/oozie -jobtype coordinator
 Killing a Workflow, Coordinator or Bundle Job

$ oozie job -oozie http://localhost:11000/oozie -kill 14-20090525161321-oozie-joe
 Checking the Status of a Workflow, Coordinator or Bundle Job or a Coordinator Action

$ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe
 Check version
$ oozie admin -oozie http://localhost:11000/oozie -version
Oozie Web Console: List Jobs
Oozie Web Console: Job Details
Oozie Web Console: Failed Actions
Oozie Web Console: Error Messages
Project
Use Case – How do I find out the best
Use Case – The type of data we are dealing with!
Abstract Flow Diagram
Huge Raw XML files User Interface To

with unstructured search the top
data line reviews rated links per
category
Flow Diagram
User Interface To
search the top
rated links per
category
SQOOP
PIG
HIVE
HDFS
Huge Raw XML files
with unstructured
data line reviews
Revision: Map-Reduce Phase
Huge Raw XML Map Reduce
files with
unstructured data
line reviews HDFS
Category hash url +tive -tive total
Map-Reduce to Pig Phase
PIG
hash url +tive -tive total hash category
Pig to Hive Phase
Category url +tive -tive total hash
Hive to Sqoop Phase : Dumping data to MySQL
SQOOP
Web Interface
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
Output
Structu
red
Data
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.
Output
Structu
red
Data
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.
Output
Structured
Output
Data
Structu
red
Data
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.
Output
Structured
Output
Data
Structu
red
Data
Slide 49 Pig www.edureka.co/big-data-and-hadoop

In a Nut -Shell
Huge Raw XML files
with unstructured Category
Data
data line reviews Ratings data
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.
Output
Structured
Output
Data
Structu
red
Data

In a Nut -Shell
Huge Raw XML files
Data
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.
Write a fancy query to get the

Output top rated links per category
Structured
Output
Data
Structu
red
Data

In a Nut -Shell
Huge Raw XML files
Data
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.

Structured
Output
Data
Structu
red
Data
HIVE SQL
Output

In a Nut -Shell
Huge Raw XML files
Category Sqoop to read the data
with unstructured
Data and dump it to My SQL
MR job will read

reviews, use some
dumb logic and then
decide if review is
good or bad.

Structured User Interface To
Output
Data search the top
Structu rated links per
red category
Data
HIVE SQL
Output

Essence
 When to use Map Reduce  When NOT to use Map Reduce
» To handle extremely Unstructured Data like xml » Scarcity of time.
files. » When Join functionality is required.
» Works both on Structured and Unstructured
data.
» Good for writing Complex Business Logic.
 When NOT to use Pig
» New language to learn.
 When to use Pig » When you have small Dataset.
» To handle Semi-structured Data like taking.
substring from every line.
» Structured and Unstructured data.
 When NOT to use Hive
» Not easy for complex business logic.
» Deals only with structured data.
 When to use Hive
» Very similar to SQL and comes handy while
analyzing data.
» Less Development Time.
» Suitable for adhoc analysis.
Hadoop Eco system
Monitoring and deployment (Ambari)
Work flow (oozie)

Applications
Machine
NoSQL Streaming In-
Batch Learning
memory
PIG Hive
MR
Impala
Solr
HIPI
Tez Mahout
HBase Storm Spark
Cluster Management and Co-ordination (Yarn and zookeeper)
HDFS
Data loading Techniques (Flume, Sqoop)
Assignment
Execute oozie practicals
Edureka Certification
To achieve the Edureka Certification, you need to
complete a project, which helps you in applying all the
concepts you have learnt during your Hadoop classes
 Please use the Data Set and Problem Statement

given in the POC. To download it click on Download
the POC now
 Following are the steps you need to complete to

apply for the Certification:
» Submit your final project within 2 weeks from

the day you start, for final review
» You will receive your Final Certification on

successful completion of the project
What Next??
Big Data in 10 minutes
Learn Big Data not in months but in Minutes!! Sounds too good ? But true
MAPR HORTONWORKS
CLOUDERA HADOOP Go from zero to big data in under 10 minutes
Talend Open Studio for Big Data dramatically

simplifies the process of loading data into Hadoop,
transforming it there, and extracting processed data
from Hadoop to other destination systems
Why Talend?
 Talend is the only Graphical User Interface tool which is capable enough to “translate” an ETL job to a
MapReduce job. Thus, Talend ETL job gets executed as a MapReduce job on Hadoop and get the big data work
done in minutes
 This is a key innovation which helps to reduce entry barriers in Big Data technology and allows ETL job
developers (beginners and advanced) to carry out Data Warehouse offloading to greater extent
 With its Eclipse-based graphical workspace, Talend Open Studio for Big Data enables the developer and data
scientist to leverage Hadoop loading and processing technologies like HDFS, HBase, Hive, and Pig without
having to write Hadoop application code
 Hadoop Applications, Seamlessly Integrated within minutes using Talend
Why Talend? (Contd.)
 By simply selecting graphical components from a palette, arranging and configuring them, you can create Hadoop jobs
For example:
1. Load data into HDFS (Hadoop Distributed File System)

2. Use Hadoop Pig to transform data in HDFS
3. Load data into a Hadoop Hive based data warehouse
4. Perform ELT (extract, load, transform) aggregations in Hive
5. Leverage Sqoop to integrate relational databases and Hadoop
Talend Hadoop Integration (Contd.)
 For Hadoop applications to be truly accessible to your organization, they need to be smoothly integrated into your
overall data flows
 Talend Open Studio for Big Data is the ideal tool for integrating Hadoop applications into your broader data
architecture
 Talend provides more built-in connector components than any other data integration solution available, with more
than 800+ connectors that make it easy to read from or write to any major file format, database, or packaged
enterprise application
 For example, in Talend Open Studio for Big Data, you can use drag 'n drop configurable components to create
data integration flows that move data from delimited log files into Hadoop Hive, perform operations in Hive, and
extract data from Hive into a MySQL database (or Oracle, Sybase, SQL Server, and so on)
Who can use “Talend for Big Data”!!
References
 https://www.talend.com/resource/hadoop-applications.html
 http://www.edureka.co/blog/big-data-and-etl-are-family/
Thank you for being with Edureka!!
We would like to remind you that, your association with edureka does not stop here!
Remember
Lifetime Access Lifetime Support Free Additional Resources Discounts Referrals
You can get a discount You are eligible for

You have lifetime You get lifetime You get free access on every next course referral benefits. Earn
access to the support for your to edureka! you buy from edureka! when you refer
courses you are courses webinars for ALL For more details keep anyone to edureka!
registered for! courses checking your LMS
Please post all your reviews here: http://www.quora.com/Reviews-of-Edureka-online-education

Mod10 PDF

Uploaded by

Copyright:

Available Formats

Mod10 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mod10 PDF

Uploaded by

Copyright:

Available Formats

Module-10

APACHE OOZIE And Hadoop Project

 Schedule Job in Oozie

 Implement Oozie Workflow

 Implement Oozie Coordinator

 Understand Project Discussion and implement one

Demo on Flume and Sqoop

Flume and Sqoop is already

Oozie Workflow Oozie Coordinator Oozie Bundles

This component This provides support Facilitates packaging

» Execute and monitor workflows in Hadoop

» ~100 users on mailing list since launch on github

KILLED RUNNING FAILED

KILLED RUNNING FAILED

Launch MR jobs HDFS

…………………specialized in running workflows based on

 Flow-control nodes : Provide a way to control the Workflow execution path.

The job.properties file needs to be a local file during

Step 1 : Create a directory for oozie job (WordCountTest)

Step 3 : job.properties and workflow.xml inside WordCountTest directory.

Step 4 : Move this directory to hdfs.

 Running the Application

oozie job -oozie http://localhost:11000/oozie -config job.properties –run

Workflow Name : WorkflowRunnerTest

 Workflow Job Log:

 Workflow Job Definition:

frequency=60 <coordinator-app name="coordinator1" frequency="${frequency}"

0*f 1*f 2*f N*f

B C Oozie Workflow Engine

 Execute your workflow every 15 minutes (CRON)

00:15 00:30 00:45 01:00

01:00 02:00 03:00 04:00

00:15 00:30 00:45 01:00

01:15 01:30 01:45 02:00

Job Name : WordCountTest_TimeBased

 Coordinator Job Definition:

 Checking the Status of multiple Coordinator Jobs

 Killing a Workflow, Coordinator or Bundle Job

 Checking the Status of a Workflow, Coordinator or Bundle Job or a Coordinator Action

Huge Raw XML files User Interface To

Category hash url +tive -tive total

hash url +tive -tive total hash category

Category url +tive -tive total hash

MR job will read

MR job will read

MR job will read

Slide 49 Pig www.edureka.co/big-data-and-hadoop

MR job will read

Slide 50 Pig www.edureka.co/big-data-and-hadoop

MR job will read

Write a fancy query to get the

Slide 51 Pig www.edureka.co/big-data-and-hadoop

MR job will read

Write a fancy query to get the

Slide 52 Pig www.edureka.co/big-data-and-hadoop

MR job will read

Write a fancy query to get the

Slide 53 Pig www.edureka.co/big-data-and-hadoop

Monitoring and deployment (Ambari)

0f 1f 2f Nf