Mod10 PDF
Mod10 PDF
Mod10 PDF
www.edureka.co/big-data-and-hadoop
Course Topics
Module 1 Module 6
» Understanding Big Data and Hadoop » HIVE
Module 2 Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase
Module 3 Module 8
» Hadoop MapReduce Framework » Advance HBase
Module 4 Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
Module 5
» PIG Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
Implement Flume and Sqoop
Understand Oozie
Slide 3 www.edureka.co/big-data-and-hadoop
Flume and Sqoop
For detail steps of MySql to HDFS - Using Sqoop installation on Edureka VM: Click Here
Slide 4 www.edureka.co/big-data-and-hadoop
Oozie
Oozie is a workflow/coordination system that you can use to manage Apache Hadoop jobs.
Oozie server — a web application that runs in a Java servlet container (the standard Oozie distribution is using
Tomcat).
This server supports reading and executing Workflows, Coordinators and Bundles definitions.
Slide 5 www.edureka.co/big-data-and-hadoop
Oozie
Oozie Functional Components
Slide 6 www.edureka.co/big-data-and-hadoop
Oozie Overview
Main Features
Adoption
Slide 7 www.edureka.co/big-data-and-hadoop
Apache Oozie
Oozie – Workflow
Slide 8 www.edureka.co/big-data-and-hadoop
Oozie Workflow
A workflow job can be in any of the following states:
PREP: When a workflow job is first created it will be in PREP state. The workflow job is defined but it is not running.
RUNNING: When a CREATED workflow job is started, it goes into RUNNING state, it will remain in RUNNING state while
it does not reach its end state, ends in error or it is suspended.
SUSPENDED: A RUNNING workflow job can be suspended, it will remain in SUSPENDED state until the workflow job is
resumed or it is killed.
PREP
SUSPENDED SUCCEEDED
Slide 9 www.edureka.co/big-data-and-hadoop
Oozie Workflow ( Contd.)
A workflow job can be in any of the following states:
SUCCEEDED: When a RUNNING workflow job reaches the end node it ends reaching the SUCCEEDED final state.
KILLED: When a CREATED , RUNNING or SUSPENDED workflow job is killed by an administrator or the owner via a
request to Oozie the workflow job ends reaching the KILLED final state.
FAILED: When a RUNNING workflow job fails due to an unexpected error it ends reaching the FAILED final state.
PREP
SUSPENDED SUCCEEDED
Slide 10 www.edureka.co/big-data-and-hadoop
Scheduling with Oozie
Oozie
Coordinator Job
MapReduce
Slide 11 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 12 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Coordinator Engine
Slide 13 www.edureka.co/big-data-and-hadoop
Oozie - workflow.xml
The bare minimum for workflow XML defines a name, a starting point, and an end point.
<workflow-app xmlns="uri:oozie:workflow:0.1"
name="WorkflowRunnerTest"> The workflow definition language
is XML based and it is called
<start to=“process1"/>
HPDL (Hadoop Process Definition
<end name="end"/>
Language).
</workflow-app>
Start node start: This specifies a starting point for an Oozie Workflow.
End node end: This specifies an end point for an Oozie Workflow.
Slide 14 www.edureka.co/big-data-and-hadoop
Oozie – workflow.xml
To this we need to add an action, and within that we'll specify the map-reduce
parameters.
Action nodes provide a way for a
<action name=“process1"> Workflow to initiate the execution of a
<map-reduce> computation/processing task.
<job-tracker>localhost:8032</job-tracker>
<name-node>hdfs://localhost:9000</name-node> This runs an Hadoop MapReduce job.
<prepare> <delete
path="hdfs://localhost:9000/WordCountTest/out1"/></prepare>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value> Remember
</property> Actions require <Ok> and
</configuration> <error> tags to direct the next
</map-reduce> action on success or failure.
<ok to="end"/>
<error to="fail"/>
</action>
Slide 15 www.edureka.co/big-data-and-hadoop
Oozie - job.properties and lib
job.properties files provide another place where job arguments can be specified.
All of the properties specified will be available in the job execution context, and consequently
can be used throughout the job.
nameNode=hdfs://localhost:9000
jobTracker=localhost:8032
oozie.wf.application.path=${nameNode}/WordCountTest
There is a lib directory which contains libraries used in the workflow (such as jar files (.jar)
or shared object files (.so)).
Slide 16 www.edureka.co/big-data-and-hadoop
Annie’s Question
True False
Slide 17 www.edureka.co/big-data-and-hadoop
Annie’s Answer
True
Slide 18 www.edureka.co/big-data-and-hadoop
DEMO ON OOZIE WORKFLOW
Slide 19 www.edureka.co/big-data-and-hadoop
Running Oozie Application
Create Application
Step 2 :Write application and create the jar (Example MapReduce jar). Move this jar to lib folder in WordCountTest
directory.
Slide 20 www.edureka.co/big-data-and-hadoop
Monitoring an Oozie Workflow Job
Workflow Job Status:
$ oozie job -info 1-20090525161321-oozie-xyz-W
Slide 21 www.edureka.co/big-data-and-hadoop
Oozie – Coordinator
Slide 22 www.edureka.co/big-data-and-hadoop
Oozie Coordinator
The Oozie Coordinator supports the automated starting of Oozie Workflow processes.
It is typically used for the design and execution of recurring invocations of Workflow
processes triggered by time and/or data availability.
Slide 23 www.edureka.co/big-data-and-hadoop
Oozie Coordinator Properties and XML
We will start with Coordinator which schedule wordcount example every 60 minutes.
Moreover, Oozie coordinators can be parameterized using variables like ${inputDir}, ${startTime}, etc. within the
coordinator definition.
When submitting a coordinator job, values for the parameters must be provided as input. As parameters are key
value pairs, they can be written in a coordinator.properties file or a XML file.
oozie.coord.application.path=${nameNode}/WordCountTest_TimeBased
Slide 24 www.edureka.co/big-data-and-hadoop
Oozie Application Lifecycle
Coordinator Job
action
start
A WF WF WF
Oozie Coordinator Engine
Slide 25 www.edureka.co/big-data-and-hadoop
Use Case 1: Time Triggers
Slide 26 www.edureka.co/big-data-and-hadoop
Example 1: Run Workflow every 15 mins
<coordinator-app name=“coord1”
start="2009-01-08T00:00Z"
end="2010-01-01T00:00Z"
frequency=”15"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>hdfs://localhost:9000/WordCountTest_TimeBased</app-path>
<configuration>
<property> <name>key1</name><value>value1</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
Slide 27 www.edureka.co/big-data-and-hadoop
Use Case 2: Time and Data Triggers
Materialize your workflow every hour, but only run them when the input data is ready.
Hadoop
Input Data
Exists?
Slide 28 www.edureka.co/big-data-and-hadoop
Example 2: Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://localhost:9000/WordCountTest_TimeBased</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>
Slide 29 www.edureka.co/big-data-and-hadoop
Use Case 3: Rolling Windows
Access 15 minute datasets and roll them up into hourly datasets
01:00
02:00
Slide 30 www.edureka.co/big-data-and-hadoop
Monitoring an Oozie Coordinator Job
Coordinator Job Status:
$ oozie job -info 1-20090525161321-oozie-xyz-C
Slide 31 www.edureka.co/big-data-and-hadoop
Some Oozie Commands
Checking the Status of multiple Workflow Jobs
oozie jobs -oozie http://localhost:11000/oozie -localtime -len 2 -fliter status=RUNNING
Check version
$ oozie admin -oozie http://localhost:11000/oozie -version
Slide 32 www.edureka.co/big-data-and-hadoop
Oozie Web Console: List Jobs
Slide 33 www.edureka.co/big-data-and-hadoop
Oozie Web Console: Job Details
Slide 34 www.edureka.co/big-data-and-hadoop
Oozie Web Console: Failed Actions
Slide 35 www.edureka.co/big-data-and-hadoop
Oozie Web Console: Error Messages
Slide 36 www.edureka.co/big-data-and-hadoop
Project
Slide 37 www.edureka.co/big-data-and-hadoop
Use Case – How do I find out the best
Slide 38 www.edureka.co/big-data-and-hadoop
Use Case – The type of data we are dealing with!
Slide 39 www.edureka.co/big-data-and-hadoop
Abstract Flow Diagram
Slide 40 www.edureka.co/big-data-and-hadoop
Flow Diagram
User Interface To
search the top
rated links per
category
SQOOP
PIG
HIVE
HDFS
Huge Raw XML files
with unstructured
data line reviews
Slide 41 www.edureka.co/big-data-and-hadoop
Revision: Map-Reduce Phase
Huge Raw XML Map Reduce
files with
unstructured data
line reviews HDFS
Slide 42 www.edureka.co/big-data-and-hadoop
Map-Reduce to Pig Phase
PIG
Slide 43 www.edureka.co/big-data-and-hadoop
Pig to Hive Phase
Slide 44 www.edureka.co/big-data-and-hadoop
Hive to Sqoop Phase : Dumping data to MySQL
SQOOP
Web Interface
Slide 45 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
Output
Structu
red
Data
Slide 46 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
Output
Structu
red
Data
Slide 47 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
Output
Structured
Output
Data
Structu
red
Data
Slide 48 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews
Output
Structured
Output
Data
Structu
red
Data
Output
Structured
Output
Data
Structu
red
Data
Slide 54 www.edureka.co/big-data-and-hadoop
Hadoop Eco system
HDFS
Slide 55 www.edureka.co/big-data-and-hadoop
Assignment
Execute oozie practicals
Slide 56 www.edureka.co/big-data-and-hadoop
Edureka Certification
To achieve the Edureka Certification, you need to
complete a project, which helps you in applying all the
concepts you have learnt during your Hadoop classes
Slide 57 www.edureka.co/big-data-and-hadoop
What Next??
Slide 58 www.edureka.co/big-data-and-hadoop
Big Data in 10 minutes
Learn Big Data not in months but in Minutes!! Sounds too good ? But true
MAPR HORTONWORKS
Slide 59 www.edureka.co/big-data-and-hadoop
Why Talend?
Talend is the only Graphical User Interface tool which is capable enough to “translate” an ETL job to a
MapReduce job. Thus, Talend ETL job gets executed as a MapReduce job on Hadoop and get the big data work
done in minutes
This is a key innovation which helps to reduce entry barriers in Big Data technology and allows ETL job
developers (beginners and advanced) to carry out Data Warehouse offloading to greater extent
With its Eclipse-based graphical workspace, Talend Open Studio for Big Data enables the developer and data
scientist to leverage Hadoop loading and processing technologies like HDFS, HBase, Hive, and Pig without
having to write Hadoop application code
Hadoop Applications, Seamlessly Integrated within minutes using Talend
Slide 60 www.edureka.co/big-data-and-hadoop
Why Talend? (Contd.)
By simply selecting graphical components from a palette, arranging and configuring them, you can create Hadoop jobs
For example:
Slide 61 www.edureka.co/big-data-and-hadoop
Talend Hadoop Integration (Contd.)
For Hadoop applications to be truly accessible to your organization, they need to be smoothly integrated into your
overall data flows
Talend Open Studio for Big Data is the ideal tool for integrating Hadoop applications into your broader data
architecture
Talend provides more built-in connector components than any other data integration solution available, with more
than 800+ connectors that make it easy to read from or write to any major file format, database, or packaged
enterprise application
For example, in Talend Open Studio for Big Data, you can use drag 'n drop configurable components to create
data integration flows that move data from delimited log files into Hadoop Hive, perform operations in Hive, and
extract data from Hive into a MySQL database (or Oracle, Sybase, SQL Server, and so on)
Slide 62 www.edureka.co/big-data-and-hadoop
Who can use “Talend for Big Data”!!
Slide 63 www.edureka.co/big-data-and-hadoop
References
https://www.talend.com/resource/hadoop-applications.html
http://www.edureka.co/blog/big-data-and-etl-are-family/
Slide 64 www.edureka.co/big-data-and-hadoop
Thank you for being with Edureka!!
We would like to remind you that, your association with edureka does not stop here!
Remember
Slide 65 www.edureka.co/big-data-and-hadoop