Unit V-Apache Pig
Unit V-Apache Pig
Unit V-Apache Pig
Applications on Big Data Using Pig and Hive – Data processing operators in Pig – Hive services –
HiveQL – Querying Data in Hive - fundamentals of HBase and ZooKeeper - IBM InfoSphere
BigInsights and Streams. Visualizations - Visual data analysis techniques, interaction techniques;
Systems and applications
Apache Pig
• It is a tool/platform which is used to analyze larger sets of data representing them as data
flows.
• Hadoop Pig is nothing but an abstraction over MapReduce.
• Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There are currently two
environments: local execution in a single JVM and distributed execution on a Hadoop
cluster.
• Pig is a scripting language for exploring large datasets.
Features of Pig
• Rich set of operators
◦ It provides many operators to perform operations like join, sort, filer, etc.
• Ease of programming
◦ Ability to process terabytes of data simply by issuing a half-dozen lines of Pig
Latin from the console.
• Optimization opportunites
◦ It allows users to focus on semantics rather than efficiency, to optimize their
execution automatically.
• Extensibility
◦ In order to do special-purpose processing, users can create their own functions.
◦ These functions are converted internally into Map and reduce tasks.
• User defined Functions
◦ Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data
◦ Apache Pig analyzes all kinds of data, both structured as well as unstructured. It
stores the results in HDFS.
Why Pig?
• Using Pig Latin, programmers can perform MapReduce tasks easily without having
to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For
example, an operation that would require you to type 200 lines of code (LoC) in Java
can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache
Pig reduces the development time by almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc. In addition, it also provides nested data types like tuples, bags,
and maps that are missing from MapReduce.
• Execution Engine
◦ The MapReduce jobs are submitted to Hadoop in a sorted order.
◦ Finally, these MapReduce jobs are executed on Hadoop producing the desired
results.
Pig Vs MapReduce
Pig MapReduce
Data flow language Data processing paradigm
High-level language Low-level
Programmer’s SQL knowledge is enough Programmer needs exposure to Java
Length of code is less 20 times more
Compilation is not required Compilation process is long
Pig Vs SQL
Pig SQL
Procedural Language Declarative Language
Schema is optional Schema declaration is essential
Nested Relational model Flat relational model
Applications of Pig
Pig Execution
• Prerequisites
◦ Hadoop and Java JDK installed.
• Download Pig
◦ Download a stable release from http://hadoop.apache.org/pig/releases.html
◦ Unpack the tarball in a suitable place on your workstation:
% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
Execution Types
• Local Mode
◦ All the files are installed and run from your local host and local file system.
◦ There is no need of Hadoop or HDFS.
◦ This mode is generally used for testing purpose.
• Hadoop Mode
◦ Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.
◦ The cluster may be a pseudo- or fully distributed cluster.
◦ This mode is mainly used to run Pig on large datasets.
Grunt
• Grunt is an interactive shell for running Pig commands.
• Grunt is started when no file is specified for Pig to run.
• Grunt offers many shell and utility commands.
• The command sh is used to invoke any shell from Grunt shell.
• Any fs Shell commands from the Grunt shell by using the fs command.
Pig Latin
• Pig Latin is the language used to analyze data in Hadoop.
• The data model of Pig is fully nested.
• A Relation is the outermost structure of the Pig Latin data model.
• A relation is a bag, where,
◦ A bag is a collection of tuples. Example − {(Raja, 30), (Mohammad, 45)}
◦ A tuple is an ordered set of fields. Example − (Raja, 30)
◦ A field is a piece of data.Map. Example − ‘raja’ or ‘30’
◦ A map (or data map) is a set of key-value pairs. The key needs to be of type chararray
and should be unique. The value might be of any type. It is represented by ‘[]’
Given below is a Pig Latin statement, which loads data to Apache Pig.
1 PigStorage() function loads and stores data as structured text files. It takes a delimiter using which each entity of a
tuple is separated as a parameter. By default, it takes ‘\t’ as a parameter.
Pig Latin Operators
Arithmetic Operators
+ Addition
- Subtraction
* Multiplication
/ Division
% Modulus
?: Conditional Operator
Comparison Operators
== Equality
!= Not Equal
< Less Than
> Greater Than
<= Less Than or equal to
>= Greater than or equal
to
Filtering
FILTER
There is a removal of unwanted rows from a relation.
DISTINCT
We can remove duplicate rows from a relation by this operator.
It transforms the data based on the columns of data.
FOREACH-GENERATE
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
Load this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int,
phone:chararray,city:chararray);
Get the id, age, and city values of each student from the relation student_details and store it
into another relation named foreach_data using the foreach operator
Output
It will produce the following output, displaying the contents of the relation foreach_data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
JOIN
We can join two or more relations.
COGROUP There is a grouping of the data into two or more relations.
Sorting
ORDER
It arranges a relation in an order based on one or more fields.
We can get a particular number of tuples from a relation.
LIMIT
Combining and Splitting
UNION
We can combine two or more relations into one relation.
To split a single relation into more relations.
SPLIT
Diagnostic Operators
DUMP
It prints the content of a relationship through the console.
DESCRIBE
It describes the schema of a relation.
We can view the logical, physical execution plans to evaluate a relation.
EXPLAIN
It displays all the execution steps as the series of statements.
ILLUSTRATE