Pig Hive
Pig Hive
Pig Hive
Unit 3:
NoSQL – Pig - Introduction to Pig, Execution Modes of Pig, Comparison
of Pig with Databases, Grunt, Pig Latin, User Defined Functions,
Data Processing operators – Hive - Hive Shell, Hive Services, Hive
Metastore, Comparison with Traditional Databases, HiveQL, Tables,
Querying – MongoDB - Needs-Terms-Data Types Query Language –
Cassandra -Introduction- Features-Querying Commands.
• Pig Overview
• Execution Modes
• Installation
• Pig Latin Basics
• Developing Pig Script
– Most Occurred Start Letter
• Resources
What Is Pig?
• Developed by Yahoo! and a top level Apache project
• Immediately makes data on a cluster available to
non-Java programmers via Pig Latin – a dataflow
language
• Interprets Pig Latin and generates MapReduce jobs
that run on the cluster
• Enables easy data summarization, ad-hoc reporting
and querying, and analysis of large volumes of data
• Pig interpreter runs on a client machine – no
administrative overhead required
Framework for analyzing large un-
WHAT IS
structured PIG?
and semi-structured data
on top of hadoop
Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.
It allows nested data types like map, tuple and bag It does not allow nested data types
Execution Modes
• Pig has two execution modes
– Local Mode all files are installed and run using your local
host and file system
– Pig runs in a single JVM and accesses the local filesystem. This
mode is suitable only for small datasets and when trying out
Pig.
• Script -> Pig can run a script file that contains Pig commands
• Embedded -> users can run Pig programs from Java using the
PigServer class
Execution Modes
• Interactive
– By using the Grunt shell by invoking Pig on the
command line
$ pig
grunt>
• Batch
– Run Pig in batch mode using Pig Scripts and the
"pig" command
$ pig –f id.pig –p <param>=<value> ...
PIG LATIN DATA FLOW
A LOAD statement to read data from the system
A series of “transformation” statement to
process the data
A DUMP statement to view results or STORE
statement to save the result
-- Extract words from each line and put them into a pig bag named ‘words’
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
Pig Data Formats
• BinStorage
– Loads and stores data in machine-readable (binary) format
• PigStorage
– Loads and stores data as structured, field delimited text files
• TextLoader
– Loads unstructured data in UTF-8 format
• PigDump
– Stores data in UTF-8 format
• YourOwnFormat!
– via UDFs
Loading Data Into Pig
• Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name,
salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
• Each LOAD statement defines a new bag
– Each bag can have multiple elements (atoms)
– Each element can be referenced by name or position ($n)
• A bag is immutable
• A bag can be aliased and referenced later
Input And Output
• STORE
– Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
• Fails if directory exists
• Writes output files, part-[m|r]-xxxxx, to the directory
– PigStorage can be used to specify a field delimiter
• DUMP
– Write output to screen
grunt> DUMP processed;
Relational Operators
• FOREACH
– Applies expressions to every record in a bag
• FILTER
– Filters by expression
• GROUP
– Collect records with the same key
• ORDER BY
– Sorting
• DISTINCT
– Removes duplicates
FOREACH . . .GENERATE
• Use the FOREACH …GENERATE operator to work
with rows of data, call functions, etc.
• Basic syntax:
alias2 = FOREACH alias1 GENERATE expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
FILTER. . .BY
• Use the FILTER operator to restrict tuples or rows
of data
• Basic syntax:
alias2 = FILTER alias1 BY expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
GROUP. . .ALL
• Use the GROUP…ALL operator to group data
– Use GROUP when only one relation is involved
– Use COGROUP with multiple relations are involved
• Basic syntax:
alias2 = GROUP alias1 ALL;
• Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
ORDER. . .BY
• Use the ORDER…BY operator to sort a relation
based on one or more fields
• Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
DISTINCT. . .
• Use the DISTINCT operator to remove
duplicate tuples in a relation.
• Basic syntax:
alias2 = DISTINCT alias1;
• Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
Relational Operators
• FLATTEN
– Used to un-nest tuples as well as bags
• INNER JOIN
– Used to perform an inner join of two or more relations based on
common field values
• OUTER JOIN
– Used to perform left, right or full outer joins
• SPLIT
– Used to partition the contents of a relation into two or more
relations
• SAMPLE
– Used to select a random data sample with the stated sample size
INNER JOIN. . .
• Use the JOIN operator to perform an inner, equi-join
join of two or more relations based on common
field values
• The JOIN operator always performs an inner join
• Inner joins ignore null keys
– Filter null keys before the join
• JOIN and COGROUP operators perform similar
functions
– JOIN creates a flat set of output records
– COGROUP creates a nested set of output records
INNER JOIN Example
DUMP Alias1; Join Alias1 by Col1 to
(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1
(8,3,4) BY Col1, Alias2 BY
(4,3,3) Col1;
(7,2,5)
(8,4,3) Dump Alias3;
DUMP Alias2;
(1,2,3,1,3)
(2,4)
(4,2,1,4,6)
(8,9)
(1,3) (4,3,3,4,6)
(2,7) (4,2,1,4,9)
(2,9) (4,3,3,4,9)
(4,6) (8,3,4,8,9)
(4,9) (8,4,3,8,9)
OUTER JOIN. . .
• Use the OUTER JOIN operator to perform left, right, or full
outer joins
– Pig Latin syntax closely adheres to the SQL standard
• The keyword OUTER is optional
– keywords LEFT, RIGHT and FULL will imply left outer, right outer and
full outer joins respectively
• Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
• Outer joins will only work for two-way joins
– To perform a multi-way outer join perform multiple two-way outer
join statements
User-Defined Functions
• Natively written in Java, packaged as a jar file
– Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
• Register the jar with the REGISTER statement
• Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE
• DEFINE can be used to work with UDFs and also
streaming commands
– Useful when dealing with complex input/output
formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
• Step 1: Download the Pig version 0.17.0 tar file from the
official Apache pig site. Navigate to the website
https://downloads.apache.org/pig/latest/. Download the file
‘pig-0.17.0.tar.gz’ from the website.
Apache Pig Installation
• Step 2: Add the path variables of PIG_HOME and
PIG_HOME\bin
Find file ‘pig.cmd’ in the bin folder of the pig file ( C:\pig-0.17.0\
bin)
State_Name : chararray ;
District_Name : chararray ;
Crop_Year : int ;
Season : chararray ;
Crop : chararray ;
Area : int ;
Production : int
No of rows: 246092
No of columns: 7
Apache Pig Example
2.Enter pig local mode using
grunt > pig -x local
OUTPUT: district names, average production of all crops in each district after the year 2000.
Apache Hive
– If multiple tables are joined, put the biggest table last and
the reducer will stream the last table, buffer the others
– Use left semi-joins to take the place of IN/EXISTS
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;
Advanced Hive Operations
• JOIN
– Do not specify join conditions in the WHERE clause
• Hive does not know how to optimise such queries
• Will compute a full Cartesian product before filtering it
• Join Example
SELECT
a.ymd, a.price_close, b.price_close
FROM stocks a
JOIN stocks b ON a.ymd = b.ymd
WHERE a.symbol = 'AAPL' AND
b.symbol = 'IBM' AND
a.ymd > '2010-01-01';
Hive Stinger
• MPP-style execution of Hive queries
• Available since Hive 0.13
• No MapReduce
• We will talk about this more when we get to
SQL on Hadoop
References
• http://hive.apache.org