Unit 5
Unit 5
Unit 5
Apache PIG
&
HIVE
What is PIG?
• The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS. Apart from that, Pig can also execute its
job in Apache Tez or Apache Spark.
• Extensibility: It provides UDFs so that the programmer can create their own
function and for processing.
Parser: Any pig scripts or commands in the grunt shell are handled by the parser. Parse
will perform checks on the scripts like the syntax of the scripts, do type checking and
perform various other checks. These checks will give output in a Directed Acyclic
Graph (DAG) form, which has a pig Latin statements and logical operators. The DAG
will have nodes that are connected to different edges, here our logical operator of the
scripts are nodes and data flows are edges.
Execution Engine: Finally, all the MapReduce jobs generated via compiler are
submitted to Hadoop in sorted order. In the end, MapReduce’s job is executed on
Hadoop to produce the desired output.
Execution Mode: Pig works in two types of execution modes depend on where the
script is running and data availability :
• Local Mode: Local mode is best suited for small data sets. Pig is implemented
here on single JVM as all files are installed and run on localhost due to this parallel
mapper execution is not possible. Also while loading data pig will always look into
the local file system.
• MapReduce Mode (MR Mode): In MapReduce, the mode programmer needs access and setup of
the Hadoop cluster and HDFS installation. In this mode data on which processing is done is exists
in the HDFS system. After execution of pig script in MR mode, pig Latin statement is converted
into Map Reduce jobs in the back-end to perform the operations on the data. By default pig uses
Map Reduce mode, hence we don’t need to specify it using the -x flag.
Apart from execution mode there three different ways of execution mechanism in Apache pig:
Interactive Mode Or Grunt Shell: Enter the pig statement and get the output in the command shell.
Batch Mode or Script Mode: This is the non-interactive way of executing where programmer write
a pig Latin script with ‘.pig’ extension and execute the same.
Embedded Mode: Use programming languages like java in our script with the help of UDFs.
Pig Job Execution Flow
• Below we explain the job execution flow in the pig:
• The programmer creates a Pig Latin script which is in the local file system as a function.
• Once the pig script is submitted it connect with a compiler which generates a series of
MapReduce jobs
• The result files are again placed in the Hadoop File System (HDFS) after the MapReduce
job is completed.
Features of Apache Pig
• 1) Ease of programming
• Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes this
process easy. In the Pig, the queries are converted to MapReduce internally.
• 2) Optimization opportunities
• It is how tasks are encoded permits the system to optimize their execution automatically, allowing the
• 3) Extensibility
• A user-defined function is written in which the user can write their logic to execute over the data set.
• 4) Flexible
• 5) In-built operators
etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own process and user-defined functions(UDFs) written in python, java
• By integrating with other components of the Apache Hadoop ecosystem, such as Apache Hive, Apache Spark, and Apache
ZooKeeper, Apache Pig enables users to take advantage of these components’ capabilities while transforming data.
and mapper, compiling packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development using the multi-query
approach. Also, Pig is beneficial for programmers who are not from Java background. 200
lines of Java code can be written in only 10 lines using the Pig Latin language. Programmers
who have SQL knowledge needed less effort to learn Pig Latin.
• It uses query approach which results in reducing the length of the code.
• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig:
• Atom: It is a atomic data value which is used to store as a string. The main
use of this model is that it can be used as a number and as well as a string.
Hive Server 2
• The Hive Server 2 accepts incoming requests from users and applications and creates an execution plan and auto
generates a YARN job to process SQL queries. The server also supports the Hive optimizer and Hive compiler to
• By enabling the implementation of SQL-reminiscent code, the Apache Hive negates the need for long-winded
JavaScript codes to sort through unstructured data and allows users to make queries using built-in HQL statements
(HQL). These statements can be used to navigate large datasets, refine results, and share data in a cost-effective and
The Hive Metastore
• The central repository of the Apache Hive infrastructure, the metastore is where all of the Hive's metadata
is stored. In the metastore, metadata can also be formatted into Hive tables and partitions to compare data
across relational databases. This includes table names, column names, data types, partition information,
• In line with other database management systems (DBMS), Hive has its own built-in command-line
interface where users can run HQL statements. Also, the Hive shell also runs Hive JDBC and ODBC
drivers and so can conduct queries from an Open Database Connectivity or Java Database Connectivity
application.
Hive Consists of Mainly 3 core parts
1. Hive Clients
2. Hive Services
• Hive Clients:
• Hive provides different drivers for communication with a different type of applications. For
Thrift based applications, it will provide Thrift client for communication.
• For Java related applications, it provides JDBC Drivers. Other than any type of applications
provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server
Hive Services:
• Client interactions with Hive can be performed through Hive Services. If the client wants to
perform any query related operations in Hive, it has to communicate through Hive Services.
• CLI is the command line interface acts as Hive service for DDL (Data definition Language)
operations. All drivers communicate with Hive server and to the main driver in Hive
• Driver present in the Hive services represents the main driver, and it communicates all type
of JDBC, ODBC, and other client specific applications. Driver will process those requests
from different applications to meta store and field systems for further processing.
Hive Storage and Computing:
• Hive services such as Meta store, File system, and Job Client in turn
communicates with Hive storage and performs the following actions
• Query results and data loaded in the tables are going to be stored in Hadoop
cluster on HDFS.
• The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer
who is proficient in SQL, to look up the data that is needed.
• The Hive client supports different types of client applications in different languages to perform queries. Thrift
is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the
programming languages that support Thrift.
• Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver.
• The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database
Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the
Hive server.
• In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive
queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the
queries submitted. It performs three steps internally:
• Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed
• Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained
• Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as
your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process
queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll
know that these are on commodity machines and are linearly scalable, which means they're very affordable.
2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process
and its related metadata information gathering
3. The compiler creates the plan for a job to be executed. Compiler communicating with Meta store for
getting metadata request
5. Compiler communicating with Driver with the proposed plan to execute the query
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from
Name Node it only fetches the metadata information for the query.
• Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition
Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. Meta
store will store information about database name, table names and column names only. It will fetch data related to query
mentioned.
• Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to
execute the query on top of Hadoop file system
9. Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will send results
back to driver and to UI ( front end)
Hive Data Modeling
• That was how data flows in the Hive. Let's now
take a look at Hive data modeling, which consists
of tables, partitions, and buckets:
• Numeric Data types - Data types like • Arrays - A collection of the same entities. The syntax is:
• String Data type - Data types like char, • Maps - A collection of key-value pairs and the syntax is
• Date/ Time Data type - Data types like • Structs - A collection of complex data with comments. Syntax:
• Miscellaneous Data type - Data types • Units - A collection of heterogeneous data types. Syntax:
• If the Hadoop installed under pseudo mode with having one data node we use Hive in this mode
• If the data size is smaller in term of limited to single local machine, we can use this mode
• Processing will be very fast on smaller data sets present in the local machine
• If Hadoop is having multiple data nodes and data is distributed across different node we use Hive in this mode
• It will perform on large amount of data sets and query going to execute in parallel way
Difference Between Hive and RDBMS
Features of Hive
• The use of a SQL-like language called HiveQL in Hive is easier than long codes
• In Hive, tables are used that are similar to RDBMS, hence easier to understand
Slow performance: Hive can be slower than traditional relational databases because it is built on top of Hadoop,
Steep learning curve: While Hive uses a SQL-like language, it still requires users to have knowledge of Hadoop
and distributed computing, which can make it difficult for beginners to use.
Lack of support for transactions: Hive does not support transactions, which can make it difficult to maintain data
consistency.
Limited flexibility: Hive is not as flexible as other data warehousing tools because it is designed to work
Operations in Hive