Unit 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Unit 5

Apache PIG
&
HIVE
What is PIG?

• Apache Pig is a high-level data flow platform for executing MapReduce


programs of Hadoop. The language used for Pig is Pig Latin.

• The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS. Apart from that, Pig can also execute its
job in Apache Tez or Apache Spark.

• Pig can handle any type of data, i.e., structured, semi-structured or


unstructured and stores the corresponding results into Hadoop Data File
System. Every task which can be achieved using PIG can also be achieved
using java used in MapReduce.
Pig Architecture With its Components
• Hadoop stores raw data coming from various sources like IOT, websites,
mobile phones, etc. and preprocessing is done in Map-reduce. Pig
framework converts any pig job into Map-reduce hence we can use the
pig to do the ETL (Extract Transform and Load) process on the raw data.
Apache pig can handle large data stored in Hadoop to perform data
analysis and its support file formats like text, CSV, Excel, RC, etc.
Apache pig is used because of its properties like:

• Ease of Programming: To make programming easy to write and understand


most of the complex tasks are encoded as data flow sequences to achieve
parallel execution.

• Optimization Opportunities: Pig gives permission to the system to optimize


tasks automatically so that the programmer can achieve semantic efficiency.

• Extensibility: It provides UDFs so that the programmer can create their own
function and for processing.
Parser: Any pig scripts or commands in the grunt shell are handled by the parser. Parse
will perform checks on the scripts like the syntax of the scripts, do type checking and
perform various other checks. These checks will give output in a Directed Acyclic
Graph (DAG) form, which has a pig Latin statements and logical operators. The DAG
will have nodes that are connected to different edges, here our logical operator of the
scripts are nodes and data flows are edges.

Optimizer: As soon as parsing is completed and DAG is generated, It is then passed to


the logical optimizer to perform logical optimization like projection and pushdown.
Projection and pushdown are done to improve query performance by omitting
unnecessary columns or data and prune the loader to only load the necessary column.
Compiler: The optimized logical plan generated above is compiled by the compiler
and generates a series of Map-Reduce jobs. Basically compiler will convert pig job
automatically into MapReduce jobs and exploit optimizations opportunities in scripts,
due this programmer doesn’t have to tune the program manually. As pig is a data-flow
language its compiler can reorder the execution sequence to optimize performance if
the execution plan remains the same as the original program.

Execution Engine: Finally, all the MapReduce jobs generated via compiler are
submitted to Hadoop in sorted order. In the end, MapReduce’s job is executed on
Hadoop to produce the desired output.
Execution Mode: Pig works in two types of execution modes depend on where the
script is running and data availability :

• Local Mode: Local mode is best suited for small data sets. Pig is implemented
here on single JVM as all files are installed and run on localhost due to this parallel
mapper execution is not possible. Also while loading data pig will always look into
the local file system.
• MapReduce Mode (MR Mode): In MapReduce, the mode programmer needs access and setup of
the Hadoop cluster and HDFS installation. In this mode data on which processing is done is exists
in the HDFS system. After execution of pig script in MR mode, pig Latin statement is converted
into Map Reduce jobs in the back-end to perform the operations on the data. By default pig uses
Map Reduce mode, hence we don’t need to specify it using the -x flag.

Apart from execution mode there three different ways of execution mechanism in Apache pig:

Interactive Mode Or Grunt Shell: Enter the pig statement and get the output in the command shell.
Batch Mode or Script Mode: This is the non-interactive way of executing where programmer write
a pig Latin script with ‘.pig’ extension and execute the same.
Embedded Mode: Use programming languages like java in our script with the help of UDFs.
Pig Job Execution Flow
• Below we explain the job execution flow in the pig:

• The programmer creates a Pig Latin script which is in the local file system as a function.

• Once the pig script is submitted it connect with a compiler which generates a series of
MapReduce jobs

• Pig compiler gets raw data from HDFS perform operations.

• The result files are again placed in the Hadoop File System (HDFS) after the MapReduce
job is completed.
Features of Apache Pig
• 1) Ease of programming

• Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes this

process easy. In the Pig, the queries are converted to MapReduce internally.

• 2) Optimization opportunities

• It is how tasks are encoded permits the system to optimize their execution automatically, allowing the

user to focus on semantics rather than efficiency.

• 3) Extensibility

• A user-defined function is written in which the user can write their logic to execute over the data set.

• 4) Flexible

• It can easily handle structured as well as unstructured data.

• 5) In-built operators

• It contains various type of operators such as sort, filter and joins.


Features of Apache Pig
• For performing several operations Apache Pig provides rich sets of operators like the filtering, joining, sorting, aggregation

etc.

• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.

• Apache Pig is extensible so that you can make your own process and user-defined functions(UDFs) written in python, java

or other programming languages .

• Join operation is easy in Apache Pig.

• Fewer lines of code.

• Apache Pig allows splits in the pipeline.

• By integrating with other components of the Apache Hadoop ecosystem, such as Apache Hive, Apache Spark, and Apache

ZooKeeper, Apache Pig enables users to take advantage of these components’ capabilities while transforming data.

• The data structure is multivalued, nested, and richer.


Need of Pig:
• One limitation of MapReduce is that the development cycle is very long. Writing the reducer

and mapper, compiling packaging the code, submitting the job and retrieving the output is a

time-consuming task. Apache Pig reduces the time of development using the multi-query

approach. Also, Pig is beneficial for programmers who are not from Java background. 200

lines of Java code can be written in only 10 lines using the Pig Latin language. Programmers

who have SQL knowledge needed less effort to learn Pig Latin.

• It uses query approach which results in reducing the length of the code.

• Pig Latin is SQL like language.

• It provides many builtIn operators.


Difference between Pig and
MapReduce
Applications of Apache Pig:

• For exploring large datasets Pig Scripting is used.

• Provides the supports across large data-sets for Ad-hoc queries.

• In the prototyping of large data-sets processing algorithms.

• Required to process the time sensitive data loads.

• For collecting large amounts of datasets in form of search logs and


web crawls.

• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig:

• It consist of the 4 types of data models as follows:

• Atom: It is a atomic data value which is used to store as a string. The main
use of this model is that it can be used as a number and as well as a string.

• Tuple: It is an ordered set of the fields.

• Bag: It is a collection of the tuples.

• Map: It is a set of key/value pairs.


Operations in Pig
What is Apache Hive?

• Apache Hive is open-source data warehouse software designed to read, write,


and manage large datasets extracted from the Apache Hadoop Distributed File
System (HDFS) , one aspect of a larger Hadoop Ecosystem.

• With extensive Apache Hive documentation and continuous updates, Apache


Hive continues to innovate data processing in an ease-of-access way.
Apache Hive architecture and key Apache Hive
components
• The key components of the Apache Hive architecture are the Hive Server 2, Hive Query Language (HQL), the

External Apache Hive Metastore, and the Hive Beeline Shell.

Hive Server 2

• The Hive Server 2 accepts incoming requests from users and applications and creates an execution plan and auto

generates a YARN job to process SQL queries. The server also supports the Hive optimizer and Hive compiler to

streamline data extraction and processing.

Hive Query Language

• By enabling the implementation of SQL-reminiscent code, the Apache Hive negates the need for long-winded

JavaScript codes to sort through unstructured data and allows users to make queries using built-in HQL statements

(HQL). These statements can be used to navigate large datasets, refine results, and share data in a cost-effective and
The Hive Metastore

• The central repository of the Apache Hive infrastructure, the metastore is where all of the Hive's metadata

is stored. In the metastore, metadata can also be formatted into Hive tables and partitions to compare data

across relational databases. This includes table names, column names, data types, partition information,

and data location on HDFS.

Hive Beeline Shell

• In line with other database management systems (DBMS), Hive has its own built-in command-line

interface where users can run HQL statements. Also, the Hive shell also runs Hive JDBC and ODBC

drivers and so can conduct queries from an Open Database Connectivity or Java Database Connectivity

application.
Hive Consists of Mainly 3 core parts

1. Hive Clients

2. Hive Services

3. Hive Storage and Computing

• Hive Clients:

• Hive provides different drivers for communication with a different type of applications. For
Thrift based applications, it will provide Thrift client for communication.

• For Java related applications, it provides JDBC Drivers. Other than any type of applications
provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server
Hive Services:
• Client interactions with Hive can be performed through Hive Services. If the client wants to

perform any query related operations in Hive, it has to communicate through Hive Services.

• CLI is the command line interface acts as Hive service for DDL (Data definition Language)

operations. All drivers communicate with Hive server and to the main driver in Hive

services as shown in above architecture diagram.

• Driver present in the Hive services represents the main driver, and it communicates all type

of JDBC, ODBC, and other client specific applications. Driver will process those requests

from different applications to meta store and field systems for further processing.
Hive Storage and Computing:

• Hive services such as Meta store, File system, and Job Client in turn
communicates with Hive storage and performs the following actions

• Metadata information of tables created in Hive is stored in Hive “Meta storage


database”.

• Query results and data loaded in the tables are going to be stored in Hadoop
cluster on HDFS.
• The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer
who is proficient in SQL, to look up the data that is needed.

• The Hive client supports different types of client applications in different languages to perform queries. Thrift
is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the
programming languages that support Thrift.

• Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver.

• The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database
Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the
Hive server.
• In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive
queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the
queries submitted. It performs three steps internally:

• Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed

• Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained

• Executor - In the final step, the tasks are executed.

• Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as
your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process
queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll
know that these are on commodity machines and are linearly scalable, which means they're very affordable.

• let's understand how does the data flow in the Hive.


The data flow in the following sequence:
1. Executing Query from the UI( User Interface)

2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process
and its related metadata information gathering

3. The compiler creates the plan for a job to be executed. Compiler communicating with Meta store for
getting metadata request

4. Meta store sends metadata information back to compiler

5. Compiler communicating with Driver with the proposed plan to execute the query

6. Driver Sending execution plans to Execution engine


7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS operations
• EE should first contacts Name Node and then to Data nodes to get the values stored in tables.

• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from
Name Node it only fetches the metadata information for the query.

• It collects actual data from data nodes related to mentioned query

• Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition
Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. Meta
store will store information about database name, table names and column names only. It will fetch data related to query
mentioned.

• Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to
execute the query on top of Hadoop file system

8. Fetching results from driver

9. Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will send results
back to driver and to UI ( front end)
Hive Data Modeling
• That was how data flows in the Hive. Let's now
take a look at Hive data modeling, which consists
of tables, partitions, and buckets:

• Tables - Tables in Hive are created the same way it


is done in RDBMS

• Partitions - Here, tables are organized into


partitions for grouping similar types of data based
on the partition key

• Buckets - Data present in partitions can be further


divided into buckets for efficient querying
Hive Data Types
• Primitive Data Types: • Complex Data Types:

• Numeric Data types - Data types like • Arrays - A collection of the same entities. The syntax is:

integral, float, decimal array<data_type>

• String Data type - Data types like char, • Maps - A collection of key-value pairs and the syntax is

string map<primitive_type, data_type>

• Date/ Time Data type - Data types like • Structs - A collection of complex data with comments. Syntax:

timestamp, date, interval struct<col_name : data_type [COMMENT col_comment],…..>

• Miscellaneous Data type - Data types • Units - A collection of heterogeneous data types. Syntax:

like Boolean and binary uniontype<data_type, data_type,..>


Different modes of Hive
• Hive can operate in two modes depending on the size of data nodes in Hadoop.
• These modes are,
• Local mode
• Map reduce mode

When to use Local mode:

• If the Hadoop installed under pseudo mode with having one data node we use Hive in this mode

• If the data size is smaller in term of limited to single local machine, we can use this mode

• Processing will be very fast on smaller data sets present in the local machine

When to use Map reduce mode:

• If Hadoop is having multiple data nodes and data is distributed across different node we use Hive in this mode

• It will perform on large amount of data sets and query going to execute in parallel way
Difference Between Hive and RDBMS
Features of Hive

• The use of a SQL-like language called HiveQL in Hive is easier than long codes

• In Hive, tables are used that are similar to RDBMS, hence easier to understand

• By using HiveQL, multiple users can simultaneously query data

• Hive supports a variety of data formats


Advantages of Hive
Scalability: Apache Hive is designed to handle large volumes of data, making it a scalable solution for big
data processing.
Familiar SQL-like interface: Hive uses a SQL-like language called HiveQL, which makes it easy for SQL
users to learn and use.
Integration with Hadoop ecosystem: Hive integrates well with the Hadoop ecosystem, enabling users to
process data using other Hadoop tools like Pig, MapReduce, and Spark.
Supports partitioning and bucketing: Hive supports partitioning and bucketing, which can improve query
performance by limiting the amount of data scanned.
User-defined functions: Hive allows users to define their own functions, which can be used in HiveQL
queries.
Disadvantages of Hive
Limited real-time processing: Hive is designed for batch processing, which means it may not be the best tool for

real-time data processing.

Slow performance: Hive can be slower than traditional relational databases because it is built on top of Hadoop,

which is optimized for batch processing rather than interactive querying.

Steep learning curve: While Hive uses a SQL-like language, it still requires users to have knowledge of Hadoop

and distributed computing, which can make it difficult for beginners to use.

Lack of support for transactions: Hive does not support transactions, which can make it difficult to maintain data

consistency.

Limited flexibility: Hive is not as flexible as other data warehousing tools because it is designed to work
Operations in Hive

Intro to HiveQL, Basic HiveQL commands: D


DL

You might also like