Big Data Analytics
Big Data Analytics
Big Data Analytics
Types of Data
● Data is largely classified as Structured, Semi-Structured and Unstructured.
● If we know the fields as well as their datatype, then we call it structured. The data
in relational databases such as MySQL, Oracle or Microsoft SQL is an example of
structured data.
● The data in which we know the fields or columns, but we do not know the data
types, we call it semi-structured data. For example, data in CSV which is comma
separated values is known as semi-structured data.
● If our data doesn't contain columns or fields, we call it unstructured data. The data
in the form of plain text files or logs generated on a server are examples of
unstructured data.
● The process of translating unstructured data into structured data is known as ETL -
Extract, Transform and Load.
Distributed System
● When networked computers are utilised to achieve a common goal, it is known as a
distributed system.
● The work gets distributed amongst many computers.
● The branch of computing that studies distributed systems is known as distributed
computing.
● The purpose of distributed computing is to get the work done faster by utilising
many computers.
● Most but not all the tasks can be performed using distributed computing.
What is Big Data?
● In very simple words, Big Data is data of very big size which cannot be processed
with usual tools
● To process such data, we need a distributed architecture. This data could be
structured or unstructured.
Velocity: Velocity refers to the speed at which data is generated, collected, and processed.
In many Big Data applications, data is produced and updated in real-time, requiring rapid
processing and analysis to derive meaningful insights. Examples of high-velocity data
include streaming data from sensors, financial market data, and social media feeds.
Example: Sensor Data in Smart Cities
● In smart city implementations, sensors deployed throughout the city continuously
collect data on various parameters such as traffic flow, air quality, temperature, and
energy consumption. This data is streamed in real-time to city management
systems, where it is analysed to optimise traffic flow, improve public services, and
enhance overall urban planning.
Variety: Variety refers to the diverse types and sources of data that exist, including
structured, semi-structured, and unstructured data formats. Traditional relational databases
are well-suited for structured data, but Big Data platforms must be able to handle a wide
variety of data types, including text, images, videos, and sensor data.
Example: Healthcare Data
● Healthcare organisations collect vast amounts of data from diverse sources,
including electronic health records (structured data), medical images
(unstructured data), wearable devices (sensor data), and social media
discussions about health issues (text data). Integrating and analysing this
variety of data can lead to improved patient care, disease prevention, and
medical research.
Veracity: Veracity refers to the quality and reliability of the data. In Big Data environments,
where data is sourced from diverse and often unstructured sources, ensuring data accuracy
and reliability is crucial for making informed decisions. Veracity encompasses issues such as
data consistency, completeness, accuracy, and trustworthiness. Poor data quality can lead
to erroneous insights and decisions.
Example: Social Media Sentiment Analysis
● When analysing social media data for sentiment analysis, ensuring the veracity of
the data is essential. Social media platforms may contain spam, fake accounts, or
misleading information, which can skew the analysis results. Techniques such as
data cleansing, outlier detection, and sentiment validation are employed to enhance
data veracity and improve the accuracy of sentiment analysis.
Value: Value refers to the ultimate goal of Big Data analytics—to derive actionable insights
and create value for organisations. While processing large volumes of data at high velocity
from diverse sources is important, the true measure of success lies in the ability to extract
meaningful insights that drive business outcomes, improve decision-making, and create
competitive advantages. Value is realised when organisations can translate Big Data
analytics into tangible benefits, such as increased revenue, cost savings, enhanced
customer satisfaction, or innovation.
Example: Personalised Marketing Campaigns
● Retailers utilise Big Data analytics to analyse customer behaviour, preferences, and
purchase history to create personalised marketing campaigns. By understanding
individual customer preferences and behaviour patterns, retailers can tailor product
recommendations, promotions, and offers, ultimately increasing sales and customer
loyalty. The value lies in the ability to deliver targeted marketing messages that
resonate with customers, resulting in higher conversion rates and improved ROI.
A/B Testing
● The idea of A/B testing is to compare two or more versions of the same thing, and
analyse which one works better. It not only helps us understand which version
works better, but also provides evidence to understand if the difference between
two versions is statistically significant or not.
● Let us go back to the history of A/B testing. Earlier, in1950s, scientists and medical
researchers started conducting A/B Testing to run clinical trials to test drug efficacy.
In the 1960s and 1970s, marketers adopted this strategy to evaluate direct
response campaigns. For instance, Would a postcard or a letter to target customers
result in more sales? Later with the advent of the world wide web, the concept of
A/B testing has also been adopted by the software industry to understand the user
preferences. So basically, the concept which was applied offline, is now being
applied online.
Apache Hadoop
● Hadoop was created by Doug Cutting in order to build his search engine called
Nutch. He was joined by Mike Cafarella. Hadoop was based on the three papers
published by Google: Google File System, Google MapReduce, and Google Big
Table.
● It is named after the toy elephant of Doug Cutting's son.
● Hadoop is under Apache licence which means you can use it anywhere without
having to worry about licensing.
● It is quite powerful, popular and well supported.
● It is a framework to handle Big Data.
● Started as a single project, Hadoop is now an umbrella of projects.
● All of the projects under the Apache Hadoop umbrella should have followed three
characteristics:
● Distributed - They should be able to utilise multiple machines in order to solve a
problem.
● Scalable - If needed it should be very easy to add more machines.
● 3. Reliable - If some of the machines fail, it should still work fine. These are the
three criteria for all the projects or components to be under Apache Hadoop.
● Hadoop is written in Java so that it can run on all kinds of devices.
HDFS
● HDFS or Hadoop Distributed File System is the most important component because
the entire ecosystem depends upon it. It is based on the Google File System.
● It is basically a file system which runs on many computers to provide humongous
storage. If you want to store your petabytes of data in the form of files, you can use
HDFS.
● YARN or yet another resource negotiator keeps track of all the resources (CPU,
Memory) of machines in the network and runs the applications. Any application
which wants to run in distributed fashion would interact with YARN.
Flume
● Flume makes it possible to continuously pump the unstructured data from many
sources to a central source such as HDFS.
● If you have many machines continuously generating data such as Web Server Logs,
you can use flume to aggregate data at a central place such as HDFS.
Sqoop
● Sqoop is used to transport data between Hadoop and SQL Databases. Sqoop
utilises MapReduce to efficiently transport data using many machines in a network.
HBase
● HBase provides humongous storage in the form of a database table. So, to manage
humongous records, you would like to use HBase.
● HBase is a NoSQL Datastore.
Map Reduce
● MapReduce is a framework for distributed computing. It utilises YARN to execute
programs and has a very good sorting engine.
● You write your programs in two parts: Map and reduce. The map part transforms the
raw data into key-value and reduces part groups and combines data based on the
key. We will learn MapReduce in detail later.
Spark
● Spark is another computational framework similar to MapReduce but faster and
more recent. It uses similar constructs as MapReduce to solve big data problems.
● Spark has its own huge stack. We will cover it in detail soon.
Hive
● Writing code in MapReduce is very time-consuming. So, Apache Hive makes it
possible to write your logic in SQL which internally converts it into MapReduce. So,
you can process humongous structured or semi-structured data with simple SQL
using Hive.
Pig
● Pig Latin is a simplified SQL like language to express your ETL needs in stepwise
fashion.
● Pig is the engine that translates Pig Latin into Map Reduce and executes it on
Hadoop.
Mahout
● Mahout is a library of machine learning algorithms that run in a distributed fashion.
Since machine learning algorithms are complex and time-consuming, mahout breaks
down work such that it gets executed on MapReduce running on many machines.
Zookeeper
● Apache Zookeeper is an independent component which is used by various
distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for the
coordination between various components. It provides a distributed configuration
service, synchronisation service, and naming registry for large distributed systems.
Oozie
● Since a project might involve many components, there is a need for a workflow
engine to execute work in sequence.
● For example, a typical project might involve importing data from SQL Server,
running some Hive Queries, doing predictions with Mahout, Saving data back to an
SQL Server.
● This kind of workflow can be easily accomplished with Oozie.
User Interface
● A user can either talk to the various components of Hadoop using Command Line
Interface, Web interface, API or using Oozie.
● We will cover each of these components in detail later.
Google File System
Design Considerations
1. Commodity Hardware
● When google was still a startup, instead of buying an expensive server, they chose
to buy an off the shelf commodity hardware
● The reason they chose commodity hardware was that they are cheap and using a lot
of such servers they could scale horizontally, given the right software layer created
on top of it
● Failures are common
➔ Disk / Network / Server
➔ OS Bugs
➔ Human Errors
2. Fault Tolerant
● Therefore, the GFS needs to be designed in such a way that it can still perform
reliably in a fault tolerant manner in the face of failures
Large Files
● GFS is optimised to store and read large files
● A typical file in GFS ranges from 100MB to Multiple GBs
1. File Operations
● The GFS was optimised for only 2 kinds of file operations
➔ Writes to any file are generally appended only. There are hardly or no
random writes in the file
➔ Reads are sequential reads
● So, there could be a crawler, which keep appending / adding all the crawled
documents – HTML pages to a single file and then a batch processing system which
reads this entire large file and creates a search index out of it
2. Chunks
● A single file is not stored on a single server
● It is actually subdivided into multiple chunks
● Each chunk is of 64 MB
● These chunks can be spread across multiple machines
● These machines (commodity hardware) are also called chunk servers
● These chunks are then identified by globally unique 64-bit Id
Replicas
● Since these files are stored on commodity hardware, which can go down any time , if
you store only one copy of the file it is highly possible that you might lose that file
forever
● Thus, GFS ensures that each chunk of your file has at least 3 replicas across 3
different servers
● So that even if 1 server goes down you have other 2 replicas to work with
● The replica count by default is 3. But can be configured by the client writing to the
file
● It is difficult for the client application to know which chunk of the file is residing on
which server
GFS Master
● This is the GFS Master Server which has all the metadata
● The metadata includes
● Name of the files
● Number of chunks for each file
● IDs of those chunks
● And on which server those chunks are residing
● It also the access control details, i.e. which client is allowed to access which control
file
GFS Read Architecture
Heartbeats
● Regular heartbeats to ensure that the chunk servers are alive.
When chunk server goes down
● Master ensures all chunks that were on a dead chunkserver are copied onto other
servers
● It ensures that replica count remain the same
Operation Log
● Log of all the operations is written to an append only log called operations log
● Each of the file operations with a corresponding timestamps and the user details of
who performed these operations is stored in this operations log
● This log is very important. Therefore, it is written directly to the disk and replicated
to some remote machine
● If the master itself crashes. It can read the operation log and create the entire
filesystem namespace, along with the chunk IDs, to get back to the earlier state of
the system
● This operations log may become too long, therefore there's a background thread
that regularly keeps checkpointing the log in background
Map Reduce
● MapReduce is a programming model and an associated implementation for
processing and generating large datasets that are parallelizable. The concept was
introduced by Google in a white paper titled "MapReduce: Simplified Data
Processing on Large Clusters," authored by Jeffrey Dean and Sanjay Ghemawat,
which was published in 2004.
● The white paper outlines the MapReduce framework, which is designed to
efficiently process vast amounts of data across distributed clusters of computers. It
abstracts the complexities of parallelization, fault tolerance, data distribution, and
load balancing, allowing programmers to focus on writing simple maps and reduce
functions.
No Data Movement
● Since, we are dealing with a very large dataset. We do not want to move these
datasets.
● We let them exist on the machine they exist in.
● So how do we process the data? We let the MAP function operate on data locally.
So, instead of grabbing aggregating the data and move it somewhere else, we send
the MAP programs to the data and avoid moving all the large data
Key Value Structure
● In MapReduce, the chunks come from the same large dataset
● When we try to reduce this data chunks, we're very likely to be looking into some
sort of commonality or pattern in these chunks
● Since, the chunks are from the same dataset, we'll see that a lot of these have the
same keys
● It becomes easier to reduce them into one single value based on that key
● We'll discuss this in detail when we look at our example
Machine Failure
● To handle failures the MapReduce model basically reperforms the MAP operations
or the REDUCE operations where the failure occurs
Idempotent
The retry only happens if your MAP and REDUCE functions are idempotent i.e. if you run
MAP and REDUCE multiple times the outcome doesn't change
File systems
● Traditional file systems are hierarchical file system ( e.g. NTFS, NFS)
● The same hierarchical file system has been applied to HDFs
Centralised metadata repository. Stores actual data blocks in the file system.
Manages file system namespace and Responsible for serving read and write
metadata. requests from clients.
Stores information about file locations, Reports regularly to the NameNode about
sizes, permissions, etc. the list of blocks it stores.
Single point of failure (High Availability Multiple DataNodes ensure data replication
configurations mitigate this). and fault tolerance.
Chunks or Blocks
Rack Awareness
It is a physical collection of various nodes. Generally, 30-40 nodes come under one rack
Functions of Name node
Metadata Management
•Manages the file system namespace and metadata.
•Stores information about file locations, sizes, permissions, and directory structure.
FsImage
● Contains a snapshot of the file system metadata.
● Represents the directory structure and file metadata at a specific point in time.
● Loaded into memory when NameNode starts to reconstruct the file system
namespace.
Edit Logs
● Record every change made to the file system metadata.
● Store a sequential log of all modifications to the file system, such as file creations,
deletions, and modifications.
● Persistently stores metadata changes to ensure durability and recoverability.
Heartbeats
● Regular messages sent by DataNodes to the NameNode.
● Indicate that DataNodes are operational and report their current status.
● Helps NameNode monitor the health and status of DataNodes in the cluster.
● Enables the NameNode to detect and handle failures promptly.
Race Condition
● When two processes are competing causing data corruption.
● As shown in the diagram, two individuals are trying to deposit 1 dollar online into
the same bank account. The initial amount is 17 dollars. Due to race conditions, the
final amount in the bank is 18 dollars instead of 19.
● When 2 processes are trying to access the same resources at the same time causing
unpredictable results
Deadlock
● When two processes are waiting for each other directly or indirectly, it is called
dead lock.
● As you can see in the second diagram, process 1 is waiting for process 2 and
process 2 is waiting for process 3 to finish and process 3 is waiting for process 1 to
finish. All these three processes would keep waiting and will never end. This is
called dead lock.
Process VS Thread
Data Model
● The way you store data in any store is called a data model. In the case of zookeeper,
think of a data model as if it is a highly available file system with little differences.
● We store data in an entity called znode. The data that we store should be in JSON
format which Java script object notation.
● The znode can only be updated. It does not support append operations. The read or
write is atomic operation meaning either it will be full or would throw an error if
failed. There is no intermediate state like half written.
● znode can have children. So, znodes inside znodes make a tree-like hierarchy. The
top level znode is "/".
● The znode "/zoo" is a child of "/" which is the top level znode. The duck is child znode
of zoo. It is denoted as /zoo/duck
● Though "." or ".." are invalid characters as opposed to the file system.
Zookeeper Data model
Types of Znode
1. Persistent Znode
● Such kinds of znodes remain in zookeeper until deleted. This is the default type of
znode. To create such node you can use the command: create /name_of_myznode
"mydata"
2. Ephemeral Znode
● Ephemeral nodes get deleted if the session in which the node was created has
disconnected. Though it is tied to the client's session, it is visible to the other users.
● An ephemeral node cannot have children, not even ephemeral children.
3. Sequential Znode
● Quite often, we need to create sequential numbers such as ids. In such situations we
use sequential nodes.
● Sequential znode are created with numbers appended to the provided name.
● You can create a znode by using create -s. The following command would create a
node with a zoo followed by a number: create -s /zoo v
● This number keeps increasing monotonically on every node creation inside a
particular node. The first sequential child node gets a suffix of 0000000000 for any
node.
Apache Pig VS Map Reduce
● Pig Latin is a high-level data flow language, whereas MapReduce is a low-level
data processing paradigm.
● Without writing complex Java implementations in MapReduce, programmers can
achieve the same implementations very easily using Pig Latin.
● Apache Pig uses a multi-query approach (i.e. using a single query of Pig Latin we
can accomplish multiple MapReduce tasks), which reduces the length of the code by
20 times. Hence, this reduces the development period by almost 16 times.
● Pig provides many built-in operators to support data operations like joins, filters,
ordering, sorting etc. Whereas to perform the same function in MapReduce is a
humongous task.
● Performing a Join operation in Apache Pig is simple. Whereas it is difficult in
MapReduce to perform a Join operation between the data sets, as it requires
multiple MapReduce tasks to be executed sequentially to fulfil the job.
● In addition, it also provides nested data types like tuples, bags, and maps that are
missing from MapReduce. I will explain to you these data types in a while.
Apache Pig
● Pig enables programmers to write complex data transformations without knowing
Java.
● Apache Pig has two main components — the Pig Latin language and the Pig
Run-time Environment, in which Pig Latin programs are executed.
● For Big Data Analytics, Pig gives a simple data flow language known as Pig Latin
which has functionalities like SQL like join, filter, limit etc.
● Developers who are working with scripting languages and SQL, leverage Pig Latin.
This gives developers ease of programming with Apache Pig. Pig Latin provides
various built-in operators like join, sort, filter, etc to read, write, and process large
data sets. Thus, it is evident, Pig has a rich set of operators.
● Programmers write scripts using Pig Latin to analyse data and these scripts are
internally converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig,
writing MapReduce tasks was the only way to process the data stored in HDFS.
● If a programmer wants to write custom functions which are unavailable in Pig, Pig
allows them to write User Defined Functions (UDF) in any language of their choice
like Java, Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This
provides extensibility to Apache Pig.
● Pig can process any kind of data, i.e. structured, semi-structured or unstructured
data, coming from various sources. Apache Pig handles all kinds of data.
● Approximately, 10 lines of pig code are equal to 200 lines of MapReduce code.
● It can handle inconsistent schema (in case of unstructured data).
● Apache Pig extracts the data, performs operations on that data and dumps the data
in the required format in HDFS i.e. ETL (Extract Transform Load).
● Apache Pig automatically optimises the tasks before execution, i.e. automatic
optimization.
● It allows programmers and developers to concentrate upon the whole operation
irrespective of creating mapper and reducer functions separately.
MapReduce
The MapReduce program first inputs the key as rows and sends the tweet table information
to mapper function. Then the Mapper function will select the user id and associate unit
value (i.e. 1) to every user id. The Shuffle function will sort the same user ids together. At
last, the Reduce function will add all the number of tweets together belonging to the same
user. The output will be user id, combined with username and the number of tweets per
user.
Limitations
● Analysis needs to be typically done in Java.
● Joins that are performed, need to be written in Java, which makes it longer and more
error-prone.
● For projection and filters, the custom code needs to be written which makes the
whole process slower.
● The job is divided into many stages while using MapReduce, which makes it difficult
to manage.
So, Twitter moved to Apache Pig for analysis. Now, joining data sets, grouping them,
sorting them and retrieving data becomes easier and simpler. You can see in the below
image how twitter used Apache Pig to analyse their large data set.
The Data
● Twitter had both semi-structured data like Twitter Apache logs, Twitter search logs,
Twitter MySQL query logs, application logs and structured data like tweets, users,
block notifications, phones, favourites, saved searches, re-tweets, authentications,
SMS usage, user followings, etc. which can be easily processed by Apache Pig.
● Twitter dumps all its archived data on HDFS. It has two tables i.e. user data and
tweets data. User data contains information about the users like username,
followers, followings, number of tweets etc. While Tweet data contains tweets, its
owner, number of retweets, number of likes etc. Now, Twitter uses this data to
analyse their customer’s behaviours and improve their past experiences.
Question: Analysing how many tweets are stored per user, in the given tweet tables?
STEP 1– First of all, twitter imports the twitter tables (i.e. user table and tweet table) into
the HDFS.
STEP 2– Then Apache Pig loads (LOAD) the tables into the Apache Pig framework.
STEP 3– Then it joins and groups the tweet tables and user table using COGROUP
command as shown in the above image.
● This results in the inner Bag Data type, which we will discuss later.
● Example of Inner bags produced –
● (1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})
● (2,{(2,Ellie,abc),(2,Ellie,vxy)})
● (3, {(3,Sam,stu)})
STEP 4– Then the tweets are counted according to the users using the COUNT command.
So, that the total number of tweets per user can be easily calculated.
● Example of tuple produced as (id, tweet count) (refer to the image) –
● (1, 3)
● (2, 2)
● (3, 1)
STEP 5– At last, the result is joined with the user table to extract the username with the
produced result.
● Example of tuple produced as (id, name, tweet count) (refer to the image) –
● (1, Jay, 3)
● (2, Ellie, 2)
● (3, Sam, 1)
STEP 6– Finally, this result is stored back in the HDFS.
Conclusion
● Pig is not only limited to this operation. It can perform various other operations
which I mentioned earlier in this use case.
● These insights help Twitter to perform sentiment analysis and develop machine
learning algorithms based on user behaviours and patterns.
● Now, after knowing the Twitter case study, in this Apache Pig tutorial, let us take a
deep dive and understand the architecture of Apache Pig and Pig Latin’s data model.
This will help us understand how pig works internally. Apache Pig draws its
strength from its architecture.
Pig Architecture
For writing a Pig script, we need Pig Latin language and to execute them, we need an
execution environment. The architecture of Apache Pig is shown in the below image.
Pig Latin Scripts
● Initially as illustrated in the previous image, we submit Pig scripts to the Apache Pig
execution environment which can be written in Pig Latin using built-in operators.
● There are three ways to execute the Pig script:
➔ Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
➔ Script File: Write all the Pig commands in a script file and execute the Pig
script file. This is executed by the Pig Server.
➔ Embedded Script: If some functions are unavailable in built-in operators, we
can programmatically create User Defined Functions to bring that
functionalities using other languages like Java, Python, Ruby, etc. and embed
it in Pig Latin Script file. Then, execute that script file.
Parser
● From the above image you can see, after passing through Grunt or Pig Server, Pig
Scripts are passed to the Parser.
● The Parser does type checking and checks the syntax of the script. The parser
outputs a DAG (directed acyclic graph).
● DAG represents the Pig Latin statements and logical operators. The logical
operators are represented as the nodes and the data flows are represented as
edges.
Optimizer
● Then the DAG is submitted to the optimizer. The Optimizer performs the
optimization activities like split, merge, transform, and reorder operators etc. This
optimizer provides the automatic optimization feature to Apache Pig. The optimizer
basically aims to reduce the amount of data in the pipeline at any instance of time
while processing the extracted data, and for that, it performs functions like:
● PushUpFilter: If there are multiple conditions in the filter and the filter can be split,
Pig splits the conditions and pushes up each condition separately. Selecting these
conditions earlier helps in reducing the number of records remaining in the pipeline.
● PushDownForEachFlatten: Applying flattens, which produces a cross product
between a complex type such as a tuple or a bag and the other fields in the record,
as late as possible in the plan. This keeps the number of records low in the pipeline.
● ColumnPruner: Omitting columns that are never used or no longer needed, reducing
the size of the record. This can be applied after each operator so that fields can be
pruned as aggressively as possible.
● MapKeyPruner: Omitting map keys that are never used, reducing the size of the
record.
● LimitOptimizer: If the limit operator is immediately applied after a load or sort
operator, Pig converts the load or sort operator into a limit-sensitive
implementation, which does not require processing the whole data set. Applying the
limit earlier reduces the number of records.
● This is just a flavour of the optimization process. Over that, it also performs Join,
Order By and Group By functions.
Compiler
● After the optimization process, the compiler compiles the optimised code into a
series of MapReduce jobs. \
● The compiler is the one who is responsible for converting Pig jobs automatically into
MapReduce jobs.
Execution Engine
● Finally, as shown in the figure, these MapReduce jobs are submitted for execution to
the execution engine. Then the MapReduce jobs are executed and give the required
result. The result can be displayed on the screen using a “DUMP” statement and can
be stored in the HDFS using “STORE” statement.
● After understanding the Architecture, now in this Apache Pig tutorial, I will explain
to you the Pig Latins' Data Model.
Bag
● A bag is a collection of a set of tuples and these tuples are a subset of rows or
entire rows of a table. A bag can contain duplicate tuples, and it is not mandatory
that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can have a different number
of fields. A bag can also have tuples with different data types.
● A bag is represented by the ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los
Angeles)}
● But for Apache Pig to effectively process bags, the fields and their respective data
types need to be in the same sequence.
Map
● A map is key-value pairs used to represent data elements. The key must be a char
array [ ] and should be unique like the column name, so it can be indexed, and
values associated with it can be accessed on the basis of the keys. The value can be
of any data type.
● Maps are represented by ‘[]’ symbol and key-value are separated by ‘#’ symbol, as
you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ], [band#Metallica, members#8 ]
● Now as we learned Pig Latin’s Data Model. We will understand how Apache Pig
handles schema as well as works with schema-less data.
Schema
Schema assigns a name to the field and declares the data type of the field. The schema is
optional in Pig Latin but Pig encourages you to use them whenever possible, as the error
checking becomes efficient while parsing the script which results in efficient execution of
the program. The schema can be declared as both simple and complex data types. During
LOAD function, if the schema is declared it is also attached to the data.
Hive Features
● It is a Data Warehousing tool.
● It is used for enterprise data wrangling.
● It uses the SQL-like language HiveQL or HQL. HQL is a non-procedural and
declarative language.
● It is used for OLAP operations.
● It increases productivity by reducing 100 lines of Java code into 4 lines of HQL
queries.
● It supports Table, Partition, and Bucket data structures.
● It is built on top of Hadoop Distributed File System (HDFS)
● Hive supports Tez, Spark, and MapReduce.
Hive Architecture
● Shell/CLI: It is an interactive interface for writing queries.
● Driver: Handle session, fetch and execute the operation
● Compiler: Parse, Plan and optimise the code.
● Execution: In this phase, MapReduce jobs are submitted to Hadoop. and jobs get
executed.
● Metastore: Meta Store is a central repository that stores the metadata. It keeps all
the details about tables, partitions, and buckets.
● Example: Suppose we have a large dataset containing information about sales
transactions for a retail company, stored in HDFS. We want to analyse this data
using Hive to gain insights into sales performance.
Shell / CLI
● The Shell or Command-Line Interface (CLI) is an interactive interface where users
can write and execute HiveQL queries to interact with the Hive system.
● Users can use the Hive shell to submit queries, manage tables, and perform other
operations
Driver
● The Driver is responsible for handling user sessions, interpreting queries submitted
through the shell/CLI, and coordinating the execution of these queries.
● It interacts with other components of the Hive architecture to process user queries.
Compiler
● The Compiler receives the HiveQL queries submitted by users and performs several
tasks:
➔ Parsing: It parses the query to understand its syntactic structure and extract
relevant information.
➔ Planning: It generates an execution plan based on the query's logical
structure, determining the sequence of operations needed to fulfil the query.
➔ Optimization: It optimises the execution plan to improve query performance
by considering factors such as data locality, join order, and filter pushdown.
Execution
● In this phase, the optimised execution plan is translated into one or more
MapReduce jobs, which are submitted to the Hadoop cluster for execution.
● These MapReduce jobs process the data stored in HDFS according to the query's
requirements and produce the desired result.
● Example: Suppose the query requires aggregating sales data by product ID. The
execution phase would involve MapReduce jobs that read and process the sales
data, performing the aggregation operation as specified in the query.
MetaStore
● The Metastore is a central repository that stores metadata about Hive objects such
as databases, tables, partitions, columns, and storage properties. It keeps track of all
the details required to manage and query data stored in Hive.
● Example: The Metastore stores metadata about the sales table, including its schema
(columns: transaction_id, product_id, quantity_sold, revenue), data location in HDFS,
partitioning information (if any), and any associated storage properties.
HIVE vs SQL
● In SQL, MapReduce is not Supported while in Hive MapReduce is Supported.
● Hive does not support update command due to the limitation and natural structure
of HDFS, hive only has an insert overwrite for an update or insert functionality.
● In HQL, the queries are in the form of objects that are converted to SQL queries in
the target database.
● SQL works with tables and columns while Hive works with classes and their
properties.
Hive Limitations
● Hive is suitable for batch processing but not suitable for real-time data handling.
● Update and delete are not allowed, but we can delete in bulk i.e. we can delete the
entire table but not individual observation.
● Hive is not suitable for OLTP(Online Transactional Processing) operations.
HBase Introduction
● HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
● Apache HBase is a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS).
● Hadoop is a framework for handling large datasets in a distributed computing
environment.
2. Vertical
● Vertical scalability, on the other hand, increases capacity by adding more resources,
such as more memory or an additional CPU, to a machine.
● Scaling vertically, which is also called scaling up, usually requires downtime while
new resources are being added and has limits that are defined by hardware.
HBase vs RDBMS
HBase
● Row key : the reference of the row, it’s used to make the search of a record faster.
● Column Families : combination of a set of columns. Data belonging to the same
column family can be accessed together in a single seek, allowing a faster process.
● Column Qualifiers: Each column’s name is known as its column qualifier.
● Cell: the storage area of data. Each cell is connected to a row key and a column
qualifier
Architecture
● HBase has three crucial components:
● Zookeeper used for monitoring
● HMaster Server assigns regions and load-balancing.
● Region Server serves data for write and read. it refers to different computers in the
Hadoop cluster. Each Region Server have a region, HLog, a store memory store
Applications
● Medical: In the medical field, HBase is used for the purpose of storing genome
sequences and running MapReduce on it, storing the disease history of people or an
area.
● Sports: For storing match histories for better analytics and prediction.
● E-Commerce: For the purpose of recording and storing logs about customer search
history, as well as to perform analytics and then target advertisements for the better
business.