Big Data Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Big Data Analytics

Types of Data
● Data is largely classified as Structured, Semi-Structured and Unstructured.
● If we know the fields as well as their datatype, then we call it structured. The data
in relational databases such as MySQL, Oracle or Microsoft SQL is an example of
structured data.
● The data in which we know the fields or columns, but we do not know the data
types, we call it semi-structured data. For example, data in CSV which is comma
separated values is known as semi-structured data.
● If our data doesn't contain columns or fields, we call it unstructured data. The data
in the form of plain text files or logs generated on a server are examples of
unstructured data.
● The process of translating unstructured data into structured data is known as ETL -
Extract, Transform and Load.

Distributed System
● When networked computers are utilised to achieve a common goal, it is known as a
distributed system.
● The work gets distributed amongst many computers.
● The branch of computing that studies distributed systems is known as distributed
computing.
● The purpose of distributed computing is to get the work done faster by utilising
many computers.
● Most but not all the tasks can be performed using distributed computing.
What is Big Data?
● In very simple words, Big Data is data of very big size which cannot be processed
with usual tools
● To process such data, we need a distributed architecture. This data could be
structured or unstructured.

Characteristics of Big Data


Volume: Volume refers to the sheer amount of data generated and collected from various
sources. With the advent of digital technologies, the volume of data being produced has
grown exponentially. Examples of large volume data include social media posts, sensor
data from IoT devices, financial transactions, and web logs.
Example: Social Media Data
● Platforms like Facebook, Twitter, and Instagram generate enormous amounts of data
every second, including posts, comments, likes, shares, and multimedia content.
Analysing this massive volume of data can provide insights into user behaviour,
sentiment analysis, and trends.

Velocity: Velocity refers to the speed at which data is generated, collected, and processed.
In many Big Data applications, data is produced and updated in real-time, requiring rapid
processing and analysis to derive meaningful insights. Examples of high-velocity data
include streaming data from sensors, financial market data, and social media feeds.
Example: Sensor Data in Smart Cities
● In smart city implementations, sensors deployed throughout the city continuously
collect data on various parameters such as traffic flow, air quality, temperature, and
energy consumption. This data is streamed in real-time to city management
systems, where it is analysed to optimise traffic flow, improve public services, and
enhance overall urban planning.

Variety: Variety refers to the diverse types and sources of data that exist, including
structured, semi-structured, and unstructured data formats. Traditional relational databases
are well-suited for structured data, but Big Data platforms must be able to handle a wide
variety of data types, including text, images, videos, and sensor data.
Example: Healthcare Data
● Healthcare organisations collect vast amounts of data from diverse sources,
including electronic health records (structured data), medical images
(unstructured data), wearable devices (sensor data), and social media
discussions about health issues (text data). Integrating and analysing this
variety of data can lead to improved patient care, disease prevention, and
medical research.
Veracity: Veracity refers to the quality and reliability of the data. In Big Data environments,
where data is sourced from diverse and often unstructured sources, ensuring data accuracy
and reliability is crucial for making informed decisions. Veracity encompasses issues such as
data consistency, completeness, accuracy, and trustworthiness. Poor data quality can lead
to erroneous insights and decisions.
Example: Social Media Sentiment Analysis
● When analysing social media data for sentiment analysis, ensuring the veracity of
the data is essential. Social media platforms may contain spam, fake accounts, or
misleading information, which can skew the analysis results. Techniques such as
data cleansing, outlier detection, and sentiment validation are employed to enhance
data veracity and improve the accuracy of sentiment analysis.

Value: Value refers to the ultimate goal of Big Data analytics—to derive actionable insights
and create value for organisations. While processing large volumes of data at high velocity
from diverse sources is important, the true measure of success lies in the ability to extract
meaningful insights that drive business outcomes, improve decision-making, and create
competitive advantages. Value is realised when organisations can translate Big Data
analytics into tangible benefits, such as increased revenue, cost savings, enhanced
customer satisfaction, or innovation.
Example: Personalised Marketing Campaigns
● Retailers utilise Big Data analytics to analyse customer behaviour, preferences, and
purchase history to create personalised marketing campaigns. By understanding
individual customer preferences and behaviour patterns, retailers can tailor product
recommendations, promotions, and offers, ultimately increasing sales and customer
loyalty. The value lies in the ability to deliver targeted marketing messages that
resonate with customers, resulting in higher conversion rates and improved ROI.

Why do we need Big Data now?


● Digital storage has started increasing exponentially after 2002 while Analog
storage remained practically the same.
● The year 2002 is called the beginning of the digital age. Why so? The answer is two
fold: Devices, Connectivity On one hand, the devices became cheaper, faster and
smaller. Smartphones are a great example. On another, the connectivity improved.
We have wifi, 4G, Bluetooth, NFC etc.
● This led to a lot of very useful applications such as a very vibrant world wide web,
social networks, and Internet of things leading to huge data generation.
● Roughly, the computer is made of 4 components. 1. CPU - Which executes
instructions. The CPU is characterised by its speed. The more instructions it can
execute per second, the faster it is considered.
● Then comes RAM. Random access memory. While processing, we load data into
RAM. If we can load more data into ram, the CPU can perform better. So, RAM has
two main attributes which matter: Size and its speed of reading and writing.
● To permanently store data, we need a hard disk drive or solid state drive. The SSD is
faster but smaller and costlier. The faster and bigger the disk, the faster we can
process data.
● Another component that we frequently forget while thinking about the speed of
computation is a network. Why? Often our data is stored on different machines and
we need to read it over a network to process.
● While processing Big Data at least one of these four components becomes the
bottleneck. That's where we need to move to multiple computers or distributed
computing architecture.

Big Data Applications


● So far we have tried to establish that while handling humongous data we would
need a new set of tools which can operate in a distributed fashion.
● But who would be generating such data or who would need to process such
humongous data? Quick answer is everyone.
● In the e-commerce industry, the recommendation is a great example of Big Data
processing. Recommendations also known as collaborative filtering is the process of
suggesting someone a product based on their preferences or behaviour.
● The e-commerce website would gather a lot of data about the customer's
behaviour. In a very simplistic algorithm, we would basically try to find similar users
and then cross-suggest them the products. So, the more the users, the better the
results.
● As per Amazon, a major chunk of their sales happen via recommendations on
websites and email. The other big example of Big Data processing was Netflix's 1
million dollar competition to generate movie recommendations.
● As of today generating recommendations have become pretty simple. The engines
such as MLLib or Mahout have made it very simple to generate recommendations on
humongous data. All you have to do is format the data in the three column format:
user id, movie id, and ratings.

A/B Testing
● The idea of A/B testing is to compare two or more versions of the same thing, and
analyse which one works better. It not only helps us understand which version
works better, but also provides evidence to understand if the difference between
two versions is statistically significant or not.
● Let us go back to the history of A/B testing. Earlier, in1950s, scientists and medical
researchers started conducting A/B Testing to run clinical trials to test drug efficacy.
In the 1960s and 1970s, marketers adopted this strategy to evaluate direct
response campaigns. For instance, Would a postcard or a letter to target customers
result in more sales? Later with the advent of the world wide web, the concept of
A/B testing has also been adopted by the software industry to understand the user
preferences. So basically, the concept which was applied offline, is now being
applied online.

Why is A/B Testing Important?


● Let us now understand why A/B testing is so important. In a real-world scenario,
Companies might think that they know their customers well. For example, a
company anticipates that variation B of the website would be more effective in
making more sales compared to variation A. But in reality, users rarely behave as
expected, and Variation A might lead to more sales.
● So to understand their users better, the companies rely on data driven approaches.
The companies analyse and understand the user behaviour based on the user data
they have. Thus, more the data, less the errors. This would in-turn contribute to
making reliable business decisions, which could lead to increased user engagement,
improved user experience, boosting of the company revenue, standing ahead of the
competitors, and many more.

Real life application of A/B Testing


● Let us have a look at how some A/B testing cases lead to more effective decisions in
the real-world. In 2009, a team at Google couldn't decide between two shades. So
they tested 41 different shades of blue to decide which colour to use for
advertisement links in Gmail.
● The company showed each shade of blue to 1% of users. A slightly purple shade of
blue got the maximum clicks, resulting in a $200 million boost of ad revenue.
Likewise, Netflix, Amazon, Microsoft and many others use A/B testing. At Amazon,
we have many such experiments running all the time. Every feature is launched via
A/B testing. It is first shown to say 1 percent of users and if it is performing good,
we increase the percentages.

Big Data Solutions


● There are many Big Data Solution stacks. The first and most powerful stack is
Apache Hadoop and Spark together. While Hadoop provides storage for structured
and unstructured data, the Spark provides the computational capability on top of
Hadoop.
● The second way could be to use Cassandra or MongoDB. Third could be to use
Google Compute Engine or Microsoft Azure. In such cases, you would have to
upload your data to Google or Microsoft which may not be acceptable to your
organisation sometimes.

Apache Hadoop
● Hadoop was created by Doug Cutting in order to build his search engine called
Nutch. He was joined by Mike Cafarella. Hadoop was based on the three papers
published by Google: Google File System, Google MapReduce, and Google Big
Table.
● It is named after the toy elephant of Doug Cutting's son.
● Hadoop is under Apache licence which means you can use it anywhere without
having to worry about licensing.
● It is quite powerful, popular and well supported.
● It is a framework to handle Big Data.
● Started as a single project, Hadoop is now an umbrella of projects.
● All of the projects under the Apache Hadoop umbrella should have followed three
characteristics:
● Distributed - They should be able to utilise multiple machines in order to solve a
problem.
● Scalable - If needed it should be very easy to add more machines.
● 3. Reliable - If some of the machines fail, it should still work fine. These are the
three criteria for all the projects or components to be under Apache Hadoop.
● Hadoop is written in Java so that it can run on all kinds of devices.

HDFS
● HDFS or Hadoop Distributed File System is the most important component because
the entire ecosystem depends upon it. It is based on the Google File System.
● It is basically a file system which runs on many computers to provide humongous
storage. If you want to store your petabytes of data in the form of files, you can use
HDFS.
● YARN or yet another resource negotiator keeps track of all the resources (CPU,
Memory) of machines in the network and runs the applications. Any application
which wants to run in distributed fashion would interact with YARN.

Flume
● ​Flume makes it possible to continuously pump the unstructured data from many
sources to a central source such as HDFS.
● If you have many machines continuously generating data such as Web Server Logs,
you can use flume to aggregate data at a central place such as HDFS.

Sqoop
● Sqoop is used to transport data between Hadoop and SQL Databases. Sqoop
utilises MapReduce to efficiently transport data using many machines in a network.

HBase
● HBase provides humongous storage in the form of a database table. So, to manage
humongous records, you would like to use HBase.
● HBase is a NoSQL Datastore.

Map Reduce
● MapReduce is a framework for distributed computing. It utilises YARN to execute
programs and has a very good sorting engine.
● You write your programs in two parts: Map and reduce. The map part transforms the
raw data into key-value and reduces part groups and combines data based on the
key. We will learn MapReduce in detail later.

Spark
● Spark is another computational framework similar to MapReduce but faster and
more recent. It uses similar constructs as MapReduce to solve big data problems.
● Spark has its own huge stack. We will cover it in detail soon.

Hive
● Writing code in MapReduce is very time-consuming. So, Apache Hive makes it
possible to write your logic in SQL which internally converts it into MapReduce. So,
you can process humongous structured or semi-structured data with simple SQL
using Hive.

Pig
● Pig Latin is a simplified SQL like language to express your ETL needs in stepwise
fashion.
● Pig is the engine that translates Pig Latin into Map Reduce and executes it on
Hadoop.

Mahout
● ​Mahout is a library of machine learning algorithms that run in a distributed fashion.
Since machine learning algorithms are complex and time-consuming, mahout breaks
down work such that it gets executed on MapReduce running on many machines.

Zookeeper
● Apache Zookeeper is an independent component which is used by various
distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for the
coordination between various components. It provides a distributed configuration
service, synchronisation service, and naming registry for large distributed systems.

Oozie
● Since a project might involve many components, there is a need for a workflow
engine to execute work in sequence.
● For example, a typical project might involve importing data from SQL Server,
running some Hive Queries, doing predictions with Mahout, Saving data back to an
SQL Server.
● This kind of workflow can be easily accomplished with Oozie.

User Interface
● A user can either talk to the various components of Hadoop using Command Line
Interface, Web interface, API or using Oozie.
● We will cover each of these components in detail later.
Google File System

Design Considerations
1. Commodity Hardware
● ​When google was still a startup, instead of buying an expensive server, they chose
to buy an off the shelf commodity hardware
● The reason they chose commodity hardware was that they are cheap and using a lot
of such servers they could scale horizontally, given the right software layer created
on top of it
● Failures are common
➔ Disk / Network / Server
➔ OS Bugs
➔ Human Errors

2. Fault Tolerant
● Therefore, the GFS needs to be designed in such a way that it can still perform
reliably in a fault tolerant manner in the face of failures

Large Files
● GFS is optimised to store and read large files
● A typical file in GFS ranges from 100MB to Multiple GBs

1. File Operations
● ​The GFS was optimised for only 2 kinds of file operations
➔ Writes to any file are generally appended only. There are hardly or no
random writes in the file
➔ Reads are sequential reads
● So, there could be a crawler, which keep appending / adding all the crawled
documents – HTML pages to a single file and then a batch processing system which
reads this entire large file and creates a search index out of it

2. Chunks
● A single file is not stored on a single server
● It is actually subdivided into multiple chunks
● Each chunk is of 64 MB
● These chunks can be spread across multiple machines
● These machines (commodity hardware) are also called chunk servers
● These chunks are then identified by globally unique 64-bit Id

Replicas
● Since these files are stored on commodity hardware, which can go down any time , if
you store only one copy of the file it is highly possible that you might lose that file
forever
● Thus, GFS ensures that each chunk of your file has at least 3 replicas across 3
different servers
● So that even if 1 server goes down you have other 2 replicas to work with
● The replica count by default is 3. But can be configured by the client writing to the
file
● It is difficult for the client application to know which chunk of the file is residing on
which server
GFS Master
● This is the GFS Master Server which has all the metadata
● The metadata includes
● Name of the files
● Number of chunks for each file
● IDs of those chunks
● And on which server those chunks are residing
● It also the access control details, i.e. which client is allowed to access which control
file
GFS Read Architecture

GFS Write Architecture

Heartbeats
● Regular heartbeats to ensure that the chunk servers are alive.
When chunk server goes down

● Master ensures all chunks that were on a dead chunkserver are copied onto other
servers
● It ensures that replica count remain the same

Single Master for Entire cluster


● There can 100s of clients
● 1000s of chunk servers
● Terabytes of data
● There is only one master node in the entire cluster
● 64 MB chunk
● Reduce metadata
● Client interacts with master only to get the metadata
● Client caches location data

Operation Log
● Log of all the operations is written to an append only log called operations log
● Each of the file operations with a corresponding timestamps and the user details of
who performed these operations is stored in this operations log
● This log is very important. Therefore, it is written directly to the disk and replicated
to some remote machine
● If the master itself crashes. It can read the operation log and create the entire
filesystem namespace, along with the chunk IDs, to get back to the earlier state of
the system
● This operations log may become too long, therefore there's a background thread
that regularly keeps checkpointing the log in background
Map Reduce
● MapReduce is a programming model and an associated implementation for
processing and generating large datasets that are parallelizable. The concept was
introduced by Google in a white paper titled "MapReduce: Simplified Data
Processing on Large Clusters," authored by Jeffrey Dean and Sanjay Ghemawat,
which was published in 2004.
● The white paper outlines the MapReduce framework, which is designed to
efficiently process vast amounts of data across distributed clusters of computers. It
abstracts the complexities of parallelization, fault tolerance, data distribution, and
load balancing, allowing programmers to focus on writing simple maps and reduce
functions.

Need of Map Reduce


● Scalability: Google's vast datasets required a framework that could process data in
parallel across large clusters of hardware.
● Fault Tolerance: MapReduce's design allowed for seamless handling of hardware
failures, ensuring uninterrupted data processing.
● Ease of Programming: Simplified programming model abstracted away
complexities, making it easier for engineers to develop distributed applications.
● Flexibility: MapReduce could adapt to a wide range of data processing tasks,
catering to Google's diverse needs.
● Performance: Efficient parallel processing across distributed clusters led to quicker
and more efficient data processing compared to traditional methods.

How does it work?


● Google engineers realised that most of the data processing task could be split up
into 2 steps – Map and reduce.
● They created a library to allow the engineers to process huge datasets in order of
petabytes spread across thousands of machines
● Map functions transform the data into key-value pairs. Key value pairs are then
shuffled around and reorganised and are then reduced to some final output.
Distributed File System
● When dealing with MapReduce model we assume that we've a distributed file
system i.e. We've got some large datasets which have been split into chunks

No Data Movement
● Since, we are dealing with a very large dataset. We do not want to move these
datasets.
● We let them exist on the machine they exist in.
● So how do we process the data? We let the MAP function operate on data locally.
So, instead of grabbing aggregating the data and move it somewhere else, we send
the MAP programs to the data and avoid moving all the large data
Key Value Structure
● In MapReduce, the chunks come from the same large dataset
● When we try to reduce this data chunks, we're very likely to be looking into some
sort of commonality or pattern in these chunks
● Since, the chunks are from the same dataset, we'll see that a lot of these have the
same keys
● It becomes easier to reduce them into one single value based on that key
● We'll discuss this in detail when we look at our example

Machine Failure
● To handle failures the MapReduce model basically reperforms the MAP operations
or the REDUCE operations where the failure occurs
Idempotent
The retry only happens if your MAP and REDUCE functions are idempotent i.e. if you run
MAP and REDUCE multiple times the outcome doesn't change

How does Map Reduce work


When dealing with the MapReduce model, the engineers need to make sure that they've a
good understanding of the inputs and expected outputs:
● Output of MAP function
● What does the intermediary key-value pair look like?
● How can they be reorganised in a meaningful way as an input to the REDUCE
function?
● How can it give the optimal results?
Thus, now the engineers need to only worry about the inputs and outputs at each step
rather than intricacies of such complex frameworks
Hadoop Distributed File System
HDFS (Hadoop Distributed File System) is a key component of the Hadoop ecosystem,
designed for storing large-scale data across distributed clusters. Its role includes:
● Scalability: Scales horizontally to handle petabytes of data by distributing it across
multiple nodes.
● Fault Tolerance: Data replication ensures reliability and fault tolerance, allowing for
easy recovery from node failures.
● High Throughput: Optimised for streaming data access, ideal for batch processing
of big data with high throughput.
● Data Locality: Maximises performance by processing data on the same nodes
where it's stored, reducing network overhead.
● Ease of Use: Provides a simple interface for storing and retrieving data, integrated
with other Hadoop ecosystem tools like MapReduce and Spark.
Components of HDFS : Name node, Data node

Major Challenges in Big Data


There are 2 major challenges when it comes to Big Data:
● Storage of large amounts of data (HDFS)
● Processing of large amounts of data (MapReduce)

File systems
● Traditional file systems are hierarchical file system ( e.g. NTFS, NFS)
● The same hierarchical file system has been applied to HDFs

Name node VS Data node

Name node Data node

Centralised metadata repository. Stores actual data blocks in the file system.​

Manages file system namespace and Responsible for serving read and write
metadata. requests from clients.​

Stores information about file locations, Reports regularly to the NameNode about
sizes, permissions, etc. the list of blocks it stores.​

Single point of failure (High Availability Multiple DataNodes ensure data replication
configurations mitigate this). and fault tolerance.
Chunks or Blocks

Rack Awareness
It is a physical collection of various nodes. Generally, 30-40 nodes come under one rack
Functions of Name node
Metadata Management
•Manages the file system namespace and metadata.
•Stores information about file locations, sizes, permissions, and directory structure.
FsImage
● Contains a snapshot of the file system metadata.
● Represents the directory structure and file metadata at a specific point in time.
● Loaded into memory when NameNode starts to reconstruct the file system
namespace.
Edit Logs
● Record every change made to the file system metadata.
● Store a sequential log of all modifications to the file system, such as file creations,
deletions, and modifications.
● Persistently stores metadata changes to ensure durability and recoverability.
Heartbeats
● Regular messages sent by DataNodes to the NameNode.
● Indicate that DataNodes are operational and report their current status.
● Helps NameNode monitor the health and status of DataNodes in the cluster.
● Enables the NameNode to detect and handle failures promptly.

Functions of Data node


Data Storage
● Stores actual data blocks of files in the file system.
● Responsible for storing and retrieving data upon client requests.
Data Replication
● Replicates data blocks to ensure fault tolerance and data availability.
● Creates and maintains multiple copies of data blocks across different DataNodes.
Block Reports
● Periodically sends reports to the NameNode containing the list of blocks stored
locally.​
● Allows NameNode to keep track of data block locations and manage data
replication.
Heartbeats
● Sends regular heartbeats to the NameNode to report its status and availability.
● Enables NameNode to monitor the health and status of DataNodes in the cluster.
Zookeeper
In very simple words, it is a central store of key-value using which distributed systems can
coordinate. Since it needs to be able to handle the load, Zookeeper itself runs on many
machines.
● Zookeeper provides a simple set of primitives, and it is very easy to program to. It
uses a data model like a directory tree.
● It is used for synchronisation, locking, maintaining configuration and failover
management.
● It does not suffer from Race Conditions and DeadLocks.

Race Condition
● When two processes are competing causing data corruption.
● As shown in the diagram, two individuals are trying to deposit 1 dollar online into
the same bank account. The initial amount is 17 dollars. Due to race conditions, the
final amount in the bank is 18 dollars instead of 19.
● When 2 processes are trying to access the same resources at the same time causing
unpredictable results

Deadlock
● When two processes are waiting for each other directly or indirectly, it is called
dead lock.
● As you can see in the second diagram, process 1 is waiting for process 2 and
process 2 is waiting for process 3 to finish and process 3 is waiting for process 1 to
finish. All these three processes would keep waiting and will never end. This is
called dead lock.
Process VS Thread

Data Model
● The way you store data in any store is called a data model. In the case of zookeeper,
think of a data model as if it is a highly available file system with little differences.
● We store data in an entity called znode. The data that we store should be in JSON
format which Java script object notation.
● The znode can only be updated. It does not support append operations. The read or
write is atomic operation meaning either it will be full or would throw an error if
failed. There is no intermediate state like half written.
● znode can have children. So, znodes inside znodes make a tree-like hierarchy. The
top level znode is "/".
● The znode "/zoo" is a child of "/" which is the top level znode. The duck is child znode
of zoo. It is denoted as /zoo/duck
● Though "." or ".." are invalid characters as opposed to the file system.
Zookeeper Data model

Types of Znode
1. Persistent Znode
● Such kinds of znodes remain in zookeeper until deleted. This is the default type of
znode. To create such node you can use the command: create /name_of_myznode
"mydata"
2. Ephemeral Znode
● Ephemeral nodes get deleted if the session in which the node was created has
disconnected. Though it is tied to the client's session, it is visible to the other users.
● An ephemeral node cannot have children, not even ephemeral children.

3. Sequential Znode
● Quite often, we need to create sequential numbers such as ids. In such situations we
use sequential nodes.
● Sequential znode are created with numbers appended to the provided name.
● You can create a znode by using create -s. The following command would create a
node with a zoo followed by a number: create -s /zoo v
● This number keeps increasing monotonically on every node creation inside a
particular node. The first sequential child node gets a suffix of 0000000000 for any
node.
Apache Pig VS Map Reduce
● Pig Latin is a high-level data flow language, whereas MapReduce is a low-level
data processing paradigm.
● Without writing complex Java implementations in MapReduce, programmers can
achieve the same implementations very easily using Pig Latin.
● Apache Pig uses a multi-query approach (i.e. using a single query of Pig Latin we
can accomplish multiple MapReduce tasks), which reduces the length of the code by
20 times. Hence, this reduces the development period by almost 16 times.
● Pig provides many built-in operators to support data operations like joins, filters,
ordering, sorting etc. Whereas to perform the same function in MapReduce is a
humongous task.
● Performing a Join operation in Apache Pig is simple. Whereas it is difficult in
MapReduce to perform a Join operation between the data sets, as it requires
multiple MapReduce tasks to be executed sequentially to fulfil the job.
● In addition, it also provides nested data types like tuples, bags, and maps that are
missing from MapReduce. I will explain to you these data types in a while.

Apache Pig
● Pig enables programmers to write complex data transformations without knowing
Java.
● Apache Pig has two main components — the Pig Latin language and the Pig
Run-time Environment, in which Pig Latin programs are executed.
● For Big Data Analytics, Pig gives a simple data flow language known as Pig Latin
which has functionalities like SQL like join, filter, limit etc.
● Developers who are working with scripting languages and SQL, leverage Pig Latin.
This gives developers ease of programming with Apache Pig. Pig Latin provides
various built-in operators like join, sort, filter, etc to read, write, and process large
data sets. Thus, it is evident, Pig has a rich set of operators.
● Programmers write scripts using Pig Latin to analyse data and these scripts are
internally converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig,
writing MapReduce tasks was the only way to process the data stored in HDFS.
● If a programmer wants to write custom functions which are unavailable in Pig, Pig
allows them to write User Defined Functions (UDF) in any language of their choice
like Java, Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This
provides extensibility to Apache Pig.
● Pig can process any kind of data, i.e. structured, semi-structured or unstructured
data, coming from various sources. Apache Pig handles all kinds of data.
● Approximately, 10 lines of pig code are equal to 200 lines of MapReduce code.
● It can handle inconsistent schema (in case of unstructured data).
● Apache Pig extracts the data, performs operations on that data and dumps the data
in the required format in HDFS i.e. ETL (Extract Transform Load).
● Apache Pig automatically optimises the tasks before execution, i.e. automatic
optimization.
● It allows programmers and developers to concentrate upon the whole operation
irrespective of creating mapper and reducer functions separately.

Where to use Pig?


● Where we need to process huge data sets like We blog , streaming online data, etc.
● Where we need Data processing for search platforms (different types of data needs
to be processed) like Yahoo uses Pig for 40% of their jobs including news feeds and
search engines.
● Where we need to process time sensitive data loads. Here, data needs to be
extracted and analysed quickly. E.g. Machine learning algorithms require
time-sensitive data loads, like twitter, and need to quickly extract data of customer
activities (i.e. tweets, re-tweets, and likes) and analyse the data to find patterns in
customer behaviours, and make recommendations immediately like trending tweets.

Twitter Case study


● I will take you through a case study of Twitter where Twitter adopted Apache Pig.
● Twitter’s data was growing at an accelerating rate (i.e., 10 TB data/day). Thus,
Twitter decided to move the archived data to HDFS and adopt Hadoop for extracting
the business values out of it.
● Their major aim was to analyse data stored in Hadoop to produce the following
insights on a daily, weekly or monthly basis.
Counting Operation:
● How many requests does twitter serve in a day?
● What is the average latency of the requests?
● How many searches happen each day on Twitter?
● How many unique queries are received?
● How many unique users come to visit?
● What is the geographic distribution of the users?
Correlating Big Data:
● How usage differs for mobile users?
● Cohort analysis: analysing data by categorising users, based on their behaviour.
● What goes wrong while a site problem occurs?
● Which features do users often use?
● Search correction and search suggestions.
Research on Big Data & produce better outcomes like:
● What can Twitter analyse about users from their tweets?
● Who follows whom and on what basis?
● What is the ratio of the follower to following?
● What is the reputation of the user? Many more.....
So How Do You Proceed?
● So, for analysing data, Twitter used MapReduce initially, which is parallel computing
over HDFS (i.e. Hadoop Distributed File system).
● For example, they wanted to analyse how many tweets are stored per user, in the
given tweet table?
● Using MapReduce, this problem will be solved sequentially as shown in the below
image:

MapReduce
The MapReduce program first inputs the key as rows and sends the tweet table information
to mapper function. Then the Mapper function will select the user id and associate unit
value (i.e. 1) to every user id. The Shuffle function will sort the same user ids together. At
last, the Reduce function will add all the number of tweets together belonging to the same
user. The output will be user id, combined with username and the number of tweets per
user.
Limitations
● Analysis needs to be typically done in Java.
● Joins that are performed, need to be written in Java, which makes it longer and more
error-prone.
● For projection and filters, the custom code needs to be written which makes the
whole process slower.
● The job is divided into many stages while using MapReduce, which makes it difficult
to manage.
So, Twitter moved to Apache Pig for analysis. Now, joining data sets, grouping them,
sorting them and retrieving data becomes easier and simpler. You can see in the below
image how twitter used Apache Pig to analyse their large data set.

Refer ppt for video

The Data
● Twitter had both semi-structured data like Twitter Apache logs, Twitter search logs,
Twitter MySQL query logs, application logs and structured data like tweets, users,
block notifications, phones, favourites, saved searches, re-tweets, authentications,
SMS usage, user followings, etc. which can be easily processed by Apache Pig.
● Twitter dumps all its archived data on HDFS. It has two tables i.e. user data and
tweets data. User data contains information about the users like username,
followers, followings, number of tweets etc. While Tweet data contains tweets, its
owner, number of retweets, number of likes etc. Now, Twitter uses this data to
analyse their customer’s behaviours and improve their past experiences.

Question: Analysing how many tweets are stored per user, in the given tweet tables?
STEP 1– First of all, twitter imports the twitter tables (i.e. user table and tweet table) into
the HDFS.
STEP 2– Then Apache Pig loads (LOAD) the tables into the Apache Pig framework.
STEP 3– Then it joins and groups the tweet tables and user table using COGROUP
command as shown in the above image.
● This results in the inner Bag Data type, which we will discuss later.
● Example of Inner bags produced –
● (1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})
● (2,{(2,Ellie,abc),(2,Ellie,vxy)})
● (3, {(3,Sam,stu)})
STEP 4– Then the tweets are counted according to the users using the COUNT command.
So, that the total number of tweets per user can be easily calculated.
● Example of tuple produced as (id, tweet count) (refer to the image) –
● (1, 3)
● (2, 2)
● (3, 1)
STEP 5– At last, the result is joined with the user table to extract the username with the
produced result.
● Example of tuple produced as (id, name, tweet count) (refer to the image) –
● (1, Jay, 3)
● (2, Ellie, 2)
● (3, Sam, 1)
STEP 6– Finally, this result is stored back in the HDFS.

Conclusion
● Pig is not only limited to this operation. It can perform various other operations
which I mentioned earlier in this use case.
● These insights help Twitter to perform sentiment analysis and develop machine
learning algorithms based on user behaviours and patterns.
● Now, after knowing the Twitter case study, in this Apache Pig tutorial, let us take a
deep dive and understand the architecture of Apache Pig and Pig Latin’s data model.
This will help us understand how pig works internally. Apache Pig draws its
strength from its architecture.

Pig Architecture
For writing a Pig script, we need Pig Latin language and to execute them, we need an
execution environment. The architecture of Apache Pig is shown in the below image.
Pig Latin Scripts
● Initially as illustrated in the previous image, we submit Pig scripts to the Apache Pig
execution environment which can be written in Pig Latin using built-in operators.
● There are three ways to execute the Pig script:
➔ Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
➔ Script File: Write all the Pig commands in a script file and execute the Pig
script file. This is executed by the Pig Server.
➔ Embedded Script: If some functions are unavailable in built-in operators, we
can programmatically create User Defined Functions to bring that
functionalities using other languages like Java, Python, Ruby, etc. and embed
it in Pig Latin Script file. Then, execute that script file.
Parser
● From the above image you can see, after passing through Grunt or Pig Server, Pig
Scripts are passed to the Parser.
● The Parser does type checking and checks the syntax of the script. The parser
outputs a DAG (directed acyclic graph).
● DAG represents the Pig Latin statements and logical operators. The logical
operators are represented as the nodes and the data flows are represented as
edges.

Optimizer
● Then the DAG is submitted to the optimizer. The Optimizer performs the
optimization activities like split, merge, transform, and reorder operators etc. This
optimizer provides the automatic optimization feature to Apache Pig. The optimizer
basically aims to reduce the amount of data in the pipeline at any instance of time
while processing the extracted data, and for that, it performs functions like:
● PushUpFilter: If there are multiple conditions in the filter and the filter can be split,
Pig splits the conditions and pushes up each condition separately. Selecting these
conditions earlier helps in reducing the number of records remaining in the pipeline.
● PushDownForEachFlatten: Applying flattens, which produces a cross product
between a complex type such as a tuple or a bag and the other fields in the record,
as late as possible in the plan. This keeps the number of records low in the pipeline.
● ColumnPruner: Omitting columns that are never used or no longer needed, reducing
the size of the record. This can be applied after each operator so that fields can be
pruned as aggressively as possible.
● MapKeyPruner: Omitting map keys that are never used, reducing the size of the
record.
● LimitOptimizer: If the limit operator is immediately applied after a load or sort
operator, Pig converts the load or sort operator into a limit-sensitive
implementation, which does not require processing the whole data set. Applying the
limit earlier reduces the number of records.
● This is just a flavour of the optimization process. Over that, it also performs Join,
Order By and Group By functions.

Compiler
● After the optimization process, the compiler compiles the optimised code into a
series of MapReduce jobs. \
● The compiler is the one who is responsible for converting Pig jobs automatically into
MapReduce jobs.

Execution Engine
● Finally, as shown in the figure, these MapReduce jobs are submitted for execution to
the execution engine. Then the MapReduce jobs are executed and give the required
result. The result can be displayed on the screen using a “DUMP” statement and can
be stored in the HDFS using “STORE” statement.
● After understanding the Architecture, now in this Apache Pig tutorial, I will explain
to you the Pig Latins' Data Model.

Pig Latin Data Model


Atomic/ Scalar Data Type
● Atomic or scalar data types are the basic data types which are used in all the
languages like string, int, float, long, double, char[], byte[]. These are also called the
primitive data types. The value of each cell in a field (column) is an atomic data type
as shown in the below image.
● For fields, positional indexes are generated by the system automatically (also known
as positional notation), which is represented by ‘$’ and it starts from $0, and grows
$1, $2, and so on… As compared with the below image $0 = S.No., $1 = Bands, $2 =
Members, $3 = Origin.
● Scalar data types are − ‘1’, ‘Linkin Park’, ‘7’, ‘California’ etc.
Tuple
The tuple is an ordered set of fields which may contain different data types for each
field. You can understand it as the records stored in a row in a relational database. A
Tuple is a set of cells from a single row as shown in the above image. The elements
inside a tuple do not necessarily need to have a schema attached to it.
● A tuple is represented by the ‘()’ symbol.
● Example of a tuple − (1, Linkin Park, 7, California)
● Since tuples are ordered, we can access fields in each tuple using indexes of the
fields, like $1 from above tuple will return a value ‘Linkin Park’. You can notice that
the above tuple doesn’t have any schema attached to it.

Bag
● A bag is a collection of a set of tuples and these tuples are a subset of rows or
entire rows of a table. A bag can contain duplicate tuples, and it is not mandatory
that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can have a different number
of fields. A bag can also have tuples with different data types.
● A bag is represented by the ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los
Angeles)}
● But for Apache Pig to effectively process bags, the fields and their respective data
types need to be in the same sequence.

Map
● A map is key-value pairs used to represent data elements. The key must be a char
array [ ] and should be unique like the column name, so it can be indexed, and
values associated with it can be accessed on the basis of the keys. The value can be
of any data type.
● Maps are represented by ‘[]’ symbol and key-value are separated by ‘#’ symbol, as
you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ], [band#Metallica, members#8 ]
● Now as we learned Pig Latin’s Data Model. We will understand how Apache Pig
handles schema as well as works with schema-less data.
Schema
Schema assigns a name to the field and declares the data type of the field. The schema is
optional in Pig Latin but Pig encourages you to use them whenever possible, as the error
checking becomes efficient while parsing the script which results in efficient execution of
the program. The schema can be declared as both simple and complex data types. During
LOAD function, if the schema is declared it is also attached to the data.

Few Points on Schema in Pig:


● If the schema only includes the field name, the data type of field is considered as a
byte array.
● If you assign a name to the field you can access the field by both, the field name and
the positional notation. Whereas if the field name is missing we can only access it
by the positional notation i.e. $ followed by the index number.
● If you perform any operation which is a combination of relations (like JOIN,
COGROUP, etc.) and if any of the relations is missing schema, the resulting relation
will have a null schema.
● If the schema is null, Pig will consider it as a byte array and the real data type of
field will be determined dynamically.
What is Hive?
● Hive is a component in Hadoop Stack.
● It is an open-source data warehouse tool that runs on top of Hadoop.
● It was developed by Facebook and later donated to the Apache foundation.
● It reads, writes, and manages big data tables stored in HDFS or other data sources.
● Hive doesn't offer insert, delete and update operations but it is used to perform
analytics, mining, and report generation on the large data warehouse.
● Hive uses Hive query language similar to SQL.
● Most of the syntax is similar to the MySQL database.
● It is used for OLAP (Online Analytical Processing) purposes.
● In Hive, OLAP (Online Analytical Processing) involves performing complex analysis
and querying of large datasets stored in Hadoop's distributed file system (HDFS)
using HiveQL, the query language similar to SQL. While Hive is not a traditional
OLAP system like some relational databases, it can be used to perform OLAP-like
tasks on big data.

Why do we need Hive?


● In the year 2006, Facebook was generating 10 GB of data per day and in 2007 its
data increased by 1 TB per day.
● After a few days, it is generating 15 TB of data per day.
● Initially, Facebook is using the Scribe server, Oracle database, and Python scripts for
processing large data sets. As Facebook started gathering data then they shifted to
Hadoop as its key tool for data analysis and processing.
● Facebook is using Hadoop for managing its big data and facing problems for ETL
operations because for each small operation they need to write the Java programs.
● They need a lot of Java resources that are difficult to find and Java is not easy to
learn.
● So Facebook developed Hive which uses SQL-like syntaxes that are easy to learn
and write.
● Hive makes it easy for people who know SQL just like other RDBMS tools.

Hive Features
● It is a Data Warehousing tool.
● It is used for enterprise data wrangling.
● It uses the SQL-like language HiveQL or HQL. HQL is a non-procedural and
declarative language.
● It is used for OLAP operations.
● It increases productivity by reducing 100 lines of Java code into 4 lines of HQL
queries.
● It supports Table, Partition, and Bucket data structures.
● It is built on top of Hadoop Distributed File System (HDFS)
● Hive supports Tez, Spark, and MapReduce.

What is Data warehousing?


● Data warehousing in Hive refers to the process of organising, storing, and managing
large volumes of structured and semi-structured data in a way that facilitates
efficient querying and analysis.
● It involves using Hive, a data warehousing infrastructure built on top of Hadoop, to
create and manage data warehouses where data can be stored, processed, and
analysed for business intelligence and decision-making purposes.

What is enterprise Data warehousing?


● Enterprise data wrangling refers to the process of preparing and transforming raw
data from various sources into a usable format for analysis, reporting, and other
data-related tasks within an organisation.
● It involves cleaning, structuring, enriching, and integrating data from multiple
sources to make it accessible and actionable for business users, analysts, and data
scientists.

Non Procedural and Declarative Language


● Non-procedural languages focus on describing what should be done rather than
how it should be done. In other words, they emphasise the end result rather than
the step-by-step process to achieve it. These languages abstract away the control
flow and implementation details, allowing the system to determine the most
efficient way to execute the task.
● Declarative languages are a broader category that includes non-procedural
languages. They allow users to define the desired outcome or properties of a
solution without specifying the exact sequence of steps or algorithms to achieve it.

Hive Architecture
● Shell/CLI: It is an interactive interface for writing queries.
● Driver: Handle session, fetch and execute the operation
● Compiler: Parse, Plan and optimise the code.
● Execution: In this phase, MapReduce jobs are submitted to Hadoop. and jobs get
executed.
● Metastore: Meta Store is a central repository that stores the metadata. It keeps all
the details about tables, partitions, and buckets.
● Example: Suppose we have a large dataset containing information about sales
transactions for a retail company, stored in HDFS. We want to analyse this data
using Hive to gain insights into sales performance.

Shell / CLI
● The Shell or Command-Line Interface (CLI) is an interactive interface where users
can write and execute HiveQL queries to interact with the Hive system.
● Users can use the Hive shell to submit queries, manage tables, and perform other
operations

Driver
● The Driver is responsible for handling user sessions, interpreting queries submitted
through the shell/CLI, and coordinating the execution of these queries.
● It interacts with other components of the Hive architecture to process user queries.

Compiler
● The Compiler receives the HiveQL queries submitted by users and performs several
tasks:
➔ Parsing: It parses the query to understand its syntactic structure and extract
relevant information.
➔ Planning: It generates an execution plan based on the query's logical
structure, determining the sequence of operations needed to fulfil the query.
➔ Optimization: It optimises the execution plan to improve query performance
by considering factors such as data locality, join order, and filter pushdown.

Execution
● In this phase, the optimised execution plan is translated into one or more
MapReduce jobs, which are submitted to the Hadoop cluster for execution.
● These MapReduce jobs process the data stored in HDFS according to the query's
requirements and produce the desired result.
● Example: Suppose the query requires aggregating sales data by product ID. The
execution phase would involve MapReduce jobs that read and process the sales
data, performing the aggregation operation as specified in the query.
MetaStore
● The Metastore is a central repository that stores metadata about Hive objects such
as databases, tables, partitions, columns, and storage properties. It keeps track of all
the details required to manage and query data stored in Hive.
● Example: The Metastore stores metadata about the sales table, including its schema
(columns: transaction_id, product_id, quantity_sold, revenue), data location in HDFS,
partitioning information (if any), and any associated storage properties.

Order By, Sort By


● ORDER BY: It always assures global ordering. It is slower for large datasets
because it pushes all the data into a single reducer. In the final output, you will get a
single sorted output file.
● SORT BY: It orders the data at each reducer but the reducer may have overlapping
ranges of data. In the final output, you will get multiple sorted output files.

HIVE vs SQL
● In SQL, MapReduce is not Supported while in Hive MapReduce is Supported.
● Hive does not support update command due to the limitation and natural structure
of HDFS, hive only has an insert overwrite for an update or insert functionality.
● In HQL, the queries are in the form of objects that are converted to SQL queries in
the target database.
● SQL works with tables and columns while Hive works with classes and their
properties.
Hive Limitations
● Hive is suitable for batch processing but not suitable for real-time data handling.
● Update and delete are not allowed, but we can delete in bulk i.e. we can delete the
entire table but not individual observation.
● Hive is not suitable for OLTP(Online Transactional Processing) operations.
HBase Introduction
● ​HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
● Apache HBase is a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS).
● Hadoop is a framework for handling large datasets in a distributed computing
environment.

Column Oriented Database


● A column-oriented DBMS (or columnar database management system) is a
database management system (DBMS) that stores data tables by column rather
than by row. Practical use of a column store versus a row store differs little in the
relational DBMS world.
● Both columnar and row databases can use traditional database query languages
like SQL to load data and perform queries. Both row and columnar databases can
become the backbone in a system to serve data for common extract, transform, load
(ETL) and data visualisation tools.
● However, by storing data in columns rather than rows, the database can more
precisely access the data it needs to answer a query rather than scanning and
discarding unwanted data in rows. Query performance is increased for certain
workloads.
Features
Scalability:
1. Horizontal
● Horizontal scalability is the ability to increase capacity by connecting multiple
hardware or software entities so that they work as a single logical unit.
● When servers are clustered, the original server is being scaled out horizontally.
● If a cluster requires more resources to improve performance and provide high
availability (HA), an administrator can scale out by adding more servers to the
cluster.
● An important advantage of horizontal scalability is that it can provide administrators
with the ability to increase capacity on the fly.
● Another advantage is that in theory, horizontal scalability is only limited by how
many entities can be connected successfully

2. Vertical
● Vertical scalability, on the other hand, increases capacity by adding more resources,
such as more memory or an additional CPU, to a machine.
● Scaling vertically, which is also called scaling up, usually requires downtime while
new resources are being added and has limits that are defined by hardware.

What you need to consider while choosing Horizontal scalability on


Vertical Scalability
● Scaling horizontally has both advantages and disadvantages.
● For example, adding inexpensive commodity computers to a cluster might seem to
be a cost-effective solution at first glance, but it’s important for the administrator to
know whether the licensing costs for those additional servers, the additional
operations cost of powering and cooling and the large footprint they will occupy in
the data centre truly makes scaling horizontally a better choice than scaling
vertically.
Consistency
● Consistency in database systems refers to the requirement that any given database
transaction must change affected data only in allowed ways. Any data written to the
database must be valid according to all defined rules, including constraints,
cascades, triggers, and any combination thereof.
● Write transactions are always performed in a strong consistency model in HBase
which guarantees that transactions are ordered and replayed in the same order by
all copies of the data. In timeline consistency, the get and scan requests can be
answered from data that may be stale.
Java API client
● The java client API for HBase is used to perform CRUD operations on HBase tables.
HBase is written in Java and has a Java Native API. Therefore it provides
programmatic access to Data Manipulation Language (DML).

HBase vs RDBMS

● Whereas, RDMS store table records in a sequence of rows (Row-oriented


databases), HBase is a column-oriented databases, which store table records in a
sequence of columns.

Hbase column oriented storage

HBase
● Row key : the reference of the row, it’s used to make the search of a record faster.
● Column Families : combination of a set of columns. Data belonging to the same
column family can be accessed together in a single seek, allowing a faster process.
● Column Qualifiers: Each column’s name is known as its column qualifier.
● Cell: the storage area of data. Each cell is connected to a row key and a column
qualifier

Architecture
● HBase has three crucial components:
● Zookeeper used for monitoring
● HMaster Server assigns regions and load-balancing.
● Region Server serves data for write and read. it refers to different computers in the
Hadoop cluster. Each Region Server have a region, HLog, a store memory store

Applications
● Medical: In the medical field, HBase is used for the purpose of storing genome
sequences and running MapReduce on it, storing the disease history of people or an
area.
● Sports: For storing match histories for better analytics and prediction.
● E-Commerce: For the purpose of recording and storing logs about customer search
history, as well as to perform analytics and then target advertisements for the better
business.

You might also like