Big data
Big data
Big data
Big Data Analytics is all about crunching massive amounts of information to uncover hidden trends,
patterns, and relationships. It's like sifting through a giant mountain of data to find the gold nuggets of
insight.
• Collecting Data: Such data comes from various sources such as social media, web traffic,
sensors and customer reviews.
• Cleaning the Data: Imagine having to assess a pile of rocks that included some gold pieces in it.
You would have to clean the dirt and the debris first. When data is being cleaned, mistakes must
be fixed, duplicates must be removed, and the data must be formatted properly.
• Analyzing the Data: It is here that the wizardry takes place. Data analysts employ powerful tools
and techniques to discover patterns and trends. It is the same thing as looking for a specific
pattern in all those rocks that you sorted through.
Big Data Analytics is a powerful tool which helps to find the potential of large and complex datasets. To
get better understanding, let's break it down into key steps:
• Data Collection: Data is the core of Big Data Analytics. It is the gathering of data from different
sources such as the customers’ comments, surveys, sensors, social media, and so on. The
primary aim of data collection is to compile as much accurate data as possible. The more data,
the more insights.
• Data Cleaning (Data Preprocessing): The next step is to process this information. It often
requires some cleaning. This entails the replacement of missing data, the correction of
inaccuracies, and the removal of duplicates. It is like sifting through a treasure trove, separating
the rocks and debris and leaving only the valuable gems behind.
• Data Processing: After that we will be working on the data processing. This process contains such
important stages as writing, structuring, and formatting of data in a way it will be usable for the
analysis. It is like a chef who is gathering the ingredients before cooking. Data processing turns the
data into a format suited for analytics tools to process.
• Data Analysis: Data analysis is being done by means of statistical, mathematical, and machine
learning methods to get out the most important findings from the processed data. For example, it
can uncover customer preferences, market trends, or patterns in healthcare data.
• Data Visualization: Data analysis usually is presented in visual form, for illustration – charts,
graphs and interactive dashboards. The visualizations provided a way to simplify the large
amounts of data and allowed for decision makers to quickly detect patterns and trends.
• Data Storage and Management: The stored and managed analyzed data is of utmost importance.
It is like digital scrapbooking. Maybe you would want to go back to those lessons in the long run,
therefore, how you store them has great importance. Moreover, data protection and adherence to
regulations are the key issues to be addressed during this crucial stage.
• Continuous Learning and Improvement: Big data analytics is a continuous process of collecting,
cleaning, and analyzing data to uncover hidden insights. It helps businesses make better
decisions and gain a competitive edge.
Big Data Analytics comes in many different types, each serving a different purpose:
1. Descriptive Analytics: This type helps us understand past events. In social media, it shows
performance metrics, like the number of likes on a post.
2. Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the reasons behind past
events. In healthcare, it identifies the causes of high patient re-admissions.
3. Predictive Analytics: Predictive analytics forecasts future events based on past data. Weather
forecasting, for example, predicts tomorrow's weather by analyzing historical patterns.
4. Prescriptive Analytics: However, this category not only predicts results but also offers
recommendations for action to achieve the best results. In e-commerce, it may suggest the best
price for a product to achieve the highest possible profit.
5. Real-time Analytics: The key function of real-time analytics is data processing in real time. It
swiftly allows traders to make decisions based on real-time market events.
6. Spatial Analytics: Spatial analytics is about the location data. In urban management, it optimizes
traffic flow from the data under the sensors and cameras to minimize the traffic jam.
7. Text Analytics: Text analytics delves into the unstructured data of text. In the hotel business, it
can use guest reviews to enhance services and guest satisfaction.
Big Data Analytics relies on various technologies and tools that might sound complex, let's simplify them:
• Hadoop: Imagine Hadoop as an enormous digital warehouse. It's used by companies like Amazon
to store tons of data efficiently. For instance, when Amazon suggests products you might like, it's
because Hadoop helps manage your shopping history.
• Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly analyze what you watch
and recommend your next binge-worthy show.
• NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing cabinets that Airbnb
uses to store your booking details and user data. These databases are famous because of their
quickness and flexibility, so the platform can provide you with the right information when you need
it.
• Tableau: Tableau is like an artist that turns data into beautiful pictures. The World Bank uses it to
create interactive charts and graphs that help people understand complex economic data.
• Python and R: Python and R are like magic tools for data scientists. They use these languages to
solve tricky problems. For example, Kaggle uses them to predict things like house prices based on
past data.
• Machine Learning Frameworks (e.g., TensorFlow): In Machine learning frameworks are the tools
who make predictions. Airbnb uses TensorFlow to predict which properties are most likely to be
booked in certain areas. It helps hosts make smart decisions about pricing and availability.
These tools and technologies are the building blocks of Big Data Analytics and help organizations gather,
process, understand, and visualize data, making it easier for them to make decisions based on
information.
Big Data Analytics offers a host of real-world advantages, and let's understand with examples:
1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps them make smart
choices about what products to stock. This not only reduces waste but also keeps customers
happy and profits high.
2. Enhanced Customer Experiences: Think about Amazon. Big Data Analytics is what makes those
product suggestions so accurate. It's like having a personal shopper who knows your taste and
helps you find what you want.
3. Fraud Detection: Credit card companies, like MasterCard, use Big Data Analytics to catch and
stop fraudulent transactions. It's like having a guardian that watches over your money and keeps it
safe.
4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver your packages faster
and with less impact on the environment. It's like taking the fastest route to your destination while
also being kind to the planet.
While Big Data Analytics offers incredible benefits, it also comes with its set of challenges:
• Data Overload: Consider Twitter, where approximately 6,000 tweets are posted every second. The
challenge is sifting through this avalanche of data to find valuable insights.
• Data Quality: If the input data is inaccurate or incomplete, the insights generated by Big Data
Analytics can be flawed. For example, incorrect sensor readings could lead to wrong conclusions
in weather forecasting.
• Privacy Concerns: With the vast amount of personal data used, like in Facebook's ad targeting,
there's a fine line between providing personalized experiences and infringing on privacy.
• Security Risks: With cyber threats increasing, safeguarding sensitive data becomes crucial. For
instance, banks use Big Data Analytics to detect fraudulent activities, but they must also protect
this information from breaches.
• Costs: Implementing and maintaining Big Data Analytics systems can be expensive. Airlines like
Delta use analytics to optimize flight schedules, but they need to ensure that the benefits
outweigh the costs.
• Healthcare: It aids in precise diagnoses and disease prediction, elevating patient care.
• Retail: Amazon's use of Big Data Analytics offers personalized product recommendations based
on your shopping history, creating a more tailored and enjoyable shopping experience.
• Finance: Credit card companies such as Visa rely on Big Data Analytics to swiftly identify and
prevent fraudulent transactions, ensuring the safety of your financial assets.
• Transportation: Companies like Uber use Big Data Analytics to optimize drivers' routes and
predict demand, reducing wait times and improving overall transportation experiences.
• Agriculture: Farmers make informed decisions, boosting crop yields while conserving resources.
• Manufacturing: Companies like General Electric (GE) use Big Data Analytics to predict machinery
maintenance needs, reducing downtime and enhancing operational efficiency.
1. Scalability Issues
Traditional systems struggle to scale effectively. They typically rely on vertical scaling, which
involves adding more resources (CPU, memory, storage) to a single server. This approach can be
costly and time-consuming, often leading to performance bottlenecks as data volumes increase.
In contrast, big data environments require horizontal scaling, where additional servers are added
to distribute the load more efficiently.
2. Performance Limitations
As data variety and velocity increase, traditional systems face difficulties in maintaining high
performance. These systems often cannot process large amounts of diverse data quickly enough,
resulting in slower response times and a higher likelihood of errors. The batch processing model
common in traditional systems introduces latency that is incompatible with the real-time analysis
needs of big data applications.
3. Complexity Management
Traditional systems are primarily designed for structured data organized in rows and columns.
They struggle with the complexity and variability of unstructured or semi-structured data types
(e.g., text, images, sensor data) that are prevalent in big data scenarios. This complexity can lead
to challenges in ensuring the quality and consistency of data processing and analysis.
4. Cost Implications
The cost associated with scaling traditional systems can be prohibitive. High-end hardware
requirements and expensive software licenses for relational databases contribute to this issue. As
organizations seek to manage larger datasets, they often find that traditional storage solutions are
not cost-effective.
5. Integration Difficulties
Many organizations operate with disparate systems where data exists in silos. Integrating these
various sources is essential for comprehensive analysis but can be a complex and time-
consuming process. Traditional systems often lack the interoperability needed to seamlessly
connect different data sources.
6. Security Risks
The vast amount of sensitive information contained within big data makes traditional systems
vulnerable to cyberattacks. Protecting these large datasets requires robust security measures that
traditional architectures may not adequately support.
7. Compliance Challenges
With increasing regulations regarding data privacy and protection, traditional systems may
struggle to comply with legal requirements when handling sensitive information. This compliance
can be complex and resource intensive.
Sources of Big Data
Big data originates from a multitude of sources, reflecting the diverse and complex nature of the
information generated in today's digital landscape. These sources can be categorized into several key
types:
1. Social Media
Social platforms such as Facebook, Twitter, Instagram, and LinkedIn generate vast amounts of
data daily through user interactions, posts, comments, images, and videos. As more people
engage with these platforms, the volume of data produced continues to grow exponentially.
IoT devices, including smart appliances, sensors, and connected vehicles, create significant
amounts of machine-generated data. These devices collect and transmit information about their
environment, usage patterns, and operational status. Examples include smart thermostats that
monitor temperature and smart meters that track energy consumption.
3. Transactional Data
4. Machine Data
Data generated by machines includes log files from servers and network devices, as well as
sensor data from industrial equipment. This type of data is essential for monitoring system
performance and ensuring operational efficiency.
5. Multimedia Content
Images, videos, and audio files are significant contributors to big data. Platforms like YouTube and
various streaming services produce massive volumes of multimedia content that require
substantial storage and processing capabilities.
Websites generate data through user interactions such as clicks, page views, and browsing
patterns. This clickstream data provides insights into user behavior and preferences.
Organizations often incorporate external datasets for enhanced analysis. These can include
weather data, geographic information, financial market trends, and scientific research findings.
Such external data enriches the internal datasets organizations collect.
8. Customer Feedback
Feedback from customers via surveys, reviews, and support interactions generates valuable
insights into product performance and customer satisfaction levels. This feedback is increasingly
gathered through digital channels.
3 Vs of Big Data
The 3 V's of Big Data are foundational concepts that help define the characteristics and challenges
associated with managing large datasets. These dimensions were popularized by Gartner analyst Doug
Laney in 2001 and are essential for understanding how big data differs from traditional data
management. The three Vs are:
1. Volume
Volume refers to the sheer amount of data generated and stored. In today's digital landscape,
organizations can accumulate vast quantities of data, often measured in terabytes or petabytes.
For instance, it is estimated that approximately 2.5 quintillion bytes of data are created every day,
highlighting the exponential growth in data generation. This massive volume poses challenges in
terms of storage, processing, and analysis.
2. Velocity
Velocity describes the speed at which data is generated and processed. Data flows into
organizations from various sources in real-time or near-real-time, necessitating rapid processing
to derive timely insights. For example, social media platforms generate a continuous stream of
user interactions, while IoT devices transmit data instantaneously. The ability to handle high-
velocity data streams is crucial for organizations looking to make quick decisions based on
current information.
3. Variety
Variety refers to the different types of data that organizations encounter, including structured,
semi-structured, and unstructured formats. This diversity encompasses text, images, videos,
sensor data, and more. Managing this variety requires sophisticated analytical tools and
techniques to ensure that all data types can be effectively processed and analyzed.
Additional V's
In recent discussions around big data, additional V's have been introduced, such as veracity (the
quality and accuracy of the data), value (the usefulness of the data), and variability (the changing
nature of data). These concepts further enrich the understanding of big data by addressing
aspects like trustworthiness and the dynamic context in which data exists.
Types of Data
Big data can be classified into three primary types based on its structure and organization: structured
data, semi-structured data, and unstructured data. Each type has distinct characteristics and
applications, which are crucial for effective data management and analysis.
1. Structured Data
Structured data is highly organized and adheres to a predefined schema, making it easily
searchable and analyzable. This type of data is typically stored in relational databases and is
represented in tabular formats, such as rows and columns. Examples include:
Structured data is valuable because it allows for quick retrieval and analysis using standard query
languages like SQL.
2. Semi-Structured Data
Semi-structured data contains elements of both structured and unstructured data. While it does
not conform to a rigid schema, it includes tags or markers that separate different pieces of
information, allowing for some level of organization. Common examples include:
- JSON (JavaScript Object Notation): Often used in web applications for transmitting data.
- XML (eXtensible Markup Language): Used for storing and transporting data.
- CSV (Comma-Separated Values): A simple format for tabular data that lacks strict
structure.
This type of data is increasingly important for applications that require flexibility in how
information is stored and processed.
3. Unstructured Data
Unstructured data lacks a predefined format or structure, making it more challenging to process
and analyze. It accounts for the majority of big data generated today. Examples include:
- Multimedia Files: Images, videos, and audio recordings (e.g., photos on social media
platforms).
- Textual Data: Emails, social media posts, and documents that do not follow a specific
format.
- Sensor Data: Information collected from IoT devices that may not fit into traditional
database structures.
Unstructured data requires advanced analytics tools to extract meaningful insights due to its
complexity and variability.
A group of computers makes up GFS. A cluster is just a group of connected computers. There
could be hundreds or even thousands of computers in each cluster. There are three basic entities
included in any GFS cluster as follows:
• GFS Clients: They can be computer programs or applications which may be used to
request files. Requests may be made to access and modify already-existing files or add
new files to the system.
• GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the
cluster’s actions in an operation log. Additionally, it keeps track of the data that describes
chunks, or metadata. The chunks’ place in the overall file and which files they belong to are
indicated by the metadata to the master server.
• GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks.
The master server does not receive any chunks from the chunk servers. Instead, they
directly deliver the client the desired chunks. The GFS makes numerous copies of each
chunk and stores them on various chunk servers in order to assure stability; the default is
three copies. Every replica is referred to as one.
• Features of GFS
• Fault tolerance.
• Reduced client and master interaction because of large chunk server size.
• High availability.
• Advantages of GFS
1. High accessibility Data is still accessible even if a few nodes fail. (replication) Component
failures are more common than not, as the saying goes.
3. Dependable storing. Data that has been corrupted can be found and duplicated.
• Disadvantages of GFS
4. Suitable for procedures or data that are written once and only read (appended) later.
It is a framework used to store and process large datasets parallelly in a distributed fashion. So here for
storing the datasets we will be using HDFS (Hadoop Distributed File System) and for parallel
processing/retrieving the information stored in HDFS we will be using MapReduce.
In HDFS each and every data/file is stored as Blocks, Block is the smallest unit of data that the file system
stores. From Hadoop 2.0 onwards the size of these HDFS data blocks is 128 MB by default, previously it
was 64 MB. We can configure the block size as per our requirement by changing the dfs.block.size
property in hdfs-site.xml
- Block Size: HDFS uses a default block size of 128 MB, compared to the 4 KB typical in traditional file
systems.
- Metadata Management: Larger blocks reduce the amount of metadata stored in the NameNode,
simplifying management and retrieval.
- Space Utilization: In HDFS, unused space in a block can be utilized for other files, preventing waste of
storage.
- Performance: Larger blocks enhance performance by reducing disk seeks and allowing more data to be
read at once.
- Big Data Suitability: HDFS is optimized for big data applications, focusing on efficiency and scalability.
Hadoop has mainly 5 services: -
• NameNode
• Secondary NameNode
• Job Tracker
• DataNode
• Task Tracker
First 3(1,2,3) services are Master services. Last two (4,5) are slave services. Master services can talk to
each other and slave services can talk to each other. NameNode, Secondary NameNode and
DataNode(1,2,4) are related to HDFS and Job Tracker and Task Tracker(3,5) are related to map reduce.
1.NameNode:
The NameNode is the master node of HDFS, managing metadata such as file directories and block
locations stored in DataNodes. It does not store actual data and is essential for reconstructing files. Its
failure leads to system inaccessibility, making it a single point of failure. Typically, it has substantial RAM
for quick metadata access, highlighting its critical role in HDFS reliability and efficiency.
2. DataNode:
The DataNode is a slave node in HDFS that stores actual data. It communicates with the NameNode
through heartbeat signals and reports the blocks it manages upon startup. If a DataNode fails, data
availability remains unaffected since the NameNode can replicate its blocks. DataNodes are generally
equipped with ample hard disk space for data storage.
3.Secondary NameNode
• Secondary Namenode, by its name we assume that it as a backup node but its not. First let me
give a brief about Namenode.
• Namenode holds the metadata for HDFS like Block information, size etc. This Information is
stored in main memory as well as disk for persistence storage .
• The information is stored in 2 different files .They are
• Editlogs- It keeps track of each and every changes to HDFS.
• Fsimage- It stores the snapshot of the file system.
• Any changes done to HDFS gets noted in the edit logs the file size grows where as the size of
fsimage remains same. This not have any impact until we restart the server. When we restart the
server the edit file logs are written into fsimage file and loaded into main memory which takes
some time. If we restart the cluster after a long time there will be a vast down time since the edit
log file would have grown. Secondary namenode would come into picture in rescue of this
problem.
• Secondary Namenode simply gets edit logs from name node periodically and copies to fsimage.
This new fsimage is copied back to namenode.Namenode now, this uses this new fsimage for next
restart which reduces the startup time. It is a helper node to Namenode and to precise Secondary
Namenode whole purpose is to have checkpoint in HDFS, which helps namenode to function
effectively. Hence, It is also called as Checkpoint node. Default checkpoint period is 1 hour.
Copying editlog into fsimage is called checkpoint and giving new edit log.
JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1 (or Hadoop
version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by Resource
Manager, Application Master and Node Manager Daemons.
4.Job Tracker –
5.TaskTracker –
This is the default mode of Hadoop, where it runs as a single Java process on a local
machine. It does not utilize HDFS (Hadoop Distributed File System) and instead uses the
local file system for input and output operations.
- Use Case: Primarily used for debugging and development purposes, allowing developers
to test their MapReduce applications without the complexity of a distributed environment.
- Characteristics:
2. Pseudo-Distributed Mode
In this mode, Hadoop runs as a single-node cluster where all the Hadoop daemons (NameNode,
DataNode, etc.) operate on one machine. Each daemon runs in its own Java Virtual Machine (JVM),
allowing them to communicate over network sockets.
- Use Case: Suitable for testing and development where a more realistic environment is
needed compared to standalone mode. It simulates a multi-node cluster while still being
limited to one physical machine.
- Characteristics:
This mode represents the production environment of Hadoop, where multiple nodes are deployed
across a cluster. Master (NameNode, ResourceManager) and slave (DataNode, NodeManager)
services run on separate nodes.
- Use Case: Ideal for processing large datasets in production environments where data is
distributed across various nodes for efficiency and fault tolerance.
- Characteristics:
To configure a Hadoop cluster in either Pseudo-Distributed or Fully Distributed mode, you need to modify
several XML configuration files:
2. `hdfs-site.xml`: Defines settings specific to HDFS, including replication factors, data storage
paths, and NameNode information.
3. `mapred-site.xml`: Configures MapReduce settings like job tracker addresses and resource
allocation parameters.
These files are located in the `etc/hadoop/` directory of your Hadoop installation. Proper configuration
ensures that all components of the Hadoop ecosystem function seamlessly together.
Unit 2
A Weather Dataset
Weather datasets are critical for analyzing meteorological data and understanding climate patterns. They
typically consist of large volumes of log data collected from various weather stations globally.
- Station Identifier: A unique code representing the weather station (e.g., WMO/DATSAV3
number).
- Other Meteorological Data: This may include humidity, wind speed, precipitation levels, etc.
Data Sources
- Offers datasets that are grouped by weather stations and available for different years.
An analysis project might focus on calculating maximum, minimum, and average temperature values
from a weather dataset for a specific year. For instance, one could analyze data from Fairbanks, Alaska to
determine hot and cold days based on temperature thresholds.
- Mapper and Reducer: Defined as interfaces; developers must implement all methods when
creating custom Mappers and Reducers.
- Job Control: Managed through the `JobClient` class, which handles job submission but lacks
some modern conveniences.
- Communication Method: Passes values using `java.lang.Iterator`, which can be less flexible.
- Output Naming Convention: Output files are named simply as `part-nnnnn`, making it harder
to distinguish between sources.
New API
- Mapper and Reducer: Defined as abstract classes; this allows for easier extension without
breaking existing implementations.
- Job Control: Managed via the `Job` class, simplifying job submission and management.
- Job Configuration: Uses a `Configuration` object along with helper methods on the `Job` class
for streamlined configuration management.
- Output Naming Convention: Output files are prefixed with `part-m-nnnnn` (for Mapper
outputs) or `part-r-nnnnn` (for Reducer outputs), making it easier to identify their source.
1. Ease of use with simplified job configuration and management compared to the Old API.
2. Flexibility in customization through abstract classes that allow developers to extend functionality
easily.
3. Improved communication model using Iterable for handling multiple values per key between Mappers
and Reducers.
Basic Programs of Hadoop MapReduce
Hadoop MapReduce is a programming model designed for processing large datasets in a distributed
environment. It consists of two primary tasks: Map and Reduce. Understanding these components and
their interactions is crucial for effectively utilizing the MapReduce framework.
1. Driver Code
The Driver Code serves as the main entry point for a MapReduce job. It is responsible for configuring and
managing the execution of the job. Key responsibilities include:
- Job Configuration: The Driver sets up various parameters required for the job, including:
- Input and output paths (where to read data from and where to write results).
- Specifying Mapper and Reducer classes that contain the logic for processing data.
- Job Submission: After configuration, the Driver submits the job to the Hadoop cluster for
execution. This involves interacting with the ResourceManager to allocate resources and schedule
tasks.
- Monitoring Job Execution: The Driver can track the progress of the job, handle errors, and
retrieve results once processing is complete.
2. Mapper Code
The Mapper is a fundamental component of the MapReduce framework that processes input data. Its
primary function is to read input key-value pairs, transform them, and produce intermediate key-value
pairs. Key aspects include:
- Input Processing: The Mapper reads data from HDFS line by line or in chunks, depending on how
input splits are defined. Each record typically consists of a key (e.g., line number or file offset) and
a value (e.g., actual data).
- Data Transformation: The Mapper applies user-defined logic to process each input record. This
might involve filtering, parsing, or transforming data into a more useful format.
- Output Generation: The Mapper emits intermediate key-value pairs as output. These pairs are
then passed on to the Reducer for further processing.
3. Reducer Code
The Reducer processes intermediate key-value pairs produced by Mappers. Its role is to aggregate these
pairs based on keys and produce final output results. Key responsibilities include:
- Receiving Intermediate Data: The Reducer receives all values associated with a specific key
from multiple Mappers. This is done after the shuffle phase, which groups all values by their
corresponding keys.
- Aggregation Logic: The Reducer applies user-defined logic to aggregate values for each key. This
could involve summing numbers, calculating averages, or performing other types of
computations.
- Output Generation: Finally, the Reducer emits its output as key-value pairs, which are written
back to HDFS as part of the final result.
4. Record Reader
The Record Reader is responsible for reading input data from HDFS and converting it into key-value pairs
suitable for processing by Mappers. It plays a crucial role in how input data is ingested into the
MapReduce framework:
- Input Formats: The Record Reader works with various input formats (e.g., TextInputFormat,
KeyValueTextInputFormat) that define how data is split into records.
- Key-Value Pair Generation: As it reads records, it generates corresponding key-value pairs that
are passed to Mappers for processing.
5. Combiner
The Combiner functions as an optional optimization layer between Mappers and Reducers:
- Local Aggregation: The Combiner operates on the output of Mappers before sending it to
Reducers. It performs local aggregation to minimize data transfer over the network.
The Partitioner determines how intermediate key-value pairs produced by Mappers are distributed
among Reducers:
- Key Distribution: The Partitioner ensures that all values associated with a specific key go to the
same Reducer. This is essential for maintaining consistency in aggregation operations.
- Custom Partitioning Logic: While Hadoop provides a default hash-based partitioning strategy,
developers can implement custom partitioning logic if specific distribution requirements exist
(e.g., ensuring certain keys are processed together).
Unit 3
The Writable Interface
The Writable interface is a core component of the Hadoop framework, designed to facilitate efficient
serialization and deserialization of data as it is transmitted between nodes in a distributed computing
environment. Serialization is crucial for converting structured data into a byte stream that can be easily
transmitted and stored.
1. write(DataOutput out):
- This method is responsible for writing the object's state to a binary stream.
- It serializes the data so that it can be transmitted over the network or stored on disk.
- The implementation typically involves writing each field of the object in a specific order.
For example, if your writable has fields like `int id` and `String name`, you would write the
integer first followed by the string.
2. readFields(DataInput in):
- The implementation should read fields in the same order they were written in the `write`
method to ensure consistency.
Purpose of Writable
- Data Transmission: Writables enable efficient data transfer between different nodes in a
Hadoop cluster by providing a compact binary representation.
- Serialization Protocol: The Writable interface defines a simple serialization protocol based on
Java's `DataInput` and `DataOutput` classes, allowing for fast and efficient read/write
operations.
- Custom Data Types: Users can implement custom classes that adhere to the Writable interface,
allowing for complex data types to be used in MapReduce jobs.
WritableComparable and Comparators
WritableComparable Interface
The WritableComparable interface extends both the Writable and Comparable interfaces. It is used
when a writable object needs to be compared with other writable objects, typically when used as keys in
sorting operations.
1. write(DataOutput out):
2. readFields(DataInput in):
3. compareTo(WritableComparable o):
- This method compares the current object with another object of the same type.
- It returns:
- Implementing this method allows Hadoop to sort keys during the shuffle phase of
MapReduce jobs.
Purpose of WritableComparable
- Sorting: Implementing this interface allows Hadoop to sort keys during the shuffle phase of
MapReduce jobs.
Writable Classes
Hadoop provides several built-in writable classes that serve as wrappers for Java primitive types and
other commonly used data structures.
Writable Wrappers for Java Primitives
Hadoop includes writable wrappers for most Java primitive types, allowing them to be serialized
efficiently:
- IntWritable: A wrapper for Java's `int`, useful for representing integer values in MapReduce
tasks. It provides methods for serialization and deserialization of integer values.
- LongWritable: A wrapper for Java's `long`, suitable for large integer values or timestamps. It
supports similar methods as IntWritable but handles long values.
- FloatWritable: A wrapper for Java's `float`, used for representing floating-point numbers. It
provides methods to serialize float values efficiently.
- BooleanWritable: A wrapper for Java's `boolean`, representing true/false values. This class
simplifies handling boolean flags in MapReduce applications.
Text
The Text class is a writable that represents a sequence of characters (strings). It is commonly used to
handle string data in Hadoop applications. Key features include:
- Supports UTF-8 encoding, making it suitable for internationalization and handling diverse
character sets.
BytesWritable
The BytesWritable class is used for raw byte arrays. It allows users to handle binary data efficiently by
providing methods for serialization and comparison. Key points include:
- It stores an array of bytes along with its length, making it suitable for handling binary data such as
images or files.
- Provides methods to compare byte arrays directly, which can be useful when working with raw
binary data formats.
NullWritable
The NullWritable class represents a writable type that has no value. It is often used as a placeholder
when no value is needed or when indicating that an operation should not produce any output. Key
aspects include:
- Helps reduce unnecessary memory usage since it does not hold any actual data.
- Useful in scenarios where only keys are needed without corresponding values (e.g., when
counting occurrences).
ObjectWritable
The ObjectWritable class allows users to serialize any Java object that implements Serializable. This
provides flexibility for passing complex objects through Hadoop's MapReduce framework. Important
points include:
- Useful when working with complex data structures that do not fit neatly into primitive types or
standard writables.
GenericWritable
The GenericWritable class enables users to create writable types that can hold different types of
writables. This provides a way to handle multiple writable types within a single structure by defining a set
of expected types. Key features include:
- Allows dynamic type handling where different types can be stored in one writable container.
- Requires implementing methods to determine which type is currently held and how to
serialize/deserialize it effectively.
Writable Collections
Hadoop also provides writable collections such as:
- Useful when dealing with lists or arrays within MapReduce tasks, such as processing
multiple values associated with a single key.
- MapWritable: Represents a map where both keys and values are writables.
- Allows dynamic key-value pairs where both components can be any writable type. This is
particularly useful for scenarios where you want to pass variable-length arguments or
configurations between mappers and reducers.
- Useful when order matters in key-value pairs during processing, such as maintaining
sorted output based on certain criteria.
1. Implement the Writable interface by defining required methods (`write` and `readFields`).
2. Define any additional fields that your custom writable will hold (e.g., strings, integers).
3. Ensure proper serialization logic in the `write` method (writing fields in order) and
deserialization logic in `readFields` (reading fields back correctly).
For example, if you want to represent a pair of strings (e.g., first name and last name), you would create a
class implementing Writable, defining how these strings are serialized (written) and deserialized (read).
This could involve converting each string into bytes using UTF-8 encoding during serialization and reading
those bytes back into strings during deserialization.
2. Override the compare method to define how two objects should be compared without
converting them into their full representations (i.e., directly comparing byte arrays).
3. This approach minimizes overhead during sorting operations by avoiding unnecessary object
creation or transformation.
Example Scenario
In scenarios where large datasets are being sorted based on byte arrays (e.g., sorting records based on
binary representations), using raw comparators can significantly speed up processing times compared to
traditional comparators that require full object instantiation.
Custom Comparators
Custom comparators can be implemented by creating classes that implement Comparator<T> or
RawComparator<T>. These comparators define specific sorting logic based on application requirements:
1. Implement compare method to provide custom comparison logic (e.g., comparing based on
specific fields).
2. Use these comparators when configuring job parameters or within specific Mapper/Reducer
implementations where custom sorting behavior is needed.
If you have a custom writable representing employee records with fields like employee ID and salary, you
might want to sort employees by salary rather than ID during processing. You would implement a custom
comparator that compares employee salary fields specifically while ignoring other attributes.
Unit 4
Pig: Hadoop Programming Made Easier
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. It simplifies the
process of writing complex data transformations and analyses through a dataflow language called Pig
Latin. This makes it easier for users to perform data analysis without needing extensive programming
knowledge or familiarity with the complexities of MapReduce.
- Dataflow Language: Pig Latin allows users to express data transformations as a series
of operations on datasets. This contrasts with traditional programming languages that rely
on control flow constructs (like loops and conditionals).
- Simplicity: The syntax is similar to SQL, making it accessible for users familiar with
database queries.
- Pig Engine: The core component that executes Pig Latin scripts. It includes:
- A compiler that transforms logical plans into physical plans (MapReduce jobs).
3. Execution Modes:
- Local Mode: Runs on a single machine and processes data stored locally.
- MapReduce Mode: Runs on a Hadoop cluster and processes data stored in HDFS
(Hadoop Distributed File System).
Execution Flow
The execution flow in Apache Pig consists of several stages:
1. Parsing:
- The Pig script is parsed to check for syntax errors and to create an initial logical plan.
- If there are any syntax errors, they will be reported at this stage.
- The parser generates a logical plan that represents the sequence of operations defined
in the script.
- This plan abstracts the details of how the operations will be executed.
3. Optimization:
- The optimized logical plan is converted into a physical plan, which consists of one or
more MapReduce jobs.
- This physical plan specifies how to execute each operation in terms of MapReduce
tasks.
5. Execution:
- The physical plan is executed on the Hadoop cluster or locally, depending on the
execution mode chosen by the user.
Pig Latin is designed for processing large datasets in parallel. It simplifies the process of writing complex
data transformations by providing an easy-to-understand syntax.
Basic Structure of a Pig Latin Script
A typical Pig Latin script consists of several key components:
```pig
```
- Transformation Statements: Include various operations such as filtering, grouping, joining, and
aggregating data.
```pig
B = FILTER A BY age > 30; -- Filters records where age is greater than 30
```
```pig
```
Example Workflow
1. Loading Data:
```pig
```
2. Transforming Data:
```pig
B = FILTER A BY age >= 21; -- Filter out records where age < 21
```
3. Storing Results:
```pig
```
Working through the ABCs of Pig Latin
Key Features of Pig Latin
- Ease of Use: The syntax is designed to be straightforward and human-readable, making it easier
for users without extensive programming backgrounds to work with large datasets.
- Extensibility: Users can define their own functions (User Defined Functions or UDFs) to extend
Pig's capabilities beyond built-in functions.
- Optimization: The Pig Latin compiler automatically optimizes execution plans for performance
improvements without requiring user intervention.
Pig supports several data types that can be used within scripts:
- Scalar Types:
- Complex Types:
```pig
```
- Bag: A collection of tuples (similar to a table). Bags can contain multiple tuples and are
defined using curly braces `{}`. For example:
```pig
```
- Map: A set of key-value pairs where keys are strings and values can be any type (scalar or
complex). Maps are defined using square brackets `[]`. For example:
```pig
M = [name'John', age25];
```
- Using tuples to represent records in datasets where each record contains multiple attributes
(e.g., student information).
- Using bags to group related records together for aggregation or analysis (e.g., all sales
transactions for a particular product).
- Using maps for dynamic attributes where keys may vary between records (e.g., user profiles with
varying numbers of attributes).
In local mode, Apache Pig runs in a single JVM on your local machine:
Advantages
- Fast feedback loop during script development; ideal for testing small datasets quickly
without needing cluster resources.
Limitations
- Not suitable for processing large datasets due to limited resources on a single machine.
- Lacks the scalability and fault tolerance provided by Hadoop's distributed architecture.
2. Distributed Mode
In distributed mode, Apache Pig runs on a Hadoop cluster and leverages HDFS for storage:
Advantages
- Designed for processing large datasets across multiple nodes in parallel; utilizes
Hadoop’s MapReduce framework efficiently.
Limitations
- Requires setup and configuration of a Hadoop cluster; longer feedback loop compared to
local mode due to job submission overhead.
- Use local mode for development and testing with small datasets where quick feedback is
needed.
- Switch to distributed mode when working with larger datasets that require scalability and fault
tolerance provided by Hadoop.
The Grunt shell is an interactive command-line interface for executing Pig commands:
Features
- Allows users to run individual statements interactively without needing complete scripts first.
- Immediate feedback on commands executed helps identify errors quickly; useful for testing snippets
before integrating them into larger scripts.
1. Creating Scripts:
- Scripts can be saved with a `.pig` extension but this is not mandatory. They consist of
multiple lines of Pig Latin commands defining the data processing workflow.
2. Executing Scripts:
- Scripts can be executed from the command line using the `pig` command followed by
the script name or directly from within Grunt shell.
3. Embedded Execution:
- You can embed Pig Latin statements within Java or Python applications using libraries like
Pigscripts or Pigshell, allowing integration with other programming environments while
leveraging Hadoop’s processing capabilities.
- When running scripts in Grunt or as standalone jobs, error messages provide feedback
that helps identify issues in your script logic or syntax.
UDFs allow users to write custom functions in Java or Python that can be called from within their Pig
scripts:
Benefits
1. Flexibility: UDFs provide flexibility to implement complex business logic that goes beyond built-
in functions provided by Pig.
2. Integration with Java/Python Libraries: Users can leverage existing libraries when writing
UDFs, making it easier to perform specialized computations or manipulations on data.
3. Deployment Process: UDFs need to be packaged as JAR files when written in Java and added
to the classpath when executing scripts. For Python UDFs, they need to be defined within the
script using appropriate imports.
For instance, if you want to calculate custom metrics like customer lifetime value based on transaction
history stored in HDFS, you could write a UDF that implements this calculation logic directly within your
Pig script.
Optimization Techniques
Apache Pig includes several optimization techniques:
1. Logical Optimizations:
- The optimizer rewrites logical plans based on rules such as projection pushdown (removing
unnecessary fields early) and filter pushdown (applying filters as soon as possible).
2. Physical Optimizations:
- The physical plan generated from logical plans can also be optimized by combining multiple
MapReduce jobs into fewer jobs when possible through techniques like combining joins or
aggregations into single steps.
- Users can visualize execution plans using commands like `EXPLAIN`, which helps understand
how their scripts will be executed and identify potential bottlenecks before running them on large
datasets.
4. Sampling Data During Development:
- Users can use sampling techniques (`SAMPLE`) within their scripts during development
phases to test logic without needing full dataset processing every time.
- Users can parameterize their scripts using `SET` commands allowing dynamic configurations
based on runtime requirements without modifying the core logic of their scripts.
Unit 5
Applying Structure to Hadoop Data with Hive
Apache Hive is a data warehousing solution built on top of Hadoop. It provides an SQL-like interface
(HiveQL) for querying and managing large datasets stored in Hadoop's HDFS. Hive abstracts the
complexity of writing MapReduce programs by allowing users to write queries in a declarative manner.
- Built on Hadoop: Hive leverages the scalability and fault tolerance of the Hadoop ecosystem,
making it suitable for processing petabytes of data.
- Metastore: Hive maintains a central repository of metadata (the Hive Metastore) that stores
information about tables, partitions, and schemas.
- HiveQL: A SQL-like query language that enables users to perform data analysis without needing
to write complex MapReduce code.
- Extensibility: Users can create User Defined Functions (UDFs) to extend the capabilities of Hive
beyond built-in functions.
- Support for Various File Formats: Hive supports multiple file formats including Text, ORC
(Optimized Row Columnar), Parquet, Avro, etc., allowing flexibility in how data is stored.
- ACID Transactions: With the introduction of ACID properties in newer versions, Hive supports
insert, update, delete operations on tables.
1. Hive Clients:
- Interfaces through which users interact with Hive. This includes command-line interfaces
like Beeline and graphical interfaces like Hue.
2. Hive Services:
- Hive Server 2: Accepts queries from clients and creates execution plans. It supports
multi-user concurrency and provides a JDBC/ODBC interface for applications.
- Hive Metastore: A central repository for metadata storage. It contains information about
databases, tables, columns, and their types.
3. Processing Framework:
- The execution engine that runs the queries. It converts HiveQL into MapReduce jobs or
other execution frameworks like Apache Tez or Apache Spark.
4. Resource Management:
- Utilizes YARN (Yet Another Resource Negotiator) for resource management across the
cluster.
5. Distributed Storage:
- Data is stored in HDFS or other compatible storage systems like Amazon S3 or Azure Blob
Storage.
Workflow in Hive
1. User submits a query: The user interacts with the Hive client to submit a query written in
HiveQL.
2. Query Parsing: The query is parsed by the Hive driver which checks for syntax errors.
3. Logical Plan Generation: The compiler generates a logical plan based on the parsed query.
5. Physical Plan Generation: The optimized logical plan is converted into a physical plan
consisting of one or more MapReduce jobs.
6. Execution: The execution engine runs the jobs on the Hadoop cluster and retrieves results.
Getting Started with Apache Hive
Installation and Setup
1. Prerequisites:
2. Installing Hive:
- Download the latest version of Apache Hive from the official website.
3. Configuration Files:
Starting Hive
- Start the Hive shell using the command:
```bash
hive
```
```bash
beeline
```
- Supports better concurrency and security features compared to the legacy CLI.
- Graphical interfaces that provide a user-friendly way to interact with Hive without needing
command-line knowledge.
- Java API: Allows developers to integrate Hive queries into Java applications using JDBC.
- Python API: Libraries like PyHive enable Python applications to interact with Hive.
- `CHAR(n)`: Fixed-length character string with a maximum length of n; padded with spaces if
shorter.
- `VARCHAR(n)`: Variable-length character string with a maximum length of n; does not pad
spaces.
Date/Time Data Types
- `TIMESTAMP`: Represents a specific point in time (date + time) with nanosecond precision.
1. Tuple:
- An ordered list of fields; e.g., `(name: STRING, age: INT)` represents a tuple containing a name
and age.
2. Bag:
- A collection of tuples; e.g., `{(John, 25), (Jane, 30)}` represents a bag containing two tuples.
3. Map:
- A set of key-value pairs; e.g., `[key1value1, key2value2]` represents a map where keys are
strings and values can be any type.
```sql
```
Using Databases
```sql
USE my_database;
```
Creating Tables
```sql
name STRING,
age INT,
salary FLOAT
STORED AS TEXTFILE;
```
Table Properties
You can specify additional properties such as compression settings or custom serialization formats when
creating tables.
Managing Tables
1. Show Tables:
```sql
SHOW TABLES;
```
```sql
DESCRIBE my_table;
```
3. Drop Table:
```sql
```
```sql
```
5. Partitioning Tables:
Partitioning allows you to divide large tables into smaller segments based on column values for efficient
querying.
```sql
id INT,
amount FLOAT,
date STRING
STORED AS TEXTFILE;
```
This creates partitions based on regions which can improve query performance significantly.
Seeing How the Hive Data Manipulation Language Works
Hive provides several commands for manipulating data within tables:
Inserting Data
```sql
```
```sql
```
```sql
```
This command imports data from an HDFS location directly into your specified table.
Querying Data
```sql
```
Aggregation Functions
Example:
```sql
```
Basic Queries
```sql
```
2. Filtering Results:
```sql
```
3. Joining Tables:
Joining allows you to combine records from two or more tables based on related columns:
```sql
FROM employees A
```sql
```
5. Using Subqueries:
```sql
SELECT name
FROM my_table
WHERE age IN (SELECT age FROM my_table WHERE salary > 60000);
```
You can group results by specific columns while applying aggregate functions:
```sql
FROM employees
GROUP BY department;
```
7. Limit Results:
```sql
```