Big Data Analytics: UNIT-1
Big Data Analytics: UNIT-1
Big Data Analytics: UNIT-1
UNIT-1
What is Big data?
● Wikipedia defines "Big Data" as a collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools or traditional data processing applications.
● "Big Data" consists of very large volumes of heterogeneous data that is being generated, often, at high
speeds.
What is Big Data?
“Big data is the data characterized by three attributes volume velocity and variety.” -----IBM
“Big data is the data characterized by four attribute volume velocity variety and value.”--------
Oracle
“Big Data is the frontier of a firm’s ability to store, process, and access all the data it needs to
operate effectively, make decisions, reduce risks, and serve customers.” --- Forrester
“Big Data in general is defined as high volume, velocity and variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight and
decision making.” -- Gartner
“Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it.” -- O’Reilly
Data storage capacity and evolution
Data storage examples
The zettabyte era
● In 2016, Cisco Systems stated that the Zettabyte Era was now
reality when global IP traffic reached an estimated 1.2 zettabytes.
● Cisco also provided future predictions of global IP traffic in their
report The Zettabyte Era: Trends and Analysis.
● This report uses current and past global IP traffic statistics to
forecast future trends.
● The report predicts trends between 2016 and 2021.
The zettabyte era
Datasets
• In business-oriented environments, data analytics results can lower operational costs and
facilitate strategic decision-making.
• In the scientific domain, data analytics can help identify the cause of a phenomenon to improve
the accuracy of predictions.
• In service-based environments like public sector organizations, data analytics can help
strengthen the focus on delivering high-quality services by driving down costs.
Data Analytics
There are four general categories of analytics that are distinguished by the results they
produce:
•descriptive analytics
• diagnostic analytics
• predictive analytics
• prescriptive analytics
Descriptive Analytics
Descriptive analytics are carried out to answer questions about events that have already occurred.
• What is the number of support calls received as categorized by severity and geographic location?
Valuewise, descriptive analytics provide the least worth and require a relatively basic skillset.
Diagnostic Analytics
● It aims to determine the cause of a phenomenon that occurred in the past using
questions that focus on the reason behind the event.
● The goal of this type of analytics is to determine what information is related to the
phenomenon in order to enable answering questions that seek to determine why
something has occurred.
● Such questions include:
● Why were Q2 sales less than Q1 sales?
● Why have there been more support calls originating from the Eastern region than from
the Western region?
● Why was there an increase in patient re-admission rates over the past three months?
Diagnostic Analytics
● Diagnostic analytics provide more value than descriptive analytics but require
a more advanced skillset.
● Diagnostic analytics usually require collecting data from multiple sources and
storing it in a structure that lends itself to performing drill-down and roll-up
analysis.
● Diagnostic analytics results are viewed via interactive visualization tools that
enable users to identify trends and patterns.
● The executed queries are more complex compared to those of descriptive
analytics and are performed on multidimensional data held in analytic
processing systems.
Drill-down refers to the process of viewing data at a level
of increased detail, while roll-up refers to the process of
viewing data with decreasing detail.
Predictive Analytics
● Predictive analytics are carried out in an attempt to determine the outcome of
an event that might occur in the future.
● Generate future predictions based upon past events.
● Questions are usually formulated using a what-if rationale, such as the
following:
● What are the chances that a customer will default on a loan if they have
missed a monthly payment?
● What will be the patient survival rate if Drug B is administered instead of Drug
A?
● If a customer has purchased Products A and B, what are the chances that
they will also purchase Product C?
Predictive Analytics
12/08/2021
Data Analysis
Data analysis is a process of inspecting, cleansing, transforming,
and modelling (past) data with the goal of discovering useful information, informing
conclusions, and supporting decision-making.
● Analysis, it is quite simple and easy to explore more valuable insights from the
available data by performing the various types of Data Analysis such as Exploratory
Data Analysis, Predictive Analysis, and Inferential Analysis, etc. They play a major
role by providing more insights in understanding the data.
Data Analysis vs Analytics
While analytics and analysis are more similar than different, their contrast is in the emphasis of each.
They both refer to an examination of information—but while analysis is the broader and more general
concept, analytics is a more specific reference to the systematic examination of data.
● analysis is the broader and more general • analytics is a more specific reference to the
concept systematic examination of data.
● We do Analysis to explain How and or Why
something happened.
• We use Analytics to explore potential future
events
● Data Analysis helps in understanding the
data and provides required insights from
• Data Analytics is the process of exploring the data
the past to understand what happened so
from the past to make appropriate decisions in the
far. future by using valuable insights.
● The most common tools employed in Data • The most common tools employed in Data
Analysis are Tableau, Excel, SPARK, Analytics are R, Python, SAS, SPARK, Google
Google Fusion tables, Node XL, etc. Analytics, Excel, etc.
Big Data Characteristics
● Most of these data characteristics were initially identified by Doug Laney in early 2001 when he
published an article describing the impact of the volume, velocity and variety of e-commerce data
on enterprise data warehouses.
● Five Big Data characteristics that can be used to help differentiate data categorized as “Big” from
other forms of data.
Volume
● The anticipated volume of data that is processed by Big Data solutions is
substantial and ever-growing.
● High data volumes impose distinct data storage and processing demands, as well
as additional data preparation, curation and management processes.
● Typical data sources that are responsible for generating high data volumes can
include:
● Online transactions, such as point-of-sale and banking
● Scientific and research experiments
● Sensors, such as GPS sensors, RFIDs, smart meters and telematics
● Social media, such as Facebook and Twitter
Velocity
● In Big Data environments, data can arrive at fast speeds, and enormous datasets
can accumulate within very short periods of time.
● From an enterprise’s point of view, the velocity of data translates into the amount of
time it takes for the data to be processed once it enters the enterprise’s perimeter.
● Coping with the fast inflow of data requires the enterprise to design highly elastic and
available data processing solutions and corresponding data storage capabilities.
● Depending on the data source, velocity may not always be high.
● For example, MRI scan images are not generated as frequently as log entries from a
high-traffic webserver.
Variety
● Data variety refers to the multiple formats and types of data that need to be
supported by Big Data solutions.
● Data variety brings challenges for enterprises in terms of data integration,
transformation, processing, and storage.
Veracity
● Veracity refers to the quality of data.
● Data that enters Big Data environments needs to be assessed for quality, which can lead
to data processing activities to resolve invalid data and remove noise.
● Data can be part of the signal or noise of a dataset.
● Noise is data that cannot be converted into information and thus has no value, whereas
signals have value and lead to meaningful information.
● Data with a high signal-to-noise ratio has more veracity than data with a lower ratio.
● Data that is acquired in a controlled manner, for example via online customer
registrations, usually contains less noise than data acquired via uncontrolled sources,
such as blog posting.
● The signal-to-noise ratio of data is dependent upon the source of the data and its type.
Value
• Structured data
• Unstructured data
• Semi-structured data
Structured Data
Unit-I
Big Data Characteristics
● Big Data Characteristics
○ Volume – How much data?
○ Velocity – How fast the data is generated/processed?
○ Variety - The various types of data.
○ Veracity – Quality of data
○ Value -- The usefulness of the data
Data is everything within the universe. This means that data is within the existing limitation of
technological capacity. If the technology capacity is allowed, there is no boundary or limitation
for data.
What is Hadoop?
● Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data.
● Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it
is
● Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing
services such as Amazon’s Elastic Compute Cloud (EC2).
● Robust— It is intended to run on commodity hardware, Hadoop is architected with the
assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
● Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
● Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop’s accessibility and simplicity give it an edge over writing and running large
distributed programs. On the other hand, its robustness and scalability make it suitable for even
the most demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both
academia and industry.
What is Hadoop?
A Hadoop cluster is a set of commodity machines networked together in one location.
● Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they both
started to work on Apache Nutch project.
● Apache Nutch project was the process of building a search engine system that can index 1
billion pages.
● After a lot of research on Nutch, they concluded that such a system will cost around half a
million dollars in hardware, and along with a monthly running cost of $30, 000 approximately,
which is very expensive.
● They realized that their project architecture will not be capable enough to the workaround
with billions of pages on the web.
● They were looking for a feasible solution which can reduce the implementation cost as well
as the problem of storing and processing of large datasets.
Hadoop | History or Evolution
● In 2003, they came across a paper that described the architecture of Google’s distributed file system, called
GFS (Google File System) which was published by Google, for storing the large data sets.
● This paper was just the half solution to their problem.
● In 2004, Google published one more paper on the technique MapReduce, which was the solution of
processing those large datasets.
● This paper was another half solution for Doug Cutting and Mike Cafarella for their Nutch project.
● These both techniques (GFS & MapReduce) were just on white paper at Google. Google didn’t implement
these two techniques.
● So, together with Mike Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as
open-source in the Apache Nutch project.
● In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike Cafarella).
Hadoop | History or Evolution
● In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17
hours for handling billions of searches and indexing millions of web pages.
● Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop
to other industries.
● In December of 2011, Apache Software Foundation released Apache Hadoop version 1.0.
● In Aug 2013, Version 2.0.6 was available.
● And currently, we have Apache Hadoop version 3.0 which released in December 2017.
Comparing Hadoop with SQL
1. Scale-out instead of scale-up
● Scaling commercial relational databases is expensive.
● Their design is more friendly to scaling up.
● To run a bigger database you need to buy a bigger
machine.
● At some point there won’t be a big enough machine
available for the larger data sets.
● The high-end machines are not cost effective. A
machine with four times the power of a standard PC
costs a lot more than putting four such PCs in a cluster.
Comparing Hadoop with SQL
Hadoop
● For data-intensive workloads, a large number of commodity low-end servers (i.e., the
scaling out" approach) is preferred over a small number of high-end servers (i.e., the
scaling up" approach).
● Hadoop is designed to be a scale-out architecture operating on a cluster of commodity
PC machines.
● Adding more resources means adding more machines to the Hadoop cluster.
● Hadoop clusters with ten to hundreds of machines is standard.
SQL (structured query language) is by design targeted at structured data. Many of
Hadoop’s initial applications deal with unstructured data such as text.
From this perspective, Hadoop provides a more general paradigm than SQL.
Comparing Hadoop with SQL
2. Key/value pairs instead of relational tables
SQL
● Data resides in tables having relational structure defined by a schema.
● Many modern applications deal with data types that don’t fit well into this model.
● Text documents, images, and XML files are popular examples. Also, large data sets are
often unstructured or semistructured.
Hadoop
● Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with
the less-structured data types.
● In Hadoop, data can originate in any form, but it eventually transforms into (key/value)
pairs for the processing functions to work on.
Comparing Hadoop with SQL
3. Functional programming (MapReduce) instead of declarative queries (SQL)
SQL
Hadoop
● Under MapReduce you specify the actual steps in processing the data.
● Under MapReduce you have scripts and codes.
MapReduce allows you to process data in a more general fashion than SQL queries. For example,
you can build complex statistical models from your data or reformat your image data. SQL is not
well designed for such tasks.
Comparing Hadoop with SQL
4. Offline batch processing instead of online transactions
● Checkpoints: is the process of merging the content of the most recent fsimage, with
all edits applied after that fsimage is merged, to create a new fsimage.
Checkpointing is triggered automatically by configuration policies or manually by
HDFS administration commands.
YARN
YARN (Yet Another Resource Negotiator)
● YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
● Consists of three major components i.e. Resource Manager ,Node Manager, Application Manager
● Resource manager has the privilege of allocating resources for the applications in a system.
● Resource manager runs on a master daemon and manages the resource allocation in the cluster.
● Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. They run on the slave daemons and are responsible for the execution of
a task on every single Data Node.
● Application manager works as an interface between the resource manager and node manager
● YARN resources in the cluster and manages the applications over Hadoop. It allows data stored in HDFS to be
processed and run by various data processing engines such as batch processing, stream processing,
interactive processing, graph processing, and many more. This increases efficiency with the use of YARN.
MapReduce
● By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
Five steps MapReduce programming model
Mark Distribution
• Mid 1 10 marks
• Mid 2 1o marks
• CLA1 5 marks
• CLA2 5 marks
• Lab experiments: 20 marks
• Project: 20 marks
• Final exam 30 marks
Big Data Characteristics
• Big Data Characteristics
– Volume – How much data?
– Velocity – How fast the data is generated/processed?
– Variety - The various types of data.
– Veracity – Quality of data
– Value -- The usefulness of the data
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Different Types of Data Analytics
• Descriptive Analytics
– are carried out to answer questions about events that
already have occurred
– Example questions
• Sales volume over past 12 months
• Monthly commission earned by sales agent
• Number of calls received based on severity and
geographic location.
Different Types of Data Analytics
• Diagnostic Analytics
– aims to determine the cause of a phenomenon that
occurred in the past using questions that focus on the
reason behind the event
– Example questions
– Item 2 sales less than item 3 – why?
– More service request calls from western region- why?
– Patient re-admission rates are increasing why?
Different Types of Data Analytics
• Predictive Analytics
– try to predict the outcome of the events and predictions are
made based on patterns, trends and exceptions found in
historical and current data.
– Example questions.
• Chances for a customer default a loan if one month
payment is missed
• Patient survival rate if Drug B is administered instead
of Drug A
• Customer purchase A, B products- chances of
purchasing C.
Different Types of Data Analytics
• Prescriptive Analytics
– build upon the results of predictive analytics by
prescribing actions should be taken
– incorporates internal data (current and historical data,
customer information, product data and business rules) and
external data (social media data, weather forecasts and
government –produced demographic data)
– Example questions:
• Among three drugs which one provides best results
why?
• When is the best time to trade a particular stock?
Big Data
• How to store and process large data ?
– Hadoop supports storage and computing framework for solving
Big Data problems.
– Researchers are working on to find efficient solutions.
• Limitations of Centralized systems
– Storage is limited to terabytes
– Uni-processor systems: Multi programming, Time sharing,
Threads - improves throughput
– Multi processor systems: Parallel processing using multiple
CPUs - improves throughput.
• Disadvantage: Scalability – Resources cannot be added to handle
increase in load.
Big Data
• Centralized systems
– Data has to be transferred to the system where application
program is getting executed
– Do not have
• Enough storage
• Required computing power to process large data
Big Data
• Name node
– The name mode manages file system’s metadata (location
of the file, size of the file, etc …) and name space
– Name space is a hierarchy of files and directories (name
space tree) and it is kept in the main memory.
– The mapping of blocks to data nodes is determined by the
name node
– Runs Job Tracker Program
HDFS Architecture - Continued
• Characteristics of HDFS
– Block Replication factor is set by the user and 3 by default
– Stores one replica in the same node where a write operation
is requested and one on different node in the same rack
(Why?)
• And one replica on a node in the different rack
• Heartbeat Message (HM)
– Data node sends periodic HM to the name node
– Receipt of a HM indicates that the data node is functioning
properly
HDFS - Contd