Data Analytics
Data Analytics
Data Analytics
QUANTUM SERIES
For
B.Tech Students of Third Year
of All Engineering Colleges Affiliated to
Dr. A.P.J. Abdul Kalam Technical University,
Uttar Pradesh, Lucknow
(Formerly Uttar Pradesh Technical University)
Data Analytics
By
Aditya Kumar
TM
CONTENTS
KCS-051/KIT-601 : DATA ANALYTICS
UNIT-1 : INTRODUCTION TO DATA ANALYTICS (1–1 J to 1–20 J)
Introduction to Data Analytics: Sources and nature of data,
classification of data (structured, semi-structured, unstructured),
characteristics of data, introduction to Big Data platform, need of
data analytics, evolution of analytic scalability, analytic process
and tools, analysis vs reporting, modern data analytic tools,
applications of data analytics. Data Analytics Lifecycle: Need, key
roles for successful analytic projects, various phases of data analytics
lifecycle – discovery, data preparation, model planning, model
building, communicating results, operationalization
UNIT-2 : DATA ANALYSIS (2–1 J to 2–28 J)
Regression modeling, multivariate analysis, Bayesian modeling,
inference and Bayesian networks, support vector and kernel methods,
analysis of time series: linear systems analysis & nonlinear dynamics,
rule induction, neural networks: learning and generalisation,
competitive learning, principal component analysis and neural
networks, fuzzy logic: extracting fuzzy models from data, fuzzy
decision trees, stochastic search methods.
UNIT-3 : MINING DATA STREAMS (3–1 J to 3–20 J)
Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering
streams, counting distinct elements in a stream, estimating moments,
counting oneness in a window, decaying window, Real-time
Analytics Platform (RTAP) applications, Case studies – real time
sentiment analysis, stock market predictions.
UNIT-4 : FREQUENT ITEMSETS & CLUSTERING (4–1 J to 4–28 J)
Mining frequent itemsets, market based modelling, Apriori algorithm,
handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques:
hierarchical, K-means, clustering high dimensional data, CLIQUE
and ProCLUS, frequent pattern based clustering methods, clustering
in non-euclidean space, clustering for streams & parallelism.
UNIT-5 : FRAME WORKS & VISUALIZATION (5–1 J to 5–30 J)
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive,
HBase, MapR, Sharding, NoSQL Databases, S3, Hadoop Distributed
File Systems, Visualization: visual data analysis techniques,
interaction techniques, systems and applications. Introduction to R
- R graphical user interfaces, data import and export, attribute and
data types, descriptive statistics, exploratory data analysis,
visualization before analysis, analytics for unstructured data.
Describe the life cycle phases of Data Analytics through discovery, planning and K1,K2
CO 1
building.
CO 2 Understand and apply Data Analysis Techniques. K2, K3
CO 5 Apply R tool for developing and evaluating real time applications. K3,K5,K6
1 Introduction to
Data Analytics
CONTENTS
Part-1 : Introduction of Data Analytics : ............... 1–2J to 1–5J
Sources and Nature of Data,
Classification of Data (Structured,
Semi-Structured, Unstructured),
Characteristics of Data
PART-1
Introduction To Data Analytics : Sources and Nature of Data,
Classification of Data (Structured, Semi-Structured, Unstructured),
Characteristics of Data.
Questions-Answers
Answer
1. Data analytics is the science of analyzing raw data in order to make
conclusions about that information.
2. Any type of information can be subjected to data analytics techniques to
get insight that can be used to improve things.
3. Data analytics techniques can help in finding the trends and metrics
that would be used to optimize processes to increase the overall efficiency
of a business or system.
4. Many of the techniques and processes of data analytics have been
automated into mechanical processes and algorithms that work over
raw data for human consumption.
5. For example, manufacturing companies often record the runtime,
downtime, and work queue for various machines and then analyze the
data to better plan the workloads so the machines operate closer to peak
capacity.
Answer
Three primary sources of Big Data are :
1. Social data :
a. Social data comes from the likes, tweets & retweets, comments,
video uploads, and general media that are uploaded and shared via
social media platforms.
b. This kind of data provides invaluable insights into consumer
behaviour and sentiment and can be enormously influential in
marketing analytics.
Data Analytics 1–3 J (CS-5/IT-6)
c. The public web is another good source of social data, and tools like
Google trends can be used to good effect to increase the volume of
big data.
2. Machine data :
a. Machine data is defined as information which is generated by
industrial equipment, sensors that are installed in machinery, and
even web logs which track user behaviour.
b. This type of data is expected to grow exponentially as the internet
of things grows ever more pervasive and expands around the world.
c. Sensors such as medical devices, smart meters, road cameras,
satellites, games and the rapidly growing Internet of Things will
deliver high velocity, value, volume and variety of data in the very
near future.
3. Transactional data :
a. Transactional data is generated from all the daily transactions that
take place both online and offline.
b. Invoices, payment orders, storage records, delivery receipts are
characterized as transactional data.
Answer
1. Unstructured data :
a. Unstructured data is the rawest form of data.
b. Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
c. This data is often stored in a repository of files.
2. Structured data :
a. Structured data is tabular data (rows and columns) which are very
well defined.
b. Data containing a defined data type, format, and structure, which
may include transaction data, traditional RDBMS, CSV files, and
simple spreadsheets.
3. Semi-structured data :
a. Textual data files with a distinct pattern that enables parsing such
as Extensible Markup Language [XML] data files or JSON.
b. A consistent format is defined however the structure is not very
strict.
c. Semi-structured data are often stored as files.
Introduction to Data Analytics 1–4 J (CS-5/IT-6)
Answer
Properties Structured data Semi-structured Unstructured
data data
Technology It is based on It is based on XML/ It is based o n
Relational database RDF. character and
table. binary data.
Transaction Matured transaction Transaction is N o transactio n
management and various adapted from management and
concurrency DBMS. no concurrency.
techniques.
Flexibility It is schema It is more flexible It very flexible and
dependent and less than structure d there is absence of
flexible. data but less than schema.
flexible than
unstructured data.
Scalability It is very difficult to It is more scalable It is very scalable.
scale database than structured
schema. data.
Query Structure d query Queries over Only textual query
performance allow complex anonymous nodes are possible.
joining. are possible.
Answer
Big Data is characterized into four dimensions :
1. Volume :
a. Volume is concerned about scale of data i.e., the volume of the data
at which it is growing.
b. The volume of data is growing rapidly, due to several applications of
business, social, web and scientific explorations.
2. Velocity :
a. The speed at which data is increasing thus demanding analysis of
streaming data.
b. The velocity is due to growing speed of business intelligence
applications such as trading, transaction of telecom and banking
domain, growing number of internet connections with the increased
usage of internet etc.
Data Analytics 1–5 J (CS-5/IT-6)
PART-2
Introduction to Big Data Platform, Need of Data Analytics.
Questions-Answers
Answer
1. Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single
solution.
2. It is an enterprise class IT platform that enables organization in
developing, deploying, operating and managing a big data infrastructure/
environment.
3. Big data platform generally consists of big data storage, servers, database,
big data management, business intelligence and other big data
management utilities.
4. It also supports custom development, querying and integration with
other systems.
5. The primary benefit behind a big data platform is to reduce the complexity
of multiple vendors/ solutions into a one cohesive solution.
6. Big data platform are also delivered through cloud where the provider
provides an all inclusive big data solutions and services.
Answer
Features of Big Data analytics platform :
1. Big Data platform should be able to accommodate new platforms and
tool based on the business requirement.
2. It should support linear scale-out.
3. It should have capability for rapid deployment.
4. It should support variety of data format.
5. Platform should provide data analysis and reporting tools.
6. It should provide real-time data analysis software.
7. It should have tools for searching the data through large data sets.
Answer
Need of data analytics :
1. It optimizes the business performance.
2. It helps to make better decisions.
3. It helps to analyze customers trends and solutions.
PART-3
Evolution of Analytic Scalability, Analytic Process and Tools,
Analysis vs Reporting, Modern Data Analytic Tools, Applications
of Data Analysis.
Questions-Answers
Answer
Steps involved in data analysis are :
1. Determine the data :
a. The first step is to determine the data requirements or how the
data is grouped.
b. Data may be separated by age, demographic, income, or gender.
c. Data values may be numerical or be divided by category.
Data Analytics 1–7 J (CS-5/IT-6)
2. Collection of data :
a. The second step in data analytics is the process of collecting it.
b. This can be done through a variety of sources such as computers,
online sources, cameras, environmental sources, or through
personnel.
3. Organization of data :
a. Third step is to organize the data.
b. Once the data is collected, it must be organized so it can be analyzed.
c. Organization may take place on a spreadsheet or other form of
software that can take statistical data.
4. Cleaning of data :
a. In fourth step, the data is then cleaned up before analysis.
b. This means it is scrubbed and checked to ensure there is no
duplication or error, and that it is not incomplete.
c. This step helps correct any errors before it goes on to a data analyst
to be analyzed.
Answer
1. In analytic scalability, we have to pull the data together in a separate
analytics environment and then start performing analysis.
Database 3
Database 1 Database 4
Database 2
Answer
1. With increased level of scalability, it needs to update analytic processes
to take advantage of it.
2. This can be achieved with the use of analytical sandboxes to provide
analytic professionals with a scalable environment to build advanced
analytics processes.
3. One of the uses of MPP database system is to facilitate the building and
deployment of advanced analytic processes.
4. An analytic sandbox is the mechanism to utilize an enterprise data
warehouse.
5. If used appropriately, an analytic sandbox can be one of the primary
drivers of value in the world of big data.
Analytical sandbox :
1. An analytic sandbox provides a set of resources with which in-depth
analysis can be done to answer critical business questions.
Data Analytics 1–9 J (CS-5/IT-6)
Answer
Modern data analytic tools :
1. Apache Hadoop :
a. Apache Hadoop, a big data analytics tool which is a Java based free
software framework.
b. It helps in effective storage of huge amount of data in a storage
place known as a cluster.
c. It runs in parallel on a cluster and also has ability to process huge
data across all nodes in it.
d. There is a storage system in Hadoop popularly known as the Hadoop
Distributed File System (HDFS), which helps to splits the large
volume of data and distribute across many nodes present in a
cluster.
2. KNIME :
a. KNIME analytics platform is one of the leading open solutions for
data-driven innovation.
b. This tool helps in discovering the potential and hidden in a huge
volume of data, it also performs mine for fresh insights, or predicts
the new futures.
3. OpenRefine :
a. OneRefine tool is one of the efficient tools to work on the messy
and large volume of data.
b. It includes cleansing data, transforming that data from one format
another.
c. It helps to explore large data sets easily.
4. Orange :
a. Orange is famous open-source data visualization and helps in data
analysis for beginner and as well to the expert.
Introduction to Data Analytics 1–10 J (CS-5/IT-6)
Que 1.13. What are the benefits of analytic sandbox from the view
of an analytic professional ?
Answer
Benefits of analytic sandbox from the view of an analytic
professional :
1. Independence : Analytic professionals will be able to work
independently on the database system without needing to continually
go back and ask for permissions for specific projects.
2. Flexibility : Analytic professionals will have the flexibility to use
whatever business intelligence, statistical analysis, or visualization tools
that they need to use.
3. Efficiency : Analytic professionals will be able to leverage the existing
enterprise data warehouse or data mart, without having to move or
migrate data.
Data Analytics 1–11 J (CS-5/IT-6)
Que 1.14. What are the benefits of analytic sandbox from the
view of IT ?
Answer
Benefits of analytic sandbox from the view of IT :
1. Centralization : IT will be able to centrally manage a sandbox
environment just as every other database environment on the system is
managed.
2. Streamlining : A sandbox will greatly simplify the promotion of analytic
processes into production since there will be a consistent platform for
both development and deployment.
3. Simplicity : There will be no more processes built during development
that have to be totally rewritten to run in the production environment.
4. Control : IT will be able to control the sandbox environment, balancing
sandbox needs and the needs of other users. The production environment
is safe from an experiment gone wrong in the sandbox.
5. Costs : Big cost savings can be realized by consolidating many analytic
data marts into one central system.
Answer
Application of data analytics :
1. Security : Data analytics applications or, more specifically, predictive
analysis has also helped in dropping crime rates in certain areas.
2. Transportation :
a. Data analytics can be used to revolutionize transportation.
b. It can be used especially in areas where we need to transport a
large number of people to a specific area and require seamless
transportation.
3. Risk detection :
a. Many organizations were struggling under debt, and they wanted a
solution to problem of fraud.
b. They already had enough customer data in their hands, and so,
they applied data analytics.
Introduction to Data Analytics 1–12 J (CS-5/IT-6)
c. They used ‘divide and conquer’ policy with the data, analyzing recent
expenditure, profiles, and any other important information to
understand any probability of a customer defaulting.
4. Delivery :
a. Several top logistic companies are using data analysis to examine
collected data and improve their overall efficiency.
b. Using data analytics applications, the companies were able to find
the best shipping routes, delivery time, as well as the most cost-
efficient transport means.
5. Fast internet allocation :
a. While it might seem that allocating fast internet in every area
makes a city ‘Smart’, in reality, it is more important to engage in
smart allocation. This smart allocation would mean understanding
how bandwidth is being used in specific areas and for the right
cause.
b. It is also important to shift the data allocation based on timing and
priority. It is assumed that financial and commercial areas require
the most bandwidth during weekdays, while residential areas
require it during the weekends. But the situation is much more
complex. Data analytics can solve it.
c. For example, using applications of data analysis, a community can
draw the attention of high-tech industries and in such cases; higher
bandwidth will be required in such areas.
6. Internet searching :
a. When we use Google, we are using one of their many data analytics
applications employed by the company.
b. Most search engines like Google, Bing, Yahoo, AOL etc., use data
analytics. These search engines use different algorithms to deliver
the best result for a search query.
7. Digital advertisement :
a. Data analytics has revolutionized digital advertising.
b. Digital billboards in cities as well as banners on websites, that is,
most of the advertisement sources nowadays use data analytics
using data algorithms.
Que 1.16. What are the different types of Big Data analytics ?
Answer
Different types of Big Data analytics :
1. Descriptive analytics :
a. It uses data aggregation and data mining to provide insight into the
past.
Data Analytics 1–13 J (CS-5/IT-6)
PART-4
Data Analytics Life Cycle : Need, Key Roles For Successful Analytic
Projects, Various Phases of Data Analytic Life Cycle : Discovery,
Data Preparations.
Questions-Answers
Que 1.17. Explain the key roles for a successful analytics projects.
Answer
Key roles for a successful analytics project :
1. Business user :
a. Business user is someone who understands the domain area and
usually benefits from the results.
b. This person can consult and advise the project team on the context
of the project, the value of the results, and how the outputs will be
operationalized.
Introduction to Data Analytics 1–14 J (CS-5/IT-6)
Answer
Various phases of data analytic lifecycle are :
Phase 1 : Discovery :
Data Analytics 1–15 J (CS-5/IT-6)
2. In addition, the team may run a pilot project to implement the models in
a production environment.
Answer
Main activities that are performed while identifying potential data sources
during discovery phase are :
1. Identify data sources :
a. Make a list of candidate data sources the team may need to test the
initial hypotheses outlined in discovery phase.
b. Make an inventory of the datasets currently available and those
that can be purchased or otherwise acquired for the tests the team
wants to perform.
2. Capture aggregate data sources :
a. This is for previewing the data and providing high-level
understanding.
b. It enables the team to gain a quick overview of the data and perform
further exploration on specific areas.
c. It also points the team to possible areas of interest within the data.
3. Review the raw data :
a. Obtain preliminary data from initial data feeds.
b. Begin understanding the interdependencies among the data
attributes, and become familiar with the content of the data, its
quality, and its limitations.
4. Evaluate the data structures and tools needed :
a. The data type and structure dictate which tools the team can use to
analyze the data.
b. This evaluation gets the team thinking about which technologies
may be good candidates for the project and how to start getting
access to these tools.
5. Scope the sort of data infrastructure needed for this type of
problem : In addition to the tools needed, the data influences the kind
of infrastructure required, such as disk storage and network capacity.
Answer
Sub-phases of data preparation are :
1. Preparing an analytics sandbox :
Data Analytics 1–17 J (CS-5/IT-6)
PART-5
Model Planning, Model Building, Communicating Results Open.
Questions-Answers
Que 1.21. What are activities that are performed in model planning
phase ?
Answer
Activities that are performed in model planning phase are :
1. Assess the structure of the datasets :
a. The structure of the data sets is one factor that dictates the tools
and analytical techniques for the next phase.
b. Depending on whether the team plans to analyze textual data or
transactional data different tools and approaches are required.
2. Ensure that the analytical techniques enable the team to meet the
business objectives and accept or reject the working hypotheses.
3. Determine if the situation allows a single model or a series of techniques
as part of a larger analytic workflow.
Que 1.22. What are the common tools for the model planning
phase ?
Answer
Common tools for the model planning phase :
1. R:
a. It has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality
code.
b. It has the ability to interface with databases via an ODBC
connection and execute statistical tests and analyses against Big
Data via an open source connection.
2. SQL analysis services : SQL Analysis services can perform in-
database analytics of common data mining functions, involved
aggregations, and basic predictive models.
3. SAS/ACCESS :
a. SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connectors such as OBDC, JDBC, and
OLE DB.
b. SAS itself is generally used on file extracts, but with SAS/ACCESS,
users can connect to relational databases (such as Oracle) and
data warehouse appliances, files, and enterprise applications.
Answer
Commercial common tools for the model building phase :
1. SAS enterprise Miner :
a. SAS Enterprise Miner allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.
b. It interoperates with other large data stores, has many partnerships,
and is built for enterprise-level computing and analytics.
2. SPSS Modeler provided by IBM : It offers methods to explore and
analyze data through a GUI.
3. Matlab : Matlab provides a high-level language for performing a variety
of data analytics, algorithms, and data exploration.
4. Apline Miner : Alpine Miner provides a GUI frontend for users to
develop analytic workflows and interact with Big Data tools and platforms
on the backend.
5. STATISTICA and Mathematica are also popular and well-regarded data
mining and analytics tools.
Answer
Free or open source tools are :
1. R and PL/R :
a. R provides a good environment for building interpretive models
and PL/R is a procedural language for PostgreSQL with R.
b. Using this approach means that R commands can be executed in
database.
c. This technique provides higher performance and is more scalable
than running R in memory.
2. Octave :
a. It is a free software programming language for computational
modeling, has some of the functionality of Matlab.
b. Octave is used in major universities when teaching machine
learning.
3. WEKA : WEKA is a free data mining software package with an analytic
workbench. The functions created in WEKA can be executed within
Java code.
4. Python : Python is a programming language that provides toolkits for
machine learning and analysis, such as numpy, scipy, pandas, and related
data visualization using matplotlib.
Introduction to Data Analytics 1–20 J (CS-5/IT-6)
Data Analytics 2–1 J (CS-5/IT-6)
2 Data Analysis
CONTENTS
Part-1 : Data Analysis : ............................................. 2–2J to 2–4J
Regression Modeling,
Multivariate Analysis
PART-1
Data Analyiss : Regression Modeling, Multivarient Analysis.
Questions-Answers
Answer
1. Regression models are widely used in analytics, in general being among
the most easy to understand and interpret type of analytics techniques.
2. Regression techniques allow the identification and estimation of possible
relationships between a pattern or variable of interest, and factors that
influence that pattern.
3. For example, a company may be interested in understanding the
effectiveness of its marketing strategies.
4. A regression model can be used to understand and quantify which of its
marketing activities actually drive sales, and to what extent.
5. Regression models are built to understand historical data and relationships
to assess effectiveness, as in the marketing effectiveness models.
6. Regression techniques are used across a range of industries, including
financial services, retail, telecom, pharmaceuticals, and medicine.
Answer
Various types of regression analysis techniques :
1. Linear regression : Linear regressions assumes that there is a linear
relationship between the predictors (or the factors) and the target
variable.
2. Non-linear regression : Non-linear regression allows modeling of
non-linear relationships.
3. Logistic regression : Logistic regression is useful when our target
variable is binomial (accept or reject).
4. Time series regression : Time series regressions is used to forecast
future behavior of variables based on historical time ordered data.
Data Analytics 2–3 J (CS-5/IT-6)
Answer
Linear regression model :
1. We consider the modelling between the dependent and one independent
variable. When there is only one independent variable in the regression
model, the model is generally termed as a linear regression model.
2. Consider a simple linear regression model
y = 0 + 1X +
Where,
y is termed as the dependent or study variable and X is termed as the
independent or explanatory variable.
The terms 0 and 1 are the parameters of the model. The parameter 0
is termed as an intercept term, and the parameter 1 is termed as the
slope parameter.
3. These parameters are usually called as regression coefficients. The
unobservable error component accounts for the failure of data to lie on
the straight line and represents the difference between the true and
observed realization of y.
4. There can be several reasons for such difference, such as the effect of
all deleted variables in the model, variables may be qualitative, inherent
randomness in the observations etc.
5. We assume that is observed as independent and identically distributed
random variable with mean zero and constant variance 2 and assume
that is normally distributed.
6. The independent variables are viewed as controlled by the experimenter,
so it is considered as non-stochastic whereas y is viewed as a random
variable with
E(y) = 0 + 1 X and Var (y) = 2.
7. Sometimes X can also be a random variable. In such a case, instead of
the sample mean and sample variance of y, we consider the conditional
mean of y given X = x as
E(y|x) = 0 + 1x
and the conditional variance of y given X = x as
Var(y|x) = 2
8. When the values of 0, 1, and 2 are known, the model is completely
described. The parameters 0, 1 and 2 arc generally unknown in practice
and is unobserved. The determination of the statistical model
y = 0 + 1 X + depends on the determination (i.e. estimation) of 0, 1,
and 2. In order to know the values of these parameters, n pairs of
observations (x1, yi)(i = 1, ...., n) on (X, y) are observed/collected and are
used to determine these unknown parameters.
Data Analysis 2–4 J (CS-5/IT-6)
Answer
1. Multivariate analysis (MVA) is based on the principles of multivariate
statistics, which involves observation and analysis of more than one
statistical outcome variable at a time.
2. These variables are nothing but prototypes of real time situations,
products and services or decision making involving more than one
variable.
3. MVA is used to address the situations where multiple measurements
are made on each experimental unit and the relations among these
measurements and their structures are important.
4. Multiple regression analysis refers to a set of techniques for studying
the straight-line relationships among two or more variables.
5. Multiple regression estimates the ’s in the equation
yj = 0 + 1x1j + 2x2j + ... + pxpj + j
Where, the x’s are the independent variables. y is the dependent variable.
The subscript j represents the observation (row) number. The ’s are
the unknown regression coefficients. Their estimates are represented
by b’s. Each represents the original unknown (population) parameter,
while b is an estimate of this . The j is the error (residual) of observation
j.
6. Regression problem is solved by least squares. In least squares method
regression analysis, the b’s are selected so as to minimize the sum of the
squared residuals. This set of b’s is not necessarily the set we want, since
they may be distorted by outliers points that are not representative of
the data. Robust regression, an alternative to least squares, seeks to
reduce the influence of outliers.
7. Multiple regression analysis studies the relationship between a
dependent (response) variable and p independent variables (predictors,
regressors).
8. The sample multiple regression equation is
y j = b0 + b1x2 + ... + bpxp
j j
10. If p = 1, the model is called simple linear regression. The intercept, b0, is
the point at which the regression plane intersects the Y axis. The bi are
the slopes of the regression plane in the direction of xi. These coefficients
are called the partial-regression coefficients. Each partial regression
coefficient represents the net effect the ith variable has on the dependent
variable, holding the remaining x’s in the equation constant
Data Analytics 2–5 J (CS-5/IT-6)
PART-2
Bayesian Modeling, Inference and Bayesian Networks,
Support Vector and Kernel Methods.
Questions-Answers
Answer
1. Bayesian networks are a type of probabilistic graphical model that uses
Bayesian inference for probability computations.
2. A Bayesian network is a directed acyclic graph in which each edge
corresponds to a conditional dependency, and each node corresponds to
a unique random variable.
3. Bayesian networks aim to model conditional dependence by representing
edges in a directed graph.
P(C=T) P(C=F)
0.5 0.5
Cloudy C P(R=T)P(R=F)
T 0.8 0.8
F 0.2 0.2
Sprinkler Rain
C P(S=T) P(S=F)
T 0.1 0.9
F 0.5 0.5 WetGrass S R P(W=T) P(W=F)
T T 0.99 0.01
T F 0.9 0.1
F T 0.9 0.1
F F 0.0 1.0
Fig. 2.5.1.
Answer
Inference over a Bayesian network can come in two forms.
1. First form :
a. The first is simply evaluating the joint probability of a particular
assignment of values for each variable (or a subset) in the network.
b. For this, we already have a factorized form of the joint distribution,
so we simply evaluate that product using the provided conditional
probabilities.
c. If we only care about a subset of variables, we will need to marginalize
out the ones we are not interested in.
d. In many cases, this may result in underflow, so it is common to take
the logarithm of that product, which is equivalent to adding up the
individual logarithms of each term in the product.
2. Second form :
a. In this form, inference task is to find P (x|e) or to find the probability of
some assignment of a subset of the variables (x) given assignments of
other variables (our evidence, e).
b. In the example shown in Fig. 2.6.1, we have to find
P(Sprinkler, WetGrass | Cloudy),
where {Sprinkler, WetGrass} is our x, and {Cloudy} is our e.
c. In order to calculate this, we use the fact that P(x|e) = P(x, e) / P(e)
= P(x, e), where is a normalization constant that we will calculate at
the end such that P(x|e) + P(x | e) = 1.
Data Analytics 2–7 J (CS-5/IT-6)
P(Cloudy) =
P(WetGrass|Sprinkler,Rain)P(Sprinker|Cloudy)P(Rain|Cloudy)
P(Cloudy) +
P(WetGrass|Sprinkler,Rain)P(Sprinker|Cloudy)P(Rain|Cloudy)
P(Cloudy)
PART-3
Analysis of Time Series : Linear System Analysis
of Non-Lineor Dynamics, Rule Introduction.
Questions-Answers
Answer
Applications of time series analysis :
1. Retail sales :
a. For various product lines, a clothing retailer is looking to forecast
future monthly sales.
b. These forecasts need to account for the seasonal aspects of the
customer's purchasing decisions.
c. An appropriate time series model needs to account for fluctuating
demand over the calendar year.
2. Spare parts planning :
a. Companies service organizations have to forecast future spare part
demands to ensure an adequate supply of parts to repair customer
Data Analysis 2–8 J (CS-5/IT-6)
Answer
A time series can consist of the following components :
1. Trends :
a. The trend refers to the long-term movement in a time series.
b. It indicates whether the observation values are increasing or
decreasing over time.
c. Examples of trends are a steady increase in sales month over month
or an annual decline of fatalities due to car accidents.
2. Seasonality :
a. The seasonality component describes the fixed, periodic fluctuation
in the observations over time.
b. It is often related to the calendar.
c. For example, monthly retail sales can fluctuate over the year due
to the weather and holidays.
3. Cyclic :
a. A cyclic component also refers to a periodic fluctuation, which is not
as fixed.
Data Analytics 2–9 J (CS-5/IT-6)
b. For example, retails sales are influenced by the general state of the
economy.
Answer
1. Rule induction is a data mining process of deducing if-then rules from a
dataset.
2. These symbolic decision rules explain an inherent relationship between
the attributes and class labels in the dataset.
3. Many real-life experiences are based on intuitive rule induction.
4. Rule induction provides a powerful classification approach that can be
easily understood by the general users.
5. It is used in predictive analytics by classification of unknown data.
6. Rule induction is also used to describe the patterns in the data.
7. The easiest way to extract rules from a data set is from a decision tree
that is developed on the same data set.
Answer
1. Sequential covering is an iterative procedure of extracting rules from
the data sets.
2. The sequential covering approach attempts to find all the rules in the
data set class by class.
3. One specific implementation of the sequential covering approach is called
the RIPPER, which stands for Repeated Incremental Pruning to Produce
Error Reduction.
4. Following are the steps in sequential covering rules generation approach :
Step 1 : Class selection :
a. The algorithm starts with selection of class labels one by one.
b. The rule set is class-ordered where all the rules for a class are
developed before moving on to next class.
c. The first class is usually the least-frequent class label.
d. From Fig. 2.10.1, the least frequent class is “+” and the algorithm
focuses on generating all the rules for “+” class.
Data Analysis 2–10 J (CS-5/IT-6)
– –
+ + –
+ – – –
+
– –
– + +
+ –
– –
Fig. 2.10.1. Data set with two classes and two dimensions.
Rule (r1)
– –
+ + –
+ – – –
+
– –
– + +
+ –
– –
–
–
–
Rule (r1)
– –
–
Rule (r2)
+ –
–
+
– –
– +
–
PART-4
Neural Networks : Learning and Generalization, Competitive
Learning, Principal Component Analysis and Neural Networks.
Questions-Answers
Answer
Supervised learning :
1. Supervised learning is also known as associative learning, in which the
network is trained by providing it with input and matching output
patterns.
2. Supervised training requires the pairing of each input vector with a
target vector representing the desired output.
3. The input vector together with the corresponding target vector is called
training pair.
4. To solve a problem of supervised learning following steps are considered :
a. Determine the type of training examples.
b. Gathering of a training set.
c. Determine the input feature representation of the learned function.
d. Determine the structure of the learned function and corresponding
learning algorithm.
e. Complete the design.
5. Supervised learning can be classified into two categories :
i. Classification ii. Regression
Unsupervised learning :
1. Unsupervised learning, an output unit is trained to respond to clusters
of pattern within the input.
3. In this method of training, the input vectors of similar type are grouped
without the use of training data to specify how a typical member of each
group looks or to which group a member belongs.
3. Unsupervised training does not require a teacher; it requires certain
guidelines to form groups.
4. Unsupervised learning can be classified into two categories :
i. Clustering ii. Association
Answer
Difference between supervised and unsupervised learning :
Data Analytics 2–13 J (CS-5/IT-6)
Answer
1. Multilayer perceptron is a class of feed forward artificial neural network.
2. Multilayer perceptron model has three layers; an input layer, and output
layer, and a layer in between not connected directly to the input or the
output and hence, called the hidden layer.
3. For the perceptrons in the input layer, we use linear transfer function,
and for the perceptrons in the hidden layer and the output layer, we use
sigmoidal or squashed-S function.
4. The input layer serves to distribute the values they receive to the next
layer and so, does not perform a weighted sum or threshold.
5. The input-output mapping of multilayer perceptron is shown in
Fig. 2.13.1 and is represented by
ll1 OO1
1 1
1 OO2
ll2
2 2
ll3 2 OO3
3 3
m OO4
ll4
n p
Hidden layer
Input layer Output layer
Fig. 2.13.1.
Data Analysis 2–14 J (CS-5/IT-6)
Que 2.14. Draw and explain the multiple perceptron with its
learning algorithm.
Answer
1. The perceptrons which are arranged in layers are called multilayer
(multiple) perceptron.
2. This model has three layers : an input layer, output layer and one or
more hidden layer.
3. For the perceptrons in the input layer, the linear transfer function used
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used. The input signal propagates through the
network in a forward direction.
4. In the multilayer perceptron bias b(n) is treated as a synaptic weight
driven by fixed input equal to + 1.
x(n) = [+1, x1(n), x2(n), ………. xm(n)]T
where n denotes the iteration step in applying the algorithm.
5. Correspondingly we define the weight vector as :
w(n) = [b(n), w1(n), w2(n)……….., wm(n)]T
6. Accordingly the linear combiner output is written in the compact form
m
V(n) = w (n) x (n)
i 0
i i = wT(n) x(n)
Input Output
signal signal
Output layer
Answer
Algorithms to optimize the network size are :
1. Growing algorithms :
a. This group of algorithms begins with training a relatively small
neural architecture and allows new units and connections to be
added during the training process, when necessary.
b. Three growing algorithms are commonly applied: the upstart
algorithm, the tiling algorithm, and the cascade correlation.
c. The first two apply to binary input/output variables and networks
with step activation function.
d. The third one, which is applicable to problems with continuous
input/output variables and with units with sigmoidal activation
function, keeps adding units into the hidden layer until a satisfying
error value is reached on the training set.
2. Pruning algorithms :
a. General pruning approach consists of training a relatively large
network and gradually removing either weights or complete units
that seem not to be necessary.
b. The large initial size allows the network to learn quickly and with a
lower sensitivity to initial conditions and local minima.
c. The reduced final size helps to improve generalization.
Que 2.16. Explain the approaches for knowledge extraction from
multilayer perceptrons.
Data Analysis 2–16 J (CS-5/IT-6)
Answer
Approach for knowledge extraction from multilayer perceptrons :
a. Global approach :
1. This approach extracts a set of rules characterizing the behaviour
of the whole network in terms of input/output mapping.
2. A tree of candidate rules is defined. The node at the top of the tree
represents the most general rule and the nodes at the bottom of
the tree represent the most specific rules.
3. Each candidate symbolic rule is tested against the network's
behaviour, to see whether such a rule can apply.
4. The process of rule verification continues until most of the training
set is covered.
5. One of the problems connected with this approach is that the
number of candidate rules can become huge when the rule space
becomes more detailed.
b. Local approach :
1. This approach decomposes the original multilayer network into a
collection of smaller, usually single-layered, sub-networks, whose
input/output mapping might be easier to model in terms of symbolic
rules.
2. Based on the assumption that hidden and output units, though
sigmoidal, can be approximated by threshold functions, individual
units inside each sub-network are modeled by interpreting the
incoming weights as the antecedent of a symbolic rule.
3. The resulting symbolic rules are gradually combined together to
define a more general set of rules that describes the network as a
whole.
4. The monotonicity of the activation function is required, to limit the
number of candidate symbolic rules for each unit.
5. Local rule-extraction methods usually employ a special error
function and/or a modified learning algorithm, to encourage hidden
and output units to stay in a range consistent with possible rules
and to achieve networks with the smallest number of units and
weights.
Answer
Selection of various parameters in BPN (Back Propagation Network) :
1. Number of hidden nodes :
Data Analytics 2–17 J (CS-5/IT-6)
(Weight change
– E
W without momentum)
n
[ W] n
[ W]
n+1 (Momentum term)
[ W]
Answer
1. Learning rate is a constant used in learning algorithm that define the
speed and extend in weight matrix corrections.
2. Setting a high learning rate tends to bring instability and the system is
difficult to converge even to a near optimum solution.
3. A low value will improve stability, but will slow down convergence.
Learning function :
1. In most applications the learning rate is a simple function of time for
example L.R. = 1/(1 + t).
2. These functions have the advantage of having high values during the
first epochs, making large corrections to the weight matrix and smaller
values later, when the corrections need to be more precise.
3. Using a fuzzy controller to adaptively tune the learning rate has the
added advantage of bringing all expert knowledge in use.
4. If it was possible to manually adapt the learning rate in every epoch, we
would surely follow rules of the kind listed below :
a. If the change in error is small, then increase the learning rate.
b. If there are a lot of sign changes in error, then largely decrease the
learning rate.
c. If the change in error is small and the speed of error change is
small, then make a large increase in the learning rate.
Answer
1. Competitive learning is a form of unsupervised learning in artificial
neural networks, in which nodes compete for the right to respond to a
subset of the input data.
2. A variant of Hebbian learning, competitive learning works by increasing
the specialization of each node in the network. It is well suited to finding
clusters within data.
Data Analytics 2–19 J (CS-5/IT-6)
Answer
1. PCA is a method used to reduce number of variables in dataset by
extracting important one from a large dataset.
2. It reduces the dimension of our data with the aim of retaining as much
information as possible.
3. In other words, this method combines highly correlated variables
together to form a smaller number of an artificial set of variables which
is called principal components (PC) that account for most variance in the
data.
4. A principal component can be defined as a linear combination of
optimally-weighted observed variables.
5. The first principal component retains maximum variation that was
present in the original components.
6. The principal components are the eigenvectors of a covariance matrix,
and hence they are orthogonal.
7. The output of PCA are these principal components, the number of which
is less than or equal to the number of original variables.
8. The PCs possess some useful properties which are listed below :
a. The PCs are essentially the linear combinations of the original
variables and the weights vector.
b. The PCs are orthogonal.
Data Analysis 2–20 J (CS-5/IT-6)
PART-5
Fuzzy Logic : Extracting Fuzzy Models From Data, Fuzzy
Decision Trees, Stochastic Search Methods.
Questions-Answers
Que 2.21. Define fuzzy logic and its importance in our daily life.
What is role of crisp sets in fuzzy logic ?
Answer
1. Fuzzy logic is an approach to computing based on “degrees of truth”
rather than “true or false” (1 or 0).
2. Fuzzy logic includes 0 and 1 as extreme cases of truth but also includes
the various states of truth in between.
3. Fuzzy logic allows inclusion of human assessments in computing
problems.
4. It provides an effective means for conflict resolution of multiple criteria
and better assessment of options.
Importance of fuzzy logic in daily life :
1. Fuzzy logic is essential for the development of human-like capabilities
for AI.
2. It is used in the development of intelligent systems for decision making,
identification, optimization, and control.
3. Fuzzy logic is extremely useful for many people involved in research
and development including engineers, mathematicians, computer
software developers and researchers.
4. Fuzzy logic has been used in numerous applications such as facial pattern
recognition, air conditioners, vacuum cleaners, weather forecasting
systems, medical diagnosis and stock trading.
Role of crisp sets in fuzzy logic :
1. It contains the precise location of the set boundaries.
2. It provides the membership value of the set.
Que 2.22. Define classical set and fuzzy sets. State the importance
of fuzzy sets.
Data Analytics 2–21 J (CS-5/IT-6)
Answer
Classical set :
1. Classical set is a collection of distinct objects.
2. Each individual entity in a set is called a member or an element of the
set.
3. The classical set is defined in such a way that the universe of discourse
is splitted into two groups as members and non-members.
Fuzzy set :
1. Fuzzy set is a set having degree of membership between 1 and 0.
A = {( x, A ( x)| x U )}
Que 2.23. Compare and contrast classical logic and fuzzy logic.
Answer
Que 2.24. Define the membership function and state its importance
in fuzzy logic. Also discuss the features of membership functions.
Answer
Membership function :
1. A membership function for a fuzzy set A on the universe of discourse X
is defined as µA : X [0,1], where each element of X is mapped to a value
between 0 and 1.
2. This value, called membership value or degree of membership, quantifies
the grade of membership of the element in X to the fuzzy set A.
3. Membership functions characterize fuzziness (i.e., all the information in
fuzzy set), whether the elements in fuzzy sets are discrete or continuous.
4. Membership functions can be defined as a technique to solve practical
problems by experience rather than knowledge.
5. Membership functions are represented by graphical forms.
Importance of membership function in fuzzy logic :
1. It allows us to graphically represent a fuzzy set.
2. It helps in finding different fuzzy set operation.
Features of membership function :
1. Core :
a. The core of a membership function for some fuzzy set A is defined
as that region of the universe that is characterized by complete and
full membership in the set.
b. The core comprises those elements x of the universe such that
A (x) = 1.
2. Support :
a. The support of a membership function for some fuzzy set A is
defined as that region of the universe that is characterized by
nonzero membership in the set A .
b. The support comprises those elements x of the universe such that
A (x) > 0.
Data Analytics 2–23 J (CS-5/IT-6)
(x)
Core
1
x
0
Support
Boundary
Boundary
Fig. 2.24.1. Core, support, and boundaries of a fuzzy set.
3. Boundaries :
Answer
Fuzzy Inference :
1. Inferences is a technique where facts, premises F1, F2, …….., Fn and a
goal G is to be derived from a given set.
2. Fuzzy inference is the process of formulating the mapping from a given
input to an output using fuzzy logic.
3. The mapping then provides a basis from which decisions can be made.
4. Fuzzy inference (approximate reasoning) refers to computational
procedures used for evaluating linguistic (IF-THEN) descriptions.
5. The two important inferring procedures are :
i. Generalized Modus Ponens (GMP) :
1. GMP is formally stated as
If x is A THEN y is B
x is A
————
y is B
and B are fuzzy terms.
Here, A , B , A
2. Every fuzzy linguistic statement above the line is analytically known
and what is below is analytically unknown.
Data Analysis 2–24 J (CS-5/IT-6)
Here B = A o R
(x, y)
where ‘o’ denotes max-min composition (IF-THEN relation)
3. The membership function is
( y) = max (min ( ( x), ( x, y)))
B A R
y is B
————
x is A
2. is computed as
The membership of A
= B o R (x, y)
A
3. In terms of membership function
( x) = max (min ( ( y), R ( x, y)))
A B
Answer
1. Decision trees are one of the most popular methods for learning and
reasoning from instances.
2. Given a set of n input-output training patterns D = {(Xi, yi) i = 1, ..., n}.
where each training pattern Xi has been described by a set of p conditional
(or input) attributes (x1,...,xp) and one corresponding discrete class label
yi where yi {1,....,q} and q is the number of classes.
3. The decision attribute yi represents a posterior knowledge regarding
the class of each pattern.
4. An arbitrary class has been indexed by l (1 / q) and each class l has
been modeled as a crisp set.
5. The membership degree of the ith value of the decision attribute yi
concerning the ith class is defined as follows :
Fuzzy partitioning
Training
data
Fuzzy ID3
Fuzzy classification
rules
Test
Data Product-product-
sum reasoning
Actual class lavel Estimated class lable
Classification
accuracy
Answer
1. Grid-based rule sets model each input variable through a usually small
set of linguistic values.
2. The resulting rule base uses all or a subset of all possible combinations
of these linguistic values for each variable, resulting in a global
granulation of the feature space into “tiles”:
R1,...,1 : IF x1 IS A1,1 AND ... AND xn IS A1,n THEN ...
R1,...,ln : IF x1 IS A1,1 AND ... AND xn IS Aln,n THEN ...
x2
R2,2
R 2,1 R2,3
A2,2
R1,2
1 0 x1
A1,1 A1,2 A1,3
1
0
x1
Fig. 2.27.1. A global granulation of the input space using
three membership functions for x 1 and two for x 2.
Data Analytics 3–1 J (CS-5/IT-6)
CONTENTS
Part-1 : Mining Data Streams : Introduction ........ 3–2J to 3–7J
To Stream Concepts, Stream Data
Model and Architecture
PART-1
Mining Data Streams : Introduction To Stream Concepts,
Stream Data Model and Architecture.
Questions-Answers
Answer
Ad-hoc
Queries
Streams entering
Limited
working
storage Archival
storage
Fig. 3.1.1.
1. A Data Stream Management System (DSMS) is a computer software
system to manage continuous data streams.
2. A DSMS also offers a flexible query processing so that the information
needed can be expressed using queries.
3. In DSMS, queries are executed continuously over the data passed to
the system, also called continuous or standing queries. These queries
are registered in the system once.
4. Depending on the system, a query can be formulated mainly in two
ways: as a declarative expression, or as a sequence or graph of data
processing operators.
Data Analytics 3–3 J (CS-5/IT-6)
Answer
Data streams
Stream manager
1. Data stream :
a. DSMS gets data streams as input.
b. Data stream elements are represented as tuples, which adhere to
a relational schema with attributes and values.
Mining Data Streams 3–4 J (CS-5/IT-6)
2. Stream manager :
a. Wrappers are provided which can receive raw data from its source,
buffer and order it by timestamp.
b. The task of stream manager is to convert the data to the format of
the data stream management system.
3. Router :
a. It helps to add tuples or data stream to the queue of the next
operator according to the query execution plan.
4. Queue manager :
a. The management of queues and their corresponding buffers is
handled by a queue manager.
b. The queue manager can also be used to swap data from the queues
to a secondary storage, if main memory is full.
5. System catalog and storage manager :
a. To enable access to data stored on disk many systems employ a
storage manager which handles access to secondary storage.
b. This is used, when persistent data is combined with data from
stream sources.
c. Also it is required when loading meta-information about, queries,
query plans, streams, inputs, and outputs.
d. These are held in a system catalog in secondary storage.
6. Scheduler :
a. Scheduler determines which operator is executed next.
b. The Scheduler interacts closely with the query processor
7. Query processor : It helps to execute the operator by interacting
with scheduler.
8. QoS monitoring :
a. Many systems also include some kind of monitor which gathers
statistics about performance, operator output rate, or output delay.
b. These statistics can be used to optimize the system execution in
several ways.
9. Query optimizer :
a. The throughput of a system can be increased by a load shedder
which is a stream element selected by a sampling method.
b. The load shedder can be a part of a query optimizer, a single
component, or part of the query execution plan.
c. The statistics can be used to re-optimize the current query
execution plan and reorder the operators. For this purpose a query
optimizer can be included.
Data Analytics 3–5 J (CS-5/IT-6)
Answer
Sources of data streams are :
1. Sensor data :
a. Sensor data are the data produced by the sensors placed at different
place.
b. Different sensor such as temperature sensor, GPS sensor and
other sensors are installed at different places for capturing the
temperature, height and many other information of that particular
place.
c. The data/information produced by sensor is a stream of real
numbers.
d. This information or data given by the sensor is stored in main
memory. These sensors send large amount of data every tenth of
second.
2. Image data :
a. Satellites often send down to earth streams consisting of many
terabytes of images per day.
b. Surveillance cameras produce images with lower resolution than
satellites, but there can be many of them, each producing a stream
of images at intervals like one second.
3. Internet and web traffic :
a. A switching node in the middle of the Internet receives streams of
IP packets from many inputs and routes them to its outputs.
b. The job of the switch is to transmit data and not to retain it or
query it and provide more capability into the switch.
c. Websites receive streams of various types. For example, Google
receives several hundred million search queries per day. Yahoo
accepts billions of clicks per day on its various sites.
d. Many things can be learned or extract from streams of data.
Answer
Answer
Steps in query processing :
1. Formulation of continuous queries :
a. The formulation of queries is mostly done using declarative
languages like SQL in DBMS. Since there are no standardized
query languages to express continuous queries, there are a lot of
languages and variations.
b. However, most of them are based on SQL, such as the Continuous
Query Language (CQL) and StreamSQL.
c. The language strongly depends on the processing model.
d. In StreamSQL, a query with a sliding window for the last 10
elements looks like follows:
SELECT AVG (price) FROM examplestream [SIZE 10 ADVANCE
1 TUPLES] WHERE value > 100.0
This stream continuously calculates the average value of price of
the last 10 tuples, but only considers those tuples whose prices are
greater than 100.0.
Data Analytics 3–7 J (CS-5/IT-6)
PART-2
Stream Computing, Sampling Data in a Stream, Filtering Streams.
Questions-Answers
Answer
1. Stream computing is a computing paradigm that reads data from
collections of software or hardware sensors in stream form and
computes continuous data streams.
2. Stream computing uses software programs that compute continuous
data streams.
3. Stream computing uses software algorithm that analyzes the data in
real time.
4. Stream computing is one effective way to support Big Data by providing
extremely low-latency velocities with massively parallel processing
architectures.
5. It is becoming the fastest and most efficient way to obtain useful
knowledge from Big Data.
Answer
Bernoulli sampling :
1. A Bernoulli sampling scheme with sampling rate q (0, 1) includes
each element in the sample with probability q and excludes the element
with probability 1 – q, independently of the other elements.
2. This type of sampling is also called “binomial” sampling because the
sample size is binomially distributed so that the probability that the
sample contains exactly k elements is equal to nCk qk(1 – q)n – k .
3. The expected size of the sample is nq. It follows from the central limit
theorem for independent and identically distributed random variables.
For example, when n is reasonably large and q is not vanishingly small,
the deviation from the expected size is within ±100 % with probability
close to 98 %, where = 2 (1 – q)/nq.
Data Analytics 3–9 J (CS-5/IT-6)
Answer
1. A Bloom filter consists of :
Mining Data Streams 3–10 J (CS-5/IT-6)
Answer
1. If a key value is in S, then the element will surely pass through the
Bloom filter.
2. However, if the key value is not in S, it might still pass.
3. We need to understand how to calculate the probability of a false
positive, as a function of n, the bit-array length, m the number of
members of S, and k, the number of hash functions.
4. The model to use is throwing darts at targets. Suppose we have x
targets and y darts. Any dart is equally likely to hit any target. After
throwing the darts, how many targets can we expect to be hit at least
once.
5. The analysis is as follows :
a. The probability that a given dart will not hit a given target is
(x – 1)/x.
b. The probability that none of the y darts will hit a given target is
y y
x
x 1 .We can write this expression as 1 x .
1
x x
c. Using the approximation (1– )1/ = 1/e for small , we conclude
that the probability that none of the y darts hit a given target is
e– y/x.
Data Analytics 3–11 J (CS-5/IT-6)
PART-3
Counting Distinct Elements in a Stream, Estimating Moments,
Counting Oneness in a Window, Decaying Window.
Questions-Answers
Answer
1. Flajolet-Martin algorithm approximates the number of unique objects
in a stream or a database in one pass.
2. If the stream contains n elements with m of them unique, this algorithm
runs in O(n) time and needs O(log(m)) memory. So the real innovation
here is the memory usage, in that an exact, brute-force algorithm
would need O(m) memory.
3. It gives an approximation for the number of unique objects, along with
a standard deviation , which can then be used to determine bounds on
the approximation with a desired maximum error , if needed.
The Flajolet-Martin algorithm :
1. Create a bit vector (bit array) of sufficient length L, such that 2L > n,
the number of elements in the stream. Usually a 64-bit vector is
sufficient since 264 is quite large for most purposes.
2. The i-th bit in this vector/array represents whether we have seen a
hash function value whose binary representation ends in 0i. So initialize
each bit to 0.
3. Generate a good, random hash function that maps input (usually
strings) to natural numbers.
4. Read input. For each word, hash it and determine the number of
trailing zeros. If the number of trailing zeros is k, set the k-th bit in the
bit vector to 1.
5. Once input is exhausted, get the index of the first 0 in the bit array (call
this R). By the way, this is just the number of consecutive 1s plus one.
6. Calculate the number of unique words as 2R/, where is 0.77351.
Answer
Problem 1 : A problem with the Flajolet-Martin algorithm is that the results
vary significantly.
Solution :
a. A common solution has been to run the algorithm multiple times with
k different hash functions and combine the results from the different
runs.
b. One idea is to take the mean of the k results together from each hash
function, obtaining a single estimate of the cardinality.
Problem 2 : The problem with this is that averaging is very susceptible to
outliers.
Solution : A different idea is to use the median, which is less prone to be
influences by outliers.
Problem 3 : The problem with this is that the results can only take form
2R/, where R is integer.
Solution :
a. A common solution is to combine both the mean and the median:
Create k.l hash functions and split them into k distinct groups (each of
size l).
b. Within each group use the median for aggregating together the l results,
and finally take the mean of the k group estimates as the final estimate.
Answer
1. Estimating moments is a generalization of the problem of counting
distinct elements in a stream. The problem, called computing "moments,"
involves the distribution of frequencies of different elements in the
stream.
2. Suppose a stream consists of elements chosen from a universal set.
Assume the universal set is ordered so we can speak of the ith element
for any i.
3. Let mi be the number of occurrences of the ith element for any i. Then
the kth-order moment of the stream is the sum over all i of (mi)k.
For example :
1. The 0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th
moment is a count of the number of distinct element in the stream.
2. The 1st moment is the sum of the mi’s, which must be the length of the
stream. Thus, first moments are especially easy to compute i.e., just
count the length of the stream seen so far.
Data Analytics 3–13 J (CS-5/IT-6)
3. The second moment is the sum of the squares of the mi’s. It is sometimes
called the surprise number, since it measures how uneven the
distribution of elements in the stream is.
4. To see the distinction, suppose we have a stream of length 100, in
which eleven different elements appear. The most even distribution of
these eleven elements would have one appearing 10 times and the
other ten appearing 9 times each.
5. In this case, the surprise number is 102 + 10 × 92 = 910. At the other
extreme, one of the eleven elements could appear 90 times and the
other ten appear 1 time each. Then, the surprise number would be 902
+ 10 × 12 = 8110
Answer
Alon-Matias-Szegedy algorithm for second moments :
1. Let us assume that a stream has a particular length n.
2. Suppose we do not have enough space to count all the mi’s for all the
elements of the stream.
3. We can still estimate the second moment of the stream using a limited
amount of space; the more space we use, the more accurate the estimate
will be. We compute some number of variables.
4. For each variable X, we store :
a. A particular element of the universal set, which we refer to as
X.element.
b. An integer X.value, which is the value of the variable. To determine
the value of a variable X, we choose a position in the stream
between 1 and n, uniformly and at random. Set X.element to be
the element found there, and initialize X.value to 1. As we read
the stream, add 1 to X.value each time we encounter another
occurrence of X.element.
For example :
1. Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b. The length of
the stream is n = 15. Since a appears 5 times, b appears 4 times, and c
and d appear three times each, the second moment for the stream is
52 + 42 + 32 + 32 = 59.
2. Suppose we used three variables, X1, X2, and X3. Also, assume that at
random we pick the 3rd, 8th, and 13th positions to define these three
variables.
3. When we reach position 3, we find element c, so we set X1.element = c
and X1.value = 1. Position 4 holds b, so we do not change X1. Similarly,
nothing happens at positions 5 or 6. At position 7, we see c again, so we
set X1.value = 2.
Mining Data Streams 3–14 J (CS-5/IT-6)
Answer
Datar-Gionis-Indyk-Motwani (DGIM) algorithm :
1. This version of the algorithm uses O(log2 N) bits to represent a window
of N bits, and allows us to estimate the number of 1’s in the window
with an error of no more than 50%.
2. To begin, each bit of the stream has a timestamp, the position in which
it arrives. The first bit has timestamp 1, the second has timestamp 2,
and so on.
3. Since we only need to distinguish positions within the window of length
N, we shall represent timestamps modulo N, so they can be represented
by log2 N bits.
4. If we also store the total number of bits ever seen in the stream (i.e.,
the most recent timestamp) modulo N, then we can determine from a
timestamp modulo N where in the current window the bit with that
timestamp is.
5. We divide the window into buckets, consisting of :
a. The timestamp of its right (most recent) end.
b. The number of 1’s in the bucket. This number must be a power of
2, and we refer to the number of 1’s as the size of the bucket.
6. To represent a bucket, we need log2 N bits to represent the timestamp
(modulo N) of its right end.
7. To represent the number of 1’s we only need log2 log2 N bits. The
reason is that we know this number i is a power of 2, say 2j, so we can
Data Analytics 3–15 J (CS-5/IT-6)
. . . 1 0 1 1 0 1 1 0 0 0 1 01 1 1 0 1 1 0 0 1 0 1 10
PART-4
Real-Time Analytics Platform (RTAP), Applications,
Case Studies : Real Time Sentiment Analysis, Stock
Market Predictions.
Questions-Answers
Answer
1. A real-time analytics platform enables organizations to make the most
out of real-time data by helping them to extract the valuable information
and trends from it.
2. Such platforms help in measuring data from the business point of view
in real time, further making the best use of data.
Mining Data Streams 3–16 J (CS-5/IT-6)
Answer
Real-time analytics platform consists of the following steps :
1. Real-time stream sources :
a. For real-time analytics, the first major need sources from where
real-time data is obtained.
b. There are many sources of streaming data :
i. Sensor data : The sensor is the output of the device that
measures a physical quantity and transforms it into a digital
signal.
ii. Social media stream : Social media streaming like a Twitter
feed, Facebook, Instagram, Youtube, Pinterest, Tumblr.
iii.Clickstream : The stream contains the data about which pages
the website visits and in what order.
2. Real-time stream ingestion :
a. There is a need to ingest the streams which are coming from the
real-time stream sources.
Data Analytics 3–17 J (CS-5/IT-6)
Answer
1. Sentiment analysis is a type of natural language processing for tracking
the mood of the public about a particular product. Sentiment analysis
is also known as opinion mining.
2. It collects and examines opinions about the particular product made in
blog posts, comments, or tweets.
3. Sentiment analysis can track a particular topic, many companies use it
to track or observe their products, status.
4. A basic task in sentiment analysis is categorizing the polarity of a given
text at the document, sentence whether the expressed opinion in a
document, a sentence or an entity feature is positive, negative, or
neutral.
5. Sentiment classification looks, at emotional states such as “angry,”
“sad,” and “happy”.
Answer
Following are the major applications of sentiment analysis in real
world scenarios :
1. Reputation monitoring : Twitter and Facebook are a central point
of many sentiment analysis applications. The most common application
is to maintain the reputation of a particular brand on Twitter and/or
Facebook.
Data Analytics 3–19 J (CS-5/IT-6)
Answer
1. Data collection :
a. Consumers express their sentiments on public forums like the
blogs, discussion boards or social network sites.
b. Feelings are expressed in different way, context of writing, usage
of short forms and slang, making the data huge.
c. Manual analysis of sentiment data is virtually impossible.
Therefore, special programming languages like R are used to
procedure and analyze the data.
Data Analytics 4–1 J (CS-5/IT-6)
CONTENTS
Part-1 : Frequent Itemsets and ............................... 4–2J to 4–9J
Clustering : Mining Frequent
Itemsets, Market Based
Modelling, Apriori Algorithm
PART-1
Frequent Itemsets and Clustering : Mining Frequent Itemsets,
Markets Based Modelling, Apriori Algorithm.
Questions-Answers
Answer
1. Frequent patterns are patterns (such as itemsets, subsequences, or
substructures) that appear frequently in a dataset.
2. A substructure can refer to different structural forms, such as sub-
graphs, sub-trees, or sub-lattices, which may be combined with itemsets
or subsequences.
3. If a substructure occurs frequently, it is called a (frequent) structured
pattern.
4. Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
5. It helps in data classification, clustering, and other data mining tasks.
6. Frequent pattern mining searches for recurring relationships in a given
dataset.
7. For example, a set of items, such as milk and bread that appear
frequently together in a grocery transaction dataset is a frequent itemset.
8. A subsequence, such as buying first a PC, then a digital camera, and
then a memory card, if it occurs frequently in a shopping history
database, is a (frequent) sequential pattern.
Answer
1. Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational datasets.
2. With massive amounts of data continuously being collected and stored,
many industries are becoming interested in mining such patterns from
their databases.
3. The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business
Data Analytics 4–3 J (CS-5/IT-6)
Answer
1. The market-basket model of data is used to describe a common form of
many to many relationship between two kinds of objects.
2. On the one hand, we have items, and on the other we have baskets,
sometimes called “transactions.”
3. Each basket consists of a set of items (an itemset), and usually we
assume that the number of items in a basket is small or much smaller
than the total number of items.
4. The number of baskets is usually assumed to be very large, bigger
than what can fit in main memory.
5. The data is assumed to be represented in a file consisting of a sequence
of baskets.
Answer
1. The Apriori algorithm takes a bottom-up iterative approach to find the
frequent itemsets by first determining all the possible items and then
identifying which among them are frequent.
2. Let variable Ck be the set of candidate k-itemsets and variable Lk be the
set of k-itemsets that satisfy the minimum support.
3. Given a transaction database D, a minimum support threshold , and
an optional parameter N indicating the maximum length an itemset
could reach, Apriori iteratively computes frequent itemsets Lk – 1 based
on Lk.
Apriori algorithm :
Apriori (D, , N) :
Frequent Itemsets & Clustering 4–4 J (CS-5/IT-6)
1. k1
2. Lk {1-itemsets that satisfy minimum support }
3. while Lk
4. if N (N k < N)
5. Ck + 1 candidate itemsets generated from Lk
6. for each transaction t in database D do
7. increment the counts of Ck +1 contained in t
8. Lk + 1 candidates in Ck + 1 that satisfy minimum support
9. k k+l
10. return k Lk
4. At each iteration, the algorithm checks whether the support criterion
can be met; if it can, the algorithm grows the item set, repeating the
process until it runs out of support or until the item sets reach a
predefined length.
5. The first step of the Apriori algorithm is to identify the frequent item
sets by starting with each item in the transactions that meets the
predefined minimum support threshold .
6. These itemsets are 1-itemsets denoted as L1, as each 1-itemset contains
only one item. Next, the algorithm grows the item set s by joining L1
onto itself to form new, grown 2-itemsets denoted as L2 and determines
the support of each 2-itemset in L2. Those itemsets that do not meet
the minimum support threshold are pruned away.
7. The growing and pruning process is repeated until no itemsets meet
the minimum support threshold.
8. A threshold N can be set up to specify the maximum number of items
the item set can reach or the maximum number of iterations of the
algorithm. Once completed, output of the Apriori algorithm is the
collection of all the frequent k-itemsets.
Answer
A two-step process is followed, consisting of join and prune actions.
1. The join step :
a. To find Lk, a set of candidate k-itemsets is generated by joining
Lk – 1 with itself. This set of candidates is denoted Ck.
b. Let l1 and l2 be itemsets in Lk – 1. The notation li[j] refers to the jth
item in li (e.g., l1[k – 2] refers to the second to the last item in l1).
c. For efficient implementation, Apriori assumes that items within a
transaction or itemset are sorted in lexicographic order. For the
Data Analytics 4–5 J (CS-5/IT-6)
(k – 1)-itemset, li, this means that the items are sorted such that
li[1] < li[2] < ··· < li[k – 1].
d. The join, Lk – 1 Lk – 1, is performed, where members of Lk – 1 are
joinable if their first (k – 2)-items are in common. That is, members
l1 and l2 of Lk – 1 are joined if (l1[1] = l2[1]) (l1[2] = l2[2]) ···
(l1[k – 2] = l2[k – 2]) (l1[k – 1] < l2[k – 1]).
e. The condition l1[k – 1] < l2[k – 1] simply ensures that no duplicates
are generated. The resulting itemset formed by joining l1 and l2 is
{l1[1], l1[2],..., l1[k – 2], l1[k – 1], l2[k – 1]}. }
2. The prune step :
a. Ck is a superset of Lk, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck.
b. A database scan to determine the count of each candidate in Ck
would result in the determination of Lk (i.e., all candidates having
a count less than the minimum support count are frequent by
definition, and therefore belong to Lk).
c. Ck , however, can be huge, and so this could involve heavy
computation. To reduce the size of Ck, the Apriori property is used.
d. According to Apriori property, any (k – 1)-itemset that is not
frequent cannot be a subset of a frequent k-itemsets. Hence if any
(k –1)-subset of a candidate k-itemset is not in Lk–1, then the
candidate cannot be frequent and can be removed from Ck.
Answer
1. Once the frequent itemsets from transactions in a database D have
been found, it is straightforward to generate strong association rules
from them (where strong association rules satisfy both minimum
support and minimum confidence).
2. This can be done using equation (4.6.1) for confidence, which is shown
here for completeness :
support_count( A B)
confidence(A B) = P(B|A) = ...(4.6.1)
support_count( A)
3. The conditional probability is expressed in terms of itemset support
count, where support_count (A B) is the number of transactions
containing the itemsets A B, and support_count (A) is the number of
transactions containing the itemset A.
4. Based on equation (4.6.1), association rules can be generated as follows :
a. For each frequent itemset l, generate all non-empty subsets of l.
Frequent Itemsets & Clustering 4–6 J (CS-5/IT-6)
Answer
Many variations of the Apriori algorithm have been proposed that focus on
improving the efficiency of the original algorithm. Several of these variations
are as follows :
1. Hash-based technique (hashing itemsets into corresponding
buckets) :
a. A hash-based technique can be used to reduce the size of the
candidate k-itemsets, Ck, for k > 1.
b. For example, when scanning each transaction in the database to
generate the frequent 1-itemsets, L1, we can generate all the
2-itemsets for each transaction, hash (i.e., map) them into the
different buckets of a hash table structure, and increase the
corresponding bucket counts (Fig. 4.7.1).
Table 4.7.1 : Transactional data for an all electronics branch
H2
h(x, y) = ((order of x) 10
Create hash table H2
bucket address 0 1 2 3 4 5 6
bucket count 2 2 4 2 2 4 4
bucket countents (I1, I4) (I1, I5) (I2, I3) (I2, I4) (I2, I5) (I1, I2) (I1, I3)
(I3, I5) (I1, I5) (I2, I3) (I2, I4) (I2, I5) (I1, I2) (I1, I3)
(I2, I3) (I1, I2) (I1, I3)
(I2, I3) (I1, I2) (I1, I3)
Phase I :
i. In phase I, the algorithm divides the transactions of D into n
non-overlapping partitions. If the minimum relative support
threshold for transactions in D is min_sup, then
Minimum support count for a partition = min_sup × the number
of transactions in that partition
ii. For each partition, all the local frequent itemsets are found.
iii. A local frequent itemset may or may not be frequent with respect
to the entire database, D. However, any itemset that is potentially
frequent with respect to D must occur as a frequent itemset in at
least one of the partitions.
iv. Therefore, all local frequent itemsets are candidate itemsets with
respect to D. The collection of frequent itemsets from all partitions
forms the global candidate itemsets with respect to D.
Phase II :
i. In phase 2, a second scan of D is conducted in which the actual
support of each candidate is assessed to determine the global
frequent itemsets.
ii. Partition size and the number of partitions are set so that each
partition can fit into main memory and therefore be read only
once in each phase.
4. Sampling (mining on a subset of the given data) :
a. The basic idea of the sampling approach is to pick a random sample
S of the given data D, and then search for frequent itemsets in S
instead of D.
b. In this way, we trade off some degree of accuracy against efficiency.
c. The S sample size is such that the search for frequent itemsets in
S can be done in main memory, and so only one scan of the
transactions in S is required overall.
d. In this technique, it is possible that we will miss some of the global
frequent itemsets.
5. Dynamic itemset counting (adding candidate itemsets at
different points during a scan) :
a. A dynamic itemset counting technique was proposed in which the
database is partitioned into blocks marked by start points.
b. In this variation, new candidate itemsets can be added at any start
point, which determines new candidate itemsets only immediately
before each complete database scan.
Answer
Applications of frequent itemset analysis :
a. Related concepts :
1. Let items be words, and let baskets be documents (e.g., Web pages,
blogs, tweets).
2. A basket/document contains those items/words that are present
in the document.
3. If we look for sets of words that appear together in many
documents, the sets will be dominated by the most common words
(stop words).
4. If the document contain many the stop words such as “and” and
“a” then it will consider as more frequent itemsets.
5. However, if we ignore all the most common words, then we would
hope to find among the frequent pairs some pairs of words that
represent a joint concept.
b. Plagiarism :
1. Let the items be documents and the baskets be sentences.
2. An item is in a basket if the sentence is in the document.
3. This arrangement appears backwards, and we should remember
that the relationship between items and baskets is an arbitrary
many-many relationship.
4. In this application, we look for pairs of items that appear together
in several baskets.
5. If we find such a pair, then we have two documents that share
several sentences in common.
c. Biomarkers :
1. Let the items be of two types such as genes or blood proteins, and
diseases.
2. Each basket is the set of data about a patient: their genome and
blood-chemistry analysis, as well as their medical history of disease.
3. A frequent itemset that consists of one disease and one or more
biomarkers suggest a test for the disease.
PART-2
Handling Large Data Sets in Main Memory, Limited Pass
Algorithm, Counting Frequent Itemsets in a Stream.
Frequent Itemsets & Clustering 4–10 J (CS-5/IT-6)
Questions-Answers
Que 4.9. What are the different methods for storing itemset count
in main memory ?
Answer
Different method for storing itemset count in main memory :
1. The triangular-matrix method :
a. Even after coding items as integers, we still have the problem that
we must count a pair {i, j} in only one place.
b. For example, we could order the pair so that i < j, and only use the
entry a[i, j] in a two-dimensional array a. That strategy would
make half the array useless.
c. A more space-efficient way is to use a one-dimensional triangular
array.
d. We store in a[k] the count for the pair {i, j}, with 1 i < j n, where
k = (i – 1)(n – i/2 ) + j – i .
e. The result of this layout is that the pairs are stored in lexicographic
order, that is first {1, 2}, {1, 3}, . . ., {1, n}, then {2, 3}, {2, 4}, . . . , {2,
n}, and so on, down to {n – 2, n – 1}, {n – 2, n}, and finally {n – 1, n}.
2. The triples method :
a. This is more appropriate approach to store counts that depend on
the fraction of the possible pairs of items that actually appear in
some basket.
b. We can store counts as triples [i, j, c], meaning that the count of
pair {i, j}, with i < j, is c. A data structure, such as a hash table with
i and j as the search key, is used so we can tell if there is a triple for
a given i and j and, if so, to find it quickly.
c. We call this approach the triples method of storing counts.
d. The triples method does not require us to store anything if the
count for a pair is 0.
e. On the other hand, the triples method requires us to store three
integers, rather than one, for every pair that does appear in some
basket.
Answer
1. In first pass of Apriori algorithm, there may be much unused space in
main memory.
2. The PCY Algorithm uses the unused space for an array of integers
that generalizes the idea of a Bloom filter. The idea is shown
schematically in Fig. 4.10.1.
1 1
Item Item Fre-
2 Item 2
names names quent
to counts to items
integers integers
n n
Bitmap
Hash table
for bucket Data structure
counts for counts
of pairs
Pass 1 Pass 2
Fig. 4.10.1.
8. We can define the set of candidate pairs C2 to be those pairs {i, j} such
that:
a. i and j are frequent items.
b. {i, j} hashes to a frequent bucket.
Answer
Simple and randomized algorithm :
1. In simple and randorimized algorithm, we pick a random subset of the
baskets and pretend it is the entire dataset instead of using the entire
file of baskets.
2. We must adjust the support threshold to reflect the smaller number of
baskets.
3. For instance, if the support threshold for the full dataset is s, and we
choose a sample of 1% of the baskets, then we should examine the
sample for itemsets that appear in at least s/100 of the baskets.
4. The best way to pick the sample is to read the entire dataset, and for
each basket, select that basket for the sample with some fixed probability
p.
5. Suppose there are m baskets in the entire file. At the end, we shall
have a sample whose size is very close to pm baskets.
6. However, if the baskets appear in random order in the file already,
then we do not even have to read the entire file.
7. We can select the first pm baskets for our sample. Or, if the file is part
of a distributed file system, we can pick some chunks at random to
serve as the sample.
8. Having selected our sample of the baskets, we use part of main memory
to store these baskets.
9. Remaining main memory is used to execute one of the algorithms
such as A-Priori or PCY. However, the algorithm must run passes over
the main-memory sample for each itemset size, until we find a size
with no frequent items.
Answer
SON Algorithm :
1. The idea is to divide the input file into chunks.
2. Treat each chunk as a sample, and run the simple and randomized
Data Analytics 4–13 J (CS-5/IT-6)
Answer
1. The SON algorithm work well in a parallel-computing environment.
2. Each of the chunks can be processed in parallel, and the frequent
itemsets from each chunk combined to form the candidates.
3. We can distribute the candidates to many processors, have each
processor count the support for each candidate in a subset of the
baskets, and finally sum those supports to get the support for each
candidate itemset in the whole dataset.
4. There is a natural way of expressing each of the two passes as a
MapReduce operation.
MapReduce-MapReduce sequence :
First Map function :
a. Take the assigned subset of the baskets and find the itemsets frequent
in the subset using the simple and randomized algorithm.
b. Lower the support threshold from s to ps if each Map task gets fraction
p of the total input file.
c. The output is a set of key-value pairs (F, 1), where F is a frequent
itemset from the sample.
First Reduce Function :
a. Each Reduce task is assigned a set of keys, which are itemsets.
Frequent Itemsets & Clustering 4–14 J (CS-5/IT-6)
b. The value is ignored, and the Reduce task simply produces those keys
(itemsets) that appear one or more times. Thus, the output of the first
Reduce function is the candidate itemsets.
Second Map function :
a. The Map tasks for the second Map function take all the output from
the first Reduce Function (the candidate itemsets) and a portion of the
input data file.
b. Each Map task counts the number of occurrences of each of the
candidate itemsets among the baskets in the portion of the dataset
that it was assigned.
c. The output is a set of key-value pairs (C, v), where C is one of the
candidate sets and v is the support for that itemset among the baskets
that were input to this Map task.
Second Reduce function :
a. The Reduce tasks take the itemsets they are given as keys and sum
the associated values.
b. The result is the total support for each of the itemsets that the Reduce
task was assigned to handle.
c. Those itemsets whose sum of values is at least s are frequent in the
whole dataset, so the Reduce task outputs these itemsets with their
counts.
d. Itemsets that do not have total support at least s are not transmitted to
the output of the Reduce task.
Answer
1. Toivonen’s algorithm is a heuristic algorithm for finding frequent
itemsets from a given set of data.
2. For many frequent itemset algorithms, main memory is considered a
critical resource.
3. This is typically because itemset counting over large data sets results
in very large data structures that quickly begin to strain the limits of
main memory.
4. Toivonen’s algorithm presents an interesting approach to discovering
frequent itemsets in large data sets. The algorithm's deceptive simplicity
allows us to discover all frequent itemsets through a sampling process.
5. Negative border : An itemset is in the negative border if it is not
frequent in the sample, but all its immediate subsets are frequent in
the sample.
Data Analytics 4–15 J (CS-5/IT-6)
Answer
1. We assume that stream elements are baskets of items.
2. The simplest approach to maintaining a current estimate of the frequent
itemsets in a stream is to collect some number of baskets and store it as
a file.
3. Run one of the frequent-itemset algorithms, meanwhile ignoring the
stream elements that arrive, or storing them as another file to be
analyzed later.
4. When the frequent-itemsets algorithm finishes, we have an estimate
of the frequent itemsets in the stream.
5. We can use this collection of frequent itemsets for the application, but
start running another iteration of the chosen frequent-itemset
algorithm immediately. This algorithm can either :
a. Use the file that was collected while the first iteration of the
algorithm was running. At the same time, collect yet another file
to be used at another iteration of the algorithm, when this current
iteration finishes.
b. Start collecting another file of baskets, and run the algorithm
until an adequate number of baskets has been collected.
PART-3
Clustering Techniques: Hierarchical, k-means.
Frequent Itemsets & Clustering 4–16 J (CS-5/IT-6)
Questions-Answers
Answer
1. Clustering is the process of grouping a set of data objects into multiple
groups or clusters so that objects within a cluster have high similarity,
but are very dissimilar to objects in other clusters.
2. Dissimilarities and similarities are assessed based on the attribute
values describing the objects and often involve distance measures.
3. The "quality" of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
4. Centroid distance is an alternative measure of cluster quality and is
defined as the average distance of each cluster object from the cluster
centroid.
5. Cluster analysis or simply clustering is the process of partitioning a set
of data objects (or observations) into subsets.
6. The set of clusters resulting from a cluster analysis can be referred to
as a clustering.
7. Clustering can lead to the discovery of previously unknown groups
within the data.
8. Cluster analysis has been widely used in many applications such as
business intelligence, image pattern recognition, Web search, biology,
and security.
Answer
Following are requirements of clustering in data mining :
1. Scalability :
a. Many clustering algorithms work well on small data sets containing
fewer than several hundred data objects.
b. Clustering on only a sample of a given large data set may lead to
biased results. Therefore, highly scalable clustering algorithms
are needed.
Data Analytics 4–17 J (CS-5/IT-6)
Answer
1. A hierarchical method creates a hierarchical decomposition of the given
set of data objects.
2. A hierarchical method can be classified as :
a. Agglomerative approach :
i. The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group.
ii. It successively merges the objects or groups close to one
another, until all the groups are merged into one (the topmost
level of the hierarchy), or a termination condition holds.
b. Divisive approach :
i. The divisive approach, also called the top-down approach,
starts with all the objects in the same cluster.
ii. In each successive iteration, a cluster is split into smaller
clusters, until eventually each object is in one cluster, or a
termination condition holds.
3. Hierarchical clustering methods can be distance-based or density- and
continuity-based.
4. Various extensions of hierarchical methods consider clustering in
subspaces.
5. Hierarchical methods suffer from the fact that once a step (merge or
split) is done, it can never be undone. This rigidity is useful in that it
leads to smaller computation costs by not having to worry about a
combinatorial number of different choices.
Answer
1. Given a set of n objects, a partitioning method constructs k partitions of
the data, where each partition represents a cluster and k n. That is,
it divides the data into k groups such that each group must contain at
least one object.
2. In other words, partitioning methods conduct one-level partitioning on
data sets. The basic partitioning methods typically adopt exclusive cluster
separation i.e., each object must belong to exactly one group.
3. Most partitioning methods are distance-based. Given k, the number of
partitions to construct, a partitioning method creates an initial
partitioning.
4. It then uses an iterative relocation technique that attempts to improve
the partitioning by moving objects from one group to another.
Data Analytics 4–19 J (CS-5/IT-6)
Answer
1. Suppose a data set, D, contains n objects in Euclidean space. Partitioning
methods distribute the objects in D into k clusters, C1,...,Ck , that is,
Ci ⊂ D and Ci ∩ Cj = φ for (1 ≤ i, j ≤ k).
2. An objective function is used to assess the partitioning quality so that
objects within a cluster are similar to one another but dissimilar to
objects in other clusters.
3. A centroid-based partitioning technique uses the centroid of a cluster,
Ci, to represent that cluster. Conceptually, the centroid of a cluster is
its center point. The centroid can be defined in various ways such as by
the mean of the objects (or points) assigned to the cluster.
4. The difference between an object p Ci and ci , the representative of
the cluster, is measured by dist(p, ci), where dist(x, y) is the Euclidean
distance between two points x and y.
5. The quality of cluster Ci can be measured by the within-cluster
variation, which is the sum of squared error between all objects in Ci
and the centroid ci, defined as :
k
2
E= dist ( p, c )
i 1 p ci
i
Where, E is the sum of the squared error for all objects in the data set;
p is the point in space representing a given object
ci is the centroid of cluster Ci (both p and ci are multi-dimensional)
6. In other words, for each object in each cluster, the distance from the
object to its cluster center is squared, and the distances are summed.
This objective function tries to make the resulting k clusters as compact
and as separate as possible.
Que 4.21. How does the k-means algorithm work ? Write k-means
algorithm for partitioning.
Frequent Itemsets & Clustering 4–20 J (CS-5/IT-6)
Answer
1. First, it randomly selects k of the objects in D, each of which initially
represents a cluster mean or center.
2. For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the Euclidean distance between
the object and the cluster mean.
3. The k-means algorithm then iteratively improves the within-cluster
variation.
4. For each cluster, it computes the new mean using the objects assigned
to the cluster in the previous iteration.
5. All the objects are then reassigned using the updated means as the
new cluster centers.
6. The iterations continue until the assignment is stable, that is, the
clusters formed in the current round are the same as those formed in
the previous round.
Algorithm :
Input :
k : the number of clusters,
D : a data set containing n objects.
Output : A set of k clusters.
Method :
1. Arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
4. update the cluster means, that is, calculate the mean value of the
objects for each cluster;
5. until no change;
Answer
Characteristics of different clustering techniques/methods are :
Characteristics of partitioning methods :
1. Find mutually exclusive clusters of spherical shape
2. Distance-based
3. May use mean or medoid to represent cluster center
4. Effective for small to medium size data sets
Data Analytics 4–21 J (CS-5/IT-6)
PART-4
Clustering High Dimensional Data, CLIQUE
and ProCLUS, Frequent Pattern Based Clustering Methods.
Questions-Answers
Que 4.23. What are the approaches for high dimensional data
clustering ?
Answer
Approaches for high dimensional data clustering are :
1. Subspace clustering :
a. Subspace clustering subspace clustering algorithms localize the
search for relevant dimensions allowing them to find clusters that
exist in multiple, and possibly overlapping subspaces.
b. This technique is an extension of feature selection that attempts
to find clusters in different subspaces of the same dataset.
c. Subspace clustering requires a search method and evaluation
criteria.
d. It limits the scope of the evaluation criteria so as to consider
different subspaces for each different cluster.
Frequent Itemsets & Clustering 4–22 J (CS-5/IT-6)
2. Projected clustering :
a. In high-dimensional spaces, even though a good partition cannot
be defined on all the dimensions because of the sparsity of the
data, some subset of the dimensions can always be obtained on
which some subsets of data form high quality and significant
clusters.
b. Projected clustering methods are aimed to find clusters specific to
a particular group of dimensions. Each cluster may refer to
different subsets of dimensions.
c. The output of a typical projected clustering algorithm, searching
for k clusters in subspaces of dimension l, is twofold :
i. A partition of data of k + 1 different clusters, where the first
k clusters are well shaped, while the (k + 1)th cluster elements
are outliers, which by definition do not cluster well.
ii. A possibly different set of l dimensions for each of the first k
clusters, such that the points in each of those clusters are
well clustered in the subspaces defined by these vectors.
3. Biclustering :
a. Biclustering (or two-way clustering) is a methodology allowing for
feature set and data points clustering simultaneously, i.e., to find
clusters of samples possessing similar characteristics together with
features creating these similarities.
b. The output of biclustering is not a partition or hierarchy of
partitions of either rows or columns, but a partition of the whole
matrix into sub-matrices or patches.
c. The goal of biclustering is to find as many patches as possible, and
to have them as large as possible, while maintaining strong
homogeneity within patches.
Answer
1. CLIQUE is a subspace clustering method.
2. CLIQUE (CLustering In QUEst) is a simple grid-based method for
finding density based clusters in subspaces.
3. CLIQUE partitions each dimension into non-overlapping intervals,
thereby partitioning the entire embedding space of the data objects
into cells. It uses a density threshold to identify dense cells and sparse
ones.
4. A cell is dense if the number of objects mapped to it exceeds the density
threshold.
5. The main strategy behind CLIQUE for identifying a candidate search
Data Analytics 4–23 J (CS-5/IT-6)
Answer
1. Projected clustering (PROCLUS) is a top-down subspace clustering
algorithm.
2. PROCLUS samples the data and then selects a set of k-medoids and
iteratively improves the clustering.
3. PROCLUS is actually faster than CLIQUE due to the sampling of large
data sets.
Frequent Itemsets & Clustering 4–24 J (CS-5/IT-6)
Answer
Basic subspace clustering approaches are :
1. Grid-based subspace clustering :
a. In this approach, data space is divided into axis-parallel cells. Then
the cells containing objects above a predefined threshold value
given as a parameter are merged to form subspace clusters. Number
of intervals is another input parameter which defines range of
values in each grid.
b. Apriori property is used to prune non-promising cells and to
improve efficiency.
c. If a unit is found to be dense in k – 1 dimension, then it is considered
for finding dense unit in k dimensions.
d. If grid boundaries are strictly followed to separate objects, accuracy
of clustering result is decreased as it may miss neighbouring
objects which get separated by string grid boundary. Clustering
quality is highly dependent on input parameters.
Data Analytics 4–25 J (CS-5/IT-6)
Answer
The major tasks of clustering evaluation include the following :
1. Assessing clustering tendency :
a. In this task, for a given data set, we assess whether a non-random
structure exists in the data.
b. Blindly applying a clustering method on a data set will return
clusters; however, the clusters mined may be misleading.
c. Clustering analysis on a data set is meaningful only when there is
a nonrandom structure in the data.
2. Determining the number of clusters in a data set :
a. A few algorithms, such as k-means, require the number of clusters
in a data set as the parameter.
b. Moreover, the number of clusters can be regarded as an interesting
and important summary statistic of a data set.
c. Therefore, it is desirable to estimate this number even before a
clustering algorithm is used to derive detailed clusters.
Frequent Itemsets & Clustering 4–26 J (CS-5/IT-6)
PART-5
Clustering in Non-Euclidean Space, Clustering For
Streams and Parallelism.
Questions-Answers
Answer
1. The representation of a cluster in main memory consists of several
features.
2. Before listing these features, if p is any point in a cluster, let
ROWSUM(p) be the sum of the squares of the distances from p to each
of the other points in the cluster.
3. The following features form the representation of a cluster.
a. N, the number of points in the cluster.
b. The clustroid of the cluster, which is defined specifically to be the
point in the cluster that minimizes the sum of the squares of the
distances to the other points; that is, the clustroid is the point in
the cluster with the smallest ROWSUM.
c. The rowsum of the clustroid of the cluster.
d. For some chosen constant k, the k points of the cluster that are
closest to the clustroid, and their rowsums. These points are part
of the representation in case the addition of points to the cluster
causes the clustroid to change. The assumption is made that the
new clustroid would be one of these k points near the old clustroid.
Data Analytics 4–27 J (CS-5/IT-6)
5. The k points of the cluster that are furthest from the clustroid and
their rowsums. These points are part of the representation so that
we can consider whether two clusters are close enough to merge.
The assumption is made that if two clusters are close, then a pair
of points distant from their respective clustroids would be close.
Answer
1. The clusters are organized into a tree, and the nodes of the tree may be
very large, perhaps disk blocks or pages, as in the case of a B-tree,
which the cluster-representing tree resembles.
2. Each leaf of the tree holds as many cluster representations as can fit.
3. A cluster representation has a size that does not depend on the number
of points in the cluster.
4. An interior node of the cluster tree holds a sample of the clustroids of
the clusters represented by each of its subtrees, along with pointers to
the roots of those subtrees.
5. The samples are of fixed size, so the number of children that an interior
node may have is independent of its level.
6. As we go up the tree, the probability that a given cluster's clustroid is
part of the sample diminishes.
7. We initialize the cluster tree by taking a main-memory sample of the
dataset and clustering it hierarchically.
8. The result of this clustering is a tree T, but T is not exactly the tree
used by the GRGPF Algorithm. Rather, we select from T certain of its
nodes that represent clusters of approximately some desired size n.
9. These are the initial clusters for the GRGPF Algorithm, and we place
their representations at the leaf of the cluster-representing tree. We
then group clusters with a common ancestor in T into interior nodes of
the cluster-representing tree. In some cases, rebalancing of the cluster-
representing tree will be necessary.
Answer
1. In BMDO algorithm, the points of the stream are partitioned into, by,
buckets whose sizes are a power of two. Here, the size of a bucket is
the number of points it represents, rather than the number of stream
elements that are 1.
Frequent Itemsets & Clustering 4–28 J (CS-5/IT-6)
2. The sizes of buckets obey the restriction that there is one or two of
each size, up to some limit. They are required only to form a sequence
where each size is twice the previous size such as 3, 6, 12, 24, . . .
3. The contents of a bucket consist of :
a. The size of the bucket.
b. The timestamp of the bucket, that is, the most recent point that
contributes to the bucket.
c. A collection of records that represent the clusters into which the
points of that bucket have been partitioned. These records contain:
i. The number of points in the cluster.
ii. The centroid or clustroid of the cluster.
iii. Any other parameters necessary to enable us to merge clusters
and maintain approximations to the full set of parameters for
the merged cluster.
Data Analytics 5–1 J (CS-5/IT-6)
CONTENTS
Part-1 : Frame Works and Visualization : ............. 5–2J to 5–4J
MapReduce, Hadoop
PART-1
Frame Works and Visualization: MapReduce, Hadoop.
Questions-Answers
Que 5.1. Write short note on Hadoop and als o write its
advantages.
Answer
1. Hadoop is an open-source software framework developed for creating
scalable, reliable and distributed applications that process huge amount
of data.
2. It is an open-source distributed, batch processing, fault tolerance system
which is capable of storing huge amount of data along with processing
on the same amount of data.
Advantages of Hadoop :
1. Fast :
a. In HDFS (Hadoop Distributed File System), the data distributed
over the cluster and are mapped which helps in faster retrieval.
b. Even the tools to process the data are often on the same servers,
thus reducing the processing time.
2. Scalable : Hadoop cluster can be extended by just adding nodes in the
cluster.
3. Cost effective : Hadoop is open source and uses commodity hardware
to store data so it really cost effective as compared to traditional
relational database management system.
4. Resilient to failure : HDFS has the property with which it can
replicate data over the network, so if one node is down or some other
network failure happens, then hadoop takes the other copy of data
and uses it.
5. Flexible :
a. Hadoop enables businesses to easily access new data sources and
tap into different types of data to generate value from that data.
b. It help to derive valuable business insights from data source such
as social media, email conversations, data warehousing, fraud
detection and market campaign analysis.
Data Analytics 5–3 J (CS-5/IT-6)
Answer
1. MapReduce is a framework using which we can write applications to
process huge amounts of data, in parallel, on large clusters in a reliable
manner.
2. MapReduce is a processing technique and a program model for
distributed computing based on Java.
3. The MapReduce paradigm provides the means to break a large task
into smaller tasks, run the tasks in parallel, and consolidate the outputs
of the individual tasks into the final output.
4. MapReduce consists of two basic parts :
i. Map :
a. Applies an operation to a piece of data
b. Provides some intermediate output
ii. Reduce :
a. Consolidates the intermediate outputs from the map steps
b. Provides the final output
5. In a MapReduce program, Map() and Reduce() are two functions.
a. The Map function performs actions like filtering, grouping and
sorting.
b. While Reduce function aggregates and summarizes the result
produced by Map function.
c. The result generated by the Map function is a key-value pair
(K, V) which acts as the input for Reduce function.
Que 5.3. What are the activities that are required for executing
MapReduce job ?
Answer
Executing a MapReduce job requires the management and
coordination of several activities :
1. MapReduce jobs need to be scheduled based on the system's workload.
2. Jobs need to be monitored and managed to ensure that any encountered
errors are properly handled so that the job continues to execute if the
system partially fails.
3. Input data needs to be spread across the cluster.
4. Map step processing of the input needs to be conducted across the
distributed system, preferably on the same machines where the data
resides.
Frame Works & Visualization 5–4 J (CS-5/IT-6)
PART-2
Pig, Hive.
Questions-Answers
Answer
Data access component of Hadoop system are :
a. Pig (Apache Pig) :
1. Apache Pig is a high level language platform for analyzing and
query huge datasets that are stored in HDFS.
2. Apache Pig uses Pig Latin language which is similar to SQL.
3. It loads the data, applies the required filters and dumps the required
format.
4. For program execution, Pig requires Java run time environment.
5. Apache Pig consists of a data flow language and an environment
to execute the Pig code.
6. The main benefit of using Pig is to utilize the power of MapReduce
in a distributed system, while simplifying the tasks of developing
and executing a MapReduce job.
7. Pig provides for the e xe cution o f se ve ral co mmon data
manipulations, such as inner and outer joins between two or more
files (tables).
b. Hive :
1. HIVE is a data warehousing component which performs reading,
writing and managing large datasets in a distributed environment
using SQL-like interface.
HIVE + SQL = HQL
2. The query language of Hive is called Hive Query Language (HQL),
which is very similar like SQL.
3. It has two basic components :
Data Analytics 5–5 J (CS-5/IT-6)
Answer
Hive architecture : The following architecture explains the flow of
submission of query into Hive.
Hive
services Hive web UI Hive server CLI
Hive driver
Metastore
Mapreduce
HDFS
Answer
Hive is used when the following conditions exist :
1. Data easily fits into a table structure.
2. Data is already in HDFS.
3. Developers are comfortable with SQL programming and queries.
4. There is a desire to partition datasets based on time.
Data Analytics 5–7 J (CS-5/IT-6)
Answer
Following are some Hive use cases :
1. Exploratory or ad-hoc analysis of HDFS data : Data can be queried,
transformed, and exported to analytical tools, such as R.
2. Extracts or data feeds to reporting systems, dashboards, or
data repositories such as HBase : Hive queries can be scheduled to
provide such periodic feeds.
3. Combining external structured data to data already residing
in HDFS :
a. Hadoop is excellent for processing unstructured data, but often
there is structured data residing in an RDBMS, such as Oracle or
SQL Server, that needs to be joined with the data residing in HDFS.
b. The data from an RDBMS can be periodically added to Hive tables
for querying with existing data in HDFS.
Answer
Que 5.9. What are the advantages and features of Apache Pig (or
Pig).
Answer
Advantage of Apache Pig :
1. Pig Latin language is easy to program.
2. It decreases the development time.
Frame Works & Visualization 5–8 J (CS-5/IT-6)
Answer
Application of Apache Pig :
1. It is used to process huge data sources like web logs, streaming online
data etc.
2. It supports Ad Hoc queries across large dataset.
3. Used to perform data processing in search platforms.
4. It is also used to process time sensitive data loads.
5. Apache Pig is generally used by data scientists for performing tasks like
ad-hoc processing and quick prototyping.
PART-3
HBase, MapR, Sharding, NoSQL Databases.
Data Analytics 5–9 J (CS-5/IT-6)
Questions-Answers
Answer
1. It is an open source, distributed database written in Java.
2. HBase is an essential part of Hadoop ecosystem. It runs on top of HDFS
(Hadoop Distributed File System).
3. It can store massive amounts of data from terabytes to petabytes. It is
column oriented and horizontally scalable.
HBase architecture :
Client
HMaster
HDFS
2. Region server :
a. HBase tables are divided horizontally by row key range into regions.
b. Regions are the basic building elements of HBase cluster that
consists of the distribution of tables and are comprised of column
families.
c. Region server runs on HDFS data node which is present in Hadoop
cluster.
d. Regions of region server are responsible for several things, like
handling, managing, executing as well as reads and writes HBase
operations on that set of regions. The default size of a region is
256 MB.
3. Zookeeper :
a. It is like a coordinator in HBase.
b. It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure
notification etc.
c. Clients communicate with region servers via zookeeper.
Answer
Features of HBase :
1. It is linearly scalable across various nodes as well as modularly scalable,
as it divided across various nodes.
2. HBase provides consistent read and writes.
3. It provides atomic read and write means during one read or write process,
all other processes are prevented from performing any read or write
operations.
4. It provides easy to use Java API for client access.
5. It supports Thrift and REST API for non-Java front ends which supports
XML, Protobuf and binary data encoding options.
6. It supports a Block Cache and Bloom Filters for real-time queries and
for high volume query optimization.
7. HBase provides automatic failure support between region servers.
8. It support for exporting metrics with the Hadoop metrics subsystem to
files.
9. It does not enforce relationship within data.
10. It is a platform for storing and retrieving data with random access.
Data Analytics 5–11 J (CS-5/IT-6)
Answer
1. Sharding is a type of database partitioning that splits very large database
into smaller faster and more easily managed part.
2. A database shard is a horizontal partition of data in a database or search
engine. Each individual partition is referred to as a shard or database
shard. Each shard is held on a separate database server instance, to
spread load.
Following are the various techniques to apply sharding :
1. Use a key value to shard our data :
a. In this method user may use different locations to store data with
help of key value pair.
b. All data can be easily access by key of that data.
c. It makes easy to store data irrespective to its location storage.
2. Use toad balancing to shard our data :
a. Database can take individual decision for storing data in different
locations.
b. Large sharding can also split into short sharding that reframe
decision by database itself.
3. Hash the key :
a. In this development, keys can be arranged by hashing its value.
b. All assignments can be hashed to store all document.
c. Consistent hashing assigns documents with a particular key value
to one of the servers in a hash ring.
Answer
Benefits of sharding :
1. Horizontal scaling : Horizontal scaling means adding more processing
units or physical machines to our server or database to allow for more
traffic and faster processing.
2. Response time :
a. It speeds up query response times.
b. In the sharded database, queries have to go over fewer rows and
thus we get our result sets more quickly.
Frame Works & Visualization 5–12 J (CS-5/IT-6)
Answer
1. NoSQL databases are non tabular, and store data differently than
relational tables.
2. NoSQL databases come in a variety of types based on their data model.
3. The main types are document, key-value, wide-column, and graph.
4. They provide flexible schemas and scale easily with large amounts of
data and high user loads.
5. NoSQL data models allow related data to be nested within a single data
structure.
Advantage of NoSQL database :
1. Cheap and easy to implement.
2. Data are replicated to multiple nodes and can be partitioned.
3. Easy to distribute.
4. Do not require a schema.
Answer
Benefits of NoSQL database :
1. Data models : NoSQL databases often support different data models
and are purpose built. For example, key-value databases support simple
queries very efficiently.
2. Performance : NoSQL databases can often perform better than SQL/
relational databases. For example, if we are using a document database
and are storing all the information about an object in the same
document, the database only needs to go to one place for those queries.
3. Scalability : NoSQL databases are designed to scale-out horizontally,
making it much easier to maintain performance as our workload grows
beyond the limits of a single server.
Data Analytics 5–13 J (CS-5/IT-6)
Answer
1. Document based database :
a. Document databases store data in documents similar to JSON
(JavaScript Object Notation) objects.
b. Each document contains pairs of fields and values.
c. The values can typically be a variety of types including things like
strings, numbers, booleans, arrays, or objects. Because of their
variety of field value types and powerful query languages, it can be
used as a general purpose database.
d. They can horizontally scale-out to accommodate large data volumes.
e. MongoDB, CouchDB, CouchbaseDB are example of document
databases.
2. Key-value based database :
a. Key-value databases are a simpler type of database where each
item contains keys and values.
b. A value can only be retrieved by referencing its value.
c. Key-value databases are great for use cases where we need to
store large amounts of data but we do not need to perform complex
queries to retrieve it.
d. Redis and DynanoDB are example of key-value databases.
3. Wide-column based database :
a. Wide-column database store data in tables, rows, and dynamic
columns.
b. It provide a lot of flexibility over relational databases because each
row is not required to have the same columns.
c. They are commonly used for storing Internet of Things data and
user profile data.
d. Cassandra and HBase are the example of wide-column databases.
Frame Works & Visualization 5–14 J (CS-5/IT-6)
Answer
PART-4
S3, Hadoop Distributed File Systems
Questions-Answers
Answer
1. Hadoop Distributed File System is the core component or the backbone
of Hadoop Ecosystem.
2. HDFS is the one, which makes it possible to store different types of
large data sets (i.e., structured, unstructured and semi structured data).
3. HDFS creates a level of abstraction over the resources, from where
we can see the whole HDFS as a single unit.
4. It helps us in storing our data across various nodes and maintaining
the log file about the stored data (metadata).
Wr
ite
Replication
Blocks
Data nodes Data nodes
Rack 1 Rack 2
Fig. 5.19.1. HDFS architecture.
ii. The file in a file system will be divided into one or more
segments and/or stored in individual data nodes. These file
segments are called as blocks.
iii. In other words, the minimum amount of data that HDFS can
read or write is called a Block.
Answer
Answer
1. Amazon S3 (Simple Storage Service) is a cloud IaaS (infrastructure as
a service) solution from Amazon Web Services for object storage via a
convenient web-based interface.
2. According to Amazon, the benefits of S3 include industry-leading
scalability, data availability, security, and performance.
3. The basic storage unit of Amazon S3 is the “object”, which consists of a
file with an associated ID number and metadata.
4. These objects are stored in buckets, which function similarly to folders
or directories and which reside within the AWS region of our choice.
5. The Amazon S3 object store is the standard mechanism to store,
retrieve, and share large quantities of data in AWS.
The features of Amazon S3 are :
1. Object store model for storing, listing, and retrieving data.
Data Analytics 5–17 J (CS-5/IT-6)
PART-5
Visualization: Visual Data Analysis Techniques, Interaction
Techniques, Systems and Applications.
Questions-Answers
Answer
Data visualization :
1. Data visualization is the process of putting data into a chart, graph, or
other visual format that helps inform analysis and interpretation.
2. Data visualization is a critical tool in the data analysis process.
3. Visualization tasks can range from generating fundamental distribution
plots to understanding the interplay of complex influential variables in
machine learning algorithms.
4. Data visualization and visual data analysis can help to deal with the
flood of information.
Visual data exploration :
1. In visual data exploration, user is directly involved in the data analysis
process.
2. Visual data analysis techniques have proven to be of high value in
exploratory data analysis.
3. Visual data exploration can be seen as a hypothesis generation process;
the visualizations of the data allow the user to gain insight into the
data and come up with new hypotheses.
Frame Works & Visualization 5–18 J (CS-5/IT-6)
Answer
Approaches to integrate the human in data exploration process to
realize different kind of approaches to visual data mining :
1. Preceding Visualization (PV) :
a. Data is visualized in some visual form before running a data-mining
(DM) algorithm.
b. By interaction with the raw data, the data analyst has full control
over the analysis in the search space.
c. Interesting patterns are discovered by exploring the data.
Data
Visualization
of the data
DM-algorithm
Result
Knowledge
Data
DM-algorithm
Result
Visualization
of the data
Knowledge
Data
Visualization +
DM-algorithm step 1
Interaction
Result
Knowledge
Answer
Used for The goal of the data It will help the business to
visualization is to make more-informed
communicate information business decisions by
clearly and efficiently to analyzing the data.
users by presenting them
visually.
Answer
The visualization technique used may be classified as :
1. Standard 2D/3D displays techniques : In standard 2D/3D display
technique we use different charts such as :
Data Analytics 5–21 J (CS-5/IT-6)
Answer
The data type to be visualized may be :
1. One-dimensional data :
a. One-dimensional data usually have one dense dimension.
b. A typical example of one-dimensional data is temporal data.
c. One or multiple data values may be associated with each point in
time.
d. Examples are time series of stock prices or time series of news data.
2. Two-dimensional data :
a. A two-dimensional data is geographical data, where the two distinct
dimensions are longitude and latitude.
b. A standard method for visualizing two-dimensional data are x-y
plots and maps are a special type of x-y plots for presenting two-
dimensional geographical data.
c. Example of two-dimensional data is geographical maps.
3. Multi-dimensional data :
a. Many data sets consist of more than three dimensions and therefore
do not allow a simple visualization as 2-dimensional or 3-dimensional
plots.
b. Examples of multi-dimensional (or multivariate) data are tables
from relational databases, which often have tens to hundreds of
columns.
Data Analytics 5–23 J (CS-5/IT-6)
Answer
Advantages of data visualization are :
1. Better understanding of the data and its pattern :
a. User can understand the flow of data like increasing sales.
b. The line chart representation of the sales report will reveal the
sales growth to the manager of the sales division of any organization.
2. Relevance of the hidden data like trends :
a. The data may contain some unseen patterns which can be identified
with data visualization.
b. For example, the data of any stock in share market may increase at
a particular period of time. This period can be identified using the
data visualization.
3. Encapsulation and abstraction of data for users :
a. The data sets are of very large size and are not understandable by
everyone like non-technical audience which is a part of top
Frame Works & Visualization 5–24 J (CS-5/IT-6)
Answer
1. Interaction techniques allow the data analyst to directly interact with
the visualizations and dynamically change the visualizations according
to the exploration objectives.
2. In addition, they also make it possible to relate and combine multiple
independent visualizations.
Different interaction techniques are :
a. Dynamic projection :
1. Dynamic projection is an automated navigation operation.
2. The basic idea is to dynamically change the projections in order to
explore a multi-dimensional data set.
3. A well-known example is the GrandTour system which tries to
show all interesting two-dimensional projections of a multi-
dimensional data set as a series of scatter plots.
4. The sequence of projections shown can be random, manual, pre-
computed, or data driven.
5. Examples of dynamic projection techniques include XGobi ,
XLispStat, and ExplorN.
b. Interactive filtering :
1. Interactive filtering is a combination of selection and view
enhancement.
2. In exploring large data sets, it is important to interactively partition
the data set into segments and focus on interesting subsets.
3. This can be done by a direct selection of the desired subset (browsing)
or by a specification of properties of the desired subset (querying).
4. An example of a tool that can be used for interactive filtering is the
Magic Lens.
5. The basic idea of Magic Lens is to use a tool similar to a magnifying
glass to filter the data directly in the visualization. The data under
the magnifying glass is processed by the filter and displayed in a
different way than the remaining data set.
6. Magic Lens show a modified view of the selected region, while the
rest of the visualization remains unaffected.
7. Examples of interactive filtering techniques includes InfoCrystal ,
Dynamic Queries, and Polaris.
c. Zooming :
1. Zooming is a well known view modification technique that is widely
used in a number of applications.
Data Analytics 5–25 J (CS-5/IT-6)
PART-6
Introduction to R: R Graphical User Interfaces, Data Import and
Export, Attribute and Data Types.
Questions-Answers
Answer
a. R language is a programming language which is actually clubbed with
packages.
b. It is used for data processing and visualization.
c. It is multi-functional language which provides the functions like data
manipulation, computation and visualization.
d. It can store the figures; performs computation on them with the objective
of putting together as ideal set.
e. It has following features to support operations on data:
1. R has integral function for data handling like declaration and
definition and it also supports in-memory storage of data.
2. It supports operations on collection of data like set and matrix.
3. Many tools are available for data analysis using R.
4. Visual representation produced using R can be displayed on the
screen as well as can be printed.
5. ‘S’ programming language is available online to support function of
R in more simplified manner.
6. Large numbers of packages are available in repository for various
functionalities of data processing with R language.
7. R supports the graphical illustration function for data analysis which
can also be exported to external files in various formats.
8. R can support end to end requirements of data analytics. It can be
used to rapidly develop any analysis.
Answer
1. R software uses a command-line interface (CLI) that is similar to the
BASH shell in Linux or the interactive versions of scripting languages
such as Python.
Data Analytics 5–27 J (CS-5/IT-6)
2. UNIX and Linux users can enter command Rat the terminal prompt to
use the CLI.
3. For Windows installations, R comes with RGui.exe, which provides a
basic graphical user interface (GUI).
4. However, to improve the ease of writing, executing, and debugging R
code, several additional GUIs have been written for R. Popular GUIs
include the R commander, Rattle, and RStudio.
5. The four window panes are as follow :
a. Scripts : Serves as an area to write and save R code
b. Workspace : Lists the datasets and variables in the R environment
c. Plots : Displays the plots generated by the R code and provides a
straightforward mechanism to export the plots
d. Console : Provides a history of the executed R code and the output
Answer
1. The dataset is imported into R using the read.csv () function as in the
following code.
sales <- read.csv(“c :/data/file_name.csv”)
2. R uses a forward slash (/) as the separator character in the directory and
file paths.
3. This convention makes script files somewhat more portable at the expense
of some initial confusion on the part of Windows users, who may be
commonly to using a backslash (\) as a separator.
4. To simplify the import of multiple files with long path names, the setwd()
function can be used to set the working directory for the subsequent
import and export operations, as shown in the following R code.
setwd (“c: / data/ ”)
sales <- read.csv (“file_name.csv”)
5. Other import functions include read.table() and read.delim() ,which are
intended to import other common file types such as TXT.
6. These functions can also be used to import the file_name.csv file as
shown in the following code :
sales_table <- read .table ( “ file_name. csv”, header=TRUE, sep=“,” )
sales_delim <- read .delim (“file_name.csv ”, sep=“,”)
Answer
Attributes can be categorized into four types :
a. Nominal :
1. The values represent labels that distinguish one from another.
Frame Works & Visualization 5–28 J (CS-5/IT-6)
Answer
Various data types of R are :
1. Vectors :
a. Vectors are a basic building block for data in R. Simple R variables are
vectors.
b. A vector can only consist of values in the same class.
c. The tests for vectors can be conducted using the is.vector () function.
d. R provides functionality that enables the easy creation and manipulation
of vectors.
For example :
# Create a vector.
apple <- c(‘red’,‘green’,“yellow”)
print(apple)
# Get the class of the vector.
print(class(apple))
Output :
“red” “green” “yellow”
“character”
2. Lists : A list is an R-object which can contain many different types of
elements inside it like vectors, functions and even another list inside it.
Data Analytics 5–29 J (CS-5/IT-6)
For example :
# Create a list.
list1 <- list(c(2,5,3),21.3)
# Print the list.
print(list1)
Output:
253
21.3
3. Matrices :
a. A matrix is a two-dimensional rectangular data set.
b. It can be created using a vector input to the matrix function.
For example :
# Create a matrix.
M = matrix(c(‘a’,‘a’,‘b’,‘c’,‘b’,‘a’), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
Output :
[,1] [,2] [,3]
[1,] “a” “a” “b”
[2,] “c” “b” “a”
4. Arrays :
a. Arrays can be of any number of dimensions.
b. The array function takes a dim attribute which creates the required
number of dimension.
For example :
Create an array.
a <- array(c(‘green’,‘yellow’),dim = c(3,3,2))
print(a)
Output :
[,1] [,2] [,3]
[1,] “green” “yellow” “green”
[2,] “yellow” “green” “yellow”
[3,] “green” “yellow” “green”
5. Factors :
a. Factors are the R-objects which are created using a vector.
b. It stores the vector along with the distinct values of the elements in the
vector as labels.
c. The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector. They are useful in statistical
modeling.
Frame Works & Visualization 5–30 J (CS-5/IT-6)
1 Introduction to
Data Analytics
(2 Marks Questions)
2 Marks Questions SQ–4 J (CS-5/IT-6)
2 Data Analysis
(2 Marks Questions)
Data Analytics SQ–7 J (CS-5/IT-6)
2 Marks Questions SQ–10 J (CS-5/IT-6)
4 Frequent Itemsets
and Clustering
(2 Marks Questions)
4.3. What are the two step process of association rule mining ?
Ans. Association rule mining can be viewed as a two-step process :
1. Find all frequent itemsets : By definition, each of these itemsets
will occur at least as frequently as a predetermined minimum support
count.
2. Generate strong association rules from the frequent
itemsets : By definition, these rules must satisfy minimum support
and minimum confidence.
Data Analytics SQ–13 J (CS-5/IT-6)
Ans. The basic idea of visual data mining is to present the data in some
visual form, allowing the user to gain insight into the data, draw
conclusions, and directly interact with the data.