Unit 2

Unit 2
What is data mining? What are its applications?

Ans: Data mining is a process of discovering useful knowledge from large data
stores.
Or
It is a technique used for extraction of new and useful knowledge from large data
store for decision making purposes.
Applications of data mining:
Business:
Modern business uses data mining to perform market analysis to identify new
product bundles, finding the root cause of manufacturing problems, to prevent
customer wearing away and acquire new customers, cross-sell to existing
customers, to determine sales trends etc….
Data mining also contributes in customer relationship management Rather
than randomly contacting a prospect or customer through a call center or
sending mail, a company can concentrate its efforts on predicting customers who
may be interested to use the offers.
Medicine, Science and engineering:
Many satellites gather huge amount of information on weather conditions. Data
mining may be used to answer many question as volcano detection, earthquake
detection, relation between ocean’s surface temperature and land temperature
etc…
Data mining is also used in micro biology, for example it aims to find out how
the changes in an individual's DNA sequence affects the risks of developing
common diseases such as cancer, which is of great importance to improving
methods of diagnosing, preventing, and treating these diseases.
Another example of data mining in science and engineering is found in
educational research, where data mining has been used to study the factors
leading students to choose to engage in behaviors which reduce their
learning, and to understand factors influencing university student retention.
Data mining is used in many more applications like banking, traffic analysis,
computer network analysis etc...
Motivating challenges of data mining

Ans: The following are the motivating challenges of the specific challenges that
motivated the development of data mining.
1) Scalability: “The amount of time required to extract the knowledge
proportional to the amount of data”. Data mining algorithm need to handle
huge amount of data. It is required that data mining should be scalable.
Data mining algorithms use new data structures to handle different kinds
of data. In some cases DM uses out of memory techniques to handle large
data.
2) High Dimensionality: Traditional data analysis techniques cannot handle
high dimensional data. “Dimensions may be considered as attributes or
properties of a data in some cases.” It is now common to deal with
hundreds of attributes. Computational cost also increases with respect to
the number of dimensions.
3) Heterogeneous and complex data: Traditional method often deals with

data of similar type. But complex data are often encountered in today’s
world. For example, web pages contain semi-structured information.
Climatic data contain time series data of various earth locations. Data
mining should be able to deal with heterogeneous and complex data by
considering relationship among data, such as parent child relationships
etc...
4) Data ownerships and distribution: Data is stored in different locations

and owned by different people. Distributed data mining deals with
following challenges.
A. Reduction of communication cost to perform distributed mining.
B. Consolidation of DM results from various sources.
C. Security issues.
5) Nontraditional analysis: In traditional statistics, a hypothesis was

proposed and an experiment is designed to gather the data and then the
data is analyzed with respect to hypothesis. But, in data mining many
hypothesis are generated and data mining techniques are used to
automate the process of hypothesis generation.
Origin of Data Mining
Data mining draws idea such as

1) Sampling, estimation and hypothesis testing from statistics
2) Search algorithms, modeling techniques and learning from Artificial
intelligence, machine learning, pattern recognition.
3) Data mining has adopted ideas from evolutionary computing,
information theory, signal processing, visualization, information
retrieval etc...
4) DBMS to provide storage, query processing and indexing.
5) Parallel computing to provide addressing massive size of data.
6) Distributed computing is essential when data cannot be gathered in
a single place.
Data mining tasks
There are two categories of data mining:
Predictive tasks: Predicting the value of an attribute based on values of another
attribute is predictive task. The value to be predicted is known as target
variable. And, the values used for making predictions are known as explanatory
variables.
Descriptive tasks: It characterizes the general properties of a target class of data
in database.
1) Predictive modeling: Process of building a model for the target variable

based on explanatory variable. There are 2 types of predictive modeling
tasks 1) classification 2) Regression.
Classification: Classification is the process of finding a set of models
which are used to predict the class of object whose class label is unknown.
It is used for discreet target variables. Example: Predicting whether web
user will buy a product online or not.
Regression: Regression is often used to predict the missing values rather
than class labels. It is used for continues target variables. Example:
Predicting the price of a gold or stock.
Example: Suppose, as sales manager of AllElectronics, you would like to
classify a large set of items in the store, based on three kinds of responses
to a sales campaign: good response, mild response, and no response. You
would like to derive a model for each of these three classes based on the
descriptive features of the items, such as price, brand, place made, type,
and category. The resulting classification should maximally distinguish
each class from the others, presenting an organized picture of the data set.
2) Association analysis: It is used to discover strongly associated features

in data. The discovered patterns are represented in the form of association
rules. Applications of association analysis include, identifying web pages
accessed together, identifying items bought together in a super market
etc...
Example: Consider the supermarket billing database. Association

analysis can be used to find the list of items bought together in a
supermarket. For example coke and milk are bought together which is
represented as (coke milk)
3) Cluster analysis: It is a process of grouping data into classes or clusters
so that objects within a cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters.
Cluster analysis can be performed on AllElectronics customer data in order
to identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing. Figure shows a 2-D plot
of customers with respect to customer locations in a city. Three clusters
of data points are evident
4) Anomaly detection: Detection of observations that are different from
rest of the data is known as anomaly detection or outliers.
Example: Anomaly detection may uncover fraudulent usage of credit cards

by detecting purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the location and type
of purchase, or the purchase frequency.
Types of data sets
Data sets
Graph Ordered
Record data
based data data
Transaction Sequential
data Data with data
relationship
amoung
objects Sequence
Data matrix
data
Document Time series

term matrix Data with data
objets that
are Graphs
Spatial data
Record data:
Record data is a collection of records each of which consists of a fixed set of
attributes. Record data is usually stored in flat files or relational databases.
1) Transaction data: A special type of record data, where each record
transaction) involves a set of items. For example, consider a grocery
store. The set of items purchased by a customer during one shopping
trip constitute a transaction. This type of data is known as market
basket analysis.
2) Data matrix: If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-dimensional
space. Such data set can be represented by an m by n matrix, where there
are m rows, and n columns, one for each attribute. This type of matrix is
known as data matrix. Standard matrix operation may be applied on it.
3) Sparse data matrix: It’s a special case of data matrix where only non zero
values are important. Consider the below example where each document
is represented as term vector. And term is attributes. In document 1 ,word
(team) is found 3 times, Word(play) is found 5 times and so on.
Graph based data:
1) Data with relationships among data objects: Graphs are used to
represent relationship between data objects. The data object can be
considered as node and relationship between the nodes can be
represented by links. Consider an example of WWW where web pages are
interlinked with other pages. Links provide good information to search
and extract relevant WebPages.
2) Data with objects that are graphs. If objects contain subobjects that
have relationship among them, then objects are represented by graphs.
For example consider water molecule(object) which can be represented
as:
Ordered data
When attributes have relationships in terms of time or space, then that

data is known as ordered data.
Sequential data (temporal data): It is an extension of record data with

time associated with each record.
Consider the customers data maintained by an online store. We can find
some interesting patterns from the above data, For example: customers
who bought a mobile phone are more likely to buy mobile case,& screen
guard in near future
Sequence data: It is a data set which contains a sequence of actions, such

as sequence of words or letters. For Example: The DNA sequence of two
people may be analyzed to know the relationship between them. Below is
the human genetic code using four nucleotides.
Time series data: It is used to store sequence of data that changes over
time. For example the value of stock of a company which changes with
respect to time
Spatial data: It contains spatial related information like geographic maps

data, VLSI chip design data, Medical and satellite image data etc…Its main
application is in weather data. For example climate of mountain areas
located at various altitude. An important aspect of spatial data is spatial
auto correlation; that is two points on earth that are close to each other
usually have similar temperature and rainfall.
Data Quality
Ans: Quality of data plays an important role in data mining. We can get quality
knowledge from a good quality data. The quality of data is not perfect due to
following reasons:
Human errors during data entry
Hardware limitations
Flaws in data collection process
Values or entire object may be missing
There may be duplicate objects.
Inconsistency
Etc…
Definitions:
Error: For continuous attribute, the numerical difference of measured and true
value is called error.
Measurement Errors: It refers to any problem resulting from the measurement
process.
Data collection error: It refers to errors such as omitting data objects or
attribute values.
VARIOUS ASPECTS OF DATA QUALITY RELATED TO DATA MEASUREMENT
AND COLLECTION:
Noise and artifacts:

Noise is a random error or variance in a measured variable. It may involve
the distortion of a value or the addition of false object. Noise is generally
associated with temporal and time series data. Image processing
techniques are used to reduced the noise (Removing noise is difficult). Data
mining algorithms should be robust (tolerate noise) to deal with noise.
Artifacts: Deterministic distortion of data is often referred as artifacts.

Or
A substance or structure not naturally present in the data being observe
d but formed by artificial means. For example: a mark on a set of
photographs.
Precision, Bias and accuracy:
Precision: The closeness of repeated measurements (of the same quantity)

to one another. The precision is generally measured by standard deviation
Bias: A systematic variation of measurements from quantity being

measured. The bias is generally measured by mean.
Accuracy: The closeness of measurements to the true value of the quantity
being measured.
Significant digit: The Nearest acceptable precision value of the digit is

known as significant digit.+_ 0.05.
Outliers: A data set may contain data objects that do not comply with the
general behavior or model of the data. This data objects are outliers. Most
data mining algorithm methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare event is
more important.
Missing values
Missing values are often encountered in data sets. This may due to
incomplete information collection.
Missing values are handled in the following ways:
Eliminate data objects or attributes: This method is not very effective,
unless the tuple contains several attributes with missing values. A Related
strategy is to eliminate attributes that have missing values.
Estimate missing values: In some cases the missing values can be
estimated my using the other values. For continuous attributes, missing
values can be estimated by using the average of the remaining values. For
categorical attribute, missing value can also be estimated using the most
common values.
Ignore missing values during analysis: Data mining algorithms should

be robust and be able to deal with missing values. In large data sets having
some missing values may not affect data mining results.
Inconsistent values (conflicting):

There may be inconsistencies in the data recorded for some transaction.
For example: while entering some personal data the pin code may not
match with name of the city. Height of a person may be negative. Height
of person does not match with his weight. This type of inconsistency may
be reduced by using some checks. But removing inconsistency may need
heavy data instead.
Duplicate data:
There may be many duplicate values in a data. There are two main issues.
First, same data object may be represented by two different objects (
different e-mail address for same person).Second, accidentally combining
two different objects.(2 people may have same name).The term duplication
is often used to refer to the process of dealing with this issues.
Data Preprocessing
What is data preprocessing?
It is a process of converting the data into appropriate and clear form for data
mining. The input data should be of good quality, because the quality of
knowledge provided by data mining is proportional to the input data.
Aggregation
Ans: Combining of 2 or more objects is known as aggregation. The large data set
Quantitative attributes may be aggregated by either taking sum or mean of
attribute values.
Advantages of aggregation
• Aggregation makes data smaller, smaller data sets require less

processing time.
• It can provide higher view of data i.e. Data can be viewed at higher
level of abstraction (summarized data can be viewed).
• Attribute at higher level of abstraction have less variability than
attribute at lower level of abstraction.
Consider an example:
Below example records the sales in 3 branches of a supermarket (day wise

report).
Store
TID Item location Date price
1001 Colgate tooth brush Bhimavaram 1/3/2013 30
1002 Amul Butter Tanuku 2/4/2013 20
1003 hp Computer mouse Bhimavaram 2/4/2013 200
1004 Pepsodent toothpaste Tanuku 3/4/2013 50
1005 Night lamp Tanuku 3/3/2013 100
1006 Thums up Bhimavaram 4/4/2013 65
Fig:1
Location Month Sales

Bhimavaram March 98.3
Tanuku April 56.6
Fig :2
Fig1 represents the original data set and fig 2 represents the aggregated data set.
TID and Items are removed and daily sales are converted into monthly sales.
Sampling Technique
Ans: This technique is inherited by data mining from statistics.
It is very expensive and time consuming to process all the data. Sampling
technique can be used to consider only a subset of data inserted of whole data
for analysis.
A Sample is said to be repetitive if it has the same properties as original data.
Sampling Approaches
There are mainly three techniques for sampling:
• Sampling without replacement (SWOR)

• Sampling with replacement(SWR)
• Stratified sampling
Sampling without replacement (SWOR): An item once selected from the data
set, it is removed from the original population (data set).Consider the below
example: No item is sampled more than once.
Sampling with replacement (SWR): An item once selected from the data set, it
is again kept in the same place constituting original population. Means, a same
item may be sampled more than once. Consider the example below. Item T4 is
sampled 2 times.
Stratified sampling: The number of objects drawn from each group is

proportional to the size of the group in original population. Consider the below
example. The ratio of age groups in the original population is same as that of the
samples. (4:8:2) is proportional to (2:4:1).
Problems with sampling:
• If the sample size is less then, some important patterns may be missed.
And if, sample size is more then they eliminate the advantages of sampling
(i.e. less time consuming and less storage space)
Progressive sample
Progressive sample method is used to determine the sufficient size of the
sample. This approach starts with a small sample, and then increase the sample
size until a sample of sufficient size has been obtained.
The correct sample size can be found in the following way:
The accuracy of predictive model increases with respect to the size of sample. At
a point the accuracy doesn’t increase. This point is known as leveling off point.
Another sample is considered from the same original data and increase the small
sample to the same size. Now, the closeness of this with the leveling point is
measured.
Dimensionality reduction
Ans: Dimensionality reduction refers to the creation of new attributes that are
combination of the old attributes.
Advantages of dimensionality reduction
1) Dimensionality reduction eliminates irrelevant attributes and reduces

noise in the data
2) Many Data mining algorithm work better with data having less number of
dimensions (attributes).
3) Reduction of dimensions leads to a more understandable data model.
4) Reduction of dimensions allows data to be visualized easily.
5) Time and memory required by the data mining algorithm is reduced with
reduction in dimensionality.
Curse of dimensionality (disadvantages of having more dimensions in data)
1) Data analysis becomes difficult as the dimensionality of data increases.

2) If data is having more number of dimensions, it is very difficult to create a
classification model due to fewer objects.
3) In the case of clustering, the density and distance between the objects
would be more. This makes clustering difficult.
Feature subset selection

Ans: In feature subset selection, only a subset of all dimensions is used. This is
specially used when there are large numbers of redundant and irrelevant
dimensions in the dataset.
Note:
Redundant features: Has same information (same values) in more than one
dimension.
Irrelevant features: Has dimensions irrelevant to data mining tasks.
Techniques for eliminating irrelevant and redundant features:
1) Common sense: Some irrelevant and redundant dimensions can be

removed using common sense or having a sound command on domain.
2) Brute-force approach: Try all possible feature subsets as input to data

mining algorithm and select the subset which produces best results. But
this method will not work if numbers of attributes (features) are more.
3) Embedded approaches: Feature selection occurs naturally as part of the
data mining algorithm. The algorithm will decide which features to include
and which to ignore.
4) Filter approaches: attributes (features) are selected before data mining
algorithm is run using some independent approaches.
5) Wrapper approaches: The data mining algorithm itself is used to
determine the attribute subset.
Feature Creation
Ans: Feature creation is a process of creating new set of attributes that can
capture information more efficiently than the original attributes.
The three methodologies for creating new attributes are:

1) Feature extraction.
2) Mapping the data to a new space
3) Feature construction.
Feature extraction: The creation of new attributes from original raw data is
known as Feature extraction. This method is mainly used in image processing.
Consider a set of images stored in the form of pixels, we want to classify them as
containing human faces or not. If new attributes are created containing
information about certain edges and colors, then this attributes helps us to
better classify these images.
Mapping the data to a new space: It is the process of viewing the data in
different angles to reveal important and interesting features in it. This method is
mainly used in Time series data. Consider the below example. Fig 1 contains two
time series data with out noise. And fig (2) contains two time series data with
noise. By Appling Fourier transformation, time series data in fig (2) is converted
into frequency information presented in fig (3).
Feature construction: Sometimes the attributes in the original data sets have
the necessary information, but still these attributes are not suitable for data
mining algorithm. In this situation, New attributes are constructed from the
original attributes which as more suitable for data mining algorithm.
Example: Consider an original data set containing mass and volume information
of various metals. In this case it is more meaningful to construct density
information (density =mass/volume) rather than mass and volume.
Metal Mass volume

Metal A 16.23 15.68
Metal B 17.89 14.34
Metal C 18.33 13.67
Fig A
Metal Density
Metal A 1.0350
Metal B 1.2475
Metal C 1.3408
Fig B
Discretization and Binarization
Binarization: The process of converting continuous and discrete attributes
into binary attributes is known as binarization.
The two techniques for binarization:
1) If there are m categorical values, then uniquely assign each ordinal
value to an integer (0,m-1). Then convert these integers into binary
numbers
In the below example there are 5 categorical values (awful, poor, ok,
good, great), the uniquely assign each value to an integer (awful=0,
poor=1, ok=2, good=3, great=4). Then convert these integers into binary
numbers.
Categorical Integer X1 X2 X3
value value
Awful 0 0 0 0
Poor 1 0 0 1
OK 2 0 1 0
Good 3 0 1 1
Great 4 1 0 0
Note: If the categorical values are ordinal then values should be stored
in a sequence.
2) In this technique only the presence of item is considered. Here the

number of binary attributes is equal to the number of categorical
values.
Categorical Integer X1 X2 X3 X4 X5
value value
Awful 0 1 0 0 0 0
Poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
Good 3 0 0 0 1 0
Great 4 0 0 0 0 1
Discretization:
Process of converting continuous attributes into categorical attributes is
known as Discretization. This technique is used for the data used for
classification and association analysis. Discretization involves two subtasks:
1) How many categorical values to include.
2) How to map continuous attributes to categorical attributes.
For example consider a student table sorted in percentages order
Student name percentages grade
A 45.8 Second class
B 47.9
C 66.7 First class
D 65.6
E 62.5
F 77.8 destination
G 80.6
There are two types of discretization 1) Unsupervised discretization

2) Supervised discretization
Unsupervised discretization: In this type of discretization, domain

knowledge (class information) is not used to convert continuous attributes
into categorical attributes. Rather, they are discretized using techniques like
1) Equal width 2) Equal depth 3) K means etc...
Equal width: It divides the attribute values into equal intervals.

Consider a set of attribute values(2,3,4,9,8,15,16,23,26,28,21,22)
Bin1(1-10)-2,3,4,9,8
Bin 2(11-20)-15, 16
Bin3(20-30)-23,26,28,21,22
Equal depth: It divides the attribute values into equal parts. First
assign them in an order(2,3,4,8,9,15,16,21,22,23,26,28)
Bin1: 2,3,4,8
Bin 2: 9,15,16,21
Bin3: 22,23,26,28
In the below example, dark dotes are outliers. K-means performs the
best
Supervised Discretization:
Discretization methods which use class labels are known as supervised

discretization. Supervised discretization produces better results than
unsupervised discretization.
Entropy measure is a common method of discretization.
Entropy = ei=
Where, k=number of class labels, and i=(1,2,3..)

mij= number of values of class j in ith interval
And, pij=mij/mi.
The total entropy is denoted by
Variable Transformation or attribute transformation
When transformation is performed on all values of a attribute, then it is

known as variable transformation. This technique is mainly used when the
attribute values are very large, or when the magnitude is more
Uses of variable transformation

1) It is used to transform larger values into smaller values
2) It is used to reduce the dependency of values on its units
3) It is also used when the magnitude of the values are important then the
values.
Simple functions
In this, simple mathematical function is applied to all values of the variable

(attribute) individually. For example consider a variable(x) it is converted by
taking its √𝑋 or ex or 1/X or sin X or |x| or log X etc..
Weight (x) Transformed weight
1000 31.66
100 10
350 18.70
245 15.65
This technique is mainly used when the value range is very large.
Note: In some cases the transformations changes the nature of the variable.
For example the values {1, 2, 3} if transformed into {1/1, 1/2, 1/3} then the
order changes i.e. (1<2<3) but (1/1> 1/2> 1/3).
Normalization or standardization
In this, formulas like mean and standard deviation are used. These methods
are mainly used to reduce the dependency of values on its units. For example
weight of people can be measured in kg or pounds. In order to consider
weights irrespective of measures these techniques are helpful.
If 𝑋̅ the mean of attribute values and Sx standard deviation then, X|= (X-𝑋̅)/SX
Creates a new variable.
The mean and standard deviation are affected by outliers. In this case median
and absolute standard deviation can be used.
Absolute standard deviation= ∑𝑚

𝑖=1 |𝑋i − 𝜇|
Xi= value of the variable

𝜇= median
m= number of values of an attribute.
Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest-
neighbor classification, we need ways to assess how alike or unalike objects
are in comparison to one another.
For example, a store may want to search for clusters of customer objects,
resulting in groups of customers with similar characteristics (e.g., similar
income, area of residence, and age).
A similarity measure for two objects, i and j, will typically return the value 0
if the objects are unalike. Value is higher when objects are more alike
A dissimilarity measure works the opposite way. It returns a value of 0 if the

objects are the same (and therefore, far from being dissimilar). The higher the
dissimilarity value, the more dissimilar the two objects are.
Note: Proximity refers to a similarity or dissimilarity
1) Data Matrix versus Dissimilarity Matrix

Data Matrix: (or object-by-attribute structure): This structure stores the n
data objects in the form of a relational table, or n-by-p matrix (n objects ×p
attributes).
𝑥11 ⋯ 𝑥1𝑝
[ ⋮ ⋱ ⋮ ]
𝑥𝑛1 ⋯ 𝑥𝑛𝑝
A data matrix is made up of two entities or “things,” namely rows (for objects)
and columns (for attributes). Therefore, the data matrix is often called a two-
mode matrix.
Dissimilarity Matrix: Dissimilarity matrix (or object-by-object structure):

This structure stores a collection of proximities that are available for all pairs
of n objects. It is often represented by an n-by-n table.
0 0 0
[ 𝑑(2,1) 0 0]
𝑑(𝑛, 1) 𝑑(𝑛, 2) 0
For example, d(2,1) represents the distance between 2nd object and 1st object.
The bigger the value of (2,1) represent the larger the difference between them.
The dissimilarity matrix contains one kind of entity (dissimilarities) and so is

called a one-mode matrix.
2) Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states. For example, car color is
a nominal attribute that may have, say, five states: red, yellow, green, pink,
and blue.
The proximity is calculated using following formula
(𝑝 − 𝑚)
𝑑(𝑖, 𝑗) =
𝑝
In the above formula, Let m be total number of matches between two-point

attributes and p be total number of attributes.
Example:
Roll No Marks Grades

1 96 A
2 87 B
3 83 B
4 96 A
Now, we apply the formula (described above) for finding the proximity of
nominal attributes:
Note: p is total number of attributes. In our case p is 2 (Marks, Grades).
– d(1,1)= (p-m)/p = (2-2)/2 = 0 – d(2,2)= (p-m)/p = (2-2)/2 = 0
– d(2,1)= (p-m)/p = (2-0)/2 = 1 – d(3,2)= (p-m)/p = (2-1)/2 = 0.5

(One attribute i.e Grades is matching so, m is 1)
– d(3,1)= (p-m)/p = (2-2)/2 = 1 – d(4,2)= (p-m)/p = (2-0)/2 = 1
– d(4,1)= (p-m)/p = (2-2)/2 = 0 – d(3,3)= (p-m)/p = (2-2)/2 = 0
– d(4,3)= (p-m)/p = (2-0)/2 = 1 – d(4,4)= (p-m)/p = (2-2)/2 = 0
As seen from the calculation, we observe that the similarity between an object
with itself is 1, which seems intuitively correct.
The proximity matrix is

0 0 0 0
𝑑(2,1) 0 0 0
[ ]
𝑑(3,1) 𝑑(3,2) 0 0
𝑑(4,1) 𝑑(4,1) 𝑑(4,2) 0
0 0 0 0
1 0 0 0
[ ]
1 0.5 0 0
0 1 1 0
3) Proximity Measures for Binary Attributes

Similarity Measures for Binary Data are called similarity coefficients and
typically have values between 0 and 1. The comparison between two binary
objects is done using the following four quantities:
Name Test 1 Test 2 Test 3 Test 4

Suresh 1 0 0 0
Ramesh 1 0 1 0
Rajesh 0 0 0 0
M11= is the number of attributes that equal 1 for both objects i and j,
M10= is the number of attributes that equal 1 for object i but equal 0 for object j
M01= is the number of attributes that equal 0 for object i but equal 1 for object j,
M00= is the number of attributes that equal 0 for both objects i and j.
a) Dissimilarity between i and j in case of Binary Symmetric attributes

are
𝑀10 + 𝑀01
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01 + 𝑀00
Let’s calculate dissimilarity between Ramesh, Suresh and Rajesh in case of

binary symmetry attributes.
0+1 1
d(Ramesh, Suresh)=1+0+1+2 = 4 = 0.25
2+0 2
d(Ramesh, Rajesh)= = 4 = 0.5
0+2+0+2
1+0 1
d(Suresh, Rajesh)= = 4 = 0.25
0+1+0+3
Dissimilarity matrix in case of binary symmetry attributes is:
𝑴𝒂𝒕𝒓𝒊𝒙 𝑆𝑢𝑟𝑒𝑠ℎ 𝑅𝑎𝑚𝑒𝑠ℎ 𝑅𝑎𝑗𝑒𝑠ℎ

𝑆𝑢𝑟𝑒𝑠ℎ 0 − −
[ ]
𝑅𝑎𝑚𝑒𝑠ℎ 0.25 0 −
𝑅𝑎𝑗𝑒𝑠ℎ 0.25 0.5 0
b) Dissimilarity between i and j in case of Binary asymmetric attributes
are
𝑀10 + 𝑀01
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01
Let’s calculate dissimilarity between Ramesh, Suresh and Rajesh in case of

binary asymmetry attributes.
0+1 1
d(Ramesh, Suresh)=1+0+1 = 2 = 0.5
2+0 2
d(Ramesh, Rajesh)= =2=1
0+2+0
1+0 1
d(Suresh, Rajesh)= =1=1
0+1+0
Dissimilarity matrix in case of binary asymmetry attributes is:

[ ]
𝑅𝑎𝑗𝑒𝑠ℎ 1 1 0
c) Symmetric binary similarity between i and j are
𝑀11 + 𝑀00
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01 + 𝑀00
Let’s calculate similarity between Ramesh, Suresh and Rajesh in case of
symmetric binary attribute.
1+2 3
d(Ramesh, Suresh)=1+0+1+2 = 4 = 0.75
0+2 2
d(Ramesh, Rajesh)= = 4 = 0.5
0+2+0+2
0+3 3
d(Suresh, Rajesh)= = 4 = 0.75
0+1+0+3

[ ]
𝑅𝑎𝑗𝑒𝑠ℎ 0.75 0.5 0
d) Asymmetric binary similarity also known as Jaccard coefficient

between i and j are
𝑀11
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01
Let’s calculate similarity between Ramesh, Suresh and Rajesh in case of
asymmetric binary attribute.
1 1
S(Ramesh, Suresh)=1+0+1 = 2 = 0.5
0
S(Ramesh, Rajesh)= =0
0+2+0
0
S(Suresh, Rajesh)= =0
0+1+0

[ ]
𝑅𝑎𝑗𝑒𝑠ℎ 0 0 0
4) Dissimilarity of Numeric Data

All attributes may not be of same scale. So, data may be normalized before
applying these distance formulas. Normalizations is performed using Z-Score
normalization.
Let’s consider a dataset
Name Test 1(out of 5) Test 2 (Out of 5)
Rama 1 2
Mohan 3 5
a) Euclidean distance:
It is a distance measure that best can be explained as the length of a segment

connecting two points. Euclidean distance is also called L2 norm.
b) Manhattan distance
The Manhattan distance, often called Taxicab distance or City Block distance,
calculates the distance between real-valued vectors. Imagine vectors that
describe objects on a uniform grid such as a chessboard. Manhattan distance
is also called L1 norm.
d=|3-1|+|5-2|
d=5
c) Minkowski distance
Minkowski distance is a generalization of the Euclidean and Manhattan

distances.
It is defined as
2
√|3 − 1|2 + |5 − 2|2
d=3.605
where h is a real number such that h >= 1. (Such a distance is also called Lp
norm in some literature, where the symbol p refers to our notation of h.
It represents the Manhattan distance when h = 1 (i.e., L1 norm) and Euclidean

distance when h = 2
d) Supremum distance
Chebyshev distance is defined as the greatest of difference between two

vectors along any coordinate dimension. In other words, it is simply the
maximum distance along one axis.
The second attribute gives the greatest difference between values for the
objects, which is 5-2 = 3. This is the supremum distance between both
objects.
5) Proximity Measures for Ordinal Attributes

The values of an ordinal attribute have a meaningful order or ranking about
them, yet the magnitude between successive values is unknown. An ordinal
variable can be discrete or continuous. An example includes the sequence
small, medium, large for a size attribute. Ordinal attributes may also be
obtained from the discretization of numeric attributes by splitting the value
range into a finite number of categories. These categories are organized into
ranks.
Sl.No Test
(Ordinal)
1 Excellent
2 Fair
3 Good
4 Excellent
Consider the above table, the proximity is calculated in 4 steps.
Step 1: Find the value of Mf (Count the number of states. In our case Mf is 3
(Excellent, Good, Fair).
Step 2: Replace each xif by its corresponding rank, rif { f1, : : : , Mf }.
Sl.No Test
(Ordinal)
1 Excellent
2 Fair Sl.No Test
3 Good (Ordinal)
4 Excellent 1 3
2 1
Replace as: Excellent-3, Good-2, Fair-1. 3 2
Step 3: Normalize the ranks: Since each 4 3 ordinal

attribute can have a different number of states, it is
often necessary to map the range of each attribute onto [0.0, 1.0] so that each
attribute has equal weight. We perform data normalization for this.
𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
Where,
rif = ith object in the fth attribute
Mf = Number of unique states=3
1−1
Fair (1) = =0
3−1
2−1
Good (2) = = 0.5
3−1
3−1
Excellent (3) = =1
3−1
The above data is modified as,
Sl.No Test Sl.No Test

(Ordinal) (Ordinal)
1 3 1 1
2 1 2 0
3 2 3 0.5
4 3 4 1
Step 4: Dissimilarity can then be
computed using any of the
distance measures
Let’s use Manhattan distance:
d(2,1)=|0-1|=1
d(2,3)=|0.5-1|=0.5
d(2,4)=|0-1|=1
d(3-4)=|0.5-1|=0.5
d(1,3)=|1-0.5|=0.5
d(1,4)=|1-1|=0
0 − − −
1 0 − −
[ ]
0.5 0.5 0 −
0 1 0.5 0
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e.,
d(2,1)= 1.0 and d(4,2)= 1.0). This makes intuitive sense since objects 1 and 4
are both excellent.
Similarity values for ordinal attributes can be interpreted from dissimilarity

as sim(i, j)= 1-d(i, j).
6) Proximity Measures of Mixed Attributes
There are two approaches to compute the dissimilarity between objects of mixed
attribute types.
1) One approach is to group each type of attribute together, performing separate data
mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive
compatible results. However, in real applications, it is unlikely that a separate
analysis per attribute type will generate compatible results.
2) A more preferable approach is to process all attribute types together, performing

a single analysis. One such technique combines the different attributes into a single
dissimilarity matrix, bringing all of the meaningful attributes onto a common scale
of the interval [0.0,1.0].
Suppose that the data set contains p attributes of mixed type. The dissimilarity
d(i , j) between objects i and j is defined as
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓) 𝑑𝑖𝑗

(𝑓)
𝑑(𝑖, 𝑗) =
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓)
(𝑓)
Where 𝛿𝑖𝑗 =0 if,
i. xif or xjf is missing,
ii. xif=xjf=0 and attribute f is asymmetric binary.

(𝑓)
Otherwise, 𝛿𝑖𝑗 =1
(𝑓)
𝑑𝑖𝑗 depends on type of attribute and can be calculated as with help of following
formulas:
(𝑓)
𝑖𝑓 𝑗𝑓 |𝑥 −𝑥 |
(1) if f is numeric 𝑑𝑖𝑗 = 𝑚𝑎𝑥−𝑚𝑖𝑛
(𝑓) (𝑓)
(2) if f is nominal or binary 𝑑𝑖𝑗 =0 if 𝑥𝑖𝑓 = 𝑥𝑗𝑓 , Otherwise 𝑑𝑖𝑗 =1
𝑟𝑖𝑓 −1
(3) if f is ordinal, compute the ranks 𝑧𝑖𝑓 = and threat zif as numeric.
𝑀𝑓 −1
Example: Compute the dissimilarity between objects of mixed attribute types given
in Table
Object test-1 test-2 test-3
identifier (nominal ) (ordinal) (numeric )
1 code-A excellent 45
2 code-B fair 22
3 code-C good 64
4 code-A excellent 28
a) Dissimilarity Matrix for Attribute Test-1 (Nominal)
Find the dissimilarity matrix between objects for attribute test-1 which is of
type nominal using formula no 2. Dissimilarity is 1 if two objects have
different value for attribute Test-1 otherwise 0.
Dissimilarity Matrix between objects for attribute test-1 (Nominal).
Object
1 2 3 4
identifi
er
1 0
2 1 0
3 1 1 0
4 0 1 1 0
b) Dissimilarity Matrix for Attribute Test-2 (Ordinal)
Consider the above table, the proximity is calculated in 4 steps.
Step 1: Find the value of Mf (Count the number of states. In our case Mf is 3
(Excellent, Good, Fair).
Step 2: Replace each xif by its corresponding rank, rif { f1, : : : , Mf }.

(Ordinal) (Ordinal)
1 Excellent 1 3
2 Fair 2 1
3 Good 3 2
4 Excellent 4 3
Replace as: Excellent-3, Good-2, Fair-1.
Step 3: Normalize the ranks: Since each ordinal attribute can have a
different number of states, it is often necessary to map the range of each
attribute onto [0.0, 1.0] so that each attribute has equal weight. We perform
data normalization for this.
𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
Where,
rif = ith object in the fth attribute
Mf = Number of unique states=3
1−1
Fair (1) = =0
3−1
2−1
Good (2) = = 0.5
3−1
3−1
Excellent (3) = =1
3−1
The above data is modified as,

(Ordinal) (Ordinal)
1 3 1 1
2 1 2 0
3 2 3 0.5
4 3 4 1
Step 4: Dissimilarity can then be computed using any of the distance

measures
Let’s use Manhattan distance:
d(2,1)=|0-1|=1
d(2,3)=|0.5-1|=0.5
d(2,4)=|0-1|=1
d(3-4)=|0.5-1|=0.5
d(1,3)=|1-0.5|=0.5
d(1,4)=|1-1|=0
0 − − −
1 0 − −
[ ]
0.5 0.5 0 −
0 1 0.5 0
c) Dissimilarity Matrix for Attribute Test-3 (Numerical)
Find the dissimilarity matrix between objects for attribute test-3 which is of
type numerical using formula no 1. normalize the values so that can be
mapped between [0,1] using formula
(𝑓) |𝑥𝑖𝑓 − 𝑥𝑗𝑓 |

𝑑𝑖𝑗 =
𝑚𝑎𝑥 − 𝑚𝑖𝑛
Consider the data below.
Object test-3
identifier (numeric )
1 45
2 22
3 64
4 28
|𝑥1𝑓 − 𝑥2𝑓 | 45 − 22
𝑑1,2 = = = 0.548
64 − 22 42
|45 − 64|
d1,3 = = 0.452
42
|45 − 28|
d1,4 = = 0.405
42
|22 − 64|
d2,3 = =1
42
|22 − 28|
d2,4 = = 0.143
42
|64 − 28|
d3,4 = = 0.857
42
Dissimilarity Matrix between objects for attribute Test-3 (Numerical).
Object
1 2 3 4
identifi
er
1 0
2 0.548 0
3 0.452 1 0
4 0.405 0.143 0.857 0
Single dissimilarity matrix between all objects for all attributes
Now, combines the different attributes into a single dissimilarity matrix. The
dissimilarity d (i , j) between objects i and j is defined as
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓) 𝑑𝑖𝑗
(𝑓)
𝑑(𝑖, 𝑗) =
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓)
(𝑓)
Where 𝛿𝑖𝑗 =0 if,
i. xif or xjf is missing,
ii. xif=xjf=0 and attribute f is asymmetric binary.

(𝑓)
Otherwise, 𝛿𝑖𝑗 =1
As we can see that there is no any missing value and also no any asymmetric binary
(𝑓)
attribute, so 𝛿𝑖𝑗 =1 for all data points. And p=3 as there 3 attributes.
Let’s keep all dissimilarly matrix here.
Test Test 2 Test 3

1 2 3 4 1 2 3 4 1 2 3 4
1
1 0 1 0 1 0
2 1 0 2 1 0 2 0.548 0
3 1 1 0 3 0.5 0.5 0 3 0.452 1 0
4 0 1 1 0 4 0 1 0.5 0 4 0.405 0.143 0.857 0
Dissimilarity between Object 1 and Object 2
Similarly,
Dissimilarity Matrix between objects for attributes (Test-1, Test-2, Test-3)
Object
1 2 3 4
identifier
1 0
2 0.849 0
3 0.635 0.8 0
4 0.135 0.714 0.952 0
Cosine similarity
Cosine similarity is a measure of similarity that can be used to compare
documents or, say, give a ranking of documents with respect to a given vector
of query words. Let x and y be two vectors for comparison. Using the cosine
measure as a similarity function, we have
𝑥. 𝑦
𝑠𝑖𝑚(𝑥, 𝑦) =
||𝑥|| ||𝑦||
For example, in the below table, we see that Document1 contains five instances of
the word team, while hockey occurs three times. The word coach is absent from the
entire document, as indicated by a count value of 0. Such data can be highly
asymmetric.
Cosine similarity is mainly used in information retrieval, biologic taxonomy,

gene feature mapping
Document1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
Document 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
Document 1 • Document 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||Document 1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =

6.481
||Document 2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5

= 4.12
cos(Document 1, Document 2 ) = 0.94

Unit 2

Uploaded by

Copyright:

Available Formats

Unit 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

Unit 2

What is data mining? What are its applications?

Motivating challenges of data mining

3) Heterogeneous and complex data: Traditional method often deals with

4) Data ownerships and distribution: Data is stored in different locations

5) Nontraditional analysis: In traditional statistics, a hypothesis was

Origin of Data Mining

Data mining draws idea such as

1) Predictive modeling: Process of building a model for the target variable

2) Association analysis: It is used to discover strongly associated features

Example: Consider the supermarket billing database. Association

Example: Anomaly detection may uncover fraudulent usage of credit cards

Types of data sets

Document Time series

When attributes have relationships in terms of time or space, then that

Sequential data (temporal data): It is an extension of record data with

Sequence data: It is a data set which contains a sequence of actions, such

Spatial data: It contains spatial related information like geographic maps

Noise and artifacts:

Artifacts: Deterministic distortion of data is often referred as artifacts.

Precision, Bias and accuracy:

Precision: The closeness of repeated measurements (of the same quantity)

Bias: A systematic variation of measurements from quantity being

Significant digit: The Nearest acceptable precision value of the digit is

Ignore missing values during analysis: Data mining algorithms should

Inconsistent values (conflicting):

• Aggregation makes data smaller, smaller data sets require less

Below example records the sales in 3 branches of a supermarket (day wise

Location Month Sales

A Sample is said to be repetitive if it has the same properties as original data.

There are mainly three techniques for sampling:

• Sampling without replacement (SWOR)

Stratified sampling: The number of objects drawn from each group is

The correct sample size can be found in the following way:

Advantages of dimensionality reduction

1) Dimensionality reduction eliminates irrelevant attributes and reduces

Curse of dimensionality (disadvantages of having more dimensions in data)

1) Data analysis becomes difficult as the dimensionality of data increases.

Feature subset selection

Irrelevant features: Has dimensions irrelevant to data mining tasks.

Techniques for eliminating irrelevant and redundant features:

1) Common sense: Some irrelevant and redundant dimensions can be

2) Brute-force approach: Try all possible feature subsets as input to data

The three methodologies for creating new attributes are:

Metal Mass volume

2) In this technique only the presence of item is considered. Here the

There are two types of discretization 1) Unsupervised discretization

Unsupervised discretization: In this type of discretization, domain

Equal width: It divides the attribute values into equal intervals.

Discretization methods which use class labels are known as supervised

Where, k=number of class labels, and i=(1,2,3..)

Variable Transformation or attribute transformation

When transformation is performed on all values of a attribute, then it is

Uses of variable transformation

In this, simple mathematical function is applied to all values of the variable

Absolute standard deviation= ∑𝑚

Xi= value of the variable

A dissimilarity measure works the opposite way. It returns a value of 0 if the

Note: Proximity refers to a similarity or dissimilarity

Document 1 • Document 2 = 53+00+32+00+21+01+01+21+00+01 = 25