Unit 2
Unit 2
Unit 2
It is a technique used for extraction of new and useful knowledge from large data
store for decision making purposes.
Applications of data mining:
Business:
Modern business uses data mining to perform market analysis to identify new
product bundles, finding the root cause of manufacturing problems, to prevent
customer wearing away and acquire new customers, cross-sell to existing
customers, to determine sales trends etc….
Data mining also contributes in customer relationship management Rather
than randomly contacting a prospect or customer through a call center or
sending mail, a company can concentrate its efforts on predicting customers who
may be interested to use the offers.
Medicine, Science and engineering:
Many satellites gather huge amount of information on weather conditions. Data
mining may be used to answer many question as volcano detection, earthquake
detection, relation between ocean’s surface temperature and land temperature
etc…
Data mining is also used in micro biology, for example it aims to find out how
the changes in an individual's DNA sequence affects the risks of developing
common diseases such as cancer, which is of great importance to improving
methods of diagnosing, preventing, and treating these diseases.
Another example of data mining in science and engineering is found in
educational research, where data mining has been used to study the factors
leading students to choose to engage in behaviors which reduce their
learning, and to understand factors influencing university student retention.
Data mining is used in many more applications like banking, traffic analysis,
computer network analysis etc...
Data sets
Graph Ordered
Record data
based data data
Transaction Sequential
data Data with data
relationship
amoung
objects Sequence
Data matrix
data
Record data:
Record data is a collection of records each of which consists of a fixed set of
attributes. Record data is usually stored in flat files or relational databases.
1) Transaction data: A special type of record data, where each record
transaction) involves a set of items. For example, consider a grocery
store. The set of items purchased by a customer during one shopping
trip constitute a transaction. This type of data is known as market
basket analysis.
2) Data matrix: If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-dimensional
space. Such data set can be represented by an m by n matrix, where there
are m rows, and n columns, one for each attribute. This type of matrix is
known as data matrix. Standard matrix operation may be applied on it.
3) Sparse data matrix: It’s a special case of data matrix where only non zero
values are important. Consider the below example where each document
is represented as term vector. And term is attributes. In document 1 ,word
(team) is found 3 times, Word(play) is found 5 times and so on.
Graph based data:
1) Data with relationships among data objects: Graphs are used to
represent relationship between data objects. The data object can be
considered as node and relationship between the nodes can be
represented by links. Consider an example of WWW where web pages are
interlinked with other pages. Links provide good information to search
and extract relevant WebPages.
2) Data with objects that are graphs. If objects contain subobjects that
have relationship among them, then objects are represented by graphs.
For example consider water molecule(object) which can be represented
as:
Ordered data
Time series data: It is used to store sequence of data that changes over
time. For example the value of stock of a company which changes with
respect to time
Data Quality
Ans: Quality of data plays an important role in data mining. We can get quality
knowledge from a good quality data. The quality of data is not perfect due to
following reasons:
Human errors during data entry
Hardware limitations
Flaws in data collection process
Values or entire object may be missing
There may be duplicate objects.
Inconsistency
Etc…
Definitions:
Error: For continuous attribute, the numerical difference of measured and true
value is called error.
Measurement Errors: It refers to any problem resulting from the measurement
process.
Data collection error: It refers to errors such as omitting data objects or
attribute values.
VARIOUS ASPECTS OF DATA QUALITY RELATED TO DATA MEASUREMENT
AND COLLECTION:
Outliers: A data set may contain data objects that do not comply with the
general behavior or model of the data. This data objects are outliers. Most
data mining algorithm methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare event is
more important.
Missing values
Missing values are often encountered in data sets. This may due to
incomplete information collection.
Missing values are handled in the following ways:
Eliminate data objects or attributes: This method is not very effective,
unless the tuple contains several attributes with missing values. A Related
strategy is to eliminate attributes that have missing values.
Estimate missing values: In some cases the missing values can be
estimated my using the other values. For continuous attributes, missing
values can be estimated by using the average of the remaining values. For
categorical attribute, missing value can also be estimated using the most
common values.
Aggregation
Ans: Combining of 2 or more objects is known as aggregation. The large data set
Quantitative attributes may be aggregated by either taking sum or mean of
attribute values.
Advantages of aggregation
Consider an example:
Store
TID Item location Date price
1001 Colgate tooth brush Bhimavaram 1/3/2013 30
1002 Amul Butter Tanuku 2/4/2013 20
1003 hp Computer mouse Bhimavaram 2/4/2013 200
1004 Pepsodent toothpaste Tanuku 3/4/2013 50
1005 Night lamp Tanuku 3/3/2013 100
1006 Thums up Bhimavaram 4/4/2013 65
Fig:1
Fig1 represents the original data set and fig 2 represents the aggregated data set.
TID and Items are removed and daily sales are converted into monthly sales.
Sampling Technique
Ans: This technique is inherited by data mining from statistics.
It is very expensive and time consuming to process all the data. Sampling
technique can be used to consider only a subset of data inserted of whole data
for analysis.
Sampling Approaches
Sampling without replacement (SWOR): An item once selected from the data
set, it is removed from the original population (data set).Consider the below
example: No item is sampled more than once.
Sampling with replacement (SWR): An item once selected from the data set, it
is again kept in the same place constituting original population. Means, a same
item may be sampled more than once. Consider the example below. Item T4 is
sampled 2 times.
• If the sample size is less then, some important patterns may be missed.
And if, sample size is more then they eliminate the advantages of sampling
(i.e. less time consuming and less storage space)
Progressive sample
Progressive sample method is used to determine the sufficient size of the
sample. This approach starts with a small sample, and then increase the sample
size until a sample of sufficient size has been obtained.
The accuracy of predictive model increases with respect to the size of sample. At
a point the accuracy doesn’t increase. This point is known as leveling off point.
Another sample is considered from the same original data and increase the small
sample to the same size. Now, the closeness of this with the leveling point is
measured.
Dimensionality reduction
Ans: Dimensionality reduction refers to the creation of new attributes that are
combination of the old attributes.
Note:
Redundant features: Has same information (same values) in more than one
dimension.
Feature Creation
Ans: Feature creation is a process of creating new set of attributes that can
capture information more efficiently than the original attributes.
Feature extraction: The creation of new attributes from original raw data is
known as Feature extraction. This method is mainly used in image processing.
Consider a set of images stored in the form of pixels, we want to classify them as
containing human faces or not. If new attributes are created containing
information about certain edges and colors, then this attributes helps us to
better classify these images.
Mapping the data to a new space: It is the process of viewing the data in
different angles to reveal important and interesting features in it. This method is
mainly used in Time series data. Consider the below example. Fig 1 contains two
time series data with out noise. And fig (2) contains two time series data with
noise. By Appling Fourier transformation, time series data in fig (2) is converted
into frequency information presented in fig (3).
Feature construction: Sometimes the attributes in the original data sets have
the necessary information, but still these attributes are not suitable for data
mining algorithm. In this situation, New attributes are constructed from the
original attributes which as more suitable for data mining algorithm.
Example: Consider an original data set containing mass and volume information
of various metals. In this case it is more meaningful to construct density
information (density =mass/volume) rather than mass and volume.
Metal Density
Metal A 1.0350
Metal B 1.2475
Metal C 1.3408
Fig B
Discretization and Binarization
Binarization: The process of converting continuous and discrete attributes
into binary attributes is known as binarization.
The two techniques for binarization:
1) If there are m categorical values, then uniquely assign each ordinal
value to an integer (0,m-1). Then convert these integers into binary
numbers
In the below example there are 5 categorical values (awful, poor, ok,
good, great), the uniquely assign each value to an integer (awful=0,
poor=1, ok=2, good=3, great=4). Then convert these integers into binary
numbers.
Categorical Integer X1 X2 X3
value value
Awful 0 0 0 0
Poor 1 0 0 1
OK 2 0 1 0
Good 3 0 1 1
Great 4 1 0 0
Note: If the categorical values are ordinal then values should be stored
in a sequence.
Categorical Integer X1 X2 X3 X4 X5
value value
Awful 0 1 0 0 0 0
Poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
Good 3 0 0 0 1 0
Great 4 0 0 0 0 1
Discretization:
Process of converting continuous attributes into categorical attributes is
known as Discretization. This technique is used for the data used for
classification and association analysis. Discretization involves two subtasks:
1) How many categorical values to include.
2) How to map continuous attributes to categorical attributes.
For example consider a student table sorted in percentages order
Student name percentages grade
A 45.8 Second class
B 47.9
C 66.7 First class
D 65.6
E 62.5
F 77.8 destination
G 80.6
Equal depth: It divides the attribute values into equal parts. First
assign them in an order(2,3,4,8,9,15,16,21,22,23,26,28)
Bin1: 2,3,4,8
Bin 2: 9,15,16,21
Bin3: 22,23,26,28
In the below example, dark dotes are outliers. K-means performs the
best
Supervised Discretization:
Entropy = ei=
Simple functions
1000 31.66
100 10
350 18.70
245 15.65
This technique is mainly used when the value range is very large.
Note: In some cases the transformations changes the nature of the variable.
For example the values {1, 2, 3} if transformed into {1/1, 1/2, 1/3} then the
order changes i.e. (1<2<3) but (1/1> 1/2> 1/3).
Normalization or standardization
In this, formulas like mean and standard deviation are used. These methods
are mainly used to reduce the dependency of values on its units. For example
weight of people can be measured in kg or pounds. In order to consider
weights irrespective of measures these techniques are helpful.
If 𝑋̅ the mean of attribute values and Sx standard deviation then, X|= (X-𝑋̅)/SX
Creates a new variable.
The mean and standard deviation are affected by outliers. In this case median
and absolute standard deviation can be used.
A similarity measure for two objects, i and j, will typically return the value 0
if the objects are unalike. Value is higher when objects are more alike
𝑥11 ⋯ 𝑥1𝑝
[ ⋮ ⋱ ⋮ ]
𝑥𝑛1 ⋯ 𝑥𝑛𝑝
A data matrix is made up of two entities or “things,” namely rows (for objects)
and columns (for attributes). Therefore, the data matrix is often called a two-
mode matrix.
0 0 0
[ 𝑑(2,1) 0 0]
𝑑(𝑛, 1) 𝑑(𝑛, 2) 0
For example, d(2,1) represents the distance between 2nd object and 1st object.
The bigger the value of (2,1) represent the larger the difference between them.
(𝑝 − 𝑚)
𝑑(𝑖, 𝑗) =
𝑝
Example:
Now, we apply the formula (described above) for finding the proximity of
nominal attributes:
As seen from the calculation, we observe that the similarity between an object
with itself is 1, which seems intuitively correct.
0 0 0 0
1 0 0 0
[ ]
1 0.5 0 0
0 1 1 0
M11= is the number of attributes that equal 1 for both objects i and j,
M10= is the number of attributes that equal 1 for object i but equal 0 for object j
M01= is the number of attributes that equal 0 for object i but equal 1 for object j,
M00= is the number of attributes that equal 0 for both objects i and j.
𝑀10 + 𝑀01
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01 + 𝑀00
𝑀10 + 𝑀01
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01
𝑀11 + 𝑀00
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01 + 𝑀00
Let’s calculate similarity between Ramesh, Suresh and Rajesh in case of
symmetric binary attribute.
1+2 3
d(Ramesh, Suresh)=1+0+1+2 = 4 = 0.75
0+2 2
d(Ramesh, Rajesh)= = 4 = 0.5
0+2+0+2
0+3 3
d(Suresh, Rajesh)= = 4 = 0.75
0+1+0+3
𝑀11
𝑑(𝑖, 𝑗) =
𝑀11 + 𝑀10 + 𝑀01
Let’s calculate similarity between Ramesh, Suresh and Rajesh in case of
asymmetric binary attribute.
1 1
S(Ramesh, Suresh)=1+0+1 = 2 = 0.5
0
S(Ramesh, Rajesh)= =0
0+2+0
0
S(Suresh, Rajesh)= =0
0+1+0
Rama 1 2
Mohan 3 5
a) Euclidean distance:
The Manhattan distance, often called Taxicab distance or City Block distance,
calculates the distance between real-valued vectors. Imagine vectors that
describe objects on a uniform grid such as a chessboard. Manhattan distance
is also called L1 norm.
d=|3-1|+|5-2|
d=5
c) Minkowski distance
It is defined as
2
√|3 − 1|2 + |5 − 2|2
d=3.605
where h is a real number such that h >= 1. (Such a distance is also called Lp
norm in some literature, where the symbol p refers to our notation of h.
d) Supremum distance
The second attribute gives the greatest difference between values for the
objects, which is 5-2 = 3. This is the supremum distance between both
objects.
Sl.No Test
(Ordinal)
1 Excellent
2 Fair
3 Good
4 Excellent
Step 1: Find the value of Mf (Count the number of states. In our case Mf is 3
(Excellent, Good, Fair).
Sl.No Test
(Ordinal)
1 Excellent
3 Good (Ordinal)
4 Excellent 1 3
2 1
𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
Where,
1−1
Fair (1) = =0
3−1
2−1
Good (2) = = 0.5
3−1
3−1
Excellent (3) = =1
3−1
d(2,1)=|0-1|=1
d(2,3)=|0.5-1|=0.5
d(2,4)=|0-1|=1
d(3-4)=|0.5-1|=0.5
d(1,3)=|1-0.5|=0.5
d(1,4)=|1-1|=0
0 − − −
1 0 − −
[ ]
0.5 0.5 0 −
0 1 0.5 0
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e.,
d(2,1)= 1.0 and d(4,2)= 1.0). This makes intuitive sense since objects 1 and 4
are both excellent.
1) One approach is to group each type of attribute together, performing separate data
mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive
compatible results. However, in real applications, it is unlikely that a separate
analysis per attribute type will generate compatible results.
Suppose that the data set contains p attributes of mixed type. The dissimilarity
(𝑓)
Where 𝛿𝑖𝑗 =0 if,
(𝑓)
𝑑𝑖𝑗 depends on type of attribute and can be calculated as with help of following
formulas:
(𝑓)
𝑖𝑓 𝑗𝑓 |𝑥 −𝑥 |
(1) if f is numeric 𝑑𝑖𝑗 = 𝑚𝑎𝑥−𝑚𝑖𝑛
(𝑓) (𝑓)
(2) if f is nominal or binary 𝑑𝑖𝑗 =0 if 𝑥𝑖𝑓 = 𝑥𝑗𝑓 , Otherwise 𝑑𝑖𝑗 =1
𝑟𝑖𝑓 −1
(3) if f is ordinal, compute the ranks 𝑧𝑖𝑓 = and threat zif as numeric.
𝑀𝑓 −1
Example: Compute the dissimilarity between objects of mixed attribute types given
in Table
Object test-1 test-2 test-3
identifier (nominal ) (ordinal) (numeric )
1 code-A excellent 45
2 code-B fair 22
3 code-C good 64
4 code-A excellent 28
Find the dissimilarity matrix between objects for attribute test-1 which is of
type nominal using formula no 2. Dissimilarity is 1 if two objects have
different value for attribute Test-1 otherwise 0.
Object
1 2 3 4
identifi
er
1 0
2 1 0
3 1 1 0
4 0 1 1 0
Step 1: Find the value of Mf (Count the number of states. In our case Mf is 3
(Excellent, Good, Fair).
(Ordinal) (Ordinal)
1 Excellent 1 3
2 Fair 2 1
3 Good 3 2
4 Excellent 4 3
Step 3: Normalize the ranks: Since each ordinal attribute can have a
different number of states, it is often necessary to map the range of each
attribute onto [0.0, 1.0] so that each attribute has equal weight. We perform
data normalization for this.
𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
Where,
1−1
Fair (1) = =0
3−1
2−1
Good (2) = = 0.5
3−1
3−1
Excellent (3) = =1
3−1
d(2,1)=|0-1|=1
d(2,3)=|0.5-1|=0.5
d(2,4)=|0-1|=1
d(3-4)=|0.5-1|=0.5
d(1,3)=|1-0.5|=0.5
d(1,4)=|1-1|=0
0 − − −
1 0 − −
[ ]
0.5 0.5 0 −
0 1 0.5 0
c) Dissimilarity Matrix for Attribute Test-3 (Numerical)
Find the dissimilarity matrix between objects for attribute test-3 which is of
type numerical using formula no 1. normalize the values so that can be
mapped between [0,1] using formula
1 45
2 22
3 64
4 28
|𝑥1𝑓 − 𝑥2𝑓 | 45 − 22
𝑑1,2 = = = 0.548
64 − 22 42
|45 − 64|
d1,3 = = 0.452
42
|45 − 28|
d1,4 = = 0.405
42
|22 − 64|
d2,3 = =1
42
|22 − 28|
d2,4 = = 0.143
42
|64 − 28|
d3,4 = = 0.857
42
Dissimilarity Matrix between objects for attribute Test-3 (Numerical).
Object
1 2 3 4
identifi
er
1 0
2 0.548 0
3 0.452 1 0
4 0.405 0.143 0.857 0
Now, combines the different attributes into a single dissimilarity matrix. The
dissimilarity d (i , j) between objects i and j is defined as
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓) 𝑑𝑖𝑗
(𝑓)
𝑑(𝑖, 𝑗) =
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓)
(𝑓)
Where 𝛿𝑖𝑗 =0 if,
As we can see that there is no any missing value and also no any asymmetric binary
(𝑓)
attribute, so 𝛿𝑖𝑗 =1 for all data points. And p=3 as there 3 attributes.
Similarly,
Dissimilarity Matrix between objects for attributes (Test-1, Test-2, Test-3)
Object
1 2 3 4
identifier
1 0
2 0.849 0
3 0.635 0.8 0
4 0.135 0.714 0.952 0
Cosine similarity
Cosine similarity is a measure of similarity that can be used to compare
documents or, say, give a ranking of documents with respect to a given vector
of query words. Let x and y be two vectors for comparison. Using the cosine
measure as a similarity function, we have
𝑥. 𝑦
𝑠𝑖𝑚(𝑥, 𝑦) =
||𝑥|| ||𝑦||
For example, in the below table, we see that Document1 contains five instances of
the word team, while hockey occurs three times. The word coach is absent from the
entire document, as indicated by a count value of 0. Such data can be highly
asymmetric.
Document 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)