CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
Data Mining
Classification
• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be
lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-
the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application 4
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies
Courtesy: http://aps.umn.edu
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Illustrating Classification Task
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
KNN : K –Nearest Neighbor method
• KNN approach
– K-Nearest Neighbor method
– Given a new test query q, we try to find the k
closest training queries to it in terms of Euclidean
distance.
– train a local ranking model online using the
neighboring training queries (Ranking SVM)
– rank the documents of the test query using the
trained local model
Example
1. Suppose use K = 3
2. Calculate the distance between the query-
instance and all the training samples .
Example (Cont.)
7 7 (7-3)2+(7-7)2=16
7 4 (7-3)2+(4-7)2=25
3 4 (3-3)2+(4-7)2=9
1 4 (1-3)2-(4-7)2=13
Example (Cont.)
7 7 √(7-3)2+(7-7)2=4 3 Bad
7 4 √(7-3)2+(4-7)2=5 4 Bad
3 4 √(3-3)2+(4-7)2=3 1 Good
1 4 √(1-3)2-(4-7)2=3.6 2 Good
Example (Cont.)
7 7 √(7-3)2+(7-7)2=4 3 Bad
7 4 √(7-3)2+(4-7)2=5 4 Bad
3 4 √(3-3)2+(4-7)2=3 1 Good
1 4 √(1-3)2-(4-7)2=3.6 2 Good
Example (Cont.)
With K=3, there are two Default=N and one Default=Y out of three closest
neighbours. The prediction for the unknown case is Default=N
• Standardized AGE
• Min age = 20
• Max age 60
• X= (x-min) / (max – min)
• X= (25 – 20) / (60 – 20) = 5/4 = 0.125
• Standardized loan
• Min = 18,000
• Max = 212,000
• X= (40,000 – 18,000) / ( 212,000 – 18,000) = 0.11
Pseudo for the basic k-nn
• Advantage
• The KNN algorithm provides good accuracy on
many domains
• Easy to understand and implemented
• Very quickly
• The KNN algorithm can estimate complex
target concept locally.
Disadvantage
28
Naïve Bayesian Classifier: Training Dataset
29
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the probability
that the hypothesis holds given the observed data sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X, given that the
hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
30
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes theorem
31
Towards Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class labels, and each
tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem P(X | C )P(C )
P(C | X) i i
i P(X)
needs to be maximized
32
Derivation of Naïve Bayes Classifier
and P(xk|Ci) is 1
( x )2
g ( x, , ) e 2 2
2
P ( X | C i ) g ( x k , C i , Ci )
33
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
witten&eibe 34
Day Outlook Temp Humidity Windy Play
35
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
36
Age Income Has a car Buy/class
senior middle yes yes
youth low yes no
junior high yes yes
youth middle yes yes
senior high no yes
junior low no no
senior middle no no
senior 2/4 1/3 middle 2/4 1/3 yes 3/4 1/3 yes
no
junior 1/4 1/3 low 0/4 2/3 no 1/4 2/3
youth 1/4 1/3 high 2/4 0/3 4/7 3/7
37
Example on Naïve Bayes
Cont.
Age Income Has a car Buy
youth middle no ?
38
Bayes’s rule
• Probability of event H given evidence E :
Pr[ E | H ] Pr[ H ]
Pr[ H | E ]
Pr[ E ]
• A priori probability of H :
– Probability of event before evidence is seen Pr[H ]
• A posteriori probability of H :
– Probability of event after evidence is seen Pr[ H | E ]
witten&eibe 40
Weather data example
93 93 93 149
2
9
Pr[ E ]
witten&eibe 41
Missing values
• Training: instance is not included in
frequency count for attribute value-class
combination
• Classification: attribute will be omitted
from calculation
• Example: Outlook Temp. Humidity Windy Play
? Cool High True ?
witten&eibe 42
The “zero-frequency problem”
witten&eibe 43
Avoiding the 0-Probability Problem
44
Numeric attributes
• Usual assumption: attributes have a normal or
Gaussian probability distribution (given the class)
• The probability density function for the normal
distribution is defined by two parameters:
– Sample mean
1 n
xi
– Standard deviation n i 1
1 n
n 1 i1
( xi ) 2
1
( x )2 Karl Gauss, 1777-1855
f ( x) e 2 2
great German
2
mathematician
45
46
Day Outlook Temp Humidity Windy Play
1 Sunny 85 85 False No
2 Sunny 80 90 True No
6 Rainy 65 70 True No
8 Sunny 72 95 False No
14 Rainy 71 91 True No
47
Statistics for
weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5 =73 =75 =79 =86 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 =6.2 =7.9 =10.2 =9.7 True 3/9 3/5
Rainy 3/9 2/5
witten&eibe 48
Classifying a new day
Sunny 66 90 true ?
witten&eibe 49
Naïve Bayes Text Classification
50
Naïve Bayes Text Classification
51
• You have a set of reviews (documents ) and a
classification
DOC TEXT CLASS
1 I loved the movie +
2 I hated the movie -
3 A great movie, good movie +
4 Poor acting -
Great acting, a good movie +
• First we need to extract unique words
• I, loved, the, movie, hated, a, great, poor
,acting, good.
• We have 10 unique words
Doc I loved the movie hated a great poor acting good class
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 1 +
• Take documents with positive outcomes
Doc I loved the movie hated a great poor acting good class
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +
Doc I loved the movie hated a great poor acting good class
2 1 1 1 1 -
4 1 1 -
= 0.6*0.0833*0.0417*0.0833*0.0417*0.0833
6.03*10-7
• If vj = -;
• = P(-)* P(I |-)*P(hated |-)*P(the |-)*P(poor |-) * P(acting
|-)
= 0.4*0. 125*0.125*0.125*0.125*0.125
1.22*10-5
Since 1.22*10-5 > 6.03*10-7
Sentence will be classified as a negative