Nazribit33603 Datamining Lecture5 Classification
Nazribit33603 Datamining Lecture5 Classification
Nazribit33603 Datamining Lecture5 Classification
Lecture 5
CLASSIFICATIONS
FSKTM, UTHM
nazri@uthm.edu.my
DATA MINING: A GENERAL VIEW
2
3
Predictive Data Mining
1. classification(categorical prediction):
• predicts categorical labels of the dependent variable (class variable)
using other available attributes
2. numerical prediction:
• predicts continuous/numeric values of an outcome variable
4
Recall….Predictive Data Mining
• Different from exploratory data analysis (association rules, cluster
analysis), in predictive analytics we know exactly what we are looking for.
• Supervised learning
5
Some definitions
6
CLASSIFICATION AND PREDICTION
8
EXAMPLES OF
CLASSIFICATION TASKS
• predicting tumor cells as benign or malignant
9
10
PROBLEM ANALYSIS
11
OVERALL
CLASSIFIER
12
DATASET
13
MULTIPLE FEATURES
14
Classification steps
16
SPLITTING THE DATASET
17
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
18
SOME CLASSIFICATION TECHNIQUES
• Decision Tree
• Rule-based Methods
• Neural Networks (NN)
• Naïve Bayes
• Support Vector Machines (SVM)
• K-nearest Neighbor (KNN)
• Regression
• Many more
19
K-NN (NEAREST NEIGHBOURS)
20
21
DO NOT CONFUSE WITH K-MEANS CLUSTERING!!
KNN: K Nearest Neighbour is a supervised learning algorithm . It uses a set of input values
to predict output values. KNN is one of the simplest forms of machine learning algorithms
mostly used for classification. It classifies the data point on how its neighbour is classified.
KNN classifies the new data points based on the similarity measure of the earlier stored
data points. For example, if we have a dataset of tomatoes and bananas. KNN will store
similar measures like shape and colour. When a new object comes it will check its similarity
with the colour (red or yellow) and shape. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other.
KNN helps us find which class the new input(test value) belongs to when k nearest
neighbours are chosen and distance is calculated between them. K in KNN represents the
number of the nearest neighbours we used to classify new data points
22
23
K-NN Procedure
• we can implement a K-NN model by following the below steps:
• load the data
• for getting the predicted class, iterate from 1 to total number of training data points
• calculate the distance between test data and each row of training data. .
• sort the calculated distances in ascending order based on distance values
• get top k rows from the sorted array
• get the most frequent class of these rows
• return the predicted class
24
How do i choose k?
•in real-life problems where we have many points the question arises is how to select the
value of k?
•choosing the right value of k is called parameter tuning and it’s necessary for better
results. by choosing the value of k we square root the total number of data points
available in the dataset.
• k = sqrt (total number of data points).
• odd value of k is always selected to avoid confusion between 2 classes.
25
26
27
28
Remember!!!
1. If K is too small, it will have more bias and the results will
skewed to one direction.
2. If K is too big, it will take forever to process and resources
29
30
31
32
33
34
35
36
37
38
39
Always remember to calculate Euclidean distance
We usually use Euclidean distance to calculate the nearest neighbour. If we have
two points (x, y) and (a, b). The formula for Euclidean distance (d) will be:
d = sqrt((x-a)²+(y-b)²)
Example 2: We have data from the questionnaires survey (to ask people opinion)
and objective testing with two attributes (acid durability and strength) to classify
whether a special paper tissue is good or not. Here is four training samples
2. Calculate the distance between the query-instance and all the training samples.
Coordinate of query instance is (3, 7), instead of calculating the distance we compute square
distance which is faster to calculate (without square root)
42
4. Gather the category of the nearest neighbors.
Notice in the second row last column that the category of nearest neighbor (Y) is
not included because
the rank of this data is more than 3 (=K).
5. Use simple majority of the category of nearest neighbors as the prediction value of
the query instance
We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that
pass laboratory test with X1 = 3 and X2 = 7 is included in Good category.
43
Another Example 44
1. Take a dataset with known
categories
In this initial step, you’re just
collecting the unsorted, raw data. In
this example, the data is clearly
categorized with hares and tortoises.
46
6. Classify the new point
The new point is classified by a
majority vote. If most of your neighbors
are turtles, odds are that you’re also a
turtle. In this case, two out of three of
the unknown’s neighbors are hares so
the new point is classified as a hare.
Benefits of using KNN algorithm
•Easy to implement
•KNN algorithm is widely used for different kinds of learnings because of its
uncomplicated and easy to apply nature.
•There are only two metrics to provide in the algorithm. value
of k and distance metric.
•Work with any number of classes not just binary classifiers.
•It is fairly easy to add new data to algorithm.
• MANY ALGORITHMS:
• HUNT’S ALGORITHM
• CART
• ID3, C4.5
• SLIQ,SPRINT
• etc
EXAMPLE (1) OF A DECISION TREE
C
C
EXAMPLE (2) OF A DECISION TREE
a l a l s
ic ic u
or or n uo
g g ti ss
te te n cl
a
ca ca co
Tid Refund Marital Taxable
Status Income Cheat Splitting Attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Refund
Yes No
4 Yes Married 120K No
5 No Divorced 95K Yes NO MarSt
6 No Married 60K No Single, Divorced Married
7 Yes Divorced 220K No
Yes
TaxInc NO
8 No Single 85K
9 No Married 75K No < 80K > 80K
10 No Single 90K Yes YES
10
NO
Training Data
Each time we receive an answer, a follow-up question is asked until we reach a
conclusion about the class label of the record.
ANOTHER EXAMPLE OF DECISION TREE
cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
DECISION TREE CLASSIFICATION TASK
Tid Refund Marital Taxable
Status Income Cheat
1 Attrib1
Tid Yes Attrib2
Single Attrib3
125K No
Class
Tree
21 Yes
No Large
Married 125K
100K No
No
Induction
32 No
No Medium
Single 100K
70K No
No
algorithm
3 No Small 70K No
4 Yes Married 120K No
4
5
Yes
No
Medium 120K
Divorced 95K
No
Yes Induction
5 No Large 95K Yes
6 No Married 60K No
6 No Medium 60K No
7 Yes Divorced 220K No
7 Yes Large 220K No Learn
8 No Single 85K Yes
8 No Small 85K Yes Model
99 No
No Married
Medium 75K
75K No
No
10
10 No
No Single
Small 90K
90K Yes
Yes
Model
10
10
Training Set
Apply Decision
Model Tree
Refund MaritalTid Taxable
Attrib1
Attrib2 Attrib3 Class
11
Status12
No Small
Income
Yes Medium
Cheat
55K
80K
?
?
Test Set
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat
Refund
No Single/Divorced 90K ?
Yes No
No Married 70K ?
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
TREE INDUCTION
Steps:
Steps:
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
SPLITTING BASED ON ORDINAL ATTRIBUTES
• MULTI-WAY SPLIT: USE AS MANY PARTITIONS AS DISTINCT
VALUES.
Size
Small Large
Medium
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Steps :
667 Y luxury 1
454 Y luxury 1
334 Y luxury 1
HOW TO DETERMINE THE BEST SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures Of Node Impurity
• EntropyEntropy (t ) p( j | t ) log p( j | t )
j
Error (t ) 1 max P (i | t )
• Misclassification Error i
Example of node impurity calculations
• Gini Index GINI (t ) 1 [ p( j | t )]
j
2
Entropy (t ) p ( j | t ) log p ( j | t )
j
Nodes
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
• IF WIND =‘WEAK’
THEN PLAYGOLF = YES
• IF WIND = ‘STRONG’ AND OUTLOOK = ‘RAIN’
THEN PLAYGOLF = NO
• IF WIND = ‘STRONG’ AND OUTLOOK = ‘O’CAST’
THEN PLAYGOLF = NO
• IF WIND = ‘STRONG’ AND OUTLOOK = ‘SUNNY’
THEN PLAYGOLF = YES
TREE INDUCTION
Steps: