Nazribit33603 Datamining Lecture5 Classification

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 94

BIT 33603 : Data Mining

Lecture 5
CLASSIFICATIONS

Professor Dr. Nazri Mohd Nawi

FSKTM, UTHM
nazri@uthm.edu.my
DATA MINING: A GENERAL VIEW

2
3
Predictive Data Mining

Two broad types:

1. classification(categorical prediction):
• predicts categorical labels of the dependent variable (class variable)
using other available attributes

2. numerical prediction:
• predicts continuous/numeric values of an outcome variable

4
Recall….Predictive Data Mining
• Different from exploratory data analysis (association rules, cluster
analysis), in predictive analytics we know exactly what we are looking for.

• We are trying to predict a well-defined outcome using existing data

• We have to identify a dependent variable (outcome of interest) based on


the independent variables (attributes/features)

• Supervised learning

5
Some definitions

6
CLASSIFICATION AND PREDICTION

Classification: Finding models (functions) that describe and


distinguish classes
E.g., classify people as healthy or sick, or classify transactions as
fraudulent or not
Methods: decision-tree, classification rule, neural network
Prediction: Predict some unknown values

For classification- the class label (output attribute) must a discrete


attribute, or categorical/nominal
For prediction and regression – the attribute for the output
(dependent variable) must a continuous attribute.
7
CLASSIFICATION

• given a collection of records (training set )


• each record contains a set of attributes, one of the attributes is the
class.

• a test set is used to determine the accuracy of the model. usually,


the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate
it.

8
EXAMPLES OF
CLASSIFICATION TASKS
• predicting tumor cells as benign or malignant

• classifying credit card transactions as legitimate or


fraudulent

• classifying secondary structures of protein as alpha-helix,


beta-sheet, or random coil

• categorizing news stories as finance, weather,


entertainment, sports, etc

9
10
PROBLEM ANALYSIS

11
OVERALL
CLASSIFIER

12
DATASET

13
MULTIPLE FEATURES

14
Classification steps

• (1) train the model (Training step)


• model construction based on training data
• using a training set
• data objects whose class labels are known
• (2) validate the model (Validation step)
• Fine-tune/refine the classification model on validation data
• (3) test the model (Testing step))
• on a test sample
• whose class labels are known but not used for training the model
• measure the accuracy of the final model using the test data
• (4) use the model for classification
• on new data whose class labels are unknown
15
CLASSIFICATION STEPS

16
SPLITTING THE DATASET

17
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
18
SOME CLASSIFICATION TECHNIQUES
• Decision Tree
• Rule-based Methods
• Neural Networks (NN)
• Naïve Bayes
• Support Vector Machines (SVM)
• K-nearest Neighbor (KNN)
• Regression
• Many more

19
K-NN (NEAREST NEIGHBOURS)

20
21
DO NOT CONFUSE WITH K-MEANS CLUSTERING!!

KNN: K Nearest Neighbour is a supervised learning algorithm . It uses a set of input values
to predict output values. KNN is one of the simplest forms of machine learning algorithms
mostly used for classification. It classifies the data point on how its neighbour is classified.

KNN classifies the new data points based on the similarity measure of the earlier stored
data points. For example, if we have a dataset of tomatoes and bananas. KNN will store
similar measures like shape and colour. When a new object comes it will check its similarity
with the colour (red or yellow) and shape. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other.

KNN helps us find which class the new input(test value) belongs to when k nearest
neighbours are chosen and distance is calculated between them. K in KNN represents the
number of the nearest neighbours we used to classify new data points
22
23
K-NN Procedure
• we can implement a K-NN model by following the below steps:
• load the data

• initialise the value of k

• for getting the predicted class, iterate from 1 to total number of training data points
• calculate the distance between test data and each row of training data. .
• sort the calculated distances in ascending order based on distance values
• get top k rows from the sorted array
• get the most frequent class of these rows
• return the predicted class

24
How do i choose k?
•in real-life problems where we have many points the question arises is how to select the
value of k?
•choosing the right value of k is called parameter tuning and it’s necessary for better
results. by choosing the value of k we square root the total number of data points
available in the dataset.
• k = sqrt (total number of data points).
• odd value of k is always selected to avoid confusion between 2 classes.

25
26
27
28

Remember!!!
1. If K is too small, it will have more bias and the results will
skewed to one direction.
2. If K is too big, it will take forever to process and resources
29
30
31
32
33
34
35
36
37
38
39
Always remember to calculate Euclidean distance
We usually use Euclidean distance to calculate the nearest neighbour. If we have
two points (x, y) and (a, b). The formula for Euclidean distance (d) will be:
d = sqrt((x-a)²+(y-b)²)

Example 2: We have data from the questionnaires survey (to ask people opinion)
and objective testing with two attributes (acid durability and strength) to classify
whether a special paper tissue is good or not. Here is four training samples

X1 = Acid X2 = Strength Y = Classification


Durability (kg/square meter)
(seconds)
7 7 Bad
7 4 Bad
3 4 Good 40
Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7.
Without another expensive survey, can we guess what the classification of this new tissue is?

1. Determine parameter K = number of nearest neighbours, Suppose use K = 3

2. Calculate the distance between the query-instance and all the training samples.
Coordinate of query instance is (3, 7), instead of calculating the distance we compute square
distance which is faster to calculate (without square root)

X1 = Acid X2 = Strength (kg/square Square Distance to query


Durability meter) instance (3, 7)
(seconds)
7 7 (7-3)2 + (7-7)2 = 16
7 4 (7-3)2 + (4-7)2 = 25
3 4 (3-3)2 + (4-7)2 = 9
1 4 (1-3)2 + (4-7)2 = 13
41
3. Sort the distance and determine nearest neighbours based on the K-
th minimum distance

X1 = Acid X2 = Strength Square Rank Is it included


Durability (kg/square Distance to minimum in 3-Nearest
(seconds) meter) query instance distance neighbors?
(3, 7)
7 7 (7-3)2 + (7-7)2 = 3 yes
16
7 4 (7-3)2 + (4-7)2 = 4 No
25
3 4 (3-3)2 + (4-7)2 = 9 1 yes
1 4 (1-3)2 + (4-7)2 = 2 yes
13

42
4. Gather the category of the nearest neighbors.
Notice in the second row last column that the category of nearest neighbor (Y) is
not included because
the rank of this data is more than 3 (=K).

X1 = Acid X2 = Strength Square Distance Rank Is it included in Y = Category


Durability (kg/square to query minimum 3-Nearest of nearest
(seconds) meter) instance (3, 7) distance neighbors? Neighbor
7 7 (7-3)2 + (7-7)2 = 16 3 yes Bad
7 4 (7-3)2 + (4-7)2 = 25 4 No -
3 4 (3-3)2 + (4-7)2 = 9 1 yes Good
1 4 (1-3)2 + (4-7)2 = 13 2 yes Good

5. Use simple majority of the category of nearest neighbors as the prediction value of
the query instance

We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that
pass laboratory test with X1 = 3 and X2 = 7 is included in Good category.
43
Another Example 44
1. Take a dataset with known
categories
In this initial step, you’re just
collecting the unsorted, raw data. In
this example, the data is clearly
categorized with hares and tortoises.

2. Cluster the data


You’ve got a few choices in this step;
How you cluster the data is up to you.
(e.g. with PCA or another clustering
method).
Add a cell with an unknown category

4. Find the “k”


Perhaps the most challenging step is finding a k that’s “just right”. The
square root of n (the number of items in the data set) is an easy place to
start.
•√(n)
•= √(8)
•= 2.82
•= ≅ 3 45
5. Locate the “k” nearest neighbors
For this example, I just used the
visual to locate the nearest neighbors.

46
6. Classify the new point
The new point is classified by a
majority vote. If most of your neighbors
are turtles, odds are that you’re also a
turtle. In this case, two out of three of
the unknown’s neighbors are hares so
the new point is classified as a hare.
Benefits of using KNN algorithm
•Easy to implement
•KNN algorithm is widely used for different kinds of learnings because of its
uncomplicated and easy to apply nature.
•There are only two metrics to provide in the algorithm. value
of k and distance metric.
•Work with any number of classes not just binary classifiers.
•It is fairly easy to add new data to algorithm.

Disadvantages of KNN algorithm


•The cost of predicting the k nearest neighbours is very high.
•Doesn’t work as expected when working with big number of
features/parameters.
•Hard to work with categorical features.
•Bad for large datasets
47
DECISION TREE
A decision tree is a flow-chart-like tree structure, where
DECISION TREE each internal node denotes a test on an attribute,attribute
ALSO KNOWN AS each branch represents an outcome of the test,
CLASSIFICATION TREE and leaf nodes represent classes or class
• Adistribution.
SUPERVISED LEARNING PROGRAM THAT GENERALIZES A SET OF
INPUT INSTANCES BY BUILDING A DECISION TREE.
DECISION TREE INDUCTION

• MANY ALGORITHMS:
• HUNT’S ALGORITHM
• CART
• ID3, C4.5
• SLIQ,SPRINT
• etc
EXAMPLE (1) OF A DECISION TREE
C
C
EXAMPLE (2) OF A DECISION TREE
a l a l s
ic ic u
or or n uo
g g ti ss
te te n cl
a
ca ca co
Tid Refund Marital Taxable
Status Income Cheat Splitting Attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Refund
Yes No
4 Yes Married 120K No
5 No Divorced 95K Yes NO MarSt
6 No Married 60K No Single, Divorced Married
7 Yes Divorced 220K No
Yes
TaxInc NO
8 No Single 85K
9 No Married 75K No < 80K > 80K
10 No Single 90K Yes YES
10
NO
Training Data
Each time we receive an answer, a follow-up question is asked until we reach a
conclusion about the class label of the record.
ANOTHER EXAMPLE OF DECISION TREE
cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
DECISION TREE CLASSIFICATION TASK
Tid Refund Marital Taxable
Status Income Cheat

1 Attrib1
Tid Yes Attrib2
Single Attrib3
125K No
Class
Tree
21 Yes
No Large
Married 125K
100K No
No
Induction
32 No
No Medium
Single 100K
70K No
No
algorithm
3 No Small 70K No
4 Yes Married 120K No
4
5
Yes
No
Medium 120K
Divorced 95K
No
Yes Induction
5 No Large 95K Yes
6 No Married 60K No
6 No Medium 60K No
7 Yes Divorced 220K No
7 Yes Large 220K No Learn
8 No Single 85K Yes
8 No Small 85K Yes Model
99 No
No Married
Medium 75K
75K No
No
10
10 No
No Single
Small 90K
90K Yes
Yes
Model
10

10

Training Set
Apply Decision
Model Tree
Refund MaritalTid Taxable
Attrib1
Attrib2 Attrib3 Class
11
Status12
No Small
Income
Yes Medium
Cheat
55K
80K
?
?

13 Yes Large 110K ?


Deduction
No Married1415 No
No
80KSmall
Large
?
95K
67K
?
?
10
10

Test Set
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat

Refund
No Single/Divorced 90K ?
Yes No
No Married 70K ?
NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
TREE INDUCTION

Steps:

1)determine how to split the records


a) how to specify the attribute test condition?
b) how to determine the best split?

2)determine when to stop splitting


TREE INDUCTION

Steps:

1)determine how to split the records


a) how to specify the attribute test condition?
b) how to determine the best split?

2)determine when to stop splitting


HOW TO SPECIFY
THE ATTRIBUTE TEST CONDITION?

• depends on attribute types


• nominal
• ordinal
• continuous

• depends on number of ways to split


• 2-way split (binary split)
• multi-way split
SPLITTING BASED ON NOMINAL
ATTRIBUTES
• MULTI-WAY SPLIT: USE AS MANY PARTITIONS AS DISTINCT
VALUES.
CarType
Family Luxury
Sports

• BINARY SPLIT: DIVIDES VALUES INTO TWO SUBSETS.


NEED TO FIND OPTIMAL PARTITIONING.

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
SPLITTING BASED ON ORDINAL ATTRIBUTES
• MULTI-WAY SPLIT: USE AS MANY PARTITIONS AS DISTINCT
VALUES.
Size
Small Large

Medium

• BINARY SPLIT: DIVIDES VALUES INTO TWO SUBSETS.


NEED TO FIND OPTIMAL PARTITIONING.
Size {Medium, Size Size
{Small, {Medium,
Medium} {Large} Large, extra large} {Small} {Small}
Large}
SPLITTING BASED ON CONTINUOUS ATTRIBUTES

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


TREE INDUCTION

Steps :

1)determine how to split the records


a) how to specify the attribute test condition?
b) how to determine the best split?

2)determine when to stop splitting


Customer ID Own car Car type Salary Marital status Cheat
231 Y Family 0
237 Y sport 0
349 N sport 0
054 N luxury 1
10 records of class 0
569 N sport 0 10 records of class 1
464 N Family 1
745 Y sport 0
552 N luxury 1
674 N luxury 0
562 Y sport 0
344 N luxury 1
865 Y sport 0
455 N Family 1
234 N sport 0
988 Y sport 0
656 N luxury 1
856 Y Family 1

667 Y luxury 1

454 Y luxury 1

334 Y luxury 1
HOW TO DETERMINE THE BEST SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1

Which test condition is the best?


HOW TO DETERMINE THE BEST SPLIT
• Greedy Approach:
• Nodes With Homogeneous Class Distribution Are Preferred
• How to know?? Need to Measure Node Impurity: (The Smaller The
Better)

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures Of Node Impurity

• EntropyEntropy (t )   p( j | t ) log p( j | t )
j

• Gini Index GINI (t ) 1  j


[ p ( j | t )] 2

Error (t ) 1  max P (i | t )
• Misclassification Error i
Example of node impurity calculations
• Gini Index GINI (t ) 1   [ p( j | t )]
j
2

• Entropy Entropy (t )   p ( j | t ) log p ( j | t )


j

• Misclassification error Error (t ) 1  max P (i | t )


i
SPLITTING CRITERIA BASED ON ENTROPY

• entropy at a given node t:

Entropy (t )   p ( j | t ) log p ( j | t )
j

(note: p( j | t) is the relative frequency of class j at


node t).

• measures homogeneity of a node.


• maximum (log nc) when records are equally
distributed among all classes implying least
information
• minimum (0.0) when all records belong to one
class, implying most information
EXAMPLES FOR COMPUTING ENTROPY
Entropy (t )   p ( j | t ) log p ( j | t )
j 2

Nodes
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
INFO GAIN...
 k ni 
GAIN split impurity( Parent )    impurity(i ) 
 i 1 n 
Where n = total number of records at the parents node
k = number of attribute values
ni = number of records associated with the child node,i
• INFORMATION GAIN:
 n k 
GAIN \ Entropy ( p )    Entropy (i ) 
i
Weighted average
 n 
split i 1
impurity of child nodes
THE SMALLER THE BETTER
parent node, p is split into k partitions;
ni is number of records in partition I
• measures reduction in entropy achieved because of the split.
• gain – the higher the better…..because later we will select the attribute
to be split based on their gain value
To get the Impurity:
calculate the Entropy…the Smaller
The Better

to know the node order in a tree:


calculate the Gain…the Higher The
Better
Y=
1
PRODUCTION RULE

• IF WIND =‘WEAK’
THEN PLAYGOLF = YES
• IF WIND = ‘STRONG’ AND OUTLOOK = ‘RAIN’
THEN PLAYGOLF = NO
• IF WIND = ‘STRONG’ AND OUTLOOK = ‘O’CAST’
THEN PLAYGOLF = NO
• IF WIND = ‘STRONG’ AND OUTLOOK = ‘SUNNY’
THEN PLAYGOLF = YES
TREE INDUCTION

Steps:

1)determine how to split the records


a) how to specify the attribute test condition?
b) how to determine the best split?

2)determine when to stop splitting


STOPPING CRITERIA FOR TREE INDUCTION
(when to stop splitting, and create a leaf
node)

• stop expanding a node when all the records


belong to the same class

• stop expanding a node when all the records have


similar attribute values

You might also like