ID3 Algorithm & ROC Analysis
Talha KABAKU talha.kabakus@ibu.edu.tr
Agenda
Where are we now? Decision Trees What is ID3? Entropy Information Gain Pros and Cons of ID3 An Example - The Simpsons What is ROC Analysis? ROC Space ROC Space Example over predictions
Where are we now?
Decision Trees
One of the most used classification approach because of its clear model and presentation Classification by using data attributes Aim is to reaching estimating destination field value using source fields Tree Induction Create tree Apply data into tree to classify Each branch node represents a choice between a number of alternatives Each leaf node represents a classification or decision Leaf Count = Rule Count
Decision Trees (Cont.)
Leafs are inserted through top to bottom A B C
Sample Decision Tree
Creating Tree Model by Training Data
Decision Tree Classification Task
Apply Model to Test Data
Apply Model to Test Data (Cont.)
Apply Model to Test Data (Cont.)
Apply Model to Test Data (Cont.)
Apply Model to Test Data (Cont.)
Apply Model to Test Data (Cont.)
Decision Tree Algorithms
Classification and Regression Algorithms Twoig Gini Entropy-based Algorithms ID3 C4.5 Memory-based (Sample-based) Classification Algorithms
Decision Trees by Variable Type
Single Variable Decision Trees Classifications are done with asking questions over only one variable Hybrid Decision Trees Classifications are done with asking questions over both single and multiple variables Multiple Variables Decision Trees Classifications are done with asking questions over multiple variables
ID3 Algorithm
Iterative Dichotomizer 3 Developed by J. Ross Quinlan in 1979 Based on Entropy Only works for discrete data Can not work with defective data Advantage over Hunt's algorithm is choosing the right attribute while classification. (Hunt's algorithm chooses randomly)
Entropy
A formula to calculate the homogeneity of a sample; gives idea about how much information gain provides each leaf A complete homogeneous sample entropy value is 0 An equally divided sample entropy value is 1 Formula:
Information Gain (IG)
Information Gain calculates effective change in entropy after making a decision based on the value of an attribute. Which attribute creates the most homogeneous branches? First the entropy of the total dataset is calculated. The dataset is then split on the different attributes.
Information Gain (Cont.)
The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. The attribute that yields the largest IG is chosen for the decision node.
Information Gain (Cont.)
A branch set with entropy of 0 is a leaf node. Otherwise, the branch needs further splitting to classify its dataset. The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
ID3 Algorithm Steps
function ID3 (R: a set of non-categorical attributes, C: the categorical attribute, S: a training set) returns a decision tree; begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified]; Let D be the attribute with largest Gain( D,S) among attributes in R; Let {dj| j=1,2, .., m} be the values of attribute D; Let {Sj| j=1,2, .., m} be the subsets of S consisting respectively of records with value dj for attribute D; Return a tree with root labeled D and arcs labeled d1, d2, .., dm going respectively to the trees ID3(R-{D}, C, S1), ID3(R-{D}, C, S2), .., ID3(R-{D}, C, Sm); end ID3;
Pros of ID3 Algorithm
Builds decision tree in min. steps The most important point while tree induction is collecting enough reliable associated data over specific properties. Asking right questions determines tree induction. Each level benefits from previous level choices Whole dataset is scanned to create tree
Cons of ID3 Algorithm
Tree can not be updated when new data is classified incorrectly, instead a new tree must be generated. Only one attribute at a time is tested for making a decision. Can not work with defective data Can not work with numerical attributes
An Example - The Simpsons
Person Homer Marge Bart Lisa Maggie Abe Selma Otto Krusty Hair Length 0'' 10'' 2'' 6'' 4'' 1'' 8'' 10'' 6'' Weight 250 150 90 78 20 170 160 180 200 Age 36 34 10 8 1 70 41 38 45 Class M F M M F F F M M
Information Gain over Hair Length
E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain
Hair Length <= 5 Yes No
E(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.9710
E(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) =0.8113
Gain(Hair Length <= 5) = 0.9911 (4/9 * 0.9710 + 5/9 * 0.8113) = 0.0911
Information Gain over Weight
E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain
Weight <= 160 Yes No
E(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219
E(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0
Gain(Weight<= 160) = 0.9911 (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Information Gain over Age
E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain
Age <= 40 Yes No
E(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1
E(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3)= 0.9188
Gain(Age z= 40) = 0.9911 (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Results
Attribute Hair Length <= 5 Weight <= 160 Age <= 40 Information Gain (IG) 0.0911 0.5900 0.0183
As seen in the results, weight is the best attribute to classify these group.
Constructed Decision Tree
Weight <= 160 Yes No
Hair Length <= 5 Yes No
Female
Male
Entropy over Nominal Values
If an attribute has nominal values:
First calculate information gain for each attribute value Then calculate attribute information gain
Example II
IG= -(5/15)log2(5/15)-(10/15)log2(10/15) = ~0.918
Example II (Cont.)
Information Gain over Engine
Engine: 6 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 5 no, 1 yes IGsmall = -(5/6)log2(5/6)-(1/6)log2(1/6) = ~0.65 medium: 3 no, 2 yes IGmedium = -(3/5)log2(3/5)-(2/5)log2(2/5) = ~0.97 large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset) => IGEngine = IE(S) [(6/15)*IGsmall + (5/15)*IGmedium + (4/15)*Ilarge] = IGEngine = 0.918 0.85 = 0.068
Example II (Cont.)
Information Gain over SC/Turbo
SC/Turbo: 4 yes, 11 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 2 no IGturbo = 1 (evenly distributed subset) no: 3 yes, 8 no IGnoturbo = -(3/11)log2(3/11)-(8/11)log2(8/11) = ~0.84 IGturbo = IE(S) [(4/15)*IGturbo + (11/15)*IGnoturbo] IGturbo = 0.918 0.886 = 0.032
Example II (Cont.)
Information Gain over Weight
Weight: 6 Average, 4 Light, 5 Heavy 3 values for attribute weight, so we need 3 entropy calculations average: 3 no, 3 yes IGaverage = 1 (evenly distributed subset) light: 3 no, 1 yes IGlight = -(3/4)log2(3/4)-(1/4)log2(1/4) = ~0.81 heavy: 4 no, 1 yes IGheavy = -(4/5)log2(4/5)-(1/5)log2(1/5) = ~0.72
IGWeight = IE(S) [(6/15)*IGaverage + (4/15)*IGlight + (5/15)*IGheavy] IGWeight = 0.918 0.856 = 0.062
Example II (Cont.)
Information Gain over Full Eco
Fuel Economy: 2 good, 3 average, 10 bad 3 values for attribute Fuel Eco, so we need 3 entropy calculations good: 0 yes, 2 no IGgood = 0 (no variability) average: 0 yes, 3 no IGaverage = 0 (no variability) bad: 5 yes, 5 no IGbad = 1 (evenly distributed subset)
We can omit calculations for good and average since they always end up not fast. IGFuelEco = IE(S) [(10/15)*IGbad] IGFuelEco = 0.918 0.667 = 0.251
Example II (Cont.)
Results:
IGEngine IGturbo IGWeight IGFuelEco 0.068 0.032 0.062 0.251
Root of the tree
Example II (Cont.)
Since we selected the Fuel Eco attribute for our Root Node, it is removed from the table for future calculations.
General Information Gain = 1 (Evenly distributed set)
Example II (Cont.)
Information Gain over Engine
Engine: 1 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 1 yes, 0 no IGsmall = 0 (no variability) medium: 2 yes, 3 no IGmedium = -(2/5)log2(2/5)-(3/5)log2(3/5) = ~0.97 large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset)
IGEngine = IE(SFuelEco) (5/10)*IGmedium + (4/10)*IGlarge] IGEngine = 1 0.885 = 0.115
Example II (Cont.)
Information Gain over SC/Turbo
SC/Turbo: 3 yes, 7 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 1 no IGturbo = -(2/3)log2(2/3)-(1/3)log2(1/3) = ~0.84 no: 3 yes, 4 no IGnoturbo = -(3/7)log2(3/7)-(4/7)log2(4/7) = ~0.84 IGturbo = IE(SFuelEco) [(3/10)*IGturbo + (7/10)*IGnoturbo] IGturbo = 1 0.965 = 0.035
Example II (Cont.)
Information Gain over Weight
Weight: 3 average, 5 heavy, 2 light 3 values for attribute weight, so we need 3 entropy calculations average: 3 yes, 0 no IGaverage = 0 (no variability) heavy: 1 yes, 4 no IGheavy = -(1/5)log2(1/5)-(4/5)log2(4/5) = ~0.72 light: 1 yes, 1 no IlGight = 1 (evenly distributed subset) IGEngine = IE(SFuel Eco) [(5/10)*IGheavy+(2/10)*IGlight] IGEngine = 1 0.561 = 0.439
Example II (Cont.)
Results:
IGEngine IGturbo IGWeight 0.115 0.035 0.439
Weight has the highest gain, and is thus the best choice.
Example II (Cont.)
Since there are only two items for SC/Turbo where Weight = Light, and the result is consistent, we can simplify the weight = Light path.
Example II (Cont.)
Updated Table: (Weight = Heavy)
All cars with large engines in this table are not fast. Due to inconsistent patterns in the data, there is no way to proceed since medium size engines may lead to either fast or not fast.
ROC Analysis
Receiver Operating Characteristic The limitations of diagnostic accuracy as a measure of decision performance require introduction of the concepts of the sensitivity and specificity of a diagnostic test. These measures and the related indices, true positive rate and false positive rate, are more meaningful than accuracy. ROC curve is shown to be a complete description of this decision threshold effect, indicating all possible combinations of the relative frequencies of the various kinds of correct and incorrect decisions.
ROC Analysis (Cont.)
Combinations of correct & incorrect decisions:
Actual Value
p p n n
Prediction Outcome
p n p n
Description
True Positive Rate (TPR) False Negative Rate (FNR) False Positive Rate (FPR) True Negative Rate (TNR)
TPR is equivalent with sensitivity. FPR is equivalent with 1 - specificity. Best possible prediction would be 100% sensitivity and 100% specificity (which means FPR = 0%).
ROC Space
A ROC space is defined by FPR and TPR as x and y axes respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent with sensitivity and FPR is equal to 1 specificity, the ROC graph is sometimes called the sensitivity vs (1 specificity) plot. Each prediction result one point in the ROC space.
Calculations
Sensitivity TPR = TP / P = TP / (TP + FN) Specificity FPR = FP / N = FP / (FP + TN) Accuracy ACC = (TP + TN) / (P + N)
A ROC Space Example
Let A, B, C, D to be predictions over 100 negative and 100 positive instance:
Prediction/ Combination A B C D TP 63 77 24 76 FP 28 77 88 12 FN 37 23 76 24 TN 72 23 12 88 TPR 0.63 0.77 0.24 0.76 FPR 0.28 0.77 0.88 0.12 ACC 0.68 0.50 0.18 0.82
A ROC Space Example (Cont.)
References
1. Data Mining Course Lectures, Ass. Prof. Nilfer Yurtay 2. Quinlan, J.R. 1986, Machine Learning, 1, 81 3. http://www.cse.unsw.edu. au/~billw/cs9414/notes/ml/06prop/id3/id3.html 4. J. Han, M. Kamber, J. Pie, Data Mining Concepts and Techniques, 3rd Edition, Elsevier, 2011. 5. http://www.cise.ufl.edu/~ddd/cap6635/Fall97/Short-papers/2.htm 6. C. E. Metz, Basic Principles of ROC Analysis, Seminars in Nuclear Medicine, Volume 8, Issue 4, P 283-298