Week 6 Machine Learning
Week 6 Machine Learning
Week 6 Machine Learning
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Predictive Data Mining
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
• An observation, or record, is the set of recorded values of variables
associated with a single entity.
• Supervised learning: Data mining methods for predicting an
outcome based on a set of input variables, or features.
• Supervised learning can be used for:
• Estimation of a continuous outcome.
• Classification of a categorical outcome.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
The data mining process comprises the following steps:
1. Data sampling. Extract a sample of data that is relevant to the business
problem under consideration.
2. Data preparation. Manipulate the data to put it in a form suitable for formal
modeling.
3. Data partitioning. Divide the sample data for the training, validation, and
testing of the data mining algorithm performance.
4. Model construction. Apply the appropriate data mining technique to the
training data set to accomplish the desired data mining task (classification or
estimation).
5. Model assessment. Evaluate models by comparing performance on the
training and validation data sets.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and
Partitioning
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
k-Fold Cross-Validation
• k-Fold Cross-Validation: A robust procedure to train and validate
models in which observations to be used to train and validate the
model are repeatedly randomly divided into k subsets called folds.
In each iteration, one fold is designated as the validation set and the
remaining k-1 folds are designated as the training set. The results of
the iterations are then combined and evaluated.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
Class Imbalanced Data
• There are two basic sampling approaches for modifying the class
distribution of the training set:
• Undersampling: Balances the number of Class 1 and Class 0
observations in a training set by removing majority class
observations from the training set.
• Oversampling: Balances the number of Class 1 and Class 0
observations in a training set by inserting copies of minority class
observations into the training set.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes
Evaluating the Estimation of Continuous Outcomes
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes:
• By counting the classification errors on a sufficiently large validation set
and/or test set that is representative of the population, we will generate
an accurate measure of the model’s classification performance.
• Classification confusion matrix: Displays a model’s correct and incorrect
classifications.
o True Positive (TP) Predicted Class
o False Positive (FP)
o False Negative (FN) Actual Class Positive Negative
o True Negative (TN) Positive TP FN
Negative FP TN
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• One minus the overall error rate is often referred to as the accuracy of the model.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• The ability to correctly predict Class 1 (positive) observations is
commonly expressed as sensitivity, or recall, and is calculated as:
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• Precision is a measure that corresponds to the proportion of
observations predicted to be Class 1 by a classifier that are actually in
Class 1:
Precision = TP / |TP + TN|
• The F1 Score combines precision and sensitivity into a single measure and
is defined as:
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• The receiver operating characteristic (ROC) curve is an alternative
graphical approach for displaying the tradeoff between a classifier’s ability
to correctly identify Class 1 observations and its Class 0 error rate.
• In general, we can evaluate the quality of a classifier by computing the area
under the ROC curve, often referred to as the AUC.
• The greater the area under the ROC curve, i.e., the larger the AUC, the
better the classifier performs.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Figure: Receiver
Operating
Characteristic
(ROC) Curve
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.