Research Trends in Machine Learning: Muhammad Kashif Hanif
Research Trends in Machine Learning: Muhammad Kashif Hanif
Research Trends in Machine Learning: Muhammad Kashif Hanif
Learning
2
Today’s Topic
• Review
• Cross Validation
3
Training and Testing Data
• Supervised machine learning models
– Split data into training and test sets
– Build model on training set using fit method
– Evaluate the model on test data
• We measure how our model predicts to the data
which is not used for training the model.
• Question is How well our model fit the training
data?
4
Cross-Validation
• train_test_split performs a random split of the
data.
– For example, 75% data for training and 25% data for
testing
• Statistical method to evaluate generalization
performance
• Split data repeatedly and multiple models are
trained
• Disadvantage: computational cost.
5
k-fold cross-validation
• k-fold cross-validation
– k is user specified number
– Default value of k is 3
– Normally, k is 5 or 10
7
k-fold cross-validation (cont…)
• We compute accuracy for each split
• Normally, we compute mean of all accuracy
scores.
8
Example
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg, iris.data, iris.target)
print("Cross-validation scores: {}".format(scores))
9
• scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
• print("Cross-validation scores: {}".format(scores))
10
k-fold cross-validation (cont…)
• We compute accuracy for each split
• Normally, we compute mean of all accuracy
scores.
11
Benefits of cross-validation
• Using cross-validation, each example will be in
the training set exactly once
– each example is in one of the folds,
– each fold is the test set once.
– the model generalize well to all of the samples in the
dataset for all of the cross-validation scores (and their
mean) to be high.
• provides some information about how sensitive
our model is to the selection of the training
dataset
12
Acknowledgement
• Slides contents are based on Introduction to Machine
Learning with Python by Andreas C. Müller and Sarah
Guido
• Examples are taken from scikit
13
k-Fold Cross-Validation
• Iris labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000000000000000000000000
00001111111111111111111111
11111111111111111111111111
11222222222222222222222222
2222222222222222222222222
2]
• For k = 3,
– First split will be class 0
– Second split will be class 1
– Third split will be class 2
– Cross-validation accuracy = 0
14
Stratified k-Fold Cross-Validation
• stratified cross-validation split the data such that
the proportions between classes are the same in
each fold as they are in the whole dataset
Source: Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido
15
k-Fold Cross-Validation (Example)
• from sklearn.model_selection import KFold
• kfold = KFold(n_splits=5)
• print("Cross-validation scores:\n{}".format(
cross_val_score(logreg, iris.data, iris.target, cv=kfo
ld)))
16
• kfold = KFold(n_splits=3)
• print("Cross-validation scores:\n{}".format(
cross_val_score(logreg, iris.data, iris.target, cv=kfo
ld)))
• Cross-validation scores: [0. 0. 0.]
17
Shuffle data
• kfold = KFold(n_splits=3, shuffle=True, random_
state=0)
• print("Cross-validation scores:\n{}".format(
cross_val_score(logreg, iris.data, iris.target, cv=kfo
ld)))
18
Leave-one-out cross-validation
• Each fold is a single sample
• For each split, single data point is selected as
test set
• Time consuming for large datasets
19
Leave-one-out cross-validation
• from sklearn.model_selection import LeaveOne
Out
• loo = LeaveOneOut()
• scores = cross_val_score(logreg, iris.data, iris.ta
rget, cv=loo)
• print("Number of cv iterations: ", len(scores))
• print("Mean accuracy: {:.2f}".format(scores.mean
()))