Research Trends in Machine Learning: Muhammad Kashif Hanif

Research Trends in Machine
Learning
Muhammad Kashif Hanif

6/4/21
Acknowledgement
• Slides contents are based on Introduction to Machine
Learning with Python by Andreas C. Müller and Sarah
Guido
• Examples are taken from scikit
2
Today’s Topic
• Review
• Cross Validation
3
Training and Testing Data
• Supervised machine learning models
– Split data into training and test sets
– Build model on training set using fit method
– Evaluate the model on test data
• We measure how our model predicts to the data
which is not used for training the model.
• Question is How well our model fit the training
data?
4
Cross-Validation
• train_test_split performs a random split of the
data.
– For example, 75% data for training and 25% data for
testing
• Statistical method to evaluate generalization
performance
• Split data repeatedly and multiple models are
trained
• Disadvantage: computational cost.
5
k-fold cross-validation
• k-fold cross-validation
– k is user specified number
– Default value of k is 3
– Normally, k is 5 or 10
First fold = first fifth of data

Second fold = second fifth of data
Third fold = third fifth of data
Fourth fold = fourth fifth of data
Fifth fold = last fifth of data
Source: Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido
6
k-fold cross-validation
• For k=5,
– data is partitioned into five parts of (approximately)
equal size, called folds.
– Next, a sequence of models are trained.
– First model is trained using
• First fold as test set
• Fold (2-5) as training set
– Second model is trained using
• Second fold as test set
• Folds 1, 3, 4, 5 as training set
– This process is repeated for folds 3, 4, 5 as test sets
7
k-fold cross-validation (cont…)
• We compute accuracy for each split
• Normally, we compute mean of all accuracy
scores.
• High variance in accuracy between folds can

exist
• This could be due to
– Model is dependent on particular fold for training
– Small data size
8
Example
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg, iris.data, iris.target)
print("Cross-validation scores: {}".format(scores))
Cross-validation scores: [0.96666667 1. 0.93333333 0.96666667 1. ]
9
• scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
• print("Cross-validation scores: {}".format(scores))
• Output: Cross-validation scores: [0.96666667 1. 0.93333333

0.96666667 1. ]
• print("Average cross-validation score: {:.2f}".format(scores.mean()))
• Output: Average cross-validation score: 0.97
10
k-fold cross-validation (cont…)
• We compute accuracy for each split
• Normally, we compute mean of all accuracy
scores.
11
Benefits of cross-validation
• Using cross-validation, each example will be in
the training set exactly once
– each example is in one of the folds,
– each fold is the test set once.
– the model generalize well to all of the samples in the
dataset for all of the cross-validation scores (and their
mean) to be high.
• provides some information about how sensitive
our model is to the selection of the training
dataset
12
Acknowledgement
• Slides contents are based on Introduction to Machine
Learning with Python by Andreas C. Müller and Sarah
Guido
• Examples are taken from scikit
13
k-Fold Cross-Validation
• Iris labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000000000000000000000000
00001111111111111111111111
11111111111111111111111111
11222222222222222222222222
2222222222222222222222222
2]
• For k = 3,
– First split will be class 0
– Second split will be class 1
– Third split will be class 2
– Cross-validation accuracy = 0
14
Stratified k-Fold Cross-Validation
• stratified cross-validation split the data such that
the proportions between classes are the same in
each fold as they are in the whole dataset
Source: Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido
15
k-Fold Cross-Validation (Example)
• from sklearn.model_selection import KFold
• kfold = KFold(n_splits=5)
• print("Cross-validation scores:\n{}".format(
cross_val_score(logreg, iris.data, iris.target, cv=kfo
ld)))
[1. 1. 0.86666667 0.93333333 0.83333333]
16
• kfold = KFold(n_splits=3)
ld)))
• Cross-validation scores: [0. 0. 0.]
17
Shuffle data
• kfold = KFold(n_splits=3, shuffle=True, random_
state=0)
ld)))
• Cross-validation scores: [0.98 0.96 0.96]
18
Leave-one-out cross-validation
• Each fold is a single sample
• For each split, single data point is selected as
test set
• Time consuming for large datasets
19
Leave-one-out cross-validation
• from sklearn.model_selection import LeaveOne
Out
• loo = LeaveOneOut()
• scores = cross_val_score(logreg, iris.data, iris.ta
rget, cv=loo)
• print("Number of cv iterations: ", len(scores))
• print("Mean accuracy: {:.2f}".format(scores.mean
()))
• Number of cv iterations: 150

• Mean accuracy: 0.95
20

Research Trends in Machine Learning: Muhammad Kashif Hanif

Uploaded by

Copyright:

Available Formats

Research Trends in Machine Learning: Muhammad Kashif Hanif

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Trends in Machine Learning: Muhammad Kashif Hanif

Uploaded by

Copyright:

Available Formats

Research Trends in Machine

Muhammad Kashif Hanif

First fold = first fifth of data

• High variance in accuracy between folds can

Cross-validation scores: [0.96666667 1. 0.93333333 0.96666667 1. ]

• Output: Cross-validation scores: [0.96666667 1. 0.93333333

• print("Average cross-validation score: {:.2f}".format(scores.mean()))

• Output: Average cross-validation score: 0.97

[1. 1. 0.86666667 0.93333333 0.83333333]

• Cross-validation scores: [0.98 0.96 0.96]

• Number of cv iterations: 150

You might also like