13 Cross - Validation
13 Cross - Validation
13 Cross - Validation
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the
available input data and evaluating them on the complementary subset of the data. Use cross-validation to
detect overfitting, ie, failing to generalize a pattern.
The three steps involved in cross-validation are as follows :
Demonstration
In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, roc_auc_score ,roc_curve,auc
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.simplefilter('ignore')
In [2]:
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = pd.Series(cancer.target)
df.head()
Out[2]:
mean
mean mean mean mean mean mean mean mean
concave
radius texture perimeter area smoothness compactness concavity symmetry
points
5 rows × 31 columns
In [3]:
X = df.drop('target',axis=1)
y = df['target'].astype('category')
In [5]:
lr_manual = LogisticRegression()
lr_manual.fit(X_train,y_train)
Out[5]:
LogisticRegression()
In [6]:
confusion_matrix(y_test,lr_manual.predict(X_test))
Out[6]:
array([[39, 5],
[ 0, 70]], dtype=int64)
Use StratifiedKFold
In [7]:
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=45)
pred_test_full =0
cv_score =[]
i=1
for train_index,test_index in kf.split(X,y):
print('{} of KFold {}'.format(i,kf.n_splits))
#model
lr = LogisticRegression(C=2)
lr.fit(xtr,ytr)
score = roc_auc_score(yvl,lr.predict(xvl))
print('ROC AUC score:',score)
cv_score.append(score)
# pred_test = lr.predict_proba(x_test)[:,1]
# pred_test_full +=pred_test
i+=1
1 of KFold 5
ROC AUC score: 0.9415329184408779
2 of KFold 5
ROC AUC score: 0.932361611529643
3 of KFold 5
ROC AUC score: 0.9523809523809523
4 of KFold 5
ROC AUC score: 0.9692460317460316
5 of KFold 5
ROC AUC score: 0.9312541918175722
In [8]:
print('Confusion matrix\n',confusion_matrix(yvl,lr.predict(xvl)))
print('Cv',cv_score,'\nMean cv Score',np.mean(cv_score))
Confusion matrix
[[38 4]
[ 3 68]]
Cv [0.9415329184408779, 0.932361611529643, 0.9523809523809523, 0.96924603174
60316, 0.9312541918175722]
Mean cv Score 0.9453551411830154
here I use logistic regression for demonstrate the k-fold. you can use any algorithm.