Closed
Description
Describe the bug
I trained several classifiers on my dataset, and then created an ensemble classifier (voting classifier) from them. While each of the estimators, stored at .estimators_
, have been fit and used independently and within the ensemble, and even after extracting them from the ensemble, they fail a check_if_fitted
test, so I cannot use them on their own in a context that checks for fit, or in another ensemble
classifier.
Steps/Code to Reproduce
# Handle imports
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import check_is_fitted
from copy import deepcopy
import numpy as np
# Generate a dummy dataset
y = np.random.choice([0, 1], size=50)
X = np.zeros((len(y), 100))
for idx, _y in enumerate(y):
X[idx, :] = 10*(np.random.random((100)) - 0.5) + int(_y)*0.75 + 20 * (np.random.random((100)) - 0.2)
yval = np.random.choice([0, 1], size=5)
Xval = np.zeros((len(yval), 100))
# Create and train classifiers across some folds
clf = Pipeline([('pca', PCA()), ('svm', SVC())])
cv = KFold(n_splits=5)
clfs = []
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
tmpclf = deepcopy(clf)
tmpclf.fit(X_train, y_train)
clfs += [('fold{0}'.format(idx), tmpclf)]
print(tmpclf.score(X_test, y_test))
print(clfs)
# Create and initialize VotingClassifier
vclf = VotingClassifier(clfs)
vclf.estimators_ = [c[1] for c in clfs] # pass pre-fit estimators
vclf.le_ = LabelEncoder().fit(yval)
vclf.classes_ = vclf.le_.classes_
print(vclf.score(Xval, yval))
# Finally, and this is where the error occurs, extract original classifiers
orig_clf = vclf.estimators_[0]
print(orig_clf.score(Xval, yval))
check_is_fitted(orig_clf)
Expected Results
No error is thrown.
Actual Results
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-77-d38acec5e641> in <module>
2
3 print(orig_clf.score(Xval, yval))
----> 4 check_is_fitted(orig_clf)
~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
1017
1018 if not attrs:
-> 1019 raise NotFittedError(msg % {'name': type(estimator).__name__})
1020
1021
NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Versions
System:
python: 3.7.3 (default, Dec 13 2019, 19:58:14) [Clang 11.0.0 (clang-1100.0.33.17)]
executable: /Users/greg/code/env/agg/bin/python3
machine: Darwin-19.6.0-x86_64-i386-64bit
Python dependencies:
pip: 20.2.3
setuptools: 50.3.0
sklearn: 0.23.2
numpy: 1.19.2
scipy: 1.5.2
Cython: None
pandas: 1.1.3
matplotlib: 3.3.2
joblib: 0.17.0
threadpoolctl: 2.1.0
Built with OpenMP: True