Skip to content

check_is_fitted gives false positive when extracted from ensemble classifier #18648

Closed
@gkiar

Description

@gkiar

Describe the bug

I trained several classifiers on my dataset, and then created an ensemble classifier (voting classifier) from them. While each of the estimators, stored at .estimators_, have been fit and used independently and within the ensemble, and even after extracting them from the ensemble, they fail a check_if_fitted test, so I cannot use them on their own in a context that checks for fit, or in another ensemble classifier.

Steps/Code to Reproduce

# Handle imports
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import check_is_fitted
from copy import deepcopy
import numpy as np

# Generate a dummy dataset
y = np.random.choice([0, 1], size=50)
X = np.zeros((len(y), 100))
for idx, _y in enumerate(y):
    X[idx, :] = 10*(np.random.random((100)) - 0.5) + int(_y)*0.75 + 20 * (np.random.random((100)) - 0.2)

yval = np.random.choice([0, 1], size=5)
Xval = np.zeros((len(yval), 100))

# Create and train classifiers across some folds
clf = Pipeline([('pca', PCA()), ('svm', SVC())])
cv = KFold(n_splits=5)

clfs = []
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    tmpclf = deepcopy(clf)
    tmpclf.fit(X_train, y_train)
    clfs += [('fold{0}'.format(idx), tmpclf)]
    
    print(tmpclf.score(X_test, y_test))
print(clfs)

# Create and initialize VotingClassifier
vclf = VotingClassifier(clfs)

vclf.estimators_ = [c[1] for c in clfs]  # pass pre-fit estimators
vclf.le_ = LabelEncoder().fit(yval)
vclf.classes_ = vclf.le_.classes_

print(vclf.score(Xval, yval))

# Finally, and this is where the error occurs, extract original classifiers
orig_clf = vclf.estimators_[0]

print(orig_clf.score(Xval, yval))
check_is_fitted(orig_clf)

Expected Results

No error is thrown.

Actual Results

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-77-d38acec5e641> in <module>
      2 
      3 print(orig_clf.score(Xval, yval))
----> 4 check_is_fitted(orig_clf)

~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
   1017 
   1018     if not attrs:
-> 1019         raise NotFittedError(msg % {'name': type(estimator).__name__})
   1020 
   1021 

NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Versions

System:
    python: 3.7.3 (default, Dec 13 2019, 19:58:14)  [Clang 11.0.0 (clang-1100.0.33.17)]
executable: /Users/greg/code/env/agg/bin/python3
   machine: Darwin-19.6.0-x86_64-i386-64bit

Python dependencies:
          pip: 20.2.3
   setuptools: 50.3.0
      sklearn: 0.23.2
        numpy: 1.19.2
        scipy: 1.5.2
       Cython: None
       pandas: 1.1.3
   matplotlib: 3.3.2
       joblib: 0.17.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions