Skip to content

MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 7, 2023
Merged

MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838

merged 11 commits into from
Dec 7, 2023

Conversation

LoHertel
Copy link
Contributor

@LoHertel LoHertel commented Nov 5, 2022

Reference Issues/PRs

Fixes #24816

What does this implement/fix? Explain your changes.

DictVectorizer didn't raise a NotFittedError when using methods, which expect prior fitting.
fit() sets the attributes feature_names_ and vocabulary_.
The following methods depend on these attributes and now are raising a NotFittedError in case of absence:

  • inverse_transform()
  • restrict()
  • transform()

The check for existence of the internal attributes is made by check_is_fitted() from the utils package.
A test whether NotFittedError is raised as expected was added as well.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have common tests for transform and inverse_transform?

@@ -384,6 +389,8 @@ def get_feature_names_out(self, input_features=None):
feature_names_out : ndarray of str objects
Transformed feature names.
"""
check_is_fitted(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_feature_names_out raises ValueError if the right attributes are not set:

if n_features_in_ is None:
raise ValueError("Unable to generate feature names without n_features_in_")

And this I think is the case (or should be) for all estimators.

cc @thomasjpfan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, the attribute self.feature_names_ which is accessed here in get_feature_names_out() two lines below in line 394 will only exist after calling fit().

Calling get_feature_names_out() before fit() is currently raising: AttributeError: 'DictVectorizer' object has no attribute 'feature_names_' when executing this code example:

from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()
dv.get_feature_names_out()

This PR proposes to raise a NotFittedError instead. Running the snippet from above on the code of the PR results in:
NotFittedError: This DictVectorizer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

I guess it's a matter of taste, which option is the more favorable API. I've created this PR, based on @glemaitre's sentiment in #24816 that an exception of type NotFittedError would be favorable.

What is your impression so far?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a matter of taste really. It's a matter of consistency. For example, StandardScaler():

In [4]: StandardScaler().get_feature_names_out()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [4], line 1
----> 1 StandardScaler().get_feature_names_out()

File ~/Projects/sklearn/scikit-learn/sklearn/base.py:893, in OneToOneFeatureMixin.get_feature_names_out(self, input_features)
    873 def get_feature_names_out(self, input_features=None):
    874     """Get output feature names for transformation.
    875 
    876     Parameters
   (...)
    891         Same as input features.
    892     """
--> 893     return _check_feature_names_in(self, input_features)

File ~/Projects/sklearn/scikit-learn/sklearn/utils/validation.py:1950, in _check_feature_names_in(estimator, input_features, generate_names)
   1948 # Generates feature names if `n_features_in_` is defined
   1949 if n_features_in_ is None:
-> 1950     raise ValueError("Unable to generate feature names without n_features_in_")
   1952 return np.asarray([f"x{i}" for i in range(n_features_in_)], dtype=object)

ValueError: Unable to generate feature names without n_features_in_

I don't mind raising NotFittedError for get_feature_names_out when the estimator is not fitted, but then we should do that everywhere.

cc @thomasjpfan @glemaitre

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking as well at this matter. It seems that the transformer that uses ClassNamePrefixFeaturesOutMixin will raise a NotFittedError because we check `check_is_fitted(self, "_n_features_out").

NotFittedError seems to be a more informative error than stating that n_features_in_ is missing. I would rather make the check_is_fitted everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me.

@LoHertel
Copy link
Contributor Author

LoHertel commented Nov 8, 2022

Don't we have common tests for transform and inverse_transform?

As far as I understand, the input for inverse_transform should be array-like and is being checked already:

# COO matrix is not subscriptable
X = check_array(X, accept_sparse=["csr", "csc"])

DictVectorizer.transform on the other hand expects a Mapping or an iterable containing Mappings (e.g. dict) as input. See example:

X = [{'col1': 'a', 'col2': 'x'}, {'col1': 'b', 'col2': 'y'}]

I don't know the utils package deeply. Is there a common check for this kind of input?

Otherwise, does something like this stub go in the right direction?

def check_is_mapping_or_iterable_over_mappings(X, input_name=""):

    if isinstance(input_name, str) and not len(input_name):
        input_name = 'X'

    error_message = (
        f"A Mapping or an Iterable over Mappings are expected for {input_name}"
    )

    X = [X] if isinstance(X, Mapping) else X

    # check if Iterable
    try:
        iterator = iter(X)
    except TypeError:
        raise TypeError(f"{error_message}, but {type(X)} was given.")

    # check that Iterable is not empty
    try:
        x = next(iterator)
    except StopIteration:
        raise TypeError(f"{error_message}, but an empty Iterable was given.")

    # check that every element in Iterable is a Mapping
    try:
        iterator = iter(X)
        for x in next(iterator):
            if not isinstance(x, Mapping):
                raise TypeError(
                    f"{error_message}, but an Iterable containing {type(x)} was given."
                )
    except StopIteration:
        pass

    return X

In my understanding, checks for a valid input in transform and inverse_transform will not address the AttributeError mentioned above and in #24816.

@glemaitre
Copy link
Member

Don't we have common tests for transform and inverse_transform?

@adrinjalali The DictVectorizer has a tag X_type that is not 2darray therefore almost no common tests are run on this class. This is one reason why we did not catch the NotFittedError inconsistency I assume.

@glemaitre
Copy link
Member

I will try to make an overview of the error raised by the estimator when calling get_feature_names_out before fit. And I should really make the maintenance on the common tests to make sure that we run minimal tests for all estimators.

@@ -203,6 +203,7 @@ def _transform(self, X, fitting):
feature_names = []
vocab = {}
else:
check_is_fitted(self, ["feature_names_", "vocabulary_"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have inserted the check into _transform() instead of transform(), because transform(X) is only a proxy for _transform(X, fitting=False).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in terms of readability, I think it makes more sense to have this in transform rather than here.

@LoHertel
Copy link
Contributor Author

LoHertel commented Apr 23, 2023

After DictVectorizer was added to the common test test_estimators_get_feature_names_out_error in #25402 I have updated my PR.

@LoHertel LoHertel requested a review from glemaitre April 23, 2023 13:47
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: af4e573. Link to the linter CI: here

@glemaitre glemaitre changed the title [MRG] Raise NotFittedError when using DictVectorizer without prior fitting MAINT Raise NotFittedError when using DictVectorizer without prior fitting Sep 9, 2023
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Putting a milestone for 1.4

@glemaitre glemaitre added this to the 1.4 milestone Sep 9, 2023
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@@ -203,6 +203,7 @@ def _transform(self, X, fitting):
feature_names = []
vocab = {}
else:
check_is_fitted(self, ["feature_names_", "vocabulary_"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in terms of readability, I think it makes more sense to have this in transform rather than here.

@glemaitre glemaitre self-requested a review December 7, 2023 15:41
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed the comment of @adrinjalali.
Thanks @LoHertel. Auto-merging this PR.

@glemaitre glemaitre enabled auto-merge (squash) December 7, 2023 15:50
@glemaitre glemaitre merged commit c913853 into scikit-learn:main Dec 7, 2023
@LoHertel LoHertel deleted the 24816_dictvectorizer_check_if_fitted branch January 4, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DictVectorizer doesn't raise NotFittedError when using transform without prior fitting
3 participants