MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838

LoHertel · 2022-11-05T10:41:54Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

DictVectorizer didn't raise a NotFittedError when using methods, which expect prior fitting.
fit() sets the attributes feature_names_ and vocabulary_.
The following methods depend on these attributes and now are raising a NotFittedError in case of absence:

inverse_transform()
restrict()
transform()

The check for existence of the internal attributes is made by check_is_fitted() from the utils package.
A test whether NotFittedError is raised as expected was added as well.

adrinjalali

Don't we have common tests for transform and inverse_transform?

adrinjalali · 2022-11-07T10:14:09Z

sklearn/feature_extraction/_dict_vectorizer.py

@@ -384,6 +389,8 @@ def get_feature_names_out(self, input_features=None):
        feature_names_out : ndarray of str objects
            Transformed feature names.
        """
+        check_is_fitted(self)


I think get_feature_names_out raises ValueError if the right attributes are not set:

scikit-learn/sklearn/utils/validation.py

Lines 1949 to 1950 in 353b05b

if n_features_in_ is None:

raise ValueError("Unable to generate feature names without n_features_in_")

And this I think is the case (or should be) for all estimators.

cc @thomasjpfan

To my understanding, the attribute self.feature_names_ which is accessed here in get_feature_names_out() two lines below in line 394 will only exist after calling fit().

Calling get_feature_names_out() before fit() is currently raising: AttributeError: 'DictVectorizer' object has no attribute 'feature_names_' when executing this code example:

from sklearn.feature_extraction import DictVectorizer dv = DictVectorizer() dv.get_feature_names_out()

This PR proposes to raise a NotFittedError instead. Running the snippet from above on the code of the PR results in:
NotFittedError: This DictVectorizer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

I guess it's a matter of taste, which option is the more favorable API. I've created this PR, based on @glemaitre's sentiment in #24816 that an exception of type NotFittedError would be favorable.

What is your impression so far?

It's not a matter of taste really. It's a matter of consistency. For example, StandardScaler():

In [4]: StandardScaler().get_feature_names_out() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In [4], line 1 ----> 1 StandardScaler().get_feature_names_out() File ~/Projects/sklearn/scikit-learn/sklearn/base.py:893, in OneToOneFeatureMixin.get_feature_names_out(self, input_features) 873 def get_feature_names_out(self, input_features=None): 874 """Get output feature names for transformation. 875 876 Parameters (...) 891 Same as input features. 892 """ --> 893 return _check_feature_names_in(self, input_features) File ~/Projects/sklearn/scikit-learn/sklearn/utils/validation.py:1950, in _check_feature_names_in(estimator, input_features, generate_names) 1948 # Generates feature names if `n_features_in_` is defined 1949 if n_features_in_ is None: -> 1950 raise ValueError("Unable to generate feature names without n_features_in_") 1952 return np.asarray([f"x{i}" for i in range(n_features_in_)], dtype=object) ValueError: Unable to generate feature names without n_features_in_

I don't mind raising NotFittedError for get_feature_names_out when the estimator is not fitted, but then we should do that everywhere.

cc @thomasjpfan @glemaitre

I was looking as well at this matter. It seems that the transformer that uses ClassNamePrefixFeaturesOutMixin will raise a NotFittedError because we check `check_is_fitted(self, "_n_features_out").

NotFittedError seems to be a more informative error than stating that n_features_in_ is missing. I would rather make the check_is_fitted everywhere.

That works for me.

LoHertel · 2022-11-08T00:52:22Z

Don't we have common tests for transform and inverse_transform?

As far as I understand, the input for inverse_transform should be array-like and is being checked already:

scikit-learn/sklearn/feature_extraction/_dict_vectorizer.py

Lines 337 to 338 in 2e481f1

    
           # COO matrix is not subscriptable 
        
           X = check_array(X, accept_sparse=["csr", "csc"])

DictVectorizer.transform on the other hand expects a Mapping or an iterable containing Mappings (e.g. dict) as input. See example:

X = [{'col1': 'a', 'col2': 'x'}, {'col1': 'b', 'col2': 'y'}]

I don't know the utils package deeply. Is there a common check for this kind of input?

Otherwise, does something like this stub go in the right direction?

def check_is_mapping_or_iterable_over_mappings(X, input_name=""):

    if isinstance(input_name, str) and not len(input_name):
        input_name = 'X'

    error_message = (
        f"A Mapping or an Iterable over Mappings are expected for {input_name}"
    )

    X = [X] if isinstance(X, Mapping) else X

    # check if Iterable
    try:
        iterator = iter(X)
    except TypeError:
        raise TypeError(f"{error_message}, but {type(X)} was given.")

    # check that Iterable is not empty
    try:
        x = next(iterator)
    except StopIteration:
        raise TypeError(f"{error_message}, but an empty Iterable was given.")

    # check that every element in Iterable is a Mapping
    try:
        iterator = iter(X)
        for x in next(iterator):
            if not isinstance(x, Mapping):
                raise TypeError(
                    f"{error_message}, but an Iterable containing {type(x)} was given."
                )
    except StopIteration:
        pass

    return X

In my understanding, checks for a valid input in transform and inverse_transform will not address the AttributeError mentioned above and in #24816.

glemaitre · 2022-11-10T08:51:16Z

Don't we have common tests for transform and inverse_transform?

@adrinjalali The DictVectorizer has a tag X_type that is not 2darray therefore almost no common tests are run on this class. This is one reason why we did not catch the NotFittedError inconsistency I assume.

glemaitre · 2022-11-10T09:07:21Z

I will try to make an overview of the error raised by the estimator when calling get_feature_names_out before fit. And I should really make the maintenance on the common tests to make sure that we run minimal tests for all estimators.

… a common test

LoHertel · 2023-04-23T08:01:20Z

sklearn/feature_extraction/_dict_vectorizer.py

@@ -203,6 +203,7 @@ def _transform(self, X, fitting):
            feature_names = []
            vocab = {}
        else:
+            check_is_fitted(self, ["feature_names_", "vocabulary_"])


I have inserted the check into _transform() instead of transform(), because transform(X) is only a proxy for _transform(X, fitting=False).

in terms of readability, I think it makes more sense to have this in transform rather than here.

LoHertel · 2023-04-23T13:47:24Z

After DictVectorizer was added to the common test test_estimators_get_feature_names_out_error in #25402 I have updated my PR.

github-actions · 2023-09-07T14:47:19Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: af4e573. Link to the linter CI: here}

glemaitre

LGTM. Putting a milestone for 1.4

adrinjalali

Otherwise LGTM.

adrinjalali · 2023-09-11T10:51:22Z

sklearn/feature_extraction/_dict_vectorizer.py

@@ -203,6 +203,7 @@ def _transform(self, X, fitting):
            feature_names = []
            vocab = {}
        else:
+            check_is_fitted(self, ["feature_names_", "vocabulary_"])


in terms of readability, I think it makes more sense to have this in transform rather than here.

glemaitre

I addressed the comment of @adrinjalali.
Thanks @LoHertel. Auto-merging this PR.

github-actions bot added the module:feature_extraction label Nov 5, 2022

adrinjalali reviewed Nov 7, 2022

View reviewed changes

glemaitre mentioned this pull request Nov 14, 2022

Make error message uniform when calling get_feature_names_out before fit #24916

Closed

36 tasks

LoHertel added 6 commits April 23, 2023 08:28

TST add NotFittedError

7a709be

ENH raising NotFittedError without prior fitting in DictVectorizer

40a0b11

DOC whatsnew

229990c

DOC whatsnew fix PR number

d916b62

TST remove test for get_feature_names_out, because it is covered by…

47f6106

… a common test

DOC moved whatsnew to v1.3

a286ae0

LoHertel commented Apr 23, 2023

View reviewed changes

LoHertel requested a review from glemaitre April 23, 2023 13:47

Merge remote-tracking branch 'origin/main' into pr/LoHertel/24838

8b65a3e

glemaitre added 2 commits September 9, 2023 15:34

update

2ad4191

Merge remote-tracking branch 'origin/main' into pr/LoHertel/24838

0248cf1

glemaitre changed the title ~~[MRG] Raise NotFittedError when using DictVectorizer without prior fitting~~ MAINT Raise NotFittedError when using DictVectorizer without prior fitting Sep 9, 2023

glemaitre approved these changes Sep 9, 2023

View reviewed changes

glemaitre added this to the 1.4 milestone Sep 9, 2023

adrinjalali reviewed Sep 11, 2023

View reviewed changes

glemaitre self-requested a review December 7, 2023 15:41

glemaitre added 2 commits December 7, 2023 16:44

Merge remote-tracking branch 'origin/main' into pr/LoHertel/24838

ca7d6b1

address adrin remarks

af4e573

glemaitre approved these changes Dec 7, 2023

View reviewed changes

glemaitre enabled auto-merge (squash) December 7, 2023 15:50

glemaitre merged commit c913853 into scikit-learn:main Dec 7, 2023

LoHertel deleted the 24816_dictvectorizer_check_if_fitted branch January 4, 2024 16:18

	if n_features_in_ is None:
	raise ValueError("Unable to generate feature names without n_features_in_")

Uh oh!

MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838

MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838

Uh oh!

Conversation

LoHertel commented Nov 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LoHertel commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Nov 10, 2022

Uh oh!

glemaitre commented Nov 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LoHertel commented Apr 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LoHertel commented Nov 5, 2022 •

edited

Loading

LoHertel commented Nov 8, 2022 •

edited

Loading

LoHertel commented Apr 23, 2023 •

edited

Loading

github-actions bot commented Sep 7, 2023 •

edited

Loading