-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Raise NotFittedError when using DictVectorizer without prior fitting #24838
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have common tests for transform
and inverse_transform
?
@@ -384,6 +389,8 @@ def get_feature_names_out(self, input_features=None): | |||
feature_names_out : ndarray of str objects | |||
Transformed feature names. | |||
""" | |||
check_is_fitted(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think get_feature_names_out
raises ValueError
if the right attributes are not set:
scikit-learn/sklearn/utils/validation.py
Lines 1949 to 1950 in 353b05b
if n_features_in_ is None: | |
raise ValueError("Unable to generate feature names without n_features_in_") |
And this I think is the case (or should be) for all estimators.
cc @thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, the attribute self.feature_names_
which is accessed here in get_feature_names_out()
two lines below in line 394 will only exist after calling fit()
.
Calling get_feature_names_out()
before fit()
is currently raising: AttributeError: 'DictVectorizer' object has no attribute 'feature_names_'
when executing this code example:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()
dv.get_feature_names_out()
This PR proposes to raise a NotFittedError
instead. Running the snippet from above on the code of the PR results in:
NotFittedError: This DictVectorizer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
I guess it's a matter of taste, which option is the more favorable API. I've created this PR, based on @glemaitre's sentiment in #24816 that an exception of type NotFittedError
would be favorable.
What is your impression so far?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a matter of taste really. It's a matter of consistency. For example, StandardScaler()
:
In [4]: StandardScaler().get_feature_names_out()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [4], line 1
----> 1 StandardScaler().get_feature_names_out()
File ~/Projects/sklearn/scikit-learn/sklearn/base.py:893, in OneToOneFeatureMixin.get_feature_names_out(self, input_features)
873 def get_feature_names_out(self, input_features=None):
874 """Get output feature names for transformation.
875
876 Parameters
(...)
891 Same as input features.
892 """
--> 893 return _check_feature_names_in(self, input_features)
File ~/Projects/sklearn/scikit-learn/sklearn/utils/validation.py:1950, in _check_feature_names_in(estimator, input_features, generate_names)
1948 # Generates feature names if `n_features_in_` is defined
1949 if n_features_in_ is None:
-> 1950 raise ValueError("Unable to generate feature names without n_features_in_")
1952 return np.asarray([f"x{i}" for i in range(n_features_in_)], dtype=object)
ValueError: Unable to generate feature names without n_features_in_
I don't mind raising NotFittedError
for get_feature_names_out
when the estimator is not fitted, but then we should do that everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking as well at this matter. It seems that the transformer that uses ClassNamePrefixFeaturesOutMixin
will raise a NotFittedError
because we check `check_is_fitted(self, "_n_features_out").
NotFittedError
seems to be a more informative error than stating that n_features_in_
is missing. I would rather make the check_is_fitted
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me.
As far as I understand, the input for scikit-learn/sklearn/feature_extraction/_dict_vectorizer.py Lines 337 to 338 in 2e481f1
X = [{'col1': 'a', 'col2': 'x'}, {'col1': 'b', 'col2': 'y'}] I don't know the utils package deeply. Is there a common check for this kind of input? Otherwise, does something like this stub go in the right direction? def check_is_mapping_or_iterable_over_mappings(X, input_name=""):
if isinstance(input_name, str) and not len(input_name):
input_name = 'X'
error_message = (
f"A Mapping or an Iterable over Mappings are expected for {input_name}"
)
X = [X] if isinstance(X, Mapping) else X
# check if Iterable
try:
iterator = iter(X)
except TypeError:
raise TypeError(f"{error_message}, but {type(X)} was given.")
# check that Iterable is not empty
try:
x = next(iterator)
except StopIteration:
raise TypeError(f"{error_message}, but an empty Iterable was given.")
# check that every element in Iterable is a Mapping
try:
iterator = iter(X)
for x in next(iterator):
if not isinstance(x, Mapping):
raise TypeError(
f"{error_message}, but an Iterable containing {type(x)} was given."
)
except StopIteration:
pass
return X In my understanding, checks for a valid input in |
@adrinjalali The |
I will try to make an overview of the error raised by the estimator when calling |
@@ -203,6 +203,7 @@ def _transform(self, X, fitting): | |||
feature_names = [] | |||
vocab = {} | |||
else: | |||
check_is_fitted(self, ["feature_names_", "vocabulary_"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have inserted the check into _transform()
instead of transform()
, because transform(X)
is only a proxy for _transform(X, fitting=False)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in terms of readability, I think it makes more sense to have this in transform
rather than here.
After |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Putting a milestone for 1.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
@@ -203,6 +203,7 @@ def _transform(self, X, fitting): | |||
feature_names = [] | |||
vocab = {} | |||
else: | |||
check_is_fitted(self, ["feature_names_", "vocabulary_"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in terms of readability, I think it makes more sense to have this in transform
rather than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed the comment of @adrinjalali.
Thanks @LoHertel. Auto-merging this PR.
Reference Issues/PRs
Fixes #24816
What does this implement/fix? Explain your changes.
DictVectorizer
didn't raise aNotFittedError
when using methods, which expect prior fitting.fit()
sets the attributesfeature_names_
andvocabulary_
.The following methods depend on these attributes and now are raising a
NotFittedError
in case of absence:inverse_transform()
restrict()
transform()
The check for existence of the internal attributes is made by
check_is_fitted()
from the utils package.A test whether
NotFittedError
is raised as expected was added as well.