Skip to content

DEP Adding a warning in the SimpleImputer when strategy mode is constant and keep_empty_features is False #29950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ArthurCourselle
Copy link
Contributor

Reference Issues/PRs

Fixes #29827

What does this implement/fix? Explain your changes.

Adding a warning in the SimpleImputer to prevent the user about an upcoming deprecation. This shows when the user set the keep_empty_features to False and the strategy mode is constant and the user have a column full of np.nan because it is not removed.

Copy link

github-actions bot commented Sep 27, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ed415fd. Link to the linter CI: here

@glemaitre glemaitre self-requested a review October 1, 2024 08:00
@glemaitre
Copy link
Member

I'll start to make a review with a couple of changes to do.

@glemaitre
Copy link
Member

The current docstring of keep_empty_features should be amended to document the deprecation:

    keep_empty_features : bool, default=False
        If True, features that consist exclusively of missing values when
        `fit` is called are returned in results when `transform` is called.
        The imputed value is always `0` except when `strategy="constant"`
        in which case `fill_value` will be used instead.

        .. versionadded:: 1.2

        .. versionchanged:: 1.6
           Currently, when `keep_empty_feature=False` and `strategy="constant",
           empty features are not dropped. This behaviour will change in version
           1.8. Set `keep_empty_feature=True` to preserve this behaviour.

@glemaitre
Copy link
Member

So right now, the CI is failing because we raise the warning in some tests:

FAILED impute/tests/test_impute.py::test_imputation_constant_integer - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_float[csr_matrix] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_float[csr_array] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_float[asarray] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_object[None] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_object[nan] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_object[NAN] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_object[] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_object[0] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_pandas[object] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_imputation_constant_pandas[category] - FutureWarning: When `keep_empty_features` is False and `constant` is set th...
FAILED impute/tests/test_impute.py::test_iterative_imputer_constant_fill_value - FutureWarning: When `keep_empty_features` is False and `constant` is set th...

So we should look at each case, and check if it makes sense and catch the warning / change the test or if this is a false positive and a bug in the code. I'll review the code and propose the changes.

@glemaitre
Copy link
Member

So luckily, for the tests test_imputation_constant_integer, test_imputation_constant_float, test_imputation_constant_object, test_imputation_constant_pandas, we intend to test the same behaviour: check that having a column full of missing values is preserved. So we can add keep_empty_features=True in the test because this is something that we want to test right now.

However, could you add the following on the top of each test:

# TODO (1.8): check that `keep_empty_features=False` drop the
# empty features due to the behaviour change.

For the remaining test test_iterative_imputer_constant_fill_value, we can also set keep_empty_features=True.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK here is a round of review.

Please add an entry to the change log at doc/whats_new/v1.6.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

This will resolve the remaining CI failure related to checking the changelog.

@ArthurCourselle
Copy link
Contributor Author

Thank you, I will start working on all those reviews.

@ArthurCourselle
Copy link
Contributor Author

I've made the requested modifications based on your reviews. Sorry that I didn't have much time to focus on it earlier. Everything should be up to date now, and please let me know if there's anything else I need to adjust. I'm waiting for the CI results.

@glemaitre
Copy link
Member

Sorry that I didn't have much time to focus on it earlier.

No issue. Apparently the CI is not happy. I'll have a look at what is going on.

@glemaitre glemaitre self-requested a review October 16, 2024 14:46
@glemaitre
Copy link
Member

The remaining failure is in another estimator called IterativeImputer. It internally uses the SimpleImputer and is going to raise a warning. But this is a false positive because we have already some specific code that make sure that remove the empty feature (it was considered as a bug in the past).

So here, you would need to catch the warning. The diff will look like:

diff --git a/sklearn/impute/_iterative.py b/sklearn/impute/_iterative.py
index d2d47bae3f..a90a659980 100644
--- a/sklearn/impute/_iterative.py
+++ b/sklearn/impute/_iterative.py
@@ -635,6 +635,13 @@ class IterativeImputer(_BaseImputer):
 
         X_missing_mask = _get_mask(X, self.missing_values)
         mask_missing_values = X_missing_mask.copy()
+
+        # TODO (1.8): remove this once the deprecation is removed. In the meantime,
+        # we need to catch the warning to avoid false positives.
+        catch_warning = (
+            self.initial_strategy == "constant" and not self.keep_empty_features
+        )
+
         if self.initial_imputer_ is None:
             self.initial_imputer_ = SimpleImputer(
                 missing_values=self.missing_values,
@@ -642,9 +649,19 @@ class IterativeImputer(_BaseImputer):
                 fill_value=self.fill_value,
                 keep_empty_features=self.keep_empty_features,
             ).set_output(transform="default")
-            X_filled = self.initial_imputer_.fit_transform(X)
+            if catch_warning:
+                with warnings.catch_warnings():
+                    warnings.simplefilter("ignore", FutureWarning)
+                    X_filled = self.initial_imputer_.fit_transform(X)
+            else:
+                X_filled = self.initial_imputer_.fit_transform(X)
         else:
-            X_filled = self.initial_imputer_.transform(X)
+            if catch_warning:
+                with warnings.catch_warnings():
+                    warnings.simplefilter("ignore", FutureWarning)
+                    X_filled = self.initial_imputer_.transform(X)
+            else:
+                X_filled = self.initial_imputer_.transform(X)
 
         if in_fit:
             self._is_empty_feature = np.all(mask_missing_values, axis=0)
@@ -658,7 +675,7 @@ class IterativeImputer(_BaseImputer):
                 # The constant strategy has a specific behavior and preserve empty
                 # features even with ``keep_empty_features=False``. We need to drop
                 # the column for consistency.
-                # TODO: remove this `if` branch once the following issue is addressed:
+                # TODO (1.8): remove this `if` branch once the following issue is addressed:
                 # https://github.com/scikit-learn/scikit-learn/issues/29827
                 X_filled = X_filled[:, ~self._is_empty_feature]
 

Applying it will make the test pass.

Comment on lines 228 to 231
.. versionchanged:: 1.6
Currently, when `keep_empty_feature=False` and `strategy="constant"`,
empty features are not dropped. This behaviour will change in version
1.8. Set `keep_empty_feature=True` to preserve this behaviour.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should solve the documentation failure, sphinx is quite stubborn about those :).

Suggested change
.. versionchanged:: 1.6
Currently, when `keep_empty_feature=False` and `strategy="constant"`,
empty features are not dropped. This behaviour will change in version
1.8. Set `keep_empty_feature=True` to preserve this behaviour.
.. versionchanged:: 1.6
Currently, when `keep_empty_feature=False` and `strategy="constant"`,
empty features are not dropped. This behaviour will change in version
1.8. Set `keep_empty_feature=True` to preserve this behaviour.

@ArthurCourselle
Copy link
Contributor Author

Thank you ! I have corrected what you told me, however I saw when fetching the main upstream that many lines have been deleted in v.1.6.rst file. I think it is the reason why the check changelog ci is failing. Do I need to modify another file for the changelog ?

@glemaitre glemaitre self-requested a review October 28, 2024 19:29
@glemaitre
Copy link
Member

Indeed, we change the way we create changelog entry to have a more automated way. Now, we need to create a single file in the module folder. I push a commit to create the file in the last commit.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @ArthurCourselle

We will need a second reviewer.
Maybe @adrinjalali can have a look such that we get this one in to not delay the deprecation.

@glemaitre glemaitre added this to the 1.6 milestone Oct 28, 2024
@ArthurCourselle
Copy link
Contributor Author

Ok thank you ! I'm looking forward for the other review then. I will look for other issues that I could work on in the future. Thanks @glemaitre for your help :)

@adrinjalali adrinjalali merged commit c1b1f87 into scikit-learn:main Oct 29, 2024
30 checks passed
@glemaitre
Copy link
Member

Nice. Thanks @ArthurCourselle for you contribution ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer does not drop a column full of np.nan even when keep_empty_feature=False
3 participants