ENH Add support for np.nan values in SplineTransformer #28043

StefanieSenger · 2024-01-02T12:05:59Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds support for np.nan values in SplineTransformer.

adds param handle_missing : {'error', 'zeros'} to init, where error preserves the previous behaviour and zeros handles nan values by setting their spline values to all 0s
adds new tests

(very outdated, should have put it in a separate comment:)
~~Yet to solve:~~

I believe in _get_base_knot_positions I have to prepare _weighted_percentile for excluding nan values similarity to how np.nanpercentile excludes nan values for the calculation of the base knots. I tried, but it was quite tricky. Edit: Just found that np.nanpercentile will have a sample_weight option soon: PR 24254 in numpy

~~Should an error also be raised in case the SplineTransformer was instantiated with (handle_missing="error"), then fitted without missing values and the X then contains missing values in transform?~~

github-actions · 2024-01-02T12:07:14Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 08a7230. Link to the linter CI: here}

ogrisel

The PR looks very good but it needs to be merged with main (there are conflicts in the changelog).

Also, I think the get_output_feature_names() method needs to be updated. The tests should be expanded accordingly, maybe to also include a test with .set_output(transform="pandas") (this is how I found out that there was a problem with the output feature names).

ogrisel · 2024-02-29T09:59:47Z

So, should sparse_output=True and handle_missing="indicator" be prevented from being used together (and show an explicit error message) ,

I think we should add support for using those two options together.

ogrisel

Here is a more in depth pass of review. There is indeed a fundamental problem with the current code: the missingness indicators from the training set (when calling .fit or .fit_transform) should not be stored as an estimator attribute and reapplied to the test set (when calling .transform). Instead the missingness pattern from the test set should be extracted.

See more details below:

sklearn/preprocessing/_polynomial.py

ogrisel · 2024-03-01T10:22:41Z

sklearn/preprocessing/_polynomial.py

        if self.include_bias:
-            return XBS
+            return self._concatenate_indicator(XBS)


The missingness indicators computed from the X passed to .transform (which can be a test set) should be passed as argument to _concatenate_indicator instead of reusing the mask extracted from the training set.

sklearn/preprocessing/tests/test_polynomial.py

StefanieSenger

Hey @ogrisel, thanks for reviewing and your help.
I went through your comments and could resolve most of the issues.

I've named the new option handle_missing="constant", but that's just an idea. I found that indicator doesn't fit so well anymore, if we don't add an indicator column to X. Though with constant as well as with zeros I feel that it's not quite clear from the naming, where in the process the nans become something else (before or after calculating the splines). Maybe we can find a name, that conveys that info.

There are quite a few things, I am a bit confused about:
Generally, I don't know if we want SplineTransformer to change or keep behaviour if nan values are present.

If we want it to keep behaviour, instead of having this test data for comparing equality:

    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    X = np.array([[1, 1], [2, 2], [3, 3], [4, 4]])

it should maybe rather be

    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    X = np.array([[1, 1], [2, 2], [3, 3], [99, 4], [4, 4]])

and in this case, the current implementation is wrong. Maybe you can shed a light on this so that I know how to go on.

I will check the issue with the feature names next.

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

StefanieSenger · 2024-03-15T13:41:50Z

I was trying to find about the problem with the feature names, that you have mentioned here, @ogrisel, but I cannot recreate it. Maybe it's been resolved when I worked on the other issues?

This is what I tried (using the code from the existing feature_name_out test):

def test_spline_transformer_feature_names_with_nans():
    """Test that SplineTransformer generates correct feature names if nan values are present."""
    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    splt = SplineTransformer(
        degree=3,
        n_knots=3,
        handle_missing="constant",
        include_bias=True).fit(X_nan)
    feature_names = splt.get_feature_names_out()
    assert_array_equal(
        feature_names,
        [
            "x0_sp_0",
            "x0_sp_1",
            "x0_sp_2",
            "x0_sp_3",
            "x0_sp_4",
            "x1_sp_0",
            "x1_sp_1",
            "x1_sp_2",
            "x1_sp_3",
            "x1_sp_4",
        ],
    )

    splt = SplineTransformer(
    degree=3,
    n_knots=3,
    handle_missing="constant",
    include_bias=False).fit(X_nan)
    feature_names = splt.get_feature_names_out(["a", "b"])
    assert_array_equal(
        feature_names,
        [
            "a_sp_0",
            "a_sp_1",
            "a_sp_2",
            "a_sp_3",
            "b_sp_0",
            "b_sp_1",
            "b_sp_2",
            "b_sp_3",
        ],
    )

    splt.set_output(transform="pandas")
    X_transformed = splt.transform(X_nan)
    feature_names = splt.get_feature_names_out(["a", "b"])
    assert_array_equal(
        feature_names,
        [
            "a_sp_0",
            "a_sp_1",
            "a_sp_2",
            "a_sp_3",
            "b_sp_0",
            "b_sp_1",
            "b_sp_2",
            "b_sp_3",
        ],
    )

Everything behaves as it should, I believe. But also maybe I didn't understand what you exactly ran into.

ogrisel · 2024-03-15T17:42:34Z

The get_feature_names problem only happens when appending extra output features for the missing indicators (they would then need to be named in that case).

EDIT: I will try to answer your other questions/comments early enough next week.

ogrisel · 2024-03-15T17:45:32Z

About handle_missing="constant", I really prefer handle_missing="zero" or handle_missing="zeros" or handle_missing="ignore" to either convey that missing values are encoded as zero spline feature values or are encoded in a value such that a simple downstream linear model would basically "ignore" them.

StefanieSenger · 2024-04-18T14:05:42Z

Hey @ogrisel, can you give me some feedback?

My current understanding is that if we introduce new 0-values in X_transformed (due to nan values in X), then we also expect different stats for the transformer compared to when no nan values are present.

This would mean, that we expect (and test for)

a different output format in case of nans (I have already written the tests like this)
that the B-splines won't sum up to one * n_features (that I need to define for the test)

StefanieSenger · 2025-05-15T19:18:48Z

Hi @ogrisel, I have worked on what you had proposed and the tests now all pass.
Do you want to have a look?

StefanieSenger · 2025-05-16T11:07:52Z

sklearn/preprocessing/_polynomial.py

+
+    def __sklearn_tags__(self):
+        tags = super().__sklearn_tags__()
+        tags.input_tags.allow_nan = self.handle_missing == "zeros"


This way of defining a tag for nan support is per instance, which I think is most user-friendly and is also needed to pass all the common tests. With tags.input_tags.allow_nan = True (or False) either one fails, so we cannot do that.

However, doc/sphinxext/allow_nan_estimators.py that allows to automatically generate a list of estimators is doing it with the default parameter settings (almost as if the tags were defined at class level).
I think I would like to work on a follow-up PR to fix the generation and take cases like this into account. The question would be how large to span that. What do you think, @adrinjalali?

It's a bit sad indeed. But I think we can live with the fact that this list only reflects the default behavior, and that some estimators can be made to accept nans with specific hyper-parameters.

I think we should make sure _construct_instances creates at least one instance where it supports nan, and instead of checking only the first instance returned by the generator, check them all here

scikit-learn/doc/sphinxext/allow_nan_estimators.py

Lines 22 to 25 in 675736a

# Here we generate the text only for one instance. This directive

# should not be used for meta-estimators where tags depend on the

# sub-estimator.

est = next(_construct_instances(est_class))

But certainly for a separate PR.

ogrisel

LGTM. Thank you very much for the final push @StefanieSenger!

ogrisel · 2025-05-16T13:20:37Z

cc @lorentzenchr for a second review.

lorentzenchr

Overall looks good.
Feedback: Review was a bit harder than necessary due to renaming of variables.

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

…kit-learn into nan_SplineTransformer

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

StefanieSenger

Thank you @lorentzenchr. I have implemented your suggestions and left comments on two I would rather like to leave as they are (the copy() and the backticks). See my comments for reasoning.

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

sklearn/preprocessing/_polynomial.py

StefanieSenger · 2025-06-11T12:49:20Z

sklearn/preprocessing/_polynomial.py

-                    # later.
-                    x[mask_inv] = spl.t[self.degree]
+                    outside_range_mask = ~inside_range_mask
+                    x = X[:, feature_idx].copy()


I think we do need the copy. That was actually pretty hard for me to figure out, because I was not familiar with views on arrays back then and I had naively implemented everything without a copy. I have also checked again: Without a copy, we're changing X with x[outside_range_mask] = xmin, and it matters, because earlier we rely on X.

(I think it is in

x = spl.t[spl.k] + (X[:, feature_idx] - spl.t[spl.k]) % ( spl.t[n] - spl.t[spl.k]

above.)

We get a lot of errors too without the copy.

sklearn/preprocessing/_polynomial.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

StefanieSenger

Thanks, @lorentzenchr! I have commited your suggestions. Now there is only one question on a re-factoring left, where I find it unclear which extend you wanted me to do.

sklearn/preprocessing/_polynomial.py

StefanieSenger added 2 commits December 20, 2023 13:32

added indicator

4255470

added splines full of 0 for nan input in transform

812ae00

github-actions bot added the module:preprocessing label Jan 2, 2024

StefanieSenger and others added 2 commits January 8, 2024 11:39

Merge branch 'main' into nan_SplineTransformer

0ad4f5e

fixed extrapolation='error' related error

f0f7d5e

StefanieSenger changed the title ~~ENH Adds support for np.nan values in SplineTransformer~~ ENH Add support for np.nan values in SplineTransformer Jan 8, 2024

StefanieSenger added 4 commits January 9, 2024 14:14

fixed _get_base_knot_positions so spline parameters are build correctly

a38ee69

exclude knots='quantile' from camparison check

4c9a867

fixed issues from test_common

f7f7ca4

little things

4bbdd4a

ogrisel mentioned this pull request Feb 28, 2024

Handle np.nan / missing values in SplineTransformer #26793

Closed

ogrisel reviewed Feb 28, 2024

View reviewed changes

ogrisel mentioned this pull request Feb 28, 2024

Implement SplineTransformer.inverse_transform #28551

Open

Merge branch 'main' into nan_SplineTransformer

092eca0

ogrisel reviewed Mar 1, 2024

View reviewed changes

StefanieSenger and others added 3 commits March 13, 2024 16:22

calculate indicator based on X passed into transform

5165c6b

Merge branch 'main' into nan_SplineTransformer

1a7d134

add and adjust tests after review

708ae53

StefanieSenger commented Mar 15, 2024

View reviewed changes

sklearn/preprocessing/_polynomial.py Show resolved Hide resolved

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_polynomial.py Show resolved Hide resolved

sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved

wording and typo

1dc3b3c

Merge branch 'main' into nan_SplineTransformer

dce1620

adrinjalali mentioned this pull request Apr 15, 2024

WIP KBinsDiscretizer supports NaN as input values #19928

Open

Merge branch 'main' into nan_SplineTransformer

63ea8aa

StefanieSenger marked this pull request as ready for review April 18, 2024 13:51

Merge branch 'main' into nan_SplineTransformer

fe27494

lucyleeow mentioned this pull request May 16, 2025

Inconsistent median/quantile behaviour now _weighted_percentile ignores NaNs #31367

Closed

StefanieSenger commented May 16, 2025

View reviewed changes

ogrisel modified the milestones: 1.7, 1.8 May 16, 2025

ogrisel approved these changes May 16, 2025

View reviewed changes

lorentzenchr approved these changes May 19, 2025

View reviewed changes

StefanieSenger and others added 8 commits May 20, 2025 15:39

Apply suggestions from code review

4b0cb85

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

fix formatting

c547dc6

Apply suggestions from code review

4e430c3

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Merge branch 'nan_SplineTransformer' of github.com:StefanieSenger/sci…

b54dfb2

…kit-learn into nan_SplineTransformer

improve documentation

7e50b73

Merge branch 'main' into nan_SplineTransformer

8babd2e

Apply suggestions from code review

a1bf164

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

refactor

2150866

StefanieSenger commented Jun 11, 2025

View reviewed changes

StefanieSenger added 2 commits June 12, 2025 09:12

Merge branch 'main' into nan_SplineTransformer

6f20b52

Merge branch 'main' into nan_SplineTransformer

2b37259

lorentzenchr reviewed Jun 24, 2025

View reviewed changes

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_polynomial.py Show resolved Hide resolved

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits June 25, 2025 09:43

Apply suggestions from code review

e2a9e3f

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

fix test error and linting

457164b

StefanieSenger commented Jun 25, 2025

View reviewed changes

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_polynomial.py Show resolved Hide resolved

lorentzenchr enabled auto-merge (squash) June 25, 2025 08:52

Merge branch 'main' into nan_SplineTransformer

08a7230

lorentzenchr merged commit f3470f8 into scikit-learn:main Jun 26, 2025
36 checks passed

StefanieSenger deleted the nan_SplineTransformer branch June 26, 2025 09:02

lucyleeow mentioned this pull request Jul 2, 2025

ENH: add quantile data-apis/array-api-extra#340

Open

jeremiedbb mentioned this pull request Jul 15, 2025

Release 1.7.1 #31762

Merged

13 tasks

	# Here we generate the text only for one instance. This directive
	# should not be used for meta-estimators where tags depend on the
	# sub-estimator.
	est = next(_construct_instances(est_class))

Uh oh!

ENH Add support for np.nan values in SplineTransformer #28043

ENH Add support for np.nan values in SplineTransformer #28043

Uh oh!

Conversation

StefanieSenger commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 29, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StefanieSenger commented Mar 15, 2024

Uh oh!

ogrisel commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Mar 15, 2024

Uh oh!

StefanieSenger commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented May 15, 2025

Uh oh!

StefanieSenger May 16, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel May 16, 2025

Choose a reason for hiding this comment

Uh oh!

adrinjalali May 19, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented May 16, 2025

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

StefanieSenger commented Jan 2, 2024 •

edited

Loading

github-actions bot commented Jan 2, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Mar 15, 2024 •

edited

Loading

StefanieSenger commented Apr 18, 2024 •

edited

Loading

StefanieSenger left a comment •

edited

Loading