Skip to content

ENH Add support for np.nan values in SplineTransformer #28043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 68 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Jan 2, 2024

Reference Issues/PRs

Closes #26793

What does this implement/fix? Explain your changes.

Adds support for np.nan values in SplineTransformer.

  • adds param handle_missing : {'error', 'zeros'} to init, where error preserves the previous behaviour and zeros handles nan values by setting their spline values to all 0s
  • adds new tests

(very outdated, should have put it in a separate comment:)
Yet to solve:

  1. I believe in _get_base_knot_positions I have to prepare _weighted_percentile for excluding nan values similarity to how np.nanpercentile excludes nan values for the calculation of the base knots. I tried, but it was quite tricky. Edit: Just found that np.nanpercentile will have a sample_weight option soon: PR 24254 in numpy

  2. Should an error also be raised in case the SplineTransformer was instantiated with (handle_missing="error"), then fitted without missing values and the X then contains missing values in transform?

Copy link

github-actions bot commented Jan 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2b37259. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title ENH Adds support for np.nan values in SplineTransformer ENH Add support for np.nan values in SplineTransformer Jan 8, 2024
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks very good but it needs to be merged with main (there are conflicts in the changelog).

Also, I think the get_output_feature_names() method needs to be updated. The tests should be expanded accordingly, maybe to also include a test with .set_output(transform="pandas") (this is how I found out that there was a problem with the output feature names).

@ogrisel
Copy link
Member

ogrisel commented Feb 29, 2024

So, should sparse_output=True and handle_missing="indicator" be prevented from being used together (and show an explicit error message) ,

I think we should add support for using those two options together.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a more in depth pass of review. There is indeed a fundamental problem with the current code: the missingness indicators from the training set (when calling .fit or .fit_transform) should not be stored as an estimator attribute and reapplied to the test set (when calling .transform). Instead the missingness pattern from the test set should be extracted.

See more details below:

if self.include_bias:
return XBS
return self._concatenate_indicator(XBS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missingness indicators computed from the X passed to .transform (which can be a test set) should be passed as argument to _concatenate_indicator instead of reusing the mask extracted from the training set.

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ogrisel, thanks for reviewing and your help.
I went through your comments and could resolve most of the issues.

I've named the new option handle_missing="constant", but that's just an idea. I found that indicator doesn't fit so well anymore, if we don't add an indicator column to X. Though with constant as well as with zeros I feel that it's not quite clear from the naming, where in the process the nans become something else (before or after calculating the splines). Maybe we can find a name, that conveys that info.

There are quite a few things, I am a bit confused about:
Generally, I don't know if we want SplineTransformer to change or keep behaviour if nan values are present.

If we want it to keep behaviour, instead of having this test data for comparing equality:

    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    X = np.array([[1, 1], [2, 2], [3, 3], [4, 4]])

it should maybe rather be

    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    X = np.array([[1, 1], [2, 2], [3, 3], [99, 4], [4, 4]])

and in this case, the current implementation is wrong. Maybe you can shed a light on this so that I know how to go on.

I will check the issue with the feature names next.

@StefanieSenger
Copy link
Contributor Author

I was trying to find about the problem with the feature names, that you have mentioned here, @ogrisel, but I cannot recreate it. Maybe it's been resolved when I worked on the other issues?

This is what I tried (using the code from the existing feature_name_out test):

def test_spline_transformer_feature_names_with_nans():
    """Test that SplineTransformer generates correct feature names if nan values are present."""
    X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
    splt = SplineTransformer(
        degree=3,
        n_knots=3,
        handle_missing="constant",
        include_bias=True).fit(X_nan)
    feature_names = splt.get_feature_names_out()
    assert_array_equal(
        feature_names,
        [
            "x0_sp_0",
            "x0_sp_1",
            "x0_sp_2",
            "x0_sp_3",
            "x0_sp_4",
            "x1_sp_0",
            "x1_sp_1",
            "x1_sp_2",
            "x1_sp_3",
            "x1_sp_4",
        ],
    )

    splt = SplineTransformer(
    degree=3,
    n_knots=3,
    handle_missing="constant",
    include_bias=False).fit(X_nan)
    feature_names = splt.get_feature_names_out(["a", "b"])
    assert_array_equal(
        feature_names,
        [
            "a_sp_0",
            "a_sp_1",
            "a_sp_2",
            "a_sp_3",
            "b_sp_0",
            "b_sp_1",
            "b_sp_2",
            "b_sp_3",
        ],
    )

    splt.set_output(transform="pandas")
    X_transformed = splt.transform(X_nan)
    feature_names = splt.get_feature_names_out(["a", "b"])
    assert_array_equal(
        feature_names,
        [
            "a_sp_0",
            "a_sp_1",
            "a_sp_2",
            "a_sp_3",
            "b_sp_0",
            "b_sp_1",
            "b_sp_2",
            "b_sp_3",
        ],
    )

Everything behaves as it should, I believe. But also maybe I didn't understand what you exactly ran into.

@ogrisel
Copy link
Member

ogrisel commented Mar 15, 2024

The get_feature_names problem only happens when appending extra output features for the missing indicators (they would then need to be named in that case).

EDIT: I will try to answer your other questions/comments early enough next week.

@ogrisel
Copy link
Member

ogrisel commented Mar 15, 2024

About handle_missing="constant", I really prefer handle_missing="zero" or handle_missing="zeros" or handle_missing="ignore" to either convey that missing values are encoded as zero spline feature values or are encoded in a value such that a simple downstream linear model would basically "ignore" them.

@StefanieSenger StefanieSenger marked this pull request as ready for review April 18, 2024 13:51
@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Apr 18, 2024

Hey @ogrisel, can you give me some feedback?

My current understanding is that if we introduce new 0-values in X_transformed (due to nan values in X), then we also expect different stats for the transformer compared to when no nan values are present.

This would mean, that we expect (and test for)

  • a different output format in case of nans (I have already written the tests like this)

  • that the B-splines won't sum up to one * n_features (that I need to define for the test)

@ogrisel ogrisel added this to the 1.7 milestone Apr 18, 2025
@ogrisel
Copy link
Member

ogrisel commented Apr 18, 2025

I have the feeling that this PR is nearly ready to get in, so I milestoned it for 1.7 but no problem if not ready in time.

@StefanieSenger
Copy link
Contributor Author

Hi, @ogrisel. Thanks for revisiting this PR and for the milestone assignment for version 1.7.

For context, I had kept this PR updated, made it pass all tests and ensured it was merge-ready, but after several months without much engagement from the team, I shifted my attention to other tasks, as the priority wasn’t clear.

Now I'm currently on vacation and will take this back up on my return. I look forward to finalizing it together.

@StefanieSenger
Copy link
Contributor Author

Hi @ogrisel, I have worked on what you had proposed and the tests now all pass.
Do you want to have a look?


def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.allow_nan = self.handle_missing == "zeros"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of defining a tag for nan support is per instance, which I think is most user-friendly and is also needed to pass all the common tests. With tags.input_tags.allow_nan = True (or False) either one fails, so we cannot do that.

However, doc/sphinxext/allow_nan_estimators.py that allows to automatically generate a list of estimators is doing it with the default parameter settings (almost as if the tags were defined at class level).
I think I would like to work on a follow-up PR to fix the generation and take cases like this into account. The question would be how large to span that. What do you think, @adrinjalali?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit sad indeed. But I think we can live with the fact that this list only reflects the default behavior, and that some estimators can be made to accept nans with specific hyper-parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make sure _construct_instances creates at least one instance where it supports nan, and instead of checking only the first instance returned by the generator, check them all here

# Here we generate the text only for one instance. This directive
# should not be used for meta-estimators where tags depend on the
# sub-estimator.
est = next(_construct_instances(est_class))

But certainly for a separate PR.

@ogrisel ogrisel modified the milestones: 1.7, 1.8 May 16, 2025
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much for the final push @StefanieSenger!

@ogrisel
Copy link
Member

ogrisel commented May 16, 2025

cc @lorentzenchr for a second review.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.
Feedback: Review was a bit harder than necessary due to renaming of variables.

Specifies the way missing values are handled.

- 'error' : Raise an error if `np.nan` values are present during :meth:`fit`.
- 'zeros' : Encode missing values as splines with value `0`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a short info that this is different from first setting feature values to zero and then create the spline basis, i.e. pipeline(SimpleImputer(strategy="constant", fill_value=0), SplineTransformer(...)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's important. I've added a note.

Similarly, Encode missing values as splines with value 0. was not so clear. I've corrected with Encode splines of missing values with values 0.

Comment on lines +862 to +865
"`X` contains invalid values (NaNs) and `SplineTransformer` is "
"configured with `handle_missing='error'`. Set "
"`handle_missing='zeros'` to encode missing values as splines with "
"value `0` or ensure no missing values in `X`."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`X` contains invalid values (NaNs) and `SplineTransformer` is "
"configured with `handle_missing='error'`. Set "
"`handle_missing='zeros'` to encode missing values as splines with "
"value `0` or ensure no missing values in `X`."
"Input X contains NaN values and `SplineTransformer` is configured"
"to error in this case (handle_missing='error'). To avoid this error, you"
"could set handle_missing='zeros' to encode missing values as splines with "
"value 0 or ensure no missing values in X."

Please tell our fellow scikit-learn (core) contributors to stop using backticks in printed output, error and warning messages as well as code comments.
Disclaimer: I might have introduced some of those in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? My impression was that maintainers encourage the use of backticks in error messages, or did I get this wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t like it. Whether others agree should be settled elsewhere. For this PR, let’s not introduce new backticks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're certainly not consistent with backticks and some people use them more than others. I personally don't bother asking people to add or remove them.

sparse_output=sparse_output,
)

# Check `fit_transform` does the same as `fit` and then `transform`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check `fit_transform` does the same as `fit` and then `transform`:
# Check fit_transform does the same as fit and then transform:

While I'm sorry for the noise, I don't like backticks in comments. It does not help reading to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like them as they help disambiguate the named API entities from the English verbs, especially in sentences that would otherwise be grammatically incorrect, as is the case in this particular snippet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks for named entities also help me read comments more intuitively. Will it be okay for you to keep them here, @lorentzenchr?

Comment on lines +529 to +530
# Check that `transform` works as expected when the passed data has a different
# shape than the training data passed to `fit`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check that `transform` works as expected when the passed data has a different
# shape than the training data passed to `fit`:
# Check that transform works as expected when the passed data has a different
# shape than the training data passed to fit:

StefanieSenger and others added 8 commits May 20, 2025 15:39
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lorentzenchr. I have implemented your suggestions and left comments on two I would rather like to leave as they are (the copy() and the backticks). See my comments for reasoning.

Specifies the way missing values are handled.

- 'error' : Raise an error if `np.nan` values are present during :meth:`fit`.
- 'zeros' : Encode missing values as splines with value `0`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's important. I've added a note.

Similarly, Encode missing values as splines with value 0. was not so clear. I've corrected with Encode splines of missing values with values 0.

sparse_output=sparse_output,
)

# Check `fit_transform` does the same as `fit` and then `transform`:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks for named entities also help me read comments more intuitively. Will it be okay for you to keep them here, @lorentzenchr?

# later.
x[mask_inv] = spl.t[self.degree]
outside_range_mask = ~inside_range_mask
x = X[:, feature_idx].copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need the copy. That was actually pretty hard for me to figure out, because I was not familiar with views on arrays back then and I had naively implemented everything without a copy. I have also checked again: Without a copy, we're changing X with x[outside_range_mask] = xmin, and it matters, because earlier we rely on X.

(I think it is in

                        x = spl.t[spl.k] + (X[:, feature_idx] - spl.t[spl.k]) % (
                            spl.t[n] - spl.t[spl.k]

above.)

We get a lot of errors too without the copy.

Comment on lines 1035 to 1036
# Note for extrapolation:
# 'continue' is already returned as is by scipy BSplines
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I have put this comment back in. I had originally deleted it because I had thought it was wrong, but now can clearly see the splines are returned as is from scipys BSpline.

Comment on lines 1101 to 1107
# Replace any output that corresponds to a missing input value
# by zero.
for spline_idx in range(n_splines):
output_feature_idx = n_splines * feature_idx + spline_idx
XBS[
nan_row_indices, output_feature_idx : output_feature_idx + 1
] = 0
Copy link
Contributor Author

@StefanieSenger StefanieSenger Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to do this in this commit, the tests seem to pass, but I'm not very happy about the result. It looks even more confusing.

Ideally, for me, this transform method would be structured regardless of eventual repetition of code lines - something either like:

if use_sparse:
    if extrapolation == "continue":
        ...
    if extrapolation == "constant":
        ...
    if extrapolation == "error":
        ...
    ...
else:
    if extrapolation == "continue":
        ...
    if extrapolation  ...

or like:

if extrapolation == "continue":
    if use_sparse:
        ...
    else:
        ...
if extrapolation == "constant":
    if use_sparse:
        ...
    else:
        ...
if extrapolation == "error":
        ...
...

(So it would have to main branches, either on use_sparse first or on extrapolation first, and be fully explicit inside each.)

I didn't manage to make this refactoring in this same PR, that was already challenging enough as it was. Probably, because I hadn't discovered all the spots where the code is zipped together across control flow branches (like I did in the last commit, if that makes sense). Next time I would touch something like that, I would make sure to try some refactoring before starting to add new functionality.

Please let me know what you think of this commit, @lorentzenchr, as I am not sure if it easier to read.

Comment on lines 1049 to 1050
if use_sparse:
# Copy the current column to avoid mutation of the user provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if this comment refers to the same dealing with x as this comment? It probably doesn't, right? If so, then I don't understand and didn't address it. In this case, could you please elaborate, @lorentzenchr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle np.nan / missing values in SplineTransformer
4 participants