Skip to content

Prevent scalers to scale near-constant features very large values #19527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 25, 2021

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Feb 22, 2021

Fixes #19450 (an edge case of a test initially written for #19426).

The original problem happens when fitting StandardScaler(with_mean=False) with sample_weight on constant (but non-zero) columns. Because of the sample weight, the variance is not exactly zero but a very small value.

This PRs tries to solve the problem by using a dtype-dependent eps threshold to detect constant features instead of a hard var == 0. test. I added more tests and also adapted sklearn.linear_model._base._preprocess_data so that linear models with normalize=True behave consistently.

Remaining issues:

  • sklearn.linear_model._base._preprocess_data is equivalent to StandardScaler up to a factor of np.sqrt(X.shape[0]) for all features except the ones that constant and non-zero. Those features are typically useless and get absorbed into the intercept but because of the interaction with regularization, the matching coef and intercept value will be different, even if the predictions are the same. Maybe this is not a problem. /cc @agramfort @maikia.

  • There a numerical stability issue to compute the variance of large but constant features on sparse data: see the tests marked XFAIL, for instance with a column with a constant value of 100., we get a variance that is far from zero (depends on n_samples). This should probably be tackled in a separate PR and tracked by a dedicated bug report in Weighted variance computation for sparse data is not numerically stable #19546.

  • Last constant features would stay large constant features with StandardScaler. This might feel surprising for some users. Shall we warn when we detect those? This would risk spurious warnings when cross-validating on small-ish datasets which could be annoying in practice. Update: we discussed this point at the last dev meeting and the opinion of letting those feature passthough was shared. Maybe we should add a waning but this can be tackled in a dedicated PR. I opened an issue to track the discussion on this point: [RFC] Should scalers or other estimators warn when fit on constant features? #19547.

@thomasjpfan
Copy link
Member

I am guessing we can not use ptp, like we do in VarianceThreshold, because of the sample weights?

if hasattr(X, "toarray"): # sparse matrix
_, self.variances_ = mean_variance_axis(X, axis=0)
if self.threshold == 0:
mins, maxes = min_max_axis(X, axis=0)
peak_to_peaks = maxes - mins
else:
self.variances_ = np.nanvar(X, axis=0)
if self.threshold == 0:
peak_to_peaks = np.ptp(X, axis=0)
if self.threshold == 0:
# Use peak-to-peak to avoid numeric precision issues
# for constant features
compare_arr = np.array([self.variances_, peak_to_peaks])
self.variances_ = np.nanmin(compare_arr, axis=0)

@ogrisel
Copy link
Member Author

ogrisel commented Feb 22, 2021

I am guessing we can not use ptp, like we do in VarianceThreshold, because of the sample weights?

That's interesting, but indeed I am not sure what to do with sample weights and the problem is most often triggered when using sample weights in practice.

Copy link
Contributor

@maikia maikia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for taking care of this issue @ogrisel

@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2021

@thomasjpfan @maikia @agramfort I addressed the comments that I think are directly related to the PR and opened dedicated issues for follow-up work: #19546 and #19547.

I think this PR could be merged as it is without waiting for either of the follow-up issues as it's already a bug fix as it stands.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

thx @ogrisel and @maikia

Copy link
Contributor

@maikia maikia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. Thanks @ogrisel

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

linear_model with normalize and StandardScaler lead to faulty results with weighted constant features
5 participants