-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0dcc9a8
to
591c18f
Compare
Yay! Why WIP? A todo list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The things missing as far as I can tell are test:
partial_fit
/_incremental_mean_and_var
with NaN- equivalence of
scale
andStandardScaler
in the presence of NaNs
Perhaps also a quick benchmark of _incremental_mean_and_var
where there are no NaNs. (I wonder, for instance, whether it's worth rewriting _incremental_mean_and_var
in cython with nogil.
sklearn/preprocessing/data.py
Outdated
@@ -656,7 +658,7 @@ def partial_fit(self, X, y=None): | |||
# First pass | |||
if not hasattr(self, 'n_samples_seen_'): | |||
self.mean_ = .0 | |||
self.n_samples_seen_ = 0 | |||
self.n_samples_seen_ = np.zeros(X.shape[1], dtype=np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I being overly cautious if I worry that this changes public API? We could squeeze this back to a single value if np.ptp(n_samples_seen_) == 0
after each partial_fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's better to try to preserve backward compat but maybe instead of using a data dependent np.ptp
check we could do the squeeze only when self.force_all_finite != 'allow-nan'
. This way the user is more in control and it's more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other feature-wise preprocessing, we're allowing NaNs through by default, under the assumption that extra processing cost is negligible and that downstream estimators will deal with or complain about the presence of NaN. Thus self.force_all_finite
does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I misread the code snippet. We don't have much choice but to use the np.ptp
trick then.
@@ -25,6 +26,7 @@ def _get_valid_samples_by_column(X, col): | |||
@pytest.mark.parametrize( | |||
"est, support_sparse", | |||
[(MinMaxScaler(), False), | |||
(StandardScaler(), False), | |||
(QuantileTransformer(n_quantiles=10, random_state=42), True)] | |||
) | |||
def test_missing_value_handling(est, support_sparse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be extended for the partial_fit
case?
X = [[np.inf, 5, 6, 7, 8]] | ||
assert_raises_regex(ValueError, | ||
"Input contains NaN, infinity or a value too large", | ||
"Input contains infinity or a value too large", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually a weird message: what is a "value too large" if it's not infinity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the values vary too much and computing the scale is not numerically possible (overflow)? If so we should add a specific test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the reason:
>>> from sklearn.preprocessing import scale
>>> import numpy as np
>>> data = np.array([np.finfo(np.float32).max, np.finfo(np.float32).min])
>>> scale(data)
/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/_methods.py:116: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
array([ 0., -0.], dtype=float32)
So I would just change the message to "Input contains infinity".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm this is the error message of check_array
actually. Changing this part would require to touch quite a lot of tests not really related. Would it be wised to make it inside another PR.
Tests failing. Ping when you want reviews! |
Yep I have to find the segfault which I don't have locally. Sent from my phone - sorry to be brief and potential misspell.
|
Tests pass on your system at 4d60bfe? I'm getting 4 failures in test_data.py |
(but no segfault) |
sklearn/preprocessing/data.py
Outdated
* X.shape[0]) | ||
print(self.n_samples_seen_) | ||
counts_nan = sparse.csr_matrix( | ||
(np.isnan(X.data), X.indices, X.indptr)).sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a shape kwarg to correctly infer the number of columns. That's the source of the errors for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?
sklearn/preprocessing/data.py
Outdated
self.n_samples_seen_ = X.shape[0] | ||
self.n_samples_seen_ = (np.ones(X.shape[1], dtype=np.int32) | ||
* X.shape[0]) | ||
sparse_constr = (sparse.csr_matrix if X.format == 'csr' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we could just use X._with_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it could be nice to have something like that publicly exposed.
sklearn/utils/extmath.py
Outdated
|
||
new_sample_count = X.shape[0] | ||
new_sample_count = np.nansum(~np.isnan(X), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nansum of a bool array?
I will do one now since both sparse and dense are supported. It will give us the big picture. |
Dense matricesratio fit time master / 0.19
Sparse matricesratio fit time master / 0.19
|
@jnothman Do you think that the regression for a low number of samples/features is a problem? |
why is fit faster? does that gain disappear with 1000 features? why is
transform slower? isn't it unchanged?
|
Actually I'm confused about your comparing 0.19 to master. Which is this pr? |
In [6]: X = np.random.random(10000000)
In [7]: %timeit np.mean(X)
5.54 ms ± 67.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit np.nanmean(X)
62.2 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) The fluctuation could be due to the sample size.
|
Sorry I did not see my mistake
|
@jnothman I touch slightly the cython code such that we support 64 bits indices as done in #9663. I still have to make a test for partial_fit but I think that this is almost ready for a full review. |
An update on the benchmark Dense matricesFit time ratio: PR / 0.19,1
Sparse matricesFit time ratio: PR / 0.19,1
|
Benchmarks seem reasonable. Especially as absolute differences in fit time are small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also make sure to test the new n_samples_seen_ shape directly. Otherwise LGTM
sklearn/preprocessing/data.py
Outdated
new calls to fit, but increments across ``partial_fit`` calls. | ||
n_samples_seen_ : int or array, shape (n_features,) | ||
The number of samples processed by the estimator for each feature. | ||
If there is not missing samples, the ``n_samples_seen`` will be an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not -> are no
@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with this code, but went through the diff and apart from few minor comments, looks good to me.
@@ -800,6 +836,9 @@ class MaxAbsScaler(BaseEstimator, TransformerMixin): | |||
|
|||
Notes | |||
----- | |||
NaNs are treated as missing values: disregarded in fit, and maintained in | |||
transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you update the MaxAbsScaler in this PR?
sklearn/utils/extmath.py
Outdated
# avoid division by 0 | ||
non_zero_idx = last_sample_count > 0 | ||
updated_unnormalized_variance[~non_zero_idx] =\ | ||
new_unnormalized_variance[~non_zero_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more readable to just do the calculation on the full array, and then in the end do updated_unnormalized_variance[non_zero_idx] = 0
sklearn/utils/extmath.py
Outdated
last_over_new_count / updated_sample_count * | ||
(last_sum / last_over_new_count - new_sum) ** 2) | ||
new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count | ||
last_over_new_count = last_sample_count / new_sample_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also already divide by zero, and then will give a RuntimeWarning by numpy, need to ignore this with an errstate?
@@ -664,7 +664,7 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | |||
|
|||
last_variance : array-like, shape: (n_features,) | |||
|
|||
last_sample_count : int | |||
last_sample_count : array-like, shape (n_features,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is still 0, which will actually never work.
So I would simply remove all defaults and make it all positional keywords (they are always used like that, so that shouldn't change anything for the rest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once @jorisvandenbossche comments and the following are addressed.
sklearn/preprocessing/data.py
Outdated
else: | ||
self.mean_, self.var_, self.n_samples_seen_ = \ | ||
_incremental_mean_and_var(X, self.mean_, self.var_, | ||
self.n_samples_seen_) | ||
|
||
# for back-compatibility, reduce n_samples_seen_ to an integer if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: backward-compatibility
@ogrisel This is ready to be merged. The previous failure was PEP8. |
Merged! Thanks @glemaitre 🍻 |
Reference Issues/PRs
Towards #10404
Supersedes and closes #10457. Supersedes and closes #10618
What does this implement/fix? Explain your changes.
Any other comments?
TODO:
sparsefuncs
n_samples_seen_
-> [MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler #11235