-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis is resorting test failures. You might need to use .A
or equivalently .toarray()
to assert things about sparse matrices in your tests
assert np.any(np.isnan(X_test), axis=0).all() | ||
|
||
X = sparse_format_func(X) | ||
|
||
X_test[:, 0] = np.nan # make sure this boundary case is tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should do this before converting to sparse
assert np.any(np.isnan(X_train), axis=0).all() | ||
assert np.any(np.isnan(X_test), axis=0).all() | ||
|
||
X = sparse_format_func(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're only converting X to sparse, not X_train or X_test
Thanks for this PR @LucijaGregov ! @ reviewers: there is an updated version of the |
@LucijaGregov yes, I think, that would be simpler. Also please add a link to the general issue about this under "Reference Issues/PRs" section of the issue description. |
@rth I am not sure if I understood you correctly. This is not mentioned on the 'Issues' list. Do you mean that I should add this as an issue? |
I meant to link to #10404 in the description to provide some context for reviewers as was done in your previous PR. No worries, I added it here. |
@rth Right, I missed that bit. Thank you. |
#11012 was merged; could you please click on the "Resolve conflicts" button to merge with master and also remove any redundant changes from this PR for |
@rth Done. I will continue working on this. |
I was checking how to we implement the min/max detection for sparse and it does not seem that easy. we rely on the scipy implementation which use the The problem of those |
@jnothman did I missed a functionality in scikit-learn which already provide such functions? |
@glemaitre @jnothman Is it safe to continue working on this as it is, or you have something else in mind to do first? |
Why would it be unsafe? |
@jnothman I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go. |
Basically, the #11011 (comment) is something that you might need to go through to make thing work :) Then, whenever you think to have a potential solution you can implement or we can discuss it here. |
pep8 failing |
@amueller I know, it is work in progress. |
…ndling-maxabs-scaler
Note that you will need to change |
Why do we have |
I thunk it makes sense to deprecate it. Sent from my phone - sorry to be brief and potential misspell.
|
I opened the issue #11300 to have a consensus on what to do. I just want to be sure that |
@LucijaGregov I made a quick push of what missing to make the PR works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -27,7 +29,8 @@ def _get_valid_samples_by_column(X, col): | |||
|
|||
@pytest.mark.parametrize( | |||
"est, func, support_sparse", | |||
[(MinMaxScaler(), minmax_scale, False), | |||
[(MaxAbsScaler(), maxabs_scale, True), | |||
(MinMaxScaler(), minmax_scale, False), | |||
(QuantileTransformer(n_quantiles=10), quantile_transform, True)] | |||
) | |||
def test_missing_value_handling(est, func, support_sparse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've realised we should have assert_no_warnings in here when fitting and transforming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good point. I just catched that the QuantileTransformer was still raising a warning at inverse_transform. Since it is a single line I would included here.
@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -44,12 +47,17 @@ def test_missing_value_handling(est, func, support_sparse): | |||
assert np.any(np.isnan(X_test), axis=0).all() | |||
X_test[:, 0] = np.nan # make sure this boundary case is tested | |||
|
|||
Xt = est.fit(X_train).transform(X_test) | |||
with pytest.warns(None) as records: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using sklearn.utils.testing.assert_no_warnings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to stick to pytest awaiting for this feature: pytest-dev/pytest#1830
The second point is that I find more readable
with pytest.warns(None):
X_t = est.whatever(X)
than
X_t = assert_no_warnings(est, whaterver, X)
@ogrisel are you also in favor of assert_no_warnings
. If yes, 2 vs 1 and I will make the change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No ok, this is fine as it is.
@glemaitre This PR need a merge from the current master. |
nice syntactic magic would be `if not pytest.warns(...)`
|
@glemaitre can you please merge master to check that the tests still pass? |
Actually wrong mention. I meant @LucijaGregov instead of @glemaitre, sorry. |
I think Guillaume will not be online most of the day, so if you want to merge this, might be easier to quickly to the merge of master yourself |
Actually, let me push the merge commit my-self. |
I let you do the merging. I will not be on the keyboard this morning Sent from my phone - sorry to be brief and potential misspell.
|
Xt_col = est.transform(X_test[:, [i]]) | ||
with pytest.warns(None) as records: | ||
Xt_col = est.transform(X_test[:, [i]]) | ||
assert len(records) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new test_common.py
assertion breaks with the PowerTransform that complains with all-nan columns.
I am not sure if we should raise this warning or not (maybe not at transform time) but this should be consistent across all the transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a push. The reason was on the np.nanmin(X)
to check that the matrix was strictly positive. This case will return a full NaN matrix as well so everything will be fine (or at least the problem is forwarded to the next step in the pipeline).
Up to now all transformers do not raise warning. We should use an np.errstate. Is it a warning or an error. Sent from my phone - sorry to be brief and potential misspell.
|
Reference Issues/PRs
Related to #10404
What does this implement/fix? Explain your changes.
Any other comments?