[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011

LucijaGregov · 2018-04-22T11:55:16Z

Reference Issues/PRs

Related to #10404

What does this implement/fix? Explain your changes.

Any other comments?

LucijaGregov · 2018-04-22T12:02:19Z

jnothman

Travis is resorting test failures. You might need to use .A or equivalently .toarray() to assert things about sparse matrices in your tests

jnothman · 2018-04-22T22:55:18Z

sklearn/preprocessing/tests/test_common.py

+    assert np.any(np.isnan(X_test), axis=0).all()
+
+    X = sparse_format_func(X)
+
    X_test[:, 0] = np.nan  # make sure this boundary case is tested


You should do this before converting to sparse

jnothman · 2018-04-22T22:55:56Z

sklearn/preprocessing/tests/test_common.py

+    assert np.any(np.isnan(X_train), axis=0).all()
+    assert np.any(np.isnan(X_test), axis=0).all()
+
+    X = sparse_format_func(X)


You're only converting X to sparse, not X_train or X_test

rth · 2018-04-23T08:32:25Z

Thanks for this PR @LucijaGregov !

@ reviewers: there is an updated version of the test_missing_value_handling_sparse test in #11012 so it would probably make sense to merge that first than sync with master here.

LucijaGregov · 2018-04-23T16:00:01Z

@rth I will wait then for the merge of test_missing_value_handling_sparse test in #11012 before continuing working on this.

rth · 2018-04-23T16:36:37Z

@LucijaGregov yes, I think, that would be simpler.

Also please add a link to the general issue about this under "Reference Issues/PRs" section of the issue description.

LucijaGregov · 2018-04-23T21:25:47Z

@rth I am not sure if I understood you correctly. This is not mentioned on the 'Issues' list. Do you mean that I should add this as an issue?

rth · 2018-04-23T21:30:39Z

I meant to link to #10404 in the description to provide some context for reviewers as was done in your previous PR. No worries, I added it here.

LucijaGregov · 2018-04-23T21:38:35Z

@rth Right, I missed that bit. Thank you.

rth · 2018-04-27T11:47:49Z

#11012 was merged; could you please click on the "Resolve conflicts" button to merge with master and also remove any redundant changes from this PR for sklearn/preprocessing/tests/test_common.py as that should be already addressed? Thanks.

LucijaGregov · 2018-04-27T14:02:02Z

@rth Done. I will continue working on this.

glemaitre · 2018-04-27T22:04:11Z

I was checking how to we implement the min/max detection for sparse and it does not seem that easy. we rely on the scipy implementation which use the maximum and minimum ufunc. We have a backport in fixes here.

The problem of those ufunc is that the let the nan pass through and we would like to skip them. Somehow, I think that we need to implement our own sparse_nanmin_nanmax which will use the nanmin and nanmax instead of np.minimum.reduce and np.maximum.reduce.

glemaitre · 2018-04-27T22:04:59Z

@jnothman did I missed a functionality in scikit-learn which already provide such functions?

LucijaGregov · 2018-04-30T12:37:27Z

@glemaitre @jnothman Is it safe to continue working on this as it is, or you have something else in mind to do first?

jnothman · 2018-04-30T12:53:59Z

Why would it be unsafe?

LucijaGregov · 2018-04-30T13:02:03Z

@jnothman I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go.

glemaitre · 2018-04-30T13:39:52Z

I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go.

Basically, the #11011 (comment) is something that you might need to go through to make thing work :) Then, whenever you think to have a potential solution you can implement or we can discuss it here.

amueller · 2018-04-30T18:11:51Z

pep8 failing

LucijaGregov · 2018-05-07T14:16:18Z

@amueller I know, it is work in progress.

…ndling-maxabs-scaler

glemaitre · 2018-06-16T13:46:29Z

Note that you will need to change n_samples_seen_ as in #11206 to have a consistent API.

jnothman · 2018-06-16T13:55:30Z

Why do we have n_samples_seen_ in scalers other than StandardScaler anyway? They're not used in the statistics as far as I can tell, and RobustScaler lacks it. Should we consider deprecating it?

glemaitre · 2018-06-16T13:58:10Z

I thunk it makes sense to deprecate it. Sent from my phone - sorry to be brief and potential misspell.

glemaitre · 2018-06-16T22:28:05Z

Why do we have n_samples_seen_ in scalers other than StandardScaler anyway? They're not used in the statistics as far as I can tell, and RobustScaler lacks it. Should we consider deprecating it?

I opened the issue #11300 to have a consensus on what to do. I just want to be sure that partial_fit does not imply to have this attribute (even if we don't really use it).

glemaitre · 2018-06-16T22:28:43Z

@LucijaGregov I made a quick push of what missing to make the PR works.

glemaitre · 2018-06-16T22:30:48Z

@rth @jnothman You can have a look to this one.

jnothman

LGTM

jnothman · 2018-06-16T23:07:35Z

sklearn/preprocessing/tests/test_common.py

@@ -27,7 +29,8 @@ def _get_valid_samples_by_column(X, col):

 @pytest.mark.parametrize(
    "est, func, support_sparse",
-    [(MinMaxScaler(), minmax_scale, False),
+    [(MaxAbsScaler(), maxabs_scale, True),
+     (MinMaxScaler(), minmax_scale, False),
     (QuantileTransformer(n_quantiles=10), quantile_transform, True)]
 )
 def test_missing_value_handling(est, func, support_sparse):


I've realised we should have assert_no_warnings in here when fitting and transforming

It is a good point. I just catched that the QuantileTransformer was still raising a warning at inverse_transform. Since it is a single line I would included here.

glemaitre · 2018-06-18T12:04:57Z

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

TomDLT

LGTM

TomDLT · 2018-06-19T15:29:05Z

sklearn/preprocessing/tests/test_common.py

@@ -44,12 +47,17 @@ def test_missing_value_handling(est, func, support_sparse):
    assert np.any(np.isnan(X_test), axis=0).all()
    X_test[:, 0] = np.nan  # make sure this boundary case is tested

-    Xt = est.fit(X_train).transform(X_test)
+    with pytest.warns(None) as records:


Why not using sklearn.utils.testing.assert_no_warnings?

I wanted to stick to pytest awaiting for this feature: pytest-dev/pytest#1830
The second point is that I find more readable

with pytest.warns(None): X_t = est.whatever(X)

than

X_t = assert_no_warnings(est, whaterver, X)

@ogrisel are you also in favor of assert_no_warnings. If yes, 2 vs 1 and I will make the change :)

No ok, this is fine as it is.

ogrisel · 2018-06-21T13:50:42Z

@glemaitre This PR need a merge from the current master.

…ndling-maxabs-scaler

jnothman · 2018-06-21T22:59:20Z

nice syntactic magic would be `if not pytest.warns(...)`

ogrisel · 2018-06-22T08:46:24Z

@glemaitre can you please merge master to check that the tests still pass?

ogrisel · 2018-06-22T08:48:24Z

@glemaitre can you please merge master to check that the tests still pass?

Actually wrong mention. I meant @LucijaGregov instead of @glemaitre, sorry.

jorisvandenbossche · 2018-06-22T08:48:29Z

@glemaitre can you please merge master to check that the tests still pass?

I think Guillaume will not be online most of the day, so if you want to merge this, might be easier to quickly to the merge of master yourself

ogrisel · 2018-06-22T08:49:19Z

Actually, let me push the merge commit my-self.

glemaitre · 2018-06-22T08:58:54Z

I let you do the merging. I will not be on the keyboard this morning Sent from my phone - sorry to be brief and potential misspell.

ogrisel · 2018-06-22T09:10:59Z

sklearn/preprocessing/tests/test_common.py

-        Xt_col = est.transform(X_test[:, [i]])
+        with pytest.warns(None) as records:
+            Xt_col = est.transform(X_test[:, [i]])
+        assert len(records) == 0


This new test_common.py assertion breaks with the PowerTransform that complains with all-nan columns.

I am not sure if we should raise this warning or not (maybe not at transform time) but this should be consistent across all the transformers.

I made a push. The reason was on the np.nanmin(X) to check that the matrix was strictly positive. This case will return a full NaN matrix as well so everything will be fine (or at least the problem is forwarded to the next step in the pipeline).

glemaitre · 2018-06-22T09:30:00Z

Up to now all transformers do not raise warning. We should use an np.errstate. Is it a warning or an error. Sent from my phone - sorry to be brief and potential misspell.

Passing NaN values through MaxAbsScaler

fb35442

jnothman reviewed Apr 22, 2018

View reviewed changes

Merge branch 'master' into nan-handling-maxabs-scaler

c76ce20

glemaitre mentioned this pull request Jun 4, 2018

[MRG] EHN: add function to ignore nan in min/max sparse functions #11196

Merged

jnothman mentioned this pull request Jun 16, 2018

Disregard NaNs in preprocessing #10404

Closed

7 tasks

Merge remote-tracking branch 'origin/master' into LucijaGregov-nan-ha…

4fb338f

…ndling-maxabs-scaler

glemaitre mentioned this pull request Jun 16, 2018

[RFC] Deprecate n_samples_seen_ in MaxAbsScaler and MinMaxScaler #11300

Open

TST check maxabs_scale and MaxAbsScaler for ignoring NaN

fbd4049

glemaitre changed the title ~~[WIP] Passing NaN values through MaxAbsScaler~~ [MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale Jun 16, 2018

fix spelling

7909057

jnothman approved these changes Jun 16, 2018

View reviewed changes

TST check that we do not raise warnings

49e23e4

This was referenced Jun 17, 2018

[MRG] ignore NaNs in PowerTransformer #11306

Merged

[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308

Merged

glemaitre added this to the 0.20 milestone Jun 18, 2018

TomDLT approved these changes Jun 19, 2018

View reviewed changes

glemaitre added 2 commits June 21, 2018 16:53

Merge remote-tracking branch 'origin/master' into LucijaGregov-nan-ha…

1750375

…ndling-maxabs-scaler

FIX: avoid warning raising with some nan division

2497932

ogrisel approved these changes Jun 22, 2018

View reviewed changes

Merge remote-tracking branch 'origin' into pr/11011

d9eaf46

ogrisel reviewed Jun 22, 2018

View reviewed changes

FIX ignore all NaN with nanmin

9668eb3

jnothman merged commit f43dd0e into scikit-learn:master Jun 23, 2018

[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011

[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011

Conversation

LucijaGregov commented Apr 22, 2018 • edited by rth Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

LucijaGregov commented Apr 22, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Apr 23, 2018

LucijaGregov commented Apr 23, 2018

rth commented Apr 23, 2018

LucijaGregov commented Apr 23, 2018

rth commented Apr 23, 2018

LucijaGregov commented Apr 23, 2018

rth commented Apr 27, 2018 • edited Loading

LucijaGregov commented Apr 27, 2018

glemaitre commented Apr 27, 2018

glemaitre commented Apr 27, 2018

LucijaGregov commented Apr 30, 2018

jnothman commented Apr 30, 2018

LucijaGregov commented Apr 30, 2018

glemaitre commented Apr 30, 2018

amueller commented Apr 30, 2018

LucijaGregov commented May 7, 2018

glemaitre commented Jun 16, 2018

jnothman commented Jun 16, 2018

glemaitre commented Jun 16, 2018 via email

glemaitre commented Jun 16, 2018

glemaitre commented Jun 16, 2018

glemaitre commented Jun 16, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jun 18, 2018

TomDLT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jun 21, 2018

jnothman commented Jun 21, 2018 via email

ogrisel commented Jun 22, 2018

ogrisel commented Jun 22, 2018

jorisvandenbossche commented Jun 22, 2018

ogrisel commented Jun 22, 2018

glemaitre commented Jun 22, 2018 via email • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jun 22, 2018 via email • edited Loading

LucijaGregov commented Apr 22, 2018 •

edited by rth

Loading

rth commented Apr 27, 2018 •

edited

Loading

glemaitre commented Jun 22, 2018 via email •

edited

Loading

glemaitre commented Jun 22, 2018 via email •

edited

Loading