Skip to content

[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 23, 2018

Conversation

LucijaGregov
Copy link
Contributor

@LucijaGregov LucijaGregov commented Apr 22, 2018

Reference Issues/PRs

Related to #10404

What does this implement/fix? Explain your changes.

Any other comments?

@LucijaGregov
Copy link
Contributor Author

@glemaitre

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis is resorting test failures. You might need to use .A or equivalently .toarray() to assert things about sparse matrices in your tests

assert np.any(np.isnan(X_test), axis=0).all()

X = sparse_format_func(X)

X_test[:, 0] = np.nan # make sure this boundary case is tested
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should do this before converting to sparse

assert np.any(np.isnan(X_train), axis=0).all()
assert np.any(np.isnan(X_test), axis=0).all()

X = sparse_format_func(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're only converting X to sparse, not X_train or X_test

@rth
Copy link
Member

rth commented Apr 23, 2018

Thanks for this PR @LucijaGregov !

@ reviewers: there is an updated version of the test_missing_value_handling_sparse test in #11012 so it would probably make sense to merge that first than sync with master here.

@LucijaGregov
Copy link
Contributor Author

@rth I will wait then for the merge of test_missing_value_handling_sparse test in #11012 before continuing working on this.

@rth
Copy link
Member

rth commented Apr 23, 2018

@LucijaGregov yes, I think, that would be simpler.

Also please add a link to the general issue about this under "Reference Issues/PRs" section of the issue description.

@LucijaGregov
Copy link
Contributor Author

@rth I am not sure if I understood you correctly. This is not mentioned on the 'Issues' list. Do you mean that I should add this as an issue?

@rth
Copy link
Member

rth commented Apr 23, 2018

I meant to link to #10404 in the description to provide some context for reviewers as was done in your previous PR. No worries, I added it here.

@LucijaGregov
Copy link
Contributor Author

@rth Right, I missed that bit. Thank you.

@rth
Copy link
Member

rth commented Apr 27, 2018

#11012 was merged; could you please click on the "Resolve conflicts" button to merge with master and also remove any redundant changes from this PR for sklearn/preprocessing/tests/test_common.py as that should be already addressed? Thanks.

@LucijaGregov
Copy link
Contributor Author

@rth Done. I will continue working on this.

@glemaitre
Copy link
Member

I was checking how to we implement the min/max detection for sparse and it does not seem that easy. we rely on the scipy implementation which use the maximum and minimum ufunc. We have a backport in fixes here.

The problem of those ufunc is that the let the nan pass through and we would like to skip them. Somehow, I think that we need to implement our own sparse_nanmin_nanmax which will use the nanmin and nanmax instead of np.minimum.reduce and np.maximum.reduce.

@glemaitre
Copy link
Member

@jnothman did I missed a functionality in scikit-learn which already provide such functions?

@LucijaGregov
Copy link
Contributor Author

@glemaitre @jnothman Is it safe to continue working on this as it is, or you have something else in mind to do first?

@jnothman
Copy link
Member

Why would it be unsafe?

@LucijaGregov
Copy link
Contributor Author

@jnothman I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go.

@glemaitre
Copy link
Member

I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go.

Basically, the #11011 (comment) is something that you might need to go through to make thing work :) Then, whenever you think to have a potential solution you can implement or we can discuss it here.

@amueller
Copy link
Member

pep8 failing

@LucijaGregov
Copy link
Contributor Author

@amueller I know, it is work in progress.

@glemaitre
Copy link
Member

Note that you will need to change n_samples_seen_ as in #11206 to have a consistent API.

@jnothman
Copy link
Member

Why do we have n_samples_seen_ in scalers other than StandardScaler anyway? They're not used in the statistics as far as I can tell, and RobustScaler lacks it. Should we consider deprecating it?

@glemaitre
Copy link
Member

glemaitre commented Jun 16, 2018 via email

@glemaitre glemaitre changed the title [WIP] Passing NaN values through MaxAbsScaler [MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale Jun 16, 2018
@glemaitre
Copy link
Member

Why do we have n_samples_seen_ in scalers other than StandardScaler anyway? They're not used in the statistics as far as I can tell, and RobustScaler lacks it. Should we consider deprecating it?

I opened the issue #11300 to have a consensus on what to do. I just want to be sure that partial_fit does not imply to have this attribute (even if we don't really use it).

@glemaitre
Copy link
Member

@LucijaGregov I made a quick push of what missing to make the PR works.

@glemaitre
Copy link
Member

@rth @jnothman You can have a look to this one.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -27,7 +29,8 @@ def _get_valid_samples_by_column(X, col):

@pytest.mark.parametrize(
"est, func, support_sparse",
[(MinMaxScaler(), minmax_scale, False),
[(MaxAbsScaler(), maxabs_scale, True),
(MinMaxScaler(), minmax_scale, False),
(QuantileTransformer(n_quantiles=10), quantile_transform, True)]
)
def test_missing_value_handling(est, func, support_sparse):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realised we should have assert_no_warnings in here when fitting and transforming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good point. I just catched that the QuantileTransformer was still raising a warning at inverse_transform. Since it is a single line I would included here.

@glemaitre
Copy link
Member

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -44,12 +47,17 @@ def test_missing_value_handling(est, func, support_sparse):
assert np.any(np.isnan(X_test), axis=0).all()
X_test[:, 0] = np.nan # make sure this boundary case is tested

Xt = est.fit(X_train).transform(X_test)
with pytest.warns(None) as records:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using sklearn.utils.testing.assert_no_warnings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to stick to pytest awaiting for this feature: pytest-dev/pytest#1830
The second point is that I find more readable

with pytest.warns(None):
    X_t = est.whatever(X)

than

X_t = assert_no_warnings(est, whaterver, X)

@ogrisel are you also in favor of assert_no_warnings. If yes, 2 vs 1 and I will make the change :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ok, this is fine as it is.

@ogrisel
Copy link
Member

ogrisel commented Jun 21, 2018

@glemaitre This PR need a merge from the current master.

@jnothman
Copy link
Member

jnothman commented Jun 21, 2018 via email

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2018

@glemaitre can you please merge master to check that the tests still pass?

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2018

@glemaitre can you please merge master to check that the tests still pass?

Actually wrong mention. I meant @LucijaGregov instead of @glemaitre, sorry.

@jorisvandenbossche
Copy link
Member

@glemaitre can you please merge master to check that the tests still pass?

I think Guillaume will not be online most of the day, so if you want to merge this, might be easier to quickly to the merge of master yourself

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2018

Actually, let me push the merge commit my-self.

@glemaitre
Copy link
Member

glemaitre commented Jun 22, 2018 via email

Xt_col = est.transform(X_test[:, [i]])
with pytest.warns(None) as records:
Xt_col = est.transform(X_test[:, [i]])
assert len(records) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test_common.py assertion breaks with the PowerTransform that complains with all-nan columns.

I am not sure if we should raise this warning or not (maybe not at transform time) but this should be consistent across all the transformers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a push. The reason was on the np.nanmin(X) to check that the matrix was strictly positive. This case will return a full NaN matrix as well so everything will be fine (or at least the problem is forwarded to the next step in the pipeline).

@glemaitre
Copy link
Member

glemaitre commented Jun 22, 2018 via email

@jnothman jnothman merged commit f43dd0e into scikit-learn:master Jun 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants