-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] EHN handle NaN value in QuantileTransformer #10437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman I have 2 questions:
|
what's the harm in silencing the warning?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice:)
I don't see any. I would go for that solution. However, |
Sure. Backport and inform other contributors at #10404 |
1 similar comment
Sure. Backport and inform other contributors at #10404 |
Backport is headache in fact. I tried and we have to bring to much code from numpy to be able to support it. Speaking IRL with @ogrisel, we propose to raise a NotImplemented error when there is NaN and that we require a nanfunctions (involving numpy >= 1.9) and ask for an upgrade. |
@lesteve @jnothman I wanted to mock the version of numpy to be sure that the error in the test is raised properly. I used |
Is pytest-mock really needed? Can we not use either:
|
+1 on that one. It seems good enough to me. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should support other missing_values indicators?
sklearn/preprocessing/data.py
Outdated
@@ -2357,7 +2361,10 @@ def _transform_col(self, X_col, quantiles, inverse): | |||
X_col[lower_bounds_idx] = lower_bound_y | |||
# for forward transform, match the output PDF | |||
if not inverse: | |||
X_col = output_distribution.ppf(X_col) | |||
# comparison with NaN will raise a warning which we make silent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a numpy error, can use np.errstate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a scipy warning
I give it some thought and I tried couple of stuff. I have the following interrogations:
Having implemented both approaches (mid-way) I think that only replacing missing_value by NaN is sufficient. |
sklearn/preprocessing/data.py
Outdated
def _check_inputs(self, X, accept_sparse_negative=False): | ||
"""Check inputs before fit and transform""" | ||
if sparse.issparse(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why before check_array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array
will convert the matrix into a float dtype. I wanted to compare when data can still be int.
However I could also use np.isclose
which can handle NaN as well.
sklearn/preprocessing/data.py
Outdated
and not np.isfinite(X[~np.isnan(X)]).all()): | ||
raise ValueError("Input contains infinity" | ||
" or a value too large for %r." % X.dtype) | ||
if np.count_nonzero(self._mask_missing_values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange to have this in a function whose argument is X, not _mask_missing_values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you should avoid storing this mask as an attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will get away once the #10455 is addressed.
sklearn/preprocessing/data.py
Outdated
self._percentile_func = np.nanpercentile | ||
else: | ||
raise NotImplementedError( | ||
'QuantileTransformer does not handle NaN value with' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easy enough to just implement in sklearn.utils.fixes:
def nanpercentile(a, q):
return np.percentile(np.compress(a, ~np.isnan(a)), q)
seeing as we don't use the other features of nanpercentile?
It is annoying to have parts of the library with different minimum numpy requirements. It means that code is not portable across supported platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right
True ... we will get always float as output. So it should be for sure in a separate PR. |
@jnothman I added a test_common file. Could you check that the creation of the instance is ok or you would see another way to create the instance of the estimator on the fly (a dict for instance?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test. It essentially also tests that the estimators are feature-wise... So we could in theory remove some existing tests
|
||
@pytest.mark.parametrize( | ||
"est, X, n_missing", | ||
_generate_tuple_transformer_missing_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why this is better than just parameterizing est directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could even consider a list of all feature-wise preprocessors then xfail some...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could even consider a list of all feature-wise preprocessors then xfail some...
I agree. We could switch to this behaviour when a majority of those preprocessor support this feature.
rng.randint(X.shape[1], size=n_missing)] = np.nan | ||
X_train, X_test = train_test_split(X) | ||
# sanity check | ||
assert not np.all(np.isnan(X_train), axis=0).any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also check that there are NaNs in both train and test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring note along the lines of "NaNs are treated as missing values: disregarded in fit, and maintained in transform". Perhaps that's too terse.
X_col = .5 * (np.interp(X_col, quantiles, self.references_) | ||
- np.interp(-X_col, -quantiles[::-1], | ||
-self.references_[::-1])) | ||
X_col[isfinite_mask] = .5 * ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw, it's possible that np.ma would handle the none-missing case more efficiently than using an ad-hoc hoc mask. I've not checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Playing around, I think that it will trigger the same number of copy.
X_train, X_test = train_test_split(X) | ||
# sanity check | ||
assert not np.all(np.isnan(X_train), axis=0).any() | ||
assert np.any(X_train, axis=0).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check that there is some Nan in each column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups yes there I forgot to check for NaN :)
|
||
|
||
from sklearn.datasets import load_iris | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so much vertical space?
@lesteve We have 2 approved here. Do you want to make a quick review before to merge this, hopefully :) |
sklearn/preprocessing/data.py
Outdated
"""Force the output of nanpercentile to be finite.""" | ||
percentile = nanpercentile(column_data, percentiles) | ||
with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
if np.all(np.isclose(percentile, np.nan, equal_nan=True)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not:
if np.all(np.isnan(percentile))
If you use that I think you can remove the with np.errstate(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. no idea how I come to something so complex.
sklearn/preprocessing/data.py
Outdated
percentile = nanpercentile(column_data, percentiles) | ||
with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
if np.all(np.isclose(percentile, np.nan, equal_nan=True)): | ||
warnings.warn("All samples in a column of X are NaN.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can mention in the warning that you are returning 0 for all the quantiles?
It would be nice if you could test you get the warning when expected.
Bonus points if you check that you do not get any warning when you don't expect a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure anymore why I force percentile to be finite.
I added a test to check that inverse transform is behaving properly in the common test. |
I suppose the problem with making quantiles NaN is that for finite data
passed to transform, you'd get NaN when transformed. That sort of makes
sense... I suppose if NaN is in the data we assume it will be handled
downstream.
|
That's true. But it seems a more logical think to map finite to NaN if during training we did not learn anything (due to a full NaN column). So I think that the way right now is ok. |
@lesteve you can have a second look at it and tell us if you it makes sense to you. |
ping @lesteve @qinhanmin2014 |
LGTM Given that there are already 2 +1 and Loic's comments were addressed, as far as I can tell, will merge when CI is green. |
Reference Issues/PRs
partially addresses #10404
What does this implement/fix? Explain your changes.
NaN are handled and ignored during processing in the QuantileTransformer.
Any other comments?
TODO:
check_array
after addressing [RFC] Dissociate NaN and Inf when considering force_all_finite in check_array #10455