Allow NaNs in the matrix #157

hnykda · 2016-09-29T11:49:01Z

Hi,

this is a bit connected to accepting sparse matrices - basically about sklearn validations.

I understand why not accepting NaN is sometimes necessary for some algorithms, it doesn't make sense for e.g. *RandomSamplers. NaNs could be copied as well.

As a workaround, I replace all NaNs by some other valid value and then convert it back. But it is of course a bit weird...

The text was updated successfully, but these errors were encountered:

glemaitre · 2016-09-29T12:59:38Z

I think that accepting NaN is weirder than making the manual replacement.

From my point of you, NaN is something that should be handle by the user itself and one should be aware about the solution chosen. Let me give more insights about my thoughts:

Imagine that you create a feature space in which you accidentally created some NaN (typical example: negative number in square root, etc.). You propagate this error along the sampling as well. I would rather prefer to find out beforehand.
Then, most of the methods generating new samples or removing samples rely on the values in this feature space. I have issue thinking about how to handle and generate new samples with NaN with SMOTE or ADASYN. Similarly, is the selection criterion for the under-sampling methods still make sense with NaN?

So that's why I would be still not permissive regarding these values.

@chkoar @dvro @fmfn Any thoughts?

hnykda · 2016-09-29T13:04:07Z

Sure, it has its drawbacks and agree with that.

I disagree with the first problem you mention, though... Here you are trying to incorporate some sort of validations which have nothing to do with oversampling. I daily work with datasets where NaNs are totally valid cases and if those tools (including scikit-learn) incorporate "ignore nans because they might be wrong values" rule, I couldn't used them at all (many ML methods supports sparse datasets). By the way, Pandas has nan handling as one of the main features.

chkoar · 2016-09-29T13:04:57Z

I think that he explicitly refers to the Random*Samplers where the feature values doesn't matter.

hnykda · 2016-09-29T13:09:34Z

@chkoar I wouldn't say they doesn't matter, they are just valid values.

chkoar · 2016-09-29T13:11:31Z

@hnykda I was refering to the sampling procedure itself.

glemaitre · 2016-09-29T13:13:48Z

@chkoar @hnykda Sorry my bad, If this is specifically for the random samplers. I agree that we can less rigid since that the selection is not based on the feature space itself.

So in this case what I would see is a let pass for these the random methods and a rising of a user warning for containing NaN.

What do you think?

glemaitre · 2016-09-29T13:18:45Z

The sparse problem is still different. We should be able to support it since that distances and kNN allow it. However, it should be the most homogeneous possible in the toolbox.

hnykda · 2016-09-29T13:21:10Z

Cool. Thanks.

IMHO just adding some argument "accept_nan=False` or something in the constructor/fit method would suffice. That should be easily implementable. But I am leaving that to you...

glemaitre · 2016-09-29T13:33:07Z

IMHO just adding some argument "accept_nan=False` or something in the constructor/fit method would suffice. That should be easily implementable. But I am leaving that to you...

This is not as simple since that we choose to be close to scikit-learn API design. We have to check how the estimators are handling this in their cases. We are also careful to put new parameters in methods due to deprecation cycle for any change in the future. But we will have to check this feature.

Thanks for your inputs.

hnykda · 2016-09-29T13:59:47Z

Sure. Thank for your work.

As far as I know, scikit basically supports only sparse matrices. Then it doesn't have to deal with NaNs or something - the sparse matrix cannot has them, period. Here is an example https://github.com/scikit-learn/scikit-learn/blob/38030a0/sklearn/decomposition/truncated_svd.py#L154 . So if you want to be as close as possible to them, add support for sparse matrices and then it is up to a user.

glemaitre · 2016-09-29T14:28:42Z

I would really separate the problem of NaN from sparse.

Regarding the NaN, only the Imputer seems to accept them since that it is to kick them out.

So we have to check if this is worth in case that you put everything in the Pipeline or other sklearn object.

massich · 2016-09-29T15:02:23Z

Just to recall, 'cos I might get lost. Here are some examples to know what we are talking about.

Full matrix with full representation

[1, 2, 3; 4, 5, 6; 7 8 9]
Full matrix with sparse representation

{(1,1)(1,2)(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)}[1 2 3 4 5 6 7 8 9]
Sparse matrix with full representation

[1, 0, 0; 0 1 0; 0 0 1]
[1, NaN, NaN; NaN 1 NaN; NaN NaN 1]
[0, NaN, NaN; NaN, 1, NaN; NaN NaN NaN]
Sparse matrix with sparse representation
{(1,1)(2,2)(3,3)} [1 1 1]
{(10,10)}[NaN]

That NaN is used to encode something like infinite, not visited, missing data, whatever, does nothing to do with the balancing itself but it can break the internal computations of the balancing since some of them are based in distances. Once said that I'm not in favor of allowing NaN as a valid sample value for a particular balancing strategy because it changes the domain of x in F(x, y='balancing_method') which breaks the API.

chkoar · 2016-09-29T18:33:58Z

I dont like the idea of accept_nan=False in the constructor. In a pipeline chain we do not expect nan values. For the samplers that do not manipulate the input space to resample, we could have an additional parameter in fit and sample methods check_input like in DecisionTreeClassifier. The default value would be True. So the user could sample without validation. Thoughts?

dvro · 2016-11-11T02:02:15Z

@glemaitre @chkoar I believe a parameter "allow_nan=True" is a quick fix but not a reasonable workaround because it will soon need replacement and also might break some tests. If scikit-learn does not support it, we shouldn't be messing with it within the estimator. Using what was mentioned by @hnykda X[np.isnan(X_)] = -123 and X_resampled[X_resampled == -123] = np.nan is probably the best solution.

What we could do differently than that would be providing a simple ImputerLike that would transform X before applying the sampling, and transform X back after applying the sampling. This would be in the preprocessing, it would be a quick fix for the user and easy to remove from the API once scikit-learn define how to handle such errors when testing estimators.

chkoar · 2016-11-11T10:19:41Z

@dvro read this #157 (comment)

glemaitre · 2016-12-25T23:34:10Z

NaN will not be included in sklearn for now. The use of
Imputer by the user is needed and we will let the user manually
manage this case. We can reopen if sklearn go toward another
proposal.

lrq3000 · 2019-08-18T06:30:25Z

I would argue to reopen this case, as I was also bitten by it.

Let's take an example: I have a mix of categorical and numerical values. Thus, I would like to use SMOTE-NC, as I would like to generate new interpolated values rather than just picking randomly already existing items. Apart from the fact that using Imputer would be quite impractical as I will have to manage different cases depending on the type, the more serious issue is that interpolating algorithms such as SMOTE-NC will interpolate the floats based on the values that are set.

Thus, if I set a placeholder value for floats, say -99999, then this WILL get used for the interpolation, which I don't want! A more correct approach would be to calculate SMOTE interpolation on only the non-nan values, and weight depending on the number of occurrences of nan a probability that nan is chosen instead of a non-nan interpolated value.

Furthermore, I would argue that it should be possible to add specific keywords to the library without breaking compatibility with sklearn, if not in fit/transform, it can be in the model initialization call, there is more liberty there I think, like how the sampling_strategy argument is used, or am I wrong?

Lastly, I'd like to mention that sklearn has been delaying for years a proper support of categorical (unordered) variables, not to mention missing values, so waiting for them might not be the greatest idea.

glemaitre added the question label Sep 29, 2016

glemaitre added the Type: Enhancement Indicates new feature requests label Sep 29, 2016

chkoar pushed a commit to chkoar/imbalanced-learn that referenced this issue Oct 24, 2016

Resolve scikit-learn-contrib#157 - Allow NaNs in Random*Samplers

e630a56

chkoar pushed a commit to chkoar/imbalanced-learn that referenced this issue Oct 24, 2016

Resolve scikit-learn-contrib#157 - Allow NaNs in Random*Samplers

1eca76b

This was referenced Oct 24, 2016

[WIP] Resolve #157 - Allow NaNs in Random*Samplers #167

Closed

check_estimator should allow NaNs? scikit-learn/scikit-learn#7737

Closed

glemaitre closed this as completed Dec 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow NaNs in the matrix #157

Allow NaNs in the matrix #157

hnykda commented Sep 29, 2016

glemaitre commented Sep 29, 2016

hnykda commented Sep 29, 2016 •

edited

Loading

chkoar commented Sep 29, 2016

hnykda commented Sep 29, 2016 •

edited

Loading

chkoar commented Sep 29, 2016

glemaitre commented Sep 29, 2016

glemaitre commented Sep 29, 2016

hnykda commented Sep 29, 2016 •

edited

Loading

glemaitre commented Sep 29, 2016

hnykda commented Sep 29, 2016 •

edited

Loading

glemaitre commented Sep 29, 2016

massich commented Sep 29, 2016

chkoar commented Sep 29, 2016

dvro commented Nov 11, 2016 •

edited

Loading

chkoar commented Nov 11, 2016

glemaitre commented Dec 25, 2016

lrq3000 commented Aug 18, 2019

Allow NaNs in the matrix #157

Allow NaNs in the matrix #157

Comments

hnykda commented Sep 29, 2016

glemaitre commented Sep 29, 2016

hnykda commented Sep 29, 2016 • edited Loading

chkoar commented Sep 29, 2016

hnykda commented Sep 29, 2016 • edited Loading

chkoar commented Sep 29, 2016

glemaitre commented Sep 29, 2016

glemaitre commented Sep 29, 2016

hnykda commented Sep 29, 2016 • edited Loading

glemaitre commented Sep 29, 2016

hnykda commented Sep 29, 2016 • edited Loading

glemaitre commented Sep 29, 2016

massich commented Sep 29, 2016

chkoar commented Sep 29, 2016

dvro commented Nov 11, 2016 • edited Loading

chkoar commented Nov 11, 2016

glemaitre commented Dec 25, 2016

lrq3000 commented Aug 18, 2019

hnykda commented Sep 29, 2016 •

edited

Loading

hnykda commented Sep 29, 2016 •

edited

Loading

hnykda commented Sep 29, 2016 •

edited

Loading

hnykda commented Sep 29, 2016 •

edited

Loading

dvro commented Nov 11, 2016 •

edited

Loading