Skip to content

Allow NaNs in the matrix #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hnykda opened this issue Sep 29, 2016 · 17 comments
Closed

Allow NaNs in the matrix #157

hnykda opened this issue Sep 29, 2016 · 17 comments
Labels
Type: Enhancement Indicates new feature requests

Comments

@hnykda
Copy link

hnykda commented Sep 29, 2016

Hi,

this is a bit connected to accepting sparse matrices - basically about sklearn validations.

I understand why not accepting NaN is sometimes necessary for some algorithms, it doesn't make sense for e.g. *RandomSamplers. NaNs could be copied as well.

As a workaround, I replace all NaNs by some other valid value and then convert it back. But it is of course a bit weird...

@glemaitre
Copy link
Member

I think that accepting NaN is weirder than making the manual replacement.

From my point of you, NaN is something that should be handle by the user itself and one should be aware about the solution chosen. Let me give more insights about my thoughts:

  • Imagine that you create a feature space in which you accidentally created some NaN (typical example: negative number in square root, etc.). You propagate this error along the sampling as well. I would rather prefer to find out beforehand.
  • Then, most of the methods generating new samples or removing samples rely on the values in this feature space. I have issue thinking about how to handle and generate new samples with NaN with SMOTE or ADASYN. Similarly, is the selection criterion for the under-sampling methods still make sense with NaN?

So that's why I would be still not permissive regarding these values.

@chkoar @dvro @fmfn Any thoughts?

@hnykda
Copy link
Author

hnykda commented Sep 29, 2016

Sure, it has its drawbacks and agree with that.

I disagree with the first problem you mention, though... Here you are trying to incorporate some sort of validations which have nothing to do with oversampling. I daily work with datasets where NaNs are totally valid cases and if those tools (including scikit-learn) incorporate "ignore nans because they might be wrong values" rule, I couldn't used them at all (many ML methods supports sparse datasets). By the way, Pandas has nan handling as one of the main features.

@chkoar
Copy link
Member

chkoar commented Sep 29, 2016

I think that he explicitly refers to the Random*Samplers where the feature values doesn't matter.

@hnykda
Copy link
Author

hnykda commented Sep 29, 2016

@chkoar I wouldn't say they doesn't matter, they are just valid values.

@chkoar
Copy link
Member

chkoar commented Sep 29, 2016

@hnykda I was refering to the sampling procedure itself.

@glemaitre
Copy link
Member

@chkoar @hnykda Sorry my bad, If this is specifically for the random samplers. I agree that we can less rigid since that the selection is not based on the feature space itself.

So in this case what I would see is a let pass for these the random methods and a rising of a user warning for containing NaN.

What do you think?

@glemaitre glemaitre added the Type: Enhancement Indicates new feature requests label Sep 29, 2016
@glemaitre
Copy link
Member

The sparse problem is still different. We should be able to support it since that distances and kNN allow it. However, it should be the most homogeneous possible in the toolbox.

@hnykda
Copy link
Author

hnykda commented Sep 29, 2016

Cool. Thanks.

IMHO just adding some argument "accept_nan=False` or something in the constructor/fit method would suffice. That should be easily implementable. But I am leaving that to you...

@glemaitre
Copy link
Member

IMHO just adding some argument "accept_nan=False` or something in the constructor/fit method would suffice. That should be easily implementable. But I am leaving that to you...

This is not as simple since that we choose to be close to scikit-learn API design. We have to check how the estimators are handling this in their cases. We are also careful to put new parameters in methods due to deprecation cycle for any change in the future. But we will have to check this feature.

Thanks for your inputs.

@hnykda
Copy link
Author

hnykda commented Sep 29, 2016

Sure. Thank for your work.

As far as I know, scikit basically supports only sparse matrices. Then it doesn't have to deal with NaNs or something - the sparse matrix cannot has them, period. Here is an example https://github.com/scikit-learn/scikit-learn/blob/38030a0/sklearn/decomposition/truncated_svd.py#L154 . So if you want to be as close as possible to them, add support for sparse matrices and then it is up to a user.

@glemaitre
Copy link
Member

I would really separate the problem of NaN from sparse.

Regarding the NaN, only the Imputer seems to accept them since that it is to kick them out.

So we have to check if this is worth in case that you put everything in the Pipeline or other sklearn object.

@massich
Copy link
Contributor

massich commented Sep 29, 2016

Just to recall, 'cos I might get lost. Here are some examples to know what we are talking about.

  1. Full matrix with full representation

    [1, 2, 3; 4, 5, 6; 7 8 9]

  2. Full matrix with sparse representation

    {(1,1)(1,2)(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)}[1 2 3 4 5 6 7 8 9]

  3. Sparse matrix with full representation

    [1, 0, 0; 0 1 0; 0 0 1]
    [1, NaN, NaN; NaN 1 NaN; NaN NaN 1]
    [0, NaN, NaN; NaN, 1, NaN; NaN NaN NaN]

  4. Sparse matrix with sparse representation
    {(1,1)(2,2)(3,3)} [1 1 1]
    {(10,10)}[NaN]

That NaN is used to encode something like infinite, not visited, missing data, whatever, does nothing to do with the balancing itself but it can break the internal computations of the balancing since some of them are based in distances. Once said that I'm not in favor of allowing NaN as a valid sample value for a particular balancing strategy because it changes the domain of x in F(x, y='balancing_method') which breaks the API.

@chkoar
Copy link
Member

chkoar commented Sep 29, 2016

I dont like the idea of accept_nan=False in the constructor. In a pipeline chain we do not expect nan values. For the samplers that do not manipulate the input space to resample, we could have an additional parameter in fit and sample methods check_input like in DecisionTreeClassifier. The default value would be True. So the user could sample without validation. Thoughts?

@dvro
Copy link
Member

dvro commented Nov 11, 2016

@glemaitre @chkoar I believe a parameter "allow_nan=True" is a quick fix but not a reasonable workaround because it will soon need replacement and also might break some tests. If scikit-learn does not support it, we shouldn't be messing with it within the estimator. Using what was mentioned by @hnykda X[np.isnan(X_)] = -123 and X_resampled[X_resampled == -123] = np.nan is probably the best solution.

What we could do differently than that would be providing a simple ImputerLike that would transform X before applying the sampling, and transform X back after applying the sampling. This would be in the preprocessing, it would be a quick fix for the user and easy to remove from the API once scikit-learn define how to handle such errors when testing estimators.

@chkoar
Copy link
Member

chkoar commented Nov 11, 2016

@dvro read this #157 (comment)

@glemaitre
Copy link
Member

NaN will not be included in sklearn for now. The use of
Imputer by the user is needed and we will let the user manually
manage this case. We can reopen if sklearn go toward another
proposal.

@lrq3000
Copy link

lrq3000 commented Aug 18, 2019

I would argue to reopen this case, as I was also bitten by it.

Let's take an example: I have a mix of categorical and numerical values. Thus, I would like to use SMOTE-NC, as I would like to generate new interpolated values rather than just picking randomly already existing items. Apart from the fact that using Imputer would be quite impractical as I will have to manage different cases depending on the type, the more serious issue is that interpolating algorithms such as SMOTE-NC will interpolate the floats based on the values that are set.

Thus, if I set a placeholder value for floats, say -99999, then this WILL get used for the interpolation, which I don't want! A more correct approach would be to calculate SMOTE interpolation on only the non-nan values, and weight depending on the number of occurrences of nan a probability that nan is chosen instead of a non-nan interpolated value.

Furthermore, I would argue that it should be possible to add specific keywords to the library without breaking compatibility with sklearn, if not in fit/transform, it can be in the model initialization call, there is more liberty there I think, like how the sampling_strategy argument is used, or am I wrong?

Lastly, I'd like to mention that sklearn has been delaying for years a proper support of categorical (unordered) variables, not to mention missing values, so waiting for them might not be the greatest idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

6 participants