-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Allow NaNs in the matrix #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think that accepting NaN is weirder than making the manual replacement. From my point of you, NaN is something that should be handle by the user itself and one should be aware about the solution chosen. Let me give more insights about my thoughts:
So that's why I would be still not permissive regarding these values. |
Sure, it has its drawbacks and agree with that. I disagree with the first problem you mention, though... Here you are trying to incorporate some sort of validations which have nothing to do with oversampling. I daily work with datasets where NaNs are totally valid cases and if those tools (including scikit-learn) incorporate "ignore nans because they might be wrong values" rule, I couldn't used them at all (many ML methods supports sparse datasets). By the way, Pandas has |
I think that he explicitly refers to the |
@chkoar I wouldn't say they doesn't matter, they are just valid values. |
@hnykda I was refering to the sampling procedure itself. |
@chkoar @hnykda Sorry my bad, If this is specifically for the random samplers. I agree that we can less rigid since that the selection is not based on the feature space itself. So in this case what I would see is a let pass for these the random methods and a rising of a user warning for containing NaN. What do you think? |
The sparse problem is still different. We should be able to support it since that distances and kNN allow it. However, it should be the most homogeneous possible in the toolbox. |
Cool. Thanks. IMHO just adding some argument "accept_nan=False` or something in the constructor/fit method would suffice. That should be easily implementable. But I am leaving that to you... |
This is not as simple since that we choose to be close to scikit-learn API design. We have to check how the estimators are handling this in their cases. We are also careful to put new parameters in methods due to deprecation cycle for any change in the future. But we will have to check this feature. Thanks for your inputs. |
Sure. Thank for your work. As far as I know, scikit basically supports only sparse matrices. Then it doesn't have to deal with |
I would really separate the problem of NaN from sparse. Regarding the NaN, only the So we have to check if this is worth in case that you put everything in the |
Just to recall, 'cos I might get lost. Here are some examples to know what we are talking about.
That NaN is used to encode something like infinite, not visited, missing data, whatever, does nothing to do with the balancing itself but it can break the internal computations of the balancing since some of them are based in distances. Once said that I'm not in favor of allowing NaN as a valid sample value for a particular balancing strategy because it changes the domain of x in |
I dont like the idea of |
@glemaitre @chkoar I believe a parameter "allow_nan=True" is a quick fix but not a reasonable workaround because it will soon need replacement and also might break some tests. If scikit-learn does not support it, we shouldn't be messing with it within the estimator. Using what was mentioned by @hnykda What we could do differently than that would be providing a simple ImputerLike that would transform X before applying the sampling, and transform X back after applying the sampling. This would be in the |
@dvro read this #157 (comment) |
|
I would argue to reopen this case, as I was also bitten by it. Let's take an example: I have a mix of categorical and numerical values. Thus, I would like to use SMOTE-NC, as I would like to generate new interpolated values rather than just picking randomly already existing items. Apart from the fact that using Thus, if I set a placeholder value for floats, say Furthermore, I would argue that it should be possible to add specific keywords to the library without breaking compatibility with Lastly, I'd like to mention that |
Hi,
this is a bit connected to accepting sparse matrices - basically about sklearn validations.
I understand why not accepting NaN is sometimes necessary for some algorithms, it doesn't make sense for e.g.
*RandomSamplers
. NaNs could be copied as well.As a workaround, I replace all NaNs by some other valid value and then convert it back. But it is of course a bit weird...
The text was updated successfully, but these errors were encountered: