Skip to content

sklearn.neighbors.NearestNeighbors allow processing nan values #29085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fedorkobak opened this issue May 22, 2024 · 8 comments
Closed

sklearn.neighbors.NearestNeighbors allow processing nan values #29085

fedorkobak opened this issue May 22, 2024 · 8 comments

Comments

@fedorkobak
Copy link

fedorkobak commented May 22, 2024

Describe the workflow you want to enable

In some cases (for example memory-based collaborative filtering) empty values is important part of the algorithm. But sklearn.neighbors.NearestNeighbors doesn't allow to handle empty values, even if metrics can handle such cases.

The following code snippet shows a simple realisation of building collaboration with sklearn.neighbors.NearestNeighbors.

import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation as orig_correlation

def correlation(a,b):
    '''
    Modified correlation function that handles
    nan values. If it's impossible to
    to calculate the distance, it returns 2 - 
    the maximum possible value.
    '''
    cond = ~(np.isnan(a) | np.isnan(b))
    # in case if there are only two
    # observations it's impossible
    # to compute coorrelation coeficient
    # it's invalid case - so we return 
    # the biggest possible distance
    if sum(cond) <=1:
        return 2

    a_std = a[cond].std()
    b_std = b[cond].std()

    # Pearson coefficient uses standard 
    # deviations in the denominator, so 
    # if any of them is equal to zero, 
    # we have to return the biggest
    # possible distance.
    if a_std==0 or b_std==0:
        return 2
    return orig_correlation(a[cond],b[cond])

example_array = np.array(
    [[ 5.,  2.,  4.,  6.,  5.,  4.,  6.,  6.,  7.,  6.],
    [ 6., np.NaN,  4.,  7.,  5.,  4.,  8.,  5.,  7.,  6.],
    [ 7., 10.,  1., np.NaN,  9.,  7., np.NaN,  3., np.NaN,  8.],
    [np.NaN,  2.,  4., np.NaN,  4.,  4.,  7.,  7., np.NaN,  6.],
    [ 8.,  1., np.NaN,  7.,  6.,  2.,  2.,  8.,  2.,  1.],
    [ 8., np.NaN, np.NaN, np.NaN,  8.,  7.,  7.,  4., 10.,  9.],
    [ 8.,  1.,  6.,  8.,  5.,  2.,  2., np.NaN,  3.,  1.],
    [ 6., np.NaN,  0.,  5.,  9.,  7.,  7.,  3.,  9.,  6.],
    [np.NaN,  1.,  7.,  8.,  5.,  2., np.NaN, np.NaN, np.NaN,  1.],
    [ 8.,  1.,  7.,  8.,  5., np.NaN,  2.,  8.,  3.,  1.]]
)

NearestNeighbors(
    metric=correlation
).fit(example_array)

It raises ValueError: Input X contains NaN.. You can see from the trace that it just happens during the data validation phase - sklearn/utils/validation.py, function _assert_all_finite.

It would be great if you could use data with empty values in some cases.

Describe your proposed solution

This can be solved by adding a parameter to sklearn.neighbors.NearestNeighbors that allows to use empty values in the algorithm. If this parameter is set to True, validation functions won't throw errors if array with nan is passed.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@fedorkobak fedorkobak added Needs Triage Issue requires triage New Feature labels May 22, 2024
@fedorkobak fedorkobak changed the title sklearn.neighbors.NearestNeighborsallow processing nan values sklearn.neighbors.NearestNeighbors allow processing nan values May 22, 2024
@adrinjalali
Copy link
Member

We're happy to allow nans in our estimators, and some work is ongoing in this regard. E.g. #28043

Feel free to open PR for this.

@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Jun 5, 2024
@durga0201
Copy link

We can replace process data to replace empty values with a placeholder that won't affect the distance calculation ?
For example, cosine metric ignores the magnitude of the vectors and only considers the direction, using -1 as a placeholder won't affect the cosine similarity calculations or for minkowski distance, we might consider using a placeholder that is far from the typical range of values in your data. For example, if our data contains only positive values, you could use a negative value that is much smaller than the minimum value in your data, such as -999?

@adrinjalali
Copy link
Member

Not really, that sounds too hackie and I'm not sure there's any theoretical backing for how we'd handle it.

@durga0201
Copy link

@adrinjalali other than imputation there is not much on missing value in KNN? If you have any thoughts let me know

@adrinjalali
Copy link
Member

I haven't thought about the details. In terms of API, if "supporting missing values" means imputing them one way or another, then we shouldn't be doing it and the user should be doing that as a step before KNN in the pipeline.

cc @ogrisel and @lorentzenchr since I think they've had opinions on this in the past.

@fedorkobak
Copy link
Author

fedorkobak commented Jun 18, 2024

@durga0201 not support your idea. It would be advantageous to have an option to define a custom metric that handles empty positions in a specific way. To achieve this, it is necessary to provide the ability to identify which positions were empty and which were not within the function passed as the metric attribute. Now passing arrays with empty values just causes an error that stops the programme.

@Lucas-Armand
Copy link

Hello everyone,

I am interested in working on this issue and would like to contribute a solution.

Objective: Modify "NearestNeighbors" to allow processing of NaN values when using metrics that can handle them, such as nan_euclidean.
Approach:

  • Add a parameter (e.g., handle_missing='allow') to control the behavior of NearestNeighbors when encountering NaN values. (like in 28043)
  • Update validation checks to permit NaN values in X when a compatible metric is used.
  • Ensure compatibility: If a metric that does not support NaN is used, the estimator should continue to raise an appropriate error.
  • Write tests to cover the new functionality and update documentation to reflect the changes...

Please let me know if this plan sounds acceptable or if there are any suggestions or concerns before I proceed.

Thank you!

@adrinjalali
Copy link
Member

I'm not sure if the same solution as in SplineTransformer applies here.

#25330 already allows supporting nan in these classes (for euclidean distance). I'm happy to add more metrics which can handle nan, but I think that PR closes this one as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants