`sklearn.neighbors.NearestNeighbors` allow processing nan values #29085

fedorkobak · 2024-05-22T21:38:18Z

Describe the workflow you want to enable

In some cases (for example memory-based collaborative filtering) empty values is important part of the algorithm. But sklearn.neighbors.NearestNeighbors doesn't allow to handle empty values, even if metrics can handle such cases.

The following code snippet shows a simple realisation of building collaboration with sklearn.neighbors.NearestNeighbors.

import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation as orig_correlation

def correlation(a,b):
    '''
    Modified correlation function that handles
    nan values. If it's impossible to
    to calculate the distance, it returns 2 - 
    the maximum possible value.
    '''
    cond = ~(np.isnan(a) | np.isnan(b))
    # in case if there are only two
    # observations it's impossible
    # to compute coorrelation coeficient
    # it's invalid case - so we return 
    # the biggest possible distance
    if sum(cond) <=1:
        return 2

    a_std = a[cond].std()
    b_std = b[cond].std()

    # Pearson coefficient uses standard 
    # deviations in the denominator, so 
    # if any of them is equal to zero, 
    # we have to return the biggest
    # possible distance.
    if a_std==0 or b_std==0:
        return 2
    return orig_correlation(a[cond],b[cond])

example_array = np.array(
    [[ 5.,  2.,  4.,  6.,  5.,  4.,  6.,  6.,  7.,  6.],
    [ 6., np.NaN,  4.,  7.,  5.,  4.,  8.,  5.,  7.,  6.],
    [ 7., 10.,  1., np.NaN,  9.,  7., np.NaN,  3., np.NaN,  8.],
    [np.NaN,  2.,  4., np.NaN,  4.,  4.,  7.,  7., np.NaN,  6.],
    [ 8.,  1., np.NaN,  7.,  6.,  2.,  2.,  8.,  2.,  1.],
    [ 8., np.NaN, np.NaN, np.NaN,  8.,  7.,  7.,  4., 10.,  9.],
    [ 8.,  1.,  6.,  8.,  5.,  2.,  2., np.NaN,  3.,  1.],
    [ 6., np.NaN,  0.,  5.,  9.,  7.,  7.,  3.,  9.,  6.],
    [np.NaN,  1.,  7.,  8.,  5.,  2., np.NaN, np.NaN, np.NaN,  1.],
    [ 8.,  1.,  7.,  8.,  5., np.NaN,  2.,  8.,  3.,  1.]]
)

NearestNeighbors(
    metric=correlation
).fit(example_array)

It raises ValueError: Input X contains NaN.. You can see from the trace that it just happens during the data validation phase - sklearn/utils/validation.py, function _assert_all_finite.

It would be great if you could use data with empty values in some cases.

Describe your proposed solution

This can be solved by adding a parameter to sklearn.neighbors.NearestNeighbors that allows to use empty values in the algorithm. If this parameter is set to True, validation functions won't throw errors if array with nan is passed.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

adrinjalali · 2024-06-05T11:37:50Z

We're happy to allow nans in our estimators, and some work is ongoing in this regard. E.g. #28043

Feel free to open PR for this.

durga0201 · 2024-06-13T22:29:08Z

We can replace process data to replace empty values with a placeholder that won't affect the distance calculation ?
For example, cosine metric ignores the magnitude of the vectors and only considers the direction, using -1 as a placeholder won't affect the cosine similarity calculations or for minkowski distance, we might consider using a placeholder that is far from the typical range of values in your data. For example, if our data contains only positive values, you could use a negative value that is much smaller than the minimum value in your data, such as -999?

adrinjalali · 2024-06-14T11:07:46Z

Not really, that sounds too hackie and I'm not sure there's any theoretical backing for how we'd handle it.

durga0201 · 2024-06-14T11:14:42Z

@adrinjalali other than imputation there is not much on missing value in KNN? If you have any thoughts let me know

adrinjalali · 2024-06-14T14:56:13Z

I haven't thought about the details. In terms of API, if "supporting missing values" means imputing them one way or another, then we shouldn't be doing it and the user should be doing that as a step before KNN in the pipeline.

cc @ogrisel and @lorentzenchr since I think they've had opinions on this in the past.

fedorkobak · 2024-06-18T22:17:00Z

@durga0201 not support your idea. It would be advantageous to have an option to define a custom metric that handles empty positions in a specific way. To achieve this, it is necessary to provide the ability to identify which positions were empty and which were not within the function passed as the metric attribute. Now passing arrays with empty values just causes an error that stops the programme.

Lucas-Armand · 2024-11-06T01:09:22Z

Hello everyone,

I am interested in working on this issue and would like to contribute a solution.

Objective: Modify "NearestNeighbors" to allow processing of NaN values when using metrics that can handle them, such as nan_euclidean.
Approach:

Add a parameter (e.g., handle_missing='allow') to control the behavior of NearestNeighbors when encountering NaN values. (like in 28043)
Update validation checks to permit NaN values in X when a compatible metric is used.
Ensure compatibility: If a metric that does not support NaN is used, the estimator should continue to raise an appropriate error.
Write tests to cover the new functionality and update documentation to reflect the changes...

Please let me know if this plan sounds acceptable or if there are any suggestions or concerns before I proceed.

Thank you!

adrinjalali · 2024-11-06T08:04:56Z

I'm not sure if the same solution as in SplineTransformer applies here.

#25330 already allows supporting nan in these classes (for euclidean distance). I'm happy to add more metrics which can handle nan, but I think that PR closes this one as well.

fedorkobak added Needs Triage Issue requires triage New Feature labels May 22, 2024

fedorkobak changed the title ~~sklearn.neighbors.NearestNeighborsallow processing nan values~~ sklearn.neighbors.NearestNeighbors allow processing nan values May 22, 2024

adrinjalali removed the Needs Triage Issue requires triage label Jun 5, 2024

adrinjalali added this to Missing value and nan support Jun 5, 2024

adrinjalali closed this as completed Nov 6, 2024

github-project-automation bot moved this to Done in Missing value and nan support Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sklearn.neighbors.NearestNeighbors` allow processing nan values #29085

`sklearn.neighbors.NearestNeighbors` allow processing nan values #29085

fedorkobak commented May 22, 2024 •

edited

Loading

adrinjalali commented Jun 5, 2024

durga0201 commented Jun 13, 2024

adrinjalali commented Jun 14, 2024

durga0201 commented Jun 14, 2024

adrinjalali commented Jun 14, 2024

fedorkobak commented Jun 18, 2024 •

edited

Loading

Lucas-Armand commented Nov 6, 2024

adrinjalali commented Nov 6, 2024

sklearn.neighbors.NearestNeighbors allow processing nan values #29085

sklearn.neighbors.NearestNeighbors allow processing nan values #29085

Comments

fedorkobak commented May 22, 2024 • edited Loading

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

adrinjalali commented Jun 5, 2024

durga0201 commented Jun 13, 2024

adrinjalali commented Jun 14, 2024

durga0201 commented Jun 14, 2024

adrinjalali commented Jun 14, 2024

fedorkobak commented Jun 18, 2024 • edited Loading

Lucas-Armand commented Nov 6, 2024

adrinjalali commented Nov 6, 2024

`sklearn.neighbors.NearestNeighbors` allow processing nan values #29085

`sklearn.neighbors.NearestNeighbors` allow processing nan values #29085

fedorkobak commented May 22, 2024 •

edited

Loading

fedorkobak commented Jun 18, 2024 •

edited

Loading