-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
sklearn.neighbors.NearestNeighbors
allow processing nan values
#29085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
sklearn.neighbors.NearestNeighbors
allow processing nan valuessklearn.neighbors.NearestNeighbors
allow processing nan values
We're happy to allow nans in our estimators, and some work is ongoing in this regard. E.g. #28043 Feel free to open PR for this. |
We can replace process data to replace empty values with a placeholder that won't affect the distance calculation ? |
Not really, that sounds too hackie and I'm not sure there's any theoretical backing for how we'd handle it. |
@adrinjalali other than imputation there is not much on missing value in KNN? If you have any thoughts let me know |
I haven't thought about the details. In terms of API, if "supporting missing values" means imputing them one way or another, then we shouldn't be doing it and the user should be doing that as a step before KNN in the pipeline. cc @ogrisel and @lorentzenchr since I think they've had opinions on this in the past. |
@durga0201 not support your idea. It would be advantageous to have an option to define a custom metric that handles empty positions in a specific way. To achieve this, it is necessary to provide the ability to identify which positions were empty and which were not within the function passed as the |
Hello everyone, I am interested in working on this issue and would like to contribute a solution. Objective: Modify "NearestNeighbors" to allow processing of NaN values when using metrics that can handle them, such as nan_euclidean.
Please let me know if this plan sounds acceptable or if there are any suggestions or concerns before I proceed. Thank you! |
I'm not sure if the same solution as in #25330 already allows supporting |
Describe the workflow you want to enable
In some cases (for example memory-based collaborative filtering) empty values is important part of the algorithm. But
sklearn.neighbors.NearestNeighbors
doesn't allow to handle empty values, even if metrics can handle such cases.The following code snippet shows a simple realisation of building collaboration with
sklearn.neighbors.NearestNeighbors
.It raises
ValueError: Input X contains NaN.
. You can see from the trace that it just happens during the data validation phase -sklearn/utils/validation.py
, function_assert_all_finite
.It would be great if you could use data with empty values in some cases.
Describe your proposed solution
This can be solved by adding a parameter to
sklearn.neighbors.NearestNeighbors
that allows to use empty values in the algorithm. If this parameter is set toTrue
, validation functions won't throw errors if array with nan is passed.Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: