-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
#4475 : Add a safe_pairwise_distances function, dealing with zero varian... #4495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ith zero variance samples when using correlation metric. The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by spicy, we can't modify it directly. So, in case of metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).
Why do you define a new function and not fix |
As there are several returns statements in pairwise_distances, and the check I added must be done for several of them, it seemed to me it would be clearer to do it after applying the original function. I pushed a new commit with the fix directly in pairwise_distances |
|
||
return _parallel_pairwise(X, Y, func, n_jobs, **kwds) | ||
if distances is None : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not an else here, so you don't need to initialize distances?
The implementation of correlation is 7 lines: https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py#L328 |
To explain my reasoning: Adding another general function to the public interface for a fix seemed overkill for a single metric fix. Adding metric-specific code in a very general function seemed like a bad idea (this function could become very long if every metric needed a fix). |
I agree that it would cleaner to fix directly the metric, but there are several points I'd like to clarify:
|
It would be fine to add a metric that only supports dense matrices. |
@Nzeuwik, do you intend to follow up @amueller's suggestion of implementing a metric? |
I am closing this PR since the original issue was closed: #4475. Thank you for working on this PR! |
#4475 : Add a safe_pairwise_distances function, dealing with zero variance samples when using correlation metric.
The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by scipy, we can't modify it directly.
So, when metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).
This change is