Skip to content

#4475 : Add a safe_pairwise_distances function, dealing with zero varian... #4495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

LowikC
Copy link

@LowikC LowikC commented Apr 2, 2015

#4475 : Add a safe_pairwise_distances function, dealing with zero variance samples when using correlation metric.

The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by scipy, we can't modify it directly.
So, when metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).


This change is Reviewable

…ith zero variance samples when using correlation metric.

The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by spicy, we can't modify it directly.
So, in case of metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).
@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 359fd1a on Nzeuwik:issue_4475 into 1c33a6f on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 95.11% when pulling 359fd1a on Nzeuwik:issue_4475 into 1c33a6f on scikit-learn:master.

@amueller
Copy link
Member

amueller commented Apr 2, 2015

Why do you define a new function and not fix pairwise_distances?

@LowikC
Copy link
Author

LowikC commented Apr 2, 2015

As there are several returns statements in pairwise_distances, and the check I added must be done for several of them, it seemed to me it would be clearer to do it after applying the original function.

I pushed a new commit with the fix directly in pairwise_distances

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling b444c58 on Nzeuwik:issue_4475 into 1c33a6f on scikit-learn:master.


return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
if distances is None :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not an else here, so you don't need to initialize distances?

@amueller
Copy link
Member

amueller commented Apr 2, 2015

The implementation of correlation is 7 lines: https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py#L328
We don't even need to validate again. Shouldn't we just rather reimplement a "stable" version?

@amueller
Copy link
Member

amueller commented Apr 2, 2015

To explain my reasoning: Adding another general function to the public interface for a fix seemed overkill for a single metric fix. Adding metric-specific code in a very general function seemed like a bad idea (this function could become very long if every metric needed a fix).

@LowikC
Copy link
Author

LowikC commented Apr 12, 2015

I agree that it would cleaner to fix directly the metric, but there are several points I'd like to clarify:

  • all metrics from sklearn support sparse matrices, whereas metrics from scipy don't. So if we include correlation in sklearn, it should also support sparse matrices, right? And in this case, the implementation will be tougher than the one in scipy.
  • Correlation distance between a vector with zero-variance and another vector should be 1 (instead of NaN). But what should be the correlation distance between 2 equal vectors with zero-variance? 0 or 1? (correlation_distance(x, x) = 0 in general)

@amueller
Copy link
Member

It would be fine to add a metric that only supports dense matrices.
For the second, I'm not sure. I'd say 0 but checking x1 == x2 is not robust. If there is a robust way to write it such that it is 0, go for it.

@amueller amueller added the Bug label May 6, 2015
@amueller amueller added this to the 0.16.2 milestone May 6, 2015
@amueller amueller modified the milestones: 0.16.2, 0.17 Sep 8, 2015
@lesteve lesteve modified the milestones: 0.17, 0.18 Jul 27, 2016
@amueller amueller modified the milestones: 0.18, 0.19 Sep 22, 2016
@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 14, 2017
@jnothman
Copy link
Member

@Nzeuwik, do you intend to follow up @amueller's suggestion of implementing a metric?

@glemaitre glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018
@jnothman jnothman modified the milestones: 0.21, 0.22 Apr 15, 2019
@jnothman jnothman modified the milestones: 0.22, 0.23 Oct 31, 2019
@thomasjpfan thomasjpfan modified the milestones: 0.23, 0.24 Apr 20, 2020
@rth rth added the Stalled label Jul 25, 2020
@cmarmo cmarmo removed this from the 0.24 milestone Oct 15, 2020
Base automatically changed from master to main January 22, 2021 10:48
@thomasjpfan
Copy link
Member

I am closing this PR since the original issue was closed: #4475. Thank you for working on this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants