-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT introduce kulczynski1 in place of kulsinski #25212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking at the failure, it seems that this is not only a renaming. I need to investigate what the difference between the two functions in the SciPy documentation. |
So apparently the definition of the metric is not the same as shown in the small example snippet in the documentation:
|
sklearn/metrics/pairwise.py
Outdated
@@ -1639,7 +1647,7 @@ def _pairwise_callable(X, Y, metric, force_all_finite=True, **kwds): | |||
"dice", | |||
"hamming", | |||
"jaccard", | |||
"kulsinski", | |||
"kulsinski" if sp_version < parse_version("1.8") else "kulczynski1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it rather this?
"kulsinski" if sp_version < parse_version("1.8") else "kulczynski1", | |
( | |
*("kulsinski", "kulczynski1") | |
if sp_version < parse_version("1.8") | |
else "kulczynski1" | |
), |
This suggestion also applies in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"kulczynski1" was introduced in 1.8. Eventually, we could have a transition for 1.8 and 1.9 where both metrics are available which is closer to the deprecation cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better since both metrics are not leading to the same results which is something I did not expect at first.
@@ -81,7 +81,7 @@ class OPTICS(ClusterMixin, BaseEstimator): | |||
'manhattan'] | |||
|
|||
- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', | |||
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', | |||
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'kulczynski1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also deprecate 'kulsinski'
in scikit-learn as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly: scipy/scipy#2009
I even think that scikit-learn could be the reason for having this metric in SciPy.
For this PR, I would like just to add support for this metric and deal with the SciPy part. Then, we can see how the deprecation would go for the scikit-learn part and see if we need to backport something for sometimes.
I am a bit blocked with the remaining error: it should be a test checking that the ball tree query provides the same results as the brute force. It seems that this is not the case with the newly implemented metric. I am wondering if there is not something fishy with the bound |
Uhm no luck if the |
I found the reason for the failure in the In the The Kulczynski distance works on data that are binarized. However, centroids are not defined with binary values. The metric itself considers that whatever value is not zero should be cast to one. We, therefore, end up in a situation where the distance between the centroid and the further point is generally Just to confirm this intuition, I modified slightly the way centroids are computed in the ball tree: for j in range(n_features):
centroid[j] /= n_points
if centroid[j] > 0.5:
centroid[j] = 1.0
else:
centroid[j] = 0.0 In this case, the centroids are values expected by the metric and the test is passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how the centroids are computed, I'm surprised that BallTree
does something reasonable with any boolean metric.
As for kulczynski1
, it is not really a metric:
from scipy.spatial import distance
x = [0, 1, 1]
distance.kulczynski1(x, x)
# inf
A metric would require the above to return 0.
Does it makes sense to have |
It is my original question :) We should investigate closely what happens with the Hamming distance and what is the reason for the test passing. I assume that the distance does not output |
Based on the context given by #20582, I agree we need to take some time to assess:
What do you think? |
I think that Ball-Tree should work for any proper metric, that is that:
As noted in #25212 (review), the So let's not include it as part of the list of valid metrics for the Ball Tree method. As for the suitability of ball-tree for bool metrics in general, it's indeed likely that the centroid init implemented in To keep the PR focused and not delay the fix, let's not change this as part of this PR and rather open a dedicated follow-up issue to investigate if the current ball-tree implementation is correct for boolean metrics and data. At least the test suite should be updated to test the boolean metrics (e.g. the equivalence of ball-tree vs brute) on boolean data. |
To address #25202, I think the best would just be to conditionally resolve |
#25285 has been opened in this regards. |
But I can see the following items to go forward:
|
Can we remove support for |
Why not? Apparently they do not compute the same thing. |
The plan proposed in #25212 (comment) sounds good to me. We can probably split it into 2 or more PRs to ease review. |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
We losing coverage because we are testing the future (SciPy 1.11) for which we don't collect coverage. |
@glemaitre: to you, which parts of the plan proposed in #25212 (comment) should we treat in this PR? I think this PR can focus on:
What do you think? |
By splitting the PR, I have no hope to introduce |
I prefer to disallow |
I am find with that as well. We can always introduce it in a later version if users ask for it. |
closes #25202
Working around the deprecation an suppression of "kulsinski" metric in SciPy.