MAINT introduce kulczynski1 in place of kulsinski #25212

glemaitre · 2022-12-19T09:42:22Z

Working around the deprecation an suppression of "kulsinski" metric in SciPy.

glemaitre · 2022-12-19T10:38:17Z

Looking at the failure, it seems that this is not only a renaming. I need to investigate what the difference between the two functions in the SciPy documentation.

glemaitre · 2022-12-19T10:44:57Z

So apparently the definition of the metric is not the same as shown in the small example snippet in the documentation:

kulczynski1 vs. kulsinski

jjerphan · 2022-12-19T14:04:05Z

sklearn/metrics/pairwise.py

@@ -1639,7 +1647,7 @@ def _pairwise_callable(X, Y, metric, force_all_finite=True, **kwds):
    "dice",
    "hamming",
    "jaccard",
-    "kulsinski",
+    "kulsinski" if sp_version < parse_version("1.8") else "kulczynski1",


Isn't it rather this?

Suggested change

"kulsinski" if sp_version < parse_version("1.8") else "kulczynski1",

(

*("kulsinski", "kulczynski1")

if sp_version < parse_version("1.8")

else "kulczynski1"

),

This suggestion also applies in other places.

"kulczynski1" was introduced in 1.8. Eventually, we could have a transition for 1.8 and 1.9 where both metrics are available which is closer to the deprecation cycle.

It would be better since both metrics are not leading to the same results which is something I did not expect at first.

thomasjpfan · 2022-12-19T14:35:24Z

sklearn/cluster/_optics.py

@@ -81,7 +81,7 @@ class OPTICS(ClusterMixin, BaseEstimator):
          'manhattan']

        - from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
-          'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
+          'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'kulczynski1',


Should we also deprecate 'kulsinski' in scikit-learn as well?

Certainly: scipy/scipy#2009
I even think that scikit-learn could be the reason for having this metric in SciPy.

For this PR, I would like just to add support for this metric and deal with the SciPy part. Then, we can see how the deprecation would go for the scikit-learn part and see if we need to backport something for sometimes.

glemaitre · 2022-12-19T16:14:08Z

I am a bit blocked with the remaining error: it should be a test checking that the ball tree query provides the same results as the brute force. It seems that this is not the case with the newly implemented metric. I am wondering if there is not something fishy with the bound inf and nan when it comes to building the tree but I don't know much about the BallTree code.

glemaitre · 2022-12-19T16:37:53Z

Uhm no luck if the nan and inf case, so certainly something else in the BallTree.

glemaitre · 2022-12-20T11:11:39Z

I found the reason for the failure in the BallTree and it is indeed linked to some inf upper bound.

In the BallTree, each node defines a centroid and a radius. This radius is based on the distance between the centroid and the furthest point.

The Kulczynski distance works on data that are binarized. However, centroids are not defined with binary values. The metric itself considers that whatever value is not zero should be cast to one. We, therefore, end up in a situation where the distance between the centroid and the further point is generally inf. This will impact the search for the nearest centroids since the balls are therefore ill-defined.

Just to confirm this intuition, I modified slightly the way centroids are computed in the ball tree:

        for j in range(n_features):
            centroid[j] /= n_points
            if centroid[j] > 0.5:
                centroid[j] = 1.0
            else:
                centroid[j] = 0.0

In this case, the centroids are values expected by the metric and the test is passing.
However, I don't know if it makes sense to threshold the centroids this way.

thomasjpfan

Given how the centroids are computed, I'm surprised that BallTree does something reasonable with any boolean metric.

As for kulczynski1, it is not really a metric:

from scipy.spatial import distance
x = [0, 1, 1]
distance.kulczynski1(x, x)
# inf

A metric would require the above to return 0.

sklearn/metrics/tests/test_pairwise.py

sklearn/metrics/_dist_metrics.pyx.tp

jjerphan · 2023-01-03T08:08:36Z

Does it makes sense to have BallTree support "kulsinski" or even boolean distance metrics?

glemaitre · 2023-01-03T09:16:29Z

Does it makes sense to have BallTree support "kulsinski" or even boolean distance metrics?

It is my original question :)

We should investigate closely what happens with the Hamming distance and what is the reason for the test passing. I assume that the distance does not output inf and it makes the test pass but the results are rubbish because the median for each ball is not a binary sample and we should always compare vectors of full 1s.

jjerphan · 2023-01-03T09:48:45Z

Based on the context given by #20582, BallTree might have been designed for the Euclidean Distance metric case and then extended to support other distance metrics.

I agree we need to take some time to assess:

if using BallTree with other (boolean) distance metrics makes sense.
if kulczynski1 can be used as a proxy/replacement for kulczynski

What do you think?

ogrisel · 2023-01-03T17:53:49Z

I think that Ball-Tree should work for any proper metric, that is that:

d(x, x) = 0,
d(x, y) > 0 if x != y,
d(x, y) = d(y, x)
d(x, y) <= d(x, z) + d(z, y)

As noted in #25212 (review), the kulczynski1 dissimilarity is certainly not a valid metric.

So let's not include it as part of the list of valid metrics for the Ball Tree method.

As for the suitability of ball-tree for bool metrics in general, it's indeed likely that the centroid init implemented in init_node is not well adapted to them but maybe it still works thanks to a silent rounding operation to 0.0 or 1.0 in the distance computation itself for those metrics?

To keep the PR focused and not delay the fix, let's not change this as part of this PR and rather open a dedicated follow-up issue to investigate if the current ball-tree implementation is correct for boolean metrics and data. At least the test suite should be updated to test the boolean metrics (e.g. the equivalence of ball-tree vs brute) on boolean data.

jjerphan · 2023-01-04T08:16:23Z

To address #25202, I think the best would just be to conditionally resolve metric="kulsinski" to KulsinskiDistance{,32} which is already present in the code base (we do not need to introduce kulczynski1 for this).

jjerphan · 2023-01-04T08:28:20Z

#25285 has been opened in this regards.

glemaitre · 2023-01-04T10:46:04Z

To address #25202, I think the best would just be to conditionally resolve metric="kulsinski" to KulsinskiDistance{,32} which is already present in the code base (we do not need to introduce kulczynski1 for this).

But kulczynski1 is different from kulsinski. We should stop providing support for kulsinkski so I don't think conditionally resolving our implementation is a wise choice here.

I can see the following items to go forward:

stop exposing kulsinski if SciPy does not. We use their deprecation cycle in this manner.
allow kulczynski1 in pairwise distances computation by using SciPy.
remove KulsinskiDistance because it does not implement the right "metric"
stop exposing KulsinskiDistance as a valid metric in the BallTree
decide whether or not to implement Kulczynski1Distance in DistanceMetric.

jjerphan · 2023-01-04T16:18:26Z

Can we remove support for "kulsinski" and not bring explicit support for "kulczynski1"?

ogrisel · 2023-01-04T16:53:53Z

Can we remove support for "kulsinski" and not bring explicit support for "kulczynski1"?

Why not? Apparently they do not compute the same thing.

ogrisel · 2023-01-04T16:55:41Z

The plan proposed in #25212 (comment) sounds good to me.

We can probably split it into 2 or more PRs to ease review.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

glemaitre · 2023-01-14T16:49:36Z

We losing coverage because we are testing the future (SciPy 1.11) for which we don't collect coverage.

jjerphan · 2023-01-17T10:53:13Z

@glemaitre: to you, which parts of the plan proposed in #25212 (comment) should we treat in this PR?

I think this PR can focus on:

stop exposing kulsinski if SciPy does not. We use their deprecation cycle in this manner.

stop exposing KulsinskiDistance as a valid metric in the BallTree

What do you think?

glemaitre · 2023-01-17T11:15:10Z

By splitting the PR, I have no hope to introduce kulzynski1 by the lack of reviewers.

thomasjpfan · 2023-01-18T14:44:46Z

allow kulczynski1 in pairwise distances computation by using SciPy.

I prefer to disallow kulczynski1 even if SciPy supports it. Given that kulczynski1 is a dissimilarity and gives d(x, x) == inf (as shown in #25212 (review)), the diagonal of the pairwise distance matrix would be inf.

ogrisel · 2023-01-18T16:59:30Z

I prefer to disallow kulczynski1 even if SciPy supports it.

I am find with that as well. We can always introduce it in a later version if users ask for it.

glemaitre added 2 commits December 19, 2022 10:34

MAINT introduce kulczynski1 in place of kulsinski

8e94171

[scipy-dev] trigger scipy-dev

72bf8b4

github-actions bot added cython module:cluster module:metrics module:neighbors labels Dec 19, 2022

glemaitre added the No Changelog Needed label Dec 19, 2022

glemaitre added 2 commits December 19, 2022 11:06

FIX add alias for DistanceMetric

ae64e3e

[scipy-dev] trigger scipy-dev

83d5728

glemaitre added 4 commits December 19, 2022 11:45

Merge remote-tracking branch 'origin/main' into is/25202

ebfa51a

Implement Kulczynski1Distance

1ebe6a0

add more details

a24c415

[scipy-dev] trigger scipy-dev

f5d993c

jjerphan reviewed Dec 19, 2022

View reviewed changes

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

glemaitre added 3 commits December 19, 2022 15:39

iter

1025f7a

iter

aba5a12

iter

3cc450a

thomasjpfan reviewed Dec 20, 2022

View reviewed changes

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

jjerphan mentioned this pull request Jan 4, 2023

FIX Conditionally resolve metric deprecated by SciPy #25285

Closed

glemaitre and others added 4 commits January 5, 2023 17:05

remove support in BallTree

8f90880

Merge remote-tracking branch 'origin/main' into is/25202

d9a8077

doc glitch whats new

44b4cbe

Update sklearn/metrics/_dist_metrics.pyx.tp

5cf81f1

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Merge branch 'main' into is/25202

b14714f

lesteve mentioned this pull request Jan 17, 2023

MNT fix test following scipy.stats.mode change in scipy development version #25393

Merged

glemaitre mentioned this pull request Jan 17, 2023

MAINT dynamically expose kulsinski and remove support in BallTree #25417

Merged

lesteve closed this in #25417 Jan 26, 2023

qiyunzhu mentioned this pull request Nov 4, 2023

Support for scipy > 1.10.1 scikit-bio/scikit-bio#1875

Closed

fancidev mentioned this pull request Sep 23, 2024

DEP: spatial.distance: deprecate kulczynski1 and sokalmichener metrics scipy/scipy#21572

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT introduce kulczynski1 in place of kulsinski #25212

MAINT introduce kulczynski1 in place of kulsinski #25212

glemaitre commented Dec 19, 2022

glemaitre commented Dec 19, 2022

glemaitre commented Dec 19, 2022

jjerphan Dec 19, 2022

glemaitre Dec 19, 2022

glemaitre Dec 19, 2022

thomasjpfan Dec 19, 2022

glemaitre Dec 19, 2022

glemaitre commented Dec 19, 2022

glemaitre commented Dec 19, 2022

glemaitre commented Dec 20, 2022

thomasjpfan left a comment

jjerphan commented Jan 3, 2023

glemaitre commented Jan 3, 2023

jjerphan commented Jan 3, 2023

ogrisel commented Jan 3, 2023 •

edited

Loading

jjerphan commented Jan 4, 2023

jjerphan commented Jan 4, 2023

glemaitre commented Jan 4, 2023

jjerphan commented Jan 4, 2023

ogrisel commented Jan 4, 2023

ogrisel commented Jan 4, 2023

glemaitre commented Jan 14, 2023

jjerphan commented Jan 17, 2023 •

edited

Loading

glemaitre commented Jan 17, 2023 •

edited

Loading

thomasjpfan commented Jan 18, 2023

ogrisel commented Jan 18, 2023

MAINT introduce kulczynski1 in place of kulsinski #25212

MAINT introduce kulczynski1 in place of kulsinski #25212

Conversation

glemaitre commented Dec 19, 2022

glemaitre commented Dec 19, 2022

glemaitre commented Dec 19, 2022

jjerphan Dec 19, 2022

Choose a reason for hiding this comment

glemaitre Dec 19, 2022

Choose a reason for hiding this comment

glemaitre Dec 19, 2022

Choose a reason for hiding this comment

thomasjpfan Dec 19, 2022

Choose a reason for hiding this comment

glemaitre Dec 19, 2022

Choose a reason for hiding this comment

glemaitre commented Dec 19, 2022

glemaitre commented Dec 19, 2022

glemaitre commented Dec 20, 2022

thomasjpfan left a comment

Choose a reason for hiding this comment

jjerphan commented Jan 3, 2023

glemaitre commented Jan 3, 2023

jjerphan commented Jan 3, 2023

ogrisel commented Jan 3, 2023 • edited Loading

jjerphan commented Jan 4, 2023

jjerphan commented Jan 4, 2023

glemaitre commented Jan 4, 2023

jjerphan commented Jan 4, 2023

ogrisel commented Jan 4, 2023

ogrisel commented Jan 4, 2023

glemaitre commented Jan 14, 2023

jjerphan commented Jan 17, 2023 • edited Loading

glemaitre commented Jan 17, 2023 • edited Loading

thomasjpfan commented Jan 18, 2023

ogrisel commented Jan 18, 2023

ogrisel commented Jan 3, 2023 •

edited

Loading

jjerphan commented Jan 17, 2023 •

edited

Loading

glemaitre commented Jan 17, 2023 •

edited

Loading