Skip to content

MAINT introduce kulczynski1 in place of kulsinski #25212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

glemaitre
Copy link
Member

closes #25202

Working around the deprecation an suppression of "kulsinski" metric in SciPy.

@glemaitre
Copy link
Member Author

Looking at the failure, it seems that this is not only a renaming. I need to investigate what the difference between the two functions in the SciPy documentation.

@glemaitre
Copy link
Member Author

So apparently the definition of the metric is not the same as shown in the small example snippet in the documentation:

kulczynski1 vs. kulsinski

@@ -1639,7 +1647,7 @@ def _pairwise_callable(X, Y, metric, force_all_finite=True, **kwds):
"dice",
"hamming",
"jaccard",
"kulsinski",
"kulsinski" if sp_version < parse_version("1.8") else "kulczynski1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it rather this?

Suggested change
"kulsinski" if sp_version < parse_version("1.8") else "kulczynski1",
(
*("kulsinski", "kulczynski1")
if sp_version < parse_version("1.8")
else "kulczynski1"
),

This suggestion also applies in other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"kulczynski1" was introduced in 1.8. Eventually, we could have a transition for 1.8 and 1.9 where both metrics are available which is closer to the deprecation cycle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better since both metrics are not leading to the same results which is something I did not expect at first.

@@ -81,7 +81,7 @@ class OPTICS(ClusterMixin, BaseEstimator):
'manhattan']

- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'kulczynski1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also deprecate 'kulsinski' in scikit-learn as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly: scipy/scipy#2009
I even think that scikit-learn could be the reason for having this metric in SciPy.

For this PR, I would like just to add support for this metric and deal with the SciPy part. Then, we can see how the deprecation would go for the scikit-learn part and see if we need to backport something for sometimes.

@glemaitre
Copy link
Member Author

I am a bit blocked with the remaining error: it should be a test checking that the ball tree query provides the same results as the brute force. It seems that this is not the case with the newly implemented metric. I am wondering if there is not something fishy with the bound inf and nan when it comes to building the tree but I don't know much about the BallTree code.

@glemaitre
Copy link
Member Author

Uhm no luck if the nan and inf case, so certainly something else in the BallTree.

@glemaitre
Copy link
Member Author

I found the reason for the failure in the BallTree and it is indeed linked to some inf upper bound.

In the BallTree, each node defines a centroid and a radius. This radius is based on the distance between the centroid and the furthest point.

The Kulczynski distance works on data that are binarized. However, centroids are not defined with binary values. The metric itself considers that whatever value is not zero should be cast to one. We, therefore, end up in a situation where the distance between the centroid and the further point is generally inf. This will impact the search for the nearest centroids since the balls are therefore ill-defined.

Just to confirm this intuition, I modified slightly the way centroids are computed in the ball tree:

        for j in range(n_features):
            centroid[j] /= n_points
            if centroid[j] > 0.5:
                centroid[j] = 1.0
            else:
                centroid[j] = 0.0

In this case, the centroids are values expected by the metric and the test is passing.
However, I don't know if it makes sense to threshold the centroids this way.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how the centroids are computed, I'm surprised that BallTree does something reasonable with any boolean metric.

As for kulczynski1, it is not really a metric:

from scipy.spatial import distance
x = [0, 1, 1]
distance.kulczynski1(x, x)
# inf

A metric would require the above to return 0.

@jjerphan
Copy link
Member

jjerphan commented Jan 3, 2023

Does it makes sense to have BallTree support "kulsinski" or even boolean distance metrics?

@glemaitre
Copy link
Member Author

Does it makes sense to have BallTree support "kulsinski" or even boolean distance metrics?

It is my original question :)

We should investigate closely what happens with the Hamming distance and what is the reason for the test passing. I assume that the distance does not output inf and it makes the test pass but the results are rubbish because the median for each ball is not a binary sample and we should always compare vectors of full 1s.

@jjerphan
Copy link
Member

jjerphan commented Jan 3, 2023

Based on the context given by #20582, BallTree might have been designed for the Euclidean Distance metric case and then extended to support other distance metrics.

I agree we need to take some time to assess:

  • if using BallTree with other (boolean) distance metrics makes sense.
  • if kulczynski1 can be used as a proxy/replacement for kulczynski

What do you think?

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2023

I think that Ball-Tree should work for any proper metric, that is that:

  • d(x, x) = 0,
  • d(x, y) > 0 if x != y,
  • d(x, y) = d(y, x)
  • d(x, y) <= d(x, z) + d(z, y)

As noted in #25212 (review), the kulczynski1 dissimilarity is certainly not a valid metric.

So let's not include it as part of the list of valid metrics for the Ball Tree method.

As for the suitability of ball-tree for bool metrics in general, it's indeed likely that the centroid init implemented in init_node is not well adapted to them but maybe it still works thanks to a silent rounding operation to 0.0 or 1.0 in the distance computation itself for those metrics?

To keep the PR focused and not delay the fix, let's not change this as part of this PR and rather open a dedicated follow-up issue to investigate if the current ball-tree implementation is correct for boolean metrics and data. At least the test suite should be updated to test the boolean metrics (e.g. the equivalence of ball-tree vs brute) on boolean data.

@jjerphan
Copy link
Member

jjerphan commented Jan 4, 2023

To address #25202, I think the best would just be to conditionally resolve metric="kulsinski" to KulsinskiDistance{,32} which is already present in the code base (we do not need to introduce kulczynski1 for this).

@jjerphan
Copy link
Member

jjerphan commented Jan 4, 2023

#25285 has been opened in this regards.

@glemaitre
Copy link
Member Author

To address #25202, I think the best would just be to conditionally resolve metric="kulsinski" to KulsinskiDistance{,32} which is already present in the code base (we do not need to introduce kulczynski1 for this).

But kulczynski1 is different from kulsinski. We should stop providing support for kulsinkski so I don't think conditionally resolving our implementation is a wise choice here.

I can see the following items to go forward:

  • stop exposing kulsinski if SciPy does not. We use their deprecation cycle in this manner.
  • allow kulczynski1 in pairwise distances computation by using SciPy.
  • remove KulsinskiDistance because it does not implement the right "metric"
  • stop exposing KulsinskiDistance as a valid metric in the BallTree
  • decide whether or not to implement Kulczynski1Distance in DistanceMetric.

@jjerphan
Copy link
Member

jjerphan commented Jan 4, 2023

Can we remove support for "kulsinski" and not bring explicit support for "kulczynski1"?

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2023

Can we remove support for "kulsinski" and not bring explicit support for "kulczynski1"?

Why not? Apparently they do not compute the same thing.

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2023

The plan proposed in #25212 (comment) sounds good to me.

We can probably split it into 2 or more PRs to ease review.

@glemaitre
Copy link
Member Author

We losing coverage because we are testing the future (SciPy 1.11) for which we don't collect coverage.

@jjerphan
Copy link
Member

jjerphan commented Jan 17, 2023

@glemaitre: to you, which parts of the plan proposed in #25212 (comment) should we treat in this PR?

I think this PR can focus on:

  • stop exposing kulsinski if SciPy does not. We use their deprecation cycle in this manner.
  • stop exposing KulsinskiDistance as a valid metric in the BallTree

What do you think?

@glemaitre
Copy link
Member Author

glemaitre commented Jan 17, 2023

By splitting the PR, I have no hope to introduce kulzynski1 by the lack of reviewers.

@thomasjpfan
Copy link
Member

allow kulczynski1 in pairwise distances computation by using SciPy.

I prefer to disallow kulczynski1 even if SciPy supports it. Given that kulczynski1 is a dissimilarity and gives d(x, x) == inf (as shown in #25212 (review)), the diagonal of the pairwise distance matrix would be inf.

@ogrisel
Copy link
Member

ogrisel commented Jan 18, 2023

I prefer to disallow kulczynski1 even if SciPy supports it.

I am find with that as well. We can always introduce it in a later version if users ask for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev ⚠️
4 participants