[MRG] Fix DBSCAN is missing _pairwise property #11453

gokart23 · 2018-07-07T12:31:23Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds the _pairwise attribute to cluster/dbscan_.py, returning True if a precomputed distance metric is indicated.

Also adds a common heuristic test for checking if an estimator should have the _pairwise property, but isn't. The test is based on the idea that if an estimator had a metric, affinity or kernel parameter which supports 'precomputed', and if the estimator was then able to fit on pairwise data, but raised an error on non-square training data, a _pairwise attribute should exist.

jnothman · 2018-07-07T13:44:53Z

This should probably have some kind of test, but I'm not sure what...

(The code is being run by tests at least...)

Otherwise LGTM.

jnothman

I've not double checked everything, but I quite like this!

Perhaps we should test the check in sklearn/utils/tests/test_estimator_checks.py

jnothman · 2018-07-19T20:54:10Z

sklearn/utils/estimator_checks.py

+            continue
+        try:
+            # Construct new object of estimator with desired attribute value
+            modified_estimator = estimator_orig.__class__(


Usually we would just use clone and set_params.

Are you able to fix these issues, @gokart23?

@jnothman sorry for the delay. Is there an expected ETA? I'm a little busy rn, but should be able to address the review this week

jnothman · 2018-07-19T20:54:31Z

sklearn/utils/estimator_checks.py

+
+    for attribute, attribute_value in attributes_to_check:
+        # Check to see if attribute value is supported by estimator
+        if getattr(estimator_orig, attribute, None) is not None:


more conventionally, we would check estimator_orig.get_params().

jnothman · 2018-07-19T20:56:02Z

sklearn/utils/estimator_checks.py

+                    modified_estimator.predict(X=X)
+        except ValueError:
+            # Check if estimator defines _pairwise attribute
+            assert_not_equal(getattr(modified_estimator, '_pairwise', None),


Can we not check as far as "_pairwise should be True iff 'precomputed'"?

@jnothman It's possible to check both ways, but I'm not sure if this test is a necesary condition for _pairwise (though it is sufficient). I implemented the necessary check (you can find it here: gokart23@04209a4) but there turned out to be a couple of failing estimators:

sklearn.preprocessing.data.KernelCenterer : _pairwise attribute added in commit 2242b1b

sklearn.neighbors.unsupervised.NearestNeighbors: _pairwise attribute added in commit f4fc275

I'm not sure if these are valid failures, or how to go about fixing these failures, which is why I haven't included the commit in the PR as yet. What do you think?

jnothman · 2018-07-19T20:56:49Z

sklearn/utils/estimator_checks.py

+                                        attribute: attribute_value
+                                      })
+            # Not all estimators validate parameters, so check fit()
+            modified_estimator.fit(X=distance_matrix, y=y_)


We usually pass X and y as positional attributes (some estimators have Y not y, for instance)

jnothman · 2018-07-25T01:49:38Z

Basically, we're waiting on an initial version of #11419 to be merged before release an RC (and perhaps we should already release a "beta" but we've rarely got much feedback from betas). We should be able to include this after RC in any case, so a week is fine.

…

On 25 July 2018 at 01:24, Karthik Duddu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/utils/estimator_checks.py <#11453 (comment)> : > + attributes_to_check = [('metric', 'precomputed'), + ('affinity', 'precomputed'), + ('kernel', 'precomputed')] + + # Using iris as sample data + iris = load_iris() + X, y_ = iris.data, iris.target + distance_matrix = pairwise_distances(X) + + for attribute, attribute_value in attributes_to_check: + # Check to see if attribute value is supported by estimator + if getattr(estimator_orig, attribute, None) is not None: + continue + try: + # Construct new object of estimator with desired attribute value + modified_estimator = estimator_orig.__class__( @jnothman <https://github.com/jnothman> sorry for the delay. Is there an expected ETA? I'm a little busy rn, but should be able to address the review this week — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11453 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zns8yEcOWXx2Cf0Kbf0L2sqeSiwks5uJzwUgaJpZM4VGV_U> .

jnothman

I get why KernelCenterer would fail that other test. Do you have any idea why NearestNeighbors would?

jnothman · 2018-08-02T00:12:44Z

sklearn/utils/estimator_checks.py

+
+    # Check if _pairwise attribute is present - will be used later
+    has_pairwise_tag = (False
+                        if getattr(estimator_orig, '_pairwise', None) is None


Why getattr and not hasattr?

I think it's preferred to use getattr in place of hasattr since the latter simply tests the former for an AttributeError[1]. There's no benefit in terms of speed and maybe a little benefit in terms of code clarity, but there's the danger of a misrepresented AttributeError. In the interest of future-proofing (and going by general community consensus[2][3]) I think it may be better to adopt getattr.

References:

https://docs.python.org/3/library/functions.html#hasattr

https://stackoverflow.com/a/24971134/9366691

https://hynek.me/articles/hasattr/

This can be simplified to,

has_pairwise_tag = hasattr(estimator_orig, '_pairwise')

stackoverflow.com/a/24971134/9366691
hynek.me/articles/hasattr

As far as I can tell the later blog post is only relevant for Python 2 that we no longer support.

We also sometimes explicitly raise AttributeEror in a property to indicate that the method is unavailable, which would result in hasattr(estimator, method) is False as intended.

gokart23 · 2018-08-02T18:54:22Z

@jnothman I looked into this a little more, and it turns out NearestNeighbor failing the test is a limitation of the test's heuristic for an estimator's 'output'. NearestNeighbor does have a fit function but doesn't appear to validate metric while fitting. Moreover, it doesn't have a predict or a transform function (iiuc, it uses kneighbors[_graph] as an 'output'), which is what the test relies on in order to identify errors with non-square distance matrices.

I'm not sure how we could strengthen the test, without adding this (and KernelCenterer) as edge cases. Do you have any ideas?

jnothman · 2018-08-05T03:27:46Z

NearestNeighbors should probably validate metric in fit. I don't get what lacking predict has to do with this

gokart23 · 2018-08-07T16:36:34Z

NearestNeighbors should probably validate metric in fit

Agreed.

I don't get what lacking predict has to do with this

Some estimators which require the _pairwise attribute don't validate inputs in the fit method (though I suspect they are largely based off of NearestNeighbors), but do validate predict or transform. For instance:

<...>
>>> from sklearn.neighbors import KNeighborsClassifier
>>> # Calling fit doesn't raise an error, though it should according to the docs
>>> # http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit
>>> KNeighborsClassifier(metric='precomputed').fit(distance_matrix[:,:-1], y_)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='precomputed',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
>> # But calling predict does - note that invoking with a square matrix here doesn't raise an error
>>> KNeighborsClassifier(metric='precomputed').fit(distance_matrix[:,:-1], y_).predict(distance_matrix[:,:-1])
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    KNeighborsClassifier(metric='precomputed').fit(distance_matrix[:,:-1], y_).predict(distance_matrix[:,:-1])
<....>
ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (150, 149) for 150 indexed.

I may be wrong in my understanding of the problem, but I think that the actual solution would be to have the estimator validate the input in its fit method. However, this test can still serve as a sufficiency check - in order to account for these in the test, I invoke predict/transform (if available) after fitting a non-square distance matrix, to check for errors.

This is relevant for the necessity check though - NearestNeighbors doesn't raise errors on fit, and can't be checked by calling predict/transform.

gokart23 · 2018-09-02T02:39:49Z

@jnothman ping!

jnothman · 2018-09-03T02:53:21Z

Yes, the input should be validated in fit, but I agree that does not need to change here.

jnothman

Otherwise LGTM.

jnothman · 2018-09-03T02:54:19Z

sklearn/utils/estimator_checks.py

+                    modified_estimator.predict(X)
+        except ValueError:
+            # Check if estimator defines _pairwise attribute
+            assert_true(has_pairwise_tag,


With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

jnothman · 2018-09-03T02:55:33Z

sklearn/utils/estimator_checks.py

@@ -1191,6 +1192,58 @@ def check_estimators_pickle(name, estimator_orig):
        assert_allclose_dense_sparse(result[method], unpickled_result)


+def check_pairwise_estimator_tag(name, estimator_orig):
+    attributes_to_check = [('metric', 'precomputed'),


I think the code would be clearer if you just put 'precomputed' inline below

I think it would be unfortunate if we found ourselves introducing a different placeholder!

Good idea :) resolved!

jnothman · 2018-09-03T02:56:37Z

sklearn/utils/estimator_checks.py

+            # Construct new object of estimator with desired attribute value
+            modified_estimator = clone(estimator_orig).set_params(
+                                                **{attribute: attribute_value})
+            # Not all estimators validate parameters, so check fit()


Most estimators do not. Rather comment that "Estimators may validate parameters in fit if not in set_params"

jnothman · 2018-09-03T02:57:26Z

sklearn/utils/estimator_checks.py

+
+    for attribute, attribute_value in attributes_to_check:
+        # Check to see if attribute value is supported by estimator
+        if attribute not in estimator_orig.get_params():


use get_params(deep=False) to keep it fast

jnothman · 2018-09-03T02:58:26Z

sklearn/utils/estimator_checks.py

+                                                **{attribute: attribute_value})
+            # Not all estimators validate parameters, so check fit()
+            modified_estimator.fit(distance_matrix, y_)
+        except (TypeError, ValueError, KeyError):


I don't see the relevance of KeyError.

The Nystroem approximator looks up the kernel parameter value insklearn/metrics/pairwise.py:KERNEL_PARAMS, which raises a KeyError.

self = Nystroem(coef0=None, degree=None, gamma=None, kernel='precomputed', kernel_params=None, n_components=100, random_state=None) def _get_kernel_params(self): params = self.kernel_params if params is None: params = {} if not callable(self.kernel): > for param in (KERNEL_PARAMS[self.kernel]): E KeyError: 'precomputed'

So Nystroem must not currently support precomputed... We could:

add support there

Just catch Exception here

I’m not sure if I’m understanding this correctly, but the exception handling includes KeyError in order to handle Nystroem (since Nystroem raises a KeyError instead of a TypeError/ValueError)

If you just catch Exception here, rather than ValueError, TypeError, KeyError I think that would be better.

Ah, ok - sorry about that. Resolved!

gokart23 · 2018-09-17T14:59:56Z

@jnothman do you think there's further changes needed? If not, do you think this can be closed now?

sklearn/utils/estimator_checks.py

jnothman

Please add entries to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:. The entries should note the changes to these specific estimators, but there should also be an entry under "changes to estimator checks" (you might need to copy this heading from v0.20.rst).

amueller · 2018-10-28T20:02:04Z

sklearn/utils/estimator_checks.py

+            continue
+
+        # Also check to see if non-square distance matrix raises an error
+        try:


This try seems odd to me. We want to check all of the methods, right? Right now only the first one that exists is checked. And shouldn't we ensure that an error is raised? Why is else a skip?

rth · 2019-07-18T11:45:59Z

sklearn/utils/estimator_checks.py

+
+    # Check if _pairwise attribute is present - will be used later
+    has_pairwise_tag = (False
+                        if getattr(estimator_orig, '_pairwise', None) is None


This can be simplified to,

has_pairwise_tag = hasattr(estimator_orig, '_pairwise')

stackoverflow.com/a/24971134/9366691
hynek.me/articles/hasattr

As far as I can tell the later blog post is only relevant for Python 2 that we no longer support.

We also sometimes explicitly raise AttributeEror in a property to indicate that the method is unavailable, which would result in hasattr(estimator, method) is False as intended.

rth · 2019-07-18T11:48:46Z

sklearn/utils/estimator_checks.py

+            modified_estimator.fit(distance_matrix, y_)
+        except Exception:
+            # Estimator does not support given attribute value
+            continue


In the current state, this whole code block might as well not be there as the try: except: block will silently ignore exceptions.

Instead we should check that the estimator does support the given attribute value and run that code only when it's true..

This current block is just filtering out estimators that do not support 'precomputed' value for attribute. I have no Python skill to think of doing this some other way. @rth, would you have something that I could study to apply in this specific case?

I was trying with assert_raises although not all estimators raise the same type of error when 'precomputed' is given as value and they do not accept that.

rth · 2019-07-18T11:50:01Z

sklearn/utils/estimator_checks.py

+            non_square_distance = distance_matrix[:, :-1]
+            if getattr(modified_estimator, 'fit_predict', None) is not None:
+                modified_estimator.fit_predict(non_square_distance, y_)
+            elif (getattr(modified_estimator, 'fit_transform', None)


Better to use hasattr(modified_estimator, 'fit_transform') I don't really see the point of getattr here.

rth · 2019-07-18T12:00:04Z

sklearn/utils/estimator_checks.py

+            elif getattr(modified_estimator, 'fit', None) is not None:
+                modified_estimator.fit(non_square_distance, y_)
+                if getattr(modified_estimator, 'predict', None) is not None:
+                    modified_estimator.predict(X)


modified_estimator everywhere is cumbersome. Better to use estimator2 or something similar.

jnothman · 2019-07-25T13:25:57Z

@gokart23 are you able to finish this off?

ricoms · 2019-09-06T12:34:29Z

Hi. I'm at EurosCipy 2019 sprint and I hope I can help with this pull_request, as it looks stale.

I will soon make a new pull request working on the conflicts raised by @rth, @amueller and @jnothman

rth · 2019-09-06T12:39:23Z

Thanks @ricoms !

ricoms

This code is passing all tests, the suggested adjustments are related to simplifications and cleaner code. Although the original requester is not responding, I suggest accepting this pull request and create a new one after it adjusting code and comments.

ricoms · 2019-09-06T16:01:34Z

sklearn/utils/estimator_checks.py

+            modified_estimator.fit(distance_matrix, y_)
+        except Exception:
+            # Estimator does not support given attribute value
+            continue


This current block is just filtering out estimators that do not support 'precomputed' value for attribute. I have no Python skill to think of doing this some other way. @rth, would you have something that I could study to apply in this specific case?

I was trying with assert_raises although not all estimators raise the same type of error when 'precomputed' is given as value and they do not accept that.

jeremiedbb · 2022-03-12T21:46:57Z

_pairwise has been deprecated in favor of the pairwise estimator tag (#18143). This is not relevant anymore.

Fixes DBSCAN is missing _pairwise property

d8abde3

gokart23 added 2 commits July 10, 2018 01:26

Added test for verifying pairwise tag; added tag to t-SNE

53348bd

Fix inevitable flake8 errors

d207a6f

jnothman reviewed Jul 19, 2018

View reviewed changes

Address review comments

f290ea9

jnothman reviewed Aug 2, 2018

View reviewed changes

jnothman reviewed Sep 3, 2018

View reviewed changes

gokart23 added 2 commits September 11, 2018 01:11

Minor fixes; address review comments

d8183eb

Attribute support check proceeds only if no Exception raised

90f2d6b

gokart23 force-pushed the feature/dbscan-pairwise branch from 73aabee to 90f2d6b Compare September 13, 2018 22:26

jnothman reviewed Sep 17, 2018

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

Fixed reference to 'precomputed' attribute

6eb3db3

jnothman approved these changes Oct 28, 2018

View reviewed changes

amueller reviewed Oct 28, 2018

View reviewed changes

eyes-robson mentioned this pull request May 22, 2019

Nested Cross Validation for precomputed KNN #13920

Closed

rth reviewed Jul 18, 2019

View reviewed changes

rth added the Needs work label Jul 25, 2019

ricoms reviewed Sep 6, 2019

View reviewed changes

ricoms mentioned this pull request Sep 6, 2019

Feature/dbscan pairwise #14919

Closed

github-actions bot added module:cluster module:manifold module:utils labels Mar 2, 2020

cmarmo added help wanted Stalled labels Jul 17, 2020

Base automatically changed from master to main January 22, 2021 10:50

jeremiedbb closed this Mar 12, 2022

Uh oh!

[MRG] Fix DBSCAN is missing _pairwise property #11453

[MRG] Fix DBSCAN is missing _pairwise property #11453

Uh oh!

Conversation

gokart23 commented Jul 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jnothman commented Jul 7, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gokart23 Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jul 25, 2018 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gokart23 commented Aug 2, 2018

Uh oh!

jnothman commented Aug 5, 2018 via email

Uh oh!

gokart23 commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gokart23 commented Sep 2, 2018

Uh oh!

jnothman commented Sep 3, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gokart23 commented Jul 7, 2018 •

edited

Loading

gokart23 Aug 1, 2018 •

edited

Loading

gokart23 commented Aug 7, 2018 •

edited

Loading

gokart23 commented Sep 17, 2018 •

edited

Loading