Skip to content

[MRG] Fix DBSCAN is missing _pairwise property #11453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

gokart23
Copy link

@gokart23 gokart23 commented Jul 7, 2018

Reference Issues/PRs

Fixes #11432

What does this implement/fix? Explain your changes.

Adds the _pairwise attribute to cluster/dbscan_.py, returning True if a precomputed distance metric is indicated.

Also adds a common heuristic test for checking if an estimator should have the _pairwise property, but isn't. The test is based on the idea that if an estimator had a metric, affinity or kernel parameter which supports 'precomputed', and if the estimator was then able to fit on pairwise data, but raised an error on non-square training data, a _pairwise attribute should exist.

@jnothman
Copy link
Member

jnothman commented Jul 7, 2018

This should probably have some kind of test, but I'm not sure what...

(The code is being run by tests at least...)

Otherwise LGTM.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not double checked everything, but I quite like this!

Perhaps we should test the check in sklearn/utils/tests/test_estimator_checks.py

continue
try:
# Construct new object of estimator with desired attribute value
modified_estimator = estimator_orig.__class__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we would just use clone and set_params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to fix these issues, @gokart23?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman sorry for the delay. Is there an expected ETA? I'm a little busy rn, but should be able to address the review this week

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


for attribute, attribute_value in attributes_to_check:
# Check to see if attribute value is supported by estimator
if getattr(estimator_orig, attribute, None) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more conventionally, we would check estimator_orig.get_params().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

modified_estimator.predict(X=X)
except ValueError:
# Check if estimator defines _pairwise attribute
assert_not_equal(getattr(modified_estimator, '_pairwise', None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not check as far as "_pairwise should be True iff 'precomputed'"?

Copy link
Author

@gokart23 gokart23 Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman It's possible to check both ways, but I'm not sure if this test is a necesary condition for _pairwise (though it is sufficient). I implemented the necessary check (you can find it here: gokart23@04209a4) but there turned out to be a couple of failing estimators:

  1. sklearn.preprocessing.data.KernelCenterer : _pairwise attribute added in commit 2242b1b
  2. sklearn.neighbors.unsupervised.NearestNeighbors: _pairwise attribute added in commit f4fc275

I'm not sure if these are valid failures, or how to go about fixing these failures, which is why I haven't included the commit in the PR as yet. What do you think?

attribute: attribute_value
})
# Not all estimators validate parameters, so check fit()
modified_estimator.fit(X=distance_matrix, y=y_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually pass X and y as positional attributes (some estimators have Y not y, for instance)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@jnothman
Copy link
Member

jnothman commented Jul 25, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why KernelCenterer would fail that other test. Do you have any idea why NearestNeighbors would?


# Check if _pairwise attribute is present - will be used later
has_pairwise_tag = (False
if getattr(estimator_orig, '_pairwise', None) is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why getattr and not hasattr?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferred to use getattr in place of hasattr since the latter simply tests the former for an AttributeError[1]. There's no benefit in terms of speed and maybe a little benefit in terms of code clarity, but there's the danger of a misrepresented AttributeError. In the interest of future-proofing (and going by general community consensus[2][3]) I think it may be better to adopt getattr.

References:

  1. https://docs.python.org/3/library/functions.html#hasattr
  2. https://stackoverflow.com/a/24971134/9366691
  3. https://hynek.me/articles/hasattr/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to,

has_pairwise_tag = hasattr(estimator_orig, '_pairwise')

stackoverflow.com/a/24971134/9366691
hynek.me/articles/hasattr

As far as I can tell the later blog post is only relevant for Python 2 that we no longer support.

We also sometimes explicitly raise AttributeEror in a property to indicate that the method is unavailable, which would result in hasattr(estimator, method) is False as intended.

@gokart23
Copy link
Author

gokart23 commented Aug 2, 2018

@jnothman I looked into this a little more, and it turns out NearestNeighbor failing the test is a limitation of the test's heuristic for an estimator's 'output'. NearestNeighbor does have a fit function but doesn't appear to validate metric while fitting. Moreover, it doesn't have a predict or a transform function (iiuc, it uses kneighbors[_graph] as an 'output'), which is what the test relies on in order to identify errors with non-square distance matrices.

I'm not sure how we could strengthen the test, without adding this (and KernelCenterer) as edge cases. Do you have any ideas?

@jnothman
Copy link
Member

jnothman commented Aug 5, 2018 via email

@gokart23
Copy link
Author

gokart23 commented Aug 7, 2018

NearestNeighbors should probably validate metric in fit

Agreed.

I don't get what lacking predict has to do with this

Some estimators which require the _pairwise attribute don't validate inputs in the fit method (though I suspect they are largely based off of NearestNeighbors), but do validate predict or transform. For instance:

<...>
>>> from sklearn.neighbors import KNeighborsClassifier
>>> # Calling fit doesn't raise an error, though it should according to the docs
>>> # http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit
>>> KNeighborsClassifier(metric='precomputed').fit(distance_matrix[:,:-1], y_)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='precomputed',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
>> # But calling predict does - note that invoking with a square matrix here doesn't raise an error
>>> KNeighborsClassifier(metric='precomputed').fit(distance_matrix[:,:-1], y_).predict(distance_matrix[:,:-1])
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    KNeighborsClassifier(metric='precomputed').fit(distance_matrix[:,:-1], y_).predict(distance_matrix[:,:-1])
<....>
ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (150, 149) for 150 indexed.

I may be wrong in my understanding of the problem, but I think that the actual solution would be to have the estimator validate the input in its fit method. However, this test can still serve as a sufficiency check - in order to account for these in the test, I invoke predict/transform (if available) after fitting a non-square distance matrix, to check for errors.

This is relevant for the necessity check though - NearestNeighbors doesn't raise errors on fit, and can't be checked by calling predict/transform.

@gokart23
Copy link
Author

gokart23 commented Sep 2, 2018

@jnothman ping!

@jnothman
Copy link
Member

jnothman commented Sep 3, 2018

Yes, the input should be validated in fit, but I agree that does not need to change here.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

modified_estimator.predict(X)
except ValueError:
# Check if estimator defines _pairwise attribute
assert_true(has_pairwise_tag,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@@ -1191,6 +1192,58 @@ def check_estimators_pickle(name, estimator_orig):
assert_allclose_dense_sparse(result[method], unpickled_result)


def check_pairwise_estimator_tag(name, estimator_orig):
attributes_to_check = [('metric', 'precomputed'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code would be clearer if you just put 'precomputed' inline below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be unfortunate if we found ourselves introducing a different placeholder!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea :) resolved!

# Construct new object of estimator with desired attribute value
modified_estimator = clone(estimator_orig).set_params(
**{attribute: attribute_value})
# Not all estimators validate parameters, so check fit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most estimators do not. Rather comment that "Estimators may validate parameters in fit if not in set_params"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


for attribute, attribute_value in attributes_to_check:
# Check to see if attribute value is supported by estimator
if attribute not in estimator_orig.get_params():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use get_params(deep=False) to keep it fast

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

**{attribute: attribute_value})
# Not all estimators validate parameters, so check fit()
modified_estimator.fit(distance_matrix, y_)
except (TypeError, ValueError, KeyError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the relevance of KeyError.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Nystroem approximator looks up the kernel parameter value insklearn/metrics/pairwise.py:KERNEL_PARAMS, which raises a KeyError.

self = Nystroem(coef0=None, degree=None, gamma=None, kernel='precomputed',                                                             
     kernel_params=None, n_components=100, random_state=None)                                                                          
                                                                                                                                       
    def _get_kernel_params(self):                                                                                                     
        params = self.kernel_params                                                                                                    
        if params is None:                                                 
            params = {}                                                                                                                
        if not callable(self.kernel):                                             
>           for param in (KERNEL_PARAMS[self.kernel]):                          
E           KeyError: 'precomputed'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Nystroem must not currently support precomputed... We could:

  • add support there
  • Just catch Exception here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if I’m understanding this correctly, but the exception handling includes KeyError in order to handle Nystroem (since Nystroem raises a KeyError instead of a TypeError/ValueError)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just catch Exception here, rather than ValueError, TypeError, KeyError I think that would be better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok - sorry about that. Resolved!

@gokart23 gokart23 force-pushed the feature/dbscan-pairwise branch from 73aabee to 90f2d6b Compare September 13, 2018 22:26
@gokart23
Copy link
Author

gokart23 commented Sep 17, 2018

@jnothman do you think there's further changes needed? If not, do you think this can be closed now?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add entries to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:. The entries should note the changes to these specific estimators, but there should also be an entry under "changes to estimator checks" (you might need to copy this heading from v0.20.rst).

continue

# Also check to see if non-square distance matrix raises an error
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try seems odd to me. We want to check all of the methods, right? Right now only the first one that exists is checked. And shouldn't we ensure that an error is raised? Why is else a skip?


# Check if _pairwise attribute is present - will be used later
has_pairwise_tag = (False
if getattr(estimator_orig, '_pairwise', None) is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to,

has_pairwise_tag = hasattr(estimator_orig, '_pairwise')

stackoverflow.com/a/24971134/9366691
hynek.me/articles/hasattr

As far as I can tell the later blog post is only relevant for Python 2 that we no longer support.

We also sometimes explicitly raise AttributeEror in a property to indicate that the method is unavailable, which would result in hasattr(estimator, method) is False as intended.

modified_estimator.fit(distance_matrix, y_)
except Exception:
# Estimator does not support given attribute value
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current state, this whole code block might as well not be there as the try: except: block will silently ignore exceptions.

Instead we should check that the estimator does support the given attribute value and run that code only when it's true..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This current block is just filtering out estimators that do not support 'precomputed' value for attribute. I have no Python skill to think of doing this some other way. @rth, would you have something that I could study to apply in this specific case?

I was trying with assert_raises although not all estimators raise the same type of error when 'precomputed' is given as value and they do not accept that.

non_square_distance = distance_matrix[:, :-1]
if getattr(modified_estimator, 'fit_predict', None) is not None:
modified_estimator.fit_predict(non_square_distance, y_)
elif (getattr(modified_estimator, 'fit_transform', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use hasattr(modified_estimator, 'fit_transform') I don't really see the point of getattr here.

elif getattr(modified_estimator, 'fit', None) is not None:
modified_estimator.fit(non_square_distance, y_)
if getattr(modified_estimator, 'predict', None) is not None:
modified_estimator.predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified_estimator everywhere is cumbersome. Better to use estimator2 or something similar.

@jnothman
Copy link
Member

@gokart23 are you able to finish this off?

@rth rth added the Needs work label Jul 25, 2019
@ricoms
Copy link

ricoms commented Sep 6, 2019

Hi. I'm at EurosCipy 2019 sprint and I hope I can help with this pull_request, as it looks stale.

I will soon make a new pull request working on the conflicts raised by @rth, @amueller and @jnothman

@rth
Copy link
Member

rth commented Sep 6, 2019

Thanks @ricoms !

Copy link

@ricoms ricoms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is passing all tests, the suggested adjustments are related to simplifications and cleaner code. Although the original requester is not responding, I suggest accepting this pull request and create a new one after it adjusting code and comments.

modified_estimator.fit(distance_matrix, y_)
except Exception:
# Estimator does not support given attribute value
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This current block is just filtering out estimators that do not support 'precomputed' value for attribute. I have no Python skill to think of doing this some other way. @rth, would you have something that I could study to apply in this specific case?

I was trying with assert_raises although not all estimators raise the same type of error when 'precomputed' is given as value and they do not accept that.

@ricoms ricoms mentioned this pull request Sep 6, 2019
@jeremiedbb
Copy link
Member

_pairwise has been deprecated in favor of the pairwise estimator tag (#18143). This is not relevant anymore.

@jeremiedbb jeremiedbb closed this Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBSCAN is missing _pairwise property
7 participants