-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Fix DBSCAN is missing _pairwise property #11453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This should probably have some kind of test, but I'm not sure what... (The code is being run by tests at least...) Otherwise LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not double checked everything, but I quite like this!
Perhaps we should test the check in sklearn/utils/tests/test_estimator_checks.py
sklearn/utils/estimator_checks.py
Outdated
continue | ||
try: | ||
# Construct new object of estimator with desired attribute value | ||
modified_estimator = estimator_orig.__class__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we would just use clone
and set_params
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to fix these issues, @gokart23?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman sorry for the delay. Is there an expected ETA? I'm a little busy rn, but should be able to address the review this week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
sklearn/utils/estimator_checks.py
Outdated
|
||
for attribute, attribute_value in attributes_to_check: | ||
# Check to see if attribute value is supported by estimator | ||
if getattr(estimator_orig, attribute, None) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more conventionally, we would check estimator_orig.get_params()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
sklearn/utils/estimator_checks.py
Outdated
modified_estimator.predict(X=X) | ||
except ValueError: | ||
# Check if estimator defines _pairwise attribute | ||
assert_not_equal(getattr(modified_estimator, '_pairwise', None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not check as far as "_pairwise should be True iff 'precomputed'"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman It's possible to check both ways, but I'm not sure if this test is a necesary condition for _pairwise
(though it is sufficient). I implemented the necessary check (you can find it here: gokart23@04209a4) but there turned out to be a couple of failing estimators:
sklearn.preprocessing.data.KernelCenterer
:_pairwise
attribute added in commit 2242b1bsklearn.neighbors.unsupervised.NearestNeighbors
:_pairwise
attribute added in commit f4fc275
I'm not sure if these are valid failures, or how to go about fixing these failures, which is why I haven't included the commit in the PR as yet. What do you think?
sklearn/utils/estimator_checks.py
Outdated
attribute: attribute_value | ||
}) | ||
# Not all estimators validate parameters, so check fit() | ||
modified_estimator.fit(X=distance_matrix, y=y_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually pass X and y as positional attributes (some estimators have Y
not y
, for instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Basically, we're waiting on an initial version of #11419 to be merged
before release an RC (and perhaps we should already release a "beta" but
we've rarely got much feedback from betas). We should be able to include
this after RC in any case, so a week is fine.
…On 25 July 2018 at 01:24, Karthik Duddu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/utils/estimator_checks.py
<#11453 (comment)>
:
> + attributes_to_check = [('metric', 'precomputed'),
+ ('affinity', 'precomputed'),
+ ('kernel', 'precomputed')]
+
+ # Using iris as sample data
+ iris = load_iris()
+ X, y_ = iris.data, iris.target
+ distance_matrix = pairwise_distances(X)
+
+ for attribute, attribute_value in attributes_to_check:
+ # Check to see if attribute value is supported by estimator
+ if getattr(estimator_orig, attribute, None) is not None:
+ continue
+ try:
+ # Construct new object of estimator with desired attribute value
+ modified_estimator = estimator_orig.__class__(
@jnothman <https://github.com/jnothman> sorry for the delay. Is there an
expected ETA? I'm a little busy rn, but should be able to address the
review this week
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11453 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zns8yEcOWXx2Cf0Kbf0L2sqeSiwks5uJzwUgaJpZM4VGV_U>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get why KernelCenterer would fail that other test. Do you have any idea why NearestNeighbors would?
|
||
# Check if _pairwise attribute is present - will be used later | ||
has_pairwise_tag = (False | ||
if getattr(estimator_orig, '_pairwise', None) is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why getattr and not hasattr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's preferred to use getattr
in place of hasattr
since the latter simply tests the former for an AttributeError
[1]. There's no benefit in terms of speed and maybe a little benefit in terms of code clarity, but there's the danger of a misrepresented AttributeError
. In the interest of future-proofing (and going by general community consensus[2][3]) I think it may be better to adopt getattr
.
References:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to,
has_pairwise_tag = hasattr(estimator_orig, '_pairwise')
stackoverflow.com/a/24971134/9366691
hynek.me/articles/hasattr
As far as I can tell the later blog post is only relevant for Python 2 that we no longer support.
We also sometimes explicitly raise AttributeEror
in a property to indicate that the method is unavailable, which would result in hasattr(estimator, method) is False
as intended.
@jnothman I looked into this a little more, and it turns out I'm not sure how we could strengthen the test, without adding this (and |
NearestNeighbors should probably validate metric in fit. I don't get what
lacking predict has to do with this
|
Agreed.
Some estimators which require the
I may be wrong in my understanding of the problem, but I think that the actual solution would be to have the estimator validate the input in its This is relevant for the necessity check though - |
@jnothman ping! |
Yes, the input should be validated in fit, but I agree that does not need to change here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
sklearn/utils/estimator_checks.py
Outdated
modified_estimator.predict(X) | ||
except ValueError: | ||
# Check if estimator defines _pairwise attribute | ||
assert_true(has_pairwise_tag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the adoption of pytest, we are phasing out use of test helpers assert_equal
, assert_true
, etc. Please use bare assert
statements, e.g. assert x == y
, assert not x
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
sklearn/utils/estimator_checks.py
Outdated
@@ -1191,6 +1192,58 @@ def check_estimators_pickle(name, estimator_orig): | |||
assert_allclose_dense_sparse(result[method], unpickled_result) | |||
|
|||
|
|||
def check_pairwise_estimator_tag(name, estimator_orig): | |||
attributes_to_check = [('metric', 'precomputed'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code would be clearer if you just put 'precomputed' inline below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be unfortunate if we found ourselves introducing a different placeholder!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea :) resolved!
sklearn/utils/estimator_checks.py
Outdated
# Construct new object of estimator with desired attribute value | ||
modified_estimator = clone(estimator_orig).set_params( | ||
**{attribute: attribute_value}) | ||
# Not all estimators validate parameters, so check fit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most estimators do not. Rather comment that "Estimators may validate parameters in fit if not in set_params"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
sklearn/utils/estimator_checks.py
Outdated
|
||
for attribute, attribute_value in attributes_to_check: | ||
# Check to see if attribute value is supported by estimator | ||
if attribute not in estimator_orig.get_params(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use get_params(deep=False) to keep it fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
sklearn/utils/estimator_checks.py
Outdated
**{attribute: attribute_value}) | ||
# Not all estimators validate parameters, so check fit() | ||
modified_estimator.fit(distance_matrix, y_) | ||
except (TypeError, ValueError, KeyError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the relevance of KeyError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Nystroem approximator looks up the kernel
parameter value insklearn/metrics/pairwise.py:KERNEL_PARAMS
, which raises a KeyError
.
self = Nystroem(coef0=None, degree=None, gamma=None, kernel='precomputed',
kernel_params=None, n_components=100, random_state=None)
def _get_kernel_params(self):
params = self.kernel_params
if params is None:
params = {}
if not callable(self.kernel):
> for param in (KERNEL_PARAMS[self.kernel]):
E KeyError: 'precomputed'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Nystroem must not currently support precomputed... We could:
- add support there
- Just catch Exception here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure if I’m understanding this correctly, but the exception handling includes KeyError in order to handle Nystroem (since Nystroem raises a KeyError instead of a TypeError/ValueError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just catch Exception here, rather than ValueError, TypeError, KeyError
I think that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok - sorry about that. Resolved!
73aabee
to
90f2d6b
Compare
@jnothman do you think there's further changes needed? If not, do you think this can be closed now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add entries to the change log at doc/whats_new/v0.21.rst
. Like the other entries there, please reference this pull request with :issue:
and credit yourself (and other contributors if applicable) with :user:
. The entries should note the changes to these specific estimators, but there should also be an entry under "changes to estimator checks" (you might need to copy this heading from v0.20.rst).
continue | ||
|
||
# Also check to see if non-square distance matrix raises an error | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try seems odd to me. We want to check all of the methods, right? Right now only the first one that exists is checked. And shouldn't we ensure that an error is raised? Why is else a skip?
|
||
# Check if _pairwise attribute is present - will be used later | ||
has_pairwise_tag = (False | ||
if getattr(estimator_orig, '_pairwise', None) is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to,
has_pairwise_tag = hasattr(estimator_orig, '_pairwise')
stackoverflow.com/a/24971134/9366691
hynek.me/articles/hasattr
As far as I can tell the later blog post is only relevant for Python 2 that we no longer support.
We also sometimes explicitly raise AttributeEror
in a property to indicate that the method is unavailable, which would result in hasattr(estimator, method) is False
as intended.
modified_estimator.fit(distance_matrix, y_) | ||
except Exception: | ||
# Estimator does not support given attribute value | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current state, this whole code block might as well not be there as the try: except:
block will silently ignore exceptions.
Instead we should check that the estimator does support the given attribute value and run that code only when it's true..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This current block is just filtering out estimators that do not support 'precomputed' value for attribute
. I have no Python skill to think of doing this some other way. @rth, would you have something that I could study to apply in this specific case?
I was trying with assert_raises
although not all estimators raise the same type of error when 'precomputed' is given as value and they do not accept that.
non_square_distance = distance_matrix[:, :-1] | ||
if getattr(modified_estimator, 'fit_predict', None) is not None: | ||
modified_estimator.fit_predict(non_square_distance, y_) | ||
elif (getattr(modified_estimator, 'fit_transform', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use hasattr(modified_estimator, 'fit_transform')
I don't really see the point of getattr
here.
elif getattr(modified_estimator, 'fit', None) is not None: | ||
modified_estimator.fit(non_square_distance, y_) | ||
if getattr(modified_estimator, 'predict', None) is not None: | ||
modified_estimator.predict(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified_estimator
everywhere is cumbersome. Better to use estimator2
or something similar.
@gokart23 are you able to finish this off? |
Thanks @ricoms ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is passing all tests, the suggested adjustments are related to simplifications and cleaner code. Although the original requester is not responding, I suggest accepting this pull request and create a new one after it adjusting code and comments.
modified_estimator.fit(distance_matrix, y_) | ||
except Exception: | ||
# Estimator does not support given attribute value | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This current block is just filtering out estimators that do not support 'precomputed' value for attribute
. I have no Python skill to think of doing this some other way. @rth, would you have something that I could study to apply in this specific case?
I was trying with assert_raises
although not all estimators raise the same type of error when 'precomputed' is given as value and they do not accept that.
|
Reference Issues/PRs
Fixes #11432
What does this implement/fix? Explain your changes.
Adds the
_pairwise
attribute tocluster/dbscan_.py
, returningTrue
if aprecomputed
distance metric is indicated.Also adds a common heuristic test for checking if an estimator should have the
_pairwise
property, but isn't. The test is based on the idea that if an estimator had ametric
,affinity
orkernel
parameter which supports 'precomputed', and if the estimator was then able to fit on pairwise data, but raised an error on non-square training data, a_pairwise
attribute should exist.