-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Plug PairwiseDistancesArgKmin
as a back-end
#22214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Plug PairwiseDistancesArgKmin
as a back-end
#22214
Conversation
PairwiseDistancesArgKmin
as a back-end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a quick first pass
sklearn/metrics/pairwise.py
Outdated
) | ||
indices = indices.flatten() | ||
else: | ||
# TODO: once PairwiseDistancesArgKmin supports sparse input matrices and 32 bit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there options to get 32bit to work besides `Tempita?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from duplication based on dtype in Cython, I hardly can see something else.
C++ template classes would come in handy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, we can start with Tempita first and see how this makes the code more complex.
What do the others think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works with fused types in Cython, that would be great. If not, C++ and Tempita a both fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fused types can't be used for attributes unfortunately (hence the need to use Tempita similarly to what exists for WeightVectors
).
sklearn/neighbors/_base.py
Outdated
raise ValueError("p must be greater or equal to one for minkowski metric") | ||
|
||
def _fit(self, X, y=None): | ||
if self._get_tags()["requires_y"]: | ||
if not isinstance(X, (KDTree, BallTree, NeighborsBase)): | ||
X, y = self._validate_data(X, y, accept_sparse="csr", multi_output=True) | ||
X, y = self._validate_data( | ||
X, y, accept_sparse="csr", multi_output=True, order="C" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to other reviewers: this is required by the new Cython implementation. It triggers a potential memory copy in the rare cases where X is not already contiguous (even in cases when we cannot use the new Cython code (e.g. for float32 data that are not supported yet).
But as the plan will be to also write the Cython code for the 32 bit float in the future, let's keep the input data validation logic simpler by always enforcing contiguity from the start.
#22241 has been opened to fix a test failure. |
dce919f
to
3448b01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking over the tests, a bunch of the changes look like refactors for using pytest's parametrize
. I think they can be broken out into their own PRs and merged directly into main
.
Implementation looks good to me. The major change is having to set order="C"
everywhere, which I think that is okay.
sklearn/multioutput.py
Outdated
@@ -399,7 +399,7 @@ class MultiOutputClassifier(ClassifierMixin, _MultiOutputEstimator): | |||
>>> X, y = make_multilabel_classification(n_classes=3, random_state=0) | |||
>>> clf = MultiOutputClassifier(KNeighborsClassifier()).fit(X, y) | |||
>>> clf.predict(X[-2:]) | |||
array([[1, 1, 0], [1, 1, 1]]) | |||
array([[1, 1, 1], [1, 1, 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the new implementation gives different results than the previous one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_neighbors_metrics
normally test consistency of the three algorithms' results.
Yet, in the specific case of this doc-test, the result is different.
I think the best is to test the new 'brute'
back-end against the current implementation.
I will come up with test for this.
@thomasjpfan: To address your comments, some changes made in this PR have been extracted in #22280 and #22281. |
cb27fdf
to
92a23ba
Compare
Superseded by #22288. |
Reference Issues/PRs
Part of #21462 (comment).
What does this implement/fix? Explain your changes.
This plugs
PairwiseDistancesArgKmin
as a back-end so that it is used for some API, namely:sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_argmin_min
sklearn.cluster.AffinityPropagation
sklearn.cluster.Birch
sklearn.cluster.MeanShift
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.feature_selection.mutual_info_regression
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.KNeighborsRegressor
sklearn.neighbors.LocalOutlierFactor
sklearn.neighbors.NearestNeighbors
sklearn.manifold.Isomap
sklearn.manifold.LocallyLinearEmbedding
sklearn.manifold.TSNE
sklearn.manifold.trustworthiness
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
Existing tests are adapted and completed accordingly.
Any other comments?
Currently checking if any original change has been left behind.