Skip to content

MAINT Plug PairwiseDistancesArgKmin as a back-end #22214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jan 14, 2022

Reference Issues/PRs

Part of #21462 (comment).

What does this implement/fix? Explain your changes.

This plugs PairwiseDistancesArgKmin as a back-end so that it is used for some API, namely:

  • sklearn.metrics.pairwise_distances_argmin
  • sklearn.metrics.pairwise_distances_argmin_min
  • sklearn.cluster.AffinityPropagation
  • sklearn.cluster.Birch
  • sklearn.cluster.MeanShift
  • sklearn.cluster.OPTICS
  • sklearn.cluster.SpectralClustering
  • sklearn.feature_selection.mutual_info_regression
  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.neighbors.KNeighborsRegressor
  • sklearn.neighbors.LocalOutlierFactor
  • sklearn.neighbors.NearestNeighbors
  • sklearn.manifold.Isomap
  • sklearn.manifold.LocallyLinearEmbedding
  • sklearn.manifold.TSNE
  • sklearn.manifold.trustworthiness
  • sklearn.semi_supervised.LabelPropagation
  • sklearn.semi_supervised.LabelSpreading

Existing tests are adapted and completed accordingly.

Any other comments?

Currently checking if any original change has been left behind.

@jjerphan jjerphan changed the title Forward pairwise_dist_chunk_size in the configuration MAINT Forward pairwise_dist_chunk_size in the configuration Jan 14, 2022
@jjerphan jjerphan changed the title MAINT Forward pairwise_dist_chunk_size in the configuration MAINT Plug PairwiseDistancesArgKmin as a back-end Jan 14, 2022
@jjerphan jjerphan marked this pull request as ready for review January 14, 2022 18:12
@jjerphan jjerphan marked this pull request as draft January 14, 2022 18:55
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a quick first pass

)
indices = indices.flatten()
else:
# TODO: once PairwiseDistancesArgKmin supports sparse input matrices and 32 bit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there options to get 32bit to work besides `Tempita?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from duplication based on dtype in Cython, I hardly can see something else.

C++ template classes would come in handy here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we can start with Tempita first and see how this makes the code more complex.

What do the others think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works with fused types in Cython, that would be great. If not, C++ and Tempita a both fine for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fused types can't be used for attributes unfortunately (hence the need to use Tempita similarly to what exists for WeightVectors).

raise ValueError("p must be greater or equal to one for minkowski metric")

def _fit(self, X, y=None):
if self._get_tags()["requires_y"]:
if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
X, y = self._validate_data(X, y, accept_sparse="csr", multi_output=True)
X, y = self._validate_data(
X, y, accept_sparse="csr", multi_output=True, order="C"
Copy link
Member

@ogrisel ogrisel Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to other reviewers: this is required by the new Cython implementation. It triggers a potential memory copy in the rare cases where X is not already contiguous (even in cases when we cannot use the new Cython code (e.g. for float32 data that are not supported yet).

But as the plan will be to also write the Cython code for the 32 bit float in the future, let's keep the input data validation logic simpler by always enforcing contiguity from the start.

@jjerphan
Copy link
Member Author

#22241 has been opened to fix a test failure.

@jjerphan jjerphan marked this pull request as ready for review January 19, 2022 10:00
@jjerphan jjerphan force-pushed the pairwise-distances-argkmin-plug branch from dce919f to 3448b01 Compare January 20, 2022 15:49
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking over the tests, a bunch of the changes look like refactors for using pytest's parametrize. I think they can be broken out into their own PRs and merged directly into main.

Implementation looks good to me. The major change is having to set order="C" everywhere, which I think that is okay.

@@ -399,7 +399,7 @@ class MultiOutputClassifier(ClassifierMixin, _MultiOutputEstimator):
>>> X, y = make_multilabel_classification(n_classes=3, random_state=0)
>>> clf = MultiOutputClassifier(KNeighborsClassifier()).fit(X, y)
>>> clf.predict(X[-2:])
array([[1, 1, 0], [1, 1, 1]])
array([[1, 1, 1], [1, 1, 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the new implementation gives different results than the previous one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_neighbors_metrics normally test consistency of the three algorithms' results.
Yet, in the specific case of this doc-test, the result is different.

I think the best is to test the new 'brute' back-end against the current implementation.
I will come up with test for this.

@jjerphan
Copy link
Member Author

@thomasjpfan: To address your comments, some changes made in this PR have been extracted in #22280 and #22281.

@jjerphan
Copy link
Member Author

Superseded by #22288.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants