-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
API Allow users to pass DistanceMetric
objects to metric
keyword argument in neighbors.KNeighborsRegressor
#26267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @Micky774.
As discussed directly with you, I think having efficient vectorized distance metrics implementations is of value for scikit-learn generally. In this regard I am +5 regarding merging this PR.
As you mention, the use of third party packages (such as yours, https://github.com/Micky774/distmetric-xsimd/) is limited by the absence of the plugins systems.
Yet, I am not sure if we want to relax our API, change scikit-learn UX, and mention it publicly if this is meant to be experimental or temporary: what this PR allows (i.e. to pass DistanceMetric{32,}
instances to the metric
keywords) relaxes the UX into. I think there are notable UX benefits in constraining users to use one single package-uniform specification (mainly clarity and uniformity across all API using metric
). In this regard, I am -1 regarding merging this PR.
Still, we probably might want to have a way for advanced users (or developers of third party package like you) to use those packages in the meantime. Another alternative temporary approach would be to have third party packages monkey patch scikit-learn internals when they are imported at runtime. This is the approach https://github.com/intel/scikit-learn-intelex uses and it has the benefit of keeping scikit-learn UX unchanged. Using this approach, you could change those mappings to point to your implementations instead:
scikit-learn/sklearn/metrics/_dist_metrics.pyx.tp
Lines 82 to 111 in 42ca718
###################################################################### | |
# metric mappings | |
# These map from metric id strings to class names | |
METRIC_MAPPING{{name_suffix}} = { | |
'euclidean': EuclideanDistance{{name_suffix}}, | |
'l2': EuclideanDistance{{name_suffix}}, | |
'minkowski': MinkowskiDistance{{name_suffix}}, | |
'p': MinkowskiDistance{{name_suffix}}, | |
'manhattan': ManhattanDistance{{name_suffix}}, | |
'cityblock': ManhattanDistance{{name_suffix}}, | |
'l1': ManhattanDistance{{name_suffix}}, | |
'chebyshev': ChebyshevDistance{{name_suffix}}, | |
'infinity': ChebyshevDistance{{name_suffix}}, | |
'seuclidean': SEuclideanDistance{{name_suffix}}, | |
'mahalanobis': MahalanobisDistance{{name_suffix}}, | |
'wminkowski': WMinkowskiDistance{{name_suffix}}, | |
'hamming': HammingDistance{{name_suffix}}, | |
'canberra': CanberraDistance{{name_suffix}}, | |
'braycurtis': BrayCurtisDistance{{name_suffix}}, | |
'matching': MatchingDistance{{name_suffix}}, | |
'jaccard': JaccardDistance{{name_suffix}}, | |
'dice': DiceDistance{{name_suffix}}, | |
'kulsinski': KulsinskiDistance{{name_suffix}}, | |
'rogerstanimoto': RogersTanimotoDistance{{name_suffix}}, | |
'russellrao': RussellRaoDistance{{name_suffix}}, | |
'sokalmichener': SokalMichenerDistance{{name_suffix}}, | |
'sokalsneath': SokalSneathDistance{{name_suffix}}, | |
'haversine': HaversineDistance{{name_suffix}}, | |
'pyfunc': PyFuncDistance{{name_suffix}}, | |
} |
This also has the benefit of making other methods of scikit-learn use your implementations.
What do you think?
We actually discussed this a bit during the most recent monthly meeting, and from my conversation with @ogrisel my current understanding is that the plugin system is intended to expose estimator-specific computational routines rather than wide-used computational backends (e.g.
This is definitely an approach worth considering, but truth be told I also think it would be good for scikit-learn to more publicly expose opportunities for third-party acceleration of computational backends even if not through the plugin system. We have recently been discussing what the role of scikit-learn in the community is, and what our obligations are. One of the more critical questions, I think, is whether or not maximizing performance optimization is our obligation. It's a bit tricky, imo. We should aim for performance wherever possible -- and indeed work like your In this way, I see the relaxation of the API for the I will admit my inexperience with this though. If other core-devs believe that the cost to relaxing the API outweighs the benefits of offering a more obvious avenue of contribution for third-party acceleration, then I am okay with the monkey-patching solution. I am not too attached to the specific path taken, just to the results of a faster scikit-learn and -- if at all possible -- an invitation to the community to try out their own acceleration strategies on cc: @ogrisel @adrinjalali since we had actively discussed the role of the plugins system, and the burden of this proposed API change during the monthly meeting |
Could you post an example snippet of what user code with I also looked at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html where you can already pass a I'm not sure I have a firm opinion yet, but my first reaction is that we should preferably use one way to allow users to extend/modify scikit-learn. I'm not sure this is the case here, but imagine we'd end up with a world where there were "estimator plugins" and "distance metric plugins" (using "plugin" as a generic word for "extending scikit-learn"). I think that would be quite confusing to explain to users. I think having one system would be better. |
Sure! For the sake of example I'll assume the user has installed my sample accelerated implementations with an installation of scikit-learn from this branch. Then it would look like: from sklearn.neighbors import KNeighborsRegressor
import numpy as np
# Step 1
from distance_metrics import get_distance_metric # package name is a WIP and will soon be slsdm :)
random_state = 0
n_samples_X = 100
n_features = 10
n_classes = 3
rng = np.random.RandomState(random_state)
X = rng.randn(n_samples_X, n_features)
y = rng.randint(n_classes, size=n_samples_X)
# Step 2
dst = get_distance_metric(X, "manhattan")
# Step 3
neigh = KNeighborsRegressor(n_neighbors=2, metric=dst)
neigh.fit(X, y)
print(neigh.predict(X[:10]))
Indeed! This actually makes the implementation very straightforward for many estimators since the functionality is largely already available (just need to make more lax the estimators' own validation).
I definitely agree that having one system would be better. I personally would be okay with multiple paths if it made sense and reduced barriers to entry. I don't know if this API change I propose actually accomplishes that or not, but that is the underlying hope. I've attached a list of potential paths forward. What do you think of option four:
Some Potential Paths ForwardI just want to go ahead and summarize some of the paths forwards we have discussed. 1. Accept the API change and do not extend the
|
I'm in favor of accepting this API enhancement no matter what we decide on the engine API side (I slightly prefer this to the engine API acceleration path I think, but no strong feelings there). It would be nice to unify metrics around the code base, rather than having different set of accepted metrics for different classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to unify metrics around the code base, rather than having different set of accepted metrics for different classes.
I also agree. Since some classes like KDTree
and BallTree
support DistanceMetric
for metric
as indicated by @betatim in #26267 (comment), the natural solution would be to have metric
support DistanceMetric
for all other interfaces exposing it.
Hence I am in favor of this PR, especially given that @Micky774 has come up with compelling performance improvement in the meantime. 🌟
I think some preliminary refactoring might be needed before approving this PR.
As an update regarding the motivation of this API, it turns out for Note that the following benchmarks were generated with this gist |
With this PR merged I'd be able to use a If yes, can we spell this out in the docs somewhere and maybe even link to an alternative source of |
Precisely 😄
Fully agreed. I originally intended to include it in documentation once the functionality was exposed through the major pairwise distance reduction backends mentioned in the issue, but I suppose there's not really any reason to wait that long. We can have a maintained list of estimators for which the functionality is available, and then remove the list once the use of the metric keyword is made consistent across the codebase. I'll start on that, and can provide my library as an example implementation (this'll kick me into gear to actually improve the library's documentation as well 😅). |
For those interested in trying this, there are now published wheels for If you want to quickly sanity check the speedups you can:
With that aside, I wanted to go ahead and ping @scikit-learn/core-devs to see if anyone has the capacity/interest in reviewing this PR. I think the API proposed here is a reasonable extension of what we already have for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Could we treat other estimators in other dedicated PRs?
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
DistanceMetric
objects to metric
keyword argument DistanceMetric
objects to metric
keyword argument in neighbors.KNeighborsRegressor
@thomasjpfan @ogrisel In case either of you were interested in reviewing |
elif ( | ||
callable(self.metric) | ||
or self.metric in VALID_METRICS["ball_tree"] | ||
or isinstance(self.metric, DistanceMetric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API wise, self.algorithm
is selecting ball_tree
all the time if self.metric
is a DistanceMetric
. Is this the preferred default for self.algorithm="auto"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preserves the standard behavior, e.g. metric="euclidean"
and metric=DistanceMetric.get_metric("euclidean")
will both result in a BallTree
with a EuclideanDistance
object.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…argument in `neighbors.KNeighborsRegressor` (scikit-learn#26267) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…argument in `neighbors.KNeighborsRegressor` (scikit-learn#26267) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Addresses #26010
Addresses #26329
What does this implement/fix? Explain your changes.
Allow users to pass
DistanceMetric
objects tometric
keyword argument inneighbors.KNeighborsRegressor
.Any other comments?
I have discussed with @betatim the idea of developing an
Engine
forDistanceMetric
, however that is probably not easily done in the near-future without adding to the complexity of the already-open work on the plugin system. That is probably the best strategy once it is a bit more mature, but until then, this option of directly passingDistanceMetric
objects does provide a fast and simple route for the development/use of third-partyDistanceMetric
implementations.