-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add array api support for laplacian_kernel and manhattan_distances #29881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add array api support for laplacian_kernel and manhattan_distances #29881
Conversation
I don't understand why https://github.com/scikit-learn/scikit-learn/actions/runs/10922498253/job/30316942426?pr=29881 was marked as "skipped". I relabeled the PR to check if it happens again. EDIT: the new run was not skipped: https://github.com/scikit-learn/scikit-learn/actions/runs/10922892103/job/30318488070?pr=29881 Maybe this is caused by permission problem with your account @OmarManzoor? Could you successfully trigger the CUDA GPU CI (without getting a skipped run) in other PRs in the past? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: I see limited value in adding array API support to manhattan_distances
if it silently causes huge data cross-device transfers in order to compute with NumPy on the host CPU.
Should we try to implement a naive Manhattan distance without relying on scipy's cdist
for non-NumPy inputs? That shouldn't be too hard to implement and maintain and could bring good speed-ups by naturally leveraging GPU parallelism, but this needs to be confirmed by experiments (e.g. on Google Colab or Kaggle Kernels).
EDIT: alternatively, one might argue that the automatic conversion back and forth to / from NumPy might be useful as a convenience in cases where no array API specific speed-ups are expected.
It could be related to permissions. This was the first time I tried using the CUDA CI label. |
You are welcome to try again to see if this is a reproducible problem. |
I like the idea of implementing a specific case for handling manhattan distances for the array api case when we are outside numpy. I think that should produce speed ups when using GPUs and maybe even on CPU when using Pytorch. However in case scipy adds Array Api support for this metric then I think that would automatically enable getting the performance gains. Do we have any idea whether they plan to support it? Maybe we can start with converting to numpy in this particular PR and add a TODO that this is a temporary workaround as long as scipy doesn't support it or we don't add our own specific implementation to deal with other arrays. That would at least get this code in a working state using the Array Api. What do you think? |
We can probably ask on one of the array API (meta-)issues of the scipy repo.
Sounds like a plan. |
It seems that array api support for the distance functions is not expected currently. Reference: scipy/scipy#21090 (comment) |
The CUDA tests are passing now 🟢 I haven't tested with MPS though. |
Indeed, it fails on MPS because scipy always returns > return xp.asarray(
distance.cdist(
_convert_to_numpy(X, xp=xp), _convert_to_numpy(Y, xp=xp), "cityblock"
),
device=device_,
)
E TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead. So we need to pass EDIT: Alternatively, we could just use |
doc/whats_new/v1.6.rst
Outdated
- :func:`sklearn.metrics.pairwise.linear_kernel` :pr:`29475` by :user:`Omar Salman <OmarManzoor>`; | ||
- :func:`sklearn.metrics.pairwise.manhattan_distances` :pr:`29881` by :user:`Omar Salman <OmarManzoor>`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a dedicated list of functions and classes that accept array API inputs only for convenience but actually convert to NumPy to perform the computation on the host CPU without delegating to the input's namespace.
@betatim @adrinjalali do you have an opinion about this PR (silently converting to numpy to run on the CPU in for cases where there is no simple way to delegate the computation to the library namespace of the input arrays)? One could argue it can be a convenience (e.g. when running a grid search with different kernel functions with CUDA array inputs), but at the same time it can be surprising / misleading the users into belieaving that the GPU is used when it is not. |
CC: @betatim , @glemaitre |
Certain functions within scikit-learn only support the array api for convenience | ||
purposes but internally the arrays are converted to numpy arrays for performing | ||
the required underlying computations and then converted back to the array | ||
namespace under consideration (e.g., :func:`metrics.pairwise.manhattan_distances`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can document the list here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for maintaining a list of estimators that accept array API inputs by internally converting to NumPy/CPU).
@ogrisel do you think that we could raise a specific warning in this case. Basically, if the warning is specific it can be easily silenced. |
Indeed, we could either introduce a new dedicated warning class or reuse / repurpose However, do we always want to warn on all namespace/device transfers? For instance, when calling
The reason is that our pipeline does not allow transforming pipeline = make_pipeline(
dataframe_feature_extractor,
function_transformer_to_move_X_to_pytorch_gpu,
array_api_capable_model,
).fit(X_as_a_pandas_dataframe, y_as_a_pandas_series) The automated conversion of Maybe we could have a global flag to control a minimum data size (in bytes) under which no warning is raised? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the question of converting to numpy and back, I don't think we should do that with X
. I think it's fine to do that with smaller data like y
, labels
, sample_weight
, etc, but moving around X
can be a bit catastrophic.
_convert_to_numpy(X, xp=xp), _convert_to_numpy(Y, xp=xp), "cityblock" | ||
), | ||
device=device_, | ||
dtype=_max_precision_float_dtype(xp=xp, device=device_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like if distance.cdist(...).dtype == float32"
, then we shouldn't be converting it to float64
here.
I am also not sure whether we should do it or not. The motivation would be to support things like: GridSearchCV(
make_pipeline(Nystroem(), RidgeCV()),
param_grid={
"kernel": ["linear", "poly", "rbf", "laplacian"],
...
},
) Where Maybe we can pause this PR and make this decision later though, for instance, once |
I think using array API is gonna be painful for a while anyway, since we don't support it everywhere. So we probably shouldn't make decisions based on convenience yet. Maybe later when there's a lot more coverage, we can then make these decisions better. |
I think that the |
Closing for now until we are in a position to make a better decision. |
Reference Issues/PRs
Towards #26024
What does this implement/fix? Explain your changes.
Any other comments?
CC: @ogrisel