Skip to content

ENH Add array api support for laplacian_kernel and manhattan_distances #29881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Towards #26024

What does this implement/fix? Explain your changes.

  • Adds array api support for laplacian kernel and manhattan distances

Any other comments?

CC: @ogrisel

Copy link

github-actions bot commented Sep 18, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 03f5856. Link to the linter CI: here

@OmarManzoor OmarManzoor changed the title Fix linting ENH Add array api support for laplacian_kernel and manhattan_distances Sep 18, 2024
@OmarManzoor OmarManzoor added Quick Review For PRs that are quick to review CUDA CI labels Sep 18, 2024
@github-actions github-actions bot removed the CUDA CI label Sep 18, 2024
@github-actions github-actions bot removed the CUDA CI label Sep 18, 2024
@ogrisel
Copy link
Member

ogrisel commented Sep 18, 2024

I don't understand why https://github.com/scikit-learn/scikit-learn/actions/runs/10922498253/job/30316942426?pr=29881 was marked as "skipped". I relabeled the PR to check if it happens again.

EDIT: the new run was not skipped:

https://github.com/scikit-learn/scikit-learn/actions/runs/10922892103/job/30318488070?pr=29881

Maybe this is caused by permission problem with your account @OmarManzoor? Could you successfully trigger the CUDA GPU CI (without getting a skipped run) in other PRs in the past?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: I see limited value in adding array API support to manhattan_distances if it silently causes huge data cross-device transfers in order to compute with NumPy on the host CPU.

Should we try to implement a naive Manhattan distance without relying on scipy's cdist for non-NumPy inputs? That shouldn't be too hard to implement and maintain and could bring good speed-ups by naturally leveraging GPU parallelism, but this needs to be confirmed by experiments (e.g. on Google Colab or Kaggle Kernels).

EDIT: alternatively, one might argue that the automatic conversion back and forth to / from NumPy might be useful as a convenience in cases where no array API specific speed-ups are expected.

@OmarManzoor
Copy link
Contributor Author

I don't understand why https://github.com/scikit-learn/scikit-learn/actions/runs/10922498253/job/30316942426?
Maybe this is caused by permission problem with your account @OmarManzoor? Could you successfully trigger the CUDA GPU CI (without getting a skipped run) in other PRs in the past?

It could be related to permissions. This was the first time I tried using the CUDA CI label.

@ogrisel
Copy link
Member

ogrisel commented Sep 18, 2024

It could be related to permissions. This was the first time I tried using the CUDA CI label.

You are welcome to try again to see if this is a reproducible problem.

@OmarManzoor
Copy link
Contributor Author

Should we try to implement a naive Manhattan distance without relying on scipy's cdist for non-NumPy inputs? That shouldn't be too hard to implement and maintain and could bring good speed-ups by naturally leveraging GPU parallelism, but this needs to be confirmed by experiments (e.g. on Google Colab or Kaggle Kernels).

EDIT: alternatively, one might argue that the automatic conversion back and forth to / from NumPy might be useful as a convenience in cases where no array API specific speed-ups are expected.

I like the idea of implementing a specific case for handling manhattan distances for the array api case when we are outside numpy. I think that should produce speed ups when using GPUs and maybe even on CPU when using Pytorch.

However in case scipy adds Array Api support for this metric then I think that would automatically enable getting the performance gains. Do we have any idea whether they plan to support it?

Maybe we can start with converting to numpy in this particular PR and add a TODO that this is a temporary workaround as long as scipy doesn't support it or we don't add our own specific implementation to deal with other arrays. That would at least get this code in a working state using the Array Api.

What do you think?

@ogrisel
Copy link
Member

ogrisel commented Sep 20, 2024

Do we have any idea whether they plan to support it?

We can probably ask on one of the array API (meta-)issues of the scipy repo.

Maybe we can start with converting to numpy in this particular PR and add a TODO that this is a temporary workaround as long as scipy doesn't support it or we don't add our own specific implementation to deal with other arrays. That would at least get this code in a working state using the Array Api.

Sounds like a plan.

@OmarManzoor
Copy link
Contributor Author

It seems that array api support for the distance functions is not expected currently.

Reference: scipy/scipy#21090 (comment)

@OmarManzoor
Copy link
Contributor Author

The CUDA tests are passing now 🟢

I haven't tested with MPS though.

@ogrisel
Copy link
Member

ogrisel commented Sep 27, 2024

Indeed, it fails on MPS because scipy always returns np.float64 dtype output even if the input is np.float32:

>       return xp.asarray(
            distance.cdist(
                _convert_to_numpy(X, xp=xp), _convert_to_numpy(Y, xp=xp), "cityblock"
            ),
            device=device_,
        )
E       TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

So we need to pass dtype=_max_precision_float_dtype(xp, device_) to this xp.asarray call.

EDIT: Alternatively, we could just use dtype=X.dtype but that would be a behavioral change with scikit-learn 1.5 (would this be considered as a fix?). Not sure what I like best.

- :func:`sklearn.metrics.pairwise.linear_kernel` :pr:`29475` by :user:`Omar Salman <OmarManzoor>`;
- :func:`sklearn.metrics.pairwise.manhattan_distances` :pr:`29881` by :user:`Omar Salman <OmarManzoor>`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a dedicated list of functions and classes that accept array API inputs only for convenience but actually convert to NumPy to perform the computation on the host CPU without delegating to the input's namespace.

@ogrisel
Copy link
Member

ogrisel commented Sep 27, 2024

@betatim @adrinjalali do you have an opinion about this PR (silently converting to numpy to run on the CPU in for cases where there is no simple way to delegate the computation to the library namespace of the input arrays)?

One could argue it can be a convenience (e.g. when running a grid search with different kernel functions with CUDA array inputs), but at the same time it can be surprising / misleading the users into belieaving that the GPU is used when it is not.

@OmarManzoor
Copy link
Contributor Author

CC: @betatim , @glemaitre

@glemaitre glemaitre self-requested a review October 17, 2024 21:20
Certain functions within scikit-learn only support the array api for convenience
purposes but internally the arrays are converted to numpy arrays for performing
the required underlying computations and then converted back to the array
namespace under consideration (e.g., :func:`metrics.pairwise.manhattan_distances`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can document the list here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for maintaining a list of estimators that accept array API inputs by internally converting to NumPy/CPU).

@glemaitre
Copy link
Member

@ogrisel do you think that we could raise a specific warning in this case. Basically, if the warning is specific it can be easily silenced.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2024

Indeed, we could either introduce a new dedicated warning class or reuse / repurpose sklearn.exception.DataConversionWarning. But maybe it's better to introduce a dedicated warning (e.g. sklearn.exception.ArrayAPIDeviceTransferWarning).

However, do we always want to warn on all namespace/device transfers? For instance, when calling est.fit(X_gpu, y_cpu), we want to implement a "y follows X" or "sample_weight follows X" policy for convenience:

The reason is that our pipeline does not allow transforming y. If we have a pipeline such as:

pipeline = make_pipeline(
    dataframe_feature_extractor,
    function_transformer_to_move_X_to_pytorch_gpu,
    array_api_capable_model,
).fit(X_as_a_pandas_dataframe, y_as_a_pandas_series)

The automated conversion of y is always needed, and the user would have no natural way to change the code to avoid raising a warning (a part from silencing it). Most of the time, y is one order of magnitude or smaller than X.

Maybe we could have a global flag to control a minimum data size (in bytes) under which no warning is raised?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the question of converting to numpy and back, I don't think we should do that with X. I think it's fine to do that with smaller data like y, labels, sample_weight, etc, but moving around X can be a bit catastrophic.

_convert_to_numpy(X, xp=xp), _convert_to_numpy(Y, xp=xp), "cityblock"
),
device=device_,
dtype=_max_precision_float_dtype(xp=xp, device=device_),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if distance.cdist(...).dtype == float32", then we shouldn't be converting it to float64 here.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2024

As for the question of converting to numpy and back, I don't think we should do that with X. I think it's fine to do that with smaller data like y, labels, sample_weight, etc, but moving around X can be a bit catastrophic.

I am also not sure whether we should do it or not.

The motivation would be to support things like:

GridSearchCV(
    make_pipeline(Nystroem(), RidgeCV()),
    param_grid={
        "kernel": ["linear", "poly", "rbf", "laplacian"],
        ...
    },
)

Where "linear" and "rbf" kernels can benefit from GPU computation via the array API but "poly" and "laplacian" would convert to NumPy for convenience: otherwise, you have to split the parameter search into two independent calls and merge the results manually, which would be quite painful.

Maybe we can pause this PR and make this decision later though, for instance, once Nystroem array API support and ridge PRs have been finalized to be able to run some concrete benchmarks.

@adrinjalali
Copy link
Member

I think using array API is gonna be painful for a while anyway, since we don't support it everywhere. So we probably shouldn't make decisions based on convenience yet. Maybe later when there's a lot more coverage, we can then make these decisions better.

@OmarManzoor
Copy link
Contributor Author

I think that the polynomial kernel is fine currently with respect to the array api. The only issue is with laplacian because of the scipy dependency in the manhattan_distances.
So should we close this PR for now?

@OmarManzoor
Copy link
Contributor Author

Closing for now until we are in a position to make a better decision.
Thank you for the discussion @ogrisel @glemaitre @adrinjalali

@OmarManzoor OmarManzoor deleted the laplacain_kernel_array_api branch October 21, 2024 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Array API module:metrics Quick Review For PRs that are quick to review
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants