ENH Add array api support for laplacian_kernel and manhattan_distances #29881

OmarManzoor · 2024-09-18T11:04:16Z

Reference Issues/PRs

Towards #26024

What does this implement/fix? Explain your changes.

Adds array api support for laplacian kernel and manhattan distances

Any other comments?

CC: @ogrisel

github-actions · 2024-09-18T11:05:32Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 03f5856. Link to the linter CI: here}

sklearn/metrics/pairwise.py

ogrisel · 2024-09-18T13:13:26Z

I don't understand why https://github.com/scikit-learn/scikit-learn/actions/runs/10922498253/job/30316942426?pr=29881 was marked as "skipped". I relabeled the PR to check if it happens again.

EDIT: the new run was not skipped:

https://github.com/scikit-learn/scikit-learn/actions/runs/10922892103/job/30318488070?pr=29881

Maybe this is caused by permission problem with your account @OmarManzoor? Could you successfully trigger the CUDA GPU CI (without getting a skipped run) in other PRs in the past?

ogrisel

General comment: I see limited value in adding array API support to manhattan_distances if it silently causes huge data cross-device transfers in order to compute with NumPy on the host CPU.

Should we try to implement a naive Manhattan distance without relying on scipy's cdist for non-NumPy inputs? That shouldn't be too hard to implement and maintain and could bring good speed-ups by naturally leveraging GPU parallelism, but this needs to be confirmed by experiments (e.g. on Google Colab or Kaggle Kernels).

EDIT: alternatively, one might argue that the automatic conversion back and forth to / from NumPy might be useful as a convenience in cases where no array API specific speed-ups are expected.

OmarManzoor · 2024-09-18T13:33:48Z

I don't understand why https://github.com/scikit-learn/scikit-learn/actions/runs/10922498253/job/30316942426?
Maybe this is caused by permission problem with your account @OmarManzoor? Could you successfully trigger the CUDA GPU CI (without getting a skipped run) in other PRs in the past?

It could be related to permissions. This was the first time I tried using the CUDA CI label.

ogrisel · 2024-09-18T13:51:34Z

It could be related to permissions. This was the first time I tried using the CUDA CI label.

You are welcome to try again to see if this is a reproducible problem.

OmarManzoor · 2024-09-19T06:50:06Z

Should we try to implement a naive Manhattan distance without relying on scipy's cdist for non-NumPy inputs? That shouldn't be too hard to implement and maintain and could bring good speed-ups by naturally leveraging GPU parallelism, but this needs to be confirmed by experiments (e.g. on Google Colab or Kaggle Kernels).

EDIT: alternatively, one might argue that the automatic conversion back and forth to / from NumPy might be useful as a convenience in cases where no array API specific speed-ups are expected.

I like the idea of implementing a specific case for handling manhattan distances for the array api case when we are outside numpy. I think that should produce speed ups when using GPUs and maybe even on CPU when using Pytorch.

However in case scipy adds Array Api support for this metric then I think that would automatically enable getting the performance gains. Do we have any idea whether they plan to support it?

Maybe we can start with converting to numpy in this particular PR and add a TODO that this is a temporary workaround as long as scipy doesn't support it or we don't add our own specific implementation to deal with other arrays. That would at least get this code in a working state using the Array Api.

What do you think?

ogrisel · 2024-09-20T12:41:48Z

Do we have any idea whether they plan to support it?

We can probably ask on one of the array API (meta-)issues of the scipy repo.

Maybe we can start with converting to numpy in this particular PR and add a TODO that this is a temporary workaround as long as scipy doesn't support it or we don't add our own specific implementation to deal with other arrays. That would at least get this code in a working state using the Array Api.

Sounds like a plan.

OmarManzoor · 2024-09-23T05:59:25Z

It seems that array api support for the distance functions is not expected currently.

Reference: scipy/scipy#21090 (comment)

OmarManzoor · 2024-09-23T06:33:57Z

The CUDA tests are passing now 🟢

I haven't tested with MPS though.

ogrisel · 2024-09-27T14:28:02Z

Indeed, it fails on MPS because scipy always returns np.float64 dtype output even if the input is np.float32:

>       return xp.asarray(
            distance.cdist(
                _convert_to_numpy(X, xp=xp), _convert_to_numpy(Y, xp=xp), "cityblock"
            ),
            device=device_,
        )
E       TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

So we need to pass dtype=_max_precision_float_dtype(xp, device_) to this xp.asarray call.

EDIT: Alternatively, we could just use dtype=X.dtype but that would be a behavioral change with scikit-learn 1.5 (would this be considered as a fix?). Not sure what I like best.

ogrisel · 2024-09-27T14:19:39Z

doc/whats_new/v1.6.rst

 - :func:`sklearn.metrics.pairwise.linear_kernel` :pr:`29475` by :user:`Omar Salman <OmarManzoor>`;
+- :func:`sklearn.metrics.pairwise.manhattan_distances` :pr:`29881` by :user:`Omar Salman <OmarManzoor>`;


I think we should have a dedicated list of functions and classes that accept array API inputs only for convenience but actually convert to NumPy to perform the computation on the host CPU without delegating to the input's namespace.

ogrisel · 2024-09-27T18:52:25Z

@betatim @adrinjalali do you have an opinion about this PR (silently converting to numpy to run on the CPU in for cases where there is no simple way to delegate the computation to the library namespace of the input arrays)?

One could argue it can be a convenience (e.g. when running a grid search with different kernel functions with CUDA array inputs), but at the same time it can be surprising / misleading the users into belieaving that the GPU is used when it is not.

doc/modules/array_api.rst

OmarManzoor · 2024-09-30T12:49:44Z

CC: @betatim , @glemaitre

glemaitre · 2024-10-17T21:28:49Z

doc/modules/array_api.rst

+Certain functions within scikit-learn only support the array api for convenience
+purposes but internally the arrays are converted to numpy arrays for performing
+the required underlying computations and then converted back to the array
+namespace under consideration (e.g., :func:`metrics.pairwise.manhattan_distances`).


Maybe we can document the list here.

+1 for maintaining a list of estimators that accept array API inputs by internally converting to NumPy/CPU).

glemaitre · 2024-10-17T21:31:57Z

@ogrisel do you think that we could raise a specific warning in this case. Basically, if the warning is specific it can be easily silenced.

ogrisel · 2024-10-21T08:10:01Z

Indeed, we could either introduce a new dedicated warning class or reuse / repurpose sklearn.exception.DataConversionWarning. But maybe it's better to introduce a dedicated warning (e.g. sklearn.exception.ArrayAPIDeviceTransferWarning).

However, do we always want to warn on all namespace/device transfers? For instance, when calling est.fit(X_gpu, y_cpu), we want to implement a "y follows X" or "sample_weight follows X" policy for convenience:

The reason is that our pipeline does not allow transforming y. If we have a pipeline such as:

pipeline = make_pipeline(
    dataframe_feature_extractor,
    function_transformer_to_move_X_to_pytorch_gpu,
    array_api_capable_model,
).fit(X_as_a_pandas_dataframe, y_as_a_pandas_series)

The automated conversion of y is always needed, and the user would have no natural way to change the code to avoid raising a warning (a part from silencing it). Most of the time, y is one order of magnitude or smaller than X.

Maybe we could have a global flag to control a minimum data size (in bytes) under which no warning is raised?

adrinjalali

As for the question of converting to numpy and back, I don't think we should do that with X. I think it's fine to do that with smaller data like y, labels, sample_weight, etc, but moving around X can be a bit catastrophic.

adrinjalali · 2024-10-21T08:15:47Z

sklearn/metrics/pairwise.py

+            _convert_to_numpy(X, xp=xp), _convert_to_numpy(Y, xp=xp), "cityblock"
+        ),
+        device=device_,
+        dtype=_max_precision_float_dtype(xp=xp, device=device_),


I feel like if distance.cdist(...).dtype == float32", then we shouldn't be converting it to float64 here.

ogrisel · 2024-10-21T08:42:48Z

As for the question of converting to numpy and back, I don't think we should do that with X. I think it's fine to do that with smaller data like y, labels, sample_weight, etc, but moving around X can be a bit catastrophic.

I am also not sure whether we should do it or not.

The motivation would be to support things like:

GridSearchCV(
    make_pipeline(Nystroem(), RidgeCV()),
    param_grid={
        "kernel": ["linear", "poly", "rbf", "laplacian"],
        ...
    },
)

Where "linear" and "rbf" kernels can benefit from GPU computation via the array API but "poly" and "laplacian" would convert to NumPy for convenience: otherwise, you have to split the parameter search into two independent calls and merge the results manually, which would be quite painful.

Maybe we can pause this PR and make this decision later though, for instance, once Nystroem array API support and ridge PRs have been finalized to be able to run some concrete benchmarks.

adrinjalali · 2024-10-21T08:49:21Z

I think using array API is gonna be painful for a while anyway, since we don't support it everywhere. So we probably shouldn't make decisions based on convenience yet. Maybe later when there's a lot more coverage, we can then make these decisions better.

OmarManzoor · 2024-10-21T09:32:00Z

I think that the polynomial kernel is fine currently with respect to the array api. The only issue is with laplacian because of the scipy dependency in the manhattan_distances.
So should we close this PR for now?

OmarManzoor · 2024-10-21T10:46:35Z

Closing for now until we are in a position to make a better decision.
Thank you for the discussion @ogrisel @glemaitre @adrinjalali

Fix linting

e4104a2

github-actions bot added the module:metrics label Sep 18, 2024

OmarManzoor added Array API and removed module:metrics labels Sep 18, 2024

Add PR number

b1b93fe

OmarManzoor changed the title ~~Fix linting~~ ENH Add array api support for laplacian_kernel and manhattan_distances Sep 18, 2024

OmarManzoor mentioned this pull request Sep 18, 2024

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

OmarManzoor commented Sep 18, 2024

View reviewed changes

sklearn/metrics/pairwise.py Show resolved Hide resolved

OmarManzoor added Quick Review For PRs that are quick to review CUDA CI labels Sep 18, 2024

github-actions bot removed the CUDA CI label Sep 18, 2024

OmarManzoor added the module:metrics label Sep 18, 2024

ogrisel added the CUDA CI label Sep 18, 2024

github-actions bot removed the CUDA CI label Sep 18, 2024

ogrisel reviewed Sep 18, 2024

View reviewed changes

Wrap X and Y with _convert_to_numpy when calling cdist cityblock

72fcf8b

Update the comment and todo

dbbec19

ogrisel reviewed Sep 27, 2024

View reviewed changes

OmarManzoor added 2 commits September 30, 2024 15:13

Merge branch 'main' into laplacain_kernel_array_api

b18cbe1

Updates based on PR suggestions

154dc93

OmarManzoor commented Sep 30, 2024

View reviewed changes

doc/modules/array_api.rst Outdated Show resolved Hide resolved

doc/modules/array_api.rst Outdated Show resolved Hide resolved

OmarManzoor added 2 commits September 30, 2024 17:05

Update doc/modules/array_api.rst

5ee461c

Update doc/modules/array_api.rst

247c244

glemaitre self-requested a review October 17, 2024 21:20

Merge remote-tracking branch 'origin/main' into pr/OmarManzoor/29881

03f5856

glemaitre reviewed Oct 17, 2024

View reviewed changes

adrinjalali reviewed Oct 21, 2024

View reviewed changes

OmarManzoor closed this Oct 21, 2024

OmarManzoor deleted the laplacain_kernel_array_api branch October 21, 2024 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add array api support for laplacian_kernel and manhattan_distances #29881

ENH Add array api support for laplacian_kernel and manhattan_distances #29881

OmarManzoor commented Sep 18, 2024

github-actions bot commented Sep 18, 2024 •

edited

Loading

ogrisel commented Sep 18, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

OmarManzoor commented Sep 18, 2024

ogrisel commented Sep 18, 2024

OmarManzoor commented Sep 19, 2024

ogrisel commented Sep 20, 2024

OmarManzoor commented Sep 23, 2024

OmarManzoor commented Sep 23, 2024

ogrisel commented Sep 27, 2024 •

edited

Loading

ogrisel Sep 27, 2024

ogrisel commented Sep 27, 2024

OmarManzoor commented Sep 30, 2024

glemaitre Oct 17, 2024

ogrisel Oct 21, 2024

glemaitre commented Oct 17, 2024

ogrisel commented Oct 21, 2024 •

edited

Loading

adrinjalali left a comment

adrinjalali Oct 21, 2024

ogrisel commented Oct 21, 2024 •

edited

Loading

adrinjalali commented Oct 21, 2024

OmarManzoor commented Oct 21, 2024

OmarManzoor commented Oct 21, 2024

		- :func:`sklearn.metrics.pairwise.linear_kernel` :pr:`29475` by :user:`Omar Salman <OmarManzoor>`;
		- :func:`sklearn.metrics.pairwise.manhattan_distances` :pr:`29881` by :user:`Omar Salman <OmarManzoor>`;

ENH Add array api support for laplacian_kernel and manhattan_distances #29881

ENH Add array api support for laplacian_kernel and manhattan_distances #29881

Conversation

OmarManzoor commented Sep 18, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Sep 18, 2024 • edited Loading

✔️ Linting Passed

ogrisel commented Sep 18, 2024 • edited Loading

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

OmarManzoor commented Sep 18, 2024

ogrisel commented Sep 18, 2024

OmarManzoor commented Sep 19, 2024

ogrisel commented Sep 20, 2024

OmarManzoor commented Sep 23, 2024

OmarManzoor commented Sep 23, 2024

ogrisel commented Sep 27, 2024 • edited Loading

ogrisel Sep 27, 2024

Choose a reason for hiding this comment

ogrisel commented Sep 27, 2024

OmarManzoor commented Sep 30, 2024

glemaitre Oct 17, 2024

Choose a reason for hiding this comment

ogrisel Oct 21, 2024

Choose a reason for hiding this comment

glemaitre commented Oct 17, 2024

ogrisel commented Oct 21, 2024 • edited Loading

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Oct 21, 2024

Choose a reason for hiding this comment

ogrisel commented Oct 21, 2024 • edited Loading

adrinjalali commented Oct 21, 2024

OmarManzoor commented Oct 21, 2024

OmarManzoor commented Oct 21, 2024

github-actions bot commented Sep 18, 2024 •

edited

Loading

ogrisel commented Sep 18, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Sep 27, 2024 •

edited

Loading

ogrisel commented Oct 21, 2024 •

edited

Loading

ogrisel commented Oct 21, 2024 •

edited

Loading