Skip to content

Conversation

otizonaizit
Copy link
Contributor

@otizonaizit otizonaizit commented Aug 22, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR changes kmeans++ to use np.cumsum instead of our own stable_cumsum. For context please refer to the discussion in #31533 .

Still a draft, missing a little benchmark to verify that the change is indeed making the code faster.

Any other comments?

Copy link

github-actions bot commented Aug 22, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9885de1. Link to the linter CI: here

@otizonaizit
Copy link
Contributor Author

Together with @ogrisel we wrote a little benchmark to measure the impact of the change.

Benchmark code:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import kmeans_plusplus

x, blobs = make_blobs(n_samples=1_000_000, n_features=10, random_state=0)

y = x.astype(np.float32)

centers, indices = kmeans_plusplus(y, n_clusters=100, random_state=0)

Profiling with pyinstrument on main:

  _     ._   __/__   _ _  _  _ _/_   Recorded: 13:51:49  Samples:  5125
 /_//_/// /_\ / //_// / //_'/ //     Duration: 6.118     CPU time: 11.733
/   _/                      v5.1.1

Program: bench_cumsum.py

6.118 <module>  bench_cumsum.py:1
├─ 4.979 wrapper  sklearn/utils/_param_validation.py:187
│  ├─ 4.701 kmeans_plusplus  sklearn/cluster/_kmeans.py:63
│  │  └─ 4.692 _kmeans_plusplus  sklearn/cluster/_kmeans.py:180
│  │     ├─ 4.086 _euclidean_distances  sklearn/metrics/pairwise.py:397
│  │     │  ├─ 3.840 _euclidean_distances_upcast  sklearn/metrics/pairwise.py:588
│  │     │  │  ├─ 1.061 row_norms  sklearn/utils/extmath.py:54
│  │     │  │  │  ├─ 0.606 einsum  numpy/_core/einsumfunc.py:1057
│  │     │  │  │  │  └─ 0.593 c_einsum  <built-in>
│  │     │  │  │  ├─ 0.157 [self]  sklearn/utils/extmath.py
│  │     │  │  │  ├─ 0.138 asarray  sklearn/externals/array_api_compat/numpy/_aliases.py:89
│  │     │  │  │  └─ 0.086 get_namespace  sklearn/utils/_array_api.py:334
│  │     │  │  │     └─ 0.069 get_config  sklearn/_config.py:35
│  │     │  │  ├─ 0.995 [self]  sklearn/metrics/pairwise.py
│  │     │  │  ├─ 0.981 safe_sparse_dot  sklearn/utils/extmath.py:160
│  │     │  │  │  ├─ 0.858 [self]  sklearn/utils/extmath.py
│  │     │  │  │  └─ 0.083 get_namespace  sklearn/utils/_array_api.py:334
│  │     │  │  └─ 0.784 astype  sklearn/externals/array_api_compat/numpy/_aliases.py:116
│  │     │  │     └─ 0.693 ndarray.astype  <built-in>
│  │     │  └─ 0.245 _modify_in_place_if_numpy  sklearn/utils/_array_api.py:992
│  │     ├─ 0.304 stable_cumsum  sklearn/utils/extmath.py:1230
│  │     │  └─ 0.302 cumsum  numpy/_core/fromnumeric.py:2879
│  │     │        [2 frames hidden]  numpy, <built-in>
│  │     └─ 0.301 [self]  sklearn/cluster/_kmeans.py
│  └─ 0.277 make_blobs  sklearn/datasets/_samples_generator.py:981
├─ 0.682 <module>  sklearn/__init__.py:1
│  └─ 0.671 <module>  sklearn/base.py:1
│     └─ 0.668 <module>  sklearn/utils/__init__.py:1
│        └─ 0.650 <module>  sklearn/utils/_chunking.py:1
│           └─ 0.646 <module>  sklearn/utils/_param_validation.py:1
│              ├─ 0.514 <module>  sklearn/utils/validation.py:1
│              │  └─ 0.467 <module>  sklearn/utils/_array_api.py:1
│              │     └─ 0.433 <module>  sklearn/utils/fixes.py:1
│              │        ├─ 0.248 <module>  scipy/stats/__init__.py:1
│              │        │     [4 frames hidden]  scipy
│              │        └─ 0.177 <module>  pandas/__init__.py:1
│              │           └─ 0.129 <module>  pandas/core/api.py:1
│              └─ 0.116 <module>  scipy/sparse/__init__.py:1
│                    [6 frames hidden]  scipy, numpy
├─ 0.256 <module>  sklearn/cluster/__init__.py:1
│  └─ 0.156 <module>  sklearn/cluster/_dbscan.py:1
│     └─ 0.156 <module>  sklearn/neighbors/__init__.py:1
│        └─ 0.127 <module>  sklearn/neighbors/_nca.py:1
│           └─ 0.127 <module>  sklearn/decomposition/__init__.py:1
│              └─ 0.101 <module>  sklearn/decomposition/_dict_learning.py:1
│                 └─ 0.101 <module>  sklearn/linear_model/__init__.py:1
├─ 0.093 <module>  sklearn/datasets/__init__.py:1
└─ 0.065 <module>  numpy/__init__.py:1

The same but with this PR applied:

  _     ._   __/__   _ _  _  _ _/_   Recorded: 13:52:43  Samples:  5271
 /_//_/// /_\ / //_// / //_'/ //     Duration: 6.192     CPU time: 11.853
/   _/                      v5.1.1

Program: bench_cumsum.py

6.191 <module>  bench_cumsum.py:1
├─ 5.020 wrapper  sklearn/utils/_param_validation.py:187
│  ├─ 4.724 kmeans_plusplus  sklearn/cluster/_kmeans.py:63
│  │  └─ 4.713 _kmeans_plusplus  sklearn/cluster/_kmeans.py:180
│  │     ├─ 4.181 _euclidean_distances  sklearn/metrics/pairwise.py:397
│  │     │  ├─ 3.956 _euclidean_distances_upcast  sklearn/metrics/pairwise.py:588
│  │     │  │  ├─ 1.302 [self]  sklearn/metrics/pairwise.py
│  │     │  │  ├─ 0.961 safe_sparse_dot  sklearn/utils/extmath.py:160
│  │     │  │  ├─ 0.875 astype  sklearn/externals/array_api_compat/numpy/_aliases.py:116
│  │     │  │  │  ├─ 0.746 ndarray.astype  <built-in>
│  │     │  │  │  ├─ 0.065 [self]  sklearn/externals/array_api_compat/numpy/_aliases.py
│  │     │  │  │  └─ 0.064 _check_device  sklearn/externals/array_api_compat/common/_helpers.py:660
│  │     │  │  └─ 0.805 row_norms  sklearn/utils/extmath.py:54
│  │     │  │     ├─ 0.456 einsum  numpy/_core/einsumfunc.py:1057
│  │     │  │     │  └─ 0.443 c_einsum  <built-in>
│  │     │  │     ├─ 0.132 [self]  sklearn/utils/extmath.py
│  │     │  │     ├─ 0.106 asarray  sklearn/externals/array_api_compat/numpy/_aliases.py:89
│  │     │  │     └─ 0.066 get_namespace  sklearn/utils/_array_api.py:334
│  │     │  └─ 0.223 _modify_in_place_if_numpy  sklearn/utils/_array_api.py:992
│  │     ├─ 0.296 [self]  sklearn/cluster/_kmeans.py
│  │     └─ 0.234 cumsum  numpy/_core/fromnumeric.py:2879
│  │           [2 frames hidden]  numpy, <built-in>
│  └─ 0.294 make_blobs  sklearn/datasets/_samples_generator.py:981
├─ 0.707 <module>  sklearn/__init__.py:1
│  └─ 0.698 <module>  sklearn/base.py:1
│     └─ 0.695 <module>  sklearn/utils/__init__.py:1
│        └─ 0.676 <module>  sklearn/utils/_chunking.py:1
│           └─ 0.672 <module>  sklearn/utils/_param_validation.py:1
│              ├─ 0.535 <module>  sklearn/utils/validation.py:1
│              │  └─ 0.486 <module>  sklearn/utils/_array_api.py:1
│              │     └─ 0.451 <module>  sklearn/utils/fixes.py:1
│              │        ├─ 0.256 <module>  scipy/stats/__init__.py:1
│              │        │     [4 frames hidden]  scipy
│              │        └─ 0.186 <module>  pandas/__init__.py:1
│              │           └─ 0.137 <module>  pandas/core/api.py:1
│              └─ 0.122 <module>  scipy/sparse/__init__.py:1
│                    [6 frames hidden]  scipy, numpy
├─ 0.256 <module>  sklearn/cluster/__init__.py:1
│  └─ 0.154 <module>  sklearn/cluster/_dbscan.py:1
│     └─ 0.154 <module>  sklearn/neighbors/__init__.py:1
│        └─ 0.124 <module>  sklearn/neighbors/_nca.py:1
│           └─ 0.124 <module>  sklearn/decomposition/__init__.py:1
│              └─ 0.098 <module>  sklearn/decomposition/_dict_learning.py:1
│                 └─ 0.098 <module>  sklearn/linear_model/__init__.py:1
├─ 0.096 <module>  sklearn/datasets/__init__.py:1
└─ 0.065 <module>  numpy/__init__.py:1

The relevant lines to compare are those referring to stable_cumsum and np.cumsum. For main:

│  │     ├─ 0.304 stable_cumsum  sklearn/utils/extmath.py:1230
│  │     │  └─ 0.302 cumsum  numpy/_core/fromnumeric.py:2879

With the PR applied:

│  │     └─ 0.234 cumsum  numpy/_core/fromnumeric.py:2879

The speedup in the call to np.cumsum comes from the fact that we are not upcasting to np.float64 anymore. But on a total running time of ~6 seconds this speed up is not really relevant.

Still, the change opens the possibility of running on systems that only support np.float32 (some graphic cards driver on MacOS, apparently?). This becomes even more relevant once the migration to the new array protocol is completed.

@otizonaizit otizonaizit marked this pull request as ready for review August 22, 2025 12:00
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tests pass, LGTM.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR @otizonaizit

@OmarManzoor OmarManzoor merged commit d5715fb into scikit-learn:main Aug 22, 2025
53 checks passed
@jeremiedbb jeremiedbb mentioned this pull request Sep 3, 2025
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants