-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
ENH use np.cumsum instead of stable_cumsum in kmeans++ #31991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Together with @ogrisel we wrote a little benchmark to measure the impact of the change. Benchmark code: import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import kmeans_plusplus
x, blobs = make_blobs(n_samples=1_000_000, n_features=10, random_state=0)
y = x.astype(np.float32)
centers, indices = kmeans_plusplus(y, n_clusters=100, random_state=0) Profiling with pyinstrument on _ ._ __/__ _ _ _ _ _/_ Recorded: 13:51:49 Samples: 5125
/_//_/// /_\ / //_// / //_'/ // Duration: 6.118 CPU time: 11.733
/ _/ v5.1.1
Program: bench_cumsum.py
6.118 <module> bench_cumsum.py:1
├─ 4.979 wrapper sklearn/utils/_param_validation.py:187
│ ├─ 4.701 kmeans_plusplus sklearn/cluster/_kmeans.py:63
│ │ └─ 4.692 _kmeans_plusplus sklearn/cluster/_kmeans.py:180
│ │ ├─ 4.086 _euclidean_distances sklearn/metrics/pairwise.py:397
│ │ │ ├─ 3.840 _euclidean_distances_upcast sklearn/metrics/pairwise.py:588
│ │ │ │ ├─ 1.061 row_norms sklearn/utils/extmath.py:54
│ │ │ │ │ ├─ 0.606 einsum numpy/_core/einsumfunc.py:1057
│ │ │ │ │ │ └─ 0.593 c_einsum <built-in>
│ │ │ │ │ ├─ 0.157 [self] sklearn/utils/extmath.py
│ │ │ │ │ ├─ 0.138 asarray sklearn/externals/array_api_compat/numpy/_aliases.py:89
│ │ │ │ │ └─ 0.086 get_namespace sklearn/utils/_array_api.py:334
│ │ │ │ │ └─ 0.069 get_config sklearn/_config.py:35
│ │ │ │ ├─ 0.995 [self] sklearn/metrics/pairwise.py
│ │ │ │ ├─ 0.981 safe_sparse_dot sklearn/utils/extmath.py:160
│ │ │ │ │ ├─ 0.858 [self] sklearn/utils/extmath.py
│ │ │ │ │ └─ 0.083 get_namespace sklearn/utils/_array_api.py:334
│ │ │ │ └─ 0.784 astype sklearn/externals/array_api_compat/numpy/_aliases.py:116
│ │ │ │ └─ 0.693 ndarray.astype <built-in>
│ │ │ └─ 0.245 _modify_in_place_if_numpy sklearn/utils/_array_api.py:992
│ │ ├─ 0.304 stable_cumsum sklearn/utils/extmath.py:1230
│ │ │ └─ 0.302 cumsum numpy/_core/fromnumeric.py:2879
│ │ │ [2 frames hidden] numpy, <built-in>
│ │ └─ 0.301 [self] sklearn/cluster/_kmeans.py
│ └─ 0.277 make_blobs sklearn/datasets/_samples_generator.py:981
├─ 0.682 <module> sklearn/__init__.py:1
│ └─ 0.671 <module> sklearn/base.py:1
│ └─ 0.668 <module> sklearn/utils/__init__.py:1
│ └─ 0.650 <module> sklearn/utils/_chunking.py:1
│ └─ 0.646 <module> sklearn/utils/_param_validation.py:1
│ ├─ 0.514 <module> sklearn/utils/validation.py:1
│ │ └─ 0.467 <module> sklearn/utils/_array_api.py:1
│ │ └─ 0.433 <module> sklearn/utils/fixes.py:1
│ │ ├─ 0.248 <module> scipy/stats/__init__.py:1
│ │ │ [4 frames hidden] scipy
│ │ └─ 0.177 <module> pandas/__init__.py:1
│ │ └─ 0.129 <module> pandas/core/api.py:1
│ └─ 0.116 <module> scipy/sparse/__init__.py:1
│ [6 frames hidden] scipy, numpy
├─ 0.256 <module> sklearn/cluster/__init__.py:1
│ └─ 0.156 <module> sklearn/cluster/_dbscan.py:1
│ └─ 0.156 <module> sklearn/neighbors/__init__.py:1
│ └─ 0.127 <module> sklearn/neighbors/_nca.py:1
│ └─ 0.127 <module> sklearn/decomposition/__init__.py:1
│ └─ 0.101 <module> sklearn/decomposition/_dict_learning.py:1
│ └─ 0.101 <module> sklearn/linear_model/__init__.py:1
├─ 0.093 <module> sklearn/datasets/__init__.py:1
└─ 0.065 <module> numpy/__init__.py:1 The same but with this PR applied: _ ._ __/__ _ _ _ _ _/_ Recorded: 13:52:43 Samples: 5271
/_//_/// /_\ / //_// / //_'/ // Duration: 6.192 CPU time: 11.853
/ _/ v5.1.1
Program: bench_cumsum.py
6.191 <module> bench_cumsum.py:1
├─ 5.020 wrapper sklearn/utils/_param_validation.py:187
│ ├─ 4.724 kmeans_plusplus sklearn/cluster/_kmeans.py:63
│ │ └─ 4.713 _kmeans_plusplus sklearn/cluster/_kmeans.py:180
│ │ ├─ 4.181 _euclidean_distances sklearn/metrics/pairwise.py:397
│ │ │ ├─ 3.956 _euclidean_distances_upcast sklearn/metrics/pairwise.py:588
│ │ │ │ ├─ 1.302 [self] sklearn/metrics/pairwise.py
│ │ │ │ ├─ 0.961 safe_sparse_dot sklearn/utils/extmath.py:160
│ │ │ │ ├─ 0.875 astype sklearn/externals/array_api_compat/numpy/_aliases.py:116
│ │ │ │ │ ├─ 0.746 ndarray.astype <built-in>
│ │ │ │ │ ├─ 0.065 [self] sklearn/externals/array_api_compat/numpy/_aliases.py
│ │ │ │ │ └─ 0.064 _check_device sklearn/externals/array_api_compat/common/_helpers.py:660
│ │ │ │ └─ 0.805 row_norms sklearn/utils/extmath.py:54
│ │ │ │ ├─ 0.456 einsum numpy/_core/einsumfunc.py:1057
│ │ │ │ │ └─ 0.443 c_einsum <built-in>
│ │ │ │ ├─ 0.132 [self] sklearn/utils/extmath.py
│ │ │ │ ├─ 0.106 asarray sklearn/externals/array_api_compat/numpy/_aliases.py:89
│ │ │ │ └─ 0.066 get_namespace sklearn/utils/_array_api.py:334
│ │ │ └─ 0.223 _modify_in_place_if_numpy sklearn/utils/_array_api.py:992
│ │ ├─ 0.296 [self] sklearn/cluster/_kmeans.py
│ │ └─ 0.234 cumsum numpy/_core/fromnumeric.py:2879
│ │ [2 frames hidden] numpy, <built-in>
│ └─ 0.294 make_blobs sklearn/datasets/_samples_generator.py:981
├─ 0.707 <module> sklearn/__init__.py:1
│ └─ 0.698 <module> sklearn/base.py:1
│ └─ 0.695 <module> sklearn/utils/__init__.py:1
│ └─ 0.676 <module> sklearn/utils/_chunking.py:1
│ └─ 0.672 <module> sklearn/utils/_param_validation.py:1
│ ├─ 0.535 <module> sklearn/utils/validation.py:1
│ │ └─ 0.486 <module> sklearn/utils/_array_api.py:1
│ │ └─ 0.451 <module> sklearn/utils/fixes.py:1
│ │ ├─ 0.256 <module> scipy/stats/__init__.py:1
│ │ │ [4 frames hidden] scipy
│ │ └─ 0.186 <module> pandas/__init__.py:1
│ │ └─ 0.137 <module> pandas/core/api.py:1
│ └─ 0.122 <module> scipy/sparse/__init__.py:1
│ [6 frames hidden] scipy, numpy
├─ 0.256 <module> sklearn/cluster/__init__.py:1
│ └─ 0.154 <module> sklearn/cluster/_dbscan.py:1
│ └─ 0.154 <module> sklearn/neighbors/__init__.py:1
│ └─ 0.124 <module> sklearn/neighbors/_nca.py:1
│ └─ 0.124 <module> sklearn/decomposition/__init__.py:1
│ └─ 0.098 <module> sklearn/decomposition/_dict_learning.py:1
│ └─ 0.098 <module> sklearn/linear_model/__init__.py:1
├─ 0.096 <module> sklearn/datasets/__init__.py:1
└─ 0.065 <module> numpy/__init__.py:1 The relevant lines to compare are those referring to │ │ ├─ 0.304 stable_cumsum sklearn/utils/extmath.py:1230
│ │ │ └─ 0.302 cumsum numpy/_core/fromnumeric.py:2879 With the PR applied:
The speedup in the call to Still, the change opens the possibility of running on systems that only support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tests pass, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR @otizonaizit
Reference Issues/PRs
What does this implement/fix? Explain your changes.
This PR changes kmeans++ to use
np.cumsum
instead of our ownstable_cumsum
. For context please refer to the discussion in #31533 .Still a draft, missing a little benchmark to verify that the change is indeed making the code faster.
Any other comments?