PERF Avoid repetitively allocating large temporary arrays when fitting `GaussianMixture` #30614

ogrisel · 2025-01-08T21:43:17Z

While profiling the memory usage of GaussianMixture as part of #30415 (comment) I realized that there were many other possible improvements, independently of the use of float32 data.

So here is a WIP PR with a snapshot of the things I found with the help of scalene and memray.

On float64 data, chunking and more liberal use of in-place operations + the a-posteriori covariance matrix centering trick make it possible to reduce fit time by ~40% and trim peak memory usage by 60% on a 400 MB dataset.

TODO:

find a way to fix the remaining broken tests ValueError: output array is not acceptable (must have the right datatype, number of dimensions, and be a C-Array)
add more details about benchmarking / profiling results;
changelog entry;
maybe split this PR into sub-PR if we better want to assess the trade-offs of each family of changes.

…sianMixture

github-actions · 2025-01-08T21:44:30Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 0ce475a. Link to the linter CI: here}

OmarManzoor

@ogrisel This looks nice. Would it make sense to separate the splitting logic into its own PR and keep all the other changes that you have made in this one? I think the error results because of the splitting part only, is that correct?

OmarManzoor · 2025-04-08T13:20:50Z

@ogrisel Do you think we should update this PR? Currently there are conflicts and the build is too old so I can't see the errors that occurred. As far as the tests in the mixture module are concerned, all of them passed on my local windows system.

ogrisel · 2025-04-08T18:06:46Z

I don't have a plan to work on it soon. Feel free to takeover or extract easy to merge parts in a new PR.

OmarManzoor · 2025-04-09T13:15:35Z

sklearn/mixture/_gaussian_mixture.py

+    # to convert it to bytes
+    bytes_per_sample = max(X.dtype.itemsize * X.shape[1], 1)
+    batch_size = max(int(get_config()["working_memory"] * 1e6) // bytes_per_sample, 1)
+    float_dtype = precisions_chol.dtype


Note: For now, we need to extract the dtype from precisions_chol as that is used below in the in-place computation of squared_diff which requires that dtypes should match. This is required because BayesianGaussianMixture does not currently support float32 so directly using X.dtype causes issues plus we also check in the common tests for cases where X has an integer dtype

OmarManzoor · 2025-04-11T07:03:44Z

CC: @lesteve @betatim @antoinebaker @jeremiedbb for reviews

OmarManzoor · 2025-06-12T07:30:43Z

@ogrisel Do you think we should close this PR, now that the PR incorporating the array API is close to being finalized?

ogrisel · 2025-06-13T09:23:35Z

Once the array API support is merged, this PR will have to be adapted to optimize either the numpy case or the generic array API case when possible. I think the individual optims will have to be reviewed individually maybe by splitting the PR into sub PRs.

OmarManzoor · 2025-06-13T10:58:40Z

I agree. I think it would be better to open new PRs though instead of trying to adjust this one. What do you think?

Avoid Repetitivelyallocating large temporary arrays when fitting Gaus…

6211876

…sianMixture

ogrisel added the Performance label Jan 8, 2025

github-actions bot added the module:mixture label Jan 8, 2025

ogrisel changed the title ~~PERF Avoid Repetitivelyallocating large temporary arrays when fitting GaussianMixture~~ PERF Avoid Repetitively allocating large temporary arrays when fitting GaussianMixture Jan 9, 2025

ogrisel changed the title ~~PERF Avoid Repetitively allocating large temporary arrays when fitting GaussianMixture~~ PERF Avoid repetitively allocating large temporary arrays when fitting GaussianMixture Jan 9, 2025

OmarManzoor reviewed Jan 9, 2025

View reviewed changes

OmarManzoor added 3 commits April 9, 2025 15:20

Merge branch 'main' into gmm-optim-memalloc

f2a1bf9

Adjust for covariance estimation for float32

81ebca5

Use prec_chol dtype for now in _estimate_log_gaussian_prob

472abcb

OmarManzoor reviewed Apr 9, 2025

View reviewed changes

OmarManzoor added 2 commits April 11, 2025 10:16

Merge branch 'main' into gmm-optim-memalloc

f18f0bc

Trigger CI [float32]

0ce475a

OmarManzoor marked this pull request as ready for review April 11, 2025 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF Avoid repetitively allocating large temporary arrays when fitting `GaussianMixture` #30614

PERF Avoid repetitively allocating large temporary arrays when fitting `GaussianMixture` #30614

ogrisel commented Jan 8, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 8, 2025 •

edited

Loading

Uh oh!

OmarManzoor left a comment

Uh oh!

OmarManzoor commented Apr 8, 2025

Uh oh!

ogrisel commented Apr 8, 2025

Uh oh!

OmarManzoor Apr 9, 2025 •

edited

Loading

Uh oh!

OmarManzoor commented Apr 11, 2025

Uh oh!

OmarManzoor commented Jun 12, 2025

Uh oh!

ogrisel commented Jun 13, 2025 •

edited

Loading

Uh oh!

OmarManzoor commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

PERF Avoid repetitively allocating large temporary arrays when fitting GaussianMixture #30614

Are you sure you want to change the base?

PERF Avoid repetitively allocating large temporary arrays when fitting GaussianMixture #30614

Conversation

ogrisel commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Uh oh!

github-actions bot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

OmarManzoor commented Apr 8, 2025

Uh oh!

ogrisel commented Apr 8, 2025

Uh oh!

OmarManzoor Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarManzoor commented Apr 11, 2025

Uh oh!

OmarManzoor commented Jun 12, 2025

Uh oh!

ogrisel commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Jun 13, 2025

Uh oh!

Uh oh!

PERF Avoid repetitively allocating large temporary arrays when fitting `GaussianMixture` #30614

PERF Avoid repetitively allocating large temporary arrays when fitting `GaussianMixture` #30614

ogrisel commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

OmarManzoor Apr 9, 2025 •

edited

Loading

ogrisel commented Jun 13, 2025 •

edited

Loading