[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526

Tahseen23 · 2025-06-11T21:25:53Z

What does this PR do?

Fixes a memory overuse issue in DBSCAN.fit() by explicitly deleting large temporary variables (neighborhoods, neighbors_model) and invoking gc.collect() after clustering is done.

These variables are not used after fitting, but they consume significant memory in large datasets. Releasing them manually helps prevent memory buildup in long-running environments (e.g., notebooks, servers).

--

Related Issue

Closes: #31407

Why is this important?

DBSCAN can retain large temporary arrays in memory, especially on datasets with millions of rows.
This fix reclaims 10–15+ MiB of memory immediately after .fit() is called.
No change to clustering logic or public API.
Helps in memory-critical environments, or when clustering is repeated in a loop.

Memory Profile Insights

I've run a memory profile on the DBSCAN implementation to analyze its memory footprint.

Memory Profiler Output (100,000 samples)

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   366    132.4 MiB    132.4 MiB           1       @_fit_context(
   367                                                 # DBSCAN.metric is not validated yet
   368                                                 prefer_skip_nested_validation=False
   369                                             )
   370                                             @profile()
   371                                             def fit(self, X, y=None, sample_weight=None):
   372                                                 """Perform DBSCAN clustering from features, or distance matrix.
   373
   374                                                 Parameters
   375                                                 ----------
   376                                                 X : {array-like, sparse matrix} of shape (n_samples, n_features), or \
   377                                                     (n_samples, n_samples)
   378                                                     Training instances to cluster, or distances between instances if
   379                                                     `metric='precomputed'. If a sparse matrix is provided, it will
   380                                                     be converted into a sparse `csr_matrix.
   381
   382                                                 y : Ignored
   383                                                     Not used, present here for API consistency by convention.
   384
   385                                                 sample_weight : array-like of shape (n_samples,), default=None
   386                                                     Weight of each sample, such that a sample with a weight of at least
   387                                                     `min_samples is by itself a core sample; a sample with a
   388                                                     negative weight may inhibit its eps-neighbor from being core.
   389                                                     Note that weights are absolute, and default to 1.
   390
   391                                                 Returns
   392                                                 -------
   393                                                 self : object
   394                                                     Returns a fitted instance of self.
   395                                                 """
   396    132.4 MiB      0.0 MiB           1           X = validate_data(self, X, accept_sparse="csr")
   397
   398    132.4 MiB      0.0 MiB           1           if sample_weight is not None:
   399                                                     sample_weight = _check_sample_weight(sample_weight, X)
   400
   401                                                 # Calculate neighborhood for all samples. This leaves the original
   402                                                 # point in, which needs to be considered later (i.e. point i is in the
   403                                                 # neighborhood of point i. While True, its useless information)
   404    132.4 MiB      0.0 MiB           1           if self.metric == "precomputed" and sparse.issparse(X):
   405                                                     # set the diagonal to explicit values, as a point is its own
   406                                                     # neighbor
   407                                                     X = X.copy()  # copy to avoid in-place modification
   408                                                     with warnings.catch_warnings():
   409                                                         warnings.simplefilter("ignore", sparse.SparseEfficiencyWarning)
   410                                                         X.setdiag(X.diagonal())
   411
   412    132.4 MiB      0.0 MiB           2           neighbors_model = NearestNeighbors(
   413    132.4 MiB      0.0 MiB           1               radius=self.eps,
   414    132.4 MiB      0.0 MiB           1               algorithm=self.algorithm,
   415    132.4 MiB      0.0 MiB           1               leaf_size=self.leaf_size,
   416    132.4 MiB      0.0 MiB           1               metric=self.metric,
   417    132.4 MiB      0.0 MiB           1               metric_params=self.metric_params,
   418    132.4 MiB      0.0 MiB           1               p=self.p,
   419    132.4 MiB      0.0 MiB           1               n_jobs=self.n_jobs,
   420                                                 )
   421    132.6 MiB      0.2 MiB           1           neighbors_model.fit(X)
   422                                                 # This has worst case O(n^2) memory complexity
   423    148.8 MiB     16.2 MiB           1           neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)
   424    148.8 MiB      0.0 MiB           1           if sample_weight is None:
   425    150.6 MiB      1.8 MiB      100001               n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
   426                                                 else:
   427                                                     n_neighbors = np.array(
   428                                                         [np.sum(sample_weight[neighbors]) for neighbors in neighborhoods]
   429                                                     )
   430
   431                                                 # Initially, all samples are noise.
   432    150.6 MiB      0.0 MiB           1           labels = np.full(X.shape[0], -1, dtype=np.intp)
   433
   434                                                 # A list of all core samples found.
   435    150.6 MiB      0.0 MiB           1           core_samples = np.asarray(n_neighbors >= self.min_samples, dtype=np.uint8)
   436    150.6 MiB      0.0 MiB           1           dbscan_inner(core_samples, neighborhoods, labels)
   437
   438    150.6 MiB      0.0 MiB           1           self.core_sample_indices_ = np.where(core_samples)[0]
   439    150.6 MiB      0.0 MiB           1           self.labels_ = labels
   440
   441    150.6 MiB      0.0 MiB           1           if len(self.core_sample_indices_):
   442                                                     # fix for scipy sparse indexing issue
   443                                                     self.components_ = X[self.core_sample_indices_].copy()
   444                                                 else:
   445                                                     # no core samples
   446    150.6 MiB      0.0 MiB           1               self.components_ = np.empty((0, X.shape[1]))
   447    141.8 MiB     -8.8 MiB           1           del neighborhoods
   448    138.6 MiB     -3.2 MiB           1           del neighbors_model
   449    138.6 MiB      0.0 MiB           1           gc.collect()
   450    138.6 MiB      0.0 MiB           1           return self

github-actions · 2025-06-11T21:26:50Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 52905fd. Link to the linter CI: here}

adrinjalali

You need to fix the linting issues. You probably want to enable pre-commit hooks on your clone.

adrinjalali · 2025-06-12T08:47:57Z

sklearn/cluster/_dbscan.py

+        del neighborhoods
+        del neighbors_model
+        gc.collect()


Does this actually help? gc should be collecting these variables as soon as the process blows up in memory, which it seems to be doing nicely as shown here: #31407 (comment)

I don't see how this can help with the issue.

Thanks for the review! You're right that gc eventually collects these objects. However, in practice (especially in notebook and long-running environments), Python doesn't immediately reclaim large memory blocks tied to deeply nested objects like neighborhoods, even after they go out of scope.

The issue being discussed in #31407 appears when .fit() is run on large datasets (e.g., 100K+ samples), and memory doesn't drop until the entire Python session is restarted — which is not expected behavior for a transient fit operation.

I observed a ~12 MiB drop immediately after .fit() completes.

Without del, memory holds steady around 150.6 MiB.

With del, it drops to 138.6 MiB right away — which is especially important when fitting multiple times in a loop.

So while it's not essential for correctness, this cleanup helps ensure memory is released deterministically — which aligns with the spirit of #31407 and improves memory behavior in production-like settings.

The user can always either call gc.collect() in a loop, or do a gc.set_thresholds to make gc.collect run more frequently.

Another issue with this is when running DBSCAN in a multithreaded environment, and the fact that the behavior or running gc.collect when it's already running, is undefined (https://docs.python.org/3/library/gc.html#gc.collect)

Maybe @jeremiedbb or @lesteve have more information here, but I'd be in favor of closing this PR.

As a solution to the original issue, we need to fix the low level memory allocation in DBSCAN.

Thanks for the detailed feedback and clarification — really appreciate you taking the time.

You're absolutely right that this doesn't fix the root cause of memory pressure in DBSCAN and may not be ideal in multi-threaded environments.

My intention with this PR was to offer a lightweight, short-term mitigation that helps in notebook or loop-heavy usage patterns — where repeated .fit() calls on large datasets (e.g., 100k+ samples) caused memory growth unless the process was restarted.

I understand now that the long-term fix needs to address lower-level memory behavior (possibly in how neighborhoods or radius neighbors are stored). I’ll read more on that and would love to help explore it if I can.

Happy to close this PR based on your recommendation — I learned a lot in the process and look forward to contributing more!

@Tahseen23 if you can add a stand-alone snippet to reproduce the growing memory usage with multiple fits in #31407 this would already be useful.

I ran two clean A/B tests comparing the original scikit-learn DBSCAN and a modified version with del neighborhoods, del neighbors_model, and gc.collect() at the end of .fit().

Tested with 100K samples and measured process memory using psutil.

Results:

Original DBSCAN:

Initial Memory: ~125.1 MiB

Peak Memory: ~128.6 MiB

Memory stable across iterations

Modified DBSCAN:

Initial Memory: ~131.6 MiB

Peak Memory: ~134.7 MiB

Manual cleanup (del, gc.collect()) did not reduce memory further

✅ This suggests that Python's GC is already releasing neighborhoods after .fit() ends, or the extra memory cost in the modified version offsets the cleanup. Will hold off on further changes until deeper structural fixes are discussed.

[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407)

eafce84

github-actions bot added the module:cluster label Jun 11, 2025

adrinjalali reviewed Jun 12, 2025

View reviewed changes

style: fix linting issues with Ruff

52905fd

Tahseen23 closed this Jun 12, 2025

Tahseen23 mentioned this pull request Jun 12, 2025

Cannot recover DBSCAN from memory-overuse #31407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526

[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526

Uh oh!

Tahseen23 commented Jun 11, 2025

Uh oh!

github-actions bot commented Jun 11, 2025 •

edited

Loading

Uh oh!

adrinjalali left a comment

Uh oh!

adrinjalali Jun 12, 2025

Uh oh!

Tahseen23 Jun 12, 2025

Uh oh!

adrinjalali Jun 12, 2025

Uh oh!

Tahseen23 Jun 12, 2025

Uh oh!

lesteve Jun 12, 2025

Uh oh!

Tahseen23 Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526

[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526

Uh oh!

Conversation

Tahseen23 commented Jun 11, 2025

What does this PR do?

Related Issue

Why is this important?

Memory Profile Insights

Uh oh!

github-actions bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Tahseen23 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Tahseen23 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

lesteve Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Tahseen23 Jun 12, 2025

Choose a reason for hiding this comment

Original DBSCAN:

Modified DBSCAN:

Uh oh!

Uh oh!

github-actions bot commented Jun 11, 2025 •

edited

Loading