[MRG+1] ENH Add working_memory global config for chunked operations #10280

jnothman · 2017-12-10T10:27:31Z

We often get issues related to memory consumption and don't deal with them particularly well. Indeed, Scikit-learn should be at home on commodity hardware like developer/researcher laptops.

Some operations can be performed chunked, so that the result is computed in constant (or O(n)) memory relative to some current O(n) (O(n^2)) consumption. Examples include: getting the argmin and min of all pairwise distances (currently done with an ad-hoc parameter to pairwise_distances_argmin_min), calculating silhouette score (#1976), getting nearest neighbors with brute force (#7287), calculating standard deviation of every feature (#5651).

It's not very helpful to provide this "how much constant memory" parameter in each function (because they're often called within nested code), so this PR instead makes a global config parameter of it. The optimisation is then transparent to the user, but still configurable.

At @rth's request, this PR has been cut back. The proposed changes to silhouette and neighbors can be seen here.

This PR (building upon my work with @dalmia) will therefore:

add set_config(working_memory=n_mib)
add pairwise_distances_chunked
make use of the latter in ~~nearest neighbors, silhouette and~~ pairwise_distances_argmin_min
deprecate batch_size in pairwise_distances_argmin_min

and thus:

help towards fixing System freeze on Silhouette scoring #7175, MemoryError from sklearn.metrics.silhouette_samples #10279, [MRG] Block-wise silhouette calculation to avoid memory consumption #7177 (silhouette)
Resolve [MRG] ENH: Added block_size parameter for lesser memory consumption #7979
help towards fixing Batch pairwise_distances in neighbors to reduce memory consumption #7287 (neighbors)
perhaps help towards Euclidean pairwise_distances slower for n_jobs > 1 #8216
provide an interface for fixing Is it possible to reduce StandardScaler.fit() memory consumption? #5651
it looks like there was some suggestion of using this kind of chunking in LMNN (FEAT Large Margin Nearest Neighbor implementation #8602)

TODO:

reverted the comment Resolved merge conflicts

and add comments

Also use threading for parallelism

…hanges

rth · 2018-03-08T21:26:35Z

Regarding the use of assert_allclose [...]

I see. It's fine, it was just a side comment not very critical to this PR...

There is a minor conflict in the what's new.

LGTM.

Now this would need a second review... maybe @lesteve or @qinhanmin2014 ? :)

TomDLT

LGTM

TomDLT · 2018-05-22T15:01:51Z

doc/whats_new/v0.20.rst

+
+- A new configuration parameter, ``working_memory`` was added to control memory
+  consumption limits in chunked operations, such as the new
+  :func:`metrics.pairwise_distances_chunked`.  See :ref:`working_memory`.


You seem to have forgotten the glossary entry.

I think a reference to the User Guide is most relevant in what's new.

I can add a glossary entry, though I'm not sure how it will help beyond the user guide and the config_context docstring.

Fair enough, I just wonder what is the goal of the syntax :ref:, which does not render a link.

Did you mean :func:set_config? Or maybe you need a label in doc/modules/computational_performance.rst?

The latter. A glossary reference would be :term:, not :ref: which references sections.

TomDLT · 2018-05-22T15:06:02Z

sklearn/metrics/pairwise.py

+        ``reduce_func``.
+
+    Examples
+    -------


You need one more dash to have proper rendering.

TomDLT · 2018-05-22T15:21:17Z

sklearn/metrics/tests/test_pairwise.py

+    assert isinstance(S_chunks, GeneratorType)
+    S_chunks = list(S_chunks)
+    assert len(S_chunks) > 1
+    # atol is for diagonal where S is explcitly zeroed on the diagonal


*explicitly

TomDLT · 2018-05-22T15:39:42Z

sklearn/metrics/tests/test_pairwise.py

+    min_block_mib = np.array(X).shape[0] * 8 * 2 ** -20
+
+    for block in blockwise_distances:
+        memory_used = len(block) * 8


You should use memory_used = block.size * 8 to have the correct memory used in the block.

Hmm... indeed!

TomDLT · 2018-05-22T15:42:12Z

sklearn/metrics/tests/test_pairwise.py

+
+    for block in blockwise_distances:
+        memory_used = len(block) * 8
+        assert memory_used <= min(working_memory, min_block_mib) * 2 ** 20


And the min should be a max, shouldn't it?

TomDLT · 2018-05-22T15:44:18Z

sklearn/metrics/tests/test_pairwise.py

+                                     metric='euclidean')
+    # Test small amounts of memory
+    for power in range(-16, 0):
+        check_pairwise_distances_chunked(X, None, working_memory=2 ** power,


This line raises a lot of warnings:
UserWarning: Could not adhere to working_memory config. Currently 0MiB, 1MiB required.
We should silence them as they are expected.

rth · 2018-05-22T19:57:17Z

@TomDLT approved these changes

Great that this is happening!

jnothman · 2018-05-23T04:10:28Z

Thanks @TomDLT for the review and the approval! I've addressed all your comments except for the glossary one, which I'm not sure is necessitated at this point.

jnothman · 2018-05-23T04:17:34Z

I suppose we could include a new Global Configuration section of the glossary. But I'm not sure how that goes beyond the config_context API reference.

amueller · 2018-05-24T02:24:40Z

Is there a plan to in the future also use this in pairwise_distances and automatically dispatch to pairwise_distances_chunked if the parameter is set? (or maybe I'm overlooking something).
Also, @jnothman feel free to point me to stuff you want me to look at ;)

jnothman · 2018-05-24T08:23:52Z

pairwise_distances_chunked is only useful if it can be reduced, so no, there's no point to including this in pairwise_distances. (But there might be particular distance functions that can be calculated in a chunked way to avoid n_samples * n_samples * n_features arrays, which may be what you are thinking of) If you would like to merge this, I'll happily post follow-ups to close more issues!

amueller · 2018-05-24T16:05:01Z

py3.6 fails ;)
But you're right, I wasn't thinking it through...

jnothman · 2018-05-25T02:03:46Z

Merging to enable downstream PRs. Let me know if there are further quibbles! Thanks for the reviews, Roman and Tom!

rth · 2018-05-27T18:03:37Z

Thanks for this @jnothman !

@TomDLT @amueller FYI there is a follow-up PR in #11136 that applies this mechanism to brute force nearest neighbors.

rth · 2018-05-27T18:07:56Z

.. and also #11135 for chunking silhouette_score calculations

Sentient07 and others added 30 commits January 27, 2016 19:26

Reverted the change, added regression test

498ccce

reverted the comment Resolved merge conflicts

ENH block_size for memory efficiency in silhouette

3a4dd68

Merge remote-tracking branch 'upstream/pr/6089' into silhouette-chunks

d45bb78

DOC add versionadded to new parameter

c6edfbb

FIX use bincount instead of np.add.at for old numpy

3b726aa

ENH use unary pairwise_distances where possible

2301fde

DOC explicit block_size parameter in silhouette_score

85d5971

TST test silhouette_samples explicitly

53fa8d9

ENH support n_jobs in silhouette_score

6828646

and add comments

DOC update block_size description given n_jobs

969eab3

DOC docstring formatting

bfbde51

ENH specify silhouette block size in bytes

03a73ab

block_size specified in MiB

51640c0

document parameters to silhouette helper

eb4619d

FIX pass n_jobs from silhouette_score

71ac994

Also use threading for parallelism

ENH: Added template for pairwise_distances_blockwise with docstring c…

7cfcd43

…hanges

ENH: added generator of blocks based on block_size

dfb99fc

FIX: removed errors and extra value for metric

1e687d1

FIX: remove redundant variables

172e7f5

FIX: remove flake8 errors

686d0d2

BUG: added fix for Y=None

c7de820

FIX: remove whitespace

0fb992f

FIX: fix typo

9b80491

TST: added tests for pairwise_distances_blockwise

8e900e3

FIX: removed errors and modified pairwise_distances_blockwise with tests

6be6ea2

FIX: fix typo

6d79bdd

WIP: support for nearest neighbors

54ff9f5

ENH: passing arguments to reduce_func via partial

8986965

FIX: remove true_distances as parameter

a072f11

FIX: revert unintended change

e0fb4c5

rth mentioned this pull request Feb 27, 2018

Absolute tolerance in approximate equality tests #10562

Closed

rth changed the title ~~[MRG] ENH Add working_memory global config for chunked operations~~ [MRG+1] ENH Add working_memory global config for chunked operations Mar 8, 2018

jnothman mentioned this pull request Apr 16, 2018

System freeze on Silhouette scoring #7175

Closed

jnothman mentioned this pull request May 2, 2018

NearestNeighbors radius_neighbors memory leaking #11051

Closed

TomDLT approved these changes May 22, 2018

View reviewed changes

Merge branch 'master' into working_memory

6b9de27

jnothman added 2 commits May 23, 2018 14:11

Typo fixes

cf284d7

Correct the tests

8597181

jnothman added 2 commits May 23, 2018 21:43

Add missing ref target

3240e0f

Update array formatting in doctest

d7c04af

jnothman added 2 commits May 25, 2018 07:52

Remove more whitespace in doctest

24d1801

Remove more whitespace in doctest

6252941

jnothman merged commit ef8d22a into scikit-learn:master May 25, 2018

This was referenced May 25, 2018

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

Merged

[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136

Merged

TomAugspurger mentioned this pull request May 25, 2018

CI: Remove duplicate test runs dask/dask-ml#179

Merged

wdevazelhes mentioned this pull request Jun 5, 2018

[MRG+2] Neighborhood Components Analysis #10058

Merged

9 tasks

rth mentioned this pull request Jul 12, 2018

[WIP] Stable and fast float32 implementation of euclidean_distances #11271

Closed

jeremiedbb mentioned this pull request Jul 13, 2018

Discussion working memory vs. performances #11506

Closed

rth mentioned this pull request Sep 8, 2018

Isolation forest - decision_function & average_path_length method are memory inefficient #12040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] ENH Add working_memory global config for chunked operations #10280

[MRG+1] ENH Add working_memory global config for chunked operations #10280

jnothman commented Dec 10, 2017 •

edited

Loading

rth commented Mar 8, 2018

TomDLT left a comment

TomDLT May 22, 2018

jnothman May 23, 2018

TomDLT May 23, 2018

jnothman May 23, 2018

TomDLT May 22, 2018

TomDLT May 22, 2018

TomDLT May 22, 2018

jnothman May 23, 2018

TomDLT May 22, 2018

TomDLT May 22, 2018

rth commented May 22, 2018

jnothman commented May 23, 2018

jnothman commented May 23, 2018

amueller commented May 24, 2018

jnothman commented May 24, 2018 via email

amueller commented May 24, 2018

jnothman commented May 25, 2018

rth commented May 27, 2018

rth commented May 27, 2018

[MRG+1] ENH Add working_memory global config for chunked operations #10280

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Conversation

jnothman commented Dec 10, 2017 • edited Loading

rth commented Mar 8, 2018

TomDLT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented May 22, 2018

jnothman commented May 23, 2018

jnothman commented May 23, 2018

amueller commented May 24, 2018

jnothman commented May 24, 2018 via email

amueller commented May 24, 2018

jnothman commented May 25, 2018

rth commented May 27, 2018

rth commented May 27, 2018

jnothman commented Dec 10, 2017 •

edited

Loading