Skip to content

[MRG+1] ENH Add working_memory global config for chunked operations #10280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 105 commits into from
May 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
498ccce
Reverted the change, added regression test
Sentient07 Dec 26, 2015
3a4dd68
ENH block_size for memory efficiency in silhouette
jnothman Aug 11, 2016
d45bb78
Merge remote-tracking branch 'upstream/pr/6089' into silhouette-chunks
jnothman Aug 11, 2016
c6edfbb
DOC add versionadded to new parameter
jnothman Aug 11, 2016
3b726aa
FIX use bincount instead of np.add.at for old numpy
jnothman Aug 11, 2016
2301fde
ENH use unary pairwise_distances where possible
jnothman Aug 11, 2016
85d5971
DOC explicit block_size parameter in silhouette_score
jnothman Aug 11, 2016
53fa8d9
TST test silhouette_samples explicitly
jnothman Aug 11, 2016
6828646
ENH support n_jobs in silhouette_score
jnothman Aug 12, 2016
969eab3
DOC update block_size description given n_jobs
jnothman Aug 12, 2016
bfbde51
DOC docstring formatting
jnothman Aug 13, 2016
03a73ab
ENH specify silhouette block size in bytes
jnothman Aug 13, 2016
51640c0
block_size specified in MiB
jnothman Aug 14, 2016
eb4619d
document parameters to silhouette helper
jnothman Aug 16, 2016
71ac994
FIX pass n_jobs from silhouette_score
jnothman Aug 17, 2016
7cfcd43
ENH: Added template for pairwise_distances_blockwise with docstring c…
dalmia Dec 5, 2016
dfb99fc
ENH: added generator of blocks based on block_size
dalmia Dec 7, 2016
1e687d1
FIX: removed errors and extra value for metric
dalmia Dec 7, 2016
172e7f5
FIX: remove redundant variables
dalmia Dec 7, 2016
686d0d2
FIX: remove flake8 errors
dalmia Dec 8, 2016
c7de820
BUG: added fix for Y=None
dalmia Dec 8, 2016
0fb992f
FIX: remove whitespace
dalmia Dec 8, 2016
9b80491
FIX: fix typo
dalmia Dec 8, 2016
8e900e3
TST: added tests for pairwise_distances_blockwise
dalmia Dec 8, 2016
6be6ea2
FIX: removed errors and modified pairwise_distances_blockwise with tests
dalmia Dec 12, 2016
6d79bdd
FIX: fix typo
dalmia Dec 12, 2016
54ff9f5
WIP: support for nearest neighbors
dalmia Jan 5, 2017
8986965
ENH: passing arguments to reduce_func via partial
dalmia Jan 7, 2017
a072f11
FIX: remove true_distances as parameter
dalmia Jan 7, 2017
e0fb4c5
FIX: revert unintended change
dalmia Jan 7, 2017
508afae
FIX: convert float indices to int
dalmia Jan 7, 2017
b515105
FIX: removed debug lines
dalmia Jan 7, 2017
f1f7348
ENH: added pairwise_distances_reduce for radius_neighbors
dalmia Jan 8, 2017
d54def1
FIX: changed order of reduce_func
dalmia Feb 4, 2017
3e8adfc
FIX: get pairwise_distances_reduce to work correctly
dalmia Feb 6, 2017
4b8c7b2
FIX: remove flake8 errors
dalmia Feb 6, 2017
97486f6
FIX: rename reduce_func
dalmia Feb 7, 2017
722a9eb
FIX: return stacked distances from pairwise_distances_reduce
dalmia Feb 18, 2017
5ae169b
TST: added tests for pairwise_distances_reduce
dalmia Feb 18, 2017
1cd56a8
FIX: correct doctests
dalmia Feb 19, 2017
855ea0a
FIX: remove conflicting doctests for Python2 and Python3
dalmia Feb 19, 2017
6dd1d36
FEAT: add new file for flexible_vstack
dalmia Feb 20, 2017
d3f607d
FIX: resolve conflicts on tests
dalmia Feb 20, 2017
c373cee
FIX: remove block_size placeholders from neighbors
dalmia Feb 20, 2017
21ad2b1
FIX: use generator expressions
dalmia Feb 20, 2017
676c272
FIX: replace error on invalid block_size with warning
dalmia Feb 20, 2017
a3074be
FIX: replace error on invalid block_size with warning
dalmia Feb 20, 2017
3756a58
TST: check each components meets specified memory requirement
dalmia Feb 20, 2017
2ab293b
FIX: remove PEP8 errors
dalmia Feb 20, 2017
84d34e3
ENH: move flexible_vstack to __init__
dalmia Feb 20, 2017
6d49c12
TST: add tests for flexible_vstack
dalmia Feb 20, 2017
b3fb795
DOC: improve docstring for
dalmia Feb 20, 2017
f3d3a1a
FIX: correct X, y for Python3
dalmia Feb 20, 2017
f901d7e
ENH: rewrote pairwise_distances_argmin_min using pairwise_distances_r…
dalmia Feb 24, 2017
76b1a75
Merge branch 'master' into blockwise
jnothman Dec 10, 2017
8089e98
Merge branch 'silhouette-chunks' into blockwise
jnothman Dec 10, 2017
c48e9a1
[WIP] ENH Add working_memory global config for chunked operations
jnothman Dec 10, 2017
82bc06a
Add to classes.rst
jnothman Dec 10, 2017
ec31fad
Add missing module
jnothman Dec 10, 2017
a56e002
Remove obsolete TODOs
jnothman Dec 10, 2017
cb35271
Pass final_len to flexible_vstack
jnothman Dec 10, 2017
dc1f544
Renaming and removing obsolete code
jnothman Dec 10, 2017
da5a6c7
Update test_config
jnothman Dec 10, 2017
7073032
Fix test import
jnothman Dec 10, 2017
b29ff76
Tweaks
jnothman Dec 10, 2017
e71a461
Remove debug print
jnothman Dec 10, 2017
055a9ef
Remove unused import
jnothman Dec 10, 2017
d391505
Remove flexible_vstack to reduce magic
jnothman Dec 10, 2017
86f0321
Improve reduce_func description
jnothman Dec 10, 2017
149eb8f
fix flake
jnothman Dec 11, 2017
c8afdb8
Block -> chunks; generate_chunks helper
jnothman Dec 11, 2017
a174313
DOC
jnothman Dec 11, 2017
16aabd5
Remove redundant code
jnothman Dec 11, 2017
3143018
Document working_memory config
jnothman Dec 11, 2017
a21794d
Use existing gen_batches
jnothman Dec 11, 2017
60e01e3
TST test_get_chunk_n_rows
jnothman Dec 11, 2017
0a79c39
DOC What's new
jnothman Dec 11, 2017
df94645
TST improve pairwise_distances_chunked testing
jnothman Dec 11, 2017
a2b2b0a
Remove unused import
jnothman Dec 11, 2017
e97ef8b
Try fix tests for Python 2
jnothman Dec 11, 2017
26cf342
Fix appveyor failure
jnothman Dec 11, 2017
af1923c
Merge branch 'master' into working_memory
jnothman Dec 11, 2017
011021d
Respond to Roman
jnothman Dec 13, 2017
87d578f
Merge branch 'working_memory' of github.com:jnothman/scikit-learn int…
jnothman Dec 13, 2017
ca536f6
See also
jnothman Dec 13, 2017
be8678b
Merge branch 'master' into working_memory
jnothman Jan 8, 2018
57875cf
use dict for global config
jnothman Jan 8, 2018
06e8413
Minor responses to Roman
jnothman Jan 8, 2018
7d45f17
Add pairwise_distances_chunked examples
jnothman Jan 9, 2018
214784d
Remove junk in docs
jnothman Jan 9, 2018
75a2eab
Illustrate actual chunking
jnothman Jan 9, 2018
f8badad
Fix up pairwise_distances_argmin arg ordering
jnothman Jan 9, 2018
29a4c64
More nuanced comment on memory-speed tradeoffs
jnothman Jan 9, 2018
73e3b33
Increase default working memory to 1GiB
jnothman Jan 14, 2018
b70e087
Undo changes to neighbors and silhouette
jnothman Feb 11, 2018
977bb17
Merge branch 'master' into working_memory
jnothman Feb 11, 2018
4c1dc1e
Merge branch 'master' into working_memory
jnothman Feb 13, 2018
ac3b422
In response to Roman's comments
jnothman Feb 27, 2018
6b9de27
Merge branch 'master' into working_memory
jnothman May 23, 2018
cf284d7
Typo fixes
jnothman May 23, 2018
8597181
Correct the tests
jnothman May 23, 2018
3240e0f
Add missing ref target
jnothman May 23, 2018
d7c04af
Update array formatting in doctest
jnothman May 23, 2018
24d1801
Remove more whitespace in doctest
jnothman May 24, 2018
6252941
Remove more whitespace in doctest
jnothman May 24, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -955,6 +955,7 @@ See the :ref:`metrics` section of the user guide for further details.
metrics.pairwise_distances
metrics.pairwise_distances_argmin
metrics.pairwise_distances_argmin_min
metrics.pairwise_distances_chunked


.. _mixture_ref:
Expand Down
21 changes: 21 additions & 0 deletions doc/modules/computational_performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,27 @@ Debian / Ubuntu.
or upgrade to Python 3.4 which has a new version of ``multiprocessing``
that should be immune to this problem.

.. _working_memory:

Limiting Working Memory
-----------------------

Some calculations when implemented using standard numpy vectorized operations
involve using a large amount of temporary memory. This may potentially exhaust
system memory. Where computations can be performed in fixed-memory chunks, we
attempt to do so, and allow the user to hint at the maximum size of this
working memory (defaulting to 1GB) using :func:`sklearn.set_config` or
:func:`config_context`. The following suggests to limit temporary working
memory to 128 MiB::

>>> import sklearn
>>> with sklearn.config_context(working_memory=128):
... pass # do chunked work here

An example of a chunked operation adhering to this setting is
:func:`metric.pairwise_distances_chunked`, which facilitates computing
row-wise reductions of a pairwise distance matrix.

Model Compression
-----------------

Expand Down
16 changes: 16 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ Classifiers and regressors
- :class:`dummy.DummyRegressor` now has a ``return_std`` option in its
``predict`` method. The returned standard deviations will be zeros.

- Added :class:`multioutput.RegressorChain` for multi-target
regression. :issue:`9257` by :user:`Kumar Ashutosh <thechargedneutron>`.

- Added :class:`naive_bayes.ComplementNB`, which implements the Complement
Naive Bayes classifier described in Rennie et al. (2003).
:issue:`8190` by :user:`Michael A. Alcorn <airalcorn2>`.
Expand Down Expand Up @@ -115,6 +118,13 @@ Metrics
:func:`metrics.roc_auc_score`. :issue:`3273` by
:user:`Alexander Niederbühl <Alexander-N>`.

Misc

- A new configuration parameter, ``working_memory`` was added to control memory
consumption limits in chunked operations, such as the new
:func:`metrics.pairwise_distances_chunked`. See :ref:`working_memory`.
:issue:`10280` by `Joel Nothman`_ and :user:`Aman Dalmia <dalmia>`.

Enhancements
............

Expand Down Expand Up @@ -521,6 +531,12 @@ Metrics
due to floating point error in the input.
:issue:`9851` by :user:`Hanmin Qin <qinhanmin2014>`.

- The ``batch_size`` parameter to :func:`metrics.pairwise_distances_argmin_min`
and :func:`metrics.pairwise_distances_argmin` is deprecated to be removed in
v0.22. It no longer has any effect, as batch size is determined by global
``working_memory`` config. See :ref:`working_memory`. :issue:`10280` by `Joel
Nothman`_ and :user:`Aman Dalmia <dalmia>`.

Cluster

- Deprecate ``pooling_func`` unused parameter in
Expand Down
17 changes: 16 additions & 1 deletion sklearn/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

_global_config = {
'assume_finite': bool(os.environ.get('SKLEARN_ASSUME_FINITE', False)),
'working_memory': int(os.environ.get('SKLEARN_WORKING_MEMORY', 1024))
}


Expand All @@ -19,7 +20,7 @@ def get_config():
return _global_config.copy()


def set_config(assume_finite=None):
def set_config(assume_finite=None, working_memory=None):
"""Set global scikit-learn configuration

Parameters
Expand All @@ -29,9 +30,17 @@ def set_config(assume_finite=None):
saving time, but leading to potential crashes. If
False, validation for finiteness will be performed,
avoiding error. Global default: False.

working_memory : int, optional
If set, scikit-learn will attempt to limit the size of temporary arrays
to this number of MiB (per job when parallelised), often saving both
computation time and memory on expensive operations that can be
performed in chunks. Global default: 1024.
"""
if assume_finite is not None:
_global_config['assume_finite'] = assume_finite
if working_memory is not None:
_global_config['working_memory'] = working_memory


@contextmanager
Expand All @@ -46,6 +55,12 @@ def config_context(**new_config):
False, validation for finiteness will be performed,
avoiding error. Global default: False.

working_memory : int, optional
If set, scikit-learn will attempt to limit the size of temporary arrays
to this number of MiB (per job when parallelised), often saving both
computation time and memory on expensive operations that can be
performed in chunks. Global default: 1024.

Notes
-----
All settings, not just those presently modified, will be returned to
Expand Down
2 changes: 2 additions & 0 deletions sklearn/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
from .pairwise import pairwise_distances_argmin
from .pairwise import pairwise_distances_argmin_min
from .pairwise import pairwise_kernels
from .pairwise import pairwise_distances_chunked

from .regression import explained_variance_score
from .regression import mean_absolute_error
Expand Down Expand Up @@ -106,6 +107,7 @@
'pairwise_distances_argmin',
'pairwise_distances_argmin_min',
'pairwise_distances_argmin_min',
'pairwise_distances_chunked',
'pairwise_kernels',
'precision_recall_curve',
'precision_recall_fscore_support',
Expand Down
Loading