-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] ENH Add working_memory global config for chunked operations #10280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
reverted the comment Resolved merge conflicts
and add comments
Also use threading for parallelism
I see. It's fine, it was just a side comment not very critical to this PR... There is a minor conflict in the what's new. LGTM. Now this would need a second review... maybe @lesteve or @qinhanmin2014 ? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
- A new configuration parameter, ``working_memory`` was added to control memory | ||
consumption limits in chunked operations, such as the new | ||
:func:`metrics.pairwise_distances_chunked`. See :ref:`working_memory`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have forgotten the glossary entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a reference to the User Guide is most relevant in what's new.
I can add a glossary entry, though I'm not sure how it will help beyond the user guide and the config_context
docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I just wonder what is the goal of the syntax :ref:
, which does not render a link.
Did you mean :func:set_config
? Or maybe you need a label in doc/modules/computational_performance.rst
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter. A glossary reference would be :term:
, not :ref:
which references sections.
sklearn/metrics/pairwise.py
Outdated
``reduce_func``. | ||
|
||
Examples | ||
------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need one more dash to have proper rendering.
assert isinstance(S_chunks, GeneratorType) | ||
S_chunks = list(S_chunks) | ||
assert len(S_chunks) > 1 | ||
# atol is for diagonal where S is explcitly zeroed on the diagonal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*explicitly
min_block_mib = np.array(X).shape[0] * 8 * 2 ** -20 | ||
|
||
for block in blockwise_distances: | ||
memory_used = len(block) * 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use memory_used = block.size * 8
to have the correct memory used in the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... indeed!
|
||
for block in blockwise_distances: | ||
memory_used = len(block) * 8 | ||
assert memory_used <= min(working_memory, min_block_mib) * 2 ** 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the min
should be a max
, shouldn't it?
metric='euclidean') | ||
# Test small amounts of memory | ||
for power in range(-16, 0): | ||
check_pairwise_distances_chunked(X, None, working_memory=2 ** power, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line raises a lot of warnings:
UserWarning: Could not adhere to working_memory config. Currently 0MiB, 1MiB required.
We should silence them as they are expected.
Great that this is happening! |
Thanks @TomDLT for the review and the approval! I've addressed all your comments except for the glossary one, which I'm not sure is necessitated at this point. |
I suppose we could include a new Global Configuration section of the glossary. But I'm not sure how that goes beyond the |
Is there a plan to in the future also use this in |
pairwise_distances_chunked is only useful if it can be reduced, so no,
there's no point to including this in pairwise_distances. (But there might
be particular distance functions that can be calculated in a chunked way to
avoid n_samples * n_samples * n_features arrays, which may be what you are
thinking of)
If you would like to merge this, I'll happily post follow-ups to close more
issues!
|
py3.6 fails ;) |
Merging to enable downstream PRs. Let me know if there are further quibbles! Thanks for the reviews, Roman and Tom! |
.. and also #11135 for chunking silhouette_score calculations |
We often get issues related to memory consumption and don't deal with them particularly well. Indeed, Scikit-learn should be at home on commodity hardware like developer/researcher laptops.
Some operations can be performed chunked, so that the result is computed in constant (or
O(n)
) memory relative to some currentO(n)
(O(n^2)
) consumption. Examples include: getting the argmin and min of all pairwise distances (currently done with an ad-hoc parameter topairwise_distances_argmin_min
), calculating silhouette score (#1976), getting nearest neighbors with brute force (#7287), calculating standard deviation of every feature (#5651).It's not very helpful to provide this "how much constant memory" parameter in each function (because they're often called within nested code), so this PR instead makes a global config parameter of it. The optimisation is then transparent to the user, but still configurable.
At @rth's request, this PR has been cut back. The proposed changes to silhouette and neighbors can be seen here.
This PR (building upon my work with @dalmia) will therefore:
set_config(working_memory=n_mib)
pairwise_distances_chunked
nearest neighbors, silhouette andpairwise_distances_argmin_minbatch_size
inpairwise_distances_argmin_min
and thus:
TODO:
get_chunk_n_rows
_check_chunk_size
and any others forpairwise_distances_chunked
gen_batches
to see if they can use this parameter to be more principled in the choice of batch size: in most cases the batch size affects the model, so we can't just change it. And in other cases, it may make little difference. We could change mean_shift's use from fixed batch_size=500 to something sensitive to working_memory if we wish.pairwise_distances_chunked
that uses start and a tuple return