-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Block-wise silhouette calculation to avoid memory consumption #7177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
reverted the comment Resolved merge conflicts
Benchmarks with import numpy as np
from sklearn.metrics.cluster import silhouette_score
n_features = 100
for n_samples in [10000, 1000, 100]:
for n_labels in [4, 6, 100]:
print('n_features', n_features, 'n_samples', n_samples, 'n_labels', n_labels)
X = np.random.rand(n_samples, n_features)
y = np.random.randint(n_labels, size=n_samples)
%timeit silhouette_score(X, y)
defaulting to |
and add comments
@@ -56,6 +60,18 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None, | |||
<sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance | |||
array itself, use ``metric="precomputed"``. | |||
|
|||
block_size : int, optional | |||
The number of rows to process at a time to limit memory usage to | |||
O(block_size * n_samples). Default is n_samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Double backticks
Instead of allowing the user to specify size as a unit of memory usage, I am wondering if |
I've now tried to just have the user specify the pairwise distance memory consumption in bytes... |
If it needs to be in memory units, could we have it in MBs as users will need to enter a smaller number? |
It doesn't need to be, but I think it is a more practical data-invariant measure. Yes, can have in MiB |
@raghavrv now in MiB |
@jnothman I'm a little bit late but +1 to have the memory usage. I think it's an easier measure to specify. And also +1 for MiB. |
You're earlier than everyone but @raghavrv. Thanks for the input. That's where it's converged anyway. |
@jnothman I do a review in a few moment. |
|
||
|
||
def _process_block(X, labels, start, block_n_rows, block_range, add_at, | ||
label_freqs, metric, kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a niptick : _process_block -> _silhouette_process_block.
I know it's a private function but maybe it will be easier for comprehension if you put a little docstring of the arguments ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Will do
The performances are similar on my computer. |
y = [1, 1, 2] | ||
assert_raise_message(ValueError, 'block_size should be at least n_samples ' | ||
'* 8 bytes = 1 MiB, got 0', | ||
silhouette_score, X, y, block_size=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead have the block_size
as 0 and let it denote the master setup of use all the memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or -1 like n_jobs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what benefit? The default 64MB will run a 2896 sample problem or smaller in a single block. For problems much larger than that, you're likely to benefit from splitting the problem up as suggested by our benchmark which shows 2x speedup from "use all memory" for a dataset less than 4x that size (and >9x the number of pairwise calculations). Yes, this is only my machine, but it's hard to imagine why we should suggest using all memory possible to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not dissimilar to n_jobs=-2
often being a better choice than n_jobs=-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do not select automatically block_size
as the min(64MB, np.ceil(n_samples * BYTES_PER_FLOAT * 2 ** -20)
and don't let the user choose a specific value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be identical to just setting it to 64, no? That's tempting, especially because this isn't a learning routine. I don't expect my benchmarks to be optimal, certainly not for any particular platform, or with n_jobs != 1
(using the default or some other parallel backend); hence my inclination to keep it public. I'll have to chew on this.
I'd appreciate another opinion on this: should we make a memory efficiency parameter constant, given that we know the constant is sensible on some modern platforms (and hard to explain to the user), or directly configurable by the user? In the context of a learning algorithm, configurability seems more valuable, but for an evaluation metric, is it sensible to just say: 64 MiB is the max pairwise distance evaluation available. Also, should this be Not sure who to ask: @GaelVaroquaux, @MechCoder, @amueller? |
Also use threading for parallelism
Fixed a problem where |
Sorry for piggy backing on this conversation... How do you usually definitively decide between multiprocessing and threading? As joblib memmaps the data I don't see why multiprocessing is any worse than threading given that it is usually a bit faster than threading (for larger datasets) and much faster when gil is not released... Or am I wrong? |
I don't have any good answers for you, @raghavrv, but for the scale of problem and processor I've tried, multiprocessing always results in slower times. Afaik, threading is a good option when there's a lot of work in non-GIL operations, and when copying the inputs and outputs competes with the calculation for cost. I'm considering just dropping |
@jnothman I think for blocked computations, we set "reasonable" defaults and allow configuration, IIRC. Do you want a review on this? It doesn't seem high priority to me. [OT: why are you using this score?] |
I'd like review on this, but no I'm not using silhouette personally. I worked on it because an issue came in, and because I wanted to ascertain that #1976 could be improved upon. I also think we should be reviewing the potential blockwise computations everywhere that |
To be sure, it's not high priority, except that it closes a few issues and avoids more appearing over the next release. |
Also, #5988 should be fixed. |
Could you refactor #7438 out of this PR and make a rebase. I can try and take a deeper look at this before the weekend :) |
@jnothman Ping :) |
Superseding #1976 which breaks the problem down by cluster, here we simply break the input down by fixed-size blocks of rows. Reports of memory issues when calculating silhouette include #7175 and mailing list. Also incorporates #6089's fix for #5988; and adds a test for
silhouette_samples
self-plagiarised from #4087.potential enhancements:
np.ufunc.at
in numpy<1.8silhouette_samples
(none exists!); could perhaps be a separate PR...block_size
by target memory usage, e.g.block_size='1G'
. Probably out of scope of this PR.block_size=100
orblock_size='1G'
or some other sensible constant as the default to avoid memory errors with default parameters, and to improve default speed.