Skip to content

[MRG] Block-wise silhouette calculation to avoid memory consumption #7177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Aug 11, 2016

Superseding #1976 which breaks the problem down by cluster, here we simply break the input down by fixed-size blocks of rows. Reports of memory issues when calculating silhouette include #7175 and mailing list. Also incorporates #6089's fix for #5988; and adds a test for silhouette_samples self-plagiarised from #4087.

potential enhancements:

  • work out what to do to avoid using np.ufunc.at in numpy<1.8
  • the minor TODOs mentioned in the code
  • benchmark to ensure this is not much slower than incumbent implementation. See #below.
  • a test for silhouette_samples (none exists!); could perhaps be a separate PR...
  • support parallelised computation over blocks?; could be a separate PR
  • allow users to specify block_size by target memory usage, e.g. block_size='1G'. Probably out of scope of this PR.
  • use block_size=100 or block_size='1G' or some other sensible constant as the default to avoid memory errors with default parameters, and to improve default speed.
  • add more inline comments
  • add to what's new
  • squash carefully to ensure attribution of [MRG] Fix regression in silhouette_score for clusters of size 1 #6089

@jnothman
Copy link
Member Author

Benchmarks with block_size=None, attempting to follow #6151:

import numpy as np
from sklearn.metrics.cluster import silhouette_score
n_features = 100
for n_samples in [10000, 1000, 100]:
    for n_labels in [4, 6, 100]:
        print('n_features', n_features, 'n_samples', n_samples, 'n_labels', n_labels)
        X = np.random.rand(n_samples, n_features)
        y = np.random.randint(n_labels, size=n_samples)
        %timeit silhouette_score(X, y)
master                                        block_size=None                               block_size=100
n_features 100 n_samples 10000 n_labels 4     n_features 100 n_samples 10000 n_labels 4     n_features 100 n_samples 10000 n_labels 4
1 loop, best of 3: 3.28 s per loop            1 loop, best of 3: 4.03 s per loop            1 loop, best of 3: 1.61 s per loop
n_features 100 n_samples 10000 n_labels 6     n_features 100 n_samples 10000 n_labels 6     n_features 100 n_samples 10000 n_labels 6
1 loop, best of 3: 3.16 s per loop            1 loop, best of 3: 4.12 s per loop            1 loop, best of 3: 1.64 s per loop
n_features 100 n_samples 10000 n_labels 100   n_features 100 n_samples 10000 n_labels 100   n_features 100 n_samples 10000 n_labels 100
1 loop, best of 3: 3.48 s per loop            1 loop, best of 3: 4.05 s per loop            1 loop, best of 3: 1.61 s per loop
n_features 100 n_samples 1000 n_labels 4      n_features 100 n_samples 1000 n_labels 4      n_features 100 n_samples 1000 n_labels 4
100 loops, best of 3: 16.9 ms per loop        10 loops, best of 3: 30.2 ms per loop         100 loops, best of 3: 15.8 ms per loop
n_features 100 n_samples 1000 n_labels 6      n_features 100 n_samples 1000 n_labels 6      n_features 100 n_samples 1000 n_labels 6
100 loops, best of 3: 18.3 ms per loop        10 loops, best of 3: 30 ms per loop           100 loops, best of 3: 15.7 ms per loop
n_features 100 n_samples 1000 n_labels 100    n_features 100 n_samples 1000 n_labels 100    n_features 100 n_samples 1000 n_labels 100
1 loop, best of 3: 226 ms per loop            10 loops, best of 3: 30.8 ms per loop         100 loops, best of 3: 16.7 ms per loop
n_features 100 n_samples 100 n_labels 4       n_features 100 n_samples 100 n_labels 4       n_features 100 n_samples 100 n_labels 4
1000 loops, best of 3: 688 µs per loop        1000 loops, best of 3: 570 µs per loop        1000 loops, best of 3: 608 µs per loop
n_features 100 n_samples 100 n_labels 6       n_features 100 n_samples 100 n_labels 6       n_features 100 n_samples 100 n_labels 6
1000 loops, best of 3: 1.01 ms per loop       1000 loops, best of 3: 572 µs per loop        1000 loops, best of 3: 600 µs per loop
n_features 100 n_samples 100 n_labels 100     n_features 100 n_samples 100 n_labels 100     n_features 100 n_samples 100 n_labels 100
10 loops, best of 3: 57.9 ms per loop         1000 loops, best of 3: 591 µs per loop        1000 loops, best of 3: 587 µs per loop

defaulting to block_size=100 is looking pretty good too...

@jnothman jnothman changed the title [WIP] Block-wise silhouette calculation to avoid memory consumption [MRG] Block-wise silhouette calculation to avoid memory consumption Aug 11, 2016
@@ -56,6 +60,18 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None,
<sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance
array itself, use ``metric="precomputed"``.

block_size : int, optional
The number of rows to process at a time to limit memory usage to
O(block_size * n_samples). Default is n_samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Double backticks

@jnothman
Copy link
Member Author

Instead of allowing the user to specify size as a unit of memory usage, I am wondering if block_size should specify maximum number of cells created by the pairwise_distances operation, rather than number of rows. I.e. the latter is the former divided by n_samples.

@jnothman
Copy link
Member Author

I've now tried to just have the user specify the pairwise distance memory consumption in bytes...

@raghavrv
Copy link
Member

raghavrv commented Aug 13, 2016

I've now tried to just have the user specify the pairwise distance memory consumption in bytes...

If it needs to be in memory units, could we have it in MBs as users will need to enter a smaller number?

@jnothman
Copy link
Member Author

If it needs to be in memory units, could we have it in MBs as users will need to enter a smaller number?

It doesn't need to be, but I think it is a more practical data-invariant measure. Yes, can have in MiB

@jnothman
Copy link
Member Author

@raghavrv now in MiB

@tguillemot
Copy link
Contributor

Instead of allowing the user to specify size as a unit of memory usage, I am wondering if block_size should specify maximum number of cells created by the pairwise_distances operation, rather than number of rows. I.e. the latter is the former divided by n_samples.

@jnothman I'm a little bit late but +1 to have the memory usage. I think it's an easier measure to specify. And also +1 for MiB.

@jnothman
Copy link
Member Author

You're earlier than everyone but @raghavrv. Thanks for the input. That's where it's converged anyway.

@tguillemot
Copy link
Contributor

@jnothman I do a review in a few moment.



def _process_block(X, labels, start, block_n_rows, block_range, add_at,
label_freqs, metric, kwds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a niptick : _process_block -> _silhouette_process_block.
I know it's a private function but maybe it will be easier for comprehension if you put a little docstring of the arguments ???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Will do

@tguillemot
Copy link
Contributor

tguillemot commented Aug 16, 2016

The performances are similar on my computer.

y = [1, 1, 2]
assert_raise_message(ValueError, 'block_size should be at least n_samples '
'* 8 bytes = 1 MiB, got 0',
silhouette_score, X, y, block_size=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead have the block_size as 0 and let it denote the master setup of use all the memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or -1 like n_jobs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what benefit? The default 64MB will run a 2896 sample problem or smaller in a single block. For problems much larger than that, you're likely to benefit from splitting the problem up as suggested by our benchmark which shows 2x speedup from "use all memory" for a dataset less than 4x that size (and >9x the number of pairwise calculations). Yes, this is only my machine, but it's hard to imagine why we should suggest using all memory possible to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not dissimilar to n_jobs=-2 often being a better choice than n_jobs=-1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not select automatically block_size as the min(64MB, np.ceil(n_samples * BYTES_PER_FLOAT * 2 ** -20) and don't let the user choose a specific value ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be identical to just setting it to 64, no? That's tempting, especially because this isn't a learning routine. I don't expect my benchmarks to be optimal, certainly not for any particular platform, or with n_jobs != 1 (using the default or some other parallel backend); hence my inclination to keep it public. I'll have to chew on this.

@jnothman
Copy link
Member Author

I'd appreciate another opinion on this: should we make a memory efficiency parameter constant, given that we know the constant is sensible on some modern platforms (and hard to explain to the user), or directly configurable by the user? In the context of a learning algorithm, configurability seems more valuable, but for an evaluation metric, is it sensible to just say: 64 MiB is the max pairwise distance evaluation available. Also, should this be 64 // n_jobs? (perhaps I need to benchmark the interaction with n_jobs.)

Not sure who to ask: @GaelVaroquaux, @MechCoder, @amueller?

Also use threading for parallelism
@jnothman
Copy link
Member Author

jnothman commented Aug 17, 2016

Fixed a problem where n_jobs wasn't being used, and changed to use threading backend for parallelism (only marginal gains for euclidean, but I've not seen multiprocessing be worth the overhead here).

@raghavrv
Copy link
Member

raghavrv commented Aug 17, 2016

Sorry for piggy backing on this conversation... How do you usually definitively decide between multiprocessing and threading? As joblib memmaps the data I don't see why multiprocessing is any worse than threading given that it is usually a bit faster than threading (for larger datasets) and much faster when gil is not released... Or am I wrong?

@jnothman
Copy link
Member Author

I don't have any good answers for you, @raghavrv, but for the scale of problem and processor I've tried, multiprocessing always results in slower times. Afaik, threading is a good option when there's a lot of work in non-GIL operations, and when copying the inputs and outputs competes with the calculation for cost. I'm considering just dropping n_jobs support as ineffectual until someone can show me it actually benchmarks a more-than-marginal benefit.

@amueller
Copy link
Member

@jnothman I think for blocked computations, we set "reasonable" defaults and allow configuration, IIRC. Do you want a review on this? It doesn't seem high priority to me. [OT: why are you using this score?]

@jnothman
Copy link
Member Author

I'd like review on this, but no I'm not using silhouette personally. I worked on it because an issue came in, and because I wanted to ascertain that #1976 could be improved upon. I also think we should be reviewing the potential blockwise computations everywhere that pairwise_distances is used, such as KNN. Perhaps I should make an issue for that.

@jnothman
Copy link
Member Author

To be sure, it's not high priority, except that it closes a few issues and avoids more appearing over the next release.

@jnothman
Copy link
Member Author

Also, #5988 should be fixed.

@raghavrv
Copy link
Member

raghavrv commented Nov 1, 2016

Could you refactor #7438 out of this PR and make a rebase. I can try and take a deeper look at this before the weekend :)

@raghavrv
Copy link
Member

@jnothman Ping :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants