MemoryError from sklearn.metrics.silhouette_samples #10279

Gitman-code · 2017-12-09T22:32:46Z

Please note that this is a follow up issue to #4701 and #4197 not a duplicate.

Calling sklearn.metrics.silhouette_samples with a large data set will cause a memory MemoryError in numpy.dot. This is not an issue with numpy.dot but with the implementation of sklearn.metrics.silhouette_samples. For a data set of length N it computes the full NxN pairwise distance matrix which can clearly take a lot of memory. Mathematically all that is needed in memory is a list of the same length as the largest cluster and the rest can be done by appropriate looping.

This issue is to propose a new feature where a new parameter is added to the sklearn.metrics.silhouette_samples method. This would be a boolean to indicate if one wanted to use the faster but more memory consumptive existing method or the alternative method which is better on memory but likely slower.

A working version of the the new method was posted in an answer on StackOverflow

jnothman · 2017-12-10T02:40:44Z

I think we need to fix this properly. We've had patches for years and I think we have shown thy using less memory is almost always worthwhile

Gitman-code · 2017-12-10T04:13:12Z

By "properly" do you mean to imply that there is a potential implementation which would keep the memory below O(N) but be as fast as the current implementation?

jnothman · 2017-12-10T05:00:38Z

well we found that using less memory speeds things up overall for any problem in which speed is an issue, basically

Gitman-code · 2017-12-10T16:46:54Z

In principle for small N the method with less operations should be faster. I was curious where a more conservative memory algorithm would become faster. I did a comparison to the Stack Overflow implementation. Turns out it gets faster pretty much right away.

N        t_Origianl     t_new
100      0.000636      0.001687
500      0.003464      0.003987
1000     0.016305      0.010772
5000     0.765970      0.283393
10000    3.594396      1.215139

jnothman · 2017-12-10T19:59:01Z

As far as I can tell, the SO implementation still has some fairly high memory costs, depending on the sizes of your clusters. But we had similar results when benchmarking #7177, though I was too lazy to plot it. Basically, the memory allocation takes a lot of time.

#7177, and my most recent attempt, #10280, should (usually) limit the memory to constant use (rather than an asymptotic function of n). I very much hope we can see #10280, which will provide memory usage improvements for a few routines, merged in the next release.

Gitman-code · 2017-12-11T21:52:03Z

Yes you are correct about the memory issues not being totally mitigated by the SO implementation.

I hope this is not too much of a tangent but it may be worth looking into at the same time. Once a clusterer has been train I use predict() to label a new data set. The silhouettes are being used later for features to a classification algorithm. This means for my new data set I need the silhouettes predicted in the space which the clustering algorithm was trained. I wrote a new function silhouette_samples_predict(X_train, labels_train, X_test, labels_test) which does this in a similar way to the SO answer. I know that sklearn is developed to be as parsimonious as possible but this is a useful feature for fraud detection and is very similar to the methods under discussion here. Thank you for your consideration and sorry for the off topic comment.

jnothman · 2017-12-11T21:58:59Z

I'm not sure what you're proposing, Keith. Are you saying that we should consider silhouette_samples_predict for inclusion in scikit-learn? Why not make it available in a separate library. Also, see #10280 for how I hope we will do this calculation in the future with essentially constant memory (unless there are > 64M samples).

Gitman-code · 2017-12-11T22:22:57Z

Yes, I was proposing such a function be considered while this work is being done. Is there an alternative library which would be appropriate?

I looked at #10280 and it seems reasonable to me. Anything beyond a few million samples is unlikely to be clustered on a single machine in the first place.

jnothman · 2017-12-12T01:03:05Z

I have no idea if there's an appropriate alternative library, perhaps specific to that applied domain, but you can always start one, or make a library just for that code. It sounds a bit like you're using silhouette for outlier detection or similar. Is that right?

Gitman-code · 2017-12-12T19:50:21Z

Yep, I am trying to expand on existing work for the identification of insurance fraud. The silhouette shows promise for a feature in the classification algorithm.

As for the silhouette_samples_predict function. If there is no public library where it would make sense to add it then it is just one so more to add to my private collection.

mangecoeur · 2018-02-16T13:27:19Z

I ran into this issue and ended up developing a somewhat ad-hoc solution using numba that avoids computing an NxN distance matrix completely by directly computing the sum and mean distances required by the silhouette_samples algorithm. It currently only implements euclidean distance and drops all sorts of checks in order to release the GIL in numba and work across all cores (which is does magnificently).

Posting because it might be helpful for someone as a quick fix, Code here:
https://gist.github.com/mangecoeur/f7c419506bb009d69c20565e2b6f7887

It works with roughly constant memory and allowed me to worth with a 30000 sample, 24 dimension dataset (which would have required about 720GB of ram to hold the distance matrix otherwise).

jnothman · 2018-02-17T11:34:01Z

nice! at the moment we avoid numba... perhaps it is worth considering.

luansouzasilva31 · 2019-01-06T23:15:35Z

Hi everyone. So i was with that same problem. I work with python, so my answer will be based on that. This problem is very common when we working with "big" informations. "Big" in quotes because a simply image 700x600 already would cause that. Well, the best solution, as already anyone said, is to divide the data and to work with these "small-datas". In python there are a function that does it: silhouette_score(). Its can be import from sklearn.metrics. All the parameters are normal, but the last is a little different: sample_size = None. It is like that by default. This parameter divides the data for work with small data, then unites all the results. Look:
metrics.silhouette_score(imgcopy,
kmeans_cluster.labels_,
metric='euclidean',
sample_size=1000) # <------------------

I divided my data in 1000 parts of data. If you have this problem with another programming language, seek the implementation and try make this based on your desire

jnothman · 2019-01-08T06:45:38Z

@luansouzasilva31 are you using scikit-learn version 0.20? This was not fixed before that release.

luansouzasilva31 · 2019-01-09T03:21:19Z

@jnothman yeah bro, my version is 0.19. I'm sorry, i didn't know about this issue. But would be nice to see the implementation, even if it's a old version haha then you could implement your own program. Thank you for your answer!!

jnothman mentioned this issue Dec 10, 2017

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged

10 tasks

jnothman mentioned this issue May 25, 2018

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

Merged

jnothman closed this as completed in #11135 Jun 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError from sklearn.metrics.silhouette_samples #10279

MemoryError from sklearn.metrics.silhouette_samples #10279

Gitman-code commented Dec 9, 2017

jnothman commented Dec 10, 2017

Gitman-code commented Dec 10, 2017 •

edited

Loading

jnothman commented Dec 10, 2017 via email

Gitman-code commented Dec 10, 2017

jnothman commented Dec 10, 2017

Gitman-code commented Dec 11, 2017

jnothman commented Dec 11, 2017 via email

Gitman-code commented Dec 11, 2017

jnothman commented Dec 12, 2017 via email

Gitman-code commented Dec 12, 2017

mangecoeur commented Feb 16, 2018

jnothman commented Feb 17, 2018 via email

luansouzasilva31 commented Jan 6, 2019

jnothman commented Jan 8, 2019

luansouzasilva31 commented Jan 9, 2019

MemoryError from sklearn.metrics.silhouette_samples #10279

MemoryError from sklearn.metrics.silhouette_samples #10279

Comments

Gitman-code commented Dec 9, 2017

jnothman commented Dec 10, 2017

Gitman-code commented Dec 10, 2017 • edited Loading

jnothman commented Dec 10, 2017 via email

Gitman-code commented Dec 10, 2017

jnothman commented Dec 10, 2017

Gitman-code commented Dec 11, 2017

jnothman commented Dec 11, 2017 via email

Gitman-code commented Dec 11, 2017

jnothman commented Dec 12, 2017 via email

Gitman-code commented Dec 12, 2017

mangecoeur commented Feb 16, 2018

jnothman commented Feb 17, 2018 via email

luansouzasilva31 commented Jan 6, 2019

jnothman commented Jan 8, 2019

luansouzasilva31 commented Jan 9, 2019

Gitman-code commented Dec 10, 2017 •

edited

Loading