Skip to content

MemoryError from sklearn.metrics.silhouette_samples #10279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gitman-code opened this issue Dec 9, 2017 · 15 comments
Closed

MemoryError from sklearn.metrics.silhouette_samples #10279

Gitman-code opened this issue Dec 9, 2017 · 15 comments

Comments

@Gitman-code
Copy link

Please note that this is a follow up issue to #4701 and #4197 not a duplicate.

Calling sklearn.metrics.silhouette_samples with a large data set will cause a memory MemoryError in numpy.dot. This is not an issue with numpy.dot but with the implementation of sklearn.metrics.silhouette_samples. For a data set of length N it computes the full NxN pairwise distance matrix which can clearly take a lot of memory. Mathematically all that is needed in memory is a list of the same length as the largest cluster and the rest can be done by appropriate looping.

This issue is to propose a new feature where a new parameter is added to the sklearn.metrics.silhouette_samples method. This would be a boolean to indicate if one wanted to use the faster but more memory consumptive existing method or the alternative method which is better on memory but likely slower.

A working version of the the new method was posted in an answer on StackOverflow

@jnothman
Copy link
Member

I think we need to fix this properly. We've had patches for years and I think we have shown thy using less memory is almost always worthwhile

@Gitman-code
Copy link
Author

Gitman-code commented Dec 10, 2017

By "properly" do you mean to imply that there is a potential implementation which would keep the memory below O(N) but be as fast as the current implementation?

@jnothman
Copy link
Member

jnothman commented Dec 10, 2017 via email

@Gitman-code
Copy link
Author

In principle for small N the method with less operations should be faster. I was curious where a more conservative memory algorithm would become faster. I did a comparison to the Stack Overflow implementation. Turns out it gets faster pretty much right away.

N        t_Origianl     t_new
100      0.000636      0.001687
500      0.003464      0.003987
1000     0.016305      0.010772
5000     0.765970      0.283393
10000    3.594396      1.215139

figure_1

@jnothman
Copy link
Member

As far as I can tell, the SO implementation still has some fairly high memory costs, depending on the sizes of your clusters. But we had similar results when benchmarking #7177, though I was too lazy to plot it. Basically, the memory allocation takes a lot of time.

#7177, and my most recent attempt, #10280, should (usually) limit the memory to constant use (rather than an asymptotic function of n). I very much hope we can see #10280, which will provide memory usage improvements for a few routines, merged in the next release.

@Gitman-code
Copy link
Author

Yes you are correct about the memory issues not being totally mitigated by the SO implementation.

I hope this is not too much of a tangent but it may be worth looking into at the same time. Once a clusterer has been train I use predict() to label a new data set. The silhouettes are being used later for features to a classification algorithm. This means for my new data set I need the silhouettes predicted in the space which the clustering algorithm was trained. I wrote a new function silhouette_samples_predict(X_train, labels_train, X_test, labels_test) which does this in a similar way to the SO answer. I know that sklearn is developed to be as parsimonious as possible but this is a useful feature for fraud detection and is very similar to the methods under discussion here. Thank you for your consideration and sorry for the off topic comment.

@jnothman
Copy link
Member

jnothman commented Dec 11, 2017 via email

@Gitman-code
Copy link
Author

Yes, I was proposing such a function be considered while this work is being done. Is there an alternative library which would be appropriate?

I looked at #10280 and it seems reasonable to me. Anything beyond a few million samples is unlikely to be clustered on a single machine in the first place.

@jnothman
Copy link
Member

jnothman commented Dec 12, 2017 via email

@Gitman-code
Copy link
Author

Yep, I am trying to expand on existing work for the identification of insurance fraud. The silhouette shows promise for a feature in the classification algorithm.

As for the silhouette_samples_predict function. If there is no public library where it would make sense to add it then it is just one so more to add to my private collection.

@mangecoeur
Copy link

I ran into this issue and ended up developing a somewhat ad-hoc solution using numba that avoids computing an NxN distance matrix completely by directly computing the sum and mean distances required by the silhouette_samples algorithm. It currently only implements euclidean distance and drops all sorts of checks in order to release the GIL in numba and work across all cores (which is does magnificently).

Posting because it might be helpful for someone as a quick fix, Code here:
https://gist.github.com/mangecoeur/f7c419506bb009d69c20565e2b6f7887

It works with roughly constant memory and allowed me to worth with a 30000 sample, 24 dimension dataset (which would have required about 720GB of ram to hold the distance matrix otherwise).

@jnothman
Copy link
Member

jnothman commented Feb 17, 2018 via email

@luansouzasilva31
Copy link

Hi everyone. So i was with that same problem. I work with python, so my answer will be based on that. This problem is very common when we working with "big" informations. "Big" in quotes because a simply image 700x600 already would cause that. Well, the best solution, as already anyone said, is to divide the data and to work with these "small-datas". In python there are a function that does it: silhouette_score(). Its can be import from sklearn.metrics. All the parameters are normal, but the last is a little different: sample_size = None. It is like that by default. This parameter divides the data for work with small data, then unites all the results. Look:
metrics.silhouette_score(imgcopy,
kmeans_cluster.labels_,
metric='euclidean',
sample_size=1000) # <------------------

I divided my data in 1000 parts of data. If you have this problem with another programming language, seek the implementation and try make this based on your desire

@jnothman
Copy link
Member

jnothman commented Jan 8, 2019

@luansouzasilva31 are you using scikit-learn version 0.20? This was not fixed before that release.

@luansouzasilva31
Copy link

@jnothman yeah bro, my version is 0.19. I'm sorry, i didn't know about this issue. But would be nice to see the implementation, even if it's a old version haha then you could implement your own program. Thank you for your answer!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants