-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MemoryError from sklearn.metrics.silhouette_samples #10279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think we need to fix this properly. We've had patches for years and I think we have shown thy using less memory is almost always worthwhile |
By "properly" do you mean to imply that there is a potential implementation which would keep the memory below O(N) but be as fast as the current implementation? |
well we found that using less memory speeds things up overall for any
problem in which speed is an issue, basically
|
In principle for small N the method with less operations should be faster. I was curious where a more conservative memory algorithm would become faster. I did a comparison to the Stack Overflow implementation. Turns out it gets faster pretty much right away.
|
As far as I can tell, the SO implementation still has some fairly high memory costs, depending on the sizes of your clusters. But we had similar results when benchmarking #7177, though I was too lazy to plot it. Basically, the memory allocation takes a lot of time. #7177, and my most recent attempt, #10280, should (usually) limit the memory to constant use (rather than an asymptotic function of n). I very much hope we can see #10280, which will provide memory usage improvements for a few routines, merged in the next release. |
Yes you are correct about the memory issues not being totally mitigated by the SO implementation. I hope this is not too much of a tangent but it may be worth looking into at the same time. Once a clusterer has been train I use predict() to label a new data set. The silhouettes are being used later for features to a classification algorithm. This means for my new data set I need the silhouettes predicted in the space which the clustering algorithm was trained. I wrote a new function silhouette_samples_predict(X_train, labels_train, X_test, labels_test) which does this in a similar way to the SO answer. I know that sklearn is developed to be as parsimonious as possible but this is a useful feature for fraud detection and is very similar to the methods under discussion here. Thank you for your consideration and sorry for the off topic comment. |
I'm not sure what you're proposing, Keith. Are you saying that we should
consider silhouette_samples_predict for inclusion in scikit-learn? Why not
make it available in a separate library. Also, see #10280 for how I hope we
will do this calculation in the future with essentially constant memory
(unless there are > 64M samples).
|
Yes, I was proposing such a function be considered while this work is being done. Is there an alternative library which would be appropriate? I looked at #10280 and it seems reasonable to me. Anything beyond a few million samples is unlikely to be clustered on a single machine in the first place. |
I have no idea if there's an appropriate alternative library, perhaps
specific to that applied domain, but you can always start one, or make a
library just for that code. It sounds a bit like you're using silhouette
for outlier detection or similar. Is that right?
|
Yep, I am trying to expand on existing work for the identification of insurance fraud. The silhouette shows promise for a feature in the classification algorithm. As for the silhouette_samples_predict function. If there is no public library where it would make sense to add it then it is just one so more to add to my private collection. |
I ran into this issue and ended up developing a somewhat ad-hoc solution using Posting because it might be helpful for someone as a quick fix, Code here: It works with roughly constant memory and allowed me to worth with a 30000 sample, 24 dimension dataset (which would have required about 720GB of ram to hold the distance matrix otherwise). |
nice! at the moment we avoid numba... perhaps it is worth considering.
|
Hi everyone. So i was with that same problem. I work with python, so my answer will be based on that. This problem is very common when we working with "big" informations. "Big" in quotes because a simply image 700x600 already would cause that. Well, the best solution, as already anyone said, is to divide the data and to work with these "small-datas". In python there are a function that does it: silhouette_score(). Its can be import from sklearn.metrics. All the parameters are normal, but the last is a little different: sample_size = None. It is like that by default. This parameter divides the data for work with small data, then unites all the results. Look: I divided my data in 1000 parts of data. If you have this problem with another programming language, seek the implementation and try make this based on your desire |
@luansouzasilva31 are you using scikit-learn version 0.20? This was not fixed before that release. |
@jnothman yeah bro, my version is 0.19. I'm sorry, i didn't know about this issue. But would be nice to see the implementation, even if it's a old version haha then you could implement your own program. Thank you for your answer!! |
Please note that this is a follow up issue to #4701 and #4197 not a duplicate.
Calling sklearn.metrics.silhouette_samples with a large data set will cause a memory MemoryError in numpy.dot. This is not an issue with numpy.dot but with the implementation of sklearn.metrics.silhouette_samples. For a data set of length N it computes the full NxN pairwise distance matrix which can clearly take a lot of memory. Mathematically all that is needed in memory is a list of the same length as the largest cluster and the rest can be done by appropriate looping.
This issue is to propose a new feature where a new parameter is added to the sklearn.metrics.silhouette_samples method. This would be a boolean to indicate if one wanted to use the faster but more memory consumptive existing method or the alternative method which is better on memory but likely slower.
A working version of the the new method was posted in an answer on StackOverflow
The text was updated successfully, but these errors were encountered: