If not stack overflow, the appropriate venue for such questions is the
scikit-learn-general mailing list.
The current dbscan implementation is by default not memory efficient,
constructing a full pairwise similarity matrix in the case where
kd/ball-trees cannot be used (e.g. with sparse matrices). This matrix will
consume n^2 floats, perhaps 40GB in your case.
We provide a couple of mechanisms for getting around this:
- You can precompute a sparse radius neighborhood graph (where missing
entries are presumed to be out of eps) in a memory-efficient way, and run
dbscan over this with metric='precomputed'.
- You can compress the dataset, either by removing exact duplicates if
these occur in your data, or by using BIRCH. Then you only have a
relatively small number of representatives for a large number of points.
You can then provide a sample_weight when fitting DBSCAN.
I suspect this could be clearer in the documentation, and a pull request is
welcome.
Perhaps default implementation of radius_neighbors and kneighbors in the
brute force case should be more memory-sensitive; or dbscan should return
to / have an option to search for nearest neighbors when needed rather than
in advance, which is the source of the high memory consumption.
Cheers; but please don't email developers personally, and continue
correspondence through the mailing list.
Joel
On 19 February 2016 at 05:53, Lefevre, Augustin <alefe...@ykems.com> wrote:
> Dear Joel and Robert,
>
>
>
> Sorry for contacting you directly, there may be a more
> formal way of contacting you about this. Anyway, here is my question.
>
>
>
> I tried using dbscan on scikit learn v0.17 today and got a
> memory Error. After reading about it on stackoverflow, I am still puzzled,
> since I am using a compressed sparse row matrix as input, of size 100,000 x
> 400, with density 0.01, which is far from huge (300 MB on disk).
> Apparently, the reason is that I am using the l1 distance as a metric.
> Please find below a sample of code to reproduce the error, and my
> traceback. If you have any suggestions on working around this problem, I
> would be very thankful.
>
>
>
> Y can reproduce the memory Error without having to download my own data,
> with the following code :
>
>
>
>
>
> Y=scipy.sparse.rand(100000,400,density=.01)
>
> dbscan(Y,eps=10,min_samples=10000,metric=’l1’)
>
> Also, here is the traceback I obtain after running the code : seems like
> initializing a dense matrix of zeros of size O(n^2) is not such a good idea.
>
>
>
> Traceback (most recent call last):
>
> File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py",
> line 2885, in run_code
>
> exec(code_obj, self.user_global_ns, self.user_ns)
>
> File "<ipython-input-94-0e23204d7925>", line 1, in <module>
>
>
> sklearn.cluster.dbscan(scipy.sparse.rand(100000,400,density=.01),metric='manhattan')
>
> File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cluster\dbscan_.py",
> line 146, in dbscan
>
> return_distance=False)
>
> File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\neighbors\base.py",
> line 609, in radius_neighbors
>
> **self.effective_metric_params_)
>
> File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
> line 1207, in pairwise_distances
>
> return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
>
> File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
> line 1054, in _parallel_pairwise
>
> return func(X, Y, **kwds)
>
> File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
> line 516, in manhattan_distances
>
> D = np.zeros((X.shape[0], Y.shape[0]))
>
> MemoryError
>
>
>
>
>
> *Augustin LEFEVRE *| Consultant Senior | Ykems | -
>
> T : +33 1 44 30 - - | M : +33 7 77 97 94 89 | alefe...@ykems.com |
> www.ykems.com
>
>
>
> [image: http://www.beijaflore.com/_mailing/signature/image001.png]
> <http://www.linkedin.com/company/beijaflore?trk=top_nav_home>[image:
> http://www.beijaflore.com/_mailing/signature/image002.png]
> <https://twitter.com/BeijafloreGroup> [image:
> http://www.beijaflore.com/_mailing/signature/image003.png]
> <https://www.facebook.com/BeijafloreGroup> [image:
> http://www.beijaflore.com/_mailing/signature/image004.png]
> <https://www.youtube.com/user/ComBeijaflore>
>
>
>
> P Save a tree ! Think before you print
>
>
>
> *SECURE BUSINESS*
>
> *This message and its attachment contain information that may be
> privileged or confidential and is the property of Beijaflore. It is
> intended only for the person to whom it is addressed. If you are not the
> intended recipient, you are not authorized to read, print, retain, copy,
> disseminate, distribute, use or rely on the information contained in this
> email. If you receive this message in error, please notify the sender
> immediately and delete all copies of this message.*
>
>
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general