-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Cemoody/bhtsne Barnes-Hut t-SNE #4025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Implementation by Chris Moody & Nick Travers | ||
# original code by Laurens van der Maaten | ||
# for reference implemenations and papers describing the technique: | ||
# http://homepage.tudelft.nl/19j49/t-SNE.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say: "original code" you mean that this implementation is a direct translation in Cython of the original C++ code?
If so this is a violation of the of the legal statement of the original implementation that says in the README:
You are free to use, modify, or redistribute this software in any way you want, but only for
non-commercial purposes. The use of the software is at your own risk; the authors are not
responsible for any damage as a result from errors in the software.
scikit-learn is a BSD licensed project, it should only include compatibly licensed code. In particular the non-commercial exclusion is not compatible.
If the implementation in this PR is not a derivative work there is no problem. If it was based on the original implementation instead of reimplementing from the paper then we need to ask the original for authorization before being able to consider the inclusion of this contribution in scikit-learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I wrote it from scratch, following the paper. I used to do
astrocomputing, where I'd written oct-tree codes before (which are similar
to this Barnes-Hut approximation). I then compared results with the LvdM's
code and used those answers to debug my own. Where his code uses
Vantage-Point trees we use BallTree, and there are other major structural
differences. But I also changed the some of names of my methods and
variables to mirror his (for example compute_non_edge_forces
or sum_Q
)
for easier comparison & debugging. This PR is not a 'direct translation': I
did not take a chunk of C++ code and try to rewrite it in Cython. But at
the same time, I certainly read & understood LvdM's code and it helped
speed up progress on this one.
So I'm not sure if that means it's derivative or not. To be safe, I'll
contact LvdM and ask for authorization.
chris
On Mon, Dec 29, 2014 at 12:51 AM, Olivier Grisel notifications@github.com
wrote:
In sklearn/manifold/bhtsne.pyx
#4025 (diff)
:+# distutils: extra_compile_args = -fopenmp
+# distutils: extra_link_args = -fopenmp
+# cython: boundscheck=False
+# cython: wraparound=False
+# cython: cdivision=True
+import numpy as np
+from cython.parallel cimport prange, parallel
+from libc.stdio cimport printf
+cimport numpy as np
+cimport cython
+cimport openmp
+
+# Implementation by Chris Moody & Nick Travers
+# original code by Laurens van der Maaten
+# for reference implemenations and papers describing the technique:
+# http://homepage.tudelft.nl/19j49/t-SNE.htmlWhen you say: "original code" you mean that this implementation is a
direct translation in Cython of the original C++ code?If so this is a violation of the of the legal statement of the original
implementation that says in the README:You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.
scikit-learn is a BSD licensed project, it should only include compatibly
licensed code. In particular the non-commercial exclusion is not compatible.If the implementation in this PR is not a derivative work there is no
problem. If it was based on the original implementation instead of
reimplementing from the paper they we need to ask the original for
authorization before being able to consider the inclusion of this
contribution in scikit-learn.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/4025/files#r22305672.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I contacted LvdM and he's reminded everyone that all of his software actually is released under the FreeBSD license (as stated here)! It's not explicitly referenced in the source code, so I've asked him to make it a bit more clear, but I think this means we're OK to include this code from a legal standpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of his software actually is released under the FreeBSD license
Indeed this is not clear for the Barnes-Hut t-SNE as the README of the distribution has a non-commercial clause that directly contradicts the BSD license.
But great then. Thanks for having checked with LvdM.
I have change the header to |
The narrative documentation will also need an update: http://scikit-learn.org/dev/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne . The source file is in |
The openmp flag cannot be enabled by default as it will not be build-able with clang (which is the default compiler under OSX) and probably not under old versions of Microsoft Visual C++ required to build Python extensions for Python 2.7 under Windows. So if we really want to use openmp we should at least have a something in setup.py to check that the compiler supports it and turn it of if it is not supported. Also the openmp implementation in GCC (libgomp) will crash whenever it's use in conjunction with Python multiprocessing (that does fork without exec under posix by default). There is a patch to make libgomp robust to forks but it has not been reviewed yet. So unless openmp brings a significant speedup vs the serial execution, I would just remove parallel support altogether. If we decide to keep it, it should be turned off by default and only enabled with a specific parameters (such as |
Do the tests pass on your box? On travis there is a bunch of direct errors (e.g. missing attributes) that make me thing that you have not pushed all the code your run locally (maybe the generated C files are not up to date?). |
I would also rename |
grad_output = np.array([[5.81128448e-05, -7.78033454e-06], | ||
[-5.81526851e-05, 7.80976444e-06], | ||
[4.24275173e-08, -3.69569698e-08], | ||
[-2.58720939e-09, 7.52706374e-09]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How those numerical values were generated? Using the reference implementation? In any case it should be made explicit in a comment.
Could you add some test on some toy data that compares the perplexity of the 2 methods and check that they are approximately similar? |
Hey Christopher, Thanks a lot for contributing this. I haven't had time to review it, but |
Thanks @GaelVaroquaux and @ogrisel for your comments! @nicktrav and I are slowly working through our TODO list and all of @ogrisel's wonderful sugestions (thank you!) -- we're hoping to be ready for a fuller review in the coming week or two. |
Uses Barnes-Hut tree methods to calculate the gradient that | ||
runs in O(NlogN) instead of O(N^2) | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the documentation of neighbors and verbose is missing here, too
b66db51
to
2bd85e7
Compare
Hi @GaelVaroquaux, @ogrisel & @AlexanderFabisch thanks for the comments! 11 hours ago I thought this was going to take a week or two, but I believe we're now ready for review :) I'll try to respond to everyone's comments all at once.
The rest of the changes we've immediately incorporated. Tests should now pass too, which I think that means this is ready for merge review, so I've changed the title to [MRG]. cc @nicktrav |
@@ -0,0 +1,3 @@ | |||
#define DEBUGFLAG 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to introduce a new file just for this? As _debugvars.h
is only included in _barnes_hut_tsne.pyx
I would rather put a an inline definition of that constant. AFAIK there is no Cython keyword to default a const
variable but maybe a cdef enum
would do. It needs to be bench to make sure that it does not cause a performance regression but I think it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, including an extra file just for compile-time debug flags is gnarly. Didn't think about enums before now, but using an enum is a much more elegant solution, and performance-wise seems to make no (significant) impact. Thanks -- I've made this change.
Could you add the performance comparison to |
@@ -502,6 +502,7 @@ three parameters that control the optimization of t-SNE: | |||
* early exaggeration factor | |||
* learning rate | |||
* maximum number of iterations | |||
* angle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paragraph above needs an update: "there are three" => "there are four".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, angle
does not really influence the optimization. It controls the approximation of the gradient. I would remove it completely and explain it in another paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still an approximation of the true cost: while the default value should of 0.5 or lower should not impact the results much, setting it to a high value should impact the quality of the embedding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "optimization of t-SNE" should be rephrased "optimization of t-SNE and therefore possibly the quality of the resulting embedding".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agreed that high values of angle
should negatively impact the quality of the embedding, and so is as relevant as the learning rate or number of iterations.
I can reproduce your compile error with gcc 4.9.2-10ubuntu13 from ubuntu 15.04. I looks like a Cython bug: when calling When calling |
The following hack seems to work:
|
There are other occurrences of the same issue in the file. The above patch does not fix them all. |
Alas, I was overly optimistic in how much time I'd have to work on this. Can we make a fresh TODO list of what this PR needs? |
I added @ogrisel's fix to #4887. However, the |
Fun api issue: calling |
Also, calling transform changes the embedding of the training data. I thought that was supposed to be avoided using |
Is this project still in progress? I'd really love to be able to use this new algorithm. |
@zachmayer the algorithm is in master and will be in the imminent release. It uses dense data structures, though, so the memory requirements will be the same as before. There is no-one working on using sparse data structures atm afaik. |
I can live with that! Awesome! |
Oh, I didn't realize it was in master! That's super awesome & I'm incredibly excited!!! Thanks @amueller for all of your hard work :) |
@amueller Can you link to the PR/commit that puts it in master? Thanks! |
@zachmayer I think it's at this commit: fe09a07 |
This one: #4887 I left your PR open, as I removed the "transform" method. It changed the training data, which is not really allowed. I think that might not have been what you intended, was it? |
I assume this should be closed? |
It can be reopened if that's appropriate. |
@jamiis AFAIK nobody is working on it at the moment. Do you want to give it a try? |
In principle, shouldn't there be a way to avoid calculating the full distance matrix? The original author of t-SNE says "Barnes-Hut algorithms are intended to operate on large datasets for which you cannot keep such distance matrices in memory" (see lvdmaaten/bhtsne#7), so I'm wondering if the current approach in the sklearn version (starting by calculating the full distance matrix) is inherently flawed. Though honestly I'm not sure what the correct alternative is.... |
I think the best way to implement it is a sparse distance matrix where only the relevant entries are included. However, it is a little bit complicated to handle all special cases correctly, e.g. precomputed distance matrices. |
Any progress? I'm unable to find an implementation that lets you both work on large datasets and vary the number of iterations. |
Moving the discussion to #7089. PR welcome ;) |
Introduction
This PR presents the Barnes-Hut implementation of t-SNE. t-SNE is used to visualize high-dimensional data in a low dimensional space that attempts preserve the pairwise high-dimensional similarities in a low-dimensional embedding. The Barnes-Hut algorithm, which is used by astrophysicists to perform N-body simulations, allows the calculation of the t-SNE embedding in O(NlogN) time instead of O(N^2).
This effectively allows us to learn embeddings of data sets with millions of elements instead of tens of thousands.
Example
An IPython Notebook is available where we've put together simple install instructions & a demo of both methods:
http://nbviewer.ipython.org/urls/gist.githubusercontent.com/cemoody/01135ef2f26837548360/raw/76ce7f0bac916a516501ead719b513b22430cad0/Barnes-Hut%20t-SNE%20Demo.ipynb
Performance
This compares the timings for the Barnes-Hut approximation against the current method (note the log y-axis).
The following diagram shows the relative speedup between both versions:

TODO
Learn
http://lvdmaaten.github.io/tsne/
http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf