Skip to content

[MRG] Cemoody/bhtsne Barnes-Hut t-SNE #4025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

cemoody
Copy link
Contributor

@cemoody cemoody commented Dec 29, 2014

Introduction

This PR presents the Barnes-Hut implementation of t-SNE. t-SNE is used to visualize high-dimensional data in a low dimensional space that attempts preserve the pairwise high-dimensional similarities in a low-dimensional embedding. The Barnes-Hut algorithm, which is used by astrophysicists to perform N-body simulations, allows the calculation of the t-SNE embedding in O(NlogN) time instead of O(N^2).

This effectively allows us to learn embeddings of data sets with millions of elements instead of tens of thousands.

Example

An IPython Notebook is available where we've put together simple install instructions & a demo of both methods:
http://nbviewer.ipython.org/urls/gist.githubusercontent.com/cemoody/01135ef2f26837548360/raw/76ce7f0bac916a516501ead719b513b22430cad0/Barnes-Hut%20t-SNE%20Demo.ipynb

bhtsne

Performance

This compares the timings for the Barnes-Hut approximation against the current method (note the log y-axis).

timing

The following diagram shows the relative speedup between both versions:
speedup

TODO

  • Create fit & transform methods to update using new data
  • Add test for transform called before fit
  • Add test for skip_num_points
  • Raise error if method='barnes_hut' and using sparse data
  • Raise error if method='barnes_hut' and embedding dim > 3
  • Transform 64bit input floats to 32bit
  • Update the narrative documentation in doc/modules/manifold.rst
  • Add toy data to test perplexity
  • Fix failing tests
  • Expose tree consistency checks as unit tests
  • Fix memory leaks
  • Segfaults on data larger than 50k elements
  • PEP8 the code
  • Include usage documentation
  • Remove extra imports, print statements, pdb statements
  • Ensure python 3 works
  • Ensure output dimensionality works for 2D or 3D
  • Am I using memviews or returning full arrays appropriately?
  • Incorporate into SKLearn
  • Remove GIL as much as possible in Cython code
  • Add answer tests
  • Ensure old t-SNE tests works
  • Changed perplexity calculation to nearest neighbors
  • Changed positive gradient calc to nearest neighbors

Learn

  • You can read more about the technique in general here:

http://lvdmaaten.github.io/tsne/

  • The Barnes-Hut approximation is here:

http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

@cemoody cemoody changed the title Cemoody/bhtsne Cemoody/bhtsne Barnes-Hut t-SNE Dec 29, 2014
# Implementation by Chris Moody & Nick Travers
# original code by Laurens van der Maaten
# for reference implemenations and papers describing the technique:
# http://homepage.tudelft.nl/19j49/t-SNE.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say: "original code" you mean that this implementation is a direct translation in Cython of the original C++ code?

If so this is a violation of the of the legal statement of the original implementation that says in the README:

You are free to use, modify, or redistribute this software in any way you want, but only for
non-commercial purposes. The use of the software is at your own risk; the authors are not
responsible for any damage as a result from errors in the software. 

scikit-learn is a BSD licensed project, it should only include compatibly licensed code. In particular the non-commercial exclusion is not compatible.

If the implementation in this PR is not a derivative work there is no problem. If it was based on the original implementation instead of reimplementing from the paper then we need to ask the original for authorization before being able to consider the inclusion of this contribution in scikit-learn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wrote it from scratch, following the paper. I used to do
astrocomputing, where I'd written oct-tree codes before (which are similar
to this Barnes-Hut approximation). I then compared results with the LvdM's
code and used those answers to debug my own. Where his code uses
Vantage-Point trees we use BallTree, and there are other major structural
differences. But I also changed the some of names of my methods and
variables to mirror his (for example compute_non_edge_forces or sum_Q)
for easier comparison & debugging. This PR is not a 'direct translation': I
did not take a chunk of C++ code and try to rewrite it in Cython. But at
the same time, I certainly read & understood LvdM's code and it helped
speed up progress on this one.

So I'm not sure if that means it's derivative or not. To be safe, I'll
contact LvdM and ask for authorization.

chris

On Mon, Dec 29, 2014 at 12:51 AM, Olivier Grisel notifications@github.com
wrote:

In sklearn/manifold/bhtsne.pyx
#4025 (diff)
:

+# distutils: extra_compile_args = -fopenmp
+# distutils: extra_link_args = -fopenmp
+# cython: boundscheck=False
+# cython: wraparound=False
+# cython: cdivision=True
+import numpy as np
+from cython.parallel cimport prange, parallel
+from libc.stdio cimport printf
+cimport numpy as np
+cimport cython
+cimport openmp
+
+# Implementation by Chris Moody & Nick Travers
+# original code by Laurens van der Maaten
+# for reference implemenations and papers describing the technique:
+# http://homepage.tudelft.nl/19j49/t-SNE.html

When you say: "original code" you mean that this implementation is a
direct translation in Cython of the original C++ code?

If so this is a violation of the of the legal statement of the original
implementation that says in the README:

You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.

scikit-learn is a BSD licensed project, it should only include compatibly
licensed code. In particular the non-commercial exclusion is not compatible.

If the implementation in this PR is not a derivative work there is no
problem. If it was based on the original implementation instead of
reimplementing from the paper they we need to ask the original for
authorization before being able to consider the inclusion of this
contribution in scikit-learn.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/4025/files#r22305672.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I contacted LvdM and he's reminded everyone that all of his software actually is released under the FreeBSD license (as stated here)! It's not explicitly referenced in the source code, so I've asked him to make it a bit more clear, but I think this means we're OK to include this code from a legal standpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of his software actually is released under the FreeBSD license

Indeed this is not clear for the Barnes-Hut t-SNE as the README of the distribution has a non-commercial clause that directly contradicts the BSD license.

But great then. Thanks for having checked with LvdM.

@ogrisel ogrisel changed the title Cemoody/bhtsne Barnes-Hut t-SNE [WIP] Cemoody/bhtsne Barnes-Hut t-SNE Dec 30, 2014
@ogrisel
Copy link
Member

ogrisel commented Dec 30, 2014

I have change the header to [WIP] (work in progress). Please don't forget to change it to [MRG] (meaning ready for merge review) to attract the attention of more reviewers once your are done with your to-do list.

@ogrisel
Copy link
Member

ogrisel commented Dec 30, 2014

The narrative documentation will also need an update: http://scikit-learn.org/dev/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne . The source file is in doc/modules/manifold.rst. I will add a new todo item.

@ogrisel
Copy link
Member

ogrisel commented Dec 30, 2014

The openmp flag cannot be enabled by default as it will not be build-able with clang (which is the default compiler under OSX) and probably not under old versions of Microsoft Visual C++ required to build Python extensions for Python 2.7 under Windows. So if we really want to use openmp we should at least have a something in setup.py to check that the compiler supports it and turn it of if it is not supported.

Also the openmp implementation in GCC (libgomp) will crash whenever it's use in conjunction with Python multiprocessing (that does fork without exec under posix by default). There is a patch to make libgomp robust to forks but it has not been reviewed yet. So unless openmp brings a significant speedup vs the serial execution, I would just remove parallel support altogether. If we decide to keep it, it should be turned off by default and only enabled with a specific parameters (such as n_jobs commonly used in the rest of the library, despite the bad name).

@ogrisel
Copy link
Member

ogrisel commented Dec 30, 2014

Do the tests pass on your box? On travis there is a bunch of direct errors (e.g. missing attributes) that make me thing that you have not pushed all the code your run locally (maybe the generated C files are not up to date?).

@ogrisel
Copy link
Member

ogrisel commented Dec 30, 2014

I would also rename bhtsne.pyx to _barnes_hut_tsne.py for discoverability.

grad_output = np.array([[5.81128448e-05, -7.78033454e-06],
[-5.81526851e-05, 7.80976444e-06],
[4.24275173e-08, -3.69569698e-08],
[-2.58720939e-09, 7.52706374e-09]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How those numerical values were generated? Using the reference implementation? In any case it should be made explicit in a comment.

@ogrisel
Copy link
Member

ogrisel commented Dec 30, 2014

Could you add some test on some toy data that compares the perplexity of the 2 methods and check that they are approximately similar?

@GaelVaroquaux
Copy link
Member

Hey Christopher,

Thanks a lot for contributing this. I haven't had time to review it, but
it is exciting! I think that the Barnes-Hut t-SNE algorithm is very
important to us to scale up.

@cemoody
Copy link
Contributor Author

cemoody commented Jan 5, 2015

Thanks @GaelVaroquaux and @ogrisel for your comments! @nicktrav and I are slowly working through our TODO list and all of @ogrisel's wonderful sugestions (thank you!) -- we're hoping to be ready for a fuller review in the coming week or two.

Uses Barnes-Hut tree methods to calculate the gradient that
runs in O(NlogN) instead of O(N^2)

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the documentation of neighbors and verbose is missing here, too

@cemoody
Copy link
Contributor Author

cemoody commented Jan 6, 2015

Hi @GaelVaroquaux, @ogrisel & @AlexanderFabisch thanks for the comments! 11 hours ago I thought this was going to take a week or two, but I believe we're now ready for review :) I'll try to respond to everyone's comments all at once.

  • We've removed all OpenMP requirements -- the overhead of launching threads was bigger than the tradeoff, even for large datasets with 10k elements. Once we've figured out building scikit-learn with OpenMP we can revisit this question, but let's table the issue for the time being.
  • We've modified the narrative docs to reflect the new performance and a extra parameters. The change is minimal since the algorithm performs more or less the same, just faster.
  • Added toy data testing & comparing the perplexity of the 2 methods.

The rest of the changes we've immediately incorporated. Tests should now pass too, which I think that means this is ready for merge review, so I've changed the title to [MRG].

cc @nicktrav

@cemoody cemoody changed the title [WIP] Cemoody/bhtsne Barnes-Hut t-SNE [MRG] Cemoody/bhtsne Barnes-Hut t-SNE Jan 6, 2015
@@ -0,0 +1,3 @@
#define DEBUGFLAG 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to introduce a new file just for this? As _debugvars.h is only included in _barnes_hut_tsne.pyx I would rather put a an inline definition of that constant. AFAIK there is no Cython keyword to default a const variable but maybe a cdef enum would do. It needs to be bench to make sure that it does not cause a performance regression but I think it should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, including an extra file just for compile-time debug flags is gnarly. Didn't think about enums before now, but using an enum is a much more elegant solution, and performance-wise seems to make no (significant) impact. Thanks -- I've made this change.

@AlexanderFabisch
Copy link
Member

Could you add the performance comparison to examples/manifold and make sure that it does not take much longer than 10^1 seconds (n_samples=1000-2000)?

@@ -502,6 +502,7 @@ three parameters that control the optimization of t-SNE:
* early exaggeration factor
* learning rate
* maximum number of iterations
* angle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paragraph above needs an update: "there are three" => "there are four".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, angle does not really influence the optimization. It controls the approximation of the gradient. I would remove it completely and explain it in another paragraph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still an approximation of the true cost: while the default value should of 0.5 or lower should not impact the results much, setting it to a high value should impact the quality of the embedding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "optimization of t-SNE" should be rephrased "optimization of t-SNE and therefore possibly the quality of the resulting embedding".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agreed that high values of angle should negatively impact the quality of the embedding, and so is as relevant as the learning rate or number of iterations.

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

I can reproduce your compile error with gcc 4.9.2-10ubuntu13 from ubuntu 15.04. I looks like a Cython bug: when calling printf("some string\n") in Cython, it creates a local char* typed variable that is not a const and that causes the -Werror=security-format to cause a compiler error with gcc 4.9.

When calling printf("some string with a formatted value: %d\n", i) then the generated C code has a const char* as first argument and everything is fine.

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

The following hack seems to work:

diff --git a/sklearn/manifold/_barnes_hut_tsne.pyx b/sklearn/manifold/_barnes_hut_tsne.pyx
index a5579b5..3978c80 100644
--- a/sklearn/manifold/_barnes_hut_tsne.pyx
+++ b/sklearn/manifold/_barnes_hut_tsne.pyx
@@ -14,6 +14,8 @@ from libc.math cimport sqrt, log
 cimport numpy as np
 import numpy as np

+cdef char* EMPTY_STRING = ""
+
 cdef extern from "math.h":
     float fabsf(float x) nogil

@@ -751,13 +753,19 @@ def gradient(float[:,:] pij_input,
     err = insert_many(qt, pos_output)
     assert err == 0, "[t-SNE] Insertion failed"
     if verbose > 10:
-        printf("[t-SNE] Computing gradient\n")
+        # XXX: format hack to workaround lack of `const char *` type
+        # in the generated C code that triggers error with gcc 4.9
+        # and -Werror=format-security
+        printf("[t-SNE] Computing gradient\n%s", EMPTY_STRING)
     sum_Q = compute_gradient(pij_input, pos_output, neighbors, forces,
                              qt.root_node, theta, dof, skip_num_points, -1)
     C = compute_error(pij_input, pos_output, neighbors, sum_Q, n_dimensions,
                       verbose)
     if verbose > 10:
-        printf("[t-SNE] Checking tree consistency \n")
+        # XXX: format hack to workaround lack of `const char *` type
+        # in the generated C code
+        # and -Werror=format-security
+        printf("[t-SNE] Checking tree consistency\n%s", EMPTY_STRING)
     cdef long count = count_points(qt.root_node, 0)
     m = ("Tree consistency failed: unexpected number of points=%i "
          "at root node=%i" % (count, qt.root_node.cumulative_size))

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

There are other occurrences of the same issue in the file. The above patch does not fix them all.

@cemoody
Copy link
Contributor Author

cemoody commented Jun 26, 2015

Alas, I was overly optimistic in how much time I'd have to work on this. Can we make a fresh TODO list of what this PR needs?

@amueller
Copy link
Member

amueller commented Jul 1, 2015

@cemoody welcome to the club ;)
Maybe reset it to #4887 and fix the errors.

@ogrisel is the cython bug fixed in newer cython? Should we report?

I think having a sparse matrix variant would be nice but I would be happy to merge without it.

@amueller
Copy link
Member

amueller commented Aug 3, 2015

I added @ogrisel's fix to #4887. However, the transform method can't work the way it is implemented now. I haven't really looked at it, but it currently assumes that the input dimension is the same as the output dimensions somehow. It concatenates the embedding with the new data. I think it should just concatenate the training data with the new data, right? But then the tests should be tighter.

@amueller
Copy link
Member

amueller commented Aug 3, 2015

Fun api issue: calling transform will change _X_fit, so calling transform changes again afterwards will change the results...

@amueller
Copy link
Member

amueller commented Aug 3, 2015

Also, calling transform changes the embedding of the training data. I thought that was supposed to be avoided using skip_num_points.
transform also returns the transform of the training data plus the additional data, which breaks api.
Maybe we should try to merge this without the transform? Or does someone have time to look into it?

@zachmayer
Copy link
Contributor

Is this project still in progress? I'd really love to be able to use this new algorithm.

@amueller
Copy link
Member

@zachmayer the algorithm is in master and will be in the imminent release. It uses dense data structures, though, so the memory requirements will be the same as before. There is no-one working on using sparse data structures atm afaik.

@zachmayer
Copy link
Contributor

I can live with that! Awesome!

@cemoody
Copy link
Contributor Author

cemoody commented Nov 5, 2015

Oh, I didn't realize it was in master! That's super awesome & I'm incredibly excited!!! Thanks @amueller for all of your hard work :)

@zachmayer
Copy link
Contributor

@amueller Can you link to the PR/commit that puts it in master? Thanks!

@cemoody
Copy link
Contributor Author

cemoody commented Nov 5, 2015

@zachmayer I think it's at this commit: fe09a07

@amueller
Copy link
Member

amueller commented Nov 5, 2015

This one: #4887

I left your PR open, as I removed the "transform" method. It changed the training data, which is not really allowed. I think that might not have been what you intended, was it?

@jnothman
Copy link
Member

I assume this should be closed?

@jnothman
Copy link
Member

It can be reopened if that's appropriate.

@jnothman jnothman closed this Mar 11, 2016
@jamiis
Copy link

jamiis commented May 26, 2016

@cemoody @amueller is there still ongoing work to optimize memory consumption occurring in calculating the pairwise distances? I'm getting a MemoryError on a 75,000x50 dataset.

@ogrisel
Copy link
Member

ogrisel commented May 27, 2016

@jamiis AFAIK nobody is working on it at the moment. Do you want to give it a try?

@jlorince
Copy link

jlorince commented Jun 1, 2016

In principle, shouldn't there be a way to avoid calculating the full distance matrix? The original author of t-SNE says "Barnes-Hut algorithms are intended to operate on large datasets for which you cannot keep such distance matrices in memory" (see lvdmaaten/bhtsne#7), so I'm wondering if the current approach in the sklearn version (starting by calculating the full distance matrix) is inherently flawed. Though honestly I'm not sure what the correct alternative is....

@AlexanderFabisch
Copy link
Member

I think the best way to implement it is a sparse distance matrix where only the relevant entries are included. However, it is a little bit complicated to handle all special cases correctly, e.g. precomputed distance matrices.

@ghost
Copy link

ghost commented Jul 26, 2016

Any progress? I'm unable to find an implementation that lets you both work on large datasets and vary the number of iterations.

@amueller
Copy link
Member

Moving the discussion to #7089. PR welcome ;)

@tomMoral tomMoral mentioned this pull request Jun 7, 2017
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.