[MRG] Cemoody/bhtsne Barnes-Hut t-SNE #4025

cemoody · 2014-12-29T06:37:30Z

Introduction

This PR presents the Barnes-Hut implementation of t-SNE. t-SNE is used to visualize high-dimensional data in a low dimensional space that attempts preserve the pairwise high-dimensional similarities in a low-dimensional embedding. The Barnes-Hut algorithm, which is used by astrophysicists to perform N-body simulations, allows the calculation of the t-SNE embedding in O(NlogN) time instead of O(N^2).

This effectively allows us to learn embeddings of data sets with millions of elements instead of tens of thousands.

Example

An IPython Notebook is available where we've put together simple install instructions & a demo of both methods:
http://nbviewer.ipython.org/urls/gist.githubusercontent.com/cemoody/01135ef2f26837548360/raw/76ce7f0bac916a516501ead719b513b22430cad0/Barnes-Hut%20t-SNE%20Demo.ipynb

Performance

This compares the timings for the Barnes-Hut approximation against the current method (note the log y-axis).

The following diagram shows the relative speedup between both versions:

TODO

Learn

You can read more about the technique in general here:

http://lvdmaaten.github.io/tsne/

The Barnes-Hut approximation is here:

http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

ogrisel · 2014-12-29T08:51:26Z

sklearn/manifold/bhtsne.pyx

+# Implementation by Chris Moody & Nick Travers
+# original code by Laurens van der Maaten
+# for reference implemenations and papers describing the technique:
+# http://homepage.tudelft.nl/19j49/t-SNE.html


When you say: "original code" you mean that this implementation is a direct translation in Cython of the original C++ code?

If so this is a violation of the of the legal statement of the original implementation that says in the README:

You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.

scikit-learn is a BSD licensed project, it should only include compatibly licensed code. In particular the non-commercial exclusion is not compatible.

If the implementation in this PR is not a derivative work there is no problem. If it was based on the original implementation instead of reimplementing from the paper then we need to ask the original for authorization before being able to consider the inclusion of this contribution in scikit-learn.

Initially I wrote it from scratch, following the paper. I used to do
astrocomputing, where I'd written oct-tree codes before (which are similar
to this Barnes-Hut approximation). I then compared results with the LvdM's
code and used those answers to debug my own. Where his code uses
Vantage-Point trees we use BallTree, and there are other major structural
differences. But I also changed the some of names of my methods and
variables to mirror his (for example compute_non_edge_forces or sum_Q)
for easier comparison & debugging. This PR is not a 'direct translation': I
did not take a chunk of C++ code and try to rewrite it in Cython. But at
the same time, I certainly read & understood LvdM's code and it helped
speed up progress on this one.

So I'm not sure if that means it's derivative or not. To be safe, I'll
contact LvdM and ask for authorization.

chris

On Mon, Dec 29, 2014 at 12:51 AM, Olivier Grisel notifications@github.com
wrote:

In sklearn/manifold/bhtsne.pyx
#4025 (diff)
:

+# distutils: extra_compile_args = -fopenmp
+# distutils: extra_link_args = -fopenmp
+# cython: boundscheck=False
+# cython: wraparound=False
+# cython: cdivision=True
+import numpy as np
+from cython.parallel cimport prange, parallel
+from libc.stdio cimport printf
+cimport numpy as np
+cimport cython
+cimport openmp
+
+# Implementation by Chris Moody & Nick Travers
+# original code by Laurens van der Maaten
+# for reference implemenations and papers describing the technique:
+# http://homepage.tudelft.nl/19j49/t-SNE.html

When you say: "original code" you mean that this implementation is a
direct translation in Cython of the original C++ code?

If so this is a violation of the of the legal statement of the original
implementation that says in the README:

You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.

scikit-learn is a BSD licensed project, it should only include compatibly
licensed code. In particular the non-commercial exclusion is not compatible.

If the implementation in this PR is not a derivative work there is no
problem. If it was based on the original implementation instead of
reimplementing from the paper they we need to ask the original for
authorization before being able to consider the inclusion of this
contribution in scikit-learn.

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/4025/files#r22305672.

I contacted LvdM and he's reminded everyone that all of his software actually is released under the FreeBSD license (as stated here)! It's not explicitly referenced in the source code, so I've asked him to make it a bit more clear, but I think this means we're OK to include this code from a legal standpoint.

all of his software actually is released under the FreeBSD license

Indeed this is not clear for the Barnes-Hut t-SNE as the README of the distribution has a non-commercial clause that directly contradicts the BSD license.

But great then. Thanks for having checked with LvdM.

ogrisel · 2014-12-30T08:52:04Z

I have change the header to [WIP] (work in progress). Please don't forget to change it to [MRG] (meaning ready for merge review) to attract the attention of more reviewers once your are done with your to-do list.

ogrisel · 2014-12-30T08:55:35Z

The narrative documentation will also need an update: http://scikit-learn.org/dev/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne . The source file is in doc/modules/manifold.rst. I will add a new todo item.

ogrisel · 2014-12-30T09:06:46Z

The openmp flag cannot be enabled by default as it will not be build-able with clang (which is the default compiler under OSX) and probably not under old versions of Microsoft Visual C++ required to build Python extensions for Python 2.7 under Windows. So if we really want to use openmp we should at least have a something in setup.py to check that the compiler supports it and turn it of if it is not supported.

Also the openmp implementation in GCC (libgomp) will crash whenever it's use in conjunction with Python multiprocessing (that does fork without exec under posix by default). There is a patch to make libgomp robust to forks but it has not been reviewed yet. So unless openmp brings a significant speedup vs the serial execution, I would just remove parallel support altogether. If we decide to keep it, it should be turned off by default and only enabled with a specific parameters (such as n_jobs commonly used in the rest of the library, despite the bad name).

ogrisel · 2014-12-30T09:08:49Z

Do the tests pass on your box? On travis there is a bunch of direct errors (e.g. missing attributes) that make me thing that you have not pushed all the code your run locally (maybe the generated C files are not up to date?).

ogrisel · 2014-12-30T09:11:54Z

I would also rename bhtsne.pyx to _barnes_hut_tsne.py for discoverability.

ogrisel · 2014-12-30T09:14:41Z

sklearn/manifold/tests/test_t_sne.py

+    grad_output = np.array([[5.81128448e-05, -7.78033454e-06],
+                            [-5.81526851e-05, 7.80976444e-06],
+                            [4.24275173e-08, -3.69569698e-08],
+                            [-2.58720939e-09, 7.52706374e-09]])


How those numerical values were generated? Using the reference implementation? In any case it should be made explicit in a comment.

ogrisel · 2014-12-30T09:26:50Z

Could you add some test on some toy data that compares the perplexity of the 2 methods and check that they are approximately similar?

GaelVaroquaux · 2015-01-05T16:07:01Z

Hey Christopher,

Thanks a lot for contributing this. I haven't had time to review it, but
it is exciting! I think that the Barnes-Hut t-SNE algorithm is very
important to us to scale up.

cemoody · 2015-01-05T19:05:56Z

Thanks @GaelVaroquaux and @ogrisel for your comments! @nicktrav and I are slowly working through our TODO list and all of @ogrisel's wonderful sugestions (thank you!) -- we're hoping to be ready for a fuller review in the coming week or two.

AlexanderFabisch · 2015-01-05T21:49:09Z

sklearn/manifold/t_sne.py

+    Uses Barnes-Hut tree methods to calculate the gradient that
+    runs in O(NlogN) instead of O(N^2)
+
+    Parameters


the documentation of neighbors and verbose is missing here, too

cemoody · 2015-01-06T05:59:28Z

Hi @GaelVaroquaux, @ogrisel & @AlexanderFabisch thanks for the comments! 11 hours ago I thought this was going to take a week or two, but I believe we're now ready for review :) I'll try to respond to everyone's comments all at once.

We've removed all OpenMP requirements -- the overhead of launching threads was bigger than the tradeoff, even for large datasets with 10k elements. Once we've figured out building scikit-learn with OpenMP we can revisit this question, but let's table the issue for the time being.
We've modified the narrative docs to reflect the new performance and a extra parameters. The change is minimal since the algorithm performs more or less the same, just faster.
Added toy data testing & comparing the perplexity of the 2 methods.

The rest of the changes we've immediately incorporated. Tests should now pass too, which I think that means this is ready for merge review, so I've changed the title to [MRG].

cc @nicktrav

ogrisel · 2015-01-06T08:33:41Z

sklearn/manifold/_debugvars.h

@@ -0,0 +1,3 @@
+#define DEBUGFLAG 0


Is it really necessary to introduce a new file just for this? As _debugvars.h is only included in _barnes_hut_tsne.pyx I would rather put a an inline definition of that constant. AFAIK there is no Cython keyword to default a const variable but maybe a cdef enum would do. It needs to be bench to make sure that it does not cause a performance regression but I think it should be fine.

Yeah, including an extra file just for compile-time debug flags is gnarly. Didn't think about enums before now, but using an enum is a much more elegant solution, and performance-wise seems to make no (significant) impact. Thanks -- I've made this change.

AlexanderFabisch · 2015-01-06T08:58:56Z

Could you add the performance comparison to examples/manifold and make sure that it does not take much longer than 10^1 seconds (n_samples=1000-2000)?

ogrisel · 2015-01-06T09:00:20Z

doc/modules/manifold.rst

@@ -502,6 +502,7 @@ three parameters that control the optimization of t-SNE:
 * early exaggeration factor
 * learning rate
 * maximum number of iterations
+* angle


The paragraph above needs an update: "there are three" => "there are four".

Actually, angle does not really influence the optimization. It controls the approximation of the gradient. I would remove it completely and explain it in another paragraph.

It is still an approximation of the true cost: while the default value should of 0.5 or lower should not impact the results much, setting it to a high value should impact the quality of the embedding.

Maybe "optimization of t-SNE" should be rephrased "optimization of t-SNE and therefore possibly the quality of the resulting embedding".

Yep, agreed that high values of angle should negatively impact the quality of the embedding, and so is as relevant as the learning rate or number of iterations.

ogrisel · 2015-06-24T09:37:34Z

I can reproduce your compile error with gcc 4.9.2-10ubuntu13 from ubuntu 15.04. I looks like a Cython bug: when calling printf("some string\n") in Cython, it creates a local char* typed variable that is not a const and that causes the -Werror=security-format to cause a compiler error with gcc 4.9.

When calling printf("some string with a formatted value: %d\n", i) then the generated C code has a const char* as first argument and everything is fine.

ogrisel · 2015-06-24T09:39:55Z

The following hack seems to work:

diff --git a/sklearn/manifold/_barnes_hut_tsne.pyx b/sklearn/manifold/_barnes_hut_tsne.pyx
index a5579b5..3978c80 100644
--- a/sklearn/manifold/_barnes_hut_tsne.pyx
+++ b/sklearn/manifold/_barnes_hut_tsne.pyx
@@ -14,6 +14,8 @@ from libc.math cimport sqrt, log
 cimport numpy as np
 import numpy as np

+cdef char* EMPTY_STRING = ""
+
 cdef extern from "math.h":
     float fabsf(float x) nogil

@@ -751,13 +753,19 @@ def gradient(float[:,:] pij_input,
     err = insert_many(qt, pos_output)
     assert err == 0, "[t-SNE] Insertion failed"
     if verbose > 10:
-        printf("[t-SNE] Computing gradient\n")
+        # XXX: format hack to workaround lack of `const char *` type
+        # in the generated C code that triggers error with gcc 4.9
+        # and -Werror=format-security
+        printf("[t-SNE] Computing gradient\n%s", EMPTY_STRING)
     sum_Q = compute_gradient(pij_input, pos_output, neighbors, forces,
                              qt.root_node, theta, dof, skip_num_points, -1)
     C = compute_error(pij_input, pos_output, neighbors, sum_Q, n_dimensions,
                       verbose)
     if verbose > 10:
-        printf("[t-SNE] Checking tree consistency \n")
+        # XXX: format hack to workaround lack of `const char *` type
+        # in the generated C code
+        # and -Werror=format-security
+        printf("[t-SNE] Checking tree consistency\n%s", EMPTY_STRING)
     cdef long count = count_points(qt.root_node, 0)
     m = ("Tree consistency failed: unexpected number of points=%i "
          "at root node=%i" % (count, qt.root_node.cumulative_size))

ogrisel · 2015-06-24T09:40:50Z

There are other occurrences of the same issue in the file. The above patch does not fix them all.

cemoody · 2015-06-26T19:50:33Z

Alas, I was overly optimistic in how much time I'd have to work on this. Can we make a fresh TODO list of what this PR needs?

amueller · 2015-07-01T20:41:23Z

@cemoody welcome to the club ;)
Maybe reset it to #4887 and fix the errors.

@ogrisel is the cython bug fixed in newer cython? Should we report?

I think having a sparse matrix variant would be nice but I would be happy to merge without it.

amueller · 2015-08-03T18:49:09Z

I added @ogrisel's fix to #4887. However, the transform method can't work the way it is implemented now. I haven't really looked at it, but it currently assumes that the input dimension is the same as the output dimensions somehow. It concatenates the embedding with the new data. I think it should just concatenate the training data with the new data, right? But then the tests should be tighter.

amueller · 2015-08-03T20:22:27Z

Fun api issue: calling transform will change _X_fit, so calling transform changes again afterwards will change the results...

amueller · 2015-08-03T20:47:09Z

Also, calling transform changes the embedding of the training data. I thought that was supposed to be avoided using skip_num_points.
transform also returns the transform of the training data plus the additional data, which breaks api.
Maybe we should try to merge this without the transform? Or does someone have time to look into it?

zachmayer · 2015-10-29T15:31:58Z

Is this project still in progress? I'd really love to be able to use this new algorithm.

amueller · 2015-10-29T22:16:58Z

@zachmayer the algorithm is in master and will be in the imminent release. It uses dense data structures, though, so the memory requirements will be the same as before. There is no-one working on using sparse data structures atm afaik.

zachmayer · 2015-10-29T22:40:03Z

I can live with that! Awesome!

cemoody · 2015-11-05T16:25:06Z

Oh, I didn't realize it was in master! That's super awesome & I'm incredibly excited!!! Thanks @amueller for all of your hard work :)

zachmayer · 2015-11-05T16:51:59Z

@amueller Can you link to the PR/commit that puts it in master? Thanks!

cemoody · 2015-11-05T17:55:14Z

@zachmayer I think it's at this commit: fe09a07

amueller · 2015-11-05T17:56:42Z

This one: #4887

I left your PR open, as I removed the "transform" method. It changed the training data, which is not really allowed. I think that might not have been what you intended, was it?

jnothman · 2016-03-11T02:40:53Z

I assume this should be closed?

jnothman · 2016-03-11T02:41:42Z

It can be reopened if that's appropriate.

jamiis · 2016-05-26T19:56:11Z

@cemoody @amueller is there still ongoing work to optimize memory consumption occurring in calculating the pairwise distances? I'm getting a MemoryError on a 75,000x50 dataset.

ogrisel · 2016-05-27T13:05:40Z

@jamiis AFAIK nobody is working on it at the moment. Do you want to give it a try?

jlorince · 2016-06-01T16:34:25Z

In principle, shouldn't there be a way to avoid calculating the full distance matrix? The original author of t-SNE says "Barnes-Hut algorithms are intended to operate on large datasets for which you cannot keep such distance matrices in memory" (see lvdmaaten/bhtsne#7), so I'm wondering if the current approach in the sklearn version (starting by calculating the full distance matrix) is inherently flawed. Though honestly I'm not sure what the correct alternative is....

AlexanderFabisch · 2016-06-01T19:14:35Z

I think the best way to implement it is a sparse distance matrix where only the relevant entries are included. However, it is a little bit complicated to handle all special cases correctly, e.g. precomputed distance matrices.

ghost · 2016-07-26T18:58:27Z

Any progress? I'm unable to find an implementation that lets you both work on large datasets and vary the number of iterations.

amueller · 2016-07-26T19:06:13Z

Moving the discussion to #7089. PR welcome ;)

cemoody changed the title ~~Cemoody/bhtsne~~ Cemoody/bhtsne Barnes-Hut t-SNE Dec 29, 2014

ogrisel reviewed Dec 29, 2014
View reviewed changes

ogrisel changed the title ~~Cemoody/bhtsne Barnes-Hut t-SNE~~ [WIP] Cemoody/bhtsne Barnes-Hut t-SNE Dec 30, 2014

ogrisel reviewed Dec 30, 2014
View reviewed changes

AlexanderFabisch reviewed Jan 5, 2015
View reviewed changes

cemoody force-pushed the cemoody/bhtsne branch from b66db51 to 2bd85e7 Compare January 6, 2015 01:09

cemoody changed the title ~~[WIP] Cemoody/bhtsne Barnes-Hut t-SNE~~ [MRG] Cemoody/bhtsne Barnes-Hut t-SNE Jan 6, 2015

ogrisel reviewed Jan 6, 2015
View reviewed changes

amueller mentioned this pull request Oct 12, 2015

tsne has no transform function #5361

Closed

This was referenced Dec 2, 2015

Question about preProcOptions in trainControl topepo/caret#335

Closed

Predicting on new data jkrijthe/Rtsne#6

Closed

amueller added the Waiting for Reviewer label Dec 10, 2015

jnothman closed this Mar 11, 2016

amueller mentioned this pull request Jul 26, 2016

t-SNE has inefficient memory structure #7089

Closed

tomMoral mentioned this pull request Jun 7, 2017

[MRG+1] Reducing t-SNE memory usage #9032

Merged

5 tasks

[MRG] Cemoody/bhtsne Barnes-Hut t-SNE #4025

[MRG] Cemoody/bhtsne Barnes-Hut t-SNE #4025

Conversation

cemoody commented Dec 29, 2014

Introduction

Example

Performance

TODO

Learn

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Dec 30, 2014

ogrisel commented Dec 30, 2014

ogrisel commented Dec 30, 2014

ogrisel commented Dec 30, 2014

ogrisel commented Dec 30, 2014

Choose a reason for hiding this comment

ogrisel commented Dec 30, 2014

GaelVaroquaux commented Jan 5, 2015

cemoody commented Jan 5, 2015

Choose a reason for hiding this comment

cemoody commented Jan 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexanderFabisch commented Jan 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jun 24, 2015

ogrisel commented Jun 24, 2015

ogrisel commented Jun 24, 2015

cemoody commented Jun 26, 2015

amueller commented Jul 1, 2015

amueller commented Aug 3, 2015

amueller commented Aug 3, 2015

amueller commented Aug 3, 2015

zachmayer commented Oct 29, 2015

amueller commented Oct 29, 2015

zachmayer commented Oct 29, 2015

cemoody commented Nov 5, 2015

zachmayer commented Nov 5, 2015

cemoody commented Nov 5, 2015

amueller commented Nov 5, 2015

jnothman commented Mar 11, 2016

jnothman commented Mar 11, 2016

jamiis commented May 26, 2016 • edited Loading

ogrisel commented May 27, 2016

jlorince commented Jun 1, 2016

AlexanderFabisch commented Jun 1, 2016

ghost commented Jul 26, 2016

amueller commented Jul 26, 2016

jamiis commented May 26, 2016 •

edited

Loading