Skip to content

[MRG] Cemoody/bhtsne Barnes-Hut t-SNE #4025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,9 @@
# built documents.
#
# The short X.Y version.
version = '0.16.dev0'
# The full version, including alpha/beta/rc tags.
import sklearn
version = sklearn.__version__
# The full version, including alpha/beta/rc tags.
release = sklearn.__version__

# The language for content autogenerated by Sphinx. Refer to documentation
Expand Down
4 changes: 2 additions & 2 deletions doc/documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<div class="container-index">

Documentation of scikit-learn 0.16.dev0
Documentation of scikit-learn 0.17.dev0
=======================================

.. raw:: html
Expand All @@ -29,7 +29,7 @@ Documentation of scikit-learn 0.16.dev0
<h2>Other Versions</h2>
<ul>
<li><a href="http://scikit-learn.org/stable/user_guide.html">scikit-learn 0.15 (stable)</a></li>
<li>scikit-learn 0.16 (development)</li>
<li>scikit-learn 0.17 (development)</li>
<li><a href="http://scikit-learn.org/0.14/user_guide.html">scikit-learn 0.14</a></li>
<li><a href="http://scikit-learn.org/0.13/user_guide.html">scikit-learn 0.13</a></li>
<li><a href="http://scikit-learn.org/0.12/user_guide.html">scikit-learn 0.12</a></li>
Expand Down
91 changes: 74 additions & 17 deletions doc/modules/manifold.rst
Original file line number Diff line number Diff line change
Expand Up @@ -479,30 +479,65 @@ t-distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE (:class:`TSNE`) converts affinities of data points to probabilities.
The affinities in the original space are represented by Gaussian joint
probabilities and the affinities in the embedded space are represented by
Student's t-distributions. The Kullback-Leibler (KL) divergence of the joint
Student's t-distributions. This allows t-SNE to be particularly sensitive
to local structure and has a few other advantages over existing techniques:

* Revealing the structure at many scales on a single map
* Revealing data that lie in multiple, different, manifolds or clusters
* Reducing the tendency to crowd points together at the center

While Isomap, LLE and variants are best suited to unfold a single continuous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check but that sounds quite verbatim from the paper. If it is, we should probably make that explicit.

low dimensional manifold, t-SNE will focus on the local structure of the data
and will tend to extract clustered local groups of samples as highlighted on
the S-curve example. This ability to group samples based on the local structure
might be beneficial to visually disentangle a dataset that comprises several
manifolds at once as is the case in the digits dataset.

The Kullback-Leibler (KL) divergence of the joint
probabilities in the original space and the embedded space will be minimized
by gradient descent. Note that the KL divergence is not convex, i.e.
multiple restarts with different initializations will end up in local minima
of the KL divergence. Hence, it is sometimes useful to try different seeds
and select the embedding with the lowest KL divergence.
and select the embedding with the lowest KL divergence.

The disadvantages to using t-SNE are roughly:

* t-SNE is computationally expensive, and can take several hours on million-sample
datasets where PCA will finish in seconds or minutes
* The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.
* The algorithm is stochastic and multiple restarts with different seeds can
yield different embeddings. However, it is perfectly legitimate to pick the the
embedding with the least error.
* Global structure is not explicitly preserved. This is problem is mitigated by
initializing points with PCA (using `init='pca'`).


.. figure:: ../auto_examples/manifold/images/plot_lle_digits_013.png
:target: ../auto_examples/manifold/plot_lle_digits.html
:align: center
:scale: 50


Optimizing t-SNE
----------------
The main purpose of t-SNE is visualization of high-dimensional data. Hence,
it works best when the data will be embedded on two or three dimensions.

Optimizing the KL divergence can be a little bit tricky sometimes. There are
three parameters that control the optimization of t-SNE:
five parameters that control the optimization of t-SNE and therefore possibly
the quality of the resulting embedding:

* perplexity
* early exaggeration factor
* learning rate
* maximum number of iterations

* angle (not used in the exact method)

The perplexity is defined as :math:`k=2^(S)` where :math:`S` is the Shannon
entropy of the conditional probability distribution. The perplexity of a
:math:`k`-sided die is :math:`k`, so that :math:`k` is effectively the number of
nearest neighbors t-SNE considers when generating the conditional probabilities.
Larger perplexities lead to more nearest neighbors and less sensitive to small
structure. Larger datasets tend to require larger perplexities.
The maximum number of iterations is usually high enough and does not need
any tuning. The optimization consists of two phases: the early exaggeration
phase and the final optimization. During early exaggeration the joint
Expand All @@ -513,19 +548,37 @@ divergence could increase during this phase. Usually it does not have to be
tuned. A critical parameter is the learning rate. If it is too low gradient
descent will get stuck in a bad local minimum. If it is too high the KL
divergence will increase during optimization. More tips can be found in
Laurens van der Maaten's FAQ (see references).
Laurens van der Maaten's FAQ (see references). The last parameter, angle,
is a tradeoff between performance and accuracy. Larger angles imply that we
can approximate larger regions by a single point,leading to better speed
but less accurate results.

Standard t-SNE that has been implemented here is usually much slower than
other manifold learning algorithms. The optimization is quite difficult
and the computation of the gradient is on :math:`O[d N^2]`, where :math:`d`
is the number of output dimensions and :math:`N` is the number of samples.
Barnes-Hut t-SNE
----------------

While Isomap, LLE and variants are best suited to unfold a single continuous
low dimensional manifold, t-SNE will focus on the local structure of the data
and will tend to extract clustered local groups of samples as highlighted on
the S-curve example. This ability to group samples based on the local structure
might be beneficial to visually disentangle a dataset that comprises several
manifolds at once as is the case in the digits dataset.
The Barnes-Hut t-SNE that has been implemented here is usually much slower than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really still slower than other manifold learning algorithms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least for the toy datasets: the speed up is only ~2-3x on the 2000 digits examples, which still makes it the slowest. Perhaps it's competitive with the next-slowest, MDS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my box, TSNE (with the default Barnes Hut approximation) is always slightly faster than our implementation of MDS in all the examples/manifold scripts. Still it's slower when compared to other manifold methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace "is usually much slower than" by "can be significantly slower than".

other manifold learning algorithms. The optimization is quite difficult
and the computation of the gradient is :math:`O[d N log(N)]`, where :math:`d`
is the number of output dimensions and :math:`N` is the number of samples. The
Barnes-Hut method improves on the exact method where t-SNE complexity is
:math:`O[d N^2]`, but has several other notable differences:

* The Barnes-Hut implementation only works when the target dimensionality is 3
or less. The 2D case is typical when building visualizations.
* Barnes-Hut only works with dense input data. Sparse data matrices can only be
embedded with the exact method or can be approximated by a dense low rank
projection for instance using :class:`sklearn.decomposition.TruncatedSVD`
* Barnes-Hut is an approximation of the exact method. The approximation is
parameterized with the angle parameter, therefore the angle parameter is
unused when method="exact"
* Barnes-Hut is significantly more scalable. Barnes-Hut can be used to embed
hundred of thousands of data points while the exact method can handle
thousands of samples before becoming computationally intractable

For visualization purpose (which is the main use case of t-SNE), using the
Barnes-Hut method is strongly recommended. The exact t-SNE method is useful
for checking the theoretically properties of the embedding possibly in higher
dimensional space but limit to small datasets due to computational constraints.

Also note that the digits labels roughly match the natural grouping found by
t-SNE while the linear 2D projection of the PCA model yields a representation
Expand All @@ -546,9 +599,13 @@ the internal structure of the data.
(2008)

* `"t-Distributed Stochastic Neighbor Embedding"
<http://homepage.tudelft.nl/19j49/t-SNE.html>`_
<http://lvdmaaten.github.io/tsne/>`_
van der Maaten, L.J.P.

* `"Accelerating t-SNE using Tree-Based Algorithms."
<http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>`_
L.J.P. van der Maaten. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.

Tips on practical use
=====================

Expand Down
148 changes: 148 additions & 0 deletions examples/manifold/plot_tsne_faces.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
"""
===================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix underline / overline length

Comparison of t-SNE methods and options
===================================

A comparison of t-SNE applied to the Olivetti faces data. The 'exact' and
'barnes-hut' methods are visualized, as well as init='pca' and init='random'.
In addition to varying method & init, the script also compares preprocessing
the input data by reducing the dimensionality via PCA and also varying the
perplexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is too crowded / long and we don't get the intended message of the example. I would rather simplify it by just demonstrating the cases with the Barnes Hut method and the 3 following configurations:

  • random init on raw features
  • PCA init on raw features
  • PCA init on PCA reduced features (via a pipeline).

The in the text description of the example I would tell that:

  • all the three configurations yield similar arrangements of the data,
  • PCA-reduced features is interesting to speed up the analysis while preserving most of the pairwise distances of the original feature space. In particular the BallTree data structure used to compute the neighborhood in the original feature space is known to work faster than brute force only for lower dimensional space (e.g. less than a couple hundreds of dimensions).
  • PCA init is interesting to get global arrangement of the groups that are slightly more consistent with the long range dependencies between groups (although I am not 100% sure about this so maybe we should not mention that all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Will do.

I think it's fine to say the initializing with PCA preserved more of the global structure -- LvdM mentions this in his papers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also put all plots in a single figure so they are side by side

For a discussion and comparison of these algorithms, see the
:ref:`manifold module page <manifold>`

"""
# Authors: Christopher Erick Moody <chrisemoody@gmail.com>
# License: BSD 3 clause

from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import RandomizedPCA
from sklearn import manifold
from matplotlib import offsetbox
import matplotlib.pyplot as plt
import numpy as np
from time import time


faces = fetch_olivetti_faces()

verbose = 5

#----------------------------------------------------------------------
# Scale and visualize the embedding vectors
def plot_embedding(dataset, X, title=None, ax=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
y = dataset.target
if ax is None:
plt.figure(figsize=(18, 15))
ax = plt.gca()
for i in range(X.shape[0]):
ax.text(X[i, 0], X[i, 1], str(dataset.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 30})

if hasattr(offsetbox, 'AnnotationBbox'):
# only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(dataset.data.shape[0]):
shown_images = np.r_[shown_images, [X[i]]]
r, g, b, a = plt.cm.Paired(i * 1.0 / dataset.data.shape[0])
gray = dataset.images[i]
rgb_image = np.zeros((64, 64, 3))
rgb_image[:, :, 0] = gray * r
rgb_image[:, :, 1] = gray * g
rgb_image[:, :, 2] = gray * b
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(rgb_image),
X[i], frameon=False)
ax.add_artist(imagebox)
ax.set_xticks([]), ax.set_yticks([])
if title is not None:
ax.set_title(title, loc='left')


#----------------------------------------------------------------------
# Fit t-SNE on 50 PCA-reduced dimensions with random initialization.
# We will frequently want to reduce the dimensionality when we have many
# dimensions since the t-SNE complexity grows linearly with the dimensionality.
tsne = manifold.TSNE(n_components=2, init='random', random_state=3,
n_iter=1000, verbose=verbose, early_exaggeration=50.0,
learning_rate=100, perplexity=50)
rpca = RandomizedPCA(n_components=50)
time_start = time()
reduced = rpca.fit_transform(faces.data)
embedded = tsne.fit_transform(reduced)
time_end = time()
dt = time_end - time_start


title = ("Barnes-Hut t-SNE Visualization of Olivetti Faces\n" +
"in %1.1f seconds on 50-dimensional PCA-reduced data ")
title = title % dt
plot_embedding(faces, embedded, title=title)


#-------------------------------------------------------------------------
# Fit t-SNE on all 4096 dimensions, but use PCA to initialize the embedding
# Initializing the embedding with PCA allows the final emebedding to preserve
# global structure leaving t-SNE to optimize the local structure
tsne = manifold.TSNE(n_components=2, init='pca', random_state=3,
n_iter=1000, verbose=verbose, early_exaggeration=50.0,
learning_rate=100, perplexity=5)
time_start = time()
embedded = tsne.fit_transform(faces.data)
time_end = time()
dt = time_end - time_start


title = ("Barnes-Hut t-SNE Visualization of Olivetti Faces\n" +
"in %1.1f seconds & initialized embedding with PCA ")
title = title % dt
plot_embedding(faces, embedded, title=title)


#----------------------------------------------------------------------
# Fit t-SNE on 50 PCA-reduced dimensions with random initialization, but reduce
# the perplexity. Note that the increased perplexity roughly increases the
# number of neighbors, which effectively decreases the number of clusters.
# For the example here, theres about 5 images per person, which with a
# perplexity of 50 forces multiple face-clusters to come together
tsne = manifold.TSNE(n_components=2, init='random', random_state=3,
n_iter=1000, verbose=verbose, early_exaggeration=50.0,
learning_rate=100, perplexity=50)
rpca = RandomizedPCA(n_components=50)
time_start = time()
reduced = rpca.fit_transform(faces.data)
embedded = tsne.fit_transform(reduced)
time_end = time()
dt = time_end - time_start


title = ("Barnes-Hut t-SNE Visualization of Olivetti Faces\n" +
"in %1.1f seconds on 50-dimensional PCA-reduced data\n" +
"with perplexity increased from 5 to 50")
title = title % dt
plot_embedding(faces, embedded, title=title)


#----------------------------------------------------------------------
# Fit t-SNE by random embedding and the exact method. The exact method
# is similar to Barnes-Hut, but this visualization reinforces the idea
# that the two methods yield similar results
tsne = manifold.TSNE(n_components=2, init='random', random_state=3,
method='exact', n_iter=10000, verbose=verbose,
learning_rate=100, perplexity=5)
time_start = time()
embedded = tsne.fit_transform(faces.data)
time_end = time()
dt = time_end - time_start


title = ("Exact t-SNE Visualization of Olivetti Faces\n" +
"in %1.1f seconds with random initialization")
title = title % dt
plot_embedding(faces, embedded, title=title)

plt.show()
4 changes: 2 additions & 2 deletions sklearn/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
# Dev branch marker is: 'X.Y.dev' or 'X.Y.devN' where N is an integer.
# 'X.Y.dev0' is the canonical version of 'X.Y.dev'
#
__version__ = '0.16.dev0'
__version__ = '0.17.dev0'


try:
Expand All @@ -61,7 +61,7 @@
'cross_validation', 'datasets', 'decomposition', 'dummy',
'ensemble', 'externals', 'feature_extraction',
'feature_selection', 'gaussian_process', 'grid_search', 'hmm',
'isotonic', 'kernel_approximation', 'kernel_ridge',
'isotonic', 'kernel_approximation', 'kernel_ridge',
'lda', 'learning_curve',
'linear_model', 'manifold', 'metrics', 'mixture', 'multiclass',
'naive_bayes', 'neighbors', 'neural_network', 'pipeline',
Expand Down
12 changes: 8 additions & 4 deletions sklearn/isotonic.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,8 +252,6 @@ def _build_y(self, X, y, sample_weight):
"""Build the y_ IsotonicRegression."""
check_consistent_length(X, y, sample_weight)
X, y = [check_array(x, ensure_2d=False) for x in [X, y]]
if sample_weight is not None:
sample_weight = check_array(sample_weight, ensure_2d=False)

y = as_float_array(y)
self._check_fit_data(X, y, sample_weight)
Expand All @@ -264,10 +262,16 @@ def _build_y(self, X, y, sample_weight):
else:
self.increasing_ = self.increasing

# If sample_weights is passed, removed zero-weight values and clean order
if sample_weight is not None:
sample_weight = check_array(sample_weight, ensure_2d=False)
mask = sample_weight > 0
X, y, sample_weight = X[mask], y[mask], sample_weight[mask]
else:
sample_weight = np.ones(len(y))

order = np.lexsort((y, X))
order_inv = np.argsort(order)
if sample_weight is None:
sample_weight = np.ones(len(y))
X, y, sample_weight = [astype(array[order], np.float64, copy=False)
for array in [X, y, sample_weight]]
unique_X, unique_y, unique_sample_weight = _make_unique(X, y, sample_weight)
Expand Down
Loading