-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Reducing t-SNE memory usage #9032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yes!
…On 7 Jun 2017 6:18 pm, "Thomas Moreau" ***@***.***> wrote:
The barnes-hut algorithm for t-SNE currently have a O(N^2) memory
complexity, it could use O(uN). This PR intend to improve the memory
usage. (see Issue scikit-learn/scikit-learn#7089
<#7089>)
Step to proceed
- Only compute the nearest neighbors distances
- Validate _barnes_hut2 with _barnes_hut functions
- Check memory usage
Related
- Initial PR scikit-learn/scikit-learn#4025
<#4025>
- Some ideas scikit-learn/scikit-learn#8582
<#8582>
------------------------------
You can view, comment on, or merge this pull request online at:
#9032
Commit Summary
- WIP experiement to reduce t-SNE memory usage
File Changes
- *M* sklearn/manifold/_barnes_hut_tsne.pyx
<https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-0>
(5)
- *A* sklearn/manifold/_barnes_hut_tsne2.pyx
<https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-1>
(841)
- *M* sklearn/manifold/_utils.pyx
<https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-2>
(154)
- *M* sklearn/manifold/setup.py
<https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-3>
(8)
- *M* sklearn/manifold/t_sne.py
<https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-4>
(177)
- *M* sklearn/manifold/tests/test_t_sne.py
<https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-5>
(20)
Patch Links:
- https://github.com/scikit-learn/scikit-learn/pull/9032.patch
- https://github.com/scikit-learn/scikit-learn/pull/9032.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9032>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz60IJIueyp-z7UVnZ6Sc3i6RKvQzDks5sBlzegaJpZM4NyW24>
.
|
not that I've looked at it yet.
…On 7 Jun 2017 6:34 pm, "Joel Nothman" ***@***.***> wrote:
yes!
On 7 Jun 2017 6:18 pm, "Thomas Moreau" ***@***.***> wrote:
> The barnes-hut algorithm for t-SNE currently have a O(N^2) memory
> complexity, it could use O(uN). This PR intend to improve the memory
> usage. (see Issue scikit-learn/scikit-learn#7089
> <#7089>)
> Step to proceed
>
> - Only compute the nearest neighbors distances
> - Validate _barnes_hut2 with _barnes_hut functions
> - Check memory usage
>
> Related
>
> - Initial PR scikit-learn/scikit-learn#4025
> <#4025>
> - Some ideas scikit-learn/scikit-learn#8582
> <#8582>
>
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
> #9032
> Commit Summary
>
> - WIP experiement to reduce t-SNE memory usage
>
> File Changes
>
> - *M* sklearn/manifold/_barnes_hut_tsne.pyx
> <https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-0>
> (5)
> - *A* sklearn/manifold/_barnes_hut_tsne2.pyx
> <https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-1>
> (841)
> - *M* sklearn/manifold/_utils.pyx
> <https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-2>
> (154)
> - *M* sklearn/manifold/setup.py
> <https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-3>
> (8)
> - *M* sklearn/manifold/t_sne.py
> <https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-4>
> (177)
> - *M* sklearn/manifold/tests/test_t_sne.py
> <https://github.com/scikit-learn/scikit-learn/pull/9032/files#diff-5>
> (20)
>
> Patch Links:
>
> - https://github.com/scikit-learn/scikit-learn/pull/9032.patch
> - https://github.com/scikit-learn/scikit-learn/pull/9032.diff
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#9032>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/AAEz60IJIueyp-z7UVnZ6Sc3i6RKvQzDks5sBlzegaJpZM4NyW24>
> .
>
|
(answered irl, my bad for rushing into this pr) |
sklearn/manifold/t_sne.py
Outdated
# set the neighbors to n - 1 | ||
distances_nn, neighbors_nn = knn.kneighbors( | ||
X, n_neighbors=k + 1) | ||
distances_nn = distances_nn[:, 1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite doing the right thing. If two instances are identical, you can't be certain which will be output first. Use kneighbors(None)
instead.
sklearn/manifold/t_sne.py
Outdated
if self.metric == 'precomputed': | ||
# Use the precomputed distances to find | ||
# the k nearest neighbors and their distances | ||
neighbors_nn = np.argsort(distances, axis=1)[:, :k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use NearestNeighbors(metric='precomputed') for this.
sklearn/manifold/t_sne.py
Outdated
else: | ||
# Find the nearest neighbors for every point | ||
# TODO: argument for class knn_estimator=None | ||
# TODO: assert that the knn metric is euclidean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use another metric?
printf("[t-SNE] [d=%i] Inserting pos %i [%f, %f] duplicate_count=%i " | ||
"into child %p\n", depth, point_index, pos[0], pos[1], | ||
printf("[t-SNE] [d=%li] Inserting pos %li [%f, %f] duplicate_count=%li" | ||
" into child %p\n", depth, point_index, pos[0], pos[1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using %li
is required when parsing long integer to avoid compiler warnings.
20f7ebf
to
dfa8985
Compare
float[:,:] pos_reference, | ||
np.int64_t[:,:] neighbors, | ||
np.int64_t[:] neighbors, | ||
np.int64_t[:] indptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really do have n_neighbors non-zero values in each row, I think the previous approach with a 2d array of (samples, neighbors) was better than having an indptr design. Why did you change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it easy to efficiently symmetrize the conditional_P
matrix using scipy operation for sparse matrices.
Symmetrization is done in the reference implementation so I tried to be as close as possible to it. Although I haven't reviewed it all yet.
sklearn/manifold/t_sne.py
Outdated
range(0, n_samples * K + 1, K)), | ||
shape=(n_samples, n_samples)) | ||
|
||
P = P + P.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparse symmetrization is done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indicate it as a comment in the code, rather than a comment in the PR :)
8d8c70d
to
312d9f0
Compare
Some benchmark performance on MNIST against the reference implementation. The script to reproduce is included in the benchmark folder. There is probably room for improvment, but maybe in the next PR? :)
Here is the memory usage reported with The memory grows most when NN build its ball tree. Then it drops a bit. In any case, it stays around 300MB for 10000 samples. EDIT: a run of this bench with master gives:
So this PR is a bit faster. |
@@ -79,34 +81,20 @@ cpdef np.ndarray[np.float32_t, ndim=2] _binary_search_perplexity( | |||
# Compute current entropy and corresponding probabilities | |||
# computed just over the nearest neighbors or over all data | |||
# if we're not using neighbors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment seems outdated, what does it mean "if we're not using neighbors" now? sorry, I misread
FYI, I am currently running a memory benchmark on the full (70000 samples) MNIST dataset. The memory usage of my python process is constant at 1.2GB. Update: running MNIST on 70000 samples took ~52 minutes. Here is the memory profile: Most of the time is spent in the ball tree. Update 2*: : running MNIST on 70000 samples took ~53 minutes with the reference implementation and 1.3GB or RAM (basically same behavior as or Cython impl in this PR). |
Yes, I was going to ask how much of the time was in KNN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ball tree is the slowest part:
- we should consider offering
n_jobs
if multithreaded balltree queries help reduce runtime we might check how the current implementation compares to other single-processor implementations to check our timings are in the ballpark of state of the art.I just realised you did that above
sklearn/manifold/t_sne.py
Outdated
metric=self.metric) | ||
knn.fit(X) | ||
# LvdM uses 3 * perplexity as the number of neighbors | ||
# And we add one to not count the data point itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not add 1; you could explain why we pass None, or more explicitly use neighbors.kneighbors_graph(X, include_self=False)
.
sklearn/manifold/t_sne.py
Outdated
# And we add one to not count the data point itself | ||
# In the event that we have very small # of points | ||
# set the neighbors to n - 1 | ||
distances_nn, neighbors_nn = knn.kneighbors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use kneighbors_graph
which will already return a sparse matrix representation. This will allow us to rely on kneighbors_graph
being as memory efficient as possible, rather than putting it into a sparse matrix in _joint_probabilities_nn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not put the distances_nn
in a sparse matrix in _join_probabilities_nn
but we put the conditional_P
as a sparse matrix for symmetrization purpose.
Using kneighbors_graph
would increase the memory to store range(0, n_samples * K + 1, K)
. I don't see a need to use it except if it is more efficient internally.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need any of the features of the scipy sparse matrix API for the neighbors info itself (we call into Cython code directly after its computation).
Only the conditional probability matrix benefits from being represented as a scipy sparse matrix to make it a one liner python snippet to do the symmetrization.
I ran the benchmark with
There is a non-negligible speedup as it took a bit less than 30min to fit the full dataset. |
I am currently running the original bhtsne code on 70000 MNIST and the memory usage of the python process is 1.3GB so I think we can officially declare that our implementation is as memory efficient as it should be 🎉 Update: the original bhtsne code on 70000 MNIST took 53min which is the same as our Cython code: 🎉² |
🍻 although it's a bit early :)
|
Here is a snippet to plot the resulting embedding: from sklearn.datasets import fetch_mldata
from sklearn.utils import check_array
import matplotlib.pyplot as plt
import numpy as np
X_embedded = np.load('mnist_tsne_70000.npy')
def load_data(dtype=np.float32, order='C'):
"""Load the data, then cache and memmap the train/test split"""
print("Loading dataset...")
data = fetch_mldata('MNIST original')
X = check_array(data['data'], dtype=dtype, order=order)
y = data["target"]
# Normalize features
X /= 255
return X, y
_, y = load_data()
plt.figure(figsize=(12, 12))
for c in np.unique(y):
X_c = X_embedded[y == c]
plt.scatter(X_c[:, 0], X_c[:, 1], alpha=0.1, label=int(c))
plt.legend(loc='best') With my 52min run with the code of this PR, this yields: This might have converged to a local minima but it's seem to work well enough Update: here is the output of the reference implementation on the same 70000 samples MNIST dataset: We probably have discrepancies in the hyperparameters that are worth investigating more thoroughly before merging this PR. |
In ran another session with PCA preprocessing of the MNIST 70000 dataset with 50 principal components:
Our PR (reported as "T-SNE" in the benchmark script output): Reference implementation (bhtsne): I have not changed anything in the way we set the default hyperparameters: this is still to investigate. However it shows that we should probably do a It's interesting to note that in that lower dimensional regime we are significantly faster than the reference implementation. Also I think we can probably do an MNIST example with 5000 samples by default in the scikit-learn doc (with 50-dim PCA preprocessing). |
0b7537c
to
d2d64ff
Compare
sklearn/manifold/t_sne.py
Outdated
@@ -803,6 +803,8 @@ def _tsne(self, P, degrees_of_freedom, n_samples, random_state, X_embedded, | |||
if self.method == 'barnes_hut': | |||
obj_func = _kl_divergence_bh | |||
opt_args['kwargs']['angle'] = self.angle | |||
# Repeat verbose argument for _kl_divergence_bh | |||
opt_args['kwargs']['verbose'] = self.verbose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
@tomMoral I am investigating why the errors are often constant: I think that's because we truncate with too large of an EPSILON in the |
Alright the bug on the error computation is fixed, all tests pass (the circle ci failure is the stock market stuff), examples look good and the MNIST benchmark is both accurate, reasonably fast and memory efficient. Merging! |
Merging!
Congratulations! Challenging and beautiful piece of work. Very useful.
|
There is still plenty of things to improve in that code base but I think this is a great first step. Thanks @tomMoral for your patience :) |
Thanks @ogrisel for fixing the |
… On 13 July 2017 at 07:51, Thomas Moreau ***@***.***> wrote:
Thanks @ogrisel <https://github.com/ogrisel> for fixing the EPSILON bug!
Hopefully, I will get some time to work on some acceleration of our t-SNE
soon!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9032 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64pRBKgm9IOeMScQnxX-oHtuNuyNks5sNT_UgaJpZM4NyW24>
.
|
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
Release 0.19b2 * tag '0.19b2': (808 commits) Preparing 0.19b2 [MRG+1] FIX out of bounds array access in SAGA (scikit-learn#9376) FIX make test_importances pass on 32 bit linux Release 0.19b1 DOC remove 'in dev' header in whats_new.rst DOC typos in whats_news.rst [ci skip] [MRG] DOC cleaning up what's new for 0.19 (scikit-learn#9252) FIX t-SNE memory usage and many other optimizer issues (scikit-learn#9032) FIX broken link in gallery and bad title rendering [MRG] DOC Replace \acute by prime (scikit-learn#9332) Fix typos (scikit-learn#9320) [MRG + 1 (rv) + 1 (alex) + 1] Add a check to test the docstring params and their order (scikit-learn#9206) DOC Residual sum vs. regression sum (scikit-learn#9314) [MRG] [HOTFIX] Fix capitalization in test and hence fix failing travis at master (scikit-learn#9317) More informative error message for classification metrics given regression output (scikit-learn#9275) [MRG] COSMIT Remove unused parameters in private functions (scikit-learn#9310) [MRG+1] Ridgecv normalize (scikit-learn#9302) [MRG + 2] ENH Allow `cross_val_score`, `GridSearchCV` et al. to evaluate on multiple metrics (scikit-learn#7388) Add data_home parameter to fetch_kddcup99 (scikit-learn#9289) FIX makedirs(..., exists_ok) not available in Python 2 (scikit-learn#9284) ...
* releases: (808 commits) Preparing 0.19b2 [MRG+1] FIX out of bounds array access in SAGA (scikit-learn#9376) FIX make test_importances pass on 32 bit linux Release 0.19b1 DOC remove 'in dev' header in whats_new.rst DOC typos in whats_news.rst [ci skip] [MRG] DOC cleaning up what's new for 0.19 (scikit-learn#9252) FIX t-SNE memory usage and many other optimizer issues (scikit-learn#9032) FIX broken link in gallery and bad title rendering [MRG] DOC Replace \acute by prime (scikit-learn#9332) Fix typos (scikit-learn#9320) [MRG + 1 (rv) + 1 (alex) + 1] Add a check to test the docstring params and their order (scikit-learn#9206) DOC Residual sum vs. regression sum (scikit-learn#9314) [MRG] [HOTFIX] Fix capitalization in test and hence fix failing travis at master (scikit-learn#9317) More informative error message for classification metrics given regression output (scikit-learn#9275) [MRG] COSMIT Remove unused parameters in private functions (scikit-learn#9310) [MRG+1] Ridgecv normalize (scikit-learn#9302) [MRG + 2] ENH Allow `cross_val_score`, `GridSearchCV` et al. to evaluate on multiple metrics (scikit-learn#7388) Add data_home parameter to fetch_kddcup99 (scikit-learn#9289) FIX makedirs(..., exists_ok) not available in Python 2 (scikit-learn#9284) ...
* dfsg: (808 commits) Preparing 0.19b2 [MRG+1] FIX out of bounds array access in SAGA (scikit-learn#9376) FIX make test_importances pass on 32 bit linux Release 0.19b1 DOC remove 'in dev' header in whats_new.rst DOC typos in whats_news.rst [ci skip] [MRG] DOC cleaning up what's new for 0.19 (scikit-learn#9252) FIX t-SNE memory usage and many other optimizer issues (scikit-learn#9032) FIX broken link in gallery and bad title rendering [MRG] DOC Replace \acute by prime (scikit-learn#9332) Fix typos (scikit-learn#9320) [MRG + 1 (rv) + 1 (alex) + 1] Add a check to test the docstring params and their order (scikit-learn#9206) DOC Residual sum vs. regression sum (scikit-learn#9314) [MRG] [HOTFIX] Fix capitalization in test and hence fix failing travis at master (scikit-learn#9317) More informative error message for classification metrics given regression output (scikit-learn#9275) [MRG] COSMIT Remove unused parameters in private functions (scikit-learn#9310) [MRG+1] Ridgecv normalize (scikit-learn#9302) [MRG + 2] ENH Allow `cross_val_score`, `GridSearchCV` et al. to evaluate on multiple metrics (scikit-learn#7388) Add data_home parameter to fetch_kddcup99 (scikit-learn#9289) FIX makedirs(..., exists_ok) not available in Python 2 (scikit-learn#9284) ...
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
…9032) Use a sparse matrix representation of the neighbors. Re-factored the QuadTree implementation to avoid insertion errors. Various fixes in the gradient descent schedule to get the Barnes Hut and exact solvers to behave more robustly and consistently.
The barnes-hut algorithm for t-SNE currently have a
O(N^2)
memory complexity, it could useO(uN)
. This PR intend to improve the memory usage. (see Issue #7089)Step to proceed
_barnes_hut2
with_barnes_hut
functionsValueError
whenn_components > 3
orn_components < 2
with the BH solver enabled.Related