-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] t-SNE #2822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] t-SNE #2822
Conversation
""" | ||
return trustworthiness(self, X, n_neighbors=n_neighbors) | ||
|
||
def transform(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just implement fit_transform and remove transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make it impossible to use grid search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? Does grid search need transform
?
transform
must be able to generalize to new data. Transductive transformers should implement fit
and fit_transform
only.
Your score
method seems to be able to generalize to new data so this line should be fine:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, score
doesn't generalise. There should really be a fit_score
method to parallel fit_predict
and fit_transform
.
@dwf might be interested in reviewing this :) |
return p, error | ||
|
||
|
||
def trustworthiness(estimator, X, n_neighbors=5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes more sense as a function of X
and X_embedded
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I'd prefer the file to be named t_sne.py
And thanks for this contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tips
I'm not sure that this should be a |
I think that it would be useful to have both the function and the transformer. The transformer is a standard in scikit-learn, but the function is also useful. -------- Original message -------- From: jnothman notifications@github.com Date:04/02/2014 02:23 (GMT+01:00) To: scikit-learn/scikit-learn scikit-learn@noreply.github.com Subject: Re: [scikit-learn] [WIP] t-SNE (#2822) — |
I agree with Gael. Grid search might be very useful since t-SNE has at least the hyperparameter |
OK, I agree that it shouldn't be a transformer. Just an estimator. |
Ha :P I didn't mean for the message to send like that. It's not Using GridSearchCV doesn't make sense either, because we can't do CV. So to parallel the current API, this should really have fit() and On 4 February 2014 20:35, Gael Varoquaux notifications@github.com wrote:
|
@dwf Do you confirm that tSNE cannot generalize to new data? (including heuristics) |
I think @AlexanderFabisch wants to use CV to obtain a trustworthiness score on unseen data (the unseen data doesn't need to be transformed to obtain that score). What I don't understand is why |
For unsupervised probabilistic models, I think it is sometimes possible to do model selection by maximizing the likelihood of the training data but we don't have mechanisms for that in scikit-learn. |
I think there is currently no other method available than parametric t-SNE Am 04.02.2014 12:20 schrieb "Mathieu Blondel" notifications@github.com: @dwf https://github.com/dwf Do you confirm that tSNE cannot generalize to Reply to this email directly or view it on GitHub.[image] |
T-SNE cannot generalize at all in this implementation. We can only optimize Am 04.02.2014 12:27 schrieb "Mathieu Blondel" notifications@github.com:
I think @AlexanderFabisch https://github.com/AlexanderFabisch wants to Reply to this email directly or view it on GitHub.[image] |
It might be worth trying. |
I think we used the term "transformer" pretty loosely in the context of the manifold module. Spectral embedding doesn't have a I don't think we should implement non-standard hacks in I don't think we should tie the decision whether this is a class or not to whether we want to be able to GridSearch it. I never use any sklearn function, because the estimator have such a nice interface. |
Also: wohoo, t-SNE! |
from scipy.optimize import fmin_l_bfgs_b | ||
from scipy.spatial.distance import pdist | ||
from scipy.spatial.distance import squareform | ||
import binary_search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use relative import
It seems like there is a huge interest in having this in sklearn. ;) I think this will take a while but I think it is worth it. |
I think you could reuse the same approach for |
Then, how are you planning to use grid search?! |
Could use cv=[(arange(X.shape[0]), arange(X.shape[0]))]? On 5 February 2014 12:58, Mathieu Blondel notifications@github.com wrote:
|
if self.distances == "precomputed" and X.shape[0] != X.shape[1]: | ||
raise ValueError("X should be a square distance matrix") | ||
|
||
self.Y_ = self._tsne(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this attribute be better called embedding_
as it is in SpectralEmbedding
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I it has been called embedding_
previously. :) I will change that again.
Indeed that would work in the current state of this PR but |
Rename affinity to metric and affinities to distances
@ogrisel The branch has been rebased on master. |
This looks good to me. Shall we merge? |
👍! |
+1 on my side too |
+1 Thanks for putting up with our nitpicking @AlexanderFabisch :b |
I have to thank you and all the other reviewers, in particular @ogrisel and @GaelVaroquaux . Your comments really improved the quality of the code! ;) |
Hurray! 🍻 |
Was there no whatsnew for this or am I blind? Wasn't this one of the highlights of 0.15? |
No, I can't find it either. |
Do you want to add it? I feel whatsnew is a good way to check when something was added. |
OK, should I open a pull request for that or is it possible to commit that directly to master? |
I think that you can commit this directly to master. Thanks! |
I added an entry to the list of highlights and one to the list of new features with bdcea50 |
@GaelVaroquaux travis complains about a failing unit test. I can't see how this is related to my commit. Do you have any idea?
|
Heisenbug, maybe? I restarted the travis job. We'll see what it gives. |
Yup. Heisenbug... |
Thanks for checking. |
Well, thanks for mentioning that something broke. That's important! |
This is an implementation of (non-parametric) t-SNE for visualization.
See Laurens van der Maaten's paper or his website about t-SNE for details. In comparison to other implementations and the original paper this version has these features:
learning_rate
,n_iter
andearly_exaggeration
, the momentum etc. are fixed and work well for most datasetsTODO
transform
(and maybe eveninverse_transform
)n_iter
,learning_rate
,early_exaggeration
)embedding_
,nbrs_
,training_data_
,embedding_nbrs_
(similar toIsomap
)doc/modules/manifold.rst
t_sne.py
Learning Schedules
In the literature:
My experiences:
This implementation uses the following schedule:
Observations
Tips
Examples
Visualizations of some datasets can be found here, e.g.
Work for other pull requests