-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fixes bugs in t-SNE argument passing, and a bug with running using n dfs #7885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Please give this PR a more informative title
# degrees_of_freedom = n_components - 1 comes from | ||
# "Learning a Parametric Embedding by Preserving Local Structure" | ||
# Laurens van der Maaten, 2009. | ||
degrees_of_freedom = max(self.n_components - 1.0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to retain this behaviour by default. I.e. make degrees_of_freedom=None
by default, and handle that case with this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have followed your suggestion!
@@ -645,10 +645,10 @@ class TSNE(BaseEstimator): | |||
""" | |||
|
|||
def __init__(self, n_components=2, perplexity=30.0, | |||
early_exaggeration=4.0, learning_rate=1000.0, n_iter=1000, | |||
early_exaggeration=4.0, learning_rate=200.0, n_iter=2000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will not change default values without a deprecation cycle. I think we're best not changing these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have put the defaults back as you suggested-- although I do believe the default learning rate is too high at the moment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not best to do in this PR, but these lower learning rates may be a better default choice (https://www.reddit.com/r/MachineLearning/comments/47kf7w/scikitlearn_tsne_implementation/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe we can set the default to a lower value. But we should at least compare the result on the first 10,000 images from the MNIST dataset with the result from the original paper. For example, in this case it works with learning_rate=1000, however, the gaps between classes in the original t-SNE paper are a little bit larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, in the original pull request I wrote down some notes on the learning rate and compared several implementations (link):
In the literature:
- original paper: initialization with standard deviation 1e-4, 1000 episodes, learning rate 100, momentum 0.5 for 250 episodes, 0.8 for the rest, early exaggeration with 4 for 50 episodes
- matlab implementation: learning rate 500, early exaggeration for 100 episodes
- python implementation: initialization with standard deviation 1, learning rate 500, early exaggeration for 100 episodes, momentum 0.5 for 20 episodes
- divvy: initialization with standard deviation 1e-4, 1000 episodes, learning rate 1000, momentum 0.5 for 100 episodes, 0.8 for the rest, early exaggeration with 4 for 100 episodes
- parametric t-sne (not comparable): conjugate gradient
- barnes-hut t-sne: initialization with standard deviation 1e-4, 1000 episodes, learning rate 200, momentum 0.5 for 250 episodes, 0.8 for the rest, early exaggeration with 12 for 250 episodes
My experiences:
- the learning rate has to be set manually for optimal performance, something between 100 and 1000
- a high momentum (0.8) during early exaggeration improves the result
This implementation uses the following schedule:
- initialization with standard deviation 1e-4, 1000 episodes, learning rate 1000, momentum 0.5 for 50 episodes, 0.8 for the rest, early exaggeration with 4 for 100 episodes
I wonder why we decided to use 1000 which is at the edge of the recommended range [100, 1000]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open this as an issue perhaps so we can check for deprecation etc. and discuss further?
@@ -142,10 +142,12 @@ def _kl_divergence(params, P, degrees_of_freedom, n_samples, n_components, | |||
|
|||
# Q is a heavy-tailed distribution: Student's t-distribution | |||
n = pdist(X_embedded, "sqeuclidean") | |||
n += 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may look at this closer later, but would appreciate someone else's comments on these changes to KL-divergence. @ssaeger, @AlexanderFabisch, @cemoody, @lesteve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you would like me to post an explanation of KL - correction, I have done the calculus to confirm -- although you can tell that my code works simply my running the program if you wish! (make sure to pass method='exact')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KL is common enough, I'm just trying to run through a long backlog of issues in my commutes to work, and don't have time to check some details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman on your phone? I haven't really found a good way to do that and would be very interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this isn't NYC. We sit on our public transport, most of the time. I use a laptop without an internet connection. On the other hand, I have about 60 issue pages open...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like you have a test for n_components > 2 yet...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow reply @jnothman
The test case works by testing for degrees_of_freedom > 1, "alpha" in the test case refers to the number of degrees of freedom. The reason for this is because the calculation of KL divergence doesn't depend on n_components (that just changes the dimensions of the matrix)
However, I could also create a test case for n_components > 2 if you like? It might be a sensible idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, @AlexanderFabisch did say:
- Probably nobody used n_components > 2 before. :) We need some kind of unit test for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is because by default degrees_of_freedom = n_components - 1
However, you can also run t-SNE with other degrees_of_freedom on a 2D embedding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding an additional test for larger n_components wouldn't hurt though ;)
n_iter_without_progress=30, min_grad_norm=1e-7, | ||
metric="euclidean", init="random", verbose=0, | ||
random_state=None, method='barnes_hut', angle=0.5): | ||
random_state=None, method='barnes_hut', angle=0.5, degrees_of_freedom=1, min_error_diff=1e-7): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 line length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed this PEP8 violation in my latest commit
@@ -1351,6 +1351,42 @@ def make_s_curve(n_samples=100, noise=0.0, random_state=None): | |||
|
|||
return X, t | |||
|
|||
def make_trefoil_knot(n_samples=100, noise=0.0, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this into a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deleted this function from this pull request
def make_trefoil_knot(n_samples=100, noise=0.0, random_state=None): | ||
"""Generate an Uniform Trefoil Knot | ||
|
||
Read more in the :ref:`User Guide <sample_generators>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only true if you modify the user guide (which you should). Sample generators are clearest with an example, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deleted this function from this pull request
Thanks for the comments Joel, I'll make the changes you mentioned this afternoon!
|
I've noticed there is also a bug in barnes_hut t-SNE with n degrees of freedom -- I will fix this then mark my Pull request as MRG again. |
is there a test for the barnes-hut change? |
Yep the gradient is checked numerically
… On 28 Nov 2016, at 11:19 p.m., Andreas Mueller ***@***.***> wrote:
is there a test for the barnes-hut change?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I mean there is no regression test for that, is there? |
I’m afraid I’m unfamiliar with software regression testing! I will look into it..
…On 29 Nov 2016, at 22:45, Andreas Mueller ***@***.***> wrote:
I mean there is no regression test for that, is there?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I don’t think there should be any software regression (from what I understand of the term through reading wikipedia),
the original code had a bug that didn’t make any difference when n=1 (which encompasses the vast majority of run cases!)
So the fix doesn’t make any difference for n=1
…On 29 Nov 2016, at 22:45, Andreas Mueller ***@***.***> wrote:
I mean there is no regression test for that, is there?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Regression testing just means making a test so that if someone rewrote the
code in the future, there would be a test to ensure this behaviour is
correct.
…On 30 November 2016 at 13:02, artcg ***@***.***> wrote:
I don’t think there should be any software regression (from what I
understand of the term through reading wikipedia),
the original code had a bug that didn’t make any difference when n=1
(which encompasses the vast majority of run cases!)
So the fix doesn’t make any difference for n=1
On 29 Nov 2016, at 22:45, Andreas Mueller ***@***.***>
wrote:
> I mean there is no regression test for that, is there?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7885 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6y-CbOiF8Luj2AXtQ-FJPooeZEdVks5rDNkygaJpZM4KzOFH>
.
|
Ok I’ve added the n_components != 2 test
RE: The Barnes-Hut test,
There is a test wherein the gradient and KL are checked against the exact method, the function is called 'test_barnes_hut_angle()' in /manifold/tests/
…On 30 Nov 2016, at 03:21, Joel Nothman ***@***.***> wrote:
Regression testing just means making a test so that if someone rewrote the
code in the future, there would be a test to ensure this behaviour is
correct.
On 30 November 2016 at 13:02, artcg ***@***.***> wrote:
> I don’t think there should be any software regression (from what I
> understand of the term through reading wikipedia),
> the original code had a bug that didn’t make any difference when n=1
> (which encompasses the vast majority of run cases!)
> So the fix doesn’t make any difference for n=1
>
>
> On 29 Nov 2016, at 22:45, Andreas Mueller ***@***.***>
> wrote:
>
> > I mean there is no regression test for that, is there?
> >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub, or mute the thread.
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7885 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6y-CbOiF8Luj2AXtQ-FJPooeZEdVks5rDNkygaJpZM4KzOFH>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@artcg thanks but that test only covers the exact t-SNE code, right? Or does it do both? Maybe I'm misunderstanding something. You fixed a bug in both algorithms, the exact one and the approximate one, right? If you fixed two bugs, you should introduce a test for each of them. |
@amueller I see what you mean!
I believe currently the tests work as follows:
the function test_gradient() checks the exact method is numerically correct
the function test_barnes_hut_angle() checks that the barnes_hut method agrees with the exact method when angle is zero (thereby indirectly checking that the barnes_hut method is numerically correct)
Originally these tests did not check for n_components != 2 which is why the bug slid through, however I have adapted both tests in this pull request such that the case for n_components != 2 is checked for which should make sure that the bug is neither present in the exact or barnes hut adaption. Please let me know if there’s anything else I should look into / explain!
…On 30 Nov 2016, at 19:50, Andreas Mueller ***@***.***> wrote:
@artcg thanks but that test only covers the exact t-SNE code, right? Or does it do both?
Maybe I'm misunderstanding something. You fixed a bug in both algorithms, the exact one and the approximate one, right? If you fixed two bugs, you should introduce a test for each of them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@artcg ok, thanks :) |
Hi all hope you have had a good holiday, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on some of the technical review from @AlexanderFabisch (thanks!), this LGTM now, apart from the minor nitpicks.
Please add a whats_new.rst
entry.
for j in range(l[0]): | ||
qijZ = ((1.0 + dist2s[j]) / dof) ** exponent | ||
oijZ = ((1.0 + (dist2s[j]) / dof)) ** -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(dist2s[j])
-> dist2s[j]
@@ -528,8 +528,9 @@ cdef float compute_gradient_positive(float[:,:] val_P, | |||
for ax in range(n_dimensions): | |||
buff[ax] = pos_reference[i, ax] - pos_reference[j, ax] | |||
D += buff[ax] ** 2.0 | |||
Q = (((1.0 + D) / dof) ** exponent) | |||
Q = ((((D) / dof) + 1.0) ** -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
((D) / dof)
-> D / dof
…strating properties of Manifold Learning algorithms. An example has been added to /examples
…hod), and set up the input parameter to allow for user-selected number of degrees of freedom
n_iter 1000 -> 2000 The t-SNE has built in functionality to stop if it isn't learning (n_iter_without_progress, and min_error_diff) With these in place, it seems reasonable to allow it to iterate a little longer if it is still learning learning rate 1000 -> 200 The learning rate of 100 is used in the 2008 paper introducting t-SNE, and Van der Maaten uses the value of 200 on his implementation of bh_tsne. With that in mind, it seems sensible to use a value of 200
Fixed bug in barnes hut t-SNE
Thanks for the review @jnothman I made those changes and rebased the pull request so it is synced with the master |
@@ -121,14 +121,15 @@ Enhancements | |||
Bug fixes | |||
......... | |||
|
|||
- Fixed a bug where :class:`sklearn.manifold.t_sne` behaved incorrectly with degrees of freedom other than 1 (default). by :user:`Arthur Goldberg <artcg>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say n_components
other than 2? Also, please keep to 80 chars per line
Sure thanks for pointing that out, changed. |
Should be fixed in #9032 |
Thanks @artcg for your contributions. |
And thanks for the help @jnothman! I learned a lot about good code practice from your reviews |
What does this implement/fix? Explain your changes.
There was a bug in the implementation of t-SNE with n degrees of freedom for the 'exact' method.
There was a bug where some arguments were not being passed in t-SNE
I added support for a trefoil knot generator
I changed a couple default arguments for t-SNE
(Details given in the commit message)
Any other comments?
Please let me know if you have any questions about the changes!