-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Mlp finishing touches #3939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping @ogrisel |
Can we do a plot example with MNIST? I think that would be kind of cool. It should be fast enough. |
I just realized that we do have plotting examples that fetch data (the species distribution example), so I'll try to do a nice mnist example that visualizes some weights. |
@ogrisel do you have a good idea on how to test momentum? |
self.layers_intercept_[i] -= (self.learning_rate_ * | ||
((1 - self.momentum) * intercept_grads[i] | ||
+ self.momentum * intercept_grads_prev[i])) | ||
coef_grads_prev = list(coef_grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not making any sense here... that should be the update I'm storing, not the gradients...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could be made this way?
coef_update_prev[i] = (self.learning_rate_ * ((1 - self.momentum) * coef_grads[i]
+ self.momentum * coef_update_prev[i]))
self.layers_coef_[i] -= coef_update_prev[i]
I once tested this by showing that momentum makes SGD converge with significantly fewer iterations than otherwise for the classic XOR . |
Why did you remove momentum again? |
@amueller I didn't implement it for this |
bc67720
to
4938993
Compare
FYI there is a PR to this PR here amueller#22. @IssamLaradji you might want to have a look. |
In addition to your todo list I would add:
MLP are very sensitive to scaling so I think it would be user friendly to have This is the strategy Vowpal Wabbit uses by default. I think it's the best because:
|
I am not sure about built-in scaling. We don't do that anywhere else. I agree that it is very useful for SGD, but we don't do it in SGDClassifier for example. We could add it there, too, though. |
I would be +1 to have in SGDClassifier as well, and in the SAG PR as well.
You just have to update your online estimate of np.maximum(X_batch.max(axis=0), self.scale_, out=self.scale_)
X_batch /= self.scale_ This does not take care of division by zero but you got the idea. |
+1. These are really awesome changes, namely, the
cheers! |
@ogrisel but don't you have to rescale your weights then? |
You could (for SGDClassifier it would be easy), but for MLP it's too complicated. At some point the likelihood of having the max abs values change too much will be very small so it can be ignored. |
Actually the impact on the intercept is not trivial. |
It looks like the current |
Hum, partial fit only does one iteration, but it is not equivalent to doing multiple iterations... weird |
I think momentum doesn't work for partial_fit currently... meh |
i think with learning_rate=constant currently tol doesn't do something sensible. |
'softmax' and 'identity' are both listed as options for the |
Ah, sorry, I should be more specific. Just had a look at the code, and it makes a bit more sense now. If you put in a spelling mistake, say
However, identity and softmax aren't available in either the regressor or the classifier class, so if you try them, the |
@naught101 I agree. |
The question is: which one is the easier to set / the most invariant across settings? |
can you add your learning rate to the comparison example on the toy datasets? And maybe try on covertype and mnist, too? Btw, I ran full mnist ususally. |
Seeking to finalize MLP
…trl+C stop option for SGD
But I wasn't able to make it work because I am confused about line 5 in the pseudo code below.
Can someone shine some light on this? :) Thanks! More details are in the paper below. [1]http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf |
You should have a look at the implementations in lasagne and keras. |
I see, now it's working, there is an epsilon term that prevents it from being zero, however, changing that epsilon value from, say, |
@kastnerkyle just told me on gitter that he didn't have a lot of success with adagrad and adadelta. Can you do tests on multiple datasets including digits and full mnist? In particular it would be good to know how well they do with a default learning rate. |
Got dropout ? |
I don't think dropout is necessary. It is highly unlikely that a single CPU I would also probably stick with stock sgd and nesterov momentum. Though
|
I'm not 100% on leaving out drop-out but yeah, we can always add it in and I'd rather merge this sooner than later. |
I wanted to have a look at @IssamLaradji's refactoring though. |
@amueller nice, I did the refactoring in my PR - I moved the learning rate algorithms in the |
Any updates on this? It would be great to get a minimal working implementation in, algorithmic improvements could come later. On a slight tangent: Is it standard across scikit-learn to pass functions like MLP's activation and weight optimisation functions as function names? Seems like it would make more sense to also accept actual functions. That would mean that new esitmators PRs wouldn't have to spend so much time on getting the details exactly right, as long as there were working defaults, as users could provide their own improved alternatives. |
@IssamLaradji what is the "Bordes" learning rate? |
@naught101 hopefully progress soon ;) |
+1. Though it would be great to make it easy to implement your own custom optimizer by deriving the estimator class and overriding a private method for instance. |
@amueller the bordes learning rate [1] approximates the diagonals of the hessian to be used for updating each parameter. so it's like the newton's method but with using only the diagonal approximation of the hessian. This might not be straightforward to implement though. On another note, an implementation friendly approach is to use the line search as defined in section 4.6 in [2] for stochastic gradient descent. It is very efficient for automatically determining the learning rate for the mini-batch case. However, although its theory is founded on convex functions, it works reasonably well for non-convex NN functions as shown by my experiments. [1] http://www.jmlr.org/papers/volume10/bordes09a/bordes09a.pdf |
superseeded by #5214 |
Slight update to #3204, putting on the finishing touches.
add tests for momentumno idea how to do thatadd monitor callbackI'm not sure about the scaling at the moment. Waiting for the MaxAbsScaler would not be great, but adding a default
scaling=True
later also seems like a bad idea :-/I'm also not 100% sure of the correct "gain" in the initialization for the different nonlinearities.
For the current "constant" learning rate schedule, maybe it should be called "adaptive" instead? And have an actual constant one?