[MRG] Mlp finishing touches #3939

amueller · 2014-12-05T17:28:55Z

Slight update to #3204, putting on the finishing touches.

I'm not sure about the scaling at the moment. Waiting for the MaxAbsScaler would not be great, but adding a default scaling=True later also seems like a bad idea :-/

I'm also not 100% sure of the correct "gain" in the initialization for the different nonlinearities.
For the current "constant" learning rate schedule, maybe it should be called "adaptive" instead? And have an actual constant one?

amueller · 2014-12-05T17:29:04Z

ping @ogrisel

amueller · 2014-12-05T17:35:07Z

Can we do a plot example with MNIST? I think that would be kind of cool. It should be fast enough.

amueller · 2014-12-05T17:46:07Z

I just realized that we do have plotting examples that fetch data (the species distribution example), so I'll try to do a nice mnist example that visualizes some weights.

amueller · 2014-12-05T19:07:42Z

@ogrisel do you have a good idea on how to test momentum?

amueller · 2014-12-05T22:39:27Z

sklearn/neural_network/multilayer_perceptron.py

+                        self.layers_intercept_[i] -= (self.learning_rate_ *
+                                                      ((1 - self.momentum) * intercept_grads[i]
+                                                       + self.momentum * intercept_grads_prev[i]))
+                        coef_grads_prev = list(coef_grads)


not making any sense here... that should be the update I'm storing, not the gradients...

I guess it could be made this way?

coef_update_prev[i] = (self.learning_rate_ * ((1 - self.momentum) * coef_grads[i] + self.momentum * coef_update_prev[i])) self.layers_coef_[i] -= coef_update_prev[i]

IssamLaradji · 2014-12-06T01:31:43Z

I once tested this by showing that momentum makes SGD converge with significantly fewer iterations than otherwise for the classic XOR .

amueller · 2014-12-06T22:36:24Z

Why did you remove momentum again?

IssamLaradji · 2014-12-07T16:15:06Z

@amueller I didn't implement it for this MLP in the first place.
It was part of a java course project to implement and test momentum. :)

ogrisel · 2014-12-18T20:32:41Z

FYI there is a PR to this PR here amueller#22. @IssamLaradji you might want to have a look.

ogrisel · 2014-12-18T23:18:37Z

In addition to your todo list I would add:

add monitor callback (but maybe as a constructor param? not sure.)
write and SGD example that plot training score / validation score vs epochs with SGD and LBGFS on digits dataset (mnist is too long). This could be combined with the
add builtin max abs scaler

MLP are very sensitive to scaling so I think it would be user friendly to have scale=True by default in the constructor and when it's the case we would implement per-feature scaling by max absolute value to have all features in the range [-1, 1] without centering to avoid breaking.

This is the strategy Vowpal Wabbit uses by default. I think it's the best because:

it's simple to understand
it's has consistent behavior with dense or sparse data without breaking sparsity
it's trivial to implement, even for incremental fitting in partial_fit.

amueller · 2014-12-18T23:24:48Z

I am not sure about built-in scaling. We don't do that anywhere else. I agree that it is very useful for SGD, but we don't do it in SGDClassifier for example. We could add it there, too, though.
How is it trivial to implement for partial_fit?

ogrisel · 2014-12-19T00:43:17Z

I would be +1 to have in SGDClassifier as well, and in the SAG PR as well.

How is it trivial to implement for partial_fit?

You just have to update your online estimate of self.scale_:

np.maximum(X_batch.max(axis=0), self.scale_, out=self.scale_)
X_batch /= self.scale_

This does not take care of division by zero but you got the idea.

IssamLaradji · 2014-12-19T15:16:28Z

+1. These are really awesome changes, namely, the

builtin max abs scaler : I remember I had to implement this for a reinforcement learning problem as SGD didn't work otherwise;
monitor callback: extremely useful for theoretical and empirical studies; and
Nesterov's momentum: nice addition and is easy to implement but might be too much for this PR, as more testing and examples (comparing Nesterov and standard momentum) would be required.

cheers!

amueller · 2014-12-19T17:59:07Z

@ogrisel but don't you have to rescale your weights then?

ogrisel · 2014-12-19T18:31:32Z

@ogrisel but don't you have to rescale your weights then?

You could (for SGDClassifier it would be easy), but for MLP it's too complicated. At some point the likelihood of having the max abs values change too much will be very small so it can be ignored.

ogrisel · 2014-12-19T18:32:11Z

You could (for SGDClassifier it would be easy),

Actually the impact on the intercept is not trivial.

amueller · 2015-04-04T18:24:06Z

It looks like the current partial_fit goes over the data multiple times, which is not what it should be doing.
Also, to summarize a discussion with @ogrisel : we probably want to to a train_test_split(test_size=.1) in fit and get a hold-out set to monitor validation set error and know when to half the learning rate.

amueller · 2015-04-04T19:31:26Z

Hum, partial fit only does one iteration, but it is not equivalent to doing multiple iterations... weird

amueller · 2015-04-04T19:33:20Z

I think momentum doesn't work for partial_fit currently... meh

amueller · 2015-04-07T18:18:17Z

i think with learning_rate=constant currently tol doesn't do something sensible.

naught101 · 2015-05-05T02:26:13Z

'softmax' and 'identity' are both listed as options for the activation= argument, but both result in KeyErrors.

naught101 · 2015-05-05T02:36:23Z

Ah, sorry, I should be more specific. Just had a look at the code, and it makes a bit more sense now. If you put in a spelling mistake, say MultilayerPerceptronRegressor(activation='blah'), and try to fit, then you get an error from the base class:

/home/naught101/miniconda3/envs/science/lib/python3.4/site-packages/sklearn/neural_network/multilayer_perceptron.py in _fit(self, X, y, incremental)
    267             raise ValueError("The activation %s is not supported. Supported "
    268                              "activations are %s." % (self.activation,
--> 269                                                       ACTIVATIONS))
    270         if self.learning_rate not in ["constant", "invscaling"]:
    271             raise ValueError("learning rate %s is not supported. " %

ValueError: The activation blah is not supported. Supported activations are {'logistic': <function logistic at 0x7f0f90e0cb70>, 'relu': <function relu at 0x7f0f90e0cc80>, 'tanh': <function tanh at 0x7f0f90e0cbf8>, 'identity': <function identity at 0x7f0f90e0cae8>, 'softmax': <function softmax at 0x7f0f90e0cd08>}.

However, identity and softmax aren't available in either the regressor or the classifier class, so if you try them, the _fit() validation isn't tripped, and a KeyError occurs. Maybe the validation should be in the __init__(), and it should check the activation list for the child class, and not the base class?

amueller · 2015-05-05T22:12:23Z

@naught101 I agree.

IssamLaradji · 2015-06-05T16:02:14Z

These are plots for several learning rates. Too bad that's it's more tricky to capture the loss values for lbfgs, I guess we have to recompute the loss value after every iteration for lbfgs

IssamLaradji · 2015-06-05T16:05:55Z

In the experiment above, I used momentum as 0.9 with nesterov, and applied the algorithms on 8000 samples of MNIST.

But after carefully adjusting the learning rate, a constant learning rate with nesterov momentum beats adagrad.

GaelVaroquaux · 2015-06-05T16:10:51Z

The question is: which one is the easier to set / the most invariant across settings?

amueller · 2015-06-05T17:18:36Z

can you add your learning rate to the comparison example on the toy datasets? And maybe try on covertype and mnist, too? Btw, I ran full mnist ususally.

Seeking to finalize MLP

…trl+C stop option for SGD

IssamLaradji · 2015-06-14T22:12:41Z

adagrad is sensitive to the initial rate since it can never increase and is manually set [1]. However, Adadelta addresses that and, according to the literature [1], is a very robust method even to different hyperparameters.

But I wasn't able to make it work because I am confused about line 5 in the pseudo code below.

RMS[delta_x]_(t-1) in line 5 starts at zero and so the steps are accumulating zeros in line 6 so the weight parameters are not being updated since E[\delta_x^2]_t in line 6 stays zero throughout the algorithm.

Can someone shine some light on this? :)

Thanks!

More details are in the paper below.

[1]http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf

amueller · 2015-06-15T18:11:05Z

You should have a look at the implementations in lasagne and keras.

IssamLaradji · 2015-06-15T20:33:52Z

I see, now it's working, there is an epsilon term that prevents it from being zero, however, changing that epsilon value from, say, 1e-5 to 1e-6 can make a huge difference in the training performance..

amueller · 2015-06-15T20:35:35Z

@kastnerkyle just told me on gitter that he didn't have a lot of success with adagrad and adadelta. Can you do tests on multiple datasets including digits and full mnist? In particular it would be good to know how well they do with a default learning rate.

StevenLOL · 2015-06-16T05:45:35Z

Got dropout ?

kastnerkyle · 2015-06-16T11:06:49Z

I don't think dropout is necessary. It is highly unlikely that a single CPU
net will get big or deep enough to need it, and we can always add it later
if necessary. L2 and L1 reg. should be fine

I would also probably stick with stock sgd and nesterov momentum. Though
adam is nice because it doesn't introduce more hyperparams, it is pretty
new.
On Jun 16, 2015 1:46 AM, "Steven Du" notifications@github.com wrote:

Got dropout ?

—
Reply to this email directly or view it on GitHub
#3939 (comment)
.

amueller · 2015-06-16T13:42:51Z

I'm not 100% on leaving out drop-out but yeah, we can always add it in and I'd rather merge this sooner than later.

amueller · 2015-06-16T13:43:15Z

I wanted to have a look at @IssamLaradji's refactoring though.

IssamLaradji · 2015-06-16T17:02:53Z

@amueller nice, I did the refactoring in my PR - I moved the learning rate algorithms in the learning_rate_class in the base file [1].

[1] https://github.com/IssamLaradji/scikit-learn/blob/generic-multi-layer-perceptron/sklearn/neural_network/base.py

naught101 · 2015-08-27T03:44:11Z

Any updates on this? It would be great to get a minimal working implementation in, algorithmic improvements could come later.

On a slight tangent: Is it standard across scikit-learn to pass functions like MLP's activation and weight optimisation functions as function names? Seems like it would make more sense to also accept actual functions. That would mean that new esitmators PRs wouldn't have to spend so much time on getting the details exactly right, as long as there were working defaults, as users could provide their own improved alternatives.

amueller · 2015-08-28T17:32:23Z

@IssamLaradji what is the "Bordes" learning rate?

amueller · 2015-08-28T17:35:19Z

@naught101 hopefully progress soon ;)

ogrisel · 2015-08-31T07:38:22Z

I would also probably stick with stock sgd and nesterov momentum.

+1. Though it would be great to make it easy to implement your own custom optimizer by deriving the estimator class and overriding a private method for instance.

IssamLaradji · 2015-09-05T02:35:00Z

@amueller the bordes learning rate [1] approximates the diagonals of the hessian to be used for updating each parameter. so it's like the newton's method but with using only the diagonal approximation of the hessian. This might not be straightforward to implement though.

On another note, an implementation friendly approach is to use the line search as defined in section 4.6 in [2] for stochastic gradient descent. It is very efficient for automatically determining the learning rate for the mini-batch case.

However, although its theory is founded on convex functions, it works reasonably well for non-convex NN functions as shown by my experiments.

[1] http://www.jmlr.org/papers/volume10/bordes09a/bordes09a.pdf
[2] https://hal.inria.fr/file/index/docid/860051/filename/sag_journal.pdf

amueller · 2015-10-20T14:06:24Z

superseeded by #5214

amueller reviewed Dec 5, 2014
View reviewed changes

amueller force-pushed the mlp_refactoring branch 2 times, most recently from bc67720 to 4938993 Compare December 16, 2014 21:37

amueller mentioned this pull request Dec 17, 2014

Regression in LinearSVC #3977

Closed

amueller mentioned this pull request Dec 29, 2014

[MRG+1] ENH add a benchmark on mnist #3562

Merged

IssamLaradji mentioned this pull request Mar 20, 2015

About the gradient descent IssamLaradji/NeuralNetworks#2

Open

amueller mentioned this pull request May 1, 2015

[MRG] Generic multi layer perceptron #3204

Closed

4 tasks

amueller force-pushed the mlp_refactoring branch from 6d3ae72 to 9dbaa76 Compare May 13, 2015 20:22

amueller force-pushed the mlp_refactoring branch from f3044ac to 06b9ea2 Compare June 4, 2015 21:33

amueller force-pushed the mlp_refactoring branch from 06b9ea2 to 126f260 Compare June 5, 2015 17:36

IssamLaradji and others added 6 commits June 5, 2015 13:48

(WIP) Added Multi-layer perceptron (MLP)

4da167a

Seeking to finalize MLP

refactor mlp optimization methods into _fit_lbfgs and _fit_sgd, add C…

c846370

…trl+C stop option for SGD

minor fixes to bench_mnist.

8b7a733

FIX partial fit test for MLP

f0af3aa

ENH better 'constant' learning rate schedule

7f70965

iterate, improve. Nesterov's momentum.

c93b7a8

amueller force-pushed the mlp_refactoring branch from 126f260 to c93b7a8 Compare June 5, 2015 17:48

glennq mentioned this pull request Sep 4, 2015

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

Merged

amueller closed this Oct 20, 2015

[MRG] Mlp finishing touches #3939

[MRG] Mlp finishing touches #3939

Conversation

amueller commented Dec 5, 2014

amueller commented Dec 5, 2014

amueller commented Dec 5, 2014

amueller commented Dec 5, 2014

amueller commented Dec 5, 2014

amueller Dec 5, 2014

Choose a reason for hiding this comment

IssamLaradji Dec 6, 2014

Choose a reason for hiding this comment

IssamLaradji commented Dec 6, 2014

amueller commented Dec 6, 2014

IssamLaradji commented Dec 7, 2014

ogrisel commented Dec 18, 2014

ogrisel commented Dec 18, 2014

amueller commented Dec 18, 2014

ogrisel commented Dec 19, 2014

IssamLaradji commented Dec 19, 2014

amueller commented Dec 19, 2014

ogrisel commented Dec 19, 2014

ogrisel commented Dec 19, 2014

amueller commented Apr 4, 2015

amueller commented Apr 4, 2015

amueller commented Apr 4, 2015

amueller commented Apr 7, 2015

naught101 commented May 5, 2015

naught101 commented May 5, 2015

amueller commented May 5, 2015

IssamLaradji commented Jun 5, 2015

IssamLaradji commented Jun 5, 2015

GaelVaroquaux commented Jun 5, 2015

amueller commented Jun 5, 2015

IssamLaradji commented Jun 14, 2015

amueller commented Jun 15, 2015

IssamLaradji commented Jun 15, 2015

amueller commented Jun 15, 2015

StevenLOL commented Jun 16, 2015

kastnerkyle commented Jun 16, 2015

amueller commented Jun 16, 2015

amueller commented Jun 16, 2015

IssamLaradji commented Jun 16, 2015

naught101 commented Aug 27, 2015

amueller commented Aug 28, 2015

amueller commented Aug 28, 2015

ogrisel commented Aug 31, 2015

IssamLaradji commented Sep 5, 2015

amueller commented Oct 20, 2015