Skip to content

[MRG] Mlp finishing touches #3939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Dec 5, 2014

Slight update to #3204, putting on the finishing touches.

  • early stopping with validation set - I haven't seen a case where that helped!
  • add tests for momentum no idea how to do that
  • use SGD in benchmark
  • add weight visualization example (maybe)
  • tune benchmark, add SGDClassifier
  • add mlp to classifier comparison
  • add monitor callback
  • write and SGD example that plot training score / validation score vs epochs with SGD and LBGFS on digits dataset (mnist is too long).
  • add builtin max abs scaler. This would need [MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514 and / or [WIP] Refactor scaler code #3639.
  • nesterovs momentum
  • rename to MLPClassifier?
  • Currently if y in MultilayerPerceptronRegressor.fit() is a vector (dimensions (n,)), .predict() returns a 2d array with dimensions (n,1). Other regressors just return a vector in the same format as y
  • learningrate = 'constant' and tol have weird interactions.
  • fix listing of nonlinearities as mentioned here: [MRG] Mlp finishing touches #3939 (comment)
  • make sure the initialization depends on nonlinearity, as mentioned here: https://twitter.com/syhw/status/590555287138004992 -- not sure on that one yet.
  • using partial_fit / incremental learning doesn't yield same results as batch learning.
  • multi-label partial_fit classes argument.
  • Python3 doctests?
  • rename constant learning rate?
  • warm start and partial fit don't work with random_state=None
  • inv-scaling might work better when dividing using epochs, not n_samples * epochs for t.
  • test early stopping

I'm not sure about the scaling at the moment. Waiting for the MaxAbsScaler would not be great, but adding a default scaling=True later also seems like a bad idea :-/

I'm also not 100% sure of the correct "gain" in the initialization for the different nonlinearities.
For the current "constant" learning rate schedule, maybe it should be called "adaptive" instead? And have an actual constant one?

@amueller
Copy link
Member Author

amueller commented Dec 5, 2014

ping @ogrisel

@amueller
Copy link
Member Author

amueller commented Dec 5, 2014

Can we do a plot example with MNIST? I think that would be kind of cool. It should be fast enough.

@amueller
Copy link
Member Author

amueller commented Dec 5, 2014

I just realized that we do have plotting examples that fetch data (the species distribution example), so I'll try to do a nice mnist example that visualizes some weights.

@amueller
Copy link
Member Author

amueller commented Dec 5, 2014

@ogrisel do you have a good idea on how to test momentum?

self.layers_intercept_[i] -= (self.learning_rate_ *
((1 - self.momentum) * intercept_grads[i]
+ self.momentum * intercept_grads_prev[i]))
coef_grads_prev = list(coef_grads)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not making any sense here... that should be the update I'm storing, not the gradients...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be made this way?

coef_update_prev[i] =  (self.learning_rate_ * ((1 - self.momentum) * coef_grads[i]
                                                  + self.momentum * coef_update_prev[i]))

self.layers_coef_[i] -= coef_update_prev[i] 

@IssamLaradji
Copy link
Contributor

I once tested this by showing that momentum makes SGD converge with significantly fewer iterations than otherwise for the classic XOR .

@amueller
Copy link
Member Author

amueller commented Dec 6, 2014

Why did you remove momentum again?

@IssamLaradji
Copy link
Contributor

@amueller I didn't implement it for this MLP in the first place.
It was part of a java course project to implement and test momentum. :)

@amueller amueller force-pushed the mlp_refactoring branch 2 times, most recently from bc67720 to 4938993 Compare December 16, 2014 21:37
@ogrisel
Copy link
Member

ogrisel commented Dec 18, 2014

FYI there is a PR to this PR here amueller#22. @IssamLaradji you might want to have a look.

@ogrisel
Copy link
Member

ogrisel commented Dec 18, 2014

In addition to your todo list I would add:

  • add monitor callback (but maybe as a constructor param? not sure.)
  • write and SGD example that plot training score / validation score vs epochs with SGD and LBGFS on digits dataset (mnist is too long). This could be combined with the
  • add builtin max abs scaler

MLP are very sensitive to scaling so I think it would be user friendly to have scale=True by default in the constructor and when it's the case we would implement per-feature scaling by max absolute value to have all features in the range [-1, 1] without centering to avoid breaking.

This is the strategy Vowpal Wabbit uses by default. I think it's the best because:

  • it's simple to understand
  • it's has consistent behavior with dense or sparse data without breaking sparsity
  • it's trivial to implement, even for incremental fitting in partial_fit.

@amueller
Copy link
Member Author

I am not sure about built-in scaling. We don't do that anywhere else. I agree that it is very useful for SGD, but we don't do it in SGDClassifier for example. We could add it there, too, though.
How is it trivial to implement for partial_fit?

@ogrisel
Copy link
Member

ogrisel commented Dec 19, 2014

I would be +1 to have in SGDClassifier as well, and in the SAG PR as well.

How is it trivial to implement for partial_fit?

You just have to update your online estimate of self.scale_:

np.maximum(X_batch.max(axis=0), self.scale_, out=self.scale_)
X_batch /= self.scale_

This does not take care of division by zero but you got the idea.

@IssamLaradji
Copy link
Contributor

+1. These are really awesome changes, namely, the

  1. builtin max abs scaler : I remember I had to implement this for a reinforcement learning problem as SGD didn't work otherwise;
  2. monitor callback: extremely useful for theoretical and empirical studies; and
  3. Nesterov's momentum: nice addition and is easy to implement but might be too much for this PR, as more testing and examples (comparing Nesterov and standard momentum) would be required.

cheers!

@amueller
Copy link
Member Author

@ogrisel but don't you have to rescale your weights then?

@ogrisel
Copy link
Member

ogrisel commented Dec 19, 2014

@ogrisel but don't you have to rescale your weights then?

You could (for SGDClassifier it would be easy), but for MLP it's too complicated. At some point the likelihood of having the max abs values change too much will be very small so it can be ignored.

@ogrisel
Copy link
Member

ogrisel commented Dec 19, 2014

You could (for SGDClassifier it would be easy),

Actually the impact on the intercept is not trivial.

@amueller
Copy link
Member Author

amueller commented Apr 4, 2015

It looks like the current partial_fit goes over the data multiple times, which is not what it should be doing.
Also, to summarize a discussion with @ogrisel : we probably want to to a train_test_split(test_size=.1) in fit and get a hold-out set to monitor validation set error and know when to half the learning rate.

@amueller
Copy link
Member Author

amueller commented Apr 4, 2015

Hum, partial fit only does one iteration, but it is not equivalent to doing multiple iterations... weird

@amueller
Copy link
Member Author

amueller commented Apr 4, 2015

I think momentum doesn't work for partial_fit currently... meh

@amueller
Copy link
Member Author

amueller commented Apr 7, 2015

i think with learning_rate=constant currently tol doesn't do something sensible.

@naught101
Copy link

'softmax' and 'identity' are both listed as options for the activation= argument, but both result in KeyErrors.

@naught101
Copy link

Ah, sorry, I should be more specific. Just had a look at the code, and it makes a bit more sense now. If you put in a spelling mistake, say MultilayerPerceptronRegressor(activation='blah'), and try to fit, then you get an error from the base class:

/home/naught101/miniconda3/envs/science/lib/python3.4/site-packages/sklearn/neural_network/multilayer_perceptron.py in _fit(self, X, y, incremental)
    267             raise ValueError("The activation %s is not supported. Supported "
    268                              "activations are %s." % (self.activation,
--> 269                                                       ACTIVATIONS))
    270         if self.learning_rate not in ["constant", "invscaling"]:
    271             raise ValueError("learning rate %s is not supported. " %

ValueError: The activation blah is not supported. Supported activations are {'logistic': <function logistic at 0x7f0f90e0cb70>, 'relu': <function relu at 0x7f0f90e0cc80>, 'tanh': <function tanh at 0x7f0f90e0cbf8>, 'identity': <function identity at 0x7f0f90e0cae8>, 'softmax': <function softmax at 0x7f0f90e0cd08>}.

However, identity and softmax aren't available in either the regressor or the classifier class, so if you try them, the _fit() validation isn't tripped, and a KeyError occurs. Maybe the validation should be in the __init__(), and it should check the activation list for the child class, and not the base class?

@amueller
Copy link
Member Author

amueller commented May 5, 2015

@naught101 I agree.

@IssamLaradji
Copy link
Contributor

These are plots for several learning rates. Too bad that's it's more tricky to capture the loss values for lbfgs, I guess we have to recompute the loss value after every iteration for lbfgs
figure_1

@IssamLaradji
Copy link
Contributor

In the experiment above, I used momentum as 0.9 with nesterov, and applied the algorithms on 8000 samples of MNIST.

But after carefully adjusting the learning rate, a constant learning rate with nesterov momentum beats adagrad.
figure_1

@GaelVaroquaux
Copy link
Member

The question is: which one is the easier to set / the most invariant across settings?

@amueller
Copy link
Member Author

amueller commented Jun 5, 2015

can you add your learning rate to the comparison example on the toy datasets? And maybe try on covertype and mnist, too? Btw, I ran full mnist ususally.

@IssamLaradji
Copy link
Contributor

adagrad is sensitive to the initial rate since it can never increase and is manually set [1]. However, Adadelta addresses that and, according to the literature [1], is a very robust method even to different hyperparameters.

But I wasn't able to make it work because I am confused about line 5 in the pseudo code below.

capture

RMS[delta_x]_(t-1) in line 5 starts at zero and so the steps are accumulating zeros in line 6 so the weight parameters are not being updated since E[\delta_x^2]_t in line 6 stays zero throughout the algorithm.

Can someone shine some light on this? :)

Thanks!

More details are in the paper below.

[1]http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf

@amueller
Copy link
Member Author

You should have a look at the implementations in lasagne and keras.

@IssamLaradji
Copy link
Contributor

I see, now it's working, there is an epsilon term that prevents it from being zero, however, changing that epsilon value from, say, 1e-5 to 1e-6 can make a huge difference in the training performance..

@amueller
Copy link
Member Author

@kastnerkyle just told me on gitter that he didn't have a lot of success with adagrad and adadelta. Can you do tests on multiple datasets including digits and full mnist? In particular it would be good to know how well they do with a default learning rate.

@StevenLOL
Copy link

Got dropout ?

@kastnerkyle
Copy link
Member

I don't think dropout is necessary. It is highly unlikely that a single CPU
net will get big or deep enough to need it, and we can always add it later
if necessary. L2 and L1 reg. should be fine

I would also probably stick with stock sgd and nesterov momentum. Though
adam is nice because it doesn't introduce more hyperparams, it is pretty
new.
On Jun 16, 2015 1:46 AM, "Steven Du" notifications@github.com wrote:

Got dropout ?


Reply to this email directly or view it on GitHub
#3939 (comment)
.

@amueller
Copy link
Member Author

I'm not 100% on leaving out drop-out but yeah, we can always add it in and I'd rather merge this sooner than later.

@amueller
Copy link
Member Author

I wanted to have a look at @IssamLaradji's refactoring though.

@IssamLaradji
Copy link
Contributor

@amueller nice, I did the refactoring in my PR - I moved the learning rate algorithms in the learning_rate_class in the base file [1].

[1] https://github.com/IssamLaradji/scikit-learn/blob/generic-multi-layer-perceptron/sklearn/neural_network/base.py

@naught101
Copy link

Any updates on this? It would be great to get a minimal working implementation in, algorithmic improvements could come later.

On a slight tangent: Is it standard across scikit-learn to pass functions like MLP's activation and weight optimisation functions as function names? Seems like it would make more sense to also accept actual functions. That would mean that new esitmators PRs wouldn't have to spend so much time on getting the details exactly right, as long as there were working defaults, as users could provide their own improved alternatives.

@amueller
Copy link
Member Author

@IssamLaradji what is the "Bordes" learning rate?

@amueller
Copy link
Member Author

@naught101 hopefully progress soon ;)

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2015

I would also probably stick with stock sgd and nesterov momentum.

+1. Though it would be great to make it easy to implement your own custom optimizer by deriving the estimator class and overriding a private method for instance.

@IssamLaradji
Copy link
Contributor

@amueller the bordes learning rate [1] approximates the diagonals of the hessian to be used for updating each parameter. so it's like the newton's method but with using only the diagonal approximation of the hessian. This might not be straightforward to implement though.

On another note, an implementation friendly approach is to use the line search as defined in section 4.6 in [2] for stochastic gradient descent. It is very efficient for automatically determining the learning rate for the mini-batch case.

However, although its theory is founded on convex functions, it works reasonably well for non-convex NN functions as shown by my experiments.

[1] http://www.jmlr.org/papers/volume10/bordes09a/bordes09a.pdf
[2] https://hal.inria.fr/file/index/docid/860051/filename/sag_journal.pdf

@amueller
Copy link
Member Author

superseeded by #5214

@amueller amueller closed this Oct 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants