Skip to content

Possible issue in MLP code #6780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue May 14, 2016 · 21 comments · Fixed by #22151
Closed

Possible issue in MLP code #6780

mblondel opened this issue May 14, 2016 · 21 comments · Fixed by #22151

Comments

@mblondel
Copy link
Member

mblondel commented May 14, 2016

I was surprised to see that the regularization term is divided by n_samples. This is not standard.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L125

This doesn't seem to correspond to the objective documented in
http://scikit-learn.org/dev/modules/neural_networks_supervised.html

@maniteja123
Copy link
Contributor

maniteja123 commented May 14, 2016

I am not an expert but looking at this line shows that it happens in the L2 regularisation in the loss term of backpropogation algorithm. Also, it seems to be 0.5 * alpha * (sum of weights) instead of alpha * (sum of weights) in the documentation. Though I suppose this is common, it should be documented correctly in case this is a mismatch. I hope these are valid pointers. Please do let me know if I am understanding it wrong.

@mblondel
Copy link
Member Author

Yes the objective value and the gradient both divide the regularization term by n_samples but this is not correct. Only the loss term should be divided by n_samples. By the way, here n_samples correspond to the mini-batch size. Basically this means that there is a mismatch between the documentation and the implementation (alpha is effectively scaled down by the mini batch size).

@rasbt
Copy link
Contributor

rasbt commented Jul 1, 2016

Also, it seems to be 0.5 * alpha * (sum of weights) instead of alpha * (sum of weights)

Yeah, typically, we add a 0.5 to the cost function so that we can cancel it out when taking the derivative s so that the "update" looks a bit "nicer". E.g., below I wrote an example for simple SSE

screen shot 2016-07-01 at 3 39 23 am

But it practice, it really doesn't matter for training since it's just a factor.

I was surprised to see that the regularization term is divided by n_samples. This is not standard.

Saw that the loss is also scaled by n_samples though. So, that's maybe done to be consistent with the derivative of the loss? However, under the hood it really doesn't matter and it's additional computational effort that is not required. I would simply remove it (but maybe leave it in the loss function if desired for interpretability, e.g., when plotting learning curves)

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L125

@mblondel
Copy link
Member Author

mblondel commented Jul 1, 2016

It is still important that the code implements what the doc says.

@rasbt
Copy link
Contributor

rasbt commented Jul 1, 2016

I agree; I would drop the n_samples scaling then. I haven't read all the code in detail, but I think there's another inconsistency in the docs. E.g., in the docs, it says

For regression, MLP uses the Square Error loss function; written as, ...

with SSE followed in the equation. However, in .base.py it's the MSE if I see it correctly:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/_base.py#L180

return ((y_true - y_pred) ** 2).mean() / 2

@maniteja123
Copy link
Contributor

Yeah, typically, we add a 0.5 to the cost function..

Sorry for my ignorance but I was thinking the 0.5 is not standard in the Regularisation term as in 0.5 * alpha * sum of [squares]weights as it is done here. Please do let me know if I am understanding it wrong. Thanks.

@rasbt
Copy link
Contributor

rasbt commented Jul 1, 2016

I was thinking the 0.5 is not standard in the Regularisation term as in 0.5 * alpha * sum of [squares]weights

Oh but it typically is the more common form (in literature at least).
For example, if you update the weights as follows:

reg_term = alpha * self.w_
coef_ += learningrate_eta * (negative_grad -reg_term)

then you need the 0.5 to the cost function (like in the code snippet in the MLP) so that it cancels when you take the derivative, e.g., here for the logistic cost:

(I wrote it with lambda here, I assume "alpha" in the code is the regularization strength?)

screen shot 2016-07-01 at 3 26 43 pm

@maniteja123
Copy link
Contributor

Thanks for the clear explanation ! Yeah, alpha is the regularisation strength lambda. I understand it now. It seems indeed the code and the documentation should be made consistent.

@rasbt
Copy link
Contributor

rasbt commented Jul 1, 2016

Yes, it should be "alpha/2" in the doc.

@amueller amueller added Easy Well-defined and straightforward way to resolve Documentation Need Contributor labels Oct 13, 2016
@aferritto
Copy link
Contributor

I'll take a look at this. It's the alpha in the square error loss function that needs to be changed to alpha/2 right? Does the first term in the loss function need a factor of 1/N to reflect the mean call here?

@amueller
Copy link
Member

amueller commented Dec 2, 2016

One of the inconsistencies is addressed in #7962. The 1/N still needs to be fixed.
Is that really non-standard? I find the docs write it in a weird way. I would write the objective as a function of the whole training set. What does an objective wrt a single sample mean?
If we talk about loss functions, that's another thing, but the loss function shouldn't include the regularization imho.

I would write

\sum_i ||y_i - \hat{y}_i||^2 + \frac{\alpha}{2}||W||_2^2

That's what's implemented, right?
Actually, that does look a bit strange....
hum....

@aferritto
Copy link
Contributor

The square error loss function here does not include the regularization and looks to be implemented as
\frac{1}{N} \sum_{i=1}^N ||y_i - \hat{y}_i||_2^2

which gets added to the regularization term during backpropagation here of
\frac{\alpha}{2N} ||W||_2^2

@rpmcruz
Copy link

rpmcruz commented Jan 15, 2018

It's very weird that the L2 (alpha) regularization is divided by sample size.

It's true there exists an indirect relationship where a bigger sample size requires regularization, but it cannot be computed linearly like this. It's highly domain-specific. This is very atypical behavior. Nobody expects this and it's undocumented. I would let the user decide how he wants to specify alpha. As a reference, Keras does the average for cross entropy, but applies L2 without doing an average over sample size. This is also how things are described over Goodfellow, Bengio, Courville's book. I believe also that the Scikit-learn SGDClassifier implementation also does not divide alpha by sample size (but I am not sure).

@rasbt
Copy link
Contributor

rasbt commented Jan 15, 2018

I think that's because the loss & gradients are scaled by the sample size (e.g., mean squared error instead sum squared error). but yeah, it should be documented

@rpmcruz
Copy link

rpmcruz commented Jan 15, 2018

But it's weird that other models from sciki-learn like LASSO or Ridge regression do not divide regularization by sample size, only the MLP. In fact, I believe SGDClassifier (which also does SGD like MLP) also does not divide L2 by sample size, but please check this for me. I have never heard of an MLP implementation dividing regularization by sample size.

@rasbt
Copy link
Contributor

rasbt commented Jan 15, 2018

Hm, I think it depends on how it's implemented, but it should actually be scaled if the loss/cost is scaled by the number of samples.

E.g., if we have the MSE as follows: J(theta) = 1/(2 n_samples) sum^{n_samples}_{i} (x_i - y_i)^2

Then adding the L2 norm for regularization would modify it as follows:

J(theta) = 1/(2 n_samples) [ sum^{n_samples}{i} (x_i - y_i)^2 + lambda sum^(n_weights){j} theta^2 ]

Hence,

J(theta) = 1/(2 n_samples) sum^{n_samples}{i} (x_i - y_i)^2 + 1/(2 n_samples) lambda sum^(n_weights){j} theta^2

EDIT:

Ahh, the above text looks horrible, let me clean this up:

ql_4a5948f54a6431ec82d0547bf6d71ee0_l3

@rpmcruz
Copy link

rpmcruz commented Jan 16, 2018

Sebastian, I feel this discussion would be more productive if we cited sources.

In their book Ian Goodfellow and Yoshua Bengio and Aaron Courville, define Mean Square Error as the average squared difference between y and yhat [1]. They then define regularization as something you add to the loss and is independent of the number of observations you have [2].

This is also how every other toolkit formulate and implement regularization. In fact, this is also how regularization is formulated and applied in scikit-linear in linear regression models. See for instance the documentation for Elastic Net [3].

[1] http://www.deeplearningbook.org/contents/mlp.html
[2] http://www.deeplearningbook.org/contents/regularization.html
[3] http://scikit-learn.org/stable/modules/linear_model.html#elastic-net

@rpmcruz
Copy link

rpmcruz commented Jan 16, 2018

I had a look at the code again. Currently, we are dividing by batch size not sample size (because what _backprop calls sample is actually a mini-batch). Therefore, if I increase the size of the mini-batches from 8 to 256, I will inadvertently change regularization by a factor of 32!

@rasbt
Copy link
Contributor

rasbt commented Jan 16, 2018

Ultimately, I would say it doesn't really matter whether the regularization term is scaled or not as it is just a scaling factor. And the way it's e.g., described by Goodfellow et al in the Deep Learning Book looks correct to me for an arbitrary loss function. However if you scale that loss by the number of training samples, you would arrive at the equation I mentioned above (if the loss is the SEE; MSE is maybe an odd example here).

The previous equation that I wrote down, considering that it is sometimes common to scale the loss by the number of training samples (eg to have values on a decent scale when tracking convergence via the loss over minibatches or epochs), would be just an explanation why the regularization term was scaled in the code. If you scale the loss function (e.g. here down for the gradients, which is a consequence of that: as in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L130), which includes the regularization term, you would hence scale the regularization term as well.

@rpmcruz
Copy link

rpmcruz commented Jan 16, 2018

I understand what you are saying. The entire loss function currently is being averaged out, not only the mean square error component or the cross entropy component. I think we can agree on that, right? This is how things are implemented right now.

What I am saying is that it is not how people generally formulate the loss function or what they expect. Nobody expects regularization to be dependent on size of the mini batch. That's not how it's done anywhere else, not even in other scikit-learn models. What I am saying is that either (a) this behavior should be changed, or (b) the loss function that is being minimized should be documented, so that people can scale regularization accordingly.

@rasbt
Copy link
Contributor

rasbt commented Jan 16, 2018

I understand what you are saying. The entire loss function currently is being averaged out, not only the mean square error component or the cross entropy component. I think we can agree on that, right? This is how things are implemented right now.

Yeah, if the loss (e.g., the SSE term; using it as a placeholder here for loss functions in general) is averaged over the number of samples (total samples or minibatch size), then the regularization term should also be averaged as it's part of the (regularized) loss function. If your loss is MSE though, you have a double-average.

Nobody expects regularization to be dependent on size of the mini batch. That's not how it's done anywhere else, not even in other scikit-learn models.

Yeah, I agree, and if it's not documented, that's confusing.

I think there are at least 3 solutions:

  • leave it as is and document it accordingly
  • leave it, document it, and add an optional parameter "loss averaging" and default it to "True"
  • take out the loss averaging (this may affect users though regarding the hyperparameter settings)

@cmarmo cmarmo added module:neural_network and removed Easy Well-defined and straightforward way to resolve labels Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants