Possible issue in MLP code #6780

mblondel · 2016-05-14T09:35:51Z

I was surprised to see that the regularization term is divided by n_samples. This is not standard.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L125

This doesn't seem to correspond to the objective documented in
http://scikit-learn.org/dev/modules/neural_networks_supervised.html

maniteja123 · 2016-05-14T11:50:32Z

I am not an expert but looking at this line shows that it happens in the L2 regularisation in the loss term of backpropogation algorithm. Also, it seems to be 0.5 * alpha * (sum of weights) instead of alpha * (sum of weights) in the documentation. Though I suppose this is common, it should be documented correctly in case this is a mismatch. I hope these are valid pointers. Please do let me know if I am understanding it wrong.

mblondel · 2016-05-15T03:29:50Z

Yes the objective value and the gradient both divide the regularization term by n_samples but this is not correct. Only the loss term should be divided by n_samples. By the way, here n_samples correspond to the mini-batch size. Basically this means that there is a mismatch between the documentation and the implementation (alpha is effectively scaled down by the mini batch size).

rasbt · 2016-07-01T07:47:30Z

Also, it seems to be 0.5 * alpha * (sum of weights) instead of alpha * (sum of weights)

Yeah, typically, we add a 0.5 to the cost function so that we can cancel it out when taking the derivative s so that the "update" looks a bit "nicer". E.g., below I wrote an example for simple SSE

But it practice, it really doesn't matter for training since it's just a factor.

I was surprised to see that the regularization term is divided by n_samples. This is not standard.

Saw that the loss is also scaled by n_samples though. So, that's maybe done to be consistent with the derivative of the loss? However, under the hood it really doesn't matter and it's additional computational effort that is not required. I would simply remove it (but maybe leave it in the loss function if desired for interpretability, e.g., when plotting learning curves)

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L125

mblondel · 2016-07-01T08:01:09Z

It is still important that the code implements what the doc says.

rasbt · 2016-07-01T16:58:35Z

I agree; I would drop the n_samples scaling then. I haven't read all the code in detail, but I think there's another inconsistency in the docs. E.g., in the docs, it says

For regression, MLP uses the Square Error loss function; written as, ...

with SSE followed in the equation. However, in .base.py it's the MSE if I see it correctly:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/_base.py#L180

return ((y_true - y_pred) ** 2).mean() / 2

maniteja123 · 2016-07-01T18:14:24Z

Yeah, typically, we add a 0.5 to the cost function..

Sorry for my ignorance but I was thinking the 0.5 is not standard in the Regularisation term as in 0.5 * alpha * sum of [squares]weights as it is done here. Please do let me know if I am understanding it wrong. Thanks.

rasbt · 2016-07-01T19:32:11Z

I was thinking the 0.5 is not standard in the Regularisation term as in 0.5 * alpha * sum of [squares]weights

Oh but it typically is the more common form (in literature at least).
For example, if you update the weights as follows:

reg_term = alpha * self.w_
coef_ += learningrate_eta * (negative_grad -reg_term)

then you need the 0.5 to the cost function (like in the code snippet in the MLP) so that it cancels when you take the derivative, e.g., here for the logistic cost:

(I wrote it with lambda here, I assume "alpha" in the code is the regularization strength?)

maniteja123 · 2016-07-01T19:37:35Z

Thanks for the clear explanation ! Yeah, alpha is the regularisation strength lambda. I understand it now. It seems indeed the code and the documentation should be made consistent.

rasbt · 2016-07-01T19:45:42Z

Yes, it should be "alpha/2" in the doc.

aferritto · 2016-11-28T16:46:23Z

I'll take a look at this. It's the alpha in the square error loss function that needs to be changed to alpha/2 right? Does the first term in the loss function need a factor of 1/N to reflect the mean call here?

amueller · 2016-12-02T15:33:05Z

One of the inconsistencies is addressed in #7962. The 1/N still needs to be fixed.
Is that really non-standard? I find the docs write it in a weird way. I would write the objective as a function of the whole training set. What does an objective wrt a single sample mean?
If we talk about loss functions, that's another thing, but the loss function shouldn't include the regularization imho.

I would write

\sum_i ||y_i - \hat{y}_i||^2 + \frac{\alpha}{2}||W||_2^2

That's what's implemented, right?
Actually, that does look a bit strange....
hum....

aferritto · 2016-12-02T15:54:35Z

The square error loss function here does not include the regularization and looks to be implemented as
\frac{1}{N} \sum_{i=1}^N ||y_i - \hat{y}_i||_2^2

which gets added to the regularization term during backpropagation here of
\frac{\alpha}{2N} ||W||_2^2

rpmcruz · 2018-01-15T22:15:56Z

It's very weird that the L2 (alpha) regularization is divided by sample size.

It's true there exists an indirect relationship where a bigger sample size requires regularization, but it cannot be computed linearly like this. It's highly domain-specific. This is very atypical behavior. Nobody expects this and it's undocumented. I would let the user decide how he wants to specify alpha. As a reference, Keras does the average for cross entropy, but applies L2 without doing an average over sample size. This is also how things are described over Goodfellow, Bengio, Courville's book. I believe also that the Scikit-learn SGDClassifier implementation also does not divide alpha by sample size (but I am not sure).

rasbt · 2018-01-15T22:28:34Z

I think that's because the loss & gradients are scaled by the sample size (e.g., mean squared error instead sum squared error). but yeah, it should be documented

rpmcruz · 2018-01-15T22:37:03Z

But it's weird that other models from sciki-learn like LASSO or Ridge regression do not divide regularization by sample size, only the MLP. In fact, I believe SGDClassifier (which also does SGD like MLP) also does not divide L2 by sample size, but please check this for me. I have never heard of an MLP implementation dividing regularization by sample size.

rasbt · 2018-01-15T22:58:42Z

Hm, I think it depends on how it's implemented, but it should actually be scaled if the loss/cost is scaled by the number of samples.

E.g., if we have the MSE as follows: J(theta) = 1/(2 n_samples) sum^{n_samples}_{i} (x_i - y_i)^2

Then adding the L2 norm for regularization would modify it as follows:

J(theta) = 1/(2 n_samples) [ sum^{n_samples}{i} (x_i - y_i)^2 + lambda sum^(n_weights){j} theta^2 ]

Hence,

J(theta) = 1/(2 n_samples) sum^{n_samples}{i} (x_i - y_i)^2 + 1/(2 n_samples) lambda sum^(n_weights){j} theta^2

EDIT:

Ahh, the above text looks horrible, let me clean this up:

rpmcruz · 2018-01-16T11:16:50Z

Sebastian, I feel this discussion would be more productive if we cited sources.

In their book Ian Goodfellow and Yoshua Bengio and Aaron Courville, define Mean Square Error as the average squared difference between y and yhat [1]. They then define regularization as something you add to the loss and is independent of the number of observations you have [2].

This is also how every other toolkit formulate and implement regularization. In fact, this is also how regularization is formulated and applied in scikit-linear in linear regression models. See for instance the documentation for Elastic Net [3].

[1] http://www.deeplearningbook.org/contents/mlp.html
[2] http://www.deeplearningbook.org/contents/regularization.html
[3] http://scikit-learn.org/stable/modules/linear_model.html#elastic-net

rpmcruz · 2018-01-16T15:00:44Z

I had a look at the code again. Currently, we are dividing by batch size not sample size (because what _backprop calls sample is actually a mini-batch). Therefore, if I increase the size of the mini-batches from 8 to 256, I will inadvertently change regularization by a factor of 32!

rasbt · 2018-01-16T17:18:49Z

Ultimately, I would say it doesn't really matter whether the regularization term is scaled or not as it is just a scaling factor. And the way it's e.g., described by Goodfellow et al in the Deep Learning Book looks correct to me for an arbitrary loss function. However if you scale that loss by the number of training samples, you would arrive at the equation I mentioned above (if the loss is the SEE; MSE is maybe an odd example here).

The previous equation that I wrote down, considering that it is sometimes common to scale the loss by the number of training samples (eg to have values on a decent scale when tracking convergence via the loss over minibatches or epochs), would be just an explanation why the regularization term was scaled in the code. If you scale the loss function (e.g. here down for the gradients, which is a consequence of that: as in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L130), which includes the regularization term, you would hence scale the regularization term as well.

rpmcruz · 2018-01-16T19:15:11Z

I understand what you are saying. The entire loss function currently is being averaged out, not only the mean square error component or the cross entropy component. I think we can agree on that, right? This is how things are implemented right now.

What I am saying is that it is not how people generally formulate the loss function or what they expect. Nobody expects regularization to be dependent on size of the mini batch. That's not how it's done anywhere else, not even in other scikit-learn models. What I am saying is that either (a) this behavior should be changed, or (b) the loss function that is being minimized should be documented, so that people can scale regularization accordingly.

rasbt · 2018-01-16T19:23:31Z

I understand what you are saying. The entire loss function currently is being averaged out, not only the mean square error component or the cross entropy component. I think we can agree on that, right? This is how things are implemented right now.

Yeah, if the loss (e.g., the SSE term; using it as a placeholder here for loss functions in general) is averaged over the number of samples (total samples or minibatch size), then the regularization term should also be averaged as it's part of the (regularized) loss function. If your loss is MSE though, you have a double-average.

Nobody expects regularization to be dependent on size of the mini batch. That's not how it's done anywhere else, not even in other scikit-learn models.

Yeah, I agree, and if it's not documented, that's confusing.

I think there are at least 3 solutions:

leave it as is and document it accordingly
leave it, document it, and add an optional parameter "loss averaging" and default it to "True"
take out the loss averaging (this may affect users though regarding the hyperparameter settings)

amueller added Easy Well-defined and straightforward way to resolve Documentation Need Contributor labels Oct 13, 2016

aferritto mentioned this issue Dec 2, 2016

[MRG] Documentation fix for MLP #7962

Merged

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

cmarmo removed the help wanted label Sep 29, 2020

cmarmo added module:neural_network and removed Easy Well-defined and straightforward way to resolve labels Dec 10, 2021

cmarmo added the Needs Decision Requires decision label Dec 10, 2021

thomasjpfan mentioned this issue Jan 8, 2022

DOC Explains alpha scaling in MLP's docstring #22151

Merged

jeremiedbb closed this as completed in #22151 Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue in MLP code #6780

Possible issue in MLP code #6780

mblondel commented May 14, 2016 •

edited

Loading

maniteja123 commented May 14, 2016 •

edited

Loading

mblondel commented May 15, 2016

rasbt commented Jul 1, 2016

mblondel commented Jul 1, 2016

rasbt commented Jul 1, 2016

maniteja123 commented Jul 1, 2016

rasbt commented Jul 1, 2016

maniteja123 commented Jul 1, 2016

rasbt commented Jul 1, 2016

aferritto commented Nov 28, 2016

amueller commented Dec 2, 2016

aferritto commented Dec 2, 2016

rpmcruz commented Jan 15, 2018 •

edited

Loading

rasbt commented Jan 15, 2018

rpmcruz commented Jan 15, 2018

rasbt commented Jan 15, 2018 •

edited

Loading

rpmcruz commented Jan 16, 2018

rpmcruz commented Jan 16, 2018 •

edited

Loading

rasbt commented Jan 16, 2018

rpmcruz commented Jan 16, 2018 •

edited

Loading

rasbt commented Jan 16, 2018

Possible issue in MLP code #6780

Possible issue in MLP code #6780

Comments

mblondel commented May 14, 2016 • edited Loading

maniteja123 commented May 14, 2016 • edited Loading

mblondel commented May 15, 2016

rasbt commented Jul 1, 2016

mblondel commented Jul 1, 2016

rasbt commented Jul 1, 2016

maniteja123 commented Jul 1, 2016

rasbt commented Jul 1, 2016

maniteja123 commented Jul 1, 2016

rasbt commented Jul 1, 2016

aferritto commented Nov 28, 2016

amueller commented Dec 2, 2016

aferritto commented Dec 2, 2016

rpmcruz commented Jan 15, 2018 • edited Loading

rasbt commented Jan 15, 2018

rpmcruz commented Jan 15, 2018

rasbt commented Jan 15, 2018 • edited Loading

rpmcruz commented Jan 16, 2018

rpmcruz commented Jan 16, 2018 • edited Loading

rasbt commented Jan 16, 2018

rpmcruz commented Jan 16, 2018 • edited Loading

rasbt commented Jan 16, 2018

mblondel commented May 14, 2016 •

edited

Loading

maniteja123 commented May 14, 2016 •

edited

Loading

rpmcruz commented Jan 15, 2018 •

edited

Loading

rasbt commented Jan 15, 2018 •

edited

Loading

rpmcruz commented Jan 16, 2018 •

edited

Loading

rpmcruz commented Jan 16, 2018 •

edited

Loading