-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Possible issue in MLP code #6780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am not an expert but looking at this line shows that it happens in the L2 regularisation in the loss term of backpropogation algorithm. Also, it seems to be |
Yes the objective value and the gradient both divide the regularization term by n_samples but this is not correct. Only the loss term should be divided by n_samples. By the way, here n_samples correspond to the mini-batch size. Basically this means that there is a mismatch between the documentation and the implementation (alpha is effectively scaled down by the mini batch size). |
Yeah, typically, we add a 0.5 to the cost function so that we can cancel it out when taking the derivative s so that the "update" looks a bit "nicer". E.g., below I wrote an example for simple SSE But it practice, it really doesn't matter for training since it's just a factor.
Saw that the loss is also scaled by n_samples though. So, that's maybe done to be consistent with the derivative of the loss? However, under the hood it really doesn't matter and it's additional computational effort that is not required. I would simply remove it (but maybe leave it in the loss function if desired for interpretability, e.g., when plotting learning curves) |
It is still important that the code implements what the doc says. |
I agree; I would drop the n_samples scaling then. I haven't read all the code in detail, but I think there's another inconsistency in the docs. E.g., in the docs, it says
with SSE followed in the equation. However, in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/_base.py#L180
|
Sorry for my ignorance but I was thinking the 0.5 is not standard in the Regularisation term as in |
Oh but it typically is the more common form (in literature at least).
then you need the 0.5 to the cost function (like in the code snippet in the MLP) so that it cancels when you take the derivative, e.g., here for the logistic cost: (I wrote it with lambda here, I assume "alpha" in the code is the regularization strength?) |
Thanks for the clear explanation ! Yeah, |
Yes, it should be "alpha/2" in the doc. |
I'll take a look at this. It's the alpha in the square error loss function that needs to be changed to alpha/2 right? Does the first term in the loss function need a factor of 1/N to reflect the mean call here? |
One of the inconsistencies is addressed in #7962. The 1/N still needs to be fixed. I would write
That's what's implemented, right? |
It's very weird that the L2 (alpha) regularization is divided by sample size. It's true there exists an indirect relationship where a bigger sample size requires regularization, but it cannot be computed linearly like this. It's highly domain-specific. This is very atypical behavior. Nobody expects this and it's undocumented. I would let the user decide how he wants to specify alpha. As a reference, Keras does the average for cross entropy, but applies L2 without doing an average over sample size. This is also how things are described over Goodfellow, Bengio, Courville's book. I believe also that the Scikit-learn SGDClassifier implementation also does not divide alpha by sample size (but I am not sure). |
I think that's because the loss & gradients are scaled by the sample size (e.g., mean squared error instead sum squared error). but yeah, it should be documented |
But it's weird that other models from sciki-learn like LASSO or Ridge regression do not divide regularization by sample size, only the MLP. In fact, I believe SGDClassifier (which also does SGD like MLP) also does not divide L2 by sample size, but please check this for me. I have never heard of an MLP implementation dividing regularization by sample size. |
Hm, I think it depends on how it's implemented, but it should actually be scaled if the loss/cost is scaled by the number of samples. E.g., if we have the MSE as follows: J(theta) = 1/(2 n_samples) sum^{n_samples}_{i} (x_i - y_i)^2 Then adding the L2 norm for regularization would modify it as follows: J(theta) = 1/(2 n_samples) [ sum^{n_samples}{i} (x_i - y_i)^2 + lambda sum^(n_weights){j} theta^2 ] Hence, J(theta) = 1/(2 n_samples) sum^{n_samples}{i} (x_i - y_i)^2 + 1/(2 n_samples) lambda sum^(n_weights){j} theta^2 EDIT: Ahh, the above text looks horrible, let me clean this up: |
Sebastian, I feel this discussion would be more productive if we cited sources. In their book Ian Goodfellow and Yoshua Bengio and Aaron Courville, define Mean Square Error as the average squared difference between y and yhat [1]. They then define regularization as something you add to the loss and is independent of the number of observations you have [2]. This is also how every other toolkit formulate and implement regularization. In fact, this is also how regularization is formulated and applied in scikit-linear in linear regression models. See for instance the documentation for Elastic Net [3]. [1] http://www.deeplearningbook.org/contents/mlp.html |
I had a look at the code again. Currently, we are dividing by batch size not sample size (because what _backprop calls sample is actually a mini-batch). Therefore, if I increase the size of the mini-batches from 8 to 256, I will inadvertently change regularization by a factor of 32! |
Ultimately, I would say it doesn't really matter whether the regularization term is scaled or not as it is just a scaling factor. And the way it's e.g., described by Goodfellow et al in the Deep Learning Book looks correct to me for an arbitrary loss function. However if you scale that loss by the number of training samples, you would arrive at the equation I mentioned above (if the loss is the SEE; MSE is maybe an odd example here). The previous equation that I wrote down, considering that it is sometimes common to scale the loss by the number of training samples (eg to have values on a decent scale when tracking convergence via the loss over minibatches or epochs), would be just an explanation why the regularization term was scaled in the code. If you scale the loss function (e.g. here down for the gradients, which is a consequence of that: as in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L130), which includes the regularization term, you would hence scale the regularization term as well. |
I understand what you are saying. The entire loss function currently is being averaged out, not only the mean square error component or the cross entropy component. I think we can agree on that, right? This is how things are implemented right now. What I am saying is that it is not how people generally formulate the loss function or what they expect. Nobody expects regularization to be dependent on size of the mini batch. That's not how it's done anywhere else, not even in other scikit-learn models. What I am saying is that either (a) this behavior should be changed, or (b) the loss function that is being minimized should be documented, so that people can scale regularization accordingly. |
Yeah, if the loss (e.g., the SSE term; using it as a placeholder here for loss functions in general) is averaged over the number of samples (total samples or minibatch size), then the regularization term should also be averaged as it's part of the (regularized) loss function. If your loss is MSE though, you have a double-average.
Yeah, I agree, and if it's not documented, that's confusing. I think there are at least 3 solutions:
|
I was surprised to see that the regularization term is divided by n_samples. This is not standard.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py#L125
This doesn't seem to correspond to the objective documented in
http://scikit-learn.org/dev/modules/neural_networks_supervised.html
The text was updated successfully, but these errors were encountered: