-
Notifications
You must be signed in to change notification settings - Fork 947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature alpha dropout #855
base: master
Are you sure you want to change the base?
Conversation
Due to the similarities and common code from DropoutLayer, should it extend from DropoutLayer and modify the get_out_for()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome, and thank you for the PR! Seems a good strategy to attract contributors is to wait for an interesting paper to come out and not implement it right away ;)
I've got some detailed comments, and a general note on the implementation. You're right that it shares a lot of code with the DropoutLayer. Making AlphaDroupoutLayer inherit from DropoutLayer will not immediately reduce the duplication, though, we'd need some refactoring. Your current implementation requires the mask to be known, so we cannot simply call the super get_output_for()
, we'd still have to implement it. It's possible to formulate alpha dropout in terms of the original dropout, by shifting the input, calling the super get_output_for()
, shifting and scaling the output. Theano will not be able to optimize this, though.
So I'd suggest the following:
- Create a read-only property self.q that encapsulates the conversion from p
- In
DropoutLayer
, create a methodapply_dropout(self, input, const=0)
that rescales the input, creates the mask, multiplies the input with it, and addsconst * (1 - mask)
ifconst != 0
. This can be called from the DropoutLayer (with const=0) and AlphaDropoutLayer (with const=self.alpha). - Make AlphaDropoutLayer inherit from DropoutLayer, overwrite
get_output_for()
. It will have minimal code duplication.
A nuisance is that a separate AlphaDropoutLayer
means the spatial_dropout
and dropout_locations
convenience functions would have to be duplicated. But merging everything into a single DropoutLayer
implementation would be ugly as well, as AlphaDropoutLayer
with alpha=0
is still not the same as DropoutLayer
, so we'd need at least two additional parameters for the DropoutLayer
if we were to unify the implementations.
lasagne/layers/noise.py
Outdated
Sets values to zero with probability p. Will also converge the remaining | ||
neurons to mean 0 and a variance of 1 unit. See notes for disabling dropout | ||
during testing. | ||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing blank line before the Parameters
heading.
lasagne/layers/noise.py
Outdated
``False`` or not specified, dropout (and possibly scaling) is enabled. | ||
Usually, you would use ``deterministic=False`` at train time and | ||
``deterministic=True`` at test time. | ||
References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing blank line.
lasagne/layers/noise.py
Outdated
class AlphaDropoutLayer(Layer): | ||
"""Dropout layer. | ||
|
||
Sets values to zero with probability p. Will also converge the remaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it sets values to alpha
.
lasagne/layers/noise.py
Outdated
super(AlphaDropoutLayer, self).__init__(incoming, **kwargs) | ||
self._srng = RandomStreams() | ||
self.p = p | ||
self.alpha = -1.7580993408473766 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think alpha should be a parameter for the layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and we should use all available digits, because we can. And maybe call it differently, since it's called alpha prime in the paper, computed as minus lambda times alpha? Not sure.
The default value would be -1.0507009873554804934193349852946 * 1.6732632423543772848170429916717
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll default it to that and create an additional parameter. The floating point precision is the difference between the calculated alpha and what I have.
lasagne/layers/noise.py
Outdated
mask_shape = tuple(1 if a in shared_axes else s | ||
for a, s in enumerate(mask_shape)) | ||
|
||
mask = self._srng.uniform(mask_shape, dtype=input.dtype) < q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will result in a binary mask. Why not use binomial
as in the original dropout layer?
docs/modules/layers/noise.rst
Outdated
@@ -21,3 +21,6 @@ Noise layers | |||
.. autoclass:: GaussianNoiseLayer | |||
:members: | |||
|
|||
.. autoclass:: AlphaDropoutLayer | |||
:members: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd place it after the dropout layers.
lasagne/layers/noise.py
Outdated
@@ -14,6 +14,7 @@ | |||
"spatial_dropout", | |||
"dropout_locations", | |||
"GaussianNoiseLayer", | |||
"AlphaDropoutLayer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd place it after the dropout layers.
lasagne/layers/noise.py
Outdated
bcast = tuple(bool(s == 1) for s in mask_shape) | ||
mask = T.patternbroadcast(mask, bcast) | ||
|
||
a = T.pow(q + self.alpha ** 2 * q * (1 - q), -0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use p
, not 1 - q
, otherwise it might result in an unwanted upcast again. Should also use T.square
instead of ** 2
to make the graph smaller.
lasagne/layers/noise.py
Outdated
|
||
a = T.pow(q + self.alpha ** 2 * q * (1 - q), -0.5) | ||
|
||
b = -a * (1 - q) * self.alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, should use p
, not 1-q
.
I apologize for all of the commits @f0k I'm having some trouble squashing them together. Any help with this would be very much appreciated. Thanks |
Thank you for continuing, and sorry for the delay!
I guess the easiest to get rid of all the merge commits in between will be to squash everything into a single commit for now. In the end I'd like two commits -- one refactoring the dropout layer and one introducing alpha dropout -- but we can take care of that later. To squash, do: git reset --soft edbbc12 # set pointer to the first commit of the PR, leave other changes stages
git commit --amend --no-edit # tack all other changes to the first commit
git push --force # overwrite the current PR with the squashed version I have made a local backup of the current state of your PR in case anything goes wrong. I will review the code afterwards, as I don't know if the reviews will stick after squashing. |
2010d19
to
f92c309
Compare
Thanks for the assist! I think everything is still working as I left it. Should be ready for another round of reviews @f0k . |
@f0k Have you had a chance to look at the changes? |
lasagne/layers/noise.py
Outdated
**kwargs) | ||
if alpha is None: | ||
self.alpha = -1.0507009873554804934193349852946 * \ | ||
1.6732632423543772848170429916717 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to mention in a comment that these two values are fixed point solutions to equations (4) and (5) in [1] for zero mean and unit variance input and that the analytic expressions for them are given in equation (14) ?
Furthermore, these value for alpha should be consistent with a SELU
input layer, would a possible option be to pass a SELU
instance for the parameter alpha and take the value from there, e.g.:
elif isinstance(alpha, SELU):
self.alpha = - alpha.scale * alpha.scale_neg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You bring up good points. I will update the documentation accordingly. Regarding the option to pass a SELU instance and calculate the alpha based on that, I think that since AlphaDropout is meant to work with SELU, it only makes sense for it to be able to use SELU. I will make this change.
Sorry, had to finish my thesis first. You're including the commits of the SELU PR in your PR, can you please rebase onto master instead? It's more difficult to review otherwise. Cheers! git fetch upstream master
git rebase upstream/master
git push --force # assuming that the rebase went through without conflicts, otherwise git rebase --abort and let me know |
73f220a
to
4304373
Compare
No worries and congrats! I rebased onto master. Resolved a few conflicts and pushed. Everything is as it was. |
Oops, sorry, introduced a conflict by merging #871. Will review and can then rebase your PR onto the new master myself, resolving the conflict. Thank you for rebasing onto yesterday's master! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for rebasing, much easier to review now! There are still a couple of issues with the code and the documentation. The latter will need some thinking to explain well what the layer does.
lasagne/tests/layers/test_noise.py
Outdated
def test_alpha(self, input_layer): | ||
input = theano.shared(numpy.ones((100, 100))) | ||
from lasagne.layers.noise import AlphaDropoutLayer | ||
layer = AlphaDropoutLayer(input_layer, alpha=-1.7580993408473766) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's just the default value for alpha
. It should test another value, otherwise you never know whether that value was actually respected or it took the default anyway :)
lasagne/tests/layers/test_noise.py
Outdated
input = theano.shared(numpy.ones((100, 100))) | ||
from lasagne.layers.noise import AlphaDropoutLayer | ||
from lasagne.nonlinearities import selu # SELU instance | ||
layer = AlphaDropoutLayer(input_layer, alpha=selu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, it should be a SELU instance with nonstandard scales. You can then test whether self.alpha was set correctly, in addition to checking the output.
lasagne/layers/noise.py
Outdated
|
||
self.q = one - self.p | ||
if self.rescale: | ||
input /= self.q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.q
should not be set here, it should become a property of the layer (i.e., not an attribute):
@property
def q(self):
return T.constant(1) - self.p
Then you never need to set self.q
and can use it in apply_dropout()
and in get_output_for()
.
lasagne/layers/noise.py
Outdated
mask = T.patternbroadcast(mask, bcast) | ||
|
||
if const != 0: | ||
return (input * mask) + (const * (1 - mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to use T.constant(1)
instead of 1
again.
lasagne/layers/noise.py
Outdated
"""Dropout layer. | ||
|
||
Sets values to alpha. Will also converge the remaining | ||
neurons to mean 0 and a variance of 1 unit. See notes for disabling dropout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this explanation is a bit imprecise. Can you be more explicit what the layer does? (If you can't find anything, I'll try that myself.)
lasagne/layers/noise.py
Outdated
incoming : a :class:`Layer` instance or a tuple | ||
the layer feeding into this layer, or the expected input shape | ||
p : float or scalar tensor | ||
The probability of setting a value to zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only that it's not set to zero...
lasagne/layers/noise.py
Outdated
alpha : float or SELU instance | ||
The default values are fixed point solutions to equations (4) and (5) | ||
in [1] for zero mean and unit variance input. The analytic expressions | ||
for them are given in equation (14) also in [1]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably first state what role alpha
plays for the operation of the layer. Also the reference must be given as [1]_
, not [1]
.
lasagne/layers/noise.py
Outdated
----- | ||
The alpha dropout layer is a regularizer that randomly sets input values to | ||
zero and also applies a normalization function to bring the mean to 0 | ||
and the variance to 1; see [1]_for why this might improve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That used to reference the original dropout paper. Either the reference or the text should be changed.
lasagne/layers/noise.py
Outdated
a = T.pow(self.q + T.square(self.alpha) * self.q * self.p, -0.5) | ||
b = -a * self.p * self.alpha | ||
|
||
return T.cast(a * mask + b, input.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this cast? Can you trace back why it is necessary? If possible, it would be better to cast earlier, otherwise Theano might execute part of the graph on CPU and then only cast and copy the results back to GPU afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it returns a float64 so I just want to make sure the dtype of the output is the same as the dtype of the input.
@f0k I have made the changes that address all your comments. Let me know what you think. |
Thank you, the code looks good now! For the docstring I'll need to have a closer look, unfortunately I'll be abroad without web access for the next two weeks -- sorry to let you wait, I'll get back to this when I return! |
Have you had a chance to review the docstring in more detail @f0k ? Please let me know if anything is inaccurate. Thanks! |
This is my first PR to Lasagne so I'm definitely looking for suggestions. To complement SELU activation, I've made a new AlphaDropoutLayer as specified in the paper: https://arxiv.org/abs/1706.02515