-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Fix lr_scheduler
unexpectedly calls step()
when init argument last_epoch is larger than -1
#149312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149312
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0208b8f with merge base a264af8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
424ac56
to
7c5e79a
Compare
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Successfully rebased |
7c5e79a
to
538d5e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look like the right approach. If the discrepancy is for ExponentialLR between get_lr and _get_closed_form_lr, I'd expect the fix to be local there. Could you explain your approach a little bit?
test/optim/test_lrscheduler.py
Outdated
optim2 = torch.optim.AdamW(model.parameters()) | ||
optim2.load_state_dict(optim.state_dict()) | ||
sch2 = LRClass(optim2, last_epoch=1) | ||
self.assertEqual(optim.param_groups[0]["lr"], optim2.param_groups[0]["lr"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the same comparison as the repro--we should be comparing that the closed form lr is the same as the params group lr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually, I see what you're doing now. Sorry I was confused yesterday. I'm willing to accept this fix if you update the test case.
It would also be good to include a comment about why we prefer the _is_initial
.
@@ -134,7 +135,8 @@ def wrapper(*args, **kwargs): | |||
def _initial_step(self): | |||
"""Initialize step counts and perform a step.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who has looked into LRScheduler more than I've been able to, have you seen a good reason why we need to call .step() from the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the key effect is to initialize optimizer initial lr as the same as the scheduler lr when create it, and reuse this part of code:
pytorch/torch/optim/lr_scheduler.py
Lines 200 to 220 in 4c5cf18
with _enable_get_lr_call(self): | |
if epoch is None: | |
self.last_epoch += 1 | |
values = self.get_lr() | |
else: | |
warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning) | |
self.last_epoch = epoch | |
if hasattr(self, "_get_closed_form_lr"): | |
values = cast(list[float], self._get_closed_form_lr()) | |
else: | |
values = self.get_lr() | |
for param_group, lr in zip(self.optimizer.param_groups, values): | |
if isinstance(param_group["lr"], Tensor): | |
param_group["lr"].fill_(_to_scalar(lr)) | |
else: | |
param_group["lr"] = lr | |
self._last_lr: list[float] = [ | |
group["lr"] for group in self.optimizer.param_groups | |
] |
One improvement can be made is extracting internal update lr logic from step
public method, please check this PR: #149392 and the issue it fixed. Thanks!
I'd love to see this expanded to ensure this works for all LRSchedulers! I have confirmed that I see the same issue when testing with StepLR (when I try to resume training and setup a new LRScheduler, it is always one step off b/c of this initial step that is taken in the init of LRSchedulers). |
@zeshengzong lmk if you can bring this PR over the finish line with expanding it to all LRSchedulers! |
Hi @janeyx99 , sorry for late reply, busy with something else previously. I would like fix all of them and hope I could clean up all issues related with |
Yes, adding a context to better distinguish initial lr or calculate lr, |
[ | ||
partial(ExponentialLR, gamma=0.999), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to expand this to more than ExponentialLR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Participating a pytorch meetup, will do it next week, thanks! :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @janeyx99 , I've added more schedulers in here, ReduceLROnPlateau
has different pattern, so I separate it to another test.
7ad1b75
to
54dab5a
Compare
test/optim/test_lrscheduler.py
Outdated
optim2 = torch.optim.AdamW(model.parameters()) | ||
optim2.load_state_dict(optim.state_dict()) | ||
sch2 = LRClass(optim2, last_epoch=0) | ||
self.assertEqual(sch2.get_last_lr()[0], optim.param_groups[0]["lr"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with get_last_lr
since some schedulers not implemented _get_closed_form_lr
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use _get_closed_form_lr whenever it is available (using hasattr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks!
@pytorchbot merge |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Apply lint suggestions Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Apply lint suggestions Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Successfully rebased |
fc64233
to
0208b8f
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #102261
Changes
_is_initial
to replaceself.last_epoch == 0
condition to judge whetherlr
should be initial valueExponentialLR
checkpoint usecaseTest Result