Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

zeshengzong · 2025-03-17T11:15:33Z

Changes

Use flag _is_initial to replace self.last_epoch == 0 condition to judge whether lr should be initial value
Add test for ExponentialLR checkpoint usecase

Test Result

pytest -s test/optim/test_lrscheduler.py  -vv

pytorch-bot · 2025-03-17T11:15:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149312

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0208b8f with merge base a264af8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zeshengzong · 2025-03-18T08:12:57Z

Hello @albanD @janeyx99 , please check whether the fixing is feasible, if it works, I would like to continue fix more schedulers which have same problem, like MultiplicativeLR, LinearLR, thanks!

zeshengzong · 2025-04-14T08:39:27Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-04-14T08:41:00Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-04-14T08:41:04Z

Successfully rebased fix/optim/step onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout fix/optim/step && git pull --rebase)

janeyx99

This does not look like the right approach. If the discrepancy is for ExponentialLR between get_lr and _get_closed_form_lr, I'd expect the fix to be local there. Could you explain your approach a little bit?

janeyx99 · 2025-05-05T21:24:13Z

test/optim/test_lrscheduler.py

+        optim2 = torch.optim.AdamW(model.parameters())
+        optim2.load_state_dict(optim.state_dict())
+        sch2 = LRClass(optim2, last_epoch=1)
+        self.assertEqual(optim.param_groups[0]["lr"], optim2.param_groups[0]["lr"])


This is not the same comparison as the repro--we should be comparing that the closed form lr is the same as the params group lr?

Changed, thanks!

torch/optim/lr_scheduler.py

janeyx99

Oh actually, I see what you're doing now. Sorry I was confused yesterday. I'm willing to accept this fix if you update the test case.

It would also be good to include a comment about why we prefer the _is_initial.

left newer review

janeyx99 · 2025-05-06T17:48:39Z

torch/optim/lr_scheduler.py

@@ -134,7 +135,8 @@ def wrapper(*args, **kwargs):
    def _initial_step(self):
        """Initialize step counts and perform a step."""


As someone who has looked into LRScheduler more than I've been able to, have you seen a good reason why we need to call .step() from the constructor?

I think one of the key effect is to initialize optimizer initial lr as the same as the scheduler lr when create it, and reuse this part of code:

pytorch/torch/optim/lr_scheduler.py

Lines 200 to 220 in 4c5cf18

with _enable_get_lr_call(self):

if epoch is None:

self.last_epoch += 1

values = self.get_lr()

else:

warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)

self.last_epoch = epoch

if hasattr(self, "_get_closed_form_lr"):

values = cast(list[float], self._get_closed_form_lr())

else:

values = self.get_lr()

for param_group, lr in zip(self.optimizer.param_groups, values):

if isinstance(param_group["lr"], Tensor):

param_group["lr"].fill_(_to_scalar(lr))

else:

param_group["lr"] = lr

self._last_lr: list[float] = [

group["lr"] for group in self.optimizer.param_groups

]

One improvement can be made is extracting internal update lr logic from step public method, please check this PR: #149392 and the issue it fixed. Thanks!

joecummings · 2025-05-14T00:13:41Z

I'd love to see this expanded to ensure this works for all LRSchedulers! I have confirmed that I see the same issue when testing with StepLR (when I try to resume training and setup a new LRScheduler, it is always one step off b/c of this initial step that is taken in the init of LRSchedulers).

janeyx99 · 2025-05-14T00:19:45Z

@zeshengzong lmk if you can bring this PR over the finish line with expanding it to all LRSchedulers!

zeshengzong · 2025-05-14T01:44:24Z

@zeshengzong lmk if you can bring this PR over the finish line with expanding it to all LRSchedulers!

Hi @janeyx99 , sorry for late reply, busy with something else previously. I would like fix all of them and hope I could clean up all issues related with lr_scheduler, thanks for help!

zeshengzong · 2025-05-14T03:42:33Z

Oh actually, I see what you're doing now. Sorry I was confused yesterday. I'm willing to accept this fix if you update the test case.

It would also be good to include a comment about why we prefer the _is_initial.

Yes, adding a context to better distinguish initial lr or calculate lr, self.last_epoch == 0 is not enough at this case.

janeyx99 · 2025-05-15T19:42:17Z

test/optim/test_lrscheduler.py

+        [
+            partial(ExponentialLR, gamma=0.999),
+        ],
+    )


It'd be great to expand this to more than ExponentialLR!

Participating a pytorch meetup, will do it next week, thanks! :D

Hi @janeyx99 , I've added more schedulers in here, ReduceLROnPlateau has different pattern, so I separate it to another test.

zeshengzong · 2025-05-19T11:57:40Z

test/optim/test_lrscheduler.py

+        optim2 = torch.optim.AdamW(model.parameters())
+        optim2.load_state_dict(optim.state_dict())
+        sch2 = LRClass(optim2, last_epoch=0)
+        self.assertEqual(sch2.get_last_lr()[0], optim.param_groups[0]["lr"])


Replaced with get_last_lr since some schedulers not implemented _get_closed_form_lr method.

Can we use _get_closed_form_lr whenever it is available (using hasattr)

Changed, thanks!

zeshengzong · 2025-05-20T01:23:24Z

@pytorchbot merge

zeshengzong · 2025-05-21T03:10:14Z

@pytorchbot merge

pytorchmergebot · 2025-05-21T03:12:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-21T03:12:32Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Apply lint suggestions

Details for Dev Infra team

Raised by workflow job

janeyx99 · 2025-05-21T17:38:05Z

@pytorchbot merge

pytorchmergebot · 2025-05-21T17:40:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-21T17:40:22Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Apply lint suggestions

Details for Dev Infra team

Raised by workflow job

zeshengzong · 2025-05-22T01:20:25Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-05-22T01:22:04Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

…t_epoch is larger than -1

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

pytorchmergebot · 2025-05-22T01:22:07Z

Successfully rebased fix/optim/step onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout fix/optim/step && git pull --rebase)

zeshengzong · 2025-05-22T06:06:56Z

@pytorchbot merge

pytorchmergebot · 2025-05-22T06:09:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the release notes: optim label Mar 17, 2025

pytorchbot added the open source label Mar 17, 2025

zeshengzong force-pushed the fix/optim/step branch from 424ac56 to 7c5e79a Compare March 18, 2025 07:59

zeshengzong marked this pull request as ready for review March 18, 2025 08:06

zeshengzong requested review from albanD and janeyx99 as code owners March 18, 2025 08:06

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2025

albanD removed their request for review April 9, 2025 19:37

pytorchmergebot force-pushed the fix/optim/step branch from 7c5e79a to 538d5e0 Compare April 14, 2025 08:41

janeyx99 previously requested changes May 5, 2025

View reviewed changes

janeyx99 reviewed May 6, 2025

View reviewed changes

torch/optim/lr_scheduler.py Show resolved Hide resolved

janeyx99 reviewed May 6, 2025

View reviewed changes

janeyx99 added the topic: bug fixes topic category label May 6, 2025

janeyx99 reviewed May 6, 2025

View reviewed changes

janeyx99 reviewed May 15, 2025

View reviewed changes

zeshengzong force-pushed the fix/optim/step branch from 7ad1b75 to 54dab5a Compare May 19, 2025 07:55

zeshengzong commented May 19, 2025

View reviewed changes

janeyx99 approved these changes May 19, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 20, 2025

pytorchmergebot removed the merging label May 20, 2025

pytorchmergebot added the merging label May 21, 2025

pytorchmergebot removed the merging label May 21, 2025

pytorchmergebot added the merging label May 21, 2025

pytorchmergebot removed the merging label May 21, 2025

zeshengzong and others added 6 commits May 22, 2025 01:22

Fix lr_scheduler unexpectedly calls step() when init argument las…

e8a564e

…t_epoch is larger than -1

Update

b1043cf

Update

bf3a4a6

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

Update

23797fe

Update

560813e

Update

0208b8f

pytorchmergebot force-pushed the fix/optim/step branch from fc64233 to 0208b8f Compare May 22, 2025 01:22

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 22, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 22, 2025

pytorchmergebot added the merging label May 22, 2025

pytorchmergebot added the Merged label May 22, 2025

pytorchmergebot closed this in d7a83ab May 22, 2025

pytorchmergebot removed the merging label May 22, 2025

		@@ -134,7 +135,8 @@ def wrapper(args, *kwargs):
		def _initial_step(self):
		"""Initialize step counts and perform a step."""

	with _enable_get_lr_call(self):
	if epoch is None:
	self.last_epoch += 1
	values = self.get_lr()
	else:
	warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
	self.last_epoch = epoch
	if hasattr(self, "_get_closed_form_lr"):
	values = cast(list[float], self._get_closed_form_lr())
	else:
	values = self.get_lr()

	for param_group, lr in zip(self.optimizer.param_groups, values):
	if isinstance(param_group["lr"], Tensor):
	param_group["lr"].fill_(_to_scalar(lr))
	else:
	param_group["lr"] = lr

	self._last_lr: list[float] = [
	group["lr"] for group in self.optimizer.param_groups
	]

Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 #149312

Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 #149312

Uh oh!

Conversation

zeshengzong commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Test Result

Uh oh!

pytorch-bot bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149312

✅ No Failures

Uh oh!

zeshengzong commented Mar 18, 2025

Uh oh!

zeshengzong commented Apr 14, 2025

Uh oh!

pytorchmergebot commented Apr 14, 2025

Uh oh!

pytorchmergebot commented Apr 14, 2025

Uh oh!

janeyx99 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings commented May 14, 2025

Uh oh!

janeyx99 commented May 14, 2025

Uh oh!

zeshengzong commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeshengzong commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeshengzong commented May 20, 2025

Uh oh!

zeshengzong commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Merge started

Uh oh!

pytorchmergebot commented May 21, 2025

Merge failed

Uh oh!

janeyx99 commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Merge started

Uh oh!

pytorchmergebot commented May 21, 2025

Merge failed

Uh oh!

zeshengzong commented May 22, 2025

Uh oh!

Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

zeshengzong commented Mar 17, 2025 •

edited

Loading

pytorch-bot bot commented Mar 17, 2025 •

edited

Loading

janeyx99 left a comment •

edited

Loading

zeshengzong commented May 14, 2025 •

edited

Loading