-
Notifications
You must be signed in to change notification settings - Fork 24.9k
remove unnecessary sync point in AveragedModel update #158017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @gl3lan, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158017
Note: Links to docs will display an error until the docs builds have been completed. ❌ 10 New FailuresAs of commit 091aeb5 with merge base db78943 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D78074709 |
This pull request was exported from Phabricator. Differential Revision: D78074709 |
This pull request was exported from Phabricator. Differential Revision: D78074709 |
This pull request was exported from Phabricator. Differential Revision: D78074709 |
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
This pull request was exported from Phabricator. Differential Revision: D78074709 |
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
ec192fb
to
6a831d2
Compare
This pull request was exported from Phabricator. Differential Revision: D78074709 |
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
6a831d2
to
ce429ea
Compare
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
This pull request was exported from Phabricator. Differential Revision: D78074709 |
ce429ea
to
49f50ec
Compare
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
This pull request was exported from Phabricator. Differential Revision: D78074709 |
49f50ec
to
e63c89c
Compare
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
This pull request was exported from Phabricator. Differential Revision: D78074709 |
e63c89c
to
30603a7
Compare
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
This pull request was exported from Phabricator. Differential Revision: D78074709 |
Summary: Pull Request resolved: pytorch#158017 The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
30603a7
to
97bb125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh why is n_averaged even a tensor. It can just be a python number and this sync would go away, right?
97bb125
to
64a7a4f
Compare
This pull request was exported from Phabricator. Differential Revision: D78074709 |
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
Right, but we need it to be saved/reloaded upon job resuming, or else the EMA behavior will change after resuming. |
I've just learned from @mikaylagawarecki that there is a get_extra_state and set_extra_state on modules that we can use to just store a Python number for n_averaged: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.get_extra_state This would make this code simpler and more understandable with the perf win if we had n_averaged as a python number instead of buffer |
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
64a7a4f
to
091aeb5
Compare
This pull request was exported from Phabricator. Differential Revision: D78074709 |
@janeyx99 it looks cleaner now but it breaks any pre-existing checkpoint loading because of the missing parameter. Any suggestion? Override |
@gl3lan yea we should be able to register a hook https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_load_state_dict_pre_hook so that if the state dict has a Tensor n_averaged, we just convert it to the number. How does that sound to you? I'm also okay with the overriding load_state_dict option, whichever is simpler. Can you also add a test case to ensure we don't break existing users? |
Summary:
The test
bool(self.n_averaged == 0)
is a CPU/GPU synchronization point that is called for each update.This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.
After this fix, each
update_parameter
call is reduced to 6ms from 333ms (98% reduction).Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI
Rollback Plan:
Differential Revision: D78074709