-
Notifications
You must be signed in to change notification settings - Fork 9.7k
FSDP example #1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP example #1019
Conversation
✅ Deploy Preview for pytorch-examples-preview canceled.
|
FSDP/README.md
Outdated
@@ -0,0 +1,25 @@ | |||
## FSDP T5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put FSDP in the distributed/
folder? Also please link to this example from the main README.md
as well
FSDP/T5_training.py
Outdated
#init_end_event.record() | ||
|
||
#if rank == 0: | ||
#print(f"Cuda event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the commented out code
FSDP/T5_training.py
Outdated
currEpoch = ( | ||
"-" + str(epoch) + "-" + str(round(curr_val_loss.item(), 4)) + ".pt" | ||
) | ||
print(f"--> attempting to save model prefix {currEpoch}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: saving could be its own function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree - it's done this way in a different fsdp repo I have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamidShojanazeri I think a good FSDP example would be very useful for users doing large scale training. Thanks for your contribution! Requesting to resolve the comments.
@rohan-varma @lessw2020 @HamidShojanazeri once you tell @hudeven and I that you'd like to merge the PR let us know. This has been open for a while. Feel free to close any feedback you don't believe is relevant |
Let me review - I was not even aware this PR existed until today, so thanks for the direct link. |
General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). |
✅ Deploy Preview for pytorch-examples-preview canceled.
|
@msaroufim , @hudeven sorry for the delay I addressed the comments and made the code more modular, would be great if we could merge this. |
Added the AC and rate_lmiter as well+ model checkpointings. |
@svekars any idea if the doc build is flaking for any reason? @HamidShojanazeri do you mind rebasing on main to see if the error goes away |
* Adding FSDP example * adding slurm cluster setup instruction * adding setup model func * added missing features * sumamrizatioon_dataset * Updates training and remove unnecessary imports * updtaing the wrapping policy * Added Zero2 sharding * updates from testing on clean machine * updates from clean machine, add requirements.txt * updates from clean machine * added SentencePiece * removed activation checkpointing and added check for bf16 * clean up * removing cluster setup * fix progress bars, update readme * update progress bars, readme * correct ordering for curr_val_loss evaluation and model save * clean up the dataset links * fixing the dataset links * updates from clean machine * reverting lastest unnecesary changes * moving to a new folder * adding FSDP to dist folder * updates to address comments * adding utils and configs to make the code modular * clean up --------- Co-authored-by: lessw2020 <lessw@etrillium.com>
This example shows training a HF T5 model with FSDP to be used with its tutorial