Skip to content

FSDP example #1019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 24, 2023
Merged

FSDP example #1019

merged 29 commits into from
May 24, 2023

Conversation

HamidShojanazeri
Copy link
Contributor

This example shows training a HF T5 model with FSDP to be used with its tutorial

@netlify
Copy link

netlify bot commented Jul 7, 2022

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit c15c689
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/62c741fbf9c2cc00089990df

FSDP/README.md Outdated
@@ -0,0 +1,25 @@
## FSDP T5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put FSDP in the distributed/ folder? Also please link to this example from the main README.md as well

#init_end_event.record()

#if rank == 0:
#print(f"Cuda event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the commented out code

currEpoch = (
"-" + str(epoch) + "-" + str(round(curr_val_loss.item(), 4)) + ".pt"
)
print(f"--> attempting to save model prefix {currEpoch}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: saving could be its own function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree - it's done this way in a different fsdp repo I have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@hudeven hudeven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamidShojanazeri I think a good FSDP example would be very useful for users doing large scale training. Thanks for your contribution! Requesting to resolve the comments.

@msaroufim
Copy link
Member

msaroufim commented Sep 22, 2022

@rohan-varma @lessw2020 @HamidShojanazeri once you tell @hudeven and I that you'd like to merge the PR let us know. This has been open for a while. Feel free to close any feedback you don't believe is relevant

@lessw2020
Copy link
Contributor

Let me review - I was not even aware this PR existed until today, so thanks for the direct link.

@lessw2020
Copy link
Contributor

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR).
But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

@netlify
Copy link

netlify bot commented May 24, 2023

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit f62b4ae
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/646e47182d49400008c6a694

@HamidShojanazeri
Copy link
Contributor Author

@msaroufim , @hudeven sorry for the delay I addressed the comments and made the code more modular, would be great if we could merge this.

@HamidShojanazeri
Copy link
Contributor Author

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

Added the AC and rate_lmiter as well+ model checkpointings.

@msaroufim msaroufim self-requested a review May 24, 2023 16:10
@msaroufim
Copy link
Member

@svekars any idea if the doc build is flaking for any reason?

@HamidShojanazeri do you mind rebasing on main to see if the error goes away

@msaroufim msaroufim merged commit 7b7c708 into pytorch:main May 24, 2023
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
* Adding FSDP example

* adding slurm cluster setup instruction

* adding setup model func

* added missing features

* sumamrizatioon_dataset

* Updates training and remove unnecessary imports

* updtaing the wrapping policy

* Added Zero2 sharding

* updates from testing on clean machine

* updates from clean machine, add requirements.txt

* updates from clean machine

* added SentencePiece

* removed activation checkpointing and added check for bf16

* clean up

* removing cluster setup

* fix progress bars, update readme

* update progress bars, readme

* correct ordering for curr_val_loss evaluation and model save

* clean up the dataset links

* fixing the dataset links

* updates from clean machine

* reverting lastest unnecesary changes

* moving to a new folder

* adding FSDP to dist folder

* updates to address comments

* adding utils and configs to make the code modular

* clean up

---------

Co-authored-by: lessw2020 <lessw@etrillium.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants