FSDP example #1019

HamidShojanazeri · 2022-07-07T20:28:42Z

This example shows training a HF T5 model with FSDP to be used with its tutorial

netlify · 2022-07-07T20:28:50Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`c15c689`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-examples-preview/deploys/62c741fbf9c2cc00089990df

FSDP/wikihowSep.csv

FSDP/.gitignore

FSDP/README.md

msaroufim · 2022-07-08T16:56:07Z

FSDP/README.md

@@ -0,0 +1,25 @@
+## FSDP T5


Can we put FSDP in the distributed/ folder? Also please link to this example from the main README.md as well

FSDP/T5_training.py

msaroufim · 2022-07-08T17:10:57Z

FSDP/T5_training.py

+    #init_end_event.record()
+
+    #if rank == 0:
+        #print(f"Cuda event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec")


please remove the commented out code

msaroufim · 2022-07-08T17:12:36Z

FSDP/T5_training.py

+                currEpoch = (
+                    "-" + str(epoch) + "-" + str(round(curr_val_loss.item(), 4)) + ".pt"
+                )
+                print(f"--> attempting to save model prefix {currEpoch}")


nit: saving could be its own function

agree - it's done this way in a different fsdp repo I have.

FSDP/T5_training.py

FSDP/README.md

FSDP/T5_training.py

hudeven

@HamidShojanazeri I think a good FSDP example would be very useful for users doing large scale training. Thanks for your contribution! Requesting to resolve the comments.

msaroufim · 2022-09-22T15:50:08Z

@rohan-varma @lessw2020 @HamidShojanazeri once you tell @hudeven and I that you'd like to merge the PR let us know. This has been open for a while. Feel free to close any feedback you don't believe is relevant

lessw2020 · 2022-09-22T21:36:13Z

Let me review - I was not even aware this PR existed until today, so thanks for the direct link.

lessw2020 · 2022-09-22T22:08:34Z

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR).
But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

netlify · 2023-05-24T06:51:28Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`f62b4ae`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-examples-preview/deploys/646e47182d49400008c6a694

HamidShojanazeri · 2023-05-24T06:53:46Z

@msaroufim , @hudeven sorry for the delay I addressed the comments and made the code more modular, would be great if we could merge this.

HamidShojanazeri · 2023-05-24T06:56:06Z

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

Added the AC and rate_lmiter as well+ model checkpointings.

msaroufim · 2023-05-24T16:11:39Z

@svekars any idea if the doc build is flaking for any reason?

@HamidShojanazeri do you mind rebasing on main to see if the error goes away

* Adding FSDP example * adding slurm cluster setup instruction * adding setup model func * added missing features * sumamrizatioon_dataset * Updates training and remove unnecessary imports * updtaing the wrapping policy * Added Zero2 sharding * updates from testing on clean machine * updates from clean machine, add requirements.txt * updates from clean machine * added SentencePiece * removed activation checkpointing and added check for bf16 * clean up * removing cluster setup * fix progress bars, update readme * update progress bars, readme * correct ordering for curr_val_loss evaluation and model save * clean up the dataset links * fixing the dataset links * updates from clean machine * reverting lastest unnecesary changes * moving to a new folder * adding FSDP to dist folder * updates to address comments * adding utils and configs to make the code modular * clean up --------- Co-authored-by: lessw2020 <lessw@etrillium.com>

HamidShojanazeri and others added 23 commits June 7, 2022 19:46

Adding FSDP example

75cf812

adding slurm cluster setup instruction

6ee63a5

adding setup model func

81fc1c1

added missing features

5e7c4f9

sumamrizatioon_dataset

c6d06a5

Updates training and remove unnecessary imports

dfa3c1a

updtaing the wrapping policy

52d1134

Added Zero2 sharding

3898e3c

updates from testing on clean machine

1361cca

updates from clean machine, add requirements.txt

a028145

updates from clean machine

12c2609

added SentencePiece

b2e50d6

removed activation checkpointing and added check for bf16

47d5a9d

clean up

a6a4696

removing cluster setup

f301c53

fix progress bars, update readme

ca9fa7d

update progress bars, readme

ce86dfd

correct ordering for curr_val_loss evaluation and model save

9485e8d

clean up the dataset links

6f6bb74

fixing the dataset links

0a904d8

Merge branch 'pytorch:main' into FSDP_example

6ddf6d9

updates from clean machine

3b898c1

reverting lastest unnecesary changes

c15c689

facebook-github-bot added the cla signed label Jul 7, 2022

msaroufim requested changes Jul 8, 2022

View reviewed changes

hudeven reviewed Aug 9, 2022

View reviewed changes

FSDP/T5_training.py Outdated Show resolved Hide resolved

hudeven suggested changes Aug 9, 2022

View reviewed changes

hudeven added distributed new example labels Aug 17, 2022

hudeven added the training label Aug 17, 2022

msaroufim requested review from rohan-varma and lessw2020 September 22, 2022 15:49

HamidShojanazeri added 4 commits May 24, 2023 05:07

moving to a new folder

8984030

adding FSDP to dist folder

6178a0c

updates to address comments

506f090

adding utils and configs to make the code modular

c5374dd

clean up

44669ae

msaroufim self-requested a review May 24, 2023 16:10

msaroufim approved these changes May 24, 2023

View reviewed changes

Merge branch 'pytorch:main' into FSDP_example

f62b4ae

msaroufim merged commit 7b7c708 into pytorch:main May 24, 2023

FSDP example #1019

FSDP example #1019

Uh oh!

Conversation

HamidShojanazeri commented Jul 7, 2022

Uh oh!

netlify bot commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msaroufim Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

msaroufim Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

msaroufim Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

lessw2020 Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

HamidShojanazeri May 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudeven left a comment

Choose a reason for hiding this comment

Uh oh!

msaroufim commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lessw2020 commented Sep 22, 2022

Uh oh!

lessw2020 commented Sep 22, 2022

Uh oh!

netlify bot commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

HamidShojanazeri commented May 24, 2023

Uh oh!

HamidShojanazeri commented May 24, 2023

Uh oh!

msaroufim commented May 24, 2023

Uh oh!

Uh oh!

netlify bot commented Jul 7, 2022 •

edited

Loading

msaroufim commented Sep 22, 2022 •

edited

Loading

netlify bot commented May 24, 2023 •

edited

Loading