Transformers documentation
Methods and tools for efficient training on a single GPU
Methods and tools for efficient training on a single GPU
ãã®ã¬ã€ãã§ã¯ãã¡ã¢ãªã®å©çšå¹çãæé©åãããã¬ãŒãã³ã°ãé«éåããããšã§ãã¢ãã«ã®ãã¬ãŒãã³ã°å¹çãåäžãããããã«äœ¿çšã§ããå®çšçãªãã¯ããã¯ã玹ä»ããŸãããã¬ãŒãã³ã°äžã«GPUãã©ã®ããã«å©çšãããããç解ãããå Žåã¯ãæåã«ãã¢ãã«ãã¬ãŒãã³ã°ã®è§£ååŠãã®ã³ã³ã»ããã¬ã€ããåç §ããŠãã ããããã®ã¬ã€ãã¯å®çšçãªãã¯ããã¯ã«çŠç¹ãåœãŠãŠããŸãã
è€æ°ã®GPUãæèŒãããã·ã³ã«ã¢ã¯ã»ã¹ã§ããå Žåããããã®ã¢ãããŒãã¯äŸç¶ãšããŠæå¹ã§ããããã«ããã«ãGPUã»ã¯ã·ã§ã³ã§èª¬æãããŠããè¿œå ã®æ¹æ³ã掻çšã§ããŸãã
倧èŠæš¡ãªã¢ãã«ããã¬ãŒãã³ã°ããéãåæã«èæ ®ãã¹ã2ã€ã®åŽé¢ããããŸãïŒ
- ããŒã¿ã®ã¹ã«ãŒããã/ãã¬ãŒãã³ã°æé
- ã¢ãã«ã®ããã©ãŒãã³ã¹
ã¹ã«ãŒãããïŒãµã³ãã«/ç§ïŒãæ倧åããããšã¯ããã¬ãŒãã³ã°ã³ã¹ããäœæžãããŸããããã¯äžè¬çã«ãGPUãã§ããã ãå¹æçã«æŽ»çšããGPUã¡ã¢ãªãéçãŸã§åããããšã«ãã£ãŠéæãããŸããåžæããããããµã€ãºãGPUã¡ã¢ãªã®å¶éãè¶ ããå ŽåãåŸé èç©ãªã©ã®ã¡ã¢ãªæé©åãã¯ããã¯ã圹ç«ã¡ãŸãã
ãããã奜ã¿ã®ããããµã€ãºãã¡ã¢ãªã«åãŸãå Žåãã¡ã¢ãªãæé©åãããã¯ããã¯ãé©çšããçç±ã¯ãããŸããã倧ããªããããµã€ãºã䜿çšã§ãããããšãã£ãŠããããå¿ ããã䜿çšãã¹ãã§ã¯ãããŸããããã€ããŒãã©ã¡ãŒã¿ã®èª¿æŽã®äžç°ãšããŠãã©ã®ããããµã€ãºãæè¯ã®çµæãçã¿åºããã決å®ãããªãœãŒã¹ãé©åã«æé©åããå¿ èŠããããŸãã
ãã®ã¬ã€ãã§ã«ããŒãããŠããæ¹æ³ãšããŒã«ã¯ããã¬ãŒãã³ã°ããã»ã¹ã«äžãã圱é¿ã«åºã¥ããŠåé¡ã§ããŸãïŒ
Method/tool | Improves training speed | Optimizes memory utilization |
---|---|---|
Batch size choice | Yes | Yes |
Gradient accumulation | No | Yes |
Gradient checkpointing | No | Yes |
Mixed precision training | Yes | (No) |
Optimizer choice | Yes | Yes |
Data preloading | Yes | No |
DeepSpeed Zero | No | Yes |
torch.compile | Yes | No |
泚æ: å°ããªã¢ãã«ãšå€§ããªããããµã€ãºã䜿çšããå Žåãã¡ã¢ãªã®ç¯çŽãè¡ãããŸããã倧ããªã¢ãã«ãšå°ããªããããµã€ãºã䜿çšããå Žåãã¡ã¢ãªã®äœ¿çšéãå¢å ããŸãã
ãããã®ãã¯ããã¯ã¯ãTrainerã§ã¢ãã«ããã¬ãŒãã³ã°ããŠããå ŽåããçŽç²ãªPyTorchã«ãŒããèšè¿°ããŠããå Žåã®äž¡æ¹ã§å©çšã§ããŸãã詳现ãªæé©åã®èšå®ã«ã€ããŠã¯ãð€ Accelerateã䜿çšããŠãããã®æé©åãèšå®ã§ããŸãã
ãããã®æ¹æ³ãååãªå©çããããããªãå Žåã以äžã®ãªãã·ã§ã³ãæ€èšã§ããŸãïŒ
- å¹ççãªãœãããŠã§ã¢ããªãã«ããåããã«ã¹ã¿ã Dockerã³ã³ããã®äœæ
- Mixture of ExpertsïŒMoEïŒã䜿çšããã¢ãã«ãæ€èš
- ã¢ãã«ãBetterTransformerã«å€æããŠãPyTorchãã€ãã£ãã®ã¢ãã³ã·ã§ã³ã掻çš
æåŸã«ããããã®æ¹æ³ããŸã ååã§ãªãå ŽåãA100ãªã©ã®ãµãŒããŒã°ã¬ãŒãGPUã«åãæ¿ããŠãããããªãæ¹åãå¿ èŠãããããŸããããããã®ã¢ãããŒãã¯ããã«ãGPUã»ããã¢ããã§ãæå¹ã§ããããã«ãGPUã»ã¯ã·ã§ã³ã§èª¬æãããŠããè¿œå ã®äžŠååæè¡ã掻çšã§ããŸãã
Batch size choice
æé©ãªããã©ãŒãã³ã¹ãå®çŸããããã«ãé©åãªããããµã€ãºãç¹å®ããããšããå§ããŸãããã2^Nã®ãµã€ãºã®ããããµã€ãºãšå ¥å/åºåãã¥ãŒãã³æ°ã䜿çšããããšãæšå¥šãããŠããŸããéåžžãããã¯8ã®åæ°ã§ããã䜿çšããããŒããŠã§ã¢ãšã¢ãã«ã®ããŒã¿åã«äŸåããããšããããŸãã
åèãŸã§ã«ãNVIDIAã®å ¥å/åºåãã¥ãŒãã³æ°ã®æšå¥šäºé ãšããããµã€ãºã確èªããŠãã ããïŒãããã¯GEMMïŒäžè¬çãªè¡åä¹ç®ïŒã«é¢äžããŸãïŒã
Tensor CoreèŠä»¶ã§ã¯ãããŒã¿åãšããŒããŠã§ã¢ã«åºã¥ããŠä¹æ°ãå®çŸ©ãããŠããŸããããšãã°ãfp16ããŒã¿åã®å Žåã64ã®åæ°ã䜿çšããããšãæšå¥šãããŸãïŒA100 GPUã®å Žåãé€ãïŒã
å°ããªãã©ã¡ãŒã¿ã®å Žåã次å éååå¹æãèæ ®ããŠãã ãããããã¯ã¿ã€ãªã³ã°ãè¡ãããé©åãªä¹æ°ãå€§å¹ ãªé«éåãããããå ŽåããããŸãã
Gradient Accumulation
åŸé èç©ã¡ãœããã¯ãGPUã®ã¡ã¢ãªå®¹éã®å¶çŽã«ãã£ãŠèª²ããããå¶éãè¶ ããå¹æçãªããããµã€ãºãå®çŸããããã«ãåŸé ãå°ããªå¢åã§èšç®ããããšãç®çãšããŠããŸãããã®ã¢ãããŒãã§ã¯ãã¢ãã«ãé æ¹åããã³éæ¹åã«å°ããªãããã§å埩çã«èšç®ãããã®éçšã§åŸé ãèç©ããŸããååãªæ°ã®åŸé ãèç©ãããããã¢ãã«ã®æé©åã¹ããããå®è¡ããŸããåŸé èç©ã䜿çšããããšã§ãGPUã®ã¡ã¢ãªå®¹éã«ããå¶çŽãè¶ ããŠå¹æçãªããããµã€ãºãå¢ããããšãã§ããŸãããåŸé èç©ã«ãã£ãŠå°å ¥ãããè¿œå ã®é æ¹åããã³éæ¹åã®èšç®ã¯ãã¬ãŒãã³ã°ããã»ã¹ãé ãããå¯èœæ§ãããããšã«æ³šæãå¿ èŠã§ãã
TrainingArguments
ã«gradient_accumulation_steps
åŒæ°ãè¿œå ããããšã§ãåŸé
èç©ãæå¹ã«ããããšãã§ããŸãïŒ
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)
äžèšã®äŸã§ã¯ãå¹æçãªããããµã€ãºã¯4ã«ãªããŸãã
ãŸãããã¬ãŒãã³ã°ã«ãŒããå®å šã«å¶åŸ¡ããããã«ð€ Accelerateã䜿çšããããšãã§ããŸããð€ Accelerateã®äŸã¯ããã®ã¬ã€ãã®åŸåã«ããã§èŠã€ããããšãã§ããŸãã
ã§ããã ãGPUã®äœ¿çšçãæ倧éã«ããããšãæšå¥šãããŠããŸãããé«ãåŸé
èç©ã¹ãããæ°ã¯ãã¬ãŒãã³ã°ã®é
延ãããé¡èã«ããããšããããŸãã以äžã®äŸãèããŠã¿ãŸããããper_device_train_batch_size=4
ã®å ŽåãåŸé
èç©ã䜿çšããªããšGPUã®å¶éã«éããŸããããããµã€ãº64ã§ãã¬ãŒãã³ã°ãããå Žåãper_device_train_batch_size
ã1ã«èšå®ããgradient_accumulation_steps
ã64ã«èšå®ããªãã§ãã ããã代ããã«ãper_device_train_batch_size=4
ãä¿æããgradient_accumulation_steps=16
ãèšå®ããŸããããã«ãããåãå¹æçãªããããµã€ãºãåŸãããå©çšå¯èœãªGPUãªãœãŒã¹ãå¹æçã«æŽ»çšãããŸãã
詳现ãªæ å ±ã«ã€ããŠã¯ãRTX-3090çšã®ããããµã€ãºãšåŸé èç©ã®ãã³ãããŒã¯ããã³A100çšã®ããããµã€ãºãšåŸé èç©ã®ãã³ãããŒã¯ãåç §ããŠãã ããã
Gradient Checkpointing
äžéšã®å€§ããªã¢ãã«ã¯ãããããµã€ãºã1ã«èšå®ããåŸé èç©ã䜿çšããŠããå Žåã§ãã¡ã¢ãªã®åé¡ã«çŽé¢ããããšããããŸããããã¯ãã¡ã¢ãªã¹ãã¬ãŒãžãå¿ èŠãªä»ã®ã³ã³ããŒãã³ããååšããããã§ãã
ååããã¹ããã®ãã¹ãŠã®ã¢ã¯ãã£ããŒã·ã§ã³ãä¿åããŠãéåããã¹ã§åŸé ãèšç®ãããšãããªãã®ã¡ã¢ãªãªãŒããŒããããçºçããŸããéåããã¹ã§å¿ èŠãªãšãã«ã¢ã¯ãã£ããŒã·ã§ã³ãç Žæ£ããŠåèšç®ãã代æ¿ã¢ãããŒãã¯ãèšç®ãªãŒããŒããããå€§å¹ ã«å¢å ãããã¬ãŒãã³ã°ããã»ã¹ãé ããªããŸãã
åŸé ãã§ãã¯ãã€ã³ãã¯ããããã®2ã€ã®ã¢ãããŒãã®æè¡·æ¡ãæäŸããèšç®ã°ã©ãå šäœã§æŠç¥çã«éžæãããã¢ã¯ãã£ããŒã·ã§ã³ã®ã¿ãä¿åãããããåŸé ãåèšç®ããå¿ èŠãããã¢ã¯ãã£ããŒã·ã§ã³ã®äžéšã ããç¯çŽããŸããåŸé ãã§ãã¯ãã€ã³ãã®è©³çŽ°ã«ã€ããŠã¯ããã®çŽ æŽãããèšäºãåç §ããŠãã ããã
Trainerã§åŸé ãã§ãã¯ãã€ã³ããæå¹ã«ããã«ã¯ãTrainingArgumentsã«å¯Ÿå¿ãããã©ã°ãæž¡ããŸãïŒ
training_args = TrainingArguments(
per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args
)
代æ¿æ段ãšããŠãð€ Accelerateã䜿çšããããšãã§ããŸã - ð€ Accelerateã®äŸã¯ãã®ã¬ã€ãã®ããã«åŸãã«ãããŸãã
åŸé ãã§ãã¯ãã€ã³ãã䜿çšããããšã§ã¡ã¢ãªå¹çãåäžããå ŽåããããŸããããã¬ãŒãã³ã°é床ã¯çŽ20%é ããªãããšã«æ³šæããŠãã ããã
Mixed precision training
æ··å粟床ãã¬ãŒãã³ã°ã¯ãã¢ãã«ã®ãã¬ãŒãã³ã°ã®èšç®å¹çãæé©åããæè¡ã§ãç¹å®ã®å€æ°ã«å¯ŸããŠäœç²ŸåºŠã®æ°å€ãã©ãŒããããå©çšããŸããåŸæ¥ãã»ãšãã©ã®ã¢ãã«ã¯å€æ°ãè¡šçŸãåŠçããããã«32ãããæµ®åå°æ°ç¹ç²ŸåºŠïŒfp32ãŸãã¯float32ïŒã䜿çšããŠããŸãããããããã¹ãŠã®å€æ°ãæ£ç¢ºãªçµæãåŸãããã«ãã®é«ç²ŸåºŠã®ã¬ãã«ãå¿ èŠãšããªãå ŽåããããŸããäžéšã®å€æ°ã®ç²ŸåºŠã16ãããæµ®åå°æ°ç¹ïŒfp16ãŸãã¯float16ïŒãªã©ã®ããäœãæ°å€ãã©ãŒãããã«å€æŽããããšã§ãèšç®ãé«éåã§ããŸãããã®ã¢ãããŒãã§ã¯ãäžéšã®èšç®ã¯å粟床ã§è¡ãããäžéšã¯ãŸã å®å šãªç²ŸåºŠã§è¡ãããããããã®ã¢ãããŒãã¯æ··å粟床ãã¬ãŒãã³ã°ãšåŒã°ããŠããŸãã
æãäžè¬çã«æ··å粟床ãã¬ãŒãã³ã°ã¯ãfp16ïŒfloat16ïŒããŒã¿åã䜿çšããŠå®çŸãããŸãããäžéšã®GPUã¢ãŒããã¯ãã£ïŒã¢ã³ãã¢ã¢ãŒããã¯ãã£ãªã©ïŒã§ã¯bf16ããã³tf32ïŒCUDAå éšããŒã¿åïŒããŒã¿åãæäŸãããŠããŸãããããã®ããŒã¿åã®éãã«ã€ããŠè©³ããç¥ãããå Žåã¯ãNVIDIAã®ããã°ã確èªããŠãã ããã
fp16
æ··å粟床ãã¬ãŒãã³ã°ã®äž»ãªå©ç¹ã¯ãå粟床ïŒfp16ïŒã§ã¢ã¯ãã£ããŒã·ã§ã³ãä¿åããããšããåŸãããŸãã åŸé ãå粟床ã§èšç®ãããŸãããæé©åã¹ãããã§ã¯åã³å®å šç²ŸåºŠã«å€æããããããããã§ã¯ã¡ã¢ãªã¯ä¿åãããŸããã æ··å粟床ãã¬ãŒãã³ã°ã¯èšç®é床ãåäžãããäžæ¹ãç¹ã«å°ããªããããµã€ãºã®å Žåãããå€ãã®GPUã¡ã¢ãªã䜿çšããããšããããŸãã ããã¯ãã¢ãã«ãGPUäžã«16ãããããã³32ããã粟床ã®äž¡æ¹ã§ååšããããã§ãïŒGPUäžã®å ã®ã¢ãã«ã®1.5åïŒã
æ··å粟床ãã¬ãŒãã³ã°ãæå¹ã«ããã«ã¯ãfp16
ãã©ã°ãTrue
ã«èšå®ããŸãïŒ
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
ð€ Accelerateã䜿çšããå Žåãð€ Accelerateã®äŸã¯ãã®ã¬ã€ãã®ããã«åŸãã«ãããŸãã
BF16
AmpereãŸãã¯ãã以éã®ããŒããŠã§ã¢ã«ã¢ã¯ã»ã¹ã§ããå Žåãæ··å粟床ãã¬ãŒãã³ã°ãšè©äŸ¡ã«bf16ã䜿çšã§ããŸããbf16ã¯fp16ããã粟床ãå£ããŸãããã¯ããã«å€§ããªåçç¯å²ãæã£ãŠããŸããfp16ã§ã¯ãæã€ããšãã§ããæ倧ã®æ°ã¯ 65535
ã§ããããããè¶
ããæ°å€ã¯ãªãŒããŒãããŒãåŒãèµ·ãããŸããäžæ¹ãbf16ã®æ°å€ã¯ 3.39e+38
ã®ããã«å€§ãããããã¯fp32ãšã»ãŒåãã§ã - ã©ã¡ããæ°å€ç¯å²ã«8ãããã䜿çšããŠããããã§ãã
BF16ãæå¹ã«ããã«ã¯ãð€ Trainerã§ä»¥äžã®ããã«èšå®ããŸãïŒ
training_args = TrainingArguments(bf16=True, **default_args)
TF32
ã¢ã³ãã¢ããŒããŠã§ã¢ã¯ãtf32ãšããç¹å¥ãªããŒã¿åã䜿çšããŸããããã¯ãfp32ãšåãæ°å€ç¯å²ïŒ8ãããïŒãæã£ãŠããŸããã23ãããã®ç²ŸåºŠã§ã¯ãªãã10ãããã®ç²ŸåºŠïŒfp16ãšåãïŒãæã¡ãåèšã§19ããããã䜿çšããŸãããããã¯éåžžã®fp32ãã¬ãŒãã³ã°ããã³æšè«ã³ãŒãã䜿çšããtf32ãµããŒããæå¹ã«ããããšã§ãæ倧3åã®ã¹ã«ãŒãããã®åäžãåŸãããç¹ã§ãéæ³ã®ãããã§ããè¡ãå¿ èŠãããã®ã¯ã次ã®ã³ãŒããè¿œå ããã ãã§ãïŒ
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
䜿çšãããŠããGPUãã¢ã³ãã¢ã·ãªãŒãºã§ãããšä»®å®ããCUDAã¯å¯èœãªéãtf32ã䜿çšããããã«èªåçã«åãæ¿ããŸãã
NVIDIAã®ç 究ã«ããã°ãã»ãšãã©ã®æ©æ¢°åŠç¿ãã¬ãŒãã³ã°ã¯ãŒã¯ããŒãã¯tf32ãã¬ãŒãã³ã°ãšfp32ãã¬ãŒãã³ã°ã§åãé£è§£åºŠãšåæã瀺ããŸãããã§ã«fp16ãŸãã¯bf16æ··å粟床ã䜿çšããŠããå Žåãã¹ã«ãŒãããã®åäžã«åœ¹ç«ã€ããšããããŸãã
ð€ Trainerã§ãã®ã¢ãŒããæå¹ã«ããããšãã§ããŸãïŒ
TrainingArguments(tf32=True, **default_args)
tf32ã¯tensor.to(dtype=torch.tf32)
ãä»ããŠçŽæ¥ã¢ã¯ã»ã¹ã§ããŸãããããã¯å
éšã®CUDAããŒã¿åã§ããtf32ããŒã¿åã䜿çšããã«ã¯ãtorch>=1.7
ãå¿
èŠã§ãã
tf32ãšä»ã®ç²ŸåºŠã«é¢ãã詳现ãªæ å ±ã«ã€ããŠã¯ã以äžã®ãã³ãããŒã¯ãåç §ããŠãã ããïŒ RTX-3090ããã³ A100ã
Flash Attention 2
transformersã§Flash Attention 2çµ±åã䜿çšããããšã§ããã¬ãŒãã³ã°ã®ã¹ã«ãŒããããåäžãããããšãã§ããŸããFlash Attention 2ã¢ãžã¥ãŒã«ãå«ãã¢ãã«ã®èªã¿èŸŒã¿æ¹æ³ã«ã€ããŠã¯ãsingle GPU sectionã®é©åãªã»ã¯ã·ã§ã³ã確èªããŠè©³çŽ°ãåŠã³ãŸãããã
ãªããã£ãã€ã¶ã®éžæ
Transformerã¢ãã«ããã¬ãŒãã³ã°ããããã«æãäžè¬çã«äœ¿çšããããªããã£ãã€ã¶ã¯AdamãŸãã¯AdamWïŒéã¿æžè¡°ã䌎ãAdamïŒã§ããAdamã¯ååã®åŸé
ã®ç§»åå¹³åãä¿åããããšã§åæãéæããŸãããã¢ãã«ãã©ã¡ãŒã¿ã®æ°ã®ãªãŒããŒã®è¿œå ã¡ã¢ãªãããããªã³ããè¿œå ããŸããããã解æ¶ããããã«ã代æ¿ãªããã£ãã€ã¶ã䜿çšã§ããŸããããšãã°ãNVIDIA/apexãã€ã³ã¹ããŒã«ãããŠããå Žåãadamw_apex_fused
ã¯ãã¹ãŠã®ãµããŒããããŠããAdamWãªããã£ãã€ã¶ã®äžã§æãé«éãªãã¬ãŒãã³ã°äœéšãæäŸããŸãã
Trainerã¯ãçŽæ¥äœ¿çšã§ããããŸããŸãªãªããã£ãã€ã¶ãçµ±åããŠãããadamw_hf
ãadamw_torch
ãadamw_torch_fused
ãadamw_apex_fused
ãadamw_anyprecision
ãadafactor
ããŸãã¯adamw_bnb_8bit
ãå«ãŸããŠããŸãããµãŒãããŒãã£ã®å®è£
ãä»ããŠããã«å€ãã®ãªããã£ãã€ã¶ãè¿œå ã§ããŸãã
AdamWãªããã£ãã€ã¶ã®ä»£æ¿æ段ã«ã€ããŠè©³ããèŠãŠã¿ãŸãããïŒ
- Trainerã§äœ¿çšå¯èœãª
adafactor
- Trainerã§äœ¿çšå¯èœãª
adamw_bnb_8bit
ã¯ããã¢ã³ã¹ãã¬ãŒã·ã§ã³çšã«ä»¥äžã§ãµãŒãããŒãã£ã®çµ±åãæäŸãããŠããŸãã
æ¯èŒã®ããã3Bãã©ã¡ãŒã¿ã¢ãã«ïŒäŸïŒãgoogle-t5/t5-3bãïŒã®å ŽåïŒ
- æšæºã®AdamWãªããã£ãã€ã¶ã¯ãåãã©ã¡ãŒã¿ã«8ãã€ãã䜿çšããããã24GBã®GPUã¡ã¢ãªãå¿ èŠã§ãïŒ8 * 3 => 24GBïŒã
- Adafactorãªããã£ãã€ã¶ã¯12GB以äžå¿ èŠã§ããåãã©ã¡ãŒã¿ã«ããã4ãã€ã以äžã䜿çšããããã4 * 3ãšå°ãäœåã«ãªããŸãã
- 8ãããã®BNBéååãªããã£ãã€ã¶ã¯ããã¹ãŠã®ãªããã£ãã€ã¶ã®ç¶æ ãéååãããŠããå Žåãããã6GBãã䜿çšããŸããã
Adafactor
Adafactorã¯ãéã¿è¡åã®åèŠçŽ ã®ããã«ååã®å¹³åãä¿åããŸããã代ããã«ãïŒè¡ããšãšåããšã®å¹³åã®åèšãªã©ïŒé
training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args)
ä»ã®ã¢ãããŒãïŒåŸé èç©ãåŸé ãã§ãã¯ãã€ã³ããæ··å粟床ãã¬ãŒãã³ã°ïŒãšçµã¿åãããããšã§ãã¹ã«ãŒããããç¶æããªããæ倧3åã®åäžãèŠãããããšããããŸãïŒãã ããåè¿°ã®ããã«ãAdafactorã®åææ§ã¯AdamãããæªãããšããããŸãã
8ããã Adam
Adafactorã®ããã«ãªããã£ãã€ã¶ã®ç¶æ ãéçŽãã代ããã«ã8ãããã®Adamã¯å®å šãªç¶æ ãä¿æãããããéååããŸããéååãšã¯ãç¶æ ãäœã粟床ã§ä¿åããæé©åã®ããã ãã«ééååããããšãæå³ããŸããããã¯æ··å粟床ãã¬ãŒãã³ã°ã®èåŸã«ããã¢ã€ãã¢ãšäŒŒãŠããŸãã
adamw_bnb_8bit
ã䜿çšããã«ã¯ãåã«TrainingArgumentsã§optim="adamw_bnb_8bit"
ãèšå®ããã ãã§ãïŒ
training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args)
ãã ãããã¢ã³ã¹ãã¬ãŒã·ã§ã³ç®çã§8ããããªããã£ãã€ã¶ããµãŒãããŒãã£ã®å®è£ ã䜿çšããããšãã§ããŸãããããçµ±åããæ¹æ³ã確èªããããã§ãã
ãŸãã8ãããAdamãªããã£ãã€ã¶ãå®è£
ããbitsandbytes
ã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããããã«ãGitHub ãªããžããªå
ã®ã€ã³ã¹ããŒã«ã¬ã€ãã«åŸã£ãŠãã ããã
次ã«ããªããã£ãã€ã¶ãåæåããå¿ èŠããããŸããããã«ã¯2ã€ã®ã¹ããããå«ãŸããŸãïŒ
- ãŸããã¢ãã«ã®ãã©ã¡ãŒã¿ã2ã€ã®ã°ã«ãŒãã«åããŸã - éã¿æžè¡°ãé©çšããã¹ãã°ã«ãŒããšãé©çšãã¹ãã§ãªãã°ã«ãŒãã§ããéåžžããã€ã¢ã¹ãšã¬ã€ã€ãŒæ£èŠåãã©ã¡ãŒã¿ã¯éã¿æžè¡°ãããŸããã
- 次ã«ã以åã«äœ¿çšããAdamWãªããã£ãã€ã¶ãšåããã©ã¡ãŒã¿ã䜿çšããããã«ãããã€ãã®åŒæ°ã®èª¿æŽãè¡ããŸãã
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names
training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
decay_parameters = get_parameter_names(model, [nn.LayerNorm], ["bias", "layernorm", "rmsnorm"])
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if n in decay_parameters],
"weight_decay": training_args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if n not in decay_parameters],
"weight_decay": 0.0,
},
]
optimizer_kwargs = {
"betas": (training_args.adam_beta1, training_args.adam_beta2),
"eps": training_args.adam_epsilon,
}
optimizer_kwargs["lr"] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
optimizer_grouped_parameters,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
lr=training_args.learning_rate,
)
æåŸã«ãã«ã¹ã¿ã ãªããã£ãã€ã¶ãTrainer
ã«åŒæ°ãšããŠæž¡ããŸãïŒ
trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
ä»ã®ã¢ãããŒãïŒåŸé èç©ãåŸé ãã§ãã¯ãã€ã³ããæ··å粟床ãã¬ãŒãã³ã°ïŒãšçµã¿åãããããšã§ãAdafactorã®äœ¿çšãšåç以äžã®3åã®ã¡ã¢ãªæ¹åããã³ãããã«é«ãã¹ã«ãŒããããæåŸ ã§ããŸãã
multi_tensor
pytorch-nightlyã¯ãå€ãã®å°ããªç¹åŸŽãã³ãœã«ãããç¶æ³ã®ãªããã£ãã€ã¶ã倧å¹
ã«é«éåããã¯ãã®torch.optim._multi_tensor
ãå°å
¥ããŸãããããã¯æçµçã«ã¯ããã©ã«ãã«ãªãã¯ãã§ããããããæ©ãè©ŠããŠã¿ããå Žåã¯ããã®GitHub issueãã芧ãã ããã
ããŒã¿ã®äºåèªã¿èŸŒã¿
åªãããã¬ãŒãã³ã°é床ã«å°éããããã®éèŠãªèŠä»¶ã®1ã€ã¯ãGPUãåŠçã§ããæ倧é床ã§ããŒã¿ãäŸçµŠã§ããèœåã§ããããã©ã«ãã§ã¯ãã¹ãŠãã¡ã€ã³ããã»ã¹ã§è¡ãããããŒã¿ããã£ã¹ã¯ããååéãèªã¿åãããšãã§ããªãå ŽåãGPUã®ã¢ã³ããŒãŠãŒãã£ãªãŒãŒã·ã§ã³ãåŒãèµ·ããããã«ããã¯ãçºçããå¯èœæ§ããããŸããããã«ããã¯ãæžããããã«ã以äžã®åŒæ°ãèšå®ããŸãïŒ
DataLoader(pin_memory=True, ...)
- ããŒã¿ãCPUã®ãã³ã¡ã¢ãªã«äºåèªã¿èŸŒã¿ããéåžžãCPUããGPUã¡ã¢ãªãžã®è»¢éãã¯ããã«é«éåãããŸããDataLoader(num_workers=4, ...)
- ããŒã¿ãããéãäºåèªã¿èŸŒã¿ããããã«è€æ°ã®ã¯ãŒã«ãŒãçæããŸãããã¬ãŒãã³ã°äžã«GPUã®å©çšç¶æ³ã®çµ±èšæ å ±ã確èªãã100ïŒ ããé ãå Žåãã¯ãŒã«ãŒã®æ°ãå¢ããå®éšãè¡ã£ãŠãã ããããã¡ãããåé¡ã¯ä»ã®å Žæã«ãããããããŸããã®ã§ãå€ãã®ã¯ãŒã«ãŒãå¿ ãããæ§èœåäžã«ã€ãªããããã§ã¯ãããŸããã
Trainerã䜿çšããå Žåã察å¿ããTrainingArgumentsã¯dataloader_pin_memory
ïŒããã©ã«ãã§ã¯True
ïŒããã³dataloader_num_workers
ïŒããã©ã«ãã¯0
ïŒã§ãã
DeepSpeed ZeRO
DeepSpeedã¯ãð€ Transformersãšð€ Accelerateãšçµ±åããããªãŒãã³ãœãŒã¹ã®ãã£ãŒãã©ãŒãã³ã°æé©åã©ã€ãã©ãªã§ãã 倧èŠæš¡ãªãã£ãŒãã©ãŒãã³ã°ãã¬ãŒãã³ã°ã®å¹çãšã¹ã±ãŒã©ããªãã£ãåäžãããããã«èšèšãããããŸããŸãªæ©èœãšæé©åãæäŸããŸãã
ã¢ãã«ãåäžã®GPUã«åãŸããå°ããªããããµã€ãºãåããã¹ããŒã¹ãããå ŽåãDeepSpeedã䜿çšããå¿ èŠã¯ãããŸãããããã¯ãããé ããªããŸãããã ããã¢ãã«ãåäžã®GPUã«åãŸããªãå ŽåããŸãã¯å°ããªããããåããããšãã§ããªãå ŽåãDeepSpeed ZeRO + CPU OffloadãŸãã¯NVMe Offloadãå©çšã§ããŸãããã®å Žåãã©ã€ãã©ãªãå¥éã€ã³ã¹ããŒã«ããèšå®ãã¡ã€ã«ãäœæããDeepSpeedãèµ·åããããã®ã¬ã€ãããã©ããŒããå¿ èŠããããŸãïŒ
- Trainerãšã®DeepSpeedçµ±åã®è©³çŽ°ã¬ã€ãã«ã€ããŠã¯ã該åœããããã¥ã¡ã³ããŒã·ã§ã³ã確èªããŠãã ãããç¹ã«ãåäžGPUçšã®ãããã€ã¡ã³ãã«é¢ããã»ã¯ã·ã§ã³ã§ããDeepSpeedãããŒãããã¯ã§äœ¿çšããã«ã¯ããã€ãã®èª¿æŽãå¿ èŠã§ãã®ã§ã該åœããã¬ã€ããã芧ãã ããã
- ð€ Accelerateã䜿çšããå Žåã¯ãð€ Accelerate DeepSpeedã¬ã€ããåç §ããŠãã ããã
torch.compileã®äœ¿çš
PyTorch 2.0ã¯æ°ããã³ã³ãã€ã«é¢æ°ãå°å
¥ããŸãããããã¯æ¢åã®PyTorchã³ãŒããå€æŽããå¿
èŠã¯ãããŸãããã1è¡ã®ã³ãŒããè¿œå ããããšã§ã³ãŒããæé©åã§ããŸãïŒmodel = torch.compile(model)
ã
Trainerã䜿çšããå ŽåãTrainingArgumentså
ã®torch_compile
ãªãã·ã§ã³ãæž¡ãã ãã§ãïŒ
training_args = TrainingArguments(torch_compile=True, **default_args)
torch.compile
ã¯ãæ¢åã®PyTorchããã°ã©ã ããã°ã©ããèªåçã«äœæããããã«Pythonã®ãã¬ãŒã è©äŸ¡APIã䜿çšããŸããã°ã©ãããã£ããã£ããåŸãç°ãªãããã¯ãšã³ããå±éããŠæé©åããããšã³ãžã³ã«å€æã§ããŸãã
詳现ããã³ãã³ãããŒã¯ã«ã€ããŠã¯ãPyTorchããã¥ã¡ã³ããåç
§ããŠãã ããã
torch.compile
ã«ã¯ããªãã·ã§ã³ã®äŸåé¢ä¿ãæã€æé·äžã®ããã¯ãšã³ãã®ãªã¹ãããããtorchdynamo.list_backends()
ãåŒã³åºããŠç¢ºèªã§ããŸããæãäžè¬çã«äœ¿çšãããäžéšã®ããã¯ãšã³ãã¯æ¬¡ã®ãšããã§ãã
ãããã°çšããã¯ãšã³ãïŒ
dynamo.optimize("eager")
- æœåºãããGraphModuleãå®è¡ããããã«PyTorchã䜿çšããŸããããã¯TorchDynamoã®åé¡ããããã°ããéã«éåžžã«åœ¹ç«ã¡ãŸããdynamo.optimize("aot_eager")
- ã³ã³ãã€ã©ãŒã䜿çšããªãAotAutogradã䜿çšããŠAotAutogradã®æœåºããããã©ã¯ãŒãããã³ããã¯ã¯ãŒãã°ã©ãã«å¯ŸããŠåã«PyTorch eagerã䜿çšããŸããããã¯ãããã°ã«åœ¹ç«ã¡ãé«éåã¯æåŸ ã§ããŸããã
ãã¬ãŒãã³ã°ããã³æšè«ããã¯ãšã³ãïŒ
dynamo.optimize("inductor")
- TorchInductorããã¯ãšã³ãã䜿çšããAotAutogradããã³cudagraphsã掻çšããŠã³ãŒãçæãããTritonã«ãŒãã«ã䜿çšããŸã 詳现ã¯ãã¡ãdynamo.optimize("nvfuser")
- nvFuser with TorchScriptã䜿çšããŸãã 詳现ã¯ãã¡ãdynamo.optimize("aot_nvfuser")
- nvFuser with AotAutogradã䜿çšããŸãã 詳现ã¯ãã¡ãdynamo.optimize("aot_cudagraphs")
- AotAutogradã䜿çšããŠcudagraphsã䜿çšããŸãã 詳现ã¯ãã¡ã
æšè«å°çšããã¯ãšã³ãïŒ
dynamo.optimize("ofi")
- Torchscriptã®optimize_for_inference
ã䜿çšããŸãã 詳现ã¯ãã¡ãdynamo.optimize("fx2trt")
- Nvidia TensorRTã䜿çšããæšè«ã®æé©åã«Nvidia TensorRTã䜿çšããŸãã 詳现ã¯ãã¡ãdynamo.optimize("onnxrt")
- CPU/GPUã§ã®æšè«ã«ONNX Runtimeã䜿çšããŸãã 詳现ã¯ãã¡ãdynamo.optimize("ipex")
- CPUã§ã®æšè«ã«IPEXã䜿çšããŸãã 詳现ã¯ãã¡ã
ð€ Transformersã䜿çšããtorch.compile
ã®äœ¿çšäŸã«ã€ããŠã¯ããã®ããã°èšäºãã芧ãã ããã
Using ð€ Accelerate
ð€ Accelerateã䜿çšãããšãäžèšã®æ¹æ³ã䜿çšããªãããã¬ãŒãã³ã°ã«ãŒããå®å šã«å¶åŸ¡ã§ããåºæ¬çã«ã¯çŽç²ãªPyTorchã§ã«ãŒããæžãããšãã§ããŸãã
次ã«ãTrainingArgumentså ã§æ¹æ³ãçµã¿åãããå Žåãæ³
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
fp16=True,
**default_args,
)
ð€ Accelerateã䜿çšããå®å šãªãã¬ãŒãã³ã°ã«ãŒãã®äŸã¯ãã»ãã®æ°è¡ã®ã³ãŒãã§ãïŒ
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size)
if training_args.gradient_checkpointing:
model.gradient_checkpointing_enable()
accelerator = Accelerator(fp16=training_args.fp16)
model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader)
model.train()
for step, batch in enumerate(dataloader, start=1):
loss = model(**batch).loss
loss = loss / training_args.gradient_accumulation_steps
accelerator.backward(loss)
if step % training_args.gradient_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
ãŸããããŒã¿ã»ãããDataLoader
ã§ã©ããããŸãã
次ã«ãã¢ãã«ã®gradient_checkpointing_enable()ã¡ãœãããåŒã³åºãããšã§åŸé
ãã§ãã¯ãã€ã³ããæå¹ã«ã§ããŸãã
Accelerator
ãåæåããéã«ãæ··å粟床ãã¬ãŒãã³ã°ã䜿çšãããã©ãããprepare
ã®åŒã³åºãã§æå®ããè€æ°ã®GPUã䜿çšããå Žåãprepare
ã®éã«ããŒã¿ããŒããŒãã¯ãŒã«ãŒéã§åæ£ãããŸããåã8ããããªããã£ãã€ã¶ãåã®äŸãã䜿çšããŸãã
æåŸã«ãäž»èŠãªãã¬ãŒãã³ã°ã«ãŒããè¿œå ã§ããŸããbackward
ã®åŒã³åºãã¯ð€ Accelerateã«ãã£ãŠåŠçãããããšã«æ³šæããŠãã ããããŸããåŸé
ã®èç©ãã©ã®ããã«æ©èœãããã確èªã§ããŸããæ倱ãæ£èŠåããŠãããããèç©ã®æåŸã«å¹³åãåŸãŠãååãªã¹ãããããããšæé©åãå®è¡ãããŸãã
ãããã®æé©åæè¡ãð€ Accelerateã䜿çšããŠå®è£ ããã®ã¯ãããããªã³ãŒãè¡ã§è¡ãããšãã§ãããã¬ãŒãã³ã°ã«ãŒãã®æè»æ§ãåäžããŸãããã¹ãŠã®æ©èœã®è©³çŽ°ã«ã€ããŠã¯ãAccelerateã®ããã¥ã¡ã³ããåç §ããŠãã ããã
Efficient Software Prebuilds
PyTorchã®pipãšcondaãã«ãã¯ãPyTorchãå®è¡ããã®ã«ååãªcudaããŒã«ãããã§äºåã«ãã«ããããŠããŸãããcudaæ¡åŒµããã«ãããå¿ èŠãããå Žåã«ã¯äžååã§ãã
ææãè¿œå ã®åªåãå¿
èŠãªå ŽåããããŸããããšãã°ãäºåã«ã³ã³ãã€ã«ãããŠããªãapex
ãªã©ã®ã©ã€ãã©ãªã䜿çšããŠããå Žåã§ãããŸããã·ã¹ãã å
šäœã§é©åãªcudaããŒã«ããããã€ã³ã¹ããŒã«ããæ¹æ³ãèŠã€ããããšãé£ããå ŽåããããŸãã
ãããã®ã·ããªãªã«å¯ŸåŠããããã«ãPyTorchãšNVIDIAã¯cudaæ¡åŒµããã§ã«äºåã«ãã«ããããŠããNGC dockerã³ã³ããã®æ°ããããŒãžã§ã³ããªãªãŒã¹ããŸãããããã°ã©ã ãã€ã³ã¹ããŒã«ããã ãã§ããã®ãŸãŸå®è¡ã§ããŸãã
ãã®ã¢ãããŒãã¯ãPyTorchã®ãœãŒã¹ã調æŽããããæ°ããã«ã¹ã¿ãã€ãºããããã«ããäœæããããããå Žåã«ã圹ç«ã¡ãŸãã 欲ããdockerã€ã¡ãŒãžããŒãžã§ã³ãèŠã€ããã«ã¯ããŸãPyTorchã®ãªãªãŒã¹ããŒãããå§ããææ°ã®æ次ãªãªãŒã¹ã®ãããããéžæããŸããåžæã®ãªãªãŒã¹ã®ãªãªãŒã¹ããŒãã«ç§»åããç°å¢ã®ã³ã³ããŒãã³ããå¿ èŠãªãã®ãšäžèŽããŠããããšã確èªããŸãïŒNVIDIA Driverã®èŠä»¶ãå«ãïŒïŒããã®ææžã®äžçªäžã«è¡ãã察å¿ããNGCããŒãžã«ç§»åããŸãããªããããããªãå Žåã¯ããã¹ãŠã®PyTorch NGCã€ã¡ãŒãžã®ã€ã³ããã¯ã¹ã§ãã
次ã«ãdockerã€ã¡ãŒãžãããŠã³ããŒãããŠå±éããæé ã«åŸããŸãã
Mixture of Experts
æè¿ã®è«æã«ããã°ãTransformerã¢ãã«ã«å°é家ã®æ··åïŒMoEïŒãçµ±åããããšã§ããã¬ãŒãã³ã°é床ã4ã5ååäžããæšè«ãé«éåãããããšãå ±åãããŠããŸãã
ããå€ãã®ãã©ã¡ãŒã¿ãããè¯ãããã©ãŒãã³ã¹ã«ã€ãªããããšãããã£ãŠããããããã®æè¡ã¯ãã¬ãŒãã³ã°ã³ã¹ããå¢ããããšãªããã©ã¡ãŒã¿ã®æ°ãæ¡éãã«å¢ããããšãå¯èœã«ããŸãã
ãã®ã¢ãããŒãã§ã¯ãä»ã®FFNå±€ã®ä»£ããã«MoEå±€ãé 眮ãããåå°é家ãããŒã¯ã³ã®äœçœ®ã«å¿ããŠãã©ã³ã¹ãããã¬ãŒãã³ã°ããã²ãŒãé¢æ°ã§æ§æãããŸãã
ïŒåºå ž: GLAMïŒ
ãã®ã¢ãããŒãã®äž»ãªæ¬ ç¹ã¯ãGPUã¡ã¢ãªãã»ãŒæ¡éãã«å€ãå¿ èŠãšããããšã§ããã¡ã¢ãªèŠä»¶ãã¯ããã«å€§ããããšããã®ãŸãŸåæ ãããŸããããé«ãã¡ã¢ãªèŠä»¶ãå æããæ¹æ³ã«ã€ããŠã¯ãããŸããŸãªèžçããã³ã¢ãããŒããææ¡ãããŠããŸãã
ãã ããçŽæ¥ã®ãã¬ãŒããªãããããŸããæ°äººã®å°é家ã䜿çšããŠããŒã¹ã¢ãã«ã2ã3åå°ããããããšã§ã5åå°ããªã¢ãã«ã«ãããã¬ãŒãã³ã°é床ãé©åºŠã«åäžãããã¡ã¢ãªèŠä»¶ãé©åºŠã«å¢ããããšãã§ããŸãã
é¢é£ããã»ãšãã©ã®è«æããã³å®è£ ã¯Tensorflow/TPUãäžå¿ã«æ§ç¯ãããŠããŸãã
- GShard: Conditional Computation and Automatic Shardingã掻çšãã巚倧ã¢ãã«ã®ã¹ã±ãŒãªã³ã°
- Switch Transformers: ã·ã³ãã«ã§å¹ççãªã¹ããŒã¹æ§ãåããããªãªãªã³ãã©ã¡ãŒã¿ã¢ãã«ãžã®ã¹ã±ãŒãªã³ã°
- GLaM: Generalist Language Model (GLaM)
Pytorchã«ã¯DeepSpeedãæ§ç¯ãããã®ããããŸã: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI ScaleãMixture of Experts - ããã°èšäº: 1ã2ã倧èŠæš¡ãªTransformerããŒã¹ã®èªç¶èšèªçæã¢ãã«ã®å ·äœçãªå±éã«ã€ããŠã¯ãããã°èšäºãMegatron-Deepspeedãã©ã³ããåç §ããŠãã ããã
PyTorchãã€ãã£ãã¢ãã³ã·ã§ã³ãšFlash Attentionã®äœ¿çš
PyTorch 2.0ã§ã¯ããã€ãã£ãã®torch.nn.functional.scaled_dot_product_attention
ïŒSDPAïŒããªãªãŒã¹ãããã¡ã¢ãªå¹çã®é«ãã¢ãã³ã·ã§ã³ããã©ãã·ã¥ã¢ãã³ã·ã§ã³ãªã©ã®èåãããGPUã«ãŒãã«ã®äœ¿çšãå¯èœã«ããŸãã
optimum
ããã±ãŒãžãã€ã³ã¹ããŒã«ããåŸãé¢é£ããå
éšã¢ãžã¥ãŒã«ã眮ãæããŠãPyTorchã®ãã€ãã£ãã¢ãã³ã·ã§ã³ã䜿çšã§ããŸãã以äžã®ããã«èšå®ããŸãïŒ
model = model.to_bettertransformer()
å€æåŸãéåžžéãã¢ãã«ããã¬ãŒãã³ã°ããŠãã ããã
PyTorchãã€ãã£ãã®scaled_dot_product_attention
æŒç®åã¯ãattention_mask
ãæäŸãããŠããªãå Žåã«ã®ã¿Flash Attentionã«ãã£ã¹ãããã§ããŸãã
ããã©ã«ãã§ã¯ããã¬ãŒãã³ã°ã¢ãŒãã§BetterTransformerçµ±åã¯ãã¹ã¯ãµããŒããåé€ããããããã¬ãŒãã³ã°ã«ããã£ã³ã°ãã¹ã¯ãå¿ èŠãªããã¬ãŒãã³ã°ã«ãã䜿çšã§ããŸãããããã¯ãäŸãã°ãã¹ã¯èšèªã¢ããªã³ã°ãå æèšèªã¢ããªã³ã°ã®ãããªãããããã¬ãŒãã³ã°ã«ããã£ã³ã°ãã¹ã¯ãäžèŠãªãã¬ãŒãã³ã°ã®å Žåã«è©²åœããŸããBetterTransformerã¯ããã£ã³ã°ãã¹ã¯ãå¿ èŠãªã¿ã¹ã¯ã«å¯Ÿããã¢ãã«ã®åŸ®èª¿æŽã«ã¯é©ããŠããŸããã
SDPAã䜿çšããã¢ã¯ã»ã©ã¬ãŒã·ã§ã³ãšã¡ã¢ãªã®ç¯çŽã«ã€ããŠè©³ããç¥ãããå Žåã¯ããã®ããã°èšäºããã§ãã¯ããŠãã ããã
< > Update on GitHub