-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Description
🐛 Describe the bug
I am using torch.compile for LLama-2-7B from Huggingface. When I run the program the first time, the peak memory usage is higher than all subsequent fresh runs. (By fresh I mean I run the program again after it has exited). In fact, the peak memory usage is stable second run onwards.
I see this behaviour every time. I wanted to check (1) is this expected? I am not entirely sure if torch.compile reuses cache across runs. (2) Could this be happening at the inductor/ dynamo level?
I have some data points:
- For seq len = 2048, First run peak memory = 7.13GB, Second run peak memory = 7.06GB
- For seq len = 8196, First peak memory = 16.84GB, Second run peak memory = 16.086GB
I am training meta-llama/Llama-2-7b-hf
on 2 A100 GPUs for a few iterations. If the above behaviour is not expected, I can help with a repro.
Error logs
I am looking at the logs generated by dynamo:
For the first run, I see in the logs that the guards are created:
TRACED GRAPH
[__graph_code] ===== __compiled_fn_29 =====
[rank1]:V0422 22:32:34.757000 659971 site-packages/torch/_dynamo/output_graph.py:1340] [11/0] GraphModule(torch.nn.Module):
[__graph_code] def forward(self, L_cos_: "bf16[1, 8192, 128][1048576, 128, 1]cuda:1", L_sin_: "bf16[1, 8192, 128][1048576, 128, 1]cuda:1", L_q_: "bf16[1, 32, 8192, 128][33554432, 128, 4096, 1]cuda:1", L_k_: "bf16[1, 32, 8192, 128][33554432, 128, 4096, 1]cuda:1"):
[__graph_code] l_cos_ = L_cos_
[__graph_code] l_sin_ = L_sin_
[__graph_code] l_q_ = L_q_
[__graph_code] l_k_ = L_k_
For the second fresh run, I see some difference in the logs (missing stride, device). Could this imply that the guard is validated instead of being created:
TRACED GRAPH
===== pre insert_deferred_runtime_asserts __compiled_fn_29 =====
[__graph_code] <eval_with_key>.10 class GraphModule(torch.nn.Module):
[__graph_code] def forward(self, L_cos_: "bf16[1, 8192, 128]", L_sin_: "bf16[1, 8192, 128]", L_q_: "bf16[1, 32, 8192, 128]", L_k_: "bf16[1, 32, 8192, 128]"):
[__graph_code] l_cos_ = L_cos_
[__graph_code] l_sin_ = L_sin_
[__graph_code] l_q_ = L_q_
[__graph_code] l_k_ = L_k_
Versions
PyTorch version: 2.5.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64)
GCC version: (Spack GCC) 11.4.0
Clang version: Could not collect
CMake version: version 3.20.2
Libc version: glibc-2.28
Python version: 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-477.86.1.el8_8.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 12.3.52
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
Nvidia driver version: 550.144.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
CPU MHz: 2445.490
BogoMIPS: 4890.98
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63