peak memory is lower for subsequent fresh runs compared to the first run of a torch.compiled model

### 🐛 Describe the bug

I am using torch.compile for LLama-2-7B from Huggingface. When I run the program the first time, the peak memory usage is higher than all subsequent fresh runs. (By fresh I mean I run the program again after it has exited). In fact, the peak memory usage is stable second run onwards.

I see this behaviour every time. I wanted to check (1) is this expected? I am not entirely sure if torch.compile reuses cache **across runs**. (2) Could this be happening at the inductor/ dynamo level?

I have some data points:
1. For seq len = 2048, First run peak memory = 7.13GB, Second run peak memory = 7.06GB
2. For seq len = 8196, First peak memory = 16.84GB, Second run peak memory = 16.086GB

I am training `meta-llama/Llama-2-7b-hf` on 2 A100 GPUs for a few iterations. If the above behaviour is not expected, I can help with a repro.


### Error logs


I am looking at the logs generated by dynamo:

For the first run, I see in the logs that the guards are created:
```
TRACED GRAPH
[__graph_code]  ===== __compiled_fn_29 =====
[rank1]:V0422 22:32:34.757000 659971 site-packages/torch/_dynamo/output_graph.py:1340] [11/0] GraphModule(torch.nn.Module):
[__graph_code]     def forward(self, L_cos_: "bf16[1, 8192, 128][1048576, 128, 1]cuda:1", L_sin_: "bf16[1, 8192, 128][1048576, 128, 1]cuda:1", L_q_: "bf16[1, 32, 8192, 128][33554432, 128, 4096, 1]cuda:1", L_k_: "bf16[1, 32, 8192, 128][33554432, 128, 4096, 1]cuda:1"):
[__graph_code]  l_cos_ = L_cos_
[__graph_code]  l_sin_ = L_sin_
[__graph_code]  l_q_ = L_q_
[__graph_code]  l_k_ = L_k_
```

For the second fresh run, I see some difference in the logs (missing stride, device). Could this imply that the guard is validated instead of being created:
```
TRACED GRAPH
===== pre insert_deferred_runtime_asserts __compiled_fn_29 =====
[__graph_code]  <eval_with_key>.10 class GraphModule(torch.nn.Module):
[__graph_code]     def forward(self, L_cos_: "bf16[1, 8192, 128]", L_sin_: "bf16[1, 8192, 128]", L_q_: "bf16[1, 32, 8192, 128]", L_k_: "bf16[1, 32, 8192, 128]"):
[__graph_code]         l_cos_ = L_cos_
[__graph_code]         l_sin_ = L_sin_
[__graph_code]         l_q_ = L_q_
[__graph_code]         l_k_ = L_k_
```

### Versions

PyTorch version: 2.5.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64)
GCC version: (Spack GCC) 11.4.0
Clang version: Could not collect
CMake version: version 3.20.2
Libc version: glibc-2.28

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-477.86.1.el8_8.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 12.3.52
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
Nvidia driver version: 550.144.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             2445.490
BogoMIPS:            4890.98
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
NUMA node2 CPU(s):   32-47
NUMA node3 CPU(s):   48-63

cc @chauhang @penguinwu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

peak memory is lower for subsequent fresh runs compared to the first run of a torch.compiled model #151995

🐛 Describe the bug

Error logs

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

peak memory is lower for subsequent fresh runs compared to the first run of a torch.compiled model #151995

Description

🐛 Describe the bug

Error logs

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions