Skip to content

[inductor] add lookup table recorder #158987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: gh/coconutruben/25/base
Choose a base branch
from

Conversation

coconutruben
Copy link
Contributor

@coconutruben coconutruben commented Jul 23, 2025

Stack from ghstack (oldest at bottom):

Why

make it easier for users to generate lookup tables and use them

What

  • infrastructure to record lookup tables from autotuning
  • sample implementation for recording to directory
  • sample implementation for emitting to log.debug (key/value)

caveats

  • right now it just records mm_templates and everything that inductor
    considers. There are some architectural changes needed to make it
    record e.g. a topk after autotuning. once that is ready, this is
    modular enough to adjust to recording only topk, however there is
    value now in being able to record a simple table, see the format, and
    manually edit it down to the topk entries using the autotuning logs

Testing

using

#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Differential Revision: D78852354

\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158987

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures

As of commit 763b683 with merge base ecea811 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

coconutruben added a commit that referenced this pull request Jul 23, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: f378305
Pull Request resolved: #158987
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 23, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: 35bbfeb
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 23, 2025
@coconutruben coconutruben requested a review from jansel July 23, 2025 23:08
@coconutruben coconutruben added the topic: not user facing topic category label Jul 23, 2025
@coconutruben
Copy link
Contributor Author

@jansel this is the follow up to aid with table generation

# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 24, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: 75ce06d
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 25, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: 52a5e02
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 25, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: da0b7c4
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 28, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: 4c0a071
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 29, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: f67c028
Pull Request resolved: #158987
# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 29, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: 5463726
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@coconutruben coconutruben requested a review from eellison July 30, 2025 04:14
# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Aug 2, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: aae6cb6
Pull Request resolved: #158987
# Why

make it easier for users to generate lookup tables and use them

# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

# Testing

using

```
#!/bin/bash

# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Aug 11, 2025
\# Why

make it easier for users to generate lookup tables and use them

\# What

- infrastructure to record lookup tables from autotuning
- sample implementation for recording to directory
- sample implementation for emitting to log.debug (key/value)

\## caveats

- right now it just records mm_templates and everything that inductor
  considers. There are some architectural changes needed to make it
  record e.g. a topk after autotuning. once that is ready, this is
  modular enough to adjust to recording only topk, however there is
  value now in being able to record a simple table, see the format, and
  manually edit it down to the topk entries using the autotuning logs

\# Testing

using

```
\#!/bin/bash

\# Create a temporary directory for the lookup table dumps
TEMP_DIR=$(mktemp -d)
echo "Created temporary directory for lookup table dumps: $TEMP_DIR"

\# Set environment variables to enable verbose output and recording
export TORCH_LOGS="+inductor"
export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR
export PYTORCH_DEBUG=1

\# Run the Python script
python3 -c "
import os
import torch
import logging
from torch._inductor import config as inductor_config
from torch._inductor.lookup_table_recorder import dump

\# Configure logging to see emit messages
logging.basicConfig(level=logging.DEBUG)

\# Enable TMA for matmul
inductor_config.triton.enable_persistent_tma_matmul = True

\# Create large tensors with bfloat16 dtype
print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...')
a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)
b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16)

\# Compile and run the matrix multiplication
print('Compiling and running torch.mm with TMA...')
compiled_mm = torch.compile(torch.mm, mode='max-autotune')
result = compiled_mm(a, b)

\# Force synchronization to ensure compilation is complete
torch.cuda.synchronize()

\# Dump the lookup table
print('Dumping lookup table...')
dump()

print('\\nMatrix multiplication completed successfully!')
" 2>&1 | tee /tmp/recorder_output.log

\# Check if emit logic works by grepping for LookupTable entries
echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ==="
if grep -q "LookupTable:" /tmp/recorder_output.log; then
    echo "✅ Emit functionality is working! Found LookupTable entries in the log."
else
    echo "❌ Emit functionality not detected. No LookupTable entries found in the log."
fi

\# Display the dumped lookup table
echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ==="
LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1)
if [ -f "$LATEST_JSON" ]; then
    echo "Found lookup table file: $LATEST_JSON"
    echo "File size: $(du -h $LATEST_JSON | cut -f1)"
    echo -e "\nFirst 20 lines of the lookup table:"
    head -n 20 $LATEST_JSON

    # Check for TMA entries
    echo -e "\n=== CHECKING FOR TMA ENTRIES ==="
    if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then
        echo "✅ TMA entries found in the lookup table!"
        echo -e "\nSample TMA entry:"
        grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON
    else
        echo "❌ No TMA entries found in the lookup table."
    fi
else
    echo "❌ No lookup table JSON file found in $TEMP_DIR"
fi

echo -e "\n\nLookup table files are available in: $TEMP_DIR"
echo "Log file is available at: /tmp/recorder_output.log"
```

```
=== CHECKING EMIT FUNCTIONALITY ===
✅ Emit functionality is working! Found LookupTable entries in the log.

=== DUMPED LOOKUP TABLE CONTENTS ===
Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json
File size: 12K

First 20 lines of the lookup table:
{
  "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [
    {
      "template_id": "mm",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 1,
      "num_warps": 2,
      "BLOCK_M": 32,
      "BLOCK_N": 32,
      "BLOCK_K": 16,
      "hint_override": null,
      "GROUP_M": 8,
      "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896"
    },
    {
      "template_id": "mm",
      "EVEN_K": true,

=== CHECKING FOR TMA ENTRIES ===
✅ TMA entries found in the lookup table!

Sample TMA entry:
    },
    {
      "template_id": "mm_persistent_tma",
      "EVEN_K": true,
      "ALLOW_TF32": false,
      "USE_FAST_ACCUM": false,
      "ACC_TYPE": "tl.float32",
      "num_stages": 3,
      "num_warps": 8,
      "BLOCK_M": 128,
      "BLOCK_N": 256,
      "BLOCK_K": 64,
      "hint_override": null,

Lookup table files are available in: /tmp/tmp.L9pydR3sH4
Log file is available at: /tmp/recorder_output.log
```

ghstack-source-id: df6559b
Pull Request resolved: #158987
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request module: inductor topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant