-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[inductor] add lookup table recorder #158987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/coconutruben/25/base
Are you sure you want to change the base?
Conversation
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158987
Note: Links to docs will display an error until the docs builds have been completed. ❌ 9 New FailuresAs of commit 763b683 with merge base ecea811 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: f378305 Pull Request resolved: #158987
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: 35bbfeb Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jansel this is the follow up to aid with table generation |
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: 75ce06d Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: 52a5e02 Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: da0b7c4 Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: 4c0a071 Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: f67c028 Pull Request resolved: #158987
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: 5463726 Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: aae6cb6 Pull Request resolved: #158987
# Why make it easier for users to generate lookup tables and use them # What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) ## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs # Testing using ``` #!/bin/bash # Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" # Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 # Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump # Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) # Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True # Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) # Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) # Force synchronization to ensure compilation is complete torch.cuda.synchronize() # Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log # Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi # Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D78852354](https://our.internmc.facebook.com/intern/diff/D78852354) [ghstack-poisoned]
\# Why make it easier for users to generate lookup tables and use them \# What - infrastructure to record lookup tables from autotuning - sample implementation for recording to directory - sample implementation for emitting to log.debug (key/value) \## caveats - right now it just records mm_templates and everything that inductor considers. There are some architectural changes needed to make it record e.g. a topk after autotuning. once that is ready, this is modular enough to adjust to recording only topk, however there is value now in being able to record a simple table, see the format, and manually edit it down to the topk entries using the autotuning logs \# Testing using ``` \#!/bin/bash \# Create a temporary directory for the lookup table dumps TEMP_DIR=$(mktemp -d) echo "Created temporary directory for lookup table dumps: $TEMP_DIR" \# Set environment variables to enable verbose output and recording export TORCH_LOGS="+inductor" export TORCH_INDUCTOR_LOOKUP_TABLE_RECORD_DIR=$TEMP_DIR export PYTORCH_DEBUG=1 \# Run the Python script python3 -c " import os import torch import logging from torch._inductor import config as inductor_config from torch._inductor.lookup_table_recorder import dump \# Configure logging to see emit messages logging.basicConfig(level=logging.DEBUG) \# Enable TMA for matmul inductor_config.triton.enable_persistent_tma_matmul = True \# Create large tensors with bfloat16 dtype print('Creating 1024x1024 bfloat16 tensors for matrix multiplication...') a = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) b = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) \# Compile and run the matrix multiplication print('Compiling and running torch.mm with TMA...') compiled_mm = torch.compile(torch.mm, mode='max-autotune') result = compiled_mm(a, b) \# Force synchronization to ensure compilation is complete torch.cuda.synchronize() \# Dump the lookup table print('Dumping lookup table...') dump() print('\\nMatrix multiplication completed successfully!') " 2>&1 | tee /tmp/recorder_output.log \# Check if emit logic works by grepping for LookupTable entries echo -e "\n\n=== CHECKING EMIT FUNCTIONALITY ===" if grep -q "LookupTable:" /tmp/recorder_output.log; then echo "✅ Emit functionality is working! Found LookupTable entries in the log." else echo "❌ Emit functionality not detected. No LookupTable entries found in the log." fi \# Display the dumped lookup table echo -e "\n\n=== DUMPED LOOKUP TABLE CONTENTS ===" LATEST_JSON=$(ls -t $TEMP_DIR/inductor_lut_*.json | head -1) if [ -f "$LATEST_JSON" ]; then echo "Found lookup table file: $LATEST_JSON" echo "File size: $(du -h $LATEST_JSON | cut -f1)" echo -e "\nFirst 20 lines of the lookup table:" head -n 20 $LATEST_JSON # Check for TMA entries echo -e "\n=== CHECKING FOR TMA ENTRIES ===" if grep -q "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON; then echo "✅ TMA entries found in the lookup table!" echo -e "\nSample TMA entry:" grep -m 1 -A 10 -B 2 "tma\|TMA_SIZE\|NUM_SMS" $LATEST_JSON else echo "❌ No TMA entries found in the lookup table." fi else echo "❌ No lookup table JSON file found in $TEMP_DIR" fi echo -e "\n\nLookup table files are available in: $TEMP_DIR" echo "Log file is available at: /tmp/recorder_output.log" ``` ``` === CHECKING EMIT FUNCTIONALITY === ✅ Emit functionality is working! Found LookupTable entries in the log. === DUMPED LOOKUP TABLE CONTENTS === Found lookup table file: /tmp/tmp.L9pydR3sH4/inductor_lut_20250723_221836_641.json File size: 12K First 20 lines of the lookup table: { "NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))": [ { "template_id": "mm", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 1, "num_warps": 2, "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_K": 16, "hint_override": null, "GROUP_M": 8, "template_hash": "0717af5834e39dcca7ea817f896b8d85b4886422da7a3ab5f6911b4cfe568896" }, { "template_id": "mm", "EVEN_K": true, === CHECKING FOR TMA ENTRIES === ✅ TMA entries found in the lookup table! Sample TMA entry: }, { "template_id": "mm_persistent_tma", "EVEN_K": true, "ALLOW_TF32": false, "USE_FAST_ACCUM": false, "ACC_TYPE": "tl.float32", "num_stages": 3, "num_warps": 8, "BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "hint_override": null, Lookup table files are available in: /tmp/tmp.L9pydR3sH4 Log file is available at: /tmp/recorder_output.log ``` ghstack-source-id: df6559b Pull Request resolved: #158987
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Stack from ghstack (oldest at bottom):
Why
make it easier for users to generate lookup tables and use them
What
caveats
considers. There are some architectural changes needed to make it
record e.g. a topk after autotuning. once that is ready, this is
modular enough to adjust to recording only topk, however there is
value now in being able to record a simple table, see the format, and
manually edit it down to the topk entries using the autotuning logs
Testing
using
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov
Differential Revision: D78852354