Skip to content

Conversation

jurahul
Copy link
Contributor

@jurahul jurahul commented Aug 22, 2025

This change adds an option to specialize decoders per bitwidth, which can help reduce the (compiled) code size of the decoder code.

Current state:
Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.

Several backends call decodeInstruction with different InsnType types, leading to several template instantiations of these functions in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128-bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions. Since there is just one decodeToMCInst in the generated code, it has code that handles decoding for all instruction sizes. However, the decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32-bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary (that the compiler cannot eliminate by itself).

New state:
With this change, we introduce an option specialize-decoders-per-bitwidth. Under this mode, the DecoderEmitter will generate several versions of decodeToMCInst function, one for each bitwidth. The code is still templated, but will require backends to specify, for each InsnType used, the bitwidth of the instruction that the type is used to represent using a type-trait InsnBitWidth. This will enable the templated code to choose the right variant of decodeToMCInst. Under this mode, a particular instantiation will only end up instantiating a single variant of decodeToMCInst generated and that will include only those decoders that are applicable to a single bitwidth, resulting in elimination of the code duplication through instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given bitwidth (as opposed to across all bitwidths without this option), so the decoder index values assigned are smaller, and consume less bytes in their ULEB128 encoding. As a result, the generated decoder tables can also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build, this results in a net 55% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. For RISCV, which today uses a single uint64_t type, this results in a 3.7% increase in code size (expected as we instantiate the code 3 times now).

Actual measured sizes are as follows:

Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.7% increase   
.rodata        35768       35839      0.1% increase

@jurahul jurahul changed the title Cull decoders [LLVM][MC] Add support to cull inactive decoders in decoder emitter Aug 22, 2025
@jurahul jurahul force-pushed the decoder_cull branch 3 times, most recently from 46bbe78 to 3297671 Compare August 25, 2025 19:26
@jurahul
Copy link
Contributor Author

jurahul commented Aug 25, 2025

Hi @s-barannikov and @topperc, this is the new simplified version of the earlier change to compile out decoders that are unused for a particular bitwidth. This approach of using a type-trait seems simpler and is much less code changes than before. Note that it includes both the per-bitwidth decoder specialization as well as numbering decoders for each bitwidth, yielding some savings on the size of the decoder tables as well.

I'll fill in the description soon to explain the motivation and changes.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

For AMDGPU, for example, the generated code looks like:

// Handling 1223 cases.
template <typename InsnType>
static std::enable_if_t<InsnBitWidth<InsnType> == 64, DecodeStatus>
decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete) {
...
}

// Handling 369 cases.
template <typename InsnType>
static std::enable_if_t<InsnBitWidth<InsnType> == 96, DecodeStatus>
decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete) {
...
}

// Handling 94 cases.
template <typename InsnType>
static std::enable_if_t<InsnBitWidth<InsnType> == 32, DecodeStatus>
decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete) {
...
}

@jurahul jurahul requested a review from s-barannikov August 26, 2025 05:34
@jurahul jurahul force-pushed the decoder_cull branch 2 times, most recently from 69c8e70 to 0930ffe Compare August 26, 2025 05:47
@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

Just fixing some comments.

@s-barannikov
Copy link
Contributor

Looks good to me overall.

One thing I'm not sure of is using std::bitset where a larger integer type would do (e.g., using std::bitset<48> instead of uint64_t). Probably not a big deal, but I thought it is worth pointing out.

Secondly, what will happen if I use a "wrong" type of instruction when calling decodeInstruction? Will it silently use the wrong decodeToMCInst? If so, this is not great, and it should be possible to add a runtime check that calling decodeInstruction with a particular table resolves to the correct decodeToMCInst.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

Looks good to me overall.

One thing I'm not sure of is using std::bitset where a larger integer type would do (e.g., using std::bitset<48> instead of uint64_t). Probably not a big deal, but I thought it is worth pointing out.

Right, for the bitwidth specialization to work though, we need different types, so using the same type for different bitwidth (like uint64_t for both 48 and 64 bit instructions) is not supported. I guess for RISCV, if 64-bit instructions are not supported, there is a choice is just using 64-bits. I'd hope though that std::bitset<48> would internally just use a uint64_t. One thing I thought of is generalizing this to allow > 1 bitwidth for the same type. The trait can be a per-type function like:

template <typename T>
constexpr bool isSupportedInsnBitWidth(uint32_t N) { return false; }

This will allow using the same C++ type for > 1 bitwidths. Then we would need to remove the current specializations in MCDecoder.h as they assume a specific usage and then backends can choose. Not sure if that generalization is needed.

Secondly, what will happen if I use a "wrong" type of instruction when calling decodeInstruction? Will it silently use the wrong decodeToMCInst? If so, this is not great, and it should be possible to add a runtime check that calling decodeInstruction with a particular table resolves to the correct decodeToMCInst.

Yeah, @topperc had suggested that in the earlier version. I plan to do that in a follow-on MR. The first byte of the generated decoder table will be the supported bitwidth and we can do an initial runtime check.

@jurahul jurahul requested a review from s-barannikov August 26, 2025 18:36
@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

template <typename T>
constexpr bool isSupportedInsnBitWidth(uint32_t N) { return false; }

This will allow using the same C++ type for > 1 bitwidths. Then we would need to remove the current specializations in MCDecoder.h as they assume a specific usage and then backends can choose. Not sure if that generalization is needed.

LMK if this is something to pursue. I suspect it may not be too much effort to do this and offers additional flexibility for backends.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

template <typename T>
constexpr bool isSupportedInsnBitWidth(uint32_t N) { return false; }

This will allow using the same C++ type for > 1 bitwidths. Then we would need to remove the current specializations in MCDecoder.h as they assume a specific usage and then backends can choose. Not sure if that generalization is needed.

LMK if this is something to pursue. I suspect it may not be too much effort to do this and offers additional flexibility for backends.

I did and now RISCV can continue using uint64_t for 48-bit insts. Will update the PR soon

@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

The windows build failure seems to be this : https://stackoverflow.com/questions/29502052/template-specialization-and-enable-if-problems

@jurahul
Copy link
Contributor Author

jurahul commented Aug 26, 2025

The windows build failure seems to be this : https://stackoverflow.com/questions/29502052/template-specialization-and-enable-if-problems

The specific form that seems to work with MSVC (and clang and gcc) was derived here: https://godbolt.org/z/Eqdf4Tehz

And this was the solution: https://stackoverflow.com/questions/75686697/how-to-work-around-function-template-has-already-been-defined-when-using-std

@jurahul
Copy link
Contributor Author

jurahul commented Aug 27, 2025

Looks like MSVC builds with compiler version 14.x do not support either form, so maybe not possible to pursue this at the moment.

@jurahul jurahul changed the title [LLVM][MC] Add support to specialize decoder per bitwidth [LLVM][MC][DecoderEmitter] Add support to specialize decoder per bitwidth Aug 27, 2025
@jurahul
Copy link
Contributor Author

jurahul commented Aug 27, 2025

@s-barannikov this is ready for another round of review. The delta is that there are no longer default implementations of InsnBitWidth in MCDecoder.h, so individual targets need to provide them according to their usage. This enables, for example, for RISCV to use uint64_t for 48 bit instructions.

Copy link
Contributor

@s-barannikov s-barannikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please for @topperc review

Meanwhile, a couple of tests wouldn't hurt.

@s-barannikov
Copy link
Contributor

The description needs to be updated

@jurahul
Copy link
Contributor Author

jurahul commented Aug 27, 2025

Yes thanks. Will add tests and update the description with one final set of numbers.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 27, 2025

Updated description and added a unit test (which found that this option did not work with use-fn-table-in-decode-to-mcinst, so added additional fixes for that, and verified that check-llvm-mc-disassembler-{amdgpu,riscv} pass locally with both options enabled).

@jurahul jurahul marked this pull request as ready for review August 27, 2025 17:47
@llvmbot
Copy link
Member

llvmbot commented Aug 27, 2025

@llvm/pr-subscribers-tablegen

Author: Rahul Joshi (jurahul)

Changes

This change adds an option to specialize decoders per bitwidth, which can help reduce the (compiled) code size of the decoder code.

Current state:
Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.

Several backends call decodeInstruction with different InsnType types, leading to several template instantiations of these functions in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128-bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions. Since there is just one decodeToMCInst in the generated code, it has code that handles decoding for all instruction sizes. However, the decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32-bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary.

New state:
With this change, we introduce an option specialize-decoders-per-bitwidth. Under this mode, the DecoderEmitter will generate several versions of decodeToMCInst function, one for each bitwidth. The code is still templated, but will require backends to specify, for each InstType used, the bitwidth of the instruction that the type is used to represent using a type-trait InsnBitWidth. This will enable the templated code to choose the right variant of decodeToMCInst. Under this mode, a particular instantiation will only end up instantiating a single variant of decodeToMCInst generated and that will include only those decoders that are applicable to a single bitwidth, resulting in elimination of the code duplication through instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given bitwidth (as opposed to across all bitwidths without this option), so the decoder index values assigned are smaller, and consume less bytes in their ULEB128 encoding. As a result, the generated decoder tables can also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build, this results in a net 55% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 4.5% reduction in the .rodata size. For RISCV, which today uses a single uint64_t type, this results in a 3.7% increase in code size (expected as instantiate the code 3 times now).

Actual measured sizes are as follows:

Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.75% increase   
.rodata        35768       35839      0.1% increase

Patch is 35.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154865.diff

14 Files Affected:

  • (modified) llvm/include/llvm/MC/MCDecoder.h (+17)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp (+22-15)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h (-38)
  • (modified) llvm/lib/Target/DirectX/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/RISCV/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp (+9-7)
  • (added) llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td (+169)
  • (modified) llvm/test/TableGen/DecoderEmitterFnTable.td (+8-8)
  • (modified) llvm/test/TableGen/HwModeEncodeDecode3.td (+7-7)
  • (modified) llvm/test/TableGen/VarLenDecoder.td (+6-5)
  • (modified) llvm/utils/TableGen/DecoderEmitter.cpp (+103-39)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/Disassembler/BUILD.gn (+6-1)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/RISCV/Disassembler/BUILD.gn (+4-1)
diff --git a/llvm/include/llvm/MC/MCDecoder.h b/llvm/include/llvm/MC/MCDecoder.h
index 70762a4a5ebae..459c8a6a5ea34 100644
--- a/llvm/include/llvm/MC/MCDecoder.h
+++ b/llvm/include/llvm/MC/MCDecoder.h
@@ -12,6 +12,7 @@
 
 #include "llvm/MC/MCDisassembler/MCDisassembler.h"
 #include "llvm/Support/MathExtras.h"
+#include <bitset>
 #include <cassert>
 
 namespace llvm::MCD {
@@ -48,6 +49,15 @@ fieldFromInstruction(const InsnType &Insn, unsigned StartBit,
   return Insn.extractBitsAsZExtValue(NumBits, StartBit);
 }
 
+template <size_t N>
+uint64_t fieldFromInstruction(const std::bitset<N> &Insn, unsigned StartBit,
+                              unsigned NumBits) {
+  assert(StartBit + NumBits <= N && "Instruction field out of bounds!");
+  assert(NumBits <= 64 && "Cannot support >64-bit extractions!");
+  const std::bitset<N> Mask(maskTrailingOnes<uint64_t>(NumBits));
+  return ((Insn >> StartBit) & Mask).to_ullong();
+}
+
 // Helper function for inserting bits extracted from an encoded instruction into
 // an integer-typed field.
 template <typename IntType>
@@ -62,6 +72,13 @@ insertBits(IntType &field, IntType bits, unsigned startBit, unsigned numBits) {
   field |= bits << startBit;
 }
 
+// InsnBitWidth is essentially a type trait used by the decoder emitter to query
+// the supported bitwidth for a given type. But default, the value is 0, making
+// it an invalid type for use as `InsnType` when instantiating the decoder.
+// Individual targets are expected to provide specializations for these based
+// on their usage.
+template <typename T> static constexpr uint32_t InsnBitWidth = 0;
+
 } // namespace llvm::MCD
 
 #endif // LLVM_MC_MCDECODER_H
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index dc9dd220130ea..56dc6f9e8e6e9 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -6,7 +6,8 @@ tablegen(LLVM AMDGPUGenAsmMatcher.inc -gen-asm-matcher)
 tablegen(LLVM AMDGPUGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM AMDGPUGenCallingConv.inc -gen-callingconv)
 tablegen(LLVM AMDGPUGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM AMDGPUGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM AMDGPUGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM AMDGPUGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
index 6a2beeed41dfd..80d194afa926b 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
@@ -38,6 +38,7 @@
 #include "llvm/Support/Compiler.h"
 
 using namespace llvm;
+using namespace llvm::MCD;
 
 #define DEBUG_TYPE "amdgpu-disassembler"
 
@@ -446,6 +447,14 @@ static DecodeStatus decodeVersionImm(MCInst &Inst, unsigned Imm,
 
 #include "AMDGPUGenDisassemblerTables.inc"
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 64;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<96>> = 96;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<128>> = 128;
+
 //===----------------------------------------------------------------------===//
 //
 //===----------------------------------------------------------------------===//
@@ -498,26 +507,24 @@ template <typename T> static inline T eatBytes(ArrayRef<uint8_t>& Bytes) {
   return Res;
 }
 
-static inline DecoderUInt128 eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<96> eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 12);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint32_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Hi(read<uint32_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(4);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
-static inline DecoderUInt128 eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<128> eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 16);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Hi(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
 void AMDGPUDisassembler::decodeImmOperands(MCInst &MI,
@@ -600,14 +607,14 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
     // Try to decode DPP and SDWA first to solve conflict with VOP1 and VOP2
     // encodings
     if (isGFX1250() && Bytes.size() >= 16) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX1250128, MI, DecW, Address, CS))
         break;
       Bytes = Bytes_.slice(0, MaxInstBytesNum);
     }
 
     if (isGFX11Plus() && Bytes.size() >= 12) {
-      DecoderUInt128 DecW = eat12Bytes(Bytes);
+      std::bitset<96> DecW = eat12Bytes(Bytes);
 
       if (isGFX11() &&
           tryDecodeInst(DecoderTableGFX1196, DecoderTableGFX11_FAKE1696, MI,
@@ -642,7 +649,7 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
 
     } else if (Bytes.size() >= 16 &&
                STI.hasFeature(AMDGPU::FeatureGFX950Insts)) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX940128, MI, DecW, Address, CS))
         break;
 
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
index f4d164bf10c3c..ded447b6f8d5a 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
@@ -32,44 +32,6 @@ class MCOperand;
 class MCSubtargetInfo;
 class Twine;
 
-// Exposes an interface expected by autogenerated code in
-// FixedLenDecoderEmitter
-class DecoderUInt128 {
-private:
-  uint64_t Lo = 0;
-  uint64_t Hi = 0;
-
-public:
-  DecoderUInt128() = default;
-  DecoderUInt128(uint64_t Lo, uint64_t Hi = 0) : Lo(Lo), Hi(Hi) {}
-  operator bool() const { return Lo || Hi; }
-  uint64_t extractBitsAsZExtValue(unsigned NumBits,
-                                  unsigned BitPosition) const {
-    assert(NumBits && NumBits <= 64);
-    assert(BitPosition < 128);
-    uint64_t Val;
-    if (BitPosition < 64)
-      Val = Lo >> BitPosition | Hi << 1 << (63 - BitPosition);
-    else
-      Val = Hi >> (BitPosition - 64);
-    return Val & ((uint64_t(2) << (NumBits - 1)) - 1);
-  }
-  DecoderUInt128 operator&(const DecoderUInt128 &RHS) const {
-    return DecoderUInt128(Lo & RHS.Lo, Hi & RHS.Hi);
-  }
-  DecoderUInt128 operator&(const uint64_t &RHS) const {
-    return *this & DecoderUInt128(RHS);
-  }
-  DecoderUInt128 operator~() const { return DecoderUInt128(~Lo, ~Hi); }
-  bool operator==(const DecoderUInt128 &RHS) {
-    return Lo == RHS.Lo && Hi == RHS.Hi;
-  }
-  bool operator!=(const DecoderUInt128 &RHS) {
-    return Lo != RHS.Lo || Hi != RHS.Hi;
-  }
-  bool operator!=(const int &RHS) { return *this != DecoderUInt128(RHS); }
-};
-
 //===----------------------------------------------------------------------===//
 // AMDGPUDisassembler
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/DirectX/CMakeLists.txt b/llvm/lib/Target/DirectX/CMakeLists.txt
index 8100f941c8d94..6c079517e22d6 100644
--- a/llvm/lib/Target/DirectX/CMakeLists.txt
+++ b/llvm/lib/Target/DirectX/CMakeLists.txt
@@ -41,6 +41,7 @@ add_llvm_target(DirectXCodeGen
   LINK_COMPONENTS
   Analysis
   AsmPrinter
+  BinaryFormat
   CodeGen
   CodeGenTypes
   Core
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 47329b2c2f4d2..531238ae85029 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -7,7 +7,8 @@ tablegen(LLVM RISCVGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM RISCVGenCompressInstEmitter.inc -gen-compress-inst-emitter)
 tablegen(LLVM RISCVGenMacroFusion.inc -gen-macro-fusion-pred)
 tablegen(LLVM RISCVGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM RISCVGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM RISCVGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM RISCVGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
index dbb16fce8390a..4becd3bbb25f2 100644
--- a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
+++ b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
@@ -558,7 +558,7 @@ static DecodeStatus decodeXqccmpRlistS0(MCInst &Inst, uint32_t Imm,
   return decodeZcmpRlist(Inst, Imm, Address, Decoder);
 }
 
-static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint32_t Insn,
+static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint16_t Insn,
                                         uint64_t Address,
                                         const MCDisassembler *Decoder) {
   uint32_t Rs1 = fieldFromInstruction(Insn, 7, 5);
@@ -700,6 +700,12 @@ static constexpr DecoderListEntry DecoderList32[]{
     {DecoderTableZdinxRV32Only32, {}, "RV32-only Zdinx (Double in Integer)"},
 };
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint16_t> = 16;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+// Use uint64_t to represent 48 bit instructions.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 48;
+
 DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
                                                  ArrayRef<uint8_t> Bytes,
                                                  uint64_t Address,
@@ -710,9 +716,7 @@ DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
   }
   Size = 4;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read32le(Bytes.data());
+  uint32_t Insn = support::endian::read32le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList32) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
@@ -758,9 +762,7 @@ DecodeStatus RISCVDisassembler::getInstruction16(MCInst &MI, uint64_t &Size,
   }
   Size = 2;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read16le(Bytes.data());
+  uint16_t Insn = support::endian::read16le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList16) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
diff --git a/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
new file mode 100644
index 0000000000000..6fd9676939bde
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
@@ -0,0 +1,169 @@
+// RUN: llvm-tblgen -gen-disassembler -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-DEFAULT
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-NO-TABLE
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -use-fn-table-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-TABLE
+
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+  let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+  def R0 : Register<"r0">;
+  def R1 : Register<"r1">;
+  def R2 : Register<"r2">;
+  def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+// Bit 0 of the encoding determines the size (8 or 16 bits).
+// Bits {3..1} define the number define the number of operands encoded.
+class Instruction8Bit<int NumOps> : Instruction {
+  let Size = 1;
+  let OutOperandList = (outs);
+  field bits<8> Inst;
+  let Inst{0} = 0;
+  let Inst{3-1} = NumOps;
+}
+
+class Instruction16Bit<int NumOps> : Instruction {
+  let Size = 2;
+  let OutOperandList = (outs);
+  field bits<16> Inst;
+  let Inst{0} = 1;
+  let Inst{3-1} = NumOps;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Each register operand needs 2 bits to encode.
+
+// An instruction with no inputs.
+def Inst0 : Instruction8Bit<0> {
+  let Inst{7-4} = 0;
+  let InOperandList = (ins);
+  let AsmString = "Inst0";
+}
+
+// An instruction with a single input.
+def Inst1 : Instruction8Bit<1> {
+  bits<2> r0;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = 0;
+  let InOperandList = (ins Regs:$r0);
+  let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. 
+def Inst2 : Instruction16Bit<2> {
+  bits<2> r0;
+  bits<2> r1;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{15-8} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1);
+  let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. .
+def Inst3 : Instruction16Bit<3> {
+  bits<2> r0;
+  bits<2> r1;
+  bits<2> r2;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{9-8} = r2;
+  let Inst{15-10} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+  let AsmString = "Inst3";
+}
+
+// -----------------------------------------------------------------------------
+// In the default case, we emit a single decodeToMCinst function and DecodeIdx
+// is shared across all bitwidths.
+
+// CHECK-DEFAULT-LABEL: DecoderTable8[25]
+// CHECK-DEFAULT:         DecodeIdx: 0
+// CHECK-DEFAULT:         DecodeIdx: 1
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: DecoderTable16[25]
+// CHECK-DEFAULT:         DecodeIdx: 2
+// CHECK-DEFAULT:         DecodeIdx: 3
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: template <typename InsnType>
+// CHECK-DEFAULT-NEXT:  static DecodeStatus decodeToMCInst
+// CHECK-DEFAULT:         case 0
+// CHECK-DEFAULT:         case 1
+// CHECK-DEFAULT:         case 2
+// CHECK-DEFAULT:         case 3
+
+// -----------------------------------------------------------------------------
+// When we specialize per bitwidth, we emit 2 decodeToMCInst functions and
+// DecodeIdx is assigned per bit width.
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable8[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable16[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// -----------------------------------------------------------------------------
+// Per bitwidth specialization with function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    DecoderTable8[25]
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 0
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 1
+// CHECK-SPECIALIZE-TABLE:          };
+
+// 8 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
+
+// 16 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
diff --git a/llvm/test/TableGen/DecoderEmitterFnTable.td b/llvm/test/TableGen/DecoderEmitterFnTable.td
index 7bed18c19a513..8929e6da716e6 100644
--- a/llvm/test/TableGen/DecoderEmitterFnTable.td
+++ b/llvm/test/TableGen/DecoderEmitterFnTable.td
@@ -71,14 +71,14 @@ def Inst3 : TestInstruction {
   let AsmString = "Inst3";
 }
 
-// CHECK-LABEL: DecodeStatus decodeFn0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK-LABEL: decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK: static constexpr DecodeFnTy decodeFnTable[]
-// CHECK-NEXT: decodeFn0,
-// CHECK-NEX...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 27, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Rahul Joshi (jurahul)

Changes

This change adds an option to specialize decoders per bitwidth, which can help reduce the (compiled) code size of the decoder code.

Current state:
Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.

Several backends call decodeInstruction with different InsnType types, leading to several template instantiations of these functions in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128-bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions. Since there is just one decodeToMCInst in the generated code, it has code that handles decoding for all instruction sizes. However, the decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32-bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary.

New state:
With this change, we introduce an option specialize-decoders-per-bitwidth. Under this mode, the DecoderEmitter will generate several versions of decodeToMCInst function, one for each bitwidth. The code is still templated, but will require backends to specify, for each InstType used, the bitwidth of the instruction that the type is used to represent using a type-trait InsnBitWidth. This will enable the templated code to choose the right variant of decodeToMCInst. Under this mode, a particular instantiation will only end up instantiating a single variant of decodeToMCInst generated and that will include only those decoders that are applicable to a single bitwidth, resulting in elimination of the code duplication through instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given bitwidth (as opposed to across all bitwidths without this option), so the decoder index values assigned are smaller, and consume less bytes in their ULEB128 encoding. As a result, the generated decoder tables can also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build, this results in a net 55% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 4.5% reduction in the .rodata size. For RISCV, which today uses a single uint64_t type, this results in a 3.7% increase in code size (expected as instantiate the code 3 times now).

Actual measured sizes are as follows:

Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.75% increase   
.rodata        35768       35839      0.1% increase

Patch is 35.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154865.diff

14 Files Affected:

  • (modified) llvm/include/llvm/MC/MCDecoder.h (+17)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp (+22-15)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h (-38)
  • (modified) llvm/lib/Target/DirectX/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/RISCV/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp (+9-7)
  • (added) llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td (+169)
  • (modified) llvm/test/TableGen/DecoderEmitterFnTable.td (+8-8)
  • (modified) llvm/test/TableGen/HwModeEncodeDecode3.td (+7-7)
  • (modified) llvm/test/TableGen/VarLenDecoder.td (+6-5)
  • (modified) llvm/utils/TableGen/DecoderEmitter.cpp (+103-39)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/Disassembler/BUILD.gn (+6-1)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/RISCV/Disassembler/BUILD.gn (+4-1)
diff --git a/llvm/include/llvm/MC/MCDecoder.h b/llvm/include/llvm/MC/MCDecoder.h
index 70762a4a5ebae..459c8a6a5ea34 100644
--- a/llvm/include/llvm/MC/MCDecoder.h
+++ b/llvm/include/llvm/MC/MCDecoder.h
@@ -12,6 +12,7 @@
 
 #include "llvm/MC/MCDisassembler/MCDisassembler.h"
 #include "llvm/Support/MathExtras.h"
+#include <bitset>
 #include <cassert>
 
 namespace llvm::MCD {
@@ -48,6 +49,15 @@ fieldFromInstruction(const InsnType &Insn, unsigned StartBit,
   return Insn.extractBitsAsZExtValue(NumBits, StartBit);
 }
 
+template <size_t N>
+uint64_t fieldFromInstruction(const std::bitset<N> &Insn, unsigned StartBit,
+                              unsigned NumBits) {
+  assert(StartBit + NumBits <= N && "Instruction field out of bounds!");
+  assert(NumBits <= 64 && "Cannot support >64-bit extractions!");
+  const std::bitset<N> Mask(maskTrailingOnes<uint64_t>(NumBits));
+  return ((Insn >> StartBit) & Mask).to_ullong();
+}
+
 // Helper function for inserting bits extracted from an encoded instruction into
 // an integer-typed field.
 template <typename IntType>
@@ -62,6 +72,13 @@ insertBits(IntType &field, IntType bits, unsigned startBit, unsigned numBits) {
   field |= bits << startBit;
 }
 
+// InsnBitWidth is essentially a type trait used by the decoder emitter to query
+// the supported bitwidth for a given type. But default, the value is 0, making
+// it an invalid type for use as `InsnType` when instantiating the decoder.
+// Individual targets are expected to provide specializations for these based
+// on their usage.
+template <typename T> static constexpr uint32_t InsnBitWidth = 0;
+
 } // namespace llvm::MCD
 
 #endif // LLVM_MC_MCDECODER_H
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index dc9dd220130ea..56dc6f9e8e6e9 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -6,7 +6,8 @@ tablegen(LLVM AMDGPUGenAsmMatcher.inc -gen-asm-matcher)
 tablegen(LLVM AMDGPUGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM AMDGPUGenCallingConv.inc -gen-callingconv)
 tablegen(LLVM AMDGPUGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM AMDGPUGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM AMDGPUGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM AMDGPUGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
index 6a2beeed41dfd..80d194afa926b 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
@@ -38,6 +38,7 @@
 #include "llvm/Support/Compiler.h"
 
 using namespace llvm;
+using namespace llvm::MCD;
 
 #define DEBUG_TYPE "amdgpu-disassembler"
 
@@ -446,6 +447,14 @@ static DecodeStatus decodeVersionImm(MCInst &Inst, unsigned Imm,
 
 #include "AMDGPUGenDisassemblerTables.inc"
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 64;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<96>> = 96;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<128>> = 128;
+
 //===----------------------------------------------------------------------===//
 //
 //===----------------------------------------------------------------------===//
@@ -498,26 +507,24 @@ template <typename T> static inline T eatBytes(ArrayRef<uint8_t>& Bytes) {
   return Res;
 }
 
-static inline DecoderUInt128 eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<96> eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 12);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint32_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Hi(read<uint32_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(4);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
-static inline DecoderUInt128 eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<128> eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 16);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Hi(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
 void AMDGPUDisassembler::decodeImmOperands(MCInst &MI,
@@ -600,14 +607,14 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
     // Try to decode DPP and SDWA first to solve conflict with VOP1 and VOP2
     // encodings
     if (isGFX1250() && Bytes.size() >= 16) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX1250128, MI, DecW, Address, CS))
         break;
       Bytes = Bytes_.slice(0, MaxInstBytesNum);
     }
 
     if (isGFX11Plus() && Bytes.size() >= 12) {
-      DecoderUInt128 DecW = eat12Bytes(Bytes);
+      std::bitset<96> DecW = eat12Bytes(Bytes);
 
       if (isGFX11() &&
           tryDecodeInst(DecoderTableGFX1196, DecoderTableGFX11_FAKE1696, MI,
@@ -642,7 +649,7 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
 
     } else if (Bytes.size() >= 16 &&
                STI.hasFeature(AMDGPU::FeatureGFX950Insts)) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX940128, MI, DecW, Address, CS))
         break;
 
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
index f4d164bf10c3c..ded447b6f8d5a 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
@@ -32,44 +32,6 @@ class MCOperand;
 class MCSubtargetInfo;
 class Twine;
 
-// Exposes an interface expected by autogenerated code in
-// FixedLenDecoderEmitter
-class DecoderUInt128 {
-private:
-  uint64_t Lo = 0;
-  uint64_t Hi = 0;
-
-public:
-  DecoderUInt128() = default;
-  DecoderUInt128(uint64_t Lo, uint64_t Hi = 0) : Lo(Lo), Hi(Hi) {}
-  operator bool() const { return Lo || Hi; }
-  uint64_t extractBitsAsZExtValue(unsigned NumBits,
-                                  unsigned BitPosition) const {
-    assert(NumBits && NumBits <= 64);
-    assert(BitPosition < 128);
-    uint64_t Val;
-    if (BitPosition < 64)
-      Val = Lo >> BitPosition | Hi << 1 << (63 - BitPosition);
-    else
-      Val = Hi >> (BitPosition - 64);
-    return Val & ((uint64_t(2) << (NumBits - 1)) - 1);
-  }
-  DecoderUInt128 operator&(const DecoderUInt128 &RHS) const {
-    return DecoderUInt128(Lo & RHS.Lo, Hi & RHS.Hi);
-  }
-  DecoderUInt128 operator&(const uint64_t &RHS) const {
-    return *this & DecoderUInt128(RHS);
-  }
-  DecoderUInt128 operator~() const { return DecoderUInt128(~Lo, ~Hi); }
-  bool operator==(const DecoderUInt128 &RHS) {
-    return Lo == RHS.Lo && Hi == RHS.Hi;
-  }
-  bool operator!=(const DecoderUInt128 &RHS) {
-    return Lo != RHS.Lo || Hi != RHS.Hi;
-  }
-  bool operator!=(const int &RHS) { return *this != DecoderUInt128(RHS); }
-};
-
 //===----------------------------------------------------------------------===//
 // AMDGPUDisassembler
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/DirectX/CMakeLists.txt b/llvm/lib/Target/DirectX/CMakeLists.txt
index 8100f941c8d94..6c079517e22d6 100644
--- a/llvm/lib/Target/DirectX/CMakeLists.txt
+++ b/llvm/lib/Target/DirectX/CMakeLists.txt
@@ -41,6 +41,7 @@ add_llvm_target(DirectXCodeGen
   LINK_COMPONENTS
   Analysis
   AsmPrinter
+  BinaryFormat
   CodeGen
   CodeGenTypes
   Core
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 47329b2c2f4d2..531238ae85029 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -7,7 +7,8 @@ tablegen(LLVM RISCVGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM RISCVGenCompressInstEmitter.inc -gen-compress-inst-emitter)
 tablegen(LLVM RISCVGenMacroFusion.inc -gen-macro-fusion-pred)
 tablegen(LLVM RISCVGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM RISCVGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM RISCVGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM RISCVGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
index dbb16fce8390a..4becd3bbb25f2 100644
--- a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
+++ b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
@@ -558,7 +558,7 @@ static DecodeStatus decodeXqccmpRlistS0(MCInst &Inst, uint32_t Imm,
   return decodeZcmpRlist(Inst, Imm, Address, Decoder);
 }
 
-static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint32_t Insn,
+static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint16_t Insn,
                                         uint64_t Address,
                                         const MCDisassembler *Decoder) {
   uint32_t Rs1 = fieldFromInstruction(Insn, 7, 5);
@@ -700,6 +700,12 @@ static constexpr DecoderListEntry DecoderList32[]{
     {DecoderTableZdinxRV32Only32, {}, "RV32-only Zdinx (Double in Integer)"},
 };
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint16_t> = 16;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+// Use uint64_t to represent 48 bit instructions.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 48;
+
 DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
                                                  ArrayRef<uint8_t> Bytes,
                                                  uint64_t Address,
@@ -710,9 +716,7 @@ DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
   }
   Size = 4;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read32le(Bytes.data());
+  uint32_t Insn = support::endian::read32le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList32) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
@@ -758,9 +762,7 @@ DecodeStatus RISCVDisassembler::getInstruction16(MCInst &MI, uint64_t &Size,
   }
   Size = 2;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read16le(Bytes.data());
+  uint16_t Insn = support::endian::read16le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList16) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
diff --git a/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
new file mode 100644
index 0000000000000..6fd9676939bde
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
@@ -0,0 +1,169 @@
+// RUN: llvm-tblgen -gen-disassembler -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-DEFAULT
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-NO-TABLE
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -use-fn-table-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-TABLE
+
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+  let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+  def R0 : Register<"r0">;
+  def R1 : Register<"r1">;
+  def R2 : Register<"r2">;
+  def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+// Bit 0 of the encoding determines the size (8 or 16 bits).
+// Bits {3..1} define the number define the number of operands encoded.
+class Instruction8Bit<int NumOps> : Instruction {
+  let Size = 1;
+  let OutOperandList = (outs);
+  field bits<8> Inst;
+  let Inst{0} = 0;
+  let Inst{3-1} = NumOps;
+}
+
+class Instruction16Bit<int NumOps> : Instruction {
+  let Size = 2;
+  let OutOperandList = (outs);
+  field bits<16> Inst;
+  let Inst{0} = 1;
+  let Inst{3-1} = NumOps;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Each register operand needs 2 bits to encode.
+
+// An instruction with no inputs.
+def Inst0 : Instruction8Bit<0> {
+  let Inst{7-4} = 0;
+  let InOperandList = (ins);
+  let AsmString = "Inst0";
+}
+
+// An instruction with a single input.
+def Inst1 : Instruction8Bit<1> {
+  bits<2> r0;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = 0;
+  let InOperandList = (ins Regs:$r0);
+  let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. 
+def Inst2 : Instruction16Bit<2> {
+  bits<2> r0;
+  bits<2> r1;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{15-8} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1);
+  let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. .
+def Inst3 : Instruction16Bit<3> {
+  bits<2> r0;
+  bits<2> r1;
+  bits<2> r2;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{9-8} = r2;
+  let Inst{15-10} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+  let AsmString = "Inst3";
+}
+
+// -----------------------------------------------------------------------------
+// In the default case, we emit a single decodeToMCinst function and DecodeIdx
+// is shared across all bitwidths.
+
+// CHECK-DEFAULT-LABEL: DecoderTable8[25]
+// CHECK-DEFAULT:         DecodeIdx: 0
+// CHECK-DEFAULT:         DecodeIdx: 1
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: DecoderTable16[25]
+// CHECK-DEFAULT:         DecodeIdx: 2
+// CHECK-DEFAULT:         DecodeIdx: 3
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: template <typename InsnType>
+// CHECK-DEFAULT-NEXT:  static DecodeStatus decodeToMCInst
+// CHECK-DEFAULT:         case 0
+// CHECK-DEFAULT:         case 1
+// CHECK-DEFAULT:         case 2
+// CHECK-DEFAULT:         case 3
+
+// -----------------------------------------------------------------------------
+// When we specialize per bitwidth, we emit 2 decodeToMCInst functions and
+// DecodeIdx is assigned per bit width.
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable8[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable16[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// -----------------------------------------------------------------------------
+// Per bitwidth specialization with function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    DecoderTable8[25]
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 0
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 1
+// CHECK-SPECIALIZE-TABLE:          };
+
+// 8 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
+
+// 16 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
diff --git a/llvm/test/TableGen/DecoderEmitterFnTable.td b/llvm/test/TableGen/DecoderEmitterFnTable.td
index 7bed18c19a513..8929e6da716e6 100644
--- a/llvm/test/TableGen/DecoderEmitterFnTable.td
+++ b/llvm/test/TableGen/DecoderEmitterFnTable.td
@@ -71,14 +71,14 @@ def Inst3 : TestInstruction {
   let AsmString = "Inst3";
 }
 
-// CHECK-LABEL: DecodeStatus decodeFn0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK-LABEL: decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK: static constexpr DecodeFnTy decodeFnTable[]
-// CHECK-NEXT: decodeFn0,
-// CHECK-NEX...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 27, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Rahul Joshi (jurahul)

Changes

This change adds an option to specialize decoders per bitwidth, which can help reduce the (compiled) code size of the decoder code.

Current state:
Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.

Several backends call decodeInstruction with different InsnType types, leading to several template instantiations of these functions in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128-bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions. Since there is just one decodeToMCInst in the generated code, it has code that handles decoding for all instruction sizes. However, the decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32-bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary.

New state:
With this change, we introduce an option specialize-decoders-per-bitwidth. Under this mode, the DecoderEmitter will generate several versions of decodeToMCInst function, one for each bitwidth. The code is still templated, but will require backends to specify, for each InstType used, the bitwidth of the instruction that the type is used to represent using a type-trait InsnBitWidth. This will enable the templated code to choose the right variant of decodeToMCInst. Under this mode, a particular instantiation will only end up instantiating a single variant of decodeToMCInst generated and that will include only those decoders that are applicable to a single bitwidth, resulting in elimination of the code duplication through instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given bitwidth (as opposed to across all bitwidths without this option), so the decoder index values assigned are smaller, and consume less bytes in their ULEB128 encoding. As a result, the generated decoder tables can also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build, this results in a net 55% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 4.5% reduction in the .rodata size. For RISCV, which today uses a single uint64_t type, this results in a 3.7% increase in code size (expected as instantiate the code 3 times now).

Actual measured sizes are as follows:

Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.75% increase   
.rodata        35768       35839      0.1% increase

Patch is 35.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154865.diff

14 Files Affected:

  • (modified) llvm/include/llvm/MC/MCDecoder.h (+17)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp (+22-15)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h (-38)
  • (modified) llvm/lib/Target/DirectX/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/RISCV/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp (+9-7)
  • (added) llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td (+169)
  • (modified) llvm/test/TableGen/DecoderEmitterFnTable.td (+8-8)
  • (modified) llvm/test/TableGen/HwModeEncodeDecode3.td (+7-7)
  • (modified) llvm/test/TableGen/VarLenDecoder.td (+6-5)
  • (modified) llvm/utils/TableGen/DecoderEmitter.cpp (+103-39)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/Disassembler/BUILD.gn (+6-1)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/RISCV/Disassembler/BUILD.gn (+4-1)
diff --git a/llvm/include/llvm/MC/MCDecoder.h b/llvm/include/llvm/MC/MCDecoder.h
index 70762a4a5ebae..459c8a6a5ea34 100644
--- a/llvm/include/llvm/MC/MCDecoder.h
+++ b/llvm/include/llvm/MC/MCDecoder.h
@@ -12,6 +12,7 @@
 
 #include "llvm/MC/MCDisassembler/MCDisassembler.h"
 #include "llvm/Support/MathExtras.h"
+#include <bitset>
 #include <cassert>
 
 namespace llvm::MCD {
@@ -48,6 +49,15 @@ fieldFromInstruction(const InsnType &Insn, unsigned StartBit,
   return Insn.extractBitsAsZExtValue(NumBits, StartBit);
 }
 
+template <size_t N>
+uint64_t fieldFromInstruction(const std::bitset<N> &Insn, unsigned StartBit,
+                              unsigned NumBits) {
+  assert(StartBit + NumBits <= N && "Instruction field out of bounds!");
+  assert(NumBits <= 64 && "Cannot support >64-bit extractions!");
+  const std::bitset<N> Mask(maskTrailingOnes<uint64_t>(NumBits));
+  return ((Insn >> StartBit) & Mask).to_ullong();
+}
+
 // Helper function for inserting bits extracted from an encoded instruction into
 // an integer-typed field.
 template <typename IntType>
@@ -62,6 +72,13 @@ insertBits(IntType &field, IntType bits, unsigned startBit, unsigned numBits) {
   field |= bits << startBit;
 }
 
+// InsnBitWidth is essentially a type trait used by the decoder emitter to query
+// the supported bitwidth for a given type. But default, the value is 0, making
+// it an invalid type for use as `InsnType` when instantiating the decoder.
+// Individual targets are expected to provide specializations for these based
+// on their usage.
+template <typename T> static constexpr uint32_t InsnBitWidth = 0;
+
 } // namespace llvm::MCD
 
 #endif // LLVM_MC_MCDECODER_H
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index dc9dd220130ea..56dc6f9e8e6e9 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -6,7 +6,8 @@ tablegen(LLVM AMDGPUGenAsmMatcher.inc -gen-asm-matcher)
 tablegen(LLVM AMDGPUGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM AMDGPUGenCallingConv.inc -gen-callingconv)
 tablegen(LLVM AMDGPUGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM AMDGPUGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM AMDGPUGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM AMDGPUGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
index 6a2beeed41dfd..80d194afa926b 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
@@ -38,6 +38,7 @@
 #include "llvm/Support/Compiler.h"
 
 using namespace llvm;
+using namespace llvm::MCD;
 
 #define DEBUG_TYPE "amdgpu-disassembler"
 
@@ -446,6 +447,14 @@ static DecodeStatus decodeVersionImm(MCInst &Inst, unsigned Imm,
 
 #include "AMDGPUGenDisassemblerTables.inc"
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 64;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<96>> = 96;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<128>> = 128;
+
 //===----------------------------------------------------------------------===//
 //
 //===----------------------------------------------------------------------===//
@@ -498,26 +507,24 @@ template <typename T> static inline T eatBytes(ArrayRef<uint8_t>& Bytes) {
   return Res;
 }
 
-static inline DecoderUInt128 eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<96> eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 12);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint32_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Hi(read<uint32_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(4);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
-static inline DecoderUInt128 eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<128> eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 16);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Hi(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
 void AMDGPUDisassembler::decodeImmOperands(MCInst &MI,
@@ -600,14 +607,14 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
     // Try to decode DPP and SDWA first to solve conflict with VOP1 and VOP2
     // encodings
     if (isGFX1250() && Bytes.size() >= 16) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX1250128, MI, DecW, Address, CS))
         break;
       Bytes = Bytes_.slice(0, MaxInstBytesNum);
     }
 
     if (isGFX11Plus() && Bytes.size() >= 12) {
-      DecoderUInt128 DecW = eat12Bytes(Bytes);
+      std::bitset<96> DecW = eat12Bytes(Bytes);
 
       if (isGFX11() &&
           tryDecodeInst(DecoderTableGFX1196, DecoderTableGFX11_FAKE1696, MI,
@@ -642,7 +649,7 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
 
     } else if (Bytes.size() >= 16 &&
                STI.hasFeature(AMDGPU::FeatureGFX950Insts)) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX940128, MI, DecW, Address, CS))
         break;
 
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
index f4d164bf10c3c..ded447b6f8d5a 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
@@ -32,44 +32,6 @@ class MCOperand;
 class MCSubtargetInfo;
 class Twine;
 
-// Exposes an interface expected by autogenerated code in
-// FixedLenDecoderEmitter
-class DecoderUInt128 {
-private:
-  uint64_t Lo = 0;
-  uint64_t Hi = 0;
-
-public:
-  DecoderUInt128() = default;
-  DecoderUInt128(uint64_t Lo, uint64_t Hi = 0) : Lo(Lo), Hi(Hi) {}
-  operator bool() const { return Lo || Hi; }
-  uint64_t extractBitsAsZExtValue(unsigned NumBits,
-                                  unsigned BitPosition) const {
-    assert(NumBits && NumBits <= 64);
-    assert(BitPosition < 128);
-    uint64_t Val;
-    if (BitPosition < 64)
-      Val = Lo >> BitPosition | Hi << 1 << (63 - BitPosition);
-    else
-      Val = Hi >> (BitPosition - 64);
-    return Val & ((uint64_t(2) << (NumBits - 1)) - 1);
-  }
-  DecoderUInt128 operator&(const DecoderUInt128 &RHS) const {
-    return DecoderUInt128(Lo & RHS.Lo, Hi & RHS.Hi);
-  }
-  DecoderUInt128 operator&(const uint64_t &RHS) const {
-    return *this & DecoderUInt128(RHS);
-  }
-  DecoderUInt128 operator~() const { return DecoderUInt128(~Lo, ~Hi); }
-  bool operator==(const DecoderUInt128 &RHS) {
-    return Lo == RHS.Lo && Hi == RHS.Hi;
-  }
-  bool operator!=(const DecoderUInt128 &RHS) {
-    return Lo != RHS.Lo || Hi != RHS.Hi;
-  }
-  bool operator!=(const int &RHS) { return *this != DecoderUInt128(RHS); }
-};
-
 //===----------------------------------------------------------------------===//
 // AMDGPUDisassembler
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/DirectX/CMakeLists.txt b/llvm/lib/Target/DirectX/CMakeLists.txt
index 8100f941c8d94..6c079517e22d6 100644
--- a/llvm/lib/Target/DirectX/CMakeLists.txt
+++ b/llvm/lib/Target/DirectX/CMakeLists.txt
@@ -41,6 +41,7 @@ add_llvm_target(DirectXCodeGen
   LINK_COMPONENTS
   Analysis
   AsmPrinter
+  BinaryFormat
   CodeGen
   CodeGenTypes
   Core
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 47329b2c2f4d2..531238ae85029 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -7,7 +7,8 @@ tablegen(LLVM RISCVGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM RISCVGenCompressInstEmitter.inc -gen-compress-inst-emitter)
 tablegen(LLVM RISCVGenMacroFusion.inc -gen-macro-fusion-pred)
 tablegen(LLVM RISCVGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM RISCVGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM RISCVGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM RISCVGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
index dbb16fce8390a..4becd3bbb25f2 100644
--- a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
+++ b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
@@ -558,7 +558,7 @@ static DecodeStatus decodeXqccmpRlistS0(MCInst &Inst, uint32_t Imm,
   return decodeZcmpRlist(Inst, Imm, Address, Decoder);
 }
 
-static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint32_t Insn,
+static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint16_t Insn,
                                         uint64_t Address,
                                         const MCDisassembler *Decoder) {
   uint32_t Rs1 = fieldFromInstruction(Insn, 7, 5);
@@ -700,6 +700,12 @@ static constexpr DecoderListEntry DecoderList32[]{
     {DecoderTableZdinxRV32Only32, {}, "RV32-only Zdinx (Double in Integer)"},
 };
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint16_t> = 16;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+// Use uint64_t to represent 48 bit instructions.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 48;
+
 DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
                                                  ArrayRef<uint8_t> Bytes,
                                                  uint64_t Address,
@@ -710,9 +716,7 @@ DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
   }
   Size = 4;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read32le(Bytes.data());
+  uint32_t Insn = support::endian::read32le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList32) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
@@ -758,9 +762,7 @@ DecodeStatus RISCVDisassembler::getInstruction16(MCInst &MI, uint64_t &Size,
   }
   Size = 2;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read16le(Bytes.data());
+  uint16_t Insn = support::endian::read16le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList16) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
diff --git a/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
new file mode 100644
index 0000000000000..6fd9676939bde
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
@@ -0,0 +1,169 @@
+// RUN: llvm-tblgen -gen-disassembler -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-DEFAULT
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-NO-TABLE
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -use-fn-table-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-TABLE
+
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+  let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+  def R0 : Register<"r0">;
+  def R1 : Register<"r1">;
+  def R2 : Register<"r2">;
+  def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+// Bit 0 of the encoding determines the size (8 or 16 bits).
+// Bits {3..1} define the number define the number of operands encoded.
+class Instruction8Bit<int NumOps> : Instruction {
+  let Size = 1;
+  let OutOperandList = (outs);
+  field bits<8> Inst;
+  let Inst{0} = 0;
+  let Inst{3-1} = NumOps;
+}
+
+class Instruction16Bit<int NumOps> : Instruction {
+  let Size = 2;
+  let OutOperandList = (outs);
+  field bits<16> Inst;
+  let Inst{0} = 1;
+  let Inst{3-1} = NumOps;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Each register operand needs 2 bits to encode.
+
+// An instruction with no inputs.
+def Inst0 : Instruction8Bit<0> {
+  let Inst{7-4} = 0;
+  let InOperandList = (ins);
+  let AsmString = "Inst0";
+}
+
+// An instruction with a single input.
+def Inst1 : Instruction8Bit<1> {
+  bits<2> r0;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = 0;
+  let InOperandList = (ins Regs:$r0);
+  let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. 
+def Inst2 : Instruction16Bit<2> {
+  bits<2> r0;
+  bits<2> r1;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{15-8} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1);
+  let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. .
+def Inst3 : Instruction16Bit<3> {
+  bits<2> r0;
+  bits<2> r1;
+  bits<2> r2;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{9-8} = r2;
+  let Inst{15-10} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+  let AsmString = "Inst3";
+}
+
+// -----------------------------------------------------------------------------
+// In the default case, we emit a single decodeToMCinst function and DecodeIdx
+// is shared across all bitwidths.
+
+// CHECK-DEFAULT-LABEL: DecoderTable8[25]
+// CHECK-DEFAULT:         DecodeIdx: 0
+// CHECK-DEFAULT:         DecodeIdx: 1
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: DecoderTable16[25]
+// CHECK-DEFAULT:         DecodeIdx: 2
+// CHECK-DEFAULT:         DecodeIdx: 3
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: template <typename InsnType>
+// CHECK-DEFAULT-NEXT:  static DecodeStatus decodeToMCInst
+// CHECK-DEFAULT:         case 0
+// CHECK-DEFAULT:         case 1
+// CHECK-DEFAULT:         case 2
+// CHECK-DEFAULT:         case 3
+
+// -----------------------------------------------------------------------------
+// When we specialize per bitwidth, we emit 2 decodeToMCInst functions and
+// DecodeIdx is assigned per bit width.
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable8[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable16[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// -----------------------------------------------------------------------------
+// Per bitwidth specialization with function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    DecoderTable8[25]
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 0
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 1
+// CHECK-SPECIALIZE-TABLE:          };
+
+// 8 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
+
+// 16 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
diff --git a/llvm/test/TableGen/DecoderEmitterFnTable.td b/llvm/test/TableGen/DecoderEmitterFnTable.td
index 7bed18c19a513..8929e6da716e6 100644
--- a/llvm/test/TableGen/DecoderEmitterFnTable.td
+++ b/llvm/test/TableGen/DecoderEmitterFnTable.td
@@ -71,14 +71,14 @@ def Inst3 : TestInstruction {
   let AsmString = "Inst3";
 }
 
-// CHECK-LABEL: DecodeStatus decodeFn0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK-LABEL: decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK: static constexpr DecodeFnTy decodeFnTable[]
-// CHECK-NEXT: decodeFn0,
-// CHECK-NEX...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 27, 2025

@llvm/pr-subscribers-backend-directx

Author: Rahul Joshi (jurahul)

Changes

This change adds an option to specialize decoders per bitwidth, which can help reduce the (compiled) code size of the decoder code.

Current state:
Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.

Several backends call decodeInstruction with different InsnType types, leading to several template instantiations of these functions in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128-bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions. Since there is just one decodeToMCInst in the generated code, it has code that handles decoding for all instruction sizes. However, the decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32-bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary.

New state:
With this change, we introduce an option specialize-decoders-per-bitwidth. Under this mode, the DecoderEmitter will generate several versions of decodeToMCInst function, one for each bitwidth. The code is still templated, but will require backends to specify, for each InstType used, the bitwidth of the instruction that the type is used to represent using a type-trait InsnBitWidth. This will enable the templated code to choose the right variant of decodeToMCInst. Under this mode, a particular instantiation will only end up instantiating a single variant of decodeToMCInst generated and that will include only those decoders that are applicable to a single bitwidth, resulting in elimination of the code duplication through instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given bitwidth (as opposed to across all bitwidths without this option), so the decoder index values assigned are smaller, and consume less bytes in their ULEB128 encoding. As a result, the generated decoder tables can also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build, this results in a net 55% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 4.5% reduction in the .rodata size. For RISCV, which today uses a single uint64_t type, this results in a 3.7% increase in code size (expected as instantiate the code 3 times now).

Actual measured sizes are as follows:

Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.75% increase   
.rodata        35768       35839      0.1% increase

Patch is 35.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154865.diff

14 Files Affected:

  • (modified) llvm/include/llvm/MC/MCDecoder.h (+17)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp (+22-15)
  • (modified) llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h (-38)
  • (modified) llvm/lib/Target/DirectX/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/RISCV/CMakeLists.txt (+2-1)
  • (modified) llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp (+9-7)
  • (added) llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td (+169)
  • (modified) llvm/test/TableGen/DecoderEmitterFnTable.td (+8-8)
  • (modified) llvm/test/TableGen/HwModeEncodeDecode3.td (+7-7)
  • (modified) llvm/test/TableGen/VarLenDecoder.td (+6-5)
  • (modified) llvm/utils/TableGen/DecoderEmitter.cpp (+103-39)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/Disassembler/BUILD.gn (+6-1)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/RISCV/Disassembler/BUILD.gn (+4-1)
diff --git a/llvm/include/llvm/MC/MCDecoder.h b/llvm/include/llvm/MC/MCDecoder.h
index 70762a4a5ebae..459c8a6a5ea34 100644
--- a/llvm/include/llvm/MC/MCDecoder.h
+++ b/llvm/include/llvm/MC/MCDecoder.h
@@ -12,6 +12,7 @@
 
 #include "llvm/MC/MCDisassembler/MCDisassembler.h"
 #include "llvm/Support/MathExtras.h"
+#include <bitset>
 #include <cassert>
 
 namespace llvm::MCD {
@@ -48,6 +49,15 @@ fieldFromInstruction(const InsnType &Insn, unsigned StartBit,
   return Insn.extractBitsAsZExtValue(NumBits, StartBit);
 }
 
+template <size_t N>
+uint64_t fieldFromInstruction(const std::bitset<N> &Insn, unsigned StartBit,
+                              unsigned NumBits) {
+  assert(StartBit + NumBits <= N && "Instruction field out of bounds!");
+  assert(NumBits <= 64 && "Cannot support >64-bit extractions!");
+  const std::bitset<N> Mask(maskTrailingOnes<uint64_t>(NumBits));
+  return ((Insn >> StartBit) & Mask).to_ullong();
+}
+
 // Helper function for inserting bits extracted from an encoded instruction into
 // an integer-typed field.
 template <typename IntType>
@@ -62,6 +72,13 @@ insertBits(IntType &field, IntType bits, unsigned startBit, unsigned numBits) {
   field |= bits << startBit;
 }
 
+// InsnBitWidth is essentially a type trait used by the decoder emitter to query
+// the supported bitwidth for a given type. But default, the value is 0, making
+// it an invalid type for use as `InsnType` when instantiating the decoder.
+// Individual targets are expected to provide specializations for these based
+// on their usage.
+template <typename T> static constexpr uint32_t InsnBitWidth = 0;
+
 } // namespace llvm::MCD
 
 #endif // LLVM_MC_MCDECODER_H
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index dc9dd220130ea..56dc6f9e8e6e9 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -6,7 +6,8 @@ tablegen(LLVM AMDGPUGenAsmMatcher.inc -gen-asm-matcher)
 tablegen(LLVM AMDGPUGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM AMDGPUGenCallingConv.inc -gen-callingconv)
 tablegen(LLVM AMDGPUGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM AMDGPUGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM AMDGPUGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM AMDGPUGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM AMDGPUGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
index 6a2beeed41dfd..80d194afa926b 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
@@ -38,6 +38,7 @@
 #include "llvm/Support/Compiler.h"
 
 using namespace llvm;
+using namespace llvm::MCD;
 
 #define DEBUG_TYPE "amdgpu-disassembler"
 
@@ -446,6 +447,14 @@ static DecodeStatus decodeVersionImm(MCInst &Inst, unsigned Imm,
 
 #include "AMDGPUGenDisassemblerTables.inc"
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 64;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<96>> = 96;
+template <>
+static constexpr uint32_t llvm::MCD::InsnBitWidth<std::bitset<128>> = 128;
+
 //===----------------------------------------------------------------------===//
 //
 //===----------------------------------------------------------------------===//
@@ -498,26 +507,24 @@ template <typename T> static inline T eatBytes(ArrayRef<uint8_t>& Bytes) {
   return Res;
 }
 
-static inline DecoderUInt128 eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<96> eat12Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 12);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint32_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<96> Hi(read<uint32_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(4);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
-static inline DecoderUInt128 eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+static inline std::bitset<128> eat16Bytes(ArrayRef<uint8_t> &Bytes) {
+  using namespace llvm::support::endian;
   assert(Bytes.size() >= 16);
-  uint64_t Lo =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Lo(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  uint64_t Hi =
-      support::endian::read<uint64_t, llvm::endianness::little>(Bytes.data());
+  std::bitset<128> Hi(read<uint64_t, endianness::little>(Bytes.data()));
   Bytes = Bytes.slice(8);
-  return DecoderUInt128(Lo, Hi);
+  return (Hi << 64) | Lo;
 }
 
 void AMDGPUDisassembler::decodeImmOperands(MCInst &MI,
@@ -600,14 +607,14 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
     // Try to decode DPP and SDWA first to solve conflict with VOP1 and VOP2
     // encodings
     if (isGFX1250() && Bytes.size() >= 16) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX1250128, MI, DecW, Address, CS))
         break;
       Bytes = Bytes_.slice(0, MaxInstBytesNum);
     }
 
     if (isGFX11Plus() && Bytes.size() >= 12) {
-      DecoderUInt128 DecW = eat12Bytes(Bytes);
+      std::bitset<96> DecW = eat12Bytes(Bytes);
 
       if (isGFX11() &&
           tryDecodeInst(DecoderTableGFX1196, DecoderTableGFX11_FAKE1696, MI,
@@ -642,7 +649,7 @@ DecodeStatus AMDGPUDisassembler::getInstruction(MCInst &MI, uint64_t &Size,
 
     } else if (Bytes.size() >= 16 &&
                STI.hasFeature(AMDGPU::FeatureGFX950Insts)) {
-      DecoderUInt128 DecW = eat16Bytes(Bytes);
+      std::bitset<128> DecW = eat16Bytes(Bytes);
       if (tryDecodeInst(DecoderTableGFX940128, MI, DecW, Address, CS))
         break;
 
diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
index f4d164bf10c3c..ded447b6f8d5a 100644
--- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
+++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h
@@ -32,44 +32,6 @@ class MCOperand;
 class MCSubtargetInfo;
 class Twine;
 
-// Exposes an interface expected by autogenerated code in
-// FixedLenDecoderEmitter
-class DecoderUInt128 {
-private:
-  uint64_t Lo = 0;
-  uint64_t Hi = 0;
-
-public:
-  DecoderUInt128() = default;
-  DecoderUInt128(uint64_t Lo, uint64_t Hi = 0) : Lo(Lo), Hi(Hi) {}
-  operator bool() const { return Lo || Hi; }
-  uint64_t extractBitsAsZExtValue(unsigned NumBits,
-                                  unsigned BitPosition) const {
-    assert(NumBits && NumBits <= 64);
-    assert(BitPosition < 128);
-    uint64_t Val;
-    if (BitPosition < 64)
-      Val = Lo >> BitPosition | Hi << 1 << (63 - BitPosition);
-    else
-      Val = Hi >> (BitPosition - 64);
-    return Val & ((uint64_t(2) << (NumBits - 1)) - 1);
-  }
-  DecoderUInt128 operator&(const DecoderUInt128 &RHS) const {
-    return DecoderUInt128(Lo & RHS.Lo, Hi & RHS.Hi);
-  }
-  DecoderUInt128 operator&(const uint64_t &RHS) const {
-    return *this & DecoderUInt128(RHS);
-  }
-  DecoderUInt128 operator~() const { return DecoderUInt128(~Lo, ~Hi); }
-  bool operator==(const DecoderUInt128 &RHS) {
-    return Lo == RHS.Lo && Hi == RHS.Hi;
-  }
-  bool operator!=(const DecoderUInt128 &RHS) {
-    return Lo != RHS.Lo || Hi != RHS.Hi;
-  }
-  bool operator!=(const int &RHS) { return *this != DecoderUInt128(RHS); }
-};
-
 //===----------------------------------------------------------------------===//
 // AMDGPUDisassembler
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/DirectX/CMakeLists.txt b/llvm/lib/Target/DirectX/CMakeLists.txt
index 8100f941c8d94..6c079517e22d6 100644
--- a/llvm/lib/Target/DirectX/CMakeLists.txt
+++ b/llvm/lib/Target/DirectX/CMakeLists.txt
@@ -41,6 +41,7 @@ add_llvm_target(DirectXCodeGen
   LINK_COMPONENTS
   Analysis
   AsmPrinter
+  BinaryFormat
   CodeGen
   CodeGenTypes
   Core
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 47329b2c2f4d2..531238ae85029 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -7,7 +7,8 @@ tablegen(LLVM RISCVGenAsmWriter.inc -gen-asm-writer)
 tablegen(LLVM RISCVGenCompressInstEmitter.inc -gen-compress-inst-emitter)
 tablegen(LLVM RISCVGenMacroFusion.inc -gen-macro-fusion-pred)
 tablegen(LLVM RISCVGenDAGISel.inc -gen-dag-isel)
-tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler)
+tablegen(LLVM RISCVGenDisassemblerTables.inc -gen-disassembler
+              --specialize-decoders-per-bitwidth)
 tablegen(LLVM RISCVGenInstrInfo.inc -gen-instr-info)
 tablegen(LLVM RISCVGenMCCodeEmitter.inc -gen-emitter)
 tablegen(LLVM RISCVGenMCPseudoLowering.inc -gen-pseudo-lowering)
diff --git a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
index dbb16fce8390a..4becd3bbb25f2 100644
--- a/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
+++ b/llvm/lib/Target/RISCV/Disassembler/RISCVDisassembler.cpp
@@ -558,7 +558,7 @@ static DecodeStatus decodeXqccmpRlistS0(MCInst &Inst, uint32_t Imm,
   return decodeZcmpRlist(Inst, Imm, Address, Decoder);
 }
 
-static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint32_t Insn,
+static DecodeStatus decodeCSSPushPopchk(MCInst &Inst, uint16_t Insn,
                                         uint64_t Address,
                                         const MCDisassembler *Decoder) {
   uint32_t Rs1 = fieldFromInstruction(Insn, 7, 5);
@@ -700,6 +700,12 @@ static constexpr DecoderListEntry DecoderList32[]{
     {DecoderTableZdinxRV32Only32, {}, "RV32-only Zdinx (Double in Integer)"},
 };
 
+// Define bitwidths for various types used to instantiate the decoder.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint16_t> = 16;
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
+// Use uint64_t to represent 48 bit instructions.
+template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 48;
+
 DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
                                                  ArrayRef<uint8_t> Bytes,
                                                  uint64_t Address,
@@ -710,9 +716,7 @@ DecodeStatus RISCVDisassembler::getInstruction32(MCInst &MI, uint64_t &Size,
   }
   Size = 4;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read32le(Bytes.data());
+  uint32_t Insn = support::endian::read32le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList32) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
@@ -758,9 +762,7 @@ DecodeStatus RISCVDisassembler::getInstruction16(MCInst &MI, uint64_t &Size,
   }
   Size = 2;
 
-  // Use uint64_t to match getInstruction48. decodeInstruction is templated
-  // on the Insn type.
-  uint64_t Insn = support::endian::read16le(Bytes.data());
+  uint16_t Insn = support::endian::read16le(Bytes.data());
 
   for (const DecoderListEntry &Entry : DecoderList16) {
     if (!Entry.haveContainedFeatures(STI.getFeatureBits()))
diff --git a/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
new file mode 100644
index 0000000000000..6fd9676939bde
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterBitwidthSpecialization.td
@@ -0,0 +1,169 @@
+// RUN: llvm-tblgen -gen-disassembler -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-DEFAULT
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-NO-TABLE
+// RUN: llvm-tblgen -gen-disassembler -specialize-decoders-per-bitwidth -use-fn-table-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s --check-prefix=CHECK-SPECIALIZE-TABLE
+
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+  let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+  def R0 : Register<"r0">;
+  def R1 : Register<"r1">;
+  def R2 : Register<"r2">;
+  def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+// Bit 0 of the encoding determines the size (8 or 16 bits).
+// Bits {3..1} define the number define the number of operands encoded.
+class Instruction8Bit<int NumOps> : Instruction {
+  let Size = 1;
+  let OutOperandList = (outs);
+  field bits<8> Inst;
+  let Inst{0} = 0;
+  let Inst{3-1} = NumOps;
+}
+
+class Instruction16Bit<int NumOps> : Instruction {
+  let Size = 2;
+  let OutOperandList = (outs);
+  field bits<16> Inst;
+  let Inst{0} = 1;
+  let Inst{3-1} = NumOps;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Each register operand needs 2 bits to encode.
+
+// An instruction with no inputs.
+def Inst0 : Instruction8Bit<0> {
+  let Inst{7-4} = 0;
+  let InOperandList = (ins);
+  let AsmString = "Inst0";
+}
+
+// An instruction with a single input.
+def Inst1 : Instruction8Bit<1> {
+  bits<2> r0;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = 0;
+  let InOperandList = (ins Regs:$r0);
+  let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. 
+def Inst2 : Instruction16Bit<2> {
+  bits<2> r0;
+  bits<2> r1;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{15-8} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1);
+  let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. .
+def Inst3 : Instruction16Bit<3> {
+  bits<2> r0;
+  bits<2> r1;
+  bits<2> r2;
+  let Inst{5-4} = r0;
+  let Inst{7-6} = r1;
+  let Inst{9-8} = r2;
+  let Inst{15-10} = 0;
+  let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+  let AsmString = "Inst3";
+}
+
+// -----------------------------------------------------------------------------
+// In the default case, we emit a single decodeToMCinst function and DecodeIdx
+// is shared across all bitwidths.
+
+// CHECK-DEFAULT-LABEL: DecoderTable8[25]
+// CHECK-DEFAULT:         DecodeIdx: 0
+// CHECK-DEFAULT:         DecodeIdx: 1
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: DecoderTable16[25]
+// CHECK-DEFAULT:         DecodeIdx: 2
+// CHECK-DEFAULT:         DecodeIdx: 3
+// CHECK-DEFAULT:       };
+
+// CHECK-DEFAULT-LABEL: template <typename InsnType>
+// CHECK-DEFAULT-NEXT:  static DecodeStatus decodeToMCInst
+// CHECK-DEFAULT:         case 0
+// CHECK-DEFAULT:         case 1
+// CHECK-DEFAULT:         case 2
+// CHECK-DEFAULT:         case 3
+
+// -----------------------------------------------------------------------------
+// When we specialize per bitwidth, we emit 2 decodeToMCInst functions and
+// DecodeIdx is assigned per bit width.
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable8[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   DecoderTable16[25]
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 0
+// CHECK-SPECIALIZE-NO-TABLE:           DecodeIdx: 1
+// CHECK-SPECIALIZE-NO-TABLE:         };
+
+// CHECK-SPECIALIZE-NO-TABLE-LABEL:   template <typename InsnType>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-NO-TABLE-NEXT:    decodeToMCInst
+// CHECK-SPECIALIZE-NO-TABLE:           case 0
+// CHECK-SPECIALIZE-NO-TABLE:           case 1
+
+// -----------------------------------------------------------------------------
+// Per bitwidth specialization with function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    DecoderTable8[25]
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 0
+// CHECK-SPECIALIZE-TABLE:            DecodeIdx: 1
+// CHECK-SPECIALIZE-TABLE:          };
+
+// 8 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_8bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 8, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_8bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
+
+// 16 bit functions and function table.
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_0
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeFn_16bit_1
+
+// CHECK-SPECIALIZE-TABLE-LABEL:    template <typename InsnType>
+// CHECK-SPECIALIZE-TABLE-NEXT:     static std::enable_if_t<InsnBitWidth<InsnType> == 16, DecodeStatus>
+// CHECK-SPECIALIZE-TABLE-NEXT:     decodeToMCInst
+// CHECK-SPECIALIZE-TABLE-LABEL:      static constexpr DecodeFnTy decodeFnTable[] = {
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_0,
+// CHECK-SPECIALIZE-TABLE-NEXT:       decodeFn_16bit_1,
+// CHECK-SPECIALIZE-TABLE-NEXT:     };
diff --git a/llvm/test/TableGen/DecoderEmitterFnTable.td b/llvm/test/TableGen/DecoderEmitterFnTable.td
index 7bed18c19a513..8929e6da716e6 100644
--- a/llvm/test/TableGen/DecoderEmitterFnTable.td
+++ b/llvm/test/TableGen/DecoderEmitterFnTable.td
@@ -71,14 +71,14 @@ def Inst3 : TestInstruction {
   let AsmString = "Inst3";
 }
 
-// CHECK-LABEL: DecodeStatus decodeFn0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
-// CHECK-LABEL: DecodeStatus decodeFn3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_0(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_1(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_2(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
+// CHECK-LABEL: DecodeStatus decodeFn_3(DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK-LABEL: decodeToMCInst(unsigned Idx, DecodeStatus S, InsnType insn, MCInst &MI, uint64_t Address, const MCDisassembler *Decoder, bool &DecodeComplete)
 // CHECK: static constexpr DecodeFnTy decodeFnTable[]
-// CHECK-NEXT: decodeFn0,
-// CHECK-NEX...
[truncated]

@jurahul
Copy link
Contributor Author

jurahul commented Aug 27, 2025

@topperc can you please review as well?

@jurahul
Copy link
Contributor Author

jurahul commented Aug 29, 2025

Gentle ping @topperc, would like to get this in early next week (assuming review is ok).

template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint16_t> = 16;
template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint32_t> = 32;
// Use uint64_t to represent 48 bit instructions.
template <> static constexpr uint32_t llvm::MCD::InsnBitWidth<uint64_t> = 48;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to add 64-bit RISC-V instructions in the future does this break? Will we need to use std::bitset<48> to make 48 unique?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems backwards. Shouldn't the decoder know what bit width it's for rather than trying to reinfer it from the type that ended up being used to represent that bit width?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@topperc yeah, if we have true 64-bit insts in future, then 48-bit insts need to use a different type. It will not break silently though, since InsnBitWidth specializations need to be defined at that time and if we have > 1 type with the same InsnBitWidth value, we will get compiler warnings about ambiguous templating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrtc27 The decoder know the number of bits, but not the actual C++ type to be used for that bitwidth (it generates templated code and backends are free to use whatever compatible types they want to use). InsnBitWidth is a way for the particular target to communicate this intent to the decoder generated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A previous iteration of this was trying to enforce the actual type as well, but that turned out to be much larger change (see #146593).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not flip this as a template on uint32_t that gives the type?

Copy link
Contributor Author

@jurahul jurahul Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like?

template <int BitWidth>
struct TypeForBitWidth {
  using T = void;
};

template <>
struct TypeForBitWidth<32> {
  using T = uint32_t;
};
template <>
struct TypeForBitWidth<64> {
  using T = uint64_t;
};

int decodeToMCInst(TypeForBitWidth<32>::T Insn) {
  return 32;
}

int decodeToMCInst(TypeForBitWidth<64>::T Insn) {
  return 64;
}

int main() {
  uint32_t x = 0;
  std::cout << decodeToMCInst(x) << '\n';
  uint64_t y = 0;
  std::cout << decodeToMCInst(y) << '\n';
}

@topperc and @s-barannikov WDYT? It conveys the same information, but I agree its more directly expressing things. Note that with this though it won't directly work if/when we want to support > 1 bitwidth for a given type (which I did attempt in this PR but gave up because MSVC broke).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure overloading on type is a good idea. For this reason and calling the function with the wrong type issue raised here #154865 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I propose we keep the current scheme for now,

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I guess we'll find out if there are any out of tree RISC-V vendors with 64 bit instructions.

using InnerKeyTy = std::pair<StringRef, unsigned>;
using InnerMapTy = std::map<InnerKeyTy, std::vector<unsigned>>;
std::map<unsigned, InnerMapTy> EncMap;

for (const auto &[HwModeID, EncodingIDs] : EncodingIDsByHwMode) {
for (unsigned EncodingID : EncodingIDs) {
const InstructionEncoding &Encoding = Encodings[EncodingID];
const Record *EncodingDef = Encoding.getRecord();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead variable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants