MC: X86 intel syntax: Support data32 and data16 better #156287

daym · 2025-09-01T07:23:29Z

This PR fixes up data16 and data32 so they actually work in intel syntax mode.

Previously, this happened:

$ cat a.asm
.code16
data32 push 8
data32
push 9
$ llvm-mc --show-encoding --x86-asm-syntax=intel a.asm
.code16
pushw $8 # encoding: [0x6a,0x08]
data32 # encoding: [0x66]
pushw $9 # encoding: [0x6a,0x09]

The first one is obviously incorrect--and the second one is a weird workaround.

Happens only with intel syntax.

See also #156286

Full diff: https://github.com/llvm/llvm-project/pull/156287.diff

4 Files Affected:

(modified) llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp (+32-6)
(added) llvm/test/MC/X86/x86-16-intel.s (+13)
(modified) llvm/test/MC/X86/x86-16.s (+12)
(added) llvm/test/MC/X86/x86-32-intel.s (+28)

diff --git a/llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp b/llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
index d7671ed19589b..211aa47e48fe3 100644
--- a/llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
+++ b/llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
@@ -3523,8 +3523,26 @@ bool X86AsmParser::parseInstruction(ParseInstructionInfo &Info, StringRef Name,
     PatchedName = Name;
 
   // Hacks to handle 'data16' and 'data32'
-  if (PatchedName == "data16" && is16BitMode()) {
-    return Error(NameLoc, "redundant data16 prefix");
+  if (PatchedName == "data16") {
+    if (is16BitMode())
+        return Error(NameLoc, "redundant data16 prefix");
+    if (is64BitMode())
+      return Error(NameLoc, "'data16' is not supported in 64-bit mode");
+    if (getLexer().isNot(AsmToken::EndOfStatement)) {
+      StringRef Next = Parser.getTok().getString();
+      getLexer().Lex();
+      // data16 effectively changes the instruction suffix.
+      // TODO Generalize.
+      if (Next == "call")
+        Next = "callw";
+      if (Next == "ljmp")
+        Next = "ljmpw";
+
+      Name = Next;
+      PatchedName = Name;
+      ForcedDataPrefix = X86::Is16Bit;
+      IsPrefix = false;
+    }
   }
   if (PatchedName == "data32") {
     if (is32BitMode())
@@ -4538,14 +4556,22 @@ bool X86AsmParser::matchAndEmitIntelInstruction(
     if (X86Op->isImm()) {
       // If it's not a constant fall through and let remainder take care of it.
       const auto *CE = dyn_cast<MCConstantExpr>(X86Op->getImm());
-      unsigned Size = getPointerWidth();
+      // Determine the size. Prioritize the ForcedDataPrefix flag if it was set
+      // by a 'data32' prefix. Otherwise, fall back to the pointer width of the
+      // current mode.
+      unsigned Size = (ForcedDataPrefix == X86::Is32Bit) ? 32
+                    : (ForcedDataPrefix == X86::Is16Bit) ? 16
+                    : getPointerWidth();
+      ForcedDataPrefix = 0;
       if (CE &&
           (isIntN(Size, CE->getValue()) || isUIntN(Size, CE->getValue()))) {
         SmallString<16> Tmp;
         Tmp += Base;
-        Tmp += (is64BitMode())
-                   ? "q"
-                   : (is32BitMode()) ? "l" : (is16BitMode()) ? "w" : " ";
+        // Append the suffix corresponding to the determined size.
+        if (Size == 64) Tmp += "q";
+        else if (Size == 32) Tmp += "l";
+        else if (Size == 16) Tmp += "w";
+        else Tmp += " ";
         Op.setTokenValue(Tmp);
         // Do match in ATT mode to allow explicit suffix usage.
         Match.push_back(MatchInstruction(Operands, Inst, ErrorInfo,
diff --git a/llvm/test/MC/X86/x86-16-intel.s b/llvm/test/MC/X86/x86-16-intel.s
new file mode 100644
index 0000000000000..77ae4ae217218
--- /dev/null
+++ b/llvm/test/MC/X86/x86-16-intel.s
@@ -0,0 +1,13 @@
+// RUN: llvm-mc -triple i386-unknown-unknown-code16 --x86-asm-syntax=intel --show-encoding %s | FileCheck %s
+
+// CHECK: pushl $8
+// CHECK: encoding: [0x66,0x6a,0x08]
+          data32 push 8
+
+// CHECK: pushw $8
+// CHECK: encoding: [0x6a,0x08]
+          push 8
+
+// CHECK: lretl
+// CHECK: encoding: [0x66,0xcb]
+          data32 retf
diff --git a/llvm/test/MC/X86/x86-16.s b/llvm/test/MC/X86/x86-16.s
index b0a4bda56fcbf..b4e116ab1a0fb 100644
--- a/llvm/test/MC/X86/x86-16.s
+++ b/llvm/test/MC/X86/x86-16.s
@@ -1060,3 +1060,15 @@ xresldtrk
 // CHECK:  encoding: [0x66,0x8b,0x1e,A,A]
 // CHECK:  fixup A - offset: 3, value: nearer, kind: FK_Data_2
 movl    nearer, %ebx
+
+// CHECK: pushl $8
+// CHECK:  encoding: [0x66,0x6a,0x08]
+data32 push $8
+
+// CHECK: pushl $8
+// CHECK:  encoding: [0x66,0x6a,0x08]
+pushl $8
+
+// CHECK: pushw $8
+// CHECK:  encoding: [0x6a,0x08]
+push $8
diff --git a/llvm/test/MC/X86/x86-32-intel.s b/llvm/test/MC/X86/x86-32-intel.s
new file mode 100644
index 0000000000000..44fc5104653c2
--- /dev/null
+++ b/llvm/test/MC/X86/x86-32-intel.s
@@ -0,0 +1,28 @@
+// RUN: llvm-mc -triple i386-unknown-unknown --x86-asm-syntax=intel --show-encoding %s | FileCheck %s
+
+// CHECK: encoding: [0x66,0x6a,0x08]
+          data16 push 8
+
+// CHECK: encoding: [0x6a,0x08]
+          push 8
+
+// CxHECK: encoding: [0x66,0xcb]
+     //     data16 retf
+
+// CHECK: encoding: [0xcb]
+          retf
+
+// CHECK: encoding: [0x66,0x9a,0xcd,0xab,0xce,0x7a]
+          data16 call 0x7ace, 0xabcd
+
+// CHECK: encoding: [0x9a,0xcd,0xab,0x00,0x00,0xce,0x7a]
+          call 0x7ace, 0xabcd
+
+// CHECK: encoding: [0xe8,A,A,A,A]
+          call a
+
+// CxHECK: encoding: [0x66,0xea,0xcd,0xab,0xce,0x7a]
+     //     data16 ljmp 0x7ace, 0xabcd
+
+// CHECK: encoding: [0xea,0xcd,0xab,0x00,0x00,0xce,0x7a]
+          ljmp 0x7ace, 0xabcd

brad0 · 2025-09-01T09:44:05Z

cc @MaskRay

github-actions · 2025-09-01T09:46:19Z

✅ With the latest revision this PR passed the C/C++ code formatter.

RKSimon · 2025-09-01T09:47:03Z

Please update the PR title and summary to properly briefly describe your patch

phoebewang · 2025-09-01T10:34:34Z

And format the changed code.

MaskRay · 2025-09-01T21:18:42Z

llvm/test/MC/X86/x86-32-intel.s

+// RUN: llvm-mc -triple i386-unknown-unknown --x86-asm-syntax=intel --show-encoding %s | FileCheck %s
+
+// CHECK: encoding: [0x66,0x6a,0x08]
+          data16 push 8


Is this intended?

GNU Assembler rejects data16 push 8 outside of .code16 region.

Does it?

$ cat boot.asm .intel_syntax noprefix test_section: .code32 data16 push 8 $ as boot.asm $ objdump -S a.out a.out: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <test_section>: 0: 66 6a 08 pushw $0x8 $ as --version GNU assembler (GNU Binutils) 2.44 Copyright (C) 2025 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `x86_64-unknown-linux-gnu'.

MaskRay · 2025-09-05T07:04:06Z

llvm/test/MC/X86/x86-32-intel.s

@@ -0,0 +1,28 @@
+// RUN: llvm-mc -triple i386-unknown-unknown --x86-asm-syntax=intel --show-encoding %s | FileCheck %s


Move this to intel-syntax-32.s instead. Possibly change the test to use --show-encoding

llvmbot added the backend:X86 label Sep 1, 2025

RKSimon requested review from MaskRay, RKSimon and phoebewang September 1, 2025 09:46

daym changed the title ~~X86 data v2~~ X86 intel syntax: Support data32 and data16 better Sep 1, 2025

daym changed the title ~~X86 intel syntax: Support data32 and data16 better~~ MC: X86 intel syntax: Support data32 and data16 better Sep 1, 2025

daym force-pushed the x86-data-v2 branch 2 times, most recently from 4aa3936 to 7dcecbf Compare September 1, 2025 18:47

daym added 2 commits September 1, 2025 20:58

[x86][MC] Fix data32 push.

dd6ec68

[x86][MC]: Fix data16.

9efea28

daym force-pushed the x86-data-v2 branch from 7dcecbf to 9efea28 Compare September 1, 2025 18:59

MaskRay reviewed Sep 1, 2025

View reviewed changes

daym requested a review from MaskRay September 2, 2025 10:24

MaskRay approved these changes Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MC: X86 intel syntax: Support data32 and data16 better #156287

MC: X86 intel syntax: Support data32 and data16 better #156287

Uh oh!

daym commented Sep 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 1, 2025

Uh oh!

llvmbot commented Sep 1, 2025

Uh oh!

brad0 commented Sep 1, 2025

Uh oh!

github-actions bot commented Sep 1, 2025 •

edited

Loading

Uh oh!

RKSimon commented Sep 1, 2025

Uh oh!

phoebewang commented Sep 1, 2025

Uh oh!

MaskRay Sep 1, 2025

Uh oh!

daym Sep 2, 2025 •

edited

Loading

Uh oh!

MaskRay Sep 5, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,28 @@
		// RUN: llvm-mc -triple i386-unknown-unknown --x86-asm-syntax=intel --show-encoding %s \| FileCheck %s

MC: X86 intel syntax: Support data32 and data16 better #156287

Are you sure you want to change the base?

MC: X86 intel syntax: Support data32 and data16 better #156287

Uh oh!

Conversation

daym commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 1, 2025

Uh oh!

llvmbot commented Sep 1, 2025

Uh oh!

brad0 commented Sep 1, 2025

Uh oh!

github-actions bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RKSimon commented Sep 1, 2025

Uh oh!

phoebewang commented Sep 1, 2025

Uh oh!

MaskRay Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

daym Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaskRay Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

daym commented Sep 1, 2025 •

edited

Loading

github-actions bot commented Sep 1, 2025 •

edited

Loading

daym Sep 2, 2025 •

edited

Loading