Skip to content

Conversation

QuantumSegfault
Copy link

@QuantumSegfault QuantumSegfault commented Aug 28, 2025

Implements custom EXTLOAD lowering logic to enable correct lowering of such from WASM (address space 1) globals, similarily to the existing LOAD logic.

These fixes also include making sure that the global.load is done for the type of the underlying global (accounting for globals declared smaller than i32), and the loaded value is extended/truncated to the desired result width (preventing invalid instruction sequences).

Loads from integral or floating-point globals of widths other than 8, 16, 32, or 64 bits are explicitly disallowed.

Fixes: #155880

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 28, 2025

@llvm/pr-subscribers-backend-webassembly

Author: None (QuantumSegfault)

Changes

Implements custom EXTLOAD lowering logic to enable correct lowering of such from WASM (address space 1) globals, similarily to the existing LOAD logic.

These fixes also include making sure that the global.load is done for the type of the underlying global (accounting for globals declared smaller than i32), and the loaded value is extended/truncated to the desired result width (preventing invalid instruction sequences).

Loads from integral or floating-point globals of widths other than 8, 16, 32, or 64 bits are explicitly disallowed.


Full diff: https://github.com/llvm/llvm-project/pull/155937.diff

2 Files Affected:

  • (modified) llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp (+115-4)
  • (added) llvm/test/CodeGen/WebAssembly/lower-load-wasm-global.ll (+185)
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
index 5a45134692865..343f85e475016 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
@@ -19,6 +19,7 @@
 #include "WebAssemblyTargetMachine.h"
 #include "WebAssemblyUtilities.h"
 #include "llvm/CodeGen/CallingConvLower.h"
+#include "llvm/CodeGen/ISDOpcodes.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
 #include "llvm/CodeGen/MachineJumpTableInfo.h"
@@ -111,6 +112,17 @@ WebAssemblyTargetLowering::WebAssemblyTargetLowering(
     }
   }
 
+  // Likewise, transform extending loads for address space 1
+  for (auto T : {MVT::i32, MVT::i64}) {
+    for (auto S : {MVT::i8, MVT::i16, MVT::i32}) {
+      if (T != S) {
+        setLoadExtAction(ISD::EXTLOAD, T, S, Custom);
+        setLoadExtAction(ISD::ZEXTLOAD, T, S, Custom);
+        setLoadExtAction(ISD::SEXTLOAD, T, S, Custom);
+      }
+    }
+  }
+
   setOperationAction(ISD::GlobalAddress, MVTPtr, Custom);
   setOperationAction(ISD::GlobalTLSAddress, MVTPtr, Custom);
   setOperationAction(ISD::ExternalSymbol, MVTPtr, Custom);
@@ -1707,6 +1719,11 @@ static bool IsWebAssemblyGlobal(SDValue Op) {
   if (const GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(Op))
     return WebAssembly::isWasmVarAddressSpace(GA->getAddressSpace());
 
+  if (Op->getOpcode() == WebAssemblyISD::Wrapper)
+    if (const GlobalAddressSDNode *GA =
+            dyn_cast<GlobalAddressSDNode>(Op->getOperand(0)))
+      return WebAssembly::isWasmVarAddressSpace(GA->getAddressSpace());
+
   return false;
 }
 
@@ -1764,16 +1781,110 @@ SDValue WebAssemblyTargetLowering::LowerLoad(SDValue Op,
   LoadSDNode *LN = cast<LoadSDNode>(Op.getNode());
   const SDValue &Base = LN->getBasePtr();
   const SDValue &Offset = LN->getOffset();
+  ISD::LoadExtType ExtType = LN->getExtensionType();
+  EVT ResultType = LN->getValueType(0);
 
   if (IsWebAssemblyGlobal(Base)) {
     if (!Offset->isUndef())
       report_fatal_error(
           "unexpected offset when loading from webassembly global", false);
 
-    SDVTList Tys = DAG.getVTList(LN->getValueType(0), MVT::Other);
-    SDValue Ops[] = {LN->getChain(), Base};
-    return DAG.getMemIntrinsicNode(WebAssemblyISD::GLOBAL_GET, DL, Tys, Ops,
-                                   LN->getMemoryVT(), LN->getMemOperand());
+    EVT GT = MVT::INVALID_SIMPLE_VALUE_TYPE;
+
+    if (const GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(Base))
+      GT = EVT::getEVT(GA->getGlobal()->getValueType());
+    if (Base->getOpcode() == WebAssemblyISD::Wrapper)
+      if (const GlobalAddressSDNode *GA =
+              dyn_cast<GlobalAddressSDNode>(Base->getOperand(0)))
+        GT = EVT::getEVT(GA->getGlobal()->getValueType());
+
+    if (GT != MVT::i8 && GT != MVT::i16 && GT != MVT::i32 && GT != MVT::i64 &&
+        GT != MVT::f32 && GT != MVT::f64)
+      report_fatal_error("encountered unexpected global type for Base when "
+                         "loading from webassembly global",
+                         false);
+
+    EVT PromotedGT = (GT == MVT::i8 || GT == MVT::i16) ? MVT::i32 : GT;
+
+    if (ExtType == ISD::NON_EXTLOAD) {
+      // A normal, non-extending load may try to load more or less than the
+      // underlying global, which is invalid. We lower this to a load of the
+      // global (i32 or i64) then truncate or extend as needed
+
+      // Modify the MMO to load the full global
+      MachineMemOperand *OldMMO = LN->getMemOperand();
+      MachineMemOperand *NewMMO = DAG.getMachineFunction().getMachineMemOperand(
+          OldMMO->getPointerInfo(), OldMMO->getFlags(),
+          LLT(PromotedGT.getSimpleVT()), OldMMO->getBaseAlign(),
+          OldMMO->getAAInfo(), OldMMO->getRanges(), OldMMO->getSyncScopeID(),
+          OldMMO->getSuccessOrdering(), OldMMO->getFailureOrdering());
+
+      SDVTList Tys = DAG.getVTList(PromotedGT, MVT::Other);
+      SDValue Ops[] = {LN->getChain(), Base};
+      SDValue GlobalGetNode = DAG.getMemIntrinsicNode(
+          WebAssemblyISD::GLOBAL_GET, DL, Tys, Ops, PromotedGT, NewMMO);
+
+      if (ResultType.bitsEq(PromotedGT)) {
+        return GlobalGetNode;
+      }
+
+      SDValue ValRes;
+      if (ResultType.isFloatingPoint())
+        ValRes = DAG.getFPExtendOrRound(GlobalGetNode, DL, ResultType);
+      else
+        ValRes = DAG.getAnyExtOrTrunc(GlobalGetNode, DL, ResultType);
+
+      return DAG.getMergeValues({ValRes, LN->getChain()}, DL);
+    }
+
+    if (ExtType == ISD::ZEXTLOAD || ExtType == ISD::SEXTLOAD) {
+      // Turn the unsupported load into an EXTLOAD followed by an
+      // explicit zero/sign extend inreg. Same as Expand
+
+      SDValue Result =
+          DAG.getExtLoad(ISD::EXTLOAD, DL, ResultType, LN->getChain(), Base,
+                         LN->getMemoryVT(), LN->getMemOperand());
+      SDValue ValRes;
+      if (ExtType == ISD::SEXTLOAD)
+        ValRes = DAG.getNode(ISD::SIGN_EXTEND_INREG, DL, Result.getValueType(),
+                             Result, DAG.getValueType(LN->getMemoryVT()));
+      else
+        ValRes = DAG.getZeroExtendInReg(Result, DL, LN->getMemoryVT());
+
+      return DAG.getMergeValues({ValRes, LN->getChain()}, DL);
+    }
+
+    if (ExtType == ISD::EXTLOAD) {
+      // Expand the EXTLOAD into a regular LOAD of the global, and if
+      // needed, a zero-extension
+
+      EVT OldLoadType = LN->getMemoryVT();
+      EVT NewLoadType = (OldLoadType == MVT::i8 || OldLoadType == MVT::i16)
+                            ? MVT::i32
+                            : OldLoadType;
+
+      // Modify the MMO to load a whole WASM "register"'s worth
+      MachineMemOperand *OldMMO = LN->getMemOperand();
+      MachineMemOperand *NewMMO = DAG.getMachineFunction().getMachineMemOperand(
+          OldMMO->getPointerInfo(), OldMMO->getFlags(),
+          LLT(NewLoadType.getSimpleVT()), OldMMO->getBaseAlign(),
+          OldMMO->getAAInfo(), OldMMO->getRanges(), OldMMO->getSyncScopeID(),
+          OldMMO->getSuccessOrdering(), OldMMO->getFailureOrdering());
+
+      SDValue Result =
+          DAG.getLoad(NewLoadType, DL, LN->getChain(), Base, NewMMO);
+
+      if (NewLoadType != ResultType) {
+        SDValue ValRes = DAG.getNode(ISD::ANY_EXTEND, DL, ResultType, Result);
+        return DAG.getMergeValues({ValRes, LN->getChain()}, DL);
+      }
+
+      return Result;
+    }
+
+    report_fatal_error(
+        "encountered unexpected ExtType when loading from webassembly global",
+        false);
   }
 
   if (std::optional<unsigned> Local = IsWebAssemblyLocal(Base, DAG)) {
diff --git a/llvm/test/CodeGen/WebAssembly/lower-load-wasm-global.ll b/llvm/test/CodeGen/WebAssembly/lower-load-wasm-global.ll
new file mode 100644
index 0000000000000..0112296df1aa8
--- /dev/null
+++ b/llvm/test/CodeGen/WebAssembly/lower-load-wasm-global.ll
@@ -0,0 +1,185 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s | FileCheck %s
+
+; Test various loads from WASM (address space 1) globals lower as intended
+
+target triple = "wasm32-unknown-unknown"
+
+
+@globalI8 = local_unnamed_addr addrspace(1) global i8 undef
+@globalI32 = local_unnamed_addr addrspace(1) global i32 undef
+@globalI64 = local_unnamed_addr addrspace(1) global i64 undef
+
+
+define i32 @zext_i8_i32() {
+; CHECK-LABEL: zext_i8_i32:
+; CHECK:         .functype zext_i8_i32 () -> (i32)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI32
+; CHECK-NEXT:    i32.const 255
+; CHECK-NEXT:    i32.and
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i8, ptr addrspace(1) @globalI32
+  %e = zext i8 %v to i32
+  ret i32 %e
+}
+
+define i32 @sext_i8_i32() {
+; CHECK-LABEL: sext_i8_i32:
+; CHECK:         .functype sext_i8_i32 () -> (i32)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI32
+; CHECK-NEXT:    i32.extend8_s
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i8, ptr addrspace(1) @globalI32
+  %e = sext i8 %v to i32
+  ret i32 %e
+}
+
+define i32 @zext_i16_i32() {
+; CHECK-LABEL: zext_i16_i32:
+; CHECK:         .functype zext_i16_i32 () -> (i32)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI32
+; CHECK-NEXT:    i32.const 65535
+; CHECK-NEXT:    i32.and
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i16, ptr addrspace(1) @globalI32
+  %e = zext i16 %v to i32
+  ret i32 %e
+}
+
+define i32 @sext_i16_i32() {
+; CHECK-LABEL: sext_i16_i32:
+; CHECK:         .functype sext_i16_i32 () -> (i32)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI32
+; CHECK-NEXT:    i32.extend16_s
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i16, ptr addrspace(1) @globalI32
+  %e = sext i16 %v to i32
+  ret i32 %e
+}
+
+
+define i64 @zext_i8_i64() {
+; CHECK-LABEL: zext_i8_i64:
+; CHECK:         .functype zext_i8_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i64.const 255
+; CHECK-NEXT:    i64.and
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i8, ptr addrspace(1) @globalI64
+  %e = zext i8 %v to i64
+  ret i64 %e
+}
+
+define i64 @sext_i8_i64() {
+; CHECK-LABEL: sext_i8_i64:
+; CHECK:         .functype sext_i8_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i64.extend8_s
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i8, ptr addrspace(1) @globalI64
+  %e = sext i8 %v to i64
+  ret i64 %e
+}
+
+define i64 @zext_i16_i64() {
+; CHECK-LABEL: zext_i16_i64:
+; CHECK:         .functype zext_i16_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i64.const 65535
+; CHECK-NEXT:    i64.and
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i16, ptr addrspace(1) @globalI64
+  %e = zext i16 %v to i64
+  ret i64 %e
+}
+
+define i64 @sext_i16_i64() {
+; CHECK-LABEL: sext_i16_i64:
+; CHECK:         .functype sext_i16_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i64.extend16_s
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i16, ptr addrspace(1) @globalI64
+  %e = sext i16 %v to i64
+  ret i64 %e
+}
+
+define i64 @zext_i32_i64() {
+; CHECK-LABEL: zext_i32_i64:
+; CHECK:         .functype zext_i32_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i64.const 4294967295
+; CHECK-NEXT:    i64.and
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i32, ptr addrspace(1) @globalI64
+  %e = zext i32 %v to i64
+  ret i64 %e
+}
+
+define i64 @sext_i32_i64() {
+; CHECK-LABEL: sext_i32_i64:
+; CHECK:         .functype sext_i32_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i64.extend32_s
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i32, ptr addrspace(1) @globalI64
+  %e = sext i32 %v to i64
+  ret i64 %e
+}
+
+
+define i64 @load_i64_from_i32() {
+; CHECK-LABEL: load_i64_from_i32:
+; CHECK:         .functype load_i64_from_i32 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI32
+; CHECK-NEXT:    i64.extend_i32_u
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i64, ptr addrspace(1) @globalI32
+  ret i64 %v
+}
+
+define i32 @load_i32_from_i64() {
+; CHECK-LABEL: load_i32_from_i64:
+; CHECK:         .functype load_i32_from_i64 () -> (i32)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI64
+; CHECK-NEXT:    i32.wrap_i64
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i32, ptr addrspace(1) @globalI64
+  ret i32 %v
+}
+
+define i8 @load_i8() {
+; CHECK-LABEL: load_i8:
+; CHECK:         .functype load_i8 () -> (i32)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI8
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i8, ptr addrspace(1) @globalI8
+  ret i8 %v
+}
+
+define i64 @load_i16_from_i8_zext_to_i64() {
+; CHECK-LABEL: load_i16_from_i8_zext_to_i64:
+; CHECK:         .functype load_i16_from_i8_zext_to_i64 () -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    global.get globalI8
+; CHECK-NEXT:    i64.extend_i32_u
+; CHECK-NEXT:    i64.const 65535
+; CHECK-NEXT:    i64.and
+; CHECK-NEXT:    # fallthrough-return
+  %v = load i16, ptr addrspace(1) @globalI8
+  %e = zext i16 %v to i64
+  ret i64 %e
+}

@QuantumSegfault
Copy link
Author

I've never worked with the LLVM code before, so please let me know if there was a simpler or cleaner way to do any of that.

@QuantumSegfault
Copy link
Author

QuantumSegfault commented Aug 28, 2025

I ran the test-suite, and my changes introduced regressions in three tests.

  • CodeGen/WebAssembly/simd.ll
  • Transforms/LoopVectorize/WebAssembly/int-mac-reduction-costs.ll
  • Transforms/LoopVectorize/WebAssembly/memory-interleave.ll

The SIMD test fails now because there are various ANDs that are no longer being folded into the zext loads. I'm guessing that's because I marked extending loads "Custom" and it can no longer reason about it? The other two tests seem to be failing for similar reasons.

What should I do about it? Is my code flawed, or do the tests need to be regenerated?

@QuantumSegfault QuantumSegfault force-pushed the wasm-global-load-lowering branch 2 times, most recently from 6e3d317 to ce78c24 Compare August 29, 2025 18:49
Copy link
Contributor

@badumbatish badumbatish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! good work on your first PR to LLVM. I've only contributed a few patches to WebAssembly but figure you'd need some reviews with the PR. I just have a few comments. Also, if you can some description on what the PR is doing that'd be great, I'm a bit hazy with what the PR is doing. I'll approve the CI to run. Seems you have a few bugs with SIMD, i'll check it out later and see if i can come up with anything

// global (i32 or i64) then truncate or extend as needed

// Modify the MMO to load the full global
MachineMemOperand *OldMMO = LN->getMemOperand();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this can be made into a local helper since you're reusing it at the bottom as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this the correct approach (and worth making into a helper)? I wasn't sure whether or not it's safe to modify the existing MMO, so I figured I better clone it.

Otherwise, I could call MachineMemOperand::setType on the existing MMO.

Copy link
Contributor

@badumbatish badumbatish Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks safe enough, LLT in each MMO is per SDNode with no references

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But could the MMO as a whole ever be shared? Or is that also created per node at first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this i actually dont know, it is possible, i'll tag someone in after finishing the review

@QuantumSegfault QuantumSegfault force-pushed the wasm-global-load-lowering branch from ce78c24 to 37e324d Compare September 3, 2025 03:16
@QuantumSegfault QuantumSegfault force-pushed the wasm-global-load-lowering branch 3 times, most recently from 7701611 to 24b14e4 Compare September 3, 2025 04:54
@QuantumSegfault
Copy link
Author

Also, if you can some description on what the PR is doing that'd be great, I'm a bit hazy with what the PR is doing.

The issue (#155880) I opened explains the problems I'm solving with this PR. Is there something in particular you need explained further? Should I edit the PR description with some details from the issue?

@badumbatish
Copy link
Contributor

Is there something in particular you need explained further?

I think I grasp the general idea now after taking some time to review. Ty!

looks good to me, one thing i noticed locally is yes, the combine of i32.load8_u + i32.extend8_s into i32.load8_s and the like is being left out. The default dag combiner must have run before lowerload, you can probably add a combine to PerformDAGCombine (same file) to help reduce code size, maybe llvm-project/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:15728 for inspiration?

You can run the update test check again on simd.ll to show that CI actually passes all check for now, and redo it after you have added PerformDAGCombine.

Tagging @sparker-arm @dschuff for some reviews

@dschuff
Copy link
Member

dschuff commented Sep 3, 2025

Thanks for the ping, I did actually take a look at it, and in particular the issue with the simd tests (since as you both observed, the load+extend is not getting turned into load_s). It's not quite obvious to me yet what is happening since I would have thought that if this custom lowering declines to actually lower the instruction, that it would fall back on the default lowering path including all of the DAG combines. But obviously that's not what's happening. I didn't get as far as trying to figure out exactly which code is running in each case and why.

@QuantumSegfault
Copy link
Author

The odd thing: it's only when +simd128 is enabled that it triggers this problem.

define i32 @extract_var_v16i8_s(<16 x i8> %v, i32 %i) {
  %elem = extractelement <16 x i8> %v, i32 %i
  %a = sext i8 %elem to i32
  ret i32 %a
}

Without any flags (no SIMD):

	.globl	extract_var_v16i8_s             # -- Begin function extract_var_v16i8_s
	.type	extract_var_v16i8_s,@function
extract_var_v16i8_s:                    # @extract_var_v16i8_s
	.functype	extract_var_v16i8_s (i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32) -> (i32)
	.local  	i32
# %bb.0:
	global.get	__stack_pointer
	i32.const	16
	i32.sub 
	local.tee	17
	local.get	15
	i32.store8	15
	local.get	17
	local.get	14
	i32.store8	14
	
# ... some more stores, omitted for brevity

	local.get	17
	local.get	1
	i32.store8	1
	local.get	17
	local.get	0
	i32.store8	0
	local.get	17
	local.get	16
	i32.const	15
	i32.and 
	i32.or  
	i32.load8_s	0
                                        # fallthrough-return
	end_function

With SIMD128 enabled:

	.globl	extract_var_v16i8_s             # -- Begin function extract_var_v16i8_s
	.type	extract_var_v16i8_s,@function
extract_var_v16i8_s:                    # @extract_var_v16i8_s
	.functype	extract_var_v16i8_s (v128, i32) -> (i32)
	.local  	i32
# %bb.0:
	global.get	__stack_pointer
	i32.const	16
	i32.sub 
	local.tee	2
	local.get	0
	v128.store	0
	local.get	2
	local.get	1
	i32.const	15
	i32.and 
	i32.or  
	i32.load8_u	0
	i32.extend8_s
                                        # fallthrough-return
	end_function

Note the difference between the ends.... i32.load8_u and i32.extend8_s vs. the intended i32.load8_s

Does that imply a bug with WASM's SIMD128 handling?

@QuantumSegfault
Copy link
Author

Okay, so examining the DAGs, it seems to be a problem with when our instructions our lowered. Without SIMD, the type-legalized DAG already has a load anyext + and that the optimization can combine into a zext load.

When SIMD128 is enabled, the load is not introduced into the full legalization phase (as the lowering of extract_vector_elt. It can't assume that ZEXTLOAD is legal (due being marked as Custom) and can't safely combine at this point.

So how can I tell it it's allowed to recombine after legalization, as long is the load is not from addrspace(1)?


No SIMD

Type-legalized DAG

        t64: i32 = and t34, Constant:i32<15>
        t66: i32 = mul t64, Constant:i32<1>
    t67: i32 = add FrameIndex:i32<0>, t66
    t68: i32,ch = load<(load (s8)), anyext from i8> t101, t67, undef:i32
t59: i32 = and t68, Constant:i32<255>

Optimized type-legalized DAG

            t34: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<16>
    t64: i32 = and t34, Constant:i32<15>
    t241: i32 = or disjoint FrameIndex:i32<0>, t64
t243: i32,ch = load<(load (s8)), zext from i8> t239, t241, undef:i32

Legalized DAG

        t34: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<16>
    t64: i32 = and t34, Constant:i32<15>
    t241: i32 = or disjoint TargetFrameIndex:i32<0>, t64
t243: i32,ch = load<(load (s8)), zext from i8> t239, t241, undef:i32

+SIMD128

Type-legalized DAG

    t8: i32 = extract_vector_elt t2, t4
t10: i32 = and t8, Constant:i32<255>

Optimized type-legalized DAG

    t8: i32 = extract_vector_elt t2, t4
t10: i32 = and t8, Constant:i32<255>

Legalized DAG

            t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
        t15: i32 = and t4, Constant:i32<15>
        t17: i32 = mul t15, Constant:i32<1>
    t18: i32 = add TargetFrameIndex:i32<0>, t17
    t19: i32,ch = load<(load (s8), align 4), anyext from i8> t13, t18, undef:i32
t10: i32 = and t19, Constant:i32<255>

Optimized legalized DAG

        t15: i32 = and t4, Constant:i32<15>
    t21: i32 = or disjoint TargetFrameIndex:i32<0>, t15
    t19: i32,ch = load<(load (s8), align 4), anyext from i8> t13, t21, undef:i32
t10: i32 = and t19, Constant:i32<255>

@dschuff
Copy link
Member

dschuff commented Sep 3, 2025

I think a more direct comparison is with and without your patch. I did this yesterday for extract_var_v16i8_s

Without patch:

Type-legalized selection DAG: %bb.0 'ç:'
      t8: i32 = extract_vector_elt t2, t4
    t10: i32 = sign_extend_inreg t8, ValueType:ch:i8

Optimized type-legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
      t8: i32 = extract_vector_elt t2, t4
    t10: i32 = sign_extend_inreg t8, ValueType:ch:i8

Legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
      t19: i32,ch = load<(load (s8), align 4), anyext from i8> t13, t18, undef:i32
    t10: i32 = sign_extend_inreg t19, ValueType:ch:i8

Optimized legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
    t23: i32,ch = load<(load (s8), align 4), sext from i8> t13, t21, undef:i32

With patch:

Type-legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
      t8: i32 = extract_vector_elt t2, t4
    t10: i32 = sign_extend_inreg t8, ValueType:ch:i8

Optimized type-legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
      t8: i32 = extract_vector_elt t2, t4
    t10: i32 = sign_extend_inreg t8, ValueType:ch:i8

Legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
      t19: i32,ch = load<(load (s8), align 4), anyext from i8> t13, t18, undef:i32
    t10: i32 = sign_extend_inreg t19, ValueType:ch:i8

Optimized legalized selection DAG: %bb.0 'extract_var_v16i8_s:'
      t19: i32,ch = load<(load (s8), align 4), anyext from i8> t13, t21, undef:i32
    t10: i32 = sign_extend_inreg t19, ValueType:ch:i8

Here everything is the same until the last step, where it seems the DAG combiner is turning load anyext + sign_extend_inreg into a load sext, but that doesn't happen when we go through the custom lowering but don't actually lower.

@QuantumSegfault
Copy link
Author

Okay, I'm not quite finished digging, but I think I roughly know why it's failing now.

Before my patch, the combination of the anyext load + and was handled by this code:
https://github.com/llvm/llvm-project/blob/27114e4f302cb8724eec758077a430dd7fb300ef/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L7576-L7622

But now that ZEXTLOAD is handled Custom, it can't CanZextLoadProfitably

With the patch applied, and SIMD disabled, the combination falls through and gets handled instead by
https://github.com/llvm/llvm-project/blob/27114e4f302cb8724eec758077a430dd7fb300ef/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L7674-L7678

But for some reason (despite appearing otherwise the same), reduceLoadWidth isn't working with SIMD128 enabled (perhaps because this optimization is triggered at a later stage).

@QuantumSegfault
Copy link
Author

Okay. reduceLoadWidth doesn't work because isLegalNarrowLdSt on line 15539 returns false, because of this starting line 6915

if (LegalOperations &&
        !TLI.isLoadExtLegal(ExtType, Load->getValueType(0), MemVT))
      return false;

It all comes down to the same: !isLoadExtLegal :/

Not sure what to do about it.

@QuantumSegfault QuantumSegfault force-pushed the wasm-global-load-lowering branch from 24b14e4 to 8a03ea5 Compare September 4, 2025 06:37
@@ -3637,6 +3755,184 @@ static SDValue performMulCombine(SDNode *N,
}
}

static SDValue performANDCombine(SDNode *N,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I found one solution, but I don't like it. Duplicate the combiner logic here.

It works for simd.ll, but instruction cost calculation for the LoopVectorize related tests is still off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree... this shouldn't be the way to go.

@sparker-arm
Copy link
Contributor

DAGCombiner::reduceLoadWidth doesn't work with vector types. Certainly when I was working in this area, it was all about optimisations for CoreMark :)

Does this need to be fixed during DAGCombine / Lowering? Could some tablegen patterns not just be added?

@QuantumSegfault
Copy link
Author

The load in question is not a vector load. It's a scalar load of one of the elements the vector from the stack, created as the lowering of extract_vector_elt.

Folding sext_in_reg and and back into various loads could probably done with tablegen, but is that much better? Defining a pattern for each combination of (ext)load + (sext_in_reg | and)?

It feels like we're missing something obvious here. Maybe the addrspace(1) load handling would be better suited elsewhere (other than ISelLowering)? Somewhere that won't interfere with these other optimizations???

@QuantumSegfault
Copy link
Author

Out of curiosity, would the new GlobalISel system have handled this situation better?

@sparker-arm
Copy link
Contributor

Maybe the addrspace(1) load handling would be better suited elsewhere (other than ISelLowering)? Somewhere that won't interfere with these other optimizations???

How about making TargetLowering consider the address space value for setLoadExtAction, isLoadExtLegal, etc? This feels like a sensible option...

@sparker-arm
Copy link
Contributor

Out of curiosity, would the new GlobalISel system have handled this situation better?

Maybe, it might have better handling of address spaces due to its popularity with GPUs. But I'd expect the backend would still have some implementation work to do.

@QuantumSegfault
Copy link
Author

How about making TargetLowering consider the address space value for setLoadExtAction, isLoadExtLegal, etc? This feels like a sensible option...

Yeah, that'd be perfect! But I was under the impression that making such changes to long standing system to solve a minor issue arising due to an unusual use case would be undesirable.

Maybe, it might have better handling of address spaces due to its popularity with GPUs. But I'd expect the backend would still have some implementation work to do.

It looks like it natively supports understanding other address spaces (you must define the pointer's address space in the LLT you define as legal for the op), and in general has better support for understanding complex logic without needed custom, optimization breaking code.

Like, address space 0 could be marked "legal" for SEXT/ZEXT, but must be "lowered" for address space 1 (no custom logic required). Then for the LOAD itself: legal for addrspace 0, custom for addrspace 1.


Are there any plans to migrate WASM to GlobalISel any time soon? If there's interest but not enough manpower, I wouldn't mind trying to set up the foundation myself (but that will still take up time from reviewers down the line). Or is it simply not worth it at all for this backend?

It seems like the plan is to eventually phase out SelectionDAG and FastISel in favor of it...or am I misunderstanding its purpose?

@dschuff
Copy link
Member

dschuff commented Sep 5, 2025

Are there any plans to migrate WASM to GlobalISel any time soon? If there's interest but not enough manpower, I wouldn't mind trying to set up the foundation myself (but that will still take up time from reviewers down the line). Or is it simply not worth it at all for this backend?

It seems like the plan is to eventually phase out SelectionDAG and FastISel in favor of it...or am I misunderstanding its purpose?

Not currently. It is indeed more a matter of lack of engineering time (and unclear how much the benefit would be) than anything else. I wouldn't mind having someone work on this, but I expect it would be a long project to get it up to parity with SelectionDAG (although I would hope not as large as it's been for other architectures like ARM64, as the wasm backend is simpler... I'm still not sure what the status there is, probably @sparker-arm knows). I would hope we could do some investigation on the front end to have some confidence that there would be some benefit in the end (i.e. code quality, maintainability, etc). None of us on the wasm side are experts on GlobalISel, but if you're serious about this, then I wouldn't mind putting in the time it would probably take for competent review :)

@dschuff
Copy link
Member

dschuff commented Sep 5, 2025

Oh, and regarding an eventual full migration: IIUC the purpose of GlobalISel is indeed to be a full replacement for SDAG, but it's a huge undertaking to migrate everything. Hard to say whether it will ever full happen across all the backends.

@QuantumSegfault
Copy link
Author

QuantumSegfault commented Sep 5, 2025

My goal would to be do it somewhat quick and dirty at first, get a PoC going before cleaning it up.

I started on this yesterday out of curiosity, and it doesn't seem so bad yet. Got lowerReturn and lowerFormalArguments for CallLowering (first steps) almost working...though this Swift ABI is annoying me.

In some ways, WASM is simpler than other architectures due to the lack of physical registers and stack spilling.

Once I get the swifterror kink worked out, I'll open a draft PR (worst case it gets scrapped *shrug*).

@QuantumSegfault
Copy link
Author

Alright, #157161 is underway.

But that doesn't solve our issues here. I liked @sparker-arm's idea, but I'm not sure that's feasible.


I had no idea this would end up being such a rabbit hole.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[WebAssembly] Incorrect lowering of extending and mismatching loads for addrspace(1) globals
5 participants