[lldb] Corretly parse Wasm segments #154727

JDevlieghere · 2025-08-21T10:41:37Z

My original implementation for parsing Wasm segments was wrong in two related ways. I had a bug in calculating the file vm address and I didn't fully understand the difference between active and passive segments and how that impacted their file vm address.

With this PR, we now support parsing init expressions for active segments, rather than just skipping over them. This is necessary to determine where they get loaded.

Similar to llvm-objdump, we currently only support simple opcodes (i.e. constants). We also currently do not support active segments that use a non-zero memory index. However this covers all segments for a non-trivial Swift binary compiled to Wasm.

llvmbot · 2025-08-21T10:42:12Z

@llvm/pr-subscribers-lldb

Author: Jonas Devlieghere (JDevlieghere)

Changes

My original implementation for parsing Wasm segments was wrong in two related ways. I had a bug in calculating the file vm address and I didn't fully understand the difference between active and passive segments and how that impacted their file vm address.

With this PR, we now support parsing init expressions for active segments, rather than just skipping over them. This is necessary to determine where they get loaded.

Similar to llvm-objdump, we currently only support simple opcodes (i.e. constants). We also currently do not support active segments that use a non-zero memory index. However this covers all segments for a non-trivial Swift binary compiled to Wasm.

Full diff: https://github.com/llvm/llvm-project/pull/154727.diff

2 Files Affected:

(modified) lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp (+90-25)
(modified) lldb/test/Shell/Symtab/symtab-wasm.test (+14-12)

diff --git a/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp b/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
index 777b20e9bb0f6..816b39377f053 100644
--- a/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
+++ b/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
@@ -71,6 +71,47 @@ GetWasmString(llvm::DataExtractor &data, llvm::DataExtractor::Cursor &c) {
   return std::string(toStringRef(llvm::ArrayRef(str_storage)));
 }
 
+/// An "init expr" refers to a constant expression used to determine the initial
+/// value of certain elements within a module during instantiation. These
+/// expressions are restricted to operations that can be evaluated at module
+/// instantiation time. Currently we only support simple constant opcodes.
+static lldb::offset_t GetWasmOffsetFromInitExpr(DataExtractor &data,
+                                                lldb::offset_t &offset) {
+  lldb::offset_t init_expr_offset = LLDB_INVALID_OFFSET;
+
+  uint8_t opcode = data.GetU8(&offset);
+  switch (opcode) {
+  case llvm::wasm::WASM_OPCODE_I32_CONST:
+  case llvm::wasm::WASM_OPCODE_I64_CONST:
+    init_expr_offset = data.GetSLEB128(&offset);
+    break;
+  case llvm::wasm::WASM_OPCODE_GLOBAL_GET:
+    init_expr_offset = data.GetULEB128(&offset);
+    break;
+  case llvm::wasm::WASM_OPCODE_F32_CONST:
+  case llvm::wasm::WASM_OPCODE_F64_CONST:
+    // Not a meaningful offset.
+    data.GetFloat(&offset);
+    break;
+  case llvm::wasm::WASM_OPCODE_REF_NULL:
+    // Not a meaningful offset.
+    data.GetULEB128(&offset);
+    break;
+  }
+
+  // Make sure the opcodes we read aren't part of an extended init expr.
+  opcode = data.GetU8(&offset);
+  if (opcode == llvm::wasm::WASM_OPCODE_END)
+    return init_expr_offset;
+
+  // Extended init expressions are not supported, but we still have to parse
+  // them to skip over them and read the next segment.
+  do {
+    opcode = data.GetU8(&offset);
+  } while (opcode != llvm::wasm::WASM_OPCODE_END);
+  return LLDB_INVALID_OFFSET;
+}
+
 /// Checks whether the data buffer starts with a valid Wasm module header.
 static bool ValidateModuleHeader(const DataBufferSP &data_sp) {
   if (!data_sp || data_sp->GetByteSize() < kWasmHeaderSize)
@@ -294,10 +335,17 @@ ParseFunctions(SectionSP code_section_sp) {
 }
 
 struct WasmSegment {
-  WasmSegment(SectionSP section_sp, lldb::offset_t offset, uint32_t size)
-      : address_range(section_sp, offset, size) {};
+  enum SegmentType {
+    Active,
+    Passive,
+  };
+
   std::string name;
-  AddressRange address_range;
+  SegmentType type = Passive;
+  lldb::offset_t section_offset = LLDB_INVALID_OFFSET;
+  uint32_t size = 0;
+  uint32_t memory_index = 0;
+  lldb::offset_t init_expr_offset = 0;
 };
 
 static llvm::Expected<std::vector<WasmSegment>>
@@ -319,27 +367,34 @@ ParseData(SectionSP data_section_sp) {
     if (!flags)
       return flags.takeError();
 
+    WasmSegment segment;
+
     // Data segments have a mode that identifies them as either passive or
     // active. An active data segment copies its contents into a memory during
     // instantiation, as specified by a memory index and a constant expression
     // defining an offset into that memory.
+    segment.type = (*flags & llvm::wasm::WASM_DATA_SEGMENT_IS_PASSIVE)
+                       ? WasmSegment::Passive
+                       : WasmSegment::Active;
+
     if (*flags & llvm::wasm::WASM_DATA_SEGMENT_HAS_MEMINDEX) {
+      assert(segment.type == WasmSegment::Active);
       llvm::Expected<uint32_t> memidx = GetULEB32(data, offset);
       if (!memidx)
         return memidx.takeError();
+      segment.memory_index = *memidx;
     }
 
-    if ((*flags & llvm::wasm::WASM_DATA_SEGMENT_IS_PASSIVE) == 0) {
-      // Skip over the constant expression.
-      for (uint8_t b = 0; b != llvm::wasm::WASM_OPCODE_END;)
-        b = data.GetU8(&offset);
-    }
+    if (segment.type == WasmSegment::Active)
+      segment.init_expr_offset = GetWasmOffsetFromInitExpr(data, offset);
 
     llvm::Expected<uint32_t> segment_size = GetULEB32(data, offset);
     if (!segment_size)
       return segment_size.takeError();
 
-    segments.emplace_back(data_section_sp, offset, *segment_size);
+    segment.section_offset = offset;
+    segment.size = *segment_size;
+    segments.push_back(segment);
 
     std::optional<lldb::offset_t> next_offset =
         llvm::checkedAddUnsigned<lldb::offset_t>(offset, *segment_size);
@@ -352,7 +407,7 @@ ParseData(SectionSP data_section_sp) {
 }
 
 static llvm::Expected<std::vector<Symbol>>
-ParseNames(SectionSP name_section_sp,
+ParseNames(SectionSP name_section_sp, SectionSP data_section_sp,
            const std::vector<AddressRange> &function_ranges,
            std::vector<WasmSegment> &segments) {
   DataExtractor name_section_data;
@@ -405,12 +460,6 @@ ParseNames(SectionSP name_section_sp,
           continue;
         // Update the segment name.
         segments[i].name = *name;
-        symbols.emplace_back(
-            symbols.size(), Mangled(*name), lldb::eSymbolTypeData,
-            /*external=*/false, /*is_debug=*/false, /*is_trampoline=*/false,
-            /*is_artificial=*/false, segments[i].address_range,
-            /*size_is_valid=*/true, /*contains_linker_annotations=*/false,
-            /*flags=*/0);
       }
 
     } break;
@@ -476,7 +525,7 @@ void ObjectFileWasm::ParseSymtab(Symtab &symtab) {
   }
 
   llvm::Expected<std::vector<Symbol>> symbols =
-      ParseNames(name_section_sp, functions, segments);
+      ParseNames(name_section_sp, data_section_sp, functions, segments);
   if (!symbols) {
     LLDB_LOG_ERROR(log, symbols.takeError(), "Failed to parse Wasm names: {0}");
     return;
@@ -487,19 +536,35 @@ void ObjectFileWasm::ParseSymtab(Symtab &symtab) {
 
   lldb::user_id_t segment_id = 0;
   for (const WasmSegment &segment : segments) {
-    const lldb::addr_t segment_addr =
-        segment.address_range.GetBaseAddress().GetFileAddress();
-    const size_t segment_size = segment.address_range.GetByteSize();
+    if (segment.type == WasmSegment::Active) {
+      // FIXME: Support segments with a memory index.
+      if (segment.memory_index != 0) {
+        LLDB_LOG(log, "Skipping segment {0}: non-zero memory index is "
+                      "currently unsupported");
+        continue;
+      }
+
+      if (segment.init_expr_offset == LLDB_INVALID_OFFSET) {
+        LLDB_LOG(log, "Skipping segment {0}: unsupported init expression");
+        continue;
+      }
+    }
+
+    lldb::addr_t file_vm_addr =
+        segment.type == WasmSegment::Active
+            ? segment.init_expr_offset
+            : data_section_sp->GetFileOffset() + segment.section_offset;
     SectionSP segment_sp = std::make_shared<Section>(
-        /*parent_section_sp=*/data_section_sp, GetModule(),
+        GetModule(),
         /*obj_file=*/this,
         ++segment_id << 8, // 1-based segment index, shifted by 8 bits to avoid
                            // collision with section IDs.
         ConstString(segment.name), eSectionTypeData,
-        /*file_vm_addr=*/segment_addr,
-        /*vm_size=*/segment_size,
-        /*file_offset=*/segment_addr,
-        /*file_size=*/segment_size,
+        /*file_vm_addr=*/file_vm_addr,
+        /*vm_size=*/segment.size,
+        /*file_offset=*/data_section_sp->GetFileOffset() +
+            segment.section_offset,
+        /*file_size=*/segment.size,
         /*log2align=*/0, /*flags=*/0);
     m_sections_up->AddSection(segment_sp);
     GetModule()->GetSectionList()->AddSection(segment_sp);
diff --git a/lldb/test/Shell/Symtab/symtab-wasm.test b/lldb/test/Shell/Symtab/symtab-wasm.test
index 4170d9aba9eea..a3a77dd1d79aa 100644
--- a/lldb/test/Shell/Symtab/symtab-wasm.test
+++ b/lldb/test/Shell/Symtab/symtab-wasm.test
@@ -1,15 +1,17 @@
 # RUN: yaml2obj %S/Inputs/simple.wasm.yaml -o %t.wasm
-# RUN: %lldb %t.wasm -o 'image dump symtab' -o 'image dump sections' | FileCheck %s
 
-CHECK: Code 0x0000000000000002 0x0000000000000002 0x00000000 __wasm_call_ctors
-CHECK: Code 0x0000000000000005 0x0000000000000029 0x00000000 add
-CHECK: Code 0x000000000000002f 0x000000000000004c 0x00000000 __original_main
-CHECK: Code 0x000000000000007c 0x0000000000000009 0x00000000 main
-CHECK: Data 0x0000000000000233 0x0000000000000009 0x00000000 .rodata
-CHECK: Data 0x0000000000000242 0x0000000000000004 0x00000000 .data
+# RUN: %lldb %t.wasm -o 'image dump symtab' | FileCheck %s --check-prefix SYMTAB
+SYMTAB: Code 0x0000000000000002 0x0000000000000002 0x00000000 __wasm_call_ctors
+SYMTAB: Code 0x0000000000000005 0x0000000000000029 0x00000000 add
+SYMTAB: Code 0x000000000000002f 0x000000000000004c 0x00000000 __original_main
+SYMTAB: Code 0x000000000000007c 0x0000000000000009 0x00000000 main
 
-CHECK: 0x0000000000000001 code       {{.*}} 0x000001a1 0x00000085 0x00000000 symtab-wasm.test.tmp.wasm.code
-CHECK: 0x0000000000000003 data       {{.*}} 0x0000022c 0x0000001a 0x00000000 symtab-wasm.test.tmp.wasm.data
-CHECK: 0x0000000000000040 wasm-name  {{.*}} 0x00000251 0x00000059 0x00000000 symtab-wasm.test.tmp.wasm.name
-CHECK: 0x0000000000000100 data       {{.*}} 0x00000233 0x00000009 0x00000000 symtab-wasm.test.tmp.wasm.data..rodata
-CHECK: 0x0000000000000200 data       {{.*}} 0x00000242 0x00000004 0x00000000 symtab-wasm.test.tmp.wasm.data..data
+# RUN: %lldb %t.wasm -o 'image dump sections' | FileCheck %s --check-prefix SECTIONS
+SECTIONS: 0x0000000000000001 code                   [0x0000000000000000-0x0000000000000085)  ---  0x000001a1 0x00000085 0x00000000 symtab-wasm.test.tmp.wasm.code
+SECTIONS: 0x0000000000000003 data                   [0x000000000000022c-0x0000000000000246)  ---  0x0000022c 0x0000001a 0x00000000 symtab-wasm.test.tmp.wasm.data
+SECTIONS: 0x0000000000000040 wasm-name                                                       ---  0x00000251 0x00000059 0x00000000 symtab-wasm.test.tmp.wasm.name
+SECTIONS: 0x0000000000000100 data                   [0x0000000000000400-0x0000000000000409)  ---  0x00000233 0x00000009 0x00000000 symtab-wasm.test.tmp.wasm..rodata
+SECTIONS: 0x0000000000000200 data                   [0x000000000000040c-0x0000000000000410)  ---  0x00000242 0x00000004 0x00000000 symtab-wasm.test.tmp.wasm..data
+
+# RUN: %lldb %t.wasm -o 'x/s 0x0000000000000400' | FileCheck %s --check-prefix STR
+STR: "data str"

My original implementation for parsing Wasm segments was wrong in two related ways. I had a bug in calculating the file vm address and I didn't fully understand the difference between active and passive segments and how that impacted their file vm address. With this PR, we now support parsing init expressions for active segments, rather than just skipping over them. This is necessary to determine where they get loaded. Similar to llvm-objdump, we currently only support simple opcodes (i.e. constants). We also currently do not support active segments that use a non-zero memory index. However this covers all segments for a non-trivial Swift binary compiled to Wasm.

My original implementation for parsing Wasm segments was wrong in two related ways. I had a bug in calculating the file vm address and I didn't fully understand the difference between active and passive segments and how that impacted their file vm address. With this PR, we now support parsing init expressions for active segments, rather than just skipping over them. This is necessary to determine where they get loaded. Similar to llvm-objdump, we currently only support simple opcodes (i.e. constants). We also currently do not support active segments that use a non-zero memory index. However this covers all segments for a non-trivial Swift binary compiled to Wasm. (cherry picked from commit aadc708)

JDevlieghere requested a review from adrian-prantl August 21, 2025 10:41

llvmbot added the lldb label Aug 21, 2025

JDevlieghere force-pushed the wasm-segments-fix branch from 703e755 to defb8e0 Compare August 21, 2025 20:10

adrian-prantl approved these changes Aug 26, 2025

View reviewed changes

JDevlieghere merged commit aadc708 into llvm:main Aug 26, 2025
9 checks passed

JDevlieghere deleted the wasm-segments-fix branch August 26, 2025 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lldb] Corretly parse Wasm segments #154727

[lldb] Corretly parse Wasm segments #154727

JDevlieghere commented Aug 21, 2025

Uh oh!

llvmbot commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

[lldb] Corretly parse Wasm segments #154727

[lldb] Corretly parse Wasm segments #154727

Conversation

JDevlieghere commented Aug 21, 2025

Uh oh!

llvmbot commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!