Skip to content

Conversation

jasonmolenda
Copy link
Collaborator

@jasonmolenda jasonmolenda commented Aug 16, 2025

I've wanted a utility to create a corefile for test purposes given a bit of memory and regsters, for a while. I've written a few API tests over the years that needed exactly this capability -- we have several one-off Mach-O corefile creator utility in the API testsuite to do this. But it's a lot of boilerplate when you only want to specify some register contents and memory contents, to create an API test.

This adds yaml2mach-core, a tool that should build on any system, takes a yaml description of register values for one or more threads, optionally memory values for one or more memory regions, and can take a list of UUIDs that will be added as LC_NOTE "load binary" metadata to the corefile so binaries can be loaded into virtual address space in a test scenario.

The format of the yaml file looks like

cpu: armv7m
endian: little
threads:
  - regsets: 
     - flavor: gpr 
        registers: [{name: sp, value: 0x2000fe70}, {name: r7, value: 0x2000fe80}, 
                    {name: pc, value: 0x0020392c}, {name: lr, value: 0x0020392d}]

memory-regions:
  # stack memory
  - addr: 0x2000fe70 
    UInt32: [ 0x0000002a, 0x20010e58, 0x00203923, 
              0x00000001, 0x2000fe88, 0x00203911, 
              0x2000ffdc, 0xfffffff9 ]
  # instructions of a function
  - addr: 0x203910 
     UInt8: [ 0xf8, 0xb5, 0x04, 0xaf, 0x06, 0x4c, 0x07, 0x49, 
              0x74, 0xf0, 0x2e, 0xf8, 0x01, 0xac, 0x74, 0xf0 ]

and that's all that is needed to specify a corefile where four register values are specified (the others will be set to 0), and two memory regions will be emitted.

The memory can be specified as an array of UInt8, UInt32, or UInt64, I anticipate that some of these corefiles may have stack values constructed manually and it may be simpler for a human to write them in a particular grouping of values.

I needed this utility for an upcoming patch for ARM Cortex-M processors, to create a test for the change. I took the opportunity to remove two of the "trivial mach-o corefile" creator utilities I've written in the past, which also restricted the tests to only run on Darwin systems because I was using the system headers for Mach-O constant values.

rdar://110663219

I've wanted a utility to create a corefile for test purposes given
a bit of memory and regsters, for a while.  I've written a few API
tests over the years that needed exactly this capability -- we have
several one-off Mach-O corefile creator utility in the API testsuite
to do this.  But it's a lot of boilerplate when you only want to
specify some register contents and memory contents, to create an
API test.

This adds yaml2mach-core, a tool that should build on any system,
takes a yaml description of register values for one or more threads,
optionally memory values for one or more memory regions, and can
take a list of UUIDs that will be added as LC_NOTE "load binary"
metadata to the corefile so binaries can be loaded into virtual
address space in a test scenario.

The format of the yaml file looks like

cpu: armv7m
endian: little
threads:
  - regsets:
      - flavor: gpr
        registers: [{name: sp, value: 0x2000fe70}, {name: r7, value: 0x2000fe80},
                    {name: pc, value: 0x0020392c}, {name: lr, value: 0x0020392d}]

memory-regions:
  - addr: 0x2000fe70
    UInt32: [
      0x0000002a, 0x20010e58, 0x00203923, 0x00000001,
      0x2000fe88, 0x00203911, 0x2000ffdc, 0xfffffff9
    ]
  - addr: 0x203910
    UInt8: [
      0xf8, 0xb5, 0x04, 0xaf, 0x06, 0x4c, 0x07, 0x49,
      0x74, 0xf0, 0x2e, 0xf8, 0x01, 0xac, 0x74, 0xf0
    ]

and that's all that is needed to specify a corefile where four register
values are specified (the others will be set to 0), and two memory
regions will be emitted.

The memory can be specified as an array of UInt8, UInt32, or UInt64,
I anticipate that some of these corefiles may have stack values
constructed manually and it may be simpler for a human to write
them in a particular grouping of values.

Accepting "endian" is probably a boondoggle that won't ever come
to any use, and honestly I don't 100% know what the correct byte
layout would be for a big endian Mach-O file any more.  In a RISC-V
discussion a month ago, it was noted that register byte layout will
be little endian even when there is a big endian defined format for
RV, so memory would be byteswapped but registers would not.  It may
have been better not to pretend to support this, but on the other
hand it might be neat to be able to generate a big endian test case
simply.

I needed this utility for an upcoming patch for ARM Cortex-M processors,
to create a test for the change.  I took the opportunity to remove two
of the "trivial mach-o corefile" creator utilities I've written in the
past, which also restricted the tests to only run on Darwin systems
because I was using the system headers for Mach-O constant values.

rdar://110663219
@llvmbot
Copy link
Member

llvmbot commented Aug 16, 2025

@llvm/pr-subscribers-lldb

@llvm/pr-subscribers-backend-risc-v

Author: Jason Molenda (jasonmolenda)

Changes

I've wanted a utility to create a corefile for test purposes given a bit of memory and regsters, for a while. I've written a few API tests over the years that needed exactly this capability -- we have several one-off Mach-O corefile creator utility in the API testsuite to do this. But it's a lot of boilerplate when you only want to specify some register contents and memory contents, to create an API test.

This adds yaml2mach-core, a tool that should build on any system, takes a yaml description of register values for one or more threads, optionally memory values for one or more memory regions, and can take a list of UUIDs that will be added as LC_NOTE "load binary" metadata to the corefile so binaries can be loaded into virtual address space in a test scenario.

The format of the yaml file looks like

cpu: armv7m
endian: little
threads:

  • regsets: - flavor: gpr registers: [{name: sp, value: 0x2000fe70}, {name: r7, value: 0x2000fe80}, {name: pc, value: 0x0020392c}, {name: lr, value: 0x0020392d}]

memory-regions:

  • addr: 0x2000fe70 UInt32: [ 0x0000002a, 0x20010e58, 0x00203923, 0x00000001, 0x2000fe88, 0x00203911, 0x2000ffdc, 0xfffffff9 ]
  • addr: 0x203910 UInt8: [ 0xf8, 0xb5, 0x04, 0xaf, 0x06, 0x4c, 0x07, 0x49, 0x74, 0xf0, 0x2e, 0xf8, 0x01, 0xac, 0x74, 0xf0 ]

and that's all that is needed to specify a corefile where four register values are specified (the others will be set to 0), and two memory regions will be emitted.

The memory can be specified as an array of UInt8, UInt32, or UInt64, I anticipate that some of these corefiles may have stack values constructed manually and it may be simpler for a human to write them in a particular grouping of values.

Accepting "endian" is probably a boondoggle that won't ever come to any use, and honestly I don't 100% know what the correct byte layout would be for a big endian Mach-O file any more. In a RISC-V discussion a month ago, it was noted that register byte layout will be little endian even when there is a big endian defined format for RV, so memory would be byteswapped but registers would not. It may have been better not to pretend to support this, but on the other hand it might be neat to be able to generate a big endian test case simply.

I needed this utility for an upcoming patch for ARM Cortex-M processors, to create a test for the change. I took the opportunity to remove two of the "trivial mach-o corefile" creator utilities I've written in the past, which also restricted the tests to only run on Darwin systems because I was using the system headers for Mach-O constant values.

rdar://110663219


Patch is 64.83 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153911.diff

28 Files Affected:

  • (modified) lldb/packages/Python/lldbsuite/test/configuration.py (+10)
  • (modified) lldb/packages/Python/lldbsuite/test/dotest.py (+1)
  • (modified) lldb/packages/Python/lldbsuite/test/lldbtest.py (+15)
  • (modified) lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp (+36)
  • (modified) lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.h (+3)
  • (removed) lldb/test/API/macosx/arm-corefile-regctx/Makefile (-6)
  • (modified) lldb/test/API/macosx/arm-corefile-regctx/TestArmMachoCorefileRegctx.py (+6-11)
  • (added) lldb/test/API/macosx/arm-corefile-regctx/arm64.yaml (+31)
  • (added) lldb/test/API/macosx/arm-corefile-regctx/armv7m.yaml (+37)
  • (removed) lldb/test/API/macosx/arm-corefile-regctx/create-arm-corefiles.cpp (-266)
  • (removed) lldb/test/API/macosx/riscv32-corefile/Makefile (-7)
  • (modified) lldb/test/API/macosx/riscv32-corefile/TestRV32MachOCorefile.py (+13-5)
  • (removed) lldb/test/API/macosx/riscv32-corefile/create-empty-riscv-corefile.cpp (-116)
  • (added) lldb/test/API/macosx/riscv32-corefile/riscv32-registers.yaml (+47)
  • (modified) lldb/tools/CMakeLists.txt (+2)
  • (added) lldb/tools/yaml2macho-core/CMakeLists.txt (+13)
  • (added) lldb/tools/yaml2macho-core/CoreSpec.h (+56)
  • (added) lldb/tools/yaml2macho-core/LCNoteWriter.cpp (+68)
  • (added) lldb/tools/yaml2macho-core/LCNoteWriter.h (+23)
  • (added) lldb/tools/yaml2macho-core/MemoryWriter.cpp (+57)
  • (added) lldb/tools/yaml2macho-core/MemoryWriter.h (+22)
  • (added) lldb/tools/yaml2macho-core/ThreadWriter.cpp (+190)
  • (added) lldb/tools/yaml2macho-core/ThreadWriter.h (+19)
  • (added) lldb/tools/yaml2macho-core/Utility.cpp (+57)
  • (added) lldb/tools/yaml2macho-core/Utility.h (+23)
  • (added) lldb/tools/yaml2macho-core/main.cpp (+223)
  • (added) lldb/tools/yaml2macho-core/yaml2corespec.cpp (+131)
  • (added) lldb/tools/yaml2macho-core/yaml2corespec.h (+16)
diff --git a/lldb/packages/Python/lldbsuite/test/configuration.py b/lldb/packages/Python/lldbsuite/test/configuration.py
index 5e3810992d172..1a9f25d66843a 100644
--- a/lldb/packages/Python/lldbsuite/test/configuration.py
+++ b/lldb/packages/Python/lldbsuite/test/configuration.py
@@ -64,6 +64,9 @@
 # Path to the yaml2obj tool. Not optional.
 yaml2obj = None
 
+# Path to the yaml2macho-core tool. Not optional.
+yaml2macho_core = None
+
 # The arch might dictate some specific CFLAGS to be passed to the toolchain to build
 # the inferior programs.  The global variable cflags_extras provides a hook to do
 # just that.
@@ -174,3 +177,10 @@ def get_yaml2obj_path():
     """
     if yaml2obj and os.path.lexists(yaml2obj):
         return yaml2obj
+
+def get_yaml2macho_core_path():
+    """
+    Get the path to the yaml2macho-core tool.
+    """
+    if yaml2macho_core and os.path.lexists(yaml2macho_core):
+        return yaml2macho_core
diff --git a/lldb/packages/Python/lldbsuite/test/dotest.py b/lldb/packages/Python/lldbsuite/test/dotest.py
index 47a3c2ed2fc9d..89b6807b41075 100644
--- a/lldb/packages/Python/lldbsuite/test/dotest.py
+++ b/lldb/packages/Python/lldbsuite/test/dotest.py
@@ -280,6 +280,7 @@ def parseOptionsAndInitTestdirs():
         configuration.llvm_tools_dir = args.llvm_tools_dir
         configuration.filecheck = shutil.which("FileCheck", path=args.llvm_tools_dir)
         configuration.yaml2obj = shutil.which("yaml2obj", path=args.llvm_tools_dir)
+        configuration.yaml2macho_core = shutil.which("yaml2macho-core", path=args.llvm_tools_dir)
 
     if not configuration.get_filecheck_path():
         logging.warning("No valid FileCheck executable; some tests may fail...")
diff --git a/lldb/packages/Python/lldbsuite/test/lldbtest.py b/lldb/packages/Python/lldbsuite/test/lldbtest.py
index 0fc85fcc4d2d6..599b019f0df8c 100644
--- a/lldb/packages/Python/lldbsuite/test/lldbtest.py
+++ b/lldb/packages/Python/lldbsuite/test/lldbtest.py
@@ -1702,6 +1702,21 @@ def yaml2obj(self, yaml_path, obj_path, max_size=None):
             command += ["--max-size=%d" % max_size]
         self.runBuildCommand(command)
 
+    def yaml2macho_core(self, yaml_path, obj_path, uuids=None):
+        """
+        Create a Mach-O corefile at the given path from a yaml file.
+
+        Throws subprocess.CalledProcessError if the object could not be created.
+        """
+        yaml2macho_core_bin = configuration.get_yaml2macho_core_path()
+        if not yaml2macho_core_bin:
+            self.assertTrue(False, "No valid yaml2macho-core executable specified")
+        if uuids != None:
+          command = [yaml2macho_core_bin, "-i", yaml_path, "-o", obj_path, "-u", uuids]
+        else:
+          command = [yaml2macho_core_bin, "-i", yaml_path, "-o", obj_path]
+        self.runBuildCommand(command)
+
     def cleanup(self, dictionary=None):
         """Platform specific way to do cleanup after build."""
         module = builder_module()
diff --git a/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp b/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
index cb8ba05d461d4..0aff98078120e 100644
--- a/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
+++ b/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
@@ -12,6 +12,7 @@
 #include "lldb/Core/PluginManager.h"
 #include "lldb/Core/Section.h"
 #include "lldb/Symbol/Symbol.h"
+#include "lldb/Target/Target.h"
 #include "lldb/Utility/LLDBLog.h"
 #include "lldb/Utility/Log.h"
 #include "llvm/ADT/DenseSet.h"
@@ -233,6 +234,41 @@ void ObjectFileJSON::CreateSections(SectionList &unified_section_list) {
   }
 }
 
+bool ObjectFileJSON::SetLoadAddress(Target &target, lldb::addr_t value,
+                                    bool value_is_offset) {
+  Log *log(GetLog(LLDBLog::DynamicLoader));
+  if (!m_sections_up)
+    return true;
+
+  const bool warn_multiple = true;
+
+  addr_t slide = value;
+  if (!value_is_offset) {
+    addr_t lowest_addr = LLDB_INVALID_ADDRESS;
+    for (const SectionSP &section_sp : *m_sections_up) {
+      addr_t section_load_addr = section_sp->GetFileAddress();
+      lowest_addr = std::min(lowest_addr, section_load_addr);
+    }
+    if (lowest_addr == LLDB_INVALID_ADDRESS)
+      return false;
+    slide = value - lowest_addr;
+  }
+
+  // Apply slide to each section's file address.
+  for (const SectionSP &section_sp : *m_sections_up) {
+    addr_t section_load_addr = section_sp->GetFileAddress();
+    if (section_load_addr != LLDB_INVALID_ADDRESS) {
+      LLDB_LOGF(
+          log,
+          "ObjectFileJSON::SetLoadAddress section %s to load addr 0x%" PRIx64,
+          section_sp->GetName().AsCString(), section_load_addr + slide);
+      target.SetSectionLoadAddress(section_sp, section_load_addr + slide,
+                                   warn_multiple);
+    }
+  }
+  return true;
+}
+
 bool ObjectFileJSON::MagicBytesMatch(DataBufferSP data_sp,
                                      lldb::addr_t data_offset,
                                      lldb::addr_t data_length) {
diff --git a/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.h b/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.h
index b72565f468862..029c8ff188934 100644
--- a/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.h
+++ b/lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.h
@@ -86,6 +86,9 @@ class ObjectFileJSON : public ObjectFile {
 
   Strata CalculateStrata() override { return eStrataUser; }
 
+  bool SetLoadAddress(Target &target, lldb::addr_t value,
+                      bool value_is_offset) override;
+
   static bool MagicBytesMatch(lldb::DataBufferSP data_sp, lldb::addr_t offset,
                               lldb::addr_t length);
 
diff --git a/lldb/test/API/macosx/arm-corefile-regctx/Makefile b/lldb/test/API/macosx/arm-corefile-regctx/Makefile
deleted file mode 100644
index e1d0354441cd4..0000000000000
--- a/lldb/test/API/macosx/arm-corefile-regctx/Makefile
+++ /dev/null
@@ -1,6 +0,0 @@
-MAKE_DSYM := NO
-
-CXX_SOURCES := create-arm-corefiles.cpp
-
-include Makefile.rules
-
diff --git a/lldb/test/API/macosx/arm-corefile-regctx/TestArmMachoCorefileRegctx.py b/lldb/test/API/macosx/arm-corefile-regctx/TestArmMachoCorefileRegctx.py
index 6754288a65e1a..a2890cdfeaa44 100644
--- a/lldb/test/API/macosx/arm-corefile-regctx/TestArmMachoCorefileRegctx.py
+++ b/lldb/test/API/macosx/arm-corefile-regctx/TestArmMachoCorefileRegctx.py
@@ -13,20 +13,14 @@
 class TestArmMachoCorefileRegctx(TestBase):
     NO_DEBUG_INFO_TESTCASE = True
 
-    @skipUnlessDarwin
-    def setUp(self):
-        TestBase.setUp(self)
-        self.build()
-        self.create_corefile = self.getBuildArtifact("a.out")
-        self.corefile = self.getBuildArtifact("core")
-
     def test_armv7_corefile(self):
         ### Create corefile
-        retcode = call(self.create_corefile + " armv7 " + self.corefile, shell=True)
+        corefile = self.getBuildArtifact("core")
+        self.yaml2macho_core("armv7m.yaml", corefile)
 
         target = self.dbg.CreateTarget("")
         err = lldb.SBError()
-        process = target.LoadCore(self.corefile)
+        process = target.LoadCore(corefile)
         self.assertTrue(process.IsValid())
         thread = process.GetSelectedThread()
         frame = thread.GetSelectedFrame()
@@ -50,11 +44,12 @@ def test_armv7_corefile(self):
 
     def test_arm64_corefile(self):
         ### Create corefile
-        retcode = call(self.create_corefile + " arm64 " + self.corefile, shell=True)
+        corefile = self.getBuildArtifact("core")
+        self.yaml2macho_core("arm64.yaml", corefile)
 
         target = self.dbg.CreateTarget("")
         err = lldb.SBError()
-        process = target.LoadCore(self.corefile)
+        process = target.LoadCore(corefile)
         self.assertTrue(process.IsValid())
         thread = process.GetSelectedThread()
         frame = thread.GetSelectedFrame()
diff --git a/lldb/test/API/macosx/arm-corefile-regctx/arm64.yaml b/lldb/test/API/macosx/arm-corefile-regctx/arm64.yaml
new file mode 100644
index 0000000000000..4c23b69302a02
--- /dev/null
+++ b/lldb/test/API/macosx/arm-corefile-regctx/arm64.yaml
@@ -0,0 +1,31 @@
+cpu: arm64
+endian: little
+threads:
+  # (lldb) reg read
+  # % pbpaste | grep = | sed 's, ,,g' | awk -F= '{print "{name: " $1 ", value: " $2 "},"}'
+  - regsets:
+      - flavor: gpr
+        registers: [
+           {name: x0, value: 0x0000000000000001}, {name: x1, value: 0x000000016fdff3c0},
+           {name: x2, value: 0x000000016fdff3d0}, {name: x3, value: 0x000000016fdff510},
+           {name: x4, value: 0x0000000000000000}, {name: x5, value: 0x0000000000000000},
+           {name: x6, value: 0x0000000000000000}, {name: x7, value: 0x0000000000000000},
+           {name: x8, value: 0x000000010000d910}, {name: x9, value: 0x0000000000000001},
+           {name: x10, value: 0xe1e88de000000000}, {name: x11, value: 0x0000000000000003},
+           {name: x12, value: 0x0000000000000148}, {name: x13, value: 0x0000000000004000},
+           {name: x14, value: 0x0000000000000008}, {name: x15, value: 0x0000000000000000},
+           {name: x16, value: 0x0000000000000000}, {name: x17, value: 0x0000000100003f5c},
+           {name: x18, value: 0x0000000000000000}, {name: x19, value: 0x0000000100003f5c},
+           {name: x20, value: 0x000000010000c000}, {name: x21, value: 0x000000010000d910},
+           {name: x22, value: 0x000000016fdff250}, {name: x23, value: 0x000000018ce12366},
+           {name: x24, value: 0x000000016fdff1d0}, {name: x25, value: 0x0000000000000001},
+           {name: x26, value: 0x0000000000000000}, {name: x27, value: 0x0000000000000000},
+           {name: x28, value: 0x0000000000000000}, {name: fp, value: 0x000000016fdff3a0},
+           {name: lr, value: 0x000000018cd97f28}, {name: sp, value: 0x000000016fdff140},
+           {name: pc, value: 0x0000000100003f5c}, {name: cpsr, value: 0x80001000}
+        ]
+      - flavor: exc
+        registers: [ {name: far, value: 0x0000000100003f5c}, 
+                     {name: esr, value: 0xf2000000}, 
+                     {name: exception, value: 0x00000000}
+                   ]
diff --git a/lldb/test/API/macosx/arm-corefile-regctx/armv7m.yaml b/lldb/test/API/macosx/arm-corefile-regctx/armv7m.yaml
new file mode 100644
index 0000000000000..1351056ed0999
--- /dev/null
+++ b/lldb/test/API/macosx/arm-corefile-regctx/armv7m.yaml
@@ -0,0 +1,37 @@
+cpu: armv7m
+endian: little
+threads:
+  # (lldb) reg read
+  # % pbpaste | grep = | sed 's, ,,g' | awk -F= '{print "{name: " $1 ", value: " $2 "},"}'
+  - regsets:
+      - flavor: gpr
+        registers: [
+          {name: r0, value: 0x00010000}, {name: r1, value: 0x00020000},
+          {name: r2, value: 0x00030000}, {name: r3, value: 0x00040000},
+          {name: r4, value: 0x00050000}, {name: r5, value: 0x00060000},
+          {name: r6, value: 0x00070000}, {name: r7, value: 0x00080000},
+          {name: r8, value: 0x00090000}, {name: r9, value: 0x000a0000},
+          {name: r10, value: 0x000b0000}, {name: r11, value: 0x000c0000},
+          {name: r12, value: 0x000d0000}, {name: sp, value: 0x000e0000},
+          {name: lr, value: 0x000f0000}, {name: pc, value: 0x00100000},
+          {name: cpsr, value: 0x00110000}
+        ]
+      - flavor: exc
+        registers: [ {name: far, value: 0x00003f5c},
+                     {name: esr, value: 0xf2000000},
+                     {name: exception, value: 0x00000000}
+                   ]
+
+memory-regions:
+  # $sp is 0x000e0000, have bytes surrounding that address
+  - addr: 0x000dffe0
+    UInt8: [
+            0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08,
+            0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11,
+            0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a,
+            0x1b, 0x1c, 0x1d, 0x1e, 0x1f, 0x20, 0x21, 0x22, 0x23,
+            0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c,
+            0x2d, 0x2e, 0x2f, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
+            0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 
+            0x3f
+           ]
diff --git a/lldb/test/API/macosx/arm-corefile-regctx/create-arm-corefiles.cpp b/lldb/test/API/macosx/arm-corefile-regctx/create-arm-corefiles.cpp
deleted file mode 100644
index db39f12ecfb7e..0000000000000
--- a/lldb/test/API/macosx/arm-corefile-regctx/create-arm-corefiles.cpp
+++ /dev/null
@@ -1,266 +0,0 @@
-#include <mach-o/loader.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string>
-#include <vector>
-
-
-// Normally these are picked up by including <mach/thread_status.h>
-// but that does a compile time check for the build host arch and
-// only defines the ARM register context constants when building on
-// an arm system.  We're creating fake corefiles, and might be
-// creating them on an intel system.
-#ifndef ARM_THREAD_STATE
-#define ARM_THREAD_STATE 1
-#endif
-#ifndef ARM_THREAD_STATE_COUNT
-#define ARM_THREAD_STATE_COUNT 17
-#endif
-#ifndef ARM_EXCEPTION_STATE
-#define ARM_EXCEPTION_STATE 3
-#endif
-#ifndef ARM_EXCEPTION_STATE_COUNT
-#define ARM_EXCEPTION_STATE_COUNT 3
-#endif
-#ifndef ARM_THREAD_STATE64
-#define ARM_THREAD_STATE64 6
-#endif
-#ifndef ARM_THREAD_STATE64_COUNT
-#define ARM_THREAD_STATE64_COUNT 68
-#endif
-#ifndef ARM_EXCEPTION_STATE64
-#define ARM_EXCEPTION_STATE64 7
-#endif
-#ifndef ARM_EXCEPTION_STATE64_COUNT
-#define ARM_EXCEPTION_STATE64_COUNT 4
-#endif
-
-union uint32_buf {
-  uint8_t bytebuf[4];
-  uint32_t val;
-};
-
-union uint64_buf {
-  uint8_t bytebuf[8];
-  uint64_t val;
-};
-
-void add_uint64(std::vector<uint8_t> &buf, uint64_t val) {
-  uint64_buf conv;
-  conv.val = val;
-  for (int i = 0; i < 8; i++)
-    buf.push_back(conv.bytebuf[i]);
-}
-
-void add_uint32(std::vector<uint8_t> &buf, uint32_t val) {
-  uint32_buf conv;
-  conv.val = val;
-  for (int i = 0; i < 4; i++)
-    buf.push_back(conv.bytebuf[i]);
-}
-
-std::vector<uint8_t> armv7_lc_thread_load_command() {
-  std::vector<uint8_t> data;
-  add_uint32(data, LC_THREAD);              // thread_command.cmd
-  add_uint32(data, 104);                    // thread_command.cmdsize
-  add_uint32(data, ARM_THREAD_STATE);       // thread_command.flavor
-  add_uint32(data, ARM_THREAD_STATE_COUNT); // thread_command.count
-  add_uint32(data, 0x00010000);             // r0
-  add_uint32(data, 0x00020000);             // r1
-  add_uint32(data, 0x00030000);             // r2
-  add_uint32(data, 0x00040000);             // r3
-  add_uint32(data, 0x00050000);             // r4
-  add_uint32(data, 0x00060000);             // r5
-  add_uint32(data, 0x00070000);             // r6
-  add_uint32(data, 0x00080000);             // r7
-  add_uint32(data, 0x00090000);             // r8
-  add_uint32(data, 0x000a0000);             // r9
-  add_uint32(data, 0x000b0000);             // r10
-  add_uint32(data, 0x000c0000);             // r11
-  add_uint32(data, 0x000d0000);             // r12
-  add_uint32(data, 0x000e0000);             // sp
-  add_uint32(data, 0x000f0000);             // lr
-  add_uint32(data, 0x00100000);             // pc
-  add_uint32(data, 0x00110000);             // cpsr
-
-  add_uint32(data, ARM_EXCEPTION_STATE);       // thread_command.flavor
-  add_uint32(data, ARM_EXCEPTION_STATE_COUNT); // thread_command.count
-  add_uint32(data, 0x00003f5c);                // far
-  add_uint32(data, 0xf2000000);                // esr
-  add_uint32(data, 0x00000000);                // exception
-
-  return data;
-}
-
-std::vector<uint8_t> arm64_lc_thread_load_command() {
-  std::vector<uint8_t> data;
-  add_uint32(data, LC_THREAD);                // thread_command.cmd
-  add_uint32(data, 312);                      // thread_command.cmdsize
-  add_uint32(data, ARM_THREAD_STATE64);       // thread_command.flavor
-  add_uint32(data, ARM_THREAD_STATE64_COUNT); // thread_command.count
-  add_uint64(data, 0x0000000000000001);       // x0
-  add_uint64(data, 0x000000016fdff3c0);       // x1
-  add_uint64(data, 0x000000016fdff3d0);       // x2
-  add_uint64(data, 0x000000016fdff510);       // x3
-  add_uint64(data, 0x0000000000000000);       // x4
-  add_uint64(data, 0x0000000000000000);       // x5
-  add_uint64(data, 0x0000000000000000);       // x6
-  add_uint64(data, 0x0000000000000000);       // x7
-  add_uint64(data, 0x000000010000d910);       // x8
-  add_uint64(data, 0x0000000000000001);       // x9
-  add_uint64(data, 0xe1e88de000000000);       // x10
-  add_uint64(data, 0x0000000000000003);       // x11
-  add_uint64(data, 0x0000000000000148);       // x12
-  add_uint64(data, 0x0000000000004000);       // x13
-  add_uint64(data, 0x0000000000000008);       // x14
-  add_uint64(data, 0x0000000000000000);       // x15
-  add_uint64(data, 0x0000000000000000);       // x16
-  add_uint64(data, 0x0000000100003f5c);       // x17
-  add_uint64(data, 0x0000000000000000);       // x18
-  add_uint64(data, 0x0000000100003f5c);       // x19
-  add_uint64(data, 0x000000010000c000);       // x20
-  add_uint64(data, 0x000000010000d910);       // x21
-  add_uint64(data, 0x000000016fdff250);       // x22
-  add_uint64(data, 0x000000018ce12366);       // x23
-  add_uint64(data, 0x000000016fdff1d0);       // x24
-  add_uint64(data, 0x0000000000000001);       // x25
-  add_uint64(data, 0x0000000000000000);       // x26
-  add_uint64(data, 0x0000000000000000);       // x27
-  add_uint64(data, 0x0000000000000000);       // x28
-  add_uint64(data, 0x000000016fdff3a0);       // fp
-  add_uint64(data, 0x000000018cd97f28);       // lr
-  add_uint64(data, 0x000000016fdff140);       // sp
-  add_uint64(data, 0x0000000100003f5c);       // pc
-  add_uint32(data, 0x80001000);               // cpsr
-
-  add_uint32(data, 0x00000000); // padding
-
-  add_uint32(data, ARM_EXCEPTION_STATE64);       // thread_command.flavor
-  add_uint32(data, ARM_EXCEPTION_STATE64_COUNT); // thread_command.count
-  add_uint64(data, 0x0000000100003f5c);          // far
-  add_uint32(data, 0xf2000000);                  // esr
-  add_uint32(data, 0x00000000);                  // exception
-
-  return data;
-}
-
-std::vector<uint8_t> lc_segment(uint32_t fileoff,
-                                uint32_t lc_segment_data_size) {
-  std::vector<uint8_t> data;
-  // 0x000e0000 is the value of $sp in the armv7 LC_THREAD
-  uint32_t start_vmaddr = 0x000e0000 - (lc_segment_data_size / 2);
-  add_uint32(data, LC_SEGMENT);                     // segment_command.cmd
-  add_uint32(data, sizeof(struct segment_command)); // segment_command.cmdsize
-  for (int i = 0; i < 16; i++)
-    data.push_back(0);                    // segment_command.segname[16]
-  add_uint32(data, start_vmaddr);         // segment_command.vmaddr
-  add_uint32(data, lc_segment_data_size); // segment_command.vmsize
-  add_uint32(data, fileoff);              // segment_command.fileoff
-  add_uint32(data, lc_segment_data_size); // segment_command.filesize
-  add_uint32(data, 3);                    // segment_command.maxprot
-  add_uint32(data, 3);                    // segment_command.initprot
-  add_uint32(data, 0);                    // segment_command.nsects
-  add_uint32(data, 0);                    // segment_command.flags
-
-  return data;
-}
-
-enum arch { unspecified, armv7, arm64 };
-
-int main(int argc, char **argv) {
-  if (argc != 3) {
-    fprintf(stderr,
-            "usage: create-arm-corefiles [armv7|arm64] <output-core-name>\n");
-    exit(1);
-  }
-
-  arch arch = unspecified;
-
-  if (strcmp(argv[1], "armv7") == 0)
-    arch = armv7;
-  else if (strcmp(argv[1], "arm64") == 0)
-    arch = arm64;
-  else {
-    fprintf(stderr, "unrecognized architecture %s\n", argv[1]);
-    exit(1);
-  }
-
-  // An array of load commands (in the form of byte arrays)
-  std::vector<std::vector<uint8_t>> load_commands;
-
-  // An array of corefile contents (page data, lc_note data, etc)
-  std::vector<uint8_t> payload;
-
-  // First add all the load commands / payload so we can figure out how large
-  // the load commands will actually be.
-  if (arch == armv7) {
-    load_commands.push_back(armv7_lc_thread_load_command());
-    load_commands.push_back(lc_segment(0, 0));
-  } else if (arch == arm64) {
-    load_commands.push_back(arm64_lc_thread_load_command());
-  }
-
-  int size_of_load_commands = 0;
-  for (const auto &lc : load_commands)
-    size_of_load_commands += lc.size();
-
-  int header_and_load_cmd_room =
-      sizeof(struct mach_header_64) + size_of_load_commands;
-
-  // Erase the load commands / payload now that we know how much space...
[truncated]

Copy link

github-actions bot commented Aug 16, 2025

✅ With the latest revision this PR passed the Python code formatter.

@jasonmolenda
Copy link
Collaborator Author

One thing I wasn't thrilled about with llvm's yaml MappingTraits parser was that I need to define register values like

       registers: [
           {name: x0, value: 0x0000000000000001}, {name: x1, value: 0x000000016fdff3c0},
            {name: x2, value: 0x000000016fdff3d0}, {name: x3, value: 0x000000016fdff510},

instead of a more natural style of registers = { "x0": 0x1, "x1": 0x16fdff3c0, "x2": 0x16fdff3d0} or so.
At least I couldn't figure out how to do this. It makes the yaml descriptions noisier than they really need to be.

@jasonmolenda
Copy link
Collaborator Author

The Linux PR pre-merge testing is failing because lldb/tool/yaml2macho-core is not being built. I think I need to add a dependency maybe in test/CMakeLists.txt? I was doing all of my development with simply ninja to build everything, but I prob need to have this added to the lldb or check-lldb-api targets.

jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Aug 16, 2025
When a processor faults/is interrupted/gets an exception, it will
stop running code and jump to an exception catcher routine.  Most
processors will store the pc that was executing in a system register,
and the catcher functions have special instructions to retrieve
that & possibly other registers.  It may then save those values to
stack, and the author can add .cfi directives to tell lldb's unwinder
where to find those saved values.

ARM Cortex-M (microcontroller) processors have a simpler mechanism
where a fixed set of registers are saved to the stack on an exception,
and a unique value is put in the link register to indicate to the
caller that this has taken place.  No special handling needs to be
written into the exception catcher, unless it wants to inspect these
preserved values.  And it is possible for a general stack walker to
walk the stack with no special knowledge about what the catch function
does.

This patch adds an Architecture plugin method to allow an Architecture
to override/augment the UnwindPlan that lldb would use for a stack
frame, given the contents of the return address register.  It
resembles a feature where the LanguageRuntime can replace/augment
the unwind plan for a function, but it is doing it at offset by one
level.  The LanguageRuntime is looking at the local register context
and/or symbol name to decide if it will override the unwind rules.
For the Cortex-M exception unwinds, we need to modify THIS frame's
unwind plan if the CALLER's LR had a specific value.  RegisterContextUnwind
has to retrieve the caller's LR value before it has completely
decided on the UnwindPlan it will use for THIS stack frame.

This does mean that we will need one additional read of stack memory
than we currently use when unwinding.  The unwinder walks the stack
lazily, as stack frames are requested, and so now if you ask for 2
stack frames, we will read enough stack to walk 2 frames, plus we
will read one extra word of memory, the spilled RA value from the
stack (see RegisterContextUnwind::AdoptArchitectureUnwindPlan()).
In practice, with 512-byte memory cache reads, this is unlikely to be
a problem, but I'm wondering if I should add an Architecture method
of "does this Architecture implement `GetArchitectureUnwindPlan`"
method -- and only do the memory read if it does.  So the performance
impact would be limited to armv7/Cortex-M debug sessions.

This PR includes a test with a yaml corefile description and a JSON
ObjectFile, incorporating all of the necessary stack memory and
symbol names from a real debug session I worked on.  The architectural
default unwind plans are used for all stack frames except the 0th
because there's no instructions for the functions, and no unwind
info.  I may need to add an encoding of unwind fules to ObjectFileJSON
in the future as we create more test cases like this.

This PR depends on the yaml2macho-core utility from
llvm#153911

rdar://110663219
@jasonmolenda
Copy link
Collaborator Author

I think @labath would point out that I'm doing an end-run around making a sufficient Mock Process capability, with memory and threads and symbols, to write unit tests. @medismailben would point out that we could write a Scripted Process python script that would ingest this same information and vend a Process, just as well as using the corefile container for the information. For that matter, a little gdb remote serial protocol stub written in python could present this same information as if it were a live process with threads, registers, and memory.

Because I already had several mach-o corefile creator tools (and needed a new one each time I needed to test another part of the mach-o corefile reader part of lldb), it seemed most natural to go that route, to me.

The most important part for me is the simplicity of taking a real world debug problem situation, live or corefile, which may involve giant binaries/corefile and cannot be used in a test for size or confidentiality reasons, but we can extract the core bits of registers and memory that are sufficient to show the issue being fixed. We can't test issues dealing with debug info with this mechanism -- say something specific to firmware debugging that can't be replicated in a userland process -- but for a lot of memory-and-stack-and-register type bugs, I think this could be a handy tool.

new yaml2macho-core tool.  Thanks to Felipe for the guidance.
uint32_t et al, for some reason I thought
those were built-in with C++.
Copy link
Member

@bulbazord bulbazord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool utility! I'm no expert, but I made a few suggestions inline. Thanks for sharing. :)


enum Endian { Big = 0, Little = 1 };

enum MemoryType { UInt8 = 0, UInt32, UInt64 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: MemoryType -> WordSize? MemoryType seems a little vague.

I'm not convinced that it was emitting big-endian Mach-O files
correctly, and until this is actually needed, there's no point
in carrying around dubious code.
Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a simple tool, but it seems like adding a bit of organization could go a long way. Essentially, the tool consists of 3 parts:

  1. A reader that takes YAML and creates an in-memory/intermediate representation (CoreSpec).
  2. A writer that takes a CoreSpec and emit a binary.
  3. The glue that holds (1) and (2) together as well as command-line parsing and I/O.

If it were up to me, that's how I would structure this tool. I think that will make it a lot easier to understand and extend in the future.

Comment on lines 2 to 3
main.cpp
yaml2corespec.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: filename suggestions.

Suggested change
main.cpp
yaml2corespec.cpp
yaml2macho.cpp
CoreSpec.cpp

Comment on lines +39 to +41
std::vector<uint8_t> bytes;
std::vector<uint32_t> words;
std::vector<uint64_t> doublewords;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're mutually exclusive, you could do:

Suggested change
std::vector<uint8_t> bytes;
std::vector<uint32_t> words;
std::vector<uint64_t> doublewords;
using Bytes = std::vector<uint8_t>;
using Words = std::vector<uint32_t>;
using Doublewords = std::vector<uint64_t>;
std::variant<Bytes, Words, Doublewords> data;

but this might complicate the YAML traits. We do this for protocol messages in DAP and MCP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wasn't super thrilled with having these three, but wanted to maintain the input formatting for endian switching (which I then later abandoned lol). At one point it was an anonymous union. I ended up bailing on all of that and just having three members, only one of which is active; they're not accessed directly in many places.

//===----------------------------------------------------------------------===//

#include "lldb/Utility/UUID.h"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no newline between header includes, the order is handled by clang-format.

Suggested change

int main(int argc, char **argv) {

const char *const short_opts = "i:o:u:h";
const option long_opts[] = {{"input", required_argument, nullptr, 'i'},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use cl::opt (https://llvm.org/docs/CommandLine.html). Even if the current implementation is slightly simpler, it makes it harder to extend in the future. It's used by every llvm tool (that doesn't use tablegen).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I took a stab at this but it's a little confusing how it works, I tried looking at how lldb-test uses this but it's also not very straightforward. Maybe i'll let that sit for now, getopt is very simlple...


#include "llvm/BinaryFormat/MachO.h"

void create_lc_note_binary_load_cmd(const CoreSpec &spec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised this lives in its own file? Why not merge this with utility?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tiny one method file today, but I can see me adding additional LC_NOTEs in the future. So I wanted the "thread writer" file which emits LC_THREAD load commands, the "memory writer" file which emits LC_SEGMENTs and the blocks of memory, and the "LC_NOTE writer" which emits LC_NOTE load commands and bytes of the note command payloads. It makes sense in my head for these three to be in their own files.

#include "Utility.h"
#include "CoreSpec.h"

void add_uint64(const CoreSpec &spec, std::vector<uint8_t> &buf, uint64_t val) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a CoreSpec but doesn't actually use it? Same for add_uint32 below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally you could specify the Endianness in the YAML and I needed to pass that to the add_uint methods to fix byte ordering in the output file. I eventually lost confidence that I was actually creating correctly-formatted Big Endian mach-o files and I removed that feature, forgot to clean up and remove this argument.

@jasonmolenda
Copy link
Collaborator Author

I know this is a simple tool, but it seems like adding a bit of organization could go a long way. Essentially, the tool consists of 3 parts:

  1. A reader that takes YAML and creates an in-memory/intermediate representation (CoreSpec).
  2. A writer that takes a CoreSpec and emit a binary.
  3. The glue that holds (1) and (2) together as well as command-line parsing and I/O.

If it were up to me, that's how I would structure this tool. I think that will make it a lot easier to understand and extend in the future.

Thanks for all the comments, addressing them now.

Yeah before I wrote this I didn't have a clear idea of what it would look like when finished (for some reason, it seems obvious now, but in the beginning there were some poor choices made and fixed along the way). You could imagine someone making an ELF corefile output capability, for instance. The YAML files I'm using as input is just a way of specifying an architecture, some registers for threads, and some memory.

I don't know if I want to structure it more generally yet, with subdirectories, or whatever, for the YAML to intermediate representation and for the Mach-O corefile writing. If anyone does want to restructure it for a additional input/output methods, I think it will be easy to restructure it at that point. It may end up never growing beyond this simple set of features (I know, unlikely)

Copy link

github-actions bot commented Aug 26, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

I think it's a little easier to follow the register context
writing methods when it's more explicit what is being written.
And the ability to specify the number of bits used in addressing
on this cpu.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants