Elide suspension points via [[clang::coro_await_suspend_destroy]] #152623

snarkmaster · 2025-08-08T01:52:18Z

Start by reading the detailed user-facing docs in AttrDocs.td.

My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/WEY1Yo866

If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for simple_coro would be drastically shorter, and would not contain a call to operator new.

Here are a few high-level thoughts that don't belong in the docs:

This has lit tests, but what gives me real confidence in its correctness is the integration test (see [libcxx] Test await_suspend control flow & coro_await_suspend_destroy attr #152820). This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios.
I considered a variety of other designs. Here are some key design points:
- I considered optimizing unmodified await_suspend() methods, as long as they unconditionally end with an h.destroy() call on the current handle, or an exception. However, this would (a) force dynamic dispatch for destroy -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. h.done() works properly. The current solution shortcuts all these concerns.
- I want await_suspend_destroy to take Promise&, rather than std::coroutine_handle -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle.
- AttrDocs.td shows how to write portable code with a stub await_suspend(). This portability / compatibility solution avoids dire issues that would arise if users relied on __has_cpp_attribute() and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0.

let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0,
                 /*Version*/ 0>];

In the docs, I mention the HasCoroSuspend path in CoroEarly.cpp as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them.
IIUC the only interaction of this with coro_only_destroy_when_complete would be that the compiler expends fewer cycles if both are set.
I ran some benchmarks on folly::result. It's definitely faster, heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity).

// Simple result coroutine that adds 1 to the input
result<int> result_coro(result<int>&& r) {
  co_return co_await std::move(r) + 1;
}

// Non-coroutine equivalent using value_or_throw()
result<int> catching_result_func(result<int>&& r) {
  return result_catch_all([&]() -> result<int> {
    if (r.has_value()) {
      return r.value_or_throw() + 1;
    }
    return std::move(r).non_value();
  });
}

// Not QUITE equivalent to the coro -- lacks the exception boundary
result<int> non_catching_result_func(result<int>&& r) {
  if (r.has_value()) {
    return r.value_or_throw() + 1;
  }
  return std::move(r).non_value();
}

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        13.61ns    73.49M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.41ns   226.88M
result_coro_error                                          19.55ns    51.16M
non_catching_result_func_error                              9.15ns   109.26M
catching_result_func_error                                 10.19ns    98.10M

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        10.59ns    94.39M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.07ns   245.81M
result_coro_error                                          13.66ns    73.18M
non_catching_result_func_error                              9.00ns   111.11M
catching_result_func_error                                 10.04ns    99.63M

Demo program from the Compiler Explorer link above:

#include <coroutine>
#include <optional>

// Read this LATER -- this implementation detail isn't required to understand
// the value of [[clang::coro_await_suspend_destroy]].
//
// `optional_wrapper` exists since `get_return_object()` can't return
// `std::optional` directly. C++ coroutines have a fundamental timing mismatch
// between when the return object is created and when the value is available:
//
// 1) Early (coroutine startup): `get_return_object()` is called and must return
//    something immediately.
// 2) Later (when `co_return` executes): `return_value(T)` is called with the
//    actual value.
// 3) Issue: If `get_return_object()` returns the storage, it's empty when
//    returned, and writing to it later cannot affect the already-returned copy.
template <typename T>
struct optional_wrapper {
  std::optional<T> storage_;
  std::optional<T>*& pointer_;
  optional_wrapper(std::optional<T>*& p) : pointer_(p) {
    pointer_ = &storage_;
  }
  operator std::optional<T>() { return std::move(storage_); }
  ~optional_wrapper() {}
};

// Make `std::optional` a coroutine
template <typename T, typename... Args>
struct std::coroutine_traits<std::optional<T>, Args...> {
  struct promise_type {
    std::optional<T>* storagePtr_ = nullptr;
    promise_type() = default;
    ::optional_wrapper<T> get_return_object() {
      return ::optional_wrapper<T>(storagePtr_);
    }
    std::suspend_never initial_suspend() const noexcept { return {}; }
    std::suspend_never final_suspend() const noexcept { return {}; }
    void return_value(T&& value) { *storagePtr_ = std::move(value); }
    void unhandled_exception() {
      // Leave storage_ empty to represent error
    }
  };
};

template <typename T>
struct optional_awaitable {
  std::optional<T> opt_;
  bool await_ready() const noexcept { return opt_.has_value(); }
  T await_resume() { return std::move(opt_).value(); }
  // Adding `noexcept` here makes the early IR much smaller, but the
  // optimizer is able to discard the cruft for simpler cases.
  void await_suspend_destroy(auto& promise) noexcept {
    // Assume the return object defaults to "empty"
  }
  [[clang::coro_await_suspend_destroy]] void await_suspend(auto handle) {
    await_suspend_destroy(handle.promise());
    handle.destroy();
  }
};

template <typename T>
optional_awaitable<T> operator co_await(std::optional<T> opt) {
  return {std::move(opt)};
}

// Non-coroutine baseline -- matches the logic of `simple_coro`.
std::optional<int> simple_func(const std::optional<int>& r) {
  try {
    if (r.has_value()) {
        return r.value() + 1;
    }
  } catch (...) {}
  return std::nullopt; // return empty on empty input or error
}

// Without `coro_await_suspend_destroy`, allocates its frame on-heap.
std::optional<int> simple_coro(const std::optional<int>& r) {
  co_return co_await std::move(r) + 4;
}

// Without `co_await`, this optimizes much like `simple_func`.
// Bugs:
//  - Doesn't short-circuit when `r` is empty, but throws
//  - Lacks an exception boundary
std::optional<int> wrong_simple_coro(const std::optional<int>& r) {
  co_return r.value() + 2;
}

int main() {
  return
      simple_func(std::optional<int>{32}).value() +
      simple_coro(std::optional<int>{8}).value() +
      wrong_simple_coro(std::optional<int>{16}).value();
}

Test Plan:

I used ninja check-clang to run all the IR regression tests, and ran each of the 3 new IR tests separately, along these lines.

Start by reading the detailed user-facing docs in `AttrDocs.td`. My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for `simple_coro` would be drastically shorter, and would not contain a call to `operator new`. Here are a few high-level thoughts that don't belong in the docs: - This has `lit` tests, but what gives me real confidence in its correctness is the integration test in `coro_await_suspend_destroy_test.cpp`. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios. - I considered a variety of other designs. Here are some key design points: * I considered optimizing unmodified `await_suspend()` methods, as long as they unconditionally end with an `h.destroy()` call on the current handle, or an exception. However, this would (a) force dynamic dispatch for `destroy` -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. `h.done()` works properly. The current solution shortcuts all these concerns. * I want to `Promise&`, rather than `std::coroutine_handle` to `await_suspend_destroy` -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces the attribue to go on the class. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot. * `AttrDocs.td` tells portable code to provide a stub `await_suspend()`. This portability / compatibility solution avoids dire issues that would arise if users relied on `__has_cpp_attribute` and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0. ``` let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0, /*Version*/ 0>]; ``` - In the docs, I mention the `HasCoroSuspend` path in `CoroEarly.cpp` as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them. - IIUC the only interaction of this with `coro_only_destroy_when_complete` would be that the compiler expends fewer cycles. - I ran some benchmarks on [folly::result]( https://github.com/facebook/folly/blob/main/folly/result/docs/result.md). Heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity. ``` // Simple result coroutine that adds 1 to the input result<int> result_coro(result<int>&& r) { co_return co_await std::move(r) + 1; } // Non-coroutine equivalent using value_or_throw() result<int> catching_result_func(result<int>&& r) { return result_catch_all([&]() -> result<int> { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); }); } // Not QUITE equivalent to the coro -- lacks the exception boundary result<int> non_catching_result_func(result<int>&& r) { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); } ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 13.61ns 73.49M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.41ns 226.88M result_coro_error 19.55ns 51.16M non_catching_result_func_error 9.15ns 109.26M catching_result_func_error 10.19ns 98.10M ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 10.59ns 94.39M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.07ns 245.81M result_coro_error 13.66ns 73.18M non_catching_result_func_error 9.00ns 111.11M catching_result_func_error 10.04ns 99.63M ``` Demo program from the Compiler Explorer link above: ```cpp #include <coroutine> #include <optional> // Read this LATER -- this implementation detail isn't required to understand // the value of [[clang::coro_await_suspend_destroy]]. // // `optional_wrapper` exists since `get_return_object()` can't return // `std::optional` directly. C++ coroutines have a fundamental timing mismatch // between when the return object is created and when the value is available: // // 1) Early (coroutine startup): `get_return_object()` is called and must return // something immediately. // 2) Later (when `co_return` executes): `return_value(T)` is called with the // actual value. // 3) Issue: If `get_return_object()` returns the storage, it's empty when // returned, and writing to it later cannot affect the already-returned copy. template <typename T> struct optional_wrapper { std::optional<T> storage_; std::optional<T>*& pointer_; optional_wrapper(std::optional<T>*& p) : pointer_(p) { pointer_ = &storage_; } operator std::optional<T>() { return std::move(storage_); } ~optional_wrapper() {} }; // Make `std::optional` a coroutine template <typename T, typename... Args> struct std::coroutine_traits<std::optional<T>, Args...> { struct promise_type { std::optional<T>* storagePtr_ = nullptr; promise_type() = default; ::optional_wrapper<T> get_return_object() { return ::optional_wrapper<T>(storagePtr_); } std::suspend_never initial_suspend() const noexcept { return {}; } std::suspend_never final_suspend() const noexcept { return {}; } void return_value(T&& value) { *storagePtr_ = std::move(value); } void unhandled_exception() { // Leave storage_ empty to represent error } }; }; template <typename T> struct [[clang::coro_await_suspend_destroy]] optional_awaitable { std::optional<T> opt_; bool await_ready() const noexcept { return opt_.has_value(); } T await_resume() { return std::move(opt_).value(); } // Adding `noexcept` here makes the early IR much smaller, but the // optimizer is able to discard the cruft for simpler cases. void await_suspend_destroy(auto& promise) noexcept { // Assume the return object defaults to "empty" } void await_suspend(auto handle) { await_suspend_destroy(handle.promise()); handle.destroy(); } }; template <typename T> optional_awaitable<T> operator co_await(std::optional<T> opt) { return {std::move(opt)}; } // Non-coroutine baseline -- matches the logic of `simple_coro`. std::optional<int> simple_func(const std::optional<int>& r) { try { if (r.has_value()) { return r.value() + 1; } } catch (...) {} return std::nullopt; // return empty on empty input or error } // Without `coro_await_suspend_destroy`, allocates its frame on-heap. std::optional<int> simple_coro(const std::optional<int>& r) { co_return co_await std::move(r) + 4; } // Without `co_await`, this optimizes much like `simple_func`. // Bugs: // - Doesn't short-circuit when `r` is empty, but throws // - Lacks an exception boundary std::optional<int> wrong_simple_coro(const std::optional<int>& r) { co_return r.value() + 2; } int main() { return simple_func(std::optional<int>{32}).value() + simple_coro(std::optional<int>{8}).value() + wrong_simple_coro(std::optional<int>{16}).value(); } ``` Test Plan: For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those `-D` settings for a successful build.) To make sure the test covered what I meant it to do: - I also added an `#error` in the "no attribute" branch to make sure the compiler indeed supports the attribute. - I ran it with a compiler not supporting the attribute, and that also passed. - I also tried `return 1;` from `main()` and saw the logs of the 7 successful tests running. ```sh #!/bin/bash -uex set -o pipefail LLVMBASE=/path/to/source/of/llvm-project SYSCLANG=/path/to/origianl/bin/clang # NB Can add `--debug-output` to debug cmake... # Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow! mkdir -p bootstrap cd bootstrap cmake "$LLVMBASE/llvm" \ -G Ninja \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$SYSCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \ -DCMAKE_C_COMPILER="$SYSCLANG" \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_BINDINGS=OFF \ -DLLVM_ENABLE_LLD=ON \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_OPTIMIZED_TABLEGEN=true \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_DUMP=ON \ -DCLANG_DEFAULT_PIE_ON_LINUX=OFF ninja clang lld ninja check-clang-codegencoroutines # Includes the new IR regression tests cd .. NEWCLANG="$PWD"/bootstrap/bin/clang NEWLLD="$PWD"/bootstrap/bin/lld # LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out cmake "$LLVMBASE/runtimes" \ -G Ninja \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$NEWCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_C_COMPILER="$NEWCLANG" \ -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_LLD=ON \ -DLIBCXX_INCLUDE_TESTS=ON \ -DLIBCXX_INCLUDE_BENCHMARKS=OFF \ -DLLVM_INCLUDE_TESTS=ON \ -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ninja cxx-test-depends LIBCXXBUILD=$PWD cd "$LLVMBASE" libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \ libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp ```

github-actions · 2025-08-08T01:52:38Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-08-08T01:53:10Z

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-coroutines

Author: None (snarkmaster)

Changes

Start by reading the detailed user-facing docs in AttrDocs.td.

My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a

If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for simple_coro would be drastically shorter, and would not contain a call to operator new.

Here are a few high-level thoughts that don't belong in the docs:

This has lit tests, but what gives me real confidence in its correctness is the integration test in coro_await_suspend_destroy_test.cpp. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios.
I considered a variety of other designs. Here are some key design points:
- I considered optimizing unmodified await_suspend() methods, as long as they unconditionally end with an h.destroy() call on the current handle, or an exception. However, this would (a) force dynamic dispatch for destroy -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. h.done() works properly. The current solution shortcuts all these concerns.
- I want to Promise&, rather than std::coroutine_handle to await_suspend_destroy -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces the attribue to go on the class. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot.
- AttrDocs.td tells portable code to provide a stub await_suspend(). This portability / compatibility solution avoids dire issues that would arise if users relied on __has_cpp_attribute and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0.

let Spellings = [Clang&lt;"coro_destroy_after_suspend", /*allowInC*/ 0,
                 /*Version*/ 0&gt;];

In the docs, I mention the HasCoroSuspend path in CoroEarly.cpp as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them.
IIUC the only interaction of this with coro_only_destroy_when_complete would be that the compiler expends fewer cycles.
I ran some benchmarks on folly::result. Heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity.

// Simple result coroutine that adds 1 to the input
result&lt;int&gt; result_coro(result&lt;int&gt;&amp;&amp; r) {
  co_return co_await std::move(r) + 1;
}

// Non-coroutine equivalent using value_or_throw()
result&lt;int&gt; catching_result_func(result&lt;int&gt;&amp;&amp; r) {
  return result_catch_all([&amp;]() -&gt; result&lt;int&gt; {
    if (r.has_value()) {
      return r.value_or_throw() + 1;
    }
    return std::move(r).non_value();
  });
}

// Not QUITE equivalent to the coro -- lacks the exception boundary
result&lt;int&gt; non_catching_result_func(result&lt;int&gt;&amp;&amp; r) {
  if (r.has_value()) {
    return r.value_or_throw() + 1;
  }
  return std::move(r).non_value();
}

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        13.61ns    73.49M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.41ns   226.88M
result_coro_error                                          19.55ns    51.16M
non_catching_result_func_error                              9.15ns   109.26M
catching_result_func_error                                 10.19ns    98.10M

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        10.59ns    94.39M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.07ns   245.81M
result_coro_error                                          13.66ns    73.18M
non_catching_result_func_error                              9.00ns   111.11M
catching_result_func_error                                 10.04ns    99.63M

Demo program from the Compiler Explorer link above:

#include &lt;coroutine&gt;
#include &lt;optional&gt;

// Read this LATER -- this implementation detail isn't required to understand
// the value of [[clang::coro_await_suspend_destroy]].
//
// `optional_wrapper` exists since `get_return_object()` can't return
// `std::optional` directly. C++ coroutines have a fundamental timing mismatch
// between when the return object is created and when the value is available:
//
// 1) Early (coroutine startup): `get_return_object()` is called and must return
//    something immediately.
// 2) Later (when `co_return` executes): `return_value(T)` is called with the
//    actual value.
// 3) Issue: If `get_return_object()` returns the storage, it's empty when
//    returned, and writing to it later cannot affect the already-returned copy.
template &lt;typename T&gt;
struct optional_wrapper {
  std::optional&lt;T&gt; storage_;
  std::optional&lt;T&gt;*&amp; pointer_;
  optional_wrapper(std::optional&lt;T&gt;*&amp; p) : pointer_(p) {
    pointer_ = &amp;storage_;
  }
  operator std::optional&lt;T&gt;() { return std::move(storage_); }
  ~optional_wrapper() {}
};

// Make `std::optional` a coroutine
template &lt;typename T, typename... Args&gt;
struct std::coroutine_traits&lt;std::optional&lt;T&gt;, Args...&gt; {
  struct promise_type {
    std::optional&lt;T&gt;* storagePtr_ = nullptr;
    promise_type() = default;
    ::optional_wrapper&lt;T&gt; get_return_object() {
      return ::optional_wrapper&lt;T&gt;(storagePtr_);
    }
    std::suspend_never initial_suspend() const noexcept { return {}; }
    std::suspend_never final_suspend() const noexcept { return {}; }
    void return_value(T&amp;&amp; value) { *storagePtr_ = std::move(value); }
    void unhandled_exception() {
      // Leave storage_ empty to represent error
    }
  };
};

template &lt;typename T&gt;
struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
  std::optional&lt;T&gt; opt_;
  bool await_ready() const noexcept { return opt_.has_value(); }
  T await_resume() { return std::move(opt_).value(); }
  // Adding `noexcept` here makes the early IR much smaller, but the
  // optimizer is able to discard the cruft for simpler cases.
  void await_suspend_destroy(auto&amp; promise) noexcept {
    // Assume the return object defaults to "empty"
  }
  void await_suspend(auto handle) {
    await_suspend_destroy(handle.promise());
    handle.destroy();
  }
};

template &lt;typename T&gt;
optional_awaitable&lt;T&gt; operator co_await(std::optional&lt;T&gt; opt) {
  return {std::move(opt)};
}

// Non-coroutine baseline -- matches the logic of `simple_coro`.
std::optional&lt;int&gt; simple_func(const std::optional&lt;int&gt;&amp; r) {
  try {
    if (r.has_value()) {
        return r.value() + 1;
    }
  } catch (...) {}
  return std::nullopt; // return empty on empty input or error
}

// Without `coro_await_suspend_destroy`, allocates its frame on-heap.
std::optional&lt;int&gt; simple_coro(const std::optional&lt;int&gt;&amp; r) {
  co_return co_await std::move(r) + 4;
}

// Without `co_await`, this optimizes much like `simple_func`.
// Bugs:
//  - Doesn't short-circuit when `r` is empty, but throws
//  - Lacks an exception boundary
std::optional&lt;int&gt; wrong_simple_coro(const std::optional&lt;int&gt;&amp; r) {
  co_return r.value() + 2;
}

int main() {
  return
      simple_func(std::optional&lt;int&gt;{32}).value() +
      simple_coro(std::optional&lt;int&gt;{8}).value() +
      wrong_simple_coro(std::optional&lt;int&gt;{16}).value();
}

Test Plan:

For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those -D settings for a successful build.)

To make sure the test covered what I meant it to do:

I also added an #error in the "no attribute" branch to make sure the compiler indeed supports the attribute.
I ran it with a compiler not supporting the attribute, and that also passed.
I also tried return 1; from main() and saw the logs of the 7 successful tests running.

#!/bin/bash -uex
set -o pipefail
LLVMBASE=/path/to/source/of/llvm-project
SYSCLANG=/path/to/origianl/bin/clang

# NB Can add `--debug-output` to debug cmake...

# Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow!
mkdir -p bootstrap
cd bootstrap
cmake "$LLVMBASE/llvm" \
    -G Ninja \
    -DBUILD_SHARED_LIBS=true \
    -DCMAKE_ASM_COMPILER="$SYSCLANG" \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \
    -DCMAKE_C_COMPILER="$SYSCLANG" \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_LLD=ON \
    -DLLVM_ENABLE_PROJECTS="clang;lld" \
    -DLLVM_OPTIMIZED_TABLEGEN=true \
    -DLLVM_FORCE_ENABLE_STATS=ON \
    -DLLVM_ENABLE_DUMP=ON \
    -DCLANG_DEFAULT_PIE_ON_LINUX=OFF
ninja clang lld
ninja check-clang-codegencoroutines # Includes the new IR regression tests
cd ..

NEWCLANG="$PWD"/bootstrap/bin/clang
NEWLLD="$PWD"/bootstrap/bin/lld
# LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out
cmake "$LLVMBASE/runtimes" \
    -G Ninja \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
    -DBUILD_SHARED_LIBS=true \
    -DCMAKE_ASM_COMPILER="$NEWCLANG" \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_C_COMPILER="$NEWCLANG" \
    -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \
    -DLLVM_FORCE_ENABLE_STATS=ON \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_LLD=ON \
    -DLIBCXX_INCLUDE_TESTS=ON \
    -DLIBCXX_INCLUDE_BENCHMARKS=OFF \
    -DLLVM_INCLUDE_TESTS=ON \
    -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

ninja cxx-test-depends

LIBCXXBUILD=$PWD
cd "$LLVMBASE"

libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \
    libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp

Patch is 46.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152623.diff

10 Files Affected:

(modified) clang/docs/ReleaseNotes.rst (+6)
(modified) clang/include/clang/Basic/Attr.td (+8)
(modified) clang/include/clang/Basic/AttrDocs.td (+87)
(modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+3)
(modified) clang/lib/CodeGen/CGCoroutine.cpp (+160-72)
(modified) clang/lib/Sema/SemaCoroutine.cpp (+84-18)
(added) clang/test/CodeGenCoroutines/coro-await-suspend-destroy-errors.cpp (+55)
(added) clang/test/CodeGenCoroutines/coro-await-suspend-destroy.cpp (+129)
(modified) clang/test/Misc/pragma-attribute-supported-attributes-list.test (+1)
(added) libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp (+409)

diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 0e9fcaa5fac6a..41c412730b033 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -136,6 +136,12 @@ Removed Compiler Flags
 Attribute Changes in Clang
 --------------------------
 
+- Introduced a new attribute ``[[clang::coro_await_suspend_destroy]]``.  When
+  applied to a coroutine awaiter class, it causes suspensions into this awaiter
+  to use a new `await_suspend_destroy(Promise&)` method instead of the standard
+  `await_suspend(std::coroutine_handle<...>)`.  The coroutine is then destroyed.
+  This improves code speed & size for "short-circuiting" coroutines.
+
 Improvements to Clang's diagnostics
 -----------------------------------
 - Added a separate diagnostic group ``-Wfunction-effect-redeclarations``, for the more pedantic
diff --git a/clang/include/clang/Basic/Attr.td b/clang/include/clang/Basic/Attr.td
index 30efb9f39e4f4..341848be00e7d 100644
--- a/clang/include/clang/Basic/Attr.td
+++ b/clang/include/clang/Basic/Attr.td
@@ -1352,6 +1352,14 @@ def CoroAwaitElidableArgument : InheritableAttr {
   let SimpleHandler = 1;
 }
 
+def CoroAwaitSuspendDestroy: InheritableAttr {
+  let Spellings = [Clang<"coro_await_suspend_destroy">];
+  let Subjects = SubjectList<[CXXRecord]>;
+  let LangOpts = [CPlusPlus];
+  let Documentation = [CoroAwaitSuspendDestroyDoc];
+  let SimpleHandler = 1;
+}
+
 // OSObject-based attributes.
 def OSConsumed : InheritableParamAttr {
   let Spellings = [Clang<"os_consumed">];
diff --git a/clang/include/clang/Basic/AttrDocs.td b/clang/include/clang/Basic/AttrDocs.td
index 2b095ab975202..d2224d86b3900 100644
--- a/clang/include/clang/Basic/AttrDocs.td
+++ b/clang/include/clang/Basic/AttrDocs.td
@@ -9270,6 +9270,93 @@ Example:
 }];
 }
 
+def CoroAwaitSuspendDestroyDoc : Documentation {
+  let Category = DocCatDecl;
+  let Content = [{
+
+The ``[[clang::coro_await_suspend_destroy]]`` attribute may be applied to a C++
+coroutine awaiter type.  When this attribute is present, the awaiter must
+implement ``void await_suspend_destroy(Promise&)``.  If ``await_ready()``
+returns ``false`` at a suspension point, ``await_suspend_destroy`` will be
+called directly, bypassing the ``await_suspend(std::coroutine_handle<...>)``
+method.  The coroutine being suspended will then be immediately destroyed.
+
+Logically, the new behavior is equivalent to this standard code:
+
+.. code-block:: c++
+
+  void await_suspend_destroy(YourPromise&) { ... }
+  void await_suspend(auto handle) {
+    await_suspend_destroy(handle.promise());
+    handle.destroy();
+  }
+
+This enables `await_suspend_destroy()` usage in portable awaiters — just add a
+stub ``await_suspend()`` as above.  Without ``coro_await_suspend_destroy``
+support, the awaiter will behave nearly identically, with the only difference
+being heap allocation instead of stack allocation for the coroutine frame.
+
+This attribute exists to optimize short-circuiting coroutines—coroutines whose
+suspend points are either (i) trivial (like ``std::suspend_never``), or (ii)
+short-circuiting (like a ``co_await`` that can be expressed in regular control
+flow as):
+
+.. code-block:: c++
+
+  T val;
+  if (awaiter.await_ready()) {
+    val = awaiter.await_resume();
+  } else {
+    awaiter.await_suspend();
+    return /* value representing the "execution short-circuited" outcome */;
+  }
+
+The benefits of this attribute are:
+  - **Avoid heap allocations for coro frames**: Allocating short-circuiting
+    coros on the stack makes code more predictable under memory pressure.
+    Without this attribute, LLVM cannot elide heap allocation even when all
+    awaiters are short-circuiting.
+  - **Performance**: Significantly faster execution and smaller code size.
+  - **Build time**: Faster compilation due to less IR being generated.
+
+Marking your ``await_suspend_destroy`` method as ``noexcept`` can sometimes
+further improve optimization.
+
+Here is a toy example of a portable short-circuiting awaiter:
+
+.. code-block:: c++
+
+  template <typename T>
+  struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
+    std::optional<T> opt_;
+    bool await_ready() const noexcept { return opt_.has_value(); }
+    T await_resume() { return std::move(opt_).value(); }
+    void await_suspend_destroy(auto& promise) {
+      // Assume the return object of the outer coro defaults to "empty".
+    }
+    // Fallback for when `coro_await_suspend_destroy` is unavailable.
+    void await_suspend(auto handle) {
+      await_suspend_destroy(handle.promise());
+      handle.destroy();
+    }
+  };
+
+If all suspension points use (i) trivial or (ii) short-circuiting awaiters,
+then the coroutine optimizes more like a plain function, with 2 caveats:
+  - **Behavior:** The coroutine promise provides an implicit exception boundary
+    (as if wrapping the function in ``try {} catch { unhandled_exception(); }``).
+    This exception handling behavior is usually desirable in robust,
+    return-value-oriented programs that need short-circuiting coroutines.
+    Otherwise, the promise can always re-throw.
+  - **Speed:** As of 2025, there is still an optimization gap between a
+    realistic short-circuiting coro, and the equivalent (but much more verbose)
+    function.  For a guesstimate, expect 4-5ns per call on x86.  One idea for
+    improvement is to also elide trivial suspends like `std::suspend_never`, in
+    order to hit the `HasCoroSuspend` path in `CoroEarly.cpp`.
+
+}];
+}
+
 def CountedByDocs : Documentation {
   let Category = DocCatField;
   let Content = [{
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 116341f4b66d5..58e7dd7db86d1 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -12504,6 +12504,9 @@ def note_coroutine_promise_call_implicitly_required : Note<
 def err_await_suspend_invalid_return_type : Error<
   "return type of 'await_suspend' is required to be 'void' or 'bool' (have %0)"
 >;
+def err_await_suspend_destroy_invalid_return_type : Error<
+  "return type of 'await_suspend_destroy' is required to be 'void' (have %0)"
+>;
 def note_await_ready_no_bool_conversion : Note<
   "return type of 'await_ready' is required to be contextually convertible to 'bool'"
 >;
diff --git a/clang/lib/CodeGen/CGCoroutine.cpp b/clang/lib/CodeGen/CGCoroutine.cpp
index 827385f9c1a1f..d74bef592aa9c 100644
--- a/clang/lib/CodeGen/CGCoroutine.cpp
+++ b/clang/lib/CodeGen/CGCoroutine.cpp
@@ -174,6 +174,66 @@ static bool StmtCanThrow(const Stmt *S) {
   return false;
 }
 
+// Check if this suspend should be calling `await_suspend_destroy`
+static bool useCoroAwaitSuspendDestroy(const CoroutineSuspendExpr &S) {
+  // This can only be an `await_suspend_destroy` suspend expression if it
+  // returns void -- `buildCoawaitCalls` in `SemaCoroutine.cpp` asserts this.
+  // Moreover, when `await_suspend` returns a handle, the outermost method call
+  // is `.address()` -- making it harder to get the actual class or method.
+  if (S.getSuspendReturnType() !=
+      CoroutineSuspendExpr::SuspendReturnType::SuspendVoid) {
+    return false;
+  }
+
+  // `CGCoroutine.cpp` & `SemaCoroutine.cpp` must agree on whether this suspend
+  // expression uses `[[clang::coro_await_suspend_destroy]]`.
+  //
+  // Any mismatch is a serious bug -- we would either double-free, or fail to
+  // destroy the promise type. For this reason, we make our decision based on
+  // the method name, and fatal outside of the happy path -- including on
+  // failure to find a method name.
+  //
+  // As a debug-only check we also try to detect the `AwaiterClass`. This is
+  // secondary, because  detection of the awaiter type can be silently broken by
+  // small `buildCoawaitCalls` AST changes.
+  StringRef SuspendMethodName;           // Primary
+  CXXRecordDecl *AwaiterClass = nullptr; // Debug-only, best-effort
+  if (auto *SuspendCall =
+          dyn_cast<CallExpr>(S.getSuspendExpr()->IgnoreImplicit())) {
+    if (auto *SuspendMember = dyn_cast<MemberExpr>(SuspendCall->getCallee())) {
+      if (auto *BaseExpr = SuspendMember->getBase()) {
+        // `IgnoreImplicitAsWritten` is critical since `await_suspend...` can be
+        // invoked on the base of the actual awaiter, and the base need not have
+        // the attribute. In such cases, the AST will show the true awaiter
+        // being upcast to the base.
+        AwaiterClass = BaseExpr->IgnoreImplicitAsWritten()
+                           ->getType()
+                           ->getAsCXXRecordDecl();
+      }
+      if (auto *SuspendMethod =
+              dyn_cast<CXXMethodDecl>(SuspendMember->getMemberDecl())) {
+        SuspendMethodName = SuspendMethod->getName();
+      }
+    }
+  }
+  if (SuspendMethodName == "await_suspend_destroy") {
+    assert(!AwaiterClass ||
+           AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return true;
+  } else if (SuspendMethodName == "await_suspend") {
+    assert(!AwaiterClass ||
+           !AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return false;
+  } else {
+    llvm::report_fatal_error(
+        "Wrong method in [[clang::coro_await_suspend_destroy]] check: "
+        "expected 'await_suspend' or 'await_suspend_destroy', but got '" +
+        SuspendMethodName + "'");
+  }
+
+  return false;
+}
+
 // Emit suspend expression which roughly looks like:
 //
 //   auto && x = CommonExpr();
@@ -220,6 +280,25 @@ namespace {
     RValue RV;
   };
 }
+
+// The simplified `await_suspend_destroy` path avoids suspend intrinsics.
+static void emitAwaitSuspendDestroy(CodeGenFunction &CGF, CGCoroData &Coro,
+                                    llvm::Function *SuspendWrapper,
+                                    llvm::Value *Awaiter, llvm::Value *Frame,
+                                    bool AwaitSuspendCanThrow) {
+  SmallVector<llvm::Value *, 2> DirectCallArgs;
+  DirectCallArgs.push_back(Awaiter);
+  DirectCallArgs.push_back(Frame);
+
+  if (AwaitSuspendCanThrow) {
+    CGF.EmitCallOrInvoke(SuspendWrapper, DirectCallArgs);
+  } else {
+    CGF.EmitNounwindRuntimeCall(SuspendWrapper, DirectCallArgs);
+  }
+
+  CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
+}
+
 static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Coro,
                                     CoroutineSuspendExpr const &S,
                                     AwaitKind Kind, AggValueSlot aggSlot,
@@ -234,7 +313,6 @@ static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Co
   auto Prefix = buildSuspendPrefixStr(Coro, Kind);
   BasicBlock *ReadyBlock = CGF.createBasicBlock(Prefix + Twine(".ready"));
   BasicBlock *SuspendBlock = CGF.createBasicBlock(Prefix + Twine(".suspend"));
-  BasicBlock *CleanupBlock = CGF.createBasicBlock(Prefix + Twine(".cleanup"));
 
   // If expression is ready, no need to suspend.
   CGF.EmitBranchOnBoolExpr(S.getReadyExpr(), ReadyBlock, SuspendBlock, 0);
@@ -243,95 +321,105 @@ static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Co
   CGF.EmitBlock(SuspendBlock);
 
   auto &Builder = CGF.Builder;
-  llvm::Function *CoroSave = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_save);
-  auto *NullPtr = llvm::ConstantPointerNull::get(CGF.CGM.Int8PtrTy);
-  auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});
 
   auto SuspendWrapper = CodeGenFunction(CGF.CGM).generateAwaitSuspendWrapper(
       CGF.CurFn->getName(), Prefix, S);
 
-  CGF.CurCoro.InSuspendBlock = true;
-
   assert(CGF.CurCoro.Data && CGF.CurCoro.Data->CoroBegin &&
          "expected to be called in coroutine context");
 
-  SmallVector<llvm::Value *, 3> SuspendIntrinsicCallArgs;
-  SuspendIntrinsicCallArgs.push_back(
-      CGF.getOrCreateOpaqueLValueMapping(S.getOpaqueValue()).getPointer(CGF));
-
-  SuspendIntrinsicCallArgs.push_back(CGF.CurCoro.Data->CoroBegin);
-  SuspendIntrinsicCallArgs.push_back(SuspendWrapper);
-
-  const auto SuspendReturnType = S.getSuspendReturnType();
-  llvm::Intrinsic::ID AwaitSuspendIID;
-
-  switch (SuspendReturnType) {
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_void;
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendBool:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_bool;
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_handle;
-    break;
-  }
-
-  llvm::Function *AwaitSuspendIntrinsic = CGF.CGM.getIntrinsic(AwaitSuspendIID);
-
   // SuspendHandle might throw since it also resumes the returned handle.
+  const auto SuspendReturnType = S.getSuspendReturnType();
   const bool AwaitSuspendCanThrow =
       SuspendReturnType ==
           CoroutineSuspendExpr::SuspendReturnType::SuspendHandle ||
       StmtCanThrow(S.getSuspendExpr());
 
-  llvm::CallBase *SuspendRet = nullptr;
-  // FIXME: add call attributes?
-  if (AwaitSuspendCanThrow)
-    SuspendRet =
-        CGF.EmitCallOrInvoke(AwaitSuspendIntrinsic, SuspendIntrinsicCallArgs);
-  else
-    SuspendRet = CGF.EmitNounwindRuntimeCall(AwaitSuspendIntrinsic,
-                                             SuspendIntrinsicCallArgs);
+  llvm::Value *Awaiter =
+      CGF.getOrCreateOpaqueLValueMapping(S.getOpaqueValue()).getPointer(CGF);
+  llvm::Value *Frame = CGF.CurCoro.Data->CoroBegin;
 
-  assert(SuspendRet);
-  CGF.CurCoro.InSuspendBlock = false;
+  if (useCoroAwaitSuspendDestroy(S)) { // Call `await_suspend_destroy` & cleanup
+    emitAwaitSuspendDestroy(CGF, Coro, SuspendWrapper, Awaiter, Frame,
+                            AwaitSuspendCanThrow);
+  } else { // Normal suspend path -- can actually suspend, uses intrinsics
+    CGF.CurCoro.InSuspendBlock = true;
 
-  switch (SuspendReturnType) {
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
-    assert(SuspendRet->getType()->isVoidTy());
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendBool: {
-    assert(SuspendRet->getType()->isIntegerTy());
-
-    // Veto suspension if requested by bool returning await_suspend.
-    BasicBlock *RealSuspendBlock =
-        CGF.createBasicBlock(Prefix + Twine(".suspend.bool"));
-    CGF.Builder.CreateCondBr(SuspendRet, RealSuspendBlock, ReadyBlock);
-    CGF.EmitBlock(RealSuspendBlock);
-    break;
-  }
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
-    assert(SuspendRet->getType()->isVoidTy());
-    break;
-  }
-  }
+    SmallVector<llvm::Value *, 3> SuspendIntrinsicCallArgs;
+    SuspendIntrinsicCallArgs.push_back(Awaiter);
+    SuspendIntrinsicCallArgs.push_back(Frame);
+    SuspendIntrinsicCallArgs.push_back(SuspendWrapper);
+    BasicBlock *CleanupBlock = CGF.createBasicBlock(Prefix + Twine(".cleanup"));
 
-  // Emit the suspend point.
-  const bool IsFinalSuspend = (Kind == AwaitKind::Final);
-  llvm::Function *CoroSuspend =
-      CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_suspend);
-  auto *SuspendResult = Builder.CreateCall(
-      CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});
+    llvm::Function *CoroSave = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_save);
+    auto *NullPtr = llvm::ConstantPointerNull::get(CGF.CGM.Int8PtrTy);
+    auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});
 
-  // Create a switch capturing three possible continuations.
-  auto *Switch = Builder.CreateSwitch(SuspendResult, Coro.SuspendBB, 2);
-  Switch->addCase(Builder.getInt8(0), ReadyBlock);
-  Switch->addCase(Builder.getInt8(1), CleanupBlock);
+    llvm::Intrinsic::ID AwaitSuspendIID;
 
-  // Emit cleanup for this suspend point.
-  CGF.EmitBlock(CleanupBlock);
-  CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
+    switch (SuspendReturnType) {
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_void;
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendBool:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_bool;
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_handle;
+      break;
+    }
+
+    llvm::Function *AwaitSuspendIntrinsic =
+        CGF.CGM.getIntrinsic(AwaitSuspendIID);
+
+    llvm::CallBase *SuspendRet = nullptr;
+    // FIXME: add call attributes?
+    if (AwaitSuspendCanThrow)
+      SuspendRet =
+          CGF.EmitCallOrInvoke(AwaitSuspendIntrinsic, SuspendIntrinsicCallArgs);
+    else
+      SuspendRet = CGF.EmitNounwindRuntimeCall(AwaitSuspendIntrinsic,
+                                               SuspendIntrinsicCallArgs);
+
+    assert(SuspendRet);
+    CGF.CurCoro.InSuspendBlock = false;
+
+    switch (SuspendReturnType) {
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
+      assert(SuspendRet->getType()->isVoidTy());
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendBool: {
+      assert(SuspendRet->getType()->isIntegerTy());
+
+      // Veto suspension if requested by bool returning await_suspend.
+      BasicBlock *RealSuspendBlock =
+          CGF.createBasicBlock(Prefix + Twine(".suspend.bool"));
+      CGF.Builder.CreateCondBr(SuspendRet, RealSuspendBlock, ReadyBlock);
+      CGF.EmitBlock(RealSuspendBlock);
+      break;
+    }
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
+      assert(SuspendRet->getType()->isVoidTy());
+      break;
+    }
+    }
+
+    // Emit the suspend point.
+    const bool IsFinalSuspend = (Kind == AwaitKind::Final);
+    llvm::Function *CoroSuspend =
+        CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_suspend);
+    auto *SuspendResult = Builder.CreateCall(
+        CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});
+
+    // Create a switch capturing three possible continuations.
+    auto *Switch = Builder.CreateSwitch(SuspendResult, Coro.SuspendBB, 2);
+    Switch->addCase(Builder.getInt8(0), ReadyBlock);
+    Switch->addCase(Builder.getInt8(1), CleanupBlock);
+
+    // Emit cleanup for this suspend point.
+    CGF.EmitBlock(CleanupBlock);
+    CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
+  }
 
   // Emit await_resume expression.
   CGF.EmitBlock(ReadyBlock);
diff --git a/clang/lib/Sema/SemaCoroutine.cpp b/clang/lib/Sema/SemaCoroutine.cpp
index d193a33f22393..83fe7219c9997 100644
--- a/clang/lib/Sema/SemaCoroutine.cpp
+++ b/clang/lib/Sema/SemaCoroutine.cpp
@@ -289,6 +289,45 @@ static ExprResult buildCoroutineHandle(Sema &S, QualType PromiseType,
   return S.BuildCallExpr(nullptr, FromAddr.get(), Loc, FramePtr, Loc);
 }
 
+// To support [[clang::coro_await_suspend_destroy]], this builds
+//   *static_cast<Promise*>(
+//       __builtin_coro_promise(handle, alignof(Promise), false))
+static ExprResult buildPromiseRef(Sema &S, QualType PromiseType,
+                                  SourceLocation Loc) {
+  uint64_t Align =
+      S.Context.getTypeAlign(PromiseType) / S.Context.getCharWidth();
+
+  // Build the call to __builtin_coro_promise()
+  SmallVector<Expr *, 3> Args = {
+      S.BuildBuiltinCallExpr(Loc, Builtin::BI__builtin_coro_frame, {}),
+      S.ActOnIntegerConstant(Loc, Align).get(),         // alignof(Promise)
+      S.ActOnCXXBoolLiteral(Loc, tok::kw_false).get()}; // false
+  ExprResult CoroPromiseCall =
+      S.BuildBuiltinCallExpr(Loc, Builtin::BI__builtin_coro_promise, Args);
+
+  if (CoroPromiseCall.isInvalid())
+    return ExprError();
+
+  // Cast to Promise*
+  ExprResult CastExpr = S.ImpCastExprToType(
+      CoroPromiseCall.get(), S.Context.getPointerType(PromiseType), CK_BitCast);
+  if (CastExpr.isInvalid())
+    return ExprError();
+
+  // Dereference to get Promise&
+  return S.CreateBuiltinUnaryOp(Loc, UO_Deref, CastExpr.get());
+}
+
+static bool hasCoroAwaitSuspendDestroyAttr(Expr *Awaiter) {
+  QualType AwaiterType = Awaiter->getType();
+  if (auto *RD = AwaiterType->getAsCXXRecordDecl()) {
+    if (RD->hasAttr<CoroAwaitSuspendDestroyAttr>()) {
+      return true;
+    }
+  }
+  return false;
+}
+
 struct ReadySuspendResumeResult {
   enum AwaitCallType { ACT_Ready, ACT_Suspend, ACT_Resume };
   Expr *Results[3];
@@ -399,15 +438,30 @@ static ReadySuspendResumeResult buildCoawaitCalls(Sema &S, VarDecl *CoroPromise,
       Calls.Results[ACT::ACT_Ready] = S.MaybeCreateExprWithCleanups(Conv.get());
   }
 
-  ExprRes...
[truncated]

NewSigma · 2025-08-08T02:22:17Z

Given your code contains destroy in await_suspend, will the solution of #148380 work for you?

snarkmaster · 2025-08-08T02:52:43Z

@NewSigma, thanks!

I quickly ran my folly::result benchmark from the description under clang-17 rather than clang-19, and ... its performance is even worse on 17 than 19. So, I doubt that resolving the inlining-of-await_suspend issue will directly address my problem.

The fundamental problem I'm addressing isn't lack of devirtualization, but lack of heap alloc elision.

That said, if you have a PoC patch, I'm happy to try it.

ChuanqiXu9 · 2025-08-08T06:37:44Z

@NewSigma, thanks!

I quickly ran my folly::result benchmark from the description under clang-17 rather than clang-19, and ... its performance is even worse on 17 than 19. So, I doubt that resolving the inlining-of-await_suspend issue will directly address my problem.

The fundamental problem I'm addressing isn't lack of devirtualization, but lack of heap alloc elision.

That said, if you have a PoC patch, I'm happy to try it.

I didn't have time to look into details yet. But could your patch cover #148380?

And also, it is better to split changes between clang and libc++ whenever possible.

snarkmaster · 2025-08-08T07:18:47Z

@ChuanqiXu9,

(1) I'm not actually changing libc++. I just added an end-to-end test for this compiler feature. It's backwards- and forwards-compatible, so I don't think there's any risk to bundling it with this PR.

(2) My patch isn't a fully general solution to #148380. Rather, it's a (probably) better solution to the more narrow class of problems of short-circuiting suspends.

To use my thing, two things have to be true:

You actually have to opt into the new attribute and define await_suspend_destroy to get the perf benefits.
You can only do so if your await_suspend would destroy the coro unconditionally.

When both of those are true (opt-in & unconditional deestroy), it results in a much simpler compilation setup -- HALO, less IR, less optimized code, better perf.

ChuanqiXu9 · 2025-08-08T07:22:00Z

@ChuanqiXu9,

(1) I'm not actually changing libc++. I just added an end-to-end test for this compiler feature. It's backwards- and forwards-compatible, so I don't think there's any risk to bundling it with this PR.

This is not about risks. It is simply about developing processes. In another word, it will be faster to merge this if you split it.

(2) My patch isn't a fully general solution to #148380. Rather, it's a (probably) better solution to the more narrow class of problems of short-circuiting suspends.

To use my thing, two things have to be true:

You actually have to opt into the new attribute and define await_suspend_destroy to get the perf benefits.

You can only do so if your await_suspend would destroy the coro unconditionally.

When both of those are true (opt-in & unconditional deestroy), it results in a much simpler compilation setup -- HALO, less IR, less optimized code, better perf.

In another word, if users want, this patch can cover #148380, right? Could you have a test for this?

ChuanqiXu9

Did a pretty quick scanning.

It is slightly surprising to me that you didn't touch LLVM. The optimization somehow comes from skipping AwaitSuspend intrinsic, which tries to solve such patterns:

void await_suspend(...) {
     ...
     call resume conditionally directly or indirectly
     ...
     some other codes
}

Then if the coroutine is destroyed in other threads, the finally generated code is unsafe even if the code wrote by the user is safe since the middle end may perform some optimizations.

But in your case, it is fine. Since it said it will be destroyed after await suspend destroy. A valid program can't destroy a coroutine instance twice. So I like the idea.

But similarly, if you want, may be you can implement similar attributes to not invoke AwaitSuspend intrinsic in different conditions.

clang/include/clang/Basic/AttrDocs.td

clang/lib/CodeGen/CGCoroutine.cpp

snarkmaster · 2025-08-09T00:25:23Z

Thanks for the initial review! I'm glad you like the idea. I'll finish polishing the e2e test, and address your inline comments 1-by-1 above.

split out libcxx change

Okay, I'll do that for the next update, no problem.

... can cover #148380, right?

For the trivial example of void await_suspend(auto handle) { handle.destroy(); }, yes, the user could use the new attribute.

But, I think it's possible to have coros that conditionally destroy the passed-in handle. So, if the user had something more complicated (e.g. coro that returns a handle, and sometimes destroys the current coro), then my thing wouldn't help.

Could you have a test for this?

My current IR test verifies that there's no heap alloc across a few examples. Were you thinking of a different kind of test? Can you be very specific since I'm totally new to LLVM?

Response requested ^^^

if you want, may be you can implement similar attributes to not invoke AwaitSuspend intrinsic in different conditions.

I don't know when I'll next be able to make time to work on this, but if I do, my next target would be std::suspend_never. I hope this one could be done without an attribute, though. Do you have other use-cases in mind?

snarkmaster · 2025-08-09T01:08:39Z

@ChuanqiXu9, I marked a few of the inline comments "Response requested", please take a look.

@ChuanqiXu9

…d test for [[clang::coro_await_suspend_destroy]] When reviewing llvm#152623, @@ChuanqiXu9 suggested that it's better to separate `clang` and `libcxx` changes, so I did that here. This event-replay test is compatible with compilers lacking the new attribute. The two PRs can be landed in any order. Furthermore, the test does not just verify the attribute's correct behavior. By storing "gold" event traces, it also provides some protection against future breakage in the complex control flow of the C++ coroutine implementation. Since suspension remains an area of active optimization work, such bugs are not ruled out. As a new contributor to LLVM, I myself made (and fixed) 2 control-flow bugs while working on llvm#152623. This test would've caught both.

snarkmaster · 2025-08-09T02:30:29Z

I improved the libcxx-hosted test and put it up as a separate PR: #152820

…lvm-project into coro_await_suspend_destroy

ChuanqiXu9 · 2025-08-13T01:54:48Z

Great, thanks for clearing up both my questions!

address and merge

A couple of new things came up at work, but I'll do my best to come back to it this week.

Don‘t be hurry. The process of the PR is quick enough.

#56301 and new attributes

I did a quick read-through, and it sounds like the problem is that within await_suspend, it is unsafe to reorder instructions across any opaque calls into which the handle escapes. If the call's code is visible, then the safety could theoretically be analyzed, but for things like Register() in that example, one has to assume that the handle might have been destroyed on the other side.

Unsurprisingly, folly/coro has the same basic design for scheduling tasks, and also extensively runs under sanitizers, so it's probably just sheer luck we didn't get bitten by the same reordering bug.

Here is await_suspend passing the handle onto a queue:

https://github.com/facebook/folly/blob/main/folly/coro/Task.h#L582

And here is the enqueue (which is also called "register"!):

https://github.com/facebook/folly/blob/main/folly/executors/CPUThreadPoolExecutor.cpp#L261

So, IMO if we want the "attribute to inline await_suspend" idea to do the most good, we need to think about how to signal where the "DO NOT REORDER" barrier lies within the function. Right now, this is a comment, which is lame. It would be far better if the barrier were compiler-visible, and compiler-checked. Here's a really dumb strawman:
// visible definition 
[[clang::coro_may_inline_suspend_with_barrier]] 
coroutine_handle<> register_wrapper(coroutine_handle<> h) {
  // ... some stuff we may want to inline
  return __builtin_suspend_barrier(register_foreign(h));
}

[[clang::coro_may_inline_suspend_with_barrier]] 
coroutine_handle<> await_suspend(coroutine_handle<> h) {
  suspended = true;
  auto new_h = __builtin_suspend_barrier(register_wrapper(h));
  // ... as in 56301
}
In the above design,

The attribute allows us to inline an await_suspend() call -- or recursively, any function call from suspend that was wrapped with __builtin_suspend_barrier().

The attribute fails to compile if

the annotated function's body isn't visible,

(not good, needs to be refined) the function fails to have exactly 1 __builtin_suspend_barrier call on every code path.
EDIT: I want to refine the "exactly 1" condition because it seems entirely reasonable to conditionally omit the barrier on certain paths.

The recursive design allows code to be naturally split across multiple helper functions, while still permitting inlining.

WDYT?

I feel confusing for [[clang::coro_may_inline_suspend_with_barrier]]. It makes things complex. In my mind, we only need:

A class attribute to hint the compiler to not emit await.suspend intrinsics.
As you said, the __builtin_suspend_barrier is helpful for users to give finer grained control.

Start by reading the detailed user-facing docs in `AttrDocs.td`. My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for `simple_coro` would be drastically shorter, and would not contain a call to `operator new`. Here are a few high-level thoughts that don't belong in the docs: - This has `lit` tests, but what gives me real confidence in its correctness is the integration test in `coro_await_suspend_destroy_test.cpp`. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios. - I considered a variety of other designs. Here are some key design points: * I considered optimizing unmodified `await_suspend()` methods, as long as they unconditionally end with an `h.destroy()` call on the current handle, or an exception. However, this would (a) force dynamic dispatch for `destroy` -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. `h.done()` works properly. The current solution shortcuts all these concerns. * I want to `Promise&`, rather than `std::coroutine_handle` to `await_suspend_destroy` -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces the attribue to go on the class. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot. * `AttrDocs.td` tells portable code to provide a stub `await_suspend()`. This portability / compatibility solution avoids dire issues that would arise if users relied on `__has_cpp_attribute` and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0. ``` let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0, /*Version*/ 0>]; ``` - In the docs, I mention the `HasCoroSuspend` path in `CoroEarly.cpp` as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them. - IIUC the only interaction of this with `coro_only_destroy_when_complete` would be that the compiler expends fewer cycles. - I ran some benchmarks on [folly::result]( https://github.com/facebook/folly/blob/main/folly/result/docs/result.md). Heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity. ``` // Simple result coroutine that adds 1 to the input result<int> result_coro(result<int>&& r) { co_return co_await std::move(r) + 1; } // Non-coroutine equivalent using value_or_throw() result<int> catching_result_func(result<int>&& r) { return result_catch_all([&]() -> result<int> { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); }); } // Not QUITE equivalent to the coro -- lacks the exception boundary result<int> non_catching_result_func(result<int>&& r) { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); } ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 13.61ns 73.49M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.41ns 226.88M result_coro_error 19.55ns 51.16M non_catching_result_func_error 9.15ns 109.26M catching_result_func_error 10.19ns 98.10M ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 10.59ns 94.39M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.07ns 245.81M result_coro_error 13.66ns 73.18M non_catching_result_func_error 9.00ns 111.11M catching_result_func_error 10.04ns 99.63M ``` Demo program from the Compiler Explorer link above: ```cpp #include <coroutine> #include <optional> // Read this LATER -- this implementation detail isn't required to understand // the value of [[clang::coro_await_suspend_destroy]]. // // `optional_wrapper` exists since `get_return_object()` can't return // `std::optional` directly. C++ coroutines have a fundamental timing mismatch // between when the return object is created and when the value is available: // // 1) Early (coroutine startup): `get_return_object()` is called and must return // something immediately. // 2) Later (when `co_return` executes): `return_value(T)` is called with the // actual value. // 3) Issue: If `get_return_object()` returns the storage, it's empty when // returned, and writing to it later cannot affect the already-returned copy. template <typename T> struct optional_wrapper { std::optional<T> storage_; std::optional<T>*& pointer_; optional_wrapper(std::optional<T>*& p) : pointer_(p) { pointer_ = &storage_; } operator std::optional<T>() { return std::move(storage_); } ~optional_wrapper() {} }; // Make `std::optional` a coroutine template <typename T, typename... Args> struct std::coroutine_traits<std::optional<T>, Args...> { struct promise_type { std::optional<T>* storagePtr_ = nullptr; promise_type() = default; ::optional_wrapper<T> get_return_object() { return ::optional_wrapper<T>(storagePtr_); } std::suspend_never initial_suspend() const noexcept { return {}; } std::suspend_never final_suspend() const noexcept { return {}; } void return_value(T&& value) { *storagePtr_ = std::move(value); } void unhandled_exception() { // Leave storage_ empty to represent error } }; }; template <typename T> struct [[clang::coro_await_suspend_destroy]] optional_awaitable { std::optional<T> opt_; bool await_ready() const noexcept { return opt_.has_value(); } T await_resume() { return std::move(opt_).value(); } // Adding `noexcept` here makes the early IR much smaller, but the // optimizer is able to discard the cruft for simpler cases. void await_suspend_destroy(auto& promise) noexcept { // Assume the return object defaults to "empty" } void await_suspend(auto handle) { await_suspend_destroy(handle.promise()); handle.destroy(); } }; template <typename T> optional_awaitable<T> operator co_await(std::optional<T> opt) { return {std::move(opt)}; } // Non-coroutine baseline -- matches the logic of `simple_coro`. std::optional<int> simple_func(const std::optional<int>& r) { try { if (r.has_value()) { return r.value() + 1; } } catch (...) {} return std::nullopt; // return empty on empty input or error } // Without `coro_await_suspend_destroy`, allocates its frame on-heap. std::optional<int> simple_coro(const std::optional<int>& r) { co_return co_await std::move(r) + 4; } // Without `co_await`, this optimizes much like `simple_func`. // Bugs: // - Doesn't short-circuit when `r` is empty, but throws // - Lacks an exception boundary std::optional<int> wrong_simple_coro(const std::optional<int>& r) { co_return r.value() + 2; } int main() { return simple_func(std::optional<int>{32}).value() + simple_coro(std::optional<int>{8}).value() + wrong_simple_coro(std::optional<int>{16}).value(); } ``` Test Plan: For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those `-D` settings for a successful build.) To make sure the test covered what I meant it to do: - I also added an `#error` in the "no attribute" branch to make sure the compiler indeed supports the attribute. - I ran it with a compiler not supporting the attribute, and that also passed. - I also tried `return 1;` from `main()` and saw the logs of the 7 successful tests running. ```sh #!/bin/bash -uex set -o pipefail LLVMBASE=/path/to/source/of/llvm-project SYSCLANG=/path/to/origianl/bin/clang # NB Can add `--debug-output` to debug cmake... # Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow! mkdir -p bootstrap cd bootstrap cmake "$LLVMBASE/llvm" \ -G Ninja \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$SYSCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \ -DCMAKE_C_COMPILER="$SYSCLANG" \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_BINDINGS=OFF \ -DLLVM_ENABLE_LLD=ON \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_OPTIMIZED_TABLEGEN=true \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_DUMP=ON \ -DCLANG_DEFAULT_PIE_ON_LINUX=OFF ninja clang lld ninja check-clang-codegencoroutines # Includes the new IR regression tests cd .. NEWCLANG="$PWD"/bootstrap/bin/clang NEWLLD="$PWD"/bootstrap/bin/lld # LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out cmake "$LLVMBASE/runtimes" \ -G Ninja \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$NEWCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_C_COMPILER="$NEWCLANG" \ -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_LLD=ON \ -DLIBCXX_INCLUDE_TESTS=ON \ -DLIBCXX_INCLUDE_BENCHMARKS=OFF \ -DLLVM_INCLUDE_TESTS=ON \ -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ninja cxx-test-depends LIBCXXBUILD=$PWD cd "$LLVMBASE" libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \ libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp ```

- Switch back to a member function attribute so that each overload can independently decide whether it uses `await_suspend_destroy()` - Emit a diagnostic when `await_suspend` and `await_suspend_destroy` have mismatched types - Add UseAwaitSuspendDestroy bit to CoSuspendExpr - Set UseAwaitSuspendDestroy bit from SemaCoroutine - Use UseAwaitSuspendDestroy in CGCoroutine - Move `lit` tests for diagnostics for the new attribute into `test/SemaCXX/` - Add new IR test for issue llvm#148380 - Improve main IR test per feedback. - Improve main IR test to cover "multiple overloads in one awaiter" - Improve AttrDocs per feedback - clang-format

…lvm-project into coro_await_suspend_destroy

snarkmaster · 2025-08-19T06:08:21Z

@ChuanqiXu9, please take a look at the update. I incorporated all the feedback from you and @yuxuanchen1997, as collected in my prior comment / checklist.

I additionally made a change that Yuxuan mentioned, and that I initially thought I wouldn't be able to make. Specifically, I went back to a member function attribute, from a class attribute. The big upside of this is that every overload of await_suspend can independently decide whether it uses await_suspend_destroy(). I am actually going to end up using this in a near-term revision of folly::result, since I am migrating to a co_await or_unwind() syntax for extracting the value both in short-circuiting and in suspending coros.

This commit message has the bullet list of changes: f1e885c -- but, things look different enough it's probably worth a fresh read-through.

Regarding the "no intrinsics" attribute discussion --

I feel confused about [[clang::coro_may_inline_suspend_with_barrier]]. It makes things complex. In my mind, we only need:

A class attribute to hint the compiler to not emit await.suspend intrinsics.

As you said, the __builtin_suspend_barrier is helpful for users to give finer grained control.

More likely, I am confused -- you know the code much better.

The reason I proposed this very specific attribute is that I was worried about end users. Without the compiler enforcing barrier usage, people might apply the attribute to a production coro task, without understanding the subtle miscompilations you might see by dropping the suspend intrinsics. This sort of thing can show up in production as a random bug (heisenbug), which can make it extremely expensive to track down.

So, my goal in coupling "remove intrinsics" with "force a reordering barrier" was to give the end users some margin of safety. I think the high-level goal of safety is a good one.

On the other hand, I don't feel like I understand the internals of LLVM well enough to have a good opinion on the implementability of that specific "safer" idea, or whether there's a different way to make it safer.

It might also be fine to provide them as separate features, but then the docs for the attribute must strongly encourage the users to use the barrier.

clang/include/clang/AST/ExprCXX.h

clang/include/clang/Basic/AttrDocs.td

ChuanqiXu9 · 2025-08-19T07:35:03Z

Regarding the "no intrinsics" attribute discussion --

Let's discuss this later in other thread. It may be confusing to discuss two things in one thread especially for other readers.

snarkmaster · 2025-08-19T17:09:59Z

Addressed nit, and reworked docs explaining the portability stub.

snarkmaster · 2025-08-19T19:04:40Z

Hm, CI clang-format is still bugging out, and producing bad formatting.

When I do this locally, a commend I got from @yuxuanchen1997, this produces no changes. Here, 0732693d8197 is my trunk base. I also tried HEAD^ but that's one of my merge commits, so it's not that interesting.

git diff -U0 --no-color 0732693d8197 | python3 clang/tools/clang-format/clang-format-diff.py -i -p1

OTOH when I run

clang/tools/clang-format/git-clang-format 0732693d8197

Then I repro the bad reformat that indents the closing brace of the anon namespace.

Two questions:

(a) Do we care about keeping git-clang-format happy?
(b) If yes, what workaround strategies are there?

efriedma-quic · 2025-08-19T21:52:29Z

git-clang-format uses the "--lines" option to clang-format restrict formatting updates. If that isn't working correctly, please file a bug. You can ignore the bot after that.

snarkmaster · 2025-08-20T00:02:23Z

@efriedma-quic , thanks, filed #154443

ChuanqiXu9 · 2025-08-20T01:50:42Z

clang/include/clang/Basic/AttrDocs.td

+Instead of calling the annotated ``await_suspend()``, the coroutine calls
+``await_suspend_destroy(Promise&)`` and immediately destroys the coroutine.
+
+Although it is not called, it is strongly recommended that `await_suspend()`
+contain the following portability stub.  The stub ensures the awaiter behaves
+equivalently without `coro_await_suspend_destroy` support, and makes the
+control flow clear to readers unfamiliar with the attribute:
+
+.. code-block:: c++
+
+  void await_suspend_destroy(Promise&) { /* actual implementation*/ }
+  [[clang::coro_await_suspend_destroy]]
+  void await_suspend(std::coroutine_handle<Promise> handle) {
+    // Stub to preserve behavior when the attribute is not supported
+    await_suspend_destroy(handle.promise());
+    handle.destroy();
+  }


This is still confusing. The behavior in my mind may be, the coroutine is immediately destroyed after the annotated await_suspend finished. In another word, the user can think we'll insert an unconditionally .destroy after the annotated await_suspend. The existence of await_suspend_destroy here is super confusing.

Okay, sounds like I need to make it very explicit. Let me put my lawyer hat on! Let me know if the following is good enough, to replace the lines you highlighted?

Here is the formal contract for this attribute.

The attribute is considered active when both of these are true:

The compiler supports it -- i.e. the macro __has_cpp_attribute(clang::coro_await_suspend_destroy) expands to a nonzero integer.

The await_suspend overload applicable to the current coroutine's promise type is annotated with [[clang::coro_await_suspend_destroy]].

If the attribute is not active, then the compiler will follow the C++ standard suspension behavior. When await_ready() returns false:

First, the coroutine is suspended -- the compiler saves the coroutine state and creates a handle.

Then, await_suspend is invoked with the handle.

Note: Without an active attribute, await_suspend_destroy(Promise&) may be defined, but is not part of the compiler's protocol.

If the attribute is active, the compiler will follow this non-standard protocol whenever await_ready() returns false:

First, await_suspend_destroy is invoked with a mutable reference to the awaiting coroutine's promise.

Then, the coroutine is immediately destroyed, as if on co_return ...; but without invoking either return_void() or return_value().

Notes:

The coroutine is not suspended, and a handle is not created.

The applicable await_suspend is not called. It must still be declared, since the compiler looks for the attribute on this special member, but a definition is optional. NB: Before providing a definition, read the note on portability below.

Portability note: It is strongly recommended to write your code in a way that does not rely on support for this attribute. Fortunately, the attribute's contract is designed so that portability does not require conditional compilation.

Suppose you have the following standard await_suspend:

void await_suspend(std::coroutine_handle<MyPromise>& h) { record_suspension_via_promise(h.promise()); h.destroy(); }

Without loss of portability, you can replace it by await_suspend_destroy, plus a fallback await_suspend. Depending on the compiler, either one may be the entry point, but the behavior will be the same -- except for the speed, size, and allocation-elision benefits of the attribute.

// Entry point when `clang::coro_await_suspend_destroy` is supported void await_suspend_destroy(MyPromise& p) { record_suspension_via_promise(p); } // Entry point when `clang::coro_await_suspend_destroy` is not supported. // Emits no code when `clang::coro_await_suspend_destroy` is supported. [[clang::coro_await_suspend_destroy]] void await_suspend(std::coroutine_handle<MyPromise>& h) { await_suspend_destroy(h.promise()); h.destroy(); }

The "standard" and "replacement" forms are equivalent because the fallback await_suspend replicates the attribute's contract whenever the attribute is not supported by the compiler.

Warning: Even if you only use Clang, do not neglect to add the portability stub -- LLVM reserves the right to remove support for this attribute in a later major release.

It is clear now but may be too verbose. But I get your point.

ChuanqiXu9 · 2025-08-20T05:50:03Z

clang/lib/Sema/SemaCoroutine.cpp

+// To support [[clang::coro_await_suspend_destroy]], this builds
+//   *static_cast<Promise*>(
+//       __builtin_coro_promise(handle, alignof(Promise), false))


Suggested change

// To support [[clang::coro_await_suspend_destroy]], this builds

// *static_cast<Promise*>(

// __builtin_coro_promise(handle, alignof(Promise), false))

The comment is not helpful

ChuanqiXu9 · 2025-08-20T05:54:31Z

clang/lib/Sema/SemaCoroutine.cpp

  CallExpr *AwaitSuspend = cast_or_null<CallExpr>(
      BuildSubExpr(ACT::ACT_Suspend, "await_suspend", CoroHandle));
  if (!AwaitSuspend)
    return Calls;
+
+  // When this `await_suspend()` overload is annotated with
+  // `[[clang::coro_await_suspend_destroy]]`, do NOT call `await_suspend()` --
+  // instead call `await_suspend_destroy(Promise&)`.  This assumes that the
+  // `await_suspend()` is just a compatibility stub consisting of:
+  //     await_suspend_destroy(handle.promise());
+  //     handle.destroy();
+  // Users of the attribute must follow this contract.  Then, diagnostics from
+  // both `await_suspend` and `await_suspend_destroy` will get exposed.
+  CallExpr *PlainAwaitSuspend = nullptr;
+  if (FunctionDecl *AwaitSuspendCallee = AwaitSuspend->getDirectCallee()) {
+    if (AwaitSuspendCallee->hasAttr<CoroAwaitSuspendDestroyAttr>()) {
+      Calls.UseAwaitSuspendDestroy = true;
+      ExprResult PromiseRefRes =
+          buildPromiseRef(S, CoroPromise->getType(), Loc);
+      if (PromiseRefRes.isInvalid()) {
+        Calls.IsInvalid = true;
+        return Calls;
+      }
+      Expr *PromiseRef = PromiseRefRes.get();
+      PlainAwaitSuspend = AwaitSuspend;
+      AwaitSuspend = cast_or_null<CallExpr>(
+          BuildSubExpr(ACT::ACT_Suspend, "await_suspend_destroy", PromiseRef));
+      if (!AwaitSuspend)
+        return Calls;
+    }
+  }


This is not good. We don't have GC or any standard mechanism to remove an AST node. I mean, it is not good to generate an AST Node (the await_suspend call) here first but then replace it with await_suspend_destroy. This is not optimal.

If a suspend expression, if you need to generate a call to await_suspend_destroy, you shouldn't generate a call to await_suspend first.

Do you have a pointer for the best way to do overload resolution so I can test for whether the attribute is present, without constructing the AST for the handle or for the call?

I don't have a good method now. And I am confused why do you have to do overload resolution? Is it bad to lookup await_suspend_destroy directly?

why do overload resolution

At a high level, because the behavior of the awaiter can depend on the coroutine type it's suspending. Very concretely, with folly::result, I would like co_await or_unwind(someResultFn()) to unwrap the result , or propagate the error, in both folly::result coroutines (short-circuiting / non-suspending) and folly::coro::Task coroutines (asynchronous, sometimes-suspending).

Due to it asynchronous usage, Task is implemented with a nontrivial final awaiter, which means that or_unwind() cannot immediately destroy the suspending coro if it is a Task.

Thus, or_unwind awaiter can only apply [[clang::coro_await_suspend_destroy]] to await_suspend(std::coroutine_handle<folly::result_promise<T>>), but not to the overload for the Task promise type.

In the current version of this PR, the main lit IR test does cover this promise-dependent overload behavior. So, you can look at that for minimal example code.

good method to resolve overloads

I hacked something up using LookupQualifiedName and OverloadCandidateSet::AddMethodCandidate.

Corner case: this calls DeduceReturnType when await_suspend both has the attribute & returns auto. I have to run body analysis in order to issue the "return types don't match" diagnostic. If that's too expensive, then

It would not be a huge problem to drop the DeduceReturnType call and just make it an error, forcing the user to make the return type known from the declaration.

The "return types must match" diagnostic is not THAT high-value, so we could even drop the diagnostic, and let users who don't read the docs shoot themselves in the foot.

Does the following lookup strategy seem like an improvement to you? If yes, I'll use it to only create the one call expression we actually need, and add some more tests for the diagnostic.

/// Return the type of `await_suspend` if it has `CoroAwaitSuspendDestroyAttr`. /// /// @returns A null type on failure to look up the `await_suspend` overload. /// @returns A null type the `await_suspend` overload lacks the attribute. /// @returns The return type of the selected `await_suspend` overload if it has /// the attribute, which is also the expected return type of the /// `await_suspend_destroy` method that will be called instead. static QualType lookupReturnTypeIfAwaitSuspendDestroy(Sema &S, Expr *Operand, QualType ArgumentType, SourceLocation Loc) { QualType ObjectType = Operand->getType(); LookupResult Found(S, S.PP.getIdentifierInfo("await_suspend"), Loc, Sema::LookupOrdinaryName); { CXXRecordDecl *ObjectRD = ObjectType->getAsCXXRecordDecl(); if (!ObjectRD) return QualType(); S.LookupQualifiedName(Found, ObjectRD); if (Found.empty()) return QualType(); } // Attempt overload resolution using a dummy argument OverloadCandidateSet Candidates(Loc, OverloadCandidateSet::CSK_Normal); OpaqueValueExpr HandleArg(Loc, ArgumentType, VK_PRValue); SmallVector<Expr *, 1> Args = {&HandleArg}; for (LookupResult::iterator I = Found.begin(), E = Found.end(); I != E; ++I) { S.AddMethodCandidate(I.getPair(), ObjectType, Operand->Classify(S.Context), Args, Candidates); } OverloadCandidateSet::iterator Best; if (Candidates.BestViableFunction(S, Loc, Best) == OR_Success) { // Test the best candidate for the attribute if (Best->Function && Best->Function->hasAttr<CoroAwaitSuspendDestroyAttr>()) { QualType RetType = Best->Function->getReturnType(); // If the return type is undeduced, try to deduce it from an available // function body or template pattern. If we fail to deduce the type, the // user will see a diagnostic about a mismatch between the return type of // `await_suspend` and `await_suspend_destroy`. if (RetType->isUndeducedType() && // Returns `StillUndeduced == false` if deduction succeeded. !S.DeduceReturnType(Best->Function, Loc, /*Diagnose=*/false)) { return Best->Function->getReturnType(); } return RetType; } } return QualType(); }

On the one hand, I don't feel it bad. On the other hand, I don't want to touch overloading logics...

I had an idea that, in CGCoroutine, when we find the await_suspend is marked with the attribute (maybe with another name), we can emit it directly instead via coro.await_suspend intrinsics. It should work. As it is just what you did right now. This is what we said as, attributes to not emitting coro.await_suspend intrinsics.

emit await_suspend directly, instead of via coro.await_suspend intrinsics

I agree it's a nice, clean idea. This is sort of what I tried in the very first version of this (#152029), which I quickly put into "draft" mode because I realized my implementation was wrong.

Here's my old thinking, which led to wrong / broken code:

await_suspend() must contain an h.destroy() call since we want portability to compilers without the attribute.

I wanted to make the await_suspend wrapper omit coro.suspend -- my goal here is to make it look like a non-suspending coroutine so it optimizes like a plain function.

The h.destroy() ultimately invokes a coro.destroy intrinsic. The implementation of that intrinsic does an indirect call through a destroy pointer that is written to the coro frame via one of the intrinsics that I elided. That frame setup would also write resume pointer which is required by h.done().

So, now the h.destroy() call will fail with a null pointer dereference.

I tried a few permutations, but at the time I didn't understand the code well enough to try the right one, perhaps.

Are you saying that specifically the following version will both maintain the internal invariants, and optimize well?

Keep coro.save and coro.suspend

Replace coro.awaits.suspend.XXX by a direct await_suspend-wrapper call

I am happy to try this out, though I'll be pleasantly surprised all the indirection optimizes away and it emits good code.

@ChuanqiXu9, can you confirm I have the correct idea now? Are there any gotchas?

I think you get my points.

To make this more clear: the reason why your patch can optimize some cases is, your patch skips coro.await.suspend intrinsics if the await_suspend function is marked as the attribute your proposed.

We can generalize this to a part:

A semantical part. For example, the coro_await_suspend_destroy attribute and its semantics you proposed.

Skip emitting coro.await.suspend. This is why we can optimize it.

cor3ntin · 2025-08-23T08:52:26Z

Hey folks,
Thanks for the work.

Please note that adding a new attribute should go through an RFC.
I know @yuxuanchen1997 proposed something in the same design space a while ago - but we never seem to have reached consensus https://discourse.llvm.org/t/language-extension-for-better-more-deterministic-halo-for-c-coroutines/80044

Given that there have been multiple attempts at similar direction, we should make sure everyone interested in this work has a chance to see it! Thanks

cor3ntin

Let's make sure we don't merge that accidentally

snarkmaster · 2025-08-23T15:54:42Z

Thanks! I'm happy to distill the conversation above into an RFC. I should note this is quite far from, and orthogonal to Yuxuan's RFC, but I'm happy to raise awareness of the short-circuiting coro use-case.

But, first I'd want to ask @ChuanqiXu9 if the lookupReturnTypeIfAwaitSuspendDestroy() implementation (and corresponding discussion of overload resolution) is satisfactory.

The rationale for settling this is that we might as well propose the most implementation-friendly attribute placement.

@ChuanqiXu9, let me know your reaction to the linked comment. If you need to see the full PR first, LMK, and I'll spend the hour or two to clean it up.

ChuanqiXu9 · 2025-08-25T02:16:42Z

Hey folks, Thanks for the work.

Please note that adding a new attribute should go through an RFC. I know @yuxuanchen1997 proposed something in the same design space a while ago - but we never seem to have reached consensus https://discourse.llvm.org/t/language-extension-for-better-more-deterministic-halo-for-c-coroutines/80044

Given that there have been multiple attempts at similar direction, we should make sure everyone interested in this work has a chance to see it! Thanks

Just to note that I don't feel the PR and @yuxuanchen1997 's PR are in the same design space. I think they are for different problems.

snarkmaster requested a review from a team as a code owner August 8, 2025 01:52

snarkmaster mentioned this pull request Aug 8, 2025

Elide suspension points via [[clang::coro_destroyed_on_suspend]] #152029

Closed

yuxuanchen1997 requested a review from ChuanqiXu9 August 8, 2025 01:59

snarkmaster and others added 3 commits August 8, 2025 00:06

Fix CI

eb5557a

Improve doc formatting

5d6a06d

Merge branch 'main' into coro_await_suspend_destroy

9fe0b17

ChuanqiXu9 reviewed Aug 8, 2025

View reviewed changes

snarkmaster mentioned this pull request Aug 9, 2025

[libcxx] Test await_suspend control flow & coro_await_suspend_destroy attr #152820

Open

snarkmaster added 5 commits August 9, 2025 00:27

Rework the AttrDocs.td addition based on feedback

1dabfe6

Split out the libcxx/test change into PR llvm#152820

62789ef

Lift standard suspend flow to emitStandardAwaitSuspend; tweak comment.

4835f37

Merge branch 'llvm:main' into coro_await_suspend_destroy

3e84df1

Merge branch 'coro_await_suspend_destroy' of github.com:snarkmaster/l…

b359f5f

…lvm-project into coro_await_suspend_destroy

Merge branch 'llvm:main' into coro_await_suspend_destroy

4b7707d

snarkmaster mentioned this pull request Aug 18, 2025

[clang] Missed optimization regression: coro.destroy not devirtualized even when possible #148380

Open

snarkmaster and others added 10 commits August 18, 2025 16:46

Fix CI

8bf453e

Improve doc formatting

811501d

Rework the AttrDocs.td addition based on feedback

63cf306

Split out the libcxx/test change into PR llvm#152820

2b2748c

Lift standard suspend flow to emitStandardAwaitSuspend; tweak comment.

cf26f6b

Merge branch 'llvm:main' into coro_await_suspend_destroy

4f4e815

Merge branch 'coro_await_suspend_destroy' of github.com:snarkmaster/l…

2b3a9b7

…lvm-project into coro_await_suspend_destroy

Fix bad merge & some doc backticks

72274d2

Remove another leftover file from bad merge

543bf07

ChuanqiXu9 reviewed Aug 19, 2025

View reviewed changes

clang/include/clang/AST/ExprCXX.h Outdated Show resolved Hide resolved

clang/include/clang/Basic/AttrDocs.td Outdated Show resolved Hide resolved

clang/include/clang/Basic/AttrDocs.td Show resolved Hide resolved

snarkmaster added 2 commits August 19, 2025 10:01

Merge branch 'llvm:main' into coro_await_suspend_destroy

a09b1f8

Address 2 more comments

e6b6367

snarkmaster mentioned this pull request Aug 20, 2025

git-clang-format misformats CGCoroutine.cpp, adding spurious indentation across a large block #154443

Open

ChuanqiXu9 reviewed Aug 20, 2025

View reviewed changes

cor3ntin requested changes Aug 23, 2025

View reviewed changes

	// To support [[clang::coro_await_suspend_destroy]], this builds
	// static_cast<Promise>(
	// __builtin_coro_promise(handle, alignof(Promise), false))

Elide suspension points via [[clang::coro_await_suspend_destroy]] #152623

Are you sure you want to change the base?

Elide suspension points via [[clang::coro_await_suspend_destroy]] #152623

Uh oh!

Conversation

snarkmaster commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

llvmbot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NewSigma commented Aug 8, 2025

Uh oh!

snarkmaster commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChuanqiXu9 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snarkmaster commented Aug 8, 2025

Uh oh!

ChuanqiXu9 commented Aug 8, 2025

Uh oh!

ChuanqiXu9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snarkmaster commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snarkmaster commented Aug 9, 2025

Uh oh!

snarkmaster commented Aug 9, 2025

Uh oh!

ChuanqiXu9 commented Aug 13, 2025

Uh oh!

snarkmaster commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChuanqiXu9 commented Aug 19, 2025

Uh oh!

snarkmaster commented Aug 19, 2025

Uh oh!

snarkmaster commented Aug 19, 2025

Uh oh!

efriedma-quic commented Aug 19, 2025

Uh oh!

snarkmaster commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snarkmaster Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snarkmaster Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

snarkmaster commented Aug 8, 2025 •

edited

Loading

llvmbot commented Aug 8, 2025 •

edited

Loading

snarkmaster commented Aug 8, 2025 •

edited

Loading

ChuanqiXu9 commented Aug 8, 2025 •

edited

Loading

snarkmaster commented Aug 9, 2025 •

edited

Loading

snarkmaster commented Aug 19, 2025 •

edited

Loading

snarkmaster Aug 20, 2025 •

edited

Loading

snarkmaster Aug 20, 2025 •

edited

Loading

snarkmaster commented Aug 23, 2025 •

edited

Loading