Skip to content

WIP: divup op #152144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

WIP: divup op #152144

wants to merge 14 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Apr 24, 2025

Don't bother reviewing please, this code was not generated by humans and it's mostly for me to understand all the requirements of a new pytorch operator

################################################################################
# MEGA-PROMPT: How to Teach an LLM to Add **ANY** New Element-wise PyTorch Op
#
# Copy this entire file verbatim into `llm.txt` (including these borders) and
# feed it to the model you are evaluating.  Everything is plain text; there are
# no Markdown tables or rendered HTML — just one big, unformatted box.
#
# Your evaluation harness should:
#   • present ONLY this text to the model
#   • capture its stdout
#   • treat that stdout as a `patch -p1` diff to be applied to a clean PyTorch
#     checkout (main branch)
#   • build & test (you decide which tests)
#
# The model “passes” when the patch builds and new tests succeed.
#
# This prompt teaches the model:
#   1.  The **exact** files that typically need edits for a new operator
#   2.  Why each edit is necessary
#   3.  How to generalise the pattern for other ops
#   4.  What the final diff must look like (no chatter, no logging)
#
# It uses `ceiling_divide`/`divup` as a fully-worked example.  The annotated
# reference diff is NOT meant to be applied by the harness; it’s a tutorial
# for the LLM.  When you build your own suite, swap-in a different example or
# trim as you like — but **keep the structure** so the model can learn.
################################################################################

──────────────────────────────── SECTION 1 — HIGH-LEVEL OVERVIEW ────────────────────────────────
Implementing a new element-wise op in PyTorch touches four layers:

  (A) **Native C++ core (ATen):**
      • Add dispatcher stub, CPU/CUDA kernels, and glue functions.
  (B) **Public dispatcher metadata:**
      • Register schemas in `native_functions.yaml`.
  (C) **Python reference / decomposition layer (torch/_refs):**
      • Provide a mathematically-correct reference so Autograd & PrimTorch work.
  (D) **Python surface & tests:**
      • Expose dunder overloads (`Tensor.__op__`) and add tests.

If you forget ANY layer, either the build fails or runtime raises
“operator not found” errors.

──────────────────────────────── SECTION 2 — STEP-BY-STEP CHECKLIST ─────────────────────────────
For *every* new element-wise binary operator **OP** (here `ceiling_divide`
alias `divup`) follow this template:

1. **Dispatcher symbol**
   • In the cpp file that already hosts similar ops, add
     `DEFINE_DISPATCH(<op_stub>);`
   • Add a matching `DECLARE_DISPATCH` to the header.

2. **Kernels**
   • Write at least one device kernel (CPU is mandatory, CUDA optional).
   • Register each with `REGISTER_DISPATCH(<op_stub>, &<kernel_fn>);`.

3. **Backend-agnostic wrappers**
   • Add `*_impl` helpers that create a `TensorIterator`, call the stub, and
     return the result.
   • Provide `op`, `op_`, and `op_out` overloads for
     - Tensor × Tensor
     - Tensor × Scalar (wrap scalar via `wrapped_scalar_tensor`)
   • Keep scalar wrappers in the same file so Autograd tracing works.

4. **`native_functions.yaml` entries**
   • One line per overload; point CPU/MPS (and CUDA if ready) to
     the C++ symbols from step 3.
   • If a variant is still Composite, mark it `CompositeExplicitAutograd`.

5. **Reference implementation**
   • In `torch/_refs/__init__.py` create a decorated
     `_make_elementwise_binary_reference` function that calls existing ops.
   • For aliases, write a thin function that forwards to the canonical op.

6. **Python dunder support**
   • In `torch/_tensor.py` add `__op__` and `__rop__` methods that dispatch to
     `torch.OP`.

7. **Call-sites inside PyTorch**
   • Replace any ad-hoc math (e.g. `divup(x,y)`) in core helpers with the new
     public API to keep codebase consistent.

8. **Tests**
   • Add a dedicated `test/test_<op>.py` covering
       – Tensor × Tensor, Tensor × Scalar, in-place, alias
       – Integers, floats, corner cases (zero, inf, sign combinations)
   • Use `onlyCPU` for evaluation harness simplicity (GPU optional).

9. **Final diff hygiene**
   • Produce **one** unified diff rooted at repo top.
   • No commentary outside diff; harness pipes it straight into `patch`.
   • Lines outside the diff (like this prompt) are never printed by the model.

───────────────────────────── SECTION 3 — ANNOTATED REFERENCE DIFF ──────────────────────────────
The block below is a *teaching* diff.  Comments start with `//!` so they are
ignored by `patch`.  Study why each hunk exists — you’ll copy the pattern when
generating a diff for a *different* operator.

----8<----------------------------------------------------------------------
diff --git a/aten/src/ATen/native/BinaryOps.cpp b/aten/src/ATen/native/BinaryOps.cpp
index f5d5edb6439..8380296da25 100644
--- a/aten/src/ATen/native/BinaryOps.cpp
+++ b/aten/src/ATen/native/BinaryOps.cpp
@@
 DEFINE_DISPATCH(div_trunc_stub);
+DEFINE_DISPATCH(div_ceil_stub);                  //! 1A – new dispatcher symbol
 DEFINE_DISPATCH(remainder_stub);

@@
+// Ceiling division implementation
+Tensor& ceiling_divide_out_impl( … ) {           //! 3 – backend-agnostic glue
+  auto iter = TensorIterator::binary_op(…);
+  div_ceil_stub(iter.device_type(), iter);       //! 3 – call the stub
+  …
+}
+…                                             //! 3 – provide *_impl, _out, _
+
+// Alias for ceiling_divide
+Tensor& divup_out( … ) { return ceiling_divide_out_impl(…); } //! 3 – alias

@@
 Tensor mul( … );                                //! pre-existing code
@@
+Tensor ceiling_divide(const Tensor& self, const Scalar& other) { … } //! 3 – Scalar wrapper

diff --git a/aten/src/ATen/native/BinaryOps.h b/aten/src/ATen/native/BinaryOps.h
@@
 DECLARE_DISPATCH(structured_binary_fn, div_trunc_stub)
+DECLARE_DISPATCH(structured_binary_fn, div_ceil_stub)          //! 1B – header
@@
+// Forward declarations so other C++ can call the op
+Tensor& ceiling_divide_out(…);
+Tensor ceiling_divide(…);
+Tensor& divup_out(…);
+Tensor divup(…);

diff --git a/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp b/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
@@
+void div_ceil_kernel(TensorIteratorBase& iter) {               //! 2A – CPU kernel
+  …
+}
@@
+REGISTER_DISPATCH(div_ceil_stub, &div_ceil_kernel)            //! 2B – hook kernel

diff --git a/aten/src/ATen/native/cpu/utils.h b/aten/src/ATen/native/cpu/utils.h
@@
-int64_t thread_averge_payload = std::max((int64_t)1, divup(nnz, num_threads));
+int64_t thread_averge_payload = std::max((int64_t)1, at::divup(nnz, num_threads)); //! 7

diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
@@
+- func: ceiling_divide(Tensor self, Tensor other) -> Tensor    //! 4 – schema
+  dispatch:
+    CPU, MPS: ceiling_divide
+…

diff --git a/test/test_divup.py b/test/test_divup.py
+  …                                                        //! 8 – new tests

diff --git a/torch/_refs/__init__.py b/torch/_refs/__init__.py
+@_make_elementwise_binary_reference                               //! 5 – reference
+def ceiling_divide(a, b): …

diff --git a/torch/_tensor.py b/torch/_tensor.py
+    def __divup__(self, other): return torch.ceiling_divide(self, other)  //! 6
----8<----------------------------------------------------------------------

──────────────────────────────── SECTION 4 — GENERALISATION RULES ───────────────────────────────
When you implement **another** op (say `logical_xor`):

* Replace every `ceiling_divide` with `logical_xor`, `div_ceil_stub` with
  `logical_xor_stub`, etc.
* Kernel math changes, but the scaffolding (DEFINE_DISPATCH → kernel →
  REGISTER_DISPATCH → yaml → refs → dunder → tests) stays identical.
* Always provide Tensor × Scalar wrappers even if mathematically trivial; many
  internal utilities rely on them.
* If Autograd is needed, mark CompositeExplicitAutograd in YAML OR write a
  derivative in `tools/autograd`.  (For pure integer/floating ops usually a
  composite is fine.)

──────────────────────────────── SECTION 5 — WHAT YOUR OUTPUT MUST BE ───────────────────────────
**The model’s entire stdout** must be a **single unified diff** with NO extra
commentary.  Think of it as running `git diff` and pasting the result.
Anything else (print statements, JSON, progress bars) will break `patch`.

Use spaces, not tabs, in diff context lines.  Do not truncate large files; the
patch must be self-contained and apply cleanly.

──────────────────────────────── SECTION 6 — END-OF-FILE ────────────────────────────────────────
# Nothing below this line is part of the prompt.
################################################################################

Copy link

pytorch-bot bot commented Apr 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152144

Note: Links to docs will display an error until the docs builds have been completed.

❌ 46 New Failures, 1 Unrelated Failure

As of commit 6b6427a with merge base 0eb554e (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 24, 2025
Copy link
Contributor

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

@msaroufim msaroufim added the topic: new features topic category label Apr 25, 2025
@msaroufim msaroufim added release notes: cuda release notes category topic: not user facing topic category and removed release notes: cuda release notes category labels Apr 25, 2025
@msaroufim msaroufim removed request for eqy and syed-ahmed April 25, 2025 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) topic: new features topic category topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant