WIP: divup op #152144

msaroufim · 2025-04-24T23:37:26Z

Don't bother reviewing please, this code was not generated by humans and it's mostly for me to understand all the requirements of a new pytorch operator

################################################################################
# MEGA-PROMPT: How to Teach an LLM to Add **ANY** New Element-wise PyTorch Op
#
# Copy this entire file verbatim into `llm.txt` (including these borders) and
# feed it to the model you are evaluating.  Everything is plain text; there are
# no Markdown tables or rendered HTML — just one big, unformatted box.
#
# Your evaluation harness should:
#   • present ONLY this text to the model
#   • capture its stdout
#   • treat that stdout as a `patch -p1` diff to be applied to a clean PyTorch
#     checkout (main branch)
#   • build & test (you decide which tests)
#
# The model “passes” when the patch builds and new tests succeed.
#
# This prompt teaches the model:
#   1.  The **exact** files that typically need edits for a new operator
#   2.  Why each edit is necessary
#   3.  How to generalise the pattern for other ops
#   4.  What the final diff must look like (no chatter, no logging)
#
# It uses `ceiling_divide`/`divup` as a fully-worked example.  The annotated
# reference diff is NOT meant to be applied by the harness; it’s a tutorial
# for the LLM.  When you build your own suite, swap-in a different example or
# trim as you like — but **keep the structure** so the model can learn.
################################################################################

──────────────────────────────── SECTION 1 — HIGH-LEVEL OVERVIEW ────────────────────────────────
Implementing a new element-wise op in PyTorch touches four layers:

  (A) **Native C++ core (ATen):**
      • Add dispatcher stub, CPU/CUDA kernels, and glue functions.
  (B) **Public dispatcher metadata:**
      • Register schemas in `native_functions.yaml`.
  (C) **Python reference / decomposition layer (torch/_refs):**
      • Provide a mathematically-correct reference so Autograd & PrimTorch work.
  (D) **Python surface & tests:**
      • Expose dunder overloads (`Tensor.__op__`) and add tests.

If you forget ANY layer, either the build fails or runtime raises
“operator not found” errors.

──────────────────────────────── SECTION 2 — STEP-BY-STEP CHECKLIST ─────────────────────────────
For *every* new element-wise binary operator **OP** (here `ceiling_divide`
alias `divup`) follow this template:

1. **Dispatcher symbol**
   • In the cpp file that already hosts similar ops, add
     `DEFINE_DISPATCH(<op_stub>);`
   • Add a matching `DECLARE_DISPATCH` to the header.

2. **Kernels**
   • Write at least one device kernel (CPU is mandatory, CUDA optional).
   • Register each with `REGISTER_DISPATCH(<op_stub>, &<kernel_fn>);`.

3. **Backend-agnostic wrappers**
   • Add `*_impl` helpers that create a `TensorIterator`, call the stub, and
     return the result.
   • Provide `op`, `op_`, and `op_out` overloads for
     - Tensor × Tensor
     - Tensor × Scalar (wrap scalar via `wrapped_scalar_tensor`)
   • Keep scalar wrappers in the same file so Autograd tracing works.

4. **`native_functions.yaml` entries**
   • One line per overload; point CPU/MPS (and CUDA if ready) to
     the C++ symbols from step 3.
   • If a variant is still Composite, mark it `CompositeExplicitAutograd`.

5. **Reference implementation**
   • In `torch/_refs/__init__.py` create a decorated
     `_make_elementwise_binary_reference` function that calls existing ops.
   • For aliases, write a thin function that forwards to the canonical op.

6. **Python dunder support**
   • In `torch/_tensor.py` add `__op__` and `__rop__` methods that dispatch to
     `torch.OP`.

7. **Call-sites inside PyTorch**
   • Replace any ad-hoc math (e.g. `divup(x,y)`) in core helpers with the new
     public API to keep codebase consistent.

8. **Tests**
   • Add a dedicated `test/test_<op>.py` covering
       – Tensor × Tensor, Tensor × Scalar, in-place, alias
       – Integers, floats, corner cases (zero, inf, sign combinations)
   • Use `onlyCPU` for evaluation harness simplicity (GPU optional).

9. **Final diff hygiene**
   • Produce **one** unified diff rooted at repo top.
   • No commentary outside diff; harness pipes it straight into `patch`.
   • Lines outside the diff (like this prompt) are never printed by the model.

───────────────────────────── SECTION 3 — ANNOTATED REFERENCE DIFF ──────────────────────────────
The block below is a *teaching* diff.  Comments start with `//!` so they are
ignored by `patch`.  Study why each hunk exists — you’ll copy the pattern when
generating a diff for a *different* operator.

----8<----------------------------------------------------------------------
diff --git a/aten/src/ATen/native/BinaryOps.cpp b/aten/src/ATen/native/BinaryOps.cpp
index f5d5edb6439..8380296da25 100644
--- a/aten/src/ATen/native/BinaryOps.cpp
+++ b/aten/src/ATen/native/BinaryOps.cpp
@@
 DEFINE_DISPATCH(div_trunc_stub);
+DEFINE_DISPATCH(div_ceil_stub);                  //! 1A – new dispatcher symbol
 DEFINE_DISPATCH(remainder_stub);

@@
+// Ceiling division implementation
+Tensor& ceiling_divide_out_impl( … ) {           //! 3 – backend-agnostic glue
+  auto iter = TensorIterator::binary_op(…);
+  div_ceil_stub(iter.device_type(), iter);       //! 3 – call the stub
+  …
+}
+…                                             //! 3 – provide *_impl, _out, _
+
+// Alias for ceiling_divide
+Tensor& divup_out( … ) { return ceiling_divide_out_impl(…); } //! 3 – alias

@@
 Tensor mul( … );                                //! pre-existing code
@@
+Tensor ceiling_divide(const Tensor& self, const Scalar& other) { … } //! 3 – Scalar wrapper

diff --git a/aten/src/ATen/native/BinaryOps.h b/aten/src/ATen/native/BinaryOps.h
@@
 DECLARE_DISPATCH(structured_binary_fn, div_trunc_stub)
+DECLARE_DISPATCH(structured_binary_fn, div_ceil_stub)          //! 1B – header
@@
+// Forward declarations so other C++ can call the op
+Tensor& ceiling_divide_out(…);
+Tensor ceiling_divide(…);
+Tensor& divup_out(…);
+Tensor divup(…);

diff --git a/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp b/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
@@
+void div_ceil_kernel(TensorIteratorBase& iter) {               //! 2A – CPU kernel
+  …
+}
@@
+REGISTER_DISPATCH(div_ceil_stub, &div_ceil_kernel)            //! 2B – hook kernel

diff --git a/aten/src/ATen/native/cpu/utils.h b/aten/src/ATen/native/cpu/utils.h
@@
-int64_t thread_averge_payload = std::max((int64_t)1, divup(nnz, num_threads));
+int64_t thread_averge_payload = std::max((int64_t)1, at::divup(nnz, num_threads)); //! 7

diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
@@
+- func: ceiling_divide(Tensor self, Tensor other) -> Tensor    //! 4 – schema
+  dispatch:
+    CPU, MPS: ceiling_divide
+…

diff --git a/test/test_divup.py b/test/test_divup.py
+  …                                                        //! 8 – new tests

diff --git a/torch/_refs/__init__.py b/torch/_refs/__init__.py
+@_make_elementwise_binary_reference                               //! 5 – reference
+def ceiling_divide(a, b): …

diff --git a/torch/_tensor.py b/torch/_tensor.py
+    def __divup__(self, other): return torch.ceiling_divide(self, other)  //! 6
----8<----------------------------------------------------------------------

──────────────────────────────── SECTION 4 — GENERALISATION RULES ───────────────────────────────
When you implement **another** op (say `logical_xor`):

* Replace every `ceiling_divide` with `logical_xor`, `div_ceil_stub` with
  `logical_xor_stub`, etc.
* Kernel math changes, but the scaffolding (DEFINE_DISPATCH → kernel →
  REGISTER_DISPATCH → yaml → refs → dunder → tests) stays identical.
* Always provide Tensor × Scalar wrappers even if mathematically trivial; many
  internal utilities rely on them.
* If Autograd is needed, mark CompositeExplicitAutograd in YAML OR write a
  derivative in `tools/autograd`.  (For pure integer/floating ops usually a
  composite is fine.)

──────────────────────────────── SECTION 5 — WHAT YOUR OUTPUT MUST BE ───────────────────────────
**The model’s entire stdout** must be a **single unified diff** with NO extra
commentary.  Think of it as running `git diff` and pasting the result.
Anything else (print statements, JSON, progress bars) will break `patch`.

Use spaces, not tabs, in diff context lines.  Do not truncate large files; the
patch must be self-contained and apply cleanly.

──────────────────────────────── SECTION 6 — END-OF-FILE ────────────────────────────────────────
# Nothing below this line is part of the prompt.
################################################################################

pytorch-bot · 2025-04-24T23:37:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152144

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 46 New Failures, 1 Unrelated Failure

As of commit 6b6427a with merge base 0eb554e ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_tensor.py:
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 1, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
test_fake_tensor.py::FakeTensorTest::test_aten_slice_scatter_multi_device
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 2, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_out__refs_floor_divide_cuda_float32
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 3, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
'test/test_nestedtensor.py::TestNestedTensorOpInfoCUDA::test_compile_backward___rmod___cuda_float32'
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 4, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_python_ref__refs_bucketize_cuda_float64
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 5, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_out_warning__refs_div_floor_rounding_cuda
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 1, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_fake_tensor.py::FakeTensorTest::test_aten_slice_scatter_multi_device
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 2, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
'test/test_nestedtensor.py::TestNestedTensorOpInfoCUDA::test_compile_backward___rmod___cuda_float32'
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 3, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_out_warning__refs_floor_divide_cuda
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 4, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_out__refs_div_floor_rounding_cuda_float32
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 5, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_dtypes__refs_div_floor_rounding_cuda
pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, ephemeral.linux.12xlarge) (gh)
test_aten_floor_divide_1
pull / linux-focal-py3.13-clang10 / test (crossref, 1, 2, ephemeral.linux.2xlarge) (gh)
test_utils.py::TestDeviceUtilsCPU::test_device_mode_ops_tril_indices_cpu_int32
pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, ephemeral.linux.2xlarge) (gh)
'test/test_nestedtensor.py::TestNestedTensorOpInfoCPU::test_compile_backward___rmod___cpu_float32'
pull / linux-focal-py3.13-clang10 / test (default, 1, 5, ephemeral.linux.4xlarge) (gh)
test_utils.py::TestDeviceUtilsCPU::test_device_mode_ops_tril_indices_cpu_int32
pull / linux-focal-py3.13-clang10 / test (default, 2, 5, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_dtypes__refs_floor_divide_cpu
pull / linux-focal-py3.13-clang10 / test (default, 3, 5, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_meta_consistency_out_dtype_mismatch_div_floor_rounding_cpu_float32
pull / linux-focal-py3.13-clang10 / test (default, 4, 5, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_python_ref__refs_bucketize_cpu_float32
pull / linux-focal-py3.13-clang10 / test (default, 5, 5, ephemeral.linux.4xlarge) (gh)
inductor/test_cpu_repro.py::CPUReproTests::test_broadcast_scalar_cpp_tile_2d_kernel
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 1, 3, ephemeral.linux.2xlarge) (gh)
test_tensor_creation_ops.py::TestRandomTensorCreationCPU::test_randint_distribution_cpu
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, ephemeral.linux.2xlarge) (gh)
test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_div_rounding_modes_cpu_float64
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 3, 3, ephemeral.linux.2xlarge) (gh)
test_decomp.py::HasDecompTest::test_has_decomposition
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, ephemeral.linux.2xlarge) (gh)
test_utils.py::TestDeviceUtilsCPU::test_device_mode_ops_tril_indices_cpu_int32
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, ephemeral.linux.2xlarge) (gh)
'test/test_nestedtensor.py::TestNestedTensorOpInfoCPU::test_compile_backward___rmod___cpu_float32'
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, ephemeral.linux.4xlarge) (gh)
test_utils.py::TestDeviceUtilsCPU::test_device_mode_ops_tril_indices_cpu_int32
pull / linux-focal-py3.9-clang10 / test (default, 2, 5, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_meta_consistency_out_dtype_mismatch_floor_divide_cpu_float32
pull / linux-focal-py3.9-clang10 / test (default, 3, 5, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_meta_consistency_out_dtype_mismatch_div_floor_rounding_cpu_float32
pull / linux-focal-py3.9-clang10 / test (default, 4, 5, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_dtypes__refs_div_floor_rounding_cpu
pull / linux-focal-py3.9-clang10 / test (default, 5, 5, ephemeral.linux.4xlarge) (gh)
inductor/test_cpu_repro.py::CPUReproTests::test_dequant_maxpool2d_lowering_uint8
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 1, 3, ephemeral.linux.2xlarge) (gh)
test_tensor_creation_ops.py::TestRandomTensorCreationCPU::test_randint_distribution_cpu
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 2, 3, ephemeral.linux.2xlarge) (gh)
test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_div_rounding_modes_cpu_float64
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 3, 3, ephemeral.linux.2xlarge) (gh)
test_decomp.py::HasDecompTest::test_has_decomposition
pull / linux-focal-rocm-py3.10 / build (gh)
ninja: build stopped: subcommand failed
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, ephemeral.linux.4xlarge) (gh)
test_utils.py::TestDeviceUtilsCPU::test_device_mode_ops_tril_indices_cpu_int32
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_python_ref__refs_bucketize_cpu_int64
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_python_ref__refs_bucketize_cpu_float64
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_python_ref__refs_bucketize_cpu_int16
pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_dtypes__refs_div_floor_rounding_cpu
pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, ephemeral.linux.4xlarge) (gh)
test_ops.py::TestCommonCPU::test_python_ref__refs_bucketize_cpu_bfloat16
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, ephemeral.linux.2xlarge) (gh)
test_utils.py::TestDeviceUtilsCPU::test_device_mode_ops_tril_indices_cpu_int32
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 5, ephemeral.linux.2xlarge) (gh)
test_ops.py::TestCommonCPU::test_out__refs_floor_divide_cpu_float32
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, ephemeral.linux.2xlarge) (gh)
test_ops.py::TestCommonCPU::test_meta_consistency_out_dtype_mismatch_div_floor_rounding_cpu_float32
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, ephemeral.linux.2xlarge) (gh)
test_ops.py::TestCommonCPU::test_dtypes__refs_div_floor_rounding_cpu
pull / linux-jammy-py3.9-gcc11 / test (default, 5, 5, ephemeral.linux.2xlarge) (gh)
inductor/test_cpu_repro.py::CPUReproTests::test_dequant_maxpool2d_lowering_int8
pull / linux-jammy-py3.9-gcc11 / test (distributed, 1, 2, ephemeral.linux.2xlarge) (gh)
distributed/tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_div_floor_rounding_cpu_float32
pull / linux-jammy-py3.9-gcc11 / test (numpy_2_x, 1, 1, ephemeral.linux.2xlarge) (gh)
test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_batch_vs_slicing__refs_floor_divide_cpu_float16

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_torch_cond_export

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-04-24T23:41:04Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

divup op

1fd220d

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 24, 2025

msaroufim added 3 commits April 24, 2025 17:04

update

6109014

update

3e838e4

update

87ea0a9

msaroufim added the topic: new features topic category label Apr 25, 2025

cu

e9a9289

msaroufim requested review from eqy and syed-ahmed as code owners April 25, 2025 02:30

msaroufim added 7 commits April 24, 2025 19:43

update

860ad06

update

2d16878

simply templates

b8a0f2c

update

622c22a

update

13246e5

update

a81bb79

old

4c881b0

msaroufim added release notes: cuda release notes category topic: not user facing topic category and removed release notes: cuda release notes category labels Apr 25, 2025

msaroufim removed request for eqy and syed-ahmed April 25, 2025 03:43

msaroufim added 2 commits April 24, 2025 21:01

le ci est vert

a71af21

Trigger build

6b6427a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: divup op #152144

WIP: divup op #152144

msaroufim commented Apr 24, 2025 •

edited

Loading

pytorch-bot bot commented Apr 24, 2025 •

edited

Loading

github-actions bot commented Apr 24, 2025

WIP: divup op #152144

Are you sure you want to change the base?

WIP: divup op #152144

Conversation

msaroufim commented Apr 24, 2025 • edited Loading

pytorch-bot bot commented Apr 24, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152144

❌ 46 New Failures, 1 Unrelated Failure

github-actions bot commented Apr 24, 2025

Attention! native_functions.yaml was changed

msaroufim commented Apr 24, 2025 •

edited

Loading

pytorch-bot bot commented Apr 24, 2025 •

edited

Loading