[muon] Introduce Muon optimizer to PyTorch #160213

chuanhaozhuge · 2025-08-08T19:58:49Z

A single-device version of Muon. Algorithm refers to the Moonshot implementation.

This PR is an update to #159465 to include suggestions and recommendations to UX and API. In particular, PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the PyTorch examples directory.

Usage

    model = MyModelForCausalLM
    # filter out your params manually
    muon_params = [...]
    adamw_params = [...]
    muon = Muon(
        params = muon_params
        lr=lr,
        wd=wd,
    )
    adamw = AdamW(
        params = muon_params
        lr=lr,
        wd=wd,
    )

    # in training loop
    loss = model(input)
    loss.backward()
    muon.step()
    adamw.step()
    muon.zero_grad()
    adamw.zero_grad()

Additional usage
Users are also able to pass in self-defined msign function for orthogonalization, and learning rate adjustment function. Interface defined below:

AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]
MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]

By default, we use 5-step Newton-Schulz, with coefficients proposed by Keller. We use LR adjustment proposed by Moonshot, which grafts learning rate from AdamW.

Testing

Unit tests: the newly introduced Muon is covered in test/test_optim.py. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.
End-to-end test: we added a training script that pre-trains a QWEN-like model on openwebtext-100k dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.

Performance
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

adamw_ddp finishes in 13.12 min
pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms

Next Steps

Add MuP
Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
Open-source unsharded Muon co-designed with FSDP2.

cc: @toothacher17, @vinaysrao, @jcui2, @haocizhang

pytorch-bot · 2025-08-08T19:58:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160213

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 654f754 with merge base 8d3d1c8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99

Wooo the approach is much simpler indeed now--thank you for the speedy turnaround on the PR. The one main API question I have is how we handle the NS config (whether it should be in the constructor) which I've commented below. Everything else looks super solid.

I know you've put in amazing work for the benchmarks and correctness compared to the original Muon, and I trust that you have verified this PR is still correct and appropriately fast locally. I will look out for the separate PR with those scripts!

janeyx99 · 2025-08-08T20:25:48Z

test/test_optim.py

+            params = [weight, bias]
+            if optim_cls.__name__ == "Muon":
+                params = [weight]


nit to not reassign

Suggested change

params = [weight, bias]

if optim_cls.__name__ == "Muon":

params = [weight]

params = [weight, bias] if optim_cls.__name__ != "Muon" else [weight]

janeyx99 · 2025-08-08T20:28:00Z

test/test_optim.py

+                model = torch.nn.Sequential(
+                    torch.nn.Linear(10, 4, bias=False),


This can just be one Linear then, right? Or maybe it'd be more indicative to add another Linear in there?

Can you add a comment for why we branch here?

janeyx99 · 2025-08-08T20:28:37Z

test/test_optim.py

@@ -1577,14 +1629,26 @@ def test_can_load_from_to_named_state_dict(
        all_optim_inputs = _get_optim_inputs_including_global_cliquey_kwargs(
            device, dtype, optim_info, skip=("differentiable",)
        )
+
+        def _get_model_and_input(device, dtype, optim_cls):


let's only have one version of this helper, it looks the same as above

janeyx99 · 2025-08-08T20:29:29Z

test/test_optim.py

@@ -2219,7 +2285,7 @@ def test_defaults_changed_to_foreach(self, device, dtype, optim_info):
    def test_non_empty_state(self, device, dtype, optim_info):
        # There are internal tests that check that the state is not empty
        optim_cls = optim_info.optim_cls
-        model = torch.nn.Linear(5, 5)
+        model = torch.nn.Linear(5, 5, bias=False)


comment that we add False here to be generically run with Muon

janeyx99 · 2025-08-08T20:29:53Z

test/test_optim.py

@@ -1969,7 +2035,7 @@ def pre_hook(opt: Optimizer, args: tuple[Any], kwargs: dict[Any, Any]):
            nonlocal data
            data += 2

-        params = [torch.tensor([1, 1], device=device, dtype=dtype)]
+        params = [torch.tensor([[1, 1]], device=device, dtype=dtype)]


How come these hook changes are necessary?

janeyx99 · 2025-08-08T20:37:00Z

torch/optim/_muon.py

+        nesterov: bool = True,
+        *,
+        msign_fn: MsignFn = zeropower_via_newtonschulz,
+        msign_fn_config: BaseMsignFnConfig = NewtonSchulzConfig(),


What is the pro of having these config live in the constructor as a struct vs separate values? Is this because these values are only used if the msign_fb is zeropower_via_newtonschulz? If so, should this config not live in the Muon constructor at all but be customizable by the user input msign_fn? What are your thoughts?

I also wonder if this config can be made in regular dict, accepted in constructor as Muon(..., msign_fn_config = {'eps' : 1e-5}), and then just passed as self.msign_fn(..., **self.msign_fn_config) - like so, it could be easier saved into a state_dict()...

I think it’s better to encapsulate the configs in a dedicated class, so the function signature stays clean and manageable. Just a preference carried over from my C++ days :)

I see Vadim's point but not sure if it's feasible (or necessity) to store the callable to state_dict in the first place.

Another option is to have config set as simple args to the function, and then have the user override them via calling functools.partial

cc @albanD regarding API design for best practices

Ho that's interesting.
I do agree that this doesn't match how we do APIs in PyTorch in general.
For value config, I would expect they're all passed in as an argument each (see other optimizers).
If you need to override some specific methods and behavior, you can either have a set of pre-defined implementations that a flag toggles between or you can subclass the optimizer to override the particular method you care about.

Also I guess I'm missing some context on why we want to do it this way if there is only one option for each right now?

janeyx99 · 2025-08-08T20:42:31Z

torch/optim/_muon.py

+            buf = state.get("momentum_buffer")
+            if buf is None:
+                buf = torch.zeros_like(p.grad, memory_format=torch.preserve_format)
+                state["momentum_buffer"] = buf
+            muon_momentum_bufs.append(buf)


Suggested change

buf = state.get("momentum_buffer")

if buf is None:

buf = torch.zeros_like(p.grad, memory_format=torch.preserve_format)

state["momentum_buffer"] = buf

muon_momentum_bufs.append(buf)

if state.get("momentum_buffer") is None:

state.get("momentum_buffer") = torch.zeros_like(p.grad, memory_format=torch.preserve_format)

muon_momentum_bufs.append(state.get("momentum_buffer"))

no need for buf intermediate, right

just trying to reduce the number of times we mention "momentum_buffer". also, state.get("momentum_buffer") = seems not right?

nonetheless, updated the code to

if "momentum_buffer" not in state: state["momentum_buffer"] = torch.zeros_like( p.grad, memory_format=torch.preserve_format ) muon_momentum_bufs.append(state["momentum_buffer"])

torch/optim/_muon.py

torch/testing/_internal/common_optimizers.py

janeyx99

The current CI failures are cuz you (probably accidentally) committed the third_party differences--pls remove those!

chuanhaozhuge · 2025-08-12T04:45:35Z

The current CI failures are cuz you (probably accidentally) committed the third_party differences--pls remove those!

uh, they must have come from the rebase. removed

janeyx99 · 2025-08-12T18:12:16Z

torch/optim/_muon.py

+        buf = muon_momentum_bufs[i]
+        buf.mul_(momentum).add_(grad)
+        if nesterov:
+            grad = grad.add(buf, alpha=momentum)


could use lerp_ probably and save some memory

albanD · 2025-08-12T14:01:31Z

torch/optim/_muon.py

+
+
+@dataclass
+class BaseMsignFnConfig:


This can be removed right?

albanD · 2025-08-12T14:03:42Z

torch/optim/_muon.py

+
+__all__ = ["Muon"]
+
+# Constants from Keller Jordan's Muon post: https://kellerjordan.github.io/posts/muon/


nit: link to github + specific line + specific commit to make sure this lint will stay up

albanD · 2025-08-12T17:53:08Z

torch/optim/_muon.py

+    assert steps < 100, (
+        "Number of steps must be less than 100 for computational efficiency"
+    )
+    assert len(grad.shape) == 2, "Input tensor gradient must be a 2D matrix"
+    assert len(coefficients) == 3, "Coefficients must be a tuple of exactly 3 values"


No plain asserts, please raise appropriate Runtime/Value/Type errors

albanD · 2025-08-12T17:54:45Z

torch/optim/_muon.py

+    assert len(grad.shape) == 2, "Input tensor gradient must be a 2D matrix"
+    assert len(coefficients) == 3, "Coefficients must be a tuple of exactly 3 values"
+    a, b, c = coefficients[0], coefficients[1], coefficients[2]
+    X = grad.bfloat16()


I would be very surprised if this is the way to go unless you have a hard assert that the param dtype is fixed?

albanD · 2025-08-12T17:55:40Z

torch/optim/_muon.py

+__all__ = ["Muon"]
+
+# Constants from Keller Jordan's Muon post: https://kellerjordan.github.io/posts/muon/
+EPS = 1e-7


All epsilons should be dtype dependent to avoid too large noise or flooring to 0.

albanD · 2025-08-12T18:00:40Z

torch/optim/_muon.py

+        A = X @ X.T
+        B = b * A + c * A @ A
+        X = a * X + B @ X


nit: would be good to have descriptive names for these variables.
Also there is quite a bit of extra memory usage due to the extra variables, but I guess that can be handled in a follow up.

albanD · 2025-08-12T18:03:09Z

torch/optim/_muon.py

+    This optimizer performs momentum SGD followed by an optional orthogonalization
+    step computed via a user provided callable.


This should follow the same style as we have in other optimizers like https://docs.pytorch.org/docs/stable/generated/torch.optim.Adam.html#adam to describe un-ambiguoustly the math being performed.

albanD · 2025-08-12T18:08:25Z

torch/optim/_muon.py

+        nesterov: bool = True,
+        *,
+        msign_fn: MsignFn = zeropower_via_newtonschulz,
+        msign_fn_config: BaseMsignFnConfig = NewtonSchulzConfig(),


Ho that's interesting.
I do agree that this doesn't match how we do APIs in PyTorch in general.
For value config, I would expect they're all passed in as an argument each (see other optimizers).
If you need to override some specific methods and behavior, you can either have a set of pre-defined implementations that a flag toggles between or you can subclass the optimizer to override the particular method you care about.

Also I guess I'm missing some context on why we want to do it this way if there is only one option for each right now?

albanD · 2025-08-12T18:10:55Z

torch/optim/_muon.py

+        lr: float = 1e-3,
+        weight_decay: float = 0.1,
+        momentum: float = 0.95,
+        nesterov: bool = True,


@chuanhaozhuge where was this doc added?

albanD · 2025-08-12T18:13:14Z

torch/optim/_muon.py

+    has_complex: bool,
+) -> None:
+    lr = _to_scalar(lr)
+    assert has_complex is False, "Complex parameters are not supported"


Let's remove plain asserts here as well

janeyx99 · 2025-08-12T20:27:38Z

Given that we had agreed to land the simplest single device Muon into torch/optim as our first step, it'd be clearest to land what people accept as the original implementation as defined in Keller Jordan's blog ( https://kellerjordan.github.io/posts/muon/). As this implementation chooses newton schulz as the algo, we should take the same stance. This means we can simplify the constructor API greatly (I will get to extensibility right after):

Remove the msign_fn callable argument (the algorithm will just call NS by default today). OSS folks will not expect to pass in a callable to the constructor, so we will not accept the PR with a kwarg that takes in a callable. Instead, for customization, we can intake a string enum (more on this later).
Remove the struct definitions and flatten the kwargs as top-level keyword arguments to the constructor. Having layers of configs is confusing and unnecessarily abstracted. Since we are going in on NS being the algorithm for single device Muon, we can explicitly list out these kwargs in the constructor.
Move the algorithm description of NS into the Muon doc, so people can see immediately what algo they are calling.
I am remembering that the original test scripts comparing this implementation to Keller Jordan's had high atol and rtols, which is surprising as the two algorithms both use PyTorch ops in python, and so I'd expect the results to be the same. Could you link a standalone script that can be run to ascertain correctness and explain why the high atol/rtols are necessary (if they still are)? As we want to land a trustworthy impl for folks to try out, we need to ensure the accuracy results are expected.

I'm realizing that you have interest in extending the algorithm to be distributed (vs another orthogonalization algo for single-device). We are strict on keeping torch/optim code single-device runnable and maximally composable, so we cannot land anything distributed in torch/optim and I'd propose landing the distributed optimizer solution in https://github.com/pytorch/pytorch/tree/main/torch/distributed/optim. With that, I see the possible extension options as below:
a) If we are interested in other orthogonalization techniques for single device, we'd recommend using string enum kwargs similar to line_search_fn in https://docs.pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#lbfgs, where the default None is NS, and other strings can represent other algorithms.
b) If we are attempting to extend in a distributed manner, and the code (state_dict, etc) is easily shareable, we'd recommend subclassing Muon into a new distributed optimizer in torch/distributed/optim. If the code ends up not being so shareable, it is perfectly acceptable to have Dion or a different optim class entirely living in torch/distributed/optim.

pytorch-bot bot added the release notes: optim label Aug 8, 2025

janeyx99 reviewed Aug 8, 2025

View reviewed changes

chuanhaozhuge force-pushed the muon_dev branch from 5083654 to 0f0df7b Compare August 11, 2025 19:28

janeyx99 reviewed Aug 11, 2025

View reviewed changes

chuanhaozhuge force-pushed the muon_dev branch from 0f0df7b to 2e6bf8c Compare August 12, 2025 04:42

chuanhaozhuge added 3 commits August 11, 2025 22:00

introduce muon

81b3e12

missed adding muon file the previous commit

9f520df

linter fix

a4c90e8

chuanhaozhuge force-pushed the muon_dev branch from 2e6bf8c to 1c82fc8 Compare August 12, 2025 05:04

addresss comments

654f754

chuanhaozhuge force-pushed the muon_dev branch from 1c82fc8 to 654f754 Compare August 12, 2025 05:07

chuanhaozhuge marked this pull request as ready for review August 12, 2025 05:08

chuanhaozhuge requested a review from albanD as a code owner August 12, 2025 05:08

janeyx99 reviewed Aug 12, 2025

View reviewed changes

albanD reviewed Aug 12, 2025

View reviewed changes

		model = torch.nn.Sequential(
		torch.nn.Linear(10, 4, bias=False),


		__all__ = ["Muon"]

		# Constants from Keller Jordan's Muon post: https://kellerjordan.github.io/posts/muon/

		This optimizer performs momentum SGD followed by an optional orthogonalization
		step computed via a user provided callable.

[muon] Introduce Muon optimizer to PyTorch #160213

Are you sure you want to change the base?

[muon] Introduce Muon optimizer to PyTorch #160213

Conversation

chuanhaozhuge commented Aug 8, 2025

Uh oh!

pytorch-bot bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160213

✅ No Failures

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janeyx99 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chuanhaozhuge commented Aug 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Aug 12, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 8, 2025 •

edited

Loading

janeyx99 left a comment •

edited

Loading