Introduce Muon optimizer to PyTorch #159465

chuanhaozhuge · 2025-07-30T09:06:22Z

A single-device version of Muon. Algorithm refers to the Moonshot implementation.

Usage
This implementation requires users to pass in named_parameters of the model - a list of (name, param) tuples. Users should also specify the fully-qualified names (FQNs) of the parameters to be optimized by Muon. Parameters not included in the FQN list will fall back to AdamW optimization. If no FQN list is provided, Muon will by default optimize 2D parameters, which may not be the expected behavior. A warning will be issued in this case.

    model = MyModelForCausalLM
    muon_param_fqns = [
        "model.layers.0.self_attn.q_proj.weight", 
        "model.layers.0.mlp.up_proj.weight"
    ]
    muon = Muon(
        model.named_parameters(),
        lr=lr,
        wd=wd,
        muon_param_fqns=muon_param_fqns,
        adamw_betas=adamw_betas,
        adamw_eps=adamw_eps,
    )

Additional usage
Users are also able to pass in self-defined msign function for orthogonalization, and learning rate adjustment function. Interface defined below:

AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]
MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]

By default, we use 5-step Newton-Schulz, with coefficients proposed by Keller. We use LR adjustment proposed by Moonshot, which grafts learning rate from AdamW.

Testing

Unit tests: the newly introduced Muon is covered in test/test_optim.py. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.
End-to-end test: we added a training script that pre-trains a QWEN-like model on openwebtext-100k dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.

Performance
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

adamw_ddp finishes in 13.12 min
pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms

Next Steps

Add MuP
Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
Open-source unsharded Muon co-designed with FSDP2.

cc: @toothacher17, @vinaysrao, @jcui2, @haocizhang

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-07-30T09:06:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159465

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2750453 with merge base 799303f ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vadimkantorov · 2025-07-30T10:39:53Z

@janeyx99 related on passing names to the optimizer:

Also allow dicts as type of params= field in param groups of optimizers #72146

janeyx99

Thanks for taking this on and adding test cases + a benchmark! I briefly did a highlevel review and left some comments. One more thing is whether you've checked that the correctness of this implementation matches the moonshot one.

janeyx99 · 2025-07-30T15:54:06Z

test/optim/test_muon.py

How come you put these in a new file?

Do we expect more tests to be added here? It might be more centralized to have these tests live in test_optim.py anyway for easy search/share the optimizer info stuff when we add more configs for Muon.

This test is very specific to Muon, so I put it in a new file. There might be new test cases added, for example to test alternative msign function numerics equivalency.

My rationale is that test cases that are general to all optimizers are put into test_optim.py and we create separate files to cover specific functionalities. This way test_optim.py doesn't grow too large IMO.

Ah, I agree with the logical separation. The reason we have lots of tests in test_optim.py is to be able to run all optim-related tests in one go. Can you update this file + test_optim.py similar to test_lrscheduler.py so we can get both pros of logical separation + being able to run all tests in one go?

janeyx99 · 2025-07-30T15:55:25Z

torch/optim/_muon.py

+
+__all__ = ["Muon", "muon"]
+
+logger = logging.getLogger(__name__)


Optimizers have generally not output any logs--we tend to let the trainer code or higher level libs handle logs. Is there a reason Muon should have a logger in particular?

Got it, yea I just think it's important to feedback some specific behavior to users to avoid potential footgun. Maybe we can control verbosity.

janeyx99 · 2025-07-30T15:57:11Z

torch/optim/_muon.py

+            # If no fqns are provided, use Muon for all parameters with 2D shape.
+            # Note: this may not be the expected behavior since some 2D
+            # parameters may not be intended to be optimized with Muon, for example Embedding.
+            logger.warning(


Let's just use normal warning here

sounds good, I'll use warning.warn

janeyx99 · 2025-07-30T15:58:19Z

torch/optim/_muon.py

+                muon_grads.append(p.grad)
+                muon_momentum_bufs.append(buf)
+            else:
+                # for the rest of the parameters, we use AdamW to optimize.


Should we delegate this to the AdamW optimizer instead of duping code here?

I think it's a good idea and was considering this as well. Re-implement for now because we have to update the state_dict handling logic if we want to introduce nested optimizer (tried that in DiLoco where we have inner and outer optimizers). Also we may want to explore update rules, so want to keep the flexibility here.

Let's not try to solve that problem in this PR, let's keep this PR as straightforward as possible.

Skylion007 · 2025-07-30T17:05:37Z

Doesn't CUBLAS already have a symmetric matmul kernel we can exploit as opposed to writing our own triton one?

Skylion007 · 2025-07-30T18:44:57Z

benchmarks/muon_examples/train.py

+    """
+    assert len(G.shape) == 2
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()


Why bfloat16 cast here?

Mostly for GPU efficiency I believe, to use tensorcore bf16 flops. This is adopted from Keller/Jingyuan's original implementation.

Why bfloat16 cast here?

this bf16 choice is discussed in here: https://kellerjordan.github.io/posts/muon/

Basically, there are several methods to approximate SVD, and with this method, bf16 is good enough, that's why it is chosen by Keller at the first place. With bf16, the communication cost is halved.

Skylion007 · 2025-07-30T18:47:06Z

benchmarks/muon_examples/train.py

+def zeropower_via_newtonschulz5(G, steps):
+    """
+    Newton-Schulz iteration to compute the zeroth power / orthogonalization of G. We opt to use a
+    quintic iteration whose coefficients are selected to maximize the slope at zero. For the purpose
+    of minimizing steps, it turns out to be empirically effective to keep increasing the slope at
+    zero even beyond the point where the iteration no longer converges all the way to one everywhere
+    on the interval. This iteration therefore does not produce UV^T but rather something like US'V^T
+    where S' is diagonal with S_{ii}' ~ Uniform(0.5, 1.5), which turns out not to hurt model
+    performance at all relative to UV^T, where USV^T = G is the SVD.
+    """
+    assert len(G.shape) == 2
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()
+    if G.size(0) > G.size(1):
+        X = X.T
+    # Ensure spectral norm is at most 1
+    X = X / (X.norm() + 1e-7)
+    # Perform the NS iterations
+    for _ in range(steps):
+        A = X @ X.T
+        B = (
+            b * A + c * A @ A
+        )  # adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
+        X = a * X + B @ X
+
+    if G.size(0) > G.size(1):
+        X = X.T
+    return X


Suggested change

def zeropower_via_newtonschulz5(G, steps):

"""

Newton-Schulz iteration to compute the zeroth power / orthogonalization of G. We opt to use a

quintic iteration whose coefficients are selected to maximize the slope at zero. For the purpose

of minimizing steps, it turns out to be empirically effective to keep increasing the slope at

zero even beyond the point where the iteration no longer converges all the way to one everywhere

on the interval. This iteration therefore does not produce UV^T but rather something like US'V^T

where S' is diagonal with S_{ii}' ~ Uniform(0.5, 1.5), which turns out not to hurt model

performance at all relative to UV^T, where USV^T = G is the SVD.

"""

assert len(G.shape) == 2

a, b, c = (3.4445, -4.7750, 2.0315)

X = G.bfloat16()

if G.size(0) > G.size(1):

X = X.T

# Ensure spectral norm is at most 1

X = X / (X.norm() + 1e-7)

# Perform the NS iterations

for _ in range(steps):

A = X @ X.T

B = (

b * A + c * A @ A

) # adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng

X = a * X + B @ X

if G.size(0) > G.size(1):

X = X.T

return X

def zeropower_via_newtonschulz_optimized(

G: Tensor,

ns_config: BaseMsignFnConfig,

) -> Tensor:

# unpack config

ns_config = cast(NewtonSchulzConfig, ns_config)

steps = ns_config.ns_steps

a, b, c = ns_config.coefficients

assert 1 < steps < 100, "ns_steps must be <100"

assert G.dim() == 2, "G must be 2D"

assert len((a, b, c)) == 3

# cast & maybe transpose so n ≤ k

X = G.to(torch.bfloat16)

transposed = False

if X.size(0) > X.size(1):

X = X.t().contiguous()

transposed = True

# normalize (Frobenius norm)

X.div_(X.norm() + 1e-7)

# shapes

n, k = X.shape

# pre-allocate buffers

A = torch.empty((n, n), dtype=X.dtype, device=X.device)

A2 = torch.empty_like(A)

B = torch.empty_like(A)

Y = torch.empty_like(X)

# Newton–Schulz loop

for _ in range(steps):

# A = X @ X^T

torch.mm(X, X.t(), out=A)

# A2 = A @ A

torch.mm(A, A, out=A2)

# B = b*A + c*A2

B.copy_(A).mul_(b).add_(A2, alpha=c)

# Y = B @ X

torch.mm(B, X, out=Y)

# X = a * X + Y, in-place

X.mul_(a).add_(Y)

# undo transpose

if transposed:

X = X.t()

return X

If you are going to support eager without compile, right?

I tested using compile but didn't seem to observe much speedup.

LMAO: I swore I worked on this at point and remembered I tried to optimize a CUTLASS implementation of this at some point here: nil0x9/flash-muon#1 The repo switched over to Triton anyway though: https://github.com/nil0x9/flash-muon/blob/80ac87fb49afc792b84eccb393d051b1ed8eee32/flash_muon/matmul_transpose_triton.py#L20 @chuanhaozhuge I assume this is the kernel you were referencing

The above suggestion should be marginally faster than the current code because we refuse the matrix buffer and use in place ops.

This is the first impl, and used for the purpose of providing golden answer for future speedup operations or parallelism operations. So we might not need to add a perfect impl or kernel here in this PR yet?

In future, when continuing to optimize the impl and integrate into FSDP, we can revisit each specific part and see how to optimizer?

LMAO: I swore I worked on this at point and remembered I tried to optimize a CUTLASS implementation of this at some point here: nil0x9/flash-muon#1 The repo switched over to Triton anyway though: https://github.com/nil0x9/flash-muon/blob/80ac87fb49afc792b84eccb393d051b1ed8eee32/flash_muon/matmul_transpose_triton.py#L20 @chuanhaozhuge I assume this is the kernel you were referencing

The above suggestion should be marginally faster than the current code because we refuse the matrix buffer and use in place ops.

Thanks @Skylion007 for the pointer! What I mentioned is an internal kernel implementation that hasn't been open-sourced. We will find time to benchmark the kernel you referred to above.

Thanks @toothacher17, right this is the purpose. As we have the interface, I think users can easily plug in optimized implementation. PyTorch can host some optimization solution as well but we need to figure out the support and maintenance mode.

toothacher17 · 2025-07-30T18:10:49Z

benchmarks/muon_examples/train.py

@@ -0,0 +1,395 @@
+import math


This file seems to be similar with: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py

Might be nice to add a link for the reference purpose

Yes! I already included the link in the README. Let me highlight that in the code files as well.

toothacher17 · 2025-07-31T17:54:47Z

benchmarks/muon_examples/train.py

+    """
+    assert len(G.shape) == 2
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()


Why bfloat16 cast here?

this bf16 choice is discussed in here: https://kellerjordan.github.io/posts/muon/

Basically, there are several methods to approximate SVD, and with this method, bf16 is good enough, that's why it is chosen by Keller at the first place. With bf16, the communication cost is halved.

toothacher17 · 2025-07-31T18:00:25Z

benchmarks/muon_examples/train.py

+def zeropower_via_newtonschulz5(G, steps):
+    """
+    Newton-Schulz iteration to compute the zeroth power / orthogonalization of G. We opt to use a
+    quintic iteration whose coefficients are selected to maximize the slope at zero. For the purpose
+    of minimizing steps, it turns out to be empirically effective to keep increasing the slope at
+    zero even beyond the point where the iteration no longer converges all the way to one everywhere
+    on the interval. This iteration therefore does not produce UV^T but rather something like US'V^T
+    where S' is diagonal with S_{ii}' ~ Uniform(0.5, 1.5), which turns out not to hurt model
+    performance at all relative to UV^T, where USV^T = G is the SVD.
+    """
+    assert len(G.shape) == 2
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()
+    if G.size(0) > G.size(1):
+        X = X.T
+    # Ensure spectral norm is at most 1
+    X = X / (X.norm() + 1e-7)
+    # Perform the NS iterations
+    for _ in range(steps):
+        A = X @ X.T
+        B = (
+            b * A + c * A @ A
+        )  # adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
+        X = a * X + B @ X
+
+    if G.size(0) > G.size(1):
+        X = X.T
+    return X


This is the first impl, and used for the purpose of providing golden answer for future speedup operations or parallelism operations. So we might not need to add a perfect impl or kernel here in this PR yet?

In future, when continuing to optimize the impl and integrate into FSDP, we can revisit each specific part and see how to optimizer?

toothacher17 · 2025-07-31T18:02:08Z

benchmarks/muon_examples/train.py

+            ############################
+
+            params = [p for p in group["params"] if self.state[p]["use_muon"]]
+            # import pdb; pdb.set_trace()


nit: remove the pdb here?

toothacher17 · 2025-07-31T18:14:09Z

torch/optim/_muon.py

+
+        # Note: Muon doesn't support multiple param groups for now.
+        if muon_param_fqns is not None:
+            muon_param_fqns_set = set(muon_param_fqns)


it might be nice to just provide some other simple logic to filter out muon params. the default logic is actually straightforward:

word embeddings and lm head do not go into Muon (can be controlled by name)

RMSNorm Gamma does not go into Muon

Other params will go to Muon. An example can be found here:

https://github.com/NVIDIA/Megatron-LM/pull/1428/files#diff-b5fac51ecd0148c2f4f8f2f1e64535089e90be87606c1f9357778d05af823220R124-R129

Discussed offline; having some pre-determined name check may be a footgun. Since we are not using a training library like Megatron where we know the fqn of parameters, it's hard to filter by name correctly, and may create confusion to users.

I improved the warning message to nudge users to include fqn list when using Muon.

Maybe we should allow user passing in custom function to decide whether a param is Muon param or Adam param? Then user can choose to either providing a list of Muon param names or provide a function similar to the Megatron example @toothacher17 linked.

Let's move all this logic into a followup PR as there may be more design decisions to be made, and in this PR, we should assume that Muon only handles Muon updates.

jcui2 · 2025-08-04T05:50:18Z

torch/optim/_muon.py

+
+        # Note: Muon doesn't support multiple param groups for now.
+        if muon_param_fqns is not None:
+            muon_param_fqns_set = set(muon_param_fqns)


Maybe we should allow user passing in custom function to decide whether a param is Muon param or Adam param? Then user can choose to either providing a list of Muon param names or provide a function similar to the Megatron example @toothacher17 linked.

jcui2 · 2025-08-04T20:52:46Z

test/optim/test_muon.py

+load_tests = load_tests
+
+
+class MoonshotReferenceMuon(torch.optim.Optimizer):


nit: moonshot Muon is used multiple times. Do we want to implement it once as a standalone class and reuse in all tests?

I created a common file under benchmark/muon_examples so that common code is reused in train.py and train_ddp.py. not updating this file yet since it's a bit weird to refer code under benchmark from test/optim.

jcui2 · 2025-08-04T20:54:25Z

torch/optim/_muon.py

+    if G.size(0) > G.size(1):
+        X = X.T
+    # Ensure spectral norm is at most 1
+    X = X / (X.norm() + 1e-7)


nit: defined 1e-7 as constant?

done, updated for coefficients.

vadimkantorov · 2025-08-05T08:57:51Z

torch/optim/_muon.py

@@ -61,7 +66,7 @@ def zeropower_via_newtonschulz(G: Tensor, ns_config: BaseMsignFnConfig) -> Tenso
    if G.size(0) > G.size(1):
        X = X.T
    # Ensure spectral norm is at most 1
-    X = X / (X.norm() + 1e-7)
+    X = X / (X.norm() + EPS)


might also be good to add it explicitly to function arguments: like ..., eps = EPS): ? this would allow someone to call the function with other eps without modifying the global vars (and would make these functions purely functional - not depending on global state, only on their explicit arguments)

good idea, I added eps to NewtonSchulzConfig.

janeyx99

Hi, thanks for everyone's reviews and @chuanhaozhuge's work here! I do think we should greatly simplify this PR to represent the Muon algorithm. We should design a dispatching API for branching params into Muon vs AdamW in the next PR, but let's keep Muon consistent and simple here.

Thanks for showing the code for the benchmarks + matching with Moonshot, but let's move that code to separate gists with results in the PR body.

I've left comments with more details below!

janeyx99 · 2025-08-05T17:01:25Z

benchmarks/muon_examples/train.py

Thanks for showing the code for the benchmarks/correctness comparisons! Let's not land any of the benchmark code (as it can live separately in a different gist or locally).

janeyx99 · 2025-08-05T17:04:03Z

test/inductor/test_compiled_optimizers.py

+    if optim_cls is Muon:
+        atol = 3e-4
+        rtol = 5e-5


hmmmm these are big...do you know why this would be?

janeyx99 · 2025-08-05T17:05:12Z

test/inductor/test_compiled_optimizers.py

@@ -617,9 +622,11 @@ def test_correctness(self, device, dtype, optim_info, use_closure):
                        param.grad = param.grad.to_sparse()

                opt_compiled = optim_cls(
-                    model_compiled.parameters(), **deepcopy(kwargs)
+                    model_compiled.named_parameters(), **deepcopy(kwargs)


Let's keep this as parameters(). Let's have Muon assume that all its parameters should receive the Muon update (all the logic for dispatching FQNs should live above the optimizer).

janeyx99 · 2025-08-05T17:05:21Z

test/inductor/test_compiled_optimizers.py

+                    model_compiled.named_parameters(), **deepcopy(kwargs)
+                )
+                opt_eager = optim_cls(
+                    model_eager.named_parameters(), **deepcopy(kwargs)


janeyx99 · 2025-08-05T17:09:21Z

test/optim/test_muon.py

Ah, I agree with the logical separation. The reason we have lots of tests in test_optim.py is to be able to run all optim-related tests in one go. Can you update this file + test_optim.py similar to test_lrscheduler.py so we can get both pros of logical separation + being able to run all tests in one go?

janeyx99 · 2025-08-05T17:21:25Z

torch/optim/_muon.py

+                muon_grads.append(p.grad)
+                muon_momentum_bufs.append(buf)
+            else:
+                # for the rest of the parameters, we use AdamW to optimize.


Let's not try to solve that problem in this PR, let's keep this PR as straightforward as possible.

janeyx99 · 2025-08-05T17:22:34Z

torch/optim/_muon.py

+        bias_correction1 = 1 - beta1 ** step.item()
+        bias_correction2 = 1 - beta2 ** step.item()


.item() calls are expensive--why not:

Suggested change

bias_correction1 = 1 - beta1 ** step.item()

bias_correction2 = 1 - beta2 ** step.item()

bias_correction1 = 1 - beta1 ** step

bias_correction2 = 1 - beta2 ** step

janeyx99 · 2025-08-05T17:24:06Z

torch/optim/_muon.py

+        self,
+        params: ParamsT,
+        lr: float = 1e-3,
+        wd: float = 0.1,


let's follow consistency and call this weight_decay

janeyx99 · 2025-08-05T17:24:38Z

torch/testing/_internal/common_optimizers.py

+                "adjust_lr_fn": lambda lr, param_shape: lr,
+            },
+            desc="passing alternative adjust_lr_fn",
+        ),


let's have configs for every arg in the constructor (weight decay, momentum, etc)

janeyx99 · 2025-08-05T17:24:49Z

torch/testing/_internal/common_optimizers.py

+                error_type=RuntimeError,
+                # note other optimizers raise TypeError in the base
+                # optimizer class. Muon raises the error earlier.
+                error_regex="Expected params to be named parameters",


let's not force this restriction

introduce Muon optimizer to PyTorch

3a9a402

chuanhaozhuge requested review from albanD and janeyx99 as code owners July 30, 2025 09:06

pytorch-bot bot added the release notes: optim label Jul 30, 2025

chuanhaozhuge requested a review from jcui2 July 30, 2025 09:10

janeyx99 reviewed Jul 30, 2025

View reviewed changes

Skylion007 reviewed Jul 30, 2025

View reviewed changes

chuanhaozhuge added 2 commits July 30, 2025 20:36

Add a test case to compare with Moonshot impl

a613130

Use warnings.warn

bfed384

toothacher17 reviewed Jul 31, 2025

View reviewed changes

toothacher17 mentioned this pull request Jul 31, 2025

a proof of concept for Distributed Muon NVIDIA/Megatron-LM#1428

Open

improve warning msg. fix misspell

726a910

toothacher17 mentioned this pull request Aug 3, 2025

addition of muon optimizer to torch.optim #148819

Open

chuanhaozhuge added 2 commits August 3, 2025 13:03

fix lint warning and error

2a41093

fix lint and test error

64770e8

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 4, 2025

albanD removed their request for review August 4, 2025 14:24

chuanhaozhuge added 2 commits August 4, 2025 11:42

fix linter

f662c83

skip compile test

ed33a31

jcui2 reviewed Aug 4, 2025

View reviewed changes

fix test and lint, reduce duplicate code

f5519c6

vadimkantorov reviewed Aug 5, 2025

View reviewed changes

add eps to config

2750453

janeyx99 requested changes Aug 5, 2025

View reviewed changes

mlazos self-requested a review August 6, 2025 05:22

chuanhaozhuge mentioned this pull request Aug 8, 2025

[muon] Introduce Muon optimizer to PyTorch #160213

Open

chuanhaozhuge force-pushed the muon_dev branch 3 times, most recently from 1c82fc8 to 654f754 Compare August 12, 2025 05:07


		__all__ = ["Muon", "muon"]

		logger = logging.getLogger(__name__)

		load_tests = load_tests


		class MoonshotReferenceMuon(torch.optim.Optimizer):

		bias_correction1 = 1 - beta1 ** step.item()
		bias_correction2 = 1 - beta2 ** step.item()

Introduce Muon optimizer to PyTorch #159465

Are you sure you want to change the base?

Introduce Muon optimizer to PyTorch #159465

Uh oh!

Conversation

chuanhaozhuge commented Jul 30, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159465

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

vadimkantorov commented Jul 30, 2025

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Jul 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chuanhaozhuge commented Jul 30, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

vadimkantorov Aug 5, 2025 •

edited

Loading