[Scaled MM] Update to support on B200 TN, NT, NN, TT Layouts are supported

# Summary
On Sm100 w/ cuda 12.8 cublas supports all 4 variants. We should update our PerTensor scaling kernel to allow for these layouts. 

We can also update our recipes in TorchAO to not require this data transposition. Since the MMA atom supports TN,NN,NT,NN we should also update our rowwise scaling kernel to not require this layout.

cc @msaroufim @jerryzh168 @ptrblck @eqy @yanbing-j @vkuzo @albanD @kadeng @penguinwu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Scaled MM] Update to support on B200 TN, NT, NN, TT Layouts are supported #152150

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Scaled MM] Update to support on B200 TN, NT, NN, TT Layouts are supported #152150

Description

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions