Skip to content

GEMM-template Horizontal #151780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

sunjiweiswift
Copy link

@sunjiweiswift sunjiweiswift commented Apr 21, 2025

Summary
Current, CPP GEMM Template using vertical transverse strategy to do the cache blocking and loop transverse, which assumes Matrix B in L1 cache and Matrix A in L2 cache. Nevertheless, we found when Matrix A is much larger than Matrix B, horizontal transverse can give better performance. In this PR:

We implement the horizontal transverse strategy which is default off and can be turn on by inductor config: config.cpp.cpp_gemm_transverse_strategy = "HORIZONTAL"
We also implement the heuristic to choose between vertical and horizontal transverse when user set this config as: config.cpp.cpp_gemm_transverse_strategy = "VERTICAL,HORIZONTAL"
Test Plan

python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_horizontal_transverse

Performance

When M > 256, a significant improvement is achieved

// Performance close to VERTICAL
Lines where the time in the first file is slower:
GEMM(M=1,N=2,K=256) runtime: 0.0092 ms (0.00 TOPS, 0.16 GB/s) vs. 0.0092 ms (0.00 %)
GEMM(M=1,N=2,K=1024) runtime: 0.0098 ms (0.00 TOPS, 0.60 GB/s) vs. 0.0094 ms (4.26 %)
GEMM(M=1,N=50,K=39200) runtime: 0.0831 ms (0.05 TOPS, 45.91 GB/s) vs. 0.0821 ms (1.22 %)
GEMM(M=1,N=768,K=768) runtime: 0.0135 ms (0.09 TOPS, 83.77 GB/s) vs. 0.0135 ms (0.00 %)
GEMM(M=1,N=1024,K=50) runtime: 0.0117 ms (0.01 TOPS, 8.50 GB/s) vs. 0.0114 ms (2.63 %)
GEMM(M=1,N=1024,K=1024) runtime: 0.0144 ms (0.15 TOPS, 139.23 GB/s) vs. 0.0145 ms (-0.69 %)
GEMM(M=4,N=768,K=768) runtime: 0.0133 ms (0.36 TOPS, 85.76 GB/s) vs. 0.0138 ms (-3.62 %)
GEMM(M=4,N=768,K=3072) runtime: 0.0268 ms (0.70 TOPS, 168.91 GB/s) vs. 0.0269 ms (-0.37 %)
GEMM(M=4,N=1000,K=768) runtime: 0.0136 ms (0.45 TOPS, 108.54 GB/s) vs. 0.0135 ms (0.74 %)
GEMM(M=4,N=1000,K=4096) runtime: 0.0527 ms (0.62 TOPS, 149.10 GB/s) vs. 0.0526 ms (0.19 %)
GEMM(M=4,N=3072,K=768) runtime: 0.0228 ms (0.83 TOPS, 198.33 GB/s) vs. 0.0233 ms (-2.15 %)
GEMM(M=4,N=4096,K=4096) runtime: 0.1918 ms (0.70 TOPS, 167.21 GB/s) vs. 0.1926 ms (-0.42 %)
GEMM(M=4,N=4096,K=25088) runtime: 1.1247 ms (0.73 TOPS, 174.47 GB/s) vs. 1.1382 ms (-1.19 %)
GEMM(M=5,N=5,K=64) runtime: 0.0094 ms (0.00 TOPS, 0.13 GB/s) vs. 0.0088 ms (6.82 %)
GEMM(M=8,N=1000,K=512) runtime: 0.0147 ms (0.56 TOPS, 68.16 GB/s) vs. 0.015 ms (-2.00 %)
GEMM(M=8,N=1000,K=2048) runtime: 0.0226 ms (1.45 TOPS, 174.71 GB/s) vs. 0.023 ms (-1.74 %)
GEMM(M=16,N=2,K=768) runtime: 0.0105 ms (0.00 TOPS, 2.51 GB/s) vs. 0.0107 ms (-1.87 %)
GEMM(M=16,N=768,K=768) runtime: 0.0158 ms (1.20 TOPS, 74.20 GB/s) vs. 0.0159 ms (-0.63 %)
GEMM(M=16,N=768,K=3072) runtime: 0.0328 ms (2.30 TOPS, 140.84 GB/s) vs. 0.0326 ms (0.61 %)
GEMM(M=16,N=1000,K=768) runtime: 0.0170 ms (1.44 TOPS, 89.12 GB/s) vs. 0.0172 ms (-1.16 %)
GEMM(M=16,N=1000,K=1280) runtime: 0.0211 ms (1.95 TOPS, 119.25 GB/s) vs. 0.0206 ms (2.43 %)
GEMM(M=16,N=1000,K=4320) runtime: 0.0553 ms (2.50 TOPS, 152.02 GB/s) vs. 0.0561 ms (-1.43 %)
GEMM(M=16,N=3072,K=768) runtime: 0.0250 ms (3.02 TOPS, 184.88 GB/s) vs. 0.0249 ms (0.40 %)
GEMM(M=32,N=512,K=512) runtime: 0.0145 ms (1.16 TOPS, 38.82 GB/s) vs. 0.0145 ms (0.00 %)
GEMM(M=32,N=512,K=768) runtime: 0.0160 ms (1.57 TOPS, 51.63 GB/s) vs. 0.0162 ms (-1.23 %)
GEMM(M=32,N=1000,K=384) runtime: 0.0159 ms (1.55 TOPS, 51.49 GB/s) vs. 0.0152 ms (4.61 %)
GEMM(M=32,N=1000,K=512) runtime: 0.0169 ms (1.94 TOPS, 63.33 GB/s) vs. 0.0167 ms (1.20 %)
GEMM(M=32,N=1000,K=768) runtime: 0.0196 ms (2.51 TOPS, 80.42 GB/s) vs. 0.0188 ms (4.26 %)
GEMM(M=32,N=1000,K=1024) runtime: 0.0221 ms (2.96 TOPS, 93.86 GB/s) vs. 0.0222 ms (-0.45 %)
GEMM(M=32,N=1000,K=1280) runtime: 0.0261 ms (3.14 TOPS, 98.77 GB/s) vs. 0.0253 ms (3.16 %)
GEMM(M=32,N=1000,K=1408) runtime: 0.0279 ms (3.23 TOPS, 101.53 GB/s) vs. 0.0273 ms (2.20 %)
GEMM(M=32,N=1000,K=2048) runtime: 0.0397 ms (3.30 TOPS, 102.98 GB/s) vs. 0.0393 ms (1.02 %)
GEMM(M=32,N=1000,K=2240) runtime: 0.0473 ms (3.03 TOPS, 94.60 GB/s) vs. 0.0477 ms (-0.84 %)
GEMM(M=32,N=1280,K=960) runtime: 0.0196 ms (4.01 TOPS, 126.58 GB/s) vs. 0.0194 ms (1.03 %)
GEMM(M=32,N=32000,K=512) runtime: 1.6477 ms (0.64 TOPS, 20.17 GB/s) vs. 1.6474 ms (0.02 %)
GEMM(M=64,N=10,K=512) runtime: 0.0118 ms (0.06 TOPS, 6.21 GB/s) vs. 0.0115 ms (2.61 %)
GEMM(M=64,N=384,K=384) runtime: 0.0155 ms (1.22 TOPS, 24.18 GB/s) vs. 0.0155 ms (0.00 %)
GEMM(M=64,N=384,K=1152) runtime: 0.0190 ms (2.98 TOPS, 54.31 GB/s) vs. 0.0191 ms (-0.52 %)
GEMM(M=64,N=512,K=256) runtime: 0.0146 ms (1.15 TOPS, 23.51 GB/s) vs. 0.0143 ms (2.10 %)
GEMM(M=64,N=1000,K=384) runtime: 0.0179 ms (2.75 TOPS, 50.38 GB/s) vs. 0.0168 ms (6.55 %)
GEMM(M=64,N=1000,K=512) runtime: 0.0190 ms (3.46 TOPS, 61.27 GB/s) vs. 0.0186 ms (2.15 %)
GEMM(M=64,N=1000,K=640) runtime: 0.0224 ms (3.66 TOPS, 63.41 GB/s) vs. 0.0208 ms (7.69 %)
GEMM(M=64,N=1000,K=768) runtime: 0.0238 ms (4.13 TOPS, 70.57 GB/s) vs. 0.022 ms (8.18 %)
GEMM(M=64,N=1000,K=1024) runtime: 0.0244 ms (5.37 TOPS, 90.11 GB/s) vs. 0.0245 ms (-0.41 %)
GEMM(M=64,N=1000,K=1280) runtime: 0.0281 ms (5.83 TOPS, 96.80 GB/s) vs. 0.0279 ms (0.72 %)
GEMM(M=64,N=1000,K=2048) runtime: 0.0459 ms (5.72 TOPS, 93.31 GB/s) vs. 0.0484 ms (-5.17 %)
GEMM(M=64,N=1152,K=384) runtime: 0.0169 ms (3.35 TOPS, 60.94 GB/s) vs. 0.0163 ms (3.68 %)
GEMM(M=96,N=65,K=512) runtime: 0.0146 ms (0.44 TOPS, 11.61 GB/s) vs. 0.0171 ms (-14.62 %)
GEMM(M=128,N=2,K=4096) runtime: 0.0327 ms (0.06 TOPS, 31.12 GB/s) vs. 0.0326 ms (0.31 %)
GEMM(M=128,N=10,K=64) runtime: 0.0092 ms (0.02 TOPS, 2.09 GB/s) vs. 0.0095 ms (-3.16 %)
GEMM(M=128,N=10,K=184) runtime: 0.0103 ms (0.05 TOPS, 4.92 GB/s) vs. 0.0102 ms (0.98 %)
GEMM(M=128,N=1000,K=256) runtime: 0.0178 ms (3.69 TOPS, 44.75 GB/s) vs. 0.019 ms (-6.32 %)
GEMM(M=128,N=1000,K=384) runtime: 0.0204 ms (4.82 TOPS, 52.43 GB/s) vs. 0.0215 ms (-5.12 %)
GEMM(M=128,N=1000,K=512) runtime: 0.0262 ms (5.00 TOPS, 51.33 GB/s) vs. 0.0249 ms (5.22 %)
GEMM(M=128,N=1000,K=768) runtime: 0.0330 ms (5.96 TOPS, 57.50 GB/s) vs. 0.033 ms (0.00 %)
GEMM(M=128,N=1000,K=1024) runtime: 0.0407 ms (6.44 TOPS, 60.15 GB/s) vs. 0.0419 ms (-2.86 %)
GEMM(M=128,N=1000,K=1280) runtime: 0.0494 ms (6.64 TOPS, 60.71 GB/s) vs. 0.05 ms (-1.20 %)
GEMM(M=128,N=1000,K=1408) runtime: 0.0520 ms (6.93 TOPS, 62.92 GB/s) vs. 0.0534 ms (-2.62 %)
GEMM(M=128,N=1000,K=1536) runtime: 0.0566 ms (6.94 TOPS, 62.67 GB/s) vs. 0.0572 ms (-1.05 %)
GEMM(M=128,N=1000,K=2048) runtime: 0.0812 ms (6.46 TOPS, 57.29 GB/s) vs. 0.0775 ms (4.77 %)
GEMM(M=128,N=1000,K=2304) runtime: 0.1034 ms (5.70 TOPS, 50.29 GB/s) vs. 0.0945 ms (9.42 %)
GEMM(M=128,N=1000,K=2560) runtime: 0.1304 ms (5.03 TOPS, 44.12 GB/s) vs. 0.1176 ms (10.88 %)
GEMM(M=128,N=1000,K=3072) runtime: 0.1939 ms (4.06 TOPS, 35.35 GB/s) vs. 0.1713 ms (13.19 %)
GEMM(M=128,N=1000,K=4096) runtime: 0.3207 ms (3.27 TOPS, 28.24 GB/s) vs. 0.2748 ms (16.70 %)、

GEMM(M=128,N=4096,K=4096) runtime: 1.3166 ms (3.26 TOPS, 25.82 GB/s) vs. 3.5636 ms (-63.05 %)
GEMM(M=128,N=4096,K=9216) runtime: 4.7093 ms (2.05 TOPS, 15.98 GB/s) vs. 4.6948 ms (0.31 %)
GEMM(M=128,N=4096,K=16384) runtime: 5.2330 ms (3.28 TOPS, 25.42 GB/s) vs. 5.1444 ms (1.72 %)
GEMM(M=128,N=16384,K=4096) runtime: 2.2769 ms (7.55 TOPS, 58.41 GB/s) vs. 6.9179 ms (-67.09 %)
GEMM(M=128,N=50400,K=4096) runtime: 6.1589 ms (8.58 TOPS, 66.09 GB/s) vs. 6.0398 ms (1.97 %)

EMM(M=220,N=512,K=512) runtime: 0.0264 ms (4.36 TOPS, 35.17 GB/s) vs. 0.0245 ms (7.76 %)
EMM(M=220,N=512,K=2048) runtime: 0.0815 ms (5.66 TOPS, 37.70 GB/s) vs. 0.0779 ms (4.62 %)

GEMM(M=220,N=1014,K=512) runtime: 0.0302 ms (7.56 TOPS, 53.97 GB/s) vs. 0.0349 ms (-13.47 %)
GEMM(M=220,N=2048,K=512) runtime: 0.8694 ms (0.53 TOPS, 3.54 GB/s) vs. 2.3897 ms (-63.62 %)
// Performance Improvements
GEMM(M=256,N=2,K=1024) runtime: 0.0145 ms (0.07 TOPS, 34.79 GB/s) vs. 0.014 ms (3.57 %)
GEMM(M=256,N=128,K=128) runtime: 0.0131 ms (0.64 TOPS, 11.94 GB/s) vs. 0.0125 ms (4.80 %)
GEMM(M=256,N=128,K=256) runtime: 0.0155 ms (1.08 TOPS, 16.13 GB/s) vs. 0.0145 ms (6.90 %)
GEMM(M=256,N=256,K=128) runtime: 0.0152 ms (1.10 TOPS, 16.45 GB/s) vs. 0.0155 ms (-1.94 %)
GEMM(M=256,N=256,K=256) runtime: 0.0157 ms (2.14 TOPS, 23.89 GB/s) vs. 0.0169 ms (-7.10 %)
GEMM(M=256,N=512,K=512) runtime: 0.0262 ms (5.12 TOPS, 38.12 GB/s) vs. 0.0246 ms (6.50 %)
GEMM(M=256,N=512,K=1024) runtime: 0.0396 ms (6.78 TOPS, 44.20 GB/s) vs. 0.0391 ms (1.28 %)
GEMM(M=256,N=512,K=197951) runtime: 16.2101 ms (3.20 TOPS, 17.90 GB/s) vs. 16.2295 ms (-0.12 %)
GEMM(M=256,N=768,K=768) runtime: 0.0426 ms (7.09 TOPS, 44.04 GB/s) vs. 0.0419 ms (1.67 %)
GEMM(M=256,N=768,K=3072) runtime: 0.4397 ms (2.75 TOPS, 14.50 GB/s) vs. 0.4131 ms (6.44 %)
GEMM(M=256,N=1000,K=128) runtime: 0.2579 ms (0.25 TOPS, 3.08 GB/s) vs. 0.4302 ms (-40.05 %)
GEMM(M=256,N=1000,K=256) runtime: 0.0307 ms (4.27 TOPS, 35.88 GB/s) vs. 0.0295 ms (4.07 %)
GEMM(M=256,N=1000,K=1024) runtime: 0.0671 ms (7.81 TOPS, 43.81 GB/s) vs. 0.0675 ms (-0.59 %)
GEMM(M=256,N=1000,K=1280) runtime: 0.3644 ms (1.80 TOPS, 9.75 GB/s) vs. 0.5249 ms (-30.58 %)
GEMM(M=256,N=1000,K=1984) runtime: 0.4563 ms (2.23 TOPS, 11.49 GB/s) vs. 1.0955 ms (-58.35 %)
GEMM(M=256,N=1000,K=2048) runtime: 0.3607 ms (2.91 TOPS, 14.96 GB/s) vs. 1.0971 ms (-67.12 %)
GEMM(M=256,N=1024,K=3) runtime: 0.3473 ms (0.00 TOPS, 1.46 GB/s) vs. 1.0332 ms (-66.39 %)
GEMM(M=256,N=1024,K=512) runtime: 0.2878 ms (0.93 TOPS, 6.08 GB/s) vs. 0.4721 ms (-39.04 %)
GEMM(M=256,N=1024,K=1024) runtime: 0.0668 ms (8.03 TOPS, 44.88 GB/s) vs. 0.0646 ms (3.41 %)
GEMM(M=256,N=2304,K=768) runtime: 0.9959 ms (0.91 TOPS, 4.90 GB/s) vs. 2.2502 ms (-55.74 %)
GEMM(M=256,N=3072,K=768) runtime: 1.1671 ms (1.04 TOPS, 5.46 GB/s) vs. 3.0596 ms (-61.85 %)
GEMM(M=256,N=50257,K=768) runtime: 5.7232 ms (3.45 TOPS, 17.22 GB/s) vs. 7.2058 ms (-20.58 %)
GEMM(M=256,N=197951,K=512) runtime: 17.7657 ms (2.92 TOPS, 16.34 GB/s) vs. 18.9209 ms (-6.11 %)

GEMM(M=473,N=2,K=768) runtime: 0.0139 ms (0.10 TOPS, 50.31 GB/s) vs. 0.0147 ms (-5.44 %)
GEMM(M=475,N=768,K=768) runtime: 0.4938 ms (1.13 TOPS, 5.10 GB/s) vs. 0.8638 ms (-42.83 %)
GEMM(M=475,N=768,K=3072) runtime: 0.9406 ms (2.38 TOPS, 8.48 GB/s) vs. 1.5037 ms (-37.45 %)
GEMM(M=475,N=3072,K=768) runtime: 1.8788 ms (1.19 TOPS, 4.25 GB/s) vs. 6.8102 ms (-72.41 %)

GEMM(M=512,N=2,K=1536) runtime: 0.0201 ms (0.16 TOPS, 75.18 GB/s) vs. 0.0201 ms (0.00 %)
GEMM(M=512,N=128,K=768) runtime: 0.0210 ms (4.78 TOPS, 50.50 GB/s) vs. 0.0209 ms (0.48 %)
GEMM(M=512,N=512,K=512) runtime: 0.0385 ms (6.96 TOPS, 38.91 GB/s) vs. 0.041 ms (-6.10 %)
GEMM(M=512,N=512,K=2048) runtime: 0.2664 ms (4.03 TOPS, 16.89 GB/s) vs. 0.2298 ms (15.93 %)

GEMM(M=512,N=768,K=128) runtime: 0.2887 ms (0.35 TOPS, 3.68 GB/s) vs. 0.8198 ms (-64.78 %)
GEMM(M=512,N=768,K=768) runtime: 0.3599 ms (1.68 TOPS, 7.29 GB/s) vs. 0.8904 ms (-59.58 %)
GEMM(M=512,N=768,K=3072) runtime: 0.8159 ms (2.96 TOPS, 10.11 GB/s) vs. 1.7363 ms (-53.01 %)
GEMM(M=512,N=1000,K=1280) runtime: 0.9001 ms (1.46 TOPS, 5.19 GB/s) vs. 2.2128 ms (-59.32 %)
GEMM(M=512,N=1000,K=1984) runtime: 0.8326 ms (2.44 TOPS, 8.05 GB/s) vs. 2.4197 ms (-65.59 %)
GEMM(M=512,N=1024,K=1024) runtime: 0.5257 ms (2.04 TOPS, 7.61 GB/s) vs. 1.7488 ms (-69.94 %)
GEMM(M=512,N=1024,K=3072) runtime: 0.9542 ms (3.38 TOPS, 10.48 GB/s) vs. 1.6973 ms (-43.78 %)
GEMM(M=512,N=1024,K=4096) runtime: 1.2711 ms (3.38 TOPS, 10.23 GB/s) vs. 2.8759 ms (-55.80 %)
GEMM(M=512,N=1536,K=1536) runtime: 1.5235 ms (1.59 TOPS, 4.92 GB/s) vs. 3.4486 ms (-55.82 %)
GEMM(M=512,N=1536,K=6144) runtime: 1.9637 ms (4.92 TOPS, 12.99 GB/s) vs. 2.4115 ms (-18.57 %)
GEMM(M=512,N=2048,K=512) runtime: 1.1441 ms (0.94 TOPS, 3.93 GB/s) vs. 4.0707 ms (-71.89 %)
GEMM(M=512,N=2048,K=2048) runtime: 1.7600 ms (2.44 TOPS, 6.82 GB/s) vs. 5.0442 ms (-65.11 %)
GEMM(M=512,N=2048,K=8192) runtime: 2.2562 ms (7.61 TOPS, 18.62 GB/s) vs. 2.2703 ms (-0.62 %)
GEMM(M=512,N=2560,K=2560) runtime: 1.8007 ms (3.73 TOPS, 9.72 GB/s) vs. 4.1385 ms (-56.49 %)
GEMM(M=512,N=2560,K=10240) runtime: 4.3323 ms (6.20 TOPS, 14.43 GB/s) vs. 4.7134 ms (-8.09 %)
GEMM(M=512,N=3072,K=768) runtime: 1.8491 ms (1.31 TOPS, 4.46 GB/s) vs. 7.6777 ms (-75.92 %)
GEMM(M=512,N=3072,K=1024) runtime: 1.9930 ms (1.62 TOPS, 5.02 GB/s) vs. 7.8267 ms (-74.54 %)
GEMM(M=512,N=4096,K=1024) runtime: 1.8094 ms (2.37 TOPS, 7.18 GB/s) vs. 6.811 ms (-73.43 %)
GEMM(M=512,N=6144,K=1536) runtime: 2.7222 ms (3.55 TOPS, 9.37 GB/s) vs. 7.7562 ms (-64.90 %)
GEMM(M=512,N=8008,K=2560) runtime: 3.0235 ms (6.94 TOPS, 16.35 GB/s) vs. 5.1597 ms (-41.40 %)
GEMM(M=512,N=8192,K=2048) runtime: 3.5080 ms (4.90 TOPS, 11.97 GB/s) vs. 8.9103 ms (-60.63 %)
GEMM(M=512,N=10240,K=2560) runtime: 4.0632 ms (6.61 TOPS, 15.38 GB/s) vs. 4.8217 ms (-15.73 %)
GEMM(M=512,N=30000,K=128) runtime: 5.5240 ms (0.71 TOPS, 6.65 GB/s) vs. 20.1042 ms (-72.52 %)
GEMM(M=512,N=30522,K=768) runtime: 6.7020 ms (3.58 TOPS, 11.23 GB/s) vs. 18.1121 ms (-63.00 %)
GEMM(M=512,N=30522,K=1024) runtime: 6.9931 ms (4.58 TOPS, 12.93 GB/s) vs. 18.3904 ms (-61.97 %)
GEMM(M=512,N=32128,K=512) runtime: 6.4790 ms (2.60 TOPS, 9.76 GB/s) vs. 17.9019 ms (-63.81 %)
GEMM(M=512,N=32128,K=1024) runtime: 7.4393 ms (4.53 TOPS, 12.79 GB/s) vs. 18.5621 ms (-59.92 %)
GEMM(M=512,N=50265,K=768) runtime: 10.5373 ms (3.75 TOPS, 11.72 GB/s) vs. 12.5688 ms (-16.16 %)
GEMM(M=512,N=51200,K=2048) runtime: 13.6468 ms (7.87 TOPS, 18.47 GB/s) vs. 12.3374 ms (10.61 %)
GEMM(M=819,N=768,K=768) runtime: 0.7904 ms (1.22 TOPS, 4.46 GB/s) vs. 2.4961 ms (-68.33 %)
GEMM(M=819,N=50358,K=768) runtime: 16.4101 ms (3.86 TOPS, 9.36 GB/s) vs. 15.2728 ms (7.45 %)
GEMM(M=832,N=768,K=768) runtime: 0.6074 ms (1.62 TOPS, 5.87 GB/s) vs. 2.0416 ms (-70.25 %)
GEMM(M=832,N=768,K=3072) runtime: 1.0136 ms (3.87 TOPS, 10.45 GB/s) vs. 1.8642 ms (-45.63 %)
GEMM(M=832,N=3072,K=768) runtime: 2.1773 ms (1.80 TOPS, 4.87 GB/s) vs. 5.4305 ms (-59.91 %)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

Copy link

pytorch-bot bot commented Apr 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151780

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 87277be with merge base 83e2ea8 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@bdhirsh bdhirsh requested a review from jgong5 April 22, 2025 23:35
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 22, 2025
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early comments:

  1. Please revise the PR title to be more specific - "Horizontal" sounds a too general one.
  2. Please add more details into the PR description - why/what changes the PR does.
  3. Please share performance numbers to demonstrate the value of the changes.

@sunjiweiswift
Copy link
Author

@jgong5 hi jiong, please review~

@CaoE CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025
@sunjiweiswift sunjiweiswift force-pushed the horizontal branch 2 times, most recently from d8b14db to ddf4460 Compare August 1, 2025 05:59
@mingfeima mingfeima added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/xpu Run XPU CI tasks labels Aug 1, 2025
Copy link

pytorch-bot bot commented Aug 1, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025
@sunjiweiswift sunjiweiswift changed the title Horizontal GEMM-template Horizontal Aug 1, 2025
@CaoE CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 6, 2025
@sunjiweiswift
Copy link
Author

@jgong5 hi jiong, please review~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request module: inductor open source release notes: inductor triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants