-
Notifications
You must be signed in to change notification settings - Fork 24.9k
GEMM-template Horizontal #151780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GEMM-template Horizontal #151780
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151780
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit 87277be with merge base 83e2ea8 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early comments:
- Please revise the PR title to be more specific - "Horizontal" sounds a too general one.
- Please add more details into the PR description - why/what changes the PR does.
- Please share performance numbers to demonstrate the value of the changes.
4615917
to
b7a5158
Compare
8b1bdd6
to
0068715
Compare
@jgong5 hi jiong, please review~ |
0301dc0
to
d217df8
Compare
ghstack-source-id: 07a179c Pull Request resolved: pytorch#143796
ghstack-source-id: baaead6 Pull Request resolved: pytorch#143897
ghstack-source-id: f3c3ee0 Pull Request resolved: pytorch#143904
ghstack-source-id: 47b76b9 Pull Request resolved: pytorch#140921
ghstack-source-id: c33660b Pull Request resolved: pytorch#140786
1c40049
to
9ecd9af
Compare
d8b14db
to
ddf4460
Compare
4838029
to
b2e8de3
Compare
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
@jgong5 hi jiong, please review~ |
Summary
Current, CPP GEMM Template using vertical transverse strategy to do the cache blocking and loop transverse, which assumes Matrix B in L1 cache and Matrix A in L2 cache. Nevertheless, we found when Matrix A is much larger than Matrix B, horizontal transverse can give better performance. In this PR:
We implement the horizontal transverse strategy which is default off and can be turn on by inductor config: config.cpp.cpp_gemm_transverse_strategy = "HORIZONTAL"
We also implement the heuristic to choose between vertical and horizontal transverse when user set this config as: config.cpp.cpp_gemm_transverse_strategy = "VERTICAL,HORIZONTAL"
Test Plan
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_horizontal_transverse
Performance
When M > 256, a significant improvement is achieved
// Performance close to VERTICAL
Lines where the time in the first file is slower:
GEMM(M=1,N=2,K=256) runtime: 0.0092 ms (0.00 TOPS, 0.16 GB/s) vs. 0.0092 ms (0.00 %)
GEMM(M=1,N=2,K=1024) runtime: 0.0098 ms (0.00 TOPS, 0.60 GB/s) vs. 0.0094 ms (4.26 %)
GEMM(M=1,N=50,K=39200) runtime: 0.0831 ms (0.05 TOPS, 45.91 GB/s) vs. 0.0821 ms (1.22 %)
GEMM(M=1,N=768,K=768) runtime: 0.0135 ms (0.09 TOPS, 83.77 GB/s) vs. 0.0135 ms (0.00 %)
GEMM(M=1,N=1024,K=50) runtime: 0.0117 ms (0.01 TOPS, 8.50 GB/s) vs. 0.0114 ms (2.63 %)
GEMM(M=1,N=1024,K=1024) runtime: 0.0144 ms (0.15 TOPS, 139.23 GB/s) vs. 0.0145 ms (-0.69 %)
GEMM(M=4,N=768,K=768) runtime: 0.0133 ms (0.36 TOPS, 85.76 GB/s) vs. 0.0138 ms (-3.62 %)
GEMM(M=4,N=768,K=3072) runtime: 0.0268 ms (0.70 TOPS, 168.91 GB/s) vs. 0.0269 ms (-0.37 %)
GEMM(M=4,N=1000,K=768) runtime: 0.0136 ms (0.45 TOPS, 108.54 GB/s) vs. 0.0135 ms (0.74 %)
GEMM(M=4,N=1000,K=4096) runtime: 0.0527 ms (0.62 TOPS, 149.10 GB/s) vs. 0.0526 ms (0.19 %)
GEMM(M=4,N=3072,K=768) runtime: 0.0228 ms (0.83 TOPS, 198.33 GB/s) vs. 0.0233 ms (-2.15 %)
GEMM(M=4,N=4096,K=4096) runtime: 0.1918 ms (0.70 TOPS, 167.21 GB/s) vs. 0.1926 ms (-0.42 %)
GEMM(M=4,N=4096,K=25088) runtime: 1.1247 ms (0.73 TOPS, 174.47 GB/s) vs. 1.1382 ms (-1.19 %)
GEMM(M=5,N=5,K=64) runtime: 0.0094 ms (0.00 TOPS, 0.13 GB/s) vs. 0.0088 ms (6.82 %)
GEMM(M=8,N=1000,K=512) runtime: 0.0147 ms (0.56 TOPS, 68.16 GB/s) vs. 0.015 ms (-2.00 %)
GEMM(M=8,N=1000,K=2048) runtime: 0.0226 ms (1.45 TOPS, 174.71 GB/s) vs. 0.023 ms (-1.74 %)
GEMM(M=16,N=2,K=768) runtime: 0.0105 ms (0.00 TOPS, 2.51 GB/s) vs. 0.0107 ms (-1.87 %)
GEMM(M=16,N=768,K=768) runtime: 0.0158 ms (1.20 TOPS, 74.20 GB/s) vs. 0.0159 ms (-0.63 %)
GEMM(M=16,N=768,K=3072) runtime: 0.0328 ms (2.30 TOPS, 140.84 GB/s) vs. 0.0326 ms (0.61 %)
GEMM(M=16,N=1000,K=768) runtime: 0.0170 ms (1.44 TOPS, 89.12 GB/s) vs. 0.0172 ms (-1.16 %)
GEMM(M=16,N=1000,K=1280) runtime: 0.0211 ms (1.95 TOPS, 119.25 GB/s) vs. 0.0206 ms (2.43 %)
GEMM(M=16,N=1000,K=4320) runtime: 0.0553 ms (2.50 TOPS, 152.02 GB/s) vs. 0.0561 ms (-1.43 %)
GEMM(M=16,N=3072,K=768) runtime: 0.0250 ms (3.02 TOPS, 184.88 GB/s) vs. 0.0249 ms (0.40 %)
GEMM(M=32,N=512,K=512) runtime: 0.0145 ms (1.16 TOPS, 38.82 GB/s) vs. 0.0145 ms (0.00 %)
GEMM(M=32,N=512,K=768) runtime: 0.0160 ms (1.57 TOPS, 51.63 GB/s) vs. 0.0162 ms (-1.23 %)
GEMM(M=32,N=1000,K=384) runtime: 0.0159 ms (1.55 TOPS, 51.49 GB/s) vs. 0.0152 ms (4.61 %)
GEMM(M=32,N=1000,K=512) runtime: 0.0169 ms (1.94 TOPS, 63.33 GB/s) vs. 0.0167 ms (1.20 %)
GEMM(M=32,N=1000,K=768) runtime: 0.0196 ms (2.51 TOPS, 80.42 GB/s) vs. 0.0188 ms (4.26 %)
GEMM(M=32,N=1000,K=1024) runtime: 0.0221 ms (2.96 TOPS, 93.86 GB/s) vs. 0.0222 ms (-0.45 %)
GEMM(M=32,N=1000,K=1280) runtime: 0.0261 ms (3.14 TOPS, 98.77 GB/s) vs. 0.0253 ms (3.16 %)
GEMM(M=32,N=1000,K=1408) runtime: 0.0279 ms (3.23 TOPS, 101.53 GB/s) vs. 0.0273 ms (2.20 %)
GEMM(M=32,N=1000,K=2048) runtime: 0.0397 ms (3.30 TOPS, 102.98 GB/s) vs. 0.0393 ms (1.02 %)
GEMM(M=32,N=1000,K=2240) runtime: 0.0473 ms (3.03 TOPS, 94.60 GB/s) vs. 0.0477 ms (-0.84 %)
GEMM(M=32,N=1280,K=960) runtime: 0.0196 ms (4.01 TOPS, 126.58 GB/s) vs. 0.0194 ms (1.03 %)
GEMM(M=32,N=32000,K=512) runtime: 1.6477 ms (0.64 TOPS, 20.17 GB/s) vs. 1.6474 ms (0.02 %)
GEMM(M=64,N=10,K=512) runtime: 0.0118 ms (0.06 TOPS, 6.21 GB/s) vs. 0.0115 ms (2.61 %)
GEMM(M=64,N=384,K=384) runtime: 0.0155 ms (1.22 TOPS, 24.18 GB/s) vs. 0.0155 ms (0.00 %)
GEMM(M=64,N=384,K=1152) runtime: 0.0190 ms (2.98 TOPS, 54.31 GB/s) vs. 0.0191 ms (-0.52 %)
GEMM(M=64,N=512,K=256) runtime: 0.0146 ms (1.15 TOPS, 23.51 GB/s) vs. 0.0143 ms (2.10 %)
GEMM(M=64,N=1000,K=384) runtime: 0.0179 ms (2.75 TOPS, 50.38 GB/s) vs. 0.0168 ms (6.55 %)
GEMM(M=64,N=1000,K=512) runtime: 0.0190 ms (3.46 TOPS, 61.27 GB/s) vs. 0.0186 ms (2.15 %)
GEMM(M=64,N=1000,K=640) runtime: 0.0224 ms (3.66 TOPS, 63.41 GB/s) vs. 0.0208 ms (7.69 %)
GEMM(M=64,N=1000,K=768) runtime: 0.0238 ms (4.13 TOPS, 70.57 GB/s) vs. 0.022 ms (8.18 %)
GEMM(M=64,N=1000,K=1024) runtime: 0.0244 ms (5.37 TOPS, 90.11 GB/s) vs. 0.0245 ms (-0.41 %)
GEMM(M=64,N=1000,K=1280) runtime: 0.0281 ms (5.83 TOPS, 96.80 GB/s) vs. 0.0279 ms (0.72 %)
GEMM(M=64,N=1000,K=2048) runtime: 0.0459 ms (5.72 TOPS, 93.31 GB/s) vs. 0.0484 ms (-5.17 %)
GEMM(M=64,N=1152,K=384) runtime: 0.0169 ms (3.35 TOPS, 60.94 GB/s) vs. 0.0163 ms (3.68 %)
GEMM(M=96,N=65,K=512) runtime: 0.0146 ms (0.44 TOPS, 11.61 GB/s) vs. 0.0171 ms (-14.62 %)
GEMM(M=128,N=2,K=4096) runtime: 0.0327 ms (0.06 TOPS, 31.12 GB/s) vs. 0.0326 ms (0.31 %)
GEMM(M=128,N=10,K=64) runtime: 0.0092 ms (0.02 TOPS, 2.09 GB/s) vs. 0.0095 ms (-3.16 %)
GEMM(M=128,N=10,K=184) runtime: 0.0103 ms (0.05 TOPS, 4.92 GB/s) vs. 0.0102 ms (0.98 %)
GEMM(M=128,N=1000,K=256) runtime: 0.0178 ms (3.69 TOPS, 44.75 GB/s) vs. 0.019 ms (-6.32 %)
GEMM(M=128,N=1000,K=384) runtime: 0.0204 ms (4.82 TOPS, 52.43 GB/s) vs. 0.0215 ms (-5.12 %)
GEMM(M=128,N=1000,K=512) runtime: 0.0262 ms (5.00 TOPS, 51.33 GB/s) vs. 0.0249 ms (5.22 %)
GEMM(M=128,N=1000,K=768) runtime: 0.0330 ms (5.96 TOPS, 57.50 GB/s) vs. 0.033 ms (0.00 %)
GEMM(M=128,N=1000,K=1024) runtime: 0.0407 ms (6.44 TOPS, 60.15 GB/s) vs. 0.0419 ms (-2.86 %)
GEMM(M=128,N=1000,K=1280) runtime: 0.0494 ms (6.64 TOPS, 60.71 GB/s) vs. 0.05 ms (-1.20 %)
GEMM(M=128,N=1000,K=1408) runtime: 0.0520 ms (6.93 TOPS, 62.92 GB/s) vs. 0.0534 ms (-2.62 %)
GEMM(M=128,N=1000,K=1536) runtime: 0.0566 ms (6.94 TOPS, 62.67 GB/s) vs. 0.0572 ms (-1.05 %)
GEMM(M=128,N=1000,K=2048) runtime: 0.0812 ms (6.46 TOPS, 57.29 GB/s) vs. 0.0775 ms (4.77 %)
GEMM(M=128,N=1000,K=2304) runtime: 0.1034 ms (5.70 TOPS, 50.29 GB/s) vs. 0.0945 ms (9.42 %)
GEMM(M=128,N=1000,K=2560) runtime: 0.1304 ms (5.03 TOPS, 44.12 GB/s) vs. 0.1176 ms (10.88 %)
GEMM(M=128,N=1000,K=3072) runtime: 0.1939 ms (4.06 TOPS, 35.35 GB/s) vs. 0.1713 ms (13.19 %)
GEMM(M=128,N=1000,K=4096) runtime: 0.3207 ms (3.27 TOPS, 28.24 GB/s) vs. 0.2748 ms (16.70 %)、
GEMM(M=128,N=4096,K=4096) runtime: 1.3166 ms (3.26 TOPS, 25.82 GB/s) vs. 3.5636 ms (-63.05 %)
GEMM(M=128,N=4096,K=9216) runtime: 4.7093 ms (2.05 TOPS, 15.98 GB/s) vs. 4.6948 ms (0.31 %)
GEMM(M=128,N=4096,K=16384) runtime: 5.2330 ms (3.28 TOPS, 25.42 GB/s) vs. 5.1444 ms (1.72 %)
GEMM(M=128,N=16384,K=4096) runtime: 2.2769 ms (7.55 TOPS, 58.41 GB/s) vs. 6.9179 ms (-67.09 %)
GEMM(M=128,N=50400,K=4096) runtime: 6.1589 ms (8.58 TOPS, 66.09 GB/s) vs. 6.0398 ms (1.97 %)
EMM(M=220,N=512,K=512) runtime: 0.0264 ms (4.36 TOPS, 35.17 GB/s) vs. 0.0245 ms (7.76 %)
EMM(M=220,N=512,K=2048) runtime: 0.0815 ms (5.66 TOPS, 37.70 GB/s) vs. 0.0779 ms (4.62 %)
GEMM(M=220,N=1014,K=512) runtime: 0.0302 ms (7.56 TOPS, 53.97 GB/s) vs. 0.0349 ms (-13.47 %)
GEMM(M=220,N=2048,K=512) runtime: 0.8694 ms (0.53 TOPS, 3.54 GB/s) vs. 2.3897 ms (-63.62 %)
// Performance Improvements
GEMM(M=256,N=2,K=1024) runtime: 0.0145 ms (0.07 TOPS, 34.79 GB/s) vs. 0.014 ms (3.57 %)
GEMM(M=256,N=128,K=128) runtime: 0.0131 ms (0.64 TOPS, 11.94 GB/s) vs. 0.0125 ms (4.80 %)
GEMM(M=256,N=128,K=256) runtime: 0.0155 ms (1.08 TOPS, 16.13 GB/s) vs. 0.0145 ms (6.90 %)
GEMM(M=256,N=256,K=128) runtime: 0.0152 ms (1.10 TOPS, 16.45 GB/s) vs. 0.0155 ms (-1.94 %)
GEMM(M=256,N=256,K=256) runtime: 0.0157 ms (2.14 TOPS, 23.89 GB/s) vs. 0.0169 ms (-7.10 %)
GEMM(M=256,N=512,K=512) runtime: 0.0262 ms (5.12 TOPS, 38.12 GB/s) vs. 0.0246 ms (6.50 %)
GEMM(M=256,N=512,K=1024) runtime: 0.0396 ms (6.78 TOPS, 44.20 GB/s) vs. 0.0391 ms (1.28 %)
GEMM(M=256,N=512,K=197951) runtime: 16.2101 ms (3.20 TOPS, 17.90 GB/s) vs. 16.2295 ms (-0.12 %)
GEMM(M=256,N=768,K=768) runtime: 0.0426 ms (7.09 TOPS, 44.04 GB/s) vs. 0.0419 ms (1.67 %)
GEMM(M=256,N=768,K=3072) runtime: 0.4397 ms (2.75 TOPS, 14.50 GB/s) vs. 0.4131 ms (6.44 %)
GEMM(M=256,N=1000,K=128) runtime: 0.2579 ms (0.25 TOPS, 3.08 GB/s) vs. 0.4302 ms (-40.05 %)
GEMM(M=256,N=1000,K=256) runtime: 0.0307 ms (4.27 TOPS, 35.88 GB/s) vs. 0.0295 ms (4.07 %)
GEMM(M=256,N=1000,K=1024) runtime: 0.0671 ms (7.81 TOPS, 43.81 GB/s) vs. 0.0675 ms (-0.59 %)
GEMM(M=256,N=1000,K=1280) runtime: 0.3644 ms (1.80 TOPS, 9.75 GB/s) vs. 0.5249 ms (-30.58 %)
GEMM(M=256,N=1000,K=1984) runtime: 0.4563 ms (2.23 TOPS, 11.49 GB/s) vs. 1.0955 ms (-58.35 %)
GEMM(M=256,N=1000,K=2048) runtime: 0.3607 ms (2.91 TOPS, 14.96 GB/s) vs. 1.0971 ms (-67.12 %)
GEMM(M=256,N=1024,K=3) runtime: 0.3473 ms (0.00 TOPS, 1.46 GB/s) vs. 1.0332 ms (-66.39 %)
GEMM(M=256,N=1024,K=512) runtime: 0.2878 ms (0.93 TOPS, 6.08 GB/s) vs. 0.4721 ms (-39.04 %)
GEMM(M=256,N=1024,K=1024) runtime: 0.0668 ms (8.03 TOPS, 44.88 GB/s) vs. 0.0646 ms (3.41 %)
GEMM(M=256,N=2304,K=768) runtime: 0.9959 ms (0.91 TOPS, 4.90 GB/s) vs. 2.2502 ms (-55.74 %)
GEMM(M=256,N=3072,K=768) runtime: 1.1671 ms (1.04 TOPS, 5.46 GB/s) vs. 3.0596 ms (-61.85 %)
GEMM(M=256,N=50257,K=768) runtime: 5.7232 ms (3.45 TOPS, 17.22 GB/s) vs. 7.2058 ms (-20.58 %)
GEMM(M=256,N=197951,K=512) runtime: 17.7657 ms (2.92 TOPS, 16.34 GB/s) vs. 18.9209 ms (-6.11 %)
GEMM(M=473,N=2,K=768) runtime: 0.0139 ms (0.10 TOPS, 50.31 GB/s) vs. 0.0147 ms (-5.44 %)
GEMM(M=475,N=768,K=768) runtime: 0.4938 ms (1.13 TOPS, 5.10 GB/s) vs. 0.8638 ms (-42.83 %)
GEMM(M=475,N=768,K=3072) runtime: 0.9406 ms (2.38 TOPS, 8.48 GB/s) vs. 1.5037 ms (-37.45 %)
GEMM(M=475,N=3072,K=768) runtime: 1.8788 ms (1.19 TOPS, 4.25 GB/s) vs. 6.8102 ms (-72.41 %)
GEMM(M=512,N=2,K=1536) runtime: 0.0201 ms (0.16 TOPS, 75.18 GB/s) vs. 0.0201 ms (0.00 %)
GEMM(M=512,N=128,K=768) runtime: 0.0210 ms (4.78 TOPS, 50.50 GB/s) vs. 0.0209 ms (0.48 %)
GEMM(M=512,N=512,K=512) runtime: 0.0385 ms (6.96 TOPS, 38.91 GB/s) vs. 0.041 ms (-6.10 %)
GEMM(M=512,N=512,K=2048) runtime: 0.2664 ms (4.03 TOPS, 16.89 GB/s) vs. 0.2298 ms (15.93 %)
GEMM(M=512,N=768,K=128) runtime: 0.2887 ms (0.35 TOPS, 3.68 GB/s) vs. 0.8198 ms (-64.78 %)
GEMM(M=512,N=768,K=768) runtime: 0.3599 ms (1.68 TOPS, 7.29 GB/s) vs. 0.8904 ms (-59.58 %)
GEMM(M=512,N=768,K=3072) runtime: 0.8159 ms (2.96 TOPS, 10.11 GB/s) vs. 1.7363 ms (-53.01 %)
GEMM(M=512,N=1000,K=1280) runtime: 0.9001 ms (1.46 TOPS, 5.19 GB/s) vs. 2.2128 ms (-59.32 %)
GEMM(M=512,N=1000,K=1984) runtime: 0.8326 ms (2.44 TOPS, 8.05 GB/s) vs. 2.4197 ms (-65.59 %)
GEMM(M=512,N=1024,K=1024) runtime: 0.5257 ms (2.04 TOPS, 7.61 GB/s) vs. 1.7488 ms (-69.94 %)
GEMM(M=512,N=1024,K=3072) runtime: 0.9542 ms (3.38 TOPS, 10.48 GB/s) vs. 1.6973 ms (-43.78 %)
GEMM(M=512,N=1024,K=4096) runtime: 1.2711 ms (3.38 TOPS, 10.23 GB/s) vs. 2.8759 ms (-55.80 %)
GEMM(M=512,N=1536,K=1536) runtime: 1.5235 ms (1.59 TOPS, 4.92 GB/s) vs. 3.4486 ms (-55.82 %)
GEMM(M=512,N=1536,K=6144) runtime: 1.9637 ms (4.92 TOPS, 12.99 GB/s) vs. 2.4115 ms (-18.57 %)
GEMM(M=512,N=2048,K=512) runtime: 1.1441 ms (0.94 TOPS, 3.93 GB/s) vs. 4.0707 ms (-71.89 %)
GEMM(M=512,N=2048,K=2048) runtime: 1.7600 ms (2.44 TOPS, 6.82 GB/s) vs. 5.0442 ms (-65.11 %)
GEMM(M=512,N=2048,K=8192) runtime: 2.2562 ms (7.61 TOPS, 18.62 GB/s) vs. 2.2703 ms (-0.62 %)
GEMM(M=512,N=2560,K=2560) runtime: 1.8007 ms (3.73 TOPS, 9.72 GB/s) vs. 4.1385 ms (-56.49 %)
GEMM(M=512,N=2560,K=10240) runtime: 4.3323 ms (6.20 TOPS, 14.43 GB/s) vs. 4.7134 ms (-8.09 %)
GEMM(M=512,N=3072,K=768) runtime: 1.8491 ms (1.31 TOPS, 4.46 GB/s) vs. 7.6777 ms (-75.92 %)
GEMM(M=512,N=3072,K=1024) runtime: 1.9930 ms (1.62 TOPS, 5.02 GB/s) vs. 7.8267 ms (-74.54 %)
GEMM(M=512,N=4096,K=1024) runtime: 1.8094 ms (2.37 TOPS, 7.18 GB/s) vs. 6.811 ms (-73.43 %)
GEMM(M=512,N=6144,K=1536) runtime: 2.7222 ms (3.55 TOPS, 9.37 GB/s) vs. 7.7562 ms (-64.90 %)
GEMM(M=512,N=8008,K=2560) runtime: 3.0235 ms (6.94 TOPS, 16.35 GB/s) vs. 5.1597 ms (-41.40 %)
GEMM(M=512,N=8192,K=2048) runtime: 3.5080 ms (4.90 TOPS, 11.97 GB/s) vs. 8.9103 ms (-60.63 %)
GEMM(M=512,N=10240,K=2560) runtime: 4.0632 ms (6.61 TOPS, 15.38 GB/s) vs. 4.8217 ms (-15.73 %)
GEMM(M=512,N=30000,K=128) runtime: 5.5240 ms (0.71 TOPS, 6.65 GB/s) vs. 20.1042 ms (-72.52 %)
GEMM(M=512,N=30522,K=768) runtime: 6.7020 ms (3.58 TOPS, 11.23 GB/s) vs. 18.1121 ms (-63.00 %)
GEMM(M=512,N=30522,K=1024) runtime: 6.9931 ms (4.58 TOPS, 12.93 GB/s) vs. 18.3904 ms (-61.97 %)
GEMM(M=512,N=32128,K=512) runtime: 6.4790 ms (2.60 TOPS, 9.76 GB/s) vs. 17.9019 ms (-63.81 %)
GEMM(M=512,N=32128,K=1024) runtime: 7.4393 ms (4.53 TOPS, 12.79 GB/s) vs. 18.5621 ms (-59.92 %)
GEMM(M=512,N=50265,K=768) runtime: 10.5373 ms (3.75 TOPS, 11.72 GB/s) vs. 12.5688 ms (-16.16 %)
GEMM(M=512,N=51200,K=2048) runtime: 13.6468 ms (7.87 TOPS, 18.47 GB/s) vs. 12.3374 ms (10.61 %)
GEMM(M=819,N=768,K=768) runtime: 0.7904 ms (1.22 TOPS, 4.46 GB/s) vs. 2.4961 ms (-68.33 %)
GEMM(M=819,N=50358,K=768) runtime: 16.4101 ms (3.86 TOPS, 9.36 GB/s) vs. 15.2728 ms (7.45 %)
GEMM(M=832,N=768,K=768) runtime: 0.6074 ms (1.62 TOPS, 5.87 GB/s) vs. 2.0416 ms (-70.25 %)
GEMM(M=832,N=768,K=3072) runtime: 1.0136 ms (3.87 TOPS, 10.45 GB/s) vs. 1.8642 ms (-45.63 %)
GEMM(M=832,N=3072,K=768) runtime: 2.1773 ms (1.80 TOPS, 4.87 GB/s) vs. 5.4305 ms (-59.91 %)
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos