|
135 | 135 | <div class="headertitle"><div class="title">Performance</div></div>
|
136 | 136 | </div><!--header-->
|
137 | 137 | <div class="contents">
|
138 |
| -<div class="textblock"><p><a class="anchor" id="autotoc_md81"></a></p> |
| 138 | +<div class="textblock"><p><a class="anchor" id="autotoc_md79"></a></p> |
139 | 139 | <p>MFC has been benchmarked on several CPUs and GPU devices. This page is a summary of these results.</p>
|
140 |
| -<h1><a class="anchor" id="autotoc_md82"></a> |
| 140 | +<h1><a class="anchor" id="autotoc_md80"></a> |
141 | 141 | Figure of merit: Grind time performance</h1>
|
142 | 142 | <p>The following table outlines observed performance as nanoseconds per grid point (ns/gp) per equation (eq) per right-hand side (rhs) evaluation (lower is better), also known as the grind time. We solve an example 3D, inviscid, 5-equation model problem with two advected species (8 PDEs) and 8M grid points (158-cubed uniform grid). The numerics are WENO5 finite volume reconstruction and HLLC approximate Riemann solver. This case is located in <code>examples/3D_performance_test</code>. You can run it via <code>./mfc.sh run -n <num_processors> -j $(nproc) ./examples/3D_performance_test/case.py -t pre_process simulation --case-optimization</code> for CPU cases right after building MFC, which will build an optimized version of the code for this case then execute it. For benchmarking GPU devices, you will likely want to use <code>-n <num_gpus></code> where <code><num_gpus></code> should likely be <code>1</code>. If the above does not work on your machine, see the rest of this documentation for other ways to use the <code>./mfc.sh run</code> command.</p>
|
143 | 143 | <p>Results are for MFC v4.9.3 (July 2024 release), though numbers have not changed meaningfully since then. Similar performance is also seen for other problem configurations, such as the Euler equations (4 PDEs). All results are for the compiler that gave the best performance. Note:</p><ul>
|
@@ -249,34 +249,34 @@ <h1><a class="anchor" id="autotoc_md82"></a>
|
249 | 249 | <td class="markdownTableBodyRight">Fujitsu A64FX </td><td class="markdownTableBodyRight">Arm </td><td class="markdownTableBodyRight">CPU </td><td class="markdownTableBodyRight">48 cores </td><td class="markdownTableBodyRight">63 </td><td class="markdownTableBodyLeft">GNU 13.2.0 </td><td class="markdownTableBodyLeft">SBU Ookami </td></tr>
|
250 | 250 | </table>
|
251 | 251 | <p><b>All grind times are in nanoseconds (ns) per grid point (gp) per equation (eq) per right-hand side (rhs) evaluation, so X ns/gp/eq/rhs. Lower is better.</b></p>
|
252 |
| -<h1><a class="anchor" id="autotoc_md83"></a> |
| 252 | +<h1><a class="anchor" id="autotoc_md81"></a> |
253 | 253 | Weak scaling</h1>
|
254 | 254 | <p>Weak scaling results are obtained by increasing the problem size with the number of processes so that work per process remains constant.</p>
|
255 |
| -<h2><a class="anchor" id="autotoc_md84"></a> |
| 255 | +<h2><a class="anchor" id="autotoc_md82"></a> |
256 | 256 | AMD MI250X GPU</h2>
|
257 | 257 | <p>MFC weask scales to (at least) 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency. This corresponds to 87% of the entire machine.</p>
|
258 | 258 | <p><img src="../res/weakScaling/frontier.svg" alt="" style="height: 50%; width:50%; border-radius: 10pt pointer-events: none;" class="inline"/></p>
|
259 |
| -<h2><a class="anchor" id="autotoc_md85"></a> |
| 259 | +<h2><a class="anchor" id="autotoc_md83"></a> |
260 | 260 | NVIDIA V100 GPU</h2>
|
261 | 261 | <p>MFC weak scales to (at least) 13,824 V100 NVIDIA V100 GPUs on OLCF Summit with 97% efficiency. This corresponds to 50% of the entire machine.</p>
|
262 | 262 | <p><img src="../res/weakScaling/summit.svg" alt="" style="height: 50%; width:50%; border-radius: 10pt pointer-events: none;" class="inline"/></p>
|
263 |
| -<h2><a class="anchor" id="autotoc_md86"></a> |
| 263 | +<h2><a class="anchor" id="autotoc_md84"></a> |
264 | 264 | IBM Power9 CPU</h2>
|
265 | 265 | <p>MFC Weak scales to 13,824 Power9 CPU cores on OLCF Summit to within 1% of ideal scaling.</p>
|
266 | 266 | <p><img src="../res/weakScaling/cpuScaling.svg" alt="" style="height: 50%; width:50%; border-radius: 10pt pointer-events: none;" class="inline"/></p>
|
267 |
| -<h1><a class="anchor" id="autotoc_md87"></a> |
| 267 | +<h1><a class="anchor" id="autotoc_md85"></a> |
268 | 268 | Strong scaling</h1>
|
269 | 269 | <p>Strong scaling results are obtained by keeping the problem size constant and increasing the number of processes so that work per process decreases.</p>
|
270 |
| -<h2><a class="anchor" id="autotoc_md88"></a> |
| 270 | +<h2><a class="anchor" id="autotoc_md86"></a> |
271 | 271 | NVIDIA V100 GPU</h2>
|
272 | 272 | <p>The base case utilizes 8 GPUs with one MPI process per GPU for these tests. The performance is analyzed at two problem sizes: 16M and 64M grid points. The "base case" uses 2M and 8M grid points per process.</p>
|
273 |
| -<h3><a class="anchor" id="autotoc_md89"></a> |
| 273 | +<h3><a class="anchor" id="autotoc_md87"></a> |
274 | 274 | 16M Grid Points</h3>
|
275 | 275 | <p><img src="../res/strongScaling/strongScaling16.svg" alt="" style="width: 50%; border-radius: 10pt pointer-events: none;" class="inline"/></p>
|
276 |
| -<h3><a class="anchor" id="autotoc_md90"></a> |
| 276 | +<h3><a class="anchor" id="autotoc_md88"></a> |
277 | 277 | 64M Grid Points</h3>
|
278 | 278 | <p><img src="../res/strongScaling/strongScaling64.svg" alt="" style="width: 50%; border-radius: 10pt pointer-events: none;" class="inline"/></p>
|
279 |
| -<h2><a class="anchor" id="autotoc_md91"></a> |
| 279 | +<h2><a class="anchor" id="autotoc_md89"></a> |
280 | 280 | IBM Power9 CPU</h2>
|
281 | 281 | <p>CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the base case using 2, 4, and 8M cells per process.</p>
|
282 | 282 | <p><img src="../res/strongScaling/cpuStrongScaling.svg" alt="" style="width: 50%; border-radius: 10pt pointer-events: none;" class="inline"/> </p>
|
|
0 commit comments