8 Nvidia PDF
8 Nvidia PDF
8 Nvidia PDF
N1 N1 N2 N3 N4
G1 G2 G3 G4
Global Solution
1
GPUs shared memory
parallel using OpenMP
Execution on under distributed parallel
CPU + GPU
CAE Priority for ISV Software Development on GPUs
LSTC / LS-DYNA
SIMULIA / Abaqus/Explicit
Altair / RADIOSS
#4
ESI / PAM-CRASH
ANSYS Mechanical
ANSYS and NVIDIA Collaboration Roadmap
Release ANSYS Mechanical ANSYS Fluent ANSYS EM
2.1X 2.2X
ANSYS Mechanical jobs/day
363
324 K40
K20 275 275
2 CPU cores 2 CPU cores 2 CPU cores 2 CPU cores 8 CPU cores 7 CPU cores 8 CPU cores 7 CPU cores
+ Tesla K20 + Tesla K40 + Tesla K20 + Tesla K40
Simulation productivity (with a HPC license) Simulation productivity (with a HPC Pack)
Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and a
Tesla K40 GPU with boost clocks.
Considerations for ANSYS Mechanical on GPUs
Problems with high solver workloads benefit the most from GPU
Characterized by both high DOF and high factorization requirements
Models with solid elements and have >500K DOF experience good speedups
Abaqus/Standard
SIMUILA and Abaqus GPU Release Progression
Abaqus 6.11, June 2011
Direct sparse solver is accelerated on the GPU
Single GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000)
3
15000 2.42x
2.5
10000
2.11x 2
5000
1.5
0 1
8c 8c + 1g 8c + 2g 16c 16c + 2g
Server with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2
Rolls Royce: Abaqus Speedups on an HPC Cluster
• 4.71M DOF (equations); ~77 TFLOPs
• Nonlinear Static (6 Steps)
Sandy Bridge + Tesla K20X for 4 x Servers • Direct Sparse solver, 100GB memory
9000
2.2x
Elapsed Time in seconds
6000
2.04X
1.9x
1.8X 1.8x
3000
0
24c 24c+4g 36c 36c+6g 48c 48c8g
Servers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2
Abaqus/Standard ~15% Gain from K20X to K40
2.1x – 4.8x
1.9x – 4.1x
15% av 15% av
1.7x – 2.9x
1.5x – 2.5x
Abaqus 6.13-DEV Scaling on Tesla GPU Cluster
PSG Cluster: Sandy Bridge CPUs with 2x E5-2670 (8-core), 2.6 GHz, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, QDR IB, CUDA 5
Abaqus Licensing in a node and across a cluster
Cores Tokens GPU Tokens GPU Tokens
1 5 1 6 2 7
2 6 1 7 2 8
3 7 1 8 2 9
4 8 1 9 2 10 2 nodes: 2x 16 cores + 2x 2 GPUs
5 9 1 10 2 11
6 10 1 11 2 12
32 cores: 21 tokens
7 11 1 12 2 12 32 cores + 4 GPUs: 22 tokens
8 (1 CPU) 12 1 12 2 13
9 12 1 13 2 13 3 nodes: 3x 16 cores + 3x 2 GPUs
10 13 1 13 2 14
11 13 1 14 2 14 48 cores: 25 tokens
12 14 1 14 2 15 48 cores + 6 GPUs: 26 tokens
13 14 1 15 2 15
14 15 1 15 2 16
15 15 1 16 2 16
16 (2 CPUs) 16 1 16 2 16
Abaqus 6.12 Power consumption in a node
Computational Structural Mechanics
MSC Nastran
MSC Nastran 2013
Nastran direct equation solver is GPU accelerated
Sparse direct factorization (MSCLDL, MSCLU)
Real, Complex, Symmetric, Un-symmetric
Handles very large fronts with minimal use of pinned host memory
Lowest granularity GPU implementation of a sparse direct solver; solves unlimited
sparse matrix sizes
Impacts several solution sequences:
High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)
3 2.7X 2.8X
1.9X
1.5
1X 1X
Lanczos solver (SOL 103)
Sparse matrix factorization
0 Iterate on a block of vectors (solve)
SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front
Orthogonalization of vectors
Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory
MSC Nastran 2013
Coupled Structural-Acoustics simulation with SOL108
400 2.7X
0
serial 1c + 1g 4c (smp) 4c + 1g 8c (dmp=2) 8c + 2g
(dmp=2)
Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory
Computational Structural Mechanics
MSC MARC
MARC 2013
Computational Fluid Dynamics
ANSYS Fluent
ANSYS and NVIDIA Collaboration Roadmap
Release ANSYS Mechanical ANSYS Fluent ANSYS EM
Cluster specification:
nprocs = Total number of fluent processes
M = Number of machines
ngpgpus = Number of GPUs per machine
Requirement 1
nprocs mod M = 0
Same number of solver processes on each machine
Requirement 2
𝑛𝑝𝑟𝑜𝑐
mod ngpgpus = 0
𝑀
No. of processes should be an integer multiple of GPUs
Cluster Specification Examples
Single-node configurations:
CPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU
16 mpi 8 mpi 8 mpi 5 mpi 5 mpi 5 mpi
Multi-node configurations:
Note: The problem must fit in the GPU memory for the solution to proceed
Considerations for ANSYS Fluent on GPUs
GPUs accelerate the AMG solver of the CFD analysis
Fine meshes and low-dissipation problems have high %AMG
Coupled solution scheme spends 65% on average in AMG
In many cases, pressure-based coupled solvers offer faster convergence
compared to segregated solvers (problem-dependent)
Is it a
Pressure–based
Segregated solver steady-state
Coupled solver
analysis?
Yes
Consider switching to the pressure-based coupled solver
Best-fit for GPUs for better performance (faster convergence) and further
speedups with GPUs.
29
ANSYS Fluent GPU Performance for Large Cases
Better speed-ups on larger and harder-to-solve problems
36 CPU cores 144 CPU cores
36 CPU cores + 12 GPUs Truck Body Model
144 CPU cores + 48 GPUs
36
13 2X
ANSYS Fluent Time (Sec)
6391
775
CPU only CPU + GPU CPU only CPU + GPU NOTE: Times
AMG solver time for 20 time steps
Solution time
ANSYS Fluent GPU Study on Productivity Gains
ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013
25
• Same solution times:
15 16 16 14 M Mixed cells
• Frees up 32 CPUs Steady, k-e turbulence
and HPC licenses for 10
Coupled PBNS, DP
Total solution times
additional job(s) CPU: AMG F-cycle
5 64 Cores 32 Cores GPU: FGMRES with
AMG Preconditioner
• Approximate 56% + 8 GPUs
increase in overall 0
OpenFOAM
NVIDIA GPU Strategy for OpenFOAM
Provide technical support for GPU solver developments
FluiDyna (implementation of NVIDIA’s AMG), Vratis and PARALUTION
AMG development by Russian Academy of Science ISP (A. Monakov)
Cufflink development by WUSTL now Engys North America (D. Combest)
GPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidX GTC 2014
Summary hybrid approach
Advantage:
Simulation tool e.g. • Universally applicable (coupled to
OpenFOAM® simulation tool of choice)
• Full availability of existing flow models
• Easy/no validation needed
• Unsteady approach better for hybrid due
to large linear solver times
Disadvantages:
• Hybrid CPU-GPU produces overhead
• In case that solution of linear system not
dominant
→ Application speedup can be limited
GPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidX GTC 2014
aeroFluidX
an extension of the hybrid approach
CPU flow solver • Porting discretization of equations to GPU
aeroFluidX
e.g. OpenFOAM® GPU implementation discretization module (Finite Volume)
running on GPU
preprocessing Possibility of direct coupling to Culises
Zero overhead from CPU-GPU-CPU memory transfer
and matrix format conversion
FV module Solution of momentum equations also beneficial
discretization FV module
• OpenFOAM® environment supported
Enables plug-in solution for OpenFOAM®
customers
Linear solver
Culises
Culises But communication with other
input/output file formats possible
postrocessing
GPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidX GTC 2014
aeroFluidX
Cavity flow
• CPU: Intel E5-2650 (all 8 cores)
GPU: Nvidia K40 Normalized computing time
• 4M grid cells (unstructured) 100 all assembly
• Running 100 SIMPLE steps with: 90 1x all linear solve
– OpenFOAM® (OF) 80
• pressure: GAMG 70
• Velocitiy: Gauss-Seidel 60
– OpenFOAM® (OFC) 50 1x
• Pressure: Culises AMGPCG (2.4x) 40 2.1x
• Velocity: Gauss-Seidel 30 1x
– aeroFluidX (AFXC) 20 1.96x
• Pressure: Culises AMGPCG 10 2.22x
• Velocity: Culises Jacobi 0
• Total speedup: OpenFOAM OpenFOAM+Culises aeroFluidX+Culises
– OF (1x)
– OFC 1.62x all assembly = assembly of all linear systems (pressure and velocity)
– AFXC 2.20x all linear solve = solution of all linear systems (pressure and velocity)
GPU Acceleration of CFD in Industrial Applications using Culises and aeroFluidX GTC 2014
PARALUTION
C++ Library to perform various sparse iterative
solvers and preconditioner
CST STUDIO SUITE is an integrated solution for 3D EM simulations it includes a parametric modeler, more
than 20 solvers, and integrated post-processing. Currently, three solvers support GPU Computing.
www.nvidia.com
-The Quadro K6000/Tesla K40 card is about 30..35% faster than the K20 card.
-12 GB onboard RAM allows for larger model size.
12
GPU
Speedup
10
8
CPU
6
CST STUDIO SUITE 2013
4
CST STUDIO SUITE 2014
2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of GPUs (Tesla K40)
Benchmark performed on system equipped with dual Xeon E5-2630 v2 (Ivy Bridge EP) processors, and four Tesla K40 cards. Model has 80 million mesh cells.
Speedup
10
GPU Hardware
Benchmark Model Features:
0 -Open boundaries
Note: A GPU accelerated 1 1.5 2 2.5 3 3.5 4 -Dispersive and lossy
material
cluster system requires Number of Cluster Nodes
high-speed network in
Base model size is 80 million cells. Problem size is scaled up linearly with the number of cluster nodes
order to perform well! (i.e., weak scaling). Hardware: dual Xeon E5-2650 processors, 128GB RAM per node (1600MHz),
Infiniband QDR interconnect (40Gb/s).
Axel Koehler
akoehler@nvidia.com
NVIDIA, the NVIDIA logo, GeForce, Quadro, Tegra, Tesla, GeForce Experience, GRID, GTX, Kepler, ShadowPlay, GameStream, SHIELD, and The Way It’s Meant To Be Played are trademarks and/or
registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.