AMD's CDNA 3 Compute Architecture - Chips and Cheese
AMD's CDNA 3 Compute Architecture - Chips and Cheese
AMD's CDNA 3 Compute Architecture - Chips and Cheese
Posts
AMD’s CDNA 3
Compute Architecture
Search …
December 17, 2023 clamchowder, Cheese Leave a
comment Sort by
Relevance
AMD has a long history of vying for GPU compute market
share. Ever since Nvidia got first dibs with their Tesla
architecture, AMD has been playing catch up. Terascale 3
moved from VLIW5 to VLIW4 to improve execution unit Archives
utilization in compute workloads. GCN replaced Terascale
December 2023
and emphasized consistent performance for both GPGPU and
graphics applications. Then, AMD diverged their GPU November 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 1/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
CDNA 2 finally brought AMD notable success. MI250X and October 2022
MI210 GPUs won several supercomputer contracts including
September 2022
ORNL’s Frontier, which holds first place on November 2023’s
TOP500 list. But while CDNA2 delivered solid and cost August 2022
efficient FP64 compute, H100 had better AI performance and July 2022
offered a larger unified GPU. June 2022
May 2022
CDNA 3 looks to close those gaps by bringing forward
everything AMD has to offer. The company’s experience in April 2022
advanced packaging technology is on full show, with March 2022
MI300X getting a sophisticated chiplet setup. Together with
February 2022
Infinity Fabric components, advanced packaging lets
MI300X scale to compete with Nvidia’s largest GPUs. On the January 2022
memory side, Infinity Cache from the RDNA line gets pulled December 2021
into the CDNA world to mitigate bandwidth issues. But that
November 2021
doesn’t mean MI300X is light on memory bandwidth. It still
October 2021
gets a massive HBM setup, giving it the best of both worlds.
Finally, CDNA 3’s compute architecture gets significant September 2021
generational improvements to boost throughput and August 2021
utilization.
July 2021
May 2021
AMD has a tradition of using chiplets to cheaply scale core
April 2021
counts in their Ryzen and Epyc CPUs. MI300X uses a similar
March 2021
strategy at a high level, with compute split off onto
Accelerator Complex Dies, or XCDs. XCDs are analogous to February 2021
CDNA 2 or RDNA 3’s Graphics Compute Dies (GCDs) or January 2021
Ryzen’s Core Complex Dies (CCDs). AMD likely changed the
December 2020
naming because CDNA products lack the dedicated graphics
hardware present in the RDNA line.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 2/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 3/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 4/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 6/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 7/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 9/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
Cross-XCD Coherency
Even though the Infinity Cache doesn’t have to worry about
coherency, the L2 caches do. Ordinary GPU memory accesses
follow a relaxed coherency model, but programmers can use
atomics to enforce ordering between threads. Memory
accesses on AMD GPUs can also be marked with a GLC bit
(Global Level Coherent). Those mechanisms still have to
work if AMD wants to expose MI300X as a single big GPU,
rather than a multi-GPU configuration as MI250X had done.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 10/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 11/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
From AMD’s Zen PPR, showing error reporting available at the Coherent
Slave (CS).
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 12/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
However, CDNA 3 within the XCD still works a lot like prior
GPUs. Evidently normal memory writes will not
automatically invalidate written lines from peer caches as in
CPUs. Instead, code must explicitly tell the L2 to write back
dirty lines and have peer L2 caches invalidate non-local L2
lines.
L2 Cache
Closer to the Compute Units, each MI300X XCD packs a 4 MB
L2 cache. The L2 is a more traditional GPU cache, and is built
from 16 slices. Each 256 KB slice can provide 128 bytes per
cycle of bandwidth. At 2.1 GHz, that’s good for 4.3 TB/s. As
the last level of cache on the same die as the Compute Units,
the L2 plays an important role in acting as a backstop for L1
misses.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 13/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
L1 Cache
CDNA 3’s focus on high cache bandwidth continues to the L1.
In a move that matches RDNA, CDNA 3 sees its L1 throughput
increased from 64 to 128 bytes per cycle. CDNA 2 increased
per-CU vector throughput to 4096 bits per cycle compared to
2048 in GCN, so CDNA 3’s doubled L1 throughput helps
maintain the same compute to L1 bandwidth ratio as GCN.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 14/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 15/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 16/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 17/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
Matrix Operations
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 18/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 19/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
Instruction Cache
Besides handling memory accesses requested by
instructions, a Compute Unit has to fetch the instructions
themselves from memory. GPUs traditionally had an easier
time with instruction delivery because GPU code tends to be
simple and not occupy a lot of memory. In the DirectX 9 era,
Shader Model 3.0 even imposed limits on code size. As GPUs
evolved to take on compute, AMD rolled out their GCN
architecture with 32 KB instruction caches. Today, CDNA 2
and RDNA GPUs continue to use 32 KB instruction caches.
Final Words
CDNA 3’s whitepaper says that “the greatest generational
changes in the AMD CDNA 3 architecture lie in the memory
hierarchy” and I would have to agree. While AMD improved
the Compute Unit’s low precision math capabilities
compared to CDNA 2, the real improvement was the addition
of the Infinity Cache.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 20/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
For this image, I am considering North-South as the vertical axis and East-
West as the horizontal axis
4.0 TB/s of total ingress bandwidth for one die may not seem
like enough if both XCD needs all available memory
bandwidth. However, both XCDs combined can only access
up to 4.2TB/s of bandwidth from the IO die so realistically
the 4.0TB/s of ingress bandwidth is a non-issue. What the
maximum of 4.0TB/s of ingress bandwidth does mean is that
a single IO die can’t take advantage of all 5.3TB/s of memory
bandwidth.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 21/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
But MI300 isn’t just a GPGPU part, it also has an APU part as
well, which is in my opinion is the more interesting of the
two MI300 products. AMD’s first ever APU, Llano, was
released in 2011 which was based on AMD’s K10.5 CPU paired
with a Terascale 3 GPU. Fast forward to 2023 and for their
first “big iron” APU, the MI300A, AMD paired 6 of their CDNA3
XCDs with 24 Zen 4 cores all while reusing the same base
die. This allows for the CPU and the GPU to shared the same
memory address space which removes the need to copy data
over an external bus to keep the CPU and GPU coherent with
each other.
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 22/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
References
1. CDNA 3 Whitepaper
2. CDNA 2 Whitepaper
3. CDNA Whitepaper
4. Volta Whitepaper
5. Nvidia A100 Whitepaper
6. Nvidia H100 Whitepaper
7. Intel Data Center GPU Max Series Technical Overview
Authors
clamchowder
Cheese
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 23/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
Leave a Reply
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 24/24