Aphcore SimonKnowles v04
Aphcore SimonKnowles v04
Aphcore SimonKnowles v04
• We put more features into our “MVP” than we managed to light up with software during its
lifetime, eg. sparse tensor arithmetic.
• Not unexpectedly, we spent a lot of time tuning code and workspace to fit 256kB tile memory.
• Efficiently mapping big models across many chips requires computer expertise – most AI
programmers need rich automation.
• Whole-graph compilation was initially simplest, but inevitably slow as models grew.
• Bulk synchrony makes it harder to tune out Vdd margin for supply transients, to minimize
power consumption. Nevertheless, Mk1 demonstrated good power efficiency.
• PCIe cards severely constrain power density and chip cluster connectivity.
IPU-POD512™
POPLAR®
Uncore
Exchange
Tile
Memory Exchange
Tile
Logic
• ld/st instructions support sparse gather in parallel with arithmetic at full speed,
via compact pointer lists:
f16 • f16
products
memory memory
cast?
Instructions:
• Generate a vector of random numbers, uniform or Gaussian.
• Randomly puncture a vector with specified probability.
• Stochastically round down-casts at full speed – vital for fast
and easy training of f16 models.
Sync Compute
cycles
1472
tiles
nop
Accumulate loop
Multiply pJ/flop
Datapath Memory
f32 f16 1.3 memory
f16 float
f32 f32 1.75 + datapath
transport
f32 f32 f32 3.3
Applications typically sustain max ~50% of burst Tflop/s on all platforms. Vendors choose
TDP to envelope such applications at full speed; a power virus will slow the clock.
1.5x net efficiency advantage of IPU over GPU implies ~3x transport energy advantage.
10
8Gb die:
(to scale)
• 40GB HBM ~triples the cost of a packaged
HBM2 DDR4 DDR4
reticle-sized processor.
20nm 20nm 18nm
• DDR-based systems like IPU can spend
8H KGD ECC-RDIMM the saved $ on more processors.
50% memory 50% memory
vendor margin vendor margin
> $2x $1x
CoWoS
Total silicon required for an 8cm2 processor
> $4x with 32GB DRAM:
60% processor HBM2: 53cm2
vendor margin
DDR4: 23cm2
> $10x $1x $/GB in system
pipeline
Processors:
DRAM:
W weights
in DRAM 10
IPU zone
1
1 10 100 1000
Compute rate (flop/s)
DRAM bandwidth (Bytes/s) = 4F • SRAM Capacity MiB
SRAM Capacity (Bytes)
Simple mechanisms allow rapid software evolution: SDK tuning over last 7 months
(relative application performance)
• Native graph abstraction.
2
Jul-21
• Codelet-level parallelism.
1.5
• Pipeline-oblivious threads.
May-21
• BSP eliminates concurrency hazards. 1
Dec-20
• Stateless all-to-all Exchange.
0.5
• Cacheless, uniform, near/far memory.
0
BERT.L RN50 EN.B4
All trademarks used in this presentation are the property of their respective owners.
Information presented is believed to be accurate at the time of presentation; however, subsequent events may impact their
accuracy. Graphcore undertakes no responsibility to update any information. AI is a rapidly evolving global technology and as such, all
information presented herein is subject to change without notice.