HC2021.C1.4 Intel Arijit

Sapphire Rapids
Arijit Biswas
Intel Senior Principal Engineer
New Standard for
Data Center Architecture
Designed for Microservices

Next-Gen Intel Xeon Scalable Processor & AI Workloads
Pioneering Advanced Memory

& IO Transitions
Node Performance
Data Center Performance

Node Performance
Cache &
Scalar Data Parallel Memory Sub- Intra/Inter
Performance Performance System Arch Socket Scaling
Larger Private & Modular SoC /w

New Performance Shared Caches Modular Die Fabric
Multiple Integrated
Core
Acceleration Engines
Microarchitecture
DDR 5 Wider & Faster UPI
Increased Core Next Gen Optane Embedded Silicon

Counts Support Bridge (EMIB)
PCIe 5.0
Next Gen Quality of
Service Capabilities
Broad WL/Usage
Support and
Low Jitter Optimizations
Fast VM Migration
Architecture
Consistent Caching & Next Gen Optane Integrated WL

Better Telemetry
Mem Latency Support Accelerators
Inter-Processor Improved Security &

IO Virtualization CXL 1.1
Interrupt Virt. RAS
Elasticity &
Efficient Data Infrastructure &
Consolidation Performance Center Framework
& Orchestration Consistency Utilization Overhead
Data Center Performance

Single Monolithic Die Multi-Tile Design for Increased Scalability
Delivers a scalable, balanced architecture leveraging existing software paradigms

for monolithic CPUs via a modular architecture
Sapphire Rapids
Multiple Tiles, Single CPU
Every thread has full access to all

resources on all tiles
Cache, Memory, IO…
Provides consistent low latency

& high cross-section BW across the
entire SoC
Sapphire Rapids
Key Building Blocks
Compute IP Cores
Acceleration
Seamless Integration of Engines
PCIe Gen
I/O IP CXL1.1
5
UPI 2.0
Memory DDR 5 Optane HBM

IP
Performance Core
Built for Data Center
Major microarchitecture and IPC improvement
Improved support for large code/data footprint
Consistent performance for multi-tenant usages
Autonomous/Fast PM for high freq @ low jitter

Performance
Core AI
Intel® Advanced Matrix Extensions - AMX
Tiled matrix operations for inference & training acceleration
Attached Accelerator interfacing Architecture - AiA

Device Efficient dispatch, signaling & synchronization from user level
Architecture
Improvements for DC Half- Precision Float New Instructions
Workloads & Usages HFNI Support for FP16 - higher throughput lower precision
Cache CLDEMOTE
Management Proactive placement of cache contents
Sapphire Rapids
Utilization Without
Acceleration
Acceleration Engines
Increasing effectiveness of cores,
by enabling offload of common mode tasks via
seamlessly integrated acceleration engines Core Core Core Core
With Acceleration
Native Dispatch, Signaling & Synchronization from User Space
Accelerator interfacing Architecture
Utilization
Coherent, Shared Memory Space
Between Cores & Acceleration Engines
Core Core Core Core Accel. Accel.
Concurrently shareable
Processes, containers and VMs
Critical Workloads Common Mode Tasks Additional Workload Capacity
Intel® Data Streaming Acceleration Engine
Optimizing streaming data movement and transformation operations
14%
up to
Core utilization of Open vSwitch

4 Instances per Socket 53%
39%
@ 1512B packet size

additional CPU
Low Latency Invocation 47% 86%
Core cycles after
DSA offload
No Memory Pinning Overhead
% CPU Compute
% CPU Data Movement
W/o Offload With Offload
Results have been estimated or simulated based on testing on pre-production hardware and software.
For workloads and configurations visit www.intel.com/ArchDay21claims . Results may vary
Intel® Quick Assist Technology Acceleration Engine
Accelerating Cryptography and Data De/Compression
100%
Expected Core utilization on SPR for Zlib L9

up to
400Gb/s Symmetric Crypto
up to
160Gb/s Compression +
98%
160Gb/s De-compression additional
workload capacity
Fused Operations
after QAT offload
Without Offload With QAT Offload

Results have been estimated or simulated. Sapphire Rapids estimation based on architecture models and baseline
testing with Ice Lake and Intel QAT. For workloads and configurations visit www.intel.com/ArchDay21claims . Results may vary.
Intel® Dynamic Load Balancer Acceleration Engine
Efficient Load Balancing across CPU Cores
400M Load Balancing Decisions per Second Without DLB
Core Core Core Core

Offloads Software Queue Management
Dynamic, flow aware load balancing & reordering
Priority Queuing (up to 8 levels) With DLB
Dynamic, power aware sizing of applications Core Core Core Core

Sapphire Rapids
I/O Advancements
Introducing Compute eXpress Link (CXL) 1.1

Accelerator and memory expansion in datacenter
Expanded device performance via

PCIe 5.0 & connectivity
Improved DDIO & QoS capabilities
Improved Multi-Socket scaling via Intel Ultra Path

Interconnect (UPI) 2.0
Up to 4 x24 UPI links operating @ 16 GT/s

New 8S-4UPI performance optimized topology
Intel® Shared Virtual Memory (SVM)
Sapphire Rapids Enabling devices and IA cores to access shared data

in CPU virtual address space
IO - Virtualization Consistent across host app. and offloaded tasks
Avoids memory pinning and copying overheads
Integrated & discrete, bare-metal & VM instances
Intel® Scalable IO Virtualization (S-IOV)

Hardware acceleration for comms between
VMs/containers and PCIe devices
Scalable sharing and direct access to accelerators

Hardware acceleration for communication between
across 1000s of VMs/containers
VMs/containers and PCIe attached devices
Higher Perf than SW only device scaling, More scalable

than SR-IOV
Supports integrated & discrete devices

Sapphire Rapids
Memory and Last Level Cache
Increased Shared Last Level Cache (LLC)

Up to >100 MB LLC shared across ALL cores
Increased bandwidth, security & reliability

via DDR 5 Memory
4 memory controllers supporting 8 channels
Integrated memory encryption engine
Improved RAS
Intel Optane™ Persistent Memory 300 Series

Sapphire Rapids
High Bandwidth Memory
Significantly Higher Memory Bandwidth
vs. baseline Xeon-SP with 8 channels of DDR 5
Increased capacity and Bandwidth

some usages can eliminate need for DDR entirely
2 Modes
HBM Flat Mode HBM Caching Mode

Flat Mem Regions w/ HBM & DRAM DRAM backed cache
HBM
HBM DDR5
HBM DDR5
HBM Flat Mode enables flat memory HBM Caching Mode allows HBM
regions with HBM & DRAM to act as DRAM backed cache
Sapphire Rapids - Architected for AI
AI has become ubiquitous across usages – AI performance required in all tiers of computing
Goal 2048
Enable efficient usage of AI across all services deployed on
elastic general-purpose tier by delivering many times more AI
Ops/Cycle per core @ 100% utilization

performance and lower CPU utilization
For Deep Learning  int8 with int32 accumulation 1024

Datatypes  Bfloat16 with IEEE SP accumulation
Acceleration at  Full Intel Arch. programmability

the ISA Level  Low Latency 256
64
Available and integrated with AVX-512 (2xFMA) FP32 AVX-512 (2xFMA) INT8
AMX (TMUL) BF16 AMX (TMUL) INT8
industry-relevant frameworks & libraries
Results have been simulated. For workloads and configurations
visit www.intel.com/ArchDay21claims . Results may vary
Sapphire Rapids - Built for elastic computing models - microservices
>80% of new cloud-native and SaaS applications are expected to be built as microservices
Microservices Performance
Goal
+69%
Throughput per Core under Latency SLA of p99 <30ms

Enable higher throughput while meeting latency requirements and
reducing infrastructure overhead for execution, monitoring and
orchestration thousands of microservices
+ 24%
Improved
Performance and
Quality of Service
1.0
Reduced
Infrastructure
Overhead
Better Distributed Cascade Lake Icelake Server Sapphire Rapids

Communication
Results have been simulated. For workloads and configurations visit
www.intel.com/ArchDay21claims . Results may vary
New Standard in Data Center Architecture
General Purpose
Sapphire Rapids
Multi Tile SoC for Physically Tiled, & Dedicated Biggest Leap in Data Center Capabilities
Scalability Logically Monolithic Acceleration Engines in over a Decade
Designed for Microservices and AI Workloads
Performance Core Workload Specialized

Architecture Acceleration
Pioneering Advanced Memory & IO Transitions
Enhanced
DDR 5 & Virtualization
HBM PCIe 5.0 Capabilities
22

HC2021.C1.4 Intel Arijit

Uploaded by

Copyright:

Available Formats

HC2021.C1.4 Intel Arijit

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HC2021.C1.4 Intel Arijit

Uploaded by

Copyright:

Available Formats

Sapphire Rapids

Designed for Microservices

Pioneering Advanced Memory

Data Center Performance

Larger Private & Modular SoC /w

Increased Core Next Gen Optane Embedded Silicon

Consistent Caching & Next Gen Optane Integrated WL

Inter-Processor Improved Security &

Data Center Performance

Delivers a scalable, balanced architecture leveraging existing software paradigms

Every thread has full access to all

Provides consistent low latency

Memory DDR 5 Optane HBM

Major microarchitecture and IPC improvement

Improved support for large code/data footprint

Consistent performance for multi-tenant usages

Autonomous/Fast PM for high freq @ low jitter

Attached Accelerator interfacing Architecture - AiA

Core utilization of Open vSwitch

@ 1512B packet size

% CPU Data Movement

W/o Offload With Offload

Expected Core utilization on SPR for Zlib L9

Without Offload With QAT Offload

400M Load Balancing Decisions per Second Without DLB

Core Core Core Core

Dynamic, flow aware load balancing & reordering

Priority Queuing (up to 8 levels) With DLB

Dynamic, power aware sizing of applications Core Core Core Core

Introducing Compute eXpress Link (CXL) 1.1

Expanded device performance via

Improved Multi-Socket scaling via Intel Ultra Path

Up to 4 x24 UPI links operating @ 16 GT/s

Sapphire Rapids Enabling devices and IA cores to access shared data

IO - Virtualization Consistent across host app. and offloaded tasks

Avoids memory pinning and copying overheads

Integrated & discrete, bare-metal & VM instances

Intel® Scalable IO Virtualization (S-IOV)

Scalable sharing and direct access to accelerators

Higher Perf than SW only device scaling, More scalable

Supports integrated & discrete devices

Increased Shared Last Level Cache (LLC)

Increased bandwidth, security & reliability

4 memory controllers supporting 8 channels

Integrated memory encryption engine

Intel Optane™ Persistent Memory 300 Series

Increased capacity and Bandwidth

HBM Flat Mode HBM Caching Mode

Ops/Cycle per core @ 100% utilization

For Deep Learning  int8 with int32 accumulation 1024

Acceleration at  Full Intel Arch. programmability

Throughput per Core under Latency SLA of p99 <30ms

Better Distributed Cascade Lake Icelake Server Sapphire Rapids

Designed for Microservices and AI Workloads

Performance Core Workload Specialized

Pioneering Advanced Memory & IO Transitions

You might also like