HC2021.C1.4 Intel Arijit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Sapphire Rapids

Arijit Biswas
Intel Senior Principal Engineer
New Standard for
Data Center Architecture

Designed for Microservices


Next-Gen Intel Xeon Scalable Processor & AI Workloads

Pioneering Advanced Memory


& IO Transitions
Node Performance

Data Center Performance


Node Performance

Cache &
Scalar Data Parallel Memory Sub- Intra/Inter
Performance Performance System Arch Socket Scaling

Larger Private & Modular SoC /w


New Performance Shared Caches Modular Die Fabric
Multiple Integrated
Core
Acceleration Engines
Microarchitecture
DDR 5 Wider & Faster UPI

Increased Core Next Gen Optane Embedded Silicon


Counts Support Bridge (EMIB)

PCIe 5.0
Next Gen Quality of
Service Capabilities

Broad WL/Usage
Support and
Low Jitter Optimizations
Fast VM Migration
Architecture

Consistent Caching & Next Gen Optane Integrated WL


Better Telemetry
Mem Latency Support Accelerators

Inter-Processor Improved Security &


IO Virtualization CXL 1.1
Interrupt Virt. RAS

Elasticity &
Efficient Data Infrastructure &
Consolidation Performance Center Framework
& Orchestration Consistency Utilization Overhead

Data Center Performance


Single Monolithic Die Multi-Tile Design for Increased Scalability

Delivers a scalable, balanced architecture leveraging existing software paradigms


for monolithic CPUs via a modular architecture
Sapphire Rapids
Multiple Tiles, Single CPU

Every thread has full access to all


resources on all tiles
Cache, Memory, IO…

Provides consistent low latency


& high cross-section BW across the
entire SoC
Sapphire Rapids
Key Building Blocks

Compute IP Cores
Acceleration
Seamless Integration of Engines

PCIe Gen
I/O IP CXL1.1
5
UPI 2.0

Memory DDR 5 Optane HBM


IP
Performance Core
Built for Data Center

Major microarchitecture and IPC improvement

Improved support for large code/data footprint

Consistent performance for multi-tenant usages

Autonomous/Fast PM for high freq @ low jitter


Performance
Core AI
Intel® Advanced Matrix Extensions - AMX
Tiled matrix operations for inference & training acceleration

Attached Accelerator interfacing Architecture - AiA


Device Efficient dispatch, signaling & synchronization from user level

Architecture
Improvements for DC Half- Precision Float New Instructions
Workloads & Usages HFNI Support for FP16 - higher throughput lower precision

Cache CLDEMOTE
Management Proactive placement of cache contents
Sapphire Rapids

Utilization Without
Acceleration
Acceleration Engines
Increasing effectiveness of cores,
by enabling offload of common mode tasks via
seamlessly integrated acceleration engines Core Core Core Core

With Acceleration
Native Dispatch, Signaling & Synchronization from User Space
Accelerator interfacing Architecture

Utilization
Coherent, Shared Memory Space
Between Cores & Acceleration Engines
Core Core Core Core Accel. Accel.

Concurrently shareable
Processes, containers and VMs
Critical Workloads Common Mode Tasks Additional Workload Capacity
Intel® Data Streaming Acceleration Engine
Optimizing streaming data movement and transformation operations

14%
up to

Core utilization of Open vSwitch


4 Instances per Socket 53%
39%

@ 1512B packet size


additional CPU
Low Latency Invocation 47% 86%
Core cycles after
DSA offload
No Memory Pinning Overhead
% CPU Compute

% CPU Data Movement

W/o Offload With Offload

Results have been estimated or simulated based on testing on pre-production hardware and software.
For workloads and configurations visit www.intel.com/ArchDay21claims . Results may vary​
Intel® Quick Assist Technology Acceleration Engine
Accelerating Cryptography and Data De/Compression
100%

Expected Core utilization on SPR for Zlib L9


up to
400Gb/s Symmetric Crypto

up to
160Gb/s Compression +
98%
160Gb/s De-compression additional
workload capacity
Fused Operations
after QAT offload

Without Offload With QAT Offload


Results have been estimated or simulated. Sapphire Rapids estimation based on architecture models and baseline
testing with Ice Lake and Intel QAT. For workloads and configurations visit www.intel.com/ArchDay21claims . Results may vary.
Intel® Dynamic Load Balancer Acceleration Engine
Efficient Load Balancing across CPU Cores

400M Load Balancing Decisions per Second Without DLB

Core Core Core Core


Offloads Software Queue Management

Dynamic, flow aware load balancing & reordering

Priority Queuing (up to 8 levels) With DLB

Dynamic, power aware sizing of applications Core Core Core Core


Sapphire Rapids
I/O Advancements

Introducing Compute eXpress Link (CXL) 1.1


Accelerator and memory expansion in datacenter

Expanded device performance via


PCIe 5.0 & connectivity
Improved DDIO & QoS capabilities

Improved Multi-Socket scaling via Intel Ultra Path


Interconnect (UPI) 2.0

Up to 4 x24 UPI links operating @ 16 GT/s


New 8S-4UPI performance optimized topology
Intel® Shared Virtual Memory (SVM)

Sapphire Rapids Enabling devices and IA cores to access shared data


in CPU virtual address space

IO - Virtualization Consistent across host app. and offloaded tasks

Avoids memory pinning and copying overheads

Integrated & discrete, bare-metal & VM instances

Intel® Scalable IO Virtualization (S-IOV)


Hardware acceleration for comms between
VMs/containers and PCIe devices

Scalable sharing and direct access to accelerators


Hardware acceleration for communication between
across 1000s of VMs/containers
VMs/containers and PCIe attached devices

Higher Perf than SW only device scaling, More scalable


than SR-IOV

Supports integrated & discrete devices


Sapphire Rapids
Memory and Last Level Cache

Increased Shared Last Level Cache (LLC)


Up to >100 MB LLC shared across ALL cores

Increased bandwidth, security & reliability


via DDR 5 Memory

4 memory controllers supporting 8 channels

Integrated memory encryption engine

Improved RAS

Intel Optane™ Persistent Memory 300 Series


Sapphire Rapids
High Bandwidth Memory
Significantly Higher Memory Bandwidth
vs. baseline Xeon-SP with 8 channels of DDR 5

Increased capacity and Bandwidth


some usages can eliminate need for DDR entirely

2 Modes

HBM Flat Mode HBM Caching Mode


Flat Mem Regions w/ HBM & DRAM DRAM backed cache

HBM
HBM DDR5
HBM DDR5
HBM Flat Mode enables flat memory HBM Caching Mode allows HBM
regions with HBM & DRAM to act as DRAM backed cache
Sapphire Rapids - Architected for AI
AI has become ubiquitous across usages – AI performance required in all tiers of computing

Goal 2048
Enable efficient usage of AI across all services deployed on
elastic general-purpose tier by delivering many times more AI

Ops/Cycle per core @ 100% utilization


performance and lower CPU utilization

For Deep Learning  int8 with int32 accumulation 1024


Datatypes  Bfloat16 with IEEE SP accumulation

Acceleration at  Full Intel Arch. programmability


the ISA Level  Low Latency 256
64

Available and integrated with AVX-512 (2xFMA) FP32 AVX-512 (2xFMA) INT8
AMX (TMUL) BF16 AMX (TMUL) INT8
industry-relevant frameworks & libraries
Results have been simulated. For workloads and configurations
visit www.intel.com/ArchDay21claims . Results may vary
Sapphire Rapids - Built for elastic computing models - microservices
>80% of new cloud-native and SaaS applications are expected to be built as microservices
Microservices Performance
Goal
+69%

Throughput per Core under Latency SLA of p99 <30ms


Enable higher throughput while meeting latency requirements and
reducing infrastructure overhead for execution, monitoring and
orchestration thousands of microservices
+ 24%
Improved
Performance and
Quality of Service
1.0

Reduced
Infrastructure
Overhead

Better Distributed Cascade Lake Icelake Server Sapphire Rapids


Communication
Results have been simulated. For workloads and configurations visit
www.intel.com/ArchDay21claims . Results may vary
New Standard in Data Center Architecture

General Purpose
Sapphire Rapids
Multi Tile SoC for Physically Tiled, & Dedicated Biggest Leap in Data Center Capabilities
Scalability Logically Monolithic Acceleration Engines in over a Decade

Designed for Microservices and AI Workloads

Performance Core Workload Specialized


Architecture Acceleration

Pioneering Advanced Memory & IO Transitions

Enhanced
DDR 5 & Virtualization
HBM PCIe 5.0 Capabilities
22

You might also like