MindShare PCIe3.0 b.1.5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1202

PCI Express Technology

Based on Spec Rev 1.x, 2.x, 3.x

Moki Anji (moki@ synopsys.com)


Presentation rev b.1.5
Do Not Distribute MindShare.com © 2013
Legal Notice 2

 This presentation is copyrighted material and


is intended solely for the use of students who
have attended this MindShare course taught
by a MindShare instructor.

 Do not copy or distribute this material without


written permission from MindShare.
 Any unauthorized distribution is illegal.

training@mindshare.com
1-800-633-1440
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Materials Download 3

www.mindshare.com/materials/

1. Login or
create
account

Offering ID is only 2. Enter


valid for a few days Offering ID
after the start of the
class
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Materials Download, continued 4

3. Click on My eBooks /
Presentations tab

Useful Links

Download the materials to your


machine and then open to view

Create your PDF if necessary

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Materials Download, continued 5

If Arbor IS included with your class:

4. Click on My eLearning /
Software tab

Download the Arbor application,


install on your machine and
activate with your key

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Materials Download 6

If Arbor is NOT included with your class:

4. Click on
Software
menu item

Download the
Arbor application,
install on your
machine and
select Trial period
activation

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MindShare Courses 7

Intel Architecture IO Buses


 Intel Sandy / Ivy Bridge Processor  PCI Express 3.0
 Intel Atom Processor  USB 3.0
 Intel 32/64-bit x86 Architecture  USB 2.0
 Intel QuickPath Interconnect (QPI)  xHCI for USB
 Computer Architecture with Intel Chipsets  PCI / PCI-X

AMD Architecture Memory Technology


 AMD Opteron Processor (Bulldozer)  Modern DRAM (DDRx / LPDDRx)

Firmware Technology Virtualization Technology


 UEFI Architecture  PC Virtualization
 BIOS Essentials  IO Virtualization (IOV)

ARM Architecture Storage Technology


 ARM 32/64-bit Architecture (w/ x86  Serial ATA 3.0
comparisons)  SAS 2.0

Programming
 OpenCL Programming
 x86 Architecture Programming
MokiAnji
x86(moki@
Assemblysynopsys.com)
Language Basics
Do Not Distribute MindShare.com © 2013
MindShare Training Options 8

 In-House classroom
 Live, on-site Instructor Led Training (ILT)
 Virtual classroom
 Live ILT via WebEx (or similar tool)
 eLearning Courses
 Pre-recorded courses available online, 24/7 with unlimited
access for 90 days

 Check our website for Public course offerings:


 PCI Express
 USB 3.0
 Modern DRAM Technology
 x86 Architecture

 We can customize any of our classes to meet your


budget and content needs
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
MindShare Books / eBooks 9

Do Not www.mindshare.com
Moki Anji (moki@ synopsys.com)
Distribute MindShare.com © 2013
MindShare Arbor 10

www.mindshare.com/arbor
 A software tool to view,
edit and verify the
configuration settings of a
computer
 Decode data from live and
saved systems
 Apply standard and custom
rule checks
 Directly edit Config,
Memory and IO space
 Everything driven from
open-format XML
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PCI Express Topics vii 11

Part One: The Big Picture Part Five: Additional System Topics
1: Background 15: Error Detection and Handling
2: PCIe Architecture Overview 16: Power Management
3: Configuration Overview 17: Interrupt Support
4: Address Space and Transaction Routing 18: System Reset
Part Two: Transaction Layer 19: Hot Plug and Power Budgeting
5: TLP Elements 20: Overview of 2.1 Spec Changes
6: Flow Control 21: Overview of 3.1 Spec Changes
7: Quality of Service Part Six: Appendices
8: Transaction Ordering A: Details of Spec 2.1 Changes
Part Three: Data Link Layer B: Details of Spec 3.1 Changes
9: DLLP Elements C: IO Virtualization Support
10: Ack/Nak Protocol D: Add-In Cards and Connectors
E: Arbor Exercise Solutions
Part Four: Physical Layer
11: Physical Layer Logical (Gen1&2)
12: Physical Layer Logical (Gen 3)
13: Physical Layer Electrical (Gen1, 2, & 3)
14: Link Initialization & Training
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Part One: The Big Picture

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Background

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Background 9 14

 PCIe was developed from PCI and PCI-X


architectures and inherits many of their
features
 PCIe software is backward compatible with PCI
to ease migration
 Address space, configuration registers, and
transaction types are the same

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Early Buses Compared 11 15

Peak Bandwidth Number of Card


Bus Type Clock Frequency
32-bit / 64-bit Slots

PCI 33MHz 133 / 266 MB/s 4-5

PCI 66MHz 266 / 533 MB/s 1-2

PCI-X 66MHz 266 / 533 MB/s 4

PCI-X 133MHz 533 /1066 MB/s 1-2

PCI-X 2.0
133MHz 1066-2132 MB/s 1 (point-to-point)
(DDR)
PCI-X 2.0
133MHz 2132-4262 MB/s 1 (point-to-point)
(QDR)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Reflected-Wave Signaling 17 16

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Basics 12 17

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Bus Arbitration 13 18

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Simple PCI Read Transaction 15 19

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI-to-PCI Bridge 18 20

PCI problems: Retries, Disconnects and, for


bridges, the Delayed Transaction Model.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PCI Transaction Model 19 21

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Address Map of PCI-Based System 26 22

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Configuration Header Type 0 29 23

Header type 0 defines a


PCI function that is not a
bridge to another PCI bus

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Configuration Header Type 1 28 24

Header type 1 defines a PCI


Bridge, which means there will be
at least one more bus below it

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Prefetchable Memory Space 25

Prefetchable space is
safe to read ahead: a
memory location can be
read, the contents
discarded and the same
content read again
without problems.

PCI transactions don’t Non-prefetchable


define a transfer size. memory has side effects
Bridges have to guess associated with reading
the size on reads, and from a location. Once
benefit from knowing read, that data is lost and
whether the address cannot be read again.
range
Moki Anjiis(moki@
prefetchable.
synopsys.com)
Do Not Distribute MindShare.com © 2013
Always Need More Bandwidth 30 26

Original PCI bus used 33 MHz;


66 MHz version followed shortly after
Good: doubled the bandwidth
Bad: fewer devices could be connected
on the shared bus.
Limitation imposed by Reflected-Wave
signaling
Higher speed reduced the timing budget
Result: only 4 or 5 electrical loads per bus
were supported (half as many as before).
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Introduction of PCI-X 1.0 31 27

Improved PCI protocol


Registered inputs, allowing more loading or
higher clock speed
Eliminated Wait States
Used a Split-Transaction model
Transferred data in blocks (making buffer
management in bridges more efficient)
Specified data transfer size
Backward compatible with PCI in both
hardware and software
More system-centric
Moki Anji (moki@ synopsys.com)
than PCI
Do Not Distribute MindShare.com © 2013
PCI-X Features 33 28

 Interrupt handling more efficient - MSI


(message signaled interrupt) support
mandatory
 Snoops can be eliminated
 No Snoop (NS) attribute: this transaction will never
need to wait for a snoop result from the processor.
 Improved traffic flow
 Relaxed Ordering (RO) attribute: software knows
whether a transfer has ordering dependencies;
normal ordering rules can be ignored to improve
efficiency in some cases.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
133 MHz PCI-X System 32 29

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI-X Burst Memory Read 33 30

Bus
Idle
Address Attribute Response Data Data Data Data
Turnaround
Phase Phase Phase Phase Phase Phase Phase Cycle
1 2 3 4
1 2 3 4 5 6 7 8 9 10 11 12

CLK

la to
r
sfe
FRAME#

t
tran st
N ex
AD[31:0] Address ATTR Data-0 Data-1 Data-2 Data-3

C/BE#[3:0] Cmd ATTR

IRDY#

TRDY#

DEVSEL# Decode
A

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI-X Split Transaction Model 34 31

Replaces Delayed-Transaction model of PCI

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI-X 2.0 Features 37 32

 Second generation supported DDR and QDR data


transfer rates by using Source-Synchronous clocking.
 2128 MB/s peak for 64-bit DDR system
 4256 MB/s peak for 64-bit QDR system
 Remained hardware and software backward
compatible with PCI-X 1.0
Note: Some pins that were reserved for 1.0 are used for ECC
and strobe signals in 2.0
 ECC support built in, giving automatic single-bit error
correction & multi-bit error detection
 Signal voltage reduced to 1.5 V to avoid excessive
power consumption at faster rates

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Problems with Any Parallel Design 36 33

 Timing issues limit parallel designs and make


 Timing issues limit parallel designs and make
board routing difficult at higher frequencies
board routing difficult at higher frequencies
 Signal skew – outputs experience different delays
 Signal skew – outputs experience different delays
 Clock skew – CLK edge occurs at different times at
each
Clockdevice
skew – Tx and Rx clocks not perfectly in synch
Flight
 Flight time
time––signal
delay delay
through the transmission
through path
the transmission
limitslimits
path the minimum clockclock
the minimum period
period or highest CLK
frequency

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Source-Synchronous Model of PCI-X 2.0 38 34

 Transmitter drives separate strobes in addition to the


data; timing relationship at Rx is similar to Tx.
 To maintain signal integrity, bus becomes point-to-
point, requiring bridges to connect multiple devices.
 Bridges replicate a wide bus, making this an
expensive solution due to its large pin count.
 Signal timing still very tight; challenging to route.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Architecture Overview

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Introduction to PCI Express 40 36

 Serial point-to-point communication bus


 Scaleable: x1, x2, x4, x8, x12, x16, x32 Links
 Symmetric: same number of lanes in each direction
 Dual-Simplex connection
 2.5, 5.0 or 8.0 GT/s transfer rate in each direction
 Packet-based transaction protocol
 Software backward compatible with PCI & PCI-X

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Width and Lanes 40 37

 Performance is scalable, based on the number


of signal Lanes implemented
 Lane consists of a serial send/receive path made up
of four wires (2 each for differential Tx and Rx).
 Link can have a minimum of 1 Lane or as many as
32 Lanes, and number of Lanes in a Link is called
the Link width. A Link connects two devices.

Transmitter Receiver

Receiver Transmitter

Moki Anji (moki@ synopsys.com) One Lane


Do Not Distribute MindShare.com © 2013
Differential Signaling 44 38

 Differential signaling
 Better noise immunity
 Lower voltages allow smaller, faster circuitry:
Tx Differential Peak-to-peak voltage = 0.8 - 1.2 V

D+
Vcm
VDIFF = VD+ - VD-

For symmetric differential swing:


VDIFFp = max |VD+ - VD- |
D-
Vcm VDIFFp-p = 2 * max |VD+ - VD- |

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Gen1, Gen2, Gen3 Throughput 43 39

Aggregate BW Link Width


(GB/s) x1 x2 x4 x8 x12 x16 x32
Gen 1 0.5 1 2 4 6 8 16

Gen 2 1 2 4 8 12 16 32

Gen 3 2 4 8 16 24 32 64

Derivation of Gen1 numbers:


 Bandwidth described as “aggregate”, implying
simultaneous traffic in both directions
Bandwidth loss:
Per direction 20% at Rx
(2.5GT/s) * 1 Lane = 2.5Gb/s 2.5Gb 1 Byte
* = 250MB/s
s 10 bits
GT = Giga-Transfers
Gb = Gigabits Due to 8b/10b 250MB/s x 2 = 500MB/s = 0.5 GB/s
GB(moki@
Moki Anji = Gigabytes
synopsys.com) (aggregate)
Bidirectional
Do Not Distribute MindShare.com © 2013
Common Clock Not Necessary 45 40

Phase-Locked Loop (PLL)


at receiver is able to recover a clock
from the incoming data stream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example PCIe Topology 47 41

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Complex 48 42

 Root Complex connects the CPU to the PCIe


topology (e.g.: chipset)
 Root Complex generates PCIe transaction
requests on behalf of the CPU, creating 4 different
types of requests:
 Configuration
 Memory
 I/O
 Message

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switches and Bridges 48 43

 Switches provide fan-out and aggregation


 Connecting more than two ports requires a Switch
 Switches act as packet routers
 Peer-to-peer support is mandatory
 Bridges connect different busses
 Forward bridge example: PCIe to PCI, etc.
 Reverse bridge example: PCI to PCIe

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Endpoints 49 44

 Endpoints are Functions in a PCIe topology that


are not Switches or the Root Complex
 They only have an upstream port and always reside at
the bottom of a PCIe topology “tree structure”
 They can act as requester or completer for transactions

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Legacy/Native Endpoints 49 45

 Legacy Endpoints use older PCI bus operations


to support backward compatibility
 Legacy Endpoints are allowed to support things
that Native PCIe Endpoints are not, such as:
 I/O transactions
 Locked transactions
 32-bit-only memory addressing
 Native PCIe Endpoints must support 64-bit
addressing for prefetchable address ranges

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Topology – Root Complex 51 46

To software, Root Complex


looks like a hierarchy of
virtual PCI-to-PCI Bridges

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Internals 52 47

Switches also appear to


software as a hierarchy of
bridges

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Configuration Headers 50 48

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Complex 52 49

The Root Complex is really the combination of


logic acting as the interface between the CPU and
PCIe Ports
 External Switches and bridges may be present if
the chipset supports enough Ports and the system
uses a large number of devices.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Complex Example 1 54 50

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Complex Example 2 51

Xeon E5 2600 Xeon E5 2600


Core 0 Xeon E5 Core n Core 0 Xeon E5 Core n
L1 D L1 I (2-8 cores) L1 D L1 I L1 D L1 I (2-8 cores) L1 D L1 I
L2 Cache L2 Cache L2 Cache L2 Cache
Ring Interconnect Ring Interconnect

L3 Slice Slice per Core L3 Slice L3 Slice Slice per Core L3 Slice
EP
No Processor System QPI QPI System No Processor
Graphics Agent PCU PCU Agent Graphics
QPI QPI
EP, EN
PCIe3 DMI2 PCIe* IMC IMC PCIe* DMI2 PCIe3
EN: 24 lanes x4 x4* EN: 24 lanes
EP: 40 lanes EP: 40 lanes
DDR3 DIMMs DDR3 DIMMs
HD Audio SMBus
C600
USB 2.0 GLAN Root Complex
(x14)
PCH
(Patsburg)
PCIe Gen2
SATA (8 Lanes)
(x6)
PCI
SAS
(x6)
SPI

LPC
Moki Anji Flash
(moki@FWH
synopsys.com)
SIO TPM 1.2
Do Not Distribute MindShare.com © 2013
PCI Express Device Layers 56 52

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Layers in PCIe Devices 57 53

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Device Layer Details 58 54

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Core/ Software Layer 59 55

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transaction Layer 58 56

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transaction Layer - Transaction Types 59 57

 Backward compatibility with PCI is maintained by using


the same memory, I/O and configuration address space
 A new transaction type is added for PCIe: Messages
 PCIe transaction types are shown in the following table:

Transaction Type Non-Posted or Posted


Memory Read Non-posted
Memory Write Posted
Memory Read Lock Non-posted
IO Read Non-posted
IO Write Non-posted
Configuration Read (Type 0 and 1) Non-posted
Configuration Write (Type 0 and 1) Non-posted
Message Posted
AtomicOp
Moki Anji (moki@ synopsys.com) Non-Posted
Do Not Distribute MindShare.com © 2013
Transaction Layer - TLP Types 61 58

TLP Type Abbreviated Name


Memory Read Request MRd
Memory Write Request MWr
Memory Read Request – Locked Access MRdLk
IO Read Request IORd
IO Write Request IOWr
Configuration Read Request (Type 0 and 1) CfgRd0, CfgRd1
Configuration Write Request (Type 0 and 1) CfgWr0, CfgWr1
Message Request with Data MsgD
Message Request without Data Msg
Completion with Data CplD
Completion without Data Cpl
Completion Lock with Data CplDLk
Completion Lock without Data CplLk
Moki Anji (moki@ synopsys.com)
AtomicOps FetchAdd, Swap, CAS
Do Not Distribute MindShare.com © 2013
TLP Origin and Destination 62 59

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Assembly/Disassembly 173 60

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Core Structure 80 61

Transaction Layer Packet (TLP)


Header Data Payload ECRC

Header (3 or 4 DWs in size) may include:


 Address, requester ID, tag, transaction type, transfer
size, requester ID/completion ID, byte enables, no
snoop bit, relaxed ordering bit, traffic class bits… etc.
Data Payload (if present)
 Present for write request packets and completions
with data
 Contains between 1 and 1024 DWs of data

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Core Structure 62

Transaction Layer Packet (TLP)


Header Data Payload ECRC

ECRC or Digest (Optional)


 If enabled, contains a 32-bit ECRC used for end-to-
end error checking (ECRC)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: Memory Read (Non-Posted) 65 63

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: I/O Write (Non-Posted) 68 64

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: Memory Write (Posted) 70 65

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Quality of Service (QoS) 70 66

Quality of Service (QoS) describes the ability of


the network to manage Transmission rate,
Effective bandwidth, and Latency
 Features that make QoS possible:
 Traffic Class – 8 TCs available
 Virtual Channels – Up to 8 VCs available
 Arbitration
 Two Classes of Transactions Supported
 Isochronous transactions
 Asynchronous transactions

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
QoS Definitions 67

Traffic Class
 A TLP header field that remains unchanged as a packet
flows from its source to its ultimate destination
Examined at each “service point” (e.g.: Switch port)
TC value is assigned by software as an indication of
preferred priority
 Every PCIe device supports TC0 at a minimum
Virtual Channel
 Implemented in hardware with separate buffers for each
VC in each port
 VCs enable multiple logical data flows over a single
physical Link
 Every PCIe device supports VC0 at a minimum

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Prioritized Traffic Example 71 68

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TC/VC Mapping 69

 Priority policy implemented with Virtual


Channel buffers and Traffic Class tags on
packets

Transmitter Device A Receiver Device B


TC0 - TC2 TC0 - TC2
map to VC0 map to VC0
Link
VC0
TC/VC Mapping

VC0

TC/VC Mapping
Arbitration

VC0
All TCs Buffers Buffers All TCs
VC1
VC1 VC1
TC3 - TC7
One physical Link, TC3 - TC7
map to VC1 map to VC1
multiple virtual channels

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
QoS Arbitration 70

 Port Arbitration and VC Arbitration


 TC/VC mapping must be the same on both ends of a
Link, but doesn’t have to match between different
Links
Link

TC0 thru TC2 to VC0 VC0


TC3 thru TC7 to VC5 VC5 Link
VC0
Port VC0
Arb
VC VC0
VC1
Arb 0
Port VC1
VC1
Link Arb

TC0 thru TC2 to VC0 VC0

TC3 thru TC7 to VC7 VC7

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transaction Ordering 71 71

 Packets of the same Traffic Class are ordered


according to transaction ordering rules
 Packet type dictates the ordering rules
 Packets of different TCs have no ordering
relationship

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Flow Control 72 72

 Eliminates inefficiencies of PCI (Retries and


Disconnects)
 Receiver periodically updates transmitter on
available buffer space
 Flow control on each Link (not end-to-end)
 Separate mechanism for each Virtual Channel
 Transmitter won’t send packet unless receiver
has enough buffer space to take it
 Flow Control applies only to TLPs
 DLLPs and Ordered Sets have no Flow Control

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Flow Control - Example 72 73

Receiver periodically sends Flow


Control Update DLLPs

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Data Link Layer 72 74

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DLLP Origin and Destination 74 75

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP / DLLP Structure at Data Link Layer 75 76

Transaction Layer Packet (TLP)


Sequence ID Header Data Payload ECRC LCRC

DLLP

Type Misc. 16-Bit CRC

 Sequence ID field is 16 bits: a 12-bit value padded with 4


zeroes at the front end used to associate an Ack/Nak
DLLP with a TLP in the Retry Buffer,
 32-bit LCRC used for error checking in receiver
 DLLPs are transaction overhead, so they need to be
small
 Type field is 8 bits
 CRC is 16 bits

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Data Link Layer Replay Mechanism 74 77

 A copy of each outgoing packet is stored in the Retry


Buffer until the neighbor acknowledges receipt
 Incoming packets are checked and Ack or Nak is
generated to acknowledge them

From To
Transaction Layer Transaction Layer
Tx Rx
Data Link Layer
Link Packet DLLP DLLP Link Packet
Ack/ Ack/
Sequence TLP LCRC Nak Nak
Sequence TLP LCRC
Device A Retry
Buffer De-mux

Error
Mux Check

Tx Rx

Link
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Ack/Nak Protocol, Non-Posted 76 78

 Receiver performs Link-level data integrity check on


every TLP transmission
 Returns Ack if no error, or Nak if error is detected

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ack/Nak Protocol, Posted 79

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Other Data Link Layer Functions 80

Flow Control Logic Initialization


Initialize flow control for default VC0
automatically
Initialize flow control for other channels as
they are enabled
DLLP packets used in initialization
process
InitFC1-P, InitFC1-NP, InitFC1-Cpl,
InitFC2-P, InitFC2-NP, InitFC2-Cpl

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Physical Layer 77 81

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP/DLLP at Physical Layer 78 82

Start Sequence Header Data Payload ECRC LCRC End


1B 2B 3-4 DW 0-1024 DW 1DW 1DW 1B

Start DLLP Type Misc. CRC End


1B 1DW 2B 1B

 Start Symbols
 STP (start TLP)
 SDP (start DLLP)
 End Symbols
 END (end good for TLPs and end for DLLPs)
 EDB (end bad for TLPs only)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Electrical Physical Layer 80 83

Detect

CTX ZTX-LINE
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
direction
CTX Z
TX-LINE
-
D- D-
ZTX ZTX ZRX ZRX

Clock Clock
VCM Source
Source VTX-CM = 0 - 3.6 V
ZTX = ZRX = 50 Ohms
CTX = 75 – 200 nF

Transmitter and receiver are AC coupled


Receiver common-mode voltage is set to 0V
Transmitter common-mode between 0V and 3.6V

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ordered Set Origin and Destination 81 84

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ordered Set Structure 81 85

COM Identifier Identifier Identifier

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Initialization Tasks 79 86

 Detect receiver
 Bit lock per Lane
 Symbol lock per Lane
 Polarity inversion
 Link numbering
 Link width and Lane numbering
 Lane reversal (optional)
 Lane-to-Lane de-skew on multi-Lane Links
 Link data rate determination and negotiation

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Additional Physical Layer Features 87

 Link power management logic


 Power state: L0, L0s, L1, L2, L3
 Reset logic
 Cold / warm reset
 Hot reset
 Function-level Reset
 Hot-plug control and status logic
 Support not mandatory, but must comply with PCI
hot-plug usage model if used

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: Memory Read Request 97 88

Requester Completer
Send Memory Read Request
Software layer Receive Memory Read Request

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header ECRC Header ECRC

Flow Control Transaction layer Flow Control


Transmit Receive
Virtual Channel Virtual Channel
Buffers Buffers
Management Management
per VC per VC
Ordering Ordering

Link Packet DLLP


Link Packet
Sequence TLP LCRC Nak
Sequence TLP LCRC
Data Link layer
Retry Buffer DLLP. Error
Ack/Nak CRC Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Encode Decode
Physical layer
Parallel-to-Serial Serial-to-Parallel
Differential Driver Differential Receiver

Port Port
Moki Anji (moki@ synopsys.com) Ack or Nak
Link
Do Not Distribute MRd TLP MindShare.com © 2013
Associated Completion 99 89

Requester Completer
Receive Completion with Data
Software layer Send Completion with Data

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Flow Control Transaction layer Flow Control


Receive Transmit
Virtual Channel Virtual Channel
Buffers Buffers
Management Management
per VC per VC
Ordering Ordering

Link Packet DLLP


Link Packet
Sequence TLP LCRC Nak
Sequence TLP LCRC
Data Link layer
DLLP Retry Buffer
Error
Ack/Nak CRC Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Decode Encode
Physical layer
Serial-to-Parallel Parallel-to-Serial
Differential Receiver Differential Driver

Port Port
Moki Anji (moki@ synopsys.com) CplD TLP
Link
Do Not Distribute Ack or Nak
MindShare.com © 2013
PCI Express Fabric Efficiency 90

 Factors that reduce efficiency:


 8b/10b overhead for Gen1 and Gen2 (20%)
 Header (3-4 DW) on all TLPs plus sequence
number, ECRC, LCRC, Start and End Symbols
 Split transaction protocol overhead
 DLLPs for Ack/Nak and Flow Control
 Factors that improve efficiency:
 Back-to-back packet transmission
 No arbitration overhead
 Switch cut-through mode

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Summary of Changes for Rev 2.0 91

 Higher speed (5.0 GT/s), supported by selectable de-


emphasis levels
 Dynamic speed and Link width changes
 Power savings, flexible bandwidth, reliability
 Virtualization support
 Access Control Services
 Other New Features
 Completion timeout control
 Function-Level Reset (optional, strongly
recommended)
 Modified Compliance Pattern for testing

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Summary of Changes for Rev 3.0 92

 Higher speed (8.0 GT/s), supported by


Specialized Equalization
Mandatory Tx Equalizer
Optional Rx Equalization choices
Equalization training sequence
128b/130b encoding to reduce overhead
Block encoding
Data streams

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Configuration Overview

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Configuration 94

 CPUs can directly access Memory and IO space, but


Configuration must be indirectly addressed, requiring
logic to interpret CPU commands into Configuration
commands.
 Legacy access: indirect through IO addresses
 Enhanced access: indirect through Memory addresses
Memory
command

Host
Bridge

Config
command

Configuration
registers local within
devices

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Legacy PCI Method 95

0CFBh 0CFAh 0CF9h 0CF8h


31 30 24 23 16 15 11 10 8 7 2 1 0
Bus
Reserved Number Device Function Doubleword
Number Number 00

Should always be zeros


CPU
Enable Configuration Space Mapping
1 = enabled
Address Port
XX I/O Address
CF8h-CFBh
Data Port
I/O Address
CFFh CFEh CFDh CFCh CFCh-CFFh

 Processor reads and writes to I/O addresses 0CF8 and 0CFC are
converted to configuration reads and writes by the Root Complex.
 Advantage: Uses very little address space.
 Disadvantage: Requires 2 address steps; allows multiple threads to
Moki Anjiinterfere
(moki@with each other.
synopsys.com)
Do Not Distribute MindShare.com © 2013
Enhanced (Memory Mapped) Method 96

 Memory access within a programmed range is translated


into configuration cycle by the chipset
 Advantage: one-step access, no possible interference between tasks
 Disadvantage: large memory range dedicated for this purpose
 28-bit address mapped into system memory
 Bits A[63:28] are defined by a Base Address register – Memory cycles whose
upper bits match this base will generate configuration cycles

63 28 27 20 19 15 14 12 11 8 7 0
Extended
Base Address Bus Device Function
Register
Register

Memory Address Configuration Space


A[20+(n-1):20] Bus Number [7:0]
“n” is between 1 & 8,
allowing up to 256MB A[19:15] Device [4:0]
to support 256 buses
A[14:12] Function [2:0]
A[11:8] Extended Register [3:0]
A[7:0] Register [7:0]
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Host Bridge Programmed With Base Address 97

Core 0 Core i5 Core 1


Memory Map
L3 Slice LLC L3 Slice

Processor System
Graphics Host Bridge Agent
0,0,0 Main
Display Bus 0
Memory Configuration
IMC
IGD
0,2,0 0,4,0
DSP Space
4KB 255,31,7
eDP
FDI DMI2 Memory Mapped
PCH PCI Configuration Space
0,22,0 Mgmt Base Address
VGA Engine
HDMI 0,31,3 256MB
SMBus
0,28,0
PCIe Port 1
Memory
0,28,1 PCIe Port 2 Bus 2 Block
0,29,0 PCIEXBAR
0,31,2 EHCI
SATA 0,26,0 Wi-Fi/
EHCI Bluetooth 4KB
Bus 0 0,20,0 xHCI 2,0,0 4KB
0,31,0
4KB 0,0,0
LPC
0

Moki Anji (moki@ synopsys.com) 0


Do Not Distribute MindShare.com © 2013
PCI Configuration Headers 98

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Extended Configuration Space 99

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Header Type 0 Registers 100

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Identifying Registers 101

Image captured from MindShare’s Arbor program

Several registers identify the Function, including Vendor ID, Device ID, and
the Class Code information, shown here.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Capability Registers 102

Status bit 4 indicates whether capabilities are implemented. If so, Capabilities Pointer gives
location of first register block in the linked list.
Capability Structure IDs
00h = Reserved
01h = Power Management
02h = AGP
03h = VPD
04h = Slot Identification
05h = MSI
06h = CompactPCI Hot Swap
07h = PCI-X Device
08h = HyperTransport
09h = Vendor Specific
0Ah = Debug Port
0Bh = CompactPCI Central Resource Control
0Ch = PCI Hot Plug
10h = PCI Express
11h = MSI-X

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Capability Structures 103

First linked list of


capability registers

Indicates a PCIe device and


means extended registers
may be present.

Optional linked list


of extended
capability registers

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Endpoint Checks Address 104

Endpoints compare the address in received packets


against the addresses programmed in their Base Address
Registers (BARs) to verify they’re the intended target.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Base Address Registers 105

Device designer sets BAR values based


on device requirements:
Memory BARs: 32- or 64-bit decode
I/O BARs: 16- or 32-bit decode

Designer fixes read-only portion of Base Address


field to indicate type and size of memory to be
mapped.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Memory BAR Fields 106

32-bit Base Address


31 4 3 2 1 0
Base Address P Type 0
Prefetchable Memory
00b = 4GB
Indicator Indicator
Range

64-bit Base Address


63 32 31 4 3 2 1 0

Upper 32 bits Lower 32 bits P Type 0

10b = 16EB
Range
Note: it is required that PCIe endpoints other than Legacy endpoints
support 64-bit addresses for any prefetchable memory. And it is strongly
encouraged that memory be designated as prefetchable whenever possible.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: 64-bit BAR 107

Image captured with MindShare’s Arbor program

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Topology View at Startup 105 108

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Enumeration Example 109

CPU

Root Complex Host/PCI Bridge


Bus 0

Dev 0 Dev 1
Fun 0 Fun 1 Fun 0
Virtual Virtual Virtual
P2P P2P P2P
Sec: Sec: Sec:

Sub: Sub: Sub:

PCIe
PCIe
PCIe
Dev 0
Dev 0
Fun 0 Virtual P2P Sec: Sub: Fun 0
Switch PCIe-to-
Dev 0 PCI
Dev 0 Dev 1 Dev 2 Sec:
Fun 0
Fun 0 Fun 0 Fun 0 Sub:
Virtual Virtual Virtual
P2P P2P P2P
Sec: Sec: Sec:
IDSEL1
IDSEL0
Sub: Sub: Sub:
PCI(-X)

PCIe PCIe PCIe

Dev 0 Dev 0 Dev 0


MokiFun
Anji
0
(moki@ synopsys.com)
Fun 0 Fun 0
Do Not
Fun 1
Distribute MindShare.com © 2013
Timing of Initial Access 110

 One-second Trhfa (Time from reset high to first access)


 Configuration software can only conclude error if
first configuration access does not return
successful completion after 1s from reset exit
 System must wait 100 ms after reset exit
before CPU can initiate first configuration
access
 CRS (Configuration Request Retry Status)
usage by PCIe device

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Enumeration Example 111

Downstream ports (from a CPU


switch or root) will only
allow access to device Root Complex Host/PCI Bridge
zero on their secondary Bus 0
bus. The upstream port of
the device isn’t allowed to Dev 0

assume that, however, Fun 0


Virtual
and must capture the P2P
Sec:
device number from an 1
incoming type 0 config Sub:
write cycle. 255
PCIe

Dev 0

Fun
Fun 00 Virtual
Virtual P2P Sec:2 Sub:
P2P Sec: 255
Sub:
Switch
Dev 0
Fun 0
Virtual
P2P
Sec:
3
Sub:
255

PCIe 3
Dev 0 Multi-function
MokiFun
Anji
0
(moki@ synopsys.com)
device discovered
Do Not
Fun 1
Distribute MindShare.com © 2013
PCI Express Enumeration Example 112

CPU

Root Complex Host/PCI Bridge


Bus 0

Dev 0
Fun 0
Virtual
P2P
Sec:
1
Sub:
5
PCIe

Dev 0

Fun 0 Virtual P2P Sec: 2 Sub: 5


Switch
Dev 0 Dev 1 Dev 2
Fun 0 Fun 0 Fun 0
Virtual Virtual Virtual
P2P P2P P2P
Sec: Sec: Sec:
3 4 5
Sub: Sub: Sub:
3 4 5
PCIe PCIe PCIe

Dev 0 Dev 0 Dev 0


MokiFun
Anji
0
(moki@ synopsys.com)
Fun 0 Fun 0
Do Not
Fun 1
Distribute MindShare.com © 2013
PCI Express Enumeration Example 113

CPU

Root Complex Host/PCI Bridge


Bus 0

Dev 0 Dev 1
Fun 0 Fun 1 Fun 0
Virtual Virtual Virtual
P2P P2P P2P
Sec: Sec: Sec:
1 6 7
Sub: Sub: Sub:
5 6 8
PCIe
PCIe
PCIe
Dev 0
Dev 0
Fun 0 Virtual P2P Sec: 2 Sub: 5 Fun 0
Switch PCIe-to-
Dev 0 PCI
Dev 0 Dev 1 Dev 2 Sec:
Fun 0
8
Fun 0 Fun 0 Fun 0 Sub:
Virtual Virtual Virtual
P2P P2P P2P 8
Sec: Sec: Sec:
IDSEL1
3 4 5 IDSEL0
Sub: Sub: Sub:
3 4 5 PCI(-X)

PCIe PCIe PCIe


Dev0 Dev1
Dev 0 Dev 0 Dev 0
MokiFun
Anji
0
(moki@ synopsys.com)
Fun 0 Fun 0
Do Not
Fun 1
Distribute MindShare.com © 2013
PCI Express Enumeration Example 114

CPU

Root Complex Host/PCI Bridge


Bus 0

Dev 0 Dev 1
Fun 0 Fun 1 Fun 0
Virtual Virtual Virtual
P2P Software scansP2Peach bus to find additional
P2P
Sec: Sec: Sec:
1 devices
6 that may
7 be attached. If no device
Sub: Sub:
5
present:
6
Sub:
8
PCIe Transactions time out in PCI resulting in a
PCIe
Master Abort. Upon detecting the abort, the
PCIe
Dev 0 source bridge returns dataDev of 0all ones to the
Fun 0 Virtual P2P Sec: 2 Sub: 5 CPU. Fun 0
PCIe-to-
Switch If a transaction
Dev 0
targets anPCI
Endpoint using a
Dev 0 Dev 1 Dev 2 device numberFun 0
other than zero,
Sec: a Root Complex
8
Fun 0 Fun 0 Fun 0 or Switch port returns a URSub: completion with data
Virtual Virtual Virtual
P2P P2P P2P of all ones. 8
Sec: Sec: Sec:
IDSEL1
3 4 5 IDSEL0
Sub: Sub: Sub:
3 4 These actions are basedPCI(-X)
5 on default settings of
PCIe PCIe
the
PCIe Bridge Control register within the
Dev0 Dev1
Dev 0 Dev 0
configuration header.
Dev 0
MokiFunAnji
0
(moki@ synopsys.com)
Fun 0 Fun 0
Do Not
Fun 1
Distribute MindShare.com © 2013
PCI Express Enumeration Example 115

CPU

Root Complex Host/PCI Bridge


Bus 0

Dev 0 Dev 1
Fun 0 Fun 1 Fun 0
Virtual Virtual Virtual
P2P P2P P2P
Sec: Sec: Sec:
1 6 7
Sub: Sub: Sub:
5 6 8
PCIe
PCIe
PCIe
Dev 0
Dev 0
Fun 0 Virtual P2P Sec: 2 Sub: 5 Fun 0
Switch PCIe-to-
Dev 0 PCI
Dev 0 Dev 1 Dev 2 Sec:
Fun 0
8
Fun 0 Fun 0 Fun 0 Sub:
Virtual Virtual Virtual
P2P P2P P2P 8
Sec: Sec: If software attempts an access to
Sec:
IDSEL1
3 4 5
Sub: Sub: Device
Sub: 2 on the PCI bus, what
IDSEL0

3 4 actions5 will be taken by the PCIe- PCI(-X)

PCIe PCIeto-PCI bridge?PCIe


Dev0 Dev1
Dev 0 Dev 0 Dev 0
MokiFun
Anji
0
(moki@ synopsys.com)
Fun 0 Fun 0
Do Not
Fun 1
Distribute MindShare.com © 2013
PCI Express Enumeration Example 116

CPU

Root Complex Host/PCI Bridge


Bus 0 Type 1
An access to bus 5
Dev 0 Dev 1
will cause Type 1
config cycles until Fun 0 Fun 1 Fun 0
Virtual Virtual Virtual
it finally reaches P2P P2P P2P
Sec: Sec: Sec:
the target bus and
1 6 7
changes to Type 0 Sub: Sub: Sub:
5 6 8
Type 1 PCIe
PCIe
PCIe
Dev 0
Dev 0
Fun 0 Virtual P2P Sec: 2 Sub: 5 Fun 0
PCIe-to-
Switch Type 1 PCI
Dev 0
Dev 0 Dev 1 Dev 2 Sec:
Fun 0
8
Fun 0 Fun 0 Fun 0 Sub:
Virtual Virtual Virtual
P2P P2P P2P 8
Sec: Sec: Sec:
IDSEL1
3 4 5 IDSEL0
Sub: Sub: Sub:
3 4 5 PCI(-X)

PCIe PCIe PCIe Type 0 Dev0 Dev1


Dev 0 Dev 0 Dev 0
MokiFun
Anji
0
(moki@ synopsys.com)
Fun 0 Fun 0 Target Device
Do Not
Fun 1
Distribute MindShare.com © 2013
Type 0 Config Read/Write Request 100 117

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Type 1 Config Read/Write Request 101 118

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Enumeration and Hot-Plug Considerations 119

 After Hot-Plug event


 Quiesce Functions by turning off their Bus Master Enable
configuration bit
 Ensure all outstanding transactions have completed by
checking Transactions Pending bit
 Initialize configuration space of changed device
 Re-enable quiesced devices
 If a bridge device was added, bus numbers might
need to be changed.
 Difficult: there are many places where old numbers might be
stored and all will need to be updated.
 Simple solution: leave bus number gaps between
assignments made to different bridges so buses won’t ever
have to be re-numbered.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Single-Root System Enumeration 108
113 120

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Multi-Root System Enumeration 108
116 121

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MindShare Arbor Lab: Scan Your System 122

Click Tab: Local System

 Demo MindShare Arbor features


 Click on Local System tab
 Draw and analyze the topology of your system

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Address Space and Transaction
Routing

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Memory and IO Space Address Maps 108
125 124

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
BARs in Configuration Header Space 108
127 125

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Use of Type 0 and Type 1 Header 108
128 126

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
32-Bit Non-Prefetch Mem BAR Setup 108
130 127

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
64-Bit Prefetch Mem BAR Setup 108
132 128

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
IO BAR Setup 108
134 129

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Resizable BARs – Motivation 108
135 130

 A problem arises when system DRAM and


Function memory resources require more
addressable space than the platform can give.
This may result in:
 Reduced space for system memory
 Function memory not being allocated, or allocated
with a sub-optimal size
 Solution: new registers report several possible
memory sizes that will work
 Only devices requesting large memory resources are
likely to use this

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Resizable BARs – Mechanism 108
135 131

 Software learns which BAR sizes are


available by reading the new extended
capability register
 Software chooses optimal memory size for
current platform conditions and programs the
BAR size
 Hardware will then report the programmed
BAR size when enumeration software queries
the configuration header

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Topology For Base/Limit Setup 108
137 132

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Prefetch Mem Base/Limit Setup 108
138 133

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Non-Prefetch Mem Base/Limit Setup 108
140 134

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example IO Base/Limit Setup 108
142 135

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Final Register Setup 108
145 136

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Routing 108
146 137

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Non-Posted (Split) Transaction Routing 108
149 138

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Traffic Routed Through Fabric 139

Traffic types on each Link:


Transaction Layer Packets (TLPs)
Data Link Layer Packets (DLLPs)
Ordered Sets

 Note: only TLPs are routed. DLLPs and Ordered Sets are never
routed to another Link because they’re only used to manage the
local Link.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Four Transaction Types 140

 Four basic transaction types:


 Memory
 I/O
 Configuration
 Message
 Switches are not permitted to split a large
packet into multiple smaller ones, but a Root
Complex that supports peer-to-peer traffic is
allowed to split them according to packet
format rules.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Three Methods of Packet Routing 147 141

Every TLP is routed based one of these:


 Address Routing
 ID Routing
 Implicit Routing

Packet Type Routing Method


MRd, MRdLk, MWr Address Routing
IORd, IOWr Address Routing
CfgRd0, CfgWr0, ID Routing
CrgRd1, CrgWr1
Cpl, CplD, CplLk ID Routing
Msg, MsgD Implicit, Address or ID
FetchAdd, Swap, CAS Address Routing
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Generic TLP Header Field (3 DW /4 DW) 108
152 142

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 1: ID Routing 108
156 143

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 1: ID Routing; Switch Check 108
158 144

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 2: Address Routing 108
159 145

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 2: Address Routing; Switch Check 108
162 146

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 2: Address Routing; Endpoint Check 108
161 147

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 3: Implicit Routing 108
164 148

Message TLP

Routing Information:
000b = Implicit: Route to Root Complex
001b = Route by Address (Uses Address fields)
010b = Route by ID (Uses Requester ID field)
011b = Implicit: Broadcast by Root Complex
100b = Implicit: Local—Terminate at Receiver
Moki Anji (moki@
101bsynopsys.com)
= Implicit: Gather and route to Root Complex
Do Not Distribute
All others = Reserved MindShare.com © 2013
Method 3: Implicit Routing 149

 In implicit routing, a device verifies that it is the intended


recipient based on the routing type
 Examples: Root Complex sees message with routing code
000b, or Endpoint sees message routing code 011b.

Routing
011b

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MindShare Arbor Lab: Address Map 150

Open File: address_map_lab.arbsys

 Software has incorrectly configured an address map


setting (either BAR or Base/Limit pair) in one of the
devices downstream of Root Port 0:28:0. Follow the
instructions below:
1. Debugging these type of errors is always easier with a picture
of the topology. In the space below (or on back), draw the
topology of devices downstream from 0:28:0.

2. Using the drawn topology, label the address ranges that have
been assigned to each device and check to make sure that
the bridges above those devices have been set up correctly.

3. Once you find the problem, indicate what the correct setting
should be and confirm your answer with the instructor.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Part Two:
Transaction Layer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Elements

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Packet-Based Transactions 169
155 153

 Two packet types are defined:


 High-level TLPs that comprise transactions
associated with data transport (focus of this
chapter)
 Low-level DLLPs and Ordered-Sets for Link
management services (discussed later)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLPs and DLLPs 170 154

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Why Use Packet-Based Protocol? 171 155

 Unlike parallel buses, whose control signals


make it easy to see what’s happening on
each clock edge, a serial bus may consist of
only one differential signal in each direction
 The string of 1’s and 0’s sent over this path
only make sense at the receiver if they’re
seen in the proper context.
 Packets supply this context by giving a
predictable structure, including start of
packet, header, data, error checking, and end
of packet
 Parts of the packet are constructed in each of
the three PCIe layers
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Posted and Non-Posted Transactions 156

 Posted transactions consist of a single TLP sent to


the completer
 Non-Posted transactions are split: a Request TLP will
be answered later by one or more Completion TLPs

Transaction Type Non-Posted or Posted


Memory Read Non-posted
Memory Write Posted
Memory Read Lock Non-posted
IO Read Non-posted
IO Write Non-posted
Configuration Read (Type 0 and 1) Non-posted
Configuration Write (Type 0 and 1) Non-posted
Message Posted
Moki Anji (moki@ synopsys.com)
AtomicOp Non-Posted
Do Not Distribute MindShare.com © 2013
TLP Assembly and Disassembly 173 157

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Generic TLP And Its Header Format 175 158

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Format Field 176 159

 Fmt[2:0] Field (2 bits)


TLP Prefix Data Present Header Size
Bit 7 Bit 6 Bit 5
0 = No TLP Prefix 0 = No Data (Read) 0 = 3 DWs
1 = TLP Prefix 1 = Data (Write) 1 = 4 DWs

 Format and Type fields together define the transaction type. E.g.
 Memory Request with Data payload is Memory Write Request (MWr)
 Memory Request without data payload is Memory Read Request (MRd)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Type Field 176 160

 Type[4:0] Field (5 bits)


TLP Type TYPE[4:0]
Byte 0, Bit 4:0
Memory Request 00000
Memory Read Lock Request 00001
IO Request 00010
Configuration Type 0 Request 00100
Configuration Type 1 Request 00101
Message Request 10rrr
Completion 01010
Completion Lock 01011
Fetch and Add AtomicOp Request 01100
Unconditional Swap AtomicOp Request 01101
Compare and Swap AtomicOp Request 01110
Local TLP Prefix (Fmt[2:0] = 100) 0L3L2L1L0
End-to-End TLP Prefix (Fmt[2:0] = 100) 0E3E2E1E0

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Traffic Class Field 176 161

 TC[2:0] Field (3 bits)


000b = Traffic Class 0 (default)
001b = Traffic Class 1
Only Memory Read/Write Request and their
associated Completion can use these TCs

111b = Traffic Class 7

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Processing Hint Field 176 162

 TH Field TLP Processing Hints


 If = 1, TLP hints are included to give system
idea on how to best handle this TLP
 If = 0, No TLP hint included

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Digest Field 177 163

 TD Field TLP Digest (ECRC) present


 If = 1, Digest is included in TLP
 If = 0, No Digest in TLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Poisoned TLP Data Field 177 164

 EP Field Poisoned TLP data


 If = 1, TLP data is poisoned or known to be corrupt
 If = 0, TLP data is valid.
 Examples:
 ECC/Parity error on memory or internal buffer
 Data may become corrupted while passing through a Switch
 This feature is referred to as Error Forwarding.
 Support for this feature is optional.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ECRC Generation 180 165

 ECRC generation is based on TLP Header


and Data Payload
 EP bit and bit 0 of Type field can legally be
changed by Switch while packet in route
 Switch may change EP because packet is
poisoned en route
 Switch may change bit 0 of Type field to convert
Type 1 config. cycle to Type 0
 To account for this, EP bit and bit 0 of Type
field are assumed = 1 in ECRC generation
and checking

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Attributes Field 177 166

 Attr[2:0] Field (3 bits)


 Byte 1, Bit 2 (ID-based Ordering)
 If = 1, ID-based Ordering is to be used when routing this TLP
 If = 0, No ID-based Ordering to be used when routing this TLP
 Byte 2, Bit 5 (Relaxed Ordering)
 If = 1, PCI-X Relaxed Ordering Model
 If = 0, PCI Strong Ordering Model
 Byte 2, Bit 4 (No Snoop)
 If = 1, No snoop required
 If = 0, Snoop required

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Address Type Header 178 167

 AT[2:0] Field (2 bits) - Indicates condition


of memory address for Memory and AtomicOp
Requests:
00 - Default
01 - Address Translation Request
10 - Address is Translated
11 - Reserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Memory Request Header 178 168

Length[9:0] Field (10 bits) - Transfer length in DWs

00 0000 0001b = 1DW

11 1111 1111b = 1023 DW


00 0000 0000b = 1024 DW (4096 bytes)

Address and length combination must not combine to cross a 4KB


address boundary

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Byte Enables Field 181 169

 Last DW BE[3:0] Field


 These four bits qualify bytes 0-3 in the Last DW transferred. If
all bytes in the last DW are valid, all four bits of this field would
be = 1.
 1st DW BE[3:0] Field
 These four bits qualify bytes 0-3 in the First DW transferred. If
all bytes in the First DW are valid, all four bits of this field would
be = 1.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
First and Last DW BE Fields 182 170

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transaction Descriptor Fields 183 171

These highlighted fields combined are called the Transaction Descriptor

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Requester ID Field 183 172

 Requester ID Field– consists of the Bus, Device, &


Function numbers of the Requesting device
 Used for routing the Completion TLP back to the
Requester

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Tag Field 183 173

Tag Field – Assigned by Requester to uniquely identify


each outstanding request, and associate returned
completions with their original request.
 5-bit Tag used by default (32 Tag values)
 8-bit Extended Tag (optional, 256 Tag values)
 11-bit Tag with Phantom Functions (optional, up to 2048 Tag
values) -- redefines Function bits of the Requester ID (3 LSB’s) to
extend the Tag in devices that don’t use all the function numbers

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Extending the Tag Field 174

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Default Tag Size 175

 In previous versions of the spec, the default size


of the Tag field has been 5 bits, allowing 32 split
transactions in progress at once per Function.
 Software could change the tag size to 8 bits by
enabling the Extended Tag Field.
 Beginning with rev 2.1, the default value of the
Extended Tag Field Enable bit is implementation
specific, rather than fixed at zero, so some
devices may default to an 8-bit tag field.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Specific TLP Format: IO Request TLPs 185 176

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Specific TLP Format: Memory Request TLPs 188 177

 Memory Read (Non-Posted)


 Memory Read Request, followed by Completion with
data or Completion without data (error status) or No
completion (error)
 Memory Read Lock (Optional, not covered here)
 Memory Write (Posted)
 Memory Write Request. No Completion.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Specific TLP Format: Memory Request TLPs 188 178

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Specific TLP Format: Memory Write Request 179

 Use either a 3DW (32-bit address) or


4DW (64-bit address) Request header.
 Contain data payload immediately following the
Request header. Payload sizes:
 4 bytes (min)
 128 bytes (default)
 4KB (max)
 Use the strongly-ordered write model
 Routed to target device based on memory address.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Notes on Payload Size 180

Max Payload
4KB
= 4KB

1KB 128B

 Given Max Payloads as shown, what payload size can be used?


 In 1.0a spec, Links that could communicate with one another were limited by the
smallest common payload, meaning all of the Links shown would have to use a 128B
max payload.
 Beginning with 1.1 and later spec versions, the RC was allowed to break up large
packets into smaller ones, allowing different payload sizes between RC ports. Switches
areAnji
Moki not(moki@
allowed to do this, however, so all switch ports would still be limited to 128B in
synopsys.com)
Do this
Not example.
Distribute MindShare.com © 2013
Memory Read Request TLPs 181

 Use a Read Request size of:


 4 bytes (min)
 512 bytes (default)
 4096 bytes (max)
 Routed to the target device based on memory
address

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Max Payload & Read Request Size 182

Max size ranges from:


128 to 4KB (binary weighted values)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Configuration Request TLPs 193 183

Name Fmt Type Description

CfgRd0 000 0 0100 Configuration Read Type 0 Request


CfgWr0 010 0 0100 Configuration Write Type 0 Request
CfgRd1 000 0 0101 Configuration Read Type 1 Request
CfgWr1 010 0 0101 Configuration Write Type 1 Request

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Configuration Address Map 184

 PCI Express extends the 256-byte configuration space


allocated to each function to 4096 bytes.

First 256 bytes are 256 MB Configuration Map


compatible with PCI 2.3 Bus 0,
4096 Header 0
configuration space and Dev 0,
Fn 0
bytes
contain the header plus any 255
compatible capability
structures.
4095

Use of extended portion is


implementation-specific and
may contain PCIe extended
capability structures.
Bus 255,
4096
Dev 31,
Fn 7
bytes
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Completion TLPs 197 185

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completions for Memory Read Requests 186

Completer returns requested data:


 A completion with data (CplD) follows a successful
memory read request.
 A completion without data (Cpl) follows an
unsuccessful read request
 All completions use a 3DW header and are routed
based on the Requester ID field in the header.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Multiple Completions 187

 Completer may need to send one or multiple Completion


TLPs to fulfill a non-posted request.
 For example, a request using the default Read Request
Size (512 bytes) and default max payload size (128 bytes)
would need to return at least four completions to fulfill the
request.
 Read Completion Boundary (RCB)
 Naturally-aligned address boundary for completions
 Root Complex software can select 128- or 64-byte RCB (Link
Control register).
 RCB for all other elements is 128 bytes. For them, data returned in
completions must end on 128-byte boundaries, except for the last
one. Software may set the Link Control register RCB value so they
know what the Root Complex is doing and can optimize their
behavior.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completion Characteristics 188

More Completion TLP Characteristics:


 A single transaction data transfer must not cross a
naturally-aligned 4KB boundary
 Some fields received in the request are returned
unchanged in the completion
 Completion TLPs are routed using ID (bus number)
 Completion Timeout values are defined

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completion Timeouts 189

 Prior to rev 2.0, PCIe defined a completion timeout


interval from 50µs to 50ms, but times were not
programmable.
 PCIe 2.0 defines new configuration register fields that
put the timeout values under software control. The
timeout ranges are separated into a default value and 4
time bins:
Default = 50µs to 50ms (same as earlier spec versions)
Range A = 50µs to 10ms
Range B = 10ms to 250ms
Range C = 250ms to 4s
Range D = 4s to 64s
 Timeout can be selected based on detected topology;
must be long enough to avoid mistakenly reporting an
error
 The timeout interval may also be completely disabled by
software
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Completion Timeout Mechanism 190

 PCIe Functions that issue Requests requiring Completions


must implement Completion Timeouts, including the Root
Complex, Endpoints & Bridges
 Completion timeouts are set up and enabled via the Device
Control 2 register

0000b = 50µs - 50ms


0001b = 50µs - 100µs
A 0010b = 1ms - 10ms
0101b = 16ms - 55ms
B
0110b = 65ms - 210ms
1001b = 260ms - 900ms
C
1010b = 1s - 3.5s
1100b = 4s - 13s
D
1110b = 17s - 64s

High-order bits
Moki Anji (moki@ synopsys.com) select range
Do Not Distribute MindShare.com © 2013
Completion TLP Fields 198 191

 Length[9:0] Field (10 bits)

Specifies data payload size associated with this


completion TLP in DWs

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completion TLP Fields 198 192

 Completer ID Field (16 bits)

This is the ID of the device returning the completion


(Bus, Device, Function number)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completion TLP Fields 200 193

Completion Status Code Field (3 bits)


 000b = Successful Completion (SC)
 001b = Unsupported Request (UR) – Master Abort in PCI
 010b = Configuration Request Retry Status (CRS)
 100b = Completer Abort (CA) – Target Abort in PCI
 Others = Reserved
 Status Code other than SC terminates any read transaction, regardless
of the amount of data returned

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Configuration Request Retry Status (CRS) 200 194

 Completer is not yet ready to respond to


configuration cycle
 Support for CRS is Root Complex specific
 Devices behind PCIe-to-PCI (-X) bridges do
not support CRS, so bridges:
 By default return a value of all 1s rather than CRS
for requests that time out behind the bridge
 May be enabled by software to send CRS instead
(by setting Bridge Configuration Retry Enable bit
in the Device Control register)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completions – Byte Count 199 195

 Byte Count Field (12 bits)


 This is the byte count remaining that will satisfy the requested data
transfer (includes current payload)
For memory reads that are satisfied with a single completion, this field
reflects the size of the original request
For memory reads that require multiple completions, the Byte Count is
decremented with each completion to reflect the number of bytes left to
transfer (including the current data payload)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completions – BCM 201 196

 BCM (Byte Count Modified) Field


 Set by PCI-X completers or bridges only, and allowed only during the
first completion in a multiple completion sequence.
If = 1, the Byte Count field has been modified and contains the count for
this completion only, not the total remaining
If = 0, the Byte Count field has not been modified and reflects the total
remaining number of bytes to transfer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completions – ID 197

 Requester ID & Tag Field (24-bit Transaction ID)


 Bus number (Byte 0, bits 7:0)
 Device number (Byte 1, bit 7:3)
 Function number (Byte 1, bits 2:0)
 Tag (Byte 2, bits 7:0) – By default only lower 5 bits are
used, but this can be extended to 11 bits as we saw
earlier

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completions – Lower Addres 200 198

Lower Address Field Field (7 bits)


 Contains the least significant 7 bits of the target address
specified in the read request
 The completer forms the content of this field from the
Request Header’s 5 least significant bits of the Address
field and 1st DW Byte Enable field

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message TLPs 203 199

 Message transactions are Posted and consist of a


Request only
 Message transactions may or may not have a data
payload

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Messages – Routing 204
203 200

Type field definition


 Byte 0, bits 4:3: 10b (Message TLP)
 Byte 0, bits 2:0: (Message Routing Sub-Field)
000b = Routed to Root Complex (Implicit)
001b = Routed by Address (Uses Address fields)
010b = Routed by ID (Uses Requester ID field)
011b = Broadcast by Root Complex (Implicit)
100b = Local — Terminate at Receiver (Implicit)
101b = Gather and route to Root Complex (Implicit)
Others = Reserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Messages – Codes 205 201

 Message Code Field (8 bits)


0000 0000b = Unlock Message
0001 0000b = Latency Tolerance Reporting Message
0001 0010b = Oprimized Buffer Flush/Fill Message
0001 xxxxb = Power Management Message
0010 0xxxb = INTx Message
0011 00xxb = Error Message
0100 xxxxb = Ignored Message
0101 0000b = Set Slot Power Limit Message
0111 111xb = Vendor-defined Messages

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Example 1: Vendor-Specific 211 202

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Example 2: LTR 212 203

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Example 3: OBFF 213 204

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Flow Control

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Purpose of Flow Control 215 206

 Devices implement credit-based Flow Control


(FC) for each virtual channel on each port.
 Goals:
 Guarantee transmitter will never send a TLP if the
receiver at the other end doesn’t have buffer
space to take it.
 Prevent buffer over-runs and eliminate inefficient
transaction retries on the Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Location of Flow Control Logic 217 207

Moki Anji (moki@ synopsys.com)


Do Not Distribute TLP Traffic
MindShare.com © 2013
Flow Control Buffer Organization 218 208

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Credit Units 219 209

 One FC credit for data buffers = 4 DWs


 One FC credit for Headers = 1 max-sized header
plus the optional digest
 5DW for request headers (posted and non-posted)
 4DW for completion headers
 Without sufficient credits a TLP type can’t be sent,
though other types may be sent if they have enough
credits
 For TLPs that include data (writes and completions
with data), the transmitter must check credits for
both header and data

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Flow Control Elements in DLL 228 210

 The Credits Allocated counter is hardware initialized with


the largest credit value supported by the receive buffer.
CL = Credit Limit
Moki Anji (moki@ synopsys.com)
CR = Credits Required
Do Not Distribute MindShare.com © 2013
INIT FC DLLPs Used During Initialization 224
228 211

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
FC Initialization in DLL 212

DLLP

 Credits Allocated counter value is sent across the Link as a


FC DLLP and updates the Credit Limit counter
 FC_Init DLLPs are used during FC initialization while
FC_Update
Moki Anji (moki@ DLLPs are used later.
synopsys.com)
Do Not Distribute MindShare.com © 2013
Phase 1: FC_Init1 DLLP Exchange 225
228 213

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Phase 2: FC_Init2 DLLP Exchange 226
228 214

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Flow Control Initialization Sequence 215

 Flow Control credits for VC0 are automatically


initialized after Link Training because VC0 can’t
be disabled.
Other virtual channels may be enabled later by
software, triggering Flow Control initialization
for those channels at that time.
 PCI Express defines two flow control initialization
states, FC_INIT1 and FC_INIT2
FC_INIT1: credits are exchanged
FC_INIT2: credit exchange is confirmed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Preparing Link for TLP Transmission 226 216

Following Reset, two


state machines interact
to prepare the PCIe
interface to send and
receive TLPs:

DLCMSM tracks the state


of the Link and initializes
Flow Control for VC0
Reset Reset
LTSSM trains and
initializes the Link

DLCMSM: Data Link Control & Management State Machine


Moki LTSSM: Link
Anji (moki@ Training and Status State Machine
synopsys.com)
Do Not Distribute MindShare.com © 2013
DLCMSM Diagram 223
228 217

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Minimum Flow Control Advertisement 219 218

Credit Type Minimum Advertisement


Posted Request Header (PH) 1 credit (4 DW Hdr + 1 DW Digest = 5 DW)
Posted Request Data (PD) Enough credits to accommodate biggest possible
Max_Payload_Size of all Functions in the Device. For a
value of 1024 bytes, 1024/16 = 64d credits needed.
Non-Posted Req. Header (NPH) 1 credit (4 DW Hdr + 1 DW Digest = 5 DW)
Non-Posted Req. Data (NPD) 2 credits if AtomicOps are supported, 1 credit otherwise
Completion Header (CPLH) 1 credit (3 DW Hdr + 1 DW Digest = 4 DW) for Switch Ports
or Root Ports that support peer-to-peer transfers.
For Endpoints or Root Ports that don’t support peer-to-peer,
infinite credits must be advertised (indicated by value of 0
during initialization).
Completion Data (CPLD) Enough credits to accommodate biggest possible
Max_Payload_Size of all Functions in a Switch Port or Root
Port that supports peer-to-peer transfers.
For Endpoints or Root Ports that don’t support peer-to-peer,
infinite credits must be advertised.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Infinite Buffer Advertisement 221 219

 Allows transmitter to send any number of


TLPs without checking credits
 Indicated at initialization by advertising credit
of zero in InitFC1 DLLP and InitFC2 DLLP
 No UpdateFC DLLPs are needed if a receiver
advertised infinite header and data buffers

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Flow Control Init Capture 220

Once LTSSM gets to L0 (TS2s stop),


Flow Control Initialization begins for
VC0 with InitFC1s.
Completion credits are not infinite
because these are Switch Ports.

Moki Anji (moki@ synopsys.com)


Trace captures courtesy of LeCroy
Do Not Distribute MindShare.com © 2013
Flow Control Init Capture – 2 221

Eventually, USP is ready for


next step, begins to send
InitFC2s.

Moki Anji (moki@ synopsys.com)


Trace captures courtesy of LeCroy
Do Not Distribute MindShare.com © 2013
Flow Control Init Capture – 3 222

When satisfied with the


information they stop
initializing and now TLPs
can be sent. In this case, the
During runtime operation, first one sent is a Slot
UpdateFC DLLPs are delivered Power Message.
periodically to update
transmitters about available
space.

Moki Anji (moki@ synopsys.com)


Trace captures courtesy of LeCroy
Do Not Distribute MindShare.com © 2013
Flow Control After Initialization 228 223

 Prior to sending TLP, transmitter checks Credits Required (CR)


against the Credit Limit (CL) to verify buffer space for the next TLP.
 CR is the sum of Credit Consumed (CC) plus the credits required
Moki Anji (moki@
to send thesynopsys.com)
Pending TLP (PTLP)
Do Not Distribute MindShare.com © 2013
Flow Control Update DLLP Format 229 224

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Two’s Complement Check Before Tx 225

 For Header Credit Check


[CL – (CC + PTLP) ] mod 256 ≤ 128
 For Data Credit Check
Moki Anji (moki@ synopsys.com)
Do Not Distribute [CL – (CC + PTLP) ] mod 4096 ≤ 2048 MindShare.com © 2013
Example Stage 1: Initialization Complete 231 226

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Stage 2: FC First TLP Sent 232 227

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Stage 3: Rx FC Buffer Fills Up 234 228

 In Device A, CC = CL = 66h; Transmitter has no credits


 In Device B, CrRcv = CrAl = 66h; Receiver buffer full
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
FC Counter Rollover 235 229

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
FC Counter Rollover Problem 235 230

 Counters have a potential problem: they can’t tell


whether CL stayed ahead of CR as it should, or if CR
passed CL by mistake
 If such a mistake happened, the subtraction result
would be a large value. To protect against this, it will
be considered an error if the difference between
pointers ever exceeds half the counter value
 Consequently, the biggest buffer allowed can only
use half the counter max. value

CL = F8h (248d) Did Credit Limit stay ahead


as it should or did Credits
Credits
Remaining Required pass it by mistake?
Unsigned subtraction result
CR = E8h (232d) won’t tell us unless we
restrict the options.

CL = Credit Limit
CR = Credits Required

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Max Receiver Buffer Size 220 231

 8-bit counter tracks header credits


 12-bit counter tracks data buffer credits
 Using half the counter space means the
maximum credits that can be issued by a
receiver are:
127 Credits for Header Buffer
2047 Credits of Data Buffer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Stage 4: Buffer Overflow Check 236 232

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Stage 5: FC Update 238 233

 Now let 3 header be consumed by Device B and removed from its FC


Rx buffer
 Credits Allocated counter increments from 66hh to 69h
 FC
Moki update
Anji (moki@DLLP delivered from Device B to A and CL counter updated
synopsys.com)
 Not
Do NextDistribute
check for sending TLP will succeed MindShare.com © 2013
Update FC DLLP Content 239 234

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Update FC Frequency 240 235

 An Update FC DLLP for each packet type (P, NP, Cpl) must
normally be scheduled within every 30 µS (-0%/+50%).
Exceptions:
 If Link is in a state other than L0 or L0s, no updates are
sent
 If Extended Sync bit (within Link Control register) is set
the limit becomes 120 µS (-0%/+50%)
 The PCIe specification recommends that a receiver tune the
Update FC latency using the following formula:

(Max_Payload_Size + TLP Overhead) * UpdateFactor + Internal Delay


LinkWidth

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Update FC Latency – Gen 1 241 236

The spec tables summarize the UpdateFC transmission latencies for


each data rate. They represent the time allowed between receiving a
TLP and sending an Update FCP.
Gen1:
Max_Payload x1 x2 x4 x8 x12 x16 x32
Size Link Link Link Link Link Link Link
128 Bytes 237 128 73 67 58 48 33
(UF=1.4) (UF=1.4) (UF=1.4) (UF=2.5) (UF=3.0) (UF=3.0) (UF=3.0)

256 Bytes 416 217 118 107 90 72 45


(UF=1.4) (UF=1.4) (UF=1.4) (UF=2.5) (UF=3.0) (UF=3.0) (UF=3.0)
512 Bytes 559 289 154 86 109 86 52
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
1024 Bytes 1071 545 282 150 194 150 84
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
2048 Bytes 2095 1057 538 278 365 278 148
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
4096 Bytes 4143 2081 1050 534 706 534 276
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Update FC Latency – Gen 2 241 237

Gen2:
Max_Payload x1 x2 x4 x8 x12 x16 x32
Size Link Link Link Link Link Link Link
128 Bytes 288 179 124 118 109 89 84
(UF=1.4) (UF=1.4) (UF=1.4) (UF=2.5) (UF=3.0) (UF=3.0) (UF=3.0)

256 Bytes 467 268 169 158 141 123 96


(UF=1.4) (UF=1.4) (UF=1.4) (UF=2.5) (UF=3.0) (UF=3.0) (UF=3.0)
512 Bytes 610 340 205 137 160 137 103
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
1024 Bytes 1122 596 333 201 245 201 135
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
2048 Bytes 2146 1108 589 329 416 329 199
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
4096 Bytes 4194 2132 1101 585 757 585 327
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Update FC Latency – Gen 3 242 238

Gen3:
Max_Payload x1 x2 x4 x8 x12 x16 x32
Size Link Link Link Link Link Link Link
128 Bytes 333 224 169 163 154 144 129
(UF=1.4) (UF=1.4) (UF=1.4) (UF=2.5) (UF=3.0) (UF=3.0) (UF=3.0)

256 Bytes 512 313 214 203 186 168 141


(UF=1.4) (UF=1.4) (UF=1.4) (UF=2.5) (UF=3.0) (UF=3.0) (UF=3.0)
512 Bytes 655 385 250 182 205 182 148
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
1024 Bytes 1167 641 378 246 290 246 180
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
2048 Bytes 2191 1153 634 374 461 374 244
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)
4096 Bytes 4239 2177 1146 630 802 630 372
(UF=1.0) (UF=1.0) (UF=1.0) (UF=1.0) (UF=2.0) (UF=2.0) (UF=2.0)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receiver Rules to Reduce Stalls 239

 If a NPH, NPD, PH or CPLH buffer is full and one


or more credits become available, a FC Update
DLLP must be sent immediately
 If a PD or CPLD buffer cannot hold a packet of
Max_Payload_Size and one or more credits
become available, a FC Update DLLP must be
sent immediately

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmitter Checks for FC Updates 240 240

 Transmitter may optionally check that FC


updates are received at a minimum frequency
(at least every 30 μs)
 Only check when Link is in L0 or L0s
 Use timer with a limit of 200 μs (-0%/+50%)
 Timer is reset with receipt of FC DLLP, or with receipt of
any DLLP
 Timer expiration causes PHY Layer to retrain the
Link via LTSSM Recovery state
 Timeout flow control mechanism is disabled if
infinite credits were advertised

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Posted Request Acceptance Rule 241

 A Posted Request can’t be delayed for more than


10 µs (Posted Request Acceptance Limit).
 Consequently, the device must either:
(a) be able to process received Posted Requests
and return FC credits within 10 µs, or
(b) depend on a restricted programming model to
ensure that a Posted Request is never sent to
the device when it’s unable to service the request
within 10 µs.
 The 10 µs limit does not apply under certain
conditions (e.g.: just after a reset)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Quality of Service and Arbitration

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Motivation for Quality of Service (QoS) 245 243

 Some traffic, like streaming video, needs


guaranteed latency and bandwidth to
achieve acceptable performance
 Example: a video capture device that records
data at a fixed rate and cannot be throttled.
 Guaranteeing “quality of service” for traffic
over a general-purpose bus with other
requests in progress requires support
mechanisms

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Differentiated Services 246 244

 Providing QoS means managing:


 Effective bandwidth
 Latency
 Error rate
 Other parameters that affect performance
 PCIe features that make QoS possible:
 Traffic Classes (TC)
 Virtual Channels (VC)
 Port Arbitration
 Virtual Channel Arbitration
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
QoS Support 247 245

 Traffic Class (TC): packet header field carried


unchanged from source to destination
 At each “service point” (e.g.: Switch port), TC label
indicates which VC should be used
 There’s no guaranteed ordering relationship for packets
with different TCs because they could get mapped into
different VCs. Software must be careful about
dependencies.
 Virtual Channel (VC): set of buffers and support
logic to carry traffic flow across a Link
 Only VC0 is required, the other 7 are optional
 Packets are placed into a VC based on their TC
 Transaction ordering rules apply to packets within the
same TC
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
TC Field in TLP 247 246

 TC Field values:
 000b = TC0 (default) All devices support this Traffic
Class, which gives “best effort” service.
 001b to 111b = TC1 to TC7, optional differentiated
service classes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transaction Types 247

Two general types of transactions are


supported
Asynchronous: timing is not critical and
“best effort” service is good enough
Isochronous: timing is critical and data may
be lost if not delivered in a timely fashion

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Isochronous Example 274 248

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Capability Structure 246 249

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Different Number VCs Per Device 250 250

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Extended VCs Supported Register 251 251

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Registers 251 252

Table

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TC-to-VC Mapping Concept 248 253

 Once the TC/VC mapping is programmed on both


ends of a Link, packets carrying TC tags that are
mapped to an enabled VC may be sent.
 Packets with TCs that do not map to a VC at the
receiver are considered malformed (a fatal error)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TC to VC Mapping Example 249 254

Example: Set up a Link so that TC0-1 map to VC0 while


TC2-4 map to VC3.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Port Arbitration & VC Arbitration 255

 Port Arbitration: Switch and Root Ports arbitrate


between packets from different ingress ports
that need to use the same virtual channel of the
same egress port.
 Virtual Channel Arbitration: Egress Ports
arbitrate between packets in different VCs.
 Both allow several options, allowing software to
select appropriate QoS policy.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Arbitration Example 253 256

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Strict Priority VC Arbitration 254 257

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Low Priority Extended Vs 255 258

All values are read only

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Arbitration 257 259

Default for all VCs

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Arbitration Capability 256 260

Arbitration
Schemes for the
Low-Priority
Group

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
WRR VC Arbitration Table 258 261

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Arbitration Table And VC Arb. Table 259 262

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
VC Arbitration Table Entries 260 263

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Port Arbitration Concept 262 264

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Port Arbitration Tables for Each VC 263 265

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Port Arbitration Buffering 264 266

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Port Arbitration Registers 265 267

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Time-Based WRR 266 268

 Used for isochronous channel


 Each phase is a time slot of 100 ns
 One TLP routed during each phase or time slot
 Amount of data that can be sent in one time slot
depends on Link speed and width.
 If a TLP needs more time slots, set up first phase
or time slot with a port number and subsequent
phases with a null phase (egress port number).
Clearly, the packet size must be known ahead of
time to facilitate proper timing.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Maximum Time Slots Register 267 269

If a VC supports TBWRR, this field indicates how many of the total time slots
are allocated to it. This reports the isochronous bandwidth capability of a
Completer, for example, and software must take that into account when
setting up isochronous service.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Port Arbitration Table 268 270

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Arbitration Example 270 271

 Set up the following arbitration schemes:


 VC arbitration: 32-entry WRR for all VCs, preferring VC1 about
2:1 over VC0
 Port arbitration for VC1: 32-entry WRR, preferring Port1 about
3:1 over Port0

Arbitration between packets going to the same egress port is


Moki Anji (moki@ synopsys.com)
only described in the spec as being “fair”.
Do Not Distribute MindShare.com © 2013
Example part 2 – VAT 272

VC Arbitration Table
7 6 5 4 3 2 1 Entry 0

VC1 VC0 VC1 VC1 VC0 VC1 VC1 VC0

VC0 VC1 VC1 VC0 VC1 VC1 VC0 VC1

VC1 VC1 VC0 VC1 VC1 VC0 VC1 VC1


31 30 29 28 27 26 25 Entry
24
VC1 VC0 VC1 VC1 VC0 VC1 VC1
VC0

 Select arbitration as WRR with 32 entries


 Program the Table similar to this example

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example part 3 – PAT 273

Port Arbitration Table for VC1


7 6 5 4 3 2 1 Entry 0
Port0 Port1 Port1 Port1 Port0 Port1 Port1 Port1

Port0 Port1 Port1 Port1 Port0 Port1 Port1 Port1

Port0 Port1 Port1 Port1 Port0 Port1 Port1 Port1

31 30 29 28 27 26 25 Entry
24
Port0 Port1 Port1 Port1 Port0 Port1 Port1
Port1

 Select arbitration as WRR with 32 entries


 Program the Table similar to this example

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Multi-Function VC Capability 271 274

 Only needed in multi-function devices that


support multiple Virtual Channels.
 Present in the Extended Configuration Space of
Function 0, these registers control Function and
VC arbitration for the device interface, while
each function controls its own internal VC
assignments.
 Registers look almost identical to the VC
capability structure and do much the same job

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Multi-Function VC Capability Structure 272 275

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Isochronous Path 276

What hardware would need to be included on each


port of the indicated path to implement Isochronous
service? What arbitration schemes would need to be
selected?
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Transaction Ordering

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Reasons for Ordering Rules 285 278

 Like PCI, PCI Express imposes ordering rules


on transactions moving through the fabric at the
same time.
 Reasons for this include:
 Making the completion of transactions deterministic
and in the sequence intended by the programmer.
 Avoiding deadlock conditions
 Maintaining compatibility with ordering already used
on legacy busses (e.g. PCI, PCI-X, AGP)
 Maximize performance and throughput
 minimize read latencies
 manage read/write ordering
 Supporting the PCI Producer-Consumer Model
(see next slide)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Producer/Consumer Model PCI Example 279

Memory
Consumer

PCI Bus
Memory is the intended
PCI-to-PCI destination, but data is
posted and delayed in
Bridge
getting to the upper bus.
Posted Write
Buffer

PCI Bus

Producer Flag

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Producer/Consumer Model PCI Example 280

Memory
Consumer

PCI Bus

PCI-to-PCI Read
Request
Bridge
Posted Write
Buffer Read result must not be
allowed to reach consumer
until data has safely
reached memory or a race
PCI Bus condition may occur.

Producer Flag

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Strong vs. Relaxed Ordering 286 281

 PCI uses a Strongly-Ordered model, with some


exceptions to avoid potential deadlock
conditions between bridges
 PCI-X added another exception: strong ordering
could be relaxed if software could guarantee that
no dependencies existed. If so, the RO Attribute
bit in the header can be used to allow improved
performance.
 RO = 0b; Normal PCI ordering
 RO = 1b; Relaxed Ordering permitted

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Interpreting the Table 288 282

 The PCI spec includes a table that summarizes


the ordering rules.
 Within the tables:
 Columns represent the first transaction issued
 Rows represent a subsequent transaction.
 At the intersection there is an implicit question: Should
the row packet be allowed to pass the column packet?
 Yes The second transaction must be allowed to pass the first to
avoid a deadlock.
 Y/N There are no requirements. A device may allow the second
transaction to pass the first.
 No The second transaction must not be allowed to pass the
first (strong Producer-Consumer ordering model enforced)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Simplified Ordering Table 289 283

 Older PCI table had more entries. New version reduces the entries by
not mentioning specific requests, resulting in fewer cases that must be
tested for spec compliance.

Row pass Posted request Non-Posted Requests Completion


Column? (Col 2) Read Request NPR with
data (Col 5)
(Col 3)
(Col 1)
(Col 4)
Posted request a) No a) Y/N
(Row A) b) Y/N Yes Yes b) Yes
Non-Posted

Read Request a) No
Requests

(Row B) b) Y/N Y/N Y/N Y/N

NPR with data a) No


(Row C) b) Y/N Y/N Y/N Y/N

Completion a) No a) Y/N
(Row D) b) Y/N Yes Yes b) No

NPR with data: Non-Posted Write Request, such as a configuration write or I/O write
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Strongly-Ordered Problem: Blocking 300 284

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Producer/Consumer Model PCIe Topology 291 285

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Producer/Consumer Sequence Ex. – Part 1 293 286

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Producer/Consumer Sequence Ex. – Part 2 294 287

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Producer/Consumer Sequence with Error 296 288

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Relaxed Ordering (RO) 297 289

RO is bit 5 of byte 2 in
the TLP Header

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Effect of Relaxed Ordering Bit Set 297 290

 Devices are permitted to reorder Memory


Writes or Messages ahead of previous
posted Memory Writes or Messages if their
RO bit = 1.
 Memory Read Requests are NOT allowed to
pass previously posted Memory Writes or
Messages even with RO bit = 1
 Completions with RO bit = 1 are permitted to
pass previously posted Memory Writes or
Messages

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Constraints on PCIe Ordering 291

 There can be no ordering guarantee between


transactions traveling with different Traffic Class
attributes.
 No ordering relationship is maintained between
transactions in different Virtual Channels.
 Consequently, if two TLPs are required to stay in
order then software must ensure that they both
use the same TC.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ID-Based Ordering (IDO) – Motivation 301 292

 Conventional PCIe ordering can limit


performance
 Transactions from unrelated threads are unlikely to
have dependencies but can still get stuck behind each
other (see next slide)
 Address translation may worsen this effect, since
transactions can now also be stalled waiting on
translations as well as FC credits
 ID-Based Ordering (added with 2.1 spec) allows
transactions from different Endpoints to be reordered,
while Relaxed Ordering allows transactions from the
same Endpoint to be reordered.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ID-Based Ordering Example 302 293

Write Buffer
Full

Memory Read
Posted Write

An earlier posted write that stalls will also block egress of subsequent
transactions due to the ordering rules.
However, when subsequent requests come from other devices the
likelihood of a dependency between them is very low and ID-Based
Ordering
Moki would synopsys.com)
Anji (moki@ improve performance.
Do Not Distribute MindShare.com © 2013
IDO Attribute Controlled by Software 303 294

 ID-Based Ordering is an optional capability,


and uses a previously reserved attribute bit in
the header to indicate when it’s being used

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Part Three:
Data Link Layer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DLLP Elements

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Data Link Layer Creates and Processes DLLPs 308 297

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Generic DLLP Format 310 298

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DLLP Types 311 299

DLLP Type Type Field Encoding Purpose

Ack 0000 0000b TLP transmission integrity


Nak 0001 0000b TLP transmission integrity
PM_XXX 0010 0XXXb Power Management
Vendor Specific 0011 0XXXb Vendor Defined
InitFC1-X 01XX 0XXXb TLP Flow Control Initialization
InitFC2-X 11XX 0XXXb TLP Flow Control Initialization
UpdateFC-X 10XX 0XXXb TLP Flow Control

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DLLP Formats 312 300

Ack/Nak Protocol

Power Management
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
DLLP Formats 315 301

Flow Control

Vendor-Specific
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Ack/Nak Protocol

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Motivation for Ack/Nak Protocol 317 303

 In serial environments a single bit error ruins a


packet, so transmission reliability is important
 Ack/Nak protocol describes hardware-based
Link-level TLP error detection and correction.
 Spec doesn’t give design details, but provides a
general description of Ack/Nak elements and
required behavior.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Layer Responsibility 318 304

Memory, I/O, Configuration R/W Requests or Message Requests or Completions


(Software layer sends / receives address/transaction type/data/message index)
Software layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Data Link layer De-mux


TLP Retry
Buffer
TLP Error
Mux Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Physical layer Encode Decode

Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver

Moki Anji (moki@ synopsys.com) Port


Do Not Distribute Link MindShare.com © 2013
Basic Operation 319 305

First, TLP arrives from Transaction Layer.


Sequence Number and LCRC are added to it

TLP

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence LCRC Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Retry Buffer De-mux

TLP Error
Mux Check

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Basic Operation 319 306

Second, a copy of TLP is stored in


Replay Buffer until it’s safe arrival is
acknowledged by the receiver

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Retry Buffer De-mux

TLP Error
Mux Check

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Basic Operation 319 307

Third, the TLP is forwarded to


Physical Layer for transmission

Link Packet DLLPs e.g. DLLPs Link Packet


Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Retry Buffer De-mux


Sequence TLP
Sequence TLP LCRC
LCRC
TLP Error
Mux Check

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Basic Operation 319 308

Later, an Ack or Nak is received with a sequence number equal to or


higher than this one, meaning this TLP can be removed from the buffer.
If a Nak was received, the TLP might need to be sent again instead of
being cleared out.

Link Packet DLLPs e.g. DLLPs Link Packet


Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Retry Buffer De-mux


Sequence TLP LCRC
TLP Error
Mux Check

Sequence Ack

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Elements of Ack/Nak Protocol 320 309

Transaction Layer (TX) Block TLPs; Report Transaction Layer (RX)


DLL protocol error
Yes Increment NRS Good TLPs
No
TLPs (NTS-AS) ≥ 2048?
(Continue) NEXT_RCV_SEQ (NRS) Seq Num = NRS
Block TLP during Replay

Assign
Sequence Seq Num < NRS (Duplicate TLP) Seq Num
NEXT_TRANSMIT_SEQ (NTS)
Number >, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Increment) (Schedule Ack)
NRS?
REPLAY_TIMER
LCRC Increment on Replay) Seq Num > NRS (Lost TLP)
REPLAY_NUM
Generator (Send Nak) Yes
Purge Older TLPs (Reset Both)
(Send Nak) No Pass
Nak AckD_SEQ (AS) LCRC?
Retry Buffer Yes
Nak? (Update) No Nak Flag Clear?
(Replay) Set & Send Nak
Yes AckNak
(TLP copy)
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
(TLP copy) Yes Ack Nak
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Ack/Nak
DLLP Link
TLP TLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLPs Arrive at Link Layer 322 310

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes  TLPs arrive from
TLPs
No
(NTS-AS) ≥ 2048? Transaction Layer
(Continue)
 Link Layer may block
Block TLP during Replay

Assign
Sequence
Number
NEXT_TRANSMIT_SEQ (NTS) incoming TLPs if Retry
(Increment) Buffer is full or a replay
LCRC Increment on Replay)
REPLAY_TIMER is in progress.
REPLAY_NUM
Generator
Purge Older TLPs (Reset Both)
Nak AckD_SEQ (AS)
Retry Buffer Yes
Nak? (Update) No
(Replay)
Yes AckNak
SeqNum = AS?

Yes
No Pass
(Discard) CRC?

Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Sequence Number Assigned 322 311

Transaction Layer (TX) Block TLPs; Report


DLL protocol error

No
Yes  Sequence Number is
TLPs
(Continue)
(NTS-AS) ≥ 2048? added to keep track of
TLPs in progress
Block TLP during Replay

Assign
Sequence
NEXT_TRANSMIT_SEQ (NTS)
Number
(Increment)

REPLAY_TIMER
LCRC Increment on Replay)
REPLAY_NUM
Generator
Purge Older TLPs (Reset Both)
Nak AckD_SEQ (AS)
Retry Buffer Yes
Nak? (Update) No
(Replay)
Yes AckNak
SeqNum = AS?

Yes
No Pass
(Discard) CRC?

Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link CRC Added 322 312

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes
No
(NTS-AS) ≥ 2048?
A 32-bit LCRC is generated
TLPs
(Continue)
from TLP, Sequence
Block TLP during Replay

Assign
Sequence
Number
NEXT_TRANSMIT_SEQ (NTS) Number, and ECRC, then
(Increment)
appended to the packet
REPLAY_TIMER
LCRC Increment on Replay)
REPLAY_NUM
Generator
Purge Older TLPs (Reset Both)
Generated
Nak LCRC Calculation Fields
AckD_SEQ (AS)
Retry Buffer Yes
Nak? (Update) No
(Replay) Seq Num Header Data ECRC LCRC
Yes AckNak
SeqNum = AS?

Yes
No Pass
(Discard) CRC?

Link Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Retry Buffer 322 313

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes
No
(NTS-AS) ≥ 2048?
Copy of entire packet is
TLPs
(Continue)
stored in the Retry Buffer
Block TLP during Replay

Assign
Sequence
Number
NEXT_TRANSMIT_SEQ (NTS) and the TLP is sent out on
(Increment)
the Link.
REPLAY_TIMER
LCRC
Generator
Increment on Replay)
REPLAY_NUM What’s the largest Retry
Purge Older TLPs (Reset Both) Buffer storage a single
Nak AckD_SEQ (AS)
Retry Buffer Yes
Nak? (Update) No
TLP could use?
(Replay)
Yes AckNak Sequence Number: 2 Bytes
(TLP copy)
SeqNum = AS? Header: 16 Bytes
Data Payload: 4096 Bytes
Yes
No Pass ECRC: 4 Bytes
(Discard) CRC? LCRC: 4 Bytes
Max TLP entry size: 4122 Bytes
Link Link

MokiEnd
Anji (moki@
LCRC synopsys.com)
ECRC Data Header Seq Num STP
Do Not Distribute MindShare.com © 2013
Receiver LCRC Check 325 314

Transaction Layer (RX)

 Receiver verifies integrity of Increment NRS


TLP content and ordering Seq Num = NRS
Good
NEXT_RCV_SEQ (NRS) TLPs
 The first check is LCRC, Seq Num
Seq Num < NRS (Duplicate TLP)
which is calculated and >, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Schedule Ack)
NRS?
compared with the Seq Num > NRS (Lost TLP)
incoming LCRC (Send Nak) Yes

(Send Nak) No Pass


LCRC?
Nak Flag Clear?
Set & Send Nak TLP
NAK_SCHEDULED Good TLP?
Clear Nak Flag
Ack Nak

Check Ack/Nak AckNak Latency


CRC Calculation Fields Generator Timer

Seq Num Header Data ECRC LCRC


Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LCRC Check Fails 325 315

Transaction Layer (RX)

If LCRC
TLPs
check fails:
Yes Increment NRS
No
 Discard the(NTS-AS)
(Continue)
bad TLP≥ 2048?
Seq Num = NRS
Good
NEXT_RCV_SEQ (NRS) TLPs
 Set NAK_SCHEDULED flag
Block TLP during Replay

Assign
Sequence
 Send Nak
Number with expected
NEXT_TRANSMIT_SEQ (NTS) Seq Num < NRS (Duplicate TLP) Seq Num
>, <, =
sequence number minus 1
(Increment) (Schedule Ack)

AckNak_Seq_Num[11:0]
NRS?
REPLAY_TIMER
LCRC Increment on Replay) Seq Num > NRS (Lost TLP)
REPLAY_NUM
Generator (Send Nak) Yes
Purge Older TLPs (Reset Both)
(Send Nak) No Pass
Nak AckD_SEQ (AS) LCRC?
Retry Buffer Yes
Nak? (Update) No Nak Flag Clear?
(Replay) Set & Send Nak TLP
Yes AckNak
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
Yes Ack Nak
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Nak
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LCRC OK, Check Sequence Number 325 316

Transaction Layer (RX)


 The 12-bit NextYesReceive
Sequence
TLPs No (NRS) counter
Increment NRS
(NTS-AS) ≥ 2048? Good
tracks expected
(Continue) sequence NEXT_RCV_SEQ (NRS)
Seq Num = NRS
TLPs
Block TLP during Replay

number
Assign
Sequence
for next TLP. Seq Num
Seq Num < NRS (Duplicate TLP)
 Cleared
NEXT_TRANSMIT_SEQ (NTS)
Number to zero at reset, it >, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Increment) (Schedule Ack)
NRS?
increments by one for each
REPLAY_TIMER TLP
good
LCRC TLP, and
Increment on Replay)rolls over
REPLAY_NUM
Seq Num > NRS (Lost TLP)
(Send Nak)
Generator
from 4095d Purge back Older to
TLPs zero.
(Reset Both)
Yes

(Send Nak) No Pass


Nak AckD_SEQ (AS) LCRC?
Retry Buffer Yes
Nak? (Update) No Nak Flag Clear?
(Replay) Set & Send Nak
Yes AckNak
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
Ack Nak
Ack/Nak AckNak Latency
Generator Timer

Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Sequence Number Check 325 317

Transaction Layer (RX)

Yes
When TLPs theNo
TLP sequence Increment NRS
(NTS-AS) ≥ 2048? Good
number(Continue)
(Seq Num) is NEXT_RCV_SEQ (NRS)
Seq Num = NRS
TLPs
Block TLP during Replay

Assign
compared
Sequence to NRS, there
NEXT_TRANSMIT_SEQ (NTS) Seq Num < NRS (Duplicate TLP) Seq Num
Number >, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
are three(Increment)
possible results: (Schedule Ack)
NRS?
REPLAY_TIMER TLP
LCRC Increment on Replay) Seq Num > NRS (Lost TLP)
Generator (Send Nak) Yes
Seq Num = NRS (Good TLP)
(Send Nak) No Pass
Seq Num < NRS (Duplicate TLP) LCRC?
Nak Flag Clear?
Seq Num > NRS (Lost TLP) Set & Send Nak

NAK_SCHEDULED Good TLP?


Clear Nak Flag
REPLAY_NUM Ack Nak

Purge Older
No TLPs Ack/Nak AckNak Latency
(Reset Both)
Yes Timer
Seq Num Header Data ECRC LCRC Generator

Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Sequence Number = NRS 325 318

TLP
Transaction Layer (RX)

 TLPs
PacketNois good. Yes
TLP core Increment NRS
(NTS-AS) ≥ 2048? Good
(Header,
(Continue) Data, ECRC) NEXT_RCV_SEQ (NRS)
Seq Num = NRS
TLPs
Block TLP during Replay

Assign
sent toNEXT_TRANSMIT_SEQ
Sequence Transaction (NTS) Layer Seq Num < NRS (Duplicate TLP) Seq Num
Number >, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
 NRS value (Increment) is incremented (Schedule Ack)
NRS?
REPLAY_TIMER
 Nak flag is cleared
LCRC Increment on Replay)
REPLAY_NUM
Seq Num > NRS (Lost TLP)
Generator (Send Nak) Yes
 Ack DLLPPurge isOlder TLPs
scheduled(Reset Both)
(Send Nak) No Pass
Nak AckD_SEQ (AS) LCRC?
Retry Buffer Yes
Nak? (Update) No Nak Flag Clear?
(Replay) Set & Send Nak
Yes AckNak
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
Yes Ack Nak
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Ack
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Sequence Number < NRS 325 319

Transaction Layer (RX)

 Transmitter
TLPs
must Yes have Increment NRS
No
replayed on (NTS-AS)
(Continue)
its own. ≥ 2048?This is
Seq Num = NRS
Good
NEXT_RCV_SEQ (NRS)
aAssign
duplicate TLP. TLPs
Block TLP during Replay

 TLP is NEXT_TRANSMIT_SEQ
Sequence
Number discarded (NTS) Seq Num < NRS (Duplicate TLP) Seq Num
>, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Schedule Ack)
 NRS is not incremented
(Increment) NRS?
REPLAY_TIMER TLP
 Ack is scheduled
LCRC Increment on Replay)
with
REPLAY_NUM
Seq Num > NRS (Lost TLP)
(Send Nak)
Generator
sequencePurge number
Older TLPs of last
(Reset Both)
Yes

valid TLP (NRS-1) Nak AckD_SEQ (AS)


(Send Nak) No Pass
LCRC?
Retry Buffer Yes
Nak? (Update) No Nak Flag Clear?
(Replay) Set & Send Nak
Yes AckNak
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
Yes Ack Nak
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Ack
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Sequence Number > NRS 325 320

Transaction Layer (RX)

 AnTLPsearlier TLPYeswas lost for Increment NRS


No
some reason
(Continue)
at the
(NTS-AS) ≥ 2048?
Seq Num = NRS
Good
NEXT_RCV_SEQ (NRS)
Physical Layer. TLPs
Block TLP during Replay

Assign

 In
Sequence Seq Num
response:
Number
NEXT_TRANSMIT_SEQ (NTS) Seq Num < NRS (Duplicate TLP)
>, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Schedule Ack)
 TLP is discarded
(Increment) NRS?
TLP
 NRSIncrement
LCRC Replay)
REPLAY_TIMER
is noton incrementedREPLAY_NUM
Seq Num > NRS (Lost TLP)

 NAK_SCHEDULED
Generator
Purge Older TLPs
flag is set. (Send Nak) Yes
(Reset Both)
 Ack/Nak generator Nak
sends Nak (Send Nak) No Pass
AckD_SEQ (AS) LCRC?
DLLP (sequence
Retry Buffer Yes number is
Nak? (Update) No Nak Flag Clear?
NRS-1)(Replay) Set & Send Nak
AckNak Nak
Yes
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
Yes Ack
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Nak
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
AckNak Latency Timer 325 321

Transaction Layer (RX)

 Latency
TLPs value
No
depends on
Yes Increment NRS
(NTS-AS) ≥ 2048? Good
Link width,
(Continue)max payload NEXT_RCV_SEQ (NRS)
Seq Num = NRS
TLPs
Block TLP during Replay

Assign
Sequence etc.
size, Seq Num < NRS (Duplicate TLP) Seq Num
NEXT_TRANSMIT_SEQ (NTS)
 When timer
Number
expires, >, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Increment) (Schedule Ack)
NRS?
Ack/Nak Generator sends
REPLAY_TIMER
Seq Num > NRS (Lost TLP)
Ack DLLP to the REPLAY_NUM (Send Nak) Yes
(Reset Both)
transmitter. (Send Nak) No Pass
AckD_SEQ (AS) LCRC?
 When Ack DLLP is sent, Nak Flag Clear?
Set & Send Nak
timer is reloaded.SeqNum AckNak
= AS? Good TLP?
NAK_SCHEDULED
Clear Nak Flag
Yes Ack Nak
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Ack
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ack/Nak Generator 325 322

Transaction Layer (RX)

 Ack/Nak
TLPs
Generator
No
Yes uses Increment NRS

sequence number
(Continue)
of last
(NTS-AS) ≥ 2048?
NEXT_RCV_SEQ (NRS)
Seq Num = NRS
Good
TLPs
good TLP received (NRS-1).
Block TLP during Replay

Assign
Sequence
 Multiple
Number TLPs may be(NTS)
NEXT_TRANSMIT_SEQ retired Seq Num < NRS (Duplicate TLP) Seq Num
>, <, =

(NRS – 1) = AckNak_Seq_Num[11:0]
(Schedule Ack)
with one(Increment)
Ack/Nak DLLP. NRS?
REPLAY_TIMER
LCRC Increment on Replay) Seq Num > NRS (Lost TLP)
REPLAY_NUM
Generator (Send Nak) Yes
Purge Older TLPs (Reset Both)
(Send Nak) No Pass
Nak AckD_SEQ (AS) LCRC?
Retry Buffer Yes
Nak? (Update) No Nak Flag Clear?
(Replay) Set & Send Nak
Yes AckNak
SeqNum = AS? NAK_SCHEDULED Good TLP?
Clear Nak Flag
Yes Ack Nak
No Ack/Nak AckNak Latency
(Discard)
Generator Timer

Ack/Nak
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Notes: NAK_SCHEDULED Flag 325 323

Transaction Layer (RX)

 When TLPs
NAK_SCHEDULED
No
Yes Increment NRS

flag is set, a (NTS-AS)


(Continue)
Nak ≥DLLP 2048? is NEXT_RCV_SEQ (NRS)
Seq Num = NRS
Good
TLPs
triggered.
Block TLP during Replay

Assign
Sequence
 Once
Number a Nak has been(NTS)
NEXT_TRANSMIT_SEQ sent, Seq Num < NRS (Duplicate TLP)
(Schedule Ack)
Seq Num
>, <, =
receiver (Increment)
discards TLPs until

AckNak_Seq_Num[11:0]
NRS?

itLCRC
seesIncrement
the onexpected
Replay)
REPLAY_TIMER
REPLAY_NUM
Seq Num > NRS (Lost TLP)

sequence number.
Generator
Purge Older TLPs (Reset Both)
(Send Nak) Yes

 If other problems Nak occur,


AckD_SEQ (AS)
(Send Nak) No Pass
LCRC?
additional
Retry Buffer Yes
Naks
Nak? are not
(Update) No Nak Flag Clear?
(Replay)
sent. Instead,Yesthe AckNak Set & Send Nak

SeqNum = AS? Good TLP?


transmitter Replay Timer NAK_SCHEDULED
Clear Nak Flag
will timeout and causeYes Ack Nak
No
another replay. (Discard)
Pass
CRC?
Ack/Nak AckNak Latency
Timer
Generator

Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmitter Checks Incoming DLLPs 322 324

Transaction Layer (TX) Block TLPs; Report


DLL protocol error

No
Yes  Data Link Increment LayerNRSperforms Good TLPs

(NTS-AS) ≥ 2048?
TLPs
(Continue) CRC check on
NEXT_RCV_SEQ (NRS) Seq all
Num =DLLPs.
NRS
Block TLP during Replay

Assign
Sequence
NEXT_TRANSMIT_SEQ (NTS)
 The CRC
Seq Num calculation
< NRS (Duplicate TLP) is
Seq Num
Number >, <, =
(Schedule Ack)
(Increment) checked against theNRS? LCRC

AckNak_Seq_Num[11:0]
LCRC Increment on Replay)
REPLAY_TIMER
REPLAY_NUM
sent with Seq DLLP.
Num > NRS (Lost TLP)

Generator
Purge Older TLPs (Reset Both)
Nak AckD_SEQ (AS)
Retry Buffer Yes
Nak? (Update) No (Send Nak) Yes
(Replay)
Yes AckNak
DLLP (Send
CRCNak)
Calculation
No Fields
Pass
(TLP copy)
SeqNum = AS?
CRC?
(TLP copy) Yes
No Pass
(Discard) CRC? Clear Nak Flag
Ack Nak
Ack/Nak AckNak Latency
AckNak Timer
DLLP Link Generator
Check

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
If DLLP CRC Check Fails 322 325

Transaction Layer (TX) Block TLPs; Report


DLL protocol error

No
Yes  Any DLLPIncrement that NRS
fails Good TLPs
TLPs
(Continue)
(NTS-AS) ≥ 2048? Physical Layer
NEXT_RCV_SEQ (NRS)
checks or
Seq Num = NRS
LCRC check is discarded.
Block TLP during Replay

Assign
Sequence Seq Num < NRS (Duplicate TLP) Seq Num
NEXT_TRANSMIT_SEQ (NTS)
Number >, <, =
(Increment) (Schedule Ack)

AckNak_Seq_Num[11:0]
NRS?
REPLAY_TIMER
How does the Link recover
LCRC Increment on Replay)
REPLAY_NUM from a lost
Seq NumAck/Nak?> NRS (Lost TLP)
(Send Nak)
Generator Yes
Purge Older TLPs (Reset Both) The next Ack or Nak DLLP
Yes
Nak AckD_SEQ (AS) for subsequent TLPs will
Retry Buffer
(Replay)
Nak? (Update) No supply theNaklatest
Flag Clear? sequence
Yes AckNak
SeqNum = AS?
number. Set & SendGood
NAK_SCHEDULED
Nak TLP?

(TLP copy) Clear Nak Flag


Ack Nak
(TLP copy) Yes
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

AckNak
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Acknowledged Sequence Numbers 322 326

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes  AS is updated each time
Increment NRS a
Good TLPs
No
TLPs
(Continue)
(NTS-AS) ≥ 2048? higher TLP sequence Seq Num = NRS
NEXT_RCV_SEQ (NRS)
number is Ack’d (indicating
Block TLP during Replay

Assign
Sequence
Number
NEXT_TRANSMIT_SEQ (NTS) forward
Seq Num < NRSprogress)
(Duplicate TLP) Seq Num

 Transmitter purges NRS?


>, <, =
(Increment) (Schedule Ack)
(retires)

AckNak_Seq_Num[11:0]
LCRC Increment on Replay)
REPLAY_TIMER Retry Buffer Seq Num
TLP entries > NRS (Lost TLP)
Generator
REPLAY_NUM
that have equal (Send Nak)or lower Yes
Purge Older TLPs (Reset Both) sequence numbers than the
Retry Buffer Yes
Nak AckD_SEQ (AS) Ack or Nak reported.
Nak? (Update) No Nak Flag Clear?
(Replay)
AckNak Set & SendGood
Nak TLP?
Yes NAK_SCHEDULED
(TLP copy)
SeqNum = AS? Clear Nak Flag
Ack Nak
(TLP copy) Yes
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

AckNak
DLLP Link

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmitter Replay Timer Basics 322 327

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes  REPLAY_TIMER Increment NRS starts Good TLPs
No
TLPs
(Continue)
(NTS-AS) ≥ 2048? when the lastSeqsymbol Num = NRS
of a
NEXT_RCV_SEQ (NRS)
TLP has been sent out, if it
Block TLP during Replay

Assign
Sequence
Number
NEXT_TRANSMIT_SEQ (NTS) wasn’t already
Seq Num < NRS running.
(Duplicate TLP) Seq Num

 Ack/Nak DLLPs thatNRS?


>, <, =
(Increment) (Schedule Ack)
update

AckNak_Seq_Num[11:0]
LCRC Increment on Replay)
REPLAY_TIMER the AS value Seq Num
(forward > NRS (Lost TLP)
Generator
REPLAY_NUM
progress) reload (Send Nak) the Yes
Purge Older TLPs (Reset Both) REPLAY_TIMER and
Retry Buffer Yes
Nak AckD_SEQ (AS) prevent a timeout.
Nak? (Update) No Nak Flag Clear?
(Replay)
AckNak Set & SendGood
Nak TLP?
Yes NAK_SCHEDULED
(TLP copy)
SeqNum = AS? Clear Nak Flag
Ack Nak
(TLP copy) Yes
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

AckNak
DLLP
TLP

MokiEnd
Anji (moki@
LCRC synopsys.com)
ECRC Data Header Seq Num STP
Do Not Distribute MindShare.com © 2013
When Transmitter Replay Timer Expires 322 328

Transaction Layer (TX) Block TLPs; Report


DLL protocol error

No
Yes
 If an inbound Ack/Nak
Increment NRS Good TLPs

(NTS-AS) ≥ 2048?
TLPs
(Continue) DLLP is late or
NEXT_RCV_SEQ (NRS)
discarded
Seq Num = NRS
Block TLP during Replay

Assign
Sequence
(eg. bad CRC), theSeq Num
NEXT_TRANSMIT_SEQ (NTS) Seq Num < NRS (Duplicate TLP)
Number
(Increment)
REPLAY_TIMER
(Schedule Ack) may>, <, =

AckNak_Seq_Num[11:0]
NRS?
REPLAY_TIMER
expire. In that case:
LCRC Increment on Replay)
REPLAY_NUM  All TLPs still in the Retry
Seq Num > NRS (Lost TLP)
(Send Nak)
Generator
Purge Older TLPs (Reset Both) Buffer are replayed. Yes
Nak AckD_SEQ (AS)
 Note that no new TLPs are
Retry Buffer Yes
Nak? (Update) No allowedNakduring replay.
Flag Clear?
(Replay)
AckNak Set & SendGood
Nak TLP?
Yes NAK_SCHEDULED
(TLP copy)
SeqNum = AS? Clear Nak Flag
Ack Nak
(TLP copy) Yes
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Replay Link Link


TLP TLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Replay Caused by Nak DLLP 322 329

Transaction Layer (TX) Block TLPs; Report


DLL protocol error
Yes  If Nak is received,
Increment NRS check
Good TLPs
No
TLPs (NTS-AS) ≥ 2048? forward progress
NEXT_RCV_SEQ
(Continue) (NRS) Seq Num = NRS
If no progress:
Block TLP during Replay

Assign
Sequence
NEXT_TRANSMIT_SEQ (NTS) 
SeqReplay all TLPs
Num < NRS (Duplicate TLP) in Retry
Seq NumBuffer
Number
(Increment)
 Increment REPLAY_NUM
(Schedule Ack) >, <, =

AckNak_Seq_Num[11:0]
REPLAY_TIMER
 If progress: NRS?

LCRC Increment on Replay)


REPLAY_NUM
 PurgeSeq
older
Num > TLPs, then replay
NRS (Lost TLP)
Generator
Purge Older TLPs
remaining TLPs
(Send Nak) in Retry
YesBuffer
(Reset Both)
 Reset REPLAY_TIMER
Yes
Nak AckD_SEQ (AS)  Reset REPLAY_NUM then
Retry Buffer Nak? (Update) No increment count
Nak Flag Clear?to 1
(Replay)
AckNak Set & SendGood
Nak TLP?
Yes NAK_SCHEDULED
(TLP copy)
SeqNum = AS? Clear Nak Flag
Ack Nak
(TLP copy) Yes
No Pass Ack/Nak AckNak Latency
(Discard) CRC?
Generator Timer

Nak
Replay Link
DLLP
TLP TLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Replay Number Counter 322 330

Transaction Layer (TX) Block TLPs; Report


DLL protocol error

No
Yes
 REPLAY_NUM Increment NRSrules:Good TLPs
(NTS-AS) ≥ 2048?
TLPs
(Continue)  Cleared
NEXT_RCV_SEQ (NRS) to Seq 0 Num
at =reset
NRS and
Block TLP during Replay

Assign
Sequence
NEXT_TRANSMIT_SEQ (NTS)
whenever forward
Seq Num < NRS (Duplicate TLP) Seq Num
Number
(Increment) (Schedule Ack)is seen >, <, =
progress

AckNak_Seq_Num[11:0]
NRS?

LCRC Increment on Replay)


REPLAY_TIMER  On replay,Seq Num
increment if no
> NRS (Lost TLP)
REPLAY_NUM
Generator
Purge Older TLPs
forward progress (Send Nak) Yes
(Reset Both)
Nak AckD_SEQ (AS)
 If count = 3 and another
Yes
Retry Buffer
(Replay)
Nak? (Update) No failure Nak occurs
Flag Clear?(4 attempts
Set & SendGood
Nak TLP?
Yes AckNak
SeqNum = AS?
to transfer
NAK_SCHEDULED same TLP),
Clear Nak Flag
(TLP copy)
Ack
(TLP copy)
errorNakis logged, Link
Yes

(Discard)
No Pass
CRC?
training AckNak
Ack/Nak is forced
Latency
Generator Timer

Replay Link Link


TLP TLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
One More Transmitter Ack/Nak Check 322 331

Transaction Layer (TX) Block TLPs; Report


DLL protocol error

No
Yes

(NTS-AS) ≥ 2048?
 The absolute Increment NRS Good TLPs

TLPs
(Continue)
difference between
NEXT_RCV_SEQ (NRS) Seq Num = NRS
Block TLP during Replay

Assign
Sequence
Number
NEXT_TRANSMIT_SEQ (NTS) NTS and AS is limited Seq Num < NRS (Duplicate TLP)
>, <, =
Seq Num
(Schedule Ack)
(Increment)
to 2048d (half the 12-

AckNak_Seq_Num[11:0]
NRS?
REPLAY_TIMER
LCRC
Generator
Increment on Replay)
REPLAY_NUM bit NTS range) Seq Num > NRS (Lost TLP)
(Send Nak) Yes
Purge Older TLPs

Nak
(Reset Both)
 Failure of this check is
AckD_SEQ (AS)
Retry Buffer Yes
Nak? (Update) No a Data Link Layer Nak Flag Clear?
(Replay)

(TLP copy)
Yes AckNak
SeqNum = AS?
protocol error, but this NAK_SCHEDULED
Set & SendGood
Nak TLP?

Clear Nak Flag

(TLP copy) Yes


is very unlikely. Only a Ack Nak

(Discard)
No Pass
CRC?
few TLPs are usually in
Ack/Nak AckNak Latency
Timer
Generator

AckNak
progress at one time
DLLP Link
TLP TLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ack/Nak Impact 332

 Hardware costs:
 Transmitter Retry Buffer storage, logic for Replay
Timer, NTS and AS counters, CRC generation, etc.
 Receiver TLP buffers, Ack/Nak Latency Timer, NRS
counter, CRC checking logic, etc.
 Performance costs:
 Fixed overhead of 32-bit LCRC appended to TLPs
 When replay is needed:
Bandwidth consumed
Replay latency penalty

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Limiting the Impact 333

Designers balance cost and performance by:


 Minimizing Retry Buffer to save memory cost. This
does limit the number of TLPs that can be in
progress.
 Adjusting transmitter Replay Timer and receiver
Ack/Nak Latency Timer values based on actual Link
width and Max Payload size to improve Ack/Nak
response time.

For a detailed discussion of Retry Buffer sizing and


timer values, refer to the whitepaper Sizing Of The
Replay Buffer In PCI Express Devices on MindShare’s
website.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Replay Timer Values (Gen1) 339 334

Max_Payload x1 x2 x4 x8 x12 x16 x32


Size Link Link Link Link Link Link Link
128 Bytes 711 384 219 201 174 144 99
256 Bytes 1248 651 354 321 270 216 135
512 Bytes 1677 867 462 258 327 258 156
1024 Bytes 3213 1635 846 450 582 450 252
2048 Bytes 6285 3171 1614 834 1095 834 444
4096 Bytes 12,429 6243 3150 1602 2118 1602 828

Replay Timer value is simply 3 times the ACK Note: this argument was
assumed to be zero for 1.0
Latency Timer. The L0s adjustment is set to 0 and 1.1 versions, and was
and the table values are called ‘unadjusted.’ finally dropped starting with
the 2.0 spec
Example: Assume a 2-Lane Link with a Max_Payload of 2048 bytes.
(Max_Payload_Size+TLP Overhead)*AckFactor
+Internal Delay *3 + Rx_L0s_Adjust.
LinkWidth

(2048 + 28) * 1.0


+19 *3 + 0 = 3171 (about a 12.7µs timeout period for Gen1)
Moki Anji
2 (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Replay Timer Values (Gen2) 354 335

Max_Payload x1 x2 x4 x8 x12 x16 x32


Size Link Link Link Link Link Link Link
128 Bytes 864 537 372 354 327 297 252
256 Bytes 1401 804 507 474 423 369 288
512 Bytes 1830 1020 615 411 480 411 309
1024 Bytes 3366 1788 999 603 735 603 405
2048 Bytes 6438 3324 1767 987 1248 987 597
4096 Bytes 12582 6396 3303 1755 2271 1755 981

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Replay Timer Values (Gen3) 354 336

Max_Payload x1 x2 x4 x8 x12 x16 x32


Size Link Link Link Link Link Link Link
128 Bytes 999 672 507 489 462 432 387
256 Bytes 1536 939 642 609 558 504 423
512 Bytes 1965 1155 750 546 615 546 444
1024 Bytes 3501 1923 1134 738 870 738 540
2048 Bytes 6573 3459 1902 1122 1383 1122 732
4096 Bytes 12717 6531 3438 1890 2406 1890 1116

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Replay Timer Rules 337 337

Counts time since an Ack or Nak DLLP was received


 If not started, then start on the last symbol of a TLP
transmission or re-transmission
 For replay, restart timer on the last symbol of a TLP
re-transmission
 Restart timer for each Ack DLLP received for a TLP
in the Retry Buffer and only if there are
unacknowledged TLPs outstanding
 Reset and Hold if there are no unacknowledged TLPs
outstanding
 Hold timer reset while Link is re-training (LTSSM in
recovery)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ACK Latency Timer 344 338

 The receiver’s AckNak_LATENCY_TIMER is loaded with a value which


reflects the worst-case transmission latency in sending an Ack or Nak in
response to a received TLP
 This time depends on anticipated payload size and the width of the
Link over which TLPs and DLLP Ack/Naks must travel.
 The equation to calculate the AckNak_LATENCY_TIMER value
required is:

(Max_Payload_Size + TLP Overhead) * AckFactor


+ Internal Delay + Tx_L0s_Adj
LinkWidth
Note: this argument was assumed to be
zero for 1.0 and 1.1 versions, and was
finally dropped starting with the 2.0 spec

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
345 339
ACK Transmission Latency Table (Gen1)
Max_Payload x1 x2 x4 x8 x12 x16 x32
Size Link Link Link Link Link Link Link

128 Bytes 237 128 73 67 58 48 33


(AF=1.4) (AF=1.4) (AF=1.4) (AF=2.5) (AF=3.0) (AF=3.0) (AF=3.0)

256 Bytes 416 217 118 107 90 72 45


(AF=1.4) (AF=1.4) (AF=1.4) (AF=2.5) (AF=3.0) (AF=3.0) (AF=3.0)

512 Bytes 559 289 154 86 109 86 52


(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)

1024 Bytes 1071 545 282 150 194 150 84


(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)

2048 Bytes 2095 1057 538 278 365 278 148


(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)

4096 Bytes 4143 2081 1050 534 706 534 276


(AF=1.0) (AF=1.0) (AF=1.0) (AF=1.0) (AF=2.0) (AF=2.0) (AF=2.0)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
352 340
ACK Transmission Latency Table (Gen2)
Max_Payload x1 x2 x4 x8 x12 x16 x32
Size Link Link Link Link Link Link Link
128 Bytes 288 179 124 118 109 99 84
256 Bytes 467 268 169 158 141 123 96
512 Bytes 610 340 205 137 160 137 103
1024 Bytes 1122 596 333 201 245 201 135
2048 Bytes 2146 1108 589 323 416 329 199
4096 Bytes 4194 2132 1101 585 757 585 327

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
352 341
ACK Transmission Latency Table (Gen3)
Max_Payload x1 x2 x4 x8 x12 x16 x32
Size Link Link Link Link Link Link Link
128 Bytes 333 224 169 163 154 144 129
256 Bytes 512 313 214 203 186 168 141
512 Bytes 655 385 250 182 205 182 148
1024 Bytes 1167 641 378 246 290 246 180
2048 Bytes 2191 1153 634 374 461 374 244
4096 Bytes 4239 2177 1146 630 802 630 372

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ACK Latency Timer Rules 343 342

Counts time since an Ack or Nak DLLP was


scheduled for transmission
 Resets to 0 each time an Ack or Nak DLLP is
scheduled for transmission
 Timer starts from 0 when the first good TLP received
has been sent to the Transaction Layer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ack Example 332 343

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ack Capture 344

Trace capture courtesy of LeCroy

Requester sends MemRd, receives Ack for that TLP


Completion is returned, and Ack is sent in response
This Request also happens to ask for an address translation.
Data returned is the address in the Request after translation by
the Translation Agent.
Moki Anji (moki@ synopsys.com)
Trace captures courtesy of LeCroy
Do Not Distribute MindShare.com © 2013
Ack with Sequence Number Rollover 333 345

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Nak Example 335 346

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Lost or Corrupt TLP 346 347

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Lost or Corrupt Nak 349 348

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmission Priority 350 349

1. Finish sending TLP, DLLP or Ordered Set already in


progress (highest priority)
2. Ordered Set
3. Nak
4. Ack
5. Flow Control DLLP
6. Replay Buffer re-transmission of TLPs
7. New TLP in transaction layer waiting to be sent
8. Other DLLPs waiting to be sent (lowest priority)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Cut-Through Mode 357 350

TLP arrives, header indicates egress port.

END TLP STP

Switch
Endpoint

 Cut-through mode can improve latency,


especially if packets are large

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Cut-Through Mode 357 351

TLP is forwarded through as it arrives


END TLP STP

Switch
Endpoint

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Cut-Through Mode 357 352

Error detected
LCRC indicates an error after packet finishes

END TLP STP

Switch
Endpoint

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Cut-Through Mode 357 353

LCRC inverted, END replaced with EDB

END TLP
EDB STP

Switch
Nak Endpoint

Nak indicates incoming packet


was bad and should be replayed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Cut-Through Mode 357 354

Rx sees inverted LCRC and EDB, discards packet

END
EDB TLP STP

Switch
Endpoint

No Ack or Nak is sent. The packet


was nullified so it’s as though it
never happened.

 If a received TLP ends with EDB, but the TLP does


not have an inverted LCRC, TLP is discarded and a
Nak is sent
 Why do we need EDB? Why not just truncate the
packet? The spec says a packet may not be
interrupted once started, so truncating would cause
an uncorrectable error.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Part Four:
Physical Layer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Physical Layer Logical
(Gen1 and Gen2)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Physical Layer 362 357

Memory, I/O, Configuration R/W Requests or Message Requests or Completions


(Software layer sends / receives address/transaction type/data/message index)
Software layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Data Payload ECRC

Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering

Link Packet DLLPs e.g. DLLPs Link Packet


Sequence TLP LCRC Ack/Nak CRC Ack/Nak CRC Sequence TLP LCRC

Data Link layer De-mux


TLP Retry
Buffer
TLP Error
Mux Check

Physical Packet Physical Packet


Start Link Packet End Start Link Packet End

Physical layer Encode Decode

Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver

Moki Anji (moki@ synopsys.com) Port


Do Not Distribute Link MindShare.com © 2013
Logical and Electrical Physical Layer 363 358

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmit Physical Layer 365 359

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receive Physical Layer 367 360

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmit Physical Layer Details – Gen1&2 369 361

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Buffer and Multiplexer Control 370 362

Ordered Sets:
TS1, TS2
SKIP, FTS
Electrical Idle
Electrical Idle Exit

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP/DLLP at Physical Layer 371 363

‘D’ Characters

STP Sequence Header Data Payload ECRC LCRC END

‘K’ Character ‘D’ Characters ‘K’ Character

SDP DLLP Type Misc. CRC END

‘K’ Character ‘K’ Character

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Byte Striping 371 364

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: x1 Byte Striping 372 365

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: x4 Byte Striping 372 366

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: x8 Byte Striping 373 367

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
General Packet Format Rules 373 368

 Packets are always multiples of 4 characters


 STP and SDP characters must be placed on Lane 0 when
starting transmission from logical Idle
 STP/SDP characters always in Lane 0, or 4, or 8, etc.
 END/EDB characters is in Lane 1 for a x2 link, but
otherwise always in lane 3, 7, 11, etc.
 DLLP packets are always 8 characters long
(SDP + 6 characters + END)
 When no packets are ready to send, Logical Idle
characters (00h) are sent, and must be on all Lanes
 Ordered Set are also sent on all Lanes at once
 If a packet does not end on last Lane and no other packets
ready, PAD characters are added to fill the Link out to the
last Lane
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
x1 Packet Format Example 374 369

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
x4 Packet Format Example 375 370

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
x8 Packet Format 377 371

PAD characters used to maintain packet framing alignment


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Scrambler 377 372

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Scrambler Operation 378 373

Scrambling polynomial:
G(x) = X16+X5+X4+X3+1

• Scrambler uses
LFSR
• ‘D’ characters are
scrambled
• COM character
causes scrambler
to reinitialize
• Runs at bit rate
(8 times byte rate)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8b/10b Encoder 380 374

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8b/10b Tutorial 380 375

 Standard encoding method invented by IBM


and used in several standards:
 Ethernet, Fibre Channel, ServerNet, FICON, IBA
 Reasons for encoding
 Creates sufficient transition density
to limit “Run Length”
 Balance number of 1’s and 0’s
to maintain “DC Balance”
 Facilitate detection of most transmission errors
 Transmission performance degraded by 20%

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8b/10b Encoding Example 381 376

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8b/10b Nomenclature 382 377

Notation of 8b Character in 8b/10b Tables

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Properties of 10b Symbols 381 378

 Generally equal number of 1's and 0's over 2


consecutive symbols
 No more than five consecutive 1’s or 0’s
 Each 10b symbol has:
 Four 0’s and six 1’s or
 Six 0’s and four 1’s or
 Five 0’s and five 1’s
 6-bit sub-block of 10b symbol contains no more than
four 1's or four 0's
 4-bit sub-block of 10b symbol contains no more than
three 1's or three 0's
 Any other symbol having other than the above five
properties is considered invalid

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
“Disparity” in 8b/10b Encoding 383 379

 Disparity is the inequality between the number of


1's and 0's in a 10b symbol
 Symbols with:
 Four 0's and six 1's have positive (+) disparity
 Six 0's and four 1's have negative (-) disparity
 Five 0's and five 1's have neutral disparity
 Characters encode into two possible 10b symbols
based on “Current Running Disparity” (CRD):
 One encoding contains four 0's and six 1's while second
encoding contains six 0's and four 1's, or
 Both encodings contain five 0's and five 1's (neutral
disparity)
 CRD is always positive or negative, and initial value is a
don’t care
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
8b/10b Encoder 384 380

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Some Example 8b/10b Encoding 385 381

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 8b/10b Transmission 386 382

Use these two characters in the example below:

D/K# Hex Binary Bits Byte CRD – CRD +


Byte HGF EDCBA Name abcdei fghj abcdei fghj
Control (K) BC 101 11100 K28.5 001111 1010 110000 0101
Data (D) 6A 011 01010 D10.3 010101 1100 010101 0011

Example Transmission
CRD Character CRD Character CRD Character CRD
Character to K28.5 (BCh) K28.5 (BCh) D10.3 (6Ah)
be transmitted
Bit stream - Yields + Yields - Yields -
transmitted 001111 1010 110000 0101 010101 1100
+ Disparity
CRD is + - Disparity
CRD is - Neutral
CRD isdisparity
neutral

Initialized value of CRD is don’t care. Receiver can determine from incoming bit stream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Control Character Encodings 386 383

Character 8b Name 10b (CRD-) 10b (CRD+) Description

COM K28.5 (BCh) 001111 1010 110000 0101 Comma used as a character
boundary alignment symbol

PAD K23.7 (F7h) 111010 1000 000101 0111 Packet Padding Symbol

SKP K28.0 (1Ch) 001111 0100 110000 1011 Used in SKP Ordered Set (SOS)

STP K27.7 (FBh) 110110 1000 001001 0111 Start of TLP Symbol

SDP K28.2 (5Ch) 001111 0101 110000 1010 Start of DLLP Symbol

END K29.7 (FDh) 101110 1000 010001 0111 End of Good Packet Symbol

EDB K30.7 (FEh) 011110 1000 100001 0111 End of Bad Packet Symbol
Used by Switch which detects bad
packet
FTS K28.1 (3Ch) 001111 1001 110000 0110 Used in Ordered Set to exit L0s to L0
power state
IDL K28.3 (7Ch) 001111 0011 110000 1100 Used in Electrical Idle Ordered Set

EIE K28.7 (FCh) 001111 1000 110000 0111 Used in the Electrical Idle Exit
Ordered Set (EIEOS) and sent prior
to FTS at speeds other than 2.5 GT/s
(Reserved at 2.5 GT/s.)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Comma Character Description 387 384

 COM is the first symbol in an Ordered Set


 10b encoding of COM (K28.5) character
contains 2 bits of one polarity followed by 5
bits of the opposite polarity (001111 1010 or
110000 0101)
 Only two other symbols with this property: FTS
and EIE
 This property makes these symbols easily
detectable

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Logical Idle Sequence 388 385

 Is not an Ordered Set (no ‘K’ characters)


 Driven on all Lanes when no packets are
transmitted while in L0 power state
 It’s simply the data 00h character that gets
scrambled and encoded
 Distinguished from other valid characters
because it occurs outside the framing context
(after an END but before the next STP/SDP)
 SOS (SKP Ordered Set) still sent periodically

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ordered Sets 388 386

 A grouping of multiples of 4 characters that


begins with the COM character
 Used for:
 Link training
 Hot Reset
 Clock tolerance compensation
 Transitioning Link to low power Electrical Idle
 Recovering from Electrical Idle
 Transmitted on all Lanes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
6 Types of Ordered Sets 388 387

 Skip (SOS)
 4-character set: 1 COM followed by 3 SKPs
 Training Sequence One (TS1)
 16-character set: 1 COM, 15 additional characters
 Training Sequence Two (TS2)
 16-character set: 1 COM, 15 additional characters
 Electrical Idle (EIOS)
 4 characters at 2.5 GT/s: 1 COM followed by 3 IDLs
 8 characters at higher speeds: 1 COM followed by 3 IDLs, sent
twice
 Fast Training Sequence (FTS)
 4-character set: 1 COM followed by 3 FTSs
 Electrical Idle Exit (EIEOS) [only for data rates above
2.5 GT/s]
 16-character set: 1 COM, 14 K28.7, TS1 Identifier

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TS1 and TS2 Ordered Sets 388 388

 Used during Link training process


 16 characters
 D10.2 and D5.2 identifiers facilitate clock recovery
 D10.2 10b pattern is 0101010101 regardless of CRD, while
 D5.2 10b pattern is 1010010101 regardless of CRD

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
FTS Ordered Sets 388 389

 FTS (Fast Training Sequence) Ordered Set is


sent on all Lanes when exiting L0s
 Number of FTSs to be sent was specified by the
“# FTS” field in TS1, TS2 during Link Training.
 At speeds higher than 2.5 GT/s, EIEOS is sent
beforehand to help Receiver detect idle exit.

Encoding
COM K28.5
FTS K28.1
FTS K28.1
FTS K28.1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
EIOS 388 390

 Electrical Idle Ordered Set, sent on all Lanes to put a


receiver and Link into a low-power state. Sent just once
when using 2.5GT/s, twice at higher data rates.
 Tells receiver that the voltage is going below the 65mV
threshold. Receiver can use a squelch detect circuit to
detect Electrical Idle entry, or timeouts to infer that the
Link must be Electrically Idle (for example: no FC
updates or SKP Ordered Sets within 128 µs)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
SOS 389 391

 Skip Ordered Set – used to facilitate clock


tolerance compensation in receiver circuit
 Scheduled to be sent once every 1180 to 1538
symbol times on a packet boundary

Encoding
COM K28.5
SKP K28.0
SKP K28.0
SKP K28.0

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
EIEOS 389 392

 EIEOS (Electrical Idle Exit Ordered Set) is sent only


for speeds higher than 2.5 GT/s; Rx uses it to exit
from electrical idle.
 Generates low frequency components to help
receiver recognize Electrical Idle exit.
 Repeated K28.7 results in a pattern of 5 ones and 5 zeros
 Pattern is easier to detect than a small voltage change
 Sequence sent when first exiting electrical idle and
after every 32 TS1/TS2
 Exit from L0s uses fewer K28.7 (just 4 to 8)

Encoding
0 COM K28.5
1 EIE K28.7

EIE K28.7
14
Moki Anji (moki@ synopsys.com)
15 TS ID D10.2 for TS1 Identifier
Do Not Distribute MindShare.com © 2013
Inference of Electrical Idle 393

Motivation: electrical threshold difficult to detect


 In L0 state, absence of FC updates or SOS within 128
µs can be inferred to mean the Link is in Electrical Idle
 In Recovery.RcvrCfg as well as Recovery.Speed,
absence of a TS1 or TS2 over 128 Symbol times must
be treated as Electrical Idle
 For speeds other than 2.5 GT/s, Electrical Idle exit is
guaranteed only on receipt of an EIEOS.
 Window is set to 16000 UI for detecting an exit from
Electrical Idle in 5.0 GT/s speeds.
 At 2.5 GT/s speed, Electrical Idle exit must be detected
with every Symbol received.
 Absence of Electrical Idle exit in a 2000 UI window
constitutes an Electrical Idle condition.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Serializer and Tx Clk 389 394

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Serializer and Tx Clock 390 395

 The serial output of each Lane is clocked out


using the Tx Clock running at the current
speed
 Accurate to +/- 300 ppm
 Resulting maximum separation between them Tx
and Rx clocks is thus 600ppm
 At max separation, the clock can gain or lose 1
clock period every 1666 clocks
 Max allowed signal skew between Tx lanes is
 Gen 1: 500ps + 2 UIs = 1300ps
 Gen 2: 500ps + 4 UIs = 1300ps
 Gen 3: 500ps + 6 UIs = 1250ps

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receive Physical Layer Details – Gen 1 & 2 393 396

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
CDR (Clock and Data Recovery) 394 397

÷ 10

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Achieving Bit Lock 395 398

Using PLL or other method, receiver


generates a Rx Clock and aligns the
phase based on input data transitions
Resulting Rx Clock has same frequency
and phase as the Tx clock: 2.5 GHz or
5.0 GHz
Rx Clock latches incoming bits into de-
serializer and elastic buffer
“Rx Clock” is different from “Rx Local
Clock” used to clock data out of the
elastic buffer (+/- 300 ppm difference)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Achieving Symbol Lock 396 399

 Once PLL is locked, receiver can reliably


sample incoming bits
 To recognize what’s being sent on the Link,
though, the next step is to find the 10-bit
symbol boundary (Symbol Lock)
 This is done by searching for the pattern of 2
bits of one polarity followed by 5 bits of the
other polarity. During training this will be the
COM character and finding it gives the
symbol position in the bit stream.
01111 000111010101011100110011111010 Incoming Bits

COM Symbol

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Elastic Buffer 397 400

÷ 10

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Clock Tolerance Compensation 397 401

SKP Ordered Set compensates for


frequency difference between
transmitter and receiver clocks
Max frequency difference of 600 ppm
SKP Ordered Set scheduled for
insertion every 1180 to 1538 symbol
times, but must be delivered on a
packet boundary

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Local Clock Faster than Recovered Clock 397 402

Local Clock Faster than Recovered Clock

Goal is to allow worst case timing in either direction. In this


design, working to keep the buffer half full accomplishes that.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Approaching Underflow Condition 397 403

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Adding SKP’s 397 404

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Adding More SKP’s 397 405

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Local Clock Slower than Recovered Clock 397 406

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Approaching Overflow Condition 397 407

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Compensate by Removing SKP’s 397 408

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Removing More SKP’s 397 409

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receiver Link De-Skew 399 410

Worst-case Rx skew to compensate for:


• Gen1: 20ns= 5 symbol time delay
• Gen 2: 8ns = 4 symbol time delay
• Gen 3: 6ns = 6 symbol time delay

This example illustrates a digital deskew after the elastic buffer. Alternatively, delay
Moki
lines Anji have
could (moki@ synopsys.com)
been used prior to the elastic buffer.
Do Not Distribute MindShare.com © 2013
8b/10b Decoder 400 411

• 8b/10b decodes the 10b symbols stream


into 8b data (D) or 8b control (K)
characters plus the D/K# signal
• D/K# indicates whether the character is a
data (D) or control (K) character

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8b/10b Decoding 401 412

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Disparity Errors 401 413

Time

CRD Character CRD Character CRD Character CRD

Transmitter
Transmitted - D21.1 - D10.2 - D23.5 +
Character Stream
Transmitted Bit - 101010 1001 - 010101 0101 - 111010 1010 +
Stream
Bit Stream After - 101010 1011 + 010101 0101 + 111010 1010 +
Error
Decoded - D21.0 + D10.2 + Invalid +
Receiver
Character Stream

Error occurs here Error detected here

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Physical Layer Error Handling 404 414

 Physical Layer Detectable Errors


 Required checking:
 8b/10b disparity error or code violation
 Optional checking:
 Elastic Buffer overflow or underflow error
 Packet inconsistent with packet format rules
 Lane de-skew error
 Loss of symbol lock error
 Errors reported as ‘Receiver Errors’ to
Transaction Layer via the status bit in the
correctable error register (if Advanced Error
Reporting registers available)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
De-Scrambler 402 415

• De-scrambler uses the same scrambling


procedure to recover original data
• XORing the scrambled data with the
same pattern a second time recovers the
original data
• Every COM reinitializes it, so it stays in
synch with the transmitter’s scrambler

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Byte Un-Striping 402 416

• The 8b characters from all Lanes


are un-striped into one stream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: x8 Byte Un-Striping 403 417

Packet byte stream from Multiplexer block


Data Stream D/K#

Character 0
Character 1
Character 2
Character 3
Character 4
Character 5
Character 6
Character 7
Byte Un-Striping

Character 0 Character 1 Character 7


Character 8 Character 9 Character 15
Character 16 Character 17 Character 23

From Lane 0 From Lane 1 From Lane 7


De-Scrambler De-Scrambler De-Scrambler

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Filter/Packet Alignment Check 403 418

• SKP, TS1, TS2, Electrical Idle, FTS


Ordered Sets and PAD characters are
detected here and filtered out (‘K’
characters are never seen in higher
layers)
• All valid packets start with STP or
SDP character and end with an END
character – anything else is reported
to Data Link Layer as an error
• Start and End characters are removed
and the resulting TLPs and DLLPs
are forwarded to Rx Buffer

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receive Data Buffer 403 419

• Rx Buffer holds all TLPs / DLLPs


• Buffer contents are clocked to Data
Link Layer
• Frequency of clock used to clock
contents out of the Rx Buffer not
specified nor is width of buffer
• Control signal indicates start and
end of each TLP or DLLP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Physical Layer Logical (Gen3)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Design Goals of Gen3 407
403 421

1. Increase bandwidth
2. Maintain backward compatibility
a) Hardware – speed compatibility
 Work with trace lengths and connectors used with Gen2
 Max voltage is a little higher at 1300 mVpp vs. 1200 mVpp
 Common starting place: Initialize to Gen1 speed, change if
higher speeds supported
 Upper layers don’t change with speed differences. TLP and
DLLP still have the same parts.
b) Software – configuration compatibility
 Old registers and access mechanisms must remain
accessible to minimize software effort

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Training Sequence 407 422

 Link automatically trains to 2.5 GT/s


 If both Link neighbors report higher
speeds, change to Recovery state and
train to the highest common speed.
 Speeds are reported in the TS1s and
TS1 or TS2
TS2s exchanged during training, and 0 COM
this is also where a desire to change 1
2
Link #
Lane #
the speed is indicated. 3
4
# FTS
Rate ID
5 Train Ctl
6 TS ID or
EQ Info
TS ID

15 TS ID
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Link Training 407 423

 To change speed, device enters


Recovery state with a flag set in the
TS1s
Detect
 Link context is maintained
 TS1s communicate desire to change
and available speeds Polling
 Highest common speed is attempted
 If successful, neighbor responds and both
return to L0 Configuration

 If not, go back to 2.5 GT/s


 OK to try again L2 Recovery

 Repeated failures handled by removing


failing speed from the advertised list
before trying again (implementation
L1 L0 L0s
specific method)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Motivation for New Encoding Model 407 424

Why not simply go from 5.0 to 10.0 GT/s?


1. Cost: 10.0 GHz is considered a ceiling of
sorts for ordinary FR4 circuit boards. For
example, lower-loss PCB materials would be
needed, along with more engineering effort.
2. Power: Higher frequency uses substantially
more power.
3. Compatibility: That frequency would not be
compatible with previous transmission
channel designs and lengths (client: one
connector & 14”; server: two connectors &
20”).
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Earlier Throughput 408 425

Aggregate BW Link Width


(GB/s) x1 x2 x4 x8 x12 x16 x32
Gen 1 0.5 1 2 4 6 8 16

Gen 2 1 2 4 8 12 16 32

Derivation of these numbers:


 Bandwidth described as “aggregate”, implying
simultaneous traffic in both directions
Bandwidth loss:
Per direction 20% at Rx
(2.5GT/s) * 1 Lane = 2.5Gb/s 2.5Gb 1 Byte
* = 250MB/s
s 10 bits
GT = Giga-Transfers
Gb = Gigabits Due to 8b/10b 250MB/s x 2 = 500MB/s = 0.5 GB/s
GB(moki@
Moki Anji = Gigabytes
synopsys.com) (aggregate)
Bidirectional
Do Not Distribute MindShare.com © 2013
Gen3 Throughput 408 426

Aggregate BW Link Width


(GB/s) x1 x2 x4 x8 x12 x16 x32
Gen 1 0.5 1 2 4 6 8 16

Gen 2 1 2 4 8 12 16 32

Gen 3 2 4 8 16 24 32 64

(It’s actually 0.985 GB/s, but 1.0 GB/s


Derivation of Gen3 numbers: sounds better and is pretty close)
Per direction Bandwidth loss:
(8.0GT/s) * 1 Lane = 8.0Gb/s 8.0Gb 1 Byte 1.5% at Rx
* = 1.0GB/s
s 8 bits
GT = Giga-Transfers
Gb = Gigabits Using 128b/130b 1.0GB/s x 2 = 2.0GB/s (aggregate)
GB = Gigabytes
Bidirectional
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
New Speed Model 409 427

 New speed of 8.0 GT/s supported by:


Increasing internal clock frequency to 8.0 GHz
Dropping 8b/10b encoding. Requires
receivers to tolerate what 8b/10b prevented:
DC wander – to mitigate this, the series capacitor
value is changed from 75-200nF to 180-265nF
Long run lengths – modified scrambling and TS1s
help, and new CDR designs are also less sensitive
to absence of edges
 Other optimizations are included, such as dropping END
symbol from DLLP and TLP packets

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Encapsulation 409 428

Two levels of encoding are defined:


Physical layer encapsulation
Rules defined for placing packets across lanes
and blocks
Five Data Stream packet tokens are defined:
IDL, SDP, STP, EDB, EDS.
Several Ordered Sets are defined
Lane level encoding (Blocks)
Blocks are normally 130 bits each and there
are only two types, indicated by the first 2 bits
of the block called “Sync” bits
– Data Block – 128-bit payload
– Ordered Set Block – byte patterns replace the
8b/10b control characters
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
128b/130b Lane Encoding 410 429

 Each Lane uses 130 bits at a time (called a Block) to


track what is being sent. Blocks consist of 16 ordinary
bytes plus a 2-bit Sync Field at the beginning
 Packets are still striped across all available Lanes as
before, but now using Blocks
 Symbol and packet alignment can’t be done based
on “K” characters now because there’s no longer a
distinction. Another method is needed.

H0 H1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


Field

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Two Types of Blocks 410 430

 Each Lane encodes a block


 Blocks are always 130 bits long and deliver 2 bits
to indicate the block type followed by 128 bits of
payload (128 + 2)
 2-bit Sync Field at the start of a block encodes for
two types of blocks
 Ordered Set block; Sync bits = 01b (H0=1, H1=0)
 Data block; Sync bits = 10b (H0=0, H1=1)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Block Encoding 410 431

H0 H1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15

128-bit Payload

Data Block

The bit ordering in the spec can be a little confusing: 10b for the Sync value
means 0 is the LSB and will be the first bit going out, so the transmission
sequence will be 01 on the Lane for a Data Block.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Block Alignment or Synchronization 411 432

 Question: If Blocks are identified by the 2


Sync bits at the beginning of the 130 bits,
how does the receiver know when to look for
those 2 bits?
 Said another way, how do we establish “Block
Lock” instead of the Symbol Lock of 8b/10b?
 Answer: look for the Electrical Idle Exit
Ordered Set (EIEOS).
 This block consists of alternating bytes of FFh and
00h, making it easy to recognize.
 Whenever these alternating bytes are seen during
training, an EIEOS is in progress and indicates a
block boundary (see next slide).
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
EIEOS 411 433

 Gen3 Pattern: Sync Header 01b followed by


16 alternating symbols of 00h and FFh
 Provides a low-frequency, recognizable
pattern to help identify block boundaries
 Bypasses scrambling
EIEOS
Sync Header 01
0 00000000
1 11111111
2 00000000
3 11111111
4 00000000

13 11111111
14
Moki Anji (moki@ synopsys.com) 00000000
Do Not Distribute 15 11111111
MindShare.com © 2013
Achieving Block Lock 411 434

 EIEOS defines block alignment in the


Configuration and Recovery LTSSM states
 Three phases involved:
1. Unaligned: Entered after a period of electrical
idle. Receivers monitor for EIEOS, adjust
alignment to it, and move to next phase.
2. Aligned: Receivers monitor for SDS (Start of
Data Stream) Ordered Set while adjusting to any
new EIEOS’s that arrive. Once SDS is seen, go
to next phase.
3. Locked: Receivers do not adjust Block alignment
in this phase. A Data Stream is expected now
and problems are handled by backing up to an
earlier phase when the Data Stream is stopped.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Ordered Set Block 412 435

Gen3 Ordered Sets are:


 TS1 & TS2 – some bytes are scrambled
 FTS – bypasses scrambling
 SDS – bypasses scrambling
 EIOS – bypasses scrambling
 SOS – bypasses scrambling
 EIEOS – bypasses scrambling

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ordered Set Block Encoding 412 436

 Ordered Set Blocks are also 16 symbols long, except for SKP
Ordered Sets which can be 8, 12, 16, 20, or 24 symbols as
SKPs are added or deleted.
 The transmitter always sends a full 16-byte SOS, and the only case
where it can be larger or smaller is when the packets are going
through a repeater (device that receives and forwards a packet)
 An ordered set is completely contained within a block

H0 H1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sync Symbol 0 Symbol 1 Symbol 15


(1 0)

128-bit Payload

Ordered Set Block

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Ordered Set: FTS 413
412 437

 Used to recover from L0s


 Bypasses scrambling
 EIEOS must be sent after
every 32 FTSs and once
more again after the last one

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TS1 and TS2 Ordered Sets 438

 Ordered Sets look different from Gen1 and Gen2 because


control characters are unavailable
 Scrambling is modified to make up for the loss of 8b/10b
encoding and is used now to improve transition density:
 Byte 0 bypasses scrambling, making it easier to recognize which ordered
set is being delivered
 Bytes 1-13 are scrambled
 Bytes 14-15 bypass scrambling if their values is used to facilitate DC
Balance, but scrambled if DC Balance action is not required.
TS1, TS2 at 8GT/s
0 1Eh for TS1, 2Dh for TS2
1 Link # 0-31d, (PAD encoded as F7h)
2 Lane # 0-31d, (PAD encoded as F7h)
3 # FTS # of FTS Ordered Sets required by receiver
4 Rate ID Bit 3 indicates 8GT/s support -
5 Train Ctl 5.0 and 2.5 must also be supported
6
EQ Info
9
TS ID 4Ah for TS1, 45h for TS2

14 DC Bal
Moki Anji (moki@ synopsys.com)
Do Not Distribute 15 DC Bal
MindShare.com © 2013
Data Stream and Data Blocks 413 439

 For 128b/130b encoding, a “data stream” is


basically in effect when the Link is in the L0 state
 Data blocks are transmitted in a Data Steam
 The only Ordered Set that can be sent while in a Data
Stream is the SOS for clock compensation
 Entering L0 from other LTSSM states requires
sending a Start of Data Stream (SDS) Ordered
Set
 Exiting L0 for a different state requires sending
an End of Data Stream (EDS) token at the end of
the last Data Block before sending Ordered Sets

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Framing Tokens in a Data Stream 414 440

 Symbol patterns now serve as Data Block framing


symbols called “Framing Tokens”, and five are defined:

Framing Number of Symbol Encodings Comments


Token Symbols
Logical Idle (IDL) 1 0000 0000b

Start DLLP (SDP) 2 Symbol 0: 1111 0000b


Symbol 1: 1010 1100b
Start TLP (STP) 4 Symbol 0: xxxx 1111b First 4 LSBs of all 1’s
Symbol 1-3: based on indicate STP, remaining
TLP bits are length, parity,
and frame CRC
End Bad (EDB) 4 All 4 symbols are the Confirms that previous
same: 1100 0000b TLP has been nullified

End of data 4 Symbol 0: 0001 1111b Same encoding as a


stream (EDS) Symbol 1: 1000 0000b TLP of length = 1 DW
Symbol 2: 1001 0000b
Symbol 3: 0000 0000b
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Gen3 Frame Token Examples 415 441

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Layout of Gen3 TLP and DLLP 414 442

TLP Layout

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

Symbol 0 Symbol 1 Symbol 2 Symbol 3

Header and Data Payload (same as 2.0) LCRC (4 bytes, same as 2.0)

DLLP Layout
ACh

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 Payload (same as 2.0) CRC (2 bytes, same as 2.0)


Symbol 0 Symbol 1 Symbols 2-5 Symbols 6, 7

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Framing Rules - TLP 417 443

 TLP length varies based on payload


 Length field includes all parts of the TLP: framing,
header, data, LCRC, and ECRC
 Length field is protected by a 4-bit Frame CRC
(guaranteed double-bit-flip detection)
 An even parity bit covering both Length and
Frame CRC guarantees detection of odd number
bit flips. Combined with CRC, triple-bit-flip
detection is guaranteed.
 12-bit sequence number is still extended to 16 bits
for calculating LCRC, but the extra 4 bits are not
sent and are assumed to be zero at the receiver.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Framing Rules - DLLP 417 444

 In a multi-Lane Link, if Logical Idle is sent on one lane,


all subsequent Lanes must use either IDL or EDS for
that symbol time
 DLLP is 8 symbols long
 1st two symbols must be F0h and ACh
 Symbols 2-5 are DLLP payload
 Symbols 6 and 7 are the 16-bit CRC

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transmit Physical Layer Details – Gen3 422 445

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Buffer and Multiplexer Control 421 446

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Byte Striping 423 447

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Packet Transmission in x1 Link 448

Lane 0
0
Sync
1
Symbol 0 xxxx 1111b
Length,
Symbol 1
CRC, STP Token
Symbol 2 Parity,
Sequence
Symbol 3 Number
Symbol 4

Symbol 5

Symbol 6

Symbol 7
Symbol 15
4 DW TLP Header
0 (straddles Block boundary)
Sync
1
Symbol 0

Symbol 1

Symbol 2
Moki Anji (moki@
Symbol 3synopsys.com)
Do Not Distribute MindShare.com © 2013
Packet Transmission in x4 Link 424 449

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Packet Transmission in x8 Link 425 450

STP Token Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7
0 0 0 0 0 0 0 0
Sync
1 1 1 1 1 1 1 1

Symbol 0 STP Token: Length=7, CRC, Parity, Seq Num

Symbol 1 (TLP)
Symbol 2
Logical
Symbol 3 LCRC SDP Token
Idle
Symbol 4 DLLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 6 STP: Length=23, CRC, Parity, Seq Num DW 2
Symbol 7
TLP
Symbol 15 DW 19 DW 20 straddles
Sync 0 0 0 0 0 0 0 0 Block
1 1 1 1 1 1 1 1 boundary
Symbol 0 DW 21 DW 22
Symbol 1 LCRC IDL IDL IDL IDL
Since a Unit Interval (bit time) is 0.125ns at 8 GT/s, then a Symbol Time (8 bits) will be 1ns
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Nullified Packet 425 451

 The old End Good symbol is dropped, and


TLPs will be assumed good unless an EDB is
seen immediately after them.
 Appending the EDB pattern at the end of the
TLP nullifies it. These symbols are part of the
TLP but aren’t included in the Length field.
 LCRC is still inverted, as it always has been,
to help identify this case.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Nullified Packet in x8 Link 426 452

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


0 0 0 0 0 0 0 0
Sync
1 1 1 1 1 1 1 1

Symbol 0 STP Token: Length=7, CRC, Parity, Seq Num

Symbol 1 (TLP)
Symbol 2

Symbol 3 LCRC SDP Token


Symbol 4 DLLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 6 STP: Length=23, CRC, Parity, Seq Num DW 2
Symbol 7
Symbol 15 DW 19 DW 20
0 0 0 0 0 0 0 0
Sync
1 1 1 1 1 1 1 1
Nullified
Symbol 0 DW 21 DW 22 packet
Symbol 1 LCRC (inverted) EDB Token

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Rules for Insertion of Ordered Sets 426 453

 Ordered Sets appear simultaneously on every Lane


 SKP Ordered Set (SOS)
 Previous Data Block must finish with EDS. This won’t
actually end the data stream if Data Block follows the SOS.
 SOS always originates with 16 bytes (12 SKP symbols +
SKP_END symbols + 3 scrambled data bytes), but SKP
symbols may be added or removed in groups of 4 before it
reaches the neighbor if the SOS passes through a repeater.
 SOS must finish with SKP_END symbol followed by LFSR
output as needed to complete the block. Spec: LFSR bytes
help trace tools acquire lock and do Lane deskew, since they
can’t force a transition to Recovery.
 Block that follows SOS must be a Data Block if Data Stream
is to continue; consecutive SOS’s are not allowed without a
data block in between them.
 Timing: must be scheduled every 370 – 375 Blocks
(375 x 16 = 6000 bytes)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example: x1 Ordered Set Construction 427 454

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
SOS Sent by Transmitter 428 455

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


Data
0 0 0 0 0 0 0 0
Sync 1 1 1 1 1 1 1 1 Block
Symbol 0 STP Token: Length = 7, CRC,Parity, Seq begins
Num
Symbol 1
(TLP)
Symbol 2
Symbol 3
LCRC SDP Token
Symbol 4 End of Data
DLLP IDL IDL IDL IDL Stream
Symbol 5
IDL IDL IDL IDL IDL IDL IDL IDL Token
Symbol 15
IDL IDL IDL IDL EDS Token
Ordered
Sync 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
Set begins
Symbol 0
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 1
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 2
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 11
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 12 SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END
Output of
LFSR with
Symbol 13 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR no data
Symbol 14 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
Symbol 15 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
Sync 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1

Moki Anji (moki@ synopsys.com)


Data Block begins, continuing the Data Stream
Do Not Distribute MindShare.com © 2013
SOS Modified by Repeater For De-Skewing 456

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


Data
Sync 0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Block
Symbol 0 STP Token: Length = 7, CRC,Parity, Seq
Num
Symbol 1
(TLP)
Symbol 2
Symbol 3
LCRC SDP Token
Symbol 4
DLLP IDL IDL IDL IDL
Symbol 5
IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 15 Ordered
IDL IDL IDL IDL EDS Token
Sync 1 1 1 1 1 1 1 1
Set
0 0 0 0 0 0 0 0
Symbol 0
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 1 Shortest
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 2 possible
SKP SKP SKP SKP SKP SKP SKP SKP
SOS
Symbol 3
SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 4 SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END

Symbol 5 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR


Symbol 6 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
New Block
Symbol 7 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR starts after
Sync 0 0 0 0 0 0 0 0 just 8 bytes
1 1 1 1 1 1 1 1
instead of 16
Moki Anji (moki@ synopsys.com)
Do Not Rx must recognize and tolerate the 5 variations in Block size for SOS.
Distribute MindShare.com © 2013
“Consecutive” SOSs in a Data Stream 457

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7


1 1 1 1 1 1 1 1
Sync 0 0 0 0 0 0 0 0 Ordered Set
Symbol 0 SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 11 SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 12 SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END

Symbol 13 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR


Symbol 14 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
Symbol 15 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
0 0 0 0 0 0 0 0
Sync 1 1 1 1 1 1 1 1
Data Block
continues
Symbol 0 IDL IDL IDL IDL IDL IDL IDL IDL the Data
Symbol 14 IDL IDL IDL IDL IDL IDL IDL IDL Steam

Symbol 15 IDL IDL IDL IDL EDS Token


1 1 1 1 1 1 1 1
Sync 0 0 0 0 0 0 0 0
Ordered Set

Symbol 0 SKP SKP SKP SKP SKP SKP SKP SKP


Symbol 11 SKP SKP SKP SKP SKP SKP SKP SKP
Symbol 12 SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END SKP_END

Symbol 13 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR


Symbol 14 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
Symbol 15 LFSR LFSR LFSR LFSR LFSR LFSR LFSR LFSR
Moki Anji (moki@ synopsys.com)
Sync
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Data Block
Do Not Distribute MindShare.com © 2013
Error Handling 458

 Receiver Errors include framing violations,


Sync violations, CRC and parity errors, etc.
 All Receiver errors are handled by directing
the LTSSM to Recovery from L0
 TLPs and DLLPs received while in Recovery are
discarded
 Since all physical layer errors will be handled this
way, round-trip time to pass through Recovery
state should be short (less than 1µs), so bit errors
won’t adversely affect performance
 Expected error rate is about once an hour,
and this method of handling them is
considered reasonable.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Scrambling 430 459

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Data Scrambling 431 460

 Scrambling polynomial for 8GT/s:


G(x) = X23+X21+X16+X8+X5+X2+1
Data
Data In XOR
Out

X0 X1 XOR X2 X3 X4 XOR X5 … XOR X21 X22

 New model improves the transition density over


time that 8b/10b used to provide per symbol
 Disabling scrambling is not allowed for Gen3
 Disabling was advantageous to be able to specify bit
patterns; supported for 2.5 and 5.0 GT/s rates
 But scrambling does more in Gen3 and the Link isn’t
expected to work reliably without it, and so it cannot be
disabled at the Gen3 rate.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Initializing the Scrambler 431 461

 To reduce cross-talk, initialized value of the LFSR


(seed value) depends on the Lane assignment:
 0: 1EFEDCh, 1: 6EF030h, 2: 0371BCh, 3: 6D818Ch,
 4: 247840h, 5: 49F9CCh, 6: 39F720h, 7: 700EECh,
 For higher-numbered lanes, the pattern repeats; Lane 8 has
same initial value as Lane 0
 Reinitialized with receipt of EIEOS or FTS (which are
not scrambled so they’re always recognizable).
 Sync bits are not scrambled and don’t advance the
LFSR
 LFSRs advance for Ordered Set symbols, even
though those bytes bypass scrambling, except for
any SOS symbols.
 The number of SKP symbols that leave the Tx may be
different from the number that reaches the Rx if a repeater is
used between them, so the scramblers could get out of sync.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Serializer, Sync Bits Mux and Tx Clk 434 462

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receive Physical Layer Details – Gen3 436 463

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Clock and Data Recovery (CDR) Logic 437 464

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Clock and Data Recovery (CDR) Logic 437 465

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Achieving Block Alignment with EIEOS 438 466

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Elastic Buffer and Rx Clock Compensation 440 467

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Lane-to-Lane Deskew 444 468

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
De-Scrambler 444 469

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Byte Un-Striping 445 470

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Packet Filtering and Rx Buffer 446 471

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Physical Layer Electrical
(Gen1, Gen2 and Gen3)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Electrical Physical Layer 450 473

Physical Layer Physical Layer

Tx Rx Tx Rx

Logical Logical

Tx Rx Tx Rx
Electrical Electrical

Link CTX
D+ D- D+ D- D- D+ D- D+

CTX

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Backward Compatibility 448 474

 New versions must be able to operate with


older versions
 Compliance at one speed doesn’t mean lower
speeds will work automatically
 This is due to differences in clock constraints.
Important that all speeds are tested.
 Link always starts up and trains at 2.5 GT/s and
negotiates and trains to higher speed later

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Differential Transmitter/Receiver 451 475

 Transmitter and receiver are AC coupled.


 Tx common-mode between 0V - 3.6V.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Receiver DC Common Mode Voltage 467 476

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Common Mode Noise Rejection 452 477

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
High Speed Signaling 454 478

 Spread Spectrum Clocking is optional


 Data rate can be modulated (spread) from +0% to -0.5% with
respect to nominal frequency
 Modulation rate not to exceed 33 KHz
 Max clock variation of +/- 300 ppm (or 600 ppm difference
between two ports) must still be maintained, usually
requiring a common clock.
 Impedance matching, termination, and other signal-
integrity considerations all become more important at
higher speeds
 Transmitters and receivers must be ESD and short-
circuit tolerant to facilitate hot-plug

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Spread Spectrum Clocking (SSC) 454 479

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Signal Freq. Less than Half of Tx Clock 454 480

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Shared RefClk Architecture 456 481

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Independent RefClk Architecture 457 482

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Data Clock Recovery at Receiver 457 483

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receiver Detection at Transmitter 461 484

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Differential Signaling 463 485

• Gen1/Gen2 Differential Peak Voltage = 400-600 mV


• Gen3 Differential Peak Voltage = 400-650 mV
• Electrical Idle Diff Voltage = 0-20 mV

Single-ended
signal

Single-ended
signal

VDIFFp = max |VD+ - VD-| Example: 200mV – (-200mv) = 400mv


VDIFFp-p = 2 * max |VD+ - VD-| Example: 2 * (400mV) = 800mv

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Differential Peak vs. Peak-to-Peak Voltage 464 486

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Voltage Margining For Testing 465 487

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
De-emphasis 469 488

 Higher frequency means greater sensitivity to jitter


 De-emphasis helps by reducing transmitter power
during repeated bits
 Goals: Reduce data-dependent jitter and alleviate
ISI (Inter-Symbol Interference)

First bit after


value change

Moki Anji (moki@ synopsys.com)


Do Not Distribute First bit after
MindShare.com © 2013
value change
ISI (Inter-Symbol Interference) at Rx 471 489

5 bits in a row of the


same value
Problem area

Without De-Emphasis
With De-Emphasis

(dramatically
decreased)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Effects of ISI on Receiver 472 490

Without De-Emphasis
Positive Polarity Signal
Negative Polarity Signal
Eye Opening reduced

With De-Emphasis
Positive Polarity Signal
Good Eye Opening
Negative Polarity Signal

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Selectable De-Emphasis for Gen2 Spec. 473 491

 Gen1 de-emphasis was always -3.5 dB


 Designed to match a common bus model
 Gen2 de-emphasis is selectable
 At 2.5 GT/s: -3.5 dB
 At 5.0 GT/s: -6.0 dB (-3.5 dB optional)
 Gen3 needs more sophisticated wave shaping; de-emphasis not
sufficient for that speed. Better equalization is needed.

-3.5 dB de-emphasis
Gen1: 2.5 GT/s

-6.0 dB de-emphasis
Gen2: 5.0 GT/s

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Control and Status 2 492

Read only, initialized by hardware

0 = -6.0 dB
1 = -3.5 dB

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Adjustable De-emphasis 493

 Informing neighboring device of requested


de-emphasis level at higher speeds
TS1
Rate Identifier
0 COM
1 Link # Bit 0 Reserved, = 0
2 Lane # Bit 1 Indicates 2.5 GT/s support
3 # FTS
4 Rate ID Bit 2 Indicates 5.0 GT/s support
5 Train Ctl Bit 3:5 Reserved, = 0
6
Bit 6 Autonomous Change / Selectable De-
TS ID emphasis

Bit 7 Speed Change


13
14 TS ID Notes
15 TS ID In Recovery, bit 6 indicates the de-emphasis preference
when running at speeds higher than 2.5 GT/s. In Polling, it
specifies the de-emphasis level for the other device
(1 = -3.5dB, 0 = -6dB).

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Low Voltage Option 474 494

 Two transmitter voltage swings are defined


 Full swing (required) uses de-emphasis
 Low/Half swing (optional) uses no de-emphasis
 Margins at receiver are always the same; both
selection and selection method are design specific

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Gen3 Electricals 474 495

 Beginning with 8.0 GT/s, signal compensation


becomes an active process
 Timing budget is tighter, signal environment needs careful,
individual compensation per Lane
 A sequence is used to actively test signal quality, suggest
equalization parameters to transmitter, evaluate the results
and repeat that until signal quality is acceptable
 Transmitter: Equalization still required
 But it’s now described by a 3-tap equalizer whose
coefficients can be adjusted by the receiving device
 Presets define coarse control before going to 8.0 GT/s
 Coefficients give fine-grain control
 Receiver: Optional equalization is described
 Single-tap DFE (Decision Feedback Equalizer)
recommended for a long channel
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
3-Tap Equalizer 475 496

 Sophisticated equalization allows more signal


shaping than de-emphasis alone.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Equalization Effect 497

Trace courtesy of PLX

Vd
Voltage levels defined in Va
the spec
Vc
Actual trace, highlighted
for clarity Vb

Area to be reproduced in later drawing

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Effect of Transmitter Equalization 477 498

 Transmitted signal has 4 voltage levels:


Vd – when previous bit, current bit, next bit all toggle in value.
Va – when current bit toggles from previous bit, next bit same as current bit.
Vc – when current bit same as previous bit, next bit toggles.
Vb – when previous bit, current bit, next bit are all the same value

Differential
version of trace
on previous page

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Tx Equalizer Example 499

 There are 11 preset settings that can be selected.


One of those is P7, which defines
 c-1 = -0.1, c+1 = -0.2
 And, since the sum of absolute value of all coefficients must
be unity, we can calculate that c0 must = 0.7
 Resulting voltage multipliers:
 Vd = (0.7 + (0.1) + (0.2)) = 1.0 (full-height signal)
 Va = (0.7 + (-0.1) + (0.2)) = 0.8
 Vc = (0.7 + (0.1) + (-0.2)) = 0.6
 Vb = (0.7 + (-0.1) + (-0.2)) = 0.4

(illustrated on the next slide)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Summing the 3 Taps 482 500

c-1 = -0.1
Original signal
inverted, weighted by
0.1, and shifted one
clock earlier

c0 = 0.7
Original signal
weighted by 0.7

c+1 = -0.2
Original signal
inverted, weighted by
0.2, and shifted one
clock later

Summed output
(recreates the highlighted
area of the trace capture
shown earlier)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Gen3 Scope Capture at Receiver 501

 Received signal in backplane environment

Without Tx Equalization With Tx Equalization

Pictures courtesy of PLX

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Rx Equalization Optional for Gen3 493 502

 Filters out low-frequency components


 Good: improves eye opening and signal integrity
 Bad: adds complexity, cost and power to Receiver
 Two types are mentioned in the spec:
 CTLE (Continuous Time Linear Equalization)
 DFE (Decision Feedback Equalization)
 Spec states:
 CTLE alone should be enough for short and medium
length channels
 Single-order CTLE feeding into a single-tap DFE is
recommended for long channels

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Linear Equalization 494 503

 CTLE characteristics:
 Improves signal for channels without many discontinuities
(cable or shorter traces) and is simple and low power
 But doesn’t handle channel discontinuities very well
 Spec recommends first-order CTLE to accommodate
most trace lengths
 Adaptive version could have a mechanism to change
component values based on feedback
 Amplifying the signal before filtering is another option
but also amplifies noise, which is less desirable
C

Tx Rx
R

Moki Anji (moki@Simple CTLE:


synopsys.com) high-pass filter
Do Not Distribute MindShare.com © 2013
Example of CTLE 495 504

 Received signal before and after CTLE: low


frequencies clearly reduced.

Pictures courtesy of PLX

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Decision Feedback Equalization (DFE) 495 505

 DFE characteristics
 Good:
 Better handling of signal discontinuities
 No magnification of noise
 Independent control of each tap
 Bad:
 Higher power and cost for SERDES
 Adjustments limited to number of taps
 Spec describes a 1-tap DFE to accommodate max trace
length, but some vendors use more taps.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
2-Tap Rx DFE 495 506

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Electrical Idle 507

 Transmitter electrical idle exists when voltage is less


than 20 mV peak
 Transmitter sends EIOS to inform receiver that it’s
about to go idle
Encoding
COM K28.5
IDL K28.3
IDL K28.3
IDL K28.3

Gen1/Gen2 EIOS (Electrical Idle Ordered Set)

 Note that the Retry Buffer should be empty before


going to electrical idle for cases like L1, since
recovery may involve going back through the Polling
state, which would reset the Link Layer (Linkup = 0),
and clear the buffer
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Electrical Idle Entry 508

 Receiver detects that link is actually electrically idle in


one of two ways:
 Old method: Squelch detect circuit to check voltage threshold
 New method: Infer idle by watching for timeout conditions.
Examples:
 L0: no SKP Ordered Sets in 128us.
 Recovery: no exit after 16,000 UI at 5.0 GT/s, or 2,000 UI at 2.5
GT/s
 Polling.Compliance: no TS1 or TS2 after 128us
 Benefits of inferring idle:
 Easier – detection circuit had trouble at 2.5 GT/s and it’s
tougher at 5.0 GT/s
 Simpler design saves power; not necessary to keep a
squelch-detect circuit on

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Electrical Idle Exit 509

 New Ordered Set required for speeds above 2.5 GT/s


 Consists of 16 symbols: one K28.5 followed by 14
K28.7 symbols (EIE symbol) and one TS1 symbol.
 Encoded but not scrambled
 Result is low-frequency pattern of 5 ones followed by 5
zeros. This pattern is easier to detect than a voltage change
(since voltage difference at Rx is very small)
 One EIEOS is sent prior to starting TS1’s and after
every 32 consecutive TS1’s or TS2’s at speeds
above 2.5 GT/s
 Exit from L0s is a special case and sends 4 to 8 EIE
symbols (K28.7’s) prior to starting the FTS sequence
(goal is to ensure Rx sees 4 of them)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Full-On L0 State 500 510

Detect

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter Receiver
one
ON ON
direction
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock Source
Source High or Low VRX-CM = 0 V Low impedance
VCM ON
impedance termination termination
ON
 Transmission and reception in progress
 Recommended Power Budget about 80 mW per Lane
 One direction of the Link can be in L0 while the other
side is in L0s
 Transmitter and Receiver clock PLL are ON
 Transmitter is On, Receiver is ON
 Low impedance termination at transmitter
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
L0s State 501 511

Detect Held at 0 - 3.6 V DC common mode voltage

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter Receiver
one
ON ON
direction
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock Source
High or Low Low impedance
Source VRX-CM = 0 V
VCM impedance termination termination ON
ON
 Transmitter holds Electrical Idle voltage (VTX-DIFFp < 20 mV) and DC common
mode voltage ( VTX-CM-DC 0 – 3.6 V)
 Recommended Power Budget <= 20 mW per Lane
 Recommended exit latency < 50 ns, however designers indicate that a more
realistic number appears to be 1 us-2 us
 One direction of the Link can be in L0s while the other is in L0
 Transmitter and Receiver clock PLL are ON but Rx Clock loses sync
 Transmitter is On, Receiver is ON
Moki Anji (moki@ synopsys.com)
 High or Low impedance termination at transmitter
Do Not Distribute MindShare.com © 2013
L1 State 502 512

Detect Held at 0 - 3.6 V DC common mode voltage

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter Receiver
one
ON ON
direction
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock Source
Source High or Low VRX-CM = 0 V Low impedance
VCM May be OFF
impedance termination termination
May be OFF
 Transmitter holds Electrical Idle voltage and DC common mode voltage
 Recommended Power Budget <= 5 mW per Lane
 Recommended exit latency < 10 microseconds (may be greater)
 Both directions of the Link must be in L1 at the same time
 Transmitter and Receiver clock PLL may be OFF, but clock to device ON
 Transmitter is On, Receiver is ON
 High or Low impedance termination at transmitter
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
L2 State 503 513

Detect Transmitter most likely OFF,


no DC value maintained
CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
OFF direction OFF
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock Source
High or Low VRX-CM = 0 V High impedance
Source VCM
impedance termination termination
OFF
OFF
Low frequency  Transmitter holds Electrical Idle voltage, but not required to hold
DC common mode voltage. Most likely OFF.
for Beacon ON  Recommended Power Budget <= 1 mW per Lane
 Recommended exit latency < 12 - 50 milliseconds
 Both directions of the Link in L2
 Transmitter and Receiver clock PLL OFF, and clock to device OFF
 Low frequency clock for Beacon in transmitter ON
 Main power to device OFF, but Vaux ON
Moki Anji (moki@  Transmitter is OFF, Receiver is OFF
synopsys.com)
 High or Low impedance termination at transmitter, high impedance at receiver
Do Not Distribute MindShare.com © 2013
L3 State 504 514

Detect DC common mode voltage OFF

CTX ZTX
D+ D+
+

No Spec
Lane in
Transmitter one Receiver
OFF direction OFF
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock High impedance High impedance Source
termination VRX-CM = 0 V termination
Source VCM OFF
OFF  Transmitter does not hold DC common mode voltage
Low frequency  Recommended Power Budget: zero
for Beacon OFF  Recommended L3 -> L0 exit latency < 12 - 50 milliseconds after
power turned ON
 Both directions of the Link in L3
 Transmitter and Receiver clock PLL OFF, and clock to device OFF
 Low frequency clock for Beacon in transmitter OFF
 Main power to device OFF, Vaux OFF
Moki Anji (moki@ synopsys.com)
 Transmitter and Receiver OFF
Do Not Distribute  High impedance termination at transmitter and receiver MindShare.com © 2013
Link Initialization and Training

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Initialization and Training 506 516

Memory, I/O, Configuration R/W Requests or Message Requests or Completions


(Software layer sends / receives address/transaction type/data/message index)
Software layer
Transmit Receive

Transaction Layer Packet (TLP) Transaction Layer Packet (TLP)


Header Data Payload ECRC Header Purpose
Data Payload ofECRC
Link Training
Transaction layer • Receiver detection
Flow Control
Transmit
Buffers
• Bit locking Receive
Virtual Channel
Buffers
Management
per VC • Symbol locking per VC
Ordering
• Polarity inversion
Link Packet DLLPs e.g. •DLLPs
Lane-to-Lane de-skewing
Link Packet
Sequence TLP LCRC Ack/Nak CRC Ack/Nak CRC Sequence
• Lane reversalTLP LCRC
Data Link layer • Link numbering (multi-Link devices)
De-mux
TLP Retry
Buffer
• Lane numbering (multi-Lane Links)
TLP Error
Mux • Link width management
Check

• Link data rate management


Physical Packet Physical Packet
Start Link Packet End Start Link Packet End

Physical layer Encode Decode

Parallel-to-Serial Link Serial-to-Parallel


Training
Differential Driver Differential Receiver
(LTSSM)

Moki Anji (moki@ synopsys.com) Port


Do Not Distribute MindShare.com © 2013
Ordered Sets Used During Link Training 510 517

Gen1/Gen2 TS1 or TS2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ordered Sets Used During Link Training 511 518

Gen3 TS1 or TS2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Training Status State Machine (LTSSM) 522 519

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Training Status State Machine (LTSSM) 522 520

 Link Power Mgmt


 Error Recovery
 Link Width Change
 Link Speed Change
 Equalization training
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Link Training Status State Machine (LTSSM) 522 521

 Hot Reset
 Ext Loop Back
 Link Enable/Disable

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Detect Sub-State Machine 523 522

Entry from Reset.


Also from Disabled,
Loopback, L2, Polling,
Configuration or
Recovery

No Electrical
Idle on Link or
12 ms timeout Receiver
Detected
Detect.Quiet Detect.Active
No Detect
12 ms Charge or
DC common mode
• Detect.Quiet: transmitter in high impedance voltage stable
• Detect.Active: Start at DC common mode
voltage at or between VDD and GND
• Drive a different Vcm than present Exit to
• Device knows charge time to change voltage Polling
based on assumed line impedance and Tx
impedance without receiver termination
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Receiver Detect: Absent 523

Detect
Receiver Absent
CTX ZTX-LINE
Cpad Cinterconnect
+

No Spec
Lane in
one Receiver
Transmitter
direction
CTX ZTX-LINE
Cpad Cinterconnect
-

ZTX ZTX ZRX ZRX

VCM

Charge Time Constant = ZTX * (Cpad + Cinterconnect)  Small Time

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Receiver Detect: Present 524

Detect
Receiver Present
CTX ZTX-LINE
Cpad Cinterconnect
+

No Spec
Lane in
Transmitter one Receiver
direction
CTX ZTX-LINE
Cpad Cinterconnect
-

ZTX ZTX ZRX ZRX

VCM

Charge Time Constant = ZTX * (CTX + Cpad + Cinterconnect)  Large Time


CTX = 75 – 200 nF >> Cpad + Cinterconnect

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Polling 525

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Polling Sub-State Machine 525 526

• Polarity Inversion: Receiver is required to implement this


• Polling.Compliance: Transmit compliance pattern for testing EMI,
cross talk, bit error rate, etc. Data rate and de-emphasis level will be
cycled (unless
Moki Anji devicesynopsys.com)
(moki@ only supports 2.5 GT/s) through 2.5 GT/s at
-3.5dB,
Do Not 5.0 GT/s at -3.5dB, and 5.0 GT/s at -6.0dB.
Distribute MindShare.com © 2013
Configuration 527

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Configuration Sub-State Machine 553 528

In this state, Link and Lane


numbering is negotiated

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Training Sequence Ordered Sets 540 529

Previously, the Link# and Lane# fields have


contained PAD symbols. Now they’re assigned
values to test the Link.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Multi-Lane Links 541 530

 Designer decides how many Lanes to


implement on a given Link based on
performance requirements
 Required for a multi-lane Link to be able to
operate as a single Lane (x1) Link
 Optional for a multi-lane Link to operate as
multiple independent Links with fewer Lanes,
or to combine multiple Links to form a wider
Link.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link and Lane Numbering 531

Upstream Device (Root Complex or Switch)

Example capabilities:
One Link with 1, 2, or 4 Lanes
2 Links with 1 or 2 Lanes
Lane reversal supported

Downstream Device (Switch or EP)


Example capabilities:
One x2 Link or one x1 Link
Lane reversal supported

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link and Lane Numbering 532

Upstream Device (Root Complex or Switch)

Example capabilities:
One Link with 1, 2, or 4 Lanes
2 Links with 1 or 2 Lanes
Lane reversal supported

Downstream Device (Switch or EP) Downstream Device (Switch or EP)


Example capabilities: Example capabilities:
One x2 Link or one x1 Link One x2 Link or one x1 Link
Lane reversal supported Lane reversal supported

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link and Lane Numbering 533

Upstream Device (Root Complex or Switch)

Example capabilities:
One Link with 1, 2, or 4 Lanes
2 Links with 1 or 2 Lanes
Lane reversal supported

Downstream Device (Switch or EP)


Example capabilities:
One x4 Link or one x1 Link
Lane reversal supported

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Partitioning Lanes into Multiple Links 541 534

A device may be able to define a group of Lanes as


multiple Links with different widths.

x8 x8

Switch Virtual
Switch Virtual
PCI PCI
Bridge 0 Bridge 0

OR
Virtual Virtual Virtual Virtual Virtual Virtual
PCI PCI PCI PCI PCI PCI
Bridge 1 Bridge 2 Bridge 3 Bridge 4 Bridge 1 Bridge 2

x2 x2 x2 x2
x4 x4

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Configuration Example 1 543 535

Configuration begins with upstream device


sending TS1 Ordered Sets with the Link field
containing a value (N) other than PAD.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 1 continued 544 536

The upstream device sees Link number match on


all Lanes and recognizes a 4-Lane Link. Next, it
proposes Lane numbers using TS1 Ordered Sets.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 1 continued 545 537

Upstream device sees matching Lane numbers


arrive, and sends TS2’s using the same Link and
Lane numbers to confirm the configuration.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Configuration Example 2 546 538

Since 4 Links are possible in this example, LTSSM


begins by sending 4 different Link numbers.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 2 continued 547 539

Downstream devices detect non-PAD Link


number values. One device returns N for Link
Number and other device returns N+2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 2 continued 548 540

Upstream device sees both N and N+2 and


recognizes that two devices are attached.
A second Link interface and LTSSM must now be
initialized.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Lane Reversal Example 3 541

Since two Links are supported, two Link numbers


are tested.

One x4
One x1
Two x2
Two x1

TS1s Lane #: PAD PAD PAD PAD


Link #: N N N+1 N+1

PAD PAD PAD PAD Link #:


PAD PAD PAD PAD Lane#: TS1s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 3 continued 542

The downstream LTSSM detects non-PAD Link


numbers in the TS1 Ordered Set and returns
the same value (N) on all of its Lanes.

One x4
One x1
Two x2
Two x1

TS1s Lane #: PAD PAD PAD PAD


Link #: N N N+1 N+1

N N N N Link #:
PAD PAD PAD PAD Lane #: TS1s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 3 continued 543

The upstream LTSSM detects the same value


(N) on every Lane and recognizes that a single
device is attached.

One x4
One x1
Two x2
Two x1

TS1s Lane #: PAD PAD PAD PAD


Link #: N N N+1 N+1

N N N N Link #:
PAD PAD PAD PAD Lane #: TS1s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 3 continued 544

The upstream LTSSM now proposes Lane


numbering to the downstream LTSSM.

One x4
One x1
Two x2
Two x1

TS1s Lane #: 0 1 2 3
Link #: N N N N

N N N N Link #:
PAD PAD PAD PAD Lane #: TS1s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 3 continued 545

The downstream LTSSM detects Lane numbers


and reports its native Lane numbering to the
upstream LTSSM.

One x4
One x1
Two x2
Two x1

TS1s Lane #: 0 1 2 3
Link #: N N N N

N N N N Link #:
3 2 1 0 Lane #: TS1s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 3 continued 546

The upstream LTSSM recognizes that the


incoming Lane numbers don’t match its native
numbering, and reverses the Lanes. It then
asks for confirmation by sending TS2s.
One x4
One x1
Two x2
Two x1

TS2s Lane #: 3 2 1 0
Link #: N N N N

N N N N Link #:
3 2 1 0 Lane #: TS1s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 3 continued 547

The Lane numbers returned now match and


normal initialization proceeds.

One x4
One x1
Two x2
Two x1

TS2s Lane #: 3 2 1 0
Link #: N N N N

N N N N Link #:
3 2 1 0 Lane #: TS2s

One x4
One x2
One x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Example 4 550 548

This example deals with a failing Lane

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 4 continued 551 549

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example 4 continued 552 550

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L0 568 551

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Bits Related to Speed and Width Change 569 552

Link Control Register


Link Control 2 Register

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Recovery 572 553

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Recovery Sub-State Machine 573 554

• Entered to recover bit/symbol lock due to detected error


• Exit from L1 or unsuccessful exit from L0s
• Speed change
• Width change
• Software initiated Link retrain
• Redo equalization (Gen3)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Dynamic Speed Change 622 555

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Dynamic Speed Change: Motivation/Method 619 556

 Motivations:
 Reduce speed to save power
 Return to full speed when
full bandwidth needed
 Reduce speed to improve
unreliable operation
 Methods:
 Hardware events -
 Autonomous Bandwidth: design-specific choice for power
management reasons
 Bandwidth Management: design-specific choice resulting
from something like a reliability issue (not power mgt.)
 Software - using configuration bits

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TS1 Rate Identifier 621 557

TS1 Rate Identifier

0 COM Bit 0 Reserved, = 0


1 Link # Bit 1 Indicates 2.5 GT/s support
2 Lane #
3 # FTS Bit 2 Indicates 5.0 GT/s support
4 Rate ID
Bit 3:5 Reserved, = 0
5 Train Ctl
6 Bit 6 Autonomous Change / Selectable De-
emphasis
TS ID
Bit 7 Speed Change
13
14 TS ID
15 TS ID
Notes
Bit 6 meaning is context sensitive:
•In Configuration state, it means the device’s request for speed or width
change was not caused by a reliability issue.
•In Recovery, it indicates the de-emphasis preference. (1 = -3.5dB, 0 = -6dB)
•In Polling, it specifies the de-emphasis level for the other device.
Bit 7: Indicates a request to change the speed; can only be set in Recovery state
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
TS2 Rate Identifier 621 558

Rate Identifier
TS2
Bit 0 Reserved, = 0
0 COM
1 Link # Bit 1 Indicates 2.5 GT/s support
2 Lane #
Bit 2 Indicates 5.0 GT/s support
3 # FTS
4 Rate ID Bit 3:5 Reserved, = 0
5 Train Ctl
Bit 6 Autonomous Change / Link Up-
6 configure Capability / Selectable De-
emphasis
TS ID
Bit 7 Speed Change
13
14 TS ID
Notes 15 TS ID
Bit 6 meaning is context sensitive:
•In Configuration.Complete, it indicates device’s ability to upconfigure the
Link to a previously negotiated Link width.
•In Recovery, downstream component indicates that the speed or Link width
change was autonomous (not caused by a reliability issue), while an
upstream component specifies the de-emphasis level for the other device (1
= -3.5dB, 0 = -6dB).
•In Polling,
Moki Anji (moki@it synopsys.com)
indicates the de-emphasis value by both devices when going
into
Do Not Loopback mode.
Distribute MindShare.com © 2013
Speed Change Example 623 559

Directed Speed Change = 0


1 Directed Speed Change = 1

Entry Entry
Speed Speed

RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 1

TS1 TS1 TS1 TS1

Root Link Speed = 2.5 GT/s


PCIe
Complex Endpoint
TS1 TS1 TS1 TS1

Speed_Change = 1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Speed Change Example 623 560

Directed Speed Change = 1 Directed Speed Change = 1

Entry Entry
Speed Speed

RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 1

TS2 TS2 TS2 TS2

Root Link Speed = 2.5 GT/s


PCIe
Complex Endpoint
TS2 TS2 TS2 TS2

Speed_Change = 1
Moki Anji (moki@ synopsys.com)
Do Not Distribute Autonomous Change = 1
MindShare.com © 2013
Speed Change Example 623 561

Directed Speed Change = 0 Directed Speed Change = 0

Entry Entry
Speed Speed

RcvrLock RcvrCfg RcvrLock RcvrCfg

TS2 TS2 TS2 EIOS


TS2

Root Link Speed = 5.0


2.5 GT/s
PCIe
Complex Endpoint
TS2
EIOS TS2 TS2 TS2
Autonomous Change = 1
Root Complex Config Space
Moki Anji (moki@ synopsys.com)
Link Autonomous Bandwidth Status bit = 1
Do Not Distribute MindShare.com © 2013
Speed Change Example 623 562

If higher speed doesn’t work, go directly


back to Speed state and Electrical Idle,
change back to 2.5GT/s
Directed Speed Change = 0 Directed Speed Change = 0

Entry Entry
Speed Speed

Exit to L0 Exit to L0
RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 0

TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1

Root Link Speed = 5.0 GT/s


PCIe
Complex Endpoint
TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1
Speed_Change = 0
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Speed Change Capture – part 1 563

Link has trained to


Gen1 rate, USP
requests Autonomous
change, both support
higher rates.

Moki Anji (moki@ synopsys.com)


Do Not Distribute Trace captures courtesy of LeCroy MindShare.com © 2013
Speed Change Capture – part 2 564

DSP responds with


TS2s; USP then
sends EIOS and
goes to Electrical
Sends Idle.
EIOS

Enters Elec Idle

Moki Anji (moki@ synopsys.com)


Do Not Distribute Trace captures courtesy of LeCroy MindShare.com © 2013
Speed Change Capture – part 3 565

DSP responds with


Electrical Idle. After
a timeout, DSP
Sends attempts new rate
EIOS with TS1s.

Enters Elec Idle

Since this is Gen3, an


equalization process is
needed. For Gen2 that isn’t
Attempts new
necessary.
rate of 8.0 GT/s

Moki Anji (moki@ synopsys.com)


Do Not Distribute Trace captures courtesy of LeCroy MindShare.com © 2013
Reporting Dynamic Changes 626 566

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Notification 625 567

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Disabling Dynamic Speed Changes 628 568

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Can Force Speed Change 629 569

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Gen3 Equalization Parameter Training 577 570

New Recovery.Equalization sub-state entered if:


Speed has been changed to 8.0 GT/s and
Equalization needs to be done (or redone)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
EQ TS1 / EQ TS2 Before Switching to Gen3 571

Symbol 6
7 6 5 4 3 2 1 0
TS1 or TS2
at 2.5 or 5.0 GT/s Tx Preset Rx Preset Hint

0 COM
1 Link # Equalization Command only has meaning at speeds less than 8.0 GT/s
2 Lane # Command
3 # FTS
4 Rate ID Symbol 7 – 9 are TS1 or TS2 Identifiers
5 Train Ctl
6 EQ info
 When first entering Recovery state (before
TS ID speed changes to 8.0 GT/s) bit 7 of symbol
6 indicates “EQ TS1” or “EQ TS2” Ordered
15 TS ID
Sets and delivers the preset values
 Downstream Ports (DSPs) supply the
preset value to be used for EQ on each
Lane for 8.0 GT/s, while Upstream Ports
(USPs) record this information
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Secondary Capability Register Details 579 572

Secondary PCIe Extended Capability Registers These registers (HWInit & RO) are required for
DSPs. If implemented in a USP, the EQ Control
Registers are ignored.

Equalization Request
Interrupt Enable
Perform Equalization

Note that each Lane


can have independent
values

Presets sent to USP via


EQ TS1/ EQ TS2 Symbol 6

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
4 Equalization Phases 578 573

 During Recovery.Equalizaton and at Gen3 speed,


Equalization Control (EC) bits indicate the
equalization phase (0-3)

TS1 at 8GT/s
0 1Eh for TS1, 2Dh for TS2
Symbol 6 1 Link # 0-31d, (PAD encoded as F7h)
7 6 5 4 3 2 1 0 2 Lane # 0-31d, (PAD encoded as F7h)
3 # FTS # of FTS Ordered Sets required by receiver
Tx Preset EC 4 Rate ID Bit 3 indicates 8GT/s support -
5 Train Ctl 5.0GT/s and 2.5GT/s must also be supported
Use Preset Reset EIEOS 6 - 9 EQ info Equalization presets and coefficients
Interval Count
10 - 13 TS ID 4Ah for TS1, 45h for TS2

14 - 15 TS ID 4Ah for TS1, 45h for TS2,


or DC balance symbols
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Procedure to Change Speed to 8GT/s 574

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
EQ Example 575

Link always trains to 2.5 GT/s after a reset


After reaching L0, if higher speeds are available, go to Recovery and change to highest
mutually-supported rate. Assume 8GT/s is supported by DSP and USP
EQ TS1s are send by DSP to give preset values to USP

Entry Entry
Speed Speed

EQ EQ
RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 1, Presets

TS1 TS1 TS1 TS1

Root Link Speed = 2.5 GT/s


PCIe
Complex Endpoint
TS1 TS1 TS1 TS1

Speed_Change = 1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Initializing Preset Values 576

Downstream 1. In Recovery but before changing to 8


Port (DSP)
GT/s, DSP sends Tx Preset and Rx
Root Port Hint info to USP. Values for each
Lane are independent and come from
Secondary PCIe Extended Capability
registers.
Tx Preset = 7, In this example, Tx preset is 7, while
Rx Hint = 0
Rx hint is given as 0.
1

To help clarify the terminology, recall that a


Endpoint
Downstream Port has that name
Upstream because it faces downstream.
Port (USP)
Similarly, an Upstream Port faces
upstream.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Preset Encodings 580 577

These spec tables interpret the Tx Preset and Rx Hint. Recall that the sum of all
coefficient absolute values must be unity, so c0 can be derived from the other two and
doesn’t need to be included in the table.

Tx Preset c-1 c+1 Rx Hint Rx Preset


Encoding Encoding (dB)
P4 0.000 0.000 000 -6
P1 0.000 -0.167 001 -7
P0 0.000 -0.250 010 -8
P9 -0.166 0.000 011 -9
P8 -0.125 -0.125 100 -10
P7 -0.100 -0.200 101 -11
P5 -0.100 0.000 110 -12

P6 -0.125 0.000 111 Reserved

P3 0.000 -0.125
P2 0.000 -0.200 Note: Variable; used for testing.

P10 0.000 See note


Moki Anji (moki@ synopsys.com)
All others Reserved Reserved
Do Not Distribute MindShare.com © 2013
EQ Example 578

LTSSM moves through speed change states as before


EQ TS2s deliver USP Tx Preset and Rx Hint information
Since Speed Change is requested, next state will be Recovery.Speed

Entry Entry
Speed Speed

EQ EQ
RcvrLock RcvrCfg RcvrLock RcvrCfg

Speed_Change = 1 , USP Presets delivered

TS2 TS2 TS2 TS2

Root Link Speed = 2.5 GT/s


PCIe
Complex Endpoint
TS2 TS2 TS2 TS2

Speed_Change = 1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
EQ Example 579

In Speed, transmitters are assigned the highest mutually-supported rate, whatever that
is. If already using the highest rate, it won’t change and will just go to RcvrLock.
To drop down to a lower rate, transmitter must remove higher speed from its list of
supported rates, causing a lower one to be mutually supported.

Entry Entry
Speed Speed

EQ EQ
RcvrLock RcvrCfg RcvrLock RcvrCfg

TS2 TS2 TS2 EIOS


TS2

Root Link Speed = 8.0


2.5 GT/s
PCIe
Complex Endpoint
TS2
EIOS TS2 TS2 TS2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
EQ Example 580

Equalization must be done when going to 8 GT/s for the first time and it can be done
again if one Link partner requests it. After EQ, return to RcvrLock. If bit and Block Lock
are successful, continue on to L0. If not, go back to Speed and revert to the previous
working rate.

Entry Entry
Speed Speed

EQ EQ
RcvrLock RcvrCfg RcvrLock RcvrCfg

TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1

Root Link Speed = 8.0 GT/s


PCIe
Complex Endpoint
TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Equalization Phase Summary 578 581

Recovery.Equalization
Note: No DLLPs
allowed until EQ Now USP reports the presets it was given, verifies
process is
Phase 0 reliable Lane operation (BER no worse than 10-4),
completed;
prevents any TLP
and then indicates phase 1 in its own TS1s.
timeouts during
the 50+ms EQ
might take.

DSP then verifies reliable Lane operation (BER no


Phase 1 worse than 10-4) and indicates phase 2 in its TS1s to
continue the EQ sequence. If signal quality is already
good enough, finish by indicating phase 0.
Recovery.RcvrLock

When USP sees phase 2, it sends coefficient


Phase 2 values to adjust DSP Tx EQ (and optionally
adjusts its own Rx EQ values). This continues
until it obtains a BER of at least 10-12; then it
indicates phase 3 in its TS1s.

When DSP sees phase 3, it sends coefficients to


adjust USP Tx EQ (and optionally adjusts its own
Phase 3 Rx EQ values) to get the desired BER. Then it sets
phase 0 in its TS1s and the process is finished.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Equalization Example Start 581 582

Downstream 1. DSP sends TS1s with EC=01b and


Port (DSP) report its FS and LF values (FS=60, LF
Root Port = 24 chosen for this case).
 FS (Full Swing) represents the range of
FS=60 possible transmitter coefficient values.
 LF (Low Frequency) corresponds to the
minimum voltage value.
1 2
EC=1: EC=0: 2. Simultaneously, USP sends TS1s with
FS = 60, Tx preset = 7 EC=00b, using and reporting the Tx
LF = 24 4 presets it received earlier (optionally
EC=1: using the Rx preset hints, too).
FS=30,
LF=12
3. USP waits for logic to stabilize (up to
500ns), then tests incoming signal.
3 FS=30
4. When 2 TS1s in a row are recognized
Endpoint
by USP (BER better than 10-4), it
changes to phase 1 and reports its own
Upstream FS & LF values of 30 and 12.
Port (USP)

Moki Anji
Allowed (moki@
value synopsys.com)
for FS is from 24 to 63 in full-swing mode; 12 to 63 in half-swing mode
Do Not Distribute MindShare.com © 2013
EQ Example Starting Info 584 583

Reporting most recent preset value


Symbol 6
7 6 5 4 3 2 1 0
Tx Preset EC
TS1 at 8GT/s
0 Use Preset Reset EIEOS
1 Link # Interval Count
2 Lane #
Symbol 7
3 # FTS
7 6 5 4 3 2 1 0 FS (Full Swing) indicates range of
4 Rate ID
5 Train Ctl FS value when EC = 01b, possible transmitter coefficients,
Rsvd Otherwise Pre-Cursor Coefficient from 24 to 63 in full-swing mode,
6 - 9 EQ info and 12 to 63 in half-swing mode
Symbol 8
7 6 5 4 3 2 1 0
10 - 13 TS ID LF value when EC = 01b,
Rsvd LF (Low Frequency) = corresponds
Otherwise Cursor Coefficient to a min voltage value
14 - 15 TS ID
Symbol 9
7 6 5 4 3 2 1 0
P RCV Post-Cursor Coefficient

Parity for the other 31 bits of


Moki Anji (moki@
EQ info in thesesynopsys.com)
4 symbols
Do Not Distribute MindShare.com © 2013
EQ Example Phase 1 583 584

Downstream 1. DSP recognizes TS1s with EC=01b


Port (DSP) coming from USP
Root Port 2. DSP evaluates the incoming signal and
verifies a BER better than 10-4)
FS=60
2 3. If that works, DSP requests phase 2.
(Optionally, it may decide the signal is
1 good enough and send EC=00b to finish
EC=1 EQ.)
Note: Both Ports must get the Link working
3
EC=2 well enough to continue, otherwise we’ll
drop back to the previous rate and quit
the EQ process.
FS=30

Endpoint
Upstream
Port (USP)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
EQ Example Phase 2, part 1 585 585

Downstream 1. USP evaluates the incoming signal


Port (DSP)
(method not specified)
Root Port
2. USP proposes new Tx coefficients, that
FS=60 will be divided by the FS of the DSP to get
the actual coefficients as shown here:
3 c-1 = 12/60 = -0.2 (pre- and post-cursor values
EC=2: must be negative or zero)
C-1=12 c0 = 36/60 = 0.6 (cursor value is positive)
C0=36 2
C+1=12 EC=2: c+1 = 12/60 => -0.2
RCV=1 C-1=12  To help distinguish what is meant, spec
C0=36
uses upper-case “C” to mean the integers
C+1=12
1 FS=30 reported, and lower-case “c” to mean the
resulting fractional values used.
Endpoint
Upstream
3. After round-trip delay, DSP echoes
Port (USP) requested coefficients, but in this case
Coefficient values
are echoed, but sets Reject Coefficient Values (RCV) = 1,
rejected because meaning it won’t use them (see next
they weren’t
supported in this
slide).
Moki Anji (moki@ synopsys.com)
example.
Do Not Distribute MindShare.com © 2013
TS1 Equalization Information 586

Symbol 6
7 6 5 4 3 2 1 0
Tx Preset EC
TS1 at 8GT/s
0 Use Preset Reset EIEOS
1 Link # Interval Count If new preset is proposed, it
2 Lane # must be used unless it’s not
Symbol 7
3 # FTS supported.
7 6 5 4 3 2 1 0
4 Rate ID
5 Train Ctl FS value when EC = 01b,
Rsvd Otherwise Pre-Cursor Coefficient
6 - 9 EQ info
Symbol 8
7 6 5 4 3 2 1 0
10 - 13 TS ID LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient Using proposed coefficients
14 - 15 TS ID is optional. Tx might reject
Symbol 9 them because they aren’t
7 6 5 4 3 2 1 0 supported, or for any other
reason.
P RCV Post-Cursor Coefficient

Reject Coefficient Values


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
EQ Example Phase 2, part 2 585 587

Downstream 1. Previous coefficients were rejected, so


Port (DSP)
USP proposes new ones:
Root Port
 C-1 = 9/60 => -0.15
FS=60 C0 = 45/60 = 0.75
C+1 = 6/60 => -0.1
2 1 2. After round-trip delay, DSP echoes
EC=2: EC=2: requested coefficients, and RCV=0
C-1= 9 C-1= 9
C0= 45 C0= 45
shows that these are supported.
C+1= 6
RCV=0
C+1= 6 3. USP evaluates incoming signal, decides
whether to try a different setting (Rx can
3 FS=30 be adjusted at any time; no reporting is
needed)
Endpoint
Upstream This round-trip “propose-and-evaluate”
Port (USP) cycle may be repeated many times to test a
number of combinations for DSP transmitter.
Each cycle must take less than 2ms, and
Moki Anji (moki@ synopsys.com)max time allowed for Phase 2 is 24ms.
Do Not Distribute MindShare.com © 2013
EQ Example Phase 3 586 588

Downstream 1. Eventually, USP is satisfied with incoming


Port (DSP)
signal and changes to phase 3.
Root Port
2. DSP now goes to Phase 3, takes
FS=60 4 ownership and proposes new coefficients
for USP transmitter.
2 1
EC=3
3. After round-trip delay, USP echoes
EC=3:
C-1= 3 3 coefficients, shows acceptance of them
C0= 21 EC=3: by RCV=0
C+1= 6 C-1= 3
RCV=0 C0= 21
4. DSP evaluates incoming signal, decides
C+1= 6 whether to try another Tx setting (Rx can
FS=30 RCV=0 be adjusted at any time)
Endpoint
Upstream
As before, this “propose-and-evaluate”
Port (USP) cycle is repeated now as many times as
needed to tune the USP transmitter.
Each cycle must take less than 2ms, and
Moki Anji (moki@ synopsys.com)max time allowed for Phase 3 is 24ms.
Do Not Distribute MindShare.com © 2013
EQ Example – Finish 586 589

Downstream 1. Eventually, DSP is satisfied with incoming


Port (DSP)
signal, switches to phase 0 to finish EQ
Root Port sequence.
FS=60 2. DSP echoes EC=0 and both drop out of
EQ sequence back to Recovery.RcvrLock
1 2
EC=0 EC=0

FS=30

Endpoint
Upstream
Port (USP)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Requesting EQ be Redone 590

These bits are only defined for


use in Recovery.RcvrCfg state.
Symbol 6
LTSSM will go back through
7 6 5 4 3 2 1 0
the Speed state but since it’s
Reserved already using 8.0 GT/s the
TS2 at 8GT/s speed doesn’t change.
0 Request Quiesce When set:
1 Link # Equalization Guarantee  USP guarantees no side effects if DSP
2 Lane # initiates EQ within 1ms of going back to L0
3 # FTS  DSP asks USP to request EQ and make this
4 Rate ID guarantee if it can do so
5 Train Ctl
Symbol 7 - 9 (Encoded as 45h)
6 - 9 EQ info
These bits provide a way to request equalization
10 - 13 TS ID be redone. USP sets both bits to request it, or
DSP sets both bits to ask USP to request it.
14 - 15 TS ID

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Autonomous EQ Initialization 586 591

 EQ can be initiated:
 Autonomously (strongly recommended)
 By software writing registers in DSP but must guarantee no
side effects – like Completion Timeouts
 Perform equalization bit
 Target Link Speed
 Retrain Link
 After initial training completes and 8.0 GT/s speed is
available, DSP initiates by going to Recovery
 EQ can be started again by DSP based on its own
needs or because of a request from USP
 Components must use Tx values selected during this
process, but can adjust their Rx any time, as long as it
doesn’t cause the Link to become unreliable.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Software EQ Initialization 592

 Software can initiate EQ by setting


Perform Equalization bit (see next
slide), writing Target Link Speed value
to 8.0 GT/s, and setting Retrain Link
bit in the Downstream Port.
 Software must guarantee no side
effects from doing EQ, such as
timeouts on previously issued
requests.
Target
Speed  Autonomous Link width change must
be disabled beforehand, ensuring that
all Lanes will participate in
equalization.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Secondary Capability Register Details 588 593

Note: These registers are required for all ports that


Secondary PCIe Extended Capability Registers support 8.0 GT/s, but many fields only apply to
Downstream Ports. For Upstream Ports they usually
only have meaning if crosslinks are supported.
Bit 1 = Equalization
Request Interrupt
Enable
Bit 0 = Perform
Equalization

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Register Definitions 588 594

 Perform Eq. – tells DSP to begin EQ process when


speed is set to 8.0 GT/s and Link is retrained
 Eq. Request Interrupt Enable – allows an interrupt to
indicate that Link Eq. Request bit has been set
 Set by H/W in Link Status 2 register if EQ problem is detected
 Detecting Port goes to Recovery state and sends TS2s with
Request EQ bit set
 Lane Error – indicates that a Lane-based error was
seen on the Lane corresponding to the bit number
 These fields are HWInit and sent from DSP to USP:
 DSP Tx Preset – used by DSP; ignored by USP
 DSP Rx Hint – may be used by DSP; ignored by USP
 USP Tx Preset – USP captures & must use
 USP Rx Hint – USP captures & may use
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Equalization Time 595

 The equalization process can take time,


especially if a device decides that it should be
repeated.
 To allow for this, the time after reset before
an 8.0 GT/s device can be sent a Request is
changed to 100ms after Link Training
completes instead of 100ms after reset.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Notes on Equalization 586 596

 Phases 2 and 3 are each limited to 24ms, so


staying under 100ms shouldn’t be too hard.
 Algorithm used by components to adjust their
parameters will be implementation specific
 EQ settings are per Lane; Lanes in a wide
Link can have different values
 EQ can be requested by either Link partner
again if needed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Dynamic Link Width Changes 629 597

 Goals:
 Improve power savings while allowing high
performance
 Provide fall-back reliability
 Provide notification of changes to software
 Note: PCIe does not support asymmetric Links

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TS2 Rate Identifier 630 598

0 COM
1 Link # Rate Identifier

Bit 0 Reserved, = 0
2 Lane #
Bit 1 Indicates 2.5 GT/s support
3 # FTS
Bit 2 Indicates 5.0 GT/s support
4 Rate ID
Bit 3:5 Reserved, = 0
5 Train Ctl
Bit 6 Autonomous Change / Link Up-
configure Capability / Selectable De-
6
emphasis
Bit 7 Speed Change
TS ID

13
14 TS ID
15 TS ID

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTSSM State Changes 631 599

Detect

Polling

Configuration

L2 Recovery

L1 L0 L0s

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Configuration Sub-States 632 600

Entry from
Polling or Recovery Exit to
Directed Loopback

Config.Linkwidth.Start

Directed Exit to
Config.Linkwidth.Accept
Disable

Exit to
Detect Config.Lanenum.Wait

Config.Lanenum.Accept

Config.Complete
2ms timeout &
2ms timeout,
haven’t yet been to
already been to Exit to
Recovery.RcvrLock
Recovery.RcvrLock Config.Idle
Recovery
8 Idle Rx, Tx 16 Idle

Exit to Full-On Power State


Moki Anji (moki@ synopsys.com) L0 Packet transmission/
Do Not Distribute reception begins
MindShare.com © 2013
Dynamic Link Width Example 633 601

1. Ethernet device initiates autonomous change to


Link width, goes to Recovery and sends TS1’s with
Speed Change bit cleared
Recovery
2. Root responds by entering Recovery, sending
Entry from
L0 same TS1’s
3. State changes to Recovery.RcvrCfg Gigabit
Root Ethernet
Recovery.Speed Recovery.RcvrLock Complex Device
Exit to
Lane
Configuration TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane:0) TS1 (Link:0, Lane:0) Lane

Recovery.RcvrCfg 0 0
Exit to TS1 (Link:0, Lane:0) TS1 (Link:0, Lane:0) TS1 (Link:PAD, Lane:PAD)
Detect
Exit to Speed Change = 0 Speed Change = 0
Recovery.Idle Loopback

TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane:1) TS1 (Link:0, Lane:1)


Exit to
Disabled 1 1
Exit to Exit to
L0 Hot Reset TS1 (Link:0, Lane:1) TS1 (Link:0, Lane:1) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0 Speed Change = 0

TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane:2) TS1 (Link:0, Lane:2)


Lan
2
2
e
TS1 (Link:0, Lane:2) TS1 (Link:0, Lane:2) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0 Speed Change = 0

TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane:3) TS1 (Link:0, Lane:3)

3 3

Moki Anji (moki@ synopsys.com) TS1 (Link:0, Lane:3)

Speed Change = 0
TS1 (Link:0, Lane:3)

Speed Change = 0
TS1 (Link:PAD, Lane:PAD)

Do Not Distribute MindShare.com © 2013


Dynamic Link Width Example 633 602

4. TS2’s with configured Link and Lane numbers are


exchanged with Speed Change bit cleared.
Autonomous Change bit is recognized in TS2’s.
Recovery
5. Since this isn’t a speed change, state transitions
Entry from
L0 to Recovery.Idle
Gigabit
Root Ethernet
Recovery.Speed Recovery.RcvrLock Complex Device
Exit to
Lane
Configuration TS1 (Link:PAD, Lane:PAD) TS2 (Link:0, Lane:0) TS2 (Link:0, Lane:0) Lane

Recovery.RcvrCfg 0 0
Exit to TS2 (Link:0, Lane:0) TS2 (Link:0, Lane:0) TS1 (Link:PAD, Lane:PAD)
Detect
Exit to Speed Change = 0 Speed Change = 0
Recovery.Idle Loopback Autonomous Change = 1 Autonomous Change = 1

TS1 (Link:PAD, Lane:PAD) TS2 (Link:0, Lane:1) TS2 (Link:0, Lane:1)


Exit to
Disabled 1 1
Exit to Exit to
L0 Hot Reset TS2 (Link:0, Lane:1) TS2 (Link:0, Lane:1) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0 Speed Change = 0


Autonomous Change = 1 Autonomous Change = 1

TS1 (Link:PAD, Lane:PAD) TS2 (Link:0, Lane:2) TS2 (Link:0, Lane:2)


Lan
2
2
e
TS2 (Link:0, Lane:2) TS2 (Link:0, Lane:2) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0 Speed Change = 0


Autonomous Change = 1 Autonomous Change = 1

TS1 (Link:PAD, Lane:PAD) TS2 (Link:0, Lane:3) TS2 (Link:0, Lane:3)

3 3

Moki Anji (moki@ synopsys.com) TS2 (Link:0, Lane:3)

Speed Change = 0
TS2 (Link:0, Lane:3)

Speed Change = 0
TS1 (Link:PAD, Lane:PAD)

Do Not Distribute Autonomous Change = 1 Autonomous Change = 1


MindShare.com © 2013
Dynamic Link Width Example 634 603

6. Root sends data Idle (a bunch of data 0 symbols),


but device sends TS1’s with PAD,PAD
7. Root sees that a previously configured Lane now
Recovery shows PAD,PAD and changes state to
Entry from Config.Linkwidth.Start
L0

Gigabit
Root Ethernet
Recovery.Speed Recovery.RcvrLock Complex Device
Exit to
Lane
Configuration TS1 (Link:PAD, Lane:PAD) Idle Data Idle Data Lane

Recovery.RcvrCfg 0 0
Exit to TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Detect
Exit to Speed Change = 0 Speed Change = 0
Recovery.Idle Loopback

TS1 (Link:PAD, Lane:PAD) Idle Data Idle Data


Exit to
Disabled 1 1
Exit to Exit to
L0 Hot Reset TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0 Speed Change = 0

TS1 (Link:PAD, Lane:PAD) Idle Data Idle Data


Lan
2
2
e
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0 Speed Change = 0

TS1 (Link:PAD, Lane:PAD) Idle Data Idle Data

3 3

Moki Anji (moki@ synopsys.com) TS1 (Link:PAD, Lane:PAD)

Speed Change = 0
TS1 (Link:PAD, Lane:PAD)

Speed Change = 0
TS1 (Link:PAD, Lane:PAD)

Do Not Distribute MindShare.com © 2013


Dynamic Link Width Example 635 604

8. Now upstream device initiates – Root sends TS1’s


with original Link number and PAD on all Lanes
Configuration 9. Device responds with matching TS1’s on Lanes it
Entry from wants “active”, but with PAD,PAD on other Lanes;
Recovery Autonomous Change bit is set
10. State changes to Config.Linkwidth.Accept
Gigabit
Config.Linkwidth.Start Root Ethernet
Complex Device Desired
Lane State
Config.Linkwidth.Accept Lane
TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD) Lane

0 0 Active
TS1 (Link:0, Lane:PAD) TS1 (Link:0, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Config.Lanenum.Wait
Autonomous Change = 1 Autonomous Change = 1

Config.Lanenum.Accept
TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)

1 1 Inactive
Config.Complete TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1 Autonomous Change = 1

Config.Idle
TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)
Lan
2 2
e
Inactive
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Exit to
L0 Autonomous Change = 1 Autonomous Change = 1

TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)

3 3 Inactive
Moki Anji (moki@ synopsys.com) TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1
TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1
TS1 (Link:PAD, Lane:PAD)

Do Not Distribute MindShare.com © 2013


Dynamic Link Width Example 636 605

11. Root sends TS1’s with Link number and Lane


number on Lanes perceived to be active and
Configuration PAD,PAD on inactive Lanes
Entry from 12. State changes to Config.Lanenum.Wait
Recovery
13. Device responds with same TS1’s
14. State changes to Conifg.Lanenum.Accept
Gigabit
Config.Linkwidth.Start Root Ethernet
Complex Device Desired
State
Config.Linkwidth.Accept Lane
TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane: 0) TS1 (Link:0, Lane: 0) Lane

0 0 Active
TS1 (Link:0, Lane:0) TS1 (Link:0, Lane:0) TS1 (Link:PAD, Lane:PAD)
Config.Lanenum.Wait
Autonomous Change = 1 Autonomous Change = 1

Config.Lanenum.Accept
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

1 1 Inactive
Config.Complete TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1 Autonomous Change = 1

Config.Idle
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Lan
2 2
e
Inactive
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Exit to
L0 Autonomous Change = 1 Autonomous Change = 1

TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

3 3 Inactive
Moki Anji (moki@ synopsys.com) TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1
TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1
TS1 (Link:PAD, Lane:PAD)

Do Not Distribute MindShare.com © 2013


Dynamic Link Width Example 636 606

15. Root updates configuration status bit to show that


bandwidth change was autonomously initiated by
Configuration downstream device
Entry from 16. State changes to Config.Complete
Recovery

Gigabit
Config.Linkwidth.Start Root Ethernet
Complex Device Desired
State
Config.Linkwidth.Accept Lane
TS1 (Link:PAD, Lane:PAD) TS1 (Link:0, Lane: 0) TS1 (Link:0, Lane: 0) Lane

0 0 Active
TS1 (Link:0, Lane:0) TS1 (Link:0, Lane:0) TS1 (Link:PAD, Lane:PAD)
Config.Lanenum.Wait
Autonomous Change = 1 Autonomous Change = 1

Config.Lanenum.Accept
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

1 1 Inactive
Config.Complete TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1 Autonomous Change = 1

Config.Idle
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Lan
2 2
e
Inactive
TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Exit to
L0 Autonomous Change = 1 Autonomous Change = 1

TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD) TS1 (Link:PAD, Lane:PAD)

Root Complex’s Config Space 3 3 Inactive


Moki Anji (moki@Bandwidth
Link Autonomous synopsys.com)
Status bit = 1
TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1
TS1 (Link:PAD, Lane:PAD)

Autonomous Change = 1
TS1 (Link:PAD, Lane:PAD)

Do Not Distribute MindShare.com © 2013


Dynamic Link Width Example 637 607

17. Root sends TS2’s on active Lane and goes to electrical idle
on inactive Lanes
Configuration - TS2’s advertise that the Link is “upconfigure capable”
18. Device responds with same TS2’s on active Lane and goes
Entry from
Recovery
to electrical idle on inactive Lanes
19. State changes to Conifg.Idle
Gigabit
Config.Linkwidth.Start Root Ethernet
Complex Device Desired
Upconfigure Capability = 1 Upconfigure Capability = 1 State
Config.Linkwidth.Accept Lane
TS1 (Link:PAD, Lane:PAD) TS2 (Link:0, Lane: 0) TS2 (Link:0, Lane: 0) Lane

0 0 Active
TS2 (Link:0, Lane:0) TS2 (Link:0, Lane:0) TS1 (Link:PAD, Lane:PAD)
Config.Lanenum.Wait
Upconfigure Capability = 1 Upconfigure Capability = 1

Config.Lanenum.Accept

1
Electrical Idle 1 Inactive
Config.Complete

Config.Idle
Electrical Idle Lan
2 2
e
Inactive
Exit to
L0

Electrical Idle
3 3 Inactive
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Dynamic Link Width Example 637 608

20. Root and device exchange idle data (just zeros) for
a while
Configuration 21. State changes to L0 and regular packets can be
Entry from exchanged once again
Recovery

Gigabit
Config.Linkwidth.Start Root Ethernet
Complex Device Desired
State
Config.Linkwidth.Accept Lane
TS1 (Link:PAD, Lane:PAD) Idle data Idle data Lane

0 0 Active
Idle data Idle data TS1 (Link:PAD, Lane:PAD)
Config.Lanenum.Wait

Config.Lanenum.Accept

1
Electrical Idle 1 Inactive
Config.Complete

Config.Idle
Electrical Idle Lan
2 2
e
Inactive
Exit to
L0

Electrical Idle
3 3 Inactive
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Disabling Dynamic Width Changes 638 609

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Width Change Capture – 1 610

Trace capture courtesy of LeCroy


DSP

USP

USP goes to Recovery and sends TS1s. Speed is 5.0 GT/s


and width is x16.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Link Width Change Capture – 2 611

In Recovery.Idle, USP sends TS1s with PAD value for Lane


numbers, causing a transition to the Configuration state
because the numbers don’t match previously negotiated
values.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Link Width Change Capture – 3 612

In response, DSP goes to Configuration and sends PAD on all Link and
Lane numbers.
Next, recognizing that a width change is in progress, it proposes that only
Lane 0 be used by setting the Link number to PAD for all the other
Lanes. Eventually, USP echoes this back and one Lane is active.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Width Change Capture – 4 613

Eventually, both recognize that only Lane 0 will be used and


they send TS2s to confirm it (*** means Lanes aren’t all
the same).
After a time of Idles, the TS2s resume with the same 5.0
GT/s speed, but now with a x1 Link width.
Asterisk means numbers are mixed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L0s 603 614

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L0s Transmitter Sub-State Machine 603 615

Entry
from L0

Transmitter sends
Electrical Idle Transmitter sends
Ordered Set N_FTSs on all Lanes
TTX-IDLE-MIN
= 50 UI Tx_L0s.Idle Directed
Tx_L0s.Entry (Tx in Electrical Tx_L0s.FTS
Idle low power)

Transmitter sends
One SKP Ordered Set

Exit to
L0

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L0s Receiver Sub-State Machine 605 616

Entry
from L0

Receiver detects
Electrical Idle
Ordered Set
TTX-IDLE-MIN Electrical
= 50 UI
Rx_L0s.Idle Idle Exit
Rx_L0s.Entry (Rx in Electrical Rx_L0s.FTS
Idle low power)

Skip N_FTS
Ordered timeout
Set

Exit to Exit to
L0 Recovery

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1 607 617

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1 Sub-State Machine 608 618

Entry
from L0

Directed and
Electrical Idle
Ordered set Remain in
Received and
Transmitted TTX-IDLE-MIN= Electrical Idle

50 UI L1.Idle
L1.Entry (Electrical
Idle low power)
Tx in Electrical Idle
Directed or
Electrical Idle Exit

Exit to
Recovery

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L2 609 619

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L2 Sub-State Machine 611 620

Entry
from L0
Directed and
Electrical Idle
Ordered set Beacon detected
Received and Send Beacon
(Downstream Switch ports)
Transmitted (Upstream ports only)
Directed to send Beacon
(Upstream ports)
L2.Idle
(Electrical Idle low L2.TransmitWake
power. No DC CMV)

Rx terminations Exit from


Enabled Electrical Idle
Rx looking for exit Detected
Directed or
Beacon detected
(Downstream Root ports) Exit to
Exit from Elec. Idle Detect
(Upstream Lanes)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Hot Reset, Disable and Loopback State 612 621

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Hot Reset Sub-State Machine 612 622

Recovery

Directed

Tx for 2 ms TS1s w/ ‘Hot Reset’ bit 0 of symbol 5 set


Hot Reset Or Rx two TS1s with Hot Reset bit set

Timeout
2 ms

Exit to
Detect

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Disabled Sub-State Machine 613 623

Entry
From Configuration
Or Recovery

Tx 16 TS1s w/ ‘Disable’ bit 1 of Symbol 5 set


and Tx Electrical Idle Ordered Set

Disabled
(Electrical Idle)

Directed or
Electrical Idle Exit or
No Electrical Idle
Ordered Set after 2 ms

Exit to
Detect

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Loopback Sub-State Machine 614 624

Entry It won’t use


from Configuration 8b/10b for Gen3
Or Recovery
Slave: Enter Electrical
Idle for 2 ms
Master sends valid Master: Tx an
Master receives 8B/10B data Slave: Electrical Idle Electrical Idle order
Identical TS1’s; Slave required to Detected or Electrical set and enter
Slave has retransmit exactly Idle order set received Electrical Idle for 2 ms
entered for 1 ms
Master Tx
Loopback Master: Directed
TS1’s w/Loopback Loopback.Entry Loopback.Active Loopback.Exit
Bit set

Timeout less than


100 ms
Exit to
Detect

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Capabilities 639 625

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Capabilities 640 626

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Status 642 627

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Control 644 628

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Part Five:
Additional System Topics

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Error Detection and Handling

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Error Handling 649 631

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Error Handling 650 632

 PCIe devices must support:


1. Existing software written for generic PCI error handling. Several
PCIe error conditions are mapped to existing PCI error
mechanisms for this purpose.
2. Additional PCIe-specific reporting mechanisms. There are two
error reporting levels:
a) Baseline Capability. These reporting capabilities are a minimum
set, and are required of all PCI Express devices.
b) Advanced Error Reporting Capability. Allows more sophisticated
error reporting using an extended capability register block (requires
extended configuration support).

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Error Classes 651 633

Errors are classified as correctable and


uncorrectable.
 Correctable means hardware has a way of
automatically handling the error
 Uncorrectable errors have no hardware method to
fix them and are characterized as:
 Fatal – Link is not functioning correctly
 Non-fatal – problem is not related to Link operation

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Correctable Errors 651 634

 Correctable errors degrade system performance,


but recovery occurs with no loss of information.
 Hardware fixes the error automatically. Reporting it
may be useful so software can monitor the
frequency of these errors.
 Example: detection of a Link CRC (LCRC) error
within a TLP. Fix: Data Link Layer replay event
(optional error message to the Root).

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Uncorrectable Errors 652 635

 Uncorrectable errors impair the function of the


interface; no specified mechanism to fix them
 Two subgroups:
1. Fatal Errors: render the Link unreliable
– First-level strategy for recovery may involve a Link reset
– Handling of fatal errors is platform-specific
2. Non-Fatal Errors: associated with a particular
transaction; Link itself remains reliable
– Software may attempt recovery
– Transactions between other devices are not affected

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Scope of PCI Express Error Handling 653 636

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advisory Non-fatal Errors 670 637

 It’s desirable in some cases for a device to


withhold reporting an uncorrectable error.
The device may:
 Be an intermediary and not the target of the
transaction, meaning it isn’t the best agent to decide
whether a recovery method exists.
 Have some other means of recovery that allows it to
postpone reporting the error.
 Need to support legacy software. A configuration
read attempt to a device that isn’t present returns a
completion with UR status. But an error message
could call a system error handler and that would
break the PCI enumeration model.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advisory Non-fatal Support 670 638

 Hardware support is required beginning with the


1.1 spec. The cases where it applies depend
on the role of the detecting agent (Requester,
Completer, Receiver) and the type of error.
 When a 1.1 or later agent detects an Advisory
Non-fatal case it can send ERR_COR, if
enabled. Otherwise, it sends no error message.
 If desired, software can override this by
escalating the severity of an error to fatal. In that
case, the detecting agent will report
ERR_FATAL, if enabled.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advisory Non-fatal Cases 671 639

Five cases described in the spec:


1.Completer sends Completion with UR/CA status
Completer decides how to handle e.g. Advisory Non-Fatal
2.Intermediate agent detects error
Destination decides how to handle this
3.Destination receives poisoned TLP
If destination can fix or work around the problem, it must be
treated as Advisory Non-fatal
4.Completion Timeout
If Requester can work around this (retry the request), it must be
treated as Advisory Non-fatal
5.Unexpected Completion
It’s expected that the original Requester will timeout and handle
this (probably by retrying the request); unintended recipient
treats
Moki Anji it assynopsys.com)
(moki@ Advisory Non-fatal
Do Not Distribute MindShare.com © 2013
Advisory Non-fatal Examples 672 640

Two examples:
1. Switch sees poisoned TLP
but isn’t the target, so
doesn’t send uncorrectable
error. If enabled, it can
1. ERR_COR 2. ERR_COR
report ERR_COR to help 1. Poisoned
2. Completion
with UR status
software learn what Packet
happened (the switch may
have poisoned the TLP).
2. Completer sending a
completion with UR or CA
status can also report
ERR_COR. Requester
chooses how to handle this.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Transaction Layer Errors 656 641

Name Default Action Taken By Agent Detecting Error


Severity
Poisoned TLP Uncorrectable Receiver (if data poisoning supported):
Rec’d (Non-Fatal) Send advisory ERR_COR, or ERR_NONFATAL to Root Complex
If AER registers are present, may* log TLP Header.

ECRC Check Uncorrectable Receiver:


Failure (Non-Fatal) Send ERR_NONFATAL to Root Complex
If AER registers are present, may* log TLP Header.

Unsupported Uncorrectable Request Receiver:


Request (UR) (Non-Fatal) Send ERR_NONFATAL to Root Complex
If AER registers are present, may* log TLP Header.

Completion Uncorrectable Requester:


Timeout (Non-Fatal) Send advisory ERR_COR to Root and retry this failed request as
many times as desired. Signal ERR_NONFATAL only when no
further recovery attempts will be made.
Completer Uncorrectable Completer:
Abort (Non-Fatal) Send advisory ERR_COR to Root Complex
If AER registers are present, may* log Request Header

Descriptions
Moki Anji (moki@assume error messages have been enabled
synopsys.com)
Do Not Distribute
* Header is logged if software will be notified of the error MindShare.com © 2013
Transaction Layer Errors 656 642

Name Default Action Taken By Agent Detecting Error


Severity
Unexpected Uncorrectable Receiver:
Completion (Non-Fatal) Send advisory ERR_COR to Root Complex
If AER registers are present, may* log TLP Header.
Note: if Unexpected completion resulted from improper routing, a
Completion Timeout message will be sent by the original requester.

ACS Violation Uncorrectable Receiver (if checking):


(Non-Fatal) Send ERR_NONFATAL to Root Complex
If AER registers are present, may* log TLP Header
Receiver Uncorrectable Receiver (if checking):
Overflow (Fatal) Send ERR_FATAL to Root Complex

Flow Control Uncorrectable Receiver (if checking):


Protocol Error (Fatal) Send ERR_FATAL to Root Complex

Malformed Uncorrectable Receiver:


TLP (Fatal) Send ERR_FATAL to Root Complex
If AER registers are present, may* log TLP Header.

Descriptions
Moki Anji (moki@assume error messages have been enabled
synopsys.com)
Do Not Distribute
* Header is logged if software will be notified of the error MindShare.com © 2013
Data Link Layer Errors 655 643

Name Default Action Taken By Agent Detecting Error


Severity
Bad TLP Correctable Receiver: Send ERR_COR message to Root Complex

Bad DLLP Correctable Receiver: Send ERR_COR message to Root Complex

Replay Correctable Transmitter: Send ERR_COR message to Root Complex


Timeout

REPLAY Correctable Transmitter: Send ERR_COR message to Root Complex


NUM
Rollover

Surprise Uncorrectable Optional Capability. If checked:


Down (Fatal) Send ERR_FATAL to Root Complex

Data Link Uncorrectable If checked:


Layer (Fatal) Send ERR_FATAL to Root Complex
Protocol
Error

Descriptions
Moki Anji (moki@assume error messages have been enabled
synopsys.com)
Do Not Distribute MindShare.com © 2013
Physical Layer Errors 655 644

Name Default Action Taken By Agent Detecting Error


Severity
Receiver Correctable Receiver errors such as 8b/10b errors, Framing error, Loss of symbol
Error lock, Lane de-skew error, Elastic buffer overflow/underflow error. Only
8b/10b error checking required
Send ERR_COR message to Root Complex

Descriptions
Moki Anji (moki@assume error messages have been enabled
synopsys.com)
Do Not Distribute MindShare.com © 2013
PCI Express 4 KB Config. Space 658 645

Basic error handling using


existing PCI Command/Status
register bits. Accessible with PCI-
compatible software.

PCIe-specific error handling in this


required capability structure
Device Control and Status
registers.

Optional Advanced Error Reporting


Capability Structure contains
registers that allow greater
resolution in identifying specific
errors and determining the error
handling strategy.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express 4 KB Config. Space 646

Legacy software
expects to enable
errors with Command
Register and read
status in the Status
Register

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI-Compatible Errors 677 647

 Transaction poisoning and error forwarding


(EP bit)
 Analogous to PCI Data Parity Error
 Completion Status:
 Completer Abort
Analogous to PCI Target Abort
 Unsupported Request
Analogous to PCI Master Abort
 Uncorrectable Error message sent:
 Analogous to PCI System Error

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Status Register 676 648

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Command Register 675 649

Enables sending uncorrectable error


messages

Enables reporting poisoned packet

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express 4 KB Config. Space 658 650

PCI Express-specific error


handling is available in Device
Control and Status registers.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Baseline Error Reporting 677 651

 Error notification takes three forms:


1. Messages sent to Root Complex
2. Completion status errors
3. Error Forwarding

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Error Messages 669 652

 Error message notifies host software of:


1. Error type
2. Device that detected it (Requester ID)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completion Status Errors 662 653

 Completion Status returned to requester


 If Requester needs to report this, it must use a design-specific
method and not an error message
 Completer can send Advisory Non-Fatal error message

• Successful Completion (SC) = 000b


• Unsupported Request (UR) = 001b
• Configuration Request Retry Status (CRS) = 010b
• Completer Abort (CA) = 011b

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Error Forwarding / Data Poisoning 660 654

 Error/Poisoned bit set when data payload is known to


have errors.
 Sender knows data is bad when packet is sent, or
 Intermediate device detects this condition and changes the bit

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Avoiding Error Pollution 656 655

 To avoid potential confusion, it’s important to


isolate errors to the most significant
occurrence for a packet
 If a packet error is seen at a lower level, don’t
report other errors from higher layers
 If multiple Transaction Layer errors, report only
one.
 Priority sequence (high to low): Rx Overflow, Flow
Control Protocol, ECRC Failure, Malformed TLP,
AtomicOp Egress Blocked, TLP Prefix Blocked, ACS
Violation, MC Blocked, UR, CA, Unexpected Completion,
Poisoned TLP
 Example: ECRC violation could also result in a
Malformed TLP, but the second condition was caused by
the synopsys.com)
Moki Anji (moki@ first so there’s no need to report both
Do Not Distribute MindShare.com © 2013
PCI Express Capability Structure 678 656

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Capabilities Register 657

Intermediate device (e.g.: Switch)


detecting an error may report
uncorrectable error with ERR_COR.
Helps software resolve the issue
while ensuring that only intended
target reports uncorrectable errors.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Control Register 681 658

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Status Register 682 659

Note: highlighted bits are RW1CS


(Readable, Write 1 to Clear, and Sticky)
Sticky: not initialized by reset
Moki Anji (moki@ synopsys.com)
Do Not Distribute RsvdZ: reserved but must write zero
MindShare.com © 2013
Link Control Register 684 660

RsvdP: reserved but must be


preserved (restore any value
read when writing to it)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Status Register 685 661

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Control Register 683 662

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Error Reporting 685 663

 Gives more granularity in defining error type


 Ability to define severity of uncorrectable errors
 Choosing whether to send ERR_FATAL or ERR_NONFATAL
message for a given error
 Support for storing a copy of the TLP header when an
incoming packet resulted in an error
 Ability to mask reporting of errors
 Enable/disable root reporting of errors
 Allows root to identify source of errors

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Error Reporting Registers 686 664

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ECRC Generation and Checking 687 665

Advanced Error Capability and Control Register

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Header Log 695 666

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Uncorrectable Status 691 667

Note: Status bits RW1CS


RsvdZ: reserved but must write zero

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Uncorrectable Severity Register 694 668

If set, the corresponding error is reported as fatal.


If clear, the error is reported as non-fatal.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Uncorrectable Mask 694 669

Note: Mask bits RWS


RsvdP: reserved but must be preserved
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Advanced Correctable Error Status 689 670

Note: Status bits RW1CS


RsvdZ: reserved but must write zero

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Correctable Mask 690 671

Note: Mask bits RWS


RsvdP: reserved but must be preserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Error Tracking and Reporting 696 672

 Root is the target for all error messages


 Status registers track errors
 Source ID logs message sender’s ID
 Root can be enabled to report received error
messages to the system with an interrupt
using:
 INTx pin emulation
 MSI (using vector number hard coded in status
Root Error Status register)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Root Error Status 697 673

Note: Status bits RW1CS


RsvdZ: reserved but must write zero

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Source ID Register 698 674

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Advanced Root Error Command 698 675

RsvdP: reserved, but must be preserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Internal Error Reporting (2.1) 667 676

1. Make internal logic errors visible to software in


an industry-standard way.
 In high-end systems it’s important to be able to
detect and contain errors
 Endpoints have device drivers that can obtain internal
information, but switches are controlled by the OS instead.
 As a result, switch vendors have developed proprietary and
incompatible error-reporting methods.
2. Allow multiple error headers to be recorded
 Current AER model only saves info on the first
uncorrectable error
3. Detect the occurrence of multiple errors of the
same type
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
New Errors Reported 677

 Internal error causes will be implementation


specific
 Three new errors reported
 Corrected Internal – masked or worked around by
h/w with no loss of info or operation (e.g.: memory
error corrected by ECC). Optionally, send
ERR_COR.
 Header Log Overflow – Optionally, send
ERR_COR.
 Uncorrectable Internal – needs a reset or h/w
replacement. Optionally, send ERR_FATAL.
 As with other AER status bits, they can be
masked, and severity is programmable
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
AER Correctable Status Register 689 678

 New Correctable Status bits

Correctable Error Status Register


31 16 15 14 13 12 11 9 8 7 6 5 1 0

RsvdZ RsvdZ RsvdZ

Header Log Overflow Status


Corrected Internal Error Status
Advisory Non-Fatal Error Status
Replay Timer Timeout Status
Replay Num Rollover Status
Bad DLLP Status
Bad TLP Status
Receiver Error Status

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
AER Uncorrectable Status Register 691 679

 New Uncorrectable Status bit

Uncorrectable Error Status Register


31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdZ RsvdZ RsvdZ

TLP Prefix Blocked Error Status


Undefined
AtomicOp Egress Blocked Status
Data Link
MC Blocked TLP Status
Protocol Error
Uncorrectable Internal Error Status Status
ACS Violation Status Surprise Down
Unsupported Request Error Status Error Status
ECRC Error Status Poisoned TLP
Malformed TLP Status Status
Receiver Overflow Status Flow Control
Unexpected Completion Status Protocol Error
Status
Completer Abort Status
Completion Timeout Status
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Multiple Header Registers 687 680

Advanced Error Capabilities and Control register


31 12 11 10 9 8 7 6 5 4 0

First Error
RsvdP Pointer

Multiple Header Recording Enable


Multiple Header Recording Capable

Only a finite number of headers can be recorded, so it’s


important that software clear errors as soon as possible
If too many errors arrive, a Header Log Overflow error is
reported.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Tracking Multiple Error Headers 681

 AER records header for received TLPs that


cause errors.
 The First Error Pointer (FEP) always points to
the uncorrectable error whose header is
visible in the header log.
 Writing a 1 to the corresponding bit in the
Uncorrectable Status clears that instance
and, if multiple header recording is enabled,
that also updates the FEP to point to the next
error and recorded header.
 When the FEP points to an invalid status bit,
there are no more headers to report.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Error Handling Flowchart 699 682

AND
AND AND

OR OR

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MindShare Lab: Error Debugging 683

Open File: error_lab.arbsys

 Part 1: Software received an interrupt from PCIe Root Port


0:28:6 that was generated because of a received error
message. Answer the following questions:
1. What type of error message triggered the interrupt (ERR_FATAL,
ERR_NONFATAL, ERR_CORR)?
2. From which BDF did the error message originate?
3. What was the specific error condition that caused the first error message?
4. Were there any other errors detected on that BDF? If so, what are they?
5. Is any other information about the first error provided? If so, provide it
(decoded if possible).
6. Bonus Question: What was the vector of the interrupt generated to software
because of the error?

 Part 2: Follow the same steps above for an interrupt generated


from Root Port 0:28:0 because of a received error message. On
question 5 for this part of the lab, also try and figure out why this
was an error.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Power Management

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Elements of Power Management 705 685

 Hardware/software elements of PC PM
 OS
 ACPI Driver
 WDM Device Driver
 Miniport Driver
 PCI Express Bus Driver
 PCI Express PM registers in each function
 System board power plane control and bus clock
control logic

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ACPI Overview 707 686

 Provides OS control of power management.


 Previously, power events were typically handled by
SMM as platform-specific events, but that approach
had problems:
 Event-handling code didn’t have full system visibility
 Limited code couldn’t support complex PM policies
 SMM code written by platform designers could be buggy or
cause problems for OS
 ACPI supports a multi-tiered view of system power
management, starting with system Global States, G0-
G3.
 Within the Global States, a number of substates are
defined that trade off power savings against the
latency involved to return to normal performance.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Platform Power Management Scope 687

Core Register Set Register Set Core


0 1  The ACPI spec standardizes
Execution Units Execution Units system-level power
Instruction Instruction management, and defines
Pipeline Pipeline
power states for the system,
Local L1 Code L1 Data L1 Data L1 Code Local
APIC Cache Cache Cache Cache APIC CPU, and IO interfaces
L2 Unified Cache
 Note that PCI, PCIe, USB
and other compliant buses
FSB/PSB Unit
cover power management in
FSB their own specs and add to
Memory it in some cases.
PCIe MCH DRAM
GFX
DMI
SMBus
PCIe ICH
IDE/SATA
PCI

HD
USB
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Global Power States G0/S0 708 688

G0/S0 (Working/Not Sleeping) – system is working


normally and responds to user in real-time.
CPU performance/power adjustments are allowed
within this state:
 C0: CPU working; running at full power
 C1: low-latency state with CPU running at reduced
power (generally, lower performance too)
 Cn: Additional CPU low power (and performance) steps,
each with increasing return latency
(example: Core2 defines C0 – C6)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Global Power States G1/S1-S4 708 689

G1/S1-S4 – system is sleeping and not executing


threads. To the user it appears to be off. Several
levels of sleep are defined (full reboot not required)
 S1 – caches flushed, CPU halted
 S2 – CPU is powered off – this is not commonly
implemented
 S3 – (Suspend to RAM or Standby) CPU context is
copied into DRAM and most of the system is powered
down. DRAM does self-refresh to maintain data.
 S4 – (Suspend to Disk or Hibernate) Context is saved to
a non-volatile memory like the disk and almost all of the
system is powered down

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Global Power States G2/S5 and G3 708 690

G2/S5 (Soft Off) – system is consuming minimum


power.
All context (except real-time clock) is lost and the
system will have to be restarted (cold boot).
Some power is still distributed to keep things like
the power button active.

G3 (Mechanical Off) – system is consuming no


power because a mechanical power switch has been
turned off or AC power has been removed.
Power will have to be restored before machine is
activated, and full reboot will be required.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Platforms May Restrict States 708 691

Laptops are usually very aggressive about power


consumption, especially when running on battery,
and would likely use the full range of power states.
Servers have other concerns, so power states
aren’t always used:
 Latencies involved can cause problems
 Devices may occasionally have trouble recovering,
hurting reliability

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Legacy Software Responsibilities 709 692

 Device power states


 D0 (mandatory) full on state
 D1 (optional support)
 D2 (optional support)
 D3 (mandatory) lowest device power state
 Device Context preservation
 PME Context preservation using VAUX

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe PM vs. ACPI 711 693

 PCI Express bus driver controls PCIe


configuration and PM registers
 ACPI driver controls non-standard system
board devices such as chipset, clock, power
controls

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PM Relationships 712 694

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Power Management 713 695

Doubleword
PCI PM Capability Register Set
Number
(in decimal)
Byte
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Class Code Revision 02
ID
Header Latency Cache 03
BIST Type Timer Line
Size

Base Address 0 04

Base Address 1 05

Base Address 2 06

Base Address 3 07

Base Address 4 08

Base Address 5 09

CardBus CIS Pointer 10

Subsystem ID Subsystem 11
Vendor ID
Expansion ROM 12
Base Address
Reserved Capabilities 13
Pointer

Reserved 14

Max_Lat Min_Gnt Interrupt Interrupt


15
Pin Line

Required configuration registers

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI PM Device States 722 696

1. D0 Un-initialized and D0
Power On
Initialized Reset D0
Un-initialized
• Active state; support
required
2. D1 Light sleep D0
Active
• Optional
3. D2 Deep sleep
• Optional D3
D1 D2
Hot
4. D3 Hot and D3 Cold
• Support required
D3
Vcc Cold
Removed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PM Capabilities Register 724 697

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PM Control/Status Register 724 698

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
7 Link States 742 699

1. L0: Fully Active, support required


2. L0s: Low power standby; low resume latency
3. L1: Lower power than L0s; longer resume time
4. L2/L3 Ready: Staging state in preparation for power removal
5. L2: Aux. powered Link; Deep power savings
6. L3: Link powered off state: no power consumed
7. LDn: Link state after power and clocks are re-applied after being removed.
Device PLL may still not be running. Vaux may be off.

While training, Link


in LDn. When
LTSSM finishes,
Link state goes to
L0

Cold, Warm, or Hot Reset,


Link Disable, or
Vcc/Clock re-application

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Relationship of D- and L-States 734 700

 Link state is affected by D-state of downstream


component
 Upstream component state cannot be more
aggressive than downstream component

Downstream Component Permissible Upstream Permissible Interconnect


D-State Component D-State State

D0 D0 L0, L0s, ASPM L1


D1 D0-D1 L1
D2 D0-D2 L1
D3 hot D0-D3 hot L1, L2/L3_Ready

D3 cold D0-D3 cold L2 (with Vaux)(3), L3

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Active State Power Management (ASPM) 742 701

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Capabilities Register 743 702

Active State PM Support field

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Control Register 744 703

Active State PM Control

00b = Both Disabled


01b = L0s Enabled, ASPM
L1 Disabled
10b = ASPM L1 Enabled,
L0s Disabled
11b = Both Enabled

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Entry Into L0s State 744 704

 Managed separately for each


direction of the Link
 A device simply initiates entry
on its transmitting Lanes
 If transmitter is disabled from
using L0s, its receiver must still
tolerate L0s from the other
device
 Transmit side may be in L0 while
receive side is in L0s

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Definition of Idle Time 745 705

 Endpoint or Root Complex Root


port
 No TLPs or DLLPs pending transmit
or no FC credits available to transmit
anything
 Switch Upstream Port
 All Switch downstream port’s receive
Lanes are L0s, and
 No TLPs or DLLPs to transmit or no
FC credits available to transmit
 Switch Downstream Port
 All Switch upstream port’s receive
Lanes are L0s, and
 No TLPs or DLLPs to transmit on
this Link or no FC credits available
to(moki@
Moki Anji transmit
synopsys.com)
Do Not Distribute MindShare.com © 2013
Exit from L0s 746 706

 Components initiate exit if they


need to communicate
 Downstream Initiated Exit
 Switch must initiate a transition on
upstream port’s transmit Lanes as
soon as exit on any downstream
ports is detected (if upstream Link is
in a low power state)
 Upstream Initiated Exit
 Switch must initiate transition on all
downstream port transmit Lanes as
soon as exit on upstream port is
detected (if downstream Links are in
a low power state)
 To exit:
 Transmitter sends FTS Ordered Sets,
followed by one SKP Ordered Set;
Receiver
Moki Anji (moki@ recovers bit and symbol
synopsys.com)
Do Not lock
Distribute MindShare.com © 2013
ASPM L1 State 747 707

Ports that may initiate entry into L1 ASPM

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Downstream Device Requests L1 750 708

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Upstream Device May Reject Request 752 709

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1 Exit From Downstream 754 710

Root Complex

6. Switch F signals L1 ASPM State


L1 Exit to Switch C
PM State D0 5. Within 1µs of 
step 4, Switch F 
Switch signals L1 Exit to RC

(F)
L1 ASPM State L1 State
4. Switch F signals
L1 Exit to Switch C L1 ASPM
State
3. Within 1µs of step 2,
PM State D0 Switch C signals  PM State D1
PM State L1 Exit to Switch F
PCIe D0 PCI-XP
Endpoint Switch Endpoint
(D) (C) (E)
L1 ASPM State
L1 State 1. EP B signals 
L1 Exit to Switch C
2. Switch C signals
L1 Exit to EP B
PM State D2 PM State D0
PCIe PCIe
Endpoint Endpoint
Moki Anji (moki@ synopsys.com)
(A) (B)
Do Not Distribute MindShare.com © 2013
L1 Exit From Upstream 755 711

Root Complex

1. RC signals L1 Exit L1 ASPM State


to Switch F 2. Switch F signals
PM State D0 L1 Exit to RC

3. Within 1µs of  Switch


step 2, Switch F  (F)
signals L1 Exit to
EP D & Switch C
L1 State
L1 ASPM State
L1 ASPM
State
4b. EP D signals 4a. Switch C signals
L1 Exit to Switch F L1 Exit to Switch F
PM State PM State D1
PM State D0 PCIe D0 PCIe
Endpoint Switch Endpoint
(D) (C) (E)
L1 ASPM State
L1 State
6. EP B signals 
5. Within 1µs of step  L1 Exit to Switch C
4a, Switch C signals 
L1 Exit to EP B
PM State D3 PM State D0
PCIe PCIe
Endpoint Endpoint
Moki Anji (moki@ (A)
synopsys.com) (B)
Do Not Distribute MindShare.com © 2013
ASPM Exit Latency Registers 757 712

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example ASPM L1 Exit Latency 759 713

Root Complex

RC L1 latency (8µs)
5. Exit to L0 also takes 8µs
L1 State

PM State D0 4. Within 1µs of detected L1 exit


from Switch C, Switch F signals
Switch L1 Exit to RC

Switch F, L1 latency (8µs) (F)


3. Exit to L0 takes 16µs L1 State

L1 State
2. Within 1µs of detecting,
PM State D0 L1 Exit from EP B, Switch
PM State C signals Exit to Switch F
PCIe D0 PCIe
PM State D1
Endpoint Switch Endpoint
(D) (C) (E)
Switch C, L1 latency (16µs)

1. Exit to L0 takes 16µs


L1 State L1 State because the switch takes
longer than the endpoint

PM State D2 PM State D0
PCIe PCIe EP B, L1 latency (8µs)
Endpoint Endpoint
(A) (B)
T T+16
Link B/C starts L1 exit at T and takes 16 µs T+17
T+1
Link C/F starts L1 exit at T+1 and takes 16 µs
T+10
Moki Anji (moki@ synopsys.com)
Link F/RC starts L1 exit at T+1 and takes 8 µs
T+2

Do Not Distribute MindShare.com © 2013


Software-Initiated Link Power Management 714

Doubleword
PCI PM Capability Register Set
Number
(in decimal)
Byte
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Class Code Revision 02
ID
Header Latency Cache 03
BIST Type Timer Line
Size

Base Address 0 04

Base Address 1 05

Base Address 2 06

Base Address 3 07

Base Address 4 08

Base Address 5 09

CardBus CIS Pointer 10

Subsystem ID Subsystem 11
Vendor ID
Expansion ROM 12
Base Address
Reserved Capabilities 13
Pointer

Reserved 14

Max_Lat Min_Gnt Interrupt Interrupt


15
Pin Line

Required configuration registers

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
D1, D2, D3Hot and the L1 States 760 715

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Puts Device in D2 760 716

Root Complex
1. Software generates a
Configuration Write TLP to
place EP A into the D2 state
L0 State

PM State D0
Switch
(F)
L1 ASPM State
L0s L0 State

L0 State
PM State D0 PM State PM State D0
PCIe D0 PCIe
Endpoint Switch Endpoint
(D) (C) (E)

L0 State L1 ASPM State


2. EP A Receives Config
Write that places it
into the D2 state L1 State
3. EP A signals a link 
PM State D0 transition to the L1 state PM State D0
PCIe PCIe
Endpoint Endpoint
PM State D2
Moki Anji (A)
(moki@ synopsys.com) (B)
Do Not Distribute MindShare.com © 2013
Software Puts Device in D2 762 717

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1 Exit Protocol 762 718

 Triggered by upstream device


 Software can cause a device to return to D0 by
performing config. write to PM registers
 Triggered by downstream device
 Device could trigger an exit based on external
events
 Whichever device initiates the exit sends
TS1. Receiver returns TS1 and both go
through re-training

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L2/L3 Ready – Link Power Removal 764 719

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Negotiation to Enter L2/L3 Ready 766 720

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Negotiation to Enter L2/L3 Ready 766 721

In case a device doesn’t


respond, software is allowed
to timeout and move ahead
after 1-10ms

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Exit L2/L3 Ready- Power Removal 767 722

(Vaux) (No Pwr)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Wakeup and PME Generation 768 723

 PME Message Generation


 PME Message delivered when Link in L0
 If Link in non-communicating state, it will first have
to be reinitialized
 From L2 or L3, power must be reapplied before training
 From L1, only training is needed
 Once Link is operational, PME can inform
software which device signaled a wakeup
event

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PME Message 769 724

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PME Sequence 770 725

 If device has lost power, it must first request that


power be restored by sending a Beacon or WAKE# to
the Root. Eventually, power is restored and the Link
trains to L0.
 Device issues PME message implicitly routed to Root
 Root informs PM controller, which may trigger an
interrupt to notify software.
 Software
 Sends configuration read to query PME status, using the
Requester ID of the PME message
 Configures device to D0 power state
 Restores device context as needed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PME Message Deadlock 770 726

Read request
Two PMEs arrive;
Root accepts 1st but
can’t take another,
so 2nd one waits.
Root reads status of
first requester.
Completion is
returned, but 2nd
PME is blocking
its progress,
causing a
deadlock.
PME PME

Completion

Solution: Root discards extra PME messages.


Functions must resend PME if their PME_Status
Mokibit
Anji (moki@
isn’t clearedsynopsys.com)
within a timeout of 100ms
Do Not Distribute MindShare.com © 2013
PME Context 771 727

 Device uses Vaux to keep some logic active


to wake the Link and retain some PME
context even in D3cold, such as:
 PME_Status bit
 PME_Enable bit
 Device-specific status bits such as those that
indicate cause of wakeup event
 Application-specific information like modem Caller
ID

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
WAKE# Pin Implementation 774 728

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Beacon Signal Implementation 774 729

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Beacon Signaling 484 730

 Used by USP to signal wake-up event over a Link in L2 state


 Main power is off, Beacon is powered by Vaux
 Low-frequency, DC-balanced, differential signal consisting of
periodic pulse between 2ns – 16ns
 Sent on Lane 0, optional for other Lanes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Power Restored 731

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Auxillary Power Enable 775 732

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Dynamic Power Allocation (DPA) (2.1) 714 733

 PCIe 2.0 added dynamic power support at


the Link level, but not at the device level
 Power and thermal budgets are becoming
increasingly important in system design, and
changes in other areas mean that PCIe
devices are becoming a bigger part of the
power budget and need more options.
 Goal: Lower platform cost by reducing device
power requirements

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Dynamic Power Allocation (DPA) Method 715 734

 New registers provide devices with up to 32


power management sub-states while in the
D0 device power state
 Gives software visibility and control of device
power states, even if a device doesn’t have a
driver that handles PM
 Power state is read or changed using
configuration cycles
 Requires use of DPA extended registers
 Multiple Functions in a device can have their
own DPA registers, and the overall device
power will be the sum of them all
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
DPA Capability Structure 715 735

31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0016h for DPA)

31 0 Offset

PCIe Enhanced Capability Header 000h

DPA Capability Register 004h

DPA Latency Indicator Register 008h

DPA Control Register DPA Status Register 00Ch

010h
DPA Power Allocation Array
(Sized by number of substates)
Up to
02Ch

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Capability Register 716 736

RsvdZ

31 24 23 16 15 14 13 12 11 10 9 8 7 5 4 0
Substate
Xlcy1 Xlcy0 PAS RsvdZ _Max

Transition Latency Value 0 All fields not


reserved are (RO)
Transition Latency Value 1

Power Allocation Scale (PAS)


Transition Latency Unit (Tlunit)

 Transition Latency Value 0 & 1 – these are


multiplied by the Tlunit to give two max
transition times for going into this substate
from any other. Actual latency can not be
Moki Anjimore
(moki@than this.
synopsys.com)
Do Not Distribute MindShare.com © 2013
Capability Register Fields 737

 Power Allocation Scale – multiplier for substate


power allocation (value in watts):
00 – 10.0
01 – 1.0
10 – 0.1
11 – 0.01
 Transition Latency Unit – multiplier for max substate
change latency
00 – 1ms
01 – 10ms
10 – 100ms
11 – Reserved
 Substate_Max – number of supported substates - 1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Latency Indicator Register 738

 Each bit indicates which of the two latency


values applies to that substate.
 Examples:
 If bit 17 = 0, then substate 17 uses latency value 0
 If bit 10 = 1, then substate 10 uses latency value 1

31 0

DPA Latency Indicator Register

All bits (RO)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Status Register 716 739

 Status gives current substate setting


 Control is enabled by default after a reset, but
can be disabled by writing a one to bit 8

15 9 8 7 5 4 0

RsvdZ RsvdZ

Substate Control Enabled (RW1C)

Substate status (RO)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Control Register 740

 Software writes the desired substate value


here. If Substate Control is Enabled, that
determines the Function’s substate.
 Default is substate 0.

15 5 4 0

RsvdP

Substate Control (RW)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Power Allocation Array 715 741

31 0 Offset

PCIe Enhanced Capability Header 000h

DPA Capability Register 004h

DPA Latency Indicator Register 008h

DPA Control Register DPA Status Register 00Ch

010h
DPA Power Allocation Array
(Sized by number of substates)
Up to
02Ch

One 8-bit register for each substate gives the power for that substate,
which is multiplied by the Power Allocation Scale register to arrive at the
wattage used. All values in the array are RO.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Optimized Buffer Flush/Fill (OBFF) (2.1) 776 742

 Problem: bus-master capable devices are not


aware of system power states and may initiate
routine DMA or interrupt transactions at times
when the system would otherwise be able to go
to a lower power state.
 Solution: Allow RC to communicate system
power status to endpoints, which can then
recognize optimal time windows for initiating
traffic.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Coordinating Idle Time 777 743

 Without coordination of events, system is


rarely able to go to lowest power state

System Idle System Idle


Window Window

System Events

Endpoint A
Events

Endpoint B
Events

Endpoint C
Events
Time

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Improved Idle Windows 777 744

 OBFF informs devices about best times to stay idle


 Same work is done, but bigger Idle windows improve
power conservation
System Idle System Idle System Idle
Window Window Window

System Events

Endpoint A
Events

Endpoint B
Events

Endpoint C
Events
Time

LTR could also be used to inform system software of acceptable latency for
Moki Anji the synopsys.com)
(moki@ endpoints between accesses, suggesting a limit on this idle time.
Do Not Distribute MindShare.com © 2013
OBFF offers a Hint 778 745

 The OBFF information is an optional hint for


improving system power savings.
 Devices can still initiate whenever they like but
overall power consumption will be negatively
affected if they do, so that should be avoided as
much as possible.
 Information is communicated in 1 of 2 ways:
 Toggling the WAKE# pin – this method is much
preferred because it avoids needlessly waking up
a Link and burning power to inform a device about
the system power state, or
 Sending Messages

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF Signaling Example 778 746

 WAKE# is preferred, but using a message as an


intermediate step may be necessary, as shown:

Root Complex

WAKE#

Endpoint
Switch Endpoint

OBFF
Message
Endpoint

WAKE# Switch

Endpoint Endpoint

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
WAKE# Signaling 779 747

Transition Event OBFF Message Code

Idle OBFF OBFF

CPU Active  CPU Active – all


Idle CPU Active
transactions OK.
This is a Function’s
initial state.
OBFF or CPU Active Idle Idle
 OBFF – transfers
to and from
memory OK
OBFF CPU Active CPU Active
 Idle – wait for
higher state before
OBFF initiating
CPU Active OBFF

Notes:
- ECN points out that there is one negative edge for signaling OBFF, and 2 negative
edges for signaling CPU Active
- Min pulse width = 300ns, time between falling edges = 700ns min to 1000ns max
- Moki
If pattern
Anji is unrecognized,
(moki@ default is CPU Active
synopsys.com)
Do Not Distribute MindShare.com © 2013
WAKE# Rules 780 748

 System is not required to enable an endpoint


to detect whether WAKE# was asserted by
another endpoint.
 Signaling can only be initiated by the RC
when the system is in an operational state
(S0 for an ACPI-compliant system)
 Functions must be in the D0 power state to
respond to it.
 Reserved codes received will be treated as
CPU Active

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Possible WAKE# Confusion 780 749

 Since the WAKE# pin may also be used by


some Functions to signal a wakeup event, it’s
possible that other Functions might
misinterpret that as an OBFF change.
 This might cause undesirable power
management, but should be recoverable
 Spec recommends that endpoints go to the
CPU Active state whenever they detect
WAKE# activity as being initiated by the host,
but doesn’t specify how they would know.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF Message Header 781 750

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Rules 780 751

 Strongly recommended that s/w only use


OBFF messages if WAKE# is not available.
 Switches are strongly encouraged to
propagate all OBFF indications, but are
allowed to discard or collapse them.
 Downstream ports have two options, called
Variation A and B, if we want to send a
message but the Link is not in L0 state.
 A: Don’t change the Link state, drop the message
 B: Return the Link to L0 and forward the message

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF Support 782 752

Device Capability 2 Register


31 24 23 22 21 20 19 18 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
OBFF Support
64-bit AtomicOp Completer Supported
00 – Not supported 32-bit AtomicOp Completer Supported
01 – Message only AtomicOp Routing Supported
ARI Forwarding Supported
10 – WAKE# only
Completion Timeout Disable Supported
11 – Both Completion Timeout Ranges Supported
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
OBFF Enable 783 753

Device Control 2 Register


15 14 13 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

OBFF Enable
00 – Disabled
01 – Enabled with Message signaling Variation A
10 – Enabled with Message signaling Variation B
Moki Anji (moki@ synopsys.com)
11 – Enabled using WAKE# signaling
Do Not Distribute MindShare.com © 2013
Latency Tolerance Reporting (LTR) (2.1) 784 754

 Goal: improve system power management


 At present, software has to guess how much latency
is acceptable for devices. Consequently, PM for
platform resources is often too cautious or even
disabled to avoid performance problems.
 LTR informs software of known latency limits so PM
policies can take into consideration how much latency
the endpoints can tolerate.
 LTR result: devices get system performance when
they need it, and system can use lower power when
devices don’t need a fast response
 Method: Provide optional registers for Functions
to report service latency requirements on
memory reads and writes
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
LTR Registers 785 755

 LTR is discovered and enabled using new PCIe


capability register bits
 Software must not enable LTR in an endpoint unless
the RC and all intermediate switches also support it
 When enabling LTR in a hierarchy, those nearest the
RC must be enabled first, working down to the bottom
of the tree LTR
Device Capabilities 2 Register Mechanism
31 12 11 10 Supported

15 11 10 9 0
LTR
Mechanism
Enable
Device Control 2 Register
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
LTR Message 788 756

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example 789 757

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example – Change But No Update 790 758

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example – Change with Update 791 759

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example – Link Down 791 760

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ASPM Option (2.1): Link Capabilities Register 761
 ASPM Support (2-bits) is filled in for 3.0
 00b – No ASPM Support (was formerly Reserved)
 01b – L0s Supported
 10b – L1 Supported (was formerly Reserved)
 11b – Both L0s and L1 Supported
 L0s support no longer required
 New bit indicates this support, and it must be set to one for 3.0-
compliant devices

Note: Some fields


are not shown to
simplify the diagram.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Interrupt Support

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Interrupt Model: Two Methods 794 763

PCI Express supports two Interrupt mechanisms:


1. Legacy interrupt pin emulation
 Required for Endpoints
 In-band messages that emulate the four physical interrupt
pins (INTA-INTD) sent to the system interrupt controller
 Forwarding support required by Switches
2. Message Signaled Interrupts (MSI or MSI-X)
 MSI/MSI-X Interrupts transmitted as Memory Writes
 Legacy Endpoints must support either MSI or MSI-X with
32- or 64-bit addresses. Native Endpoints must support 64-
bit addresses
 Message Signaled Interrupts eXtensions (MSI-X) provide
larger number of vectors and associated delivery
addresses per function

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Interrupt Delivery Options 796 764

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Legacy Interrupt Delivery 797 765

In early machines, interrupts were delivered via pins


and required an Interrupt Acknowledge processor
bus cycle to get service
INTR Memory
CPU
5
Interrupt Service
Interrupt
Vector Routine (ISR)
Acknowledge
4
North Bridge
Interrupt Table (ISR
starting addresses)
PCI Bus

2 3
Bridge
Data Buffer
South Bridge

1
PCI Bus
Interrupt Controller
(PIC) INTA#
Device
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Memory Synchronization 797 766

A race condition can develop when data is sent and delivery is


signaled with an interrupt. If data gets stuck in an intermediate
posted-write buffer, CPU could fetch stale data.
INTR 5 Memory
CPU
Memory Buffer

Interrupt Service
4
Routine (ISR)
North Bridge
Interrupt Table (ISR
3 starting addresses)
PCI Bus

Bridge
Write Buffer
South Bridge
1
2
PCI Bus
Interrupt Controller
(PIC) INTA#
Device
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Memory Synchronization-2 797 767

To avoid this, CPU sends a “dummy read” to the device.


Ordering rules guarantee any previous write data will
be pushed out before read result is allowed to return.
INTR Memory
CPU 6
Memory Buffer

Interrupt Service
4
Routine (ISR)
North Bridge
Interrupt Table (ISR
3 starting addresses)
PCI Bus

Bridge
Write Buffer
South Bridge
1
2 5 PCI Bus
Interrupt Controller
(PIC) INTA#
Device
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Legacy INTx Information 768

 In many legacy systems, interrupt requests are


routed from peripheral devices on buses like PCI to
an interrupt controller in the system chipset
 Interrupts may also originate from embedded legacy
hardware within the chipset itself, such as:
 DMA controller
 Timers
 ATA controller
 USB controller
 There are separate interrupt lines for each interrupt,
and each line is mapped to an interrupt request
number
 IRQ0, IRQ1, etc.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Determining Interrupt Usage 801 769

INTD#
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Legacy Interrupt Routing 803 770

ISA
Slave
Programmable 8259A
Interrupt Interrupt
Router Controller

IRQ8
IRQ9 (IRQ2)
IRQ10
IRQ11
IRQ12 ISA
Input 0# IRQ13 Master
IRQ14 8259A
Input 1# IRQ15 Interrupt
Input 2# Controller
Input 3#
IRQ0
IRQ1
Interrupt to CPU
IRQ3
IRQ4
IRQ5
IRQ6
IRQ7

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Command Register-Interrupt Disable 804 771

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Status Register-Interrupt Status 805 772

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch May Collapse Messages 806 773

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
INTx Message Header Format 807 774

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Mapping and Collapsing INTx 809 775

Device Number of Virtual INTx Message Type INTx Message Type at


Bridge Receiving INTx Message at Input to Virtual Output of Virtual Bridge
Bridge
0, 4, 8, 12, etc. INTA INTA
INTB INTB
INTC INTC
INTD INTD
1, 5, 9, 13, etc. INTA INTB
INTB INTC
INTC INTD
INTD INTA
2, 6, 10, 14, etc. INTA INTC
INTB INTD
INTC INTA
INTD INTB
3, 7, 11, 15, etc. INTA INTD
INTB INTA
INTC INTB
INTD INTC
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Mapping INTx 810 776

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Collapsing INTx 811 777

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Legacy System Interrupts 831 778

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Signaled Interrupts 812 779

MSI advantages over pins:


 32 interrupt vectors available per Function
 Sharing of vectors is eliminated, simplifying
interrupt servicing
 Per-vector interrupt masking (optional)
 Memory synchronization is automatic;
interrupt handler doesn’t need to take steps
to guarantee this the way it does for pins

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Signaled Interrupt (MSI) 780

Core i7 CPU 0 Core i7 CPU 1


 Message Signaled Interrupts
APIC TH0

APIC TH1
APIC TH0

APIC TH1

APIC TH0

APIC TH1
APIC TH0

APIC TH1
Core Core Core Core are nothing more than memory
0 n 0 n writes targeting a special
address.
QPI
 Special address: FEEx_xxxxh
QPI QPI INT MSG QPI INT MSG QPI
 This address identifies it as an
interrupt as opposed to a
IOAPIC
IOH regular memory transaction.
 MSIs can be sourced from an
IO APIC on behalf of a device,
MSI MSI or from the device directly.
 Each MSI must contain
ICH IOAPIC information about the interrupt
PCIe MSI
being delivered:
PCI MSI
 Vector
R  Destination ID
PCIe A-D
o INTR
PCI A-D u
8259  Destination Mode (Physical or
(Master)
Internal Logical)
IDE t •USB Ints
i  Redirectable or not
SERIRQ •SATA
Others
n •SMBu  Edge or Level
g 8259
s  Type of Interrupt (Fixed, NMI,
(Slave) SMI, ExtINT, etc.)
•RTC
Moki Anji •DMA
(moki@ synopsys.com)
•Other
Do Not Distribute MindShare.com © 2013
MSI Carries Vector And Delivery Info 816 781

MSI Format when Interrupt Remapping is Disabled

Interrupt Message Address Field


31 20 19 12 11 4 3 2 1 0
Extended
FEEh Destination ID 00
Destination ID

Redirection Hint
Indicates MSI traffic Destination Mode
targeting Core Local APICS

Interrupt Message Data


31 16 15 14 13 12 11 10 8 7 0

0000h 00 Vector

Trigger Mode
Delivery Status
Destination Mode
Moki Anji (moki@ synopsys.com)
Delivery Mode
Do Not Distribute MindShare.com © 2013
MSI Address Encoding 816 782

MSI Format when Interrupt Remapping is Disabled

Bit(s) Description
63:32 If the device adapter implements a 64-bit MSI Address register, these address bits are typically
programmed as all zeros.
31:20 FEEh targets the Local APICs. FECh targets the IO APICs. (Targeting IO APICs with MSI
transactions is no longer common.)
19:12 Destination ID: If originating from an IO APIC, this field holds bits [63:56] of the Redirection Table
Entry.
11:4 Extended Destination ID: If originating from an IO APIC, this field holds bits [55:48] of the
Redirection Table Entry.
3 Redirection Hint bit: The message’s delivery mode is delivered as part of the write data and is not
present in the message address. If the message’s Delivery Mode is the Lowest-Priority Delivery
Mode, this Redirection Hint bit can be set to alert the Host Bridge. The message address also
contains a Destination Mode bit. There are three possible combinations of these two bits:
(hint, dest mode)
(0,x) – Interrupt delivered to APIC identified in bits 19:4 as interpreted by Destination Mode
(1,0) – Lowest-Priority Delivery Mode and Physical Destination Mode. All of the processors in the
cluster are considered for redirection of that interrupt.
(1,1) – Lowest-Priority Delivery Mode and Logical Destination Mode. The redirection is limited to only
those processors that are part of the logical group of processors specified in the Destination ID field.
2 Destination Mode: This bit only has meaning if the Redirection Hint bit is set to 1.
0 – Physical Destination Mode
1 – Logical Destination Mode
1:0 Typically set to 00b
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
MSI Data Encoding 817 783

MSI Format when Interrupt Remapping is Disabled

Bit(s) Description
31:16 Typically programmed to 0000h
15 Trigger Mode.
0 – Edge Triggered
1 – Level Triggered
14 Delivery Status. If this is an Edge Triggered interrupt as indicated by the Trigger Mode field, this bit
is set to 1. If this is a Level Triggered interrupt, this bit indicates the state of the interrupt input:
0 – Deasserted
1 – Asserted
13:12 Typically programmed to 00b
11 Destination Mode:
0 – Physical
1 – Logical

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI Data Encoding 817 784

MSI Format when Interrupt Remapping is Disabled

Bit(s) Description
10:8 Delivery Mode: This is the same as the corresponding bits in the Redirection Table for that interrupt.
000b – Fixed. Delivers the interrupt to all of the Local APICs listed in the Destination field. The
Trigger Mode can be either edge-triggered or level-triggered.
001b – Lowest-Priority. Delver the interrupt to the processor that is executing the lowest-priority
program of all the processors listed in the Destination field. The Trigger Mode can be either edge-
triggered or level-triggered.
010b – SMI. The Trigger Mode must be edge-triggered. The Vector is ignored but must be
programmed to all zeroes for future compatibility.
011b – Reserved.
100b – NMI. Delivers the interrupts to all of the Local APICs listed in the Destination field. The Vector
is ignored. Regardless of the Trigger Mode setting, NMI is an edge-triggered interrupt.
101b – INIT. Delivers the interrupt to all of the Local APICs listed in the Destination field. The Vector
is ignored. Regardless of the Trigger Mode setting, INIT is an edge-triggered interrupt.
110b – Reserved.
111b – ExtINT. The interrupt is delivered to the Local APIC specified in the message’s Destination
field. That processor then issues an Interrupt Acknowledge transaction to request the vector from the
8259A compatible interrupt controller. ExtINT is an edge-triggered interrupt.
7:0 Vector. Specifies which of the user-defined interrupt vectors is being triggered (i.e. 10h – FEh).

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI Capability Register Set 813 785

Doubleword
Number
(in decimal)
Byte
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Class Code Revision 02
ID
Header Latency Cache 03
BIST Type Timer Line
Size

Base Address 0 04

Base Address 1 05

Base Address 2 06

Base Address 3 07

Base Address 4 08

Base Address 5 09

CardBus CIS Pointer 10

Subsystem ID Subsystem 11
Vendor ID
Expansion ROM 12
Base Address
Reserved Capabilities 13
Pointer

Reserved 14

Max_Lat Min_Gnt Interrupt Interrupt


Pin Line
15 MSI Capability
Register Pointer
Required configuration registers
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
MSI Message Control Register 814 786

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI Memory Write Request 821 787

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI-X Features 821 788

 Motivation for MSI-X is more vectors:


2K vectors per Function
 Unlike MSI, vectors can target different
processors by using different address
 Per-vector and per-Function masking
 Support is optional. If supported MSI is not
required, though it can be helpful to build in
both MSI and MSI-X and let software choose
which one to enable.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI-X Capability Register 822 789

31 16 15 8 7 2 0
Message Control Register Pointer to Next ID Capability ID = 11h 00h
MSI-X Table Offset Table BIR 04h
Pending Bit Array (PBA) Offset PBA BIR 08h

(BIR = BAR Indicator Register)

31 30 29 27 26 16

Rsvd Table Size in N-1 (RO)

Function Mask (R/W)


MSI-X Enable (R/W)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Location of MSI-X Table 790

Doubleword
Number
(in decimal)
Byte
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register Function’s Memory
Class Code Revision 02
ID
Header Latency Cache 03
BIST Type Timer Line
Size

Base Address 0 04

05 Table BIR = 2
Base Address 1
06
MSI-X Table
Base Address 2

Base Address 3 07

08 MSI-X
Base Address 4
Table
Base Address 5 09
Offset
CardBus CIS Pointer 10

Subsystem ID Subsystem 11
Vendor ID
Expansion ROM 12
Base Address
Reserved Capabilities 13
Pointer

Reserved 14

Max_Lat Min_Gnt Interrupt Interrupt


15
Pin Line

Required configuration registers


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
MSI-X Table Structure 825 791

DW3 DW2 DW1 DW0

Vector Control Message Data Upper Address Lower Address Entry 0


Vector Control Message Data Upper Address Lower Address Entry 1
Vector Control Message Data Upper Address Lower Address Entry 2
…. …. …. ….
…. …. …. ….
Vector Control Message Data Upper Address Lower Address Entry N-1

Bit 0 is vector Mask Bit (R/W)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI-X Pending Bit Array (PBA) Structure 826 792

DW1 DW0

Pending Bits 0 - 63 QW 0
Pending Bits 64 - 127 QW 1
Pending Bits 128 - 191
….
….
Pending Bits QW (N-1)/64

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MSI Additional Topics 826 793

 Message Signaled Interrupts look just like


memory writes and are indistinguishable from
them in terms of flow control, ordering, data
integrity, etc.
 Memory synchronization is automatic if MSI is
sent by device; no need for handler to take
steps to guarantee it
 Interrupt Latency is no worse than the legacy
method or the APIC bus

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
MindShare Lab: Interrupt Investigation 794
Open File: interrupt_lab.arbsys

 Interrupts can be signaled using INTx, MSI and MSI-X


mechanisms. The following exercise reviews these
implementations.
1. What methods of interrupt generation are supported in BDF 0:27:0, and
which mechanism is being used?

2. What configuration register and field verifies the selection in item 1 above?
When an interrupt is generated what interrupt will be signaled by the
device?

3. What methods of interrupt generation are supported by the Root Port at


location 0:28:6, and which is being used?

4. For 0:28:6, how many interrupt vectors are requested and how many are
enabled?

5. For 0:28:6, what are the specific address and data values allowed in the
interrupts (memory writes) that can be signaled by the Root Port? Bonus
question: On this x86-based system, what are interrupt vectors of these
Moki Anji (moki@
interruptssynopsys.com)
(assuming interrupt remapping is not enabled in the system)?
Do Not Distribute MindShare.com © 2013
MindShare Lab: Interrupt Investigation 795
Open File: interrupt_lab.arbsys

6. What methods of interrupt generation are supported by the device attached


to Root Port 0:28:2, and which mechanism is enabled?

7. For the device attached to Root Port 0:28:2, what are the specific address
and data values allowed in the interrupts (memory writes) that can be
signaled by the device? Bonus question: On this x86-based system, what
are interrupt vectors of these interrupts (assuming interrupt remapping is
not enabled in the system)?

8. Are any of the interrupts of BDF 9:0:0 masked from being generated? If so,
have any of those masked events occurred at the BDF? Which one(s)?

9. Are any of the interrupts of BDF 8:0:0 masked from being generated? If so,
have any of those masked events occurred at the BDF?

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
System Resets

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Two Classes of Resets 833 797

1. Conventional Reset (required): Resets


carried forward from earlier spec versions –
cold, warm, and hot reset. From a device
perspective, all of these completely reset the
device with the possible exception of sticky
bits.
2. Function Level Reset (optional, but strongly
recommended): A reset initiated by software
to reset just one function within a device.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Conventional Resets 834 798

Three types of conventional reset:


1. Cold Reset: Reset following application of power;
called a “Fundamental Reset” because it resets
everything in the device
2. Warm Reset (optional): A Fundamental Reset
without cycling power
 Means for generating warm reset not specified
 Relationship between PCI Express reset and component or
platform reset is design specific
 Central resource may assert PERST# sideband signal to
devices; assertion causes Fundamental Reset to devices
that use this input.
3. Hot Reset: A reset sent across a Link using TS1
Ordered Set. Controlled by software.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PERST# 836 799

Processor

FSB

GFX Root Complex


DDR
PCI Express SDRAM
GFX PCI Express
POWERGOOD RST#
PCI
I/O Controller Hub
(ICH) IEEE
PERST# 1394

Add-In Add-In
Switch

PCIe
SCSI
to-PCI-X
RST#
Moki Anji (moki@ synopsys.com) PCI-X
Do Not Distribute MindShare.com © 2013
Hot Reset 837 800

• Tx sends TS1s for 2ms with Hot Reset bit set


• Reception of at least 2 consecutive TS1s or TS2s
with this bit set is detected as a Hot Reset
• Hot Reset propagates downstream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Generation of Hot Reset 840 801

Doubleword
Number
(in decimal)
15 12 11 10 9 8 7 6 5 4 3 2 1 0
Byte
3 2 1 0
Reserved 2.2 2.2 2.2 2.2
Device Vendor 00
ID ID
Status Command 01
Discard Timer SERR# Enable Register Register
02
Discard Timer Status Class Code Revision
ID
Secondary Discard Timeout Header Latency Cache 03
BIST Type Timer Line
Size
Primary Discard Timeout 04
Base Address 0
Fast Back-to-Back Enable
Base Address 1 05
Secondary Bus Reset
Secondary Subordinate Secondary Primary 06
Master Abort Mode Latency Timer Bus Number Bus Number Bus Number

Secondary I/O I/O 07


VGA Enable Status Limit Base
ISA Enable Memory Memory 08
Limit Base
SERR# Enable Prefetchable Prefetchable 09
Parity Error Response Memory Limit Memory Base
Prefetchable Base 10
Upper 32 Bits
Prefetchable Limit 11
Upper 32 Bits
I/O Limit I/O Base 12
Upper 16 Bits Upper 16 Bits

 Software commands a downstream port Reserved Capability


Pointer
13

14
to generate a Hot Reset by setting and Expansion ROM Base Address
15
Bridge Interrupt Interrupt
clearing the Secondary Bus Reset bit in Control Pin Line

Required configuration registers


the Bridge Control register
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Hot Reset from Switch Port 838 802

Processor Processor

FSB

PCI Express
GFX
GFX Root Complex
DDR
SDRAM

Secondary Bus Reset


Switch A Switch C
1

10 Gb PCI Express-
Switch B SCSI
Ethernet to-PCI
Slots

PCI
Gb
Add-In S IEEE
Ethernet
I/ 1394
O
COM1
COM2
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Reset from Switch Upstream Port 839 803

Processor Processor

FSB

PCI Express
GFX
GFX Root Complex
DDR
SDRAM

1 Secondary Bus Reset


The higher in the
Switch A Switch C
topology the reset
takes place, the
more devices will
be reset.
10 Gb PCI Express-
Switch B SCSI
Ethernet to-PCI
Slots
RST#
PCI
Gb
Add-In S IEEE
Ethernet
I/ 1394
O
COM1
COM2
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Link Disable Via Link Control Register 841 804

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Disable 842 805

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Function-Level Reset 842 806

FLR allows software to reset just one function within a


device without affecting the shared PCIe Link.
Motivations:
 Remove all data from a previous application before
allowing another to use the hardware (virtualization
support)
 Guarantee that all external transactions are stopped
in cases where the software is not working correctly
 Return hardware to un-initialized state before
rebuilding the software stack

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Capability Register 843 807

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Control Register 843 808

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completions and FLR 844 809

 If function has outstanding completions


(Transactions Pending = 1), then software
must wait long enough for them to complete
or be sure they never will. Recommended:
 If completion timeouts enabled, wait for the
timeout
 If not, wait at least 100ms
 Transactions Pending bit must be cleared at
the end of FLR

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
FLR Effects 845 810

 Link state is not affected by FLR, since other


functions within the device may still be using
the Link
 Most function registers and state machines
must be initialized, but there are several
exceptions: sticky bits, HWinit bits, Link
control and status bits

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
FLR (Function-Level Reset) 845 811

FLR resets Traffic from


Function 5 other
and clears functions
transactions unaffected

Function Function
5 0

Multi-Function Device

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Reset Exit 846 812

 After exiting reset, Link training and


initialization must begin within 20ms
 System software must wait at least 100ms
from the end of a reset before issuing
Configuration Requests
 For 8.0 GT/s capable devices this spec is changed
to 100ms after Link Training completes instead of
100ms after reset
 Functions are allowed 1 second after reset
before they must be ready for configuration
access. Root Complex or software cannot
conclude that a non-responsive device is
broken until after that time
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Hot Plug

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Hot-Plug Elements 850 814

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCIe Hot-Plug Elements 851 815

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Elements 852 816

 User Interface via OS


 Hot-Plug services via OS
 Standard Hot-plug System Driver from OS or
system board vendor
 Hot-Plug Device Driver from device vendor

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Hardware Elements 853 817

Element Purpose
Hot-Plug Controller per HP slot Interface between software and Hot-Plug
control
Card slot power Switching via Power Control of slot power
Controller
Card reset logic PERST# control

Power and Attention Indicators Show the power and attention states of the
slot
Manually-operated Retention Latch Hold add-in cards in place
(MRL)
MRL Sensor Allow the port and system SW to detect the
MRL being opened
Electromechanical Interlock Prevent removal of add-in cards while slot is
powered
Attention Button Allow user to request hot plug operations

PRSNT1# and PRSNT2# Short pins that indicate whether card is


physically present in the slot

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Switch Hot-Plug Control Functions 864 818

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Power Controller 854 819

 Power Controller Switches power for a slot


and monitors power fault conditions
 If MRL implemented:
 Whenever sensor reports latch open, power to
slots must be turned off
 When sensor reports latch closed, power to slots
must be restored

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Hot-Plug Indicators 859 820

 Two indicators defined: Power & Attention


 Implemented on the chassis only
 spec v1.0 allowed indicators on the module or card,
but spec v1.1 removed that support
 Controlled by hot-plug software using
command register bits
 Three states for each indicator
 On, Off, or Blinking (1 to 2 Hz, 50% duty cycle)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Indicators & Button Location 677 821

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Indicators 859 822

 Must be in close proximity to their associated


slot
 Hardware doesn’t change their state except for
a stuck-on power fault, in which case hardware
forces power indicator “on”
 Card should not be removed while power light is on
 Power is green, attention is yellow
 Blinking power light also provides visual feedback
to operator when Attention Button is pressed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Yellow Attention Indicator 859 823

 Off: Normal Operation


 On: Hot Plug Operation Failure
 Blinking: Slot being identified at operators
request

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Green Power Indicator 860 824

 Off: Power Off


 On: Power On
 Blinking: Power Transition. Card removal or
insertion not allowed

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Manual Retention Latch (MRL) 861 825

 Rigidly holds add-in card in the slot


 For example, to allow cables to be connected or
disconnected without causing intermittent
contacts.
 Another reason might be to discourage an
operator from removing the card unexpectedly
 Sensor detects status and, if open, main
power, Vaux, and SMBus must be turned off
 If no MRL Sensor, presence detect pins may
be used to Switch the signals

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Attention Button 862 826

 Momentary-contact push button near the slot,


pressed by the user prior to hot insertion or hot
removal at that slot.
 Power indicator blinks to provide visual feedback
 5 second abort interval after power indicator starts blinking,
during which a second button press will cancel the request
 If operation succeeds, light goes out and operator continues
 If operation fails, power light remains on and software may
present a console message and add a message to a system
log. Operator should not make changes to the slot.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Event Behavior 870 827

Hot Plug Event Register Bit Set Results when detected


When Detected
Presence Presence Detect
Detect Change Event Status

Attention Attention Button


Button Pressed Event System Attention
Pressed Requested
(MSI, SMI, SCI, or PME)
MRL Sensor MRL Sensor
Changed Change Detected
Event

Power Fault Power Fault


Detected Event

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Hot-Plug Related Registers 865 828

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Link Capability and Status Register 829

Indicates FC Init
has completed;
one indication
that a hot-add
must have taken
place

Required for
downstream port if slot
is hot-plug capable.
Not valid for upstream
port.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Slot Capability Register 866 830

(HwInit)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Slot Control Register 868 831

Enables an
interrupt when
Link Active
status bit
changes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Slot Status Register 870 832

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Capability 873 833

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Card Removal Procedure 856 834

 Initial state
 Attention Indicator (Yellow): Off
 Power Indicator (Green): On
 Procedure
 Operator presses Attention Button or indicates via
software GUI the physical slot number of interest
 Software causes Power Indicator to blink
 Hot plug software validates request via status register
 Device driver commanded to quiesce the card
 Software commands port to disable link
 Software commands Hot Plug Controller to turn slot off
 Power Indicator commanded to turn off
 Operator releases MRL
Moki Anji (moki@ synopsys.com)
Distribute
Do Not OS de-allocates resources MindShare.com © 2013
Card Insertion Procedure 857 835

 Initial state:
 Attention Indicator (Yellow): Off
 Power Indicator (Green): Off
 Procedure
 Operator installs card and secures MRL
 Attention Button pressed by operator or uses software
GUI to indicate hot-insertion. Hot-plug services notified
via interrupt
 Hot plug software validates request via status register
 Software commands Power Indicator to blink
 Software commands Hot Plug Controller to turn slot on
 Power Indicator commanded to turn on
 OS initializes card and allocates resources to card
Moki Anji (moki@ synopsys.com)
Distribute
Do Not OS calls driver to complete card initialization MindShare.com © 2013
836

Power Budgeting

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Power Budgeting Capability 876 837

 Motivation: facilitate hot-plug by allowing


system to verify power for new devices
 Optional capability (though some form factors
using hot plug may require it)
 Devices must remain under power limit
specified for their form factor until configured
 System guarantees power has been properly
budgeted before enabling a card to use its full
power

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Elements of Power Budgeting 877 838

 System firmware
 Used during boot time, it contains the system
power budget and consumption of devices that are
known to be present.
 Power budget manager
 Used during run time
 Expansion ports
 Ports to which cards are attached
 Add-in devices
 Those that are power budget capable

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Elements of Power Budgeting 880 839

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Slot Power Limit Sequence 882 840

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Power Budget Capability 884 841

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Power Budget Data Register 885 842

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
843

Overview of Spec 2.1 Changes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Overview of Spec 2.1 Changes 844

 13 features were introduced with the 2.1


spec revision, which can be loosely grouped
into these categories:
1. Communication
 Multicasting
2. Performance
 TPH, IDO, ARI, Extended Tag Default
3. Power Management
 DPA, LTR, OBFF, ASPM Option
4. Software model
 Atomic Ops
5. Configuration
 Internal Error Reporting, Resizable BARs, Simplified
Ordering Table
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
1: Communication: Multicasting 845

 Sending posted memory writes to multiple


destinations at once can improve
performance for certain applications, such as
 Sending data to multiple storage devices at the
same time (RAID, Mirroring)
 Multi-headed graphics

GFX Root Complex


SDRAM

NIC Endpoint
Switch NIC

Moki Anji (moki@ synopsys.com) SCSI SCSI


Do Not Distribute MindShare.com © 2013
2: Performance: TLP Processing Hints (TPH) 846

 System performance can be improved if hardware


knows where cacheable data will be needed again
soon.
 This is most interesting in complex systems that have multiple
hops between components and distributed processing
 Adding hints to TLPs helps take advantage of the
system cache hierarchy to:
1. Facilitate data residency
 Data such as device control information or certain payload data
that will be used again soon could be loaded into a cache when
device writes to main memory
2. Minimize access latencies
 Avoid longer access paths by using caches
3. Reduce bandwidth and power in affected buses
 Reduce congestion and improve efficiency to memory
 Prefix can also extend APIC ID for MSI-X
Moki Anji (moki@ synopsys.com)
Do Not(FEEx_xxxx
Distribute only allows 16 bits for ID) MindShare.com © 2013
2: Performance: Example Without TPH 847

1. Device writes to memory


2. RC snoops CPU cache, possibly resulting in
a write-back cycle (2a)
3. Next, data is written into memory

2 2a

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
2: Performance: Example Without TPH 848

4. Device notifies host of data delivery


5. CPU reads from memory
(3 memory accesses were needed)

4
5

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
2: Performance: Example With TPH 849

1. Device sends DMA write toward memory, TPH hints indicate


data should be cached near the CPU, possibly even biasing
the LRU information to give it preferred retention in the cache
2. Processor cache is snooped
3. Device notifies host
4. CPU reads from cache, avoids memory access
3
2 4

Cache
1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
2: Performance: ID-Based Ordering 850

Write Buffer
Full

Memory Read
Posted Write

Transaction ordering rules mean a posted write that stalls


will block egress of any subsequent transactions.
If requests come from other endpoints, though, the
likelihood of a dependency between them is very low.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
2: Performance: ID-Based Ordering 851

 ID-Based Ordering is an optional capability,


and uses a previously reserved attribute bit in
the header to indicate when it’s being used

+0 +1 +2 +3
7 0 7 3 2 1 0 7 0 7 0

Byte 0 Format T T E
0 x 1
Type r TC R Attr R
H D P
Attr AT Length
Last DW First DW
Byte 1 Requester ID Tag BE BE

Byte 2 Address [63:32]

Byte 3 Address [31:0] R

Request Header for 64-bit Memory Address


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
2: Performance: ID-Based Ordering 852

 Performance is most affected when multiple


Requests share a common Link and the Link
becomes congested
 Examples:
 Multi-function device, since all the Functions share one Link
 Switch, since streams from different endpoints get mixed
upstream
 Note that IDO and Relaxed Ordering overlap but
aren’t related
 IDO allows out of order packets from different streams
 RO allows out of order packets in the same stream
 If both attributes are set, the result is the logical OR
 If neither is set, result is PCI strongly-ordered model
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
2: Performance: Alternative Routing ID (ARI) 853

 Motivation: New implementations can benefit


from having a larger number of Functions
available to them, especially virtualized devices.
 Device number is less interesting in endpoints
and is usually just zero.
 Method: For endpoints that support it, device
and Function number are merged into a single 8-
bit Function number.
 To avoid confusion about device numbers and
configuration, downstream ports must be
enabled to recognize ARI before software can
use it.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
2: Performance : Extended Tag Enable Default 854

 In previous versions of the spec, the default size


of the Tag field was 5 bits, allowing 32 split
transactions to be in progress at the same time
per Function.
 If a Function supported it, software could change
the tag size to 8 bits by enabling the Extended
Tag Field.
 This change simply allows the default value of
the Extended Tag Field Enable bit to be either 5
or 8 (implementation specific)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
3: Power Management 855

 All of the 2.1 PM changes (DPA, OBFF, LTR,


ASPM Option) were covered earlier in the
chapter on Power Management

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
4: Software Model: AtomicOps 856

 Many processors have single commands for atomic


operations, and that has been supported in PCIe by
the legacy bus-locking protocol.
 Three new optional commands allow that syntax to
be contained in a single PCIe request now.
 Software must verify support before attempting to use them.
 AtomicOps ease porting of host-based apps to
computational accelerators by using PCIe as the
interconnect but leaving the algorithms unchanged.
They provide non-blocking synchronization for:
 multiple producers or consumers using semaphores in
memory
 counters that are atomically incremented by hardware and
atomically read and cleared by software
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
4: Software Model: AtomicOps 857

 Optional AtomicOps perform multiple operations


internally
 Guaranteed atomic within the device (execute internally
without interruption)
 Path between devices does not need to be locked
 Non-posted; a completion is expected
 Ordering with respect to other TLPs is the same as any non-
posted write. If ordering is important, s/w must enforce it.
 New instructions:
 Fetch and Add – fetch from address, add value and replace,
return original value
 Unconditional Swap – fetch from address, replace with swap
value, return original value
 Compare and Swap – fetch from address, compare with
“compare” value, if equal, replace original with swap value,
return original
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
5: Configuration: Internal Error Reporting 858

1. Make internal logic errors visible to software in


an industry-standard way.
 In high-end systems it’s important to be able to
detect and contain errors
 Endpoints have device drivers that can obtain internal
information, but switches usually don’t and are controlled by
the OS instead.
 As a result, switch vendors have developed proprietary and
incompatible error-reporting methods.
2. Allow multiple error headers to be recorded
 Current AER model only saves info on the first
uncorrectable error
3. Detect the occurrence of multiple errors of the
same type
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
5: Configuration: Resizable BARs 859

 Problem arises when requested memory


resources are greater than system addressable
space. Possible results:
 Available system memory is reduced
 Function memory is not allocated
 Function memory is allocated with a sub-optimal size
 Solution: Functions report several possible
usable memory sizes, software selects one
based on system constraints
 Expected that devices requesting large memory
resources will be most likely to use this

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
5: Configuration: Simplified Ordering Table 860

 This wasn’t an ECR to the 2.0 spec, but was


already affected for 2.1 by the addition of IDO

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
5: Configuration: Simplified Ordering Table 861

 New version reduces the entries and simplifies them by not mentioning
specific requests. Both make it easier for new devices to be compliant.

Row pass Posted Non-Posted Requests Completion


Column? request
Read NPR with
(Col 2) Request data (Col 5)
(Col 3) (Col 4)
Posted request a) No a) Y/N
(Row A) b) Y/N Yes Yes b) Yes

Read a) No
Non-Posted

Request b) Y/N Y/N Y/N Y/N


Requests

(Row B)
NPR with a) No
data b) Y/N Y/N Y/N Y/N
(Row C)
Completion a) No a) Y/N
(Row D) b) Y/N Yes Yes b) No

Moki Anji (moki@ synopsys.com)


NPR
Do Not with data = Non-Posted Request, such as configuration write or I/O write
Distribute MindShare.com © 2013
862

Overview of Spec 3.1 Changes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Summary of Spec 3.1 Changes 863

1. Downstream Port Containment (DPC), later


updated with Enhanced DPC (eDPC)
2. L1 Substates
3. Lightweight Notification
4. 8.0 GT/s Receiver Impedance
5. Process Address Space ID (PASID), later
updated with Address Translation
6. End-End TLP Prefix Changes for RCs
7. Precision Time Measurement (PTM)
8. Protocol Multiplexing (PMUX): [2.x ECN,
Appendix G in 3.0 spec]
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
1. Downstream Port Containment (DPC) 864

 Main feature: Automatically disables the Link below a


Downstream Port when triggered by an uncorrectable
error.
 Prevents possible spread of data corruption; subsequent
TLPs are blocked in both directions for that Port
 System is notified of the event and error recovery is possible
if supported by software
 New event, “Async Removal”, occurs when a device
goes offline without notification to the OS. This is
going to happen when DPC is triggered.
 DPC support is optional.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
2. L1 Substates 865

 Reducing active Link power was accomplished with


LTR and OBFF
 Reducing standby power is now accomplished by
adding optional low-power substates for L1.
 Example: laptop battery can be drained even with Link in L1
because Electrical Idle detector and common-mode voltage
driver continue to draw power (up to 25mW/Lane).
 Substates apply to ASPM L1 and software PM L1
 Optional sideband CLKREQ# signal becomes
bidirectional and is used to turn off the RefClk and
indicate transition to substates.
 Currently only PCIe Mini Card form factor implements this
pin. Form factors that don’t have CLKREQ# can use an in-
band version of it that will be defined in a future ECN.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
2. L1 Substates State Diagram 866

L1.0

L1.1 L1.2
“Snooze” “Off”

 L1.0: both Ports must detect electrical idle exit (EIE), and maintain
Common Mode Voltage (CMV) [Power/Lane = 20 mW]
CLKREQ# deassertion signals entry to lower state. Next state will be
L1.2 if software enabled, or if ASPM enabled and conditions are met.
Otherwise, next state will be L1.1 if software or ASPM enabled.
Assertion of CLKREQ# exits back to L1.0.
Neither substate detects EIE, but for
 L1.1: CMV maintained [Power/Lane = 500 µW]
Moki Anji(moki@
L1.2: CMV not maintained [Power/Lane = 10 µW]
synopsys.com)
Do Not Distribute MindShare.com © 2013
3: Lightweight Notification (LN) 867

 Improve performance by using local caches


in Endpoints while avoiding coherency
overhead
 Reduces traffic for Endpoints and host memory
 Improves latency
 “Lightweight” means host indicates cache
lines have changed but no details, avoiding
synchronization and flow-control issues.
 Dynamic device association
 Virtual Machine (VM) guest drivers accessing
device strictly via host memory instead of PIO
make guest migration easier
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
3: Lightweight Notification (LN) 868

 Optional protocol for Endpoints to register


interest in cache lines in host memory and
get notified if those lines are changed.
 LN Reads, LN Completions, and LN Writes
are defined
 Cache Line Sizes – 64 & 128 bytes supported
 Endpoints use a new capability register block;
host has new field in the Device Capabilities
2 register to act as LN Completer.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
4: 8 GT/s Receiver Impedance 869

 Problem: Deadlock cases can arise when an


8.0 GT/s receiver is not detected in LTSSM
Detect state because impedance is not in
proper range (40-60 ohms) at that rate.
 Two cases are modified in which detecting
EIE results in LTSSM state transition:
 Exiting from Rx_L0s.Idle to Rx_L0s.FTS
 Exiting from L1.Idle to Recovery
 Solution: If impedance at 8.0 GT/s or higher
doesn’t match the value for 2.5 GT/s, a
timeout of 100ms can be used to cause this
transition.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
5: Process Address Space ID (PASID) 870

 Goal: Adding a Process ID Prefix gives extra


attributes for a Memory Request
 Method: New prefix adds a 20-bit value to indicate
the address space of an Untranslated Address
 PASIDs enable
 Sharing an Endpoint across multiple processes
while maintaining a separate 64-bit address space
for each one.
 Hierarchical management of address spaces.
 Without PASID, Untranslated Addresses are seen as Guest
Physical and are translated to System Physical by the
Hypervisor.
 With PASID, Untranslated addresses are seen as Guest Virtual
and are translated to Guess Physical by the Guest OS.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
5: PASID Translation 871

 The base PASID ECN described the use of


PASID Prefix with Memory Requests that had
untranslated addresses
 PASID Translation allows its use for other
Requests:
 Address Translation Requests
 Page Request Messages
 ATS Invalidation Requests
 PRG Response Messages

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
6: End-End TLP Prefix Changes 872

 Allows different Root Ports to report different values


of Max End-End TLP Prefixes, and clarifies handling
of TLPs with more prefixes than Port can support
(only applies to Root Ports).

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
7: Precision Time Measurement (PTM) 873

 PTM Definition: Ability to send base timing


information between components.
 Goal: Simplify scheduling of time-sensitive
media and server applications in a platform
by coordinating precise timing across
devices.
 Method: PTM Requesters send Messages to
request base time info and PTM Responders
send Messages in response.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8: Protocol Multiplexing (PMUX) 874

 Purpose: allow multiple protocols to share a


PCIe Link
 Terms:
 PMUX Channel – set of logic to generate and
receive packets using a specific protocol, which is
multiplexed into the general PCIe traffic
 PMUX Link – PCIe Link over which protocol
multiplexing has been enabled. A mix of TLPs and
PMUX packets are transferred.
 PMUX Packet – Specially-modified packet that
identifies itself as the PMUX type.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8: PMUX Basics 875

 Does not consume or interfere with PCIe


resources; uses distinct protocol-specific
resources.
 PMUX packets have no impact on TLPs or
DLLPs and are ignored by devices that don’t
support them.
 PMUX must be enabled by s/w before
packets can be sent. PMUX packets received
by a target that isn’t enabled are ignored.
 PMUX Link can support up to 4 active PMUX
Channels simultaneously
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Thank you!

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Part 6: Appendices

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Appendix A:
Details of Spec 2.1 Changes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Overview of Spec 2.1 Changes 879

 13 features were introduced with the 2.1


spec revision, which can be loosely grouped
into these categories:
1. Communication
 Multicasting
2. Performance
 TPH, IDO, ARI, Extended Tag Default
3. Power Management
 DPA, LTR, OBFF, ASPM Option
4. Software Model
 Atomic Ops
5. Configuration
 Internal Error Reporting, Resizable BARs, Simplified
Ordering Table
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
880

1. Communication:
Multicasting

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 888 881

 Sending posted memory writes to multiple


destinations at once can improve
performance for certain applications, such as
 Multi-headed graphics
 Storage options (RAID, Mirroring)
 Streaming to video and storage simultaneously
SDRAM

GFX Root Complex

Endpoint Endpoint
Switch NIC

Disk Disk

Moki Anji (moki@ synopsys.com) SCSI SCSI


Do Not Distribute MindShare.com © 2013
Mechanism 888 882

 Only supported for posted, address-routed


requests (such as memory writes)
 New Configuration Registers
 Multicast BAR
 Defines an address range for multicast memory and
segments that range into equal-size Multicast Windows,
each associated with a Multicast Group (MCG)
 TLPs that will be in the multicast range contain an MCG
number (up to 6 bits) within their address
 Multicast Capability structure configures routing
elements and endpoints for MC routing and
decode
 Multicast Overlay mechanism in egress ports allows
endpoints without MC Capability registers to still be
treated as MC targets
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
MC Extended Capability Structure 889 883

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Capability and Control Registers 890 884

Multicast Capability Register


15 14 13 8 7 6 5 0

MC_Window_Size RsvdP MC_Max_Group


Requested

RsvdP Exponent for MC Max number of MCGs


MC_ECRC_ window size in Supported minus 1
Regeneration_Supported endpoints –
RsvdP in Switches
and RC

Multicast Control Register


15 14 6 5 0

RsvdP MC_Num_Group

MC_Enable Number of MCGs


Moki Anji (moki@ synopsys.com) Configured minus 1
Do Not Distribute MindShare.com © 2013
Base Address Register 891 885

 Starting address of the MC range and


location of the MCG within the address
31 12 11 6 5 0

MC_Index
MC_Base_Address [31:12] RsvdP
_Position

MC_Base_Address [63:32]

Location of the LSB of the MCG number within


the address. Since the lowest 12 bits here are
not available, the behavior will be undefined if
this number is < 12 and MC is enabled.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Multicast TLP 892 886

 Spec describes the process of finding the


MCG in the TLP address by this formula:
MCG = ((TLP address – Base address) >>
MC_Index_Position) AND 3Fh

MCG
If Index Position = 12, then the MCG (which
can be up to 6 bits) would be found within
Moki Anji (moki@ synopsys.com)
the address starting at bit 12.
Do Not Distribute MindShare.com © 2013
MC Windows Example 894 887

 MC_Base_Address = 2GB
 MC_Max_Group = 7
 MC_Window_Size_Requested = 10
 MC_Index_Position = 12
 MC_Num_Group = 5
System Memory Map

MC Address Range
= 2GB to 2GB + 212 * 6
= 2GB to 2GB + 24KB
8 MC windows available in
2GB + 24KB MC Group 5 hardware, each at least 210
MC Group 4
Only 6 MC windows are MC Group 3 in size (technically, 212 is
configured for use MC Group 2
MC Group 1
min. address granularity)
MC Group 0
2GB MC_Base_Address

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Blocked TLP Error 888

 If a TLP is blocked for either reason, it is


dropped and considered an uncorrectable,
non-fatal error
 Completer sets Signaled Target Abort bit
 A new AER bit reports this case
 If enabled, a non-fatal error message can be sent
to notify system software.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Motivation for Overlay 894 889

 Downstream: Allow an endpoint that doesn’t have the


MC structure to still act as a target for MC TLPs
 This could also be done by simply placing the endpoint BAR
address range within an MC range. But using the Overlay
allows the endpoint address to be in a different area and
accessed normally, while also accessible through an MC
window.
 Upstream: Allow part of an MC window address
range to be mapped onto host memory space
 It’s not expected that host memory would normally be an MC
target but, if it is, the use of Overlay avoids the need for
address translation within the Root.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Overlay BAR 895 890

 Map MC transactions to another address that


isn’t MC capable
31 6 5 0

MC_Overlay
MC_Overlay_BAR [31:6]
_Size

MC_Overlay_BAR [63:32]

MC Overlay Size:
If 6 or greater, this specifies the size in bytes of the overlay aperture.
If less than 6, the overlay mechanism is disabled, since this BAR
can’t use the 6 low-order bits
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Overlay Method 896 891

 When enabled, upper bits of the MC address


are replaced with the overlay address,
effectively remapping it.
 Example:
 Original Write hits MC range with address: ABCD_ BEEFh
 Overlay is enabled, and overlay BAR is: FEED_ 0000h
 MC_Overlay_Size = 01 0000b (since >6, overlay is enabled)
 Result keeps 16 lower bits of original address and replaces
all the upper bits, so address to target is:
(FEED_BEEFh)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Overlay Example 896 892

System Memory Map

PCIe BAR Range Overlaid Address:


FEED_0000 FEED_BEEFh
to FEED_FFFF

Overlay remaps posted


accesses to MC window into
memory space defined by
an endpoint BAR. Original Address:
ABCD_BEEFh
Multicast Address
Range

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ECRC Regeneration 893

 If original TLP included ECRC, changing the


address with overlay will render it incorrect.
Solution:
 Switch and RC ports optionally support ECRC
Regeneration. If so, compute a new ECRC using
the new address and replace the old ECRC
 If not supported, remove the old ECRC and clear
the TD bit in the TLP header. That still lets the
target check ECRC on other TLPs without being
confused by this one.
 If ECRC had error before TLP was modified, add
new ECRC but invert it to ensure the error isn’t
accidently masked by regeneration
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Routing MC Transactions 896 894

 When an MC hit is detected, normal routing is


suspended
 The MCG number is extracted and the
MC_Receive register of all egress ports is
checked to see whether they get a copy
 Unless a port that has enabled receive has
also blocked that MCG, forward the TLP
 If no ports forward the TLP or no Functions
consume it, it is silently dropped
 To prevent loops, a TLP is never forwarded
back out its ingress port, with the possible
exception of the ACS case
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Multicast Example 896 895

 From the sender


 Verify MCG isn’t blocked and send TLP
 At switch or RC port
 If ingress address is within the multicast range:
 Extract MCG from the address, verify MCG isn’t blocked
 Compare MCG against each egress port’s Receive register to
determine if that port should receive a copy
 If so, verify MCG isn’t blocked and send out on that egress port
 At the endpoint
 If address is within multicast range, extract MCG from
address, compare with Receive register. If no match, silently
drop it
 If not or MC not supported, treat TLP as having an ordinary
address

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Congestion Avoidance 897 896

 MC increases switch traffic proportional to the


percentage of MC traffic, leading to a risk of
congestion.
 To avoid this:
 MC targets should be designed to accept MC
traffic “at speed” – minimal delays
 MC senders should limit injection rate
 System designer should use switches with big
enough buffers to handle expected traffic, and
targets that can process MC traffic quickly.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
897

2. Performance:
TLP Processing Hints

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 899 898

 System performance can be improved if


hardware knows where cacheable data will be
needed again soon.
 Adding TLP hints helps take advantage of the
system cache hierarchy:
1. Facilitate data residency
 Data (device control info or payload data) to be used again
soon can be put in a cache when device writes to memory
2. Minimize access latencies
 Avoid longer access paths by using caches
3. Reduce bandwidth and power in affected buses
 Reduce congestion and improve efficiency to memory

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example: Device Write to Host 901 899

1. Device writes to memory


2. RC snoops CPU cache, possibly resulting in
a write-back cycle (2a)
3. Next, data is written into memory
2 2a

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Write to Host 901 900

4. Device notifies host of data delivery


5. CPU reads from memory
(3 memory accesses were needed)

4
5

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TPH Improves Performance 902 901

1. Device sends DMA write toward memory, TPH hints indicate


data should be cached instead, possibly even biasing the LRU
information to give it preferred retention in the cache
2. Processor cache is snooped
3. Device notifies host
4. CPU reads from cache, avoids memory access
3
2 4

Cache
1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Host Write to Device 903 902

1. Host writes toward device; TPH indicates


data should be cached near the target.
2. Device reads from cache repeatedly, avoids
both Link access and memory access

Cache

Cache
2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device to Device 904 903

Two devices both write toward device memory, TPH


indicates data should be cached.
Both devices use system cache for “read mostly”
structures and avoid accessing memory or each
other’s Links

Cache

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Usage Models 904

 Processing network IO traffic


 Database system clusters, Computational
accelerators
 Exchange of lock information is often a bottleneck
in such systems
 Overhead of operations limits efficiency in some
operations

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Method 905 905

+0 +1 +2 +3
7 07 6 5 4 3 2 1 07 0 7 0
T T E
Format Type R TC R Attr R H D P Attr AT Length
First DW Last DW
Requester ID Tag BE BE

Address [63:32]

Address [31:2] PH

 If TH is set, this TLP includes hints about where


its data should be cached. Functions that don’t
support it are required to ignore TH.
 Two levels of hints are defined:
 Baseline
 Optional prefix
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Baseline 905 906

+0 +1 +2 +3
7 07 6 5 4 3 2 1 07 0 7 0
T T E
Format Type R TC R Attr R H D P
Attr AT Length
First DW Last DW
Requester ID Tag BE BE
Address [63:32]

Address [31:2] PH

 Baseline includes Processing Hints (PH)


 These used to be the 2 LSBs of the address and were reserved
 Now they show the location of frequent access and provide a hint
as to the best place for caching the data
 00 Bidirectional – shared data structure
 01 Requester – device read and write
 10 Target – device write & host read or vice-versa (one direction)
 11 Target with priority – same as Target but with high temporal locality

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Baseline: Steering Tags 906 907

 Purpose: suggest caching strategy


 Meaning of the bits is implementation specific
 The steering tag bits are sent in the header
and are obtained from a table, as we’ll see
later.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Steering Tags for Reads 906 908

+0 +1 +2 +3
7 07 6 5 4 3 2 1 07 0 7 0
T T E
Format Type R TC R Attr R H D P
Attr AT Length
Last DW First DW
Requester ID Tag BE
Steering Tags
BE
Address [63:32]

Address [31:2] PH

 For Read requests with TH, byte enable fields are


repurposed to serve as the lower 8 ST bits
 BE’s aren’t needed for prefetchable reads. If the request
targets non-prefetchable space, care must be taken to avoid
undesirable side effects.
 For AtomicOp requests
 If TH is set, BE’s are repurposed, otherwise they’re reserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Steering Tags for Writes 906 909

+0 +1 +2 +3
7 07 6 5 4 3 2 1 07 0 7 0
T T E
Format Type R TC R Attr R H D P
Attr AT Length
Last DW First DW
Requester ID Tag Tags
Steering
BE BE
Address [63:32]

Address [31:2] PH

 For Posted requests with TH, the tag field is


repurposed as the lower 8 ST bits
 Tag isn’t needed since there’s no completion for
posted writes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Optional Prefixes 910

 If more hints are needed, TLP prefixes can be added


at the beginning of a TLP
 Size of a prefix is 1 dword (4 bytes)
 Scope can be
 Local – limited to the Link
 End-End – between the requester and the target
 Flow Control header credits will be different for
receivers that support End-End Prefixes. One unit =
space for a max-size header + Digest + the max
number of End-End prefixes.
 Routing elements may or may not use prefixes, but
they’ll at least need to know they aren’t errors

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP with Prefixes 911

 Prefixes can be added in front of a memory


request TLP
 Note that prefixes must be followed by a
header. If not, the packet is Malformed.
+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
Prefixes

100 x Prefix
100 x Prefix
T T E
Format Type R TC R Attr R H D P Attr AT Length
Header

Last DW First DW
Requester ID Tag BE BE

Address [31:2] PH

Optional Data

Optional ECRC

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Prefix Contents 912

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
100 x (Defined by prefix contents)

 Format bit 2 was previously reserved but now


must be recognized to detect a prefix.
 The Extended Fmt Field Supported bit indicates
whether a Function recognizes a 3-bit format field
(see next slide), and the spec strongly recommends it
 Note that if Fmt [2] is set but the device doesn’t
support it, handling of the TLP will be undefined.
 Format value of 100b indicates the presence of a
prefix, and the next bit indicates:
 0 = Local Prefix
 1 = End-End Prefix
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Device Capabilities 2 Register 913

Device Capability 2 Register


31 24 23 22 21 20 19 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
Fields related to
64-bit AtomicOp Completer Supported
TPH and prefixes. A
new set of registers 32-bit AtomicOp Completer Supported
is needed for AtomicOp Routing Supported
Requesters and ARI Forwarding Supported
that’s covered later. Completion Timeout Disable Supported
Completion Timeout Ranges Supported
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Local Prefix Rules 914

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
1 0 0 0 L [3:0] (Defined by prefix contents)

 L [3:0] encodings:
0000 – MR-IOV (Multi-Root IO Virtualized environment):
Supports packet routing, error detection, and congestion
management (see MR-IOV spec for details).
1110 – Vendor-defined local prefix 0
1111 – Vendor-defined local prefix 1
 Local prefixes are not protected by ECRC
 If both Local and End-End prefixes are used, the
local ones must appear first. An error in this regard is
a Malformed TLP.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
End-to-End Prefix Rules 915

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
1 0 0 1 E [3:0] (Defined by prefix contents)

 E [3:0] encodings:
0000 – Extended TPH
1110 – Vendor-defined end-end prefix 0
1111 – Vendor-defined end-end prefix 1
 All end-end prefixes are protected by the optional ECRC
 Max end-end prefixes in a TLP = 4 (max number
supported by a Function is reported in Device
Capabilities 2 register). Rx must check this and an error
will be a Malformed TLP.
 End-End TLP Prefix Supported bit indicates whether a
Function can receive these prefixes
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
More End-to-End Prefix Rules 916

 Receipt of prefixes by a Function that doesn’t


support them will be a Malformed TLP.
 However, if one Function in an MFD does support
them, receipt of an unsupported prefix by other
Functions will be treated as an Unsupported Request
instead.
 Prefixes are replicated in multicast TLPs

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Control Register 2 917

Device Control 2 Register


15 14 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

 Routing elements can use End-End TLP Prefix Egress


blocking for endpoints that won’t understand them. Such
a TLP is dropped and an error is reported (see next
slide). If the TLP was a request, a completion with UR is
Mokireturned.
Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
AER Uncorrectable Status Register 918

 New Uncorrectable Status bit

Uncorrectable Error Status Register


31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdZ RsvdZ RsvdZ

TLP Prefix Blocked Error Status


Undefined
AtomicOp Egress Blocked Status
Data Link
MC Blocked TLP Status
Protocol Error
Uncorrectable Internal Error Status Status
ACS Violation Status Surprise Down
Unsupported Request Error Status Error Status
ECRC Error Status Poisoned TLP
Malformed TLP Status Status
Receiver Overflow Status Flow Control
Unexpected Completion Status Protocol Error
Status
Completer Abort Status
Completion Timeout Status
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
End-to-End Prefix Example 919

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
100 1 0000 ST [15:8] Reserved

 Code = Extended TPH


 Byte 1 contains upper 8 steering tags

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Stacked Prefix Example 920

 Prefixes may be stacked or repeated


 Local prefixes must appear first
STP
Sequence Number
1000 Local Prefix
1001 End-End Prefix

Protected by LCRC
Protected by ECRC
1001 End-End Prefix

TLP Header

Optional Payload Data


Optional ECRC
LCRC
END
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
TPH Requester Capability Structure 906 921

 Required for requesters that will use TPH


 Completers don’t use this, but indicate TPH
Completer support in Device Capabilities 2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TPH Requester Capability & Control 907 922

(Read Only)

00b – ST Table not present


01b – ST Table is in Requester Cap
Register Structure
10b – ST Table is in MSI-X table
11b – Reserved

Device-specific index into ST Table

MSI-X Vector indexes into ST Table

Required: Uses zeroes for steering tags

00b – Requester not permitted to use


TPH or Extended TPH
01b – TPH permitted, but not 000b – No ST Mode
Extended TPH 001b – Interrupt Vector Mode
10b – Reserved 010b – Device-Specific Mode (recommended)
Moki Anji
11b – (moki@ synopsys.com)
Both TPH and Extended TPH Other encodings – Reserved
permitted
Do Not Distribute MindShare.com © 2013
TPH ST Table 908 923

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Prefix Error Logging 924

 AER registers may include a TLP Prefix Log register


to record End-End prefixes (not local ones)
associated with the First Error Pointer.
 Local Prefixes may have another mechanism
available for logging them (for example, MR-IOV
prefixes are logged in MR-IOV structures)
 For a Malformed TLP, a prefix may be recorded in
the Header Log. For example, if more than 4 End-
End prefixes are seen, the first overflow prefix is
logged in the first dword of the Header Log, with the
rest of the Header Log undefined.
 If a Function doesn’t support all 4 prefixes, the extra
ones in the log must be hardwired to zero.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Prefix Log Register 925

38h
In Functions
that support TLP Prefix Log Register
TLP Prefixes
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
926

ID-Based Ordering

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Stalled Path 302 927

Write Buffer
Full

Memory Read
Posted Write

Transaction ordering rules mean a posted write that stalls


will block egress of any subsequent transactions.
If requests come from other endpoints, though, the
likelihood of a dependency between them is very low.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
IDO Attribute 303 928

 ID-Based Ordering is an optional capability,


and uses a previously reserved attribute bit in
the header to indicate when it’s being used

+0 +1 +2 +3
7 0 7 3 2 1 0 7 0 7 0

Byte 0 Format T T E
0 x 1
Type r TC R Attr R
H D P
Attr AT Length
Last DW First DW
Byte 1 Requester ID Tag BE BE

Byte 2 Address [63:32]

Byte 3 Address [31:0] R

Request Header for 64-bit Memory Address


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
IDO Performance 302 929

 Performance is most affected when multiple


Requests share a common Link and the Link
becomes congested
 Examples:
 Multi-function device, since all the Functions share one Link
 Switch, since streams from different endpoints get mixed
upstream
 Note that IDO and Relaxed Ordering overlap but
aren’t related
 IDO allows out of order packets from different streams
 RO allows out of order packets in the same stream
 If both attributes are set, the result is the logical OR
 If neither is set, result is PCI strongly-ordered model
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
When to Use IDO 302 930

 For endpoints
 If communicating with just one other entity, it’s safe to
use IDO for all TLPs.
 If working with more than one entity, some TLPs could
use IDO and others not, but race conditions can result
unless synchronization techniques are used.
 For root ports, use of IDO doesn’t make sense:
 They usually communicate with multiple entities
 Some RC designs use a different requester ID and
completer ID for the same port, which could make
TLPs appear to be in different streams even when
they weren’t.
 IDO is not permitted for configuration or IO requests,
and the IDO bit is reserved in those headers.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
IDO Policy for Endpoints 303 931

 Software driver for the device must know when


it’s safe to enable IDO
 Simple Policy:
 If IDO Request Enable is set, set IDO bit in all
applicable TLPs it originates
 If IDO Completion Enable is set, use IDO in all its
completion TLPs (unlike RO, this doesn’t depend on
setting of IDO bit in the original request)
 Complex Policy
 Implementation-specific mix of appropriate times to
use IDO.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Control Register 2 303 932

Device Control 2 Register


15 14 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

 Functions can hardwire these bits to zero if they


won’t use IDO
 As with Relaxed Ordering, there is no capability bit,
just an enable bit. Software would have to know the
Function
Moki Anji could support it.
(moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
933

Alternative Routing-ID Interpretation

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
ARI Basics 909 934

 Motivation: New implementations can benefit


from having a larger number of Functions
available to them, especially virtualized devices.
 Device number in endpoints is usually just zero
anyway.
 Method: For endpoints that support it, device
number is assumed to be zero and Function
number grows to 8 bits.
 To avoid confusion, downstream ports must be
enabled to recognize ARI before software can
use it.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ARI Capability Structure 935

31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (000Eh for ARI)

31 0 Offset

PCIe Enhanced Capability Header 000h

ARI Control Register ARI Capability Register 004h

 All Functions that support ARI must implement


this structure, so software can see it
 Support for ARI does not impact enumeration
algorithms in use today

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Capability Register 936

ARI Capability Register (RO)


15 8 7 2 1 0

Next Function Number RsvdP AM

 Next Function number – next higher


numbered Function in this device, or zero if
this is the last one. First Function must be
Function zero.
 A and M both apply only to Function 0 and
must be cleared for all other Functions.
 A – ACS Function Groups Capability – supports
Function Group granularity for ACS P2P Egress
Control.
 M – MFVC Function Groups Capability –Indicates
Function Group level arbitration support for MFVC.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Next Function Number 937

 Motivation: improve enumeration


performance by avoiding the need to go
through the entire list of 256 possible
Functions.
 Hardware supplies a linked list of Function
numbers and software can optionally walk
through the list instead.
 Function 0 will be the head of the list
 The list can be sparse and non-sequential

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ARI Forwarding 938

 Routing elements are allowed to use ARI for


downstream ports if
 The downstream device supports ARI, and
 ARI forwarding is supported and enabled for that port
 Once enabled, logic that changes a Type 1
Configuration request to Type 0 will no longer enforce
the requirement that the downstream device number
must be zero (it will now be implied to be zero).
 Extended Functions (beyond Function 7) are always
enabled in an ARI device, but software must enable
ARI support in the downstream port just above it
before they’ll be accessible.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Capabilities 2 Register 939

Device Capability 2 Register


31 24 23 22 21 20 19 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
64-bit AtomicOp Completer Supported
32-bit AtomicOp Completer Supported
AtomicOp Routing Supported
ARI Forwarding Supported
Completion Timeout Disable Supported
Completion Timeout Ranges Supported
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Device Control Register 2 940

Device Control 2 Register


15 14 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example ARI Topology 941

 The ports to be enabled for ARI forwarding


are those just above an ARI device
 Intermediate ports won’t care about device
vs. Function numbers

Root Complex

ARI Device A
Switch NIC

Enabled for
ARI forwarding ARI Device B

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
942

Extended Tag Enable Default

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 943

 In previous versions of the spec, the default size


of the Tag field was 5 bits, allowing 32 split
transactions to be in progress at the same time
per Function.
 If a Function supported it, software could change
the tag size to 8 bits by enabling the Extended
Tag Field.
 This change simply allows the default value of
the Extended Tag Field Enable bit to be either 5
or 8 (implementation specific)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
944

3. Power Management:
Dynamic Power Allocation

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 714 945

 PCIe 2.0 added dynamic power support at


the Link level, but not at the device level
 As power and thermal budgets become
increasingly important in system design, PCIe
devices need more options.
 Goals
 Reduce average power
 Lower platform cost
 Improve battery life in mobile devices

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Method 714 946

 New registers provide devices with up to 32


power management sub-states under D0
state
 Gives software visibility and control of device
power states, even when a device doesn’t
have a driver that handles PM
 Power state is read or changed using
configuration cycles
 Requires use of DPA extended registers
 Multiple Functions in a device can have their
own DPA registers, and the overall device
power will be the sum of them all
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
DPA Capability Structure 715 947

31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0016h for DPA)

31 0 Offset

PCIe Enhanced Capability Header 000h

DPA Capability Register 004h

DPA Latency Indicator Register 008h

DPA Control Register DPA Status Register 00Ch

010h
DPA Power Allocation Array
(Sized by number of substates)
Up to
02Ch

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Capability Register 716 948

RsvdZ

31 24 23 16 15 14 13 12 11 10 9 8 7 5 4 0
Substate
Xlcy1 Xlcy0 PAS RsvdZ _Max

Transition Latency Value 0 All fields not


reserved are (RO)
Transition Latency Value 1

Power Allocation Scale (PAS)


Transition Latency Unit (Tlunit)

 Transition Latency Value 0 & 1 – these are


multiplied by the Tlunit to give two max
transition times for going into this substate
from any other. Actual latency can not be
Moki Anjimore
(moki@than this.
synopsys.com)
Do Not Distribute MindShare.com © 2013
Capability Register Fields 949

 Power Allocation Scale – multiplier for substate


power allocation (value in watts):
00 – 10.0
01 – 1.0
10 – 0.1
11 – 0.01
 Transition Latency Unit – multiplier for max substate
change latency
00 – 1ms
01 – 10ms
10 – 100ms
11 – Reserved
 Substate_Max – number of supported substates - 1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Latency Indicator Register 950

 Each bit indicates which of the two latency


values applies to that substate.
 Examples:
 If bit 17 = 0, then substate 17 uses latency value 0
 If bit 10 = 1, then substate 10 uses latency value 1

31 0

DPA Latency Indicator Register

All bits (RO)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Status Register 716 951

 Status gives current substate setting


 Control is enabled by default after a reset, but
can be disabled by writing a one to bit 8

15 9 8 7 5 4 0

RsvdZ RsvdZ

Substate Control Enabled (RW1C)

Substate status (RO)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPA Control Register 952

 Software writes the desired substate value


here. If Substate Control is Enabled, that
determines the Function’s substate.
 Default is substate 0.

15 5 4 0

RsvdP

Substate Control (RW)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Power Allocation Array 715 953

31 0 Offset

PCIe Enhanced Capability Header 000h

DPA Capability Register 004h

DPA Latency Indicator Register 008h

DPA Control Register DPA Status Register 00Ch

010h
DPA Power Allocation Array
(Sized by number of substates)
Up to
02Ch

One 8-bit register for each substate gives the power for that substate,
which is multiplied by the Power Allocation Scale register to arrive at the
wattage used. All values in the array are RO.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
954

Latency Tolerance Reporting

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 784 955

 Goal: improve system power management


 At present, software has to guess how much latency
is acceptable for devices. PM is often cautious or
disabled to avoid performance problems.
 LTR informs software of known latency requirements
so PM policies can take into consideration how much
latency the endpoints can tolerate.
 Result: devices get good performance when they
need it, and system can use lower power when
devices don’t need a fast response
 Method: Provide optional registers for Functions
to report service latency requirements on
memory reads and writes
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
LTR Registers 785 956

 LTR is discovered and enabled using


new PCIe capability register bits
 Software must not enable LTR in an endpoint
unless the RC and all intermediate switches
also support it
 When enabling LTR in a hierarchy, those
nearest the RC must be enabled first, working
down to the bottom of the tree
Device Capabilities 2 Register
31 12 11 10

Device Control 2 Register LTR


LTR 15 11 10 9 0
Mechanism
Mechanism Supported
Enable
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
LTR Message 786 957

 LTR values are sent from endpoints to the RC


using messages
 The message TLP contains no data payload and must
use TC0 or it will be considered malformed.
 If RC doesn’t support LTR, the message is treated as
an unsupported request
 Messages can be sent whenever conditions change
 Gives endpoints a way to communicate when they’re idle.
 For example, a longer latency would be acceptable if the
device finds itself in an idle condition, while a shorter one
would be reported in anticipation of a time when sustained
data transfer will be needed.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Message Header 788 958

+0 +1 +2 +3
7 0 7 0 7 0 7 0
Format Type TC T E Attr AT
0 0 1 1 0 1 0 0
r 0 0 0
reserved D P 00 00 Length (reserved)
Message Code
Requester ID Tag 0001 0000

Reserved
No-Snoop Latency Snoop Latency

15 14 13 12 10 9 0
Latency Latency
Reserved
Scale Value

Requirement Scale:
000 – x 1 ns 001 – x 32 ns
010 – x 1K ns 011 – x 32K ns
100 – x 1M ns 101 – x 32M ns
Moki Anji (moki@ synopsys.com) 110-111 – not permitted
Do Not Distribute MindShare.com © 2013
Rules for Multi-Function Devices (MFDs) 787 959

 Acceptable latency values for the message


sent upstream represent the lowest values of
all Functions (“conglomerated” value)
 Snoop and no-snoop latencies can be
associated with different Functions
 MFD must send again when any Function’s
latency values change

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Rules for Switches 787 960

 Switches send a “conglomerated” message


upstream according to these rules:
1. Message sent upstream must reflect lowest
values received from any downstream port
 If no ports have requirements for snoop or no-snoop,
the message from the switch must not set that
requirement
 Any additional latency in the switch must be accounted
for in the conglomerated message. Switch design must
ensure this won’t cause conglomerated latency to be
reduced by more than 20%
2. If LTR is supported, it must be supported on all
ports

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Rules for Switches – 2 787 961

3. To send LTR upstream, LTR must be enabled or


have been recently disabled (like endpoints)
4. If a downstream port goes to DL_Down status or
that port has LTR disabled, the latencies for it are
treated as invalid. If the conglomerated values
change as a result, a new message must be sent
upstream
5. If no downstream ports receive LTR messages,
the switch must not send one upstream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Rules for Root Complex 788 962

 RC is allowed to delay processing a request


as long as it still satisfies the device’s service
requirements, to allow batching together
several packets to the same device.
 When the latency requirement is updated, RC
must comprehend it no later than either:
 The previous latency time, or
 The time to service a previous request
 RC is not required to honor the requested
latencies, but is “strongly encouraged” to
provide worst-case service latency that
doesn’t exceed the values indicated by LTR
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
LTR Example 789 963

Conglomerate
value

Conglomerate 1150 ns
value

1200 ns

 Only endpoint reports LTR – switch forwards that value


 Switch internal latency = 50ns

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example-2 790 964

Conglomerate 1150 ns
value

Conglomerate 1200 ns
value

5000 ns

 Only endpoint reports LTR – switch forwards that value


 Second endpoint reports larger value, no change needed upstream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example-3 791 965

Conglomerate 1150 ns
value

Conglomerate 650 ns
1200 ns
value

700 ns

 Only endpoint reports LTR – switch forwards that value


 Second endpoint reports larger value, no change needed upstream
 Third endpoint reports smaller value – new LTR message needed
upstream

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LTR Example-4 791 966

Conglomerate 650 ns
value

Conglomerate 1200
700 ns
1150
value

 Only endpoint reports LTR – switch forwards that value


 Second endpoint reports larger value, no change needed upstream
 Third endpoint reports smaller value – new LTR message needed
upstream
 Link to 3rd endpoint fails, and so its LTR value is now considered invalid
Moki Anji–(moki@ synopsys.com)
a new LTR message is needed upstream to report the smallest
Do Not Distribute
current value MindShare.com © 2013
967

Optimized Buffer Flush Fill (OBFF)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Optimized Buffer Flush/Fill (OBFF): Motivation 776 968

 Problem: bus-master capable devices are not


aware of system power states and may initiate
routine DMA or interrupt transactions at times
when the system would otherwise be able to go
to a lower power state.
 Solution: Allow RC to communicate system
power status to endpoints, which can then
recognize optimal time windows for initiating
traffic.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Coordinating Idle Time 777 969

 Without coordination of events, system is


rarely able to go to lowest power state

System Idle System Idle


Window Window

System Events

Endpoint A
Events

Endpoint B
Events

Endpoint C
Events
Time

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Improved Idle Windows 777 970

 OBFF informs devices about best times to stay idle


 Same work is done, but bigger Idle windows improve
power conservation
System Idle System Idle System Idle
Window Window Window

System Events

Endpoint A
Events

Endpoint B
Events

Endpoint C
Events
Time

LTR could also be used to inform system software of acceptable latency for
Moki Anji the synopsys.com)
(moki@ endpoints between accesses, suggesting a limit on this idle time.
Do Not Distribute MindShare.com © 2013
OBFF offers a Hint 778 971

 The OBFF information is an optional hint for


improving system power savings.
 Devices can still initiate whenever they like but
overall power consumption will be negatively
affected if they do, so that should be avoided as
much as possible.
 Information is communicated in 1 of 2 ways:
 Toggling the WAKE# pin – this method is much
preferred because it avoids needlessly waking up
a Link and burning power to inform a device about
the system power state, or
 Sending Messages

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF Signaling Example 778 972

 WAKE# is preferred, but using a message as an


intermediate step may be necessary, as shown:

Root Complex

WAKE#

Endpoint
Switch Endpoint

OBFF
Message
Endpoint

WAKE# Switch

Endpoint Endpoint

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
WAKE# Signaling 779 973

Transition Event OBFF Message Code

Idle OBFF OBFF

Idle CPU Active CPU Active

OBFF or CPU Active Idle Idle

OBFF CPU Active CPU Active

CPU Active OBFF OBFF

Notes:
- ECN points out that there is one negative edge for signaling OBFF, and 2 negative
edges for signaling CPU Active
- Min pulse width = 300ns, time between falling edges = 700ns min to 1000ns max
- Moki
If pattern
Anji is unrecognized,
(moki@ default is CPU Active
synopsys.com)
Do Not Distribute MindShare.com © 2013
WAKE# Rules 780 974

 System is not required to enable an endpoint


to detect whether WAKE# was asserted by
another endpoint.
 Signaling can only be initiated by the RC
when the system is in an operational state
(S0 for an ACPI-compliant system)
 Functions must be in the D0 power state to
respond to it.
 Reserved codes received will be treated as
CPU Active

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Possible WAKE# Confusion 780 975

 Since the WAKE# pin may also be used by


some Functions to signal a wakeup event, it’s
possible that other Functions might
misinterpret that as an OBFF change.
 This might cause undesirable power
management, but should be recoverable
 Spec recommends that endpoints go to the
CPU Active state whenever they detect
WAKE# activity as being initiated by the host,
but doesn’t specify how they would know.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF Message Header 781 976

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF States 780 977

 From device perspective, the codes mean:


 CPU Active – all transactions OK. This will be a
Function’s initial state.
 OBFF – transfers to and from memory OK
 Idle – wait for higher state before initiating

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Message Rules 780 978

 Strongly recommended that s/w only use


OBFF messages if WAKE# is not available.
 Switches are strongly encouraged to
propagate all OBFF indications, but are
allowed to discard or collapse them.
 Downstream ports have two options, called
Variation A and B, if we want to send a
message but the Link is not in L0 state.
 A: Don’t change the Link state, simply drop the
message
 B: Return the Link to L0 and forward the message

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
OBFF Support 782 979

Device Capability 2 Register


31 24 23 22 21 20 19 18 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
OBFF Support
64-bit AtomicOp Completer Supported
00 – Not supported 32-bit AtomicOp Completer Supported
01 – Message only AtomicOp Routing Supported
ARI Forwarding Supported
10 – WAKE# only
Completion Timeout Disable Supported
11 – Both Completion Timeout Ranges Supported
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
OBFF Enable 783 980

Device Control 2 Register


15 14 13 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

OBFF Enable
00 – Disabled
01 – Enabled with Message signaling Variation A
10 – Enabled with Message signaling Variation B
Moki Anji (moki@ synopsys.com)
11 – Enabled using WAKE# signaling
Do Not Distribute MindShare.com © 2013
981

ASPM Option

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
3: Power Mgmt: ASPM Option 910 982

 ASPM Support (2-bits) is filled in for 3.0


 00b – No ASPM Support (was formerly Reserved)
 01b – L0s Supported
 10b – L1 Supported (was formerly Reserved)
 11b – Both L0s and L1 Supported
 L0s support no longer required
 New bit indicates this support, and it must be set to one for 3.0-
compliant devices

Note: Some fields


are not shown to
simplify the diagram.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
983

4. Software Model:
Atomic Operations

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation for AtomicOps 897 984

 Many processors have single commands for


atomic operations (uninterrupted read-modify-
write), such as XCHG command in x86
processors.
 Old method of locking the bus over several
transactions worked but was slow and limited
traffic. New optional TLPs create that syntax
in a single PCIe request.
 Provides non-blocking synchronization for:
 multiple producers or consumers using semaphores in
memory
 counters that are atomically incremented by hardware
and atomically read and cleared by software
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Locked Example 985

 When locked read goes through a switch, the egress


port blocks other traffic on VC0
 When the locked completion is returned, the
upstream port is blocked for VC0, too
 Requests from other initiators can be stalled

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Competition 986

 When multiple atomic accesses target the same


resource, as can happen with co-processing devices,
handling their competition gracefully will improve
performance
 Root becomes Atomic Completer Engine to do this

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
New Atomic Model 898 987

 AtomicOps integrate multiple steps


 Operations are guaranteed atomic within the
device (execute internally without interruption)
 Path between devices is not locked
 Three new requests and one completion
defined:
 Fetch and Add
 Unconditional Swap
 Compare and Swap

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Three Atomic Requests 898 988

 Fetch and Add


 Read target value, add the “add” value to it, write the sum to
the target, return the original target value.
 Useful for atomically updating counters
 Unconditional Swap
 Read the target value, write the “swap” value, return the
original target value
 Useful for atomically reading and clearing counters
 Compare and Swap
 Read target value, compare it to the “compare” value in the
command. If equal, write the “swap” value to the target,
return the original target value
 Useful as a “test and set” operation for managing a
semaphore

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
AtomicOp Capabilities 899 989

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
AtomicOp Error Handling 990

 Egress blocked
 Software can set Atomic Egress Blocking bit in individual
ports of a routing element to block AtomicOps from being
forwarded to devices that don’t recognize them and would
otherwise generate a Fatal Error.
 Send Completion with UR status, but don’t set Unsupported
Request status bit
 For the Port, this defaults to Advisory Non-fatal case. A new
AER status bit was added to make this case visible to
system software (next slide).
 Completer internal uncorrectable error: completion
with CA status
 Requests with type or operand size that isn’t
supported: completion with UR status
 Length doesn’t match an architected value:
Uncorrectable Fatal (Malformed)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
AER Uncorrectable Status Register 691 991

 New Uncorrectable Status bit

Uncorrectable Error Status Register


31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdZ RsvdZ RsvdZ

TLP Prefix Blocked Error Status


Undefined
AtomicOp Egress Blocked Status
Data Link
MC Blocked TLP Status
Protocol Error
Uncorrectable Internal Error Status Status
ACS Violation Status Surprise Down
Unsupported Request Error Status Error Status
ECRC Error Status Poisoned TLP
Malformed TLP Status Status
Receiver Overflow Status Flow Control
Unexpected Completion Status Protocol Error
Status
Completer Abort Status
Completion Timeout Status
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
992

5. Configuration:
Internal Error Reporting

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivations 667 993

1. Make internal logic errors visible to software in


an industry-standard way.
 In high-end systems it’s important to be able to
detect and contain errors
 Endpoints have device drivers that can obtain internal
information, but switches usually don’t and are controlled by
the OS instead.
 As a result, switch vendors have developed proprietary and
incompatible error-reporting methods.
2. Allow multiple error headers to be recorded
 Current AER model only saves info on the first
uncorrectable error
3. Detect the occurrence of multiple errors of the
same type
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
New Errors Reported 994

 Internal error definition will be implementation


specific
 Three new errors reported
 Corrected Internal – masked or worked around by
h/w with no loss of info or operation (e.g.: memory
error corrected by ECC). Optionally, send
ERR_COR.
 Header Log Overflow – Optionally, send
ERR_COR.
 Uncorrectable Internal – needs a reset or h/w
replacement. Optionally, send ERR_FATAL.
 As with other AER status bits, they can be
masked, and severity is programmable
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
AER Correctable Status Register 689 995

 New Correctable Status bits

Correctable Error Status Register


31 16 15 14 13 12 11 9 8 7 6 5 1 0

RsvdZ RsvdZ RsvdZ

Header Log Overflow Status


Corrected Internal Error Status
Advisory Non-Fatal Error Status
Replay Timer Timeout Status
Replay Num Rollover Status
Bad DLLP Status
Bad TLP Status
Receiver Error Status

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
AER Uncorrectable Status Register 691 996

 New Uncorrectable Status bit

Uncorrectable Error Status Register


31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdZ RsvdZ RsvdZ

TLP Prefix Blocked Error Status


Undefined
AtomicOp Egress Blocked Status
Data Link
MC Blocked TLP Status
Protocol Error
Uncorrectable Internal Error Status Status
ACS Violation Status Surprise Down
Unsupported Request Error Status Error Status
ECRC Error Status Poisoned TLP
Malformed TLP Status Status
Receiver Overflow Status Flow Control
Unexpected Completion Status Protocol Error
Status
Completer Abort Status
Completion Timeout Status
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Multiple Header Registers 687 997

Advanced Error Capabilities and Control register


31 12 11 10 9 8 7 6 5 4 0

First Error
RsvdP Pointer

Multiple Header Recording Enable


Multiple Header Recording Capable

Only a finite number of headers can be recorded, so it’s


important that software clear errors as soon as possible
If too many errors arrive, a Header Log Overflow error is
reported.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Tracking Multiple Error Headers 998

 AER records header for received TLPs that


cause errors.
 The First Error Pointer (FEP) always points to
the uncorrectable error whose header is
visible in the header log.
 Writing a 1 to the corresponding bit in the
Uncorrectable Status clears that instance
and, if multiple header recording is enabled,
that also updates the FEP to point to the next
error and recorded header.
 When the FEP points to an invalid status bit,
there are no more headers to report.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
999

Resizable BARs

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 911 1000

 Problem arises when requested memory


resources are greater than system addressable
space. Possible results:
 Available system memory is reduced
 Function memory is not allocated
 Function memory is allocated with a sub-optimal size
 Solution: Functions report several possible
usable memory sizes, software selects one
based on system constraints
 Expected that devices requesting large memory
resources will be most likely to use this

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Mechanism 912 1001

 Software learns which BAR sizes are available by reading the


new extended capability register created for this purpose
31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0015h for Resizable BAR)
31 0 Offset

PCIe Enhanced Capability Header 000h

Resizable BAR Capability Register (0) 004h


Register Pair
for each Reserved Resizable BAR Control Register (0) 008h
supported
BAR …
Resizable BAR Capability Register (n) n*8 +4

Reserved Resizable BAR Control Register (n) n*8 +8

 Software determines optimal memory size for current platform


conditions and programs the BAR sizes
 Hardware will then report that BAR size when enumeration
software queries the normal configuration header
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Rules 911 1002

 To avoid confusion, BAR size should only be


changed when the Memory Enable bit is
cleared in the Command register
 Strongly recommended that Functions not
advertise BARs bigger than they can
effectively use
 For optimal performance, software should
allocate the biggest size that will work for the
system

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Capability Register 912 1003

 Bits indicate available BAR sizes (RO)


Resizable BAR Capability Register
31 24 23 4 3 0

RsvdP RsvdP

 Bit 4 – 1MB BAR size will work for this Function


 Bit 5 – 2MB
 Bit 6 – 4MB

 Bit 23 – 512GB BAR size will work

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Control Register 913 1004

 BAR Index indicates which BAR is being


described here (0 to 5). For a 64-bit address,
this indicates the lower dword.
 Number of resizable BARs: from 1 to 6.
This field is only valid in Control register (0) and
is RsvdP for all the others.
 Encoded BAR size: Resizable BAR Control Register
 0 = 1MB 31 13 12 8 7 5 4 3 2 0

 1 = 2MB RsvdP RsvdP

 2 = 4MB
BAR Size (RW)

 19 = 512GB Number of Resizable
BARs (RO)

Moki Anji (moki@ synopsys.com) BAR Index (RO)


Do Not Distribute MindShare.com © 2013
Configuration Header BARs 914 1005

 Actual BARs report the


selected size when
enumeration software
checks them
 If a size value is
changed, the content of
the effected BAR will
be lost and must be
restored

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
1006

Simplified Ordering Table

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Motivation 1007

 This wasn’t an ECR to the 2.0 spec, but was


already affected for 2.1 by the addition of IDO

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Simplified Ordering Table 289 1008

 New version reduces the entries and simplifies them by not mentioning
specific requests. Both make it easier for new devices to be compliant.

Row pass Posted Non-Posted Requests Completion


Column? request
Read NPR with
(Col 2) Request data (Col 5)
(Col 3) (Col 4)
Posted request a) No a) Y/N
(Row A) b) Y/N Yes Yes b) Yes

Read a) No
Non-Posted

Request b) Y/N Y/N Y/N Y/N


Requests

(Row B)
NPR with a) No
data b) Y/N Y/N Y/N Y/N
(Row C)
Completion a) No a) Y/N
(Row D) b) Y/N Yes Yes b) No

Moki Anji (moki@ synopsys.com)


NPR
Do Not with data = Non-Posted Request, such as configuration write or I/O write
Distribute MindShare.com © 2013
Appendix B:
Details of Spec 3.1 Changes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Summary of Spec 3.1 Changes 1010

1. Downstream Port Containment (DPC), later


updated with Enhanced DPC (eDPC)
2. L1 Substates
3. Lightweight Notification
4. 8.0 GT/s Receiver Impedance
5. Process Address Space ID (PASID), later
updated with Address Translation
6. End-End TLP Prefix Changes for RCs
7. Precision Time Measurement (PTM)
8. Protocol Multiplexing (PMUX): [2.x ECN,
Appendix G in 3.0 spec]
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
1. Downstream Port Containment (DPC)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Basics 1012

 Main feature: Automatically disables the Link below a


Downstream Port when triggered by an uncorrectable
error.
 Prevents possible spread of data corruption; subsequent
TLPs are blocked in both directions for that Port
 System is notified of the event and error recovery is possible
if supported by software
 New event, “Async Removal”, occurs when a device
goes offline without notification to the OS. This is
going to happen when DPC is triggered.
 DPC support is optional.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Help for Handling DPC 1013

 New Completion Timeout prefix/header log


 Requesters may log their Request header and
prefixes when timeout detected.
 Timeouts may result from improper configuration,
system failure, or Async Removal. To distinguish
whether normal operation is still possible after this
error, Requesters are strongly encouraged to log
their requests.
 Timeouts may also now be caused when DPC is
triggered.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Completion Timeout Header Capability 1014

AER: Capability and Control Register

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Triggering 1015

 New capability registers are defined for DPC


 DPC is triggered when enabled and an internal
uncorrectable error or error message is seen (see
DPC Control register)
 When triggered by ERR_FATAL or ERR_NONFATAL, the
message is discarded but the source ID is recorded
 DPC can be reported with an interrupt or ERR_COR
message
 No error message is sent to report the uncorrectable error
that caused the trigger, even if it was otherwise enabled.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Triggering 1016

 When triggered, Port immediately:


 Sets DPC Status register Trigger Status bit and
Trigger Reason field
 Disables the Link (using LTSSM) until Status bit
is cleared.
 Note: To ensure the Link has time to go to the Disabled
state, software must wait until the Link Layer Active bit
in the Link Status register has been cleared to 0b
(verifying that the Link is down) before clearing this bit.
 Once that requirement is met, the status can be cleared
regardless of other status bit states associated with the
DPC event.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Extended Capability Registers 1017

001Dh

Source ID of sending agent when


ERR_FATAL or ERR_NONFATAL
is received.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Control Register 1018

Bit Description Attr


Location
1:0 DPC Trigger Enable
00b – DPC disabled
01b – DPC enabled and triggered with detection of uncorrectable
error or receipt of ERR_FATAL
10b – DPC enabled and triggered with detection of uncorrectable
error or receipt of ERR_FATAL or ERR_NONFATAL RW
11b – Reserved
2 DPC Completion Control – completion status returned
0b – CA status
1b – UR status
3 DPC Interrupt Enable – send interrupt when DPC triggered

4 DPC ERR_COR Enable – send ERR_COR when DPC triggered

5 Poisoned TLP Egress Blocking Enable – if the corresponding RW/RO


support bit is set, this bit is valid and enables software to block
poisoned TLPs from being sent. If not supported, hardwired to 0b

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Control Register - 2 1019

Bit Description Attr


Location
6 DPC Software Trigger – if DPC is enabled, and software RW/RO
triggering is supported, and DPC Status is cleared, setting this bit
triggers a DPC. Interestingly, it’s permissible to set this bit and
also enable DPC with the same write. If this feature isn’t
supported, this bit is hardwired to zero.
7 DL_Active ERR_COR Enable – when s/w releases a RW/RO
downstream Port from DPC, the Link will normally retrain. The
system can be notified that the Link Layer is again ready either
with a Link Layer State Changed interrupt or with an ERR_COR
Message, or both. If support for the latter is indicated in the DPC
Capability register, this bit will enable it. If this feature isn’t
supported, this bit is hardwired to zero.
15:8 RsvdP – Reserved and preserved RsvdP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root Port (RP) Extensions 1020

 Root Ports may include an extra capability to


detect and report a new type of error called
Root Port Programmed I/O (RP PIO) errors.
 Whether this is supported is reported in DPC
Capability register bit 5.
 RP PIO error reporting is very similar to AER
and involves extra registers that are
described later.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
When DPC Triggered by Upstream TLP 1021

 Transaction Layer DPC Behavior:


 No more upstream TLPs are accepted from the
Link layer
 Any TLPs subsequent to the error that were
already accepted from the Link Layer must be
silently discarded

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
When DPC Triggered by Downstream TLP 1022

 Previous TLPs:
 Posted requests or completions: sent or silently dropped.
 For NP request, if RP extensions supported and DPC
triggered and tracking that request, RP can create a
Completion, with status based on Completion Control bit:
If set, UR status, If cleared, CA status.
Which requests are tracked is design specific, but spec
strongly recommends that those generated by host CPU
instructions, or by Atomic Ops, should be tracked.
 Otherwise, NP requests may receive Completion Timeouts,
and software must account for this.
 Subsequent TLPs:
 Posted requests or completions must be silently discarded
 NP requests will generate a Completion using Completer
ID of the Downstream Port, with status based on DPC
Completion Control bit:
Moki Anji (moki@ synopsys.com)
Do Not Distribute If set, UR status, If cleared, CA status
MindShare.com © 2013
Root Port Function-Level Containment (FLC) 1023

RC may have proprietary Function-Level Containment (FLC) mechanism,


enabling it to contain traffic from a Function when a Non-Fatal error is
detected. If so:
1. Switch Port receiving ERR_FATAL message, or detecting an
uncorrectable error, should trigger DPC itself.
2. However, ERR_NONFATAL messages would be passed through to
the Root Port, and can trigger FLC or DPC there.

DPC Triggered
2 ERR_NONFATAL

DPC Triggered

1 ERR_FATAL

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Root-Specific Handling 1024

If the Root Port doesn’t support FLC, the Switch Port


should trigger DPC for all uncorrectable error
messages.

ERR_NONFATAL
or ERR_FATAL

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Reporting DPC 1025

 Two required DPC reporting mechanisms:


1. ERR_COR message
 Required even if AER is not supported
 Send message whenever DPC Trigger Status changes
to 1b if both DPC ERR_COR bit and Correctable Error
Reporting bit in Device Control register are set.
 Sending DPC ERR_COR doesn’t set Corrected Error
Detected bit in Device Status register because this
event isn’t considered an error.
 Both mechanisms can be enabled at the same time. If
so, the ERR_COR message must go first before the
MSI/MSI-X message. For INTx there’s no such ordering
constraint because they may not stay in order anyway.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Reporting DPC to Software 1026

 Two required DPC reporting mechanisms:


2. Interrupt
 Spec recommends that interrupt signaling be used,
since DPC ERR_COR is primarily intended to allow
platform firmware to do its own event logs or do
“firmware first” services.
 For INTx, the virtual wire is asserted as long as
Command Register Interrupt Disable is 0b and both
DPC Interrupt Enable and DPC Interrupt Status are 1b.
 For MSI/MSI-X, the message is sent whenever the
associated vector is unmasked and both DPC Interrupt
Enable and DPC Interrupt Status are 1b.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Role in Error Reporting 1027

DPC prevents
forwarding error
messages.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Resolving Whether DPC was Triggered 1028

 A completion returned with error status could


mean that a device has been removed, or
that a DPC was triggered.
 Many platforms return a data value of all 1’s when
an error is seen on a config, I/O, or memory read
request.
 Spec recommends finishing a series of MMIO or
Configuration requests with a read to an address
whose data is known to be other than all 1’s to test
this.
 If RP Extensions are supported, the RP response
is selectable to help resolve this (see RP PIO
registers later).
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Software Handling of DPC 1029

 If software doesn’t have an uncorrectable error


strategy, there’s no benefit to using DPC. But if it
does, should a Port that saw DPC use UR or CA
status for the completions it creates for NP requests?
 If the strategy detects containment by looking for all 1’s on
PIO reads, then a UR status may be best, since many Root
Ports will synthesize a value of all1’s in response to UR
(e.g.: for enumeration compatibility). The Root Port would
need to generate all 1’s for Config, Memory, and maybe
even I/O space.
 If the strategy handles UR and CA differently for PIO reads,
then a CA response might be better, because that usually
means a programming model violation, which might trigger
Root FLC and error handling.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Handling (contd.) 1030

 If DPC was triggered in a Root Port that supports RP


extensions, the Port may need some time to quiesce
and clean up activities.
 DPC RP Busy bit indicates that the Port is not ready and
software must not clear the DPC Status bit yet.
 Spec says that this should normally take only a few µs, but
that internal errors in big systems might take several
seconds to resolve. Software may need to schedule polling
of this bit.
 Later, when the RP Busy bit is off and software clears the
DPC Status bit, releasing DPC, the Link will normally attempt
to retrain. Software can tell when training is done by looking
for a Link Layer State Changed interrupt, or an ERR_COR
used to indicate DL_Active.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Software Handling (contd.) 1031

 To avoid possible conflicts over which


software controls DPC, it’s recommended that
both firmware and OS link the control of DPC
to the control of AER, since they’re closely
related.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Detecting Device Removal 1032

 New Out-of-Band Presence Detect is added


to the list of hot-plug elements, indicates
physical presence of a card in a slot. No
details are given except that it doesn’t use the
Physical Layer.
 Spec says this can be used in form factors
that don’t have an MRL sensor so they’ll still
have a way to switch the signal group.
 Presumably, this is also another way to tell
whether a device has quit responding
because a DPC was triggered or because it
was removed.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Avoiding Confusion with DPC 1033

 Triggering DPC does not notify the OS


 It’s an Async Removal event
 This doesn’t fit the Hot-Plug model, in which the
operator requests a change and waits for software
confirmation before proceeding.
 Consequently, side effects may result and
software needs to comprehend them to avoid
confusion

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Possible Async Removal Side Effects 1034

1. Physical or Link Layers may generate Correctable


Errors.
2. Requesters may experience Completion Timeouts
for requests that were accepted but will never be
serviced.
3. Link Layer may transition from DL_Up to DL_Down,
generating a Surprise_Down error.
4. Surprise_Down error may trigger DPC, disabling the
Link until software handles the problem and clears
the status bit.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
New Feature for Root Ports 1035

 Root Port Programmed IO (RP PIO) Errors


 New config registers appended to DPC registers
enable control of what should happens when NP
requests tracked by the Root Port encounter
uncorrectable or Advisory Non-Fatal errors.
 Example: UR completion error for Memory Read
could trigger DPC, while the same error for Config
requests could be used to return all 1s instead, to
support normal enumeration.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
DPC Capability Register 1036

Bit Description Attr


Location
4:0 DPC Interrupt Message Number – the MSI vector offset from RO
the base data value to use or the MSI-X entry to use (must be
one of the first 32 entries)
5 RP Extensions for DPC – Root Port (RP) supports a defined RO
set of DPC extensions and more registers are appended to the
DPC registers. Switch Ports cannot set this bit.
6 Poisoned TLP Egress Blocking Supported – Root or Switch RO
Downstream Port supports blocking poisoned TLPs.
RP that supports RP Extensions must set this bit.
7 DPC Software Triggering Supported –Root or Switch RO
Downstream Port supports software-triggered DPC.
RP that supports RP Extensions must set this bit.
11:8 RP PIO Log Size – Dwords allocated for the group of RP PIO RO
log registers. If RP Extensions are used, value must be 4 or
greater. Otherwise, the value must be 0.
12 DL_Active ERR_COR Signaling Supported – RO
RP that supports RP Extensions must set this bit.
15:13 Reserved and preserved RsvdP
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Reporting Completion Errors 1037

 RP PIO errors are similar to those reported by


AER but apply to different cases and are
managed by different controls.
 UR or CA error in AER results from the Port acting
as a Completer and getting an error.
 UR or CA error in RP PIO results from a Port
acting as a Requester and getting an error.
 CTO (Completion Timeout) error results from a
Root Port acting as a Requester, and is recorded
in both RP PIO and AER. Since it could be
reported in both places at the same time, spec
recommends that one of them be masked to avoid
potential conflict.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
RP PIO Registers 1038

RP PIO Log
Registers

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
RP PIO Status Register 1039

Bit Description Attr Default


Location
0 Cfg UR Cpl – Cfg Request received UR completion RW1CS 0b
1 Cfg CA Cpl – Cfg Request received CA completion RW1CS 0b
2 Cfg CTO Cpl – Cfg Request received completion timeout RW1CS 0b
8 I/O UR Cpl – I/O Request received UR completion RW1CS 0b
9 I/O CA Cpl – I/O Request received CA completion RW1CS 0b
10 I/O CTO Cpl – I/O Request received completion timeout RW1CS 0b
16 Mem UR Cpl – Mem Request received UR completion RW1CS 0b
17 Mem CA Cpl – Mem Request received CA completion RW1CS 0b
18 Mem CTO Cpl – Mem Req received completion timeout RW1CS 0b
31 Permanently reserved – default RP PIO First Error Pointer RsvdZ 0b
points to this value when nothing to report

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
RP PIO Mask Register 1040

Bit Description Attr Default


Location
0 Cfg UR Cpl – Cfg Request received UR completion RWS 1b
1 Cfg CA Cpl – Cfg Request received CA completion RWS 1b
2 Cfg CTO Cpl – Cfg Request received completion timeout RWS 1b
8 I/O UR Cpl – I/O Request received UR completion RWS 1b
9 I/O CA Cpl – I/O Request received CA completion RWS 1b
10 I/O CTO Cpl – I/O Request received completion timeout RWS 1b
16 Mem UR Cpl – Mem Request received UR completion RWS 1b
17 Mem CA Cpl – Mem Request received CA completion RWS 1b
18 Mem CTO Cpl – Mem Req received completion timeout RWS 1b

If an RP PIO error occurs while the event is masked, the


corresponding Status bit is set but doesn’t trigger a DPC and
isn’t recorded in the Log registers.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
RP PIO Severity Register 1041

Bit Description Attr Default


Location
0 Cfg UR Cpl – Cfg Request received UR completion RWS 0b
1 Cfg CA Cpl – Cfg Request received CA completion RWS 0b
2 Cfg CTO Cpl – Cfg Request received completion timeout RWS 0b
8 I/O UR Cpl – I/O Request received UR completion RWS 0b
9 I/O CA Cpl – I/O Request received CA completion RWS 0b
10 I/O CTO Cpl – I/O Request received completion timeout RWS 0b
16 Mem UR Cpl – Mem Request received UR completion RWS 0b
17 Mem CA Cpl – Mem Request received CA completion RWS 0b
18 Mem CTO Cpl – Mem Req received completion timeout RWS 0b

If the associated severity bit is set, an error will be handled as an


uncorrectable error and will log the error, trigger DPC (if DPC is
enabled), and signal an interrupt or ERR_COR or both.
,
If the bit is not set, it will be handled as an Advisory Non-Fatal
error instead and will log the error and signal an ERR_COR
Moki Anji(won’t
(moki@ synopsys.com)
trigger DPC).
Do Not Distribute MindShare.com © 2013
RP PIO SysError Register 1042

Bit Description Attr Default


Location
0 Cfg UR Cpl – Cfg Request received UR completion RWS 0b
1 Cfg CA Cpl – Cfg Request received CA completion RWS 0b
2 Cfg CTO Cpl – Cfg Request received completion timeout RWS 0b
8 I/O UR Cpl – I/O Request received UR completion RWS 0b
9 I/O CA Cpl – I/O Request received CA completion RWS 0b
10 I/O CTO Cpl – I/O Request received completion timeout RWS 0b
16 Mem UR Cpl – Mem Request received UR completion RWS 0b
17 Mem CA Cpl – Mem Request received CA completion RWS 0b
18 Mem CTO Cpl – Mem Req received completion timeout RWS 0b

If the associated bit for an error is set when the unmasked RP


PIO error occurs, a System Error is generated.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
RP PIO Exception Register 1043

Bit Description Attr Default


Location
0 Cfg UR Cpl – Cfg Request received UR completion RWS 0b
1 Cfg CA Cpl – Cfg Request received CA completion RWS 0b
2 Cfg CTO Cpl – Cfg Request received completion timeout RWS 0b
8 I/O UR Cpl – I/O Request received UR completion RWS 0b
9 I/O CA Cpl – I/O Request received CA completion RWS 0b
10 I/O CTO Cpl – I/O Request received completion timeout RWS 0b
16 Mem UR Cpl – Mem Request received UR completion RWS 0b
17 Mem CA Cpl – Mem Request received CA completion RWS 0b
18 Mem CTO Cpl – Mem Req received completion timeout RWS 0b

If the associated bit for an error is set when the RP PIO error
occurs, a synchronous processor exception is generated,
regardless of whether the error was masked or not.

Exceptions will be CPU and platform specific. Spec mentions


having the Root deliver poisoned data, which would help on
Moki Anjireads,
(moki@or synopsys.com)
a “hard fail” notification for the request, which would
Do Not Distribute
help for both reads and writes. MindShare.com © 2013
RP PIO Log Registers 1044

 DPC Capability register field “RP PIO Log Size”


reports the combined size of all 3 Log registers
 Header Log – 4 DW, formatted like AER header log
 ImpSpec Log – If RP Extensions supported, and log
size is 5 or greater, then space is allocated for this
Implementation-Specific data (e.g.: source of
Request TLP). If not implemented, it should read 0s.
 TLP Prefix Log – contains any End-End TLP Prefixes
from the TLP corresponding to the RP PIO error. If
the RP supports End-End prefixes, this register must
exist and must be large enough to hold the max
number of prefixes for any TLP. Formatted like the
AER prefix log.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Handling Poisoned TLPs 1045

 If the Port supports Poisoned TLP Blocking, the


Blocking Enable bit is set, and a poisoned TLP
presents itself to go out, this is a new uncorrectable,
non-fatal error called Poisoned TLP Egress Blocked.
 In this event, the Port:
 Must not send the TLP
 If the Poisoned TLP Egress Blocked error is unmasked and
DPC is enabled, DPC is triggered
 For a poisoned NP request, if DPC is not triggered, a
completion must still be sent as if it had been.
 This error is not detected by an intermediate
Receiver and won’t cause an Advisory Non-fatal error
there.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Blocking Poisoned TLP Status 1046

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Avoiding Power Surges 1047

 Added a new config bit to allow software to


limit power surge current by staggering
transition of Endpoints to higher power state.
 Set Slot Power Limit message is sent by
Downstream Port when:
 Slot Capabilities register is written, or
 Link layer status changes to DL_Up and the new
Auto Slot Power Limit Disable bit in the Slot
Configuration registers is cleared. If the bit is set,
no message will be sent.
 As a result, power message won’t be sent until
software triggers it.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Auto Slot Power Message Bit 1048

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
2: L1 Substates

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Introduction 1050

 Reducing active Link power was accomplished with


LTR and OBFF
 Reducing standby power is now accomplished by
adding optional low-power substates for L1.
 Example: laptop battery can be drained even with Link in L1
because Electrical Idle detector and common-mode voltage
driver continue to draw power (up to 25mW/Lane).
 Substates apply to ASPM L1 and software PM L1
 Optional sideband CLKREQ# signal becomes
bidirectional and is used to turn off the RefClk and
indicate transition to substates.
 Currently only PCIe Mini Card form factor implements this
pin. Form factors that don’t have CLKREQ# can use an in-
band version of it that will be defined in a future ECN.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
State Diagram 1051

L1.0

L1.1 L1.2
“Snooze” “Off”

 L1.0: Conventional state – up and down Ports must detect


electrical idle exit (EIE), and maintain common mode voltage
(CMV) [Power/Lane = approx. 20 mW]
CLKREQ# signals entry to and exit from next states
 L1.1: CMV kept, don’t detect EIE [Power/Lane = 500 µW]
 L1.2: CMV not kept, don’t detect EIE [Power/Lane = 10 µW]
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
L1 Substate Entry 1052

 From L1.0, deassertion of CLKREQ# will


cause transition to one of the two substates,
but which one?
 L1.1 if:
 PCI-PM L1.1 is enabled and PCI-PM L1.2 is not, or
 ASPM L1.1 and ASPM L1.2 are both enabled but LTR
conditions for L1.2 are not met
 L1.2 if:
 PCI-PM L1.2 is enabled
 ASPM L1.2 is enabled and LTR conditions are met
 LTR conditions for L1.2 :
 Reported snooped LTR and non-snooped LTR are
both >= LTR_L1.2_THRESHOLD value
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
L1.x Entry Rules 1053

 Both up- and downstream Ports must monitor


CLKREQ#
 Upstream Port must not de-assert CLKREQ#
until the Link enters L1.0
 Either Port can prevent entry by keeping
CLKREQ# asserted
 Downstream Port intending to block entry to
L1.2 must assert CLKREQ# before the Link
enters L1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1.1 Requirements 1054

 Both Ports are allowed to deactivate


mechanisms for EIE detection and RefClk
activity detection, but both must maintain
CMV.
 To initiate exit from L1.1, either Port asserts
CLKREQ#
 Next state is L1.0
 RefClk will be turned back on, but may be delayed
by the LTR advertized by the Upstream Port.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1.2 Requirements 1055

 All Link and Phy state information must be


maintained, or restored when leaving L1.2
 Three substates are defined:

L1.0

L1.2 L1.2
Exit Entry

L1.2
Idle L1.2
Moki Anji (moki@ synopsys.com) Substates
Do Not Distribute MindShare.com © 2013
L1.2.Entry 1056

 Transitional state to allow time for RefClk to


turn off and ensure both Ports have seen
CLKREQ# off.
 Both Ports:
 Must maintain CMV
 May turn off EIE detection logic
 Must not assert CLKREQ#
 RefClk must be turned off within 100ns
 If CLKREQ# asserted, next state is L1.0,
otherwise, next state is L1.2.Idle after no
more than 2µs.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
L1.2.Idle 1057

 Both Ports:
 May power off active logic and cease to maintain
CMV
 May have PHY power removed
 Must monitor the state of CLKREQ#
 After 4 µs in L1.2, may exit by asserting
CLKREQ#
 Downstream: assert until Link exits Recovery
 Upstream: assert on entry to Recovery and keep active
until next state that allows de-asserting it, like L1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1.2.Exit 1058

 Transitional state to allow time for both Ports


to power up.
 Both Ports:
 PHYs must be powered
 Must not assume that CMV was maintained

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Exiting Out of L1.2 1059

 Both Ports:
 Must power up any circuits required for L1.0, including those
needed to maintain CMV
 Must not change the state of CLKREQ#
 Refclk must be turned on after TPOWER_ON (L1 PM Substates
Control 2 register), and before the time advertised by LTR.
Goal: ensure that we’re never actively driving into an
unpowered component.
 Next state is L1.0
 Common mode can be established passively during L1.0
and actively during Recovery. To ensure it has been
established, Downstream Port must wait for TCOMMONMODE
(L1 PM Substates Control 1 register) after it has begun
sending and receiving TS1s before sending TS2s.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
CLKREQ# Example 1 1060

 CLKREQ# is a shared bidirectional, open-drain, active-low


signal (with a pull-up) that’s driven low when the reference clock
is desired. The Port that tri-states the signal last or asserts it first
controls the timing of CLKREQ#.

RefClk A
PLL

CLKREQ#
A Clock
B Generator

CLKREQ#

PLL RefClk B

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Timing of L1.2 State 1061

Diagram courtesy PCI Express specification

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
CLKREQ# Example 2 1062

 Multiple downstream Ports share one PLL, so


RefClkA# is only disabled if neither is using it

Clk Clk RefClk A


Req Req PLL
CLKREQA#

A Clock
CLKREQB#
B Generator
CLKREQC# C

PLL PLL RefClk C

Moki Anji (moki@ synopsys.com) RefClk B


Do Not Distribute MindShare.com © 2013
L1 Substates Extended Registers 1063

001Eh

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
L1 Substates Capability Register 1064

Bit Description Attr


Location
0 PCI-PM L1.2 Supported – supported if this bit is set HwInit

1 PCI-PM L1.1 Supported – supported if this bit is set HwInit

2 ASPM L1.2 Supported – supported if this bit is set HwInit

3 ASPM L1.1 Supported – supported if this bit is set HwInit

4 L1 PM Substates Supported – supported if this bit is HwInit


set
7:5 Reserved RsvdP

15:8 Port Common Mode Restore Time – Time (in µs) HwInit/
needed for this Port to re-establish common mode. RsvdP
Required if PCI-PM L1.2 or ASPM L1.2 is supported,
otherwise reserved.

NOTE: Spec warns that deeper power savings can result in longer recovery
latencies that may potentially cause unintentional conflicts like LTSSM timeouts.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
L1 Substates Capability Register – part 2 1065

Bit Description Attr


Location
17:16 Port T_Power_On Scale – Scale for Port T_Power_On HwInit/
Value field below. Encoded as: RsvdP
00b - 2µs (default)
01b - 10µs
10b - 100µs
11b – Reserved
Required if PCI-PM L1.2 or ASPM L1.2 is supported,
otherwise reserved.
18 Reserved RsvdP

23:19 Port T_Power_On Value– Combined with Scale field HwInit/


above, this defines the time (in µs) that the Port on the RsvdP
opposite side of the Link is required to wait in L1.2 Exit,
after sampling CLKREQ# asserted, before it can actively
drive the interface. Default value is 00101b.
Required if PCI-PM L1.2 or ASPM L1.2 is supported,
otherwise reserved.
31:24
Moki Reserved
Anji (moki@ synopsys.com) RsvdP
Do Not Distribute MindShare.com © 2013
L1 Substates Control 1 Register 1066

Bit Description Attr


Location
0 PCI-PM L1.2 Enable – For compatibility with possible RW
future extensions, these enable bits must be hardwired
to 0b unless the L1 PM Substates Capability bit is set.
Default value is 0b.
1 PCI-PM L1.1 Enable RW

2 ASPM L1.2 Enable RW

3 ASPM L1.1 Enable RW

7:4 Reserved RsvdP

15:8 Common Mode Restore Time – Time (in µs) that must RW/
be used by the Downstream Port for timing the re- RsvdP
establishment of common mode. This field can only be
changed when both ASPM L1.2 Enable and PCI-PM
L1.2 Enable are cleared – otherwise the resulting
behavior will be undefined. Required for Downstream
Ports if PCI-PM L1.2 or ASPM L1.2 is supported,
Moki Anji (moki@ synopsys.com)
otherwise reserved. Reserved for Upstream Ports.
Do Not Distribute MindShare.com © 2013
L1 Substates Control 1 Register – part 2 1067

Bit Description Attr


Location
25:16 LTR L1.2 THRESHOLD Value – Used with threshold RW/
scale below to determine whether entry into L1 results in RsvdP
L1.1 (if enabled) or L1.2 (if enabled). Default value is all
0s. This field must only be changed when the ASPM
L1.2 Enable bit is cleared, otherwise undefined Port
behavior will result.
Required for all Ports that have the ASPM L1.2
Supported bit set; otherwise reserved.
28:26 Reserved RsvdP

25:16 LTR L1.2 THRESHOLD Scale – Provides a scale for RW/


threshold value above. Encoding is the same as the RsvdP
LatencyScale fields in the LTR Messages. Default value
is all 0s. Hardware operation is undefined if a Not-
Permitted value is written to it. This field must only be
changed when the ASPM L1.2 Enable bit is cleared,
otherwise undefined Port behavior will result.
Required for all Ports that have the ASPM L1.2
Moki Anji (moki@ synopsys.com)
Supported bit set; otherwise reserved.
Do Not Distribute MindShare.com © 2013
L1 Substates Control 2 Register 1068

Bit Description Attr


Location
1:0 T_Power_On Scale – Scale for T_Power_On Value field below. RW/
Encoded as: RsvdP
00b - 2µs (default)
01b - 10µs
10b - 100µs
11b – Reserved
Required for all Ports that support L1.2, otherwise reserved.
Can only be changed when enables for both ASPM L1.2 and
PCI-PM L1.2 are cleared – otherwise behavior is undefined.
2 Reserved RsvdP

7:3 T_Power_On Value– Combined with Scale field above, this RW/
defines the time (in µs) that the Port must wait in L1.2 Exit, after RsvdP
sampling CLKREQ# asserted, before actively driving the
interface. Default value is 00101b. Required for all Ports that
support L1.2, otherwise reserved. Can only be changed when
both ASPM L1.2 Enable and PCI-PM L1.2 Enable are cleared –
otherwise the behavior will be undefined.
31:8 Reserved RsvdP
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
3: Lightweight Notification (LN)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Purpose 1070

 Improve performance by using local caches


in Endpoints while avoiding coherency
overhead
 Reduces traffic for Endpoints and host memory
 Improves latency
 Lightweight allows host s/w to indicate
changes to cache lines, avoiding
synchronization and flow-control issues.
 Dynamic device association
 Virtual Machine (VM) guest drivers accessing
device strictly via host memory instead of PIO
make guest migration easier
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Basics 1071

 Optional protocol for Endpoints to register


interest in cache lines in host memory and
get notified if those lines are changed.
 LN Reads, LN Completions, and LN Writes
are defined
 Cache Line Sizes (CLSs) of 64 and 128 bytes
are supported
 Endpoints use a new capability register block;
host has new field in the Device Capabilities
2 register to act as LN Completer.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Terms 1072

 Lightweight Notification (LN) – h/w mechanism to notify


Endpoints when cache lines of interest are updated
 LN Completer (LNC) – host that receives LN Read/Write
Requests and sends LN Messages when registered cache
lines are updated
 LN Requester (LNR) – Endpoint that sends LN Read/Write
Requests and receives LN Messages
 LN Messages – notification of cache line updates
 LN Read – Memory Read Request with LN header bit set
 LN Write – Memory Write Request with LN header bit set
 LN Completion – completion with LN header bit set

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LN Bit in TLP Header 1073

 In LN Read/Write Requests, indicates LNR’s


desire to be notified when the cacheline at
this address is changed or evicted
 In LN Completions, indicates that LNC
supports registration and that the LN Read
was successful

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Rules 1074

 Zero-length LN Read/Write permitted by


setting length = 1h and BE’s all 0’s
 LN bit doesn’t apply to I/O, Configuration, or
Message Requests and is reserved for them
 LN Messages are a special type called a SIG-
Defined VDM (Vendor-Defined Message), as
shown on next slide
 LN Messages can be directed to a single
Endpoint (routing bits = 010b) or broadcast
from the Root Port (routing bits = 011b)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LN Messages 1075

 They’re Vendor-Defined Type 1 Messages that use the PCI-SIG


Vendor ID = 0001h, Message Code 7Fh
 If Completer doesn’t support them, they’re silently discarded
 Only one Subtype currently defined: 00H – LN Message

4 DW
Header

2 DW
Data
Payload
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
LN Message Header Rules 1076

 Format = 011b (4 DW header, with data)


 Type = MsgD
 Length = 2
 Traffic Class = 0
 No Snoop bit reserved, but not IDO or RO
 LN bit is reserved in all messages
 Tag field is reserved
 If this is a broadcast message, Destination ID
is reserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LN Message Payload 1077

 Cacheline address:
 64-bit address of a cacheline that has been modified or
evicted (same format is used for 32-bit address)
 For 128-byte cachelines, bit 6 of address must be clear (LNR
may not check this)
 NR (Notification Reason) – why Message was sent:
00b – Cacheline update (line was changed)
01b – Single cacheline was evicted (LNC will no longer track it)
10b – All cachelines registered to this Function were evicted
(Cacheline Address is reserved in that case).
11b – Reserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example of LN Read 1078

LN Read

LN
Moki Anji(moki@ synopsys.com)
Completion
Do Not Distribute MindShare.com © 2013
LNC Details 1079

 LNC registers requested addresses and


LNRs that want to track it.
 Limited number of addresses can be registered. If
resources are unavailable to handle a request, an
LN Message can evict either an old address or
the new address.
 Limited number of LNRs for the same address
can be tracked. Up to that number, individual
messages are sent, but beyond that number LNC
is allowed to send broadcast messages instead.
The number is allowed to be zero: broadcast
messages always used.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Notification is a One-Shot Event 1080

 If a line is updated or evicted, the registered


LNRs are notified, but what can they do?
Without coherency information, the only
choice is to invalidate their local copies.
 Knowing this, when LNC updates or evicts a
cacheline and sends LN Messages to notify
registered requesters, no more LN Messages
for that line will be sent.
 If a new LN Read/Write for that same line is
received, then LN Messages for it are armed
again. Multiple LN requests for the same
address just re-arm the messages.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Caution About Broadcast Messages 1081

 Broadcast messages can cause performance


issues
 Consume bandwidth on multiple Links
 May go to Endpoints that aren’t LNR capable,
which may handle them as exception cases that
require firmware to resolve. Each Message might
thus take microseconds to service, potentially
causing backpressure on Posted Requests in the
topology and severe performance problems.
 Directed Messages only go to LNRs and
won’t have this problem.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LN Capability Registers 1082

15:14 LN System CLS (HwInit) – Only applies


to Root Ports or RCRBs. Indicates LNC support
and cacheline size.
00b – LNC not supported or not in effect
01b – LNC in effect with 64-byte line size
10b – LNC in effect with 128-byte line size
11b – Reserved

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LNR Extended Registers 1083

 Optional, but Endpoints that support LN as a


Requester must implement this register.

001Ch

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LNR Capability Register 1084

Bit Description Attr


Location
0 LNR-64 Supported – if set, LNR for 64-byte cachelines RO
is supported
1 LNR-128 Supported – if set, LNR for 128-byte RO
cachelines is supported
12:8 LNR Registration Max – indicates, as a power of 2, RO
number of cachelines this Requester can register at the
same time. For example, a value of 00101b indicates
support for as many as 25 = 32 cachelines concurrently.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
LNR Control Register 1085

Bit Description Attr


Location
0 LNR Enable – if set, indicates Endpoint is enabled to RW
act as LNR. Software can clear this bit at any time.
1 LNR CLS – controls and indicates cache line size in use RW
(0 = 64 bytes, 1 = 128 bytes). Default value is 0b, but
can be hardwired if device only supports one line size.
12:8 LNR Registration Limit – limits, as a power of 2, RW
number of cachelines this Requester can register at the
same time. For example, a value of 00100b indicates
that only 24 = 16 cachelines are allowed to be registered
concurrently.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
4: 8.0 GT/s Receiver Impedance

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Summary 1087

 Problem: Deadlock cases can arise when an


8.0 GT/s receiver is not detected in LTSSM
Detect state because impedance is not in
proper range (40-60 ohms) at that rate.
 Two cases are modified in which detecting
EIE results in LTSSM state transition:
 Exiting from Rx_L0s.Idle to Rx_L0s.FTS
 Exiting from L1.Idle to Recovery
 Solution: If impedance at 8.0 GT/s or higher
doesn’t match the value for 2.5 GT/s, a
timeout of 100ms can also be used to cause
this transition.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
5: Process Address Space ID
(PASID)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PASID Summary 1089

 Goal: Adding a Process ID Prefix gives extra


attributes for a Memory Request
 Method: New prefix adds a 20-bit value to indicate
the address space of an Untranslated Address
 PASIDs enable
 Sharing an Endpoint across multiple processes
while maintaining a separate 64-bit address space
for each one.
 Hierarchical management of address spaces.
 Without PASID, Untranslated Addresses are seen as Guest
Physical and are translated to System Physical by the
Hypervisor.
 With PASID, Untranslated addresses are seen as Guest Virtual
and are translated to Guess Physical by the Guest OS.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PASID Extended Registers 1090

 Register block defined for Endpoint PASID


Requesters. Registers are not defined for Endpoint
PASID Completers or for Root Port PASID
Requesters or Completers.

001Bh

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PASID Capability Register 1091

Bit Description Attr


Location
0 Reserved RsvdP

1 Execute Permission Supported – if set, Endpoint supports RO


sending TLPs with Execute Requested bit set, which labels
an address range as code that may be executed by the
Endpoint.
2 Privileged Mode Supported – if set, Endpoint supports RO
Privileged and Non-Privileged modes and can send TLPs
with Privileged Mode Requested bit set
7:3 Reserved RsvdP

12:8 Max PASID Width – indicates the width of the PASID field RO
supported by the Endpoint (between 0 and 20 bits). If a
Request arrives with PASID wider than supported, optionally
send UR. If a completion does so, optionally report
Unexpected Completion.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PASID Control Register 1092

Bit Description Attr


Location
0 PASID Enable – Endpoints can’t send or receive PASID RW
Prefixes unless this bit is set. If they receive a TLP with
a PASID Prefix and this bit is clear, they must return a
UR completion.
1 Execute Permission Enable – if set, Endpoint is RW/
enabled to send and receive Requests with the Execute RsvdP
Requested bit set. If Execute Permission Supported
capability bit is cleared, this bit is reserved. Default = 0b.
2 Privileged Mode Enable – if set, Endpoint is enabled to RW/
send and receive Requests with the Privileged Mode RsvdP
Requested bit set. If Privileged Mode Supported
capability bit is cleared, this bit is reserved. Default = 0b.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Ordering 1093

 Transaction Ordering – a Request with IDO


set following another Request is allowed to
pass if the two Requester IDs are different or
if both Requests use different PASID
Prefixes.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TLP Prefix Review 1094

 Prefixes can be added in front of a memory


request TLP, and can be Local or Global

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
Prefixes

100 x Prefix
100 x Prefix
T T E
Format Type R TC R Attr R H D P Attr AT Length
Header

Last DW First DW
Requester ID Tag BE BE

Address [31:2] PH

Optional Data

Optional ECRC

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Prefix Contents 1095

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
100 x (Defined by prefix contents)

 Format bit 2 was previously reserved but now


must be recognized to detect a prefix.
 Format value of 100b indicates the presence of a
prefix, and the next bit indicates:
 0 = Local Prefix
 1 = End-End Prefix

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Capabilities 2 Register 1096

Device Capability 2 Register


31 24 23 22 21 20 19 14 13 12 11 10 9 8 7 6 5 4 3 0

RsvdP RsvdP

Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
Fields related to
64-bit AtomicOp Completer Supported
TPH and prefixes. A
new set of registers 32-bit AtomicOp Completer Supported
is needed for AtomicOp Routing Supported
Requesters and ARI Forwarding Supported
that’s covered later. Completion Timeout Disable Supported
Completion Timeout Ranges Supported
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Local Prefix Rules 1097

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
1 0 0 0 L [3:0] (Defined by prefix contents)

 L [3:0] encodings:
0000 – MR-IOV (Multi-Root IO Virtualized environment):
Supports packet routing, error detection, and congestion
management (see MR-IOV spec for details).
1110 – Vendor-defined local prefix 0
1111 – Vendor-defined local prefix 1

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
End-to-End Prefix Rules 1098

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
1 0 0 1 E [3:0] (Defined by prefix contents)

 E [3:0] encodings:
0000 – Extended TPH
1110 – Vendor-defined end-end prefix 0
1111 – Vendor-defined end-end prefix 1
 All end-end prefixes are protected by the optional ECRC
 Max end-end prefixes in a TLP = 4 (max number
reported in Device Capabilities 2 register).

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Device Control Register 2 1099

Device Control 2 Register


15 14 11 10 9 8 7 6 5 4 3 0

RsvdP

End-End TLP Prefix Blocking


LTR Mechanism Enable
IDO Completion Enable
IDO Request Enable
AtomicOp Egress Blocking
AtomicOp Requester Enable
ARI Forwarding Enable
Completion Timeout Disable
Completion Timeout Value

 Routing elements can use End-End TLP Prefix Egress


blocking for endpoints that won’t understand them. Such
a TLP is dropped and an error is reported (see next
slide). If TLP was a Request, send Completion with UR.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
AER Uncorrectable Status Register 1100

 New Uncorrectable Status bit

Uncorrectable Error Status Register


31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0

RsvdZ RsvdZ RsvdZ

TLP Prefix Blocked Error Status


Undefined
AtomicOp Egress Blocked Status
Data Link
MC Blocked TLP Status
Protocol Error
Uncorrectable Internal Error Status Status
ACS Violation Status Surprise Down
Unsupported Request Error Status Error Status
ECRC Error Status Poisoned TLP
Malformed TLP Status Status
Receiver Overflow Status Flow Control
Unexpected Completion Status Protocol Error
Status
Completer Abort Status
Completion Timeout Status
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Stacked Prefix Example 1101

 Prefixes may be stacked or repeated


 Local prefixes must appear first
STP
Sequence Number
1000 Local Prefix
1001 End-End Prefix

Protected by LCRC
Protected by ECRC
1001 End-End Prefix

TLP Header

Optional Payload Data


Optional ECRC
LCRC
END
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
TPH Requester Capability Structure 1102

 Required for requesters that will use TPH


 Completers don’t use this, but indicate TPH
Completer support in Device Capabilities 2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
TPH Prefix Example 1103

+0 +1 +2 +3
7 5 4 3 0 7 0 7 0 7 0
100 1 0000 ST [15:8] Reserved

 Code = Extended TPH


 Byte 1 contains upper 8 steering tags

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PASID Prefix 1104

Bits Description

31:29 100b – indicates TLP Prefix

28 1b – indicates End-End Prefix

27:24 0001b – indicates PASID Prefix

23 Privileged Mode Requested

22 Execute Requested

21:20 Reserved

19:0 Process Address Space ID field


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PASID Field 1105

 Twenty-bit field, but devices may support less


than 20. Max PASID Width register indicates
width of the PASID field an Endpoint supports
(between 0 and 20 bits). Root Ports report
this in a design-specific way, since they don’t
implement PASID registers.
 Both Endpoints and Root Ports are allowed to
signal an error if they receive a TLP with a
PASID field bigger than they support: UR for
Requests, and Unexpected Completion for
Completions.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Execute and Privileged Request Bits 1106

 If Execute Requested bit is set, Endpoint


desires permission to execute instructions in
the memory range associated with this
request. Beyond that, the meaning is outside
the scope of the spec
 Meaning of Privileged Mode depends on the
protection model of the system and is outside
the scope of the spec. However, Completers
can use it as part of a protection check for
access to the memory range.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PASID Translation

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Use of PASID Prefix 1108

 The base PASID ECN described the use of


PASID Prefix with Memory Requests that had
untranslated addresses
 PASID Translation allows its use for other
Requests:
 Address Translation Requests
 Page Request Messages
 ATS Invalidation Requests
 PRG Response Messages

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Adding Translation Support 1109

 3 new bits are added to the TA Completion


data in response to a translation request

Bits Description

Global Global Mapping – if set, ATC is allowed to cache this address


in all PASIDs. If clear, ATC is only allowed to cache this in the
PASID associated with the Request
Exe Execute Permitted – Requester is permitted to execute code
in the memory range (only set if the Request asked for it).
Priv Privileged Mode Access – Combines with Execute Permitted
to associate permission level for this address range.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Use of PASID Prefix 1110

 When translating an address with a PASID,


the TA translates Guest Virtual to Guest
Physical Address (GPA), and then from GPA
to the System Physical Address. [The
intermediate GPA is not made visible.]
 For more details on PASID and Address
Translation, refer to our class on IOV.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
6: End-End TLP Prefix Changes

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Summary 1112

 Allows different Root Ports (only applies to Root


Ports) to report different values of Max End-End TLP
Prefixes, and clarifies handling of TLP that have more
prefixes than Port can support.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Changes 1113

 Root Ports are allowed to report support for


fewer prefixes than hardware actually
implements
 Errors are still based on exceeding the value
actually implemented
 Request with too many E-E prefixes handled as
 UR (recommended) or
 Malformed TLP
 Completion with too many E-E prefixes handled as
 Unexpected Completion (recommended) or
 Malformed TLP
 Functions other than Root Ports will always report
too many E-E prefixes as Malformed TLP
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
7: Precision Time Measurement (PTM)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Introduction 1115

 PTM Definition: Ability to send base timing


information between components.
 Goal: Simplify scheduling of time-sensitive
media and server applications in a platform
by coordinating precise timing across
devices.
 Method: PTM Requesters send Messages to
request base time info and PTM Responders
send Messages in response.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Terms 1116

 PTM Requester – Function capable of using


TPM as a consumer
 PTM Responder – Function capable of
supplying PTM Master Time associated with
a Port or RCRB
 Time Source – a local clock associated with a
PTM Responder
 PTM Root – source of PTM Master Time for a
PTM hierarchy. Must also be a Time Source
and usually will be a PTM Responder.
 PTM Hierarchy – set or Requesters
associated
Moki Anji with a single PTM Root
(moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Single PTM Hierarchy 1117

Root Complex is the PTM


Root for this hierarchy

PTM Hierarchy

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Independent PTM Hierarchies 1118

Each Switch acts as


PTM Hierarchy
a PTM Root for its
own hierarchy. The 2
hierarchies have
independent PTM
Master Times.

PTM Hierarchy

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Request and Response Messages 1119

Format Field bit 6:


0b = no data, 1b = data
Message Code LSB:
0 = Request, 1 = Response

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ResponseD Message 1120

 PTM Master Time: PTM Root supplies this for


a hierarchy
 Propagation Delay: round-trip delay within
Downstream Port

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PTM Dialog 1121

Upstream Port Downstream Port


Requester Responder

t1
The points t1, t2, etc., are
timestamps captured locally by t2
each Port as they send and 1st PTM Dialog
receive PTM Messages. t3

t4
Components store timestamps
from the 1st dialog to use with
the 2nd one, and so on for later
dialogs. Once a previous t1’
dialog establishes the history,
t2'
the PTM Master Time can be 2nd PTM Dialog
calculated. t3’

Downstream Ports have t4’


access to the PTM Master
Time, and report it with their
own turnaround time based on t1’’
previous timestamp values.
t2’’ 3rd PTM Dialog
Upstream Ports calculate PTM t3’’
Master Time based on data
Moki Anji
received and(moki@ t4’’
their own synopsys.com)
Do Not Distribute
timestamps.
MindShare.com © 2013
PTM Example 1122

Root Complex acts


as the PTM Root for
this system

Goal: Coordinate accurate timing


between Endpoints with different
local time bases.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PTM Example – 2 1123

Switch Upstream
Port uses
PTM dialog to
fetch PTM
Master Time

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
First PTM Dialog 1124

t2
2
First PTM Dialog Sequence
t3 1. PTM enabled in Endpoint
3 2. PTM Request sent upstream at recorded
t1 t4 local time t1.
3. Switch Port returns PTM Response w/o data
(at local time t4). Endpoint waits for a timeout
and repeats the request.
Moki Anji
1
(moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PTM Requester Operation 1125

PTM Enabled
Invalid PTM Local Time
Context Invalidation
Event
Trigger
Event

PTM PTM
Response ResponseD
Issue PTM Valid PTM
Wait >= 1µs
Request Context

Trigger
Event
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Second PTM Dialog 1126

t2’
2
Second PTM Dialog Sequence
t3’ 1. New PTM Request sent upstream at time t1’.
2. Switch Port returns PTM Response w/ data
3
t1’ t4’ of t2’ and value of (t3-t2). The t2’ value
includes awareness of the PTM Master Time.
3. Endpoint uses data to calculate local value of
PTM Master Time by adjusting for its own
Moki Anji (moki@ synopsys.com) Link delay.
1
Do Not Distribute MindShare.com © 2013
Timing Values 1127

 t4 – t1 = turnaround time at Upstream Port


 t3 – t2 = turnaround time at Downstream Port
 ((t4-t1) – (t3-t2)) = roundtrip transmission time
between the components
 Link transmit times are likely to be symmetric
(according to the spec), so dividing the
roundtrip time by 2 gives the one-way Link
delay = ((t4-t1) – (t3-t2))/2

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Timing Calculation Example 1128

 After 1st dialog, Endpoint knows t1 and t4.


 After 2nd dialog, Endpoint knows t1’, t4’, and
receives t2’ and (t3 – t2) as data
 Master Time can now be calculated as:
PTM Master Time at t1’ = t2’ – ((t4-t1) – (t3-t2))/2
[Upstream Port’s Launch time = Downstream Port’s
reported value minus the one-way Link delay]
 Components are allowed to make timing
context and calculation visible to software in a
design-specific manner.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Handling PTM Errors 1129

 PTM Request seen by a Port that doesn’t


support it or hasn’t been enabled results in
Unsupported Request.
 PTM Response seen by a Port that isn’t
enabled is silently dropped.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PTM Extended Capability Registers 1130

001Fh

 Permits software discovery and control of a


PTM hierarchy.
 Root Ports, RCRBs, and RCIEs must have
this if they support PTM
 Endpoints and Switch Upstream Ports need it
in one Function if they support PTM
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PTM Capability Register 1131

Bit Description Attr


Location
0 PTM Requester Capable – switches supporting PTM must HwInit
set this bit while Endpoints and Root Ports may set it.
1 PTM Responder Capable – switches supporting PTM must HwInit
set this bit while Endpoints and Root Ports may set it. If
PTM Root Capable is set, this bit must also be set.
15:8 PTM Root Capable – Root Ports, RCRBs, and switches HwInit
can set this bit if they implement a PTM Time Source Role
and are capable of serving as the PTM Root.
7:3 Reserved RsvdP

15:8 Local Clock Granularity – HwInit/


00h – no local clock; timing info when responding from PTM RsvdP
Requests is propagated from further upstream
01h–FEh – indicates local clock period in ns.
FFh – indicates local clock period is >254ns
Reserved if Function doesn’t have PTM Time Source Role.
31:16 Reserved RsvdP
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PTM Control Register – part 1 1132

Bit Description Attr


Location
0 PTM Enable – enables Function to participate in PTM RW
according to its selected role. Default value is 0b.
1 Root Select – if this bit and PTM Enable are set, this Time RW/
Source is the PTM Root. (Recommended that software RO
choose the furthest Upstream Time Source to be the PTM
Root.) Default value is 0b. May be hardwired to 0b if PTM
Root Capable bit is not set.
7:2 Reserved RsvdP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PTM Control Register – part 2 1133

Bit Description Attr


Location
15:8 Effective Granularity – for Functions acting as PTM RW/
Requesters, this gives info about expected accuracy of the RO
PTM clock. For Endpoints, s/w programs it to the max value
reported by the PTM Root and all intervening PTM Time
Sources.
00h – unknown granularity; a Switch between this Function
and the PTM Root reported granularity of 00h. (Default
value)
01h–FEh – effective PTM granularity in ns.
FFh – effective PTM granularity is >254ns
Reserved if Function doesn’t have PTM Time Source Role.
31:16 Reserved RsvdP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
8: Protocol Multiplexing (PMUX)
(Gen 3.0 Appendix G)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Introduction 1135

 Purpose: allow multiple protocols to share a


PCIe Link
 Terms:
 PMUX Channel – set of logic to generate and
receive packets using a specific protocol;
multiplexed into the general PCIe traffic
 PMUX Link – PCIe Link over which protocol
multiplexing has been enabled. A mix of TLPs and
PMUX packets are transferred.
 PMUX Packet – Specially-modified packet that
identifies itself as the PMUX type.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PMUX Basics – 1 1136

 Does not consume or interfere with PCIe


resources; uses distinct resources associated
with another protocol.
 PMUX packets have no impact on TLPs or
DLLPs and are ignored by devices that don’t
support them.
 PMUX must be enabled by s/w before
packets can be sent. PMUX packets received
by a target that isn’t enabled are ignored.
 PMUX Link can support up to 4 active PMUX
Channels simultaneously
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PMUX Basics – 2 1137

 PMUX Packets don’t use ACK/NAK. Instead,


they can use a protocol-specific means of
reliable transport. Consequently:
 12-bit TLP sequence number is replaced with
PMUX Metadata
 Replay Timer may not advance when receiving
PMUX Packets
 Transmitters will use a design-specific
arbitration to schedule PMUX Packets with
TLPs and DLLPs (not defined by spec)
 No address or routing mechanism defined for
PMUX Packets.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PMUX Basics – 3 1138

 Operation may be affected by Link width and


speed (for example, some protocols may not
work with smaller widths).
 PMUX logic is allowed to influence the choice
of things like width, speed, or Link power
state (for example, requesting change to L0).
 PMUX errors may be reported as internal
errors
 PMUX Packets with LCRC errors are
discarded and the Protocol Layer is notified.
Protocol-specific error recovery or reporting
may be invoked.
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PMUX Packets 1139

 Transaction Layer is not used, they use their


own PMUX Protocol Layer instead
 Data Link Layer is simplified. They don’t use
ACK/NAK protocol but do use LCRC. Errors
don’t cause replays or affect TLP traffic.
 Physical Layer must be slightly modified to
recognize the PMUX Packets
 Data can be from 0 to 125 Dwords, specific to
the protocol being used
 Contents of Metadata and Data depend on
the protocol in use; not defined in this spec
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PMUX Packet Path 1140

 PMUX Packets are created and received over


a different path.
Transaction Protocol
Layer Layer

Data Link Simplified Link


Layer Layer

Physical Modified Physical


Layer Layer

Scheduling / Multiplexing Logic

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Gen2 Comparison: TLP and PMUX Packet 1141

TLP

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 TLP LCRC 7 6 5 4 3 2 1 0

Symbol 0 Symbol 1 Symbol 2 Symbol n

PMUX Packet

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 Data LCRC 7 6 5 4 3 2 1 0

Symbol 0 Symbol 1 Symbol 2 Symbol n

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Gen3 Comparison: TLP and PMUX Packet 1142

TLP

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 TLP LCRC
Symbol 0 Symbol 1 Symbol 2 Symbol 3

STP
Token

PMUX Packet

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 Data LCRC
Symbol 0 Symbol 1 Symbol 2 Symbol 3

Modified
STP Token
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PMUX Extended Registers 1143

ID: 001Ah

PMUX support is indicated by the presence of these registers.


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
PMUX Capability Register 1144

Bit Description Attr


Location
5:0 PMUX Protocol Array Size – Number of protocols RO
supported; 00h indicates none are supported.
7:6 Reserved RsvdP

15:8 PMUX Supported Link Speeds – speeds at which PMUX is RO/


supported. At least one bit must be set, and any combination RsvdP
is legal, but a speed cannot be supported here if the Link
itself doesn’t support it.
Bit 8 – 2.5 GT/s
Bit 9 – 5.0 GT/s
Bit 10 – 8.0 GT/s
Bits 15:11 – reserved
31:16 Reserved RsvdP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PMUX Control Register 1145

Bit Description Attr


Location
5:0 PMUX Channel 0 Assignment – Indicates which entry in RW
the protocol array has been assigned to Channel 0. Value of
00h (default) indicates no protocol is assigned (entry 0 in the
array doesn’t exist).
7:6 Reserved RsvdP

13:8 PPMUX Channel 1 Assignment RW

15:14 Reserved RsvdP

21:16 PMUX Channel 2 Assignment RW

23:22 Reserved RsvdP

29:24 PPMUX Channel 3 Assignment RW

31:30 Reserved RsvdP

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PMUX Status Register – part 1 1146

Bit Description Attr


Location
0 PMUX Channel 0 Disabled: Link Speed – Indicates current RO
Link speed is not supported by PMUX, so channel disabled.
1 PMUX Channel 0 Disabled: Link Width – Indicates current RO
Link width is not supported by PMUX, so channel disabled.
2 PMUX Channel 0 Disabled: Protocol Specific – PMUX RO
channel disabled for protocol-specific reasons.
7:3 Reserved RsvdZ

8 PMUX Channel 1 Disabled: Link Speed RO

9 PMUX Channel 1 Disabled: Link Width RO

10 PMUX Channel 1 Disabled: Protocol Specific RO

15:11 Reserved RsvdZ

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PMUX Status Register – part 2 1147

Bit Description Attr


Location
16 PMUX Channel 2 Disabled: Link Speed RO

17 PMUX Channel 2 Disabled: Link Width RO

18 PMUX Channel 2 Disabled: Protocol Specific RO

23:19 Reserved RsvdZ

24 PMUX Channel 3 Disabled: Link Speed RO

25 PMUX Channel 3 Disabled: Link Width RO

26 PMUX Channel 3 Disabled: Protocol Specific RO

31:27 Reserved RsvdZ

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PMUX Protocol Array Entry [up to 63] 1148

Bit Description Attr


Location
15:0 Protocol ID – designates a specific protocol and how it is RO
mapped into PMUX
31:16 Authority ID – designates the authority controlling the RO
values used in the Protocol ID field. It’s a PCI-SIG assigned
Vendor ID, and ID 0001h indicates the PCI-SIG itself.

 Value of all 0’s indicates entry not


implemented
 Duplicate entries OK
 Different channels must be assigned different
entries, but if entries are duplicated the
channels can still use the same protocol
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Enabling a PMUX Channel 1149

 PMUX Channel is enabled when both


 Channel Assignment field is valid:
 Value is non-zero, and
 Within the range defined for the Protocol Array, and
 Pointing to an implemented entry in the Protocol Array
 And PMUX Channel Disabled bits are all cleared

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Appendix C:
I/O Virtualization Support

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Objectives: IOV Support 1151

Upon completion, student should be able to


explain:
 The motivation for virtualization support in
I/O devices
 The purpose of P2P Redirection
 The motivation for FLR in IOV

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
IOV Topics 1152

 Virtualization is a big topic in itself. To learn


more about it, including the supplemental
PCIe specs that support IOV, visit our website
at www.mindshare.com
 Features added to the PCIe base spec in
support of I/O device virtualization:
 ACS (Access Control Services)
 FLR (Function Level Reset)
 Separate specs for IOV
 ATS (Address Translation Services)
 SR-IOV (Single-Root IOV) and MR-IOV (Multi-
Root IOV) [these are not covered]
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Definitions 1153

Non-virtual Guest
Guest = OS + applications.
App 1 App 2
System OS perceives itself as sole
(Single OS) OS owner of the system, and that
Processor still needs to be true in a
Complex
PCIe Root virtualized system so it won’t
need to be modified.
PCIe Switch
Hypervisor allows these OS’s to
work in a virtualized system by
intercepting accesses from
Virtualized Guest 1 Guest 2 Guest 3 Guests. It prevents conflicts by
System
(Multi-OS)
Hypervisor translating addresses, blocking
Processor
Complex
sensitive accesses until
PCIe Root appropriate times, etc.
PCIe Switch
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Motivation for Virtualization 1154

Migration – move Guests between machines for


 Load balancing
 Servicing hardware
Consolidation – combine Guests that normally require sole
access to a platform. They still perceive sole access but
cost and maintenance requirements are reduced.

Guest 2 Guest 3 Guest 4 Guest 1

Hypervisor Hypervisor
Processor Processor
Complex Complex
PCIe Root PCIe Root

PCIe Switch PCIe Switch


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Motivation for IOV 1155

1. Speed – having Hypervisor intercept


Guest 1 Guest 2 Guest 3 all accesses can be slow if the
Hypervisor system is heavily loaded. Hardware
Processor
support can speed up the operations.
Complex 2. Simplify Software – offload some
PCIe Root Hypervisor tasks to hardware.

PCIe Switch IOV-Capable Devices serve this


hardware-accelerator process by
adding
PCIe Port  New configuration registers, making
Config Management
the one Physical Function appear as
Virtual Function 1 Routing several Virtual Functions, allowing it
Physical Resources 1 to be shared by multiple Guests
Virtual Function 2
Physical Resources 2
 Routing logic to send accesses to
Virtual Function 3 correct Virtual Function
Physical Resources 3
Shared
...

Hardware
Virtual Function N Implementation
Physical Resources N

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
1156

ATS (Address Translation Services)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Background 1157

 In a virtualized system, Guest physical addresses


are usually not true physical addresses
 Each guest can establish a unique address range with
which it accesses shared functions.
 Guests are unaware of each other and can request
conflicting address spaces.
 Guest addresses have to be translated to physical
addresses to provide unique address domains

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Translation Agent Can Do It All 1158

Translation Agent (TA)


translates Guest Translation Memory
addresses into real Agent Controller
physical addresses using
ATPT, and verifies that ATPT
Requester is permitted to (Address Translation
access that range. & Protection Tables)

PCIe
Endpoint Root Complex
Root Root
Port Port

PCIe
Endpoint Switch
DMA transaction
with virtual
address.

Moki Anji (moki@ synopsys.com) PCIe PCIe


Do Not Distribute Endpoint Endpoint
MindShare.com © 2013
ATC – Address Translation Cache 1159

 Translation often involves several memory


accesses to “walk through” the tables
 To save time, an ATC is used to store earlier
translation results, much like a processor TLB
 Problem: a central ATC may have to translate for multiple
devices and guests, making cache management difficult.
 Distributing the ATC helps
 Endpoints cache their own addresses and participate in
the translation process
 Relieves pressure on a single system-level ATC and reduces
thrashing in the cache
 Improves access latency by sending a pre-translated address
from the Endpoint

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ATC Example 1160

Translation Memory
Agent Controller

PCIe ATC
Endpoint Root Complex
Root Root
Port Port Device knows what addresses it will use
and can predict future activity better than
the TA. For example, isochronous
transfers would need to have all the
PCIe ATC
Endpoint Switch addresses ready before starting.
DMA transaction
Device ATC can be an with translated
application-specific design address.
Moki Anji (moki@ synopsys.com) PCIe ATC PCIe ATC
Do Not Distribute Endpoint Endpoint
MindShare.com © 2013
Gathering Translated Addresses 1161

Translation
Agent Memory
-Address Translation Controller
- Protection Table

Translation
Completion

PCIe ATC
Endpoint Root Complex
Root Root
Port Port
Translation
Different devices can Request
share address space or PCIe ATC
Devices request and
have dedicated spaces Endpoint Switch store translated
addresses in their ATC
whenever they decide
doing so would be
beneficial
Moki Anji (moki@ synopsys.com) PCIe ATC PCIe ATC
Do Not Distribute Endpoint Endpoint
MindShare.com © 2013
Address Type Field 1162

 Requests have a new field to indicate whether the


address is virtual and needs translation or has already
been translated AT (Address Type)
Note: Spec sometimes refers
00 : default to this field as Address Type
01 : translation request and sometimes as Address
10 : translated Translation
11 : reserved

+0 +1 +2 +3
7 0 7 0 7 0 7 0
Fmt Type T E
r x 1 0 0 0 0 0 r TC reserved D P
Attr AT Length
Requester ID Tag Last DW BE 1st DW BE

Address 63:32
Address 31:02 r

PCIe Memory Request Header, 64-bit address

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
1163

ACS (Access Control Services)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com
© 2008© 2013
Background 1164

 ACS is an optional extended capability that can be


used to consider address translation when
deciding how certain transactions are to be
handled

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Source Validation & Translation Blocking 1165

CPU

Root Complex

Within the
Bus range?

Switch Packets that


PCIe Bridge
participate in
translation can to
be blocked
PCI or PCI-X
PCIe PCIe
Endpoint Endpoint PCI/PCI-X

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Motivation for Redirection Capability 1166

 Prevent silent data corruption


 Example: request corrupted within a switch could
get mistakenly routed to a peer device, causing
several problems
 Validate that peer-to-peer transactions are
allowed to prevent unintended peer access
 Facilitate the use of translated addresses

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
P2P Redirection & Direct Translated 1167

CPU

Root Complex
Request Memory
Validation Logic

Redirected
path
Translated
address Switch PCIe Bridge
overrides Normal
Redirection path to
PCI or PCI-X
PCIe PCIe
Endpoint Endpoint PCI/PCI-X

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Upstream Forwarding 1168

CPU

Root Complex
Request Memory
Validation Logic

Switch PCIe Bridge


to
PCI or PCI-X

Switch
PCI/PCI-X

PCIe PCIe
Moki Anji (moki@Endpoint
synopsys.com)
Endpoint
Do Not Distribute MindShare.com © 2013
Egress Control 1169

CPU

Root Complex
This is the only ACS
feature described as Memory
optional for
downstream ports
4 3 2 1 0

Switch PCIe Bridge


to
PCI or PCI-X
PCIe PCIe
Endpoint Endpoint PCI/PCI-X

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ACS Capability Registers 1170

Number of ports to which


31
this port could send
E n h a n ce d C a p a b ility H e a d e r R e g iste r
peer-to-peer traffic
AC S C ontrol R egister AC S C ap ab ility R eg ister ACS Capability Register
2 1
Eg ress C ontrol Vector
Egress Control Vector Size T E U C R B V
Additional Eg ress C ontrol V ector D w ords (if req uir
RsvdP
Direct Translated P2P

Egress Control

Upstream Forwarding
ACS Control Register P2P Completion Redirect
2 1
P2P Request Redirect
RsvdP T E U C R B V
Translation Blocking

Source Validation

Direct Translated P2P Enable

Egress Control Enable

Upstream Forwarding Enable


P2P Completion Redirect Enable

P2P Request Redirect Enable

Translation Blocking Enable

Source Validation Enable

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Egress Control 1171

CPU

Root Complex
Egress Vector Control
Bit: 4 3 2 1 0 Memory
Values: 0 0 0 0 1
4 3 2 1 0

Switch PCIe Bridge


to
PCI or PCI-X
PCIe PCIe
Endpoint Endpoint PCI/PCI-X

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Upstream Ports 1172

ACS only applies to upstream ports if the device has multiple


functions, since that’s the only way to get peer-to-peer traffic
within a device. Redirection, egress control, and direct-translated
options apply.

Blocking peer-to-peer
between functions can The use of ARI would
prevent “data leakage”, permit a device to have a
or unintended transfer, large number of functions,
especially useful for virtual
accidental or malicious, Function Function functions.
between unrelated
functions. 5 0

Multi-Function Device

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Appendix D:
Add-In Cards and Connectors

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Objectives: Add-In Cards & Connectors 1174

Upon completion, student should be able to


explain:
 Examples of form factors defined for mobile, desktop,
and server applications
 Characteristics of the presence-detect pins
 The auxiliary signals required on the CEM add-in
card connector
 The purpose of the power pins in the external cable
form factor

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express x1 Connector 1175

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Desktop & Server Form Factor 1176

 PC motherboard design
 Supports I/O & graphics

x16
x8
x4
x1
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Connector on System Board 1177

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Connector Pinout 1178

 See Book for details


 Auxiliary Signals
 REFCLK+ (Required)
 REFCLK- (Required)
 PERST# (Required)
 WAKE# (Required if wakeup function supported)
 SMBCLK (optional)
 SMBDAT (optional)
 JTAG Signal Group (optional)
 PRSNT1# (Required)
 PRSNT2# (Required)

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
WAKE# Circuit Protection 1179

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Presence Detect 1180

PCI Express Add-in Card

Pull-up

PRSNT1# Hot Plug


Control Logic

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Electrical Requirements 1181

 Power Rails
 +3.3V
 +12V
 +3.3Vaux
 Power Dissipation
 Standard Height Cards
 x1 - 10W desktop (half length), 25W server (7” – full length,
limited to 10W until configured as a high-power device)
 x4/x8 - 25W
 x16 - 25W server, 75W graphics application (limited to 25W
until configured as a high-power device)
 Low Profile Cards (all are limited to half length)
 x1 - 10W
 x4/x8 - 25W
 x16 - 25W

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Add-In Card Interoperability 1182

x1 Slot x4 Slot x8 Slot x16 Slot


x1 Card Required Required Required Required
x4 Card No Required Allowed Allowed
x8 Card No No Required Allowed
x16 Card No No No Required

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Riser Card 1183

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Other Form Factors 1184

 Backplane
 ExpressModule
 Riser Card
 PCI Express Mini Card
 ExpressCard (PCMCIA group)
 External Cable

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PICMG®-Based Backplane 1185

 PICMG for PCI Express


 Based on CompactPCI
Passive backplane design
 Modular system and peripheral
cards

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
CompactPCI Express Form-Factor 1186

 3U and 6U Form
RTM
Factors
PCI PCIe
Device
Bridge  High-speed connector
for backplane
 High-speed connector
for mezzanine slot
PCIe
Endpoint
Native

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ExpressModule Form Factor 1187

 Targets enterprise servers


 Cartridge approach improves
density, power, cooling, &
reliability
 Hot pluggable
 Secure mechanical latch
 Cartridge holds equivalent of 1
or 2 add-in cards

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Mini Card 1188

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Mini Card Photos 1189

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ExpressCard™ 1190

 Recent version of the PCMCIA standard


 Maintained by PCMCIA group
http://www.expresscard.org
 Interface defined as:
 Single x1 PCI Express connector, including
REFCLK and PERST#
 Single USB 2.0 connector
 Note: Both may be active at same time, but care
must be taken when an Eject event occurs to
account for any dependencies between them.
ACPI entry in the BIOS defines this dependency.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
ExpressCard™ Form Factor 1191

Mechanically similar to earlier


PCMCIA card form factor
ExpressCard Slot can be designed to
support both card sizes in the same
slot or just the 34mm version

Current CardBus New ExpressCard Types


Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
ExpressCard™ Auxiliary Signals 1192

 SMBus - optional 2-wire connection


 WAKE# - provided for support of PCI Express
wakeup events, USB does not use it
 CLKREQ# - asserted to enable REFCLK for use by
PCI Express module
 CPPE# - grounded on modules that use the PCI
Express interface
 CPUSB# - grounded on modules that use the USB
interface

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
PCI Express Cable 1193

 Rev 1.0 – Cable length not specified. Instead, cable vendors


optimize for cost or length by design choices like wire gauge
and shielding. Some vendors specify a 1m version and a 7m
version.
 Widths of x1, x4, x8, x16 currently defined

 New C-Link spec (rev 0.3), defines copper


and optical cables, with active and passive
versions of each. Optional: 10W for
peripheral power. x16 Graphics card with
external cable
Drawing courtesy of Molex

x1, x4, x8, x16 cables


Photo courtesy of Molex

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Example Cable Applications 1194

Drawing courtesy of Molex

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Appendix E:
Arbor Exercise Solutions

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Arbor Exercise Answers: Address Map 1196

File: address_map_lab.arbsys
BAR – NA
P-MMIO – 1_FB80_0000h –
1_FBAF_FFFFh 0:28:0
BAR 0/1 – 1_FBA0_0000h –
Bus 2 1_FBA1_FFFFh
P-MMIO – 1_FB80_0000h –
1_FBAF_FFFFh
2:0:0
BAR – NA
P-MMIO – 1_FB80_0000h –
Bus 3
1_FB8F_FFFFh

BAR – NA 3:0:0 3:5:0 3:7:0 3:9:0


P-MMIO – 1_FB90_0000h –
1_FB9F_FFFFh
Bus 4 Bus 5 Bus 6 Bus 7
Problem Here
A bridge’s Base/Limit range should NOT
4:0:0 5:0:0 include its own BAR. Base/Limit should only
indicate what addresses live beneath the
bridge, not what the bridge itself owns.
BAR 0/1 –1_FB90_0000h – BAR 0/1 –1_FB80_0000h – The correct configuration should be:
1_FB93_FFFFh 1_FB83_FFFFh P-MMIO – 1_FB80_0000h –
Moki Anji (moki@ synopsys.com) 1_FB9F_FFFFh
Do Not Distribute MindShare.com © 2013
Arbor Exercise Answers: Error Debugging 1197

File: error_lab.arbsys

 Part 1: Software received an interrupt from PCIe Root Port 0:28:6 that
was generated because of a received error message. Answer the
following questions:
1. What type of error message triggered the interrupt (ERR_FATAL, ERR_NONFATAL,
ERR_CORR)?
ERR_NONFATAL
2. From which BDF did the error message originate?
12:0:0 or (C:0:0 in hex)
3. What was the specific error condition that caused the first error message?
Completer Abort
4. Were there any other errors detected on that BDF? If so, what are they?
Yes, Unsupported Request and Bad TLP
5. Is any other information about the first error provided? If so, provide it (decoded if
possible).
Header Log: Decoded:
0000_0080h – 1st Dword Memory Read, 3DW header, TC=0, TD=0, EP=0, Attr=0,
0A00_0CFFh – 2nd Dword AT=0, Length=80h (128 DWs or 512 bytes), ReqID=10:0:0
FB80_1000h – 3rd Dword (A:0:0 hex), Tag=Ch, ByteEnables=FFh, Addr=FB80_1000h
0000_0000h – 4th Dword
 Bonus Question: What was the vector of the interrupt generated to software because of
the error?
Moki Anji (moki@
Vector A9h synopsys.com)
Do Not Distribute MindShare.com © 2013
Arbor Exercise Answers: Error Debugging 1198

File: error_lab.arbsys

 Part 2: Software received an interrupt from PCIe Root Port 0:28:0 that
was generated because of a received error message. Answer the
following questions:
1. What type of error message triggered the interrupt (ERR_FATAL, ERR_NONFATAL,
ERR_CORR)?
ERR_FATAL
2. From which BDF did the error message originate?
5:0:0
3. What was the specific error condition that caused the first error message?
Malformed TLP (be careful with hex vs decimal)
4. Were there any other errors detected on that BDF? If so, what are they?
Yes, Poisoned TLP and Receiver Error
5. Is any other information about the first error provided? If so, provide it (decoded if
possible) and try and determine why this was an error.
Header Log: Decoded:
6000_8080h – 1st Dword Memory Write, 4DW header, TC=0, TD=1, EP=0, Attr=0,
0000_04FFh – 2nd Dword AT=0, Length=80h (128 DWs or 512 bytes), ReqID=0:0:0,
0000_0001h – 3rd Dword Tag=4, ByteEnables=FFh, Addr=1_FB80_1000h
FB80_1000h – 4th Dword (The length for this write is 512 bytes, but the Max Payload
Size enabled for this BDF is 256 bytes, and that is why this
request was treated as a Malformed TLP.)
Moki Anji (moki@ synopsys.com)
Do Not Distribute MindShare.com © 2013
Arbor Exercise Answers: Interrupt Investigation 1199

File: interrupt_lab.arbsys

 Interrupts can be signaled using INTx, MSI and MSI-X


mechanisms. The following exercise reviews these
implementations.
1. What methods of interrupt generation are supported in BDF 0:27:0, and which
mechanism is being used?
Legacy INTx and MSI are supported; Legacy interrupts are being used
2. What configuration register and field verifies the selection in item 1 above?
When an interrupt is generated what interrupt will be signaled by the device?
In Command register in Header, Interrupt Disable (bit [10]) is not set AND MSI Enable
bit is not set.
INTA (based on IntPin register in Header)
3. What methods of interrupt generation are supported by the Root Port at
location 0:28:6, and which is being used?
Legacy INTx and MSI are supported; MSIs are being used
4. For 0:28:6, how many interrupt vectors are requested and how many are
enabled?
2 interrupts are requested and 2 are enabled

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Arbor Exercise Answers: Interrupt Investigation 1200

File: interrupt_lab.arbsys

5. For 0:28:6, what are the specific address and data values allowed in the
interrupts (memory writes) that can be signaled by the Root Port? Bonus
question: On this x86-based system, what are interrupt vectors of these
interrupts (assuming interrupt remapping is not enabled in the system)?
FEE0_F00Ch : 49A8h (bonus: vector A8h)
FEE0_F00Ch : 49A9h (bonus: vector A9h)
6. What methods of interrupt generation are supported by the device attached
to Root Port 0:28:2, and which mechanism is enabled?
Legacy INTx, MSI and MSI-X are supported; MSI-X is enabled
7. For the device attached to Root Port 0:28:2, what are the specific address
and data values allowed in the interrupts (memory writes) that can be
signaled by the device? Bonus question: On this x86-based system, what
are interrupt vectors of these interrupts (assuming interrupt remapping is
not enabled in the system)?
FEE0_F00Ch : 49A0h (bonus: vector A0h)
FEE0_C00Ch : 4990h (bonus: vector 90h)
FEE0_C00Ch : 4980h (bonus: vector 80h)
FEE0_C00Ch : 4970h (bonus: vector 70h)
FEE0_0000h : 4960h (bonus: vector 60h)
FEE0_300Ch : 4982h (bonus: vector 82h)
FEE0_300Ch : 4972h (bonus: vector 72h)
Moki Anji (moki@ synopsys.com)
FEE0_300Ch : 4962h (bonus: vector 62h)
Do Not Distribute MindShare.com © 2013
Arbor Exercise Answers: Interrupt Investigation 1201

File: pcie_lab2.arbsys

8. Are any of the interrupts of BDF 9:0:0 masked from being generated? If so,
have any of those masked events occurred at the BDF? Which one(s)?
Yes, 4982h and 4972h are masked; the event for 4972h has occurred as
indicated by the Pending Bit Array
9. Are any of the interrupts of BDF 8:0:0 masked from being generated? If so,
have any of those masked events occurred at the BDF?
Yes, all interrupts are masked because the Function Mask bit in the MSI-X
capability structure is set. None of the pending bits are set.

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013
Thank you!

Moki Anji (moki@ synopsys.com)


Do Not Distribute MindShare.com © 2013

You might also like