OpenSPARCT1 Micro Arch PDF
OpenSPARCT1 Micro Arch PDF
OpenSPARCT1 Micro Arch PDF
Specification
Copyright © 2008 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits réservés.
Sun Microsystems, Inc. détient les droits de propriété intellectuels relatifs à la technologie incorporée dans le produit qui est décrit dans ce
document. En particulier, et ce sans limitation, ces droits de propriété intellectuelle peuvent inclure un ou plus des brevets américains listés à
l’adresse http://www.sun.com/patents et un ou les brevets supplémentaires ou les applications de brevet en attente aux Etats - Unis et dans les
autres pays.
L’utilisation est soumise aux termes de la Licence.
Cette distribution peut comprendre des composants développés par des tierces parties.
Sun, Sun Microsystems, le logo Sun, Solaris, OpenSPARC T1 et UltraSPARC sont des marques de fabrique ou des marques déposées de Sun
Microsystems, Inc. aux Etats-Unis et dans d’autres pays.
Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc.
aux Etats-Unis et dans d’autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun
Microsystems, Inc.
UNIX est une marque déposée aux Etats-Unis et dans d’autres pays et licenciée exlusivement par X/Open Company, Ltd.
Le logo Adobe. est une marque déposée de Adobe Systems, Incorporated.
Les produits qui font l’objet de ce manuel d’entretien et les informations qu’il contient sont regis par la legislation americaine en matiere de
controle des exportations et peuvent etre soumis au droit d’autres pays dans le domaine des exportations et importations. Les utilisations
finales, ou utilisateurs finaux, pour des armes nucleaires, des missiles, des armes biologiques et chimiques ou du nucleaire maritime,
directement ou indirectement, sont strictement interdites. Les exportations ou reexportations vers des pays sous embargo des Etats-Unis, ou
vers des entites figurant sur les listes d’exclusion d’exportation americaines, y compris, mais de maniere non exclusive, la liste de personnes qui
font objet d’un ordre de ne pas participer, d’une facon directe ou indirecte, aux exportations des produits ou des services qui sont regi par la
legislation americaine en matiere de controle des exportations et la liste de ressortissants specifiquement designes, sont rigoureusement
interdites.
LA DOCUMENTATION EST FOURNIE "EN L’ETAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES
OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT
TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A L’APTITUDE A UNE UTILISATION PARTICULIERE OU A
L’ABSENCE DE CONTREFACON.
Please
Recycle
Contents
Preface xxiii
iii
1.3.9 Electronic Fuse 1–12
Contents v
2.8.3 Modular Arithmetic Memory (MA Memory) 2–41
2.8.4 Modular Arithmetic Operations 2–42
2.9 Memory Management Unit 2–44
2.9.1 The Role of MMU in Virtualization 2–45
2.9.2 Data Flow in MMU 2–46
2.9.3 Structure of Translation Lookaside Buffer 2–46
2.9.4 MMU ASI Operations 2–48
2.9.5 Specifics on TLB Write Access 2–49
2.9.6 Specifics on TLB Read Access 2–49
2.9.7 Translation Lookaside Buffer Demap 2–49
2.9.8 TLB Auto-Demap Specifics 2–50
2.9.9 TLB Entry Replacement Algorithm 2–50
2.9.10 TSB Pointer Construction 2–50
2.10 Trap Logic Unit 2–51
2.10.1 Architecture Registers in the Trap Logic Unit 2–53
2.10.2 Trap Types 2–54
2.10.3 Trap Flow 2–56
2.10.4 Trap Program Counter Construction 2–58
2.10.5 Interrupts 2–58
2.10.6 Interrupt Flow 2–59
2.10.7 Interrupt Behavior and Interrupt Masking 2–62
2.10.8 Privilege Levels and States of a Thread 2–62
2.10.9 Trap Modes Transition 2–63
2.10.10 Thread States Transition 2–64
2.10.11 Content Construction for Processor State Registers 2–65
2.10.12 Trap Stack 2–66
2.10.13 Trap (Tcc) Instructions 2–67
2.10.14 Trap Level 0 Trap for Hypervisor 2–67
Contents vii
3.3.3 L2 Miss 3–13
3.3.4 ERR 3–13
3.3.5 Non-Cacheable Bit 3–13
3.3.6 Thread ID 3–14
3.3.7 Way and Way Valid 3–14
3.3.8 Four-byte Fill 3–14
3.3.9 Atomic 3–14
3.3.10 Prefetch 3–14
3.3.11 Data 3–15
3.4 Processing of PCX Transactions 3–16
3.4.1 Load 3–16
3.4.2 Prefetch 3–16
3.4.3 D-cache Invalidate 3–17
3.4.4 Instruction Fill 3–17
3.4.5 I-cache Invalidate 3–17
3.4.6 Store 3–18
3.4.7 Block Store 3–18
3.4.8 Block Init Store 3–18
3.4.9 CAS (Compare and Swap) 3–19
3.4.10 Swap/Ldstub 3–19
3.4.11 Stream Load 3–19
3.4.12 Stream Store 3–19
3.4.13 External Floating-Point Operations 3–20
3.4.14 Interrupt Requests 3–20
3.4.15 L2 Evictions 3–21
3.4.16 L2 Errors 3–21
3.4.17 Forwarded Requests 3–21
3.4.18 Writes to the INT_VEC_DIS Register 3–21
Contents ix
4.1.3.1 L2-Cache Transaction Types 4–9
4.1.3.2 L2-Cache Pipeline Stages 4–10
4.1.4 L2-Cache Instruction Descriptions 4–12
4.1.4.1 Loads 4–12
4.1.4.2 Ifetch 4–12
4.1.4.3 Stores 4–13
4.1.4.4 Atomics 4–13
4.1.4.5 J-Bus Interface Instructions 4–14
4.1.4.6 Eviction 4–16
4.1.4.7 Fill 4–16
4.1.4.8 Other Instructions 4–16
4.1.5 L2-Cache Memory Coherency and Instruction Ordering 4–17
4.2 L2-Cache I/O LIST 4–18
Contents xi
8. DRAM Controller 8–1
8.1 Functional Description 8–1
8.1.1 Arbitration Priority 8–3
8.1.2 DRAM Controller State Diagrams 8–4
8.1.3 Programmable Features 8–5
8.1.4 Errors 8–6
8.1.5 Repeatability and Visibility 8–6
8.1.6 DDR-II Addressing 8–7
8.1.7 DDR-II Supported Features 8–8
8.2 I/O Signal List 8–9
Contents xiii
xiv OpenSPARC T1 Microarchitecture Specification • February 2009
Figures
FIGURE 2-2 Physical Location of Functional Units on an OpenSPARC T1 SPARC Core 2–3
xv
FIGURE 2-18 IDIV Block Diagram 2–36
FIGURE 2-24 Multiply Function Result Generation Sequence Pipeline Diagram 2–44
FIGURE 2-28 TLU Role With Respect to All Other Backlogs in a SPARC Core 2–52
FIGURE 2-30 Trap Flow With Respect to the Hardware Blocks 2–57
FIGURE 3-4 PCX Packet Transfer Timing – One Packet Request 3–27
FIGURE 3-6 CPX Packet Transfer Timing Diagram – One Packet Request 3–29
FIGURE 3-7 CPX Packet Transfer Timing Diagram – Two Packet Request 3–30
FIGURE 3-8 Timing Diagram - Third Speculative request is accepted by CCX 3–31
FIGURE 3-9 Timing Diagram - Third Speculative request is rejected and resent later 3–34
FIGURE 4-1 Flow Diagram and Interfaces for an L2-Cache Bank 4–3
FIGURE 5-2 IOB UCB Interface to and From the Cluster 5–4
Figures xvii
xviii OpenSPARC T1 Microarchitecture Specification • February 2009
Tables
xix
TABLE 3-11 Floating-Point Return Data Field 3–16
TABLE 5-7 UCB No Payload Over an 8-Bit Interface Without Stalls 5–6
TABLE 5-8 UCB No Payload Over an 8-Bit Interface With Stalls 5–7
TABLE 7-2 SPARC V9 Single and Double Precision FPop Instruction Set 7–4
Tables xxi
xxii OpenSPARC T1 Microarchitecture Specification • February 2009
Preface
Chapter 3 describes the CPU-cache crossbar (CCX) unit and includes detailed CCX
block and timing diagrams.
Chapter 10 gives a functional description of the processor’s clock and test unit
(CTU).
xxiii
Using UNIX Commands
This document might not contain information about basic UNIX® commands and
procedures such as shutting down the system, booting the system, and configuring
devices. Refer to the following for this information:
■ Software documentation that you received with your system
■ Solaris™ Operating System documentation, which is at:
http://docs.sun.com
Shell Prompts
Shell Prompt
C shell machine-name%
C shell superuser machine-name#
Bourne shell and Korn shell $
Bourne shell and Korn shell superuser #
Typographic Conventions
Typeface1 Meaning Examples
http://www.opensparc.net/
OpenSPARC T1 http://www.opensparc.net/
Documentation http://www.sun.com/documentation/
Support http://www.sun.com/support/
Training http://www.sun.com/training/
Preface xxv
Third-Party Web Sites
Sun is not responsible for the availability of third-party web sites mentioned in this
document. Sun does not endorse and is not responsible or liable for any content,
advertising, products, or other materials that are available on or through such sites
or resources. Sun will not be responsible or liable for any actual or alleged damage
or loss caused by or in connection with the use of or reliance on any such content,
goods, or services that are available on or through such sites or resources.
OpenSPARC T1 Overview
The OpenSPARC T1 processor contains eight SPARC® processor cores, which each
have full hardware support for four threads. Each SPARC core has an instruction
cache, a data cache, and a fully associative instruction and data translation lookaside
buffers (TLB). The eight SPARC cores are connected through a crossbar to an on-chip
unified level 2 cache (L2-cache).
The four on-chip dynamic random access memory (DRAM) controllers directly
interface to the double data rate-synchronous DRAM (DDR2 SDRAM). Additionally,
there is an on-chip J-Bus controller that provides an interconnect between the
OpenSPARC T1 processor and the I/O subsystem.
1-1
1.2 Functional Description
The features of the OpenSPARC T1 processor include:
■ 8 SPARC V9 CPU cores, with 4 threads per core, for a total of 32 threads
■ 132 Gbytes/sec crossbar interconnect for on-chip communication
■ 16 Kbytes of primary (Level 1) instruction cache per CPU core
■ 8 Kbytes of primary (Level 1) data cache per CPU core
■ 3 Mbytes of secondary (Level 2) cache – 4 way banked, 12 way associative shared
by all CPU cores
■ 4 DDR-II DRAM controllers – 144-bit interface per channel, 25 GBytes/sec peak
total bandwidth
■ IEEE 754 compliant floating-point unit (FPU), shared by all CPU cores
■ External interfaces:
■ J-Bus interface (JBI) for I/O – 2.56 Gbytes/sec peak bandwidth, 128-bit
multiplexed address/data bus
■ Serial system interface (SSI) for boot PROM
FIGURE 1-1 shows a block diagram of the OpenSPARC T1 processor illustrating the
various interfaces and integrated components of the chip.
Each SPARC core has single issue, six stage pipeline. These six stages are:
1. Fetch
2. Thread Selection
3. Decode
4. Execute
5. Memory
6. Write Back
FIGURE 1-2 shows the SPARC core pipeline used in the OpenSPARC T1 Processor.
Regfile Crypto
x4 Coprocessor
Alu
DCache
ICache Inst Mul Crossbar
Thrd Decode Dtlb
Itlb buf x 4 Shft Interface
Sel Stbuf x 4
Div
Mux
Instruction type
Thread
Thread selects Misses
select
Traps and interrupts
logic
Resource conflicts
PC logic
Thrd
x4
Sel
Mux
1. Instruction fetch unit (IFU) includes the following pipeline stages – fetch, thread
selection, and decode. The IFU also includes an instruction cache complex.
3. Load/store unit (LSU) includes memory and writeback stages, and a data cache
complex.
4. Trap logic unit (TLU) includes trap logic and trap program counters.
5. Stream processing unit (SPU) is used for modular arithmetic functions for crypto.
Instruction cache complex has a 16-Kbyte data, 4-way, 32-byte line size with a single
ported instruction tag. It also has dual ported (1R/1W) valid bit array to hold cache
line state of valid/invalid. Invalidates access the V-bit array, not the instruction tag.
A pseudo-random replacement algorithm is used to replace the cache line.
There is a fully associative instruction TLB with 64 entries. The buffer supports the
following page sizes: 8 Kbytes, 64 Kbytes, 4 Mbytes, and 256 Mbytes. The TLB uses
a pseudo least recently used (LRU) algorithm for replacement. Multiple hits in the
TLB are prevented by doing an autodemap on a fill.
Two instructions are fetched each cycle, though only one instruction is issued per
clock, which reduces the instruction cache activity and allows for an opportunistic
line fill. There is only one outstanding miss per thread, and only four per core.
Duplicate misses do not issue requests to the L2-cache.
The integer register file (IRF) of the SPARC core has 5 Kbytes with 3 read/2 write/1
transport ports. There are 640 64-bit registers with error correction code (ECC). Only
32 registers from the current window are visible to the thread. Window changing in
background occurs under the thread switch. Other threads continue to access the IRF
(the IRF provides a single-cycle read/write access).
The load/store unit (LSU) has an 8 entry store buffer per thread, which is unified
into a single 32 entry array, with RAW bypassing. Only a single load per thread
outstanding is allowed. Duplicate requests for the same line are not sent to the L2-
cache. The LSU has interface logic to interface to the CPU-cache crossbar (CCX). This
interface performs the following operations:
■ Prioritizes the requests to the crossbar for floating-point operation (Fpops),
streaming operations, I$ and D$ misses, stores and interrupts, and so on.
■ Request priority: imiss>ldmiss>stores,{fpu,stream,interrupt}.
■ Assembles packets for the processor-cache crossbar (PCX).
The LSU handles returns from the CPX crossbar and maintains the order for cache
updates and invalidates.
Bank 0
Bank 1
FPU/CRI
L2 Cache, FPU, CRI to Core
Core o
Core 1
Core 7
L2-cache has a 64-byte line size, with 64 bytes interleaved between banks. Pipeline
latency in the L2-cache is 8 clocks for a load, 9 clocks for an I-miss, with the critical
chunk returned first. 16 outstanding misses per bank are supported for a 64 total
misses. Coherence is maintained by shadowing the L1 tags in an L2-cache directory
structure (the L2-cache is a point of global visibility). DMA from the I/O is serialized
with respect to the traffic from the cores in the L2-cache.
The L2-cache directory shadows the L1 tags. The L1 set index and the L2-cache bank
interleaving is such that one forth of the L1 entries come from an L2-cache bank. On
an L1 miss, the L1 replacement way and set index identifies the physical location of
the tag which will be updated by the miss address. On a store, the directory will be
cammed. The directory entries are collated by set, so only 64 entries need to be
cammed. This scheme is quite power efficient. Invalidates are a pointer to the
physical location in the L1-cache, eliminating the need for a tag lookup in the L1-
cache.
The OpenSPARC T1 processor uses DDR2 DIMMs and can support one or two ranks
of stacked or unstacked DIMMs. Each DRAM bank/port is two-DIMMs wide (128-
bit + 16-bit ECC). All installed DIMMs must be identical, and the same number of
DIMMs (that is, ranks) must be installed on each DRAM controller port. The DRAM
controller frequency is an exact ratio of the core frequency, where the core frequency
must be at least three times the DRAM controller frequency. The double data rate
(DDR) data buses transfer data at twice the frequency of the DRAM controller
frequency.
The OpenSPARC T1 processor can support memory sizes of up to 128 Gbytes with a
25 Gbytes/sec peak bandwidth limit. Memory access is scheduled across 8 reads
plus 8 writes, and the processor can be programmed into a two-channel mode for a
reduced configuration. Each DRAM channel has 128 bits of data and 16 bytes of ECC
interface, with chipkill support, nibble error correction, and byte error detection.
The J-Bus interface is the functional block that interfaces to the J-Bus, receiving and
responding to DMA requests, routing them to the appropriate L2 banks, and also
issuing PIO transactions on behalf of the processor threads and forwarding
responses back.
SPARC Core
An OpenSPARC T1 processor contains eight SPARC cores, and each SPARC core has
several function units. These SPARC core units are described in the following
sections:
■ Section 2.1, “SPARC Core Overview and Terminology” on page 2-2
■ Section 2.2, “SPARC Core I/O Signal List” on page 2-5
■ Section 2.3, “Instruction Fetch Unit” on page 2-6
■ Section 2.4, “Load Store Unit” on page 2-22
■ Section 2.5, “Execution Unit” on page 2-34
■ Section 2.6, “Floating-Point Frontend Unit” on page 2-36
■ Section 2.7, “Multiplier Unit” on page 2-38
■ Section 2.8, “Stream Processing Unit” on page 2-39
■ Section 2.9, “Memory Management Unit” on page 2-44
■ Section 2.10, “Trap Logic Unit” on page 2-51
■ Section 2.11, “Core Debug Features” on page 2-69
2-1
2.1 SPARC Core Overview and Terminology
FIGURE 2-1 presents a high-level block diagram of a SPARC core, and FIGURE 2-2
shows the general physical location of these units on an example core.
Strand
I-Cache Strand Decode
Scheduler
Instruction
Registers
Store Buffers
ALU Register
Files
D-Cache
External
Interface
EXU
IFU
MMU LSU
Trap
0 2 4 6
1 3 5 7
Term Description
Thread A thread is a hardware strand (thread and strand will be used interchangeably in
this chapter). Each thread, or strand, enjoys a unique set of resources in support of
its execution while multiple threads, or strands, within the same SPARC core will
share a set of common resources in support of their execution.
The per-thread resources include registers, a portion of I-fetch data-path, store
buffer, and miss buffer. The shared resources include the pipeline registers and
data-path, caches, translation lookaside buffers (TLB), and execution unit of the
SPARC Core pipeline.
ST Single threaded.
MT Multi-threaded.
Hypervisor (HV) The hypervisor is the layer of system software that interfaces with the hardware.
Supervisor (SV) The supervisor is the layer of system software such as operation system (OS) that
executes with privilege.
Long latency instruction LLI represents an instruction that would take more than one SPARC core clock
(LLI) cycle to make its results visible to the next instruction.
FIGURE 2-3 shows the view from virtualization, which illustrates the relative
privileges of the various software layers.
Applications
OS instance 1 OS instance 2
Hypervisor
OpenSPARC T1
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
The I-cache access and the ITLB access take place in fetch stage. A selected thread
(hardware strand) will be picked in the thread selection stage. The instruction
decoding and register file access occur in the decode stage. The branch evaluation
takes place in the execution stage. The access to memory and the actual writeback
will be done in the memory and writeback stages. FIGURE 2-4 illustrates the SPARC
core pipeline and support structures.
Regfile Crypto
x4 Coprocessor
Alu
DCache
ICache Inst Mul Crossbar
Thrd Decode Dtlb
Itlb buf x 4 Shft Interface
Sel Stbuf x 4
Div
Mux
Instruction type
Thread
Thread selects Misses
select
Traps and interrupts
logic
Resource conflicts
PC logic
Thrd
x4
Sel
Mux
The instruction fill queue (IFQ) feeds into the I-cache. The missed instruction list
(MIL) stores the addresses that missed the I-cache and the ITLB, and the MIL feeds
into the load store unit (LSU) for further processing. The instruction buffer is two
levels deep, and it includes the thread instruction (TIR) and next instruction (NIR)
unit. Thread selection and scheduler (S-stage) resolves the arbitration among the
TIR, NIR, branch-PC, and trap-PC to pick one thread send it to the decode stage
(D-stage). FIGURE 2-5 shows the support structure for this portion of the thread
pipeline.
IFQ MIL
br-pc/trap-pc
From To
LSU LSU
FIGURE 2-5 Frontend of the SPARC Core Pipeline
There is one program counter (PC) register per thread. The next-program counter
(NPC) could come from one of these sources:
1. Branch
2. TrapPC
3. Trap NPC
5. PC + 4
The IFU tracks the PC and NPC through W-stage. The last retired PC will be saved
in the trap logic unit (TLU), and, if a trap occurs, it will also be saved in the trap
stack.
There is a separate array for valid bit (V-bit). This V-bit array holds the cache line
state of either valid or invalid, and the array has one read port and one write port
(1R1W). The cache line invalidation only accesses the V-bit array, and the cache line
replacement policy is pseudo-random.
The read access to the I-cache has a higher priority over the write access. The ASI
read and write accesses to the I-cache are set to lower priorities. The completion of
the ASI accesses are opportunistic, and there is fairness mechanism built in to
prevent the starvation of service to ASI accesses.
The maximum wait period for a write access to the I-cache is 25 SPARC core clock
cycles. A wait longer than 25 clock cycles will stall the SPARC core pipeline in order
to allow the I-cache write access completion.
To To Bypass
Vbit Ary I-Cache to TIR
INV IFQ
bist>asi>cpx
bist asi
cpxpkt
from LSU
FIGURE 2-6 I-Cache Fill Path
The I-cache line size is 32 bytes, and a normal I-cache fill takes two CPX packets of
16 bytes each. The instruction fill queue (IFQ) has a depth of two. An I-cache line
will be invalidated when the first CPX packet is delivered and filled in the I-cache.
That cache line will be marked as valid when the second CPX packet is delivered
and filled. I-cache control guarantees the atomicity of the I-cache line fill action
between the two halves of the cache line being filled.
An instruction fetch from the boot PROM, by way of the system serial interface (SSI),
is a very slow transaction. The boot prom is a part of the I/O address space. All
instruction fetches from the I/O space are non-cacheable. The boot PROM fetches
only one 4-byte instruction at a time. This 4-byte instruction is replicated four times
during the formation of the CPX packet. Only one CPX packet of non-cacheable
The load store unit (LSU) initiates all ASI accesses. The LSU serializes all ASI
accesses so that the second access will not be launched until the first access has been
acknowledged. ASI accesses tend to be slow, and data for an ASI read will be sent
back later.
Level 2 cache invalidations will always undergo a CPU-ID check in order to ensure
that this invalidation packet is indeed meant for the specified SPARC core. In the
following cases, an invalidation could be addressing anyone:
■ A single I-cache line invalidation due to store acknowledgements, or due to a
load exclusivity requiring that the invalidation of the other level 1 I-caches
resulted from the self-modifying code.
■ Invalidating two I-cache lines because of a cache-line eviction in the level 2 cache
(L2-cache).
■ Invalidating all ways in a given set due to error conditions, such as encountering
a tag ECC error in a level 2 cache line.
PA
Cmp MIL
RR arb
pcxpkt
to LSU
FIGURE 2-7 I-Cache Miss Path
The MIL keeps track of the physical address (PA) of an instruction that missed the
I-cache. A second PA that matches the PA of an already pending I-cache miss will
cause the second request to be put on hold and marked as a child of the pending
I-cache miss request. The child request will be serviced when the pending I-cache
miss receives its response. The MIL uses a linked list to track and service the
duplicated I-cache miss request. The depth for such a linked list is four.
1. Make request.
3. Fill the first 16 bytes of data. The MIL sends a speculative completion notification
to the thread scheduler at the completion of filling the first 16 bytes.
4. Fill the second 16 bytes of data. The MIL sends a completion notification to the
thread scheduler at the completion of filling the second 16 bytes.
5. Done.
An I-cache miss request could be canceled because of, for example, a trap. The MIL
still goes through the motions of filling a cache line but it does not bypass it to the
thread instruction register (TIR). A pending child request must be serviced even if
the original parent I-cache miss request was cancelled.
When a child I-cache miss request crosses with a parent I-cache miss request, the
child request might not be serviced before the I-cache fill for the parent request
occurs. The child instruction fetch shall be retired (rolled back) to the F-stage to
allow it to access the I-cache. This kind of case is referred to as miss-fill crossover.
FIGURE 2-8 illustrates the structure of an integer architectural register file (IARF) and
an integer working register file (IWRF).
outs[0-7]
w(n+1)
call locals[0-7]
outs[0-7] ins[0-7]
w(n) outs[0-7]
locals[0-7]
locals[0-7]
outs[0-7] ins[0-7] Transfer
w(n-1) port ins[0-7]
locals[0-7] return
ins[0-7]
Read/Write
Access from pipe
Each thread requires 128 registers for the eight windows (with 16 registers per
window), and four sets of global registers with eight global registers per set. There
are 160 registers per thread, and there are four threads per SPARC core. There are a
total of 640 registers per SPARC core.
Only 32 registers from the current window are visible to the thread. A window
change occurs in the background under thread switching while the other threads
continue to access integer register file.
The ITLB contains 64 entries. The replacement policy is a pseudo least recently used
(pseudo-LRU) policy, which is the same policy as that for the I-cache.
The ITLB supports page sizes of 8 Kbytes, 64 Kbytes, 4 Mbytes, and 256 Mbytes.
Multiple hits in the ITLB are prevented by the autodemap feature in an ITLB fill.
1. The thread is executing one of the long latency instructions, such as load, branch,
multiplication, division, and so on.
2. The SPARC core pipeline has been stalled due to one of the long latency
operations, such as encountering a cache miss, taking a trap, or experiencing a
resource conflict.
Idle
Id
r
int
le
intr
Idle
et
re s
e/
um
es
r
any intr
Active Halt
Halt inst
A thread in the idle state should not receive the resume command without a previous
reset. When a thread is violated, the integrity of the hardware behavior cannot be
guaranteed.
n
tio
ed
ple
ule
com
is s
sw
ld m
it c
p/
he
tra
d
ou
t
long lat/rsrc conflict
Wait Run
An active thread could be placed in the wait state because of any of the following
reasons:
3. Wait due to long latency, or a resource conflict where all resource conflicts arise
because of long latency.
FIGURE 2-11 illustrates the state transition for a thread in speculative states.
Rdy Run
s w it c h e d o u t
one
Rea
Really d
lly done
schedule
SpecRdy Wron
g g spec SpecRun
sp n
ec ro
W
pe
S
cu t
la t
ed
Wait
nflic
one co
rsrc
long lat/
The fairness scheme for threads in the Run state or the SpecRun state is a round-
robin algorithm with the least recently executed thread winning the selection.
1. All of the stall conditions, or switch conditions, were not known at the time of the
scheduling.
Rolled back instructions must be restarted from the S-stage or F-stage of the SPARC
core pipeline. FIGURE 2-12 illustrates the pipeline graph for the rollback mechanism.
F S D E M W
1. E to S and D to F
2. D to S and S to F
3. W to F
The privilege is checked in D-stage of the SPARC core pipeline. Some instructions
can only be executed with hypervisor privilege or with supervisor privilege.
The branch condition is also evaluated in the D-stage, and the decision for annulling
a delay slot is made in this stage as well.
When executing in the hypervisor (HV) state, an interrupt with a supervisor (SV)
privilege will not be serviced at all. An hypervisor state execution shall not be
blocked by anything with supervisor privilege.
Some interrupts are asserted by a level while others are asserted by a pulse. The IFU
remembers the form the interrupts were originated in order to preserve the integrity
of the scheduling.
The instruction translation lookaside buffer (ITLB) array is parity decoded without
an error-correction mechanism, so all errors are fatal.
All on-core errors, and some of the off-core errors, are logged in the per-thread error
registers. Refer to the Programmer’s Reference Manual for details.
The instruction fetch unit (IFU) maintains the error injection and the error enabling
registers, which are accessible by way of ASI operations.
Critical states (such as program counter (PC), thread state, missed instruction list
(MIL), and so on) can be snapped and scanned out on-line. This process is referred
to as a shadow scan.
The threaded architecture of the LSU can process four loads, four stores, one fetch,
one FP operation, one stream operation, one interrupt, and one forward packet.
Therefore, thirteen sources supply data to the LSU.
The LSU implements the ordering for memory references, whether locally or not.
The LSU also enforces the ordering for all the outbound and inbound packets.
E M W W2
Cache Cache/Tag stb lookup pcx rcq gcn.
TLB TLB traps and writeback
setup read bypass
The cache access set-up and the translation lookaside buffer (TLB) access set-up are
done during the pipeline’s E-stage (execution). The cache/tag/TLB read operations
are done in the M-stage (memory access). The W-stage (writeback) supports the
look-up of the store buffer, the detection of traps, and the execution of the data
bypass. The W2-stage (writeback-2) is for generating PCX requests and writebacks to
the cache.
Load misses are kept in the load miss (LSM) queue, which is shared by other
opcodes such as atomics and prefetch. The LSM queue supports one outstanding
load miss per thread. Load misses with duplicated physical addresses (PA) will not
be sent to the level 2 (L2) cache.
Inbound packets from the CCX are queued and ordered for distribution to other
units through the data fill queue (DFQ).
The DTLB is fully associative, and it is responsible for the address translations. All
CAM/RAM translations are single-cycle operations.
The ASI operations are serialized through the LSU. They are sequenced through the
ASI queue to the destination units on the chip.
stb
tlb
pcx gen
A cacheable load-miss will allocate a line, and it will execute the write-through
policy for stores. Stores do not allocate, and local stores may update the L1 D-cache
if it is present in the L1 D-cache, as determined by L2 (Level 2) cache directory. If it
is deemed that it is not present in L1 D-cache, the local stores will cause the lines to
become invalidated. The line replacement policy is pseudo random based on a linear
shift register. The data from the bypass queues will be multiplexed into the L1 D-
cache in order to be steered to the intended destination. The D-cache supports up to
four simultaneous invalidates from the data evictions.
Each line in the L1 D-cache is parity protected. A parity error will cause a miss in the
L1 D-cache which, in turn, will cause the correct data to be brought back from the
L2-cache.
In addition to the pipeline reads, the L1 D-cache can also be accessed by way of
diagnostic ASI operations, BIST operations, and RAMtest operations through the test
access port (TAP).
The TTE tag and the TTE data are both parity protected and errors are uncorrectable.
TTE access parity errors for load instructions will cause a precise trap. TTE access
parity errors for store instructions will cause a deferred trap (that is, the generation
of the trap will be deferred to the instruction following the store instruction).
However, the trap PC delivered to the system software still points to the store
instruction that encountered the parity error in the TTE access. Therefore, the
deferred action of the trap generation will still cause a precise trap from the system
software perspective.
All stores reside in the store buffer until they are ordered following a total store
ordering (TSO) model and have updated the L1D (level 1 D-cache). The lifecycle of a
TSO compliant store follows these four stages:
1. Valid
Non-TSO complaint stores, such as blk-init and other flavors of bst (block store), will
not follow the preceding life-cycle. A response from the L2-cache is not required
before releasing the non-TSO complaint stores from the store buffer.
Atomic instructions such as CAS, LDSTUB, and SWAP, as well as flush instructions,
can share the store buffer.
The store buffer implements partial and full read after write (RAW) checking. Full-
RAW data will be returned to the register files from the pipe. Partial RAW hits will
force the load to access the L2-cache while interlocked with the store issued to the
CCX. Multiple hits in the store buffer will always force access to the L2-cache in
order to enforce data consistency.
If a store hits any part of a quad-load (16-byte access), the quad-load checking will
force the serialization of the issue to the CCX. This forced serialization enforces that
there will be no bypass operation.
Instructions such as a blk-load (64-byte access) will not detect the potential store
buffer hit on the 64-byte boundary. The software must guarantee the data
consistency using membar instructions.
Load requests to the L2-cache from different addresses can alias to the same L2-
cache line. Primary versus secondary checking will be performed in order to prevent
potential duplication in the L2-cache tags.
The latencies for completing different load instructions may differ (for example, a
quad-load fill will have to access integer register file (IRF) twice).
The LMQ is also leveraged by other instructions. For example, the first packet of a
CAS instruction will be issued out of the store buffer while the second packet will be
issued out the LMQ.
The 13 sources are further divided into four categories of different priorities. The
I-cache miss handling is one category. The load instructions (one outstanding per
thread) are in one category. The store instructions (one outstanding per thread) are in
another category. The rest of accesses are lumped into one category, and include –
the FPU access, SPU access, interrupt, and the forward-packet.
The arbitration is done within the category first and then among the other
categories. An I-cache fill is at the highest priority, while all other categories have an
equal priority. The priorities can be illustrated in this order:
1. I-cache miss
2. Load miss
3. Stores
There are five possible targets, which include four L2-cache banks and one I/O
buffer (IOB). The FPU access shares the path through the IOB.
Speculation on the PCX availability does occur, and a history will be established
once the speculation is known to be correct.
A store to the D-cache is not allowed to bypass another store to the D-cache. Store
operations to different caches can bypass each other without violating the total store
ordering (TSO) model.
Interrupts are allowed to be delivered to TLU only after all the prior invalidates
have been visible in their respective caches. An acknowledgement to a local I-flush is
treated the same way as an interrupt.
The bypass queue handles all of the load reference data, other than that received
from the L2-cache, that must be asynchronously written to the integer register file
(IRF). This kind of read data includes full-RAW data from the store buffer, ldxa to the
internal ASI data, store data for casa, a forward packet for the ASI transactions, as
well as the pending precise traps.
SWAP and LDSTUB are single packet requests to the PCX, and they reside in the
store buffer.
A parity error on a store to the DTLB will cause a deferred trap. It will be reported
on the follow-up membar #sync. The trap PC in this case will point to the store
instruction encountering the parity error when storing to the DTLB. The deferred
trap will look like a precise trap to the system software because of the way the
hardware supports the recording of the precise trap PC.
An interrupt is treated similar to a membar. It will be sent to PCX once the store
buffer of the corresponding thread has been drained. This interrupt will then
immediately be acknowledged to TLU.
After the interrupt packet has been dispatched by way of the L2-cache to Core
Interface (CCX), the packet would be executed on the destination thread of a SPARC
core. It can be invalidated after all prior invalidates have completed and results
arrived at L1 D-cache (L1D).
The flush is issued as an interrupt with the flush bit set, which causes the L2-cache
to broadcast the packet to all SPARC cores.
For the SPARC cores that did not issue the flush, the DFQ will serialize the flushes
so that the order of the issuing threads actions, relative to the flushes, will be
preserved.
The LSU supports a total of eight outstanding prefetch instructions across all four
threads. The LSU keeps track of the number of outstanding prefetches per thread,
which limits the number of outstanding prefetches.
The LSU breaks up a the 64-byte packet of a blk-ld instruction into four of 16-byte
load packets so that they can access the processor and L2-cache interface (PCX). The
Level 2 cache returns four of the 16-byte packets, which in turn, will cause eight of 8-
byte data transfers to the floating-point register file (FRF). Errors are reported on the
last packet. A blk-ld instruction could cause a partial update to the FRF. Software
must be written to retry the instruction later.
A blk-st instruction will be unrolled into eight helper instructions by the floating-
point functional unit (FFU) for a total of a 64-byte data transfer. Each 8-byte data
gets an entry of the corresponding thread in the store buffer. The blk-st instructions
are non-TSO compliant, so the software must do the ordering.
The blk-init load instructions must be quad-word accesses, and violating this rule
will cause a trap. Like quad-load instructions, blk-init loads also send double-pump
writes (8-byte access) to the integer register file (IRF) when a blk-init load packet
reaches the head of the data fill queue (DFQ).
The blk-init stores are also non-TSO compliant, which allows for greater write
throughput and higher-performance yields for the block-copy routine.
The Store buffer will not be looked-up by the strm-ld instructions, and the store
buffer will not buffer strm-st data. Software must be written to enforce the ordering
and the maintenance of the data coherency.
The acknowledgements for strm-st instructions will be ordered through the data fill
queue (DFQ) upon the return to the stream processing unit (SPU). The
corresponding store acknowledgement (st ack) will be sent to the SPU once the
level 1 D-cache (L1D) invalidation, if any, has been completed.
A forward reply will be sent back to the I/O bridge (IOB) once the data is read or
written. A SPARC core might further forward the request to the L2-cache for an
access to the control status register (CSR). The I/O bridge only supports one
outstanding forward access at any time.
The trap logic unit (TLU) gathers traps from all functional units except the LSU, and
it then sends them to the LSU. the LSU performs the or function for all of them (plus
its own) and then it broadcasts across the entire chip.
The LSU can also send a truncated flush for the internal ASI ld/st to the TLU, the
MMU, and the SPU.
rd to RF
Load
MUL
data
opcode
ECL
SHFT
The execution control logic (ECL) block generates the necessary select signals that
control the multiplexors, keeps track of the thread and reads of each instruction, and
implements the bypass logic. The ECL also generates the write-enables for the
integer register file (IRF). The bypass logic block does the operand bypass from the
E, M, and W stages to the D stage. Results of long latency operations such as load,
mul, and div, are forwarded from the W stage to the D stage. The condition codes
are bypassed similar to the operands, and bypassing of the FP results and writes to
the status registers are not allowed.
The shifter block (SHFT) implements the 0 - 63-bit shift, and FIGURE 2-16 illustrates
the top level block diagram of the shifter.
32b selec
mask/sign
extend
left shift
The arithmetic and logic unit (ALU) consists of an adder and logic operations such
as – ADD, SUB, AND, NAND, OR, NOR, XOR, XNOR, and NOT. The ALU is also
reused when calculating the branch address or a virtual address. FIGURE 2-17
illustrates the top level block diagram of the ALU.
Exu_ifu_brpc_e
sum
shft result
Sum predict
PR or SR output
logic
Logic
cc.Z
=0?
regz
=0?
Dividend/ XOR
Quotient
Input
Queue
+
Divisor
When either IMUL or IDIV is occupied, a thread issuing a MUL or DIV instruction
will be rolled back and switched out.
FFU_CTL
78
dout
FRF FFU_DP FFU_VIS
78
din
sparc_mul_top
Data In/Control Data Out Data In/Control
EXU SPU
The SPU shares the integer multiplier with the execution unit (EXU) for the modular
arithmetic (MA) operations. The SPU itself supports full modular exponentiation.
While the SPU facility is shared among all threads of a SPARC core, only one thread
can use the SPU at a time. The SPU operation is set up by a storing a thread to a
control register and then returning to normal processing. The SPU will initiate
streaming load or streaming store operations to the level 2 cache (L2) and compute
operations to the integer multiplier. Once the operation is launched, it can operate in
parallel with SPARC core instruction execution. The completion of the operation is
detected by polling (synchronous fashion) or by interrupt (asynchronous fashion).
RSVD ES E X N M A MA_EXP
N B A MA_MUL
X M R N A MA_RED
SPU
IFU
MUL SPU_MAMEM
(bw_r_idct)
TLU
SPU_CTL
EXU SPU_MADP
LSU
CPX
Write accesses to the MA memory can be on either the 16-byte boundary or the
8-byte boundary. Read accesses to the MA memory must be on the 8-byte boundary.
An ldxa to MA registers are blocking. All except ldxa to the MA_Sync register will
respond immediately. An ldxa to the MA_Sync register will return a 0 to the
destination register upon the operation completion. The thread ID of this ldxa
should be equal to that stored in the thread ID field of the MA_CTL register.
Otherwise, the SPU will respond immediately and send signals to the LSU to not
update the register file. In case of aborting an MA operation, the pending ldxa to
MA_Sync is unblocked, and the SPU signals the LSU will not update the register file.
ma_op
Idle Abort
Wait
A MA_ST operation is started with a stxa to the MA_CTL register opcode equals the
MA_ST, and the length field specifies the number of words to send to the level 2
cache (L2-cache). The SPU sends a processor to cache interface (PCX) request to the
LSU and waits for an acknowledgement from the LSU prior to sending another
request. If needed, store acknowledgements, which are returned from the L2-cache
on level 2 cache to processor interface (CPX), will go to the LSU in order to
invalidate the level 1 D-cache (L1D). The LSU will then send the SPU an
acknowledgement. The SPU then decrements a local counter and waits for all the
stores sent out to be acknowledged and transitioned to the done state.
On a read from the MA Memory, the operation will be halted if a parity error is
encountered. The SPU waits for all posted stores to be acknowledged. If the Int bit is
cleared (Int = 0), the SPU will signal the LSU and the IFU on all ldxa to the MA
registers.
Any data returned with an uncorrectable error will halt the operation. If the Int bit is
cleared (Int = 0), the SPU will send a signal to the LSU and the IFU on any ldxa to
MA register.
Any data returned with a correctable error will cause the error address to be sent to
IFU and be logged, while the operation will continue until completion.
0 0 - error_log
0 1 - error_log
1 0 precise trap error_log
1 1 - error_log
The MA_RED operates on A and N operands and the result will be stored in the R
operand.
The parity error encountered on an operand read will cause the operation to be
halted. The LSU and the IFU will be signaled.
FIGURE 2-24 shows a pipeline diagram that illustrates the sequence of the result
generation of the multiply function.
mul_ack
ITLB DTLB
MMU
Applications
OS instance 1 OS instance 2
Hypervisor
OpenSPARC T1
The hypervisor (HV) layer uses physical addresses (PA) while the supervisor (SV)
layer views real addresses (RA) where the RAs represent a different abstraction of
the underlying PAs. All applications use virtual addresses (VA) to access memory.
(The VA will be translated to RA and then to PA by TLBs and the MMU.)
The access to the MMU is through the hypervisor-managed ASI operations such as
ldxa and stxa. These ASI operations can be asynchronous or in-pipe, depending on
the latency requirements. Those asynchronous ASI reads and writes will be queued
up in LSU. Some of the ASI operations can be updated through faults or by a data
access exception. Fault data for the status registers will be sent by trap logic unit
(TLU) and the load and store unit (LSU).
1. CAM
2. Read
3. Write
4. Bypass
5. Demap
7. Hard-reset
CAM consists of the following field of bits – partition ID (PID), real (identifies a RA-
to-PA translation or a VA-to-PA translation), context ID (CTXT), and virtual address
(VA). The VA field is further broken down to page-size based fields with individual
enables. The CTXT field also has its own enable in order to allow the flexibility in
implementation. The CAM portion of the fields are for comparison purposes. RAM
consists of the following field of bits, namely, physical address (PA) and attributes.
The RAM portion of the fields are for read purposes, where a read could be caused
by a software read or a CAM based 1-hot read.
63
cam rd data
Write access to the data-in algorithmically places the translation table entry (TTE) in
the TLB. Writes occur to the least significant unused entry. In contrast, write access
to the data-access places the TTE in the specified entry in the TLB. For diagnostics
purposes, a single bit parity error can be injected on writes.
A page may be specified as a real-on write, and a page will have a partition assigned
to it on a write.
A used bit can be set on a write, or on a CAM hit, or when locked. A locked page
will have its used bit always set. An invalid entry has its used bit always cleared. All
used bits will be cleared when the TLB reaches a saturation point (that is, when all
entries have their used bit set while a new entry needs to be put in a TLB). If a TLB
remains saturated because all of the entries have been locked, the default
replacement candidate (entry 0x63) will be chosen and an error condition will be
reported.
3. Access tag.
Software will then generate a pointer into the TSB based on the VA, the TSB base
address, the TSB size, and the Tag.
Software interrupts are delivered to each of the virtual cores using the
interrupt_level_n trap through the SOFTINT_REG register. I/O and CPU cross-call
interrupts are delivered to each virtual core using the interrupt_vector trap. Up to 64
outstanding interrupts can be queued up per thread—one for each interrupt vector.
Interrupt vectors are implicitly prioritized, with vector 0x63 being at the highest
priority, while vector 0x0 is at the lowest priority. Each I/O interrupt source has a
hardwired interrupt number that is used as the interrupt vector by the I/O bridge
block.
The TLU is in a logically central position to collect all of the traps and interrupts and
forward them. FIGURE 2-28 illustrates the TLU role with respect to all other backlogs
in a SPARC core.
Instruction
Sync.Trap
Interrupts
PC/NPC
TrapPC
Sync.Trap Sync.Trap
tcl tpd
Async.Trap
Async.Trap
Deferred Trap
EXU hyperv tsa
LSU
CWP_CCR_REG Interrupt.PKT
TLU
ASI_REG
Ld/St Addr (ASI regs)
intctl intdp Rd/Ld data
FIGURE 2-28 TLU Role With Respect to All Other Backlogs in a SPARC Core
There are three defined categories of traps – precise trap, deferred trap, and
disrupting trap. The following paragraphs briefly describe the nature of each
category of trap.
1. Precise trap
A precise trap is induced by a particular instruction and occurs before any
program-visible state has been changed by the trap-inducing instruction. When a
precise trap occurs, several conditions must be true:
■ The PC saved in TPC[TL] points to the instruction that induced the trap, and
NPC saved in NTPC[TL] points to the instruction that was to be executed next.
■ All instructions issued before the one that induced the trap must have
completed their execution.
■ Any instructions issued after the one that induced the trap remain unexecuted.
2. Deferred trap
A deferred trap is induced by a particular instruction. However, the trap may
occur after the program-visible state has been changed by the execution of either
the trap inducing instruction itself, or one or more other instructions.
If an instruction induces a deferred trap, and a precise trap occurs simultaneously,
the deferred trap may not be deferred past the precise trap.
TABLE 2-5 illustrates the type of traps supported by the OpenSPARC T1 processor.
Asynchronous traps are taken opportunistically. They will be pending until the TLU
can find a trap bubble in the SPARC core pipeline. A maximum of one asynchronous
trap per thread can be pending at a time. When the other three threads are taking
traps back-to-back, an asynchronous trap may wait a maximum three SPARC core
clock cycles before the trap is taken.
Disrupting traps are associated with certain particular conditions. The TLU collects
them and forward them to the IFU. The IFU sends them down the pipeline as
interrupts instead of sending instructions down. A trap bubble is thus guaranteed at
the W-stage, and the trap will be taken.
D E M W W2
Reg. RD/WR,
DONE/RETRY
Inst. from IFU
Alt. LD/ST
Inst. from IFU Async. Traps
VA from EXU
Sync. Traps,
Interrupts
from IFU, EXU,
SPU and TLU
internal Traps
Synchronous
and Deferred
Traps from LSU
Resolve priority,
save states in
Stack and send
TrapPC_vld
to IFU
Update States
and send
TrapPC to IFU
FIGURE 2-30 illustrates the trap flow with respect to the hardware blocks.
TLU
detected Flush to LSU
Traps
Reset Vect.
Processor State:
HTBA
Update
TBA
HPSTATE, TL, State
IFU Traps (m) PSTATE, etc. Regs.
and Interrupts
Resolve Priority
FP-Traps
SpillTraps
2.10.5 Interrupts
The software interrupts are delivered to each virtual core using the interrupt_level_n
traps (0x41-0x4f) through the SOFTINT_REG register. I/O and CPU cross-call
interrupts are delivered to each virtual core using the interrupt_vector trap (0x60).
I/O devices and CPU cross-call interrupts contain a 6-bit identifier, which
determines which interrupt vector (level) in the ASI_SWVR_INTR_RECEIVE register
the interrupt will target. Each strand’s ASI_SWVR_INTR_RECEIVE register can
queue up to 64 outstanding interrupts, one for each interrupt vector. Interrupt
vectors are implicitly prioritized with vector 63 being the highest priority and vector
0 being the lowest priority.
Each I/O interrupt source has a hard-wired interrupt number, which is used to
index a table of interrupt vector information (INT_MAN) in the I/O bridge unit.
Generally, each I/O interrupt source will be assigned a unique virtual core target
and vector level. This association is defined by the software programming of the
Vector Dispatch
CPU Interrupt
.
Register
PCX Intr. PKT Interrupt Packet to LSU HV SW write
to CCX
LSU .
.
63
.
Pending INTR to IFU (level)
INTR. (CPX) PKT
Decode
.
from LSU
.
Incoming Vector
.
Register
0 .
.
0
State and Gen. TrapPC INTR cmd (m-stage)
Ctrl Regs, Update Stack FFs with "nop" from IFU
Trap Stack and State Regs
Reset Type
(Vector)
(TICK==STICK_CMPR)
PIC_Overflow
(TICK==TICK_CMPR)
16 15 14 0
SOFTINT_REG[SV] + SV Intr. to
IFU (level)
Level_<1...15>
(Bits 0 and CPU_mondoQ Dev_mondoQ ResumableError
SV Intr. to
16 also map Head!= Tail [SV] Head!= Tail [SV] _mondoQ
IFU (level)
to Level_14) Head!= Tail [SV]
+ SW write
(TICK==HSTICK_CMPR) [HV]
1. Hypervisor interrupts cannot be masked by the supervisor nor the user and can
only be masked by the hypervisor by way of the PSTATE.IE bit. Such interrupts
include hardware interrupts, HINTP, and so on.
HPSTATE.enb X 1 1 1 0 0
HPSTATE.red 1 0 0 0 0 0
HPSTATE.priv 1 1 0 0 X(1) 0
PSTATE.priv 1 X 1 0 1 0
r
User (SV
-Tr
)
p ) o TL<5 ap
ra = @T
-T @2< L<
p
V- V
(S (H
2)
a
(Reset) or
Tr
(Trap@TL>=5)
2)
Tr
Trap @ T L <
ap @TL< 5
(HV-Trap) or
Hypervisor Supervisor
(Trap@TL>=2)
V-
(S
(R
or
)
(T
5)
et et
es
ra
>=
) s
TL or Re TL
p@
>= ( @
ap
5)
RED State (Tr
p)
(R
es a
e t) o r ( T r
User
1
2 4
8
12
7
10 Supervisor RED State
1 11
13
5
6
3 1a
Hypervisor
7 (Done/Retry @ HTSTATE[TL].{red,priv}=00&TSTATE[TL].priv=1)
| (HPSTATE.{red,priv}->00)@PSTATE.priv=1)
8 (Done/Retry @ HTSTATE[TL].{red,priv}=00&TSTATE[TL].priv=0)
| (HPSTATE.{red,priv}->00)@PSTATE.priv=0)
1. On traps or interrupts – save states in the trap stack and update them
b. PC => TPC[TL]
a. Update the trap level (TL) and the global level (GL)
TL <= TL -1
GL <= Restore from trap stack @TL and apply CAP
b. Restore all the registers including PC, NPC, HPSTATE, PSTATE, from the trap
stack @[TL]
c. Send CWP and CCR register updates to the execution unit (EXU)
f. Decrement TL
Synchronization based on the HTSTATE.priv bit and the TSTATE.priv bit for the
non-split mode is not enforced on software writes, but synchronized while restoring
done and retry instructions.
Software writes in supervisor mode to the TSTATE.gl bit do not cap at two. The cap
is applied while restoring done and retry instructions.
Traps number 0x80 to 0xff can only be used by privileged software. These traps are
always delivered to hypervisor. User software using trap number 0x80 to 0xff will
result in an illegal instruction trap if the condition code evaluates to true. Otherwise, it
is just a NOP.
The instruction decoding and condition code evaluation of Tcc instructions are done
by the instruction fetch unit (IFU) and the seventh bit of the Trap# is checked by the
TLU.
The trap level can be changed by the done or the retry instructions or a WRPR
instruction to TL. The trap is taken on the instruction immediately following these
instructions. The change could be stepping down the trap level, or changing the TL
from >0 to 0. The HPSTATE.tlz bit will not be cleared by the hardware when a trap
is taken so the TLZ trap (tlz-trap) handler has to clear this bit before returning in
order to avoid the infinite tlz-trap loop.
Each thread has a performance instrumentation counter (PIC) register. The access
privileged is controlled by the setting the PERF_CONTROL_REG.PRIV bit. When
PERF_CONTROL_REG.PRIV=1, non-privileged accesses to this register cause a
privileged_action trap.
SL EVENT
If the PCR.OVFH bit is set to 1, the PIC.H has overflowed and the next event will
cause a disrupting trap that appears to be precise to the instruction following the
event.
If the PCR.OVFL bit is set to 1, the PIC.L has overflowed and next event will cause a
disrupting trap that appears to be precise to the instruction following the event.
If the PCR.UT bit is set to 1, it counts events in user mode. Otherwise, it will ignore
user mode events.
If the PCR.ST bit is set to 1 and HPSTATE.ENB is also set to 1, it counts events in
supervisor mode. Otherwise, it will ignore supervisor mode events.
If the PCR.ST bit is set to 1 and HPSTATE.ENB is also set to 0, it counts events in
hypervisor mode. Otherwise, it will ignore hypervisor mode events.
If the PCR.PRIV bit is set to 1, it prevents user code access to the PIC counter.
Otherwise, it allows the user code to access the PIC counter.
The PIC.H bits form the instruction counter. Trapped or canceled instructions will
not be counted. The Tcc instructions will be counted even if some other trap is taken
on them.
Software writes to the PCR that set one of the overflow bits (OVFH, OVFL) will also
cause a disrupting but precise trap on the instruction following the next
incrementing event.
Note that all outstanding memory operations must be complete in order for the reset
to take effect. If any outstanding loads or stores are not completed, the core will
continue waiting for the completion of these operations before taking the reset.
The thread can be resumed by sending a second interrupt packet with the interrupt
type set to resume.
The capture and scan process does not affect the state of the core, and may be
performed while the core is running.
The shadow scan snap block is located within the instruction fetch unit. FIGURE 2-37
shows the implementation of this block.
The shadow scan signals are usually connected to a JTAG TAP Controller. The thread
ID, ctu_sscan_tid[3:0], is decoded from the JTAG instruction. It is valid for as long as
the instruction is held. The ctu_sscan_snap and ctu_sscan_en signals are decoded
from the Capture-DR and Shift-DR states of the TAP controller. The timing
relationship is shown in FIGURE 2-38.
The core shadow scan chain is 94 bits long. It captures the following information:
There is a single shadow scan chain of 94 bits per physical core. When a shadow
scan sample is triggered, the shadow scan block muxes 94x4 bits down to 94 bits to
be shifted out. The shadow scan chains for each physical core are placed on separate
chains. Bit 0 of the chain is the first bit shifted out, but each field is arranged in the
shadow scan chain such that the MSB is shifted out first. For example, bit 0 of the
shadow scan chain is mil_state[3].
CPU-Cache Crossbar
Each SPARC CPU core can send a packet to any one of the L2-cache banks, the I/O
bridge, or the FPU. Conversely, packets can also be sent in the reverse direction,
where any of the four L2-cache banks, the I/O bridge, or the FPU can send a packet
to any one of the eight CPU cores.
3-1
FIGURE 3-1 shows that each of the eight SPARC CPU cores can communicate with
each of the four L2-cache banks, the I/O bridge, and the FPU. The cache-processor
crossbar (CPX) and the processor-cache crossbar (PCX) packet formats are described
in Section 3.1.5, “CPX and PCX Packet Formats” on page 3-5.
CCX
When multiple sources send a packet to the same destination, the CCX buffers each
packet and arbitrates its delivery to the destination. The CCX does not modify or
process any packet.
In one cycle, only one packet can be delivered to a particular destination. The CCX
handles two types of communication requests. The first type of requests contain one
packet and it is delivered in one cycle. The second type of request contains two
packets, and these two packets are delivered in two cycles.
The total number of cycles required for a packet to travel from the source to the
destination may be more than the number of cycles required to deliver a packet. This
issue occurs when the PCX (or the CCX) uses more than one cycle to deliver the
packet. The PCX (or the CCX) uses more than one cycle to deliver a particular packet
if multiple sources can send packets for the same destination.
The PCX connects to each destination by way of a separate bus. However, the FPU
and I/O bridge share the same bus. Therefore, there are five buses that connect the
PCX to the six destinations. The PCX does not perform any packet processing and
therefore the bus width from the PCX to each destination is 124-bits wide, which is
identical to the PCX packet width. FIGURE 3-2 illustrates this PCX interface.
Since both the FPU and the I/O bridge share a destination ID, the packets intended
for each get routed to both. The FPU and I/O bridge each decode the packet to
decide whether to consume or discard the packet.
A source can send at most two single-packet requests or one two-packet request to a
particular destination. There is a 2 deep queue inside the PCX for each source-
destination pair that holds the packet. The PCX sends a grant to the source after
dispatching a packet to its destination. Each source uses this handshake signal to
monitor the queue full condition.
The L2-caches and the I/O bridge can process a limited number of packets. When a
destination reaches its limit, it sends a stall signal to the PCX. This stall signal
prevents the PCX from sending the grant to a source (CPU core). The FPU, however,
cannot stall the PCX.
A source sends a packet and a destination ID to the CPX. The packets are sent on a
145-bit wide bus. Out of the 145 bits, the 128 bits is used for data and the rest of the
bits are used for control.
The destination ID is sent on a separate 8-bit bus. Each source connects with the CPX
on its own separate bus. Therefore, there are six buses that connect from the four
L2-caches, the I/O bridge, and the FPU to the CPX. The CPX connects by way of a
separate bus to each destination. Therefore, there are eight buses from the PCX that
connect it to the six destinations. The CPX does not perform any packet processing,
so the bus width from the CPX to each destination is 145-bits wide, which is
identical to the bus width from the source to the CPX. FIGURE 3-3 illustrates the CPX
interface.
A source can send at most two single-packet requests, or one two-packet request, to
a particular destination. There is a 2 deep queue inside the CPX for each source-
destination pair that holds the packet. The CPX sends a grant to the source after
dispatching a packet to its destination. Each source uses this handshake signal to
monitor the queue full condition.
Unlike the PCX, the CPX does not receive a stall from any of its destinations, as each
CPU has an efficient mechanism to drain the buffer that stores the incoming packets.
Note – For the next four packet format tables, the table entries are defined as
follows:
■ x – Not used or don’t care
■ V – Valid
■ rs – Source register
■ rd – Destination register
■ T – Thread ID
■ FD – Forwarded data
■ src – Source
■ tar – Target
I$fill (1)
Pkt bits No. Load L2,IOB I$fill (2) L2 Strm Load Evict Inv
Valid 144 1 V V V V V
Rtntyp 143:140 4 0000 0001 0001 0010 0011
L2 miss 139 1 V V 0 V x
ERR 138:137 2 V V V V x
NC 136 1 V V V V V
Shared bit 135 1 T T T T x
Shared bit 134 1 T T T T x
Shared bit 133 1 WV WV,0 WV WV x
Shared bit 132 1 W W,x W W x
Shared bit 131 1 W W,x W W x
Shared bit 130 1 0 0, F4B 0 A x
Shared bit 129 1 atomic 0 1 B x
Reserved 128 1 PFL 0 0 0 0
Data 127:0 128 V V V V {INV1
+6(pa)
+112(inv)}
Valid 144 1 V V V V V V V
Rtntyp 143:140 4 0100 0101 0110 0111 1000 1001 1010 1011 1100
L2 miss 139 1 x x x x x x x
ERR 138:137 2 x x x x x V V
NC 136 1 V V flush V R/!W R/!W x
Shared bit 135 1 T T T T x x 0
Shared bit 134 1 T T T T x x 0
Shared bit 133 1 x x x x src tar x
Shared bit 132 1 x x x x src tar x
Shared bit 131 1 x x x x src tar x
Shared bit 130 1 x/R A x x SASI x x
Shared bit 129 1 atomic x x x x x x
Reserved 128 1 x/R 0 0 0 0 0 0 0
Data 127:0 128 {INV2 {INV3 V! V* FD {64(x) x
+3(cpu) +3(cpu) + Data}
+6(pa) +6pa)
+112(inv)} +112(inv)}
Valid 123 1 V V V V V
Cpu_id 116:114 3 V V V V V
Thread_id 113:112 2 V V V V V
Invalidate 111 1 V V 0 0 0
Rep_L1_way 108:107 2 V V P V x
Size 106:104 3 V x V V V
Address 103:64 40 V V# V V V
Valid 123 1 V V V V V V V V
Rqtyp 122:118 5 00110 00100 00101 01001 01010 01011 01100 01101 01110
Invalidate 111 1 0 0 0 0 x x 0 0
Prefetch 110 1 0 0 0 0 x x 0 0
3.2.4 Invalidate
The one-bit invalidate field indicates an invalidation request. This notifies L2 to
update its directories.
3.2.5 Prefetch
This field in a load packet indicates that the load is a prefetch. In a store packet, this
bit indicates that the store is a block store.
Encoding Size
000 Byte
001 Half-word (2-byte)
Encoding Size
010 Word (4-byte)
011 Extended word (8-byte)
111 Cache Line (16/32 byte)
In floating-point request packets, the address field is used to send operation data.
The format of this field is shown in TABLE 3-6.
3.2.10 Data
The data field for a PCX packet is 64 bits long. This field will contain 64 bits of data
for a store address. It is invalid for a load or I-fill packet. An interrupt packet will
contain an 18-bit interrupt vector in this field. If the data in a store packet is less than
64 bits, then the field is filled with copies of the data as shown in TABLE 3-7.
3.3.1 Valid
The valid bit indicates that the CPX packet is valid. The packet will be dropped if
this signal is not one.
3.3.3 L2 Miss
This bit indicates that the transaction missed in the cache. This bit is valid only for
load returns, I-fill returns, and Stream Load Returns. For other transactions, the field
should be set to zero. On an I-fill return transaction, the L2 miss bit is set only for the
first of the two packets. The second I-fill packet for the transaction always has this
bit set to zero.
3.3.4 ERR
This two-bit field indicates that the transaction had an error. Bit 138 indicates an un-
correctible error, while bit 137 indicates a correctible error.
3.3.9 Atomic
Bit 129 indicates an atomic transaction. It is used in load return and store ACK
packets.
3.3.10 Prefetch
Bit 128 is used only in the load packet to indicate a prefetch load. For all other
transactions, this bit should be set to zero.
Inval Vec: addr[5:4]==11 Inval Vec: Addr[5:4]==10 Inval Vec: addr[5:4]==01 Inval Vec: addr[5:4]==00
3.4.1 Load
A load transaction transfers data from L2 or I/O to the core. Cacheable accesses to
L2 are always 16B in size; therefore, the size field in the PCX packet should be
ignored. Sixteen bytes of load data are returned on the CPX bus. For L1 cacheable
accesses, the l1way field of the PCX packet will indicate to which L1 way the data
will be allocated. The L2 uses this information to update it’s directory. Non-
cacheable accesses are indicated by NC=1. These will not allocate in the L1 cache or
in the directory.
Loads to I/O can be of variable size as indicated by the 3 LSB’s of the size field. 000=
1B, 001=2B, 010=4B, 011=8B, 100=16B. For sizes of less than 16B, the NCU will
replicate the data across the 16B return data field. ECC errors are reported in the err
field. 00=no error, 01=correctable error, 10=uncorrectable error.
On a cacheable (NC=0) load, the L2 must also check the I$ directory to see if the
requested line is present in the requesting core’s Icache. If it is, the L2 will assert the
WV bit in the CPX response and indicate in the way field which L1 way the line was
found in. The Icache will invalidate it’s line upon seeing the load return packet.
3.4.2 Prefetch
Prefetch can only be issued to the L2. From the L2 perspective, a prefetch is simply a
load that is non-cacheable in the L1. As such, the NC bit will always be asserted for
prefetch requests. The L2 will assert the PFL bit in the CPX return packet so that the
core knows not to update the register files as would happen for a load. While the L2
The L2 will act by invalidating all ways of the indicated index in it’s directory. It will
then respond to the core with a dcache invalidate ACK CPX packet. This packet has
the same format as a store ACK packet except that bit 123 (D$ inval all) will be set
high.
For stores to the L2, all D$ and I$ directories are checked for the presence of the line.
Any hit is indicated in the invalidation vector portion of the CPX response. If a hit is
detected in the D$ directory of the core that issued the store, that directory is left
unchanged. However, if a hit is detected in the D$ directory of a core which did not
issue the store request or in any I$ directory, that entry is subsequently invalidated.
For stores to the L2, if the inv bit of the PCX packet is high, the L2 will write
NotData values into the L2. This occurs in the case where the store buffer in the
SPARC core detects an uncorrectable error and the true store data is unknown.
The BIS bit (109) in the PCX packet is always reflected to the BIS bit (125) of the store
ACK packet.
For performance reasons, the replacement algorithm may also be modified for block
store cases.
First, like the block store, all directory hits cause invalidations.
Second, if the address is 64B aligned, and if the store causes an L2 miss, the L2 will
not fetch data from memory to fill the line but will instead initialize the line with
zeros. The store will then take place as usual.
If the address is not 64B aligned or if the store hits in the L2, the store proceeds just
like a normal store except for the first exception described above.
The BIS bit (109) in the PCX packet is always reflected to the BIS bit (125) of the store
ACK packet.
The core sends two packets for a CAS operation; the first contains the compare data,
the second contains the swap data. The L2 will first load the line and return a CAS
return packet (identical to a load return) on the CPX. It will then compare the data
and make a second pass through the L2 pipe. If the compare was true, the data from
the second packet will be stored. If the compare was false, memory will be
unchanged. Regardless of the compare result, the L2 will send a CAS ACK (identical
to a store ACK) on the CPX. If the compare was true and the store occurred, the
directories must be checked as on a block init store (i.e., any hit causes invalidation).
It is implementation dependent whether the directories are checked and invalidated
in the case where the compare was false.
The atomic bit (129) of the CPX packet must be asserted for both response packets.
3.4.10 Swap/Ldstub
Swap/Ldstub (ldstub is simply a byte sized swap where the new data is always 0xff)
work similarly to the CAS operation except that the memory write is unconditional.
The atomic bit (129) of the CPX packet must be asserted for both response packets.
Swap requests are never issued to I/O.
The modular arithmetic unit ID bit (A), bit (108) in the PCX packet, is always
reflected to bit 130 of the CPX stream store ACK packet.
A flush is indicated by a one in the Broadcast bit (Bit 117) of the PCX packet. The L2
responds back to this request with a CPX INT packet. The broadcast bit is copied to
the flush bit (Bit 136). The core ID from the PCX packet is returned to the core in
bits 120:118 of the CPX packet (the same location they would occupy in a store ACK
packet). Data may be sent in the lower 32 bits of the data field, but this is not valid
data, and is ignored. The purpose of the INT packet is to guarantee that all
proceeding stores to L2 have been committed before the given thread continues.
The cross-CPU interrupt is indicated by a zero in bit 117 of the PCX packet. This
packet may be sent to an L2 bank or to the I/O Bridge (IOB) block. The L2 bank will
forward the interrupt vector to the target cpu by sending a CPX INT packet. The 18-
bit interrupt vector in the lower 18 bits of the PCX data field is copied to the CPX
data field in two places: Bits [17:0] and bits [81:64]. The format of the interrupt vector
is shown in TABLE 3-10.
Interrupts may be sent directly from the I/O bridge to the core. The first case where
this happens is at reset. The I/O bridge wakes up the lowest numbered core and
thread by sending a CPX INT packet with the trap type set to power-on reset (POR).
Interrupts from devices are also sent by the I/O bridge to the core responsible for
handling the device.
3.4.16 L2 Errors
When the L2 encounters a fatal error, it will respond back to the core with an error
packet. The error packet only reports the type of error in bits 138 and 137 of the CPX
packet. For non-fatal errors, the L2 may not return an error packet. It will instead
simply report the type of error in the ERR field of the normal CPX response packet.
PQ PA PX PX2
spc0_pcx_req_vld_pq[0]
spc0_pcx_data_pa[123:0]
Arbiter
control
Arbiter
data
select
pcx_spc0_grant_px
pcx_sctag0_data_rdy_px1
pcx_sctag0_data_px2[123:0] pkt1
CPU0 signals the PCX that it is sending a packet in cycle PQ. CPU0 then sends a
packet in cycle PA. ARB0 looks at all pending requests and issues a grant to CPU0 in
cycle PX. ARB0 sends a data ready signal to the L2-cache Bank0 in cycle PX. ARB0
sends the packet to the L2-cache Bank0 in cycle PX2.
PQ PA PX PX2 PX3
spc0_pcx_req_vld_pq[0]
spc0_pcx_atom_pq
spc0_pcx_data_pa[123:0]
pkt1 pkt2
Arbiter Arbiter
control control
Arbiter Arbiter
data data
select select
pcx_spc0_grant_px
pcx_sctag0_data_rdy_px1
pcx_sctag0_atm_px1
CPU0 signals the PCX that it is sending a packet in cycle PQ. CPU0 also asserts
spc0_pcx_atom_pq, which tells the PCX that CPU0 is sending a two-packet request.
The PCX handles all two-packet requests atomically. CPU0 sends the first packet in
cycle PA and the second packet in cycle PX. ARB0 looks at all pending requests and
issues a grant to CPU0 in cycle PX. The grant is asserted for two cycles. The PCX
also asserts pcx_sctag0_atm_px1 in cycle PX, which tells the L2-cache Bank0 that the
PCX is sending a two-packet request. ARB0 sends a data ready signal to the L2-cache
Bank0 in cycle PX. ARB0 sends the two packets to the L2-cache Bank0 in cycles PX2
and PX3.
The timing for CPX transfers is similar to PCX transfers with the following
difference—the data ready signal from the CPX is delayed by one cycle before
sending the packet to its destination. FIGURE 3-6 and FIGURE 3-7 shows the CPX
packet transfer timing diagrams.
CQ CA CX CX2
sctag0_cpx_req_cq[0]
sctag0_cpx_data_ca[144:0]
pkt1
Arbiter
control
Arbiter
data
select
cpx_sctag0_grant_px
cpx_spc0_data_rdy_cx21
cpx_spc0_data_cx2[144:0] pkt1
FIGURE 3-6 CPX Packet Transfer Timing Diagram – One Packet Request
sctag0_cpx_req_cq[0]
sctag0_cpx_atom_cq
sctag0_cpx_data_ca[144:0]
pkt1 pkt2
Arbiter Arbiter
control control
Arbiter Arbiter
data data
select select
cpx_sctag0_grant_px
cpx_spc0_data_rdy_cx21
FIGURE 3-7 CPX Packet Transfer Timing Diagram – Two Packet Request
To optimize bandwidth, the SPARC core may send speculative requests when both
transaction slots are occupied, assuming that a grant will come back in time. In the
best case, a grant will come back two cycles after a request. If a speculative request
ARB0 can receive packets from any of the eight CPUs for the L2-cache Bank0, and it
stores packets from each CPU in a separate queue. Therefore, ARB0 contains eight
queues. Each queue is a two entry deep FIFO, and each entry can hold one packet. A
packet is 124-bits wide and it contains the address, the data, and the control bits.
ARB0 delivers packets to the L2-cache Bank0 on a 124-bit wide bus. FIGURE 3-11
shows this data flow.
124
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7
Q1 Q1
Q0 Q0
ARB1, ARB2, and ARB3 receive packets for the L2-cache Bank1, Bank2, and Bank3
respectively. ARB4 receives packets for both the FPU and the I/O bridge.
ARB0 dispatches packets to the destination in the order it receives each packet.
Therefore, a packet received in cycle 4 will be dispatched before a packet received in
cycle 5. When multiple sources dispatch a packet in the same cycle, ARB0 follows a
round-robin policy to arbitrate among packets from multiple sources.
A 5-bit bus originates from each CPU, and the bit corresponding to the destination is
high while all other bits are low. Each arbiter receives one bit from the 5-bit bus from
each CPU.
[4:0]
C0 C1 C2 C3 C4 C5 C6 C7 Direction C0 C1 C2 C3 C4 C5 C6 C7 Direction
16 entries 16 entries
Q2 Q2
Q1 Q1
Q0 Q0
8 8 8 8 8
Data select Data select Data select Data select Data select
to arbiter 0 to arbiter 1 to arbiter 2 to arbiter 3 to arbiter 4
The checkerboard consists of eight FIFOs. Each FIFO is sixteen entries deep, and
each entry holds a single valid bit received from its corresponding CPU. Each valid
FIFO entry represents a valid packet from a source for the L2-cache Bank0. Since
each source can send at the most two entries for the L2-cache Bank0, there can be at
There can be only one entry for each request, even if a request contains two packets.
Such requests occupy one valid entry in the checkerboard and two FIFO entries in
the data queue. A separate bit identifies a two-packet request.
The direction for the round-robin selection depends on the direction bit. Round-
robin selection is left-to-right (C0 - C7) if the direction bit is high, or right-to-left (C7
- C0) if the direction bit is low. The direction bit toggles every cycle.
The direction bit is low for all arbiters at a reset. The direction bit toggles for all
arbiters during every cycle. This requirement is required to maintain the TSO
ordering for invalidates sent by an L2-cache bank.
ARB0 picks the first valid entry from the last row of the checkerboard every cycle.
ARB0 then sends an 8-bit signal to the multiplexer at the output of the FIFOs storing
the data (as show in FIGURE 3-11). The 8-bit signal is 1-hot, and the index of the high
bit is same as the index of the entry picked in the last row. If there are multiple valid
entries, ARB0 picks them in a round-robin fashion. ARB0 decides the direction for
round-robin based on the direction bit.
Level 2 Cache
The L2-cache accepts requests from the SPARC CPU cores on the processor-to-cache
crossbar (PCX) and responds on the cache-to-processor crossbar (CPX). The L2-cache
is also responsible for maintaining the on-chip coherency across all L1-caches on the
chip by keeping a copy of all L1 tags in a directory structure. Since the OpenSPARC
4-1
T1 processor implements system on a chip, with single memory interface and no L3
cache, there is no off-chip coherency requirement for the OpenSPARC T1 L2-cache
other than it needs to be coherent with the main memory.
Each L2-cache bank has a 128-bit fill interface and a 64-bit write interface with the
DRAM controller. Each bank had a dedicated DRAM channel, and each 32-bit word
is protected by 7-bits of single error correction double error detection (SEC/DED)
ECC code.
Each L2-cache bank interfaces with the eight SPARC CPU cores through a processor
-cache crossbar (PCX). The PCX routes the L2-cache requests (loads, ifetches, stores,
atomics, ASI accesses) from all of the eight CPUs to the appropriate L2-cache bank.
The PCX also accepts read return data, invalidation packets, and store ACK packets
from each L2-cache banks and forwards them to the appropriate CPU(s).
Each L2-cache bank interfaces with one DRAM controller in order to issue reads and
evictions to the DRAM on misses in the L2-cache. A writeback gets issued 64-bits at
a time to the DRAM controller. A fill happens 128-bits at a time from the DRAM
controller to the L2-cache.
The L2-cache interfaces with the J-Bus interface (JBI) by way of the snoop input
queue and the RDMA write buffer.
FIGURE 4-1 shows the various L2-cache blocks and their interfaces. The following
paragraphs provide additional details about each functional block.
32b
CnplQ(16Q) IQ(16Q) OQ(16Q)
Jbi i/f
32b
ARB
L2 Tag+VUAD Dir
L2Data
36b + control
MB(16L)
Dram i/f
64b + WB(8L)
control
Rdma WB(8L)
4.1.2.2 L2 Tag
The L2 tag block contains the sctag array and the associated control logic. Each 22-bit
tag is protected by 6-bits of SEC ECC (the L2 tag does not support double-bit error
detection). sctag is a single ported array, and it supports inline false hit detection. In
the C1 stage of pipeline, the access address bits, as well the check bits, are compared.
Therefore, there is never a false hit.
The state of each line is maintained using valid (V), used (U), allocated (A), and
dirty (D) bits. These bits are stored in the L2 VUAD array.
A valid bit indicates that the line is valid. The valid bit (per way) gets set when a
new line is installed in that way. It gets reset when that line gets invalidated.
The used bit is a reference bit used in the replacement algorithm. The L2-cache uses
a pseudo LRU algorithm for selecting a way to be replaced. There are 12 used bits
per set in the L2-cache. The used bit gets set when there are any store/load hits
(1 per way). Used bits get cleared (all 12 at a time) when there are no unused or
unallocated entries for that set.
The allocate bit indicates that the marked line has been allocated to a miss. This bit
is also used in the processing of some special instructions, such as atomics and
partial stores. (Because these stores do read-modify-writes, which involve two passes
through the pipe, the line needs to be locked until the second pass completes;
otherwise, the line may get replaced before the second pass happens). The allocate
The dirty bit indicates that L2-cache contains the only valid copy of the line. The
dirty bit (per way) gets set when a stores modifies the line. It gets cleared when the
line is invalidated.
The pseudo least recently used (LRU) algorithm examines all the ways starting from
a certain point in a round-robin fashion. The first unused, unallocated ways is
selected for replacement. If no unused, unallocated way is found, then the first
unallocated way is selected.
Each scdata array bank is further subdivided into four columns. Each column
consists of six 32-Kbyte sub-arrays.
Any L2-cache data array access takes two cycles to complete, so no columns can be
accessed in consecutive cycles. All access can be pipelined except back-to-back
accesses to the same column. The scdata array has a throughput of one access per
cycle.
Each 32-bit word is protected by seven bits of SEC/DED ECC. (Each line is 32 x [32
+ 7 ECC] = 1248 bits). All sub-word accesses require a read modify write operation
to be performed, and they are referred to in this chapter as partial stores.
The FIFO is implemented with a dual-ported array. The write port is used for
writing into the IQ from the PCX interface. The read port is for reading contents for
issue into the L2-cache pipeline. If the IQ is empty when a packet comes to the PCX,
the packet can pass around the IQ if it is selected for issue to the L2-cache pipe. The
IQ asserts a stall to the PCX when all eleven entries are used in the FIFO. This stall
allows space for the packets already in flight.
Multicast requests are dequeued from the FIFO only if all the of CPX destination
queues can accept the response packet. When the OQ reaches its high-water mark,
the L2-cache pipe stops accepting inputs from miss buffer or the PCX. Fills can
happen while the OQ is full since they do not generate CPX traffic.
The miss buffer is divided into a non-tag portion which holds the store data, and a
tag portion which contains the address. The non-tag portion of the buffer is a RAM
with 1 read and 1 write port. The tag portion is a CAM with 1 read, 1 write, and 1
cam port.
A read request is issued to the DRAM and the requesting instruction is replayed
when the critical quad-word of data arrives from the DRAM.
All entries in the miss buffer that share the same cache line address are linked in the
order of insertion in order to preserve the coherency. Instructions to the same
address are processed in age order, whereas instructions to different addresses are
not ordered and exist as a free list.
When an MB entry gets picked for issue to the DRAM (such as a load, store, or ifetch
miss), the MB entry gets copied into the fill buffer and a valid bit gets set. There can
be up to 8 reads outstanding from the L2-cache to the DRAM at any point of time.
In most cases, when a data return happens, the replayed load from the MB makes it
through the pipe before the fill request can. Therefore, the valid bit of the MB entry
gets cleared (after the replayed MB instruction execution is complete in the pipe)
before the fill buffer valid bit. However, if there are other prior MB instructions, like
partial stores that get picked instead of the MB instruction of concern, the fill request
can enter the pipe before the MB instruction. In these cases, the valid bit in the fill
buffer gets cleared prior to the MB valid bit. Therefore, the MB valid bit and FB valid
bits always get set in the order of MB valid bit first, and FB valid bit second. (These
bits can get cleared in any order, however.)
The fill buffer is an 8 entry buffer used to temporarily store data arriving from the
DRAM on an L2-cache miss request. Data arrives from the DRAM in four 16-byte
blocks starting with the critical quad-word. A load instruction waiting in the miss
buffer can enter the pipeline after the critical quad-word arrives from the DRAM
(the critical 16 bytes will arrive first from the DRAM). In this case, the data is
bypassed. After all four quad-words arrive, the fill instruction enters the pipeline
and fills the cache (and the fill buffer entry gets invalidated).
When data comes back in the FB, the instruction in the MB gets readied for reissue
and the cache line gets written into the data array. These two events are independent
and can happen in any order.
For a non-allocating read (for example, an I/O read), the data gets drained from the
fill buffer directly to the I/O interface when the data arrives (and the fill buffer entry
gets invalidated). When the FB is full, the miss buffer cannot make requests to the
DRAM.
The fill buffer is divided into a RAM portion, which stores the data returned from
the DRAM waiting for a fill to the cache, and a CAM portion, which contains the
address. The fill buffer has a read interface with the DRAM controller.
The WBB is divided into a RAM portion, which stores the evicted data until it can be
written to the DRAM, and a CAM portion, which contains the address.
The WBB has a 64-byte read interface with the scdata array and a 64 -bit write
interface with the DRAM controller. The WBB reads from the scdata array faster than
it can flush data out to the DRAM controller.
The L2-cache directory also ensures that the same line is not resident in both the
icache and the dcache (across all CPUs). The L2-cache directory is written in the C5
cycle of a load or an I-miss that hits the L2-cache, and is cammed in the C5 cycle of
a store/streaming store operation that hits the L2-cache. The lookup operation is
performed in order to invalidate all the SPARC L1-caches that own the line other
than the SPARC core that performed the store.
The L2-cache directory is split into an icache directory (icdir) and a dcache directory
(dcdir), which are both similar in size and functionality.
The L2-cache directory is written only when a load is performed. On certain data
accesses (loads, stores and evictions), the directory is cammed to determine whether
the data is resident in the L1-caches. The result of this CAM operation is a set of
The dcache directory is organized as sixteen panels with sixty-four entries in each
panel. Each entry number is formed using the cpu ID, way number, and bit 8 from
the physical address. Each panel is organized in four rows and four columns. The
icache directory is organized similarly. For an eviction, all four rows are cammed.
The requests from a CPU include the following instructions – load, streaming load,
Ifetch, prefetch, store, streaming store, block store, block init store, atomics,
interrupt, and flush.
The requests from the I/O include the following instructions – block read (RD64),
write invalidate (WRI), and partial line write (WR8).
The requests from the I/O buffer includes the following instructions – forward
request load and forward request store (these instructions are used for diagnostics).
The test access port (TAP) device cannot talk to the L2-cache directly. The TAP
C1
■ All buffers (WBB, WB and MB) are cammed. The instruction is a dependent
instruction if the instruction address is found in any of the buffers.
■ Generate ECC for store data.
■ Access VUAD and TAG array to establish a miss or a hit.
C2
■ Pipeline stall conditions are evaluated. The following conditions require that the
pipeline be stalled:
■ 32-byte access requires two cycles in the pipeline.
■ An I-miss instruction stalls the pipeline for one cycle. When an I-miss
instruction is encountered in the C2 stage, it stalls the instruction in the C1
stage so that it stays there for two cycles. The instruction in the C1 stage is
replayed.
■ For instructions that hit the cache, the way-select generation is completed.
■ Pseudo least recently used (LRU) is used for selecting a way for replacement in
case of a miss.
■ VUAD is updated in the C5 stage. However, VUAD is accessed in the C1 stage.
The bypass logic for VUAD generation is completed in the C2 stage. This process
ensures that the correct data is available to the current instruction from the
previous instructions because the C2 stage of the current instruction completes
before the C5 stage of the last instruction.
■ The miss buffer is cammed in the C1 stage. However, the MB is written in the C3
stage. The bypass logic for a miss buffer entry generation is completed in the C2
stage. This ensures that the correct data is available to the current instruction from
previous instructions, because the C2 stage of the current instruction starts before
the C3 stage of the last instruction completes.
C4
■ The first cycle of read or write to the scdata array for load/store instructions that
hit the cache.
C5
■ The second cycle of read or write to the scdata array for load/store instructions
that hit the cache.
■ Write into the L2-cache directory for loads, and CAM the L2-cache directory for
stores.
■ Write the new state of line into the VUAD array (by now the new state of line has
been computed).
■ Fill buffer bypass – If the data to service the load that missed the cache is
available in the FB, then do not wait for the data to be available in the data array.
The FB provides the data directly to the pipeline.
C6
■ 128-bits of data and 28-bits of ECC are transmitted from the scdata (data array) to
the sctag (tag array).
C7
■ Error correction is done by the sctag (data array).
■ The sctag sends the request packet to the CPX, and the sctag is the only interface
the L2-cache has with the CPX.
C8
■ A data packet is sent to the CPX. This stage corresponds with the CQ stage of the
CPX pipeline.
Cache miss instructions are reissued from the miss buffer after the data returns from
the DRAM controller. These reissued instructions follow the preceding pipeline.
4.1.4.1 Loads
A load instruction to the L2-cache is caused by any one of the following conditions:
■ A miss in the L1-cache (the primary cache) by a load, prefetch, block load, or a
quad load instruction.
■ A streaming load issued by the stream processing unit (SPU)
■ A forward request read issued by the IOB
The output of the scdata array, returned by the load, is 16 bytes in size. This size is
same as the size of the L1 data cache line. An entry is created in the dcache directory.
An icache directory entry is invalidated if it exists. An icache directory entry is
invalidated for L1-cache of every CPU in which it exists.
From an L2-cache perspective, a block load is the same as eight load requests. A
quad load is same as four load requests.
A prefetch instruction is issued by a CPU and is identical to a load, except for this
one difference – the results of a prefetch are not written into the L1-cache and
therefore the tags are not copied into the L2-cache directory.
A forward request read returns 39-bits (32 + 7 ECC) of data. The data is returned
without an ECC check. Since the forward request load is not installed in the L1-
cache, there is no L2-cache directory access.
4.1.4.2 Ifetch
An ifetch is issued to the L2-cache in response to an instruction missing the L1
icache. The size of icache is 256-bits. The L2-cache returns the 256-bits of data in two
packets over two cycles to the requesting CPU over the CPX. The two packets are
returned as an atomic. The L2-cache then creates an entry in the icache directory and
invalidates any existing entry in the dcache directory.
The store instruction writes (in a granularity of) 32-bits of data into the scdata array.
An acknowledgment packet is sent to the CPU that issued the request, and an
invalidate packet is sent to all other CPUs. The icache directory entry for every CPU
is cammed and invalidated. The dcache directory entry of every CPU, except the
requesting CPU, is cammed and invalidated.
A block store is the same as eight stores from an L2-cache perspective. A block init
store is same as a block store except for one difference – in the case of a miss for a
block init store, a dummy read request is issued to the DRAM controller. The DRAM
controller returns a line filled with all zeroes. Essentially, this line return saves
DRAM read bandwidth.
The LSU treats every store as a total store order (TSO) store. The LSU waits for an
acknowledgement to arrive before processing the next store. However, block init
stores can be processed without waiting for acknowledgements.
A forward request write stores 64-bits of data in the scdata. The icache and the
dcache directory entries are not cammed afterwards.
The forward request write and the streaming store may stride a couple of words and
therefore may require partial stores.
Partial stores (PST) perform sub-32-bit writes into the scdata array. As mentioned
earlier, the granularity of the writes into the scdata is 32-bits. A partial stores is
executed as a read-modify-write operation. In the first step the cache line is read and
merged with the write data. It is then saved in the miss buffer. The cache line is
written into the scdata array in the second pass of the instruction through the pipe.
4.1.4.4 Atomics
The L2-cache processes three types of atomic instructions – load store unsigned byte
(LDSTUB), SWAP, and compare and swap (CAS). These instructions require two
passes down the L2-cache pipeline.
The first pass reads the addressed cache line and returns 128-bits of data to the
requesting CPU. It also merges it with unsigned-byte/swap data. This merged data
is written into the miss buffer.
In the second pass of the instruction, the new data is stored in the scdata array. An
acknowledgement is sent to the issuing CPU and the invalidation is sent to all other
CPUs appropriately. The icache and the dcache directories are cammed and the
entries are invalidated. In case of atomics, the directory entry of even the issuing
CPU is invalidated.
CAS/CAS(X)
CAS{X} instructions are handled as two packets on the PCX. The first packet
(CAS(1)) contains the address and the data (against which the read data will be
compared).
The first pass reads the addressed cache line and sends 128-bits of data read back to
the requesting CPU. (The comparison is performed in the first pass.)
The second packet (CAS(2)) contains the store data. The store data is inserted into
the miss buffer as a store at the address contained in the first packet. If the
comparison result is true, the second pass proceeds like a normal store. If the result
was false, the second pass proceeds to generate the store acknowledgment only. The
scdata array is not written.
Write Invalidate
For a 64-byte write (the write invalidate (WRI) from the JBI), the JBI issues a 64-byte
write request to the L2-cache.
When the write progresses through the pipe, it looks up the tags. If a tag hit occurs,
it invalidates the entry and all primary cache entries that match. If a tag miss occurs,
it does nothing (it just continues down the pipe) to maintain the order.
Data is not written into the scdata cache on a miss. However, the scdata entry, and all
primary cache lines, are invalidated on a hit.
The CTAG (the instruction identifier) is returned to the JBI when the processor sends
an acknowledgement to the cache line invalidation request sent over the CPX.
After the instruction is retired from the pipe, 64 bytes of data is written to the
DRAM.
When the JBI issues 8-byte writes to the L2-cache with random byte enables, the L2-
cache treats them just like 8-bytes stores from the CPU. (That is, it does a two-pass
partial store if an odd number of byte enables are active or if there is a misaligned
access. Otherwise, it does a regular store.)
The CTAG (the instruction identifier) is returned to the JBI when the processor sends
an acknowledgement to the cache line invalidation request sent over the CPX.
The L2-cache (scdata) includes all valid L1-cache lines. In order to preserve the
inclusion, the L2-cache directory (both icache and dcache) is cammed with the
evicted tag, and the corresponding entry is invalidated. The invalidated packets are
all sent to the appropriate CPUs.
If the evicted line is dirty, it is written into the write back buffer (WBB). The WBB
opportunistically streams out the cache line to the DRAM controller over a 64-bit
bus.
4.1.4.7 Fill
A fill is issued following an eviction after an L2-cache store or load miss. The 64-byte
data arrives from the DRAM controller and is stored in the fill buffer. Data is read
from the fill buffer and written into the L2-cache scdata array.
L1-Cache Invalidation
The instruction invalidates the four primary cache entries as well as the four L2-
cache directory entries corresponding to each primary cache tag entry. The
invalidation is issued whenever the CPU detects a parity error in the tags of I-cache
or dcache.
Interrupts
When a thread wants to send an interrupt to another thread, it sends it through the
L2-cache. The L2-cache treats the thread like a bypass. After a decode, the L2-cache
sends the instruction back to destination CPU if it is a interrupt.
A flush stays in the output queue until all eight receiving queues are available. This
is a total store order (TSO) requirement.
The L2-cache directory maintains the cache coherency in all primary caches. The L2-
cache directory preserves the inclusion property – all valid entries in the primary
cache should reside in the L2-cache as well. It also keeps the icache and the dcache
exclusive for each CPU.
The read after write (RAW) dependency to the DRAM controller is resolved by
camming the write back buffer on a load miss.
Mulitcast requests (for example, a flush request) are sent to the CPX only if all of the
receiving queues are available. This process is a requirement for maintaining the
total store order (TSO).
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
sctag_scbuf_word_vld_c7 In SCTAG
scdata_scbuf_decc_out_c7[623:0] In SCDATA
dram_scbuf_data_r2[127:0] In DRAM
dram_scbuf_ecc_r2[27:0] In DRAM
cmp_gclk In CTU Clock
arst_l In CTU Asynchronous reset
grst_l In CTU Synchronous reset
global_shift_enable, In CTU
cluster_cken In CTU
ctu_tst_pre_grst_l In CTU
ctu_tst_scanmode In CTU
ctu_tst_scan_disable In CTU
ctu_tst_macrotest In CTU
ctu_tst_short_chain In CTU
scbuf_sctag_ev_uerr_r5 Out SCTAG
scbuf_sctag_ev_cerr_r5 Out SCTAG
scbuf_jbi_ctag_vld Out JBI
scbuf_jbi_data[31:0] Out JBI
scbuf_jbi_ue_err Out JBI
scbuf_sctag_rdma_uerr_c10 Out SCTAG
scbuf_sctag_rdma_cerr_c10 Out SCTAG
scbuf_scdata_fbdecc_c4[623:0] Out SCDATA
scbuf_dram_data_mecc_r5 Out DRAM
scbuf_dram_wr_data_r5[63:0] Out DRAM
scbuf_dram_data_vld_r5 Out DRAM
so Out DFT Scan out
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Input/Output Bridge
5-1
5.1.1 IOB Interfaces
FIGURE 5-1 shows the interfaces to and from the IOB to the rest of the blocks and
clusters.
EFC
CCX IOB
L2
■ In most of the UCB interfaces, the IOB is master and the cluster/block is a
slave, with the exception of the TAP. The TAP interface is unique – it is both
master and slave.
■ All UCB interfaces are visible through the debug ports.
■ J-Bus Mondo Interrupt Interface:
■ 16-bit request interface and a valid bit.
■ Header with 5-bit source and target (thread) IDs.
■ 8 cycles of data - 128 bits (J-Bus Mondo Data 0 & 1).
■ 2-bit acknowledge interface - ACK / NACK.
■ Efuse Controller (EFC) – Serial Interface:
■ Shifted-in at power-on-reset (POR) to make the software visible (read-only).
■ CORE_AVAIL, PROC_SER_NUM.
■ Debug Ports:
■ Internal visibility port on each UCB interface.
■ L2-cache visibility port input from the L2-cache (2 x 40-bits @ CMP clock).
■ Debug port A output to the debug pads (40-bits @ J-Bus clock).
■ Debug port B output to the JBI (2 x 48-bits @ J-Bus clock).
Cluster
IOB ucb_flow_*
iob_ucb_vld addr[39:0]
ucb_bus_in
ucb_bus_out
ucb_iob_vld
ucb_bus_in
ucb_bus_out
data[63:0]
ucb_pkt ucb_iob_data[M-1:0] ucb_pkt
cntl
iob_ucb_stall
UCB_READ_NACK 0000
UCB_READ_ACK 0001
UCB_WRITE_ACK 0010
UCB_IFILL_ACK 0011
UCB_READ_REQ 0100
UCB_WRITE_REQ 0101
UCB_IFILL_REQ 0110
UCB_IFILL_NACK 0111
There is no write NACK as writes to invalid addresses are dropped. Some packet
types have data (payload) while others are without data (no payload).
UCB_SIZE_1B 000
UCB_SIZE_2B 001
UCB_SIZE_4B 010
UCB_SIZE_8B 011
UCB_SIZE_16B 111
The buffer ID is 00 when the master is CPU and the ID is 01 when the master is
TAP. The thread ID has two parts – CPU ID (3-bits) and Thread ID within CPU (2-
bits).
UCB_INT 1000
UCB_INT_VEC 1100 IOB Internal Use Only
UCB_RESET_VEC 1101 IOB Internal Use Only
UCB_IDLE_VEC 1110 IOB Internal Use Only
UCB_RESUME_VEC 1111 IOB Internal Use Only
TABLE 5-7 shows the UCB no payload packet (64-bit) over an 8-bit interface without
stalls.
iob_ucb_vld 0 1 1 1 1 1 1 1 1 0
iob_ucb_data[7:0] X D0 D1 D2 D3 D4 D5 D6 D7 X
ucb_iob_stall 0 0 0 0 0 0 0 0 0 0
iob_ucb_vld 0 1 1 1 1 1 1 1 1 1 1 1 0
iob_ucb_data[7:0] X D0 D1 D2 D2 D2 D3 D3 D4 D5 D6 D7 X
ucb_iob_stall 0 0 1 1 1 0 0 0 0 0 0 0 0
DRAM 0/2
DRAM 1/3
40 40
JBI PIO
JBI SPI
134 dbg
CTU
TAP
Mondo 16x65 16x65
data0/1 (x4) (x2)
src,busy 4 4 4 64 4 8 *
+misc
c2i
PCX 124 Dbg port B
(J-bus
64/128
48
16x65
48
(x4)
Dbg port A
16 ctrl
40
x wr_ack
160 EFC
CSRs
16x160
153
64/128
UCB IF
32
i2c
CPX 145
UCB Pkt IF
4 4 4 16 4 8 16
CTU
DRAM 0/2
DRAM 1/3
JBI PIO
JBI SPI
TAP
JBI Mondo
iv. Generates an CPX interrupt packet to the target using J_INT_VEC CSR and
send
■ If J_INT_BUSY[target] CSR BUSY = 1
i. J_INT_VEC – specifies interrupt vector for the CPX Int in order to target
thread
ii. J_INT_BUSY (count 32) – source and BUSY for each target thread
iii. J_INT_DATA0 (count 32) – mondo data 0 for each target thread
iv. J_INT_DATA1 (count 32) – mondo data 1 for each target thread
v. J_INT_ABUSY, J_INT_ADATA0, J_INT_ADATA1 – aliases to J_INT_BUSY,
J_INT_DATA0, J_INT_DATA1 for the current thread
■ The interrupt handler must clear the BUSY bit in J_INT_BUSY[target] to allow
future mondo interrupts to that thread
The output debug ports have separate mux select and filtering on each port. There
are two debug ports:
■ Debug port A - dedicated debug pins (40-bits @ J-Bus clock)
■ Debug port B - J-Bus port (2 x 48-bits @ J-Bus clock)
■ 16-bytes data return to a non-existent module (AID 2)
clk_iob_cmp_cken In CTU
clk_iob_data[3:0] In CTU
clk_iob_jbus_cken In CTU
clk_iob_stall In CTU
clk_iob_vld In CTU
clspine_iob_resetstat[3:0] In
clspine_iob_resetstat_wr In
clspine_jbus_rx_sync In RX synchronous
clspine_jbus_tx_sync In TX synchronous
cmp_adbginit_l In CTU Asynchronous reset
cmp_arst_l In CTU Asynchronous reset
cmp_gclk In CTU Clock
cmp_gdbginit_l In CTU Synchronous reset
J-Bus Interface
This chapter contains the following topics about the J-Bus interface (JBI) functional
block:
■ Section 6.1, “Functional Description” on page 6-1
■ Section 6.2, “I/O Signal list” on page 6-8
6-1
■ There are only two sub-blocks in the JBI (J-Bus parser and J-Bus transaction issue)
specific to J-Bus. All of the other blocks are J-Bus independent. J-Bus independent
blocks can be used for any other external bus interface implementation.
L2
SCTAG SCBUF
Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3
JBI
Hdr and Data
Request Q1
Request Q2
Request Q3
Return Q0
Return Q1
Return Q2
Return Q3
16 x 138b
16 x 138b
16 x 138b
16 x 138b
Write
Decomp.
Queue
16 x 156b
IOB
Debug
Interrupt Ack/Nack Q FIFOs
(16 x 10b) 32 x 64b
J-Bus J-Bus
Interrupt Q (16 x 138b) Parser Txn Issue SSI
J-BUS SSI
FIGURE 6-1 JBI Functional Block Diagram
The following sub-sections describe the various JBI transactions and interfaces from
the JBI to the other functional blocks.
3. WriteMerge (WRM):
■ WRM is similar to WRI but with 64-bit Byte enables, supporting 0 to 64-byte
writes.
■ Multiple 8-byte write requests (WR8) to the L2-cache
■ Write decomposition
■ WRM is broken into 8-byte write requests (WR8) and sent to the L2-cache at
the head of the write decomposition queue (WDQ)
■ Number of requests is dependent on the WRM byte enable pattern
■ Each WR8 request writes 1 to 8 contiguous bytes
■ If a run of contiguous bytes crosses an 8-byte address boundary, two WR8s are
generated
■ A WRM transaction can generate up to 32 WR8s to the L2-cache
■ Writes to the L2-cache may observe strict ordering with respect to the other writes
to the L2-cache (software programmable)
Read requests comes from IOB, gets stored in the PIO request queue, and then goes
out on the J-Bus. The data read from J-Bus is then parsed by J-Bus parser, and then
the data is stored in the PIO return queue which is sent to the IOB.
The Read transactions (NCRD) can be 1, 2, 4, 8, 16-byte reads and are aligned to size.
There is a maximum support for 1 to 4 pending reads to the J-Bus (software
programmable). Read returns to the IOB may observe strict ordering with respect to
the writes to the L2-cache (software programmable).
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Floating-Point Unit
7-1
■ The FPU includes three independent execution pipelines:
■ Floating-point adder (FPA) – adds, subtracts, compares, conversions
■ Floating-point multiplier (FPM) – multiplies
■ Floating-point divider (FPD) – divides
■ One instruction per cycle may be issued from the FPU input FIFO queue to one of
the three execution pipelines.
■ One instruction per cycle may complete and exit the FPU.
■ Support for all IEEE 754 floating-point data types (normalized, denormalized,
NaN, zero, infinity). A denormalized operand or result will never generate an
unfinished_FPop trap to the software. The hardware provides full support for
denormalized operands and results.
■ IEEE non-standard mode (FSR.ns) is ignored by the FPU.
■ The following instruction types are fully pipe-lined and have a fixed latency,
independent of operand values – add, subtract, compare, convert between
floating-point formats, convert floating-point to integer, convert integer to
floating-point.
■ The following instruction types are not fully pipe-lined – multiply (fixed latency,
independent of operand values), divide (variable latency, dependent on operand
values).
■ Divide instructions execute in a dedicated datapath and are non-blocking.
■ Underflow tininess is detected before rounding. Loss of accuracy is detected
when the delivered result value differs from what would have been computed
were both the exponent range and precision unbounded (inexact condition).
■ A precise exception model is maintained. The OpenSPARC T1 implementation
does not require early exception detection/prediction. A given thread stalls
(switches out) while waiting for an FPU result.
■ The FPU includes three parallel pipelines and these pipelines can simultaneously
have instructions at various stages of completion. FIGURE 7-1 displays an FPU
block diagram that shows these parallel pipelines.
ISA SPARC V9
VIS Not available
Issue 1
Register file In FFU
FDIV blocking No
Full hardware Yes
denorm support
Hardware quad No
support
TABLE 7-2 SPARC V9 Single and Double Precision FPop Instruction Set
If an FPA or FPM execution pipeline is waiting for its result to exit the FPU, the
pipeline will stall at the final execution stage. If the final execution stage is not
occupied by a valid instruction, instructions within the pipeline will advance, and
the input FIFO queue may issue to the pipeline. If the final execution stage is
occupied by a valid instruction then each pipeline stage is held.
The input FIFO queue will not advance if the instruction at the head of the FIFO
must issue to a pipeline, which at each stage has been held due to a result from that
pipeline not exiting the FPU.
Stage Action
Stage Action
The FPU has independent clock control for each of the three execution pipelines
(FPA, FPM, and FPD). Clocks are gated for a given pipeline when it is not in use, so
a pipeline will have its clocks enabled only under one of the following conditions:
■ The pipeline is executing a valid instruction
■ A valid instruction is issuing to the pipeline
■ The reset is active
■ The test mode is active
The input FIFO queue and output arbitration blocks receive free running clocks. This
eliminates potential timing issues, simplifies the design, and has only a small impact
on the overall FPU power savings.
The FPU power management feature automatically powers up and powers down
each of the three FPU execution pipelines, based on the contents of the instruction
stream. Also, the pipelines are clocked only when required. For example, when no
divide instructions are executing, the FPD execution pipeline automatically powers
down. Power management is provided without affecting functionality or
performance, and it is transparent to the software.
The underflow exception condition is defined separately for the trap-enabled and
trap-disabled states.
■ FSR.UFM = 1 – underflow occurs when the intermediate result is tiny
■ FSR.UFM = 0 – underflow occurs when the intermediate result is tiny and there is
a loss of accuracy
A tiny result is detected before rounding, when a non-zero result value is computed
as though the exponent range were unbounded and would be less in magnitude
than the smallest normalized number.
Loss of accuracy is detected when the delivered result value differs from what
would have been computed had both the exponent range and the precision been
unbounded (an inexact condition).
The FPA, FPM, and FPD will signal an underflow to the SPARC core FFU for all tiny
results. The FFU must clear the FSR.ufc flag if the result is exact (the FSR.nxc is not
set) and the FSR.UFM mask is not set. This case represents an exact denormalized
result.
Note – The FPU does not receive the trap enable mask (FSR.TEM). The FSR.TEM
bits are used within the FFU. If an instruction generates an IEEE exception when the
corresponding trap enable is set, then an fp_exception_ieee_754 trap is generated
and the results are inhibited by the FFU.
Underflow or
Instruction Invalid Divide by zero Overflow Denormalized Inexact
FiTOs result=IEEE6
FSR.nxc=1
FiTOd Cannot generate IEEE exceptions
FMOV(s,d) Executed in SPARC core FFU (cannot generate IEEE exceptions)
FMOV(s,d)cc Executed in SPARC core FFU (cannot generate IEEE exceptions)
FMOV(s,d)r Executed in SPARC core FFU (cannot generate IEEE exceptions)
FMUL(s,d) • SNaN result=±max result=±0 or result=IEEE6
• ∞×0 or ±∞ ±min or FSR.nxc=17
result=NaN1, 2 FSR.ofc=14 ±denorm
FSR.nvc=1 FSR.ufc=15, 4
FNEG(s,d) Executed in SPARC core FFU (cannot generate IEEE exceptions)
Underflow or
Instruction Invalid Divide by zero Overflow Denormalized Inexact
FsMULd • SNaN
• ∞×0
result=NaN1, 2
FSR.nvc=1
FSQRT(s,d) Unimplemented
F(s,d)TOi • NaN result=IEEE6
• • FSR.nxc=1
• large
result=max
±integer3
FSR.nvc=1
FsTOd • SNaN
result=NaN2
FSR.nvc=1
FdTOs • SNaN result=±max or result=±0 or result=IEEE6
result=NaN2 ±∞ ±min or FSR.nxc=17
FSR.nvc=1 FSR.ofc=14 ±denorm
FSR.ufc=15, 4
F(s,d)TOx • NaN result=IEEE6
• • FSR.nxc=1
• large
result=max
±integer3
FSR.nvc=1
FSUB(s,d) • SNaN result=±max result=±0 or result=IEEE6
• ∞–∞ or ±∞ ±min or FSR.nxc=17
result=NaN1, 2 FSR.ofc=14 ±denorm
FSR.nvc=1 FSR.ufc=15, 4
FxTO(s,d) result=IEEE6
FSR.nxc=1
1 Default response QNaN = x’7ff...fff’
2 SNaN input propagated and transformed to QNaN result
3 Maximum signed integer (x’7ff...fff’ or x’800...000’)
4 FFU will clear FSR.ofc (FSR.ufc) if overflow (underflow) exception traps and FSR.OFM (FSR.UFM) is not set and FSR.NXM is set. FFU
will set FSR.nxc.
5 FFU will clear FSR.ufc if the result is exact (FSR.nxc is not set) and FSR.UFM is not set. This case represents an exact denormalized result.
6 Rounded or overflow (underflow) result.
7 FFU will clear FSR.nxc if an overflow (underflow) exception does trap because FSR.OFM (FSR.UFM) is set, regardless of whether
FSR.NXM is set. FFU will set FSR.ofc (FSR.ufc).
DRAM Controller
This chapter describes the following topics for the double data rate two (DDR-II)
dynamic random access memory (DRAM) controller:
■ Section 8.1, “Functional Description” on page 8-1
■ Section 8.2, “I/O Signal List” on page 8-9
8-1
■ The DRAM controller performs L2-cache writebacks to the DIMMs
■ Out-of-bound write addresses are silently dropped
■ Uncorrectable L2-cache data is stored by poisoning the data
■ The DRAM controller performs DRAM data scrubbing
■ DRAM controller issues periodic refreshes to the DIMMs
■ Supports DRAM power throttling by reducing the number of DIMM activations
■ To program the DRAM controller control and status registers (CSRs), the
controller uses the UCB bus as an interface to the I/O buffer (IOB)
128+28
64
L2 i/f ctl err det/cor
dram_clk 256
dram_clk
To DIMMs
1. Refresh request.
4. Write pending RAS requests, which have matching addresses, as read requests
that are picked for RAS.
5. Read RAS requests from read queues, or write RAS requests from write queues
when the write starvation counter reaches its limit (round-robin).
6. Write RAS requests from write queues, or read RAS requests from read queues if
the write starvation counter reaches its limit.
Idle
Init done
Refresh GO
Closed
Yes No
all banks?
Issue
Refresh
command
Wait
CAS
request
and
timing met
RAS request
and timing met
RAS CAS
Pick Pick
CAS request and
timing met
8.1.4 Errors
The DRAM controller error mechanism has the following characteristics:
■ Error injection can be done through software programming
■ Error registers are accessible by way of the IOB interface
■ Error counter registers can send an interrupt when reaching a programmed count
■ All correctable and uncorrectable errors are logged and sent to the L2-cache along
with the data
■ DRAM scrub errors are also forwarded to L2-cache independently
■ Error location register logs the error nibble position on correctable errors
■ The scrub error address is also logged in the error address register
Base Device Part Number of Banks Bank Address Row Address Column Address
Total
Memory DIMM DRAM DIMM Bank
Per Density / Component Stacked Address Row
Channel Type Used RANK DIMM (BA) Address Column Address
TABLE 8-3 lists the subset of DDR-II SDRAM commands used by the OpenSPARC T1
processor.
CKE CKE
Previous Current
Function Cycle Cycle CS_L RAS_L CAS_L WE_L Bank Address
Mode/extended H H L L L L BA Op-code
mode register set
Auto refresh H H L L L H X X
Self refresh entry H L L L L H X X
Self refresh exit L H H X X X X X
L H H H
Precharge all banks H H L L H L X A10=H
Bank activate H H L L H H BA Row Address
Write with auto H H L H L L BA Column address,
precharge A10=H
Read with auto H H L H L H BA Column address,
precharge A10=H
No operation H X L H H H X X
Device deselect H X H X X X X X
Source/
Signal Name I/O Destination Description
dram_other_pt_max_banks_open_v In
alid
dram_other_pt_max_time_valid In
dram_other_pt_ucb_data[16:0] In
dram_other_pt0_opened_bank In
dram_other_pt1_opened_bank In
io_dram0_data_in[255:0] In PADS I/O data in
io_dram0_data_valid In PADS I/O data valid
io_dram0_ecc_in[31:0] In PADS I/O ECC in
io_dram1_data_in[255:0] In PADS I/O data in
io_dram1_data_valid In PADS I/O data valid
io_dram1_ecc_in[31:0] In PADS I/O ECC in
iob_ucb_data[3:0] In IOB UCB data
iob_ucb_stall In IOB UCB stall
iob_ucb_vld In IOB UCB valid
scbuf0_dram_data_mecc_r5 In SCBUF0
scbuf0_dram_data_vld_r5 In SCBUF0
scbuf0_dram_wr_data_r5[63:0] In SCBUF0 To dramctl0 of dramctl.v
scbuf1_dram_data_mecc_r5 In SCBUF1
scbuf1_dram_data_vld_r5 In SCBUF1
scbuf1_dram_wr_data_r5[63:0] In SCBUF1 To dramctl1 of dramctl.v
sctag0_dram_addr[39:5] In SCTAG0 To dramctl0 of dramctl.v
sctag0_dram_rd_dummy_req In SCTAG0
sctag0_dram_rd_req In SCTAG0 To dramctl0 of dramctl.v
sctag0_dram_rd_req_id[2:0] In SCTAG0 To dramctl0 of dramctl.v
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
ctu_tst_short_chain In CTU
dram_io_addr0[14:0] Out PADS DRAM address 0
dram_io_addr1[14:0] Out PADS DRAM address 1
dram_io_bank0[2:0] Out PADS DRAM bank 0
dram_io_bank1[2:0] Out PADS DRAM bank 1
dram_io_cas0_l Out PADS DRAM CAS 0
dram_io_cas1_l Out PADS DRAM CAS 1
dram_io_channel_disabled0 Out PADS DRAM channel disable 0
dram_io_channel_disabled1 Out PADS DRAM channel disable 1
dram_io_cke0 Out PADS DRAM CKE 0
dram_io_cke1 Out PADS DRAM CKE 1
dram_io_clk_enable0 Out PADS DRAM clock enable 0
dram_io_clk_enable1 Out PADS DRAM clock enable 1
dram_io_cs0_l[3:0] Out PADS DRAM CS 0
dram_io_cs1_l[3:0] Out PADS DRAM CS 1
dram_io_data0_out[287:0] Out PADS DRAM data 0
dram_io_data1_out[287:0] Out PADS DRAM data 1
dram_io_drive_data0 Out PADS From dramctl0 of dramctl.v
dram_io_drive_data1 Out PADS From dramctl1 of dramctl.v
dram_io_drive_enable0 Out PADS From dramctl0 of dramctl.v
dram_io_drive_enable1 Out PADS From dramctl1 of dramctl.v
dram_io_pad_clk_inv0 Out PADS
dram_io_pad_clk_inv1 Out PADS
dram_io_pad_enable0 Out PADS
dram_io_pad_enable1 Out PADS
dram_io_ptr_clk_inv0[4:0] Out PADS
dram_io_ptr_clk_inv1[4:0] Out PADS
dram_io_ras0_l Out PADS DRAM RAS 0
dram_io_ras1_l Out PADS DRAM RAS 1
dram_io_write_en0_l Out PADS DRAM write enable 0
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
dram_pt_ucb_data[16:0] Out
dram_clk_tr Out CTU Debug trigger @ J-Bus freq
dram_so Out DFT Scan out
Error Handling
9-1
9.1.1 Error Reporting and Logging
■ The SPARC core errors are logged in program order and they are logged only
after the instruction has exited the pipe (W-stage). The rolled back and flushed
instructions do not log errors immediately. Errors are logged in the L2-cache and
DRAM error registers in the order the errors occur.
■ Errors are reported hierarchically in the following order – DRAM, L2-cache, and
SPARC core. For diagnostic reasons, the L2-cache can be configured to not report
errors to the SPARC core.
■ SPARC, L2-cache, and DRAM error registers log error details for a single error
only.
■ Fatal and uncorrectable errors will overwrite earlier correctable error information.
■ The error registers have bits to indicate if multiple errors occurred.
■ Refer to the UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 for
detailed information about error control and status register (CSR) definitions,
including addresses, bit fields, and so on.
The following sub-sections describe the errors in SPARC core, L2-cache, and DRAM.
Errors in other blocks like IOB and JBI are described in their chapters.
10-1
■ The CTU has the following sub-blocks – PLL (clock PLL), random number
generator (RNG), design For testability (DFT), clock spine (CLSP), the
temperature sensor (TSR).
■ The CTU generates the following signals for each cluster – clock, clock enable,
reset (synchronous and asynchronous), init (debug init), sync pulses for clock
domain crossing, and built-in self test (BIST) signals for blocks with memory
BIST.
■ For debugging purposes, the CTU receives a trigger signal from the cluster.
■ The CTU and PADS themselves are clock and reset recipients.
FIGURE 10-1 displays a high-level block diagram of the CTU clock and reset signals
and CTU sub-blocks.
OpenSPARC T1
Clusters
PAD CTU
J_CLK[1:0] CTU clk
cken
PAD MISC clsp
TRST_l
rst/init
dft
PWRON_RST_L
sync
PLL
PLL_CHAR_IN bist
RNG
FIGURE 10-2 shows the PLL block diagram including the VCO and the feedback path.
CTU
BW_PLL
PLL_CHAR_IN CLKOBS[1:0]
obs
pll_raw_clk_out
pll_clk_out_1
J_CLK[1:0]
pll_clk_out
VCO
pll_bypass
pll_arst_1
jdup_div
jbus_gclk_dup jbus_gclk_dup_out
Each clock domain (C, D, and J) are generated by the dividing PLL clock, and each
domain uses its own divide ratio and positive/negative pairs. For the PLL bypass
mode, the divide ratios are fixed – the C clock is divided by 1, and D and J clocks are
divided by 4. Refer to the UltraSPARC T1 Supplement to the UltraSPARC 2005
Architecture Specification for the complete definitions of these clock divider ratios.
Clock divider block diagram and waveforms are shown in FIGURE 10-3.
pll_clk_out
dom_div
init_1
1div align
div_vec[14:0]
pos
neg
PLL
pos
neg
out
The clock divider and other parameters are stored in shadowed control registers
(CREGs). A cold reset (or a power-on reset) sets the default values in each CREG and
its shadow. Warm resets with frequency changes copies the CREG to its shadow.
FIGURE 10-4 shows a waveform for cross domain crossing Rx and Tx pulses.
Yd Y en Yc
dram_clk cmp_clk
PLL
cmp
dram
dram_rx
Yd a b
Y a b
Yc a b
dram_tx
Xc en X Xd
cmp_clk dram_clk
PLL
cmp
dram
dram_tx
XC a b
X a b
Xd a b
gclk rclk
cken
art_l art_l
grst_l rst_l
dbginit_l dbginit_l
art_l art_l
PWRON_RST_L arst_l
adbginit_l
sync_header
gclk
gclk rclk
gclk
rx_sync rx_sync
tx_sync tx_sync
art_l art_l
CTU
1. Assert resets
c. For cold resets, assertion of the PWRON_RST_L reset asserts all resets.
Deassertion of the PWRON_RST_L reset deasserts only asynchronous ones,
while the synchronous ones remain asserted.
ii. For fchg and warm resets, CREG_CLK_CTL.SRARM defines whether the rfsh
attribute is on or off
iii. If the rfsh is not on, the D domain reset is asserted at the same time
iv. If rfsh is on, the self_refresh signal to DRAM is asserted, and the D reset is
asserted about 1600 ref cycles after C and J resets
v. The gap for the D and J domain clock enables is subject to Tx_sync.
i. The J-div may be turned off, but then the J-tree is fed from j-dup
ii. PLL lock mode – reset count = 128, reset + lock = 32000 (for a cold reset)
d. For other warm resets, a fake sequence is used, where the PLL reset is not
asserted and counters are shorter
a. The C, D, and J domain dividers start in sync with J-dup, and the result is a
common rising (AKA coincident) edge. (For cycle deterministic operation,
tester/diagnostics tests must keep track of coincident edges.)
b. If the JBUS_GCLK was running from J-dup, it switches to J-div (in PLL bypass
mode, JBUS_GCLK is not the same frequency as J_CLK)
b. The starting cluster is 0 (for sparc0), and the enables progress in a CREG bit
order
a. For cold resets, the ARST_L signals are already deasserted at the deassertion of
the PWRON_RST_L reset
b. The GRST_L signals are deasserted at the same time in all domains
c. The DLL reset is deasserted a few cycles before the GRST_L deassertion
d. There is no handshake to indicate the end of the operation, and the CTU just
waits a fixed number of cycles
9. Do BIST
a. At the J_RST_L reset deassert time, DO_BIST pin is sampled for eight cycles to
determine the msg, which determines:
c. The CTU starts the BIST engines (enabled by EFC), and then the CTU waits for
a response from the engines
d. The status from each BIST engine is recorded, but does not affect reset
sequence
3. Assertion of the J_RST_L reset, which performs steps 5 and 6 of the generic reset
sequence.
4. Deassertion of the J_RST_L reset, which performs the steps 7, 8, 9, and 10 of the
generic reset sequence.
There are two types of the cold resets - normal and deterministic. The timing of
the J_RST_L reset assertion determines the reset type. On the tester, the
deterministic type is used.
1. Assertion of the J_RST_L reset, which performs steps 1 through 6 of the preceding
generic reset sequence described in Section 10.1.2.3, “Reset Sequence” on
page 10-11.
2. Deassertion of the J_RST_L reset, which performs steps 7 and 10 of the generic
reset sequence (skipping steps 8 and 9).
The SPARC core initiates a warm reset by writing to the I/O bridge (IOB) chip in
order to toggle the J_RST_L reset signal. A warm reset can be used for:
■ Recovering from hangs
■ Creating a deterministic diagnostics start
■ Changing frequency
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
rclk In
enable_chk In
ctu_tst_pre_grst_l Out
global_shift_enable Out From ctu_dft of ctu_dft.v
ctu_tst_scanmode Out From ctu_dft of ctu_dft.v
ctu_tst_macrotest Out From ctu_dft of ctu_dft.v
ctu_tst_short_chain Out From ctu_dft of ctu_dft.v
ctu_efc_read_start Out EFC
ctu_jbi_ssiclk Out JBI
ctu_dram_rx_sync_out Out DRAM From ctu_clsp of ctu_clsp.v
ctu_dram_tx_sync_out Out DRAM From ctu_clsp of ctu_clsp.v
ctu_jbus_rx_sync_out Out JBI From ctu_clsp of ctu_clsp.v
ctu_jbus_tx_sync_out Out JBI From ctu_clsp of ctu_clsp.v
cmp_grst_out_l Out From ctu_clsp of ctu_clsp.v
afo_rng_clk Out From u_rng of bw_rng.v
afo_rng_data Out From u_rng of bw_rng.v
afo_rt_ack Out From ctu_dft of ctu_dft.v
afo_rt_data_out[31:0] Out From ctu_dft of ctu_dft.v
afo_tsr_dout[7:0] Out From u_tsr of bw_tsr.v
clsp_iob_data[3:0] Out From ctu_clsp of ctu_clsp.v
clsp_iob_stall Out IOB From ctu_clsp of ctu_clsp.v
clsp_iob_vld Out IOB From ctu_clsp of ctu_clsp.v
cmp_adbginit_l Out From ctu_clsp of ctu_clsp.v
cmp_arst_l Out From ctu_clsp of ctu_clsp.v
cmp_gclk_out Out From ctu_clsp of ctu_clsp.v
cmp_gdbginit_out_l Out From ctu_clsp of ctu_clsp.v
ctu_ccx_cmp_cken Out From ctu_clsp of ctu_clsp.v
ctu_dbg_jbus_cken Out From ctu_clsp of ctu_clsp.v
ctu_ddr0_clock_dr Out PADS From ctu_dft of ctu_dft.v
ctu_ddr0_dll_delayctr[2:0] Out PADS From ctu_clsp of ctu_clsp.v
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description
Source/
Signal Name I/O Destination Description