Streaming 20 Network

DIGITAL INDUSTRIES SOFTWARE
Streaming Scan Network

An Efficient Packetized Data Network for Testing of Complex SoCs
Executive summary
Originally presented at the 2020 International Test Conference by Siemens and Intel
authors, this paper describes the Tessent Streaming Scan Network and demonstrates
how this packetized data network optimizes test time and implementation
productivity for today’s complex SoCs.
The IEEE paper is reprinted here in full with permission.
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any
current or future media, including reprinting/republishing this material for advertising
or promotional purposes, creating
new collective works, for resale or redistribution to servers or lists, or reuse of any
copyrighted component of this work in
other works.
Authors: Jean-François Côté , Kassab Mark , Wojciech Janiszewski , Ricardo Rodrigues, ,

Reinhard Meier , Bartosz Kaczmarek , Peter Orlando , Geir Eide , Janusz Rajski , Glenn Colon-
Bonet , Naveen Mysore , Ya Yin and Pankaj Pant
siemens.com/software
White Paper – Streaming Scan Network
Contents
Streaming Scan Network (SSN): An Efficient

Packetized Data Network for Testing of
Complex SoCs
Abstract
I. INTRODUCTION
II. PRIOR WORK
III. SSN TECHNOLOGY FUNDAMENTALS
IV. MANAGING CLOCK SKEW & BUS WIDTH
VI. TESTING OF MULTIPLE IDENTICAL CORES
VII. ALTERNATE INTERFACES
VIII. PRACTICAL EXPERIENCE USING SSN
IX. CONCLUSION
ACKNOWLEDGMENT
References
SIEMENS DIGITAL INDUSTRIES SOFTWARE 2

Streaming Scan Network (SSN): An

Efficient Packetized Data Network for
Testing of Complex SoCs
Jean-François Côté, Mark Kassab, Wojciech Janiszewski, Ricardo Rodrigues, Reinhard

Meier, Bartosz Kaczmarek, Peter Orlando, Geir Eide, Janusz Rajski, Glenn Colon-Bonet,
Naveen Mysore, Ya Yin, Pankaj Pant
Mentor, A Siemens Business Intel Corporation Intel Corporation
8005 SW Boeckman Road 4701 Technology Parkway 75 Reed Road
Wilsonville, OR 97070 Fort Collins, CO 80528 Hudson, MA 01749

Abstract
Abstract—System-on-Chip (SoC) designs architecture designed to address all

are increasingly difficult to test using these challenges. It enables
traditional scan access methods without simultaneous testing of any number of
incurring inefficient test time, high cores even with few chip I/Os. It
planning effort, and physical design/ facilitates short test time by enabling
timing closure challenges. The number high-speed data distribution, by
of cores keeps growing while chip pin efficiently handling imbalances between
counts available for scan remain cores, and by supporting testing of any
constant or decline, limiting the ability number of identical cores with a
to drive cores concurrently. With constant cost. It provides a plug-and-
increasingly commonplace tiling and play interface in each core that is well
abutment, the scan distribution suited for abutted tiles, and simplifies
hardware must be placed inside the scan timing closure. This paper also
cores, making balanced pipelining when compares the test cost and
broadcasting to identical cores difficult. implementation productivity of SSN
Optimizing test time requires analyzing with those of Intel’s Structural Test
all the cores and subsequently changing Fabric.
the test hardware in the cores. Internal
shift speed constraints may limit the Keywords—Design For Test, DFT, SoC
ability to shift data in and out of the Test, Hierarchical Test, Multiple Identical
chip at high rates. Differences in pattern Cores, Known-Good-Die Testing, Test
counts or scan chain lengths between Time Reduction, Low Pin Count Test,
cores tested in parallel can result in Scan Distribution Architecture, Scan
padding and increased test time. SSN is Fabric
a bus-based scan data distribution

I. INTRODUCTION
With some Integrated Circuits (ICs) from other cores that are tested at the
growing to billions of transistors, it is same time if scan access and design
virtually impossible to design, constraints permit. In addition to
implement, and test them flat. A retargeting patterns generated for
System-on-a-Chip (SoC) is an IC that is testing the wrapped logic within each
comprised of multiple components, core, test pattern generation is also run
referred to as cores. Each core is at the next level up to test peripheral
typically designed, implemented, and logic outside wrapper chains as well as
validated independently before being logic at that higher level of hierarchy. If
integrated with others. As design this parent level is not the chip level,
complexity has grown, so have the then those patterns will also have to be
levels of core hierarchy. It is not retargeted to the chip level. The same
uncommon to have lower-level cores test pattern generation and retargeting
integrated into subsystems, which are methodology is applied recursively
integrated into chiplets that are then regardless of the levels of hierarchy, but
assembled into a chip. the planning and implementation of
DFT get more complex with additional
As design is done hierarchically to levels of hierarchy, especially when
manage complexity, so is DFT. In using conventional scan access
hierarchical test methodologies methods.
[1][2][3], scan chains and compression
logic [4][5][6] are inserted into every The following subsections explain key
core. The cores are wrapped with scan SoC test challenges inherent with pin-
and interface control logic. Test patterns mux scan access, which is commonly
targeting most faults in a core are used in the industry and explained in
generated and validated at the core the referenced papers.
level. Subsequently, the patterns from
multiple wrapped cores are retargeted A. SoC Test Challenges: Planning
or mapped to the top level. They are and Layout
often merged with patterns retargeted Traditionally, for a group of cores to be

tested concurrently, one of the of levels of core hierarchy increases, the

requirements is that their channel inputs planning complexity and test
and outputs must be directly connected inefficiency also grow.
to chip-level pins. As the number of
cores in SoCs grows and the number of Connecting chip pins to the cores can
chip-level pins available for scan test have physical design implications.
remains the same or is reduced, Connecting each pin to different cores in
additional groups of cores and scan different test configurations can lead to
access configurations must be created. routing congestion. The pads may be
This has negative implications on DFT embedded inside cores in some
implementation effort, silicon area, packaging technologies such that the
pattern retargeting complexity, and test connections for one core impact the
time. design of other cores to which the
signals have to be routed, or through
Part of hierarchical test planning is to which the scan connections flow. Those
identify early in the design flow the connections are also often pipelined, so
number of scan channels used in every timing between those pipeline stages
core, and the groups of cores which will and compression logic must be carefully
be tested concurrently in every scan designed to achieve high shift speeds
access configuration. This can result in and avoid timing violations.
sub-optimal results since it creates fixed
core groupings and forces premature Tile-based layout is a relatively recent
decisions on channel counts per core trend in SoC design that is adding
before the cores are completed and further complexity and constraints to
before their compression configurations DFT architectures. In pure tiling layouts,
can be optimized and their pattern virtually all logic and routing is within
counts estimated. Chip-level design the cores and not at the top level. The
decisions depend on the cores. The cores are designed to abut one another
cores are finalized too late in the design when integrated into the chip such that
cycle, and their compression connections flow from one core to the
configurations are influenced by the next. Any connectivity between cores
chip-level core groupings and pin has to flow through cores that are
availability. This mutual dependency between them. Logic that is at the top
makes it virtually impossible to optimize level has to be pushed into the cores
compression for the SoC. As the number and designed as part of the cores.

B. SoC Test Challenges: Limited outputs are often observed

Chip-Level Pins independently to guarantee the same
When retargeting core-level patterns, test coverage achieved at the core level
limited chip-level pin counts can be and to ensure enough observability for
dealt with by increasing the number of diagnosing failing cores. Since at least 1
core groups and test sessions, as long as output channel is needed per core
there are enough chip pins to drive at instance, this can limit the number of
least each core individually. However, identical core instances that can be
there are cases where simultaneous tested concurrently just as there are
access to multiple or all cores is similar limitations on heterogeneous
necessary, and grouping cores into core instances.
smaller groups is not an option. One
example is Iddq test, where scan data is The second issue is that after scan
loaded across the entire chip before a loading, the capture clocking is usually
relatively lengthy current measurement applied concurrently to all core
is taken. When using scan compression instances. Combined with the broadcast
such as Embedded Deterministic Test of input scan data, the number of
(EDT) [4], this means there must be pipeline stages must be equal between
enough pins available to drive all the a scan input pin and all the identical
EDT channels of the cores concurrently. core instances it drives. This can be
difficult to achieve in the presence of
C. SoC Test Challenges: Identical tiling where no routing or logic may
Core Instances exist outside the cores. Signals,
Pattern retargeting in the presence of including scan inputs, may propagate
identical core instances can benefit from across multiple instances of the same
generating patterns once, and from the core, accumulating pipelining delay.
ability to broadcast the scan inputs from Routing of individual output channels
the same top-level pins, reducing both from each core instance through the
ATPG runtime and pin requirements. other core instances can also be
There are, however, still multiple complicated due to the fact that all
challenges to be resolved. cores are copies of each other. A
solution exists where every core
Although broadcast of scan inputs keeps instance is programmed with a different
the number of input pins constant for number of pipeline stages and different
any number of identical cores, the routing for scan output paths, but this

introduces complexity and limits the redesigning the cores to account for
reuse of cores. Designing a new chip differences in pipelining and routing
with more core instances requires channels.
II. PRIOR WORK
To address some of the challenges mode used for high-volume

explained, a few companies have manufacturing. This architecture also
developed and published scan access has data overhead because every
technologies beyond the traditional pin- parallel word includes a command
mux topologies. They vary in the scope opcode in addition to the scan data
of the challenges they address and the payload. The fact that each parallel
tradeoffs they make. word has to include both payload and a
command imposes limits on how
A packetized bus-based architecture narrow the bus may be, and imposes
specifically tailored at providing a additional constraints on the bus width
scalable solution for testing of multiple and its relation to the core scan channel
identical core instances was introduced counts.
in [7]. It is not a general scan access
mechanism that can simultaneously test The authors subsequently introduced a
heterogeneous cores. It supports new architecture [8] that has a different
shifting in the expected data, in addition focus: while it maintains a solution for
to input stimuli, such that on-chip testing of multiple identical cores, its
comparison can be done and pass/fail primary new design objective is to
data accumulated and observed. It also enable better bin packing for retargeted
allows some trade-offs between core-level patterns. It does so by
efficiency and diagnostic information. providing flexibility in mapping chip-
Getting full failure data for diagnosis level pins to core-level scan pins such
may require the application of a that there is flexibility in controlling
different pattern set; one that uses a which cores are tested concurrently.
different configuration than the full-rate Instead of a bus architecture as in [7], it

uses a flexible mux-based switching serializers/deserializers. This additionally

network. The architecture succeeds in allows running chip-level scan pins at
enabling effective dynamic bandwidth higher frequencies than internal scan
management [9] and late-binding core chains support, improving overall
grouping to minimize padding caused bandwidth. A subsequent version of this
by test length differences across cores. technology [13] added flexibility to
However, this architecture incurs some allow varying the number of scan pins
costs. Given the network provides per core. The number of external scan
flexibility in connecting any top-level pin pins per core and the related
to any core level scan channel (although serialization/deserialization ratio are
there are restrictions on combinations of programmable. The purpose is to enable
connections), the network can result in reuse of the test data for a given core
significant routing cost especially in the across SoCs with different scan pin
presence of a large number of cores. configurations. It also enables varying
Using a mux-based star network is also shift frequencies in different cores
less amenable to connection-by- within the SoC. Those methods facilitate
abutment in tile-based designs IP reuse and access to cores in the
compared to bus-based architectures. presence of limited chip-level scan pins.
However, they do not address routing
The Structural Test Fabric (STF) solution challenges in tile-based designs nor
[10][11], published by co-authors of this provide an efficient and scalable
paper, provides a general packet based solution for multiple identical cores.
core access mechanism that works for
heterogeneous cores, and has a scalable Some scan compression methods have
solution for multiple identical cores. It is extensions to facilitate test across an
flexible in that every parallel word is SoC. For example, the architecture in
self-contained, but incurs overhead per [14] can distribute test data to
parallel bus word. A detailed compression logic in cores, and uses
comparison of this architecture to SSN is serializers/deserializers to manage pin
presented in Section VIII. count limitations. However, as with the
preceding method, it is not an abutment
To allow simultaneously driving more friendly architecture nor does it
internal scan channels than the number efficiently test many identical cores as
of chip-level scan pins, some SSN will be shown to do.
architectures such as [12] employ

In the next sections, we describe how on efficiency, flexibility, and capabilities

SSN aims to solve the challenges of previously published access
presented in Section I, while improving mechanisms.
III. SSN TECHNOLOGY FUNDAMENTALS
A. Architecture Overview an EDT scan compression controller is

Fig. 1 shows a simplified example of a shown for simplicity as a representative
6-core design that uses SSN. Each core of the scan logic within the core. In
typically contains one Streaming Scan reality, the SSH node can interface with
Host (SSH) node (yellow box). The SSH EDT controller(s), uncompressed/legacy
drives local scan resources to load/ scan chains, or a combination of the
unload scan chains/channels with data two.
delivered on the SSN bus. In the figure,
Fig. 1: SSN Architecture

Each SSH has two external interfaces: local scan operations for the core,
An IEEE 1687 [15] IJTAG interface including transitions between load/
predominantly used for setup, and a unload and capture stages, as well as
parallel data bus that subsequently performing individual shift operations.
transports the payload scan data and All scan signals and EDT controls are
connects one SSH node to the next. The generated by the SSN local to the core
IJTAG network, shown as a 1-bit bus, is and the only test signals that cross core
used to configure all nodes in the SSN boundaries are the SSN parallel bus
network prior to the application of a test (Nbit data bus + clock) and the IJTAG
pattern set. Each node is loaded with signals. This allows scan timing closure
information related to the protocol such to be completed at the core level.
as the active bus width, its location in
the series of nodes driven, the number SSN supports the abutment of cores in
of shift cycles per scan pattern, tile-based designs with no routing
scan_enable transition timing outside the cores. The outputs of one
information, etc. Following this setup, core connect to the inputs of the next
the entire test pattern set is applied as adjacent core. A chip with SSN usually
packetized scan data that is streamed on has a single datapath (parallel bus) that
the parallel bus shown as an N-bit bus. goes through all cores. Depending on
Because the protocol of alternating shift/ the floorplan and pad locations, it may
capture operations is very regular and be preferable for physical design to
repeatable, each SSH is pre-loaded with implement multiple, physically
the information needed for its counters independent datapaths (for example,
and finite state machine to track the one datapath per chiplet [16][17]). Each
streaming operation. There is no need datapath is also configurable and can
to send opcode or address information include muxes that can be programmed
with each packet. Only the scan payload to include or exclude segments of the
is streamed, as shown in the next network similar to the Segment
section. As data streams through the Insertion Bit (SIB) in IJTAG networks.
SSH nodes, each node can identify
when it needs to read scan_in data from As will be demonstrated in the
the bus, when it needs to place upcoming sections, the SSN bus width is
scan_out data on the bus, and when it selected based on chip-level pin
needs to pass along data that is destined availability and is independent of the
for other nodes. Each SSH controls the number and logic size of the scanned

cores, and the number of channels of cores without changing the

needed by the EDT controller(s) in each hardware. Unlike pin-mux architectures,
core. This enables each core to have the this flexibility does not come at the
same plug-and-play interface and bus expense of routing congestion.
width for scan test, allowing SSN to Additionally, there is no need to try and
scale efficiently as the design floorplan, predict at design time how to group
number of cores, or the content of the cores that are to be tested concurrently.
cores change. Whether performing ATPG on groups of
cores or retargeting patterns from
The ability to route the bus carrying the different cores, the same SSN network
data from one core to the next while can provide access to one core at a time,
dynamically controlling which cores are all cores simultaneously, or anything in-
active/inactive/bypassed means one has between.
flexibility in accessing any combination
Fig. 2: Streaming scan packets
B. Packets single internal scan shift operation. A

In SSN terminology, a “packet” usually packet should not be confused with the
consistent of all the scan data needed actual SSN physical bus width which
for all the active SSH nodes to perform a could be narrower or wider than a

packet. The SSN payload delivered from programmed with the shift count per
the tester may be viewed as a scan load, so it can identify when to
continuous stream of packets that may perform shift, and when to perform
wrap across SSN bus boundaries. To capture. Capture involves events
illustrate this concept, consider the generated by the SSH such as de-
example shown in Fig. 2 where two asserting scan_enable, applying capture
blocks are being tested concurrently. clocks through an On-chip Clock
Block A loads/unloads 5 bits per shift Controller (OCC) [18], and re-asserting
cycle of the block (has 5 EDT channels). scan_enable in preparation for the next
Block B has 4 channels. For both blocks scan operation.
to perform one shift cycle, 9 bits have to
be loaded/unloaded. In conventional In this example, we have decided to use
scan access methods, this would have 9-bit packets although the bus width is 8
required 9 chip-level scan input pins and bits. The stream of 9-bit packets is
9 scan output pins. With SSN, the packet simply folded into the 8-bit bus with no
size in this example gets set to 9 bits bits wasted. The first 9-bit packet
independent of the SSN 8-bit bus width. occupies the first 8-bit parallel word of
9 bits have to be delivered for each of the bus, and the first bit of the second
the 2 blocks to shift once. The first 5 bits word (second tester cycle). The second
of every 9-bit packet are programmed to packet starts immediately after that,
belong to block A, and the next 4 bits of occupying the remaining 7 bits of the
every packet are programmed to belong second parallel word, and the 2 bits of
to block B. This is all determined and the following parallel word. While the
programmed at pattern generation time allocation of bits within a packet to an
– it is not hard-coded in the SSN logic. SSH is invariant, there is no static
After programming all the SSN nodes mapping between a bit of the bus and
using IJTAG, SSN delivers a continuous, an EDT channel inputs/output. The
repeating stream of 9-bit packets. The locations of the 9-bit packets within
allocation of packet bit positions to SSH each 8-bit bus word rotate with each
nodes is the same for all packets and is packet. Each SSH node keeps track of
programmed at setup. As soon as block the location of its data in each packet,
A extracts 5 bits from the bus, it including accounting for rotation of the
performs one internal shift operation. data. The size of each packet must be
Likewise for block B, every time it equal to or greater than the bus width.
accumulates 4 bits. The SSH is In exceptional cases where the packet

size is less than the physical bus width, instead of 8 bits wide, it takes 9 tester
the bus is re-programmed to reduce its cycles to scan in each packet. So the
active width such that it does not internal shift rate is 1/9th of the external
exceed the number of bits in a packet. shift rate, but it is still possible to drive
all 9 internal channels from the 1-bit
Typically, the same time slots of the bus. In fact, the bus width can be scaled
packet that carry scanning data to an down dynamically at pattern generation
SSH node also carry scan-out data from time. When driving multiple cores
that node. (Multiple identical cores may concurrently such that the packet spans
be handled differently as explained multiple bus widths, and the internal
later.) As block A reads the first 5 bits of shift frequency is slower than the
every packet, it replaces them with 5 external frequency as a result, this
bits scanned out (with slight latency). presents an opportunity to deliver the
data more quickly without exceeding
Any number of internal cores and their the constraints on the internal core shift
channels can be controlled with an SSN frequencies. It is common in SSN
bus that is as narrow as one bit. This is implementations to cap the core-
because the packets can be as wide as internal shift frequency at 100 MHz yet
they need to be, and can occupy as run a faster/narrow bus at 400 MHz.
many bus words as needed. The internal
channel requirements (9 bits in this C. The Streaming Scan Host (SSH)
example) are decoupled from the Node
available scan pins at the chip level (8 × Fig. 3 shows a high-level view of the
2 pins for scan in this case). If the packet SSH. In addition to its aforementioned
is wider than the bus and occupies functionality, other characteristics to
multiple bus words, the cores shift less highlight are:
often than once every bus shift cycle
but it will be possible to drive all the 1. If a core with an SSH is not
cores needed. In this example with 9-bit under test in a given mode, the
packets and an 8-bit bus, the blocks shift SSH may have to continue
approximately every bus/tester clock passing data through, being part
cycle. Occasionally, a block may omit of the network, but does not
shifting in a given cycle because it has have to deliver scan data to its
to wait to acquire all the bits it needs for EDT. In this case, the SSH is said
one shift cycle. If the bus is 1 bit wide to be disabled. The data passes

from the bus input register 3. Because the packets data may
directly to the bus output rotate within the bus and span
register, such that the SSH acts multiple parallel words, the SSH
as two pipeline stages within the has shifters and registers to re-
network. align and collect the data.
2. If a core is to be powered off 4. To test the SSH and the rest of

when not under test such that the SSN network before they are
the data cannot flow through used for scan test, the SSH can
the SSN segment within it, the be placed into loopback mode.
datapath can be designed such In this mode, the scan data
that the segment going through normally going to EDT is directly
the powered-off region is muxed fed back to the scan data
out. normally unloaded from EDT, as
shown in the figure.
5. The node is small in size. It is

usually smaller than an EDT
controller.

Fig. 3: Streaming Scan Host (SSH) node
IV. MANAGING CLOCK SKEW & BUS

WIDTH
To maximize SSN’s throughput, it is may be balanced within each core or

desired to run the bus at higher groups of cores, but there may be clock
frequencies than shift frequencies of the skew between those regions that must
cores. It is possible to implement a 400 not be allowed to degrade the shift
MHz SSN bus. It is, however, often frequency. This is addressed using a Bus
unrealistic to balance the SSN clock Frequency Divider (BFD)/Bus Frequency
throughout a large SoC. The SSN clock Multiplier (BFM) pair, as shown in Fig. 4.

Fig. 4: Managing clock skew across CTS regions using BFD/BFM
The pair acts as a deskew FIFO. By The BFD and BFM nodes may
temporarily converting a fast narrow additionally be used to reduce the bus
bus into a slow wide bus when crossing width distributed around the chip and
Clock Tree Synthesis (CTS) regions, a reduce the SSN area. Although an SSN
larger amount of clock skew can be bus that operates at 400 MHz can be
tolerated without impacting the shift easily implemented, it is often not
speed or throughput. The FIFO logically possible to shift data through the chip-
acts like pipeline stages in the SSN level pins at more than 200 MHz.
datapath. Splitting the FIFO into 2 Assume that the SoC has enough pins to
discrete components allows the BFD to implement 64 scan inputs and 64 scan
be placed in the transmitting region and outputs. One option would be to
the BFM in the receiving region, with implement a 64-bit bus throughout the
each component driven by the local SSN chip and operate it at 200 MHz.
clock in its region. Alternatively, the data can be scanned
into the chip through 64 pins at 200

MHz and a BFM added between the scan buses. Then before exiting the chip, a
inputs and the first SSH to convert this BFD node is added to convert the SSN
input stream to a 32-bit, 400 MHz bus. output bus back to a 200 MHz 64-bit
This 32-bit bus is then used across the bus driving the output pins.
chip, connecting SSH nodes with 32-bit
Fig. 5: Retargeting with aligned vs. independent capture
SSN has two features to reduce test time can be sent fewer bits per packet. For
and test data volume in such cases. example, a core with 4 channels does
First, it supports independent shift/ not need to be allocated 4 bits per
capture for different retargeted cores. packet. It can be throttled down and
This is possible because signals such as sent only 1 bit per packet such that it
scan_enable and the shift clock are shifts internally every four packets
generated locally by each SSH. Second, instead of every packet. The result is
it reduces the shift length/pattern count that the total number of packets
imbalances between cores by remains the same, but the size of the
programmatically varying the packets is reduced, speeding up the
bandwidth used for each core. If a core overall test time. The next section
requires many fewer overall shift cycles
across a pattern set than other cores, it

introduces further test optimization do not all shift and capture at the same
possible in the presence of multiple time. In addition to scan access, this
identical core instances. may further facilitate testing a large
number of cores concurrently.
Note that an additional benefit of
independent capture is power. It can
mitigate IR drop since cores under test
VI. TESTING OF MULTIPLE IDENTICAL

CORES
Many SoCs that achieve high A. On-Chip Compare

throughput by parallelizing processing SSN provides a scalable method for
contain a number of cores that are testing any number of identical core
replicated multiple times. CPU chips instances in near constant test time,
often include multiple processor cores. independent of the number of available
AI and GPU chips in particular can have chip-level pins, even in the presence of
some cores replicated well over 100 tile-based design constraints explained
times. As previously explained, in pin- earlier. Instead of shifting in the stimuli
mux scan access architectures, the scan only and unloading the expected
inputs may be broadcast to identical response for comparison on the tester,
core instances, but the scan outputs are the stimuli, expected responses, and
usually observed independently to compare/no-compare mask data are
ensure lossless mapping and scanned in within each packet so that
observability for diagnosis. This results each core can perform its own on-chip
in a non-scalable solution where comparison. Note that the data arrives
increasing the number of core instances at each core instance at a slightly
requires additional chip pins for different time since the SSN bus data
concurrent test. streams through the nodes. With each
internal shift cycle, the channel data

transferred from EDT to the SSH is end of a pattern set to quickly

compared, and a pass/fail status bit per identify failing cores (for designs
channel per shift cycles is computed. with redundant cores), and to
What is ultimately observed on the aid in diagnosis. Note that where
tester is the following: finer granularity than 1 fail bit
per SSH is needed, it is possible
1. Per-shift status bits: This is the to generate a sticky bit per
aforementioned pass/fail bit for channel output connected to the
a given channel in a given SSH.
internal shift cycle. This status
bit is allocated a timeslot in the Fig. 6 shows an example of data
packet for unloading. To provide encoding into packets when using on-
a scalable solution for any chip compare. Six identical core
number of identical core instances are used in this example, each
instances, the same status bit in driving an EDT controller that has 7
the packet usually accumulates input channels and 2 output channels.
the pass/fail status from a given Each packet has enough scan data for
channel/shift cycle across all the cores to perform one internal shift
identical core instances (or a operation. First, 7 bits per packet
subset of them). If this bit corresponding to the 7 input channels
indicates a fail, one can identify (shown in blue) are allocated. Those
which core-level bit had a failure stimuli are broadcast (in time) to all
but not necessarily which core identical core instances. The expected
instance(s) this failure originated responses (2 output channels = 2 bits)
from. It is still possible to identify and mask information (2 output
failing cores and per-core fail channels = 2bits) are also shifted in and
information for diagnosis as broadcasted (red). Last are the status
explained later. bits that accumulate the pass/fail
information per channel per shift cycle
2. Sticky status bits: One sticky bit (green). Typically, we would allocate 2
per SSH indicates if there was a bits corresponding to the 2 output
failure in scan observed by this channels. A failure in one of those bits
SSH in any cycle/channel of the would indicate that the first channel of
pattern set. This bit per SSH is one of the 6 core instances failed, but
unloaded through IJTAG at the we would not know which one. When

we accumulate the status information of 4 green bits: 2 output channels × 2

all 6 cores together, they are considered groups. The number of groups is
to be placed into 1 status group. In this programmable at pattern retargeting
example, we chose to partition the 6 time. Increasing the number of groups
cores into group “a” and group “b”. We beyond 1 sacrifices test efficiency for
only accumulate the fail information improved observability as will be
within each group. That is why we have explained in the diagnosis section.
Fig. 6: Packets when using on-chip compare to test multiple identical cores
When using on-chip compare, the #output_channels. Because each output

response data cannot replace the stimuli channel requires at least 3 bits of data in
in the packet because the stimuli have the packet (expected value, mask, and
to travel to all other core instances. pass/fail status), using an asymmetric
Separate time slots have to be allocated EDT with fewer output channels than
for the stimuli, the expected responses input channels improves test time and
and the masks shifted in, as well as the test data volume in conjunction with on-
status bits unloaded. In the common chip compare.
case of 1 status group, the number of
bits per packet is usually
#input_channels + 3 ×

B. Diagnosis Flow fails, then we know the per-cycle pass/

Failure data is needed even during high- fail data came from this core alone and
volume manufacturing for on-tester therefore we have all the information
identification of failing cores to support needed for diagnosis. However, if
partial good die strategies (redundant multiple cores fail, we have to
logic cores), and for diagnosis-driven separately test and observe each of
yield analysis. When not using on-chip those failing cores to get their individual
compare, every channel output bit in a fail data. If two cores fail, for example,
core maps to a single bit on the top-level then the same test set is re-applied
SSN bus outputs that are unloaded and twice, with minor patching applied. In
compared on the tester. Logic diagnosis each case, static bits in the setup of the
is straightforward in that case: perform cores are patched to control which cores
reverse mapping of chip-level failures are allowed to contribute to the
through the SSN network to the EDT cumulative pass/fail results. Note there
channel outputs, then perform is no need to store separate patterns for
conventional compressed pattern diagnosis on the tester.
diagnosis (at the core level in case of
retargeted patterns). If identical core instances are split into
multiple groups, this slightly increases
Diagnosis in the presence of on-chip the test time, but decreases the
compare is more involved and may probability of resorting to multiple test
require re-application of the pattern set applications for collecting diagnosis
to collect all the data needed. Consider data. In the example shown in Fig. 6,
the case where all identical core the six cores are split into two groups. If
instances are placed in a single status cores A1 and A4 are found to have
group such that their per-cycle pass/fail failed, there is no need for test re-
information is aggregated into the same application because cores A1-A3
packet timeslots. If any of those bits accumulate their status bits separately
indicate failures, we have the from cores A4-A6. However, if cores A1
cumulative per-pin per-cycle fail data and A3 fail, test re-application with
but may not know which core(s) the patching is needed to acquire the
failures came from. The sticky status bits individual fail data. In the extreme case,
unloaded at the end of the test set via you may choose to assign each core
IJTAG indicate which core(s) failed at instance to its own group so that each
least once. If only one core in this group core is observed individually. This mode

of operation may be better suited for

silicon debug than high-volume
manufacturing.
VII. ALTERNATE INTERFACES
A. Streaming Tests through JTAG/ B. Compatibility with Test Using

IJTAG Interfaces SerDes (IEEE 1149.10)
It is possible not to use the SSH parallel IEEE 1149.10 [19] provides for re-using
bus at all, and instead use the high-speed I/O (HSIO) SerDes lanes to
JTAG(chip)/IJTAG(core) interface for both enable very high bandwidth transfer of
setup and subsequent streaming of the test data to/from a chip. The Packet
test data. There are two cases where Encoder/Decoder and Distribution
this may be desirable: Architecture (PEDDA) IP described in the
standard results in deserialized data
1. As a survivability option. If presented on a parallel bus. SSN’s
during silicon bring-up, the bus synchronous parallel bus is ideally suited
is inaccessible due to a silicon to interface with the PEDDA. SSN can
defect, this provides an alternate handle on-chip distribution of test data
method of accessing any SSH or and internal generation of test signals.
group of SSHs. As the SSN network can operate
internally at high frequencies (at least
2. If a low pin count device only 400 MHz), it is capable of testing many
has a JTAG interface and no cores concurrently and quickly when
other digital pins, it is possible to coupled with this high-bandwidth chip-
implement SSN without the level interface.
parallel bus and rely on the JTAG/
IJTAG interfaces for streaming
the test data.

VIII. PRACTICAL EXPERIENCE USING

SSN
In collaboration with Mentor, Intel has any number of partitions, however, the
been evaluating the use of SSN. SSN is approach to accomplish this differs
capable of scaling to large SoCs and between the two systems. The STF
server class designs that require support network relies on explicit addressing
for large partition counts and identical information stored within each packet.
core testing. Previous generations of This is accomplished by having a short
Intel SoCs have utilized an internally address ID tag contained within each
developed high bandwidth packetized packet, typically 4 bits in size. In
fabric, STF [10][11] to address these addition, STF requires an opcode field, 4
needs. STF was developed to allow this bits in size, as well as input and output
scalability at much lower overhead than valid bits. This results in an overhead of
the traditional pin muxed scan 10 bits being added to each data packet.
solutions. In evaluating SSN, the goals In contrast under SSN, the destinations
were to assess whether moving to SSN and interleave settings are statically
could further improve test time and programmed during the test setup,
bandwidth utilization over STF, as well allowing the entire bus bandwidth to be
as reduce design effort through the use used for data. For a typical bus size of
of a vendor supported platform. 32 bits, STF has a 31% higher overhead
than SSN. This is depicted in Fig. 7.
A. Comparison of Packet Encoding
Overhead
Both STF and SSN can scale to support

Fig. 7: STF packet overhead vs. SSN
Fig. 8: STF packing of narrow EDT data
B. Comparison of Data Field smaller number of channels, the STF

Utilization data word is divided up into fields, and
STF utilizes a fixed data field size of 32 the data for multiple shift cycles is
bits. To accommodate EDTs with a packed into the 32-bit word to achieve

better utilization. However, when the example shown in Fig. 9, a set of

EDT channel size does not divide evenly partition patterns that have differing
into the 32-bit word, this reduces numbers of vectors are to be merged.
efficiency as illustrated in Fig. 8. In this Typically, STF will have a specified
example, with 9-bit EDTs, we can pack 3 interleave factor, in this case 4, to which
shift cycles of data into the data word the patterns are repacked optimally into
with 5 bits of unused data, resulting in these 4 groups. These groups are then
an overhead of nearly 16%. In the worst round-robin interleaved to create the
case of a 17-bit EDT, 47% of the data final pattern set, as shown in the figure.
bandwidth is wasted. Thus, STF data
field utilization can range from 53% to SSN’s handling of interleaving achieves
100% depending on how the EDT data similar efficiency for vector count
packs into the 32-bit word. Because SSN mismatch as STF, but SSN can also
utilizes data rotation, any leftover bits partially mitigate chain length mismatch
within the bus become part of the next between partitions, which STF cannot.
packet, always achieving 100% STF requires all partitions in the pattern
utilization of the bus data word. set to be padded to the same shift
length, resulting in overhead. This is
C. Interleaving, Vector Count and depicted in Fig. 10. In our current
Chain Length Mismatch Handling designs, we allow up to 20% chain
Both STF and SSN scale to any number length mismatch between partitions, so
of partitions, however their approaches it is theoretically possible SSN could
differ in how they handle the have up to 20% better packing efficiency
interleaving of partitions. In the in the final pattern.

Fig. 9: STF pattern interleaving
Fig. 10: Chain length mismatch padding in STF
D. Fabric Test Setup small, approximately 10 cycles per

Since the STF fabric is configured inactive endpoint. SSN utilizes IJTAG to
band using packets, the pattern program the network with
overhead for network setup is very approximately 160 bits of state per

active endpoint, plus IJTAG network volume. For the purpose of this analysis,
overhead. Though this could result in we assumed that on-die compare would
substantially higher setup overhead for be neutral between the two systems.
SSN, the cost of the setup is amortized
across the entire scan vector set. For F. Total Estimated Overhead
large pattern sets, network setup should Comparison
not present a significant overhead of In summary, STF pays a high overhead
more than 1% for SSN. in packet encoding, data field utilization
and handling of chain length mismatch.
E. On-Die Compare Network setup overhead is higher in
STF and SSN provide comparable SSN, but amortized across the number
functionality for identical core testing of scan vectors resulting in a negligible
using on-die compare. Both systems difference. Overall, this can lead to over
require the input data stream to include 2X reduction in data volume under SSN
the input data, mask data and expected vs. STF, as summarized in Table I.
response, causing a 3X growth of the
data volume, but allow testing of any G. SSN Pilot Study
number of cores in constant time. SSN SSN offers a compelling theoretical
has a possible advantage in the advantage over the current STF fabric in
handling of an asymmetric number of use. However, we wanted to measure
input and output channels. In this case, results on actual partition data to verify.
SSN can more tightly pack the expect Further, the study looked at other
and mask fields to match the smaller aspects, such as design effort and run
output channel case, possibly realizing times. To perform the study, a simple
less than 3X data growth. STF, however, test design was created consisting of a
allocates bandwidth assuming single interface partition, partition1,
symmetric usage and is always 3X data and four identical copies of a partition,
partitions 2-5, as shown in Fig. 11.

Fig. 11: Pilot network topology
Table I. Theoretical Comparison of STF vs. SSN Data Volume

An SSN bus data width of 32 bits was chosen to match STF to allow direct
comparison. ATPG patterns were created targeting partitions 2-5, each having 9 EDT
channels for a total of 36 bits of channel data. By having a total channel data set size
of >32 bits, SSN will perform data rotation and create a more meaningful comparison.
The 9-bit EDT channel size represents a typical data field packing inefficiency for STF.
Multiple ATPG runs were conducted to analyze the overhead at 10, 500, and 10,000
vectors. The results from these runs are summarized in Table II, comparing STF, SSN,
and a legacy pin mux solution.
Table II. Pilot test time and data volume results
For this testcase, SSN shows a clear mismatch between partitions nor chain
advantage over STF, with STF having length mismatch, which would further
19% higher test time and 57% more favor SSN. For comparison purposes, a
data volume than SSN. SSN test setup is legacy pin muxed solution is included
higher overhead than STF, however showing a large overhead relative to
when amortized across the 10,000 SSN. Since the pin muxed solution
vectors in the run set, this impact is in cannot transport 36 bits of channel data
the expected range of 1.2%. This in a single run, it must be split into 2
testcase used identical partitions and runs, nearly
hence did not exercise vector count doubling test time and data volume.

In addition to data volume and test time built from multiple tools, enabling rapid
metrics, we also collected information integration into the design and fast
on design efficiency between the turnaround ATPG runs. The SSN flows do
internal STF toolset and the Mentor not require ATPG cut points and custom
Tessent™ tool flows for SSN. This setups to generate and retarget
comparison is summarized in Table III. patterns, resulting in significant savings
in pattern retargeting. Though not in
As the table shows, SSN and the Tessent the scope of this analysis, further
flows provide significant productivity benefits are expected in gate level
improvement over our previous flow simulation debug productivity.
Table III. Design efficiency comparison between STF and SSN
H. SSN Pilot Study Summary approach of static network configuration

Analysis of a small test network verified during test setup is more efficient for
that the theoretical advantages of SSN large scan data sets than allocating
over our previous internal STF fabric are addressing and opcode information
achievable and a significant within each packet. In addition, further
improvement in both test time (16% benefits were seen in design efficiency
reduction) and test data volume (36%
reduction). The data shows that the

for insertion, ATPG setup and pattern

retargeting relative to our previous
flows.
IX. CONCLUSION
The SSN technology introduced in this throughout the chip. It simplifies design
paper solves many of the scan planning and implementation, and is
distribution challenges in complex SoCs. especially well suited for tile-based
It enables simultaneous testing of any designs. Intel evaluated SSN and
number of cores with few chip-level compared it to STF as well as to
pins, and it has multiple features to conventional pin-muxed access. SSN
reduce test time and test data volume. It was found to reduce the test data
can test any number of identical core volume by 36% and 43%, respectively. It
instances in near constant time, reduced test cycles by 16% and 43%,
minimizes padding in the presence of respectively. Steps in the design and
cores with mismatched pattern counts retargeting flow were between 10x –
and/or scan chain lengths, and enables 20x faster with SSN compared to STF.
fast streaming of data to/from and
ACKNOWLEDGMENT
The authors wish to thank other thank the contributors to the SSN pilot
contributors to the development of the study: Sirish Chittoor, Yonsang Cho, Luis
SSN technology: Yahya Zaidan, Pawel Briceño Guerrero, Kavita Bansal, Kelsey
Galas, Szymon Walkowiak, Paul Reuter, Byers, and Ian Nuber. Finally, many
and Tony Fryars. We would also like to thanks to all our other partners who also

provided invaluable feedback during the

development, validation, and
deployment of SSN.

References
[1] Standard Testability Method for Embedded Core-based Integrated Circuits, IEEE Standard 1500,
2005.
[2] J. Remmers et al., “Hierarchical DFT methodology - a case study, ” IEEE International Test
Conference, 2004.
[3] D. Trock et al., “Recursive Hierarchical DFT Methodology with Multilevel Clock Control and Scan
Pattern Retargeting,” IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE),
2016.
[4] J. Rajski et al., “Embedded Deterministic Test,” IEEE Trans. on CAD, vol. 23, May 2004, pp.
776-792.
[5] P. Wohl, J.A. Waicukauski, J.E. Colburn, M. Sonawane. "Achieving extreme scan compression for
SoC Designs", IEEE International Test Conference, 2014.
[6] C. Barnhart et al., "OPMISR: The foundation for compressed ATPG vectors," IEEE International
Test Conference, 2001.
[7] G. Giles et al., “Test Access Mechanism for Multiple Identical Cores,” IEEE International Test
Conference, 2008.
[8] Y. Dong et al., “Maximizing Scan Pin and Bandwidth Utilization with a Scan Routing Fabric,” IEEE
International Test Conference, 2017.
[9] J. Janicki et al., "EDT bandwidth management - Practical scenarios for large SoC designs," IEEE
International Test Conference, 2013.
[10] G. Colon-Bonet, “High Bandwidth DFT Fabric Requirements for Server and Microserver SoCs,”
IEEE International Test Conference, 2015.

[11] G. Colon-Bonet, “High Bandwidth Packetized DFT Fabric for Server SoCs,” IEEE International
System-on-Chip Conference, 2016.
[12] A. Sanghani et al., “Design and Implementation of A Time-Division Multiplexing Scan

Architecture Using Serializer and Deserializer in GPU Chips,” IEEE VLSI Test Symposium, 2011.
[13] M. Sonawane et al., “Flexible Scan Interface Architecture for Complex SoCs,” IEEE VLSI Test
Symposium, 2016.
[14] P. Wohl et al., “Achieving Extreme Scan Compression for SoC Designs,” IEEE International Test
Conference, 2014.
[15] Standard for Access and Control of Instrumentation Embedded within a Semiconductor Device,
IEEE Standard 1687, 2014.
[16] J. Durupt et al., " IJTAG supported 3D DFT using chiplet-footprints for
testing multi-chips active interposer system," IEEE European Test Symposium, 2016.
[17] M. Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design for High Performance
Computing”, Symposium on VLSI Circuits Digest of Technical Papers, 2019.
[18] T. Waayers et al., “Clock control architecture and ATPG for reducing pattern count in SoC
designs with multiple clock domains,” IEEE International Test Conference, 2010.
[19] Standard for High-Speed Test Access Port and On-Chip Distribution Architecture, IEEE Standard
1149.10, 2017.

Siemens Digital Industries Software About us
Offices Siemens Digital Industries Software helps

organizations of all sizes digitally transform using
Headquarters
software, hardware and services from the Siemens
Granite Park One 5800 Granite
Xcelerator business platform. Siemens' software and the
Parkway
comprehensive digital twin enable companies to
Suite 600
optimize their design, engineering and manufacturing
Plano, TX 75024 USA
processes to turn today's ideas into the sustainable
+1 972 987 3000
products of the future. From chips to entire systems,
Americas from product to process, across all industries. Siemens
Granite Park One 5800 Granite Digital Industries Software – Accelerating
Parkway transformation.
Suite 600
Plano, TX 75024 USA About the author
+1 314 264 8499
Jean-François Côté
Europe Jean-François Côté, is a Technical Fellow of Siemens EDA,
Stephenson House and main inventor of the described 3D solution. His
Sir William Siemens Square Frimley, interests are design automation and chip design. He
Camberley holds a Master in Electrical Engineering from McGill
Surrey, GU16 8QD University, Montreal, Canada.
+44 (0) 1276 413200 Kassab Mark
Wojciech Janiszewski
Asia-Pacific
Ricardo Rodrigues,
Unit 901-902, 9/F Reinhard Meier
Tower B, Manulife Financial Centre Bartosz Kaczmarek
223-231 Wai Yip Street, Kwun Tong Peter Orlando
Kowloon, Hong Kong Geir Eide
Janusz Rajski
+852 2230 3333
Glenn Colon-Bonet
For additional numbers, click here. Naveen Mysore
Ya Yin
Pankaj Pant
10/2023 84615-C2

Streaming 20 Network

Uploaded by

Copyright:

Available Formats

Streaming 20 Network

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Streaming 20 Network

Uploaded by

Copyright:

Available Formats

DIGITAL INDUSTRIES SOFTWARE

Streaming Scan Network

The IEEE paper is reprinted here in full with permission.

Authors: Jean-François Côté , Kassab Mark , Wojciech Janiszewski , Ricardo Rodrigues, ,

Streaming Scan Network (SSN): An Efficient

II. PRIOR WORK

III. SSN TECHNOLOGY FUNDAMENTALS

IV. MANAGING CLOCK SKEW & BUS WIDTH

VI. TESTING OF MULTIPLE IDENTICAL CORES

VII. ALTERNATE INTERFACES

VIII. PRACTICAL EXPERIENCE USING SSN

SIEMENS DIGITAL INDUSTRIES SOFTWARE 2

Streaming Scan Network (SSN): An

Jean-François Côté, Mark Kassab, Wojciech Janiszewski, Ricardo Rodrigues, Reinhard

Mentor, A Siemens Business Intel Corporation Intel Corporation

8005 SW Boeckman Road 4701 Technology Parkway 75 Reed Road

Wilsonville, OR 97070 Fort Collins, CO 80528 Hudson, MA 01749

SIEMENS DIGITAL INDUSTRIES SOFTWARE 3

Abstract—System-on-Chip (SoC) designs architecture designed to address all

SIEMENS DIGITAL INDUSTRIES SOFTWARE 4

SIEMENS DIGITAL INDUSTRIES SOFTWARE 5

tested concurrently, one of the of levels of core hierarchy increases, the

SIEMENS DIGITAL INDUSTRIES SOFTWARE 6

B. SoC Test Challenges: Limited outputs are often observed

SIEMENS DIGITAL INDUSTRIES SOFTWARE 7

II. PRIOR WORK

To address some of the challenges mode used for high-volume

SIEMENS DIGITAL INDUSTRIES SOFTWARE 8

uses a flexible mux-based switching serializers/deserializers. This additionally

SIEMENS DIGITAL INDUSTRIES SOFTWARE 9

In the next sections, we describe how on efficiency, flexibility, and capabilities

III. SSN TECHNOLOGY FUNDAMENTALS

A. Architecture Overview an EDT scan compression controller is

Fig. 1: SSN Architecture

SIEMENS DIGITAL INDUSTRIES SOFTWARE 10

SIEMENS DIGITAL INDUSTRIES SOFTWARE 11

cores, and the number of channels of cores without changing the

Fig. 2: Streaming scan packets

B. Packets single internal scan shift operation. A

SIEMENS DIGITAL INDUSTRIES SOFTWARE 12

SIEMENS DIGITAL INDUSTRIES SOFTWARE 13

SIEMENS DIGITAL INDUSTRIES SOFTWARE 14

2. If a core is to be powered off 4. To test the SSH and the rest of

5. The node is small in size. It is

SIEMENS DIGITAL INDUSTRIES SOFTWARE 15

Fig. 3: Streaming Scan Host (SSH) node

IV. MANAGING CLOCK SKEW & BUS

To maximize SSN’s throughput, it is may be balanced within each core or

SIEMENS DIGITAL INDUSTRIES SOFTWARE 16

Fig. 4: Managing clock skew across CTS regions using BFD/BFM

SIEMENS DIGITAL INDUSTRIES SOFTWARE 17

Fig. 5: Retargeting with aligned vs. independent capture

SIEMENS DIGITAL INDUSTRIES SOFTWARE 18

VI. TESTING OF MULTIPLE IDENTICAL

Many SoCs that achieve high A. On-Chip Compare

SIEMENS DIGITAL INDUSTRIES SOFTWARE 19

transferred from EDT to the SSH is end of a pattern set to quickly

SIEMENS DIGITAL INDUSTRIES SOFTWARE 20

we accumulate the status information of 4 green bits: 2 output channels × 2

When using on-chip compare, the #output_channels. Because each output