Streaming 20 Network
Streaming 20 Network
Streaming 20 Network
Executive summary
Originally presented at the 2020 International Test Conference by Siemens and Intel
authors, this paper describes the Tessent Streaming Scan Network and demonstrates
how this packetized data network optimizes test time and implementation
productivity for today’s complex SoCs.
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any
current or future media, including reprinting/republishing this material for advertising
or promotional purposes, creating
new collective works, for resale or redistribution to servers or lists, or reuse of any
copyrighted component of this work in
other works.
siemens.com/software
White Paper – Streaming Scan Network
Contents
Abstract
I. INTRODUCTION
IX. CONCLUSION
ACKNOWLEDGMENT
References
Abstract
I. INTRODUCTION
With some Integrated Circuits (ICs) from other cores that are tested at the
growing to billions of transistors, it is same time if scan access and design
virtually impossible to design, constraints permit. In addition to
implement, and test them flat. A retargeting patterns generated for
System-on-a-Chip (SoC) is an IC that is testing the wrapped logic within each
comprised of multiple components, core, test pattern generation is also run
referred to as cores. Each core is at the next level up to test peripheral
typically designed, implemented, and logic outside wrapper chains as well as
validated independently before being logic at that higher level of hierarchy. If
integrated with others. As design this parent level is not the chip level,
complexity has grown, so have the then those patterns will also have to be
levels of core hierarchy. It is not retargeted to the chip level. The same
uncommon to have lower-level cores test pattern generation and retargeting
integrated into subsystems, which are methodology is applied recursively
integrated into chiplets that are then regardless of the levels of hierarchy, but
assembled into a chip. the planning and implementation of
DFT get more complex with additional
As design is done hierarchically to levels of hierarchy, especially when
manage complexity, so is DFT. In using conventional scan access
hierarchical test methodologies methods.
[1][2][3], scan chains and compression
logic [4][5][6] are inserted into every The following subsections explain key
core. The cores are wrapped with scan SoC test challenges inherent with pin-
and interface control logic. Test patterns mux scan access, which is commonly
targeting most faults in a core are used in the industry and explained in
generated and validated at the core the referenced papers.
level. Subsequently, the patterns from
multiple wrapped cores are retargeted A. SoC Test Challenges: Planning
or mapped to the top level. They are and Layout
often merged with patterns retargeted Traditionally, for a group of cores to be
introduces complexity and limits the redesigning the cores to account for
reuse of cores. Designing a new chip differences in pipelining and routing
with more core instances requires channels.
Each SSH has two external interfaces: local scan operations for the core,
An IEEE 1687 [15] IJTAG interface including transitions between load/
predominantly used for setup, and a unload and capture stages, as well as
parallel data bus that subsequently performing individual shift operations.
transports the payload scan data and All scan signals and EDT controls are
connects one SSH node to the next. The generated by the SSN local to the core
IJTAG network, shown as a 1-bit bus, is and the only test signals that cross core
used to configure all nodes in the SSN boundaries are the SSN parallel bus
network prior to the application of a test (Nbit data bus + clock) and the IJTAG
pattern set. Each node is loaded with signals. This allows scan timing closure
information related to the protocol such to be completed at the core level.
as the active bus width, its location in
the series of nodes driven, the number SSN supports the abutment of cores in
of shift cycles per scan pattern, tile-based designs with no routing
scan_enable transition timing outside the cores. The outputs of one
information, etc. Following this setup, core connect to the inputs of the next
the entire test pattern set is applied as adjacent core. A chip with SSN usually
packetized scan data that is streamed on has a single datapath (parallel bus) that
the parallel bus shown as an N-bit bus. goes through all cores. Depending on
Because the protocol of alternating shift/ the floorplan and pad locations, it may
capture operations is very regular and be preferable for physical design to
repeatable, each SSH is pre-loaded with implement multiple, physically
the information needed for its counters independent datapaths (for example,
and finite state machine to track the one datapath per chiplet [16][17]). Each
streaming operation. There is no need datapath is also configurable and can
to send opcode or address information include muxes that can be programmed
with each packet. Only the scan payload to include or exclude segments of the
is streamed, as shown in the next network similar to the Segment
section. As data streams through the Insertion Bit (SIB) in IJTAG networks.
SSH nodes, each node can identify
when it needs to read scan_in data from As will be demonstrated in the
the bus, when it needs to place upcoming sections, the SSN bus width is
scan_out data on the bus, and when it selected based on chip-level pin
needs to pass along data that is destined availability and is independent of the
for other nodes. Each SSH controls the number and logic size of the scanned
packet. The SSN payload delivered from programmed with the shift count per
the tester may be viewed as a scan load, so it can identify when to
continuous stream of packets that may perform shift, and when to perform
wrap across SSN bus boundaries. To capture. Capture involves events
illustrate this concept, consider the generated by the SSH such as de-
example shown in Fig. 2 where two asserting scan_enable, applying capture
blocks are being tested concurrently. clocks through an On-chip Clock
Block A loads/unloads 5 bits per shift Controller (OCC) [18], and re-asserting
cycle of the block (has 5 EDT channels). scan_enable in preparation for the next
Block B has 4 channels. For both blocks scan operation.
to perform one shift cycle, 9 bits have to
be loaded/unloaded. In conventional In this example, we have decided to use
scan access methods, this would have 9-bit packets although the bus width is 8
required 9 chip-level scan input pins and bits. The stream of 9-bit packets is
9 scan output pins. With SSN, the packet simply folded into the 8-bit bus with no
size in this example gets set to 9 bits bits wasted. The first 9-bit packet
independent of the SSN 8-bit bus width. occupies the first 8-bit parallel word of
9 bits have to be delivered for each of the bus, and the first bit of the second
the 2 blocks to shift once. The first 5 bits word (second tester cycle). The second
of every 9-bit packet are programmed to packet starts immediately after that,
belong to block A, and the next 4 bits of occupying the remaining 7 bits of the
every packet are programmed to belong second parallel word, and the 2 bits of
to block B. This is all determined and the following parallel word. While the
programmed at pattern generation time allocation of bits within a packet to an
– it is not hard-coded in the SSN logic. SSH is invariant, there is no static
After programming all the SSN nodes mapping between a bit of the bus and
using IJTAG, SSN delivers a continuous, an EDT channel inputs/output. The
repeating stream of 9-bit packets. The locations of the 9-bit packets within
allocation of packet bit positions to SSH each 8-bit bus word rotate with each
nodes is the same for all packets and is packet. Each SSH node keeps track of
programmed at setup. As soon as block the location of its data in each packet,
A extracts 5 bits from the bus, it including accounting for rotation of the
performs one internal shift operation. data. The size of each packet must be
Likewise for block B, every time it equal to or greater than the bus width.
accumulates 4 bits. The SSH is In exceptional cases where the packet
size is less than the physical bus width, instead of 8 bits wide, it takes 9 tester
the bus is re-programmed to reduce its cycles to scan in each packet. So the
active width such that it does not internal shift rate is 1/9th of the external
exceed the number of bits in a packet. shift rate, but it is still possible to drive
all 9 internal channels from the 1-bit
Typically, the same time slots of the bus. In fact, the bus width can be scaled
packet that carry scanning data to an down dynamically at pattern generation
SSH node also carry scan-out data from time. When driving multiple cores
that node. (Multiple identical cores may concurrently such that the packet spans
be handled differently as explained multiple bus widths, and the internal
later.) As block A reads the first 5 bits of shift frequency is slower than the
every packet, it replaces them with 5 external frequency as a result, this
bits scanned out (with slight latency). presents an opportunity to deliver the
data more quickly without exceeding
Any number of internal cores and their the constraints on the internal core shift
channels can be controlled with an SSN frequencies. It is common in SSN
bus that is as narrow as one bit. This is implementations to cap the core-
because the packets can be as wide as internal shift frequency at 100 MHz yet
they need to be, and can occupy as run a faster/narrow bus at 400 MHz.
many bus words as needed. The internal
channel requirements (9 bits in this C. The Streaming Scan Host (SSH)
example) are decoupled from the Node
available scan pins at the chip level (8 × Fig. 3 shows a high-level view of the
2 pins for scan in this case). If the packet SSH. In addition to its aforementioned
is wider than the bus and occupies functionality, other characteristics to
multiple bus words, the cores shift less highlight are:
often than once every bus shift cycle
but it will be possible to drive all the 1. If a core with an SSH is not
cores needed. In this example with 9-bit under test in a given mode, the
packets and an 8-bit bus, the blocks shift SSH may have to continue
approximately every bus/tester clock passing data through, being part
cycle. Occasionally, a block may omit of the network, but does not
shifting in a given cycle because it has have to deliver scan data to its
to wait to acquire all the bits it needs for EDT. In this case, the SSH is said
one shift cycle. If the bus is 1 bit wide to be disabled. The data passes
from the bus input register 3. Because the packets data may
directly to the bus output rotate within the bus and span
register, such that the SSH acts multiple parallel words, the SSH
as two pipeline stages within the has shifters and registers to re-
network. align and collect the data.
The pair acts as a deskew FIFO. By The BFD and BFM nodes may
temporarily converting a fast narrow additionally be used to reduce the bus
bus into a slow wide bus when crossing width distributed around the chip and
Clock Tree Synthesis (CTS) regions, a reduce the SSN area. Although an SSN
larger amount of clock skew can be bus that operates at 400 MHz can be
tolerated without impacting the shift easily implemented, it is often not
speed or throughput. The FIFO logically possible to shift data through the chip-
acts like pipeline stages in the SSN level pins at more than 200 MHz.
datapath. Splitting the FIFO into 2 Assume that the SoC has enough pins to
discrete components allows the BFD to implement 64 scan inputs and 64 scan
be placed in the transmitting region and outputs. One option would be to
the BFM in the receiving region, with implement a 64-bit bus throughout the
each component driven by the local SSN chip and operate it at 200 MHz.
clock in its region. Alternatively, the data can be scanned
into the chip through 64 pins at 200
MHz and a BFM added between the scan buses. Then before exiting the chip, a
inputs and the first SSH to convert this BFD node is added to convert the SSN
input stream to a 32-bit, 400 MHz bus. output bus back to a 200 MHz 64-bit
This 32-bit bus is then used across the bus driving the output pins.
chip, connecting SSH nodes with 32-bit
SSN has two features to reduce test time can be sent fewer bits per packet. For
and test data volume in such cases. example, a core with 4 channels does
First, it supports independent shift/ not need to be allocated 4 bits per
capture for different retargeted cores. packet. It can be throttled down and
This is possible because signals such as sent only 1 bit per packet such that it
scan_enable and the shift clock are shifts internally every four packets
generated locally by each SSH. Second, instead of every packet. The result is
it reduces the shift length/pattern count that the total number of packets
imbalances between cores by remains the same, but the size of the
programmatically varying the packets is reduced, speeding up the
bandwidth used for each core. If a core overall test time. The next section
requires many fewer overall shift cycles
across a pattern set than other cores, it
introduces further test optimization do not all shift and capture at the same
possible in the presence of multiple time. In addition to scan access, this
identical core instances. may further facilitate testing a large
number of cores concurrently.
Note that an additional benefit of
independent capture is power. It can
mitigate IR drop since cores under test
Fig. 6: Packets when using on-chip compare to test multiple identical cores
In collaboration with Mentor, Intel has any number of partitions, however, the
been evaluating the use of SSN. SSN is approach to accomplish this differs
capable of scaling to large SoCs and between the two systems. The STF
server class designs that require support network relies on explicit addressing
for large partition counts and identical information stored within each packet.
core testing. Previous generations of This is accomplished by having a short
Intel SoCs have utilized an internally address ID tag contained within each
developed high bandwidth packetized packet, typically 4 bits in size. In
fabric, STF [10][11] to address these addition, STF requires an opcode field, 4
needs. STF was developed to allow this bits in size, as well as input and output
scalability at much lower overhead than valid bits. This results in an overhead of
the traditional pin muxed scan 10 bits being added to each data packet.
solutions. In evaluating SSN, the goals In contrast under SSN, the destinations
were to assess whether moving to SSN and interleave settings are statically
could further improve test time and programmed during the test setup,
bandwidth utilization over STF, as well allowing the entire bus bandwidth to be
as reduce design effort through the use used for data. For a typical bus size of
of a vendor supported platform. 32 bits, STF has a 31% higher overhead
than SSN. This is depicted in Fig. 7.
A. Comparison of Packet Encoding
Overhead
Both STF and SSN can scale to support
active endpoint, plus IJTAG network volume. For the purpose of this analysis,
overhead. Though this could result in we assumed that on-die compare would
substantially higher setup overhead for be neutral between the two systems.
SSN, the cost of the setup is amortized
across the entire scan vector set. For F. Total Estimated Overhead
large pattern sets, network setup should Comparison
not present a significant overhead of In summary, STF pays a high overhead
more than 1% for SSN. in packet encoding, data field utilization
and handling of chain length mismatch.
E. On-Die Compare Network setup overhead is higher in
STF and SSN provide comparable SSN, but amortized across the number
functionality for identical core testing of scan vectors resulting in a negligible
using on-die compare. Both systems difference. Overall, this can lead to over
require the input data stream to include 2X reduction in data volume under SSN
the input data, mask data and expected vs. STF, as summarized in Table I.
response, causing a 3X growth of the
data volume, but allow testing of any G. SSN Pilot Study
number of cores in constant time. SSN SSN offers a compelling theoretical
has a possible advantage in the advantage over the current STF fabric in
handling of an asymmetric number of use. However, we wanted to measure
input and output channels. In this case, results on actual partition data to verify.
SSN can more tightly pack the expect Further, the study looked at other
and mask fields to match the smaller aspects, such as design effort and run
output channel case, possibly realizing times. To perform the study, a simple
less than 3X data growth. STF, however, test design was created consisting of a
allocates bandwidth assuming single interface partition, partition1,
symmetric usage and is always 3X data and four identical copies of a partition,
partitions 2-5, as shown in Fig. 11.
An SSN bus data width of 32 bits was chosen to match STF to allow direct
comparison. ATPG patterns were created targeting partitions 2-5, each having 9 EDT
channels for a total of 36 bits of channel data. By having a total channel data set size
of >32 bits, SSN will perform data rotation and create a more meaningful comparison.
The 9-bit EDT channel size represents a typical data field packing inefficiency for STF.
Multiple ATPG runs were conducted to analyze the overhead at 10, 500, and 10,000
vectors. The results from these runs are summarized in Table II, comparing STF, SSN,
and a legacy pin mux solution.
For this testcase, SSN shows a clear mismatch between partitions nor chain
advantage over STF, with STF having length mismatch, which would further
19% higher test time and 57% more favor SSN. For comparison purposes, a
data volume than SSN. SSN test setup is legacy pin muxed solution is included
higher overhead than STF, however showing a large overhead relative to
when amortized across the 10,000 SSN. Since the pin muxed solution
vectors in the run set, this impact is in cannot transport 36 bits of channel data
the expected range of 1.2%. This in a single run, it must be split into 2
testcase used identical partitions and runs, nearly
hence did not exercise vector count doubling test time and data volume.
In addition to data volume and test time built from multiple tools, enabling rapid
metrics, we also collected information integration into the design and fast
on design efficiency between the turnaround ATPG runs. The SSN flows do
internal STF toolset and the Mentor not require ATPG cut points and custom
Tessent™ tool flows for SSN. This setups to generate and retarget
comparison is summarized in Table III. patterns, resulting in significant savings
in pattern retargeting. Though not in
As the table shows, SSN and the Tessent the scope of this analysis, further
flows provide significant productivity benefits are expected in gate level
improvement over our previous flow simulation debug productivity.
IX. CONCLUSION
The SSN technology introduced in this throughout the chip. It simplifies design
paper solves many of the scan planning and implementation, and is
distribution challenges in complex SoCs. especially well suited for tile-based
It enables simultaneous testing of any designs. Intel evaluated SSN and
number of cores with few chip-level compared it to STF as well as to
pins, and it has multiple features to conventional pin-muxed access. SSN
reduce test time and test data volume. It was found to reduce the test data
can test any number of identical core volume by 36% and 43%, respectively. It
instances in near constant time, reduced test cycles by 16% and 43%,
minimizes padding in the presence of respectively. Steps in the design and
cores with mismatched pattern counts retargeting flow were between 10x –
and/or scan chain lengths, and enables 20x faster with SSN compared to STF.
fast streaming of data to/from and
ACKNOWLEDGMENT
The authors wish to thank other thank the contributors to the SSN pilot
contributors to the development of the study: Sirish Chittoor, Yonsang Cho, Luis
SSN technology: Yahya Zaidan, Pawel Briceño Guerrero, Kavita Bansal, Kelsey
Galas, Szymon Walkowiak, Paul Reuter, Byers, and Ian Nuber. Finally, many
and Tony Fryars. We would also like to thanks to all our other partners who also
References
[1] Standard Testability Method for Embedded Core-based Integrated Circuits, IEEE Standard 1500,
2005.
[2] J. Remmers et al., “Hierarchical DFT methodology - a case study, ” IEEE International Test
Conference, 2004.
[3] D. Trock et al., “Recursive Hierarchical DFT Methodology with Multilevel Clock Control and Scan
Pattern Retargeting,” IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE),
2016.
[4] J. Rajski et al., “Embedded Deterministic Test,” IEEE Trans. on CAD, vol. 23, May 2004, pp.
776-792.
[5] P. Wohl, J.A. Waicukauski, J.E. Colburn, M. Sonawane. "Achieving extreme scan compression for
SoC Designs", IEEE International Test Conference, 2014.
[6] C. Barnhart et al., "OPMISR: The foundation for compressed ATPG vectors," IEEE International
Test Conference, 2001.
[7] G. Giles et al., “Test Access Mechanism for Multiple Identical Cores,” IEEE International Test
Conference, 2008.
[8] Y. Dong et al., “Maximizing Scan Pin and Bandwidth Utilization with a Scan Routing Fabric,” IEEE
International Test Conference, 2017.
[9] J. Janicki et al., "EDT bandwidth management - Practical scenarios for large SoC designs," IEEE
International Test Conference, 2013.
[10] G. Colon-Bonet, “High Bandwidth DFT Fabric Requirements for Server and Microserver SoCs,”
IEEE International Test Conference, 2015.
[11] G. Colon-Bonet, “High Bandwidth Packetized DFT Fabric for Server SoCs,” IEEE International
System-on-Chip Conference, 2016.
[13] M. Sonawane et al., “Flexible Scan Interface Architecture for Complex SoCs,” IEEE VLSI Test
Symposium, 2016.
[14] P. Wohl et al., “Achieving Extreme Scan Compression for SoC Designs,” IEEE International Test
Conference, 2014.
[15] Standard for Access and Control of Instrumentation Embedded within a Semiconductor Device,
IEEE Standard 1687, 2014.
[16] J. Durupt et al., " IJTAG supported 3D DFT using chiplet-footprints for
testing multi-chips active interposer system," IEEE European Test Symposium, 2016.
[17] M. Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design for High Performance
Computing”, Symposium on VLSI Circuits Digest of Technical Papers, 2019.
[18] T. Waayers et al., “Clock control architecture and ATPG for reducing pattern count in SoC
designs with multiple clock domains,” IEEE International Test Conference, 2010.
[19] Standard for High-Speed Test Access Port and On-Chip Distribution Architecture, IEEE Standard
1149.10, 2017.
10/2023 84615-C2