Streaming Scan Network
Streaming Scan Network
* **
Mentor, A Siemens Business Intel Corporation Intel Corporation
8005 SW Boeckman Road 4701 Technology Parkway 75 Reed Road
Wilsonville, OR 97070 Fort Collins, CO 80528 Hudson, MA 01749
2020 IEEE International Test Conference (ITC) | 978-1-7281-9113-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/ITC44778.2020.9325233
Abstract—System-on-Chip (SoC) designs are increasingly cores are wrapped with scan and interface control logic. Test
difficult to test using traditional scan access methods without patterns targeting most faults in a core are generated and
incurring inefficient test time, high planning effort, and physical validated at the core level. Subsequently, the patterns from
design/timing closure challenges. The number of cores keeps multiple wrapped cores are retargeted or mapped to the top level.
growing while chip pin counts available for scan remain constant They are often merged with patterns retargeted from other cores
or decline, limiting the ability to drive cores concurrently. With that are tested at the same time if scan access and design
increasingly commonplace tiling and abutment, the scan constraints permit. In addition to retargeting patterns generated
distribution hardware must be placed inside the cores, making for testing the wrapped logic within each core, test pattern
balanced pipelining when broadcasting to identical cores difficult.
generation is also run at the next level up to test peripheral logic
Optimizing test time requires analyzing all the cores and
outside wrapper chains as well as logic at that higher level of
subsequently changing the test hardware in the cores. Internal
shift speed constraints may limit the ability to shift data in and out hierarchy. If this parent level is not the chip level, then those
of the chip at high rates. Differences in pattern counts or scan patterns will also have to be retargeted to the chip level. The
chain lengths between cores tested in parallel can result in padding same test pattern generation and retargeting methodology is
and increased test time. SSN is a bus-based scan data distribution applied recursively regardless of the levels of hierarchy, but the
architecture designed to address all these challenges. It enables planning and implementation of DFT get more complex with
simultaneous testing of any number of cores even with few chip additional levels of hierarchy, especially when using
I/Os. It facilitates short test time by enabling high-speed data conventional scan access methods.
distribution, by efficiently handling imbalances between cores,
and by supporting testing of any number of identical cores with a
The following subsections explain key SoC test challenges
constant cost. It provides a plug-and-play interface in each core inherent with pin-mux scan access, which is commonly used in
that is well suited for abutted tiles, and simplifies scan timing the industry and explained in the referenced papers.
closure. This paper also compares the test cost and
implementation productivity of SSN with those of Intel’s A. SoC Test Challenges: Planning and Layout
Structural Test Fabric. Traditionally, for a group of cores to be tested concurrently,
one of the requirements is that their channel inputs and outputs
Keywords—Design For Test, DFT, SoC Test, Hierarchical Test, must be directly connected to chip-level pins. As the number of
Multiple Identical Cores, Known-Good-Die Testing, Test Time
cores in SoCs grows and the number of chip-level pins available
Reduction, Low Pin Count Test, Scan Distribution Architecture,
Scan Fabric
for scan test remains the same or is reduced, additional groups
of cores and scan access configurations must be created. This
has negative implications on DFT implementation effort, silicon
I. INTRODUCTION area, pattern retargeting complexity, and test time.
With some Integrated Circuits (ICs) growing to billions of Part of hierarchical test planning is to identify early in the
transistors, it is virtually impossible to design, implement, and design flow the number of scan channels used in every core, and
test them flat. A System-on-a-Chip (SoC) is an IC that is the groups of cores which will be tested concurrently in every
comprised of multiple components, referred to as cores. Each scan access configuration. This can result in sub-optimal results
core is typically designed, implemented, and validated since it creates fixed core groupings and forces premature
independently before being integrated with others. As design decisions on channel counts per core before the cores are
complexity has grown, so have the levels of core hierarchy. It is completed and before their compression configurations can be
not uncommon to have lower-level cores integrated into optimized and their pattern counts estimated. Chip-level design
subsystems, which are integrated into chiplets that are then decisions depend on the cores. The cores are finalized too late in
assembled into a chip. the design cycle, and their compression configurations are
As design is done hierarchically to manage complexity, so is influenced by the chip-level core groupings and pin availability.
DFT. In hierarchical test methodologies [1][2][3], scan chains This mutual dependency makes it virtually impossible to
and compression logic [4][5][6] are inserted into every core. The optimize compression for the SoC. As the number of levels of
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
core hierarchy increases, the planning complexity and test accumulating pipelining delay. Routing of individual output
inefficiency also grow. channels from each core instance through the other core
instances can also be complicated due to the fact that all cores
Connecting chip pins to the cores can have physical design are copies of each other. A solution exists where every core
implications. Connecting each pin to different cores in different instance is programmed with a different number of pipeline
test configurations can lead to routing congestion. The pads may stages and different routing for scan output paths, but this
be embedded inside cores in some packaging technologies such introduces complexity and limits the reuse of cores. Designing a
that the connections for one core impact the design of other cores new chip with more core instances requires redesigning the
to which the signals have to be routed, or through which the scan cores to account for differences in pipelining and routing
connections flow. Those connections are also often pipelined, so channels.
timing between those pipeline stages and compression logic
must be carefully designed to achieve high shift speeds and
avoid timing violations. II. PRIOR WORK
Tile-based layout is a relatively recent trend in SoC design To address some of the challenges explained, a few
that is adding further complexity and constraints to DFT companies have developed and published scan access
architectures. In pure tiling layouts, virtually all logic and technologies beyond the traditional pin-mux topologies. They
routing is within the cores and not at the top level. The cores are vary in the scope of the challenges they address and the trade-
designed to abut one another when integrated into the chip such offs they make.
that connections flow from one core to the next. Any A packetized bus-based architecture specifically tailored at
connectivity between cores has to flow through cores that are providing a scalable solution for testing of multiple identical
between them. Logic that is at the top level has to be pushed into core instances was introduced in [7]. It is not a general scan
the cores and designed as part of the cores. access mechanism that can simultaneously test heterogeneous
cores. It supports shifting in the expected data, in addition to
B. SoC Test Challenges: Limited Chip-Level Pins input stimuli, such that on-chip comparison can be done and
When retargeting core-level patterns, limited chip-level pin pass/fail data accumulated and observed. It also allows some
counts can be dealt with by increasing the number of core groups trade-offs between efficiency and diagnostic information.
and test sessions, as long as there are enough chip pins to drive Getting full failure data for diagnosis may require the
at least each core individually. However, there are cases where application of a different pattern set; one that uses a different
simultaneous access to multiple or all cores is necessary, and configuration than the full-rate mode used for high-volume
grouping cores into smaller groups is not an option. One manufacturing. This architecture also has data overhead because
example is Iddq test, where scan data is loaded across the entire every parallel word includes a command opcode in addition to
chip before a relatively lengthy current measurement is taken. the scan data payload. The fact that each parallel word has to
When using scan compression such as Embedded Deterministic include both payload and a command imposes limits on how
Test (EDT) [4], this means there must be enough pins available narrow the bus may be, and imposes additional constraints on
to drive all the EDT channels of the cores concurrently. the bus width and its relation to the core scan channel counts.
The authors subsequently introduced a new architecture [8]
C. SoC Test Challenges: Identical Core Instances that has a different focus: while it maintains a solution for testing
Pattern retargeting in the presence of identical core instances of multiple identical cores, its primary new design objective is
can benefit from generating patterns once, and from the ability to enable better bin packing for retargeted core-level patterns. It
to broadcast the scan inputs from the same top-level pins, does so by providing flexibility in mapping chip-level pins to
reducing both ATPG runtime and pin requirements. There are, core-level scan pins such that there is flexibility in controlling
however, still multiple challenges to be resolved. which cores are tested concurrently. Instead of a bus architecture
as in [7], it uses a flexible mux-based switching network. The
Although broadcast of scan inputs keeps the number of input architecture succeeds in enabling effective dynamic bandwidth
pins constant for any number of identical cores, the outputs are management [9] and late-binding core grouping to minimize
often observed independently to guarantee the same test padding caused by test length differences across cores.
coverage achieved at the core level and to ensure enough However, this architecture incurs some costs. Given the network
observability for diagnosing failing cores. Since at least 1 output provides flexibility in connecting any top-level pin to any core-
channel is needed per core instance, this can limit the number of level scan channel (although there are restrictions on
identical core instances that can be tested concurrently just as combinations of connections), the network can result in
there are similar limitations on heterogeneous core instances. significant routing cost especially in the presence of a large
The second issue is that after scan loading, the capture number of cores. Using a mux-based star network is also less
clocking is usually applied concurrently to all core instances. amenable to connection-by-abutment in tile-based designs
Combined with the broadcast of input scan data, the number of compared to bus-based architectures.
pipeline stages must be equal between a scan input pin and all The Structural Test Fabric (STF) solution [10][11],
the identical core instances it drives. This can be difficult to published by co-authors of this paper, provides a general packet-
achieve in the presence of tiling where no routing or logic may based core access mechanism that works for heterogeneous
exist outside the cores. Signals, including scan inputs, may cores, and has a scalable solution for multiple identical cores. It
propagate across multiple instances of the same core, is flexible in that every parallel word is self-contained, but incurs
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
overhead per parallel bus word. A detailed comparison of this Each SSH has two external interfaces: An IEEE 1687 [15]
architecture to SSN is presented in Section VIII. IJTAG interface predominantly used for setup, and a parallel
data bus that subsequently transports the payload scan data and
To allow simultaneously driving more internal scan channels connects one SSH node to the next. The IJTAG network, shown
than the number of chip-level scan pins, some architectures such as a 1-bit bus, is used to configure all nodes in the SSN network
as [12] employ serializers/deserializers. This additionally allows prior to the application of a test pattern set. Each node is loaded
running chip-level scan pins at higher frequencies than internal with information related to the protocol such as the active bus
scan chains support, improving overall bandwidth. A width, its location in the series of nodes driven, the number of
subsequent version of this technology [13] added flexibility to shift cycles per scan pattern, scan_enable transition timing
allow varying the number of scan pins per core. The number of information, etc. Following this setup, the entire test pattern set
external scan pins per core and the related is applied as packetized scan data that is streamed on the parallel
serialization/deserialization ratio are programmable. The bus shown as an N-bit bus. Because the protocol of alternating
purpose is to enable reuse of the test data for a given core across shift/capture operations is very regular and repeatable, each SSH
SoCs with different scan pin configurations. It also enables is pre-loaded with the information needed for its counters and
varying shift frequencies in different cores within the SoC. finite state machine to track the streaming operation. There is no
Those methods facilitate IP reuse and access to cores in the
need to send opcode or address information with each packet.
presence of limited chip-level scan pins. However, they do not Only the scan payload is streamed, as shown in the next section.
address routing challenges in tile-based designs nor provide an As data streams through the SSH nodes, each node can identify
efficient and scalable solution for multiple identical cores. when it needs to read scan_in data from the bus, when it needs
Some scan compression methods have extensions to to place scan_out data on the bus, and when it needs to pass
facilitate test across an SoC. For example, the architecture in along data that is destined for other nodes. Each SSH controls
[14] can distribute test data to compression logic in cores, and the local scan operations for the core, including transitions
uses serializers/deserializers to manage pin count limitations. between load/unload and capture stages, as well as performing
However, as with the preceding method, it is not an abutment- individual shift operations. All scan signals and EDT controls
friendly architecture nor does it efficiently test many identical are generated by the SSN local to the core and the only test
cores as SSN will be shown to do. signals that cross core boundaries are the SSN parallel bus (N-
bit data bus + clock) and the IJTAG signals. This allows scan
In the next sections, we describe how SSN aims to solve the timing closure to be completed at the core level.
challenges presented in Section I, while improving on
efficiency, flexibility, and capabilities of previously published SSN supports the abutment of cores in tile-based designs
access mechanisms. with no routing outside the cores. The outputs of one core
connect to the inputs of the next adjacent core. A chip with SSN
III. SSN TECHNOLOGY FUNDAMENTALS usually has a single datapath (parallel bus) that goes through all
cores. Depending on the floorplan and pad locations, it may be
A. Architecture Overview preferable for physical design to implement multiple, physically
independent datapaths (for example, one datapath per chiplet
Fig. 1 shows a simplified example of a 6-core design that [16][17]). Each datapath is also configurable and can include
uses SSN. Each core typically contains one Streaming Scan Host muxes that can be programmed to include or exclude segments
(SSH) node (yellow box). The SSH drives local scan resources of the network similar to the Segment Insertion Bit (SIB) in
to load/unload scan chains/channels with data delivered on the IJTAG networks.
SSN bus. In the figure, an EDT scan compression controller is
shown for simplicity as a representative of the scan logic within As will be demonstrated in the upcoming sections, the SSN
the core. In reality, the SSH node can interface with EDT bus width is selected based on chip-level pin availability and is
controller(s), uncompressed/legacy scan chains, or a independent of the number and logic size of the scanned cores,
combination of the two. and the number of channels needed by the EDT controller(s) in
each core. This enables each core to have the same plug-and-
play interface and bus width for scan test, allowing SSN to scale
efficiently as the design floorplan, number of cores, or the
content of the cores change.
The ability to route the bus carrying the data from one core
to the next while dynamically controlling which cores are
active/inactive/bypassed means one has flexibility in accessing
any combination of cores without changing the hardware.
Unlike pin-mux architectures, this flexibility does not come at
the expense of routing congestion. Additionally, there is no need
to try and predict at design time how to group cores that are to
be tested concurrently. Whether performing ATPG on groups of
cores or retargeting patterns from different cores, the same SSN
Fig. 1: SSN Architecture
network can provide access to one core at a time, all cores
simultaneously, or anything in-between.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: Streaming scan packets
bits of the second parallel word, and the 2 bits of the following
B. Packets parallel word. While the allocation of bits within a packet to an
In SSN terminology, a “packet” usually consistent of all the SSH is invariant, there is no static mapping between a bit of the
scan data needed for all the active SSH nodes to perform a single bus and an EDT channel inputs/output. The locations of the 9-
internal scan shift operation. A packet should not be confused bit packets within each 8-bit bus word rotate with each packet.
with the actual SSN physical bus width which could be narrower Each SSH node keeps track of the location of its data in each
or wider than a packet. The SSN payload delivered from the packet, including accounting for rotation of the data. The size of
tester may be viewed as a continuous stream of packets that may each packet must be equal to or greater than the bus width. In
wrap across SSN bus boundaries. To illustrate this concept, exceptional cases where the packet size is less than the physical
consider the example shown in Fig. 2 where two blocks are bus width, the bus is re-programmed to reduce its active width
being tested concurrently. Block A loads/unloads 5 bits per shift such that it does not exceed the number of bits in a packet.
cycle of the block (has 5 EDT channels). Block B has 4 channels.
For both blocks to perform one shift cycle, 9 bits have to be Typically, the same time slots of the packet that carry scan-
loaded/unloaded. In conventional scan access methods, this in data to an SSH node also carry scan-out data from that node.
would have required 9 chip-level scan input pins and 9 scan (Multiple identical cores may be handled differently as
output pins. With SSN, the packet size in this example gets set explained later.) As block A reads the first 5 bits of every packet,
to 9 bits independent of the SSN 8-bit bus width. 9 bits have to it replaces them with 5 bits scanned out (with slight latency).
be delivered for each of the 2 blocks to shift once. The first 5 Any number of internal cores and their channels can be
bits of every 9-bit packet are programmed to belong to block A, controlled with an SSN bus that is as narrow as one bit. This is
and the next 4 bits of every packet are programmed to belong to because the packets can be as wide as they need to be, and can
block B. This is all determined and programmed at pattern occupy as many bus words as needed. The internal channel
generation time – it is not hard-coded in the SSN logic. After requirements (9 bits in this example) are decoupled from the
programming all the SSN nodes using IJTAG, SSN delivers a available scan pins at the chip level (8 × 2 pins for scan in this
continuous, repeating stream of 9-bit packets. The allocation of case). If the packet is wider than the bus and occupies multiple
packet bit positions to SSH nodes is the same for all packets and bus words, the cores shift less often than once every bus shift
is programmed at setup. As soon as block A extracts 5 bits from cycle but it will be possible to drive all the cores needed. In this
the bus, it performs one internal shift operation. Likewise for example with 9-bit packets and an 8-bit bus, the blocks shift
block B, every time it accumulates 4 bits. The SSH is approximately every bus/tester clock cycle. Occasionally, a
programmed with the shift count per scan load, so it can identify block may omit shifting in a given cycle because it has to wait
when to perform shift, and when to perform capture. Capture to acquire all the bits it needs for one shift cycle. If the bus is 1
involves events generated by the SSH such as de-asserting bit wide instead of 8 bits wide, it takes 9 tester cycles to scan in
scan_enable, applying capture clocks through an On-chip Clock each packet. So the internal shift rate is 1/9th of the external shift
Controller (OCC) [18], and re-asserting scan_enable in rate, but it is still possible to drive all 9 internal channels from
preparation for the next scan operation. the 1-bit bus. In fact, the bus width can be scaled down
In this example, we have decided to use 9-bit packets dynamically at pattern generation time. When driving multiple
although the bus width is 8 bits. The stream of 9-bit packets is cores concurrently such that the packet spans multiple bus
simply folded into the 8-bit bus with no bits wasted. The first 9- widths, and the internal shift frequency is slower than the
bit packet occupies the first 8-bit parallel word of the bus, and external frequency as a result, this presents an opportunity to
the first bit of the second word (second tester cycle). The second deliver the data more quickly without exceeding the constraints
packet starts immediately after that, occupying the remaining 7 on the internal core shift frequencies. It is common in SSN
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
implementations to cap the core-internal shift frequency at 100 using a Bus Frequency Divider (BFD)/Bus Frequency
MHz yet run a faster/narrow bus at 400 MHz. Multiplier (BFM) pair, as shown in Fig. 4.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
V. OPTIMIZING TEST TIME AND DATA VOLUME shift and capture at the same time. In addition to scan access,
It is important to differentiate when the capture cycles of all this may further facilitate testing a large number of cores
cores must be aligned and performed concurrently versus when concurrently.
each core (or group of cores) operates independently and can
capture regardless of whether other cores are shifting or VI. TESTING OF MULTIPLE IDENTICAL CORES
capturing. The latter enables more efficiency when it can be Many SoCs that achieve high throughput by parallelizing
used. processing contain a number of cores that are replicated multiple
When running ATPG on a group of interacting cores, such times. CPU chips often include multiple processor cores. AI and
as during external test, it is always necessary to align capture GPU chips in particular can have some cores replicated well
events because of the interactions between the cores during over 100 times. As previously explained, in pin-mux scan access
capture. In this case, the SSH in each of those cores can shift architectures, the scan inputs may be broadcast to identical core
independently, but all those SSH nodes are programmed to instances, but the scan outputs are usually observed
capture concurrently once they complete a scan load/unload. independently to ensure lossless mapping and observability for
diagnosis. This results in a non-scalable solution where
However, consider when pattern generation is performed on increasing the number of core instances requires additional chip
wrapped cores (or groups of cores) that are isolated from one pins for concurrent test.
another and have their own OCCs. At the top level, those
patterns sets are independent and can be merged and applied A. On-Chip Compare
concurrently. In most other retargeting solutions, the capture
events are aligned as shown in Fig. 5. While this is necessary for SSN provides a scalable method for testing any number of
ATPG, it can be unnecessary and inefficient for the case of identical core instances in near constant test time, independent
retargeted patterns. Imbalances in shift lengths per scan load of the number of available chip-level pins, even in the presence
may result in unnecessary padding. A core with short scan of tile-based design constraints explained earlier. Instead of
chains should not need to wait for other cores to complete shifting in the stimuli only and unloading the expected response
shifting before they can capture. Furthermore, there are often for comparison on the tester, the stimuli, expected responses,
significant imbalances in the pattern counts of different cores. and compare/nocompare mask data are scanned in within each
Traditional retargeting methods pad the cores with fewer packet so that each core can perform its own on-chip
patterns such that there is a waste of data and test time. comparison. Note that the data arrives at each core instance at a
slightly different time since the SSN bus data streams through
the nodes. With each internal shift cycle, the channel data
transferred from EDT to the SSH is compared, and a pass/fail
status bit per channel per shift cycles is computed. What is
ultimately observed on the tester is the following:
1. Per-shift status bits: This is the aforementioned pass/fail
bit for a given channel in a given internal shift cycle. This
status bit is allocated a timeslot in the packet for unloading.
To provide a scalable solution for any number of identical
core instances, the same status bit in the packet usually
accumulates the pass/fail status from a given channel/shift
Fig. 5: Retargeting with aligned vs. independent capture
cycle across all identical core instances (or a subset of
them). If this bit indicates a fail, one can identify which
SSN has two features to reduce test time and test data volume core-level bit had a failure but not necessarily which core
in such cases. First, it supports independent shift/capture for instance(s) this failure originated from. It is still possible
different retargeted cores. This is possible because signals such to identify failing cores and per-core fail information for
as scan_enable and the shift clock are generated locally by each diagnosis as explained later.
SSH. Second, it reduces the shift length/pattern count
imbalances between cores by programmatically varying the 2. Sticky status bits: One sticky bit per SSH indicates if there
bandwidth used for each core. If a core requires many fewer was a failure in scan observed by this SSH in any
overall shift cycles across a pattern set than other cores, it can be cycle/channel of the pattern set. This bit per SSH is
sent fewer bits per packet. For example, a core with 4 channels unloaded through IJTAG at the end of a pattern set to
does not need to be allocated 4 bits per packet. It can be throttled quickly identify failing cores (for designs with redundant
down and sent only 1 bit per packet such that it shifts internally cores), and to aid in diagnosis. Note that where finer
every four packets instead of every packet. The result is that the granularity than 1 fail bit per SSH is needed, it is possible
total number of packets remains the same, but the size of the to generate a sticky bit per channel output connected to the
packets is reduced, speeding up the overall test time. The next SSH.
section introduces further test optimization possible in the Fig. 6 shows an example of data encoding into packets when
presence of multiple identical core instances. using on-chip compare. Six identical core instances are used in
Note that an additional benefit of independent capture is this example, each driving an EDT controller that has 7 input
power. It can mitigate IR drop since cores under test do not all channels and 2 output channels. Each packet has enough scan
data for the cores to perform one internal shift operation. First,
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
Fig. 6: Packets when using on-chip compare to test multiple identical cores
7 bits per packet corresponding to the 7 input channels (shown Diagnosis in the presence of on-chip compare is more
in blue) are allocated. Those stimuli are broadcast (in time) to all involved and may require re-application of the pattern set to
identical core instances. The expected responses (2 output collect all the data needed. Consider the case where all identical
channels = 2 bits) and mask information (2 output channels = 2 core instances are placed in a single status group such that their
bits) are also shifted in and broadcasted (red). Last are the status per-cycle pass/fail information is aggregated into the same
bits that accumulate the pass/fail information per channel per packet timeslots. If any of those bits indicate failures, we have
shift cycle (green). Typically, we would allocate 2 bits the cumulative per-pin per-cycle fail data but may not know
corresponding to the 2 output channels. A failure in one of those which core(s) the failures came from. The sticky status bits
bits would indicate that the first channel of one of the 6 core unloaded at the end of the test set via IJTAG indicate which
instances failed, but we would not know which one. When we core(s) failed at least once. If only one core in this group fails,
accumulate the status information of all 6 cores together, they then we know the per-cycle pass/fail data came from this core
are considered to be placed into 1 status group. In this example, alone and therefore we have all the information needed for
we chose to partition the 6 cores into group “a” and group “b”. diagnosis. However, if multiple cores fail, we have to separately
We only accumulate the fail information within each group. test and observe each of those failing cores to get their individual
That is why we have 4 green bits: 2 output channels × 2 groups. fail data. If two cores fail, for example, then the same test set is
The number of groups is programmable at pattern retargeting re-applied twice, with minor patching applied. In each case,
time. Increasing the number of groups beyond 1 sacrifices test static bits in the setup of the cores are patched to control which
efficiency for improved observability as will be explained in the cores are allowed to contribute to the cumulative pass/fail
diagnosis section. results. Note there is no need to store separate patterns for
diagnosis on the tester.
When using on-chip compare, the response data cannot
replace the stimuli in the packet because the stimuli have to If identical core instances are split into multiple groups, this
travel to all other core instances. Separate time slots have to be slightly increases the test time, but decreases the probability of
allocated for the stimuli, the expected responses and the masks resorting to multiple test applications for collecting diagnosis
shifted in, as well as the status bits unloaded. In the common data. In the example shown in Fig. 6, the six cores are split into
case of 1 status group, the number of bits per packet is usually two groups. If cores A1 and A4 are found to have failed, there is
#input_channels + 3 × #output_channels. Because each output no need for test re-application because cores A1-A3 accumulate
channel requires at least 3 bits of data in the packet (expected their status bits separately from cores A4-A6. However, if cores
value, mask, and pass/fail status), using an asymmetric EDT A1 and A3 fail, test re-application with patching is needed to
with fewer output channels than input channels improves test acquire the individual fail data. In the extreme case, you may
time and test data volume in conjunction with on-chip compare. choose to assign each core instance to its own group so that each
core is observed individually. This mode of operation may be
B. Diagnosis Flow better suited for silicon debug than high-volume manufacturing.
Failure data is needed even during high-volume
manufacturing for on-tester identification of failing cores to VII. ALTERNATE INTERFACES
support partial good die strategies (redundant logic cores), and
for diagnosis-driven yield analysis. When not using on-chip A. Streaming Tests through JTAG/IJTAG Interfaces
compare, every channel output bit in a core maps to a single bit It is possible not to use the SSH parallel bus at all, and
on the top-level SSN bus outputs that are unloaded and instead use the JTAG(chip)/IJTAG(core) interface for both
compared on the tester. Logic diagnosis is straightforward in setup and subsequent streaming of the test data. There are two
that case: perform reverse mapping of chip-level failures cases where this may be desirable:
through the SSN network to the EDT channel outputs, then
perform conventional compressed pattern diagnosis (at the core 1. As a survivability option. If during silicon bring-up, the
level in case of retargeted patterns). bus is inaccessible due to a silicon defect, this provides an
alternate method of accessing any SSH or group of SSHs.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
2. If a low pin count device only has a JTAG interface and
no other digital pins, it is possible to implement SSN
without the parallel bus and rely on the JTAG/IJTAG
interfaces for streaming the test data.
Fig. 8: STF packing of narrow EDT data
B. Compatibility with Test Using SerDes (IEEE 1149.10)
IEEE 1149.10 [19] provides for re-using high-speed I/O
(HSIO) SerDes lanes to enable very high bandwidth transfer of B. Comparison of Data Field Utilization
test data to/from a chip. The Packet Encoder/Decoder and STF utilizes a fixed data field size of 32 bits. To
Distribution Architecture (PEDDA) IP described in the standard accommodate EDTs with a smaller number of channels, the STF
results in deserialized data presented on a parallel bus. SSN’s data word is divided up into fields, and the data for multiple shift
synchronous parallel bus is ideally suited to interface with the cycles is packed into the 32-bit word to achieve better
PEDDA. SSN can handle on-chip distribution of test data and utilization. However, when the EDT channel size does not
internal generation of test signals. As the SSN network can divide evenly into the 32-bit word, this reduces efficiency as
operate internally at high frequencies (at least 400 MHz), it is illustrated in Fig. 8. In this example, with 9-bit EDTs, we can
capable of testing many cores concurrently and quickly when pack 3 shift cycles of data into the data word with 5 bits of
coupled with this high-bandwidth chip-level interface. unused data, resulting in an overhead of nearly 16%. In the worst
case of a 17-bit EDT, 47% of the data bandwidth is wasted.
VIII. PRACTICAL EXPERIENCE USING SSN Thus, STF data field utilization can range from 53% to 100%
depending on how the EDT data packs into the 32-bit word.
In collaboration with Mentor, Intel has been evaluating the Because SSN utilizes data rotation, any leftover bits within the
use of SSN. SSN is capable of scaling to large SoCs and server- bus become part of the next packet, always achieving 100%
class designs that require support for large partition counts and utilization of the bus data word.
identical core testing. Previous generations of Intel SoCs have
utilized an internally developed high bandwidth packetized C. Interleaving, Vector Count and Chain Length Mismatch
fabric, STF [10][11] to address these needs. STF was developed Handling
to allow this scalability at much lower overhead than the
traditional pin muxed scan solutions. In evaluating SSN, the Both STF and SSN scale to any number of partitions,
goals were to assess whether moving to SSN could further however their approaches differ in how they handle the
improve test time and bandwidth utilization over STF, as well as interleaving of partitions. In the example shown in Fig. 9, a set
reduce design effort through the use of a vendor supported of partition patterns that have differing numbers of vectors are
platform. to be merged. Typically, STF will have a specified interleave
factor, in this case 4, to which the patterns are repacked
optimally into these 4 groups. These groups are then round-robin
A. Comparison of Packet Encoding Overhead
interleaved to create the final pattern set, as shown in the figure.
Both STF and SSN can scale to support any number of
partitions, however, the approach to accomplish this differs SSN’s handling of interleaving achieves similar efficiency
between the two systems. The STF network relies on explicit for vector count mismatch as STF, but SSN can also partially
addressing information stored within each packet. This is
accomplished by having a short address ID tag contained within
each packet, typically 4 bits in size. In addition, STF requires an
opcode field, 4 bits in size, as well as input and output valid bits.
This results in an overhead of 10 bits being added to each data
packet. In contrast under SSN, the destinations and interleave
settings are statically programmed during the test setup,
allowing the entire bus bandwidth to be used for data. For a
typical bus size of 32 bits, STF has a 31% higher overhead than
SSN. This is depicted in Fig. 7. Fig. 9: STF pattern interleaving
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
mitigate chain length mismatch between partitions, which STF looked at other aspects, such as design effort and run times. To
cannot. STF requires all partitions in the pattern set to be padded perform the study, a simple test design was created consisting of
to the same shift length, resulting in overhead. This is depicted a single interface partition, partition1, and four identical copies
in Fig. 10. In our current designs, we allow up to 20% chain of a partition, partitions 2-5, as shown in Fig. 11.
length mismatch between partitions, so it is theoretically
possible SSN could have up to 20% better packing efficiency in
the final pattern.
E. On-Die Compare
Fig. 11: Pilot network topology
STF and SSN provide comparable functionality for identical
core testing using on-die compare. Both systems require the
input data stream to include the input data, mask data and Table I. Theoretical Comparison of STF vs. SSN Data Volume
expected response, causing a 3X growth of the data volume, but
allow testing of any number of cores in constant time. SSN has Vector/
a possible advantage in the handling of an asymmetric number Chain
of input and output channels. In this case, SSN can more tightly Packet Data Field Length Fabric
pack the expect and mask fields to match the smaller output Encoding Utilization Mismatch Setup
Overhead Overhead Overhead Overhead
channel case, possibly realizing less than 3X data growth. STF,
however, allocates bandwidth assuming symmetric usage and is SSN (baseline) 1.0 1.0 1.0 1
always 3X data volume. For the purpose of this analysis, we STF 1.31 1.0 - 1.47 1.0 - 1.2 0.99
assumed that on-die compare would be neutral between the two
systems. STF data volume vs. SSN (theoretical): 1.30 - 2.29
F. Total Estimated Overhead Comparison An SSN bus data width of 32 bits was chosen to match STF
In summary, STF pays a high overhead in packet encoding, to allow direct comparison. ATPG patterns were created
data field utilization and handling of chain length mismatch. targeting partitions 2-5, each having 9 EDT channels for a total
of 36 bits of channel data. By having a total channel data set size
Network setup overhead is higher in SSN, but amortized across
of >32 bits, SSN will perform data rotation and create a more
the number of scan vectors resulting in a negligible difference.
meaningful comparison. The 9-bit EDT channel size represents
Overall, this can lead to over 2X reduction in data volume under
a typical data field packing inefficiency for STF. Multiple
SSN vs. STF, as summarized in Table I.
ATPG runs were conducted to analyze the overhead at 10, 500,
and 10,000 vectors. The results from these runs are summarized
G. SSN Pilot Study in Table II, comparing STF, SSN, and a legacy pin mux solution.
SSN offers a compelling theoretical advantage over the
current STF fabric in use. However, we wanted to measure For this testcase, SSN shows a clear advantage over STF,
results on actual partition data to verify. Further, the study with STF having 19% higher test time and 57% more data
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.
volume than SSN. SSN test setup is higher overhead than STF, especially well suited for tile-based designs. Intel evaluated SSN
however when amortized across the 10,000 vectors in the run and compared it to STF as well as to conventional pin-muxed
set, this impact is in the expected range of 1.2%. This testcase access. SSN was found to reduce the test data volume by 36%
used identical partitions and hence did not exercise vector count and 43%, respectively. It reduced test cycles by 16% and 43%,
mismatch between partitions nor chain length mismatch, which respectively. Steps in the design and retargeting flow were
would further favor SSN. For comparison purposes, a legacy pin between 10x – 20x faster with SSN compared to STF.
muxed solution is included showing a large overhead relative to
SSN. Since the pin muxed solution cannot transport 36 bits of ACKNOWLEDGMENT
channel data in a single run, it must be split into 2 runs, nearly
doubling test time and data volume. The authors wish to thank other contributors to the
development of the SSN technology: Yahya Zaidan, Pawel
In addition to data volume and test time metrics, we also Galas, Szymon Walkowiak, Paul Reuter, and Tony Fryars. We
collected information on design efficiency between the internal would also like to thank the contributors to the SSN pilot study:
STF toolset and the Mentor Tessent™ tool flows for SSN. This Sirish Chittoor, Yonsang Cho, Luis Briceño Guerrero, Kavita
comparison is summarized in Table III. Bansal, Kelsey Byers, and Ian Nuber. Finally, many thanks to
As the table shows, SSN and the Tessent flows provide all our other partners who also provided invaluable feedback
significant productivity improvement over our previous flow during the development, validation, and deployment of SSN.
built from multiple tools, enabling rapid integration into the
design and fast turnaround ATPG runs. The SSN flows do not REFERENCES
require ATPG cut points and custom setups to generate and
retarget patterns, resulting in significant savings in pattern [1] Standard Testability Method for Embedded Core-based Integrated
retargeting. Though not in the scope of this analysis, further Circuits, IEEE Standard 1500, 2005.
benefits are expected in gate level simulation debug [2] J. Remmers et al., “Hierarchical DFT methodology - a case study, ” IEEE
productivity. International Test Conference, 2004.
[3] D. Trock et al., “Recursive Hierarchical DFT Methodology with Multi-
Table III. Design efficiency comparison between STF and SSN level Clock Control and Scan Pattern Retargeting,” IEEE Design,
Automation & Test in Europe Conference & Exhibition (DATE), 2016.
STF Tessent SSN [4] J. Rajski et al., “Embedded Deterministic Test,” IEEE Trans. on CAD,
Metric Flow Flow vol. 23, May 2004, pp. 776-792.
Tools Count 7 3 [5] P. Wohl, J.A. Waicukauski, J.E. Colburn, M. Sonawane. "Achieving
RTL Completion to ATPG extreme scan compression for SoC Designs", IEEE International Test
~10 Hours ~1 Hour Conference, 2014.
Start
ATPG Completion to Gate [6] C. Barnhart et al., "OPMISR: The foundation for compressed ATPG
~1 Day ~2 Hours vectors," IEEE International Test Conference, 2001.
Level Simulation start
ATPG pattern retargeting of [7] G. Giles et al., “Test Access Mechanism for Multiple Identical Cores,”
~4 Hours ~12 Minutes IEEE International Test Conference, 2008.
a partition
[8] Y. Dong et al., “Maximizing Scan Pin and Bandwidth Utilization with a
Scan Routing Fabric,” IEEE International Test Conference, 2017.
H. SSN Pilot Study Summary [9] J. Janicki et al., "EDT bandwidth management - Practical scenarios for
large SoC designs," IEEE International Test Conference, 2013.
Analysis of a small test network verified that the theoretical
[10] G. Colon-Bonet, “High Bandwidth DFT Fabric Requirements for Server
advantages of SSN over our previous internal STF fabric are and Microserver SoCs,” IEEE International Test Conference, 2015.
achievable and a significant improvement in both test time (16% [11] G. Colon-Bonet, “High Bandwidth Packetized DFT Fabric for Server
reduction) and test data volume (36% reduction). The data SoCs,” IEEE International System-on-Chip Conference, 2016.
shows that the approach of static network configuration during [12] A. Sanghani et al., “Design and Implementation of A Time-Division
test setup is more efficient for large scan data sets than allocating Multiplexing Scan Architecture Using Serializer and Deserializer in GPU
addressing and opcode information within each packet. In Chips,” IEEE VLSI Test Symposium, 2011.
addition, further benefits were seen in design efficiency for [13] M. Sonawane et al., “Flexible Scan Interface Architecture for Complex
insertion, ATPG setup and pattern retargeting relative to our SoCs,” IEEE VLSI Test Symposium, 2016.
previous flows. [14] P. Wohl et al., “Achieving Extreme Scan Compression for SoC Designs,”
IEEE International Test Conference, 2014.
[15] Standard for Access and Control of Instrumentation Embedded within a
IX. CONCLUSION Semiconductor Device, IEEE Standard 1687, 2014.
The SSN technology introduced in this paper solves many of [16] J. Durupt et al., " IJTAG supported 3D DFT using chiplet-footprints for
the scan distribution challenges in complex SoCs. It enables testing multi-chips active interposer system," IEEE European Test
Symposium, 2016.
simultaneous testing of any number of cores with few chip-level
[17] M. Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design
pins, and it has multiple features to reduce test time and test data for High Performance Computing”, Symposium on VLSI Circuits Digest
volume. It can test any number of identical core instances in near of Technical Papers, 2019.
constant time, minimizes padding in the presence of cores with [18] T. Waayers et al., “Clock control architecture and ATPG for reducing
mismatched pattern counts and/or scan chain lengths, and pattern count in SoC designs with multiple clock domains,” IEEE
enables fast streaming of data to/from and throughout the chip. International Test Conference, 2010.
It simplifies design planning and implementation, and is [19] Standard for High-Speed Test Access Port and On-Chip Distribution
Architecture, IEEE Standard 1149.10, 2017.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 18:36:16 UTC from IEEE Xplore. Restrictions apply.