Advanced_DFT_Clock_Control_Architectures_with_Agile_Development_for_Chisel-Based_High_Performance_RISC-V_Processors
Advanced_DFT_Clock_Control_Architectures_with_Agile_Development_for_Chisel-Based_High_Performance_RISC-V_Processors
Ruining Feng
School of Software and Microelectronics
Peking University
Beijing, China
[email protected]
Abstract—The impact of clock distribution networks on its associated agile design methodologies have proven to hold
the performance, power dissipation, and reliability of high- significant advantages in improving the efficiency of processor
performance processors has been extensively studied. Designing design. Chisel-based agile design methodologies offer new
a DFT clock control architecture that matches mainstream
clock distribution networks while meeting the requirements of perspectives and innovative solutions for implementing DFT
advanced DFT methodologies is crucial for improving the quality clock control architectures in high-performance processors.
of DFT designs in high-performance processors. Building on
the open-source high-performance RISC-V processor XiangShan, A. Clock Distribution Network in High-Performance Proces-
this paper presents cutting-edge DFT clock control architectures sors
suitable for H-tree and clock mesh structures. It achieves true hi-
erarchical testing and significantly optimizes DFT timing closure. Traditionally, clock networks have been designed as tree
Employing the agile development language Chisel, a versatile, structures due to their simplicity and ease of optimization.
parameterizable OCC that accommodates various clock networks However, with the advancement of process nodes and growing
was developed. In this paper, the use of Chisel as DFT EDA tool complexity of chip design, the impact of on-chip variation
for developing the DFT clock control architecture provides a (OCV) to the clock skew is increasing. Conventional clock
more agile and flexible DFT design flow, offering new insights
for agile DFT design based on Chisel. tree structure is not robust against OCV and may require more
Index Terms—Clock Distribution Network, DFT Clock Control optimization efforts to meet tight timing constraints, leading to
Architecture, Chisel-based Agile DFT Development a trend to move away from conventional clock trees for high-
speed, clock skew-sensitive designs [1]. On the other hand,
I. I NTRODUCTION the clock mesh offers high tolerance to OCV and can achieve
The clock distribution network (CDN) is a crucial compo- very low clock skew due to the redundant paths between the
nent of modern Very-Large-Scale-Integration (VLSI) designs, clock source and the sinks [2]. This is particularly beneficial
particularly in high-performance processors, as it significantly in high-performance processors where tight skew budgets are
impacts their performance, power dissipation, and reliability. critical for maintaining timing integrity and maximizing clock
Clock networks are mainly implemented using tree, mesh, and frequencies. Furthermore, clock mesh provides a consistent
hybrid structures that combine the advantages of both tree and clock distribution framework that can adapt to the increasing
mesh structures. Designing a high-performance Design-for- size and complexity of high-performance processors. For these
Testability (DFT) clock control scheme that is compatible with advantages, clock mesh has become the mainstream CDN in
the functional clock network structure without compromising high-performance processors.
its performance is a challenging task. The scheme must
correspond to the different implementations of the functional B. DFT Clock Control Architecture in High-Performance Pro-
clock network in high-performance processors. In addition, cessors
it must fulfill the requirements of various test modes and DFT clock structures have been developed to address the
hierarchical DFT design methodology, attain the desired test challenges of increasing design complexity, multiple clock
coverage, and facilitate the timing closure of DFT modes in domains, and the necessity for effective at-speed testing
static timing analysis. Agile development language Chisel and of high-performance designs. Initially, a simple multiplexer
switching between functional clock and external test clock [7] presented a semi-automatic clock mesh synthesis
was sufficient for DFT requirements. Now, with the use of framework designed to address the complexities involved in
ATPG-programmable On-chip Clock Controllers (OCC), we creating an efficient and robust clock mesh network. This
can design a variety of clock control structures based on paper mentioned that to minimize skew variation, the postmesh
different test modes and design hierarchies. This enables the tree (refers to the clock network driven by the clock mesh net)
generation of complex test clock sequences for better ATPG should adhere to specific design requirements: (1) The loads
results. among the clock sinks following the clock mesh should be
When designing DFT implementations for high- balanced as much as possible. (2) The circuit should have
performance processors, it is necessary to adapt DFT minimal depth below the clock mesh, that means minimizing
clock control architecture to match the characteristics of the number of buffers or gates between the clock mesh and
specific functional clock distribution networks, such as H-tree the final clock sinks, therefore the clock multiplexer can not
or clock mesh. This involves developing functionalities of be included in the postmesh tree and the level of Integrated
OCCs according to the CDN requirements, fine-tuning the Clock Gating (ICG) cells must be kept to a minimum. When
OCCs insertion points and how they interconnect across designing DFT clock networks for high-performance proces-
design levels. To maintain functional CDN implementations sors that implement a clock mesh, these requirements must
performance while enabling advanced DFT strategies, also be met.
ultimately simplifying timing closure, increasing test One of the early concepts for delay testing was introduced
coverage, and creating a more efficient DFT design flow. in [8], focusing on the generation of test patterns that could
detect delay faults in logic circuits. And the necessity for
C. Chisel-based DFT Clock Control Architecture Design at-speed testing was the primary reason for the shift of test
clock control from external testers to the chip under test itself.
In [3], The XiangShan team innovatively integrated DFT
With the increase of operational frequencies and complexity
implementations with chisel-based agile chip design method-
of clock domains, supplying clock signals directly from the
ologies. They developed a flexible chisel-based XS-shared
test equipment has become impractical. The DFT challenges
bus MBIST interface to improve design PPA, established a
and solutions for ColdFire microprocessor core were discussed
structured DFT development approach and Chisel-based DFT
in [9], which was the first to utilize a PLL-sourced clock for
agile design flow for high-performance RISC-V processors,
at-speed launch-to-capture cycles, enhancing the testability of
which improves DFT design efficiency and achieves industry-
the microprocessor core.
competitive performance. Building upon this work, we are
The work in [10] is significant for the development for
continuing to use Chisel in our DFT development process,
DFT clock control, it detailed the implementation of on-chip
designing chisel-based DFT clock control networks for both
high-speed clock generation for delay testing in System-on-
H-tree and clock mesh structures, satisfying the requirements
Chip (SoC) devices with high-frequency clock domains. It
of SSN and true hierarchical DFT. By shifting the integration
discussed the challenges of generating high-speed clock pulses
of OCC “lefter” into the functional Chisel code development
on-chip and the need for ATPG tools to control the clock
stage, rather than at the traditional Register-Transfer Level
generation mechanism for all clock domains, which further
(RTL) or netlist level, we benefit from the earlier-stage perfor-
refined the requirements for DFT clock control. The OCCs
mance evaluation of the DFT clock control architectures and
are often implemented to be ATPG-programmable such that
early detection of timing risks while maintaining the agility of
each internal clock is controlled by the logic values loaded at
the design process.
a set of dedicated scan cells.
The rest of this paper is organized as follows: Section II
[11] emphasized the benefits of of hierarchical test method-
provides a review of existing related work. The DFT clock
ologies for large-scale designs, highlighted the importance of
control architectures in Xiangshan Processor are introduced in
a programmable register-based OCC at the block level for the
Section III. Section IV presents the chisel-based OCC design
most flexibility and efficiency in hierarchical testing. Block-
and the agile development of DFT clock structures. Section
level patterns with block-level OCCs can be easily retargeted
V concludes this paper and outlines the directions for future
and merged with other block patterns at the top-level design.
work.
This approach simplifies the testing process and enhances the
overall efficiency of the hierarchical test methodology.
II. R ELATED W ORK
III. T HE DFT C LOCK C ONTROL A RCHITECTURES IN
The early researches of CDN for high-performance proces-
X IANG S HAN P ROCESSOR
sors were focused on the development of algorithms for clock
tree synthesis that aimed to minimize clock skew [4], [5]. [6] A. The DFT Clock Control Architecture for H-tree
discussed the implementation of a global clock distribution The second-generation microarchitecture ‘Nanhu’ of the
strategy in microprocessors, focusing on buffered, tunable tree XiangShan processor employs a functional clock with an H-
networks with a common grid. This work emphasized the tree structure. Local trees are constructed at the TAP points to
importance of a regular grid for reducing local skew and the drive the clock inputs of each Flip-Flop. The structured DFT
use of transmission lines for optimal clock delivery. development approach of Nanhu was showcased in [3]. To
accommodate the large scale of the processor core design and
the need to scale up to multi-core versions, a hierarchical DFT
methodology is employed. The entire design is divided into
two distinct levels of blocks, namely BOSC XSTop (top-level)
and BOSC XSTile (core-level). Scan wrappers are inserted
into these two blocks to treat them as independent plug-and-
play blocks. This enables the generation of specific plug-and-
play patterns for each block. To guarantee that any set of
block-level scan patterns, with any type of clocking, is entirely
plug-and-play and can be combined with any other block
patterns at the top level of the design, it is recommended to
place the OCCs within the blocks. If inserting the OCCs within
the blocks is not possible, hierarchical testing and pattern Fig. 2. Stuck-at test clock waveform
retargeting to the top-level design would be significantly more
complicated.
The preliminary DFT clock control architecture used OCC
IPs generated by Siemens EDA tool. These OCC IPs enable
DFT clock selection, clock chopping control, and clock gating
in different test modes under the control of ATPG patterns.
BOSC XSTop and BOSC XSTile also use scan compression
technology developed by Siemens EDA—Embedded Deter-
ministic Test (EDT). Therefore, the clock of EDT IP is also
a crucial component of the DFT clock structure. For instance,
Figure 1 illustrates a schematic diagram of the DFT clock
structure in the BOSC XSTile block.
passes directly through the OCC to the FFs. During scan test-
ing, different static control signals are configured through the
test setup, enabling the OCC to operate in the corresponding
test mode. The internal logic of the OCC generates enable
and selection signals for each clock during different stages of
the test, based on the setup procedure and the ATPG pattern.
Fig. 1. DFT clock control architecture of BOSC XSTile block During the scan shift phase, the OCC enables and selects
shift capture clock as the output clock. In the stuck-at test
In Figure 1, func clk represents the functional clock output capture phase, it enables and selects shift capture clock, and
from the on-chip PLL, operating at a frequency of 2GHz. during the at-speed test capture phase, it enables and selects
This clock drives all sequential logic within the BOSC XSTile func clock as the output. The OCC CONTROL module con-
block, using an H-tree structure to minimize clock skew. tains shift registers that are part of the scan chain, allowing
test clock is a lower-speed test clock, generally not exceeding ATPG pattern to control the data scanned into shift regs during
100MHz and driven by an external test equipment. Both the scan shift phase. Consequently, during the capture phase,
shift capture clock and edt clock are derivatives of test clock the values of shift regs act as part of the gating enable signal
controlled by ICG cells. The shift capture clock serves as the for the capture clock, precisely controlling each pulse of the
shift clock during the scan shift stage and as the capture clock capture clock cycle.
during the stuck-at test scan capture stage. The ICG enable As illustrated in Figure 1, the EDT’s scan in output port
signal is edt update. The edt clock drives the timing logic supplies scan input data to the internal scan chain, and the
inside the EDT IP, with scan en acting as the ICG enable scan out input port receives scan output data from the internal
signal. clock out is the output clock from the OCC, which is scan chain. During the scan shift phase, the data output and
used as the clock root for the H-tree. Figure 2 and 3 display input from the EDT are synchronized with either the rising
the waveform diagrams of each clock during stuck-at and at- or falling edges of the edt clock. All FFs in the internal scan
speed testing, respectively. chain and the lockup cells, are synchronized with the rising
In functional mode, the OCC is inactive, and the func clock and falling edges of the shift capture clock. To ensure the
timing of the shift data interaction between the EDT and the
internal scan chain, it is critical to keep the skew between
shift capture clock and edt clock as minimal as possible.
Figure 4 illustrates the timing path between a FF in the
decompression logic of the EDT and the FF at the head of
an internal scan chain.
Fig. 4. Shift timing path of EDT register and internal scan cell
same functionalities, including clock chopping, clock gating signals that can be configured for various operating
and clock selection, and works in either standard mode or modes, meeting the requirements of all test modes men-
parent mode (as explained later). The child OCC, on the tioned previously.
other hand, has no clock multiplexer, so it can’t perform • ATPG-programmable. The chisel-based OCC includes
clock selection, can only operate in child mode, and must be shift registers, which generates its description file, such as
connected to an active upstream parent OCC. The child OCC the Tessent Core Description file, based on the parameters
structure can also output the clock en signal directly without of the OCC instance during instantiation. This fully
the ICG, which separates the ICG from the child OCC and supports Siemens EDA in identifying the chisel-based
places the ICG close to the clock root, as shown in Figure 11. OCC and the ATPG pattern generation.
• Plug-and-play. The OCC module can be instantiated in V. C ONCLUSION AND F UTURE W ORK
the design as either a standard or a parent-child OCC by By analyzing the characteristics of H-tree and clock mesh
directly passing parameters, depending on the functional structures, we have targeted optimizations in the design of our
clock network structure and design hierarchy. DFT clock control architecture. This reduces the difficulty of
Chisel’s parameterized generators enable easy scaling and DFT timing closure early in the design stage and enhances
customization of OCC designs to meet various design require- the quality of DFT design. Our chisel-based OCC also makes
ments. For instance, if the clock network uses a clock mesh the DFT clock control architecture more agile and flexible,
structure and the BOSC XSTile module has a divided-by-two advancing the development of agile DFT design. In future
clock, we modify the child OCC inserted for the divided clock. work, we aim to optimize the design of chisel-based OCC,
Figure 10 illustrates the DIV2 child OCC and our approach. perform thorough verification, and validate its functional cor-
In our DIV2 child OCC disgn, the functional division logic rectness through post-silicon testing. Additionally, we plan to
develop a DFT clock architecture that integrates Streaming
Scan Network technology to further advance the DFT solution
for the XiangShan high-performance processor.
ACKNOWLEDGMENT
I would like to sincerely thank my mentor Mr. Zhiheng He
and my supervisor Mr. Tao, for giving me the chance to join
the Nanhu team and the Xiangshan Project. Mr. Tao is very
open-minded and has outstanding academic and engineering
skills. It’s a great honour to be his student. Mr. He is humble
and kindness, rigorous in study and responsible in work. From
Fig. 10. DFT clock structure for divider in post-mesh tree
him, I have not only learned professional knowledge, but also
a serious attitude towards studying, work, and life.
and ICG cell are reused, reducing the levels of ICG by one and
optimizing the OCC circuit area. As can be seen in Figure 10, R EFERENCES
The ICG shared by the DIV2 child OCC and the functional [1] Harvey Toyama, ”What’s The Difference Between CTS,
divider logic is called the DIV2 & child OCC ICG. The Multisource CTS, And Clock Mesh?”, ElectronicDesign,
div2 en signal is the ICG enable signal for the output of the https://www.electronicdesign.com/news/products/article/21765665/whats-
the-difference-between-cts-multisource-cts-and-clock-mesh, Mar. 2012.
functional clock divider logic. In function mode, the div2 en [2] C. Yeh et al., ”Clock distribution architectures: a comparative study,” 7th
signal controls the generation of the DIV2 clock by passing International Symposium on Quality Electronic Design (ISQED’06), San
directly through the child OCC. In scan test mode, the DIV2 Jose, CA, USA, 2006, pp. 7 pp.-91.
[3] B. Zhang, Y. Cai, Z. He, S. Liang and W. He, ”Structured DFT
child OCC reuses div2 en and enables the signal accordingly Development Approach for Chisel-Based High Performance RISC-V
under the control of the ATPG pattern. Processors,” 2023 IEEE International Test Conference in Asia (ITC-
Figure 11 shows the clock waveforms during the at-speed Asia), Matsue, Japan, 2023, pp. 1-6.
[4] Ting-Hai Chao, Yu-Chin Hsu, Jan-Ming Ho and A. B. Kahng, ”Zero
test INTEST mode for BOSC XSTile. skew clock routing with minimum wirelength,” in IEEE Transactions
on Circuits and Systems II: Analog and Digital Signal Processing, vol.
39, no. 11, pp. 799-814, Nov. 1992.
[5] M. P. Desai, R. Cvijetic and J. Jensen, ”Sizing of clock distribution
networks for high performance CPU chips,” 33rd Design Automation
Conference Proceedings, 1996, Las Vegas, NV, USA, 1996, pp. 389-
394.
[6] P. J. Restle et al., ”A clock distribution network for microprocessors,”
2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat.
No.00CH37103), Honolulu, HI, USA, 2000, pp. 184-187.
[7] P. Chakrabarti, V. Bhatt, D. Hill and A. Cao, ”Clock mesh frame-
work,” Thirteenth International Symposium on Quality Electronic De-
sign (ISQED), Santa Clara, CA, USA, 2012, pp. 424-431.
[8] Lesser and Shedletsky, ”An Experimental Delay Test Generator for LSI
Logic,” in IEEE Transactions on Computers, vol. C-29, no. 3, pp. 235-
248, March 1980.
[9] T. L. McLaurin and F. Frederick, ”The testability features of the
MCF5407 containing the 4th generation ColdFire(R) microprocessor
core,” Proceedings International Test Conference 2000 (IEEE Cat.
No.00CH37159), Atlantic City, NJ, USA, 2000, pp. 151-159.
Fig. 11. At-speed test clock waveform of the DIV2 child OCC [10] Logic Design for On-Chip Test Clock Generation: Implementation
Details and Impact on Delay Test Quality.
[11] RON PRESS, ”Design clock controllers for hierarchical test”, EDN,
The chisel-based OCC allows for agile design of the DFT https://www.edn.com/design-clock-controllers-for-hierarchical-test/, Jul.
clock control architectures to be done during the functional 2014.
code development stage, thereby significantly enhancing the
flexibility of the DFT design flow.