A Comparative Study On Multisource Clock Network Synthesis
A Comparative Study On Multisource Clock Network Synthesis
Abstract—Hybrid clock architecture offers a compromise between tree The multisource CTS implementation flow and construction of
and mesh. While most of the relative works focus on tree-driven-mesh different mesh configuration are shown in Section II. Section III
configuration, we are interested in the performance and optimization of presents the details of proposed approaches for tap-point and sink
multisource CTS flow provided by a state-of-the-art commercial tool,
which applies a coarse mesh with local sub-trees. In this study, we assignment. Section IV reports the analysis of our experimental
analyze the QoR of conventional clock tree and multisource CTS on results on a real industrial case. Finally, we give the conclusions
a real industrial design. We also propose several heuristic approaches in Section V.
to improving the performance of multisource CTS, especially for skew
optimization. According to the experimental results, we reveal the benefits
and drawbacks of each method, give some guidelines for determining II. M ULTISOURCE M ESH C ONFIGURATION
the proper configuration for a design, and then summarize some future
research directions. A. Overall Flow
I. I NTRODUCTION Multisource clock network is a hybrid method of conventional CTS
and clock mesh. The structure is shown in Fig. 1, which consists
Clock network design is one of the most important research of a mesh driven by a pre-mesh tree. Multisource drivers connect
topics in VLSI design. There are two common clock distribution to the mesh at a limited number of locations referred to as taps.
architectures implemented to meet the timing requirements of the A multisource clock tree structure driven by the mesh consists of
system. One is conventional clock tree, which is commonly used subtrees, each driven by a tap.
due to low power consumption, less routing resource usage as With the mixed structure, multisource CTS has lower skew, better
well as simplicity of implementation and simulation [1]. However, QoR and on-chip variation (OCV) tolerance than conventional clock
tree-based architecture can be highly sensitive to process, voltage, tree because of the increasing common path. Furthermore, by using
and temperature (PVT) variations, especially in high-performance coarse mesh rather than dense mesh, multisource CTS consumes less
chip designs. Clock mesh, the other architecture, provides better power and routing resources and become easier to implement than a
tolerance to variations [2]. Nevertheless, with lots of mesh nodes and traditional clock mesh.
unbalanced loads, clock mesh is difficult to analyze and automate [3].
Besides, additional metal wires and drivers lead to a lot more routing
resource and higher power consumption than conventional clock tree.
Several works for non-tree networks optimization have been pre-
sented. [4] used cross links instead of mesh to improve robustness
with a small increase in power consumption. Mesh edge reduction
algorithms reduced power while a little skew overhead were presented
in [5] and [6]. Moreover, [7] proposed a technique to customize
mesh for non-uniform sink distributions and [8] developed a buffer
reduction method for mesh-based clock distribution.
Hybrid architecture that combines tree and mesh is another method
for power and skew trade-off. Research in [1] built a blockage-
aware tree driving a mesh considering the loadings to minimize local
skew. Others produced a combination of non-uniform meshes and
un-buffered trees to reduce skew variations while minimizing power
and metal area overhead [3]. Besides, the authors of [9] proposed an
algorithm to choose the position of tapping points on a tree-driven-
grid clock network that can handle non-uniform loads. Synopsys Fig. 1. Structure of multisource clock distribution network. The red dashed
IC Compiler (ICC) provides multisource clock tree synthesis (CTS) line divides the structure into global and local clock network. Our modified
methodology, which applies a coarse mesh with local sub-trees to fill parts are marked.
the gap between conventional clock tree and clock mesh. [10], [11],
[12] and [13] presented the introduction and implementation about In our work, first, we generate appropriate size of mesh for the
this methodology. design, but there are no guides about how dense the mesh should be.
While most of the relative works focus on tree-driven-mesh ar- We do some works on mesh configuration to find a good one, and
chitecture, we are interested in the performance and optimization details are presented in Section II.B and II.C. Next, we insert the
of multisource CTS flow provided by a commercial EDA tool. In multisource drivers to the intersections of the horizontal and vertical
this paper, we analyze the quality of results (QoR) of conventional mesh spines and assign each sink to the corresponding driver. The
clock tree and multisource clock network on a real industrial design. influence of this stage on final results are significant, so we propose
We also propose several heuristic approaches to improving the some approaches to minimizing clock skew in Section III. Then we
performance of multisource CTS, especially for skew optimization. implement the rest of the flow following the steps recommended by
From the results, we give some guidelines for determining the proper [13], including building pre-mesh tree and the subtrees driven by the
configuration for a design. multisource drivers, routing the clock nets and analyzing the results.
- 141 -
B. Mesh Density Adjustment by red dashed lines. Thus, we decide to decrease the delay in the
The mesh configuration in multisource CTS dominates the latency region where tap driving sinks with higher arrival time, so that clock
shared by each sink. The lower the mesh density, it behaves more like skew might be minimized.
conventional clock tree. Conversely, as the mesh density increasing,
it becomes like a pure mesh and benefits from better OCV tolerance
but pays expense on higher power consumption and routing resources
[10]. Thus, determination of the mesh density, which means number
of horizontal and vertical straps, is very important.
We first use a design service company in-house utility [14] to get
a recommended mesh size. This utility estimates the mesh density by
the aspect ratio of bounding box of all sink pins and the experiential
average pin number driven by a tap. Take the case we use as an
example, the utility recommends the configuration 7*9 (7 horizontal
with 9 vertical straps). Then we try other mesh sizes based on the
recommended one, and the QoR results are shown in Section IV.
C. Non-uniform Mesh
Because the intersections of the horizontal and vertical mesh
segments are potential multisource driver locations, mesh generation
affects the following clock network significantly. The mesh configura- Fig. 2. Arrival time distribution graph of mesh configuration 4*6. We divide
tion mentioned in previous section is uniform structure; however, non- the sinks with different colors by red dashed lines.
uniform sink distribution and macros in the design might diminish
the performance of mesh, so we use two approaches of non-uniform As shown in Fig. 2, red dots with white text on them denote the
clock mesh to fix the problem. tap drivers. We can see that the sinks around Tap 5 and Tap 6 are
First one is the macro-avoiding approach. Mesh segments that most orange and yellow, while others are pink, purple and even blue
overlap macros are not useful and will waste the power consumption. ones. According to this distribution, we insert another tap between
We add a function provided by ICC to avoid creating straps and vias Tap 5 and Tap 6 to share the sinks driven by them and hope this can
intersecting those macros so that the power can be reduced. The other lower the latency of those sinks. The result after tap insertion shows
method is creating mesh considering sink distribution. Our flow is that colors change not only around the inserted tap but also in other
as follows: we collect all sink locations first, and then calculate each regions of the design. It is true that the sinks in the targeted range
column and row density under the given uniform mesh configuration. have lower arrival time than in the original case, but we cannot make
Next, it is necessary to compare a row with its adjacent rows and sure whether the overall skew would be better after tap insertion.
a column with its adjacent columns before moving straps. The final
position of a strap depends on the difference in density between it B. Loading-based Approach
and its adjacencies. In addition, the straps can only be moved within To work on tap location determination, we collect the information
a limit range in order to maintain the balanced characteristic of mesh of each tap after multisource CTS. From the results, we know how
structure. many sinks and levels of sub-tree under each driver, the number of
buffers used by each tap and where the tap located in the design.
III. TAP - POINT D ETERMINATION AND S INK A SSIGNMENT
We find that the differences in levels, the depth of taps, do not
In multisource CTS implementation, determination of the number change as number of sinks increases (as shown in Fig. 3(a)). Thus, we
and positions of tap points which lower-level sinks are connected to, infer that subtrees with huge different number of sinks have similar
as well as the arrangement of sink grouping can affect the QoR of levels because of the objective in balancing the latency. In contrast of
the final clock network. Variances between the trees that sit beyond a level, number of buffers is more related to the driven sinks (as shown
collection of multisource drivers lead to disparities in local clock skew in Fig. 3(b)). We decide to have the same number of sinks driven by
and insertion delay, which might impact on buffer area minimization each tap so that the number of used buffers could be similar as well.
and inter-clock delay-balancing efforts adversely [12]. That is, this approach is based on balancing the number of sinks and
In our flow, we choose locations of multisource drivers on the buffers, which means the loading of each tap keeps the same.
vias where the horizontal mesh straps intersect the vertical ones for
convenience and uniform distribution. However, loading and topology
of tree are also factors to be considered when making decision on tap
points. Furthermore, ICC assigns sinks to taps based on location and
blockages rather than timing relation between them, which is another
important element for skew minimization.
We propose several heuristic approaches and analytic methods on
tap determination and sink assignment in this section. Because we
can control only a few parts in the flow, our works just modify some
user-defined parameter according to the limited information we get (a) Sink vs. level (b) Sink vs. buffer
from ICC and observe if our research is meaningful. Fig. 3. Relation between number of sinks, buffers and levels. Number of
level does not change a lot as number of sinks increases. In contrast, number
A. Arrival-time-based Approach
of buffers is more related to the driven sinks.
Clock skew is defined as the difference of arrival time between any
two registers. In order to minimize the skew, we observe the arrival We focus on the taps driving bigger number of sinks and find that
time distribution of all sinks. Take Fig. 2 as an example, the more there are two pairs of these taps next to each other. To share the sinks
the color is close to yellow, the bigger the sink arrival time is; the with them, a new tap is inserted to the middle of each pair. From the
more the color is similar to blue, the smaller the sink arrival time results, we confirm that loadings of taps become more balanced than
is. It is obvious that arrival time of sinks distribute based on the tap the original since the biggest difference in number of driven sinks
locations since we can easily divide the sinks with different colors and in numbers of buffers are both reduced near 25%.
- 142 -
C. Local-skew-based Approach We decide to fix these violations by reassigning the sinks to the
The skew between a sequentially related sink pair is called local same tap without modifying location and number of taps. We still
clock skew. To minimize clock skew, we wonder where the worse focus on the sink pairs with distance under a given threshold so that
local skews appear in the design. We use ICC to report the worst the reassignment would cause little drawback in latency. As illustrated
200 local clock skew pairs of sinks and check which taps the sinks in Fig. 6, sinks can have timing relation with more than one sink.
are driven by. Then we make a list of those worse skew tap pairs in We change the assignment of sinks with most of their relative sinks
Fig. 4. assigned to another tap. For example, B3 is reassigned rather than B1
In the upper right of Fig. 4 shows the worst local skew pairs or B2, and in the same way A1 and A2 are chosen to be reassigned.
statistic. We can see that these pairs contain the close tap pairs and Furthermore, if a sink has only one relative sink driven by another
the far ones. That is, the sinks with sequential relation may be placed tap, we ignore it because this kind of sinks are too many and the
in the locations far from each other. However, in this stage we cannot reassignment might be insufficient. For instance, we do not change
modify the placement, so if there is a worst local skew pair with long the assignment of both black sinks in Fig. 6.
distance between each sink, it is hard to fix during this step. Thus,
we choose the skew pairs with distance less than or equal to a given
threshold as listed in the lower right of Fig. 4.
1) Tap-point Determination: After filtering the worst local skew IV. E XPERIMENT R ESULTS
pairs, it is clear that which adjacent pairs of taps have more skew A. Experiment Setup
violations. Since the skew violation occurs due to different taps the
Our experiments are performed on a real industrial case applying
sinks assigned to, we hope to fix this by making these sinks driven
28nm process and other characteristics are listed in Table I. All of
by the same tap. Insert a tap in the middle or merge two taps
our flows use the same placement results made from ICC as the
are candidates for solution. However, both of them have potential
input. We implement our works on two clock domains with different
drawback: this does not improve the result but make it worse.
number of sinks in this design, and the information of the clocks are
Fig. 5 is an example of tap insertion. Rectangles with same color
displayed in Table II. We analyze the results after the CTS stage and
represent the sinks with timing relation between them. Local skew
before routing, and we collect the data under the scenario operating
violations of blue sinks and black sinks happen in the original tap
on typical conditions and considering OCV effects.
arrangement in the left of Fig. 5. Therefore, we insert Tap 3 between
Tap 1 and Tap 2 as shown in the right of Fig. 5. The result is that we
successfully assign all blue sinks to the same tap, but the green ones TABLE I
which are grouped to the same tap at first become separated after tap T EST DESIGN DATA
insertion. Case Name ASIC1
Process 28nm
Number of Macro Cell 101
Number of Std Cell 782460
Total Macro Cell Area 907977.76 μm2
Total Std Cell Area 756555.47 μm2
Clock Routing Layers M3-M6
Clock Mesh Layers M7,M8
Fig. 5. An example of tap insertion based on local skew violation. After
adding Tap 3, some timing-related sinks are assigned to the same tap, while
others are not.
TABLE II
To sum up, it is hard to expect the result of tap rearrangement DATA OF TWO CLOCKS IN ASIC1
based on local skew violations because other sinks driven by the
targeted taps can be affected due to automated sink assignment. Clock Name CLK1 CLK2
2) Sink Reassignment: In multisource CTS flow, ICC automati- Number of Sinks 79380 3007
cally splits sinks to multisource drivers based on their placement and Period 1.49 ns 2.50 ns
Maximum Transition Time 100 ps 100 ps
the blockages after tap point determination. As a result, sinks with
timing relation might be assigned to different taps. Furthermore, ICC
only provides a command to force a sink to assign to a specific tap
driver. Although we find a way to check whether a pair of sinks have B. Different Mesh Sizes
timing relation between them by checking their local skew report, In Table III and Table IV, we present the QoR results of mul-
there is still no way to control sink grouping and assignment. tisource CTS flow under different mesh configurations. We apply
- 143 -
different mesh settings on each case based on the rule we mentioned C. Non-uniform Mesh
in Section II.B. The CTS row means the result of original CTS flow Table V represents the results of different non-uniform mesh
while others are results of multisource CTS flow with different mesh configurations, including macro-avoiding approach and mesh con-
densities. The ClkCells item is the total number of cells in the clock sidering sink distribution. The red data indicate it is worse than the
network. The blue and the red data denote the best and the worst one original one while the blue data represents that it is better, and the
in each column. black ones are similar to the original. The results show that macro-
avoiding approach averagely consumes 3-4% less clock power than
TABLE III
uniform mesh configuration with a little tradeoff in skew or timing
QOR OF DIFFERENT MESH CONFIGURATIONS UNDER CLK1
performance.
We use C++ programing language with sink locations and number
of straps as input to get non-uniform mesh configuration. In Table V,
we can see that clock skew is reduced obviously when creating mesh
considering sink locations. However, more clock power is consumed
and the reason may be the increase in number of buffers used to
balance the non-uniform mesh.
TABLE V
Q O R OF DIFFERENT MESH AND NON - UNIFORM MESH CONFIGURATIONS
CLK1 AND CLK2
TABLE IV
Q O R OF DIFFERENT MESH CONFIGURATIONS UNDER CLK2
- 144 -
Tap 12, most sinks in that region are assigned to Tap 11 rather than only reassign a few sinks to the proper taps where lots of their timing-
split to different taps. Thus, the more unbalanced loading of each tap related sinks assigned to. After the reassignment, the tap pairs shown
leads to the worse skew, but timing violations reduce because lots of in the original worse skew report are gone, and it seems that the
nearby sinks are grouped to the same tap. We can also explain the overall skew is reduced without significant trade-off in clock power
results of other cases by this reason. consumption. These results show that clock skew and QoR may
become better when grouping sinks to taps considering both location
TABLE VII and timing relation.
QOR OF LOADING - BASED TAP ASSIGNMENT FOR CLK1 AND CLK2
TABLE IX
Q O R OF LOCAL - SKEW- BASED SINK REASSIGNMENT FOR CLK1
V. C ONCLUSIONS
In this work, we present a study by analyzing the QoR of
conventional clock tree and multisource clock network implemented
with state-of-the-art tool on a real industrial design. We focus on
the steps which can be controlled in the flow and propose some
heuristic approaches to improving the performance of multisource
CTS, especially for skew optimization. From the results, we find the
proper mesh configuration for a design and that multisource CTS
performs better when the target clock covers larger area. Though the
results seem to be case-dependent, we can still conclude that skew
and QoR may get better by performing tap and sink assignment,
considering both location and timing relation of sinks. How to
automate this flow is an important future research direction.
R EFERENCES
[1] L. Xiao, et al., “Local clock skew minimization using blockage-
aware mixed tree-mesh clock network,” in International Conference on
Computer-Aided Design, 2010, pp. 458–462.
[2] C. Yeh, et al., “Clock distribution architectures: a comparative study,” in
Fig. 8. Taps and sink arrival time distribution of case 3*4 tap+1 International Symposium on Quality Electronic Design, 2006, pp. 85–91.
[3] A. Abdelhadi, et al., “Timingdriven variationaware synthesis of hybrid
Table VIII shows the results of tap assignment based on worse local mesh/tree clock distribution networks,” Integration, the VLSI Journal,
skew pairs. We apply tap insertion and tap merging in the case 6*8. vol. 46, pp. 382–391, sept. 2013.
[4] R. Ewetz and C.-K. Koh, “Cost-effective robustness in clock networks
According to the QoR results, it seems that only clock skew in case using near-tree structures,” IEEE Transactions on Computer-Aided De-
4*6 tap+1 is improved, but nothing gets better in other cases. The sign of Integrated Circuits and Systems, vol. 34, pp. 515–528, 2015.
results are expected due to the potential risk mentioned in Section [5] A. Rajaram and D. Pan, “Meshworks: An efficient framework for
3.1.1. From the reports of worse skew pairs in original and modified planning, synthesis and optimization of clock mesh networks,” in Design
Automation Conference,, 2008, pp. 250–257.
configurations, we can see that violations in original case almost [6] G. Venkataraman, et al., “Combinatorial algorithms for fast clock mesh
disappear in the modified ones with a few new violations. To sum optimization,” IEEE Transactions on Very Large Scale Integration (VLSI)
up, though overall skew and QoR performance may not be improved Systems, vol. 18, pp. 131–141, Jan. 2010.
by this approach, it can help reduce the skew in local area. [7] M. R. Guthaus, et al., “Non-uniform clock mesh optimization with linear
programming buffer insertion,” in Design Automation Conference, 2010,
pp. 13–18.
TABLE VIII [8] J. Reuben, et al., “Buffer reduction algorithm for mesh-based clock dis-
Q O R OF LOCAL - SKEW- BASED TAP ASSIGNMENT FOR CLK1 AND CLK2 tribution,” in Advances in Electronics, Computers and Communications,
2014, pp. 1–4.
[9] N. Y. Zhou, et al., “Pacman: driving nonuniform clock grid loads for low-
skew robust clock network,” in System Level Interconnect Prediction,
2014, pp. 1–5.
[10] N. Patel, “A novel clock distribution technology multisource clock
tree system (mcts),” Int. Journal of Advanced Research in Electrical,
Electronics and Instrumentation Engineering, vol. 2, pp. 2234–2239,
June 2013.
[11] T. Yang and Y. Tang, “Implementation of multi-source clock tree
synthesis,” in Synopsys Users Group, 2014.
[12] R. Helfand, “Application of the multisource cts operative in ic compiler,”
in Synopsys Users Group, 2013.
[13] “Implementing multisource clock trees,” in IC Compiler Implementation
User Guide, version I-2013.12-SP4, 2013, pp. 5.126–5.133.
[14] K. Chou, “Ic compiler multisource cts - practical experience sharing,”
in Synopsys Users Group, 2015.
E. Sink Reassignment
The data shown in Table IX is the result of the approach mentioned
in Section III.C, which is also based on worse local skew pairs. We
- 145 -