0% found this document useful (0 votes)
98 views204 pages

Design of High-Speed SerDes Transceiver For Chip-To-Chip Communications in CMOS Process

Uploaded by

陳孝真
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views204 pages

Design of High-Speed SerDes Transceiver For Chip-To-Chip Communications in CMOS Process

Uploaded by

陳孝真
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 204

Design of High-Speed SerDes

Transceiver for Chip-to-Chip


Communications in CMOS Process

Xuqiang Zheng
Supervisor: Professor Shigang Yue

School of Computer Science


University of Lincoln

A thesis submitted in partial fulfilment of the requirements of the


University of Lincoln for the degree of Doctor of Philosophy

May 2018
Abstract

With the continuous increase of on-chip computation capacities and exponential


growth of data-intensive applications, the high-speed data transmission through serial
links has become the backbone for modern communication systems. To satisfy the
massive data-exchanging requirement, the data rate of such serial links has been up-
dated from several Gb/s to tens of Gb/s. Currently, the commercial standards such
as Ethernet 400GbE, InfiniBand high data rate (HDR), and common electrical inter-
face (CEI)-56G has been developing towards 40+ Gb/s. As the core component within
these links, the transceiver chipset plays a fundamental role in balancing the opera-
tion speed, power consumption, area occupation, and operation range. Meanwhile,
the CMOS process has become the dominant technology in modern transceiver chip
fabrications due to its large-scale digital integration capability and aggressive pric-
ing advantage. This research aims to explore advanced techniques that are capable of
exploiting the maximum operation speed of the CMOS process, and hence provides
potential solutions for 40+ Gb/s CMOS transceiver designs. The major contributions
are summarized as follows.
A low jitter ring-oscillator-based injection-locked clock multiplier (RILCM) with a
hybrid frequency tracking loop that consists of a traditional phase-locked loop (PLL),
a timing-adjusted loop, and a loop selection state-machine is implemented in 65-nm C-
MOS process. In the ring voltage-controlled oscillator, a full-swing pseudo-differential
delay cell is proposed to lower the device noise to phase noise conversion. To obtain
high operation speed and high detection accuracy, a compact timing-adjusted phase
detector tightly combined with a well-matched charge pump is designed. Meanwhile,
a lock-loss detection and lock recovery is devised to endow the RILCM with a similar
lock-acquisition ability as conventional PLL, thus excluding the initial frequency set-

I
up aid and preventing the potential lock-loss risk. The experimental results show that
the figure-of-merit of the designed RILCM reaches -247.3 dB, which is better than
previous RILCMs and even comparable to the large-area LC-ILCMs.
The transmitter (TX) and receiver (RX) chips are separately designed and fab-
ricated in 65-nm CMOS process. The transmitter chip employs a quarter-rate multi-
multiplexer (MUX)-based 4-tap feed-forward equalizer (FFE) to pre-distort the output.
To increase the maximum operating speed, a bandwidth-enhanced 4:1 MUX with the
capability of eliminating charge-sharing effect is proposed. To produce the quarter-rate
parallel data streams with appropriate delays, a compact latch array associated with an
interleaved-retiming technique is designed. The receiver chip employs a two-stage
continuous-time linear equalizer (CTLE) as the analog front-end and integrates an im-
proved clock data recovery to extract the sampling clocks and retime the incoming
data. To automatically balance the jitter tracking and jitter suppression, passive low-
pass filters with adaptively-adjusted bandwidth are introduced into the data-sampling
path. To optimize the linearity of the phase interpolation, a time-averaging-based com-
pensating phase interpolator is proposed. For equalization, a combined TX-FFE and
RX-CTLE is applied to compensate for the channel loss, where a low-cost edge-data
correlation-based sign zero-forcing adaptation algorithm is proposed to automatically
adjust the TX-FFE’s tap weights. Measurement results show that the fabricated trans-
mitter/receiver chipset can deliver 40 Gb/s random data at a bit error rate of < 10−12
over a channel with >16 dB loss at the half-baud frequency, while consuming a total
power of 370 mW.

II
Declaration

I, Xuqiang Zheng, declare that this thesis describes an original study carried out on
my own. It has not been previously submitted to any university for the award of any
degree. Where I have quoted from the work of others, the source is always given.

III
Acknowledgements

First and foremost, I would like to thank my academic advisor, Professor Shigang
Yue, for his tolerance and patience in letting me explore my interested fields. He
encouraged me to think deeply and creatively. He also taught me how to effectively
communicate my research in papers and presentations. I hope through the years I have
been able to pick up a little of his ability to find and explain ideas and concepts with
such clarity. He will always be a role model to me in my future academic career.
Professor Chun Zhang is my co-advisor, and I am grateful to him for his help and
support when I was on secondment to Tsinghua University. I especially value his
trust in giving me plenty of tapeout chances, regardless of consequences for him. I
have learned a lot from him on how to communicate with people and how to address
troublesome issues. I also want to thank my second co-advisor, Dr. Tryphon Lambrou,
for his nice advice and kind discussions.
I would like to take this chance to thank my family for their selfless love and con-
stant support. Especially, my parents-in-law who gave me great support on deciding to
start my Ph.D. study and provided me generous help during my study. I am grateful to
my wife who supported the whole home when I was studying abroad. I also want to
say sorry to my son for the absence during my abroad study.
The environment at the University of Lincoln is full of brilliant and enthusiastic
colleagues who have provided me valuable help and discussions. I wish to thank the
previous and present members in Lincoln Centre for Autonomous Systems (L-CAS)
research group who have brought me great convenience in daily life and academic
research. In particularly, I want to thank Farshad Arvin, Yi Gao, Junxiong Jia, Feng

IV
Zhao, Tuo Xie, Mingzhu Long, Yan Yan, Guopeng Zhang, Cheng Hu, Qinbing Fu,
Jingmin Huang, Biao Zhao, Xuelong Sun, Jiannan Zhao, Huatian Wang, and Tian Liu
for their selfless help and creative discussions.
I wish to thank Dr. Fangxu Lv for joint work on parts of the project for always being
ready to carry out necessary chip measurements. I also would like to thank Prof. Fule
Li for his constructive advice on circuit designs. I thank him most for being patient
with me at the very beginning and using his vision to open the door of the integrated
circuit design to me.
Finally, I appreciate the financial support from School of Computer Science at
University of Lincoln, the EU FP7 projects: EYE2E (269118), LIVCODE (295151),
and EU Horizon 2020 project: STEP2DYNA (691154).

V
List of Main Publications

[1] X. Zheng, C. Zhang, and F. Lv et al., “A 40-Gb/s quarter-rate SerDes transmit-


ter and receiver chipset in 65-nm cmos,” IEEE J. Solid-State Circuits (JSSC),
vol. 52, no. 11, pp. 2963–2978, Nov. 2017.

[2] X. Zheng, Z. Wang, and F. Li et al., “A 14-bit 250 MS/s IF sampling pipelined
ADC in 180 nm CMOS process,” IEEE Trans. Circuits Syst. I, Reg. Papers
(TCAS-I), vol. 63, no. 9, pp. 1381–1392, Sep. 2016.

[3] X. Zheng, F. Lv, and F. Zhao et al., “A 10 GHz 56 fsrms-integrated-jitter and


-247 dB FOM ring-VCO based injection-locked clock multiplier with a continu-
ous frequency-tracking loop in 65 nm CMOS,” in Proc. IEEE Custom Integrated
Circuits Conf. (CICC), Jul. 2017, pp. 1–4.

[4] X. Zheng, C. Zhang, and S. Yuan et al., “An improved 40 Gb/s CDR with jitter-
suppression filters and phase-compensating interpolators,” in Proc. IEEE Asian
Solid-State Circuits Conf. (ASSCC), Nov. 2016, pp. 85–88.

[5] X. Zheng, C. Zhang, and F. Lv et al., “A 5-50 Gb/s quarter rate transmitter with
a 4-tap multiple-MUX based FFE in 65 nm CMOS,” in Proc. IEEE European
Solid-State Circuits Conf. (ESSCIRC), Sep. 2016, pp. 305–308.

[6] W. Cao, X. Zheng, Z. Wang, and D. Li et al., “A 15Gb/s wireline repeater in


65nm CMOS technology,” in Proc. IEEE International Conference on Electron
Devices and Solid-State Circuits (EDSSC), Oct. 2015, pp. 590–593.

VI
List of Figures

1.1 Diagram of the global data traffic trend [1]. By 2020, 50 billion devices
will be connected generating more than two zetta bytes of data traffic
annually. 2
1.2 Wired network roadmap [2]. The data rates in SFP+, QSFP, and CFP
are updating towards 100Gb/s, 400Gb/s, and 1Tb/s, respectively. 3

2.1 Cutoff frequency (fT ) scaling comparison among different processes


in terms of the inverse of the lithographic feature size [3]. 11
2.2 Typical SerDes application spaces. (a) rack-to-rack link, (b) chassis-
to-chassis connection, and (c) intra-chassis interconnect [4]. 13
2.3 Reach details of each application space defined in CEI-56G [4]. 13
2.4 Jitter decomposition and jitter sources. 18
2.5 CDR specifications of (a) JTRAN, (b) JGEN, and (c) JTOL in SONET
[5]. 20
2.6 Typical serial link for wireline communications. 22
2.7 Clock synthesis implementations and phase noise performances for (a)
PLL, (b) DLL, (c) ILO, and (d) IL-VCO. Here, f is the frequency of
the noise, Sθ (f ) stands for the phase noise spectrum, fBW refers to
the -3dB bandwidth of the loop, fc denotes the corner frequency of the
VCO, and finj represents the injection-locking bandwidth of the ILO. 24
2.8 Clock distribution structures based on (a) inverter chain, (b) CML chain,
(c) transmission line, and (d) inductive load. 28
2.9 Typical transmitter driver modes. (a) CML mode and (b) SST mode. 31
2.10 Schemes of the final 4:1 multiplexing. (a) Half-rate topology based
on two-stage 2:1 MUXs, (b) quarter-rate structure based on direct 4:1
MUX, (c) critical path and timing diagram of the 2:1 MUX, (d) timing
margin of the 2:1 MUX, and (e) timing margin of the 4:1 MUX. 32
2.11 Techniques of 1-UI delay generation based on (a) full-rate FF, (b) half-
rate 2:1 MUX, (c) quarter rate 4:1 MUX, and (d) analog delay line. 34
2.12 CDR topologies without a reference. (a) Single control of VCO fre-
quency tuning and (b) coarse and fine control of VCO frequency tuning. 36
2.13 CDR topologies with a reference. (a) Dual VCO architecture, (b) se-
quential locking topology, (c) PI-based structure, and (d) variant of
PI-based structure. 38
2.14 Two typical CDR PDs. (a) Hogge PD implementation, (b) Hogge PD
detection mechanism, (c) Hogge PD gain, (d) Alexander PD imple-
mentation, (e) Alexander PD detection mechanism, and (e) Alexander
PD gain. 40

VII
2.15 Clocked compactors. (a) CML-type latch-based compactor, (b) Strong-
Arm latch-based compactor, (c) latch sensitivity function comparison
[6], (d) latch transfer function comparison [6], and (e) energy con-
sumption comparison [7]. 44
2.16 PI structures and implementations. (a) Structure with direct multiple-
input phases [8, 9], (b) structure with coarse phase selection followed
by a phase mixer [10, 11], (c) inverter-based implementation [12, 13],
and (d) CML-based implementation [14, 15]. 46
2.17 (a) Phase constellation for quadrature PI, (b) phase constellation for oc-
tagonal PI, (c) interpolated phase steps for quadrature PI in one quad-
rant, and (d) interpolated phase steps for octagonal PI in one octant. 47
2.18 The FFE. (a) Functional block diagram, where Tb is the bit period and
αn is the weight of the nth tap. (b) Typical frequency response, where
k is the summation of the absolute tap weights. 50
2.19 The CTLE. (a) Passive implementation, (c) frequency response of the
passive CTLE, (c) active implementation, and (d) frequency response
of the active CTLE. Here, ωz is the angular frequency of the zero and
ωp is the angular frequency of the pole. 53
2.20 The DFE. (a) Functional diagram, where Tb is the bit period and αn is
the tap weight of the nth tap. (b) Typical frequency response, where
the frequency is normalized to the value of the data rate. 56
2.21 Equalization adaptations. (a) Algorithm-based adjustment, (b) eye monitor-
based coefficient update, and (c) spectrum matching-based calibration. 58

3.1 Previous frequency tracking techniques. (a)Traditional IL-PLL, (b)


IL-PLL with DLL-based injection position adjustment, (c) dual-loop
architecture with replica-VCO/VCDL, (d) TDC-based FTL, and (e)
TPD-based FTL. 65
3.2 The architecture of the proposed RILCM. 68
3.3 Linear model of the RILCM in case of the injection-locked condition,
where θref (s), θi (s), θo (s), θn,ref (s), θn,vco (s) represent the reference
input phase, total input phase, output phase, reference input noise, and
VCO noise, respectively. 69
3.4 NTF characteristics of the RILCM. (a) NTF behaviors and (b) simpli-
fied noise shaping characteristics. Here, fc is the corner frequency of
the oscillator, finj stands for the bandwidth of the injection locking,
ftune denotes the tunable bandwidth of the TAL, 1/f 2 represents the
white noise of the oscillator, and 1/f 3 is the flick noise of the oscillator. 71
3.5 IL-RVCO. (a) Four-stage RVCO implementation, (b) pulse generator,
and (c) injection locking behavior. 73
3.6 (a) FTG-based FS-PDDC, (b) CCI-based FS-PDDC, (c) effect of the
FTGs, and (d) effect of the CCIs. Here, the arrows stand for the effort
directions that are offered by the FTGs or CCIs. 73
3.7 Effect of the injection pulse on the speed of edge transitions, where the
proceeding portion of the injection pulse contributes positive feedback
while the following portion provides negative feedback. 75
3.8 Transient simulation results of the IL-RVCO. (a) Injection locking range,
(b) the relative phase difference with respect to the transient time, and
(c) the relative phase difference versus the frequency offset. 76
3.9 Circuit implementation of the combined TPD and CP. 78

VIII
3.10 Locking behaviour of the proposed TPD. (a) Waveforms when injec-
tion occurs at the falling edge of CLK P, and (b) waveforms when in-
jection occurs at the rising edge of CLK P. 79
3.11 Implementation of the introduced LSSM. (a) Circuit details and (b)
behavior of the FLD. 81
3.12 Layout view of the whole RILCM chip, where the block placement of
the core circuits is illustrated in the left view. 84
3.13 Layout views of the crucial blocks. (a) VCO, (b) PG, (c) PFD/CP1, (d)
TPD/CP2, and (e) LSSM. 84
3.14 Simulation setup of the RVCO, where the left curve depicts the VC-
TRL of the RVCO. 85
3.15 Simulation results of the RVCO. (a) Differential output clock, (b) swing
reduction, (c) frequency range, and (d) phase noise. 86
3.16 Simulated performance comparison of the RVCOs with FTG-based
and CCI-based FS-PDDCs in terms of (a) operation frequency, (b) fre-
quency range, (c) FOMPN , and (d) swing reduction. Here, the horizon-
tal axes denote the percentage of the FTG/CCI to the main inverter in
dimension. 86
3.17 Comparison of the transient procedure when operating in conventional
PLL mode and RILCM mode with LLD-LR. 88
3.18 Transient behavior comparison. (a) With injection-lock indicator IN-
J LOCK and (b) without injection lock indicator INJ LOCK. 89
3.19 Die micrograph of the RILCM. 91
3.20 Power breakdown of the RILCM. 91
3.21 Measured phase noise with half-rate output at 5GHz. 91
3.22 Measured reference spur with half-rate output at 5GHz. (a) RILCM
without FTL and (b) RILCM with FTL. 92
3.23 Integrated rms-jitter versus supply voltage. 93
3.24 Integrated rms-jitter versus reference frequency. 93
3.25 Performance-area-speed graph. 96

4.1 (a) Critical path and (b) timing diagram for the 2:1 MUX. Here, tdiv is
the delay of the divider, tck−q is the ck-to-q delay of the 2:1 MUX, and
tsetup is the setup time of the sampling latch. 99
4.2 (a) Traditional CML-based MUX implementation and (b) power con-
sumption with different multiplexing ratio [16]. Here, N refers to the
the multiplexing branch number. 101
4.3 Block diagram of the transmitter chip. 102
4.4 Conceptional circuit schematic of the traditional 4:1 MUX. 104
4.5 Four possible unit cell implementations of the 4:1 MUX. 104
4.6 Topology of the 4:1 MUX. (a) Conceptual schematic and (b) timing
diagram. 107
4.7 Traditional unit cell implementations for high-speed 4:1 MUX. (a)
Data-up structure and (b) clock-up structure. 108
4.8 Improved unit cell implementation. 108
4.9 Effect of the introduced PM on (a) high-level glitches and (b) edge
transitions. 109
4.10 Circuit details of the clocking blocks. (a) Clock conditioner, (b) DIV2,
and (c) CML2CMOS. 111
4.11 Pesudo-NAND2. (a) Circuit details and (b) operation waveform. 113

IX
4.12 Layout view of the whole transmitter chip. 114
4.13 Layout views of the crucial blocks. (a) 4:1 MUX, (b) interleaved-
retiming latch array, (c) pesudo-NAND2 with an inverter, (d) CM-
L2CMOS converter, (e) DIV2, and (f) clock conditioner. 115
4.14 Simulation setup of the transmitter chip. 117
4.15 (a) Transient waveform of the traditional unit cell, (b) transient wave-
form of the enhanced unit cell, (c) eye-diagram of the the traditional
unit cell, and (d) eye-diagram of the the enhanced unit cell. 117
4.16 Swing variations of the improved unit cell under different PVT corners. 118
4.17 Simulation eye-diagrams of the transmitter at (a) 10 Gb/s with over
equalization, (b) 40 Gb/s with proper equalization, (c) 50 Gb/s without
equalization, and (d) 50 Gb/s with proper equalization. 118
4.18 Chip micrograph of the transmitter. 119
4.19 Power breakdown of the transmitter when operating at 50 Gb/s. 119
4.20 Measured output eye-diagrams of the transmitter at (a) 5 Gb/s with
over equalization, (b) 40 Gb/s without equalization, (c) 40 Gb/s with
proper equalization, and (d) 50 Gb/s with proper equalization. 120
4.21 Measured output eye-diagrams with four separate eyes. (a) Clock pat-
tern and (b) PRBS pattern. 121

5.1 Block diagram of the receiver chip. 126


5.2 Conventional BBPD-based CDR. 127
5.3 Block diagram of the modified CDR architecture. 128
5.4 Functional view of the introduced LPFs. (a) Principle of the BBPD,
(b) linearized CDR model, and (c) jitter transfer functions. 130
5.5 Proposed compensating PI. (a) Quarter-rate 45◦ -spaced clock genera-
tion, (b) in-phase I, Q clock generation for the data sampling, and (c)
45◦ phase-shifted I, Q clock generation for the edge sampling. 132
5.6 Details of (a) quadrature PI and (b) TA. 132
5.7 Phase transfer characteristics based on trigonometric-function approx-
imation. 134
5.8 Simulation results of the phase compensating PI. (a) Simulated phase
transfer characteristics, (b) DNL performance, and (c) INL perfor-
mance. 135
5.9 Layout view of the whole transmitter chip. 137
5.10 Layout views of the (a) Terminals+CTLE and (b) CDR. 138
5.11 Layout views of the crucial blocks within the CDR. (a) Samplers, (b)
compensating PI, and (c) digital loop filter. 138
5.12 Simulation setup of the CDR. A PRBS generator is used to produce
the 40 Gb/s input data with 5 ps peak-to-peak jitter, a clock generator
is utilized to produce the 20 GHz input clock with a 1 UI amplitude
sinusoidal jitter at 500 kHz, the output data refers to the input data at
the samplers, the output clock is the recovered data-sampling clock,
the output biasa represents the current mirror bias for 0◦ -phase before
the LFP, and the biasb stands for the current mirror bias for 0◦ -phase
after the LFP. 139
5.13 Effect of the LPFs with a bandwidth of (a) 4 MHz, (b) 20 MHz, (c) 50
MHz, and (d) adaptively-adjusting. 140
5.14 Properties of the adaptive-bandwidth jitter suppression. 141

X
5.15 Effect of different input patterns on jitter attenuation. (a) PRBS7, (b)
PRBS15, (c) PRBS23, and (d) PRBS31. 142
5.16 (a) Chip micrograph and (b) power breakdown of the receiver. 143
5.17 Measured eye-diagrams for (a) input data at 40 Gb/s, (b) recovered data
at 10 Gb/s, (c) recovered edge-sampling clock without LPFs at 5 GHz,
and (d) recovered data-sampling clock with LPFs at 5 GHz. 144
5.18 Measured JTRAN and JTOL with PRBS7 at 28 Gb/s. 145

6.1 Implemented equalization scheme with the proposed EDC-SZF algo-


rithm. Here, TX-FFE and RX-CTLE are employed to compensate for
the channel loss, the control voltage of the RX-CTLE (VCTLE) is man-
ually calibrated while the tap weights (α−1 , α1 , α2 ) of the TX-FFE are
adaptively adjusted by the proposed EDC-SZF. 149
6.2 TX-FFE. (a) Schematic details, (b) simulated output eye-diagram at 10
Gb/s, and (c) simulated output eye-diagram at 40 Gb/s. 150
6.3 RX-CTLE. (a) Schematic details and (b) frequency responses for dif-
ferent control voltages. 151
6.4 Pulse response of a typical dispersion channel. 154
6.5 Block diagram of the EDC-SZF adaptation algorithm. 158
6.6 Correlation detector. (a) Operation principle illustration and (b) func-
tion table. 158
6.7 Layout views of the equalization blocks. (a) TX-FFE, (b) RX-CTLE,
and (c) EDC-SZF. 159
6.8 Transistor-level simulation of the EDC-SZF adaptation. (a) Channel
frequency response, (b) convergence process of the TX-FFE tap weight-
s, (c) eye-diagram with zero TX-FFE tap weights, and (d) eye-diagram
with adaptively-adjusted TX-FFE tap weights. 160
6.9 Constructed chip-to-chip interconnect. (a) Testing PCB, (b) auxiliary
PCB, and (c) duplicated channel frequency response. 161
6.10 Adaptively-adjusted bias voltages of the TX-FFE with different RX-
CTLE control voltages. 162
6.11 Measured far-end eye-diagrams for (a) bias condition A, (b) bias con-
dition B, (c) bias condition D, and (d) bias condition F depicted in Fig.
6.10. 162
6.12 Measured bathtub curves under different bias conditions depicted in
Fig. 6.10. 163

A1 Phase accumulation behavior of the ILO. (a) Output waveform of the


ILO in one injection period, (b) flow-chart diagram of the phase accu-
mulation, and (c) intuitive diagram of the phase accumulation. 179
A2 Model of the ILO. (a) Signal flow chart and (b) linear model. 180

XI
List of Tables

3.1 PERFORMANCE SUMMARY OF THE RILCM 94

4.1 PERFORMANCE SUMMARY OF THE TRANSMITTER 122

5.1 PERFORMANCE SUMMARY OF THE RECEIVER 146

XII
List of Acronyms and Abbreviations

ADC analog-to-digital converter


BBPD bang-bang phase detector
BER bit error rate
CAGR compound annual growth rate
CCI cross-coupled inverter
CDR clock data recovery
CEI common electrical interface
CFP centum form-factor pluggable
CML current-mode logic
CP charge pump
CPU central processing units
CTLE continuous linear equalizer
DAC digital-to-analog converter
DFE decision feedback equalizer
DIV divider
DJ deterministic jitter
DLL delay-locked loop
DNL differential nonlinearity
DRC Design rule check
DSP digital signal processing
EDC-SZF edge-data correlation based sign zero-forcing
EDR enhanced data rate
ESD electro-static discharge
FEC forward error correction
FFE feed-forward equalizer
FIR finite impulse response
FOM figure-of-merit
FS-PDDC full-swing pseudo-differential delay cell
FTG forward transmission gate
FTL frequency tracking loop
GbE gigabit ethernet
GBW gain-bandwidth product

XIII
HDR high data rate
HPF high-pass filter
IBTA InfiniBand trade association
IEEE institute of electrical and electronics engineers
ILCM injection-locked clock multiplier
ILO injection locked oscillator
IL-RVCO injection-locked ring voltage-controlled oscillator
INL integral nonlinearity
ISI inter-symbol interface
JGEN jitter generation
JTOL jitter tolerance
JTRAN jitter transfer
LD lock detector
LLD-LR lock-loss detection and lock recovery
LMS least mean square
LPF low-pass filter
LR long reach
LSSM loop-selection state machine
LVS layout versus schematics
MAC media access control
MEO maximum eye opening
MR medium reach
MUX multiplex
NRZ non-return to zero
NTF noise transfer function
OC optical carrier
OSC oscillator
PCB printed circuit board
PD phase detector
PEX parasitic extraction
PFD phase frequency detector
PG pulse generator
PI phase interpolator
PLL phase-locked loop
POD polarity detector
PSD phase shift detection
PTL phase tracking loop
QSFP quad small form-factor pluggable
RILCM ring-oscillator-based injection-locked clock multiplier
RJ random jitter

XIV
RVCO ring voltage-controlled oscillator
RX receiver
S/H sample-and-hold
SerDes serializer/deserializer
SFP+ small form-factor pluggable plus
SNR signal noise ratio
SONET synchronous optical network
SS-LMS sign-sign least mean square
SST source-series terminated
SSTPD sub-sampling timing-adjusted phase detector
TAL timing-adjusted loop
TDC time-to-digital converter
TPD timing-adjusted phase detector
TX transmitter
UI unit interval
USR ultra short reach
VCDL voltage-controlled delay line
VCO voltage-controlled oscillator
VCTLR control voltage
VSR very short reach
XSR extra short reach
ZF zero-forcing
fBW -3dB bandwidth
fT cutoff frequency of the transistor
fc corner frequency of the oscillator
finj injection-locking bandwidth of the injection-locked oscillator
1/f 2 white noise of the oscillator
1/f 3 flick noise of the oscillator
Sθ (f ) phase noise spectrum of the oscillator

XV
Contents

Abstract I

Declaration III

Acknowledgements IV

List of Main Publications VI

List of Figures VII

List of Tables XII

List of Acronyms and Abbreviations XIII

1 Introduction 1
1.1 Background 1
1.2 Challenges in Cutting-Edge Transceivers 3
1.3 Research Objectives 4
1.4 Research Contributions 6
1.5 Organization of the Thesis 7

2 Literature Review 10
2.1 General Design Considerations 11
2.1.1 Technology Choices 11
2.1.2 Spaces of Electrical Links 12
2.1.3 On-Chip Wire Modeling 15
2.2 SerDes Design Metrics 16
2.2.1 Data Rate and Power Efficiency 16
2.2.2 Bit Error Rate 17
2.2.3 Clock Data Recovery (CDR) Specifications 19
2.3 Basics of Electrical Serial Links 21
2.3.1 Clocking Techniques 23
2.3.2 Transmitter Techniques 30
2.3.3 Receiver Techniques 36
2.3.4 Channel Equalization 49

3 Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM) 63


3.1 Challenges in RILCM and Previous Solutions 64
3.1.1 Challenges in RILCM 64
3.1.2 Prior Arts 66
3.2 Proposed RILCM Architecture 68
3.2.1 Overall Architecture 68

XVI
3.2.2 Architecture Modeling 69
3.3 Injection-Locked Ring Voltage-Controlled Oscillator (IL-RVCO) 72
3.3.1 Implementation of the IL-RVCO 73
3.3.2 Relationship Between the Relative Phase Difference and the
Frequency Offset 75
3.4 The Proposed Phase Difference Detection 77
3.4.1 Principle of the Proposed Timing-Adjusted Phase Detector 79
3.4.2 Polarity Selection 80
3.5 Mechanism of the Lock-Loss Detection and Lock Recovery (LLD-LR) 81
3.5.1 Operation Process of the LLD-LR 81
3.5.2 Principles of the Lock Loss and False Lock Detection 82
3.6 Experimental Results 83
3.6.1 Tools and Fabrication Process 83
3.6.2 Layout and Simulation Results 85
3.6.3 Chip Micrograph and Measurement results 90
3.6.4 Performance Comparison 95
3.7 Chapter Summary 96

4 The Transmitter Design 98


4.1 Design Challenges in High-Speed Transmitter 99
4.1.1 Timing Constraints 99
4.1.2 Bandwidth Limitations 100
4.2 Transmitter Architecture 102
4.2.1 Overall Architecture 102
4.2.2 Features of the Transmitter 103
4.3 Enhanced 4:1 Multiplexer (MUX) 104
4.3.1 Previous 4:1 MUXs 104
4.3.2 Topology Consideration 106
4.3.3 Enhancement on the Unit Cell of the 4:1 MUX 107
4.4 Clocking for the Transmitter 112
4.4.1 Topology of the Clock Bundle 112
4.4.2 Clocking Blocks 112
4.5 Experimental Results 114
4.5.1 Tools and Fabrication Process 114
4.5.2 Layout and Simulation Results 116
4.5.3 Chip Fabrication and Measurement Results 120
4.5.4 Performance Comparison 121
4.6 Chapter Summary 122

5 The Receiver Design 123


5.1 Design Considerations of the Receiver 124
5.1.1 Receiver Sensitivity 124
5.1.2 CDR Bandwidth 124
5.1.3 Challenges within High-Speed CDR 125
5.2 Receiver Architecture 126
5.2.1 Overall Architecture 126
5.2.2 Features of the Receiver 127
5.3 Improved Digital CDR 127
5.3.1 Dithering Behavior in Digital CDR 127
5.3.2 Architecture Improvement 128

XVII
5.3.3 Behavior of the Improved CDR 129
5.4 Compensating Phase Interpolator 131
5.4.1 Implementation Details 133
5.4.2 Linearity Analysis 133
5.5 Experimental Results 136
5.5.1 Tools and Fabrication Process 136
5.5.2 Layout and Simulation Results 137
5.5.3 Chip Fabrication and Measurement Results 143
5.5.4 Performance Comparison 145
5.6 Chapter Summary 146

6 Overall Serial Link and Adaptive Equalization 148


6.1 Serial Link and Channel Equalization 149
6.1.1 Link Connection and Equalization Scheme 149
6.1.2 Equalizer Implementation Details 150
6.2 Edge-Data Correlation-Based Sign Zero-Forcing (EDC-SZF) 152
6.2.1 Drawbacks of Previous Adaptation Algorithms 152
6.2.2 Iteration of the EDC-SZF 153
6.2.3 Correlation between Edge Information and Recovered Data 153
6.2.4 Derivation of the EDC-SZF 155
6.2.5 Implementation of the EDC-SZF 157
6.3 Experimental Results 160
6.3.1 Layout and Simulation Results 160
6.3.2 Measurement Results 163
6.4 Chapter Summary 164

7 Conclusions and Future Work 165


7.1 Conclusions 165
7.2 Future Work 167

Bibliography 168

Appendices 178
Appendix A Modeling of the Injection-Locked Oscillator (ILO) 178
A.1 Behavior Model of the ILO 178
A.2 Linear Model of the ILO 180
A.3 Tracking Bandwidth of the ILO 182
Appendix B Convergence Proof of the Proposed EDC-SZF Iteration 184

XVIII
Chapter 1

Introduction

1.1 Background

The exponential growth of cloud computing, social networking, and multimedi-


a sharing has led to an explosive bandwidth demand on data communication. Cisco
global IP traffic forecast estimates that the global IP traffic will grow at a compound
annual growth rate (CAGR) of 22 percent from 2015 to 2020. By 2020, it is expected to
see 50 billion connected devices generating more than two Zetta bytes (230 Tera bytes)
of data traffic annually (see Fig. 1.1). Moreover, 64 percent of all the Internet traffic
will be delivered globally crossing the content delivery networks [1]. To accommodate
this aggregated bandwidth requirement, the study group of Institute of Electrical and
Electronics Engineers (IEEE) P802.3bs has approved a 400 Gigabit Ethernet (GbE)
standard to quadruple the backbone bandwidth of the existing 100 GbE [17] and the
InfiniBand R trade association (IBTA) has announced its 600 Gb/s computer network-
ing communication standard high data rate (HDR) in the roadmap [18]. To support
such high-speed data communications, multi-lane high-speed serial links are usually
employed to extend the throughput bandwidth. As an example, the next-generation
400 GbE will be most possibly implemented by multiple serial links in forms of 16x25
Gb/s or 8x50 Gb/s, where the latter lane configuration is more in line with the trend
because of its low cost, high capability, simplified cabling, high power efficiency, and
less coherent optical devices.
Among a variety of components in these high-speed links, the physical Serializ-

1
Chapter 1. Introduction

Figure 1.1: Diagram of the global data traffic trend [1]. By 2020, 50 billion devices
will be connected generating more than two zetta bytes of data traffic annually.

er/Deserializer (SerDes) transceiver plays a critical role in making up the communi-


cation connections between the data link layer and the physical medium. Due to the
development of the optical communication and post massive data-processing ability,
the data-moving capability is mainly limited by the maximum speed of the SerDes
transceivers. Over the past few decades, the transceiver data rate has constantly been
increased from Mb/s to tens of Gb/s [19]. Fig. 1.2 shows the wired network roadmap
since 2000 [2], where the small form-factor pluggable plus (SFP+) line describes the
port speeds on servers, the quad small form-factor pluggable (QSFP) curve shows the
speeds of switches above the servers, and the centum form-factor pluggable (CFP) line
illustrates the data rates of routers. Similar to 10 Gb/s ports on servers that have driven
the speed of switches to 40 Gb/s and 100 Gb/s, the development of 25 Gb/s network-
ing has updated 100 Gb/s switching and 400 Gb/s routing. At present, the Ethernet
Alliance is evaluating potential standards for 50 Gb/s on the server and 200 Gb/s on
the switch. Looking forward to 2025, the data communication speed is tend to be re-
newed to 100 Gb/s on the server, 400 Gb/s on the switch, and 1 Tb/s on the router.
So far, 25-28 Gb/s serial links approved by InfiniBand enhanced data rate (EDR), 32G
fibre channel (32GFC), and common electrical interface (CEI)-28G have stepped into
the period of industrial deployment [20, 21, 22]. Meanwhile, 38-64 Gb/s transceivers,
which will play key roles in the next-generation data rates supported by Ethernet 400
GbE, InfiniBand HDR, and CEI-56G, have attracted increasing research attention in

2
Chapter 1. Introduction

Figure 1.2: Wired network roadmap [2]. The data rates in SFP+, QSFP, and CFP are
updating towards 100Gb/s, 400Gb/s, and 1Tb/s, respectively.

both the industry and academia [17, 18, 23, 24, 25, 26]. This dissertation mainly fo-
cuses on the advanced techniques of high-speed SerDes transceivers for chip-to-chip
communications operating at 40+ Gb/s in CMOS process.

1.2 Challenges in Cutting-Edge Transceivers

To accommodate the requirement of the continuously increasing data communica-


tions, cutting-edge transceivers operating at 38-64 Gb/s have become standard mod-
ules within the next-generation connections for data centers and backbone networks
[17, 18, 27]. The main challenges in designing such high-speed transceivers originate
from the ever decreased UI period [23, 24, 25], which not only poses high bandwidth
requests on the blocks located at the critical path, but also makes the link timing bud-
get extremely tight. The CMOS fabrication process, which is preferred due to its
large-scale integration and aggressive pricing advantage, has made the designs even
more challenging because of its limited cutoff frequency and poor noise performance
[7, 28, 29]. Although an advanced process can provide a higher operation speed, it
cannot completely solve these problems as the parasitic capacitances/resistances at
the high-speed outputs usually do not scale well with the technology because of the
bonding and/or electro-static discharge (ESD) protection requirements. Meanwhile, e-
conomic feasibility is another factor that must be considered when constructing multi-
lane connections. It usually involves power consumption, area occupation, and heat

3
Chapter 1. Introduction

dissipation, where small area occupation and low power consumption could improve
the port density and lower the requirement of heat dissipation, hence reducing the
overall cost [28, 30, 31]. For implementations, the digital media access control (MAC)
layer and the analog physical layer (SerDes transceiver) are developing at different
stages. Specifically, the 200G MAC (4×50 Gb/s) has been implemented and validated
in the industry [31], while the physical layer is still in the period of moving from the
lab to the market [17, 18, 23, 24, 25, 26, 27, 32]. This is because the MAC mainly
processes the parallel data streams, where the timing requirement can be relaxed by
increasing the parallel bit width. In contrast, the SerDes transceiver has to provide
accurate timing information, sufficient bandwidth, and appropriate equalization for the
full-rate data communication.

1.3 Research Objectives

The next-generation SerDes transceivers that support 38-64 Gb/s have attracted
great attentions from both the industry and the academia due to their broad market
potential and significant academic value. Although the technical feasibility has been
proved by several 40-56 Gbs transceiver designs [33, 34, 35, 36], plenty of research
studies are still demanded to further optimize the power consumption, area occupation,
and operation robustness, thus paving the path for the upcoming industrial deployment.
This thesis mainly focuses on the enhancement techniques to explore the maximum
process limit and hence provides potential solutions for the cutting-edge transceiver
designs. The major research objectives are summarized as follow.

• Designing a robust ring-oscillator-based injection-locked clock multiplier (RIL-


CM) with optimized figure-of-merit. RILCM has been proven to be one of the
most promising solutions for high-speed low-jitter clock multiplications since
it combines the good properties of small area occupation and low phase noise.
However, there still exists two difficulties that hinder its wide spread in product
applications. One is the limited accuracy of the frequency offset detection as the
accumulated phase error can always be reset by the injection pulse. The other is
the fragile robustness due to its limited lock-in range and weak lock-acquisition

4
Chapter 1. Introduction

ability. This thesis aims to overcome these two difficulties and hence provides a
reliable, low-cost clock multiplier for wireline transceivers.

• Designing a wide-range transmitter that explores the maximum process limit.


The direct 4:1 MUX multiplexing scheme has provided a promising solution to
satisfy the stringent timing requirement at the final serialization stage. Nonethe-
less, the doubled self-drain capacitance has limited the maximum bandwidth
and hence constrains the overall transmission data rate. Another difficulty in
the transmitter design is how to generate the UI-spaced serial sequences for the
FFE. This thesis targets to optimize the bandwidth of the 4:1 MUX and develop
a quarter-rate transmitter with a multi-MUX-based 4-tap FFE.

• Designing a jitter-performance-improved receiver. Quarter-rate PI-based CDR


has become the preferred choice for data rates over 20 Gb/s due to its robust-
ness, portability, and compactness. Nevertheless, its jitter performance is lim-
ited by the nonlinearity-caused cycle-limited oscillation and the nonlinearity of
the phase interpolation. This thesis seeks to improve the CDR architecture to
suppress the deterministic jitter caused by the cycle-limited oscillation while
maintaining the loop parameter unchanged to satisfy the JTOL specification.
Meanwhile, we make an effort to optimize the linearity of the PI.

• Developing a low-cost adaptive equalization algorithm. Adaptive equalization


has become a dominant option for data rates over 20 Gb/s. Previous adaptation
algorithms such as sign-sign least mean square (SS-LMS), zero-forcing (ZF),
and maximum eye opening (MEO) have manifested their validity. However, the
auxiliary circuits associated with these methods have degraded their competi-
tiveness in the cutting-edge transceiver design. This thesis aims to develop a
low-cost adaptation algorithm that only uses the existing data/edge information
to automatically adjust the tap weights of the TX-FFE.

5
Chapter 1. Introduction

1.4 Research Contributions

This dissertation explores several advanced techniques to make the data rates of
the cutting-edge wireline transceivers approach the fundamental technology limit. It
addresses some of the architecture-level and circuit-level challenges with appropriate
compromises of power consumption, area occupation, performance margin, and op-
eration robustness. The main contributions of this dissertation are summarized in the
following.

• A low-jitter ring-oscillator-based injection-locked clock multiplier (RILCM) is


designed in 65-nm CMOS process. It employs a hybrid frequency tracking loop
that consists of a traditional phase-locked loop (PLL), a timing-adjusted loop,
and a loop selection state-machine to automatically adjust the control voltage
of the injection-locked voltage-controlled oscillator (VCO). In the ring-VCO, a
full-swing pseudo-differential delay cell is proposed to lower the device noise
to phase noise conversion. To satisfy the requirements of high operation speed,
high detection accuracy, and low output disturbance, a compact timing-adjusted
phase detector tightly combined with a well-matched charge pump is designed.
Meanwhile, a lock-loss detection and lock recovery is devised to endow the RIL-
CM with a similar lock-acquisition ability as conventional PLLs, thus excluding
the initial frequency setup aid and preventing the potential lock-loss risk. The
measurement results show that the implemented RILCM achieves a good bal-
ance among jitter performance, area occupation, operation speed, and power
efficiency.

• A 5-50 Gb/s quarter-rate transmitter (TX) with a 4-tap feed-forward equalization


(FFE) based on multiple-multiplexer (multi-MUX) is designed in 65-nm CMOS
technology. To increase the maximum operating speed, a bandwidth enhanced
4:1 MUX with the capability of eliminating charge-sharing effect is proposed. To
produce the quarter-rate parallel data streams with appropriate delays, a compact
latch array associated with an interleaved-retiming technique is designed. The
measurement results indicate that the fabricated transmitter achieves better jitter

6
Chapter 1. Introduction

performance and power efficiency, even in comparison to the LC-delay-based


FFE, mainly because of the proposed high-speed 4:1 MUX and the compact
interleaved-latching scheme.

• A 40 Gb/s receiver (RX) with excellent performance on both jitter suppression


and jitter tracking is implemented in 65-nm CMOS process. Passive low-pass
filters with adaptively adjusted bandwidth are introduced into the data-sampling
path to automatically balance jitter tracking and jitter suppression for data de-
cisions. Additionally, a time-averaging-based compensating phase interpolator
is proposed to not only improve the phase-step uniformity but also reduce the
phase-spacing drift between edge and data sampling clocks. The measurement
results show that the maximum tolerable amplitude of sinusoidal jitter at high
frequency outperforms previous receivers, which is mainly because of the intro-
duced LPFs and the developed compensating PI.

• A chip-to-chip connection over a 12-cm printed circuit board (PCB) channel us-
ing the designed transmitter and receiver chips is constructed. The channel loss
is compensated by a combination of TX-FFE and RX-CTLE. To obtain the op-
timal equalization coefficients and track the channel-loss variations with respec-
t to operation environment, a low-cost edge-data correlation-based sign zero-
forcing (EDC-SZF) adaptation algorithm is proposed to automatically adjust the
TX-FFE’s tap weights. The measurement results indicate that the equalization
scheme of the combination of TX-FFE and RX-CTLE is a good choice for the
equalization of the 16-dB loss channel at 40 Gb/s, and the proposed EDC-SZF
adaptation can effectively tune the TX-FFE to its optimal tap weights for a given
control voltage applied to the RX-CTLE.

1.5 Organization of the Thesis

This thesis is composed of seven chapters. Chapter 1 outlines the research back-
ground, objectives, contributions, and organization of the dissertation. Chapter 2 sum-
marizes the mainstream techniques developed on the high-speed serial links. The main

7
Chapter 1. Introduction

contributions of this thesis are detailed in Chapters 3, 4, 5 and 6, which present the
designed clock multiplier, transmitter chip, receiver chip, and chip-to-chip link, re-
spectively. In each of these four chapters, we discuss the design motivation, describe
the prototype implementation, and present the experimental results. Finally, Chapter 7
concludes this thesis and outlooks the possible future work. The details in each chapter
are summarized as follows.
Chapter 2 reviews the mainstream techniques that have been developed within the
wireline transceiver designs. It begins with a brief discussion on the general design
considerations when constructing a serial communication link, including technology
selection, link space choice, and on-chip wire modeling. Then, we summarize the
major metrics that are used to characterize the overall performance of a serial com-
munication link. Following that, the mainstream techniques of the crucial components
within a serial link are discussed in detail, including clock multiplier, transmitter, re-
ceiver, and equalizers.
Chapter 3 presents the design of the RILCM. It firstly summarizes the challenges
in previous RILCM and then describes the proposed RILCM architecture. Following
that, we demonstrate the details of the ring-based voltage-controlled oscillator, the
phase-shift detection scheme, and the introduced lock-loss detection and lock recovery.
Finally, the experimental results are presented and discussed. This chapter is extended
based on the publications [3] on page VI.
Chapter 4 presents the designed transmitter chip. It firstly discusses the two main
challenges (i.e., timing constrains and bandwidth limitations) in high-speed transmitter
designs, and then presents our transmitter architecture. Following that, the enhance-
ment on the 4:1 multiplex and the clocking techniques are separately illustrated. Final-
ly, the experimental results are demonstrated and discussed. This chapter is an enriched
version of the contents published in [1] and [5] on page VI.
Chapter 5 presents the implemented receiver chip, which mainly focuses on the im-
provement on the clock data recovery (CDR) design. It firstly summarizes the design
considerations of the receiver, and then displays the receiver architecture. Follow-
ing that, we separately describe the improved digital CDR and the linearity-optimized

8
Chapter 1. Introduction

compensating PI. Finally, the experimental results are presented and discussed. This
chapter is extended based on the contents published in [1] and [4] on page VI.
Chapter 6 constructs an overall chip-to-chip communication link utilizing the chips
designed in Chapters 4 and 5. It firstly describes the link connection and equalization
scheme, and then demonstrates the developed low-cost EDC-SZF adaptation algorith-
m. After that, we present the experimental setup and the measurement results. The
condensed contents of this chapter has been published in [1] on page VI.
Chapter 7 summarizes this dissertation in conclusions and discusses the potential
optimization work that can be further done in the future.

9
Chapter 2

Literature Review

High-speed serial links are commonly adopted in chip-to-chip communication ap-


plications ranging from handheld electronics to supercomputers. Driven by the expo-
nential growth of the computation ability and storage-volume capability, the through-
put bandwidths within the connections among memories, graphics, processors, chassis,
racks, and routers [37, 38, 39] have been continuously increased. In practical designs,
these bandwidth increases are achieved by either raising the number of data lanes or
increasing the data rate per lane [40]. As one of the most important component in such
links, serial transceiver needs to provide precise timing information, sufficient band-
width, and appropriate equalization for the data transmission. These requirements have
posed significant challenges in the implementation of wireline transceivers and hence
made the design of the wireline transceivers a hot research field [41, 42, 36].
This chapter will review the related works for the wireline transceiver designs. It
begins by introducing the general design considerations when constructing a serial
chip-to-chip connection in Section 2.1. Section 2.2 then presents the crucial metrics
that are usually employed to characterize the performance of a serial link. Following
that, the pros and cons of the mainstream techniques within a serial link including
clocking techniques, transmitter techniques, receiver techniques, and channel equaliz-
ers are detailedly discussed in Section 2.3.

10
Chapter 2. Literature Review

Figure 2.1: Cutoff frequency (fT ) scaling comparison among different processes in
terms of the inverse of the lithographic feature size [3].

2.1 General Design Considerations

2.1.1 Technology Choices

High-speed links over 10 Gb/s have traditionally been implemented in SiGe BiC-
MOS technology due to its integration of high-speed SiGe bipolar and low-cost CMOS
transistor, where the former is suitable for the high-speed, low-noise blocks such as
transmitter (TX) driver and receiver (RX) pre-amplifier, while the latter is appropriate
for the control-logic implementation [43, 44]. However, for more complex application-
s where SerDes function is combined with complicated digital functions, CMOS pro-
cess is preferred because of its fast shrinking that makes it feasible to keep the die size,
power consumption, and fabrication cost as low as possible [43]. These area, power,
and cost savings over equivalent SiGe circuits mainly come from the simple and com-
pact transistor implementation in CMOS process that makes the designs easily scaled
downward as semiconductor processes improve. Line card and optical module manu-
facturers utilizing CMOS products will benefit from the large community of competing
foundries, which engages in aggressive pricing strategies and rapid adoption of ever-
smaller process nodes that deliver successively lower cost per chip, reduced operating
voltages, and decreased power consumption [28]. Fig. 2.1 shows the scaling trend

11
Chapter 2. Literature Review

of CMOS process versus SiGe BiCMOS technology in terms of the cutoff frequency
(fT ). Although the SiGe BiCMOS technology always remains a speed advantage over
CMOS process, the fT of 45 nm CMOS already reaches 270 GHz, which makes it
feasible to implement high-speed transceivers around tens of Gb/s.
Note that the potential advantages of using low-cost CMOS process come with
several significant challenges. The primary challenge is that mainstream CMOS tran-
sistors are slightly slower than exotic SiGe devices (see Fig. 2.1). Therefore, more in-
novative designs for crucial blocks such as voltage-controlled oscillator (VCO), trans-
mitter driver, receiver analog front-end, and channel equalizer are required to overcome
the slower, noisier characteristics of CMOS transistors. Driven by the large-scale mar-
ket requirements, Moore’s law curve is developing towards ever-better power, perfor-
mance and price. Meanwhile, process nodes are constantly scaled down under the ag-
gressive investment of the foundries. The resulting processes have provided platforms
for the development of several tens of serial transceivers with high efficiencies in both
cost and power. So far, 25-28 Gb/s serial transceivers in CMOS processes support-
ing InfiniBand enhanced data rate (EDR), 32G fiber channel (32GFC), and common
electrical interface (CEI)-28G have stepped into the period of industrial deployment
[20, 21, 22]. Meanwhile, 38-64 Gb/s transceivers, which will play a key role in the
next-generation data rate supported by 400 Gigabit Ethernet (GbE), InfiniBand high
data rate (HDR), and CEI-56G, have been successfully demonstrated in lab and been
under the period of moving from the lab to the market [17, 18, 23, 24, 25, 26, 27, 32].

2.1.2 Spaces of Electrical Links

Fig. 2.2 shows the main SerDes application spaces in electrical links, including
rack-to-rack link, chassis-to-chassis connection, and intra-chassis interconnect. This
thesis mainly focuses on the chip-to-chip connections described in Fig. 2.2(c). Ac-
cording to the communication distance, these serial links can be classified into ultra
short reach (USR), extra short reach (XSR), very short reach (VSR), medium reach
(MR), and long reach (LR). Fig. 2.3 summarizes the connection details of each appli-
cation space defined in CEI-56G.

12
Chapter 2. Literature Review

(a)

(b) (c)

Figure 2.2: Typical SerDes application spaces. (a) rack-to-rack link, (b) chassis-to-
chassis connection, and (c) intra-chassis interconnect [4].

Die Die/OE

Chip Chip/OE

Module

Chip

Chip Chip

Chip

Chip

Figure 2.3: Reach details of each application space defined in CEI-56G [4].

13
Chapter 2. Literature Review

The USR link is usually used to connect multiple dies and optical engines within
a multi-chip module to achieve the power and signal integrity objectives. This 2.5/3-
dimension packaging solution can save substantial power since the communication
distance is typically less than 10 mm. This short channel length allows for a much
simple physical layer implementation since it can be treated as a synchronous link.
Meanwhile, the low-cost communication channel makes it possible to rule out equal-
izations.
The XSR link is often employed to realize the data communication between elec-
trical chips and optical devices, where the link distance is usually less than 50 mm.
Meanwhile, central processing units (CPUs) and digital signal processings (DSPs) can
also be connected via such a short connection to satisfy the latency requirements. This
XSR link is used to connect CPUs with memory stacks to optimize the responding
time of memory access as well.
The VSR link mainly refers to the connection between electrical chips and plug-
gable modules. Its typical communication distance is around 10 cm, where the channel
loss could reach 10-20 dB at the Nyquist frequency.
The MR link is usually used to implement the connection between two chips on the
same printed circuit board (PCB) or one on the main card and the other on a daughter
card [4]. Its communication distance ranges up to 50 cm and the channel loss is in the
range of 15-25 dB at the half frequency of the symbol rate.
The LR interface is usually applied to realize the connection between two daughter
cards across a legacy backplane with an up to 35 dB channel loss at the Nyquist fre-
quency. The total channel length is limited less than 100 cm, and two connectors are
allowed.
The channel loss in VSR, MR, and LR links has posed significant challenges in
transceiver designs as they need to compensate for the high-frequency loss within the
power budget. This problem becomes extremely severe for the large switch chips
where heat dissipation also plays a performance-limiting factor. To address these is-
sues, complex equalization scheme, high-order modulation, and forward error correc-
tion (FEC) have been developed [4]. To accommodate difference channel loss, a proper

14
Chapter 2. Literature Review

combination of these techniques is usually employed to correct the signal distortion.


For example, a solely TX-side feed-forward equalizer (FFE) is usually sufficient for
VSR links to compensate for the small channel loss (<10 dB) while a sophisticat-
ed combination of complex equalization scheme, advanced modulation, and FEC is
required in the LR links to cope with the signal integrity problem associated with
the legacy communication channel, including severe signal attenuation caused by di-
electric loss (> 30 dB), signal reflection resulting from impedance discontinuity, and
mutual crosstalk among different transmission channels.

2.1.3 On-Chip Wire Modeling

With the rapid development of the manufacturing technologies, the channel length
and the transistor delay are respectively shrinking down to nanometer scale and sub-
tens of ps. These miniaturization trend for CMOS integrated circuits has led to a
tremendous cost advantage and performance improvement. However, the narrowed
cross-section and wire spacings have dramatically increased the parasitic effects of
the connection wires, thus degrading their high-speed performance. Previous studies
have demonstrated that when the signal’s rise/fall time roughly matches the propaga-
tion time through the line, the connection wire actually isolates the receiver from the
driver and plays the role of output/input impedance of the driver/receiver [45, 46, 47].
Consequently, how to model on-chip connection wires has become a tricky problem
for high-speed circuit designers. If it is not handled appropriately, the interconnect
effects including voltage ring, signal delay, distortion, reflection, and crosstalk could
degrade the system robustness or even lead to undesired errors. Considering the fact
that a simple model may ignore some important effects to result in a design failure
while a sophisticated model could complicate the simulation to extend the design cy-
cle or even make the simulation unapplicable. Hence, it becomes extremely important
for designers to properly simulate the entire designs as efficiently as possible while
maintaining the simulation accuracy [48].
The concept of “high-speed interconnect” is a relative concept. It refers to the inter-
connect where the propagation time to travel between the two connection points cannot

15
Chapter 2. Literature Review

be neglected. As discussed in [48, 47], the “electrical length” of an interconnect can


be considered as a criterion for classifying interconnects. If the wire length is shorter
than one-tenth of the corresponding wavelength (e.g., for a 10 GHz signal, λ=3 cm),
the interconnect can be considered as electrically short and hence can be modeled by
the lumped model. Otherwise, the interconnect can be referred as electrically long (i.e.
“high-speed interconnect”), which should be treated as a distributed or full-wave mod-
el [45, 49]. In high-speed serial links, the highest frequency of interest is determined
by the rise/fall time of the transmission signal since most of the trapezoidal pulse en-
ergy is concentrated inside the first lobe. Correspondingly, fmax can be defined as the
-3 dB bandwidth of this major lobe [46, 47],

0.35 (2.1)
fmax = tr ,

where tr is the rise/fall time of the signal. This implies that for a 0.1 ns rise time, the
maximum interest frequency is around 3 GHz and the minimum wavelength is 10 cm.
In some special cases, a more conservative bandwidth can be set as [50],

1 (2.2)
fmax = tr .

2.2 SerDes Design Metrics

2.2.1 Data Rate and Power Efficiency

The data-rate of a high-speed serial link is the number of data bits transferred per
second from the transmitter to the receiver, while the power efficiency refers to the
normalized power consumption when transferring every Gigabit data in one second .
The former is usually measured in Gb/s and the latter is frequently characterized by
mW/Gb/s. Previous studies [51, 52, 53] have demonstrated that there exists an optimal
data rate to exploit the maximum potential of a given process to achieve the best power
efficiency. The analyses in [52] and [53] suggest that the power efficiency reaches the
optimal value when the bit time (the reciprocal of the data rate) is around (4∼6) ×

16
Chapter 2. Literature Review

FO4 (the inverter delay of the target technology with a fan-out-of-4). At this speed,
it is relatively easy to drive the half-rate clock and build critical high-speed blocks
(e.g., TX-side half-rate 2:1 multiplexers and RX-side edge/data samplers) in power-
efficient CMOS logic [51]. The FO4 delays can be roughly approximated as 500 ps
per µm of minimum drawn gate length in CMOS technologies [54]. On one hand, if
the data rate is too low, the overhead of the stationary currents will become dominan-
t, thus deteriorating the power efficiency. On the other hand, when the bit period is
too short, power-hungry current-mode logic (CML) circuits and complicated equaliza-
tion techniques are usually employed to satisfy the stringent timing requirement and
compensate for the severe channel attenuation. This is also the reason why the cutting-
edge transceivers running at tens of Gb/s usually show an increasing trend in power
efficiency values. Previous research has demonstrated 2 mW/Gb/s transceivers in 65
nm CMOS operating around 10 Gb/s [55, 56]. Meanwhile, the commercial 28 Gb/s
transceivers with sophisticated equalizers using 28 nm CMOS is around 7 mW/Gb/s
[30]. Recently published non-return to zero (NRZ) transceivers operating from 40 to
60 Gb/s with an equalization ability of <20 dB in 28-65 nm CMOS processes have
shown energy efficiencies ranging from 4.4 to 16.4 mW/Gb/s [25, 34, 57, 35, 36].

2.2.2 Bit Error Rate

Bit error rate (BER) is the ratio of the error bit number to the total transmitted bit
number in a specific period. It is a measure of the correctness of the link operation,
which is expected to be lower than 10−12 for most serial connections. In serial com-
munication systems, the BER could be affected by the distribution of the random jitter
(RJ) and the deterministic jitter (DJ) in the link. Fig. 2.4 gives the jitter decomposition
components and their corresponding jitter sources [58], where the jitter generation and
amplification mechanisms can be found in [54] and [59, 60, 61], respectively. Com-
bining the jitter generated by each source, the total RJ and DJ can be respectively
computed by the following two equations,

q
Trj = t2rj1 + · · · + t2rjn , (2.3)

17
Chapter 2. Literature Review

Decomposed Jitter Jitter Sources


Bounded Correlated Jitter Data Dependent (1) Channel Loss ISI
Jitter (DDJ) (2) Limited Buffer Bandwidth
Bounded Uncorrelated Jitter
Duty Cycle Jitter (1) Layout Mismatch
(DCD) (2) Threshold Offset
Deterministic (2) Jitter Amplification
Jitter (DJ)
Periodic Jitter (1) PLL Spur
(PJ) (2) CDR Dithering Jitter

Total Jitter Bounded (1) Power Supply Noise


(TJ) Uncorrelated Jitter (2) Crosstalk, Reflection
Bounded

Unbounded Random Jitter


(1) Reference Clock Jitter
(2) Oscillation Jitter
(RJ)
(3) Transport Devices Jitter

Figure 2.4: Jitter decomposition and jitter sources.

Tdj = tdj1 + · · · + tdjn , (2.4)

where Trj denotes the total RJ, trjn , (n = 1, 2, ....) refers to the independent RJ gen-
erated by different sources, Tdj represents the total DJ, and tdjn , (n = 1, 2, ....) stands
for the separate DJ produced by different blocks. Assuming the samplings that happen
outside the bit period produce bit errors, the horizontal Q-factor of the BER can be
represented by,
Tbit −Tdj
QBER = 2Trj
, (2.5)

where Tbit is the bit period. Referring to the analysis in [62], the BER can be roughly
evaluated by,

BER = 12 erf c( Q√
BER
2
), (2.6)

where erf c() is the complementary error function, which is defined as,

R∞ 2
erf c(x) = √2
π x
e−x dx. (2.7)

According to Eq. (2.6), the horizontal Q-factor should be 7.0 to satisfy the common-
ly required BER of 10−12 . It is worth noting that the BER can be further degraded
by the non-ideal impairments such as asymmetric jitter distribution [63], non-optimal

18
Chapter 2. Literature Review

sampling position [64], phase-spacing error, sampler input-offset [65], and sampler
metastability [65].
The vertical amplitude dimension is another factor that could affect the BER. It
usually involves the TX-side output swing, channel equalization, and RX-side input
sensitivity. The receiver sensitivity is defined as the lowest signal amplitude that the
receiver can correctly extract the transmitted data. It is a function of equivalent input
noise, input offset, and minimum latch resolution. When the received signal has a
sufficient large swing, the vertical amplitude shows negligible effect on the BER. If
the received signal swing is reduced close to the receiver sensitivity, the BER of the
whole link could be determined by the signal noise ratio (SNR) of the received signal
even though there is adequate horizontal timing margin. Similar to the relationship
between the BER and horizontal Q-factor in Eq. (2.6), the BER is related to the SNR
through the following equation [66],


BER = 12 erf c( SN
√ R ).
2 2
(2.8)

Note that Eq. (2.8) takes place under the condition that there is a sufficient horizontal
timing margin. It seems that a vertical eye opening at the RX-side can always be
obtained by increasing the output swing at the TX-side. However, the enhanced inter-
symbol interface (ISI), reflection, and crosstalk associated with the increased swing
could overwhelm the RX-side amplitude increment and hence deteriorate the overall
performance of the link. The enlarged swing also needs a higher capacitor-charging
current and thus increases the power consumption. In practical designs, signal swing
and equalizer scheme are often sophisticatedly selected and designed to achieve both
low BER and power consumption. Offset cancellation techniques are often employed
in the receiver to lower its sensitivity to reduce the minimum swing requirement as
well, and hence optimize the power efficiency of the link.

2.2.3 Clock Data Recovery (CDR) Specifications

The CDR used to extract the sampling clocks and retime the transmitted data must
satisfy stringent jitter specifications. Its performance is usually evaluated by “jitter

19
Chapter 2. Literature Review

0.1 dB

Jitter Gain (dB)


1
-20 dB/dec

fc
Frequency (Hz)

OC Level Rate fc P
1 51.84 Mb/s 40 kHz 0.1 dB
3 155.52 Mb/s 130 kHz 0.1 dB
12 622.08 Mb/s 500 kHz 0.1 dB
48 2.48832 Gb/s 2 MHz 0.1 dB
192 9.95328 Gb/s 120 kHz 0.1 dB
(a)
Jitter Filter Gain (dB)

f0 f1
Frequency (Hz)

OC Level Rate f0 f1 Total Jitter


1 51.84 Mb/s 12 kHz 400 kHz 10 mUI RMS
3 155.52 Mb/s 12 kHz 1.3 MHz 10 mUI RMS
12 622.08 Mb/s 12 kHz 5 MHz 10 mUI RMS
48 2.48832 Gb/s 12 kHz 20 MHz 10 mUI RMS
192 9.95328 Gb/s 50 kHz 80 MHz 10 mUI RMS
(b)
Sinusoidal Jitter Amplitude (UIpp)

A3
Acceptable Performance

A2 -20 dB/dec

A1
Unacceptable Performance

f0 f1 f2 f3 ft
Frequency (Hz)

OC Level Rate f0 f1 f2 f3 ft A1 A2 A3
(Mb/s) (Hz) (Hz) (Hz) (Hz) (Hz) (UIpp) (UIpp) (UIpp)
1 51.84 10 30 300 2k 20k 0.15 1.5 15
3 155.52 10 30 300 6.5k 65k 0.15 1.5 15
12 622.08 10 30 300 25k 250k 0.15 1.5 15
48 2488.3 10 600 6k 100k 1M 0.15 1.5 15
192 9953.3 10 2k 20k 400k 4M 0.15 1.5 15
(c)

Figure 2.5: CDR specifications of (a) JTRAN, (b) JGEN, and (c) JTOL in SONET [5].

20
Chapter 2. Literature Review

transfer (JTRAN)”, “jitter generation (JGEN)”, and “jitter tolerance (JTOL)” [67, 68].
Fig. 2.5 summarizes these three metric definitions in synchronous optical network
(SONET) [5].

• The JTRAN is characterized by calculating the ratio of output jitter to input jit-
ter as a function of frequency. This metric is often used in long-haul networks
employing many data regenerators. To implement reliable data communications
in such cascaded systems, the JTRAN peaking of each regenerator must be suffi-
ciently small to ensure that the output jitter after tens of successive amplifications
is still acceptable. As depicted in Fig. 2.5(a), the maximum jitter peaking of the
retiming regenerator in SONET must be less than 0.1 dB [5].

• The JGEN is a measure of the intrinsic jitter produced by the CDR itself when
there is no jitter in the input data. It can be measured at the output of the CDR
using a high-pass filter with a specific cut-off frequency. Fig. 2.5(b) gives the
corner frequencies at different data rates for SONET and the maximum allowable
integration rms-jitter. For different OC levels, the rms-jitter is always demanded
to keep lower than 10 mUI.

• The JTOL is used to characterize the CDR jitter tacking ability, and it is defined
as the maximum amplitude of the injected sinusoidal jitter that the link can toler-
ate without dropping below a specific BER. Fig. 2.5(c) displays the JTOL mask
for SONET, which defines the minimum jitter amplitude that can be tolerated
while not exceeding a specific BER at different frequencies [68].

In summary, the JTRAN, JGEN, and JTOL separately answer the following three
questions: (i) how much jitter passes through the CDR from the input to the output,
(ii) how much jitter is created by the CDR itself, and (iii) how much jitter can be there
at the input of the CDR [68].

2.3 Basics of Electrical Serial Links

Fig. 2.6 describes a typical serial link for chip-to-chip communications. It is com-
posed of three primary components: a transmitter, a receiver, and a channel. The main

21
Chapter 2. Literature Review

Driver
CTLE DFE

Deserializer
DN Equalizer DN
Channel

Serializer
DS
+ CDR
D2 D2
D1 D1
Refclk
Refclk PLL
PLL
Transmitter Receiver

Figure 2.6: Typical serial link for wireline communications.

function of the transmitter is to convert the parallel digital data into an electrical signal
and launch it on the transmission channel with a proper waveform shape such that the
received signal after the lossy channel can be correctly recovered. A general trans-
mitter (see Fig. 2.6) usually consists of a phase-locked loop (PLL), a serializer, and a
combined driver-equalizer. Driven by the clocks with appropriate frequency and phase,
the parallel data D1 -DN are successively multiplexed into a full-rate data stream DS
using the multiplexing stages in the serializer. To guarantee a robust serialization, the
bandwidth and timing margin of each multiplexing stage must be sufficient. After the
full-rate data streams are generated, they are applied to the combined driver-equalizer
to pre-distort the output waveform and launch it into the transmission channel. The
main task of the receiver located at the other end of the transmission channel is to
extract the originally transmitted data from the received signal using appropriate e-
qualization and clock data recovery (CDR) techniques [69, 61, 70]. A general receiver
(see Fig. 2.6) usually contains a front-end equalizer, a CDR, and a deserializer. The
incoming signal is firstly equalized by the front-end equalizer to obtain a sufficient ver-
tical eye opening and an adequate horizontal-sampling margin. This equalized output
is then sliced by the samplers, where the sampling position is continuously adjusted
by the CDR loop. These sliced data sequences are further demultiplexed by the dese-
rializer to attain the originally transmitted data D1 -DN . The communication channel
is adopted to move the serial data from the TX side to the RX side. The main problem
associated with the transmission channel is the channel loss. To overcome this diffi-
culty, a combination of TX-side feed forward equalization (FFE) along with RX-side
continuous linear equalizer (CTLE) and decision feedback equalizer (DFE) is usually

22
Chapter 2. Literature Review

employed, as shown in Fig. 2.6.


In the remainder of this chapter, we firstly present the clocking techniques which
mainly focus on clock synthesis and distribution in Section 2.3.1. Then, the general
architectures and crucial blocks of the transmitter and receiver are respectively dis-
cussed in Sections 2.3.2 and 2.3.3. Finally, Section 2.3.4 illustrates the equalization
techniques for ISI cancellation.

2.3.1 Clocking Techniques

Clocking circuitry plays a critical role in modern high-speed wireline communica-


tion systems since the clock signals not only establish the flow-of-time for the down-
stream data processing but also provide accurate timing information for the upstream
data serialization/deserialization and data transmission. The timing accuracy of the re-
timing clocks at the TX-side and the sampling clocks at the RX-side can directly affect
the timing margin of the serial link. According to the operation functions, the clock
circuitry in a serial link can be classified as clock synthesis, distribution, and recovery.
This section mainly focuses on the clock synthesis and distribution while the clock
recovery will be discussed in Section 2.3.3.1 together with the receiver basics.

2.3.1.1 Clock Synthesis

Clock synthesis is usually accomplished by a PLL- or delay-locked loop (DLL)-


based frequency multiplier, which takes a low reference clock with low jitter to syn-
thesize high-frequency clocks. As the source of the high-frequency clocks, any jitter
at the output of the clock multiplier can be directly converted into timing uncertainty
in the serial link and hence compresses the jitter budget of the whole link. Given that
the numerous theoretical analyses and circuit implementations are already available
for integrated PLLs and DLLs [29, 71, 72, 73, 74, 75, 76], this section does not review
the design principles and implementation details. Instead, we summarize the main fea-
tures and design points of these two clock synthesis schemes. Additionally, injection
locking-based clock multiplier has been developed in recent years to improve the clock
jitter performance. The features and challenges of this technique are also briefly intro-

23
Chapter 2. Literature Review

Fref Sɵ ( f ) VCO
PFD+CP LPF Fout 1
PLL f3
PFD/DIV/CP
1
f2

/N REF 20log(N)

fBW fc f
(a)

PD+CP LPF
Sɵ ( f ) Jitter Peaking
Fref DLL Jitter Amplifying

VCDL REF 20log(N)

fBW f
Edge-Combining Logic
Fout
(b)

Sɵ ( f ) VCO
Fref PG Fout 1
f3
ILO
VCTRL 1
f2
REF 20log(N)

fc finj f
(c)

Sɵ ( f ) VCO
Fref PG Fout 1
f3

1
IL-VCO
FTL f2
REF 20log(N)

fBW fc finj f
(d)

Figure 2.7: Clock synthesis implementations and phase noise performances for (a)
PLL, (b) DLL, (c) ILO, and (d) IL-VCO. Here, f is the frequency of the noise, Sθ (f )
stands for the phase noise spectrum, fBW refers to the -3dB bandwidth of the loop,
fc denotes the corner frequency of the VCO, and finj represents the injection-locking
bandwidth of the ILO.

duced in this section to give an overview of the common clock generation schemes for
serial links. The details of this injection locking technique will be discussed together
with the designed ring-oscillator-based injection locked clock multiplier (RILCM) in

24
Chapter 2. Literature Review

Chapter 3. Fig. 2.7 presents the widely used clock generation techniques and their
corresponding phase noise performance.
The most general method to produce high-frequency clocks from a low-frequency
input is the traditional PLL [see the left diagram in Fig. 2.7(a)], due to its compact
implementation, robust operation, and convenient rate configuration. It consists of
a phase frequency detector (PFD), a charge pump (CP), a low-pass filter (LPF), a
voltage-controlled oscillator (VCO), and a divider (DIV). The PFD is utilized to detect
the phase errors between the input reference clock and the feedback divided clock, the
CP is used to convert the phase errors into current pulses, the LPF is adopted to sup-
press the ripples on the control voltage, the VCO generates high-frequency clocks, and
the divider is introduced to set the clock multiplication factor. Theoretical analyses
show that the PLL acts as an LPF for the reference noise, DIV noise as well as PD
noise, a band-pass filter for the CP noise, and a high-pass filter (HPF) for the VCO
noise [29, 71, 72]. X. Gao et al. [77] proposed two useful designing criteria to min-
imize the PLL output jitter for a given power budget. One is spending equal power
on the loop (including PFD, DIV, and CP) and the VCO. The other is setting the PLL
bandwidth at an optimal value that makes the loop components and the VCO equally
contribute to the total jitter. As shown in the right diagram in Fig 2.7(a), the optimal
bandwidth can be approximated by the phase-noise intersects of the loop components
and the VCO. The jitter performance of the PLL heavily relies on the oscillator (OSC).
Different types of OSCs provide different advantages and drawbacks with respect to
power efficiency, area occupation, phase noise, tuning range, and multi-phase gener-
ation. The Ring-OSC holds the advantages over the LC-OSC in terms of small area
occupation, wide tuning range, and convenience of multi-phase generation, while the
LC-OSC possesses the good properties of low phase noise and high power efficien-
cy. Neither of them can satisfy all the clock synthesis requirements of small area, low
power, low phase noise, and multi-phase generation. The poor phase noise of the Ring-
OSC is mainly because of the device noise accumulation, while the large area of the
LC-OSC is due to the involvement of the large inductor. Additionally, the phase noise
of both these two OSCs degrades rapidly when the operation frequency exceeds 10

25
Chapter 2. Literature Review

GHz. Therefore, a wide PLL bandwidth is desirable to suppress the phase noise of the
VCO. However, the maximum loop bandwidth is often limited by the input reference
frequency for loop stability consideration.
DLL-based clock synthesizer is one of the possible solutions to satisfy the afore-
mentioned requirements [76, 78, 75, 79]. The left diagram in Fig. 2.7(b) presents its
conceptional implementation, which consists of a conventional DLL and an edge com-
biner. Driven by the phase detection loop, the voltage-controlled delay line (VCDL)
is forced to produce equally spaced phases within a specific duration (e.g., a period
of the input clock). These evenly spaced low-frequency phases are then fed into the
edge combiner to produce the desired high-frequency clocks. The main advantage of
this DLL-based clock synthesizer is its high jitter performance, which can be mainly
attributed to that the jitter accumulation in the open-loop VCDL only lasts within a
single-line delay [78]. In addition, the phase noise transferred from the the PD and
CP is negligible due to the small gain of the VCDL. The right diagram in Fig. 2.7(b)
presents the phase noise characteristics of the DLL-based clock synthesizer, where the
accumulated phase noise associated with the VCDL and the phase noise introduced
by the PD/CP are so small that can be neglected. Note that there does exist jitter
amplification for the out-band frequencies although they are usually very small. Com-
pared to the phase noise in traditional PLL [see the right diagram in Fig. 2.7(a)], the
DLL-based synthesizer exhibits excellent jitter performance. It can be roughly ap-
proximated by the reference clock jitter [75]. Another benefit of the DLL-based clock
synthesizer comes from its natural stability, which manifests itself as a single-pole sys-
tem. However, this architecture has three major drawbacks. Firstly , its performance
is sensitive to static nonlinearities. Any phase inaccuracy of the evenly spaced clocks
translates directly into duty cycle error and/or phase spacing error. This phase inaccu-
racy could be either caused by the mismatches in the PD, CP, and VCDL or induced
by the waveform-shape inconsistency due to an improper input waveform. These fac-
tors make the DLL-based clock synthesizer fragile to fabrication mismatch and power,
voltage, and temperature (PVT) variations, thus exhibiting weak robustness. Second-
ly, the clock multiplication factor is difficult to program due to limited VCDL stages.

26
Chapter 2. Literature Review

Thirdly, the additional high-speed edge combiner could significantly degrade its pow-
er efficiency. Constrained by the fragile robustness, huge power consumption, and
inconvenient combining-timing control, the DLL-based clock synthesizer is difficult
to reach frequencies higher than 10 GHz [80].
Injection-locked clock multiplier (ILCM) is another promising scheme to produce
high-frequency multi-phase clocks with small area occupation, low power consump-
tion, and high jitter performance [81, 82, 83, 84, 85]. It has shown great potential in
serial link communications [86, 87, 88]. Fig. 2.7(c) depicts the functional diagram of
the injection locked oscillator (ILO) and its phase noise suppression effect. The injec-
tion locking actually acts as a single-pole HPF system that achieves 20 dB/dec of in-
band noise shaping against the intrinsic phase noise of the OSC [82, 89]. Nonetheless,
this simple ILO suffers from the following three issues. Firstly, the jitter suppression
is sensitive to the frequency offset between the target frequency and the free-running
frequency of the OSC [81]. As the frequency deviation increases, the phase noise
tracking ability will be significantly degraded while the spur increases dramatically.
Therefore, the ILO should be tuned to be close to the center of the locking range for
best jitter performance. Secondly, this injection locking technique cannot completely
suppress the 1/f 3 noise. This problem becomes particularly prominent for ring-OSC
implemented in deep sub-micron CMOS processes because their flicker-noise corner
frequencies usually reach tens of MHz [82]. Consequently, phase calibration mecha-
nisms are needed to assist in suppressing the 1/f 3 noise of the OSC. Thirdly, the small
locking range of the ILO reduces its robustness and reliability against PVT variations.
To address these issues, frequency tracking loop (FTL) is introduced to provide a prop-
er control voltage such that the natural oscillation frequency of the VCO can always
stay around the desired multiple of the injection frequency [see the left diagram in Fig.
2.7(d)]. This FTL brings in the following two benefits [83, 84]. One is that the frequen-
cy deviation between the target frequency and the natural frequency of the VCO can be
optimized. This not only enhances the jitter suppression effect of the injection locking
[see the right diagram in Fig. 2.7(d)], but also improves the robustness of the system
since the frequency deviation can always be controlled within the locking range of the

27
Chapter 2. Literature Review

IL-VCO. The other is the noise shaping ability which helps to suppress the in-band
noise of the VCO. Combining with the 20 dB/dec low-frequency noise suppression of
the injection lock, the 1/f 3 noise of the VCO can be effectively attenuated.

2.3.1.2 Clock Distribution

Clock distribution plays an important role in modern high-speed volume-lane transceiv-


er applications. The clock frequency can range from a few GHz to tens of GHz, and
the distribution distance is able to reach several millimeters when a common clock
lane is amortized across multiple data lanes [61]. Moving such high-frequency clocks
over such long distances has posed significant challenges for the on-chip clock distri-
bution [90]. Firstly, the ever-increased clock frequency has compressed the absolute
jitter budget for the timing uncertainty and duty-cycle error. Secondly, the increas-

(a)

(b)

Transmission Line

(c)

(d)

Figure 2.8: Clock distribution structures based on (a) inverter chain, (b) CML chain,
(c) transmission line, and (d) inductive load.

ing distribution distance is approaching to one tenth of the ”electrical length” of the
transmission clock, thus making the connection wires exhibit transmission line charac-
teristics. Thirdly, the parasitic resistance and capacitance have limited the bandwidth
of the interconnect wires. This problem becomes even more severe when the feature

28
Chapter 2. Literature Review

size scales downwards. The reason is that the scaled geometry could significantly in-
crease the parasitic effects, hence degrading the bandwidth of the connection wires.
Fig. 2.8 shows the four widely used clock distribution techniques. The most tra-
ditional method is to employ a buffer chain that can be implemented by either simple
inverters [see Fig. 2.8(a)] or compact CMLs [see Fig. 2.8(b)]. In these two approaches,
the transmission wire is divided into several segments to optimize the desired metric-
s, e.g., delay time, jitter performance, and power consumption. The analysis in [91]
shows that there exist optimal segment number and wire geometry for a specific distri-
bution distance and a distinct optimization metric (delay, jitter or power). Compared
to the full-swing digital inverter, the CML buffer is more suitable for high-frequency
cock distribution due to the following reasons. Firstly, the propagation delay of the
CML is much shorter than that of the logical inverter, since the CML can use a small
swing to reduce the edge-transition time. Secondly, the CML buffer can fully exploit
the process potentials as its compact NMOS driving topology naturally features fast
current switching speed and small parasitic capacitance. Thirdly, the CML buffer with
resistor loads has much less delay sensitivity to supply noise than inverters [92], due
to its excellent power supply rejection ratio (e.g., 5× in Intel 90 nm 1.2 V CMOS
process). The main disadvantage of the CML is the high power consumption because
it always draws a current from the supply even when the clock is not switching [61].
Considering the fact that the delay variation with respect to the supply fluctuation is
mainly caused by the clock buffers rather than the transmission wires [91], minimizing
the delay through the clock buffers is helpful to reduce the delay susceptibility of the
clock network to the power-supply noise.
Fig. 2.8(c) shows a repeaterless clock distribution network, which usually employs
an open-drain CML buffer to drive the terminated on-chip transmission lines [see Fig.
2.8(c)]. The measurement results in reference [92] demonstrate that a 10 GHz global
clock can be transmitted nearly 3 mm using an open-drain buffer to drive a pair of d-
ifferential transmission lines with on-chip terminations. The delay of the transmission
line is the smallest due to its speed-of-light propagation velocity. For the characteristic
impedance, it is not necessary to design exactly 100 Ω as long as it matches with the

29
Chapter 2. Literature Review

far-end terminations. In practice, large characteristic impedance is preferred, because


it not only improves the ratio of the impedance to the metal resistance, but also saves
the power of the driving CML buffer by reducing the driving current. Nonetheless,
the characteristic impedance is limited by the parasitic capacitance per wire length.
The design in [91] shows that a 120 Ω differential characteristic impedance can be
achieved by adjusting the metal geometry parameters such as metal layer, width, and
spacing. Due to nonnegligible resistance of the transmission line, it exhibits a limit-
ed bandwidth and hence causes random jitter and duty-cycle distortion amplifications
during clock transmission. The analysis in [61] indicates that these amplification ef-
fects increase very rapidly while the clock frequency exceeds the effective bandwidth
of the transmission line. Note that the main issue associated with the clock distribution
line is its large capacitive loading, many researchers have proposed to adopt an LC
resonance-based clock distribution to neutralize this capacitor [93, 94, 95]. As shown
in Fig. 2.8(d), the introduced differential spiral inductor and the parasitic wire capac-
itance actually constitute an LC tank. Owing to the characteristics of energy cycling
and impedance peaking within the LC resonance [93], this clock distribution scheme
exhibits great potential on power reduction and clock jitter suppression [94, 95]. It is
worth noting that the quality factor Q of the on-die inductor does not play a key role in
this clocking network, since the resistance of the long wire has dominated the Q of the
LC tank [91]. Consequently, a compact multi-layer inductor can be used to save die
area. Since the impedance of the LC tank shows a frequency selection characteristic,
it is not suitable for the applications that need to support a wide operation range.

2.3.2 Transmitter Techniques

2.3.2.1 Driving Mode

According to driving mode, the output stage of the transmitter can be mainly di-
vided into current-mode logic (CML) and source-series terminated (SST) drivers. Fig.
2.9(a) shows the implementation details of a typical CML driver, which consists of
a differential pair, a pair of resistive loads, and a tail current. Compared to the SST
driver described in Fig. 2.9(b), it poses the good properties of high-speed switch-

30
Chapter 2. Literature Review

50Ω 50Ω
4mA 12mA
50Ω
4mA
100Ω
400mV
4mA
100Ω
4mA 400mV

50Ω
4mA

16mA

(a) (b)

Figure 2.9: Typical transmitter driver modes. (a) CML mode and (b) SST mode.

ing, adjustable output swing, good impedance matching, and convenience to integrate
peaking inductors [96, 97]. These features endow it with the capability of exploiting
the maximum process potential, thus making it more suitable for cutting-edge drivers
that operate at tens of Gb/s. Recently, 50-64 Gb/s transmitters using CML drivers have
been implemented in 65 nm CMOS process [24, 25, 26]. The SST driver evolves from
traditional CMOS inverter, where 50 ohm resistors are inserted in each branch to re-
duce the impedance discontinuities and thus optimize the reflections. The SST driver
demonstrates a high power efficiency, which only consumes one fourth of that of the
CML driver (see Fig. 2.9). The symmetrical topology makes it compatible with all of
the low, high, and mid common-mode terminations. Nonetheless, the large self-load
capacitances, slow PMOS transistors, and incompatibility with bandwidth-extension
inductors have limited its maximum operation speed. These factors of the SST driver
make it popular in power-sensitive high-volume designs using advanced process with
adequate speed margins. For examples, a 28 Gb/s SST transmitter has been fabricated
in a 32 nm CMOS [98] and a 16-40 Gb/s NRZ/PAM4 dual-mode transmitter utilizing
SST driver has been implemented in a 14 nm CMOS [99].

2.3.2.2 Multiplexing Scheme

The serializer usually utilizes a multiplexing tree to combine the low-speed parallel
data into a high-speed stream. Each multiplexing stage is composed of a multiplexer
(MUX) and several latches, where the latches are placed before the MUX to guarantee

31
Chapter 2. Literature Review

Quarter Rate Half Rate Full Rate Quarter Rate Full Rate

D1 L L D1 L L
2:1
D3 L L L L L D2 L L L
2:1 Dout 4:1 Dout
L L L
D2 L L D3 L L L
2:1
D4 L L L D4 L L L L

2
2 2

4
PH270

PH270
PH0

PH180
PH90
/2
/2
(a) (b)
tsetup
Da CK1 PH0
2:1 tck-q
L
CK2 PH90
tdiv Da
PH180
CK2 CK1 tdiv
/2 tck-q tsetup PH270
(c) tsetup thold
D1 D1<n> D1<n+1>
PH0

PH180 D2 D2<n>

D1 D1<n> D1<n+1> D3 D3<n>


tsetup thold
D2 D2<n> D4 D4<n>

Dout D1<n> D2<n> D1<n+1> Dout D0<n> D1<n> D2<n> D3<n> D0<n+1>

tsetup + thold = 1UI tsetup + thold = 3UI


(d) (e)

Figure 2.10: Schemes of the final 4:1 multiplexing. (a) Half-rate topology based on
two-stage 2:1 MUXs, (b) quarter-rate structure based on direct 4:1 MUX, (c) critical
path and timing diagram of the 2:1 MUX, (d) timing margin of the 2:1 MUX, and (e)
timing margin of the 4:1 MUX.

sufficient timing margin for the following data selection and/or data sampling. These
timing constraints have posed significant challenges for the high-speed serialization
in the last few stages. According to the ratio of the data rate to the maximum clock
frequency, the transmitters can be partitioned into half-rate architecture and quarter-
rate architecture. Fig. 2.10 describes the conceptional implementations and timing
requirements of the two typical multiplexing schemes.
For the half-rate architecture, the final 4:1 multiplexing is implemented by three
2:1 MUXs, where two of them work in quarter rate and the final one operates at half
rate [see Fig. 2.10 (a)]. This serialization topology is ubiquitously used mainly owing
to its simple clocking scheme, which only requires a pair of complementary clocks to
alternatively select the input data. The pulse width of the MUX output is subject to the
duty cycle of the driving clocks, thus a 50% duty cycle is required. In practical designs,
a duty cycle correction circuit is usually employed to guarantee the desired duty cy-

32
Chapter 2. Literature Review

cle. The main drawbacks of this architecture are the tight timing constraints and large
number of latches (15 for the 4:1 serialization). Fig. 2.10 (c) and (d) displays the two
possible critical paths. One is located at the first latch in the final 2:1 MUX, where the
summation of the delay of the divider (by 2), the ck-to-q of the previous 2:1 MUX, and
the setup time of the latch must be smaller than 1 unit interval (UI). The other occurs at
the final 2:1 MUX, where the data selection margin [i.e., tsetup + thold in Fig. 2.10(d)]
is only 1 UI. When the data rate reaches several tens of Gb/s, it becomes a nontrivial
task to satisfy these timing requirements. The delay variations along with different
PVT corners make this problem even more challenging. To overcome this difficulty,
traditional half-rate transmitters often insert extra delay matching buffers [27, 24] or
phase calibration loops [100, 33, 26] between CK1 and the latch [see Fig. 2.10(a)]. For
the former method, the delay fluctuation between the multiplexing path and the match-
ing buffer may beyond 1 UI and thereby causes bit errors. For the latter approach, the
automatic phase adjusting suffers from the accuracy of phase detection, which could
reduce the stability, reliability, and robustness of the serializer. Additionally, both of
these two techniques involve substantial power and area overheads.
For the quarter-rate architecture, the final 4:1 multiplexing is performed by a single
4:1 MUX, where the input data operate at the quarter rate [see Fig. 2.10 (b)]. This
serialization structure has attracted increasing attentions to the applications beyond 10
Gb/s. This is because it not only addresses the timing issues in traditional 2: 1 MUX by
removing the critical path in Fig. 2.10(c) and relaxing the data-selection margin from
1UI [see 2.10(d)] to 3 UI [see Fig. 2.10(e)], but also saves substantial power by halving
the maximum clock speed and removing the half-rate latches [see Fig. 2.10(e)]. How-
ever, these benefits come with the penalty of a doubled self-drain capacitance, which
dramatically degrades the bandwidth of the 4:1 MUX, hence limiting its maximum
operation speed. Another difficulty associated with this 4:1 MUX is how to generate
the evenly 90◦ -spaced multi-phase clocks and produce the UI-spaced input sequences
for the data selection. Both of these issues are addressed in this thesis, which will be
detailed in Chapter 4.

33
Chapter 2. Literature Review

Dout1 Dout2 Doutn


Din1
Din2 MUX Din D FF Q D
FF
Q D
FF
Q
Din3 4:1
Din4

Full-Rate Clock

(a)

Din1 L
MUX L L L L
Din3 L L 2:1

2 MUX MUX MUX
2:1 Dout1 2:1 Dout2 2:1 Doutn
Din2 L 2 2 2
MUX L L L L L
Din4 L L 2:1
180°
180° 2 0° 180° 0° 2
Complementary 0° 2
Quarter-Rate Clock
DIV2

Complementary 2
Half-Rate Clock
(b)

Din1 L L L L
180° 270°
Din2 L L
MUX MUX MUX
4:1 Dout1 4:1 Dout2 4:1 Doutn
Din3 L L L
4 4 4
Din4 L L L L L L L

0° 90° 180° 270° 90° 180°


4 4 4 4 4

Multi-Phase
Quarter-Rate
Clock (c)

Dout1 Dout2 Doutn


Din1
Din2 MUX Din Delay Delay Delay
Din3 4:1 Line Line Line
Din4 VCTRL

VCTRL
DIV2
Delay
Half-Rate Clock Line
PD PD LPF

DLL-Based Delay Line Bias Generation

(d)

Figure 2.11: Techniques of 1-UI delay generation based on (a) full-rate FF, (b) half-rate
2:1 MUX, (c) quarter rate 4:1 MUX, and (d) analog delay line.

2.3.2.3 1-UI Delay Generation

TX-FFE, which performs as a finite impulse response (FIR) filter and pre-distorts
the transmitted signal, is one of the most common
34 techniques that is employed in high-
Chapter 2. Literature Review

speed serial links to alleviate the ISI caused by the frequency-dependent channel loss.
In practical designs, the FIR taps are usually driven by full-rate 1 UI-spaced sequences.
To accommodate the exponentially growing data rate, the 1 UI delay generation tech-
niques have also evolved. Fig. 2.11 summarizes the mainstream 1 UI delay generation
techniques utilized in previous FFE implementations.
The most general method is to utilize flip-flops (FFs) driven by a full-rate clock
to sequentially retime the serial data stream [see Fig. 2.11 (a)]. The main advantage
of this approach is its compactness, which only requires one FF for each tap sequence
generation. As the data rate exceeds the maximum reliable operation rate (e.g., 10
Gb/s for 65 nm CMOS [101]) of the FFE, the full-rate structure inevitably consumes
substantial power because every single block in it has to be realized in power-hungry
CML. Constrained by the ck-to-q delay, this FF-based 1 UI delay generator even with
CML topology fails to operate beyond 24 Gb/s in 65 nm CMOS process [34]. Another
drawback of this structure is that it needs a sophisticated full-rate clock tree to drive the
heavy loads of these retiming FFs, which results in considerable power consumption
and area occupation. The stringent full-rate timing requirement can be relaxed by
half-rate structure based on 2:1 MUX or quarter-rate architecture based on 4:1 MUX
[see Fig. 2.11(b) and (c)]. As discussed in [101], the half-rate structure in 65 nm
CMOS running at 20 Gb/s saves 12 mW (50%) of power in contrast to its FF-based
counterpart. Compared to the half-rate structure, the quarter-rate architecture further
relaxes the critical path timing margin from 1 UI to 3 UI and halves the maximum
clock speed, thus showing more potentials in cutting-edge transceiver designs.
As the data rate approaches to the delay of a single buffer, the desired 1 UI delay
can also be produced by analogy delay line [see Fig. 2.11(d)], where a DLL-based
bias generator is often integrated to adaptively tune the control voltage of the delay
line [102]. The delay cell can be implemented in LC-cells [24] or CML-buffers [103].
Nonetheless, these techniques suffer from either a penalty of large area occupation (L-
C cells) or a cost of huge power consumption (CML buffers). Additionally, the delay
produced by the analog delay line is susceptible to PVT variations, power fluctuation,
and substrate noise. Moreover, the limited adjusting range makes this technique only

35
Chapter 2. Literature Review

Retimed Data Retimed Data

PD CP2 PD CP2 LPF2


Phase
Tracking Loop Phase

Recovered
Recovered
Tracking Loop
Din Din Fine

Clock
Clock
VCO LPF VCO
Coarse Frequency
Frequency Tracking Loop
Tracking Loop
FD CP1 FD CP1 LPF1

(a) (b)

Figure 2.12: CDR topologies without a reference. (a) Single control of VCO frequency
tuning and (b) coarse and fine control of VCO frequency tuning.

suitable for narrow range applications [104]. As an example, the design in [16] demon-
strates that the power consumption for each tap in the LC-cell delay line-based FFE is
about 12 mW, which is much lower than that (48 mW) implemented in multi-MUX-
based FFE. On the other hand, it cannot support the speed below 50 Gb/s and occupies
a whole area of 1.2 mm2 which is one time larger than that based on multiple MUXs
in [104, 105].

2.3.3 Receiver Techniques

2.3.3.1 CDR Architectures

Nowadays, modern CDR design mainly uses a dual-loop architecture consisting of


a frequency tracking loop (FTL) and a phase tracking loop (PTL), where the FTL is in
charge of frequency capture, and the PTL is responsible for phase position adjustment
[106]. According to whether or not an external reference clock is needed, CDRs can be
categorized into reference-less CDR and reference CDR. The frequency information
in the former one is extracted from the received random data through a frequency de-
tector (FD), while the latter one utilizes a traditional PLL to pull the VCO oscillation
frequency to the target value. The common feature of these two topologies is to inte-
grate a similar PTL with a dedicated phase detector (PD) to finely adjust the sampling
position of the recovered clock to the mid-point of the incoming data.
Reference-less CDR- Reference-less CDRs arise from the applications where the use
of an external crystal is not feasible [107]. One example is a repeater for either optical

36
Chapter 2. Literature Review

or copper media in which the space and number of pins are severely limited to include
an external crystal oscillator. Additionally, adding a low-noise, rate-adjustable crystal
could increase the overall cost and complexity of these receivers [108].
Fig. 2.12(a) depicts a CDR without a reference clock, where the currents generated
by both the FTL/CP1 and PTL/CP2 are applied to a common LFP to produce the
control voltage of the VCO [109]. During either CDR startup or loss of phase lock, the
FD plays a key role to generate a control voltage through the CP1 and LPF to coarsely
tune the VCO oscillation frequency towards the input data rate. When the frequency
difference between the VCO and the input data falls into the capture range of the PTL,
the PD takes over to finely adjust the control voltage through the CP2 and LPF, thus
making the VCO output clock coincide with the input data phase [110]. There are two
possible issues associated with this CDR architecture. Firstly, the FTL and the PTL
may potentially interfere with each other when the voltage control is transferred from
the FD to the PD, resulting in prominent ripples on the VCO control line that could
even lead to a phase-lock failure [111]. Secondly, the FD could become momentarily
confused about the actual input data rate if the received input data contains random
consecutive identical digits or if the received rising and falling edges are corrupted
by the channel loss or electromagnetic crosstalk. To mitigate the effects of these two
issues, the loop bandwidth of the FTL is often chosen to be much smaller than that
of the PTL so as to reduce the noise contribution from the FD for ensuring the clock
quality of the VCO [111]. Meanwhile, a CDR bandwidth proportional to the data
rate is required to satisfy the protocol specification. To independently optimize the
bandwidths of the FTL and the PTL, separate LPFs are adopted in the two loops [see
Fig. 2.12(b)], where the line voltages generated by the FTL and PTL respectively
drive the coarse control and fine control of the VCO [110]. The main drawback of this
architecture is it requires a larger area due to the presence of the two LPFs. To alleviate
this area overhead, a hybrid analog/digital loop filter is developed in [112].
Reference CDR- Fig. 2.13 summarizes the main CDR topologies with a reference in
which a traditional PLL is embedded to initially adjust the VCO oscillation frequen-
cy. Fig. 2.13(a) displays the dual-VCO architecture, which uses the conventional PLL

37
Chapter 2. Literature Review

Recovered Clock
Phase Tracking Loop Recovered Clock

Fine Retimed Data


Din PD CP1 LPF1 VCO1
Din PD Phase
Tracking Loop

Coarse
Retimed Data
Fref
LPF0 LD CP LPF VCO
Fref
PFD CP2 LPF2 VCO2 Frequency
PFD Tracking Loop
Frequency Tracking Loop
/N /N

(a) (b)

Recovered Clock Recovered Clock

Phase Tracking Loop Phase Tracking Loop

Din PD DLPF IDAC PI Din PD DLPF IDAC PI

Retimed Data Retimed Data


Fref Fref
PFD CP LPF VCO2 PFD CP LPF VCO2

Frequency Tracking Loop Frequency Tracking Loop

/N /N

(c) (d)

Figure 2.13: CDR topologies with a reference. (a) Dual VCO architecture, (b) sequen-
tial locking topology, (c) PI-based structure, and (d) variant of PI-based structure.

to lock the output clock phase of the VCO2 to that of the input frequency [113]. By
applying the control voltage of the VCO2 in the PLL to the replica VCO1 through
an additional LPF0, the oscillation frequency of the VCO1 should be very close to or
equal to the target value. The remaining frequency offset as well as the output clock
phase error with respect to the input data is finely tuned by the PTL. To accomplish a
fast lock acquisition and maintain a fine control of the VCO1, the slew rate of the FTL
should be higher than that of the PTL while the bandwidth of the FTL must be lower
than that of the PTL. On one hand, the physical separation of the FTL and the PTL
makes it easier to meet the lock-acquisition, loop stability, and tracking bandwidth re-
quirements. On the other side, there are two possible problems associated with this
CDR architecture. One is the mismatch between VCO1 and VCO2, which may lead
to a difference in oscillation frequency even though the two VCOs share one coarse
control voltage. The other is the frequency pulling between the two VCOs in asyn-

38
Chapter 2. Literature Review

chronous systems. Specifically, the data rate in an asynchronous system often allows
certain frequency offset between the transmitted data and the local clock frequency.
The frequency pulling could make the output frequency of VCO1 shift away from the
incoming data rate and towards N×Fref. This could be especially problematic when a
spread spectrum clock is required since the pulling phenomena may make the output
frequency of VCO1 unchange with its fine control input. Another issue associated with
this CDR is the area overhead, especially in case of adopting an LC-VCO. To address
the pulling issue and reduce the area overhead, a sequential locking scheme is pro-
posed in [102, 114] to remove the needs of the dual CPs, LPFs, and VCOs. This CDR
is presented in Fig. 2.13(b), which utilizes a lock detector (LD) to rotationally enable
the FTL and the PTL by continuously monitoring the frequency locking state. During
the CDR startup, the FTL is firstly selected to tune the control line of the VCO to pull
the oscillation frequency towards the target frequency N×Fref. If the LD detects that
the divided clock of the VCO output is locked to the Fref, it disables the FTL loop and
enables the PTL. When there is a loss of frequency locking, the LD will swap the PTL
to FTL to engage a lock recovery. One potential problem in this topology is that the
transition from the FTL to the PTL may disturb the VCO control voltage and therefore
causes a VCO frequency shift. Once the frequency shift is beyond the capture range of
the PTL, a failure of phase lock could happen [111].
Fig. 2.13(c) presents another typical reference CDR based on phase interpolator
(PI) [14, 115]. The conventional PLL is adopted to provide multi-phase clocks with a
frequency of N×Fref that is very close or equal to the incoming data rate. These clocks
are further rotated by a PI driven by the PTL to make the phase of the recovered clock
lock to that of the input data. The availability of high-frequency clocks endows that this
architecture possesses the good properties of faster phase acquisition, increased system
stability, and less jitter peaking. It is worthy to note that jitter peaking in PI-based
CDR is absence only when the PTL is a first-order loop and the loop latency is not
significantly larger than the phase update period. This is because the fast changing jitter
may have already reversed its direction by the time the updating phase code reaches the
PI [116]. Additionally, the physical separation of the FTL and the PTL makes it easier

39
Chapter 2. Literature Review

Din 1
Y CK
A A -π π
B X B
Din
D Q D Q
Y
-1
CK 1
X KPD = π (TD)

(a) (b) (c)

Din
A CK
Y
A
Din B B
D Q D Q
X
CK C

X
C
D Q D Q
Y

Clock Late Clock Early


(d) (e)

2π 1 2
KPD = (TD)
Jpp
-π π * -π π = -π π

-2π -1

(f)

Figure 2.14: Two typical CDR PDs. (a) Hogge PD implementation, (b) Hogge PD de-
tection mechanism, (c) Hogge PD gain, (d) Alexander PD implementation, (e) Alexan-
der PD detection mechanism, and (e) Alexander PD gain.

to satisfy the loop bandwidth and stability requirements. This separation also allows
the clock lane consisting of PLL and bias generator to be shared by multiple data lanes,
thus making it a popular architecture in parallel-lane applications. Another advantage
of the PI-based CDR is the complete digital implementation of the loop filter, which
leads to smaller area occupation and fewer effects from PVT variations. The primary
problem along with this CDR is the discrete updating phase steps, which may result in
prominent cycle-to-cycle jitter. The steady-state oscillation existing in the digital PTL
could make this impact even more severe, especially when the loop latency is large. To
smooth out the discrete phase steps, the PI-based CDR evolves into the structure shown
in Fig. 2.13(d), where the feedback clock and recovered clock respectively applied to
the divider and the sampler in the PD are swapped. The primary advantage of this
evolved CDR is that the discrete phase shift in the PI can be smoothed out by the LPF
in the FTL, which provides a smooth phase shift in the PTL. However, it requires an
FTL in each receiver lane, thus making it not suitable for multilane applications.

40
Chapter 2. Literature Review

2.3.3.2 CDR Phase Detector

The main functions of the PD in CDR systems are to compare the phase differ-
ence between the input data and the recovered clock, provide information to adjust
the sampling position, and simultaneously retime the incoming serial signal. Fig. 2.14
summarizes the implementations and behaviors of the widely used liner Hogge PD and
non-linear Alexander PD [i.e., bang-bang PD (BBPD)].
Fig. 2.14(a) and (b) describes the implementation and operation waveforms of the
Hogge PD. The phase differences between the input data and the recovered sequence
are converted to high pulses [see signal X in Fig. 2.14(b)] by the top XOR. Meanwhile,
the reference pulses [see signal Y in Fig. 2.14(b)] that equals a half of the clock cycle is
produced by XORing the recovered sequence and its half-clock-cycle delayed version.
Taking the width difference of X and Y as the PD output, the phase error between the
optimal sampling position (i.e., lagging the data transition a half of a clock cycle) and
the rising edge of the recovered clock can be obtained. Fig. 2.14(c) gives the phase
transfer characteristics, and its PD gain can be given by,

1
KP D = (T D) (unit of radian−1 ), (2.9)
π

where T D is the transition density. The main advantage of the Hogge PD is that
it provides both sign and magnitude information of the sampling phase error, which
allows to construct a linear feedback loop. On the other hand, there also exist several
imperfections in the Hogge PD. Firstly, the ck-to-q delay of the first data-sampling FF
widens the pulse width of signal Y, but doesn’t impact that of signal X, thus causing
a skew of ∆T (i.e., the ck-to-q delay of the FF) when the CDR loop is locked. This
skew effect becomes a serious issue at high speeds since ∆T can occupy a significant
fraction of the clock period. The resulting phase offset may exceed several tens of
degrees, thus degrading the sampling phase margin and finally deteriorating the jitter
tolerance. This phase shift can be compensated by either narrowing the proportional
pulses or widening the reference pulses through inserting proper dummy delay element
[67]. Nonetheless, the delay introduced by the dummy element may not track the FF

41
Chapter 2. Literature Review

delay well against PVT variations. Another drawback of the Hogge PD stems from the
half-cycle shift between the two XOR outputs [see Fig. 2.14 (b)], where the reference
pulse is after the proportional pulse. This phase shift makes the CP driven by the Hogge
PD create tri-wave currents and hence generate ripples on the VCO control line, which
could severely disturb the VCO output phase. This tri-wave issue can be ameliorated
by introducing two additional reference pulses at a cost of one more full-rate latch and
two more power-hungry XOR gates [117]. Finally, the output pulses of the Hogge PD
are approximate to a half of the bit period, which demands extremely high-speed XOR
gates to generate these narrow pulses. Combining with the complex implementation
of the XOR, the Hogge PD could become the speed bottleneck of the whole CDR. As
a consequence, the Hogge PD is suitable for CDR designs with a low to moderate data
rate, where a sufficient margin can be guaranteed for the narrow pulse generation.
Fig. 2.14(d) describes the implementation of the BBPD. It utilizes three data sam-
plers driven by three consecutive 180◦ -shifted clocks along with two XOR gates to
determine whether the clock leads or lags the data when there is a data transition. In
case that there is no data transition, the outputs of the three samplers are identical and
hence the outputs of the two XORs remain at “0s”. In presence of a data transition, the
BBPD produces the signals of early Y and late X by XORing the edge sample with its
previous data and following data, respectively. Fig. 2.14(e) illustrates the waveforms
under the two possible locking conditions, namely, clock Late and clock Early. The
BBPD only outputs the sign information of the phase error in the form of an early
or late pulse with a fixed width, thereby its gain is ideally infinite at zero phase error
[see the left diagram in Fig. 2.14(f)]. However, this gain can be linearized by the
metastability of the samplers, the time uncertainty of the input data, and the jitter of
the edge-sampling clocks. Previous studies [118, 119, 120] have demonstrated that
the overall phase transfer function of the BBPD in practical CDRs can be obtained by
convoluting the ideal PD transfer function with the probability density function (PDF)
of the total jitter [see Fig. 2.14(f)] and its gain can be approximated as,

2
KP D ≈ (T D) (unit of radian−1 ), (2.10)
JP P

42
Chapter 2. Literature Review

where T D is the transition density, and JP P denotes the peak-to-peak jitter (including
the sampler metastability, input data jitter, and edge-sampling clock jitter). The bina-
ry quantization of the BBPD has simplified the phase comparison, which utilizes the
recovered data and quantized edge sequences to extract the early/late signals. Com-
pared to the linear Hogge PD that needs to process pulses no wider than a half of the
bit period, the minimum pulse width involved in this nonlinear BBPD equals the bit
period. Hence, it is able to support an even higher data rate. By replacing the XORs
following the full-rate samplers with a group of parallel XORs after the demultiplexer,
the operation speed of the BBPDs can be further reduced to normal digital logic speed.
Unlike the traditional liner PD whose outputs gently toggle around zero, the outputs of
the BBPD exhibit abrupt toggling between the two states of ‘1’ and ‘0’. On one hand,
the abrupt toggling may introduce larger disturbances on the control voltage line of the
VCO-based CDRs. On the other hand, the complete digital operation renders it more
convenience to implement digital CDRs.

2.3.3.3 Clocked Compactor

The basic function of the clocked compactor is to sample and resolve the input
signal to binary ‘0’ or ‘1’ at each rising edge of the driving clock. The output is
determined by the polarity of the sampled instantaneous value compared to a specific
reference (e.g., zero for the NRZ modulation). Unlike the digital latches which can be
described by the setup time, hold time, and latch delay, the sampling latches in analog
application are usually characterized by their sensitivity and bandwidth [121, 6]. To
obtain correct bit streams from the attenuated noisy analog input, samplers with high
timing precision and high input sensitivity are badly demanded.
Fig. 2.15 summarizes the two most popular samplers, which are based on CML-
type latch and Strong-Arm latch, respectively. To convert the analog input to logic
output, the CML-latch-based clocked compactor requires two CML latches and one
CML2CMOS converter [see Fig. 2.15(a)] while the Strong-Arm-based counterpart
only needs one Strong-Arm latch and one RS latch [see Fig. 2.15(b)]. Fig. 2.15(c) and
(d) respectively displays the latch sensitivity function and latch transfer function for

43
Chapter 2. Literature Review

CKN

CKN
CKP

CKP

CKN
CKP
RXN Q Full Swing Q Full Swing
CML- CML- CML2 RXN RS-
RXP Output StrongArm- Output
Latch Latch CMOS RXP Latch Latch

CLK CLK
ON
OP
ON OP
IN IP

CKP CKN IP IN

CLK
ISS

(a) (b)
Normalized Latch Sensitivity (ps-1)

Gain (dB)

Time Offset from Clock Edge (ps) Frequency (GHz)

(c) (d)
Normalized Energy

Tcycle (ps)

(e)

Figure 2.15: Clocked compactors. (a) CML-type latch-based compactor, (b) Strong-
Arm latch-based compactor, (c) latch sensitivity function comparison [6], (d) latch
transfer function comparison [6], and (e) energy consumption comparison [7].

the two clocked compactors in Fig. 2.15(a) and (b) [6]. Referring to the discussion in
[6], the following conclusions can be made: (i) the sensitivity window of the Strong-

44
Chapter 2. Literature Review

Arm latch is smaller than that of the CML-type latch, meaning that the Strong-Arm
latch shows better time resolution ability, (ii) the DC gain of the CML-type latch ex-
hibits 10 dB more than that of the Strong-Arm latch, implying that the CML-type latch
exhibits a high sensitivity, (iii) the gain-bandwidth product (GBW) of the CML-type
latch is higher than that of the Strong-Arm one, indicating that the CML-type latch
is more suitable for high-speed design. Fig. 2.15(e) describes the normalized ener-
gy comparison between the aforementioned two compactors, where the Strong-Arm
latch always demonstrates a better power efficiency [7]. In practical designs, although
Strong-Arm latches provide narrow sensitivity window and dissipate less power, CML
latches are usually used in ultra high-speed receivers because of their large GBW, su-
perior sensitivity (high gain), and high immunity to power fluctuation. Additionally,
the CML-type latch possesses a superior convenience to integrate on-chip inductors to
further extend its bandwidth.

2.3.3.4 Phase Interpolator

Phase interpolation can be performed by either a direct multiple-input PI [see Fig.


2.16(a)] or two coarse phase Muxes followed by a two-input phase mixer [see Fig.
2.16(b)]. Fig. 2.16(c) and (d) presents the two typical PI implementations based on
inverters [12, 13] and CML buffers [14, 15], respectively. As shown in Fig. 2.16(b),
by introducing a phase-selection Mux before the phase mixer, the input devices of
the phase mixer can be reduced, thus optimizing the output bandwidth. However, the
forward coupling through the overlap capacitances of the input devices could cause
discrete phase jumps when the selected phase is updated in the Mux [11]. In contrast,
the direct phase mixer [i.e., PI in Fig. 2.16(a)] can effectively avoid these discrete phase
jumps since the forward coupling paths are always present between the input phases
and the output regardless of the phase swapping [61]. Previous work has demonstrated
that the direct PI [see Fig. 2.16(a)] is reasonable for four-phase mixing [8, 115, 9]
while the Mux-based two-stage PI [see Fig. 2.16(b)] is more suitable for six/eight-
phase interpolation. [10, 11, 61].
To keep the common voltage of the interpolated clock to be constant, linearly ad-

45
Chapter 2. Literature Review

Ф1
Ф1 Ф3
Odd PH1
Ф2
Mux
Phase ФOUT ФN-1 Phase ФOUT
Interpolator Ф2 Mixer
Ф4
Even PH2
ФN Mux
ФN
(a) (b)

ФOUT

SEL1 SELN

Ф1 ФN

(c)

Buffer ФOUT

SEL1 Ф1 SELN ФN

(d)

Figure 2.16: PI structures and implementations. (a) Structure with direct multiple-
input phases [8, 9], (b) structure with coarse phase selection followed by a phase mixer
[10, 11], (c) inverter-based implementation [12, 13], and (d) CML-based implementa-
tion [14, 15].

justing the weights of the two adjacent phases is usually employed in practical designs.
For an ideal multiple-input PI, the input clocks should share an equal phase spacing
between any two adjacent phases. Correspondingly, the interpolated output clock can
be represented by,

ideal
CKPideal
I = AP I ejϕP I = AP I ej(ψi+1 −ψi )·m/K , ψi+1 − ψi = 2π
N
, (2.11)

46
Chapter 2. Literature Review

90°
90°
0° 0°
180° 180°

270°

270°
(a) (b)

90° 90° Maximum


+ + + + 67.5° Maximum + + +
+ +
67.5°Phase Error
+
+ Phase Error +
+ +
+ 45° + 45°
+ +
+
+
+ 22.5°
+ Maximum
+ Phase Error Equal Phase Steps
Equal Phase Steps
+ + Oct. Int. Phase steps
+ Orth. Int. Phase steps
+
+ 0° 0°
5° 1°
0° 0°
Phase Error Phase Error
-5° -1°
0° 90° 45° 90°
(c) (d)

Figure 2.17: (a) Phase constellation for quadrature PI, (b) phase constellation for oc-
tagonal PI, (c) interpolated phase steps for quadrature PI in one quadrant, and (d)
interpolated phase steps for octagonal PI in one octant.

(ψi+1 −ψi )·m 2π·m (2.12)


ϕideal
PI = K
= KN
,

where N is the input phase number, AP I denotes the interpolated clock amplitude, K
stands for the total steps between ψi+1 and ψi , and ϕideal
PI represents the ideal output
phase when the phase code m ranges from 0 to K. Considering the fact that the phase
interpolation is achieved by mixing two input phases with different weights, the actual
interpolated output signal can be calculated by,

m N −m (2.13)
CKpi = Api · ejϕpi = N
· A0 ejψi + N
· A0 ejψi+1 ,

47
Chapter 2. Literature Review

where Api and ϕpi denote the instant amplitude and phase of the interpolated clock
signal, respectively. Taking quadrature and octagonal PIs as examples, Fig. 2.17
describes the phase mixing constellations and the interpolated phase step allocations
[8, 10, 122, 9]. It can be found that the maximum interpolation phase error for the
quadrature PI reaches 4◦ and the maximum interpolation phase error for the octagonal
PI is around 0.5◦ , where the maximum deviation happens at the same positions for
both the quadrature and octagonal PIs, which are located at the 1/4 and 3/4 of the
total steps between the two mixing input phases. These phase errors stem from the
linearly sweeping as the phase transfer characteristics of the PIs are in proportion to
the anti-trigonometric function of the input-phase weight ratio rather than the input-
phase weight ratio itself. It is worthy to note that the phase error of the octagonal PI is
smaller than that of the quadrature PI. This makes the octagonal PI a superior choice
for high-linearity phase interpolators, but the cost is the doubled input phases, complex
phase-weight coding, and complicated circuit implementation.
The amplitude of the interpolated clock (Api ) is also modulated by the phase code.
When the phase code is 0, the PI actually performs as a buffer with a minimum mixing
factor, hence a maximum amplitude can be obtained. As the phase code increases, the
amplitude will decrease with the increasing mixing factor. Once the phase codes of the
two input phases are adjusted to be equal to each other, the mixing factor reaches its
maximum value and the amplitude decreases to its minimum value. If the phase code
continues to rise, the amplitude will increase with the decreasing mixing factor and
finally rise up to its maximum value. According to the discussion in [67], these ampli-
tude fluctuations can be potentially converted into delay variations through amplitude
modulation (AM) to phase modulation (PM) conversion, and the delay variation is ap-
proximately proportional to the square of the input-signal swing. Theoretically, the
maximum amplitude reduction of a quadrature PI can reach 29.3%, occurring at the
half of the total steps for each quadrant. It is also under the same condition that the
phase deviation runs up to its maximum value, thus any extra delay caused by the AM-
PM conversion can further aggravate the maximum DNL directly. The linearity of the
phase interpolator can also be deteriorated by the I, Q mismatch, clock duty distortion,

48
Chapter 2. Literature Review

and inadequate edge overlap of the input clocks [122, 123, 115]. To mitigate these
effects, a variety of techniques including local duty cycle correction, I, Q phase cor-
rection, and slew rate calibration using slew buffers or harmonic rejection poly phase
filters are usually utilized to optimize the quality of the I, Q clocks [14, 123, 115].

2.3.4 Channel Equalization

When transmitting data pass through electrical mediums, the insertion loss caused
by frequency-dependent skin effect and dielectric absorption could result in prominent
ISI. This ISI can be directly converted into the deterministic jitter to compress the link
jitter margin and hence reduces the maximum support rate or deteriorates the BER of
the serial link. For instance, for a -12 dB loss channel, the far-end eye-diagram after
this channel can be completely closed. It seems that this issue can be solved by simply
increasing the signal strength to go against the attenuation. In practical designs, there
does exist an optimal swing for a specific channel loss. This is the reason why many
transmitters have integrated the function of swing adjustment, and therefore allows the
users to adjust the driving strength to the optimal values to accommodate to differ-
ent applications. If the signal swing is too small, the received signal could be buried
by the noise, thus exhibiting a low SNR. Theoretically, a high swing can effectively
improve the SNR of the system. However, this does not mean a higher swing is al-
ways better for the link communication. Firstly, the increased signal swing does not
solve the ISI problem. This is because the increased symbol swing also improves the
energy spread to the other symbols, thus exhibiting no optimization on the ISI. Sec-
ondly, the increased swing also improves the strength of some proportional noises such
as reflection, crosstalk, which could deteriorate the performance of the link. Thirdly,
the increased swing always means substantial power consumption as the driver need-
s to draw more currents. To overcome this frequency-dependent signal dispersion,
many equalization techniques have been developed to compensate for the channel loss
by either attenuating the low-frequency components or boosting the high-frequency
components [124]. This section will summarize the mainstream equalizers utilized in
high-speed links, including the FFE, CTLE, and DFE. These equalizers are usually

49
Chapter 2. Literature Review

Din Tb Tb Tb

Gain (dB)
 -n  -n+1  -n+2 n

20log(1-2k)
+
Dout 0.5
Normalized Frequency
(a) (b)

Figure 2.18: The FFE. (a) Functional block diagram, where Tb is the bit period and αn
is the weight of the nth tap. (b) Typical frequency response, where k is the summation
of the absolute tap weights.

combined together to cover a broad range of channel spectrums, especially for high-
loss legacy channels. The FFE is usually employed to cancel the pre-cursor ISI and
partial nearby post-cursor ISI. The CTLE is often adopted to neutralize the long-tail
ISI. The DFE is frequently utilized to remove the nearby post-cursor ISI.

2.3.4.1 Feed Forward Equalizer

The FFE, which is usually implemented using a finite impulse response (FIR) fil-
ter, is one of the most common techniques in high-speed serial links. It pre-distorts
the output waveform shape over several symbols to pre-attenuate the low-frequency
portion of the transmitted signal, thus making the signal spectrum after the lossy chan-
nel maintain a proper balance between various frequency components. Fig. 2.18(a)
describes the functional block diagram of the FFE, where Tb is the bit period and
α(l) is the normalized tap weight. Clearly, the waveform pre-distortion is actually
performed by summing the symbol-spaced streams with different tap weights. Fig.
2.18(b) displays a typical frequency response of the FFE, which demonstrates promi-
nent low-frequency attenuation. The maximum de-emphasis amount is 20log(1 − 2k),
P
where k = |α(l)|. Note that k must be within 0 and 1/2 to perform high-frequency
l6=0
boosting. For k > 1/2, the frequency response actually exhibits attenuation rather
than boosting for high frequencies. The specific response shape is subject to the tap
number as well as the tap-weight distribution. The discussion in [101] shows that
more taps help to fit desired response. Meanwhile, the increased tap number implies

50
Chapter 2. Literature Review

an almost linear increase of parasitic capacitance at the output node, thus limiting the
output bandwidth. To keep sufficient bandwidth and maintain an adequate eye open-
ing, a tap number of three or four is usually adopted for the data rate below 30 Gb/s
[101, 9, 125, 13]. For the cutting-edge transmitters operating around 40-60 Gb/s, two-
tap FFEs are usually adopted [23, 36, 57].
The FFE exhibits several unique advantages over its counterparts. Firstly, the FFE
is able to cancel pre-cursor ISI by introducing pre-cursor taps. Secondly, the FFE
shows negligible noise amplification due to its digital implementation. Thirdly, the
tap weights of the FFE can be accurately controlled by employing a high-resolution
digital-to-analog converter (DAC) For example, 5-6 bit resolution can be achieved
conveniently, which is usually accurate enough for the FFE tap-weight adjustmen-
t. The main disadvantage of the FFE is that it is implemented by attenuating the
low-frequency portion rather than boosting high-frequency ones. This equalization
mode can significantly reduce the eye-height in the RX-side. Another drawback is its
complex circuit implementation which involves multiple symbol-spaced full-rate data
generations. It not only decreases the maximum allowable data-rate by introducing
parasitic capacitance on the output nodes, but also increases the area occupation and
power consumption. These penalties become even more severe in ultra-high-speed
transceivers operating around the cutting-edge speed of the technology. The FFE e-
qualization can be located either at the TX-side or the RX-side. In the following two
paragraphs, we will separately discuss the pros/cons of the TX-FFE and RX-FFE.
Most designs put the FFE on the TX-side due to the following two reasons. One is
the 1 UI delay can be accurately generated by simply relatching. The other is the coeffi-
cient multiplication can be simply performed on binary values by changing the current-
controlling codes to adjust the tap weight. Nonetheless, TX-FFE has several prominent
disadvantages. Firstly, it is difficult to perform automatic tap-weight adaptation since
the quality of the received signal can only be known at the RX-side. Although a back
channel can be employed to transfer the continuously adjusted tap weights [126], this
extra back channel increases the system complexity in terms of extra chip pins, compli-
cated chip packaging, and additional PCB routing. Moreover, this RX-side adaptation

51
Chapter 2. Literature Review

scheme may not even be available due to the problems in interoperability, especially
when the transmitter and receiver are from different vendors [13]. Secondly, the com-
pensation ability is limited by the allowable minimum signal swing after de-emphasis,
this is because the TX-FFE compensates for the high-frequency channel loss by atten-
uating low-frequency components rather than increasing high-frequency components
in the signal.
By placing the FFE at the RX-side, the tap weights can be adapted locally at the
RX-side, thus eliminating the need for a back channel and removing the issue of TX-
RX interoperability. This also makes the driver at the TX-side simpler by removing
the combining taps, thus reducing the output capacitances and improving the driving
bandwidth. However, there also exist several drawbacks in the RX-FFE. The primary
challenge is how to generate the symbol-spaced versions of the received signal [13].
Passive delay cells using inductors and capacitors need a large area, and their tun-
able delays are not wide enough to handle a wide operation range. Active delay cells
such as CML buffers are power-hungry and distort the signal waveform due to their
delay-dependent bandwidth. Another challenge in the RX-FFE is how to carry out the
product of the coefficients and the analog signals [13]. In bipolar technology, a tradi-
tional Gilbert multiplier can be utilized to perform this multiplication. However, the
limited linearity performance of the CMOS transistors makes the Gilbert multiplier far
less accurate and its resulting distortion significantly degrades the FFE performance.
Unlike the TX-FFE that only sums the weighted digital streams, the RX-FFE process-
es the received signal containing both the useful signal information and useless noise
disturbance. Therefore, the high-frequency components of the noise are also boosted,
which is not desired in high-speed communication systems.

2.3.4.2 Continuous-Time Linear Equalizer

The CTLE is a simple continuous-time circuit with a high-frequency boosting


transfer function that effectively overcomes the high-frequency losses through a trans-
mission channel. It usually acts as a front-end amplification stage at the RX-side to
provide gain and high-frequency peaking with acceptable power and area overhead-

52
Chapter 2. Literature Review

R1

Gain (dB)
ωp
VEQ
20 dB/dec

C1 R1
C2 R2
R1+R2 ωz

Angular Frequency (rad)


(a) (b)

RL RL

Gain (dB)
OP ON ωp1 ωp2
CL CL
20log(gmRD+1)
IN IP
0.5CD gmRL
VCTLE 2RD
gmRD+1 ωz

ISS/2 ISS/2
Angular Frequency (rad)

(c) (d)

Figure 2.19: The CTLE. (a) Passive implementation, (c) frequency response of the
passive CTLE, (c) active implementation, and (d) frequency response of the active
CTLE. Here, ωz is the angular frequency of the zero and ωp is the angular frequency
of the pole.

s. As the CTLE sharpens both the rising and falling edges of the received signal, it
shows a capability of canceling both the long tail ISI caused by the pre-cursor and
post-cursor taps. Similar to the RX-FFE, there are also some drawbacks associated
with the CTLE. Firstly, the equalization ability of the CTLE is limited to first-order
compensation. Secondly, it also amplifies the noise and crosstalk in the boosting band.
Thirdly, its gain boosting is sensitive to PVT variations, and the tuning range is small.
Finally, its operation speed is limited by the GBW product of the amplifier.
The CTLE can be realized in both passive components and active devices [124].
Fig. 2.19 displays both passive and active implementations of the CTLE and their fre-
quency responses. For the passive CTLE shown in Fig. 2.19(a), the frequency shaping
is achieved by a simple RC network, where low-frequency components are attenuat-
ed by the resistor and the high-frequency components are allowed to pass through the
capacitor, thus leading to high-frequency gain boosting. According to the signal pro-

53
Chapter 2. Literature Review

cessing theorems, the transfer function and the associated pole-zero positions can be
calculated by,

R2 1 + R1 C1 s
H(s) = · R1 R2
, (2.14)
R1 + R2 1 + R1 +R2 (C1 + C2 )s
1
ωz = , (2.15)
R1 C1
1
ωp = R1 R2 , (2.16)
R1 +R1
(C1 + C2 )
R2
DC − Gain = , (2.17)
R1 + R2
 
ωp
Gain − Boost = 20log . (2.18)
ωz

Fig. 2.19(b) displays a typical frequency response of the passive CTLE. The boosting
frequency components are determined by the locations of the zero and the pole while
the boosting factor can be approximated by the ratio of the pole to the zero, since the
frequency response shows a 20 dB/dec rolling up [see Eq. (2.18)]. By appropriate-
ly choosing the resistor/capacitor values that determine the positions of the zero and
the pole, reasonable gain boosting including both frequency components and boosting
amounts can be achieved. The main feature of this equalizer is its compact imple-
mentation and zero power consumption since it only contains passive components of
resistors and capacitors. However, there are three prominent disadvantages in this sim-
ple RC equalizer. Firstly, the RC network introduces large impedance discontinuity
at the interface between the channel and the equalizer, which could cause significant
reflection. Secondly, this approach cannot improve the SNR since the equalization is
performed by attenuating low-frequency components. Thirdly, it is not convenient to
adjust the boosting parameters since the configuration of the RC values could introduce
additional overheads to the most high-speed critical path. Therefore, this technique has
seldom been utilized in high-speed serial links.
Fig. 2.19(c) presents the widely used active CTLE implementation. It utilizes an
RC source degradation to provide different gains for different frequencies in order to
realize high-frequency boosting. By analyzing the linear equivalent half circuits, the

54
Chapter 2. Literature Review

transfer function and the zero-pole positions can be given by,

1
gm s + R D CD 1
H(s) = · · , (2.19)
gm RD +1
CL s + R C s + RL1CL
D D

1
ωz = , (2.20)
RD CD
gm RD + 1
ωp1 = , (2.21)
RD CD
1
ωp2 = , (2.22)
RL CL
gm RL
DC − Gain = , (2.23)
gm RD + 1
 
ωp1
Gain − Boost = 20log . (2.24)
ωz

Fig. 2.19(d) presents a typical frequency response of the active CTLE. The response
shape is mainly constrained by ωz , ωp1 and DC −Gain since the second pole is usually
determined by the load resistor and output capacitor. The boosting ability of the active
CTLE is usually changed by adjusting the source degradation RC network [see Fig.
2.19(c)]. As the control voltage VCTLE is tuned from high to low, both the equivalent
resistor RD and equivalent capacitor CD become larger. The resulting zero (ωz ) can
be reduced while the ratio of the dominant pole to zero (ωp1 /ωz ) will increase, thus
the boosting frequency band and the boosting gain can be both improved. Inductive
peaking [127] or forward-coupling capacitance neutralization [124] can be used to
further increase the bandwidth of the CTLE and hence enhances the gain-boost ability.
Compared to the passive CTLE, the main feature of the active CTLE is its ability to
achieve higher gains (over 0 dB) for both low and high frequency components, which
helps to improve the SNR to optimize the BER of the link.

2.3.4.3 Decision Feedback Equalizer

The DFE is another effective signal conditioning technique to cancel the ISI caused
by frequency-dependent channel loss, which is commonly implemented at the RX-
side in serial links. Fig. 2.20 gives the conceptional diagram and typical frequency
response of the DFE. It works by directly subtracting (or adding) the previous deci-
sions in multiplication with corresponding tap weights. This previous-decision-based

55
Chapter 2. Literature Review

Loop Latency I < 1UI

Din + Tb Tb
1

Gain (dB)
2
0

Loop Latency II < 1UI


n
0.5
Normalized Frequency

(a) (b)

Figure 2.20: The DFE. (a) Functional diagram, where Tb is the bit period and αn is
the tap weight of the nth tap. (b) Typical frequency response, where the frequency is
normalized to the value of the data rate.

ISI cancellation not only increases the boosting factor of the DFE, but also makes
it immune to noise amplification since the feedback signal is the scaled version of
well-recovered digital streams. Similar to the FFE, for a fixed tap-weight summa-
Pn
tion k, where k = |α(l)|, (0 < k < 1), the maximum boost factor is a constant
l=1
[20log((1 + k)/(1 − k))], while the tap number and tap weight distribution only affect
the shape of the response.
There are three issues in the DFE design [128]. Firstly, there exists error propaga-
tion problem in the DFE because the ISI cancellation is based on the assumption that
all the previous decisions are correct. When there are bit errors, the subtraction or ad-
dition of the scaled decisions will rather exacerbate the ISI than cancel it. Fortunately,
this error propagation can be neglected for a robust serial link since its BER is usually
lower than 10−12 . Secondly, the DFE can only remove post-cursor ISI as the feedback
sequences can only be the previously received data. This is the reason why the DFE
is usually combined with the TX-FFE and/or RX-CTLE to cancel the ISI caused by
both the pre-cursors and post-cursors. Finally, the DFE implementation suffers from
a stringent timing problem. As described in Fig. 2.20(a), there are two possible criti-
cal paths. One is the feedback loop of the first tap, whose timing requirement can be
expressed by,

tslicer
cq + tslicer
setup + tf b < 1U I,
(2.25)

56
Chapter 2. Literature Review

where tslicer
cq is the ck-to-q delay of the slicer, tslicer
setup denotes the setup time of the slicer,

and tf b stands for the feedback path delay. The other is the feedback loop on other
taps, whose loop delay also must be lower than 1 UI,

tfcqf + tslicer
setup + tf b < 1U I,
(2.26)

where tfcqf represents the ck-to-q delay of the retiming FF. The main difference between
these two loops is that their ck-to-q delays come from different components. Compare
to the FF which retimes the sliced full-swing data sequence, the slicer needs to regen-
erate the digital output from a small input. Consequently, the tslicer
cq should be larger
than tfcqf , which makes the timing budget for the first tap [see Eq. (2.25)] tighter than
that for other taps [see Eq. (2.26)]. This is the reason why various techniques are
developed to relax the first tap timing requirement.

2.3.4.4 Equalization Adaptation

In practical transmission systems, the connection channels usually have the fol-
lowing features. Firstly, the exact channel profiles in practical serial links are usually
unknown in advance. Secondly, the channel length can vary from one application to
another. Thirdly, the channel profile may change due to the fabrication variation. Fi-
nally, the channel profile will vary in real time with its operation environment, which
becomes particularly severe for data rates beyond 10 Gb/s. To accommodate to the
different channel losses and track the real-time channel variations, many adaptive e-
qualization techniques like least mean square (LMS) [129, 34, 130, 131], zero-forcing
(ZF) [105, 132], maximum eye opening (MEO) [133], and spectrum matching [134]
have been developed. Fig. 2.21 summarizes the conceptional diagrams of these adap-
tation methods.
Algorithm-Based Adaptation- Fig. 2.21(a) describes the conceptional diagram of the
most widely used algorithm-based adaptation, which can be applied to any type of
equalizers including the FFE, CTLE, and DFE. There are many algorithms that can be
used to adjust the equalizer coefficients, but only a few of them are suitable for on-
chip integrations. The most popular ones for compact hardware implementation are

57
Chapter 2. Literature Review

xk yk dk xk yk dk
Equalizer Equalizer

LMS Adaptation ek Coefficient Eye


ZF Algorithm Scanning Monitor
(a) (b)

CTLE
fm
VCTLE LPF LPF

(c)

Figure 2.21: Equalization adaptations. (a) Algorithm-based adjustment, (b) eye


monitor-based coefficient update, and (c) spectrum matching-based calibration.

the LMS, ZF, and their variants.

1. The LMS algorithm optimizes the equalization coefficients based on minimizing


the mean squared error. The coefficient update equation can be express by,

α(k+1,l) = α(k,l) − λ · ek · xk−l , (l = 1, 2 · ··, n), (2.27)

where α(k,l) denotes the lth tap weight at the k th iteration, λ is the update step
size, xk stands for the samplers at the channel output, dbk is the estimate of the
transmitted data, and ek = dbk −yk represents the equalization error. The require-
ment of the analog multiplications (xk and ek are naturally analog signals) in Eq.
(2.27) makes it difficult to be implemented in hardware, thus reducing its com-
petitiveness in equalization coefficient adaptations. To reduce the complexity
of the traditional LMS, the sign-sign LMS (SS-LMS) algorithm has been devel-
oped, which utilizes the binary quantized sign(ek ) and sign(xk−l ) to replace the
analogue ek and xk−l in Eq. (2.27). The update iteration is then changed to,

α(k+1,l) = α(k,l) − λ · sign(ek ) · sign(xk−l ), (l = 1, 2 · ··, n). (2.28)

58
Chapter 2. Literature Review

Considering the fact that the binary quantized sign(ek ) and sign(xk−n ) can be
directly mapped from the sliced error sequence and recovered data stream, the
SS-LMS obviates the need for analog operations, hence making it more feasible
for on-chip integrations. Since the binary quantization significantly reduces the
iterative accuracy, the convergence time of the SS-LMS is generally worsen than
that of the traditional LMS. Fortunately, this increased convergence time is not a
problem in most serial links.

2. The ZF solution is obtained by forcing residual ISI in the decision instant to zero
[135], which can be theoretically achieved by completely inverting the channel
response HC (s) [136],

1
HE (s) = , (2.29)
HC (s)

where HE (s) is the frequency response of the equalizer. The resulting total
transfer function of the convolution of the equalizer and the channel should be
flat. Optimal ZF equalization requires equalization filters with infinite taps to fit
the long-tail impulse response. In practical implementations, suitable truncation
is usually applied to construct a finite impulse response (FIR) to approximate the
infinite impulse response (IIR). This method is suitable for the time-invariant
channel, which is well known in advance. To adaptively adjust the equalizer
coefficients and track the slow channel changing, the equalizer coefficients can
be updated by the following iteration [137],

αk+1 = αk − λ · ek · xk ,
(2.30)
ek = sbk − sk , sbk = xTk αk ,

where αk is the equalizer coefficient vector, λ denotes the update step that con-
trols the adaptation rate, ek stands for the error vector, sbk represents the estimate
vector of the transmitted data, xk is a vector being composed of the input sig-
nal applied to the equalizer, and sk denotes a vector consisting of the training
symbols. Note that the subscript k or k + 1 refers to the k th or (k + 1)th itera-

59
Chapter 2. Literature Review

tion. Comparing Eq. (2.30) to Eq. (2.27), we can find that the ZF algorithm is
equivalent to the LMS for FIR equalizers.

The errors utilized in the aforementioned LMS and ZF algorithms are extracted by
measuring the amplitude differences between the equalized and desired outputs that
are sampled at the data-sampling positions. This level-based error extraction method
involves both data recovery and peak detection [138]. Moreover, this configuration of-
ten requires additional slicers or even an analog-to-digital converter (ADC) to extract
the amplitude errors between the equalized and expected eye heights, which makes it
less competitive for high-speed applications due to the following reasons. Firstly, these
auxiliary circuits (slicers or ADC) degrade the maximum bandwidth because their in-
put capacitances are directly connected to the maximum-speed signal path. Secondly,
the additional high-speed circuits will inevitably introduce more connections, which
not only makes the layout routing more complicated but also increases the parasitic
capacitances. Thirdly, the additional circuits consume considerable power since they
need to operate at the maximum speed. Meanwhile, the residual ISI can also be min-
imized using the errors at the crossing points since the ISI at the crossing points is
heavily correlated to the transmitted data for bandwidth-limited systems [138]. Lever-
aging this characteristic, Xilinx [138, 139, 42] has developed an edge-based algorithm,
where the the error in Eq. (2.29) is replaced with the error at the crossing points. These
errors can be directly mapped from the quantized edge sequence that is indispensable
for the CDR. Consequently, the additional samplers can be obviated to optimize the
critical path capacitances and improve the power efficiency. Note that the indirect na-
ture of the edge-based algorithm shows a relatively lower effectiveness when compared
with its level-based counterpart. Fortunately, simulation results indicate that for low-
loss applications, the edge-based adaptation is sufficient to guarantee an acceptable eye
opening at the data-sampling point [138].
Eye Monitor-Based Adaptation- Fig. 2.21(b) describes the eye monitor-based adap-
tation, which is also applicable to any type of equalizer structures. The optimal e-
qualization coefficients are attained based on maximizing the eye opening. The two-
dimensional mask of the eye opening can be obtained by monitoring the BER while

60
Chapter 2. Literature Review

adjusting the sampling position and slicing levels of the error-detection slicer [140].
As for the adaptive equalization process, the eye masks with different equalization
coefficients are first measured by the internal eye monitor under the control of the co-
efficient scanning engine [see Fig. 2.21(b)]. The optimal coefficient configuration is
then selected by a maximum-eye-opening searching algorithm. This method can pro-
duce visualized eye-diagram with distinct eye width and eye height, thus providing
an intuitional window to observe the equalization effect. Nonetheless, there exist two
drawbacks in the eye monitor-based equalization adaptation. One is the high power
consumption of the eye monitor (including full-rate slicer, clock PI, driving buffer,
scanning engine, and searching algorithm) can significantly degrade the power effi-
ciency of the serial link. The other is the contradiction of the design complexity, scan-
ning speed, and measuring accuracy. Precise eye measurement needs high-resolution
DAC and PI for slicing level adjustment and sampling position moving, which not on-
ly complicates the design but also significantly prolongs the eye-scanning time. The
eye monitor presented in [141] shows that the combination of a 3-bit DAC and a 4-bit
PI contributes a total of 210 different masks, and it is a good balance for a 10 Gb/s
design. In addition, the eye-scanning accuracy is also limited by the slicer sensitivity,
slicer offset, and PI nonlinearity.
Spectrum Matching-Based Adaptation- Fig. 2.21(c) presents the spectrum matching-
based adaptation, which is applicable to the RX-CTLE and one tap RX-FFE as it only
provides one control voltage [134, 142, 143]. The control voltage is optimized by forc-
ing the imbalance of the spectrum split by the frequency fm to zero [see Fig. 2.21(c)],
where fm equals 0.28/Tb and Tb is the bit period [134]. This fm is selected based on
the fact that it equally splits the power energy of the spectrum for ideal random bina-
ry sequences. Note that the setup of fm as 0.28/Tb is valid only for purely-random
or pseudo-random data streams [134]. There are several difficulties in this adaptation
method. Firstly, the LPF and HPF are directly connected to the critical path [Fig.
2.21(c)], which could degrade the maximum bandwidth. Secondly, the splitting band-
width is difficult to control since the passive components utilized in the LPF and HFP
are sensitive to PVT variations. Thirdly, the effective power detection is challenging,

61
Chapter 2. Literature Review

especially for the high-frequency power detection. Finally, the accuracy is limited by
various system uncertainties . For example, the unbalanced power detection between
the low-frequency and high-frequency rectifiers could lead to underestimate or overes-
timate of the boosting factors, thus resulting in a suboptimal solution [142].

62
Chapter 3

Design of the Ring-Based


Injection-Locked Clock Multiplier
(RILCM)

Clock multipliers continue to play important roles in modern wireline communica-


tion systems. The rapid growth of per-lane data rate paired with the high-volume lane
integration has posed more stringent requirements on the clock multipliers, including
high-frequency ability, low-jitter generation, small-area occupation, and low-power
consumption. Over the past decades, plenty of efforts have been made to develop such
clock multipliers. Phase locked loop (PLL) is attractive because of its compact imple-
mentation, robust operation, and convenient configuration. Nonetheless, the infeasibil-
ity of combining the preferred properties of small area (within ring-oscillator) and low
jitter (within LC-oscillator) is prone to degrade its competitiveness. DLL-based clock
multiplier is an alternative solution that can offer superior jitter performance, while ob-
viating large-area inductors [144, 79]. However, the duty cycle error and fixed pattern
jitter caused by mismatches could hinder its widespread uses in practical applications.
Recently, injection locking has attracted increasing attentions since it exhibits obvious
advantages over the above methods, including simple structure, high power efficiency,
and low phase noise [81, 83, 82, 145]. It has shown a great potential on a variety of ap-
plications such as clock multiplication [85, 146], frequency division [147, 148], clock
distribution [149], and clock data recovery [150].

63
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

This chapter presents a ring-oscillator-based injection-locked clock multiplier (RIL-


CM) that seeks to achieve the good properties of low jitter generation, small area oc-
cupation, and high power efficiency. To adaptively adjust the frequency offset, we
have developed a hybrid FTL. Meanwhile, a lock-loss detection and lock recovery
(LLD-LR) is devised to endow the RILCM with a similar lock-acquisition ability as
conventional PLL, thus excluding the initial frequency setup aid and preventing po-
tential lock loss. To satisfy the requirements of high operation speed, high detection
accuracy, and low output disturbance, a compact timing-adjusted phase detector (T-
PD) tightly combined with a well-matched charge pump (CP) is designed. To further
reduce the output jitter, a full-swing pseudo-differential delay cell (FS-PDDC)-based
injection-locked ring-VCO (IL-RVCO) is developed as well.
The remainder of this chapter is organized as follows. Section 3.1 summarizes the
challenges in the RILCM design and previous solutions. Section 3.2 describes the RIL-
CM architecture. The proposed IL-RVCO, the devised phase shift detection, and the
designed LLD-LR are presented in Section 3.3, 3.4, and 3.5, respectively. Section 3.6
details the experimental results and Section 3.7 summarizes the implemented RILCM.

3.1 Challenges in RILCM and Previous Solutions

3.1.1 Challenges in RILCM

It is a nontrivial task to design a robust RILCM for practical applications and the
challenges mainly focus on the following aspects. Firstly, the jitter suppression is
sensitive to the frequency offset between the target frequency and the free-running
frequency of the oscillator. Specifically, the phase noise tracking ability will decline
rapidly as the frequency offset increases [81]. Moreover, it is quite challenging to de-
tect the frequency offset since the accumulated phase error can always be reset by the
injection pulse. Therefore, the free-running frequency of the voltage-controlled oscil-
lator (VCO) should be tuned as close as possible to the center of the locking range to
obtain an optimum jitter suppression. Secondly, the injection locking technique cannot
suppress the 1/f 3 noise of the VCO. This is because the injection locking is actually e-

64
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

ti DLL
Pulse
Generator Pulse
Generator
fref fout fref VCDL
Vctrl fout
PFD CP LPF PFD CP LPF
Vctrl
td
/N /N

(a) (b)

Pulse Main VCO TDC-Based


fref Pulse fout
Period Error
Generator fout Generator Measurement

Vctrl
fref
PLL/DLL Vctrl Replica
VCO/VCDL LPF DAC Modulation

(c) (d)

fref Pulse
Generator

Timing Vctrl fout


Adjusted PD CP2 LPF

PFD CP1
Auxiliary Loop
for Initial Step
/N

(e)

Figure 3.1: Previous frequency tracking techniques. (a)Traditional IL-PLL, (b) IL-
PLL with DLL-based injection position adjustment, (c) dual-loop architecture with
replica-VCO/VCDL, (d) TDC-based FTL, and (e) TPD-based FTL.

quivalent to a single-pole feedback system that can only achieve 20 dB/dec of in-band
noise shaping [82, 89]. It means that the injection locking technique suppresses the
1/f 2 noise (converted from white noise) of the VCO but not its 1/f 3 noise (converted
from flick noise). Thirdly, the injected VCO is possibly locked to some harmonic fre-
quency of the injection signal [151]. This can be traditionally solved by introducing a
beginning-calibration procedure [152, 145, 153] to initially adjust the control voltage
close to the desired value. However, it cannot prevent the hidden risk of possibly losing
lock due to its limited lock-in range and weak lock-acquisition ability [86, 87]. As a
consequence, robust frequency tracking techniques with low-frequency noise suppres-
sion abilities are highly demanded to overcome these difficulties.

65
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

3.1.2 Prior Arts

Fig. 3.1 summarizes the previous frequency tracking techniques that are utilized
to address the aforementioned issues. According to the frequency offset detection
mechanism, they are categorized into two different classes. One is based on traditional
PLL/DLL [see Fig. 3.1(a), (b), and (c)] and the other is based on injection caused
phase shift detection (PSD) [see Fig. 3.1(d) and (e)].
Fig. 3.1(a) shows the most general injection-locked (IL)-PLL, where the PLL sys-
tem keeps the natural frequency of the VCO located at the desired frequency harmonic
[88]. The main problem of this scheme is the mutual-pulling between the PLL locking
force and the injection locking force, which could degrade the jitter performance or
even result in a stability problem. The mutual-pulling is usually caused by the delay
mismatch between ti (the intrinsic delay of the pulse generator) and td (the delay of
the asynchronous divider) [see Fig. 3.1(a)], and their delay fluctuations over different
PVT corners make it even more difficult to handle. This problem was solved in [82]
by adding a voltage-controlled delay line (VCDL) preceding the pulse generator [see
Fig. 3.1(b)]. Driven by the DLL loop, the delay of the VCDL is adaptively adjust-
ed to maintain an optimal injection position. This method removes the timing issue
with the penalty of an additional DLL loop. Fig. 3.1(c) describes the dual-loop ar-
chitecture, where the frequency deviation is monitored by a separate PLL utilizing a
replica-VCO [83] or an independent DLL using the same delay cell as the main V-
CO [84]. The physical separation of the FTL and the injection-locked oscillator (ILO)
can effectively prevent the mutual-pulling problem between the two locking forces.
However, there are still several drawbacks within this architecture. Firstly, the auxil-
iary PLL/DLL consumes substantial extra power which lowers the power efficiency.
Secondly, the fabrication mismatch constrains the calibration precision. Thirdly, the
separate FTL cannot suppress the 1/f 3 noise of the VCO, since the flick noise tracked
by the PLL/DLL is independent of that in the main VCO. Generally, the common fea-
ture of the above mentioned architectures is employing an additional PLL/DLL loop to
correct the frequency offset. Hence, they can be classified as PLL/DLL-based FTLs.
The main drawback of these FTLs is the low efficiencies in power consumption and

66
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

area occupation.
Meanwhile, the PSD-based FTLs are attracting more attentions because of their
low power consumption and high jitter performance. By merging the frequency off-
set detection and the injection error detection into one single PSD, the always-on
PLL/DLL in the aforementioned FTLs is only required to work during the frequency
initialization, thus saving substantial power consumption. Considering the fact that the
phase disturbance induced by the device noise is equally detected by the PSD without
distinction, thus the FTL is capable of attenuating the VCO in-band noise like tradition-
al IL-PLL. Combining with the 20 dB/dec noise shaping introduced by the injection
locking, the 1/f 3 noise of the VCO can be completely suppressed. Fig. 3.1(d) and (e)
respectively presents the digital and analogical PSD-based FTLs [85, 146]. The former
adopts a time-to-digital converter (TDC) to measure the periodic phase errors caused
by frequency offset [85], while the latter utilizes a TPD to detect the phase shift be-
tween the injection-pulse center and the zero-crossing point of the IL-VCO [146]. For
the TDC-based FTL, its performance is restricted by the TDC resolution and control
voltage granularity. The complex logic operation associated with the complicated cir-
cuit implementation also reduces its power efficiency. In contrast, the TPD-based FTL
shows superior power efficiency since its operation only involves the TPD, CP, and
LPF. As an example, the IL-PLL designed in [146] with the TPD-based FTL achieves
a figure-of-merit (FOM) of -247 dB at 3.2 GHz. However, there still exist several chal-
lenges within the TPD-based FTL. Firstly, it is quite challenging to design a high-speed
TPD since it needs to process the most high speed injection pulse. Secondly, the TPD
must have high detection accuracy to distinguish the small phase shift caused by the
frequency offset. Thirdly, the hidden risk of possibly losing lock along with its limited
locking range and weak lock-acquisition ability reduces its robustness and reliability
[86, 87]. This work is aimed to address these issues in the TPD-based FTL.

67
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

TPD_EN
Pulse Injection Path
REF_CLK
Generator Polarity INJ_LOCK Timing-Adjusted
Detector Loop

TPD/CP2
TPD_EN
MUX VCO
PFD/CP1

TPD_EN
PFD_EN
TPD_EN
PFD_EN
LPF
DIV4
/2 /2
Phase-Locked Loop
REF_CLK INJ_LOCK TPD_EN
Freq. Lock Loop
DIV4_90 FRE_LOCK PFD_EN
/2 Detector Selector
EXT_MODE_SEL

Loop Selection State Machine

Figure 3.2: The architecture of the proposed RILCM.

3.2 Proposed RILCM Architecture

3.2.1 Overall Architecture

Fig. 3.2 shows the block diagram of the proposed RILCM. It contains a pulse
generator (PG) and a hybrid FTL consisting of a traditional PLL, a timing-adjusted
loop (TAL), and a loop-selection state machine (LSSM). Driven by the LSSM, the
LPF/VCO alternately connects to PFD/CP1 and TPD/CP2 to accomplish lock acquisi-
tion. When the FTL switches from PLL to TAL, the resistor in series with the capacitor
is shorted to remove the stabilizing zero in the loop gain. This is because the injection
locking gives rise to the inclusion of a high pass filter within the TAL, thus making it
a first-order system.
This design has two main features. One is the newly developed TPD, which utilizes
limited transistors to achieve both high detection accuracy and high operation speed.
Meanwhile, a polarity detection mechanism is introduced to avoid positive feedback.
The other is the introduced LLD-LR in the hybrid FTL, which automatically switches
the FTL to traditional PLL mode for a specific duration to undertake lock recovery in
case that an injection-lock loss is detected. In doing so, the pull-in range of the RILCM
is effectively extended, which not only solves the problem of initial lock acquisition but
also prevents the hidden risk of losing lock in normal operation mode. Owing to these

68
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

θi (s)
N·Hinj (s)

θref (s) θi (s) + Δθ θo (s)


Σ Σ KTPD LPF (s) KVCO /s Σ 1-Hinj (s) Σ
-

θn,ref (s) θn,vco (s)


Linear Model of IL-VCO

/N

Figure 3.3: Linear model of the RILCM in case of the injection-locked condition,
where θref (s), θi (s), θo (s), θn,ref (s), θn,vco (s) represent the reference input phase,
total input phase, output phase, reference input noise, and VCO noise, respectively.

two techniques, the proposed RILCM effectively prevents the mutual-pulling issue in
conventional IL-PLLs while keeping their good properties of enhanced in-band noise
suppression and high operation robustness, thus making it a competitive option for
commercial applications.

3.2.2 Architecture Modeling

Fig. 3.3 displays the detailed linear model of the RILCM with the TAL, where
the two main noise sources [i.e., the reference noise θn,ref (s) and the RVCO noise
θn,vco (s)] are included. In contrast to traditional PLL, the injection locking gives rise
to the inclusion of [1 − Hinj (s)] within the TAL loop [85, 89], where Hinj (s) denotes
the normalized phase transfer function of the injection locking. It can be approximated
by an LFP with a left-plane pole around the tracking bandwidth of the IL-VCO [89, 80,
154]. The presence of such an HPF accounts for the in-band phase noise attenuation
in terms of resetting the phase errors at the arrival of each injection pulse. To explore
the system stability and phase transfer characteristics, the closed-loop characteristic
equation is formulated as below,

[(θi (s) − θo (s)/N )·KT P D ·LP F (s)·KV CO /s + θn,vco (s)]

·[1 − Hinj (s)] + θi (s)·N ·Hinj (s) = θo (s), (3.1)

69
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

where θi (s) is the summation of the reference input θref (s) and the reference noise
θn,ref (s). Rearranging Eq. (3.1), the closed-loop transfer function can be obtained by

N ·LG(s) N
θo (s) = θi (s)· + θi (s)· ·Hinj (s)
1 + LG(s) 1 + LG(s)
1
+ θn,vco (s)· ·[1 − Hinj (s)],
1 + LG(s)
N ·LG(s)
= θref (s)· ·[1 − Hinj (s)] + N ·θref (s)·Hinj (s)
1 + LG(s)
N ·LG(s)
+ θn,ref (s)· ·[1 − Hinj (s)] + N ·θn,ref (s)·Hinj (s)
1 + LG(s)
1
+ θn,vco (s)· ·[1 − Hinj (s)], (3.2)
1 + LG(s)

where the first line represents the phase transfer of θref (s), the second line stands for
the noise transfer of θn,ref (s), the third line denotes the noise transfer of θn,V CO (s),
and LG(s) is the loop gain, written as

1 KV CO
LG(s) = ·KT P D ·LP F (s)· ·[1 − Hinj (s)]. (3.3)
N s

Stability Consideration- The lock acquisition in this RILCM is achieved by alterna-


tively enabling the PLL and TAL under the control of the LSSM. Hence, its stability
problem involves two aspects. One is that the transition process between the two loops
must be smooth so as to avoid large voltage ripples on the control line of the IL-
VCO. The other is that the PLL and TAL must be separately stabilized regardless of
whichever loop is activated. To guarantee smooth switching transitions, the MUX is
placed before the LPF (see Fig. 3.2) such that the sudden charge injection/extration
caused by the loop switching can be effectively neutralized by the large capacitor in
the shared LPF. To provide sufficient phase margin for the PLL, a resistor in series
with the loop filter capacitor is added in the LPF to create a stabilizing zero in the loop
gain. However, this stabilizing zero is not needed in the TAL since a pole located at
the origin can be eliminated by the [1 − Hinj (s)] in Eq. (3.3). Accordingly, the serial
resister that aids to stabilize the PLL should be shorted to maintain an adequate phase
margin when the TAL is selected (see Fig. 3.2). The elimination of the serial resis-
tor also helps to reduce the ripples on the control voltage, hence improving the spur

70
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

ftune
finj

Sɵ ( f )
θn,vco θn,vco
ftune finj A 1
1 1 - H inj ( s ) f3
1  LG ( s )
θn,ref B θn,out
Σ 1
finj
20log(N) dB
NH inj ( s )
θn,out f2
ftune finj 20log(N)
θn,ref
C
ftune finj fc finj f
N  LG ( s ) 1 - H inj ( s ) fc < ftune < finj
1  LG ( s )
(a) (b)

Figure 3.4: NTF characteristics of the RILCM. (a) NTF behaviors and (b) simplified
noise shaping characteristics. Here, fc is the corner frequency of the oscillator, finj
stands for the bandwidth of the injection locking, ftune denotes the tunable bandwidth
of the TAL, 1/f 2 represents the white noise of the oscillator, and 1/f 3 is the flick noise
of the oscillator.

performance. Referring to Eq. (3.3), we can find that the secondary pole within the
TAL is subject to the dominant pole of Hinj (s). Therefore, the unity-gain bandwidth
of the loop gain should be designed smaller than the -3 dB bandwidth of Hinj (s) so
as to guarantee sufficient phase margin. Meanwhile, to suppress the 1/f 3 noise of the
VCO, the TAL bandwidth ftune is expected to be larger than the corner frequency fc
of the VCO. In this design, the bandwidth of the injection locking is designed to be 40
MHz while the TAL bandwidth can be adjusted by changing the CP current.
Noise Shaping Characteristics- Following the closed-loop transfer function in Eq.
(3.2), Fig. 3.4(a) describes the noise transfer function (NTF) behaviors of the two
main noise sources θn,ref and θn,vco . The three NTFs in Eq. (3.2) are generalized into
three noise transfer paths: A, B, and C. Path A stands for the NTF from the VCO, path
B refers to the main NTF of the reference, and path C represents the secondary NTF
path from the reference. The TAL leads to the inclusion of an extra [1/(1 + LG(s))]
within the NTF of the IL-VCO [see path A in Fig. 3.4(a)] and introduces an addition-
al path [see path C in Fig. 3.4(a)] from the reference noise to the VCO output. The
equivalent NTFs for these two paths are plotted in gray solid line [see Fig. 3.4(a)].
For path A, the presence of the [1/(1 + LG(s))] provides 20 dB/dec noise suppres-
sion. Combining with the 20 dB/dec attenuation contributed by the [1 − Hinj (s)], the

71
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

1/f 3 noise of the VCO can be significantly suppressed as long as the bandwidths of
ftune and finj are larger than the VCO corner frequency fc . This requirement can be
easily satisfied by adjusting the TAL loop parameters and injection strength. Path B
denotes the main noise transfer mechanism of the RILCM, which can be considered
as the reference NTF of the IL-VCO without the TAL. As for path C, the reference
noise transferred to the VCO output is negligible. Because the equivalent NTF of the
cascaded [LG(s)/(1 + LG(s))] and [1 − Hinj (s)] shows significant attenuations over
all frequencies as long as their bandwidths satisfy ftune < finj [see Fig. 3.4(a)]. This
requirement can be naturally met as it coincides with the loop stability request. Fig.
3.4(b) presents the simplified noise-shaping characteristics of the proposed RILCM
with the TAL. The injection locking along with the TAL can completely suppress the
in-band noise of the VCO, hence making its in-band noise tightly track the reference
noise.

3.3 Injection-Locked Ring Voltage-Controlled Oscilla-

tor (IL-RVCO)

The LC oscillator has demonstrated excellent performance on phase noise and pow-
er efficiency. However, its large area occupation, narrow tuning range, and inductor-
caused cross-coupling make it less suitable for multi-lane applications [153, 151]. In
contrast, the ring oscillator shows more potential in such applications because of its
wide operation range, multi-phase generation, and compact layout implementation.
Moreover, the recently developed injection locking technique makes it possible to
achieve a comparable jitter performance to its LC counterpart [82, 145]. This sec-
tion will firstly describe the IL-RVCO based on a new FS-PDDC, and then explore the
relative phase difference (i.e., the crossing point of the IL-RVCO output relative to the
injection center) with respect to the frequency offset.

72
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

Delay Delay INJ Delay Delay


Cell Cell Cell Cell

VCTRL

(a)
INJ_EN
INJ REF

INJ
CLK
REF OUT

(b) (c)

Figure 3.5: IL-RVCO. (a) Four-stage RVCO implementation, (b) pulse generator, and
(c) injection locking behavior.

IN OP IN OP

VCTRL VCTRL

IP ON IP ON

(a) (b)
IN OP IN OP

Accelerate Accelerate Decelerate Accelerate


Edge Transitions Edge Transitions Edge Transitions Edge Transitions

IP ON IP ON
(c) (d)

Figure 3.6: (a) FTG-based FS-PDDC, (b) CCI-based FS-PDDC, (c) effect of the FTGs,
and (d) effect of the CCIs. Here, the arrows stand for the effort directions that are
offered by the FTGs or CCIs.

3.3.1 Implementation of the IL-RVCO

Fig. 3.5(a) shows the adopted IL-RVCO, which consists of four identical delay
cells and its frequency is adjusted by the control voltage (VCTRL). The injection pulse
is applied to one of the four stages, while other injection transistors are connected to
the ground to avoid disrupting injection. By injecting the narrow pulses produced by
the pulse generator in Fig. 3.5(b) into the IL-RVCO, the accumulated jitter can be
periodically corrected by the injection pulse at every reference cycle [see Fig. 3.5(c)].
Fig. 3.6(a) presents the proposed FS-PDDC, where the pseudo differential output is

73
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

ensured by a pair of forward transmission gates (FTGs). To illustrate the unique fea-
tures of this FS-PDDC, another representative implementation is also described in Fig.
3.6(b), whose pseudo differential output is guaranteed by two cross-coupled inverters
(CCIs) [155].
The common feature of these two FS-PDDCs is that they both employ a pair of
back-to-back varactors to tune the free-running frequency of the VCO, where the VC-
TRL is fed to the common body of the two varactors. In principle, when the VCTRL
goes down, the equivalent voltage applied to the varactors increases, implying that the
delay cells need to drive higher load capacitances. Hence, the free-running frequency
of the VCO becomes low. Conversely, as the VCTRL goes high, the VCO frequency
will rise. Compared to conventional supply-regulated delay cells [156, 155], the main
advantages of these FS-PDDCs are their high output swing and fixed common-mode
voltage (around half of the power supply), which preclude the demands of level shift
correction [123, 157], and thereby facilitate their applications. Fig. 3.6(c) and (d) de-
scribes the effects on edge transitions that are contributed by the FTGs and CCIs [see
Fig. 3.6(a) and (b)], where the arrows stand for the effort directions that are offered by
the FTGs or CCIs. Obviously, the arrows in Fig. 3.6(c) always coincide with the edge
transitions, thus accelerating them. The reason is that the transmission delay from IP
to OP (IN to ON) through the FTG is similar to that from IN to OP (IP to ON) via
the inverter. In contrast, the CCIs decelerate the edge transitions in the portion preced-
ing the crossing point, while accelerating the edge transitions succeeding the crossing
point [see Fig. 3.6(d)]. This can be understood by realizing that the state changes of
the CCIs happen at the crossing point. Before that, they provide negative feedback to
preserve previous states [see the gray arrows in Fig. 3.6(d)]. After that, they contribute
positive feedback to speed up the state changes [see the black arrows in Fig. 3.6(d)].
Compared to the half-negative and half-positive feedback associated with the CCIs in
Fig. 3.6(b), the FTGs in Fig 3.6(a) contribute persist positive feedback, which makes
the FTG-based FS-PDDC a more promising solution for high-frequency applications.

74
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

CLK_P:VDD→VDD/2 INJ CLK_P: VDD/2→0

Current INJ Current INJ


Flows CLK_P Flows
CLK_N

CLK_N: 0→VDD/2 INJ CLK_N:VDD/2→VDD

Positive Negative
Accelerate the Feedback Region Feedback Region
Decelerate the
Edge Transitions Edge Transitions

Figure 3.7: Effect of the injection pulse on the speed of edge transitions, where the pro-
ceeding portion of the injection pulse contributes positive feedback while the following
portion provides negative feedback.

3.3.2 Relationship Between the Relative Phase Difference and the

Frequency Offset

Previous work has demonstrated that the most challenging task in the TAL is how
to detect the difference between the free-running frequency of the VCO and the target
frequency, since the VCO output frequency is not changed with the control voltage
in locked conditions [82, 158]. Inspired by the design in [82], the relative phase dif-
ference of the VCO output with respect to the center of the injection pulse is used to
estimate the frequency offset in this design. To explore the relationship between the
relative phase difference and the frequency offset, Fig. 3.7 summarizes the effect of
the injection pulse on the speed of edge transitions. An ideal injection is depicted at
the center of the diagram when the crossing point of the differential output clock oc-
curs at the center of the injection pulse. The left subfigure describes the current flow
through the injection transistor for the preceding part of the injection pulse. During
this period, CLK P falls to VDD/2 from VDD while CLK N rises to VDD/2 from
0. The current flows from CLK P to CLK N through the injection transistor, provid-
ing an additional current path for both pulling down CLK P and pulling up CLK N.
Therefore, the preceding part of the injection pulse contributes a positive feedback that
accelerates both the rising and falling edges. On the contrary, the following part of

75
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

Phase Difference (mrad)


Injection Locked Frequency
Injection Frequency 2.508 GHz

Frequency (GHz)
2.508
GHz

2.492 GHz 2.492 GHz

Time (μs) Time (μs)


(a) (b)
Phase Difference (mrad)

Crossing Point Lags


K slope 
1 Injection Center
40 M
Crossing Point at
Injection Center

Crossing Point Leads


Injection Center

Frequency (GHz)
(c)

Figure 3.8: Transient simulation results of the IL-RVCO. (a) Injection locking range,
(b) the relative phase difference with respect to the transient time, and (c) the relative
phase difference versus the frequency offset.

the injection pulse results in negative feedback. As illustrated in the right subfigure,
CLK P falls from VDD/2 to 0 while CLK N rises from VDD/2 to VDD. The current
flows from CLK N to CLK P, which slows down the edge transitions. Based on the
above analysis, when the free running frequency is lower than the target frequency, the
output clock period needs to be decreased to catch up the injection signal and hence the
crossing point should be located succeeding the injection center to make sure that the
positive feedback is stronger than the negative feedback to speed up the VCO. Con-
versely, the crossing point should be located proceeding the injection center to slow
down the VCO. When the free running frequency equals the target frequency, the in-
jection center should be located at the crossing point, where the phase-noise reduction
contributed by the injection locking reaches its maximum [151].
Fig. 3.8 displays the transient simulation results of the IL-RVCO, where an injec-
tion pulse with a slow ramping frequency is applied to the RVCO, while the control
voltage is set to a fixed value that makes the center frequency of the IL-RVCO locate

76
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

around 10 GHz. The simulated frequencies of the RVCO’s quarter-rate output and the
injection pulse are plotted in Fig. 3.8(a). Obviously, the IL-RVCO can track the injec-
tion pulse in a locking range of 2.491-2.509 GHz. Fig. 3.8(b) depicts the relative phase
difference (i.e., the crossing point of the IL-RVCO output to the center of the injection
pulse) with respect to the transient time. Replacing the transient time with the frequen-
cy offset in the horizontal axis, the relationship between the relative phase difference
and the frequency offset can be obtained [see Fig. 3.8(c)], where the visual locking
positions are also given in the right waveforms. Clearly, the relative phase difference
can be regarded as linear with respect to the frequency offset in the vicinity of the lock-
ing center. According to the analysis in Appendix A, the reciprocal of the slope Kslope
is actually equal to the tracking bandwidth of the phase transfer function Hinj (s) in
Fig. 3.3. The linear relationship of the relative phase difference versus the frequency
offset and the explicit tracking bandwidth lay a solid foundation for the FTL design,
including the TPD implementation, stability analysis, and bandwidth optimization.

3.4 The Proposed Phase Difference Detection

The main function of the TPD in the FTL is to detect the phase difference from
the crossing point of the differential output clock to the injection-pulse center to indi-
cate the frequency deviation between the free-running frequency of the VCO and the
target output frequency. Many attempts have been made to obtain an accurate phase
difference. In [153], a sub-sampling TPD (SSTPD), which embeds the sample-and-
hold (S/H) circuits into one of the stages, is adopted to monitor the injection timing.
However, the heavy load of the S/H along with the subsequent voltage-to-current can
dramatically prolong the delay of the SSTPD embedded stage, which not only slows
down the maximum operation speed of the VCO but also leads to an I, Q matching
problem. Additionally, its output polarity is unpredictable since the sampled output
is subject to the injection positions (falling edge or rising edge). This could lead to a
probability of 50% to form an undesired positive feedback at the initial state. Another
TPD consisting of four AND gates, six D-flip-flops (DFFs), and several logic gates is
developed to convert the phase differences to voltage pulses [146]. Nevertheless, its

77
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

CLK_P PO_P
×β
D Q
MOS12 PULSE VBP VBP
CLK
MOS25 SW
CLK_N PO_N VBP E
D Q
PULSE CLK N P EN_P EN_N
N P
PO_P PO_N PO_P
Polarity Detector D + C
- iCP
ICP
EN_N EN_P
A B A B
CLK_P F
PULSE PULSE
PSP_P PSP_N CLK_N
PS_P PS_N ×β
PULSE
CLK_P CLK_N

TPD Equivalent Logic TPD


with Polarity Selection Equivalent Logic TPD Details

Figure 3.9: Circuit implementation of the combined TPD and CP.

complicated circuit implementation increases power consumption and area occupation.


Meanwhile, the logic operation involving with the narrow injection pulse constrains its
maximum operation frequency, thus making it less suitable for high-speed application-
s.
To address these issues, a tightly combined TPD and CP with a polarity detector
(POD) is proposed (see Fig. 3.9), where both fast 1.2 V transistors and large-size 2.5
V transistors are employed to provide the capabilities of high operation speed and high
matching accuracy. The phase detection is carried out by comparing the differential
paths that are connected to the injection pulse and the VCO’s complementary output-
s. Each path adopts two symmetrically connected branches to performs the NAND
function. Owing to the compact symmetrical implementation that solely utilizes fast
NMOS transistors, this TPD can achieve a high operation speed with a low power con-
sumption. For the output of the CP, a differential structure with the enabling signals
EN P and EN N along with a voltage follower AMP is applied to realize smooth mode
transitions between the PLL and TAL. The voltage follower is used to make the voltage
V(D) always follow V(C) to avoid obvious charge extraction when the output is acti-
vated. Similarly, the CP in the PLL loop also utilizes these techniques to make the two
charge pumps (CP1 and CP2) compatible with each other. It is worth noting that the
CP works in 2.5 V power domain, which is capable of producing high-swing control
voltages to help to extend the operation range of the RVCO.

78
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

CLK_N
CLK_P ... CLK_P
CLK_N
...
PULSE Φ PULSE Φ
φp φn
PS_P PS_P
φn φp
PS_N PS_N
PO_N ... PO_P ...
PO_P PO_N
PSP_P φp PSP_P
φp
PSP_N φn PSP_N φn
(a) (b)

Figure 3.10: Locking behaviour of the proposed TPD. (a) Waveforms when injection
occurs at the falling edge of CLK P, and (b) waveforms when injection occurs at the
rising edge of CLK P.

3.4.1 Principle of the Proposed Timing-Adjusted Phase Detector

Fig 3.10 shows the locking behaviour of the proposed TPD, where the operation of
the TPD logic can be considered as a pair of on/off switches that are controlled by the
equivalent signals of PS P and PS N. Obviously, the injection pulse is partitioned into
two sections ϕp and ϕn by the crossing point of the high-speed complementary clocks.
When the injection center is leading the crossing point, the pulse width ϕp is larger than
ϕn , and vice versa. This width difference is then converted to current by the following
CP in Fig. 3.9, where the instant current is determined by the threshold voltage of the
common-gate high voltage transistor and its source-equivalent turning-on resistor of
the phase detecting transistors.
When the TAL is stable, the average output current of the CP should be zero, thus
the crossing point of the VCO’s output should be located at the center of the injection
pulse. It is under such an exact condition that the frequency of the free-running VCO
becomes close to the target frequency, according to the analysis of the relationship
between the phase difference and frequency offset in Section 3.3.2. This means that
the alignment of the injection pulse center and VCO’s output crossing point is a com-
mon target for both the injection locking and frequency tracking. Therefore, the race
condition between the two pulling forces is eliminated.

79
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

3.4.2 Polarity Selection

One common problem in the existing TPD-based TALs is that an improper injec-
tion position may lead to an undesired positive feedback [82, 153]. To be more specific,
Fig. 3.10 illustrates the functional waveform of the TPD logic when injection occurs
at different transition edges. For the condition that the injection happens at the rising
edge of CLK P as shown in Fig. 3.10(a), the equivalent pulses of PS P and PS N can
be given by

P S P = CLK N · P U LSE, (3.4)

P S N = CLK P · P U LSE. (3.5)

When the injection occurs at the falling edge of CLK P as depicted in Fig. 3.10(b),
they can be induced by

P S P = CLK P · P U LSE, (3.6)

P S N = CLK N · P U LSE. (3.7)

Clearly, if no measure is taken, the value of the detected phase difference will change
to the opposite sign as the injection position switches between the two possible locking
conditions depicted in Fig 3.10. This will make the TAL have a 50% chance to operate
in positive feedback in the initial state, which may cause a false lock or even a fail
lock since the injection locking range is small. To solve this problem, a POD shown in
Fig. 3.9 is introduced to produce the polarity signals PO P and PO N by distinguish-
ing the edge types at the injection instant. The equivalent function of the TPD with
the polarity selection is also depicted in Fig. 3.9, where the waveforms for the final
equivalent inputs of PSP P and PSP N are described in Fig. 3.10. It can be seen that
the connection of the detected pulses of PS P and PS N are exchanged by the polarity
selection signals of PO P and PO N. Therefore, the same equivalent pulses of PSP P
and PSP N for both conditions shown in Fig. 3.10 can be acquired. Consequently, the
possible positive feedback is avoided.

80
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

REF_CLK INJ_LOCK TPD_EN


Freq. Lock Loop
DIV4_90 FRE_LOCK PFD_EN
Detector Selector
EXT_MODE_SEL

DIV4_90
D Q INJ_LOCK
FD1
REF_CLK IND_LOCK TPD_EN
CLK XOR
FRE_LOCK XOR Saturation PFD_EN
RST Edge Counter
REF_CLK FD2
D Q EXT_MODE_SEL
DIV4_90
CLK Loop
Selector
A RST
Frequency Lock
PFD_EN
Detector
RC=60 ns

(a)

REF_CLK
... ...
DIV4_90
FD1 0
FD2 1 0

FRE_LOCK 1
Target Harmonic Locked False Harmonic Locked Regular Frequency Deviation
(b)

Figure 3.11: Implementation of the introduced LSSM. (a) Circuit details and (b) be-
havior of the FLD.

3.5 Mechanism of the Lock-Loss Detection and Lock

Recovery (LLD-LR)

3.5.1 Operation Process of the LLD-LR

There exist initial lock acquisition problem and losing lock risk in previous TPD-
based TALs due to their limited locking range and weak lock-acquisition ability [146,
153]. To overcome these difficulties, a complete LLD-LR mechanism is embedded in-
to the hybrid FTL under the control of the LSSM. Fig. 3.11(a) gives the details of the
LSSM. It consists of a frequency lock detector (FLD) and a loop selector (LS). Apply-
ing the injection-lock indicator INJ LOCK and frequency-lock indicator FRE LOCK
to the LS, the total edge transitions on INJ LOCK and FRE LOCK are recorded by
a saturated counter. Once the number reaches a specific value (4 in this design), the
LS will switch the FTL from TAL to PLL to start a lock-acquisition process. Simul-
taneously, the RC timer with a time constant of 60 ns is launched to charge node A.

81
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

When its voltage climbs to the inverter threshold, the LS is reset and the FTL switches
back to TAL to engage injection locking. If the VCO successfully locks to the injec-
tion pulse at the target frequency, both the INJ LOCK and FRE LOCK will stay static,
and so does the LS. Otherwise, this process will be repeated until injection locking
is achieved. During the initial period, the injection lock can be obtained by repeating
this LLD-LR process. During the normal operation period, lock loss can be detected
in time to activate this LLD-LR process. Additionally, it is worth noting that almost
no extra power is dissipated in the normal loss detecting mode since there is no signal
transition in the LSSM.

3.5.2 Principles of the Lock Loss and False Lock Detection

Fig. 3.11(b) describes the functional behavior of the FLD. When the feedback
frequency is exactly equal to the reference frequency under the locked condition [left
subset in Fig. 3.11(b)], the outputs FD1 and FD2 of the mutual sampling D-flip-flops
(DFFs) stay unchanged and thus the frequency-lock indicating signal FRE LOCK
remains static. For the case when the frequency of the feedback clock equals the
sub-harmonic or multiple-harmonic of the reference frequency [middle subset in Fig.
3.11(b)], the FRE LOCK must be the delayed version of the low-frequency clock since
the mutual sampling can reserve all the timing information of the low-frequency clock.
For the regular condition when there is a frequency deviation between the feedback
clock and reference clock [right subset in Fig. 3.11(b)], the FLD can also produce
transitions on FRE LOCK. Generally, only when the VCO is running at the target
frequency, the frequency-lock indicating signal FRE LOCK stays static. Hence, the
presence of transitions on FRE LOCK can be considered as a frequency-lock failure.
Although the FLD can detect any frequency deviation [see Fig. 3.11(b)], it takes
a long time to bring in a cycle slip to generate the frequency-loss edge transitions
when the frequency of the VCO is close to the target frequency. Due to the small
locking range, there is a high likelihood of such an occurrence during the injection
locking process. To speed up the lock-loss detection, the INJ LOCK (buffered version
of PO P) is also applied to the LS. Recalling the cases when VCO is locked to the

82
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

injected pulse as shown in Fig. 3.10, the DFFs in the POD triggered by the rising edge
of the injected pulse will always sample the same logic level. Therefore, the polarity
signal INJ LOCK stays unchanged either in logic high or logic low. Conversely, if the
injected pulse fails to lock to the VCO, the injection position will change with phase
error accumulation, which will finally bring in a cycle slip. It is at this specific moment
the polarity signal INJ LOCK will present an edge transition, which can be considered
as an effective indicator of injection-lock failure. By monitoring the edge transitions
on both INJ LOCK and FRE LOCK, any injection failure including injection-lock loss
and false harmonic lock can be quickly detected.

3.6 Experimental Results

3.6.1 Tools and Fabrication Process

The RILCM is designed using a Dell R730 server with two E5-2609V4 CUPs, 128
G memory and 8 T hard disk. The schematic, layout, and simulation are respectively
finished by Schematic Composor, Virtuoso Layout, and Spectre/aps that are developed
by Cadence. The software version is IC5141. The layout verification and parasitic
extraction are carried out by layout versus schematics (LVS)/design rule check (DRC)
and parasitic extraction (PEX) using Caliber2013 that is developed by Mentor Graph-
ics. To characterize the jitter performance of the fabricated prototype, a very clean ref-
erence clock is generated by a KEYSIGHT N5191A. For a 2.5 GHz output, it presents
phase noises of -146 dBc/Hz and -150 dBc/Hz at 1 MHz and 10 MHz offset, respec-
tively. The rms-jitter integrated from 10 kHz to 40 MHz is 38.7 fs. Without special
explanation, the rms-jitter in the following description is designated to be integrated
over the same frequency range.
The prototype chip is designed and fabricated utilizing a 65 nm process. Under a
typical corner, the cut-off frequency (fT ) of the NMOS transistor and the inverter delay
with a fan-out-of-4 in this process achieve 200 GHz and 13 ps, respectively. These two
metrics indicate that the utilized 65 nm process is able to provide enough bandwidth
and timing margin for the targeted 10 GHz RILCM design. Although an advanced

83
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

PFD/CP1 TPD/CP2

Capacitor
LSSM PG

RILCM
CORE VCO

LPF
RILCM CORE

Figure 3.12: Layout view of the whole RILCM chip, where the block placement of the
core circuits is illustrated in the left view.

(a) (b) (c)

(d) (e)

Figure 3.13: Layout views of the crucial blocks. (a) VCO, (b) PG, (c) PFD/CP1, (d)
TPD/CP2, and (e) LSSM.

process with a smaller minimum channel length such as 45 nm, 32 nm, 22 nm and
16 nm can offer higher fT and shorter inverter delay, their high prices make them not
available for us. Fortunately, our RILCM mainly focuses on the hybrid-loop frequency
tacking architecture, improved FS-PDDC-based RVCO, TPD circuit implementation
and LLD-LR mechanism. These techniques can still be verified by the economical and
practical 65 nm CMOS process.

84
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

3.6.2 Layout and Simulation Results

3.6.2.1 Layout Designs

Fig. 3.12 displays the layout view of the whole RILCM chip, where the block
placement of the core circuits is illustrated in the left view. The PG is placed very
close to the VCO to reduce the parasitic capacitance of the pulse output, and hence
provides an injection pulse with sharp edges. The PFD/CP1, TPD/CP2, and LSSM are
placed together to facilitate the connections among the mode selection signals. The
LFP is put close to the VCO to reduce the effect caused by supply fluctuations. Fig.
3.13 further presents the layout views of the crucial blocks. As shown in Fig. 3.13(a),
the VCO layout is implemented in a ring, which assists to make each of the delay cell
share the same parasitic capacitance, and hence optimize the noise performance of the
VCO. The main design point of the PG is to optimize the parasitic capacitance on the
pulse output node [see Fig. 3.13(b)]. As for the PFD/CP1 and TPD/CP2 shown in
Fig. 3.13(c) and (d), we have paid special attentions to guarantee the two comparison
branches are symmetrical, and thereby reduce the mismatch between the two com-
paring phases. The main consideration for the LSSM is the convenience to route the
connection signals.

1.2 VCTRL
RVCO
Voltage (V)

0.7

PWL
0.2

Figure 3.14: Simulation setup of the RVCO, where the left curve depicts the VCTRL
of the RVCO.

85
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

Zoom in
1.1 V
Voltage (V)

Voltage (V)
(a) (b)

30 dB/dec VCTRL=700 mV

Phase Noise (dBc/Hz)


Frequency (GHz)

1 -91.83
f3
4.69 fc 10 MHz
20 dB/dec
1
f2

Frequency Offset (Hz)


(c) (d)

Figure 3.15: Simulation results of the RVCO. (a) Differential output clock, (b) swing
reduction, (c) frequency range, and (d) phase noise.

12 5.5
Frequency Range (GHz)

Cs
FS-PDD 5 RVCO with FTG-Based FS-PDDCs
Frequency (GHz)

11
ith FTG -Based
RVCO w
4.5
10
4 RVC
Ow
9 RVC ith
Ow 3.5 CCI
ith C -B ase
CI-B dF
8 ased S-P
FS-P 3 DDC
DDC s
s
7 2.5
10 20 30 40 50 10 20 30 40 50
(a) (b)

-153 30
C s Cs
Swing Reduction (%)

-PDD DD
-154 s ed FS - P
CI-Ba dF
S
FOM PN (dBc/Hz)

CO with C 20 se
-155 RV -Ba
TG
wi th F
-156 10 CO
RVC RV
Ow
-157 ith F
TG-B
ased 0
-158 FS-P RVCO with CCI-Based FS-PDDCs
DDC
s
-159 -10
10 20 30 40 50 10 20 30 40 50
(c) (d)

Figure 3.16: Simulated performance comparison of the RVCOs with FTG-based and
CCI-based FS-PDDCs in terms of (a) operation frequency, (b) frequency range, (c)
FOMPN , and (d) swing reduction. Here, the horizontal axes denote the percentage of
the FTG/CCI to the main inverter in dimension.

86
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

3.6.2.2 Simulated Performance of the RVCO

Fig. 3.14 describes the simulation setup of the RVCO, where the VCTRL rises
from 0.2 V to 1.2 V in a liner mode to continuously adjust the operation frequency of
the RVCO. Fig. 3.15 (a) demonstrates the transient output waveforms of the RVCO.
Fig. 3.15 (b) shows the zooming waveforms, which clearly show that the swing of the
RVCO outputs is shrunk to 1.1 V. Fig. 3.15 (c) displays the operation frequency of
the RVCO, which shows that the frequency range of the RVCO is 4.69 GHz when the
VCTRL changes from 0.2 V to 1.2 V. Fig. 3.15 (d) gives the simulated phase noise of
the RVCO when the VCTRL is set to 0.7 V. The corner frequency of the 1/f 3 noise is
around 10 MHz and the phase noise at 1 MHz offset is -91.83 dBc/Hz.
To compare the performance of the RVCOs with the FTG-based and CCI-based
FS-PDDCs that are described in Fig. 3.6(a) and (b), we repeated the simulations in
Fig. 3.15 using the setup in Fig. 3.14 while changing the ratio of the the FTG/CCI to
the main inverter. Fig. 3.16 summarizes the simulated comparison results, where the
horizontal axis is the percentage of the FTG/CCI to the main inverter in dimension.
As depicted in Fig. 3.16(a), (b), and (c), the RVCO integrated with the FTG-based
FS-PDDCs holds the advantages of higher operation frequencies, wide tunable ranges,
and lower FOMPN s over that with the CCI-based FS-PDDCs. Here, the FOMPN refers
to the the phase noise FOM of the VCO, which is defined by

 
∆f 2 PDC (3.8)
F OMP N = L (f0 , ∆f ) + 10log f02
· 1mW
,

where L(f0 , ∆f ) is the single-side band phase noise at a frequency offset ∆f from a
carrier frequency at f0 , and PDC denotes the power consumption. A lower FOMPN in-
dicates a better VCO [147]. When the percentage of the FTG/CCI to the main inverter
increases from 5% to 50%, the metrics of the RVCO with the FTG-based FS-PDDCs
show a trend of optimization [see the red curves with circle markers in Fig. 3.16(a),
(b), and (c)], while those associated with the RVCO using the CCI-based FS-PDDCs
exhibit a deterioration trend [see the blue curves with square markers in Fig. 3.16(a),
(b), and (c)]. Particularly, for the RVCO with the FTG-based FS-PDDCs, the opera-

87
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

tion frequency rises from 9.98 to 10.95 GHz, the tunable range slightly increases from
4.65 to 4.75 GHz, and the FOMPN upgrades from -155.8 to -158.2 dBc/Hz. As for the
RVCO with the CCI-based FS-PDDCs, the operation frequency drops from 9.6 to 7.6
GHz, the tunable range declines from 4.4 to 3.0 GHz, and the FOMPN degrades from
-155.5 to -153.7 dBc/Hz. These are because the increased FTGs provide a higher pre-
driving ability and thus enhance the positive feedback, while the enlarged CCIs offers
a superior reinforcement on the negative feedback over that on the positive feedback.
On the other hand, the increased pre-driving ability gives rise to a prominent swing
reduction on the RVCO with the FTG-based FS-PDDCs [see the red curve with circle
markers in Fig. 3.16(d)], which is not desired in some applications. In this design,
the percentage of the FTG to the main inverter is chosen to be 15% to ensure that the
swing reduction is controlled under 10%.

VCTRL
3.48 us
Voltage (mV)

Conventional PLL Mode

VCTRL
3.85 us
Voltage (mV)

RILCM Mode with LLD-LR

TPD_EN
Voltage (V)

TAL Mode

PLL Mode

Time (us)

Figure 3.17: Comparison of the transient procedure when operating in conventional


PLL mode and RILCM mode with LLD-LR.

88
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

VCTRL VCTRL

Voltage
(mV)
TPD_EN TPD_EN

Voltage
1 2 3 4 5 6 7 8

18n
19n
(V)

VCTRL
Voltage (mV)

PLL Transient Behavior

Voltage
(mV)
1.10

TPD_EN
Voltage (V)

1.49
(a)
VCTRL VCTRL
Voltage
(mV)

TPD_EN TPD_EN
Voltage
(V)

1 2 3 4 5 6 7
68n
88n

VCTRL
Voltage (mV)

TPD_EN
Voltage (V)

2.54
(b)
Figure 3.18: Transient behavior comparison. (a) With injection-lock indicator IN-
J LOCK and (b) without injection lock indicator INJ LOCK.

89
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

3.6.2.3 Settling Behavior of the Hybrid Frequency Tracking Loop

The simulated settling behavior of the proposed RILCM with the LLD-LR is plot-
ted in Fig. 3.17, where the PLL and the TAL work alternately under the control of
TPD EN and PFD EN. As shown in the zooming subfigure, although the settling pro-
cess is periodically interrupted by frequent TAL engagement, it still exhibits a similar
lock-acquisition process to the traditional PLL loop. This is because the lock loss
can always be quickly detected by the designed LSSM, thus making the FTL operate
in PLL mode occupy a high time proportion. Benefiting from the improved lock-
acquisition ability, the issues mentioned previously such as possible harmonic locking
and weak robustness are completely resolved.
Fig. 3.18 shows the transient behavior for the cases with and without the injection-
lock indicator INJ LOCK. Obviously, they share similar acquisition behavior when
the VCTRL is far away from its target value, since the large frequency differences
make the frequency-loss detection close to the combined injection-loss and frequency-
loss detection. For instance, the details for the first 500 ns are depicted in the top-left
subsets in Fig. 3.18(a) and (b). However, when the VCO frequency is close to the
target frequency, namely the control voltage VCTLR approaches its target value as
detailed in the top-right subsets in Fig. 3.18(a) and (b), the detecting method only
using the frequency-lock loss indicator REF LOCK requires a long time (e.g., 68 ns,
88 ns) to trigger the PLL loop. On the other hand, the strategy monitoring both the
frequency-lock loss signal FRE LOCK and injection-lock loss signal INJ LOCK can
greatly shorten the detecting time (e.g., 18 ns, 19 ns). Consequently, the detection
method involving both frequency-lock loss and injection-lock loss makes the transient
behavior of the proposed RILCM more similar to the traditional PLL, which brings in
significant convenience and facility for fast start-up applications.

3.6.3 Chip Micrograph and Measurement results

3.6.3.1 Chip Micrograph and Power Breakdown

Fig. 3.19 shows the die micrograph of the fabricated prototype. The chip size
including pads is 0.8×0.9 mm2 , where the active area of the RILCM only occupies

90
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

350 μm
RILCM

200 μm
CORE

RILCM CORE
Figure 3.19: Die micrograph of the RILCM.

FTL PG
7.8 mW 1.8 mW
LSSM+DIV
VCO 1.8 mW
48 mW

Figure 3.20: Power breakdown of the RILCM.

Conventional PLL mode


without injection
Phase Noise (dBc/Hz)

RILCM mode without FTL


RILCM mode with FTL

Frequency Offset (Hz)

Figure 3.21: Measured phase noise with half-rate output at 5GHz.

91
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

58.31 dB

Spur Level (dB)

Frequency (Hz)

(a)
Spur Level (dB)

57.13 dB

Frequency (Hz)

(b)

Figure 3.22: Measured reference spur with half-rate output at 5GHz. (a) RILCM with-
out FTL and (b) RILCM with FTL.

0.07 mm2 . The power consumption is 59.4 mW, where 44.5 mA is drawn from a 1.2
V supply and 2.4 mA is provided by a 2.5 V supply. The power breakdown is given
in Fig. 3.20. The introduced LSSM along with the two dividers only costs 3.0% (1.8
mW). The fabricated chip is mounted on a printed circuit board by wire-bonding. The
output clock of the RILCM is firstly divided by 2, and then applied to an output buffer
for measurement.

92
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

120
w/o FTL
100

RMS Jitter (fs)


w/ FTL
80
60
40
20
1.1 1.15 1.2 1.25 1.3
Supply Voltage (V)
Figure 3.23: Integrated rms-jitter versus supply voltage.

70
RMS Jitter (fs)

60

50

40

2 2.2 2.4 2.6 2.8 3


Frequency of Reference Clock (GHz)
Figure 3.24: Integrated rms-jitter versus reference frequency.

3.6.3.2 Phase Noise and Spur Level Performance

Fig. 3.21 describes the measured phase noise (using half-rate output at 5 GHz) in
three operation modes: conventional PLL without injection, RILCM with and without
FTL. The measured phase noises are -120 dBc/Hz, -128 dBc/Hz, and -138 dBc/Hz, re-
spectively, at an offset frequency of 10 MHz for the above three operation modes. Cor-
respondingly, the measured rms-jitters are 390.2 fs, 130.0 fs, and 56.1 fs. Obviously,
the implemented RILCM demonstrates significant improvement on noise performance
due to the noise shaping contributed by the pulse injection and the continuous FTL. As
illustrated in Fig. 3.22 (a) and (b), the measured reference spur levels without and with
the FTL are 58.31 dB and 57.13 dB, respectively. Note that Fig. 3.22 (a) is measured

93
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

under an ideal condition with a nearly zero frequency deviation that is initially set, the
slight spur degradation (1.2 dB) indicates that the FTL can adjust the injection window
to an optimal position without introducing destructive disturbance.

3.6.3.3 Integrated-Jitter Performance

By repeating the phase noise measurement and jitter integration shown in 3.21, the
rms-jitter under different testing conditions can be obtained. Fig. 3.23 depicts the rms-
jitter versus the supply voltage for modes with and without the FTL, where the rms-
jitter decreases as the supply voltage varies from 1.1 V to 1.32 V. This is because the
improved supply voltage makes the swing of the proposed VCO increase, which helps
sharpen the transition edges to reduce device noise to jitter conversion. To evaluate the
operation range of the RILCM, we recorded the rms-jitter while continuously adjusting
the reference frequency. The measurement results (see Fig. 3.24) demonstrate that the
RILCM can produce high performance clocks (i.e., rms-jitter keeps lower than 60 fs)
over a wide range of 8-12 GHz.

Table 3.1: PERFORMANCE SUMMARY OF THE RILCM

JSSC09 [85] ISSCC13 [146] ISSCC15 [158] ISSCC14 [82] ISSCC16 [81] JSSC14 [151] This work
Architecture LC-ILCM LC-ILCM LC-ILCM PPM Ring-ILCM Ring-ILCM Ring-ILCM Ring-ILCM
Freq. Tracking FTL w/ Timing FTL w/ FTL w/ Replica- Dual Loop w/
FTL w/ TDC None FTL w/ TPD
Method Adjusted PD Pulse Gating Delay Cell Replica-VCO
Lock-Acquisition Manually-Tuned PLL Coarse Freq. Manually-Tuned Coarse Freq.
None None
Auxiliary Control Voltage Initialization Selection Control Voltage Selection
Loss Detection, Not Not Not Not Not
Available Available
Lock Recovery Available Available Available Available Available
Output Freq. 3.2 GHz 2.4 GHz 6.75-8.25 GHz 2-16 GHz 0.96-1.44 GHz 0.5-1.6 GHz 8.0-12.0 GHz
Reference Freq. 50 MHz 150 MHz 105-129 MHz 0.25-2.0 GHz 120 MHz 40-300 MHz 2.0-3.0 GHz
Phase Noise at
-127.4 dBc/Hz -126.4 dBc/Hz -113.5dBc/Hz -115 dBc/Hz -134.4 dBc/Hz -124 dBc/Hz -133.8 dBc/Hz
1 MHz Offset
Jitterrms (δt ) 130 fs 188 fs 190 fs 268 fs 185 fs 700 fs 56.1 fs
(Integ. Range) (100k-40MHz) (1k-40MHz) (10k-100MHz) (100k-1GHz) (10k-40MHz) (10k-40MHz) (10k-40MHz)
Power Diss.
28.6 mW 5.2 mW 2.25 mW 46.2 mW 9.5 mW 0.97 mW 59.4 mW
(PDC )
Reference Spur -64 dBc -49 dBc -40 dBc -48 dBc -53 dBc -57 dBc -57.13 dBc
FOM -243.2 dB -247.0 dB -251.0 dB -235 dB -244.9 dB -243 dB -247.3 dB
Active Area 0.4 mm2 0.25 mm2 0.25 mm2 0.044 mm2 0.06 mm2 0.022 mm2 0.07 mm2
Technology 130 nm CMOS 65 nm CMOS 65 nm CMOS 20 nm CMOS 65 nm CMOS 65 nm CMOS 65 nm CMOS

94
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

3.6.4 Performance Comparison

Table 3.1 compares the performance of our RILCM with state-of-the-art ILCMs
that have the capability of frequency tracking. Obviously, the phase noise at 1 MHz
offset and integrated jitter of our RILCM outperforms other RILCMs and even compa-
rable to the LC-ILCMs. This is mainly owing to the well combination of the injection
locking and frequency tracking, both of which could provide significate noise suppres-
sion. Additionally, the high-swing RVCO also helps to reduce the phase noise. The
good spur level indicates that the FTL can tune the RVCO to the target free-running fre-
quency and hence make the injection happens around the optimal position. Meanwhile,
it has a much smaller area occupation in contrast to those LC-ILCMs [85, 146, 158].
Additionally, the designed LLD-LR enables our RILCM with similar lock acquisition
ability to conventional PLLs, thus making it a robust solution for commercial produc-
tions.
It is worthy to note that some parameters of the proposed RILCM are inferior.
The tuning range of the proposed RILCM is less than that developed in [82] due to
the limited tuning range of the back-to-back connected varactors. Fortunately, the 4
GHz tuning range is still relatively wide, which can satisfy most of the applications.
The power consumption of our RILCM is higher than previous studies. However, the
power consumption alone cannot be considered as the performance criterion since it is
mainly determined by the utilized transistor sizes rather than the developed techniques.
To estimate the power efficiency of the proposed RILCM, the FOM of the ILCMs are
calculated, which is defined as,
" 2 #
δt PDC
F OM = 10 · log · , (3.9)
1s 1mW

where δt is the rms-jitter of the output signal and PDC is the power consumption. It is
usually considered as the performance-evaluation parameter of the clock multipliers.
Clearly, The proposed RILCM achieves the best FOM (-247.3 dB) among the RILCMs
[82, 81, 151], which indicates that the proposed RILCM (mainly referring to the archi-
tecture and circuit topologies) has the potential to achieve a better FOM (i.e., power
efficiency) than previously developed clock multipliers. As for the area occupation,
it is subject to the process, transistor sizes, and decoupling capacitance values. If an

95
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

ISSCC 2011 [153] ISSCC 2012 [152]


-220 ILRPLL ALL-Digital ILO

ISSCC 2014 [82]


-230
FOM (dB)
PPM-ILCM

VLSI 2015 [84]


-240 Ring-ILCM ISSCC 2013 [146] JSSC 2009 [85]
LC-ILCM LC-ILCM
ISSCC 2016 [81]
JSSC 2014 [151] Ring-ILCM
-250 IL-PLL ISSCC 2015 [158]
This work LC-ILCM

0.01 0.1 1.0


Area (mm2)

ISSCC 2012 [152]


-220 ISSCC 2011 [153] ALL-Digital ILO
ILRPLL

ISSCC 2014 [82]


-230
FOM (dB)

PPM-ILCM
VLSI 2015 [84] JSSC 2010 [87]
JSSC 2014 [151]Ring-ILCM ILRO
-240 IL-PLL JSSC 2009 [85]
ISSCC 2016 [81] LC-ILCM
Ring-ILCM ISSCC 2015 [158] This work
-250 ISSCC 2013 [146] LC-ILCM
LC-ILCM

2× 10-1 100 101 2× 101


Frequency (GHz)

Figure 3.25: Performance-area-speed graph.

advanced process with a smaller minimum channel length such as 45 nm, 32 nm, 22
nm and 16 nm is utilized, the area occupation can be significantly optimized.
Fig. 3.25 gives a comparison between the proposed RILCM and previous work
in terms of performance-area-speed trade-off. It can be easily seen that our RILCM
achieves a good balance among jitter performance, area occupation, operation speed,
and power efficiency.

3.7 Chapter Summary

This chapter presents a RILCM using a newly developed TPD-based hybrid FTL
capable of producing a low-jitter, high-speed output at a low power consumption and
a small active area occupation. The RILCM occupies 0.07 mm2 , while producing a
56.1 fs rms-jitter at 10 GHz oscillation frequency and consuming a power of 59.4 mW.

96
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)

The utilization of the newly developed FS-PDDC-based RVCO leads to a low device
noise to phase noise conversion and a high convenience for subsequent applications.
A compact TPD associated with a well-matched CP is designed to accomplish high
phase-difference-detection accuracy and low charge-pumping disturbance. By timely
starting a traditional PLL under the control of the LSSM, the essential frequency ini-
tialization in prior FTL-based-ILCMs is eliminated. The LLD-LR mechanism benefits
the developed RILCM with the comparable lock-acquisition ability to conventional
PLL, thus making it a robust solution for commercial productions. Moreover, the de-
signed LSSM only consumes little additional power since most of the logics stay static
when target harmonic locking is obtained. Overall, the proposed system achieves a
good balance of performance-area-speed-efficiency trade-off when compared to other
work.

97
Chapter 4

The Transmitter Design

As one of the most important components in serial links, the transmitter (TX) needs
to produce full-rate data stream with precise timing for correct data transmission and
provide sufficient voltage swing and appropriate equalization such that the received
signal can maintain an adequate swing to make the receiver capable of distinguishing
the transmitted data bits without errors. This chapter presents a 5-50 Gb/s transmitter
with a 4-tap forward-feed equalizer (FFE), where the unit interval (UI)-spaced seri-
al data are produced by four parallel 4:1 multiplexers (MUXs). This scheme brings in
several benefits, including compact layout implementation, accurate 1UI-delay genera-
tion, and wide operating range. To mitigate the inherent large self-drain capacitance of
the 4:1 MUX, an enhanced 4:1 pulling-down unit cell is proposed, which not only im-
proves the maximum operation speed, but also effectively reduces the charge-sharing
effect. A compact latch-array with an interleaved-retiming technique is adopted to
produce the required 16 paths of quarter-rate data streams, where the retiming clock-
s for both the latch array and the 4:1 MUXs are generated by a clock bundle that is
implemented in power-efficiency CMOS style.
In the rest of this chapter, we will firstly illustrate the design challenges in the
high-speed transmitter, and then present the designed transmitter architecture. Follow-
ing that, the enhanced 4:1 MUX and clocking techniques for the transmitter will be
described. Finally, the experimental results of the transmitter will be discussed.

98
Chapter 4. The Transmitter Design

tsetup
Da CK1
2:1 tck-q
L
CK2
tdiv Da
CK2 CK1 tdiv
/2 tck-q tsetup
(a) (b)

Figure 4.1: (a) Critical path and (b) timing diagram for the 2:1 MUX. Here, tdiv is the
delay of the divider, tck−q is the ck-to-q delay of the 2:1 MUX, and tsetup is the setup
time of the sampling latch.

4.1 Design Challenges in High-Speed Transmitter

The difficulties in the high-speed transmitter design mainly focus on two aspects.
The first one is the timing constrains for the final-stage serialization, the second one is
the bandwidth limitations for high-speed blocks such as latches, MUXs, and clock/data
driving buffers, which usually involve a tradeoff between the bandwidth extension and
power consumption.

4.1.1 Timing Constraints

Fig. 4.1 re-draws the critical path and timing diagram in the 2:1 MUX. Note that
the latch needs to sample Da with CK1 , hence sufficient setup time and hold time for
the sampling latch must be satisfied to guarantee the correct functionality. As shown
in Fig. 4.1(b), the hold time can be easily met as the data always hold an adequate time
after the arrival of CK1 . To satisfy the setup time constrain, the following equation
must be held.

tdiv + tck−q + tsetup < 1U I, (4.1)

where tdiv is the delay of the divider with a division factor of 2, tck−q is the ck-to-q
delay of the 2:1 MUX, and tsetup is the setup time of the sampling latch. The other
possible critical path is the located at the final data selection stage, where the margin

99
Chapter 4. The Transmitter Design

for the sampling is 1 UI. It can be expressed as

tM UX MUX
setup + thold < 1U I,
(4.2)

where tM UX MUX
setup and thold separately stand for the setup time and hold time of the final 2:1

MUX. As the data rate increases, the reduced bit period will lower these timing margins
and hence limit the maximum operation rate. Moreover, the delay changes associated
with the process, voltage, and temperature (PVT) variations make this problem even
more challenging. To overcome this difficulty, traditional half-rate transmitters often
insert extra delay-matching buffers [27, 24] or phase calibration loops [100, 33, 26]
between CK1 and the latch [see Fig. 4.1(a)]. For the former method, the delay fluctu-
ation between the multiplexing path and the delay-matching path may excess 1 UI and
thereby causes bit errors. For the latter approach, the timing margin is subject to the ac-
curacy of phase detection, which could reduce the stability, reliability, and robustness
of the serializer. Meanwhile, both of these two techniques involves substantial power
and area overhead. An alternative solution is to replace the last three 2:1 MUXs with
a single 4:1 MUX [159, 32, 24, 99]. The resulting quarter-rate serialization relaxes the
critical path timing margin to 3 UI, halves the maximum clock speed, and saves con-
siderable power, thus making it a promising solution for the high-speed serialization.
It is worthy to note that these benefits come with the penalty of a doubled self-drain
capacitance, which dramatically degrades the bandwidth of the 4:1 MUX and hence
limits its maximum operation speed.

4.1.2 Bandwidth Limitations

The transmitter contains a large number of latches, MUXs, and data/clock driving
buffers that operate at high speeds. As the data rate increases, the bandwidth require-
ments for these blocks rise accordingly. An insufficient bandwidth could make the
signal difficult to reach the top or return to bottom, thus resulting in an attenuated
amplitude. This could bring in significant detriments to the high-speed transmitter.
Firstly, the insufficient bandwidth can make the ck-to-q delay of the MUX occupy a
prominent portion of the bit period and hence restricts the maximum operation rate.

100
Chapter 4. The Transmitter Design

(a) (b)

Figure 4.2: (a) Traditional CML-based MUX implementation and (b) power consump-
tion with different multiplexing ratio [16]. Here, N refers to the the multiplexing
branch number.

Secondly, the limited bandwidth could slow down the transition edges of the trans-
mitting clocks, which will deteriorate the jitter performance of the clock. Thirdly, the
insufficient bandwidth could lead to prominent inter-symbol interface (ISI). Specifical-
ly, the limited bandwidth makes the bit pulses cannot reach the top or return to bottom
and thereby results in long tail over the succeeding bits.
As a general method, the bandwidth can be extended by burning more power. Fig.
4.2 describes the power consumption versus data rates with different multiplexing ra-
tios (i.e., 1, 2, and 4), where the 1:1 MUX actually refers to the clock/data buffer and
the performance of the latch can be estimated by the 2:1 MUX. At low data rates, the
power consumption is linear to the data rate, where the self-drain capacitance can be
neglected. As the data rate rises, the power consumption grows exponentially. This
can be understood by noting that the self-drain capacitance gradually becomes the
dominant load and thereby the resulting bandwidth of the MUXs cannot be extend-
ed by solely increasing the transistor sizes and power consumption. Referring to the
curves in Fig. 4.2(b), it seems that the half-rate serialization with the 2:1 MUX is
much more efficient than the quarter-rate serialization with the 4:1 MUX. However,
the quarter-rate serialization scheme eliminates the three half-rate latches, two half-
rate 2:1 MUXs, and a large number of quarter-rate latches as well as a few half-rate
clock/data driving buffers. These significant power savings can effectively compen-
sate for the power increase in the 4:1 MUX. The other advantage of this multiplexing
scheme is that it can significantly relax the timing constrains for the final stage data se-

101
Chapter 4. The Transmitter Design

DPRE0<n>
Latch Array DPRE1<n> 4:1
D0<n> DPRE0<n>
L
DPST10<n> CKP DPRE2<n> DRV
L MUX
D QN DPRE3<n> α-1
DMAIN0<n> DPST20<n>
L L L
CKN DMAIN0<n>
4-bit D1<n> L L
DPRE1<n>
L
DPST11<n> Latch Details DMAIN1<n> 4:1
DMAIN2<n> DRV
MUX

Termination
Quarter
DMAIN1<n> DPST21<n> DMAIN3<n> α0

ESD &
Rate L L L TX_P
DPST10<n> TX_N
Parallel D2<n> DPRE2<n> DPST12<n>
PRBS L L L DPST11<n> 4:1
DPST12<n> DRV
Gen. DMAIN2<n> DPST22<n> MUX
L L L DPST13<n> α1
D3<n> DPRE3<n> DPST13<n> DPST20<n>
L L L L DPST21<n> 4:1
L L
DMAIN3<n>
L
DPST23<n> DPST22<n>
MUX
DRV 4-tap FFE
DPST23<n>
PH180 α2 Combiner

CK180
CK270
CK90
td1

CK0
BUF
PH0 PH90 PH180 PH270 PH0 PH90 PH180
BUF Pseudo-AND2
Clock Bundle td1
CK180_D

td2
CK0_D

X4 X4
CK0_CML CK0_D
CML
CKP_CML CK90_CML CK90_D CML Logic
Half Rate Clock 2
Conditioner
CKN_CML DIV2 CK180_CML CK180_D
CMOS CMOS Logic
Clock CK270_CML CK270_D
X4

Figure 4.3: Block diagram of the transmitter chip.

rialization since the data to be multiplexed operate at quarter rate rather than half rate.
The input data width is doubled and hence provides a doubled timing margin, which
makes it possible to produce the full-rate data stream across PVT variations without
additional matching buffers and phase tuning mechanism. Owing to these good prop-
erties, the quarter-rate architecture has become one of the most promising solutions in
the 20+ Gb/s transmitter designs. One of the main task in this work is to optimize the
operation speed and energy efficiency of the 4:1 MUX, including topology considera-
tion, unit cell enhancement, and clocking optimization.

4.2 Transmitter Architecture

4.2.1 Overall Architecture

The block diagram of the transmitter chip is illustrated in Fig. 4.3. It consists of
a multi-MUX-based 4-tap FFE combiner, a latch array, an on-chip PRBS generator,
and a clock bundle. In principle, the on-chip PRBS is utilized to generate the par-
allel quarter-rate data streams D0<n>, D1<n>, D2<n>, and D3<n>. These four
data streams are then interleavedly latched by the compact latch array to produce the
16-path quarter-rate data for the following four 4:1 MUXs. The desired timing rela-

102
Chapter 4. The Transmitter Design

tionship (see the signal positions in the latch array), which enables each MUX to share
the same timing margin, is satisfied by 90◦ -spaced quarter-rate clock relatching. After
the four 4:1 MUXs, the four full-rate UI-spaced serial sequences are firstly buffered
by the pre-drivers and then sent to the 4-tap FFE combiner to finally pre-distort the
output waveform and launch to the transmission channel. In the clock bundle, a clock
conditioner is employed to convert the incoming single-end half-rate clock into differ-
ential outputs, which are then fed into a divider (DIV2) to generate the quart-rate I,
Q clocks. Applying these quadrature clocks to four CML2CMOS converters, they are
transformed into full swing clocks, which are further applied to four driving buffers
and four pseudo-AND2s to produce 50% and 25% duty cycle clocks for the latch array
and the 4:1 MUXs, respectively.

4.2.2 Features of the Transmitter

The main feature of the transmitter chip is the compact implementation of the mul-
tiple 4:1 MUX-based 4-tap FFE, which not only relaxes the stringent timing require-
ment of the final serialization stage, but also provides a robust approach to support a
wide operation range. The quarter-rate multiplexing scheme implemented by the 4:1
MUXs significantly relaxes the stringent timing requirement. The interleaved-latching
method is able to guarantee the 16 quarter-rate data streams always maintain the suffi-
cient timing margins for the 4:1 MUXs. To improve the performance of the 4:1 MUX,
we propose a new unit cell to cancel the charge-sharing effect, which not only reduces
its output jitter, but also helps to optimize the self-drain capacitance and hence im-
proves its maximum operation speed. For the clocking, the shared 25% duty cycle
UI-spaced clocks are produced by pseudo-NANDs. This clocking scheme not only
possesses the good property of the high power efficiency, but also provides full swing
outputs and hence optimize the sizes of the gating transistors in the 4:1 MUX.

103
Chapter 4. The Transmitter Design

CK1
CK2
CK3
CK4

CK1 CK2 CK3 CK4


CK2 PH0 CK3PH90 CK4 PH180 CK1PH270
Unit Unit Unit Unit
D0N/P Cell D1N/P Cell D2N/P Cell D3N/P Cell

Figure 4.4: Conceptional circuit schematic of the traditional 4:1 MUX.

Iout Iout
CKin,2

CKin,1 CKin,2

Din CKin,1 Din

(a) (b)

Iout Din
Iout
X
X
Din
CKin,2 CKin,2
CKin,1
CKin,1

(c) (d)

Figure 4.5: Four possible unit cell implementations of the 4:1 MUX.

4.3 Enhanced 4:1 Multiplexer (MUX)

4.3.1 Previous 4:1 MUXs

Fig. 4.4 displays the conceptional schematic implementation of the traditional 4:1
104
MUX, which consists of four pulling-down unit cells and a pair of shunt-peaked loads.
Chapter 4. The Transmitter Design

Each unit cell performs two tasks, i.e., clock ANDing and data sampling, where the
former refers to ANDing the two adjacent clock phases to determine the edge positions
of the output pulse and the latter represents the input data sampling and hence decides
the logic of the output pulse.
Fig. 4.5 shows four possible implementations of the unit cell within the 4:1 MUX.
One common feature in these unit cells is that the current source is eliminated to reduce
the number of the stacked devices. In the first implementation [see Fig. 4.5(a)], the
ANDing and sampling operations are combined into one stage and hence the number of
the internal nodes can be reduced to the minimum. Nonetheless, these stacked devices
in the output stage need large sizes to provide sufficient driving current. The increased
device size shows a large capacitance load for the preceding stage and manifests a
increased self-drain capacitance, which in return limits the maximum operation speed
and/or the achievable power efficiency. To mitigate these issues, a second realization
shown in Fig. 4.5(b) is developed, where a separate sampling stage is introduced to
AND the two adjacent clock phases CKin,1 and CKin,2 to produce the 25% duty-cycle
pulse. This pulse is then applied to the output stage to gate the enabling transistor to
transmit the input data Din to the output. By separating the ANDing and sampling
operations into two stages, the stacked devices in the output stage are reduced to two,
which could significantly improve the operation speed and power efficiency of the 4:1
MUX. However, the involvement of processing the 25% duty cycle pulse along with
the sharp edge requirement has posed a high requirement on the 1-UI pulse generation.
To avoid the involvement of the 25% duty cycle pulse, a third possible implementation
of the unit cell is developed in [24]. As shown in Fig. 4.5(c), the leading clock CKin,1
is firstly sampled by the input data Din to remove the high pulses whenever Din is
low (corresponding to no discharging current in the output stage). After that, this
data-selected clock together with the CKin,2 will generate the pulse current to transmit
the input data onto the output with an accurate UI spacing. This technique possesses
three advantages. Firstly, the involvement of 25% duty cycle is precluded and hence
the stringent speed requirement on the inter nodes is relaxed. Secondly, the switching
activity of the preceding sampling sate is actually determined by the input data Din .

105
Chapter 4. The Transmitter Design

For a random input data with equal polarities, the switching activity is 50%, which
is lower than that of the design in Fig. 4.5(b). Finally, the sampling stage actually
performs the function of a latch and thereby a latch in the preceding stage can be
saved. Fig. 4.5(d) shows a variant of the design in Fig. 4.5(c) [32]. Instead of using
a NMOS for the first latch, a PMOS latch is utilized to keep node X pre-discharged
rather than pre-charged. This allows to remove the intermediate inverter, which reduces
the operation devices and hence leads a significant power saving. This unit cell also
naturally implements the latching function and therefore saves a latch in the preceding
data path. The main disadvantage of this topology is the stacked devices in the latch,
which could slow down the edge transitions of node X, thus limiting its maximum
operation speed. Another common drawback within the unit cells in Fig. 4.5(c) and
(d) is that both the sampling and ANDing operations are integrated together in the unit
cell, hence ruling out the possibility of the ANDing stage sharing.

4.3.2 Topology Consideration

Fig. 4.6(a) describes the schematic of the developed 4:1 MUX. Like the tradi-
tional 4:1 MUX shown in Fig. 4.5, it is composed of a pair of shunt-peaked loads
and four identical pull-down unit cells. Unlike the conventional 4:1 MUX that are di-
rectly driven by the quadrature 50% duty cycle clocks, these unit cells are activated
sequentially by four 25% duty cycle UI-spaced phases (CK0-90-180-270) to combine
the four quarter-rate data streams (D0-1-2-3) into one serial sequence (SDATA) [see
Fig. 4.6(b)]. Compared to the 4:1 MUXs presented in [24, 32] that combine both the
ANDing operation and sampling operation into the pulling-down unit cell, the unit cell
in this design only performs the sampling operation while the ANDing operation is
carried out by the pseudo-AND2s in the clock bundle (see Fig. 4.3). This splitting ar-
rangement allows the four 4:1 MUXs in Fig. 4.3 to share one common ANDing stage,
thus exhibiting more potentials on power efficiency.

106
Chapter 4. The Transmitter Design

CK0 PH0 CK90 PH90 CK180 PH180 CK270 PH270


Unit Unit Unit Unit
D0N/P Cell D1N/P Cell D2N/P Cell D3N/P Cell

(a)
Tsetup Thold
D0N/P D0<n> D0<n+1>
D1N/P D1<n-1> D1<n>
D2N/P D2<n-1> D2<n>
D3N/P D3<n-1> D3<n>
CK0
CK90
CK180
CK270
SDATA D1<n-1> D2<n-1> D3<n-1> D0<n> D1<n> D2<n> D3<n> D0<n+1>

(b)
Figure 4.6: Topology of the 4:1 MUX. (a) Conceptual schematic and (b) timing dia-
gram.

4.3.3 Enhancement on the Unit Cell of the 4:1 MUX

The main drawback of the quarter-rate serialization is the doubled self-drain capac-
itances of the 4:1 MUX, which significantly constrain the maximum operation speed.
Consequently, bandwidth extending techniques for the 4:1 MUX are highly desired.
This part will firstly discuss the drawbacks in traditional unit cells and then presents
our optimization techniques. Fig. 4.7 depicts the two widely used traditional unit cells
that support the splitting placement of ANDing and sampling operations. To optimize
the operation speed, the current source transistors are eliminated to avoid stacked de-
vices. In the data-up structure [101, 32] depicted in Fig. 4.7(a), the output can be
corrupted by the data transitions on other branches through the forward-coupling path
from the data input to the output when the MUX is performing data selection on one
branch [37]. Fig. 4.7(b) describes the clock-up structure [21, 103], which addresses

107
Chapter 4. The Transmitter Design

VOP VON VOP NM3 NM4 VON


D0N D0P CK0
NM1 NM2
X Y
CK0 NM3 NM1 NM2
D0N D0P

(a) (b)
Figure 4.7: Traditional unit cell implementations for high-speed 4:1 MUX. (a) Data-up
structure and (b) clock-up structure.

VOP VON
NM3 NM4
PM1 CK0 PM2

X Y
NM1 NM2
D0N INV D0P

Figure 4.8: Improved unit cell implementation.

the forward-coupling problem by moving the clocking pairs to the top to eliminate the
feed-through path. However, it suffers from severe charge-sharing effect between the
outputs VOP/VON and junction nodes X/Y in the form of causing glitches on two con-
secutive bits at high level or slowing down the rising edges for low-to-high transitions.
Inspired by the voltage mode source-series terminated (SST) driver discussed in [98],
we introduce a pair of pre-charging transistors PM1/PM2 connecting to nodes X/Y to
mitigate this effect. As shown in Fig. 4.8, the pre-charging PM1/PM2 and the data-
gating NM1/NM2 actually constitute two inverters, which make nodes X/Y be always
pre-driven to desired states, thus eliminating the charge-sharing effect. Compared to
the SST implementation in [98], the improved 4:1 MUX exhibits more potentials on
high-speed applications. The reason is that it can fully exploit the process potentials as

108
Chapter 4. The Transmitter Design

VT(OUTP) VT(X) Induce a VT(OUTP) VT(X) Slow down

Voltage (V)
large glitch the rising

Voltage (V)
edge
Remain at low state w/o PM Remain at low state
w/o PM
VT(OUTP) VT(X) VT(OUTP) VT(X)
Voltage (V)

Voltage (V)
Pre-charge No glitch Pre-charge
to VDD w/ PM A faster
to VDD
rising edge w/ PM
VT(CK0) VT(D0N) VT(CK0) VT(D0N)
Voltage (V)

Voltage (V)
Input data Input data
and CLK and CLK

(a) (b)

Figure 4.9: Effect of the introduced PM on (a) high-level glitches and (b) edge transi-
tions.

its compact NMOS driving topology naturally features fast current switching speed and
small parasitic capacitance. Additionally, the speed-constraining output capacitances
including self-drain load, routing wire, and far-end driving load can be neutralized by
adopting on-chip peaking inductors. In the rest of this part, we will discuss the adverse
effect of the charge-sharing in conventional clock-up structure and the favorable effect
of the introduced pre-charging transistors.
(1) Charge-sharing effect in conventional clock-up structure
The top row of the simulated waveforms in Fig. 4.9(a) and (b) demonstrates the
two adverse effects of the charge-sharing in the conventional clock-up structure [see
Fig. 4.7(b)]. Assuming the upcoming data D0P/D0N are logic high/low, node Y is
pre-discharged to the ground through NM2, which helps to speed up the falling edge.
The voltage of node X depends on previous transmitted data. In case that the previous
D0N is logic low, node X should have been charged to an allowed maximum value
(V DD − VT HN ) during the selection-enabled period (high pulse duration of CK0),
which should maintain to the present instant since NM1 has always been in cut-off
state. This will not cause prominent charge-extraction effect, as node X has already
been charged to the allowed maximum value by the previous transmitted bit. If the
previous D0N is logic high, node X should keep the ground voltage that is pulled
down during the hold time in previous bit period [i.e., Thold in Fig. 4.6(b)]. When

109
Chapter 4. The Transmitter Design

the high pulse of CK0 arrives, the capacitance at node X will extract charge from the
output, thus causing a remarkable glitch for two consecutive output bits at high level
or slowing down the rising edge for a low-to-high transition [see the waveform details
in the top row of Fig. 4.9(a) and (b)]
(2) The effect of the introduced pre-charging transistors
To demonstrate the effect of the introduced pre-charging transistors PM1/PM2
shown in Fig. 4.8, we take PH0 branch as an example to illustrate the operation
process of the proposed pull-down unit cell. When input data arrive, depending on
D0N/D0P, nodes X/Y are either pre-charged to VDD or pre-discharged to VSS by the
two inverters consisting of PM1/PM2 and NM1/NM2. This makes nodes X/Y always
pre-driven to the desired states that are coincident with the output signal levels. As the
high level of CK0 comes, NM3/NM4 are turned on to send D0N/D0P to the MUX’s
outputs. After a period of 1 UI, the pull-down path is switched off by the falling edge
of CK0 and the voltage level of nodes X/Y stays unchanged until the next input data
come. The main feature of this 4:1 MUX is its ability of eliminating the charge-sharing
effect caused by parasitic capacitances at nodes X/Y, which brings in several benefit-
s. Firstly, the deterministic jitter and glitches caused by charge-sharing extraction can
be remarkably mitigated [see the middle row in Fig. 4.9(a) and (b)]. Moreover, the
glitch elimination effectively improves the noise margin that allows a lower output
swing to save power. Secondly, the elimination of the charge-sharing effect makes the
capacitances at nodes X/Y less significant. Thus, large-size NM1/NM2 can be used
to enhance the discharging capabilities. Note that the output swing is determined by
the proportion of resistive load and equivalent resistance of stacked NM1/NM3 (N-
M2/NM4). For a fixed minimum output swing, the big size of NM1/NM2 implies that
NM3/NM4’s size can be reduced. The smaller size of NM3/NM4 helps to decrease the
self-drain capacitances of the unit cells. Consequently, the bandwidth of the overall
4:1 MUX can be expanded. Thirdly, the added transistors PM1/PM2 provide another
path through NM3/NM4 to help to pull up the output, which can accelerate the rising
transitions.

110
Chapter 4. The Transmitter Design

S2D Buffer Buffer

320 pH
320 pH
80 ohm
80 ohm
ON OP ON
R OP
IP IN
IP IN
CK_IP 2R
ISS ISS

(a)
CKN

CKN
CKP

CKP

Latch Latch

150 ohm
ON
OP
IN IP

CKP CKN

ISS

(b)

20k ohm

200 fF

(c)

Figure 4.10: Circuit details of the clocking blocks. (a) Clock conditioner, (b) DIV2,
and (c) CML2CMOS.

111
Chapter 4. The Transmitter Design

4.4 Clocking for the Transmitter

4.4.1 Topology of the Clock Bundle

As depicted at the bottom of Fig. 4.3, the desired full swing clocks for the latch
array and the 4:1 MUXs are produced by a clock bundle, where current-mode logic
(CML)-style circuits are employed in the clock conditioner and DIV2 to support the
most high-speed (half-rate) operation while the CML2CMOS and pseudo-AND2 that
operate at quarter rate are implemented in a more power efficient CMOS style.

4.4.2 Clocking Blocks

Fig. 4.10 presents the implementation details of these clocking blocks. In the clock
conditioner [see Fig. 4.10(a)], an AC-coupled CML with one input connected to the
fixed common voltage (2VDD/3) is adopted to perform the single-end input to differen-
tial output conversion. This differential clock is further rectified by two CML buffers.
To reduce the power consumption, multi-layer on-chip inductors are employed to neu-
tralize the output capacitances. For the DIV2, a traditional inductorless CML latch
shown in Fig. 4.10(b) is used to balance the operation speed and layout compactness.
Fig. 4.10(c) gives the schematic details of the CML2CMOS, where an AC-coupled in-
verter with a feedback resistor is utilized to convert the CML voltage level to full swing
CMOS logic. This compact CML2CMOS possesses the good properties of small area
occupation and high power efficiency. To some extent, it is also capable of performing
the function of duty cycle correction since the DC voltage of the converted full-swing
clock is feedback to bias the common voltage of the inverter.
For the pseudo-AND2, its function is to AND the two 50% duty cycle half-rate
clocks with a 90◦ phase shift to generate the 25% duty cycle clocks (CK0-90-180-270
in Fig. 4.3) for the 4:1 MUX. As the final retiming stage, the transmitter performance
largely relies on these clocks since any timing deviation will be converted into final
output jitter directly. This necessitates the following two desirable properties: i) the
high pulse width for each phase should be an accurate UI period, and ii) the spacing
between any two adjacent phases should be the same, which equals 1 UI. Generally

112
Chapter 4. The Transmitter Design

PM2
PM1
OUT X OUT X

NM1
CK90
X
NM2
CK0 NM1
NM2

(a)

CK0
CK90
OUT

PH1 PH2 PH3


(b)

Figure 4.11: Pesudo-NAND2. (a) Circuit details and (b) operation waveform.

speaking, these pulses can be created by NOR/AND of two 50% duty-cycle half-rate
clocks with 90◦ phase shifts. Considering the fact that serial NMOS transistors are
much faster than serial PMOS transistors, NAND2 associated with a driving inverter
could be a better choice. Fig. 4.11 presents the designed pseudo NAND2 and its oper-
ation waveforms. In contrast to conventional NAND2, this pseudo-NAND2 eliminates
the pulling-up transistor PM1 [see Fig. 4.11(a)]. In doing so, the output capacitance
can be reduced, thus leading to a higher operation speed. The similar circuit realiza-
tions of the pseudo-AND2 and the BUF (consisting of two cascaded inverters) also
mitigate the delay mismatch between td1 and td2 (see Fig. 4.3), which helps to meet
the stringent timing constraints against PVT variations. Fig. 4.11(b) presents the oper-
ation waveforms of the pseudo NAND2. At the beginning of PH1, node OUT is pulled
up to VDD by PM1, which can be held during PH2 since NM1 is still in closed state.
In PH3, both NM1 and NM2 are turned on to generate the UI-spaced pulse. It is worth
noting that there does exist charge-sharing effect between the capacitance at node X

113
Chapter 4. The Transmitter Design

Main Tap
Pst2 Tap FFE
Clock Conditioner MCG PRBS Gen.
Pst1 Tap
Combiner
Pre1 Tap

Figure 4.12: Layout view of the whole transmitter chip.

and the output. Particularly, at the beginning of PH1, CK0 goes down to trigger PM1 to
charge the output node, while node X extracts charge through NM1 since CK90 is still
remaining at high state. To alleviate this effect, an abutment layout approach with min-
imum gate spacing [see Fig. 4.11(a)] is exploited to reduce the parasitic capacitance
at node X. The big serial transistors are divided into several small serial transistors,
and every two small ones are connected in parallel, sharing a common drain region to
reduce the junction area.

4.5 Experimental Results

4.5.1 Tools and Fabrication Process

The transmitter is designed using a Dell R730 server with two E5-2609V4 CUPs,
128 G memory and 8 T hard disk. The schematic, layout, and simulation are re-
spectively finished by Schematic Composor, Virtuoso Layout, and Spectre/aps that
are developed by Cadence and the Cadence version is IC5141. The layout verifica-
tion and parasitic extraction are carried out by layout versus schematics (LVS)/design
rule check (DRC) and parasitic extraction (PEX) using Caliber2013 that is develope-
d by Mentor Graphics. To perform the measurements of the fabricated prototype, a
KEYSIGHT N5191A is used to generate the input clock and a KEYSIGHT DSA-X
93204A with a 80 GS/s and 32 GHz bandwidth is utilized to characterize the jitter
performance of the transmitter.

114
Chapter 4. The Transmitter Design

(a) (b) (c)

(d) (e)

(f)

Figure 4.13: Layout views of the crucial blocks. (a) 4:1 MUX, (b) interleaved-retiming
latch array, (c) pesudo-NAND2 with an inverter, (d) CML2CMOS converter, (e) DIV2,
and (f) clock conditioner.

The prototype chip is designed and fabricated utilizing a 65 nm process. Under


a typical corner, the cut-off frequency (fT ) of the NMOS transistor and the inverter
delay with a fan-out-of-4 in this process achieve 200 GHz and 13 ps, respectively. This
implies that the utilized 65 nm process is able to provide enough bandwidth and timing
margin for the targeted 40 Gb/s transmitter design. Although an advanced process
with smaller minimum channel length such 45 nm, 32 nm, 22 nm and 16 nm can offer
higher fT and shorter inverter delay, their high prices make them not available for us.
Fortunately, our transmitter mainly focuses on the interleaved latching-based sequence
generation, multi-MUX-based 4-tap FFE implementation, and 4:1 MUX enhancement,
which can still be verified by the economical and practical 65 nm CMOS process.

115
Chapter 4. The Transmitter Design

4.5.2 Layout and Simulation Results

4.5.2.1 Layout Designs

Fig. 4.12 displays the layout view of the whole transmitter chip. The FFE combiner
is located at the right edge of the chip to directly drive the output pads. The four
paths consisting of 4:1 MUXs and drivers (i.e., main tap, pst2 tap, pst1 tap, and pre
tap in Fig. 4.12) are placed next to the FFE combiner to reduce the driving length
of the connection wires. The PRBS generator and the latch array are dispersed at the
blank places among these four multiplexing paths to generate the quarter-rate data with
appropriate delays. The clock conditioner and the MCG is put at the left side of the
chip to provide proper clocks for the PRBS generator, latch array, and 4:1 MUXs.
Fig. 4.13 further presents the layout views of the crucial blocks. For the 4:1 MUX
shown in Fig. 4.13(a), the parasitic capacitances on the output nodes are optimized
to support a maximum operation speed. For the latch array displayed in Fig. 4.13(b),
special attentions are paid to the latch placement to facilitate the signal connections.
For the pesudo NAND2 shown in Fig. 4.13(c), an abutment layout approach with a
minimum poly spacing is adopted to optimize the parasitic capacitance on node X as
shown in Fig. 4.11. For the CML2CMOS converter, DIV2, and clock conditioner
[see Fig. 4.13(d), (e), and (f)], special attentions are paid to the parasitic capacitance
optimization, hence making the received clock can be well amplified, rectified, and
divided.

4.5.2.2 Simulation Results

Fig. 4.14 illustrates the simulation setup of the transmitter chip. The inputs of
bias main, bias pre, bias pst1, and bias pst2 are corresponding to the four tap weights
of the FFE combiner. The input clock operates at 25 GHz. The muxed data are the
direct outputs of the 4:1 MUX on the main-tap path. The output data are DC coupled
to a pair of far-end 50 ohm resistors through a channel with a 12 dB attenuation at 20
GHz.
To evaluate the effect of the introduced PMs in the 4:1 MUX, the transient output
and overlapped eye-diagrams using the traditional unit cell [see Fig. 4.7(b)] and the

116
Chapter 4. The Transmitter Design

bias_main
bias_pre muxed data
bias_pst1 Transmitter
bias_pst2
Chip Channel

output data
25GHz

Figure 4.14: Simulation setup of the transmitter chip.


Amplitude (V)

Amplitude (V)

5.50 5.75 6.0 6.25 6.50 6.75 7.0 7.25 5.50 5.75 6.0 6.25 6.50 6.75 7.0 7.25
Time (ns) Time (ns)
(a) (b)

Maximum glitch: 105 mV No visible glitch


Amplitude (V)
Amplitude (V)

Jitter: 1.6 ps Jitter: 0.3 ps

0 10 20 30 40 0 10 20 30 40
Time (ps) Time (ps)
(c) (d)

Figure 4.15: (a) Transient waveform of the traditional unit cell, (b) transient waveform
of the enhanced unit cell, (c) eye-diagram of the the traditional unit cell, and (d) eye-
diagram of the the enhanced unit cell.

enhanced unit cell (see Fig. 4.8) are separately displayed in Fig. 4.15. The simulated
eye-diagrams indicate that the ISI induced by the charge-sharing is reduced from 1.6
ps to 0.3 ps and the voltage glitches are mostly removed. It worthy to note that there
exists an drawback within this proposed 4:1 MUX. Its output swing is sensitive to PVT
variations. The reason is that the equivalent resistance of the two stacked transistors

117
Chapter 4. The Transmitter Design

Δ%
ΔSW Swing Variation
Δ%=
500mV Swing at Typical Corner

Temperature (° C)

Figure 4.16: Swing variations of the improved unit cell under different PVT corners.

Amplitude (mV)
Amplitude (mV)

440 mV
Jitter: 3.7 ps

50 100 150 200 5 15 25 35 45


Time (ps) Time (ps)
(a) (b)
Amplitude (mV)
Amplitude (mV)

Jitter: 11.2 ps
150 mV 400 mV
Jitter: 3.6 ps

0 10 20 30 40 5 15 25 35
Time (ps) Time (ps)
(c) (d)

Figure 4.17: Simulation eye-diagrams of the transmitter at (a) 10 Gb/s with over equal-
ization, (b) 40 Gb/s with proper equalization, (c) 50 Gb/s without equalization, and (d)
50 Gb/s with proper equalization.

could change a lot under different PVT corners. Fig. 4.16 gives the swing variations
for different PVT corners, where the swing variation can be controlled under 25% and
it can be further reduced by adopting a tunable resistor described in [24].
The performance of a transmitter is usually characterized by its output eye-diagram,
which folds a time-domain waveform into one or several bit periods. The two critical
parameters of the eye-diagram refer to the voltage swing and inner eye opening, where

118
Chapter 4. The Transmitter Design

1200μm

MUX Driver
x2 x2
FFE

500μm
Clock Conditioner MCG PRBS Gen. Combiner
MUX Driver
x2 x2

Figure 4.18: Chip micrograph of the transmitter.

Muxs Driver/FFE
Latch Array 22mW Combiner
11mW 43mW
PRBS Gen.
7mW Clocking
73mW

Total Power= 156 mW


FFE/driver
multiplexer
Latch Array
PRBS Gen.
Clocking

Figure 4.19: Power breakdown of the transmitter when operating at 50 Gb/s.

the former determines the transmitter output power and sets a requirement for the re-
ceiver sensitivity, while the latter indicates the overall performance of the jitter, noise,
and effective bandwidth. Fig. 4.17 shows the simulated eye-diagrams. Fig. 4.17(a)
displays the simulated eye-diagram at 10 Gb/s with an over equalization, where the
sub-levels are contributed by the FFE taps. Fig. 4.17(b) presents the simulated eye-
diagram at 40 Gb/s with a proper equalization, where the horizontal jitter and the ver-
tical eye opening are 3.7 ps and 440 mV, respectively. Fig. 4.17(c) and (d) gives the
eye-diagram comparison before and after applying an appropriate equalization at 50
Gb/s. Clearly, the FFE can significantly optimize the eye opening, where the horizon-
tal jitter is reduced from 11.2 ps to 3.6 ps and the vertical swing is increased from 150
mV to 400 mV.

119
Chapter 4. The Transmitter Design

Figure 4.20: Measured output eye-diagrams of the transmitter at (a) 5 Gb/s with over
equalization, (b) 40 Gb/s without equalization, (c) 40 Gb/s with proper equalization,
and (d) 50 Gb/s with proper equalization.

4.5.3 Chip Fabrication and Measurement Results

4.5.3.1 Chip Fabrication and Power Consumption

Fig. 4.18 presents the chip micrograph, which occupies an area of 0.6 mm2 . Fig.
4.19 shows the power breakdown of the transmitter chip. It consumes 156 mW from
a 1.2 V supply when operating at 50 Gb/s, where the four enhanced 4:1 MUXs only
consume 22 mW. The fabricated chip is mounted on a printed circuit board (PCB)
through wire-bonding. The transmitter output is measured after a compound channel
consisting of doubled bonding wire, PCB trace, and connection cable.

4.5.3.2 Measurement Results

Fig. 4.20 gives the measured eye-diagrams under different conditions. Fig. 4.20(a)
depicts the over-equalized eye-diagram when operating at 5 Gb/s, where the four sub-
levels are contributed by the four FFE taps. Fig. 4.20(b) and (c) presents the output
eye-diagrams at 40 Gb/s before and after applying the 4-tap FFE. The comparison

120
Chapter 4. The Transmitter Design

Figure 4.21: Measured output eye-diagrams with four separate eyes. (a) Clock pattern
and (b) PRBS pattern.

shows that the FFE can significantly improve the inner eye opening. Specifically, the
eye height and eye width are optimized from 140 mV and 0.45 UI to 180 mV and
0.68 UI, respectively. Meanwhile, the thickness of the eyelid is dramatically reduced
from around 330 mV to 140 mV. Fig. 4.20(d) displays the properly-compensated eye-
diagram at the maximum operation speed of 50 Gb/s. Its eye height and eye width
are 50 mV and 0.38 UI. Clearly, a wide operation range from 5 Gb/s to 50 Gb/s is
achieved, which is mainly attributed to the multi-MUX-based FFE implementation.
Fig. 4.21 further illustrates the transmitter output with four separate eyes. It can be
seen that the horizontal eye widths for both fixed clock and PRBS patterns are almost
identical, thus proving that the four sampling phases are properly aligned.

4.5.4 Performance Comparison

Table 4.1 compares the measurement results of our transmitter chip with other
transmitters operating at similar data rates. The results indicate that this transmitter
chip achieves wider operation range, lower jitter performance, and better power effi-
ciency than others. These are mainly owing to the proposed high-speed 4:1 MUX and
the compact interleaved-latching scheme. The comparison also shows the area of our
transmitter is much larger than that developed in [99], this is mainly due to the follow-
ing two reasons. Firstly, the area of our transmitter refers to the whole chip including
the core circuits, decoupling transistors, and input/output PADs, while the area in [99]
only includes the core circuits. Secondly, the transmitter in [99] is designed based on

121
Chapter 4. The Transmitter Design

Table 4.1: PERFORMANCE SUMMARY OF THE TRANSMITTER

Reference [26] [99] [24] This work


Technology (nm) 65 14 65 65
Supply (V) 1.2 N/A 1.2 1.2
Data Rate (Gb/s) 60 16-40 50-64 5-50
Chip Area( mm2 ) 2.1 × 1.0 0.215 × 0.13 1.2 × 1.0 1.2 × 0.5
FFE N/A 4-tap 4-tap 4-tap
1UI-delay Gen. N/A Multi-MUX LC-delay Multi-MUX
MUX Type 2:1 4:1 4:1 4:1
Data Jitter 0.33@28Gb/s 0.23@40Gb/s
1.08@30Gb/s N/A
RJ (psrms ) 0.51@40Gb/s 0.18@50Gb/s
Data Jitter (ps) 10.72@28Gb/s 9.90@40Gb/s
−12 N/A N/A
TJ (BER=10 ) 12.89@40Gb/s 10.58@50Gb/s
Power (mW) 450 518 199 156
Energy Efficiency
7.5 12.9 3.1 3.1
(pJ/bit)

a 14 nm process, which is much smaller than the 65 nm process.

4.6 Chapter Summary

The quarter-rate transmitter with 4-tap FFE is implemented in 65 nm CMOS pro-


cess. The integration of a bandwidth enhanced 4:1 MUX and an interleaved-retiming
latch array makes the transmitter possess good properties of both low power consump-
tion (3.1 pJ/bit) and small area occupation (1.2 × 0.5mm2 ). The measurement results
show that the developed transmitter can achieve a maximum operation speed of 50
Gb/s with a total jitter of 10.58 ps after a 12 dB loss channel. Owing to the multi-
MUX-based FFE implementation, the transmitter can operate as low as 5 Gb/s and
thus a wide operation range can be obtained.

122
Chapter 5

The Receiver Design

The main task of the receiver (RX) is to extract the originally transmitted data from
the received signal using appropriate equalization and clock data recovery (CDR) tech-
niques [69, 61, 70, 10]. This chapter presents a quarter-rate receiver operating at 40
Gb/s. It employs a two-stage continuous-time linear equalizer (CTLE) as the analog
front-end and integrates an improved CDR to extract the sampling clocks and retime
the incoming data. To automatically balance the jitter tracking and jitter suppression,
passive low-pass filters (LPFs) with adaptively adjusted bandwidth are introduced in-
to the data-sampling path, where the controlling code of the bandwidth is truncated
from the frequency code generated by the integral path of the digital LPF within the C-
DR loop. To optimize the linearity of the phase interpolation, a time-averaging-based
compensating phase interpolator (PI) is proposed, which significantly optimizes the
differential nonlinearity (DNL) and integral nonlinearity (INL) of the phase interpo-
lation, thus improving the phase-step and phase-spacing uniformities of the sampling
clocks.
In the remainder of this chapter, we firstly discuss the design considerations of the
receiver, and then present the overall receiver chip and illustrate its main features. After
that, the architecture-level improvement on the CDR loop and the linearity-optimized
compensating PI are elaborately discussed in the following two sections. Finally, the
experimental results are presented and discussed.

123
Chapter 5. The Receiver Design

5.1 Design Considerations of the Receiver

5.1.1 Receiver Sensitivity

Receiver sensitivity is the minimum differential voltage level that the receiver can
correctly differentiate between a “0” and a “1”. It is a function of the input referred
noise, offset, minimum latch resolution, and bit error rate (BER) requirement. It can
be calculated by


Vspp = 2Vnrms SN R + Vmin + Vof f set , (5.1)

where Vspp is the receiver sensitivity, Vnrms denotes the equivalent input random noise,
SN R represents the signal-to-noise ratio, Vmin stands for the minimum latch resolu-
tion, and Vof f set refers to the equivalent input offset. Vnrms usually comes from match-
ing impedances, input amplifiers, and data slicers. The SN R is determined by the BER

requirement, e.g. SN R=7 for a BER = 10−12 . Vmin stems from the hysteresis, fi-
nite regeneration gain, and bounded noise sources. Typically, its value is smaller than 5
mV. Vof f set is subject to circuit mismatches, which primarily exhibits a strong function
of the Vth mismatch and a weak function of electron mobility mismatch. Although
a large area (4×) can reduce the input offset [1/(2×)], it is not feasible in practical
designs due to the excessive area occupation and power consumption. In practical de-
signs, offset correction circuitry is usually employed to reduce the input offset from a
potentially large uncorrected value (>50 mV) to near 1 mV.

5.1.2 CDR Bandwidth

The CDR bandwidth is one of the most important parameters in the CDR design,
which involves a tradeoff of jitter tracking, jitter suppression, and jitter tolerance. A
narrow bandwidth can provide prominent input-jitter suppression and help to reduce
the jitter peaking, while a wide bandwidth can enhance the capability of jitter track-
ing and jitter tolerance. To suppress jitter amplification and accumulation in long-haul
telecommunication systems, a narrow bandwidth is usually specified (e.g., 120 kHz

124
Chapter 5. The Receiver Design

for optical carrier (OC)-192 in synchronous optical network (SONET) [5]). To im-
prove the jitter tacking ability in chip-to-chip connections, a relatively wide bandwidth
is frequently utilized (e.g., 10 MHz for 32G fiber channel (FC) [160]). A wide CDR
bandwidth also helps to suppress the VCO phase noise, thus reducing the jitter of the
sampling clocks (i.e., optimizing the jitter generation), which finally helps to lower the
link BER. Historically, the CDR bandwidth in many SerDes protocols such as periph-
eral component interconnect express (PCIE), Infiniband, FC, and common electrical
interface (CEI) grows linearly with the data rate, which is usually defined as 1/1667 or
1/2500 of the data rate.

5.1.3 Challenges within High-Speed CDR

As the data-rate approaches to the process limit, the short unit interval (UI) sig-
nificantly compresses the jitter budget for the CDR at the RX-side. This means there
is a even smaller margin left for sampling position deviation, clock dithering, random
and/or deterministic jitter, duty cycle distortion, and spacing errors among differen-
t phases [23], thus setting higher standards on low-frequency jitter tracking, high-
frequency jitter suppression, recovered clock jitter generation, sampling clock duty
cycle precision and phase-spacing accuracy. These requirements bring in significant
challenges in designing a high performance CDR [25, 23, 123], mainly because of the
following reasons. Firstly, the tightly coupled jitter tolerance (JTOL) and jitter trans-
fer (JTRAN) parameters make it difficult to design a low bandwidth for JTRAN to
suppress the incoming jitter. Secondly, the cycle-limited dithering caused by steady-
state oscillation contributes a substantial amount of deterministic jitter. Thirdly, the
inevitable loop latency along with the data-rate proportional CDR bandwidth may de-
grade the system phase margin.

125
Chapter 5. The Receiver Design

CLK0
Half Rate Clock CML CLK180 Quarter-Rate
CLK90 IDACs
Clock Cond. DIV2 CLK270 Multiple PIs

Clock Data Recovery

CK270

CK135

CK315
CK180

CK225
CK90

CK45
CK0
RX_N RX_DN
D<3:0> D<15:0>

DEMUX
Tcoil Data/Edge

BBPD
CTLE Digital
RX_P Rterm RX_DP Samplers E<3:0> E<15:0>
Filter

CML
Vbias,α-1 6
EDC-SZF

DACs
CMOS To TX- Vbias,α1 6
side FFE Vbias,α2 6 Algorithm

FFE Adaptation Unit


TX_QN Buffer
TX
TX_QP Driver CK45
CLK_TP
D<1>
CK225 0

MUX
CLK
DIV2
CK0 Driver CLK_TN
CK180 1
TX_N
4:1 MUX

TX D<3:0>
TX_P Driver CLK_SEL

Data/Clock Testing Circuits

Figure 5.1: Block diagram of the receiver chip.

5.2 Receiver Architecture

5.2.1 Overall Architecture

Fig. 5.1 describes the block diagram of the receiver chip. It consists of a two-stage
CTLE, a quarter-rate CDR, an feed-forward equalizer (FFE) adaptation unit, and some
testing circuits for the recovered data and clock measurements. The received signal is
firstly equalized by the CTLE and then sliced by eight quarter-rate data and edge sam-
plers, where the sampling clocks are generated by two quarter-rate compensating PIs
and the sampling positions are adjusted by a digital CDR using bang-bang phase detec-
tors (BBPDs). To support the high operation speed, the samplers, PIs and clock/data
buffers are implemented in current-mode logic (CML) type [10]. To alleviate the tim-
ing problem, a quarter-rate sampling scheme using multiple PIs is used to extend the
slicer regeneration time. The channel loss is compensated for by the TX-FFE and
RX-CTLE, where the TX-FFE is adaptively adjusted by the proposed edge-data cor-
relation based sign zero-forcing algorithm (EDC-SZF) (refer to Section 6.1) while the

126
Chapter 5. The Receiver Design

n
BBPD
n Voter
Xn

DEMUX DEMUX
Digital
Filter KI KP
RX_N
Data Edge
RX_P
Samplers Samplers Freq.
+
Integ.
CLKD CLKE
CLK0
CLK180
Phase Phase Code Phase
CLK90
Interpolator Integ.
CLK270

Phase Code Steady-State Oscillation

Figure 5.2: Conventional BBPD-based CDR.

RX-CTLE is manually calibrated.

5.2.2 Features of the Receiver

There are two main features in this receiver chip. One is the improved CDR ar-
chitecture, where passive LPFs with adaptively adjusted bandwidth are introduced into
the data-sampling path to automatically balance jitter tracking and jitter suppression for
data decisions. In doing so, the JTRAN bandwidth can be adjusted separately with-
out affecting the bandwidth of the JTOL. The other is the proposed compensating PI,
which not only improves the phase-step uniformity but also reduces the phase-spacing
drifting between edge and data sampling clocks.

5.3 Improved Digital CDR

5.3.1 Dithering Behavior in Digital CDR

Fig. 5.2 displays the conventional architecture of the BBPD-based CDR. Due to the
nonlinear behaviour and inevitable loop delay, the phase code applied to the PI usually

127
Chapter 5. The Receiver Design

16
BBPD
16 Voter

(A Half of Quadrant Steps)


X16
Digital

64 Steps
DEMUX DEMUX Loop Filter
4:16 4:16 KI KP

4 4 Freq. Phase
RX_N +
Data Samplers Edge Samplers Integ. Integ.
RX_P X4 X4 +
PHA PHB
ABS
<8:0> <8:0>

CK135
CK225
CK315
CK180
CK270

CK45
CK90
CK0
Limiter

IDAC

IDAC
CLK0

DF<2:0>
CLK180
Compensating Compensating
CLK90
CLK270
PI1 PI2 4 4

8 8 8
Current Mirrors

LPF for Data-


Cell
Sampling Cl ock
Implementation

Figure 5.3: Block diagram of the modified CDR architecture.

exhibits steady-state oscillation, which brings in substantial deterministic jitter through


rotating the PIs. This effect can become more severe as the data rate increases, the
reason is that the increased loop gain and the not-well-scaled loop latency are prone to
causing a larger limit-cycle oscillation amplitude. To attenuate this amplitude, a split-
path CDR/DFE architecture is proposed in [161], which employs a digital averaging
technique to filter the phase code for the separate data-sampling clocks. This approach
can effectively improve the JTOL amplitude at high frequencies, but the inevitable
delay added by the digital averaging block may make the sampling clocks drift away
from the optimal positions, thus degrading the maximum tolerable amplitude at low
frequencies.

5.3.2 Architecture Improvement

Fig. 5.3 shows the block diagram of the improved CDR. It employs separate PI1
and PI2 to produce the two sets of 45◦ -spaced clocks for the data sampling and edge

128
Chapter 5. The Receiver Design

sampling, where passive LPFs are introduced into the clock branch for the data sam-
pling to provide extra jitter suppression on the data-sampling clocks. The bandwidth
of these introduced LPFs is adaptively adjusted by the same DF<2:0>, which is the
absolute value of the truncated frequency code generated from the integral path of
the digital loop filter. In this design, the minimum bandwidth of the LPFs is about 4
MHz while the maximum one is around 50 MHz. Particularly, a limiter is utilized to
set the DF<2:0> to its maximum value when the frequency code goes too large. In
principle, a large frequency code indicates a continuous phase slewing to accommo-
date to the accumulative jitter tracking. Thus, a wide bandwidth is chosen to improve
the jitter tracking ability. On the contrary, a small frequency code implies that there
is little trackable jitter. Accordingly, a narrow bandwidth is selected to suppress the
high-frequency jitter.
For the implementation, 16 BBPDs associated with a majority voter are adopted to
generate a 5-bit signed phase error, which is filtered by a digital loop filter consisting
of a proportional path and an integral path to produce a 14-bit output. Here, the top
9 bits are applied to a 12-bit phase integrator whose output is then truncated to form
the phase code PHA<8:0>, which is further circularly added by 64 steps (a half of
quadrant phase steps) to obtain PHB<8:0>. These phase codes are applied to two
current digital-to-analog converters (IDACs) to produce 8 paths of weighted currents
that are fed into a current mirror array consisting of 8 identical slots. As shown in
Fig. 5.3, each slot generates two branches of currents, one is directly mirrored for the
edge-sampling PI2, while the other is mirrored through a LPF for the data-sampling
PI1.

5.3.3 Behavior of the Improved CDR

The working principle of the BBPD is illustrated in Fig. 5.4(a). Considering the
fact that the data sampling occurring at the center of the eye-diagram serves as a refer-
ence to judge whether the edge sampling is leading or lagging the input data transitions,
there should be sufficient margin for the data sampling. Accordingly, the outputs of
the data samplers show a fairly low sensitivity to phase errors in normal operating

129
Chapter 5. The Receiver Design

E L
. .
E<n>

D<n> D<n+1>

Data Edge Data


Sampling Sampling Sampling

(a)

SIN SQBB STF


+ A B z 1
Φ IN   K PD  KP 
-
1  z 1
z 1
SPI2 KI 1  z 1 zN

V(D) C STD
ΦE  K PI K DA 
ΦD  K PI LPF
SPI1
(b)

S IN + STF STD
+ K PD
-
SQBB
S PI2 + +

f0 f0
H A (f) H B (f) HC (f)
S PI2 fL <f0 S PI1
+
ΦE V(D) fL ΦD
+ K PI K PI +
HLPF (f)

(c)

Figure 5.4: Functional view of the introduced LPFs. (a) Principle of the BBPD, (b)
linearized CDR model, and (c) jitter transfer functions.

CDRs, which means that further jitter suppression on data-sampling clocks exhibits
little effect on the loop parameters for jitter tracking. Leveraging this characteristic
of the BBPD, we introduce LPFs into the data-sampling path to further filter the out-
put jitter while keeping the loop parameters unchanged to satisfy the jitter tolerance
specification. Fig. 5.4(b) presents the small-signal model of the modified CDR, where

130
Chapter 5. The Receiver Design

the LPF located outside of the feedback loop is able to provide additional jitter sup-
pression for the data-sampling clocks [see Fig. 5.4(c)]. Therefore, the dithering jitter
caused by the limit-cycle oscillation can be effectively attenuated. The noise sources
are also depicted in Fig. 5.4(b), including the input noise (SIN ), quantization noise
(SQBB ) of the BBPD, truncation noise I (STF ) due to finite resolution of the integral
path, truncation noise II (STD ) due to limited resolution of the IDAC, and nonlinearity
noise (SPI1 , SPI2 ) of the PIs. Fig. 5.4(c) displays the transfer function characteristics
for these noise sources. It can be seen that the introduced LPFs can dramatically at-
tenuate the remaining band-frequency and high-frequency components from STF and
STD . The low-frequency components of SIN , SPI2 , and SQBB can be further reduced by
these LPFs when lower bandwidths are employed. Simultaneously, the potential jitter
peak can be suppressed to alleviate the jitter amplification problem.

5.4 Compensating Phase Interpolator

The nonlinearity of phase interpolation can result in serious adverse effects on the
overall performance of the CDR. Specifically, the differential nonlinearity (DNL) in-
troduces a much larger phase jump than the ideal one, which can be directly converted
into recovered clock jitter. The integral nonlinearity (INL) can make the data-sampling
clocks drift away from their optimal decision points in quarter-rate architectures using
multiple PIs [23]. To optimize the PI nonlinearity, fine weight current sources have
been adopted in [115]. Unfortunately, the non-uniformity of the tail current sources
gives rise to fluctuant common-mode output, which may distort the phase-interpolated
clocks through common-mode to differential-mode conversion. Moreover, its perfor-
mance is also subject to input waveform shape and fabrication mismatches. Another
approach that is also usually adopted to optimize the PI linearity is the octagonal PI
[122], which needs eight 45◦ -spaced clock phases to perform the phase interpolation.
Correspondingly, it requires a complex phase rotator and phase controlling circuits to
generate the octagonal phase constellation. Note that even in the octagonal PI, there
does exist nonlinearity in theory. As a consequence, new techniques that can improve
the linearity of the phase interpolation are still highly demanded.

131
Chapter 5. The Receiver Design

IB315 CKB315
PHB<8:0> Current IB225 Conventional CKB135
IDAC IB135 CKB225
Mirrors IB45 PIB CKB45
64 Steps
(A Half of Quadrant Steps) CKI CKQ
+
IA270 CKA270
PHA<8:0> Current IA180 Conventional CKA90
IDAC IA90 CKA180
Mirrors IA0
PIA CKA0

(a)
CKA0 CKB45
CKA180 CK0 CKB225 CK45
CKB45 TA CK180 CKA90 TA CK225
CKB225 CKA270
CKA90 CKB135
CKA270 CK90 CKB315 CK135
CKB135 TA CK270 CKA180 TA CK315
CKB315 CKA0

(b) (c)

Figure 5.5: Proposed compensating PI. (a) Quarter-rate 45◦ -spaced clock generation,
(b) in-phase I, Q clock generation for the data sampling, and (c) 45◦ phase-shifted I, Q
clock generation for the edge sampling.

250 ohm BUF

CKIP CKIN CKIP CKQP CKQN CKQP

I0 I180 I90 I270

(a)

250 ohm

IP1 IN1 IP2 IN2

BIAS BIAS

(b)

Figure 5.6: Details of (a) quadrature PI and (b) TA.

132
Chapter 5. The Receiver Design

5.4.1 Implementation Details

Fig. 5.5 shows the conceptional block diagram of the compensating PI. It em-
ploys two conventional PIs (PIA and PIB) with 1/2-quadrant-step spaced phase codes
(PHA<8:0> and PHB<8:0>) to produce the two sets of 45◦ -spaced clocks (CKA0-
90-180-270 and CKB45-135-225-315) [see Fig. 5.5(a)]. The two sets of 45◦ -spaced
clocks are then applied to four time averaging (TA) [see Fig. 5.5(b) and (c)] to gener-
ate the final data and edge sampling clocks. Specifically, the data-sampling clocks
(CK0-90-180-270) are obtained by averaging CKA0-90-180-270 and CKB45-135-
225-315, while the edge-sampling clocks (CK45-135-225-315) are attained by av-
eraging CKA90-180-270-0 and CKB45-135-225-315. Fig. 5.6 further displays the
schematic details of the quadrature PI and TA, which are implemented in CML style.
The simulation also shows that the additional PI and TAs in each compensating PI
consume around 10 mW, which occupies 50% of the compensating PI.

5.4.2 Linearity Analysis

Taking the sinusoidal waveform to approximate the input-clock wave shape, the
quadrature input clocks can be expressed by

CKI = Asin(2πf t),


(5.2)
CKQ = Acos(2πf t),

where A and f are the amplitude and frequency of the input clock. Then the output of
the traditional PIA can be calculated by

CKP IA = (1 − α)Asin(2πf t) + αAcos(2πf t)

= AP IA sin(2πf t + θP IA ), (5.3)
α
θP IA = arctan( ), (5.4)
1−α
p
AP IA = α2 + (1 − α)2 A, (5.5)

133
Chapter 5. The Receiver Design

135
  1/ 2 PIB
arctan( 1 / 2   ), 0    1 / 2
112.5  B  
arctan(   1 / 2 )   ,1 / 2    1
 3 / 2 2
)

Phase (degree)
90 1  A  B
(
F ing
P I: 2
67.5 en sat
p
C om PIA
45 
 A  arctan( )
1- 
45
22.5 B 8.1º phase code

E phase steps
0
0 16 32 48 64 80 96 112 128
Phase code

Figure 5.7: Phase transfer characteristics based on trigonometric-function approxima-


tion.

where α is the ratio of the current phase code to the total phase steps and its range
always meets 0 ≤ α ≤ 1. Similarly, the equations for PIB can also be obtained,
written as

CKP IB = AP IB sin(2πf t + θP IB ), (5.6)

where

arctan( α+1/2 ), 0 ≤ α ≤ 1/2,


1/2−α
θP IB = (5.7)
arctan( α−1/2 ), 1/2 ≤ α ≤ 1.


3/2−α


p
 (α + 1/2)2 + (1/2 − α)2 A, 0 ≤ α ≤ 1/2,

AP IB = (5.8)
p
 (α − 1/2)2 + (3/2 − α)2 A, 1/2 ≤ α ≤ 1.

Previous studies [23, 103] have demonstrated that this 45◦ -spaced clock generation
can be directly used in CDR designs. However, the nonlinearity of the traditional PI
could significantly degrade the performance of the CDR. To gain more insight into this
issue, the red dashed and blue dotted lines in Fig. 5.7 respectively present the phase
transfer curves according to Eq. 5.4 and 5.7. Clearly, the phase transfer curves of

134
Chapter 5. The Receiver Design

125

100
Compensating PI

Delay (ps)
75
Conventional PIB
50

Conventional PIA
25

0
0 128 256 384 512
Phase code
(a)
2 12 Compensating PI
1LSB=0.703125º
8
DNL (LSB)

INL (LSB)
4
0 0
-4
Conventional Compensating
PIA PI
-8
Conventional PIA
-2 -12
0 128 256 384 512 0 128 256 384 512
Phase code Phase code
(b) (c)

Figure 5.8: Simulation results of the phase compensating PI. (a) Simulated phase trans-
fer characteristics, (b) DNL performance, and (c) INL performance.

the traditional PIA and PIB present an S-shape phase transfer characteristic. When
PIA rotates to point E and PIB rotates to point F, the phase shift between them can
reach a maximum of 8.1◦ (or 0.09 UI). For the designs directly using these phases as
the sampling clocks [23, 103], since the edge-sampling clocks tightly track the edge
transitions in the received data stream, any phase-spacing variation between the edge-
sampling and data-sampling clocks could make the data-sampling clocks drift away
from the expected decision point. As a result, the data decision margin is reduced,
which directly degrades the CDR performance. Moreover, improving the PI resolution
cannot optimize this effect since fine step weights cannot change the shape of the phase
transfer characteristics.
Referring to the time-averaging effect of the TA, the output phase of the compen-

135
Chapter 5. The Receiver Design

sating PI can be expressed as



α
) + arctan( α+1/2

1 arctan(

1−α 1/2−α
), 0 ≤ α ≤ 1/2,
θCP I = (5.9)
2
arctan( α ) + arctan( α−1/2 ), 1/2 ≤ α ≤ 1.

1−α 3/2−α

The black solid line in Fig. 5.7 displays the phase transfer curve of the compensating
PI according to Eq. 5.9, which indicates that a more linear phase transfer curve with
negligible phase deviations smaller than 0.17◦ can be achieved. This is mainly because
of the compensating characteristics of the phase transfer curves of PIA and PIB. In
contrast to the theoretical analysis, the practical linearity could be degraded by the
transistors’ inherent nonlinearity and the nonideal input clock waveform. Fig. 5.8
shows the transistor-level simulation results of the compensating PI. It can be seen that
the maximum DNL and INL of the compensating PI can be significantly improved over
the traditional PI, where the INL can be controlled below 2.5 LSB (or 1.8◦ ), which is
only a quarter of that of the conventional PI.

5.5 Experimental Results

5.5.1 Tools and Fabrication Process

The receiver is designed using a Dell R730 server with two E5-2609V4 CUPs, 128
G memory and 8 T hard disk. The schematic, layout, and simulation are respectively
finished by Schematic Composor, Virtuoso Layout, and Spectre/aps that are developed
by Cadence and the Cadence version is IC5141. The layout verification and parasitic
extraction are carried out by layout versus schematics (LVS)/design rule check (DRC)
and parasitic extraction (PEX) using Caliber2013 that is developed by Mentor Graph-
ics. To perform the measurements of the fabricated prototype, an Anritsu MP1812A
is used to generate the 40 Gb/s input data through combining four 10 Gb/s PRBS7
sequences, a Tektronix BSA286C is used to characterize the CDR performance, and
a KEYSIGHT DSA-X 93204A with an 80 GS/s and 32 GHz bandwidth is utilized to
characterize the jitter performance of the output waveforms.
The prototype chip is designed and fabricated utilizing a 65 nm process. Under

136
Chapter 5. The Receiver Design

1.6 mm

EDC-SZF
Terminals
CTLE CDR

1.2 mm
CLK Driver
Full Clock
Rate Condi-
¼ -Rate
Driver tioner
Driver

Figure 5.9: Layout view of the whole transmitter chip.

a typical corner, the cut-off frequency (fT ) of the NMOS transistor and the inverter
delay with a fan-out-of-4 in this process achieve 200 GHz and 13 ps, respectively.
This implies that the utilized 65 nm process is able to provide enough bandwidth and
timing margin for the targeted 40 Gb/s receiver design. Although an advanced process
with smaller minimum channel length such 45 nm, 32 nm, 22 nm and 16 nm can
offer higher fT and shorter inverter delay, their high prices make them not available
for us. Fortunately, our receiver mainly focuses on the CDR architecture improvement
and high-linearity compensating PI implementation, which can still be verified by the
economical and practical 65 nm CMOS process.

5.5.2 Layout and Simulation Results

5.5.2.1 Layout Designs

Fig. 5.9 displays the layout view of the whole receiver chip. The Terminals, CTLE,
and CDR located at the top side of the chip are the core blocks of the receiver. They
are placed in a line to guarantee the layout symmetry and reduce the parasitic effect on
the high-speed signals. The clock conditioner is placed close to the PI to facilitate the
high-speed clock connection. The full-rate driver, clock driver, and quarter-rate driver
are placed at the bottom side to output the measurement signals.

137
Chapter 5. The Receiver Design

(a) (b)

Figure 5.10: Layout views of the (a) Terminals+CTLE and (b) CDR.

(a) (b)

(c)

Figure 5.11: Layout views of the crucial blocks within the CDR. (a) Samplers, (b)
compensating PI, and (c) digital loop filter.

138
Chapter 5. The Receiver Design

RJpp=5ps
data
40Gb/s PRBS
CDR clock
biasa
biasb
20GHz SJ: 1UI @ 500kHz

Figure 5.12: Simulation setup of the CDR. A PRBS generator is used to produce the
40 Gb/s input data with 5 ps peak-to-peak jitter, a clock generator is utilized to produce
the 20 GHz input clock with a 1 UI amplitude sinusoidal jitter at 500 kHz, the output
data refers to the input data at the samplers, the output clock is the recovered data-
sampling clock, the output biasa represents the current mirror bias for 0◦ -phase before
the LFP, and the biasb stands for the current mirror bias for 0◦ -phase after the LFP.

Fig. 5.11 gives the layout views of the blocks in the CDR. For the samplers [see
Fig. 5.11(a)], multi-layer inductors are used in the first latch to extend its bandwidth.
For the compensating PI [see Fig. 5.11(b)], the inductors are removed to reduce the
area occupation. For the digital loop filter [see Fig. 5.11(c)], it is designed based on
the standard cells provided by the foundry.

5.5.2.2 Effect of the Adaptively-Adjusted Bandwidth

To validate the effect of the adaptively-adjusted bandwidth, simulations are per-


formed based on the simulation setup in Fig. 5.12. Fig. 5.13 displays the filtering
effect on the current mirror bias for 0◦ -phase and the jitter performance of the data-
sampling clock with different LPF bandwidths, where the eye-diagrams are overlapped
from 0.9 µs to 2.1 µs. For the simulated diagrams with the bandwidth of 4 MHz in
Fig. 5.13(a), the high-frequency ripples on the bias can be significantly suppressed
by the LPF. However, the dithering jitter of the data-sampling clock reaches 7.54 ps,
which is much larger than that of the edge-sampling clock without the LPF (3.04 ps).
It means that the CDR performance is actually deteriorated. This is mainly because of
the prominent phase shift caused by the LPF delay. As the bandwidth increases, the
delay-caused phase shift becomes smaller, thus indicating a descending trend in dither-

139
Chapter 5. The Receiver Design

Bandwidth: 4 MHz Bandwidth: 20 MHz

Voltage (mV)
Voltage (mV)
Prominent
Delay

Time (us) Time (us)


Amplitude (V)

Amplitude (V)
7.54 ps 4.04 ps

Time (s) Time (s)


(a) (b)

Bandwidth: 50 MHz Bandwidth:


Adaptively Adjusting
Voltage (mV)

Voltage (mV)
Tightly
Tracking
Slight Attenuation Significant Attenuation
Time (us) Time (us)
Amplitude (V)
Amplitude (V)

2.66 ps 2.62 ps

Time (s) Time (s)

(c) (d)

Figure 5.13: Effect of the LPFs with a bandwidth of (a) 4 MHz, (b) 20 MHz, (c) 50
MHz, and (d) adaptively-adjusting.

ing jitter of the sampling clock [see Fig. 5.13(b) and (c)]. For the bandwidth fixed at
50 MHz, the dithering jitter of the data-sampling clock (2.66 ps) becomes smaller than
that of the edge-sampling clock (3.04 ps). This implies that the jitter optimization
contributed by the bias-ripple suppression overwhelms the delay-caused phase shift.
Based on the above discussion, it can be found that adopting a fixed bandwidth is inad-
visable since the low bandwidth suffers from delay-caused phase shift while the high
bandwidth exhibits limited jitter suppression. Fig. 5.13(d) presents the simulation re-
sults when utilizing the proposed bandwidth-adaptively-adjusting technique, where the
low dithering jitter is achieved by balancing the bias tracking and ripple suppression.
The high-frequency ripple at the slow input-jitter changing region [circled region in
Fig. 5.13(d)] can be effectively attenuated while the phase variations at fast input-jitter
changing region [surround region in Fig. 5.13(d)] can be tightly tracked.

140
Chapter 5. The Receiver Design

Recovered Clock Jitter

Delay (ps)
Edge-Sampling Clock Jitter Jitter
without LPFs Tracking Suppression

Recovered Clock Jitter


Delay (ps) Delay (ps)
Data-Sampling Clock
with LPFs

Injected Jitter on Input Clock

Frequency=500 kHz
Amplitude=1 UI
Bandwidth Control Code (Decimal)
Large
Bandwidth
Small Bandwidth

Figure 5.14: Properties of the adaptive-bandwidth jitter suppression.

To further explore the bandwidth-adaptively-adjusting process, Fig. 5.14 gives the


transient simulation waveforms. For the fast input jitter changing region (jitter track-
ing region), a large frequency code is accumulated in the frequency integrator (see Fig.
5.3), thus a high bandwidth control code DF<2:0> for the LPFs can be obtained (see
the bottom waveform in Fig. 5.14). As a result, the data-sampling clocks can tightly
track the edge-sampling clocks to avoid data-sampling lagging. For the slow input jit-
ter changing region (jitter suppression region), the frequency code becomes small and
so does the bandwidth control code DF<2:0>. Correspondingly, the bandwidth of the
LPFs decreases, thus exhibiting prominent jitter suppression effect. Owing to the pro-
posed adaptive bandwidth-adjusting scheme, the jitter suppression and jitter tracking
can be automatically balanced in this CDR. Overall, this automatic bandwidth selec-
tion technique makes it possible to use a low bandwidth to significantly suppress the
high-frequency jitter while exhibiting little effect on the low-frequency jitter tracking
ability.

141
Chapter 5. The Receiver Design

Voltage (mV)

Voltage (mV)
Tightly Tightly
PRBS7 Tracking PRBS15 Tracking
Significant Attenuation Significant Attenuation
Time (us) Time (us)
Amplitude (V)

Amplitude (V)
2.62 ps
2.74 ps

Time (s)
Time (s)
(a) (b)
Voltage (mV)

Voltage (mV)
Tightly Tightly
PRBS23 Tracking PRBS31 Tracking
Significant Attenuation Significant Attenuation
Time (us) Time (us)
Amplitude (V)

Amplitude (V)

3.13 ps 3.30 ps

Time (s) Time (s)

(c) (d)

Figure 5.15: Effect of different input patterns on jitter attenuation. (a) PRBS7, (b)
PRBS15, (c) PRBS23, and (d) PRBS31.

5.5.2.3 Effect of Different Input Patterns

To demonstrate the jitter suppression effect with different patterns, we have re-
peated the simulations with the adaptively-adjusted bandwidth using the setup shown
in Fig. 5.12. As depicted in Fig. 5.15, when the input pattern ranges from PRBS7
to PRBS15, PRBS23, and PRBS31, the jitter performance of the recovered clock be-
comes slightly worse. This is because the increased run-length of “1s” or “0s” extends
the wandering time of the CDR loop, thus causing a larger amplitude of steady-state
oscillation and hence increase the deterministic jitter. Additionally, the high-frequency
jitter suppression effect becomes more prominent as the max run-length of the input
pattern increases (see the voltage ripple attenuation in Fig. 5.15).

142
Chapter 5. The Receiver Design

1.6 mm

EDC-SZF
Terminals
CTLE CDR

1.2 mm
CLK Driver
Full Clock
Rate Condi-
¼ -Rate
Driver tioner
Driver

(a)
Clock
Conditioner
30 mW

CTLE CDR

36 mW CDR CTLE

DEMUX
Clock Conditioner

159 mW

Total Power=225mW

(b)
Figure 5.16: (a) Chip micrograph and (b) power breakdown of the receiver.

5.5.3 Chip Fabrication and Measurement Results

5.5.3.1 Chip Fabrication and Power Consumption

The prototype receiver chip is fabricated in 65-nm CMOS process. Fig. 5.16 illus-
trates its micrograph and power breakdown when applying a 1.2 V supply and oper-
ating at 40 Gb/s. The receiver chip occupies 1.92 mm2 (including the testing circuits)
and dissipates 225 mW power (excluding the testing circuits). The fabricated chip is

143
Chapter 5. The Receiver Design

117mV 5ps 149mV 20ps

RJ: 260 fs, DJ: 3.63 ps, TJ: 7.31 ps RJ: 450 fs, DJ: 6.38 ps, TJ: 12.73 ps

(a) (b)

100mV 20ps 100mV 20ps

RJ: 490 fs, DJ: 4.45 ps, TJ: 11.48 ps RJ: 450 fs, DJ: 1.18 ps, TJ: 7.66 ps

(c) (d)

Figure 5.17: Measured eye-diagrams for (a) input data at 40 Gb/s, (b) recovered data at
10 Gb/s, (c) recovered edge-sampling clock without LPFs at 5 GHz, and (d) recovered
data-sampling clock with LPFs at 5 GHz.

mounted on a printed circuit board (PCB) through wire-bonding. The receiver input-
s and outputs are connected to the instruments through double-bonding wires, PCB
traces, and connection cables.

5.5.3.2 Measurement Results

The receiver standalone measurement results are presented in this part. Fig. 5.17(a)
shows the eye-diagram of the 40 Gb/s input data, where the single-end eye height and
eye width are around 410 mV and 0.71 UI. Fig. 5.17(b) presents the eye-diagram of
the 10 Gb/s recovered data with a total jitter of 12.73 ps. The eye-diagrams of the
recovered clocks (divided by 2) for the data sampling and edge sampling are shown
in Fig. 5.17(c) and (d), which reveal that the introduced LPFs can optimize the total
jitter from 11.48 ps to 7.66 ps. To demonstrate the effect of the LPFs with adaptively-
adjusting bandwidth, the JTRAN and JTOL curves are measured using a Tektronix
BSA286C with a CDR block. The input peak-to-peak swing is tuned to 800 mV and
the control voltage of the CTLE is manually set to 710 mV. The JTRAN curves in Fig.

144
Chapter 5. The Receiver Design

0 Without LPFs
BW 18 MHz

[dB]
JTRAN (dB)
With LPFs
-4 BW 4 MHz

JTRAN
Inputjitter
Input Jitter
-8 Amplitude=0.20UI
amplitude=0.20 UI
Edge-Sampling Clock
Edge-sampling clock
Data-Sampling Clock
Data-sampling clock
-12 -1 0 1 2
10 10 10 10
JTOL (UIpp)

Improved by
Introduced LPFS

Frequency (MHz)

Figure 5.18: Measured JTRAN and JTOL with PRBS7 at 28 Gb/s.

5.18 illustrate that the bandwidth of the data-sampling path depending on the LPFs is
4 MHz, which is much smaller than 18 MHz for the edge-sampling path determined
by the loop parameters. The measured JTOL in Fig. 5.18 indicates that the embedded
LPFs result in a significant dip attenuation around the corner frequency and improve
the JTOL amplitudes apparently at high jitter frequencies. Meanwhile, the adaptively-
adjusting bandwidth of the LPFs makes them exhibit little effect on the phase-tracking
slew rate at low jitter frequencies. Additionally, the corner frequency of the JTOL is
about 20 MHz, which is much larger than the JTRAN bandwidth of 4 MHz.

5.5.4 Performance Comparison

Table 5.1 compares the performance of our receiver with previous studies. It can be
seen that the maximum tolerable amplitude of sinusoidal jitter at high frequency (0.41
UI@100 MHz) outperforms the other two, which is mainly because of the introduced
LPFs and the developed compensating PI. As for the reason why this parameter is so

145
Chapter 5. The Receiver Design

Table 5.1: PERFORMANCE SUMMARY OF THE RECEIVER

[23] [123] This work


Technology (nm) 28 22 65
Supply (V) 1.1/0.85 1.07 1.2
Data Rate (Gb/s) 40 4-32 40
Multi-phase Clock Gen. DLL+PIs MCDLL+PIs DIV2+PIs
Jitter Suppression Split-path CDR N/A Adaptive-BW LPFs
JTOL Amplitude (UI) 0.2@80MHz 0.2@40MHz 0.41@100MHz
JTOL Bandwidth (MHz) 10 20∗ 20
2 ∗∗
Chip Area (mm ) 0.81/lane 0.079/lane 1.92
Power (mW) 630∗† 79.64†† 225

Estimated from jitter tolerance results, ∗∗ Area of whole transceiver
∗†
Including FFE+DFE equalization, †† Including RX-FFE

important is because it directly indicates that our receiver has the ability of relaxing the
timing budget of the link and hence optimizes the communication BER. Meanwhile,
it is worthy to note that the area occupation and power consumption of our receiver
are larger than the design presented in [123]. This is mainly because of the following
two reasons. One is that the area occupation and the power consumption in [123]
are measured based on the the core circuits, while these two parameters in our design
are measured based on the whole chip, including the core circuits, testing circuits,
decoupling transistors, and connection pads. The other is that the receiver in [123]
is implemented in a 22 nm process, which naturally possesses the good properties of
smaller area and lower power consumption. If this receiver is also implemented in
such an advanced processes, the receiver should have the ability to operate at a higher
data rate with a smaller area occupation and a lower power consumption than the one
implemented in the 65 nm process.

5.6 Chapter Summary

This chapter presents a 40 Gb/s receiver with excellent performance on both jitter
suppression and jitter tracking, where a compensating PI is designed to alleviate the
issues of non-uniform phase steps and I, Q phase-spacing drifting. Moreover, the in-
troduced bandwidth-adaptively adjusted LPFs can provide additional high-frequency

146
Chapter 5. The Receiver Design

jitter attenuation for data-sampling clocks, while leaving the edge-sampling clocks un-
filtered to maintain a high jitter tracking capacity.

147
Chapter 6

Overall Serial Link and Adaptive


Equalization

To overcome the channel loss and satisfy the stringent power and area budgets, so-
phisticated equalization design is demanded to compensate for the channel loss while
balancing the cost of power and area overheads. Based on the transmitter (TX) and
receiver (RX) chips designed in the previous two chapters, this chapter constructs a
chip-to-chip connection where the output of the transmitter chip and the input of the
receiver chip are DC connected over a 12-cm printed circuit board (PCB) trace. A
combined TX-side feed-forward equalizer (FFE) and RX-side continuous-time linear
equalizer (CTLE) is adopted to compensate for the channel loss. The control voltage
of the RX-CTLE is manually calibrated while the tap weights of the TX-FFE are auto-
matically adjusted by a newly developed edge-data correlation-based sign zero-forcing
(EDC-SZF) adaptation engine located at the RX-side.
In the rest of this chapter, Section 6.1 illustrates the equalization scheme employed
in the serial link. The proposed EDC-SZF adaptation is presented in section 6.2.
It begins by summarizing the drawbacks in previous adaptation techniques and then
presents the update iteration and the derivation of the proposed EDC-SZF algorithm.
Section 6.3 finally gives the link setup and experimental results.

148
Chapter 6. Overall Serial Link and Adaptive Equalization

FFE Manually Calibrated


α-1
VCTLE n

Deserializer

EDC-SZF
I(k) RX

CDR
Z-1 α0 CTLE Data
TX
Data n
Z-1 α1 Combined Channel RX
Edge

Z-1 α2

Limiters
Range
α-1,α1,α2

DACs
6-bit
Adaptively Adjusted

Figure 6.1: Implemented equalization scheme with the proposed EDC-SZF algorithm.
Here, TX-FFE and RX-CTLE are employed to compensate for the channel loss, the
control voltage of the RX-CTLE (VCTLE) is manually calibrated while the tap weights
(α−1 , α1 , α2 ) of the TX-FFE are adaptively adjusted by the proposed EDC-SZF.

6.1 Serial Link and Channel Equalization

6.1.1 Link Connection and Equalization Scheme

Fig. 6.1 describes the block diagram of the serial link along with the equalization
scheme, where the output of the transmitter chip is directly connected to the receiver
chip through a channel. It employs a TX-FFE and a RX-CTLE to compensate for the
channel loss. The decision feedback equalizer (DFE) is ruled out here, mainly because
of its operation speed limitation, complicated implementation, and significant power
consumption [162, 101]. These overheads generally result from the increased number
of data samplers within the DFE [34, 25]. The RX-CTLE is manually calibrated while
the tap weights of the TX-FFE are adaptively adjusted by an EDC-SZF algorithm at
the RX-side. The digital tap weights generated by the EDC-SZF engine are firstly
constrained by three range limiters and then applied to three 6-bit digital-to-analog
converters (DACs) to produce the bias voltages for the TX-FFE taps. These bias volt-
ages are transferred to the transmitter through PCB traces. To save the output pins,
the DACs in practical implementations are located at the TX-side and the controlling
tap-weight codes are sent through the communication channel under the control of the
status state machine in the media access control (MAC) layer [163]. In our prototype,

149
Chapter 6. Overall Serial Link and Adaptive Equalization

400 pH

50 ohm
ON OP

α-1 α0 α1 α2
Dpre Dmain Dpst1 Dpst2
(a)
Amplitude (mV)

(b) (c)

Figure 6.2: TX-FFE. (a) Schematic details, (b) simulated output eye-diagram at 10
Gb/s, and (c) simulated output eye-diagram at 40 Gb/s.

due to the lack of the MAC layer, the DACs are located at the RX-side and the bias
voltages are transferred to the transmitter through PCB traces.

6.1.2 Equalizer Implementation Details

Fig. 6.2(a) shows the schematic details of the TX-FFE. It is realized by a 4-tap
current-mode logic (CML) combiner, where the tap weights are adjusted by changing
the bias voltages of the current sources. Fig. 6.2(b) and (c) display the simulated
near-end eye-diagrams at 10 Gb/s and 40 Gb/s when applying -3 dB, -6 dB, and -
3 dB equalization coefficients to the pre, post1, and post2 cursors in the FFE. The
circuit implementation of the RX-CTLE and its frequency responses with different
control voltages are presented in Fig. 6.3. To optimize the equalization configuration,

150
Chapter 6. Overall Serial Link and Adaptive Equalization

320 pH
320 pH
65 ohm
ON 65 ohm
OP
IN IP OP ON
IN IP
VCTLE
ISS
ISS/2 ISS/2

(a)

VCTLE=900 mV

VCTLE=800 mV
Gain (dB)

VCTLE=700 mV

VCTLE=600 mV

Frequency (Hz)

(b)

Figure 6.3: RX-CTLE. (a) Schematic details and (b) frequency responses for different
control voltages.

the control voltage of the RX-CTLE is manually adjusted while the tap weights of
the TX-FFE are adaptively adjusted by a low-cost EDC-SZF adaptation engine. In
the remainder of this section, we will focus on the design of the proposed EDC-SZF
algorithm.

151
Chapter 6. Overall Serial Link and Adaptive Equalization

6.2 Edge-Data Correlation-Based Sign Zero-Forcing (EDC-

SZF)

6.2.1 Drawbacks of Previous Adaptation Algorithms

Previous adaptation algorithms for wireline communications can be mainly cate-


gorized into sign-sign least mean square (SS-LMS), zero-forcing (ZF), and maximum
eye opening (MEO)[129, 130, 131, 105, 132, 133]. The SS-LMS algorithm, which
aims to minimize the mean square error between the specific eye height and the mea-
sured eye height, is widely used to adjust the equalization coefficients for its simplicity
and robustness [129, 34, 130, 131]. However, it needs auxiliary samplers to extrac-
t the sign error between the equalized and expected eye heights. This makes it less
competitive for applications operating at tens of Gb/s because of the following rea-
sons. Firstly, the additional high-speed samplers consume considerable power. Sec-
ondly, these auxiliary samplers degrade the bottleneck bandwidth because their input
capacitances are directly connected to the maximum-speed signal path. Thirdly, more
samplers mean more input and output signals, which make the layout routing more
complicated. The traditional ZF solution is achieved by forcing the cross-correlation
between error sequence εk = Ik − Iˆk and desired information sequence Ik to be zero.
Its main drawback is the requirement of an auxiliary Iˆk measuring analog-to-digital
converter (ADC) [105, 132], which also dramatically reduces the bandwidth of the
full-rate driver, thus limiting the maximum operating speed. Additionally, it also need-
s a large amount of logic to perform matrix multiplication [132]. For the MEO method,
the evaluation of the received signal eye opening is fulfilled by gradually adjusting the
sampling thresholds and the sampling positions. Instead of producing error informa-
tion, this approach can provide the visual received eye-diagram at a cost of a complete
eye monitor, which usually incorporates threshold-adjusting samplers, phase-adjusting
PIs, micro-controller, and measurement softwares [133].

152
Chapter 6. Overall Serial Link and Adaptive Equalization

6.2.2 Iteration of the EDC-SZF

To preclude the auxiliary circuits in previous adaptation algorithms [129, 130,


131, 34, 105, 132, 133], a low-cost EDC-SZF algorithm utilizing edge-data cross-
correlation is developed. The target is to force the cross-correlation between the sign
of the edge-sampling errors and received data to zero. The iterative procedure of the
TX-FFE tap weights is given by,

αl (k + 1) = αl (k) − λ · sign[e(k)] · D(k − l), (l = −1, 0, 1, 2), (6.1)

where αl (k) is the instant l-tap weight, sign[e(k)] represents the sign of the edge sam-
pling error, D(k) denotes the recovered data, and λ stands for the scale factor control-
ling the adjustment rate and its value is usually much smaller than 1. The sign of the
edge sampling error sign[e(k)] caused by the inter-symbol interface (ISI) is directly
mapped from the quantized edge sequence E(k), and it is correlated with the data bit
D(k − l) to produce the product sign[e(k)] · D(k − l). The result is then integrated to
update the FFE tap weight αl (k).
The main feature of this approach is that it only involves the existing quantized edge
sequence E(k) and recovered data sequence D(k). As a result, the essential auxiliary
circuits such as samplers, ADCs, and PIs in previous adaptive equalizations [129, 130,
131, 105, 132, 133] are removed, thus exhibiting more potentials on operation speed
and cost effectiveness.

6.2.3 Correlation between Edge Information and Recovered Data

When dealing with band-limited channels that result in ISI, it is convenient to de-
velop an equivalent discrete-time model for the continuous-time system. The reason
is that the transmitter sends discrete-time symbols with a period of T and the output
at the receiver side is also a discrete-time signal with samples of the same period. Fig.
6.4 presents the UI-width pulse response of a typical dispersion channel, where hk and
hk+0.5 denote the ISI tail values at data-sampling and edge-sampling points, respec-
tively. According to the signal processing principles, the received discrete-time signal

153
Chapter 6. Overall Serial Link and Adaptive Equalization

Single
Input Bit

h0
h0.5 Output
h-0.5 Pulse Response
h1
Channel
h h1.5 h2.5
h-1.5 -1 h2 h3 h
3.5

Figure 6.4: Pulse response of a typical dispersion channel.

qk can be computed by the convolution of the input data sequence Ik and channel pulse
response hk ,

X
qk = Ii hk−i . (6.2)
i

For a normal operating serial link where the data-sampling clock always locates at
the center of the eye-diagram, qk and qk+0.5 can be considered as the sampled analog
values before binary quantization. After the decision latches, qk is quantized to the
data sequence Dk , while qk+0.5 is quantized to the edge sequence Ek . Applying the
cross-correlation function to the edge-sampled sequence qk+0.5 and the recovered data
sequence Dk , we can get,

X X
Re,d (n) = qj+0.5 Dj−n = qj+0.5 Ij−n
j j
!
X X
= Ii Ij−n hj+0.5−i
j i
(6.3)
 
X X
= Ij−n Ij−n hj+0.5−(j−n) + Ii Ij−n hj+0.5−i 
j i6=j−n
 
X X X
= hn+0.5 +  Ii Ij−n hj+0.5−i  .
j j i6=j−n

Here, Dk is replaced by the input sequence Ik because the bit-error-rate (BER) is usu-
ally quite low (< 1e−12 ) for proper operating serial links. Assuming m = j − i, we

154
Chapter 6. Overall Serial Link and Adaptive Equalization

have,
   
X X X X
 Ii Ij−n hj+0.5−i  =  Ij−m Ij−n hm+0.5 
j i6=j−n j m6=n
  (6.4)
X X
=  Ij−m Ij−n hm+0.5  = 0.
m6=n j

Note that the sum indexes of i and j traverse over all the integers except for i = j − n,
thus m should also round over all integers except for m = n. The final derivation
of Eq. (6.4) is obtained based on the fact that the time-shifted data sequences Ij−m
and Ij−n (m 6= n) are actually independent with each other, since the transmitted data
streams in wireline systems are usually random sequences. Substituting Eq. (6.4) into
Eq. (6.3) , the cross-correlation function is simplified to,

X
Re,d (n) = hn+0.5 . (6.5)
j

By normalizing the cross-correlation function, we obtain,

ρe,d (n) = hn+0.5 . (6.6)

Clearly, the normalized cross-correlation coefficient ρe,d (n) between the sequence qk+0.5
and the recovered data sequence Dk exactly equals the residual ISI value with a time
shift of (n + 0.5)T , as shown in Fig. 6.4.

6.2.4 Derivation of the EDC-SZF

For a transmitter with an l-tap UI-spaced FFE, the pre-distorted output can be rep-
resented by,

X
t(k) = αl I(k − l), (6.7)
l

where I(k) is the transmitting sequence, αl denotes the tap weight, and l is the tap
index [133]. To make the analysis more compact, the cascaded passive channel and
RX-CTLE is treated as a combined channel with a new pulse response of ck . By

155
Chapter 6. Overall Serial Link and Adaptive Equalization

calculating the convolution of pre-distorted output t(k) and the channel pulse response
ck , the received discrete-time sequence before binary quantization can be given by
!
X X
r(k) = αl I(i)ck−l−i . (6.8)
l i

Following the steps of the cross-correlation analysis in 6.2.3 and using the derived
results, we attain the cross-correlation coefficient between the edge-sampling error
sequence r(k + 0.5) and the recovered data sequence D(k),

X
ρ̂e,d (n) = αl cn−l+0.5 . (6.9)
l

For an ideally equalized serial link, the edge-sampling error sequence is supposed
to be a 0-sequence. Hence, all the cross-correlation coefficients should be zero. How-
ever, this needs infinite taps to cancel all the residual ISI. Considering the fact that the
ISI tail decreases exponentially as the time goes on, it is reasonable to assume that
the ISI affects a finite number of symbols and previous research has demonstrated that
equalizers with a specific number of taps can effectively compensate for legacy chan-
nels [130, 131, 164, 133, 123]. In principle, when the tap weights are adjusted close to
the targeted values, the resulting cross-correlation coefficient ρ̂e,d (n) should be forced
towards zero. Taking the implemented 4-tap FFE in this design as an example, for a
group of proper tap weights, we have,

ρ̂e,d = Cα = 0, (6.10)

where,
ρ̂e,d = (ρ̂e,d (−1), ρ̂e,d (0), ρ̂e,d (1), ρ̂e,d (2))T ,

α = (α−1 , α0 , α1 , α2 )T ,

156
Chapter 6. Overall Serial Link and Adaptive Equalization

 


c0.5 c−0.5 c−1.5 c−2.5 

 
 

 c1.5 c0.5 c−0.5 c−1.5 

C= .
 
 



c2.5 c1.5 c0.5 c−0.5 


 
 
c3.5 c2.5 c1.5 c0.5

To find the optimal TX-FFE tap weights, a recursive equation is constructed as,

α(k + 1) = α(k)−λCα(k) = α(k)− λρ̂e,d (k). (6.11)

In each iteration, a small portion of the instant cross-correlation coefficient vector


λρ̂e,d (k) is subtracted from the tap weight vector α(k) to make it closer to the tar-
geted value. For the convergence, mathematic analysis (see Appendix B) indicates
that a sufficient condition is to keep the 1-norm of matrix I − λC smaller than 1 (i.e.,
(i.e., the maximum absolute column sum is smaller than 1). For any bandwidth-limited
channel, the transmitted symbol will spread over multiple symbols at the RX-side, thus
making the above conditions held. Consequently, a set of optimal tap weights of the
TX-FFE can be obtained by the iterative Eq. (6.11). It is worthy to note that when the
transmission channel is beyond the scope of the equalization ability or over-equalized
by improperly setting the control voltage of the RX-CTLE, the tap-weight coefficients
will go too high (or low). To manage this contingency, the rang limiters depicted in
Fig. 6.1 are inserted between the EDC-SZF and DACs, which are used to keep the
control codes received by the DACs not larger (or smaller) than the specific maximum
(or minimum) values.
Taking sign[e(k)] as the binary quantization of the edge-sampling error, the cross-
correlation between the sign of the edge-sampling error and received data: sign[e(k)] ·
D(k − l) can be considered as an instant estimation of the ρ̂e,d (l). Hence, the final
iterative equation presented in previous part can be obtained [refer to Eq. (6.1)].

6.2.5 Implementation of the EDC-SZF

Fig. 6.5 depicts the implementation of the EDC-SZF adaptation algorithm, which
contains three identical paths to process the quantized data and edge sequences to pro-

157
Chapter 6. Overall Serial Link and Adaptive Equalization

Sl ot 4
Sl ot 3
Sl ot 2
Sl ot 1
{+1, 0, -1}
D(n+1) Correlation D1<3:0>
4 16-bit Truncation 6 Range 6-bit Vbias,α-1
4 Adder
E(n) Detector D0<3:0> Integrator Output Limiter DAC
ResCor-1(n)

{+1, 0, -1}
D(n-1) Correlation D1<3:0>
4 16-bit Truncation 6 Range 6-bit Vbias,α1
Detector 4 Adder
D0<3:0> Integrator Output Limiter DAC
ResCor1(n)

{+1, 0, -1}
D(n-2) Correlation D1<3:0>
4 16-bit Truncation 6 Range 6-bit Vbias,α2
4 Adder
Detector D0<3:0> Integrator Output Limiter DAC
ResCor2(n)
D(n) XOR 2 BW<1:0>
D(n+1)

Figure 6.5: Block diagram of the EDC-SZF adaptation algorithm.

E(n-1) E(n) E(n+1)


D(n-2) D(n-1) D(n) D(n+1) D(n+2)

+ ResCor-1(n)

+ ResCor0(n)

+ ResCor1(n)

+ ResCor2(n)

(a)

D(n-l )⊕E(n) D(n)⊕D(n+1) ResCorl 1(n) ResCorl 0(n)

0 0 0 0
1 0 0 0
0 1 0 1
1 1 1 1
Note: The signed ResCorl(n) is represented by two bits: ResCorl1(n) and ResCorl0(n).

(b)

Figure 6.6: Correlation detector. (a) Operation principle illustration and (b) function
table.

duce the desired bias voltages for TX-FFE taps. Here, the main tap weight is pre-fixed
to accelerate the convergence speed. In each path, the edge and data streams with a
proper time shift are applied to a correlation detector (CD) to generate the residual
correlation ResCorl (n), which is used to represent the sign[e(n)] · D(n − l) in Eq.
(6.1). These parallel correlation coefficients are firstly summed and then fed into a

158
Chapter 6. Overall Serial Link and Adaptive Equalization

16-bit integrator to execute the iteration of Eq. (6.1), where λ is determined by the
subsequent truncation operation. In this design, a set of consecutive 4-bit data/edge of
the 1/16-rate demultiplexed data/edge are employed, which ensures that the data/edge
information used for equalization adaptation comes from different samplers. This de-
centralized error collection method reduces the possibility of non-optimal adaptation
caused by imperfections such as fabrication mismatch, duty cycle distortion, and I, Q
quadrature error. Fig. 6.6 further details the operation principle and function table of
the CD. Clearly, if there is no transition (D(n) + D(n + 1) = 0), ResCorl (n) is as-
signed 0. In case of a data transition (D(n) + D(n + 1) = 1), ResCorl (n) is assigned
+1 or -1 when the polarities of D(n − l) and E(n) are identical (D(n − l) + E(n) = 0)
or opposite (D(n − l) + E(n) = 1).

(a) (b)

(d)

Figure 6.7: Layout views of the equalization blocks. (a) TX-FFE, (b) RX-CTLE, and
(c) EDC-SZF.

159
Chapter 6. Overall Serial Link and Adaptive Equalization

6.3 Experimental Results

6.3.1 Layout and Simulation Results

6.3.1.1 Layout Designs

Fig. 6.7 displays the layout views of the equalization blocks. For the TX-FFE [see
Fig. 6.7(a)], a pair of standard inductors is utilized to neutralize the capacitances on the
output nodes. For the RX-CTLE [see Fig. 6.7(b)], two T-coil inductors are used in the
Terminals to support the high current ability, while multi-layer inductors are adopted
in the CTLE stages to save the area occupation. For the EDC-SZF [see Fig. 6.7(c)], it
is implemented based on the standard cells provided by the foundry.

Post1 Tap
Attenuation (dB)

Voltage (mV)

Pre Tap
-15.9 dB at 20 GHz
Post2 Tap

(a) (b)
Amplitude (mV)
Amplitude (mV)

(c) (d)

Figure 6.8: Transistor-level simulation of the EDC-SZF adaptation. (a) Channel fre-
quency response, (b) convergence process of the TX-FFE tap weights, (c) eye-diagram
with zero TX-FFE tap weights, and (d) eye-diagram with adaptively-adjusted TX-FFE
tap weights.

160
Chapter 6. Overall Serial Link and Adaptive Equalization

Transmitter Receiver
Chip Chip
PCB Channel

(a)

TX Channel

Duplicated Channel

(b)
Gain (dB)

-4.64 dB @ 5.0 GHz


-11.26 dB @ 10.0 GHz
-16.34 dB @ 20.0 GHz

Frequency (Hz)
(c)

Figure 6.9: Constructed chip-to-chip interconnect. (a) Testing PCB, (b) auxiliary PCB,
and (c) duplicated channel frequency response.

161
Chapter 6. Overall Serial Link and Adaptive Equalization

0.6 A
B C D
0.5
Post1 Tap E
F
(V)
0.4
Bias(V)
Tab Bias

0.3
Tap

Pre Tap
0.2

0.1
Post2 Tap
0
0.9 0.85 0.8 0.75 0.7 0.65 0.6
VCTLE
VCTLE(V)
(V)
Figure 6.10: Adaptively-adjusted bias voltages of the TX-FFE with different RX-
CTLE control voltages.

Figure 6.11: Measured far-end eye-diagrams for (a) bias condition A, (b) bias condition
B, (c) bias condition D, and (d) bias condition F depicted in Fig. 6.10.

6.3.1.2 Simulation Results

Fig. 6.8 gives the transistor-level simulation results of the serial link with the EDC-
SZF adaptation, where the control voltage of the RX-CTLE is pre-set to 700 mV, and

162
Chapter 6. Overall Serial Link and Adaptive Equalization

Bias Condition A
Bit Error Rate (Error ratio)

Bias Condition F

Bias Condition C

BIAS
Phase
Condition
(UI) C

Figure 6.12: Measured bathtub curves under different bias conditions depicted in Fig.
6.10.

the dispersive channel is imitated by an LPF with a -15.9 dB loss at 20 GHz. The
channel frequency response and the eye-diagram after the channel are shown in Fig.
6.8(a). Fig. 6.8(b) describes the convergence process of the TX-FFE tap weights. Fig.
6.8(c) and (d) displays the eye-diagrams (measured at the output of the RX-CTLE) with
zero and adaptively-adjusted tap weights, respectively. It can be easily seen that the
developed EDC-SZF adaptation algorithm can gradually tune the TX-FFE tap weights
to optimal values, which can effectively optimize the eye opening and eyelid thickness.

6.3.2 Measurement Results

Fig. 6.9 shows the measurement setup of the serial link. As shown in Fig. 6.9(a),
a chip-to-chip interconnect is constructed. The outputs of the transmitter chip and the
inputs of the receiver chip are separately wire-bonded to the two terminals of a 12
cm PCB channel. Meanwhile, an auxiliary PCB with a transmitter chip bonding to a
replica channel and a pair of duplicated PCB traces are also manufactured to measure
the far-end eye-diagrams and evaluate the channel characteristics [see Fig. 6.9(a)]. Fig.

163
Chapter 6. Overall Serial Link and Adaptive Equalization

6.9(c) depicts the frequency response of the PCB channel, where the channel loss at
the half-baud frequency is over 16 dB.
Fig. 6.10 shows the adaptively-adjusted bias voltages of the TX-FFE taps as the
control voltage of the RX-CTLE changes from 900 mV to 615 mV [see the correspond-
ing equalization abilities in Fig. 6.3(b)]. Fig. 6.11 describes the far-end eye-diagrams
under the bias conditions of A, B, D, and F depicted in Fig. 6.10. As the control volt-
age of the RX-CTLE is decreased (i.e., improving the high-frequency peaking ability
of the RX-CTLE), the TX-FFE bias voltages are adjusted accordingly to decrease the
equalization capability of the TX-FFE, thus maintaining the frequency response of the
combined TX-FFE, RX-CTLE, and transmission channel close to a flat profile. By
detecting the BER while adjusting the sampling positions, the bathtub diagram can be
obtained. Fig. 6.12 displays the measured bathtub curves under the bias conditions of
A, C, and F described in Fig. 6.10. For the balanced equalization coefficient alloca-
tion under bias condition C, the horizontal eye opening at BER=10−12 achieves 0.51
UI, which is much better than those measured under bias condition A (0.30 UI) and
bias condition F (0.35 UI). This proves that a combination scheme of the TX-FFE and
RX-CTLE is a good choice for the equalization of the 40 Gb/s link.

6.4 Chapter Summary

This chapter constructs a serial link over a > 16 dB loss PCB channel using the
chips designed in Chapter 4 and 5. A combined TX-FFE and RX-CTLE is employed
to compensate for the channel loss. To obtain the optimal equalization coefficients
and track the channel-loss variations with respect to operation environment, a low-cost
EDC-SZF adaptation algorithm is proposed to automatically adjust the TX-FFE’s tap
weights. Unlike previous adaptation techniques that need auxiliary circuits to extract
the error information, the proposed EDC-SZF adaptation performs the tap-weight ad-
justment through processing the existing data and edge sequences, hence introducing
little overheads to the link.

164
Chapter 7

Conclusions and Future Work

7.1 Conclusions

The rapid growth of the computing power and storage volume has led to an ex-
plosive bandwidth demand on data communication in both telecommunication equip-
ments and inter/intra data centers. To accommodate to this requirement, the data rate
of the wireline SerDes transceiver has been continuously increased. Currently, 25-28
Gb/s serial links have stepped into the period of industrial deployment. The 38-64 Gb/s
transceivers, which will play a key role in the next-generation data rate have attracted
increasing attentions in both the industry and the academia. This thesis addresses some
of the architecture-level and circuit-level challenges associated with such cutting-edge
wireline transceiver designs. Several advanced techniques are developed to optimize
the operation speed, power efficiency, performance margin, and area occupation. The
prototype chips of a 10 GHz clock multiplier, a 40 Gb/s transmitter, and a 40 Gb/s
receiver are separately designed and fabricated in a 65 nm CMOS process. The main
features of these designed chips are summarized as below.

• The main features of the implemented ring-oscillator-based injection-locked clock


multiplier (RILCM) focus on three aspects. Firstly, a hybrid frequency tracking
loop is proposed to automatically adjust the control voltage of the injection-
locked voltage-controlled oscillator (VCO). By introducing a lock-loss detec-
tion and lock recovery mechanism, the hybrid loop endows the RILCM with
a similar lock-acquisition ability as conventional PLLs, thus excluding the ini-

165
Chapter 7. Conclusions and Future Work

tial frequency setup aid and preventing the potential lock-loss risk. Secondly,
a full-swing pseudo-differential delay cell is developed to optimize the phase
noise performance of the VCO. Thirdly, a compact timing-adjusted phase detec-
tor tightly combined with a well-matched charge pump is designed to satisfy the
requirements of high operation speed, high detection accuracy, and low output
disturbance. The measurement results show that the implemented 10 GHz RIL-
CM chip achieves a good balance among jitter performance, area occupation,
operation speed, and power efficiency.

• The main features of the implemented transmitter focus on three aspects. Firstly,
a 4-tap feed-forward equalizer (FFE) based on multiple multiplexers (MUXs) is
designed. Thanks to the retiming-based symbol-spaced sequence generation, it
can support a wide operation range of 5-50 Gb/s. Secondly, an enhanced 4:1
MUX is developed. By introducing a pair of pre-charging PMOS transistors in
the pulling-down unit cell, it completely eliminates the charge-sharing effect,
which not only improves the jitter performance of the 4:1 MUX but also helps
to extend its maximum bandwidth. Thirdly, a compact latch array associated
with an interleaved-retiming technique is designed. By interleaved-retiming the
parallel data, the 16 paths quarter-rate data streams with appropriate delays can
be obtained. The measurement results indicate that the fabricated 40 Gb/s trans-
mitter chip achieves excellent jitter performance and power efficiency.

• The main features of the implemented receiver focus on two aspects. One is
the architecture-level improvement on the clock data recovery (CDR). By intro-
ducing passive low-pass filters with an adaptively adjusted bandwidth into the
data-sampling path, the jitter tracking and jitter suppression for data decisions
can be automatically balanced, thus improving the jitter tolerance of the CDR.
The other is the time-averaging-based compensating phase interpolator, which
not only improves the phase-step uniformity but also reduces the phase-spacing
errors between the edge and data sampling clocks. The measurement results
show that the maximum tolerable amplitude of implemented 40 Gb/s receiver
chip outperforms previous receivers at high frequencies.

166
Chapter 7. Conclusions and Future Work

• Using the designed transmitter and receiver chips, a chip-to-chip communication


link over a 12-cm printed circuit board (PCB) channel is constructed. It employs
a combination of TX-FFE and RX-CTLE to compensate for the channel loss. A
low-cost edge-data correlation-based sign zero-forcing (EDC-SZF) adaptation
algorithm is proposed to automatically adjust the TX-FFE’s tap weights. The
measurement results indicate that the equalization scheme of the combination of
TX-FFE and RX-CTLE is a good choice for the equalization of the 16 dB loss
channel at 40 Gb/s, and the proposed EDC-SZF adaptation can effectively tune
the TX-FFE to its optimal tap weights for a given control voltage applied to the
RX-CTLE.

7.2 Future Work

The factors to consider when designing a serial communication link mainly include
data transmission rate, power efficiency, and channel characteristics. The first factor is
usually set by particular operation standards, the other two factors largely depend on
the network infrastructure, operation medium, and link spaces. As the requirement for
the data rates goes beyond 40 Gb/s, efforts in channel optimization, on-chip transmis-
sion line, and modulation scheme should also be made to further optimize the factors
of the serial link. As a consequence, the following items could be the future tasks to
further optimize the link performance.

• Enhancing the chip-package co-design. The chips presented in this dissertation


are measured through mounting them directly on the PCB using gold-bonding
wires. The inductive parasitics of the bonding wires will inevitably cause dis-
continuities. These discontinuities will degrade the signal integrity in terms
of reinforcing the undesired signal inflections. One can extract the models of
the bonding wires through high frequency electromagnetic field simulations and
treat them as electrical components during the chip design. This chip-package
co-design method provides a possible way to reduce the effect of the bonding
wires and hance improves the continuity of the transmission channel.

167
Bibliography

• Developing on-chip transmission lines. The wavelength for a 20 GHz (Nyquist


frequency of 40 Gb/s) signal is around 1.5 cm, which makes a 150 µm (one
tenth of the wavelength) connection wire should be considered as a transmission
line. Moreover, the highest frequency of interest is actually determined by the
rise/fall time of the transmission signal, which means even shorter connection
wires should be modeled as transmission lines. Instead of the lump parasitic
capacitors and inductors, the parasitic effect of the transmission line is charac-
terized by the characteristic impedance. By placing a resistive matching termi-
nation at the far-end, the parasitic effect can be theoretically neutralized, thus
saving substantial driving power. Meanwhile, the serial parasitic resistance can
degrade the performance of the transmission line, especially for long connection
wires. Additionally, the requirement of physical uniformity for transmission
lines has also posed significant challenges for the layout routing.

• Exploring advanced techniques on the four-level pulse amplitude modulation


(PAM4) chipset design. PAM4 has been considered as one of the most promis-
ing multi-level modulation schemes for next-generation data rates, due to its
doubled channel capacity, moderate signal-to-noise ratio (SNR), and applicabil-
ity to the existing infrastructure. It uses four distinct amplitude levels to convey
two bits in one symbol, thus halving the Nyquist frequency to refine the system
loss budget and/or increase the link speed. However, it suffers from a 9.5 dB
SNR attenuation since the eye height is reduced to one third of the non-return-
to-zero (NRZ) modulation. To mitigate this effect, the transmitter is demanded
to output a large swing with a high linearity, while the receiver is required to
automatically adjust the threshold levels to correctly extract the most significant
bit and the least significant bit. Additionally, the inherent inter-symbol interface
associated with the edge transitions among different symbol levels makes the
clock data recovery design in the PAM4 mode much more challenging than that
in the NRZ mode. Moreover, the three-eye-opening requirement has posed new
challenges in the equalization design.

168
Bibliography

[1] C. V. N. Index, “The zettabyte era-trends and analysis.” Cisco white paper,
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-
vni/vni-hyperconnectivity-wp.html, Jun. 2016. [Online]. Accessed 26-Feb.-2017.

[2] T. P. Morgan, “Driving the Ethernet roadmap at 100x speeds.” Thenextplatform,


https://www.nextplatform.com/2015/03/31/driving-the-ethernet-roadmap-at-100x-speeds/, Mar.
2015.

[3] S. Voinigescu et al., “SiGe BiCMOS for analog, high-speed digital and millimetre-wave ap-
plications beyond 50 GHz,” in Proc. IEEE Bipolar/BiCMOS Circuits and Technology Meeting,
pp. 1–8, Oct. 2006.

[4] Optical Internetworking Forum, “OIF CEI-56G application note-Common electrical interface
at 56Gb/s.” OIF application note, http://www.oiforum.com/wp-content/uploads/OIF-CEI-white-
paper-final-Mar-23-2016.pdf, 2016. [Online]. Accessed 26-Feb.-2017.

[5] Telecordia Technologies, Synchronous Optical Network (SONET) Transport Sys-tems: Common
Generic Criteria, Sep. 2000. GR-253-CORE.

[6] T. Toif et al., “A 22-Gb/s PAM-4 receiver in 90-nm CMOS SOI technology,” IEEE J. Solid-State
Circuits, vol. 41, pp. 954–965, Apr. 2006.

[7] T. Toifl, Low-Power High-Speed CMOS I/Os: Design Challenges and Solutions. IBM Research
GmbH Zurich Research Laboratory, 2012.

[8] C. Kromer et al., “A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects,” IEEE J.
Solid-State Circuits, vol. 41, pp. 2921–2929, Dec. 2006.

[9] J. F. Bulzacchelli et al., “A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technolo-
gy,” IEEE J. Solid-State Circuits, vol. 41, pp. 2885–2900, Dec. 2006.

[10] L. Rodoni et al., “A 5.75 to 44 Gb/s quarter rate CDR with data rate selection in 90 nm bulk
CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 1927–1941, Jul. 2009.

[11] S. Sidiropoulos et al., “A semidigital dual delay-locked loop,” IEEE J. Solid-State Circuits,
vol. 32, pp. 1683–1692, Nov. 1997.

[12] G.-Y. Wei et al., “A variable-frequency parallel I/O interface with adaptive power-supply regula-
tion,” IEEE J. Solid-State Circuits, vol. 35, pp. 1600–1610, Nov. 2000.

[13] A. Agrawal et al., “A 19-Gb/s serial link receiver with both 4-tap FFE and 5-tap DFE functions
in 45-nm SOI CMOS,” IEEE J. Solid-State Circuits, vol. 47, pp. 3220–3231, Dec. 2012.

[14] R. Kreienkamp et al., “A 10-Gb/s CMOS clock and data recovery circuit with an analog phase
interpolator,” IEEE J. Solid-State Circuits, vol. 40, pp. 736–743, Mar. 2005.

[15] H. Pan et al., “A digital wideband CDR with 15.6kppm frequency tracking at 8Gb/s in 40nm
CMOS,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 442–443, Feb. 2011.

[16] M.-S. Chen, Design of 60+Gb/s Serial-Link Transmitters Using Filter Techniques. PhD thesis,
Electrical Engineering, University of California, Los Angeles, 2015.

169
Bibliography

[17] B. Welch, “400G optics-technologies, timing, and transceivers.” IEEE P802. 3bs,
http://www.ieee802.org/3/bs/public/14 05/welch 3bs 01 0514.pdf, May. 2014. [Online]. Ac-
cessed 22-Oct.-2016.

[18] InfiniBand Trade Association, “InfiniBand roadmap.” Mellanox Technologies,


http://www.infinibandta.org/content/pages.php?pg=technology overview. [Online]. Accessed
22-Oct.-2016.

[19] M. Cvijetic and I. B. Djordjevic, Advanced Optical Communication Systems and Networks, ch. 1,
pp. 1–38. Artech House, 2013.

[20] P. C. Chiang et al., “4 × 25 Gb/s transceiver with optical front-end for 100 GbE system in 65 nm
CMOS technology,” IEEE J. Solid-State Circuits, vol. 50, pp. 573–582, Feb. 2015.

[21] U. Singh et al., “A 780 mW 4 × 28 Gb/s transceiver for 100 GbE gearbox PHY in 40 nm CMOS,”
IEEE J. Solid-State Circuits, vol. 49, pp. 3116–3129, Dec. 2014.

[22] T. Takemoto et al., “A 25-Gb/s 2.2-W 65-nm CMOS optical transceiver using a power-supply-
variation-tolerant analog front end and data-format conversion,” IEEE J. Solid-State Circuits,
vol. 49, pp. 1903–1916, Feb. 2014.

[23] R. Navid et al., “A 40 Gb/s serial link transceiver in 28 nm CMOS technology,” IEEE J. Solid-
State Circuits, vol. 50, pp. 814–827, Dec. 2015.

[24] M. S. Chen and C. K. K. Yang, “A 50-64 Gb/s serializing transmitter with a 4-tap, LC-ladder-
filter-based FFE in 65 nm CMOS technology,” IEEE J. Solid-State Circuits, vol. 50, pp. 1903–
1916, Apr. 2015.

[25] J. Lee et al., “Design of 56 Gb/s NRZ and PAM4 SerDes transceivers in CMOS technologies,”
IEEE J. Solid-State Circuits, vol. 50, pp. 2061–2073, Sep. 2015.

[26] P. C. Chiang et al., “60Gb/s NRZ and PAM4 transmitters for 400GbE in 65nm CMOS link,” in
Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 42–43, Feb. 2014.

[27] H. Tao et al., “40-43-Gb/s OC-768 16:1 MUX/CMU chipset with SFI-5 compliance,” IEEE J.
Solid-State Circuits, vol. 38, pp. 2169–2180, Dec. 2003.

[28] Inphi, “CMOS paves the road to 100 GbE mainstream markets.” http-
s://www.inphi.com/products/whitepapers/inphi whitepaper iphy final.pdf, 2011. [Online].
Accessed 28-Jul.-2017.

[29] T. H. Lee, The Design of CMOS Radio-Frequency Integrated Circuits. Cambridge: Cambridge
University Press, 1998.

[30] Altera, Altera’s 28-nm, Power-Efficient Transceivers, Jan. 2013.

[31] IEEE 802.3, 50 Gb/s Ethernet Over a Single Lane and Next Generation 100 Gb/s & 200 Gb/s
Ethernet Call For Interest Consensus Presentation, Nov. 2015.

[32] A. A. Hafez, M.-S. Chen, and C.-K. K. Yang, “A 32-48 Gb/s serializing transmitter using multi-
phase serialization in 65 nm CMOS technology,” IEEE J. Solid-State Circuits, vol. 50, pp. 763–
775, Mar. 2015.

[33] S. Kaeriyama et al., “A 40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5
interface for optical transmission systems,” IEEE J. Solid-State Circuits, vol. 44, pp. 3568–3579,
Dec. 2009.

[34] M. S. Chen et al., “A fully-integrated 40-Gb/s transceiver in 65-nm CMOS technology,” IEEE J.
Solid-State Circuits, vol. 47, pp. 627–640, Mar. 2012.

[35] T. Shibasaki et al., “A 56Gb/s NRZ-electrical 247mW/lane serial-link transceiver in 28nm C-


MOS,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 64–65, Feb. 2016.

170
Bibliography

[36] J. Han et al., “A 60Gb/s 288mW NRZ transceiver with adaptive equalization and baud-rate clock
and data recovery in 65nm CMOS technology,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, pp. 112–113, Feb. 2017.

[37] D. Cui et al., “A dual-channel 23-Gbps CMOS transmitter/receiver chipset for 40-Gbps RZ-
DQPSK and CS-RZ-DQPSK optical transmission,” IEEE J. Solid-State Circuits, vol. 47, p-
p. 3249–3260, Dec. 2012.

[38] M. Harwood et al., “A 225mW 28Gb/s SerDes in 40nm CMOS with 13dB of analog equalization
for 100GBASE-LR4 and optical transport lane 4.4 applications,” in Proc. IEEE Int. Solid-State
Circuits Conf. Dig. Tech. Papers, pp. 326–327, Feb. 2012.

[39] K. Kaviani et al., “A tri-modal 20-Gbps/link differential/DDR3/GDDR5 memory interface,”


IEEE J. Solid-State Circuits, vol. 47, pp. 926–937, Apr. 2012.

[40] ISSCC, ISSCC 2016 trends, Feb. 20016.

[41] T. Toifl et al., “A 72mW 0.03mm2 inductorless 40 Gb/s CDR in 65 nm SOI CMOS,” in Proc.
IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 226–227, Feb. 2007.

[42] J. Savoj et al., “Design of high-speed wireline transceivers for backplane communications in
28nm CMOS,” in Proc. IEEE Custom Integrated Circuits Conf., pp. 1–4, Sep. 2012.

[43] T. T. Vu, Compound Semiconductor Integrated Circuits, vol. 29. World Scientific, 2003.

[44] C.-K. K. Yang, Design of High-Speed Serial Links in CMOS. PhD thesis, Stanford University,
Dec. 1998.

[45] H. Bakoglu, ed., Circuits, Interconnections and Packaging for Very Large Scale Integration. Ad-
dison Wesley Longman Publishing Co., 1990.

[46] H. Johnson and M. Graham, High-Speed Digital Design. Prentice-Hall, 1993.

[47] H. Johnson and M. Graham, High-Speed Signal Propagation: Advanced Black Magic. Prentice-
Hall, 2003.

[48] W.-K. Chen, ed., The VLSI Handbook. Taylor & Francis Group, 2 ed., 2007.

[49] C. R. Paul, Analysis of Multiconductor Transmission Lines. John Wiley & Sons, 2 ed., 2008.

[50] T. Dhaene and D. D. Zutter, “Selection of lumped element models for coupled lossy transmission
lines,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 11, pp. 805–815, Jul. 1992.

[51] K. Fukuda et al., “A 12.3-mW 12.5-Gb/s complete transceiver in 65-nm CMOS process,” IEEE
J. Solid-State Circuits, vol. 45, pp. 2838–2849, Dec. 2010.

[52] K.-L. J. Wong et al., “A 27-mW 3.6-Gb/s I/O transceiver,” IEEE J. Solid-State Circuits, vol. 39,
pp. 602–612, Apr. 2004.

[53] H. Hatamkhani and C.-K. K. Yang, “Power analysis for high-speed I/O transmitters,” in Proc.
Symp. VLSI Circuits, pp. 142–145, Jun. 2004.

[54] A. Agrawal, Design of High Speed I/O Interfaces for High Performance Microprocessors. PhD
thesis, The School of Engineering and Applied Sciences, Harvard University, Oct. 2010.

[55] B. Kim et al., “A 10-Gb/s compact low-power serial I/O with DFE-IIR equalization in 65-nm
CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 3526–3538, Dec. 2009.

[56] G. Balamurugan et al., “A scalable 5-15 Gbps, 14-75 mW low-power I/O transceiver in 65 nm
CMOS,” IEEE J. Solid-State Circuits, vol. 43, pp. 1010–1019, Apr. 2008.

[57] J. Lee et al., “56Gb/s PAM4 and NRZ SerDes transceivers in 40nm CMOS,” in Proc. IEEE Symp.
VLSI Circ. Dig. Tech. Papers, pp. 118–119, Jun. 2015.

[58] Tektronix, Understanding and Characterizing Timing Jitter, Feb. 2011.

171
Bibliography

[59] F. Rao and S. Hindi, “Frequency domain analysis of jitter amplification in clock channels,” in
Proc. IEEE 21st Conference on Electrical Performance of Electronic Packaging and Systems,
pp. 51–54, Oct. 2012.

[60] S. Chaudhuri et al., “Jitter amplification characterization of passive clock channels at 6.4 and 9.6
Gb/s,” in Proc. IEEE Electrical Performance of Electronic Packaging, pp. 23–25, Feb. 2006.

[61] B. Casper and F. OMahony, “Clocking analysis, implementation and measurement techniques for
high-speed data links-A tutorial,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, pp. 17–39,
Jan. 2009.

[62] Maxim Integrated, Converting between RMS and Peak-to-Peak Jitter at a Specified BER, Apr.
2008.

[63] Y. Moon et al., “A 0.6-2.5 GBaud CMOS tracked 3× oversampling transceiver with dead-zone
phase detection for robust clock/data recovery,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, pp. 212–213, Feb. 2001.

[64] J. L. Sonntag and J. Stonick, “A digital clock and data recovery architecture for multi-Gigabit/s
binary links,” IEEE J. Solid-State Circuits, vol. 41, pp. 1867–1875, Jul. 2006.

[65] P. K. Hanumolu et al., “Digitally-enhanced phase-locking circuits,” in Proc. IEEE Custom Inte-
grated Circuits Conf., pp. 361–368, Sep. 2007.

[66] A. Ghatak and K. Thyagarajan, An Introduction to Fiber Optics. Cambridge: Cambridge Univer-
sity Press, 1998.

[67] B. Razavi, Design of Integrated Circuits for Optical Communications. John Wiley & Sons. Inc,
2 ed., 2012.

[68] K. Kundert, Verification of Bit-Error Rate in Bang-Bang Clock and Data Recovery Circuits. The
Designers Guide Community, May 2010.

[69] R. Reutemann et al., “A 4.5 mW/Gb/s 6.4 Gb/s 22+1-lane source synchronous receiver core with
optional cleanup PLL in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, pp. 2850–2860,
Dec. 2010.

[70] N. Kalantari and J. F. Buckwalter, “A multichannel serial link receiver with dual-loop clock-
and-data recovery and channel equalization,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60,
pp. 2920–2931, Nov. 2013.

[71] V. F. Kroupa, Phase Lock Loops and Frequency Synthesis. John Wiley & Sons Ltd, 2003.

[72] B. Razavi, ed., Phase-Locking in High-Performance Systemss: From Devices to Architectures.


Wiley-IEEE Press, 2003.

[73] M. Mansuri and C.-K. K. Yang, “Jitter optimization based on phase-locked loop design parame-
ters,” IEEE J. Solid-State Circuits, vol. 37, pp. 1375–1382, Nov. 2002.

[74] J. G. Maneatis, “Low-jitter process-independent DLL and PLL based on self-biased techniques,”
IEEE J. Solid-State Circuits, vol. 31, pp. 1723–1732, Nov. 1996.

[75] M.-J. E. Lee, “Jitter transfer characteristics of delay-locked loops-theories and design tech-
niques,” IEEE J. Solid-State Circuits, vol. 38, pp. 614–615, Apr. 2003.

[76] C.-N. Chuang and S. luan Liu, “A 40GHz DLL-based clock generator in 90nm CMOS technolo-
gy,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 178–179, Feb. 2007.

[77] X. Gao et al., “Jitter analysis and a benchmarking figure-of-merit for phase-locked loops,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 56, pp. 117–121, Feb. 2009.

[78] C. K. et al., “A low-power small-area 7.28-ps-jitter 1-GHz DLL-based clock generator,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 37, pp. 1414–1420, Nov. 2002.

172
Bibliography

[79] R. Farjad-Rad et al., “A low-power multiplying DLL for low-jitter multigigahertz clock genera-
tion in highly integrated digital chips,” IEEE J. Solid-State Circuits, vol. 37, pp. 1804–1812, Dec.
2002.

[80] H.-Y. Chang, “A low-jitter low-phase-noise 10-Ghz sub-harmonically injection-locked PLL with
self-aligned DLL in 65-nm CMOS technology,” IEEE Trans. Microwave Theory Tech., vol. 62,
pp. 543–555, Mar. 2014.

[81] S. Choi et al., “A 185 fsrms -integrated-jitter and -245dB FOM PVT-robust ring-VCO-based
injection-locked clock multiplier with a continuous frequency-tracking loop using a replica-delay
cell and a dual-edge phase detector,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
pp. 194–195, Feb. 2016.

[82] J.-C. Chien et al., “A pulse-position-modulation phase-noise-reduction technique for a 2-to-


16GHz injection-locked ring oscillator in 20nm CMOS,” in IEEE Int. Solid-State Circuits Conf.
Dig. Tech. Papers, pp. 52–53, Feb. 2014.

[83] W. Deng et al., “A 0.022mm2 970µW dual-loop injection-locked PLL with -243 dB FOM using
synthesizable all-digital PVT calibration circuits,” in IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, pp. 248–249, Feb. 2013.

[84] M. Kim et al., “A 450-fs jitter PVT-robust fractional-resolution injection-locked clock multiplier
using a DLL-based calibrator with replica-delay-cells,” in Proc. IEEE Symp. VLSI Circ. Dig.
Tech. Papers, pp. C142–C143, Jun. 2015.

[85] B. M. Helal et al., “A low jitter programmable clock multiplier based on a pulse injection-locked
oscillator with a highly-digital tuning loop,” IEEE J. Solid-State Circuits, vol. 44, pp. 1391–1400,
May 2009.

[86] M. Raj et al., “A 4-to-11GHz injection-locked quarter-rate clocking for an adaptive 153fJ/b opti-
cal receiver in 28nm FDSOI CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
pp. 404–405, Feb. 2015.

[87] K. Hu et al., “A 0.6 mW/Gb/s, 6.4-7.2 Gb/s serial link receiver using local injection-locked ring
oscillators in 90 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, pp. 899–908, Apr. 2010.

[88] J. Lee and H. Wang, “Study of subharmonically injection-locked PLLs,” IEEE J. Solid-State
Circuits, vol. 44, pp. 1539–1553, May 2009.

[89] S. Ye et al., “A multiple-crystal interface PLL with VCO realignment to reduce phase noise,”
IEEE J. Solid-State Circuits, vol. 37, pp. 1795–1803, Dec. 2002.

[90] X. Qi et al., “Compact on-chip wire models for the clock distribution of high-speed i/o inter-
faces,” in Proc. IEEE Electrical Performance of Electronic Packaging, pp. 235–238, Oct. 2008.

[91] K. Hu et al., “Comparison of on-die global clock distribution methods for parallel serial links,”
in Proc. IEEE International Symposium on Circuits and Systems, pp. 1843–1846, May 2009.

[92] F. OMahony et al., “A low-jitter PLL and repeaterless clock distribution network for a 20Gb/s
link,” in Proc. IEEE Symp. VLSI Circ. Dig. Tech. Papers, pp. 29–30, Jun. 2006.

[93] L. Xiu, “Clock technology: The next frontier,” IEEE Circuits and Systems Magazine, vol. 17,
pp. 27–46, May. 2017.

[94] J. Poulton et al., “A 14-mW 6.25-Gb/s transceiver in 90-nm CMOS,” IEEE J. Solid-State Circuits,
vol. 42, pp. 2745–2757, Dec. 2007.

[95] S. Chan et al., “A resonant global clock distribution for the cell broadband-engine processor,” in
Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 512–513, Feb. 2008.

[96] N. Holland, Interfacing Between LVPECL, VML, CML, and LVDS Levels. Texas Instruments,
2002.

[97] Cypress Semiconductor, A Comparison of CML and LVDS for High-Speed Serial Links, 2002.

173
Bibliography

[98] C. Menolfi et al., “A 28Gb/s source-series terminated tx in 32nm CMOS SOI,” in Proc. IEEE Int.
Solid-State Circuits Conf. Dig. Tech. Papers, pp. 334–335, Feb. 2012.

[99] J. Kim et al., “A 16-to-40Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14nm CMOS,”
in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 60–61, Feb. 2015.

[100] K. Kanda et al., “A single-40 Gb/s dual-20 Gb/s serializer IC with SFI-5.2 interface in 65 nm
CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 3580–3589, Dec. 2009.

[101] H. Wang and J. Lee, “A 21-Gb/s 87-mW transceiver with FFE/DFE/Analog equalizer in 65-nm
CMOS technology,” IEEE J. Solid-State Circuits, vol. 45, pp. 909–919, Apr. 2010.

[102] L. Henrickson et al., “Low power fully integrated 10-Gb/s SONET/SDH transceiver in 0.13-µm
CMOS,” IEEE J. Solid-State Circuits, vol. 38, pp. 1595–1601, Oct. 2003.

[103] B. Raghavan et al., “A sub-2 W 39.8-44.6 Gb/s transmitter and receiver chipset with SFI-5.2
interface in 40 nm CMOS,” IEEE J. Solid-State Circuits, vol. 48, pp. 3219–3228, Dec. 2013.

[104] X. Zheng, C. Zhang, and S. Yuan et al., “An improved 40 Gb/s CDR with jitter-suppression filters
and phase-compensating interpolators,” in Proc. IEEE Asian Solid-State Circuits Conf. (ASSCC),
pp. 85–88, Nov. 2016.

[105] J. W. Bergmans, Digital Baseband Transmission and Recording, ch. 8, pp. 400–412. Springer
Science & Business Media, 1996.

[106] F.-T. Chen et al., “A 10-Gb/s low jitter single-loop clock and data recovery circuit with rotational
phase frequency detector,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 61, pp. 3278–3287,
Nov. 2014.

[107] N. Kocaman et al., “An 8.5-11.5-Gbps SONET transceiver with referenceless frequency acquisi-
tion,” IEEE J. Solid-State Circuits, vol. 48, pp. 1975–1884, Aug. 2013.

[108] M. S. Jalali et al., “A reference-less single-loop half-rate binary CDR,” IEEE J. Solid-State Cir-
cuits, vol. 50, pp. 2037–2047, Sep. 2015.

[109] A. Pottbacker et al., “A Si bipolar phase and frequency detector IC for clock extraction up to 8
Gb/s,” IEEE J. Solid-State Circuits, vol. 27, pp. 1747–1751, Dec. 1992.

[110] M. ta Hsieh and G. E. Sobelman, “Architectures for multi-gigabit wire-linked clock and data
recovery,” IEEE Circuits and Systems Magazine, vol. 8, no. 4, pp. 45–57, 2008.

[111] B. Razavi, “Challenges in the design of high-speed clock and data recovery circuits,” IEEE Com-
munications Magazine, pp. 94–101, Aug. 2002.

[112] M. H. Perrott et al., “A 2.5-Gb/s multi-rate 0.25-µm CMOS clock and data recovery circuit
utilizing a hybrid analog/digital loop filter and all-digital referenceless frequency acquisition,”
IEEE J. Solid-State Circuits, vol. 41, pp. 2930–2944, Dec. 2006.

[113] J. C. Scheytt et al., “A 0.155-, 0.622-, and 2.488-Gb/s automatic bit-rate selecting clock and
data recovery IC for bit-rate transparent SDH systems,” IEEE J. Solid-State Circuits, vol. 34,
pp. 1935–1943, Dec. 1999.

[114] H. S. Muthali et al., “A CMOS 10-Gb/s SONET transceiver,” IEEE J. Solid-State Circuits,
vol. 39, pp. 1026–1033, Jul. 2004.

[115] M. Y. He and J. Poulton, “A CMOS mixed-signal clock and data recovery circuit for OIF CEI-
6G+ backplane transceiver,” IEEE J. Solid-State Circuits, vol. 41, pp. 597–606, Mar. 2006.

[116] H.-H. Chang et al., “Low jitter and multirate clock and data recovery circuit using a MSADLL for
chip-to-chip interconnection,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, pp. 2356–2364,
Dec. 2004.

[117] B. Razavi, Monolithic Phase-Locked Loops and Clock Recovery Circuits: Theory and Design.
Wiley-IEEE Press, 1996.

174
Bibliography

[118] Y. Sun and H. Wang, “Analysis of digital bang-bang clock and data recovery for multi-gigabits
serial transceivers,” in Proc. IEEE Custom Integrated Circuits Conf., pp. 13–16, Sep. 2009.

[119] S. Tertinek et al., “Binary phase detector gain in bang-bang phase-locked loops with DCO jitter,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, pp. 941–945, Dec. 2010.

[120] N. D. Dalt, “Markov chains-based derivation of the phase detector gain in bang-bang PLLs,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, pp. 1195–1199, Nov. 2006.

[121] J. Kim et al., “Simulation and analysis of random decision errors in clocked comparators,” IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 56, pp. 1844–1857, Aug. 2009.

[122] G. R. Gangasani et al., “A 16-Gb/s backplane transceiver with 12-tap current integrating DFE
and dynamic adaptation of voltage offset and timing drifts in 45-nm SOI CMOS technology,”
IEEE J. Solid-State Circuits, vol. 47, pp. 1828–1841, Aug. 2012.

[123] T. Musah et al., “A 4-32 Gb/s bidirectional link with 3-tap FFE/6-tap DFE and collaborative CDR
in 22 nm CMOS,” IEEE J. Solid-State Circuits, vol. 49, pp. 3079–3090, Dec. 2014.

[124] P. K. Hanumolu et al., “Equalizer for high-speed links,” International Joruanl of High Speed
Electronics and Systems, vol. 15, pp. 429–458, Jul. 2005.

[125] J. F. Bulzacchelli et al., “A 28-Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32-nm SOI
CMOS technology,” IEEE J. Solid-State Circuits, vol. 47, pp. 3232–3248, Dec. 2012.

[126] M. Altmann and F. Spagna, Adaptive Tx Equalization. IEEE 802.3ap, Nov. 2004.

[127] S. S. Mohan et al., “Bandwidth extension in CMOS with optimized on-chip inductors,” IEEE J.
Solid-State Circuits, vol. 35, pp. 346–355, Mar. 2000.

[128] S. Ibrahim and B. Razavi, “Low-power CMOS equalizer design for 20-Gb/s systems,” IEEE J.
Solid-State Circuits, vol. 46, pp. 1321–1336, Jun. 2011.

[129] C. Thakkar et al., “A 10 Gb/s 45 mW adaptive 60 GHz baseband in 65 nm CMOS,” IEEE J.


Solid-State Circuits, vol. 47, pp. 952–968, Apr. 2012.

[130] J. Jaussi et al., “A 205mW 32Gb/s 3-tap FFE/6-tap DFE bidirectional serial link in 22nm CMOS,”
in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 440–441, Feb. 2014.

[131] M. Pozzoni et al., “A multi-standard 1.5 to 10 Gb/s latch-based 3-tap DFE receiver with a SSC
tolerant CDR for serial backplane communication,” IEEE J. Solid-State Circuits, vol. 44, p-
p. 1306–1315, Apr. 2009.

[132] H. Higashi et al., “A 5-6.4-Gb/s 12-channel transceiver with pre-emphasis and equalization,”
IEEE J. Solid-State Circuits, vol. 40, pp. 978–985, Apr. 2005.

[133] K. Krishna et al., “A multigigabit backplane transceiver core in 0.13-µm CMOS with a power-
efficient equalization architecture,” IEEE J. Solid-State Circuits, vol. 40, pp. 2658–2666, Dec.
2005.

[134] J. Lee, “A 20-Gb/s adaptive equalizer in 0.13-µm CMOS technology,” IEEE J. Solid-State Cir-
cuits, vol. 41, pp. 2058–2066, Sep. 2006.

[135] Wong and Lok, “Theory of digtial communications: Chapter 4 intersymbol interference and
equalization.” http://wireless.ece.ufl.edu/twong/Notes/Comm/ch4.pdf. [Online]. Accessed 16-
Set.-2017.

[136] schober, “Signal detection and estimation: Equalization of channels with ISI.”
http://courses.ece.ubc.ca/564/chapter6.pdf. [Online]. Accessed 16-Set.-2017.

[137] Communication Capstone Design, “Channel equalization.” Electrical Engineering,


http://courses.washington.edu/ee417/handouts/handout2.pdf. [Online]. Accessed 16-Set.-
2017.

175
Bibliography

[138] J. Savoj et al., “A low-power 0.5-6.6 Gb/s wireline transceiver embedded in low-cost 28 nm
FPGAs,” IEEE J. Solid-State Circuits, vol. 48, pp. 2582–2594, Nov. 2013.

[139] J. Savoj et al., “A wide common-mode fully-adaptive multi-standard 12.5 Gb/s backplane
transceiver in 28 nm CMOS,” in Proc. IEEE Symp. VLSI Circ. Dig. Tech. Papers, pp. 104–105,
Jun. 2012.

[140] B. Analui et al., “A 10Gb/s eye-opening monitor in 0.13


µm CMOS.” http://researcher.watson.ibm.com/researcher/files/us-
sasha/10Gbps Eye Monitor 18 3 Behnam Analui ISSCC.pdf. [Online]. Accessed 26-Feb.-
2017.

[141] B. Analui et al., “A 10-Gb/s two-dimensional eye-opening monitor in 0.13-µm standard CMOS,”
IEEE J. Solid-State Circuits, vol. 40, pp. 2689–2699, Dec. 2005.

[142] J.-S. Choi et al., “A 0.18-µm CMOS 3.5-Gb/s continuous-time adaptive cable equalizer using
enhanced low-frequency gain control method,” IEEE J. Solid-State Circuits, vol. 39, pp. 419–
425, Mar. 2004.

[143] S. Gondi et al., “A 10Gb/s CMOS adaptive equalizer for backplane applications,” in Proc. IEEE
Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 328–329, Feb. 2005.

[144] P. C. Maulik and D. A. Mercer, “A DLL-based programmable clock multiplier in 0.18-µm CMOS
with 70 dBc reference spur,” IEEE J. Solid-State Circuits, vol. 42, pp. 1642–1648, Aug. 2007.

[145] Y.-C. Huang and S.-I. Liu, “A 2.4-GHz subharmonically injection-locked PLL with self-
calibrated injection timing,” IEEE J. Solid-State Circuits, vol. 48, pp. 417–428, Feb. 2013.

[146] I.-T. Lee et al., “A divider-less sub-harmonically injection-locked PLL with self-adjusted injec-
tion timing,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 414–415, Feb. 2013.

[147] H. M. Cheema et al., 60-GHz CMOS Phase-Locked Loops. Springer Sciencet & Business Media,
2010.

[148] V. Manassewitsch, ed., Frequency Synthesizers. New York: Wiley, 1987.

[149] L. Zhang et al., “Injection-locked clocking: A low-power clock distribution scheme for high-
performance microprocessors,” IEEE Trans. VLSI. syst, vol. 16, pp. 1251–1256, Sep. 2008.

[150] J. Lee and M. Liu, “A 20-Gb/s burst-mode clock and data recovery circuit using injection-locking
technique,” IEEE J. Solid-State Circuits, vol. 43, pp. 619–630, Mar. 2008.

[151] A. Musa et al., “A compact, low-power and low-jitter dual-loop injection locked PLL using all-
digital PVT calibration,” IEEE J. Solid-State Circuits, vol. 49, pp. 50–60, Jan. 2014.

[152] P. Park, J. Park, H. Park, and S. Cho, “An all-digital clock generator using a fractionally injection-
locked oscillator in 65nm CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
pp. 336–337, Feb. 2012.

[153] C.-F. Liang and K.-J. Hsiao, “An injection-locked ring PLL with self-aligned injection window,”
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 90–91, Feb. 2011.

[154] D. Dunwell and A. C. Carusone, “Modeling oscillator injection locking using the phase domain
response,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, pp. 2823–2833, Nov. 2013.

[155] Y.-H. Kwak et al., “A 20 Gb/s clock and data recovery with a ping-pong delay line for unlimited
phase shifting in 65 nm CMOS process,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60,
pp. 303–313, Feb. 2013.

[156] E. Alon et al., “Replica compensated linear regulators for supply-regulated phase-locked loops,”
IEEE J. Solid-State Circuits, vol. 41, pp. 413–424, Feb. 2006.

[157] L. Kull et al., “Implementation of low-power 6-8 b 30-90 GS/s time-interleaved ADCs with
optimized input bandwidth in 32 nm CMOS,” IEEE J. Solid-State Circuits, vol. 51, pp. 636–648,
Mar. 2016.

176
Bibliography

[158] A. Elkholy et al., “A 6.75-to-8.25GHz 2.25mW 190fsrms integrated-jitter PVT-insensitive


injection-locked clock multiplier using all-digital continuous frequency-tracking loop in 65nm
CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 188–189, Feb. 2015.

[159] P. Chiang et al., “A 20-Gb/s 0.13-µm CMOS serial link transmitter using an LC-PLL to directly
drive the output multiplexer,” IEEE J. Solid-State Circuits, vol. 40, pp. 1004–1011, Apr. 2005.

[160] INCITS, Fiber Channel Physical Interface-6, Oct. 2013.

[161] M. Hossain et al., “A 4x40 Gb/s quad-lane CDR with shared frequency tracking and data depen-
dent jitter filtering,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, pp. 1–2, Jun. 2014.

[162] T. O. Dickson, J. F. Bulzacchelli, and D. J. Friedman, “A 12-Gb/s 11-mW half-rate sampled 5-tap
decision feedback equalizer with current-integrating summers in 45-nm SOI CMOS technology,”
IEEE J. Solid-State Circuits, vol. 44, pp. 1298–1305, Apr. 2009.

[163] D. Vijayaraghavan et al., “Highly configurable FPGA-integrated PCI Express 3.0 digital IP ar-
chitecture,” in DesignCon, pp. 1274–1288, Jan.-Feb. 2011.

[164] H. Kimura et al., “A 28 Gb/s 560 mW multi-standard SerDes with single-stage analog front-end
and 14-tap decision feedback equalizer in 28 nm CMOS,” IEEE J. Solid-State Circuits, vol. 49,
pp. 3091–3103, Dec. 2014.

[165] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university press, 2012.

177
Appendices

Appendix A Modeling of the Injection-Locked Oscilla-

tor (ILO)

A.1 Behavior Model of the ILO

The discussions in [89, 154] show that it is reasonable to assume the injection event
shifts the ILO output phase instantaneously and the phase shift is linear with respect to
the instantaneous phase difference relative to the injection signal. These assumptions
are also supported by the circuit simulations and measurement results. This means
each injection phase shift can be modeled as an additional phase step that is applied
to the oscillator output. Fig. A1(a) shows the ILO waveform within the nth injection
period. The total 2N π is divided into two portions of ϕosc (n) and ϕinj (n), which
are separately contributed by the self-oscillation of the oscillator and the pulling of
the injection pulling. Under such a locking condition, the relationship of ϕosc (n) and
ϕinj (n) should satisfy,

ϕosc (n) + ϕinj (n) = 2N π, (A1)

where N is the factor of harmonic injection. The phase accumulation produced by the
oscillator self-oscillation can be calculated by,

2N πω0
ϕosc (n) = , (A2)
ωlock

178
Appendices

φosc (n) φinj (n)


(a)

φosc (n)
+
φinj (n) + + θout (n)
+
Tinj
(b)

θout (n)
+ +
φosc (n) + + + φinj (n)
+ +
θosc (n) Tinj Tinj θinj (n)
(c)

Figure A1: Phase accumulation behavior of the ILO. (a) Output waveform of the ILO
in one injection period, (b) flow-chart diagram of the phase accumulation, and (c)
intuitive diagram of the phase accumulation.

where ω0 stands for the free-running frequency of the oscillator and ωlock represents
the target frequency of the ILO when it is locked to the injection signal. Accordingly,
the phase shift contributed by the injection pulling should be,

2N π(ωlock − ω0 )
ϕinj (n) = . (A3)
ωlock

Considering the fact that the ILO output phase can be calculated by summing all the
discrete phases in different injection periods, the ILO can be modeled as a discrete
phase integrator with an updating period of Tinj . The phase accumulation behavior
of the ILO is described in Fig. A1(b), which can be transformed into Fig. A1(c) to
give a more instructive view. Corresponding to the phase contribution in each injection
period, the total output phase θout (n) of the ILO is also divided into θosc (n) and θinj (n),
where the former denotes the accumulated phase upon the oscillator self-oscillation
and the latter represents the summation of the phase shift produced by the injection

179
Appendices

θosc (n)
+ φ ss(n) φinj (n) + θinj
θref (n) + P(φ ss) + + θout (n)
- +
Tinj

/N
(a)
ωosc
s
+ φinj (s) + θinj (s)
θref (s) + β + + θout (n)
- +
z-1

/N
(b)

Figure A2: Model of the ILO. (a) Signal flow chart and (b) linear model.

event.

A.2 Linear Model of the ILO

According to the discussion in [154], the phase shift ϕinj is a function of the instan-
taneous phase difference between the injection reference signal and oscillator output,
φss = θref − θout /N , which is usually defined as ϕinj = P (φss ). Here, the phase
difference is defined as the horizontal coordinate of the oscillator output crossing point
that locates inside the injection pulse relative to the center of the injection pulse. Em-
bedding this phase shift function into Fig. A1(c), the complete signal flow chart of the
ILO can be obtained [see Fig. A2(a)]. Since the integration of the ϕosc (n) actually
equals the output phase of the free-running oscillator, we replace the right chart sur-
rounded by the dashed line with the the θosc . When the ILO reaches steady-state with
a relative phase difference of φss , the ILO can be modeled as a linear phase transfer
system for small signal analysis. The phase shift ϕinj can be treated as a linear function
with respect to the phase difference φss by a factor of β, which can be approximated
by the instantaneous slope of P (φss ) at φss,lock ,

dP (φss )
β= . (A4)
dφss φss =φss,lock

180
Appendices

Then the linear model of the ILO is constructed as shown in Fig. A2(b). To explore
the phase transfer characteristics, the closed-loop characteristic equation is formulated
as,

1 ωosc
[θref (s) − θout (s)/N ] · β · + = θout (s), (A5)
1 − z −1 s

where ωosc is the angular frequency of the the oscillator. According to the digital signal
processing theorem, the discrete transfer function 1/(1 − z −1 ) can be approximated by
the continuous transfer function of 1/(sTinj ), where Tinj is the period of the injec-
tion signal (i.e. sampling period). Substituting this approximation into Eq. (A5) and
rearranging it, we can get the ILO closed-loop transfer function,

Nβ ωosc sN Tinj
θout (s) = θref (s) · + · . (A6)
sN Tinj + β s sN Tinj + β

From Eq. (A6), the phase transfer function of the input reference is,

Nβ N N
Href (s) = = s = s = N Hinj (s),
sN Tinj + β 1+ β 1+
N Tinj
ωT B
(A7)
where Hinj (s) is the normalized Href (s). Obviously, the phase transfer function
Href (s) is actually a first-order LPF with a left-plane pole located at ωT B = β/(N Tinj )
and its DC-gain is 20log(N ) dB. Hence the ILO shows a low-frequency noise tracking
ability of the input reference.
Reviewing Eq. (A6), the phase transfer function of the oscillator can be written as,

sN Tinj
Hosc (s) = = 1 − Hinj (s). (A8)
sN Tinj + β

Clearly, the Hosc (s) is a first-order HPF with the same pole as the Href (s) and its high-
frequency gain is 0 dB. Thereby, the ILO exhibits a low-frequency noise suppression
for the oscillator in 20 dB/dec.

181
Appendices

A.3 Tracking Bandwidth of the ILO

According to the discussion in [154], the relative phase difference will settle to a
steady state, φss , where each injection event causes a phase shift P (φss ) that is just
sufficient to cancel the phase drift resulting from the frequency offset. This condition
can be expressed by,
2N π(flock − f0 )
P (φss ) = , (A9)
flock

where N is the multiplication factor, flock denotes the locked frequency, and f0 rep-
resents the free-running frequency of the oscillator. For a different frequency offset,
there exists a different steady state φss . Assume there is a small phase perturbation
∆θinj in the injection signal, then the output phase perturbation ∆θout can be predicted
by β∆θinj , where β is the instantaneous slope of the P (φss ). It can be obtained by
taking the derivative of Eq. (A9). resulting in,

dP (φss )
β= . (A10)
dφss φss =φss,lock

Note that the small perturbations in the injection signal intends to cause an instanta-
neous output frequency change, hence the output frequency flock can be considered
as the intermediate variable of P (φss ). Substituting Eq. (A9) into Eq. (A10) and
simplifying it using flock ≈ f0 , we can get,

2N π dflock
β= · . (A11)
flock dφss φss =φss,lock

Substituting Eq. (A11) into ωT B = β/(N Tinj ) and combining with ωT B = 2πfT B , we
can get the tracing bandwidth,

1 dflock
fT B = · . (A12)
N dφss φss =φss,lock

The tracking bandwidth can also be obtained by the intuitive transient analysis.
Based on the deduced slope β of the phase shift P (φss ) with respect to φss , the output

182
Appendices

phase perturbation can be written as,

2N π dflock
∆θout = · · ∆θinj , (A13)
flock dφss φss =φss,lock

If we assume the the first-order phase transfer function of the IL-RVCO is,

N
Hinj (s) = , (A14)
1 + ωTsB

where N is the harmonic factor of the IL-RVCO and ωT B is the angular frequency of
the tracing bandwidth. Then its transient response for a small step input ∆θinj should
be

∆θout = N ∆θinj (1 − e−ωT B t ). (A15)

For an injection period, ωT B Tinj can be considered as much smaller than 1, then
(1 − e−ωT B Tinj ) can be approximated by ωT B Tinj . Correspondingly, Eq. (A15) can
be simplified as

∆θout = N ωT B Tinj · ∆θinj . (A16)

Compare Eq. (A13) with Eq. (A16), we can get the equation

2N π dflock
· = N ωT B Tinj . (A17)
flock dφss φss =φss,lock

Rearrange Eq. (A17), we attain

2π dflock
ωT B = · . (A18)
flock Tinj dφss φss =φss,lock

Considering ωT B = 2πfT B and flock = N/Tinj , Eq. (A18) can be simplified as

1 dflock
fT B = · , (A19)
N dφss φss =φss,lock

which is the same as Eq. (A12).

183
Appendices

Appendix B Convergence Proof of the Proposed EDC-

SZF Iteration

The iterative equation in Section 6.2.2 can be rewritten as,

xk+1 = (I − λB)xk + f, (A20)

where xk ∈ R` with ` ∈ N+ , λ ∈ (0, 1), I denotes the identity matrix, and f ∈ R` is


a fixed constant vector. It is well known that when discussing the convergence issue,
the norm (distance) defined on R` should be specified to make it a normed space. With
respect to the properties of matrix B, we choose the 1-norm and denote the normed
space as (R` , k · k1 ). For a vector x ∈ R` , the 1-norm could be defined as,

`
X
k x k1 = | xi | . (A21)
i=1

In addition, the 1-norm for matrix A = (aij )i,j=1,2,···` has the following two equivalent
definitions [165],

k Ax k1
k A k1 = sup , (A22)
x∈R` k x k1
`
X
k A k1 = max |aij |. (A23)
1≤j≤`
i=1

Because (R` , k · k1 ) is a compact space, the convergence of {xk }∞


k=1 is equivalent to

that the sequence {xk }∞ `


k=1 is a Cauchy sequence in (R , k · k1 ).

For convenience, we denote T := I − λB. In the following, we prove that k T k1 <


1 is a sufficient condition to ensure the sequence {xk }∞
k=1 is a Cauchy sequence. Let

184
Appendices

n, m ∈ N+ , without loss of generality, assuming n > m, we then obtain,

k xn − xm k1 ≤k xn − xn−1 k1 + · · · + k xm+1 − xm k1

≤k T kn−1
1 k x1 − x0 k 1 + · · ·

+ k T km 1 0
1 k x − x k1 (A24)

X
≤k T km
1 k T kk1 k x1 − x0 k1
k=0
 
1
≤k T km
1 k x1 − x0 k1 ,
1− k T k1

where Eq. (A22) and a simple iteration are used for deducing the second inequality.
When the condition k T k1 < 1 is satisfied, we have lim k T km
1 = 0. Hence, for
m→∞

all  > 0, there exists a constant M > 0 such that for any n, m ≥ M , the following
inequality holds,

k xn − xm k1 ≤ . (A25)

This means {xk }∞


k=1 is a Cauchy sequence. Therefore,

k I − λB k1 < 1,

is a sufficient condition to make the iterative Eq. (A20) convergent.

185

You might also like