Design of High-Speed SerDes Transceiver For Chip-To-Chip Communications in CMOS Process
Design of High-Speed SerDes Transceiver For Chip-To-Chip Communications in CMOS Process
Xuqiang Zheng
Supervisor: Professor Shigang Yue
May 2018
Abstract
I
up aid and preventing the potential lock-loss risk. The experimental results show that
the figure-of-merit of the designed RILCM reaches -247.3 dB, which is better than
previous RILCMs and even comparable to the large-area LC-ILCMs.
The transmitter (TX) and receiver (RX) chips are separately designed and fab-
ricated in 65-nm CMOS process. The transmitter chip employs a quarter-rate multi-
multiplexer (MUX)-based 4-tap feed-forward equalizer (FFE) to pre-distort the output.
To increase the maximum operating speed, a bandwidth-enhanced 4:1 MUX with the
capability of eliminating charge-sharing effect is proposed. To produce the quarter-rate
parallel data streams with appropriate delays, a compact latch array associated with an
interleaved-retiming technique is designed. The receiver chip employs a two-stage
continuous-time linear equalizer (CTLE) as the analog front-end and integrates an im-
proved clock data recovery to extract the sampling clocks and retime the incoming
data. To automatically balance the jitter tracking and jitter suppression, passive low-
pass filters with adaptively-adjusted bandwidth are introduced into the data-sampling
path. To optimize the linearity of the phase interpolation, a time-averaging-based com-
pensating phase interpolator is proposed. For equalization, a combined TX-FFE and
RX-CTLE is applied to compensate for the channel loss, where a low-cost edge-data
correlation-based sign zero-forcing adaptation algorithm is proposed to automatically
adjust the TX-FFE’s tap weights. Measurement results show that the fabricated trans-
mitter/receiver chipset can deliver 40 Gb/s random data at a bit error rate of < 10−12
over a channel with >16 dB loss at the half-baud frequency, while consuming a total
power of 370 mW.
II
Declaration
I, Xuqiang Zheng, declare that this thesis describes an original study carried out on
my own. It has not been previously submitted to any university for the award of any
degree. Where I have quoted from the work of others, the source is always given.
III
Acknowledgements
First and foremost, I would like to thank my academic advisor, Professor Shigang
Yue, for his tolerance and patience in letting me explore my interested fields. He
encouraged me to think deeply and creatively. He also taught me how to effectively
communicate my research in papers and presentations. I hope through the years I have
been able to pick up a little of his ability to find and explain ideas and concepts with
such clarity. He will always be a role model to me in my future academic career.
Professor Chun Zhang is my co-advisor, and I am grateful to him for his help and
support when I was on secondment to Tsinghua University. I especially value his
trust in giving me plenty of tapeout chances, regardless of consequences for him. I
have learned a lot from him on how to communicate with people and how to address
troublesome issues. I also want to thank my second co-advisor, Dr. Tryphon Lambrou,
for his nice advice and kind discussions.
I would like to take this chance to thank my family for their selfless love and con-
stant support. Especially, my parents-in-law who gave me great support on deciding to
start my Ph.D. study and provided me generous help during my study. I am grateful to
my wife who supported the whole home when I was studying abroad. I also want to
say sorry to my son for the absence during my abroad study.
The environment at the University of Lincoln is full of brilliant and enthusiastic
colleagues who have provided me valuable help and discussions. I wish to thank the
previous and present members in Lincoln Centre for Autonomous Systems (L-CAS)
research group who have brought me great convenience in daily life and academic
research. In particularly, I want to thank Farshad Arvin, Yi Gao, Junxiong Jia, Feng
IV
Zhao, Tuo Xie, Mingzhu Long, Yan Yan, Guopeng Zhang, Cheng Hu, Qinbing Fu,
Jingmin Huang, Biao Zhao, Xuelong Sun, Jiannan Zhao, Huatian Wang, and Tian Liu
for their selfless help and creative discussions.
I wish to thank Dr. Fangxu Lv for joint work on parts of the project for always being
ready to carry out necessary chip measurements. I also would like to thank Prof. Fule
Li for his constructive advice on circuit designs. I thank him most for being patient
with me at the very beginning and using his vision to open the door of the integrated
circuit design to me.
Finally, I appreciate the financial support from School of Computer Science at
University of Lincoln, the EU FP7 projects: EYE2E (269118), LIVCODE (295151),
and EU Horizon 2020 project: STEP2DYNA (691154).
V
List of Main Publications
[2] X. Zheng, Z. Wang, and F. Li et al., “A 14-bit 250 MS/s IF sampling pipelined
ADC in 180 nm CMOS process,” IEEE Trans. Circuits Syst. I, Reg. Papers
(TCAS-I), vol. 63, no. 9, pp. 1381–1392, Sep. 2016.
[4] X. Zheng, C. Zhang, and S. Yuan et al., “An improved 40 Gb/s CDR with jitter-
suppression filters and phase-compensating interpolators,” in Proc. IEEE Asian
Solid-State Circuits Conf. (ASSCC), Nov. 2016, pp. 85–88.
[5] X. Zheng, C. Zhang, and F. Lv et al., “A 5-50 Gb/s quarter rate transmitter with
a 4-tap multiple-MUX based FFE in 65 nm CMOS,” in Proc. IEEE European
Solid-State Circuits Conf. (ESSCIRC), Sep. 2016, pp. 305–308.
VI
List of Figures
1.1 Diagram of the global data traffic trend [1]. By 2020, 50 billion devices
will be connected generating more than two zetta bytes of data traffic
annually. 2
1.2 Wired network roadmap [2]. The data rates in SFP+, QSFP, and CFP
are updating towards 100Gb/s, 400Gb/s, and 1Tb/s, respectively. 3
VII
2.15 Clocked compactors. (a) CML-type latch-based compactor, (b) Strong-
Arm latch-based compactor, (c) latch sensitivity function comparison
[6], (d) latch transfer function comparison [6], and (e) energy con-
sumption comparison [7]. 44
2.16 PI structures and implementations. (a) Structure with direct multiple-
input phases [8, 9], (b) structure with coarse phase selection followed
by a phase mixer [10, 11], (c) inverter-based implementation [12, 13],
and (d) CML-based implementation [14, 15]. 46
2.17 (a) Phase constellation for quadrature PI, (b) phase constellation for oc-
tagonal PI, (c) interpolated phase steps for quadrature PI in one quad-
rant, and (d) interpolated phase steps for octagonal PI in one octant. 47
2.18 The FFE. (a) Functional block diagram, where Tb is the bit period and
αn is the weight of the nth tap. (b) Typical frequency response, where
k is the summation of the absolute tap weights. 50
2.19 The CTLE. (a) Passive implementation, (c) frequency response of the
passive CTLE, (c) active implementation, and (d) frequency response
of the active CTLE. Here, ωz is the angular frequency of the zero and
ωp is the angular frequency of the pole. 53
2.20 The DFE. (a) Functional diagram, where Tb is the bit period and αn is
the tap weight of the nth tap. (b) Typical frequency response, where
the frequency is normalized to the value of the data rate. 56
2.21 Equalization adaptations. (a) Algorithm-based adjustment, (b) eye monitor-
based coefficient update, and (c) spectrum matching-based calibration. 58
VIII
3.10 Locking behaviour of the proposed TPD. (a) Waveforms when injec-
tion occurs at the falling edge of CLK P, and (b) waveforms when in-
jection occurs at the rising edge of CLK P. 79
3.11 Implementation of the introduced LSSM. (a) Circuit details and (b)
behavior of the FLD. 81
3.12 Layout view of the whole RILCM chip, where the block placement of
the core circuits is illustrated in the left view. 84
3.13 Layout views of the crucial blocks. (a) VCO, (b) PG, (c) PFD/CP1, (d)
TPD/CP2, and (e) LSSM. 84
3.14 Simulation setup of the RVCO, where the left curve depicts the VC-
TRL of the RVCO. 85
3.15 Simulation results of the RVCO. (a) Differential output clock, (b) swing
reduction, (c) frequency range, and (d) phase noise. 86
3.16 Simulated performance comparison of the RVCOs with FTG-based
and CCI-based FS-PDDCs in terms of (a) operation frequency, (b) fre-
quency range, (c) FOMPN , and (d) swing reduction. Here, the horizon-
tal axes denote the percentage of the FTG/CCI to the main inverter in
dimension. 86
3.17 Comparison of the transient procedure when operating in conventional
PLL mode and RILCM mode with LLD-LR. 88
3.18 Transient behavior comparison. (a) With injection-lock indicator IN-
J LOCK and (b) without injection lock indicator INJ LOCK. 89
3.19 Die micrograph of the RILCM. 91
3.20 Power breakdown of the RILCM. 91
3.21 Measured phase noise with half-rate output at 5GHz. 91
3.22 Measured reference spur with half-rate output at 5GHz. (a) RILCM
without FTL and (b) RILCM with FTL. 92
3.23 Integrated rms-jitter versus supply voltage. 93
3.24 Integrated rms-jitter versus reference frequency. 93
3.25 Performance-area-speed graph. 96
4.1 (a) Critical path and (b) timing diagram for the 2:1 MUX. Here, tdiv is
the delay of the divider, tck−q is the ck-to-q delay of the 2:1 MUX, and
tsetup is the setup time of the sampling latch. 99
4.2 (a) Traditional CML-based MUX implementation and (b) power con-
sumption with different multiplexing ratio [16]. Here, N refers to the
the multiplexing branch number. 101
4.3 Block diagram of the transmitter chip. 102
4.4 Conceptional circuit schematic of the traditional 4:1 MUX. 104
4.5 Four possible unit cell implementations of the 4:1 MUX. 104
4.6 Topology of the 4:1 MUX. (a) Conceptual schematic and (b) timing
diagram. 107
4.7 Traditional unit cell implementations for high-speed 4:1 MUX. (a)
Data-up structure and (b) clock-up structure. 108
4.8 Improved unit cell implementation. 108
4.9 Effect of the introduced PM on (a) high-level glitches and (b) edge
transitions. 109
4.10 Circuit details of the clocking blocks. (a) Clock conditioner, (b) DIV2,
and (c) CML2CMOS. 111
4.11 Pesudo-NAND2. (a) Circuit details and (b) operation waveform. 113
IX
4.12 Layout view of the whole transmitter chip. 114
4.13 Layout views of the crucial blocks. (a) 4:1 MUX, (b) interleaved-
retiming latch array, (c) pesudo-NAND2 with an inverter, (d) CM-
L2CMOS converter, (e) DIV2, and (f) clock conditioner. 115
4.14 Simulation setup of the transmitter chip. 117
4.15 (a) Transient waveform of the traditional unit cell, (b) transient wave-
form of the enhanced unit cell, (c) eye-diagram of the the traditional
unit cell, and (d) eye-diagram of the the enhanced unit cell. 117
4.16 Swing variations of the improved unit cell under different PVT corners. 118
4.17 Simulation eye-diagrams of the transmitter at (a) 10 Gb/s with over
equalization, (b) 40 Gb/s with proper equalization, (c) 50 Gb/s without
equalization, and (d) 50 Gb/s with proper equalization. 118
4.18 Chip micrograph of the transmitter. 119
4.19 Power breakdown of the transmitter when operating at 50 Gb/s. 119
4.20 Measured output eye-diagrams of the transmitter at (a) 5 Gb/s with
over equalization, (b) 40 Gb/s without equalization, (c) 40 Gb/s with
proper equalization, and (d) 50 Gb/s with proper equalization. 120
4.21 Measured output eye-diagrams with four separate eyes. (a) Clock pat-
tern and (b) PRBS pattern. 121
X
5.15 Effect of different input patterns on jitter attenuation. (a) PRBS7, (b)
PRBS15, (c) PRBS23, and (d) PRBS31. 142
5.16 (a) Chip micrograph and (b) power breakdown of the receiver. 143
5.17 Measured eye-diagrams for (a) input data at 40 Gb/s, (b) recovered data
at 10 Gb/s, (c) recovered edge-sampling clock without LPFs at 5 GHz,
and (d) recovered data-sampling clock with LPFs at 5 GHz. 144
5.18 Measured JTRAN and JTOL with PRBS7 at 28 Gb/s. 145
XI
List of Tables
XII
List of Acronyms and Abbreviations
XIII
HDR high data rate
HPF high-pass filter
IBTA InfiniBand trade association
IEEE institute of electrical and electronics engineers
ILCM injection-locked clock multiplier
ILO injection locked oscillator
IL-RVCO injection-locked ring voltage-controlled oscillator
INL integral nonlinearity
ISI inter-symbol interface
JGEN jitter generation
JTOL jitter tolerance
JTRAN jitter transfer
LD lock detector
LLD-LR lock-loss detection and lock recovery
LMS least mean square
LPF low-pass filter
LR long reach
LSSM loop-selection state machine
LVS layout versus schematics
MAC media access control
MEO maximum eye opening
MR medium reach
MUX multiplex
NRZ non-return to zero
NTF noise transfer function
OC optical carrier
OSC oscillator
PCB printed circuit board
PD phase detector
PEX parasitic extraction
PFD phase frequency detector
PG pulse generator
PI phase interpolator
PLL phase-locked loop
POD polarity detector
PSD phase shift detection
PTL phase tracking loop
QSFP quad small form-factor pluggable
RILCM ring-oscillator-based injection-locked clock multiplier
RJ random jitter
XIV
RVCO ring voltage-controlled oscillator
RX receiver
S/H sample-and-hold
SerDes serializer/deserializer
SFP+ small form-factor pluggable plus
SNR signal noise ratio
SONET synchronous optical network
SS-LMS sign-sign least mean square
SST source-series terminated
SSTPD sub-sampling timing-adjusted phase detector
TAL timing-adjusted loop
TDC time-to-digital converter
TPD timing-adjusted phase detector
TX transmitter
UI unit interval
USR ultra short reach
VCDL voltage-controlled delay line
VCO voltage-controlled oscillator
VCTLR control voltage
VSR very short reach
XSR extra short reach
ZF zero-forcing
fBW -3dB bandwidth
fT cutoff frequency of the transistor
fc corner frequency of the oscillator
finj injection-locking bandwidth of the injection-locked oscillator
1/f 2 white noise of the oscillator
1/f 3 flick noise of the oscillator
Sθ (f ) phase noise spectrum of the oscillator
XV
Contents
Abstract I
Declaration III
Acknowledgements IV
1 Introduction 1
1.1 Background 1
1.2 Challenges in Cutting-Edge Transceivers 3
1.3 Research Objectives 4
1.4 Research Contributions 6
1.5 Organization of the Thesis 7
2 Literature Review 10
2.1 General Design Considerations 11
2.1.1 Technology Choices 11
2.1.2 Spaces of Electrical Links 12
2.1.3 On-Chip Wire Modeling 15
2.2 SerDes Design Metrics 16
2.2.1 Data Rate and Power Efficiency 16
2.2.2 Bit Error Rate 17
2.2.3 Clock Data Recovery (CDR) Specifications 19
2.3 Basics of Electrical Serial Links 21
2.3.1 Clocking Techniques 23
2.3.2 Transmitter Techniques 30
2.3.3 Receiver Techniques 36
2.3.4 Channel Equalization 49
XVI
3.2.2 Architecture Modeling 69
3.3 Injection-Locked Ring Voltage-Controlled Oscillator (IL-RVCO) 72
3.3.1 Implementation of the IL-RVCO 73
3.3.2 Relationship Between the Relative Phase Difference and the
Frequency Offset 75
3.4 The Proposed Phase Difference Detection 77
3.4.1 Principle of the Proposed Timing-Adjusted Phase Detector 79
3.4.2 Polarity Selection 80
3.5 Mechanism of the Lock-Loss Detection and Lock Recovery (LLD-LR) 81
3.5.1 Operation Process of the LLD-LR 81
3.5.2 Principles of the Lock Loss and False Lock Detection 82
3.6 Experimental Results 83
3.6.1 Tools and Fabrication Process 83
3.6.2 Layout and Simulation Results 85
3.6.3 Chip Micrograph and Measurement results 90
3.6.4 Performance Comparison 95
3.7 Chapter Summary 96
XVII
5.3.3 Behavior of the Improved CDR 129
5.4 Compensating Phase Interpolator 131
5.4.1 Implementation Details 133
5.4.2 Linearity Analysis 133
5.5 Experimental Results 136
5.5.1 Tools and Fabrication Process 136
5.5.2 Layout and Simulation Results 137
5.5.3 Chip Fabrication and Measurement Results 143
5.5.4 Performance Comparison 145
5.6 Chapter Summary 146
Bibliography 168
Appendices 178
Appendix A Modeling of the Injection-Locked Oscillator (ILO) 178
A.1 Behavior Model of the ILO 178
A.2 Linear Model of the ILO 180
A.3 Tracking Bandwidth of the ILO 182
Appendix B Convergence Proof of the Proposed EDC-SZF Iteration 184
XVIII
Chapter 1
Introduction
1.1 Background
1
Chapter 1. Introduction
Figure 1.1: Diagram of the global data traffic trend [1]. By 2020, 50 billion devices
will be connected generating more than two zetta bytes of data traffic annually.
2
Chapter 1. Introduction
Figure 1.2: Wired network roadmap [2]. The data rates in SFP+, QSFP, and CFP are
updating towards 100Gb/s, 400Gb/s, and 1Tb/s, respectively.
both the industry and academia [17, 18, 23, 24, 25, 26]. This dissertation mainly fo-
cuses on the advanced techniques of high-speed SerDes transceivers for chip-to-chip
communications operating at 40+ Gb/s in CMOS process.
3
Chapter 1. Introduction
dissipation, where small area occupation and low power consumption could improve
the port density and lower the requirement of heat dissipation, hence reducing the
overall cost [28, 30, 31]. For implementations, the digital media access control (MAC)
layer and the analog physical layer (SerDes transceiver) are developing at different
stages. Specifically, the 200G MAC (4×50 Gb/s) has been implemented and validated
in the industry [31], while the physical layer is still in the period of moving from the
lab to the market [17, 18, 23, 24, 25, 26, 27, 32]. This is because the MAC mainly
processes the parallel data streams, where the timing requirement can be relaxed by
increasing the parallel bit width. In contrast, the SerDes transceiver has to provide
accurate timing information, sufficient bandwidth, and appropriate equalization for the
full-rate data communication.
The next-generation SerDes transceivers that support 38-64 Gb/s have attracted
great attentions from both the industry and the academia due to their broad market
potential and significant academic value. Although the technical feasibility has been
proved by several 40-56 Gbs transceiver designs [33, 34, 35, 36], plenty of research
studies are still demanded to further optimize the power consumption, area occupation,
and operation robustness, thus paving the path for the upcoming industrial deployment.
This thesis mainly focuses on the enhancement techniques to explore the maximum
process limit and hence provides potential solutions for the cutting-edge transceiver
designs. The major research objectives are summarized as follow.
4
Chapter 1. Introduction
ability. This thesis aims to overcome these two difficulties and hence provides a
reliable, low-cost clock multiplier for wireline transceivers.
5
Chapter 1. Introduction
This dissertation explores several advanced techniques to make the data rates of
the cutting-edge wireline transceivers approach the fundamental technology limit. It
addresses some of the architecture-level and circuit-level challenges with appropriate
compromises of power consumption, area occupation, performance margin, and op-
eration robustness. The main contributions of this dissertation are summarized in the
following.
6
Chapter 1. Introduction
• A chip-to-chip connection over a 12-cm printed circuit board (PCB) channel us-
ing the designed transmitter and receiver chips is constructed. The channel loss
is compensated by a combination of TX-FFE and RX-CTLE. To obtain the op-
timal equalization coefficients and track the channel-loss variations with respec-
t to operation environment, a low-cost edge-data correlation-based sign zero-
forcing (EDC-SZF) adaptation algorithm is proposed to automatically adjust the
TX-FFE’s tap weights. The measurement results indicate that the equalization
scheme of the combination of TX-FFE and RX-CTLE is a good choice for the
equalization of the 16-dB loss channel at 40 Gb/s, and the proposed EDC-SZF
adaptation can effectively tune the TX-FFE to its optimal tap weights for a given
control voltage applied to the RX-CTLE.
This thesis is composed of seven chapters. Chapter 1 outlines the research back-
ground, objectives, contributions, and organization of the dissertation. Chapter 2 sum-
marizes the mainstream techniques developed on the high-speed serial links. The main
7
Chapter 1. Introduction
contributions of this thesis are detailed in Chapters 3, 4, 5 and 6, which present the
designed clock multiplier, transmitter chip, receiver chip, and chip-to-chip link, re-
spectively. In each of these four chapters, we discuss the design motivation, describe
the prototype implementation, and present the experimental results. Finally, Chapter 7
concludes this thesis and outlooks the possible future work. The details in each chapter
are summarized as follows.
Chapter 2 reviews the mainstream techniques that have been developed within the
wireline transceiver designs. It begins with a brief discussion on the general design
considerations when constructing a serial communication link, including technology
selection, link space choice, and on-chip wire modeling. Then, we summarize the
major metrics that are used to characterize the overall performance of a serial com-
munication link. Following that, the mainstream techniques of the crucial components
within a serial link are discussed in detail, including clock multiplier, transmitter, re-
ceiver, and equalizers.
Chapter 3 presents the design of the RILCM. It firstly summarizes the challenges
in previous RILCM and then describes the proposed RILCM architecture. Following
that, we demonstrate the details of the ring-based voltage-controlled oscillator, the
phase-shift detection scheme, and the introduced lock-loss detection and lock recovery.
Finally, the experimental results are presented and discussed. This chapter is extended
based on the publications [3] on page VI.
Chapter 4 presents the designed transmitter chip. It firstly discusses the two main
challenges (i.e., timing constrains and bandwidth limitations) in high-speed transmitter
designs, and then presents our transmitter architecture. Following that, the enhance-
ment on the 4:1 multiplex and the clocking techniques are separately illustrated. Final-
ly, the experimental results are demonstrated and discussed. This chapter is an enriched
version of the contents published in [1] and [5] on page VI.
Chapter 5 presents the implemented receiver chip, which mainly focuses on the im-
provement on the clock data recovery (CDR) design. It firstly summarizes the design
considerations of the receiver, and then displays the receiver architecture. Follow-
ing that, we separately describe the improved digital CDR and the linearity-optimized
8
Chapter 1. Introduction
compensating PI. Finally, the experimental results are presented and discussed. This
chapter is extended based on the contents published in [1] and [4] on page VI.
Chapter 6 constructs an overall chip-to-chip communication link utilizing the chips
designed in Chapters 4 and 5. It firstly describes the link connection and equalization
scheme, and then demonstrates the developed low-cost EDC-SZF adaptation algorith-
m. After that, we present the experimental setup and the measurement results. The
condensed contents of this chapter has been published in [1] on page VI.
Chapter 7 summarizes this dissertation in conclusions and discusses the potential
optimization work that can be further done in the future.
9
Chapter 2
Literature Review
10
Chapter 2. Literature Review
Figure 2.1: Cutoff frequency (fT ) scaling comparison among different processes in
terms of the inverse of the lithographic feature size [3].
High-speed links over 10 Gb/s have traditionally been implemented in SiGe BiC-
MOS technology due to its integration of high-speed SiGe bipolar and low-cost CMOS
transistor, where the former is suitable for the high-speed, low-noise blocks such as
transmitter (TX) driver and receiver (RX) pre-amplifier, while the latter is appropriate
for the control-logic implementation [43, 44]. However, for more complex application-
s where SerDes function is combined with complicated digital functions, CMOS pro-
cess is preferred because of its fast shrinking that makes it feasible to keep the die size,
power consumption, and fabrication cost as low as possible [43]. These area, power,
and cost savings over equivalent SiGe circuits mainly come from the simple and com-
pact transistor implementation in CMOS process that makes the designs easily scaled
downward as semiconductor processes improve. Line card and optical module manu-
facturers utilizing CMOS products will benefit from the large community of competing
foundries, which engages in aggressive pricing strategies and rapid adoption of ever-
smaller process nodes that deliver successively lower cost per chip, reduced operating
voltages, and decreased power consumption [28]. Fig. 2.1 shows the scaling trend
11
Chapter 2. Literature Review
of CMOS process versus SiGe BiCMOS technology in terms of the cutoff frequency
(fT ). Although the SiGe BiCMOS technology always remains a speed advantage over
CMOS process, the fT of 45 nm CMOS already reaches 270 GHz, which makes it
feasible to implement high-speed transceivers around tens of Gb/s.
Note that the potential advantages of using low-cost CMOS process come with
several significant challenges. The primary challenge is that mainstream CMOS tran-
sistors are slightly slower than exotic SiGe devices (see Fig. 2.1). Therefore, more in-
novative designs for crucial blocks such as voltage-controlled oscillator (VCO), trans-
mitter driver, receiver analog front-end, and channel equalizer are required to overcome
the slower, noisier characteristics of CMOS transistors. Driven by the large-scale mar-
ket requirements, Moore’s law curve is developing towards ever-better power, perfor-
mance and price. Meanwhile, process nodes are constantly scaled down under the ag-
gressive investment of the foundries. The resulting processes have provided platforms
for the development of several tens of serial transceivers with high efficiencies in both
cost and power. So far, 25-28 Gb/s serial transceivers in CMOS processes support-
ing InfiniBand enhanced data rate (EDR), 32G fiber channel (32GFC), and common
electrical interface (CEI)-28G have stepped into the period of industrial deployment
[20, 21, 22]. Meanwhile, 38-64 Gb/s transceivers, which will play a key role in the
next-generation data rate supported by 400 Gigabit Ethernet (GbE), InfiniBand high
data rate (HDR), and CEI-56G, have been successfully demonstrated in lab and been
under the period of moving from the lab to the market [17, 18, 23, 24, 25, 26, 27, 32].
Fig. 2.2 shows the main SerDes application spaces in electrical links, including
rack-to-rack link, chassis-to-chassis connection, and intra-chassis interconnect. This
thesis mainly focuses on the chip-to-chip connections described in Fig. 2.2(c). Ac-
cording to the communication distance, these serial links can be classified into ultra
short reach (USR), extra short reach (XSR), very short reach (VSR), medium reach
(MR), and long reach (LR). Fig. 2.3 summarizes the connection details of each appli-
cation space defined in CEI-56G.
12
Chapter 2. Literature Review
(a)
(b) (c)
Figure 2.2: Typical SerDes application spaces. (a) rack-to-rack link, (b) chassis-to-
chassis connection, and (c) intra-chassis interconnect [4].
Die Die/OE
Chip Chip/OE
Module
Chip
Chip Chip
Chip
Chip
Figure 2.3: Reach details of each application space defined in CEI-56G [4].
13
Chapter 2. Literature Review
The USR link is usually used to connect multiple dies and optical engines within
a multi-chip module to achieve the power and signal integrity objectives. This 2.5/3-
dimension packaging solution can save substantial power since the communication
distance is typically less than 10 mm. This short channel length allows for a much
simple physical layer implementation since it can be treated as a synchronous link.
Meanwhile, the low-cost communication channel makes it possible to rule out equal-
izations.
The XSR link is often employed to realize the data communication between elec-
trical chips and optical devices, where the link distance is usually less than 50 mm.
Meanwhile, central processing units (CPUs) and digital signal processings (DSPs) can
also be connected via such a short connection to satisfy the latency requirements. This
XSR link is used to connect CPUs with memory stacks to optimize the responding
time of memory access as well.
The VSR link mainly refers to the connection between electrical chips and plug-
gable modules. Its typical communication distance is around 10 cm, where the channel
loss could reach 10-20 dB at the Nyquist frequency.
The MR link is usually used to implement the connection between two chips on the
same printed circuit board (PCB) or one on the main card and the other on a daughter
card [4]. Its communication distance ranges up to 50 cm and the channel loss is in the
range of 15-25 dB at the half frequency of the symbol rate.
The LR interface is usually applied to realize the connection between two daughter
cards across a legacy backplane with an up to 35 dB channel loss at the Nyquist fre-
quency. The total channel length is limited less than 100 cm, and two connectors are
allowed.
The channel loss in VSR, MR, and LR links has posed significant challenges in
transceiver designs as they need to compensate for the high-frequency loss within the
power budget. This problem becomes extremely severe for the large switch chips
where heat dissipation also plays a performance-limiting factor. To address these is-
sues, complex equalization scheme, high-order modulation, and forward error correc-
tion (FEC) have been developed [4]. To accommodate difference channel loss, a proper
14
Chapter 2. Literature Review
With the rapid development of the manufacturing technologies, the channel length
and the transistor delay are respectively shrinking down to nanometer scale and sub-
tens of ps. These miniaturization trend for CMOS integrated circuits has led to a
tremendous cost advantage and performance improvement. However, the narrowed
cross-section and wire spacings have dramatically increased the parasitic effects of
the connection wires, thus degrading their high-speed performance. Previous studies
have demonstrated that when the signal’s rise/fall time roughly matches the propaga-
tion time through the line, the connection wire actually isolates the receiver from the
driver and plays the role of output/input impedance of the driver/receiver [45, 46, 47].
Consequently, how to model on-chip connection wires has become a tricky problem
for high-speed circuit designers. If it is not handled appropriately, the interconnect
effects including voltage ring, signal delay, distortion, reflection, and crosstalk could
degrade the system robustness or even lead to undesired errors. Considering the fact
that a simple model may ignore some important effects to result in a design failure
while a sophisticated model could complicate the simulation to extend the design cy-
cle or even make the simulation unapplicable. Hence, it becomes extremely important
for designers to properly simulate the entire designs as efficiently as possible while
maintaining the simulation accuracy [48].
The concept of “high-speed interconnect” is a relative concept. It refers to the inter-
connect where the propagation time to travel between the two connection points cannot
15
Chapter 2. Literature Review
0.35 (2.1)
fmax = tr ,
where tr is the rise/fall time of the signal. This implies that for a 0.1 ns rise time, the
maximum interest frequency is around 3 GHz and the minimum wavelength is 10 cm.
In some special cases, a more conservative bandwidth can be set as [50],
1 (2.2)
fmax = tr .
The data-rate of a high-speed serial link is the number of data bits transferred per
second from the transmitter to the receiver, while the power efficiency refers to the
normalized power consumption when transferring every Gigabit data in one second .
The former is usually measured in Gb/s and the latter is frequently characterized by
mW/Gb/s. Previous studies [51, 52, 53] have demonstrated that there exists an optimal
data rate to exploit the maximum potential of a given process to achieve the best power
efficiency. The analyses in [52] and [53] suggest that the power efficiency reaches the
optimal value when the bit time (the reciprocal of the data rate) is around (4∼6) ×
16
Chapter 2. Literature Review
FO4 (the inverter delay of the target technology with a fan-out-of-4). At this speed,
it is relatively easy to drive the half-rate clock and build critical high-speed blocks
(e.g., TX-side half-rate 2:1 multiplexers and RX-side edge/data samplers) in power-
efficient CMOS logic [51]. The FO4 delays can be roughly approximated as 500 ps
per µm of minimum drawn gate length in CMOS technologies [54]. On one hand, if
the data rate is too low, the overhead of the stationary currents will become dominan-
t, thus deteriorating the power efficiency. On the other hand, when the bit period is
too short, power-hungry current-mode logic (CML) circuits and complicated equaliza-
tion techniques are usually employed to satisfy the stringent timing requirement and
compensate for the severe channel attenuation. This is also the reason why the cutting-
edge transceivers running at tens of Gb/s usually show an increasing trend in power
efficiency values. Previous research has demonstrated 2 mW/Gb/s transceivers in 65
nm CMOS operating around 10 Gb/s [55, 56]. Meanwhile, the commercial 28 Gb/s
transceivers with sophisticated equalizers using 28 nm CMOS is around 7 mW/Gb/s
[30]. Recently published non-return to zero (NRZ) transceivers operating from 40 to
60 Gb/s with an equalization ability of <20 dB in 28-65 nm CMOS processes have
shown energy efficiencies ranging from 4.4 to 16.4 mW/Gb/s [25, 34, 57, 35, 36].
Bit error rate (BER) is the ratio of the error bit number to the total transmitted bit
number in a specific period. It is a measure of the correctness of the link operation,
which is expected to be lower than 10−12 for most serial connections. In serial com-
munication systems, the BER could be affected by the distribution of the random jitter
(RJ) and the deterministic jitter (DJ) in the link. Fig. 2.4 gives the jitter decomposition
components and their corresponding jitter sources [58], where the jitter generation and
amplification mechanisms can be found in [54] and [59, 60, 61], respectively. Com-
bining the jitter generated by each source, the total RJ and DJ can be respectively
computed by the following two equations,
q
Trj = t2rj1 + · · · + t2rjn , (2.3)
17
Chapter 2. Literature Review
where Trj denotes the total RJ, trjn , (n = 1, 2, ....) refers to the independent RJ gen-
erated by different sources, Tdj represents the total DJ, and tdjn , (n = 1, 2, ....) stands
for the separate DJ produced by different blocks. Assuming the samplings that happen
outside the bit period produce bit errors, the horizontal Q-factor of the BER can be
represented by,
Tbit −Tdj
QBER = 2Trj
, (2.5)
where Tbit is the bit period. Referring to the analysis in [62], the BER can be roughly
evaluated by,
BER = 12 erf c( Q√
BER
2
), (2.6)
where erf c() is the complementary error function, which is defined as,
R∞ 2
erf c(x) = √2
π x
e−x dx. (2.7)
According to Eq. (2.6), the horizontal Q-factor should be 7.0 to satisfy the common-
ly required BER of 10−12 . It is worth noting that the BER can be further degraded
by the non-ideal impairments such as asymmetric jitter distribution [63], non-optimal
18
Chapter 2. Literature Review
sampling position [64], phase-spacing error, sampler input-offset [65], and sampler
metastability [65].
The vertical amplitude dimension is another factor that could affect the BER. It
usually involves the TX-side output swing, channel equalization, and RX-side input
sensitivity. The receiver sensitivity is defined as the lowest signal amplitude that the
receiver can correctly extract the transmitted data. It is a function of equivalent input
noise, input offset, and minimum latch resolution. When the received signal has a
sufficient large swing, the vertical amplitude shows negligible effect on the BER. If
the received signal swing is reduced close to the receiver sensitivity, the BER of the
whole link could be determined by the signal noise ratio (SNR) of the received signal
even though there is adequate horizontal timing margin. Similar to the relationship
between the BER and horizontal Q-factor in Eq. (2.6), the BER is related to the SNR
through the following equation [66],
√
BER = 12 erf c( SN
√ R ).
2 2
(2.8)
Note that Eq. (2.8) takes place under the condition that there is a sufficient horizontal
timing margin. It seems that a vertical eye opening at the RX-side can always be
obtained by increasing the output swing at the TX-side. However, the enhanced inter-
symbol interface (ISI), reflection, and crosstalk associated with the increased swing
could overwhelm the RX-side amplitude increment and hence deteriorate the overall
performance of the link. The enlarged swing also needs a higher capacitor-charging
current and thus increases the power consumption. In practical designs, signal swing
and equalizer scheme are often sophisticatedly selected and designed to achieve both
low BER and power consumption. Offset cancellation techniques are often employed
in the receiver to lower its sensitivity to reduce the minimum swing requirement as
well, and hence optimize the power efficiency of the link.
The CDR used to extract the sampling clocks and retime the transmitted data must
satisfy stringent jitter specifications. Its performance is usually evaluated by “jitter
19
Chapter 2. Literature Review
0.1 dB
fc
Frequency (Hz)
OC Level Rate fc P
1 51.84 Mb/s 40 kHz 0.1 dB
3 155.52 Mb/s 130 kHz 0.1 dB
12 622.08 Mb/s 500 kHz 0.1 dB
48 2.48832 Gb/s 2 MHz 0.1 dB
192 9.95328 Gb/s 120 kHz 0.1 dB
(a)
Jitter Filter Gain (dB)
f0 f1
Frequency (Hz)
A3
Acceptable Performance
A2 -20 dB/dec
A1
Unacceptable Performance
f0 f1 f2 f3 ft
Frequency (Hz)
OC Level Rate f0 f1 f2 f3 ft A1 A2 A3
(Mb/s) (Hz) (Hz) (Hz) (Hz) (Hz) (UIpp) (UIpp) (UIpp)
1 51.84 10 30 300 2k 20k 0.15 1.5 15
3 155.52 10 30 300 6.5k 65k 0.15 1.5 15
12 622.08 10 30 300 25k 250k 0.15 1.5 15
48 2488.3 10 600 6k 100k 1M 0.15 1.5 15
192 9953.3 10 2k 20k 400k 4M 0.15 1.5 15
(c)
Figure 2.5: CDR specifications of (a) JTRAN, (b) JGEN, and (c) JTOL in SONET [5].
20
Chapter 2. Literature Review
transfer (JTRAN)”, “jitter generation (JGEN)”, and “jitter tolerance (JTOL)” [67, 68].
Fig. 2.5 summarizes these three metric definitions in synchronous optical network
(SONET) [5].
• The JTRAN is characterized by calculating the ratio of output jitter to input jit-
ter as a function of frequency. This metric is often used in long-haul networks
employing many data regenerators. To implement reliable data communications
in such cascaded systems, the JTRAN peaking of each regenerator must be suffi-
ciently small to ensure that the output jitter after tens of successive amplifications
is still acceptable. As depicted in Fig. 2.5(a), the maximum jitter peaking of the
retiming regenerator in SONET must be less than 0.1 dB [5].
• The JGEN is a measure of the intrinsic jitter produced by the CDR itself when
there is no jitter in the input data. It can be measured at the output of the CDR
using a high-pass filter with a specific cut-off frequency. Fig. 2.5(b) gives the
corner frequencies at different data rates for SONET and the maximum allowable
integration rms-jitter. For different OC levels, the rms-jitter is always demanded
to keep lower than 10 mUI.
• The JTOL is used to characterize the CDR jitter tacking ability, and it is defined
as the maximum amplitude of the injected sinusoidal jitter that the link can toler-
ate without dropping below a specific BER. Fig. 2.5(c) displays the JTOL mask
for SONET, which defines the minimum jitter amplitude that can be tolerated
while not exceeding a specific BER at different frequencies [68].
In summary, the JTRAN, JGEN, and JTOL separately answer the following three
questions: (i) how much jitter passes through the CDR from the input to the output,
(ii) how much jitter is created by the CDR itself, and (iii) how much jitter can be there
at the input of the CDR [68].
Fig. 2.6 describes a typical serial link for chip-to-chip communications. It is com-
posed of three primary components: a transmitter, a receiver, and a channel. The main
21
Chapter 2. Literature Review
Driver
CTLE DFE
Deserializer
DN Equalizer DN
Channel
Serializer
DS
+ CDR
D2 D2
D1 D1
Refclk
Refclk PLL
PLL
Transmitter Receiver
function of the transmitter is to convert the parallel digital data into an electrical signal
and launch it on the transmission channel with a proper waveform shape such that the
received signal after the lossy channel can be correctly recovered. A general trans-
mitter (see Fig. 2.6) usually consists of a phase-locked loop (PLL), a serializer, and a
combined driver-equalizer. Driven by the clocks with appropriate frequency and phase,
the parallel data D1 -DN are successively multiplexed into a full-rate data stream DS
using the multiplexing stages in the serializer. To guarantee a robust serialization, the
bandwidth and timing margin of each multiplexing stage must be sufficient. After the
full-rate data streams are generated, they are applied to the combined driver-equalizer
to pre-distort the output waveform and launch it into the transmission channel. The
main task of the receiver located at the other end of the transmission channel is to
extract the originally transmitted data from the received signal using appropriate e-
qualization and clock data recovery (CDR) techniques [69, 61, 70]. A general receiver
(see Fig. 2.6) usually contains a front-end equalizer, a CDR, and a deserializer. The
incoming signal is firstly equalized by the front-end equalizer to obtain a sufficient ver-
tical eye opening and an adequate horizontal-sampling margin. This equalized output
is then sliced by the samplers, where the sampling position is continuously adjusted
by the CDR loop. These sliced data sequences are further demultiplexed by the dese-
rializer to attain the originally transmitted data D1 -DN . The communication channel
is adopted to move the serial data from the TX side to the RX side. The main problem
associated with the transmission channel is the channel loss. To overcome this diffi-
culty, a combination of TX-side feed forward equalization (FFE) along with RX-side
continuous linear equalizer (CTLE) and decision feedback equalizer (DFE) is usually
22
Chapter 2. Literature Review
23
Chapter 2. Literature Review
Fref Sɵ ( f ) VCO
PFD+CP LPF Fout 1
PLL f3
PFD/DIV/CP
1
f2
/N REF 20log(N)
fBW fc f
(a)
PD+CP LPF
Sɵ ( f ) Jitter Peaking
Fref DLL Jitter Amplifying
fBW f
Edge-Combining Logic
Fout
(b)
Sɵ ( f ) VCO
Fref PG Fout 1
f3
ILO
VCTRL 1
f2
REF 20log(N)
fc finj f
(c)
Sɵ ( f ) VCO
Fref PG Fout 1
f3
1
IL-VCO
FTL f2
REF 20log(N)
fBW fc finj f
(d)
Figure 2.7: Clock synthesis implementations and phase noise performances for (a)
PLL, (b) DLL, (c) ILO, and (d) IL-VCO. Here, f is the frequency of the noise, Sθ (f )
stands for the phase noise spectrum, fBW refers to the -3dB bandwidth of the loop,
fc denotes the corner frequency of the VCO, and finj represents the injection-locking
bandwidth of the ILO.
duced in this section to give an overview of the common clock generation schemes for
serial links. The details of this injection locking technique will be discussed together
with the designed ring-oscillator-based injection locked clock multiplier (RILCM) in
24
Chapter 2. Literature Review
Chapter 3. Fig. 2.7 presents the widely used clock generation techniques and their
corresponding phase noise performance.
The most general method to produce high-frequency clocks from a low-frequency
input is the traditional PLL [see the left diagram in Fig. 2.7(a)], due to its compact
implementation, robust operation, and convenient rate configuration. It consists of
a phase frequency detector (PFD), a charge pump (CP), a low-pass filter (LPF), a
voltage-controlled oscillator (VCO), and a divider (DIV). The PFD is utilized to detect
the phase errors between the input reference clock and the feedback divided clock, the
CP is used to convert the phase errors into current pulses, the LPF is adopted to sup-
press the ripples on the control voltage, the VCO generates high-frequency clocks, and
the divider is introduced to set the clock multiplication factor. Theoretical analyses
show that the PLL acts as an LPF for the reference noise, DIV noise as well as PD
noise, a band-pass filter for the CP noise, and a high-pass filter (HPF) for the VCO
noise [29, 71, 72]. X. Gao et al. [77] proposed two useful designing criteria to min-
imize the PLL output jitter for a given power budget. One is spending equal power
on the loop (including PFD, DIV, and CP) and the VCO. The other is setting the PLL
bandwidth at an optimal value that makes the loop components and the VCO equally
contribute to the total jitter. As shown in the right diagram in Fig 2.7(a), the optimal
bandwidth can be approximated by the phase-noise intersects of the loop components
and the VCO. The jitter performance of the PLL heavily relies on the oscillator (OSC).
Different types of OSCs provide different advantages and drawbacks with respect to
power efficiency, area occupation, phase noise, tuning range, and multi-phase gener-
ation. The Ring-OSC holds the advantages over the LC-OSC in terms of small area
occupation, wide tuning range, and convenience of multi-phase generation, while the
LC-OSC possesses the good properties of low phase noise and high power efficien-
cy. Neither of them can satisfy all the clock synthesis requirements of small area, low
power, low phase noise, and multi-phase generation. The poor phase noise of the Ring-
OSC is mainly because of the device noise accumulation, while the large area of the
LC-OSC is due to the involvement of the large inductor. Additionally, the phase noise
of both these two OSCs degrades rapidly when the operation frequency exceeds 10
25
Chapter 2. Literature Review
GHz. Therefore, a wide PLL bandwidth is desirable to suppress the phase noise of the
VCO. However, the maximum loop bandwidth is often limited by the input reference
frequency for loop stability consideration.
DLL-based clock synthesizer is one of the possible solutions to satisfy the afore-
mentioned requirements [76, 78, 75, 79]. The left diagram in Fig. 2.7(b) presents its
conceptional implementation, which consists of a conventional DLL and an edge com-
biner. Driven by the phase detection loop, the voltage-controlled delay line (VCDL)
is forced to produce equally spaced phases within a specific duration (e.g., a period
of the input clock). These evenly spaced low-frequency phases are then fed into the
edge combiner to produce the desired high-frequency clocks. The main advantage of
this DLL-based clock synthesizer is its high jitter performance, which can be mainly
attributed to that the jitter accumulation in the open-loop VCDL only lasts within a
single-line delay [78]. In addition, the phase noise transferred from the the PD and
CP is negligible due to the small gain of the VCDL. The right diagram in Fig. 2.7(b)
presents the phase noise characteristics of the DLL-based clock synthesizer, where the
accumulated phase noise associated with the VCDL and the phase noise introduced
by the PD/CP are so small that can be neglected. Note that there does exist jitter
amplification for the out-band frequencies although they are usually very small. Com-
pared to the phase noise in traditional PLL [see the right diagram in Fig. 2.7(a)], the
DLL-based synthesizer exhibits excellent jitter performance. It can be roughly ap-
proximated by the reference clock jitter [75]. Another benefit of the DLL-based clock
synthesizer comes from its natural stability, which manifests itself as a single-pole sys-
tem. However, this architecture has three major drawbacks. Firstly , its performance
is sensitive to static nonlinearities. Any phase inaccuracy of the evenly spaced clocks
translates directly into duty cycle error and/or phase spacing error. This phase inaccu-
racy could be either caused by the mismatches in the PD, CP, and VCDL or induced
by the waveform-shape inconsistency due to an improper input waveform. These fac-
tors make the DLL-based clock synthesizer fragile to fabrication mismatch and power,
voltage, and temperature (PVT) variations, thus exhibiting weak robustness. Second-
ly, the clock multiplication factor is difficult to program due to limited VCDL stages.
26
Chapter 2. Literature Review
Thirdly, the additional high-speed edge combiner could significantly degrade its pow-
er efficiency. Constrained by the fragile robustness, huge power consumption, and
inconvenient combining-timing control, the DLL-based clock synthesizer is difficult
to reach frequencies higher than 10 GHz [80].
Injection-locked clock multiplier (ILCM) is another promising scheme to produce
high-frequency multi-phase clocks with small area occupation, low power consump-
tion, and high jitter performance [81, 82, 83, 84, 85]. It has shown great potential in
serial link communications [86, 87, 88]. Fig. 2.7(c) depicts the functional diagram of
the injection locked oscillator (ILO) and its phase noise suppression effect. The injec-
tion locking actually acts as a single-pole HPF system that achieves 20 dB/dec of in-
band noise shaping against the intrinsic phase noise of the OSC [82, 89]. Nonetheless,
this simple ILO suffers from the following three issues. Firstly, the jitter suppression
is sensitive to the frequency offset between the target frequency and the free-running
frequency of the OSC [81]. As the frequency deviation increases, the phase noise
tracking ability will be significantly degraded while the spur increases dramatically.
Therefore, the ILO should be tuned to be close to the center of the locking range for
best jitter performance. Secondly, this injection locking technique cannot completely
suppress the 1/f 3 noise. This problem becomes particularly prominent for ring-OSC
implemented in deep sub-micron CMOS processes because their flicker-noise corner
frequencies usually reach tens of MHz [82]. Consequently, phase calibration mecha-
nisms are needed to assist in suppressing the 1/f 3 noise of the OSC. Thirdly, the small
locking range of the ILO reduces its robustness and reliability against PVT variations.
To address these issues, frequency tracking loop (FTL) is introduced to provide a prop-
er control voltage such that the natural oscillation frequency of the VCO can always
stay around the desired multiple of the injection frequency [see the left diagram in Fig.
2.7(d)]. This FTL brings in the following two benefits [83, 84]. One is that the frequen-
cy deviation between the target frequency and the natural frequency of the VCO can be
optimized. This not only enhances the jitter suppression effect of the injection locking
[see the right diagram in Fig. 2.7(d)], but also improves the robustness of the system
since the frequency deviation can always be controlled within the locking range of the
27
Chapter 2. Literature Review
IL-VCO. The other is the noise shaping ability which helps to suppress the in-band
noise of the VCO. Combining with the 20 dB/dec low-frequency noise suppression of
the injection lock, the 1/f 3 noise of the VCO can be effectively attenuated.
(a)
(b)
Transmission Line
(c)
(d)
Figure 2.8: Clock distribution structures based on (a) inverter chain, (b) CML chain,
(c) transmission line, and (d) inductive load.
ing distribution distance is approaching to one tenth of the ”electrical length” of the
transmission clock, thus making the connection wires exhibit transmission line charac-
teristics. Thirdly, the parasitic resistance and capacitance have limited the bandwidth
of the interconnect wires. This problem becomes even more severe when the feature
28
Chapter 2. Literature Review
size scales downwards. The reason is that the scaled geometry could significantly in-
crease the parasitic effects, hence degrading the bandwidth of the connection wires.
Fig. 2.8 shows the four widely used clock distribution techniques. The most tra-
ditional method is to employ a buffer chain that can be implemented by either simple
inverters [see Fig. 2.8(a)] or compact CMLs [see Fig. 2.8(b)]. In these two approaches,
the transmission wire is divided into several segments to optimize the desired metric-
s, e.g., delay time, jitter performance, and power consumption. The analysis in [91]
shows that there exist optimal segment number and wire geometry for a specific distri-
bution distance and a distinct optimization metric (delay, jitter or power). Compared
to the full-swing digital inverter, the CML buffer is more suitable for high-frequency
cock distribution due to the following reasons. Firstly, the propagation delay of the
CML is much shorter than that of the logical inverter, since the CML can use a small
swing to reduce the edge-transition time. Secondly, the CML buffer can fully exploit
the process potentials as its compact NMOS driving topology naturally features fast
current switching speed and small parasitic capacitance. Thirdly, the CML buffer with
resistor loads has much less delay sensitivity to supply noise than inverters [92], due
to its excellent power supply rejection ratio (e.g., 5× in Intel 90 nm 1.2 V CMOS
process). The main disadvantage of the CML is the high power consumption because
it always draws a current from the supply even when the clock is not switching [61].
Considering the fact that the delay variation with respect to the supply fluctuation is
mainly caused by the clock buffers rather than the transmission wires [91], minimizing
the delay through the clock buffers is helpful to reduce the delay susceptibility of the
clock network to the power-supply noise.
Fig. 2.8(c) shows a repeaterless clock distribution network, which usually employs
an open-drain CML buffer to drive the terminated on-chip transmission lines [see Fig.
2.8(c)]. The measurement results in reference [92] demonstrate that a 10 GHz global
clock can be transmitted nearly 3 mm using an open-drain buffer to drive a pair of d-
ifferential transmission lines with on-chip terminations. The delay of the transmission
line is the smallest due to its speed-of-light propagation velocity. For the characteristic
impedance, it is not necessary to design exactly 100 Ω as long as it matches with the
29
Chapter 2. Literature Review
According to driving mode, the output stage of the transmitter can be mainly di-
vided into current-mode logic (CML) and source-series terminated (SST) drivers. Fig.
2.9(a) shows the implementation details of a typical CML driver, which consists of
a differential pair, a pair of resistive loads, and a tail current. Compared to the SST
driver described in Fig. 2.9(b), it poses the good properties of high-speed switch-
30
Chapter 2. Literature Review
50Ω 50Ω
4mA 12mA
50Ω
4mA
100Ω
400mV
4mA
100Ω
4mA 400mV
50Ω
4mA
16mA
(a) (b)
Figure 2.9: Typical transmitter driver modes. (a) CML mode and (b) SST mode.
ing, adjustable output swing, good impedance matching, and convenience to integrate
peaking inductors [96, 97]. These features endow it with the capability of exploiting
the maximum process potential, thus making it more suitable for cutting-edge drivers
that operate at tens of Gb/s. Recently, 50-64 Gb/s transmitters using CML drivers have
been implemented in 65 nm CMOS process [24, 25, 26]. The SST driver evolves from
traditional CMOS inverter, where 50 ohm resistors are inserted in each branch to re-
duce the impedance discontinuities and thus optimize the reflections. The SST driver
demonstrates a high power efficiency, which only consumes one fourth of that of the
CML driver (see Fig. 2.9). The symmetrical topology makes it compatible with all of
the low, high, and mid common-mode terminations. Nonetheless, the large self-load
capacitances, slow PMOS transistors, and incompatibility with bandwidth-extension
inductors have limited its maximum operation speed. These factors of the SST driver
make it popular in power-sensitive high-volume designs using advanced process with
adequate speed margins. For examples, a 28 Gb/s SST transmitter has been fabricated
in a 32 nm CMOS [98] and a 16-40 Gb/s NRZ/PAM4 dual-mode transmitter utilizing
SST driver has been implemented in a 14 nm CMOS [99].
The serializer usually utilizes a multiplexing tree to combine the low-speed parallel
data into a high-speed stream. Each multiplexing stage is composed of a multiplexer
(MUX) and several latches, where the latches are placed before the MUX to guarantee
31
Chapter 2. Literature Review
Quarter Rate Half Rate Full Rate Quarter Rate Full Rate
D1 L L D1 L L
2:1
D3 L L L L L D2 L L L
2:1 Dout 4:1 Dout
L L L
D2 L L D3 L L L
2:1
D4 L L L D4 L L L L
2
2 2
4
PH270
PH270
PH0
PH180
PH90
/2
/2
(a) (b)
tsetup
Da CK1 PH0
2:1 tck-q
L
CK2 PH90
tdiv Da
PH180
CK2 CK1 tdiv
/2 tck-q tsetup PH270
(c) tsetup thold
D1 D1<n> D1<n+1>
PH0
PH180 D2 D2<n>
Dout D1<n> D2<n> D1<n+1> Dout D0<n> D1<n> D2<n> D3<n> D0<n+1>
Figure 2.10: Schemes of the final 4:1 multiplexing. (a) Half-rate topology based on
two-stage 2:1 MUXs, (b) quarter-rate structure based on direct 4:1 MUX, (c) critical
path and timing diagram of the 2:1 MUX, (d) timing margin of the 2:1 MUX, and (e)
timing margin of the 4:1 MUX.
sufficient timing margin for the following data selection and/or data sampling. These
timing constraints have posed significant challenges for the high-speed serialization
in the last few stages. According to the ratio of the data rate to the maximum clock
frequency, the transmitters can be partitioned into half-rate architecture and quarter-
rate architecture. Fig. 2.10 describes the conceptional implementations and timing
requirements of the two typical multiplexing schemes.
For the half-rate architecture, the final 4:1 multiplexing is implemented by three
2:1 MUXs, where two of them work in quarter rate and the final one operates at half
rate [see Fig. 2.10 (a)]. This serialization topology is ubiquitously used mainly owing
to its simple clocking scheme, which only requires a pair of complementary clocks to
alternatively select the input data. The pulse width of the MUX output is subject to the
duty cycle of the driving clocks, thus a 50% duty cycle is required. In practical designs,
a duty cycle correction circuit is usually employed to guarantee the desired duty cy-
32
Chapter 2. Literature Review
cle. The main drawbacks of this architecture are the tight timing constraints and large
number of latches (15 for the 4:1 serialization). Fig. 2.10 (c) and (d) displays the two
possible critical paths. One is located at the first latch in the final 2:1 MUX, where the
summation of the delay of the divider (by 2), the ck-to-q of the previous 2:1 MUX, and
the setup time of the latch must be smaller than 1 unit interval (UI). The other occurs at
the final 2:1 MUX, where the data selection margin [i.e., tsetup + thold in Fig. 2.10(d)]
is only 1 UI. When the data rate reaches several tens of Gb/s, it becomes a nontrivial
task to satisfy these timing requirements. The delay variations along with different
PVT corners make this problem even more challenging. To overcome this difficulty,
traditional half-rate transmitters often insert extra delay matching buffers [27, 24] or
phase calibration loops [100, 33, 26] between CK1 and the latch [see Fig. 2.10(a)]. For
the former method, the delay fluctuation between the multiplexing path and the match-
ing buffer may beyond 1 UI and thereby causes bit errors. For the latter approach, the
automatic phase adjusting suffers from the accuracy of phase detection, which could
reduce the stability, reliability, and robustness of the serializer. Additionally, both of
these two techniques involve substantial power and area overheads.
For the quarter-rate architecture, the final 4:1 multiplexing is performed by a single
4:1 MUX, where the input data operate at the quarter rate [see Fig. 2.10 (b)]. This
serialization structure has attracted increasing attentions to the applications beyond 10
Gb/s. This is because it not only addresses the timing issues in traditional 2: 1 MUX by
removing the critical path in Fig. 2.10(c) and relaxing the data-selection margin from
1UI [see 2.10(d)] to 3 UI [see Fig. 2.10(e)], but also saves substantial power by halving
the maximum clock speed and removing the half-rate latches [see Fig. 2.10(e)]. How-
ever, these benefits come with the penalty of a doubled self-drain capacitance, which
dramatically degrades the bandwidth of the 4:1 MUX, hence limiting its maximum
operation speed. Another difficulty associated with this 4:1 MUX is how to generate
the evenly 90◦ -spaced multi-phase clocks and produce the UI-spaced input sequences
for the data selection. Both of these issues are addressed in this thesis, which will be
detailed in Chapter 4.
33
Chapter 2. Literature Review
Full-Rate Clock
(a)
Din1 L
MUX L L L L
Din3 L L 2:1
0°
2 MUX MUX MUX
2:1 Dout1 2:1 Dout2 2:1 Doutn
Din2 L 2 2 2
MUX L L L L L
Din4 L L 2:1
180°
180° 2 0° 180° 0° 2
Complementary 0° 2
Quarter-Rate Clock
DIV2
Complementary 2
Half-Rate Clock
(b)
Din1 L L L L
180° 270°
Din2 L L
MUX MUX MUX
4:1 Dout1 4:1 Dout2 4:1 Doutn
Din3 L L L
4 4 4
Din4 L L L L L L L
Multi-Phase
Quarter-Rate
Clock (c)
VCTRL
DIV2
Delay
Half-Rate Clock Line
PD PD LPF
(d)
Figure 2.11: Techniques of 1-UI delay generation based on (a) full-rate FF, (b) half-rate
2:1 MUX, (c) quarter rate 4:1 MUX, and (d) analog delay line.
TX-FFE, which performs as a finite impulse response (FIR) filter and pre-distorts
the transmitted signal, is one of the most common
34 techniques that is employed in high-
Chapter 2. Literature Review
speed serial links to alleviate the ISI caused by the frequency-dependent channel loss.
In practical designs, the FIR taps are usually driven by full-rate 1 UI-spaced sequences.
To accommodate the exponentially growing data rate, the 1 UI delay generation tech-
niques have also evolved. Fig. 2.11 summarizes the mainstream 1 UI delay generation
techniques utilized in previous FFE implementations.
The most general method is to utilize flip-flops (FFs) driven by a full-rate clock
to sequentially retime the serial data stream [see Fig. 2.11 (a)]. The main advantage
of this approach is its compactness, which only requires one FF for each tap sequence
generation. As the data rate exceeds the maximum reliable operation rate (e.g., 10
Gb/s for 65 nm CMOS [101]) of the FFE, the full-rate structure inevitably consumes
substantial power because every single block in it has to be realized in power-hungry
CML. Constrained by the ck-to-q delay, this FF-based 1 UI delay generator even with
CML topology fails to operate beyond 24 Gb/s in 65 nm CMOS process [34]. Another
drawback of this structure is that it needs a sophisticated full-rate clock tree to drive the
heavy loads of these retiming FFs, which results in considerable power consumption
and area occupation. The stringent full-rate timing requirement can be relaxed by
half-rate structure based on 2:1 MUX or quarter-rate architecture based on 4:1 MUX
[see Fig. 2.11(b) and (c)]. As discussed in [101], the half-rate structure in 65 nm
CMOS running at 20 Gb/s saves 12 mW (50%) of power in contrast to its FF-based
counterpart. Compared to the half-rate structure, the quarter-rate architecture further
relaxes the critical path timing margin from 1 UI to 3 UI and halves the maximum
clock speed, thus showing more potentials in cutting-edge transceiver designs.
As the data rate approaches to the delay of a single buffer, the desired 1 UI delay
can also be produced by analogy delay line [see Fig. 2.11(d)], where a DLL-based
bias generator is often integrated to adaptively tune the control voltage of the delay
line [102]. The delay cell can be implemented in LC-cells [24] or CML-buffers [103].
Nonetheless, these techniques suffer from either a penalty of large area occupation (L-
C cells) or a cost of huge power consumption (CML buffers). Additionally, the delay
produced by the analog delay line is susceptible to PVT variations, power fluctuation,
and substrate noise. Moreover, the limited adjusting range makes this technique only
35
Chapter 2. Literature Review
Recovered
Recovered
Tracking Loop
Din Din Fine
Clock
Clock
VCO LPF VCO
Coarse Frequency
Frequency Tracking Loop
Tracking Loop
FD CP1 FD CP1 LPF1
(a) (b)
Figure 2.12: CDR topologies without a reference. (a) Single control of VCO frequency
tuning and (b) coarse and fine control of VCO frequency tuning.
suitable for narrow range applications [104]. As an example, the design in [16] demon-
strates that the power consumption for each tap in the LC-cell delay line-based FFE is
about 12 mW, which is much lower than that (48 mW) implemented in multi-MUX-
based FFE. On the other hand, it cannot support the speed below 50 Gb/s and occupies
a whole area of 1.2 mm2 which is one time larger than that based on multiple MUXs
in [104, 105].
36
Chapter 2. Literature Review
or copper media in which the space and number of pins are severely limited to include
an external crystal oscillator. Additionally, adding a low-noise, rate-adjustable crystal
could increase the overall cost and complexity of these receivers [108].
Fig. 2.12(a) depicts a CDR without a reference clock, where the currents generated
by both the FTL/CP1 and PTL/CP2 are applied to a common LFP to produce the
control voltage of the VCO [109]. During either CDR startup or loss of phase lock, the
FD plays a key role to generate a control voltage through the CP1 and LPF to coarsely
tune the VCO oscillation frequency towards the input data rate. When the frequency
difference between the VCO and the input data falls into the capture range of the PTL,
the PD takes over to finely adjust the control voltage through the CP2 and LPF, thus
making the VCO output clock coincide with the input data phase [110]. There are two
possible issues associated with this CDR architecture. Firstly, the FTL and the PTL
may potentially interfere with each other when the voltage control is transferred from
the FD to the PD, resulting in prominent ripples on the VCO control line that could
even lead to a phase-lock failure [111]. Secondly, the FD could become momentarily
confused about the actual input data rate if the received input data contains random
consecutive identical digits or if the received rising and falling edges are corrupted
by the channel loss or electromagnetic crosstalk. To mitigate the effects of these two
issues, the loop bandwidth of the FTL is often chosen to be much smaller than that
of the PTL so as to reduce the noise contribution from the FD for ensuring the clock
quality of the VCO [111]. Meanwhile, a CDR bandwidth proportional to the data
rate is required to satisfy the protocol specification. To independently optimize the
bandwidths of the FTL and the PTL, separate LPFs are adopted in the two loops [see
Fig. 2.12(b)], where the line voltages generated by the FTL and PTL respectively
drive the coarse control and fine control of the VCO [110]. The main drawback of this
architecture is it requires a larger area due to the presence of the two LPFs. To alleviate
this area overhead, a hybrid analog/digital loop filter is developed in [112].
Reference CDR- Fig. 2.13 summarizes the main CDR topologies with a reference in
which a traditional PLL is embedded to initially adjust the VCO oscillation frequen-
cy. Fig. 2.13(a) displays the dual-VCO architecture, which uses the conventional PLL
37
Chapter 2. Literature Review
Recovered Clock
Phase Tracking Loop Recovered Clock
Coarse
Retimed Data
Fref
LPF0 LD CP LPF VCO
Fref
PFD CP2 LPF2 VCO2 Frequency
PFD Tracking Loop
Frequency Tracking Loop
/N /N
(a) (b)
/N /N
(c) (d)
Figure 2.13: CDR topologies with a reference. (a) Dual VCO architecture, (b) sequen-
tial locking topology, (c) PI-based structure, and (d) variant of PI-based structure.
to lock the output clock phase of the VCO2 to that of the input frequency [113]. By
applying the control voltage of the VCO2 in the PLL to the replica VCO1 through
an additional LPF0, the oscillation frequency of the VCO1 should be very close to or
equal to the target value. The remaining frequency offset as well as the output clock
phase error with respect to the input data is finely tuned by the PTL. To accomplish a
fast lock acquisition and maintain a fine control of the VCO1, the slew rate of the FTL
should be higher than that of the PTL while the bandwidth of the FTL must be lower
than that of the PTL. On one hand, the physical separation of the FTL and the PTL
makes it easier to meet the lock-acquisition, loop stability, and tracking bandwidth re-
quirements. On the other side, there are two possible problems associated with this
CDR architecture. One is the mismatch between VCO1 and VCO2, which may lead
to a difference in oscillation frequency even though the two VCOs share one coarse
control voltage. The other is the frequency pulling between the two VCOs in asyn-
38
Chapter 2. Literature Review
chronous systems. Specifically, the data rate in an asynchronous system often allows
certain frequency offset between the transmitted data and the local clock frequency.
The frequency pulling could make the output frequency of VCO1 shift away from the
incoming data rate and towards N×Fref. This could be especially problematic when a
spread spectrum clock is required since the pulling phenomena may make the output
frequency of VCO1 unchange with its fine control input. Another issue associated with
this CDR is the area overhead, especially in case of adopting an LC-VCO. To address
the pulling issue and reduce the area overhead, a sequential locking scheme is pro-
posed in [102, 114] to remove the needs of the dual CPs, LPFs, and VCOs. This CDR
is presented in Fig. 2.13(b), which utilizes a lock detector (LD) to rotationally enable
the FTL and the PTL by continuously monitoring the frequency locking state. During
the CDR startup, the FTL is firstly selected to tune the control line of the VCO to pull
the oscillation frequency towards the target frequency N×Fref. If the LD detects that
the divided clock of the VCO output is locked to the Fref, it disables the FTL loop and
enables the PTL. When there is a loss of frequency locking, the LD will swap the PTL
to FTL to engage a lock recovery. One potential problem in this topology is that the
transition from the FTL to the PTL may disturb the VCO control voltage and therefore
causes a VCO frequency shift. Once the frequency shift is beyond the capture range of
the PTL, a failure of phase lock could happen [111].
Fig. 2.13(c) presents another typical reference CDR based on phase interpolator
(PI) [14, 115]. The conventional PLL is adopted to provide multi-phase clocks with a
frequency of N×Fref that is very close or equal to the incoming data rate. These clocks
are further rotated by a PI driven by the PTL to make the phase of the recovered clock
lock to that of the input data. The availability of high-frequency clocks endows that this
architecture possesses the good properties of faster phase acquisition, increased system
stability, and less jitter peaking. It is worthy to note that jitter peaking in PI-based
CDR is absence only when the PTL is a first-order loop and the loop latency is not
significantly larger than the phase update period. This is because the fast changing jitter
may have already reversed its direction by the time the updating phase code reaches the
PI [116]. Additionally, the physical separation of the FTL and the PTL makes it easier
39
Chapter 2. Literature Review
Din 1
Y CK
A A -π π
B X B
Din
D Q D Q
Y
-1
CK 1
X KPD = π (TD)
Din
A CK
Y
A
Din B B
D Q D Q
X
CK C
X
C
D Q D Q
Y
2π 1 2
KPD = (TD)
Jpp
-π π * -π π = -π π
-2π -1
(f)
Figure 2.14: Two typical CDR PDs. (a) Hogge PD implementation, (b) Hogge PD de-
tection mechanism, (c) Hogge PD gain, (d) Alexander PD implementation, (e) Alexan-
der PD detection mechanism, and (e) Alexander PD gain.
to satisfy the loop bandwidth and stability requirements. This separation also allows
the clock lane consisting of PLL and bias generator to be shared by multiple data lanes,
thus making it a popular architecture in parallel-lane applications. Another advantage
of the PI-based CDR is the complete digital implementation of the loop filter, which
leads to smaller area occupation and fewer effects from PVT variations. The primary
problem along with this CDR is the discrete updating phase steps, which may result in
prominent cycle-to-cycle jitter. The steady-state oscillation existing in the digital PTL
could make this impact even more severe, especially when the loop latency is large. To
smooth out the discrete phase steps, the PI-based CDR evolves into the structure shown
in Fig. 2.13(d), where the feedback clock and recovered clock respectively applied to
the divider and the sampler in the PD are swapped. The primary advantage of this
evolved CDR is that the discrete phase shift in the PI can be smoothed out by the LPF
in the FTL, which provides a smooth phase shift in the PTL. However, it requires an
FTL in each receiver lane, thus making it not suitable for multilane applications.
40
Chapter 2. Literature Review
The main functions of the PD in CDR systems are to compare the phase differ-
ence between the input data and the recovered clock, provide information to adjust
the sampling position, and simultaneously retime the incoming serial signal. Fig. 2.14
summarizes the implementations and behaviors of the widely used liner Hogge PD and
non-linear Alexander PD [i.e., bang-bang PD (BBPD)].
Fig. 2.14(a) and (b) describes the implementation and operation waveforms of the
Hogge PD. The phase differences between the input data and the recovered sequence
are converted to high pulses [see signal X in Fig. 2.14(b)] by the top XOR. Meanwhile,
the reference pulses [see signal Y in Fig. 2.14(b)] that equals a half of the clock cycle is
produced by XORing the recovered sequence and its half-clock-cycle delayed version.
Taking the width difference of X and Y as the PD output, the phase error between the
optimal sampling position (i.e., lagging the data transition a half of a clock cycle) and
the rising edge of the recovered clock can be obtained. Fig. 2.14(c) gives the phase
transfer characteristics, and its PD gain can be given by,
1
KP D = (T D) (unit of radian−1 ), (2.9)
π
where T D is the transition density. The main advantage of the Hogge PD is that
it provides both sign and magnitude information of the sampling phase error, which
allows to construct a linear feedback loop. On the other hand, there also exist several
imperfections in the Hogge PD. Firstly, the ck-to-q delay of the first data-sampling FF
widens the pulse width of signal Y, but doesn’t impact that of signal X, thus causing
a skew of ∆T (i.e., the ck-to-q delay of the FF) when the CDR loop is locked. This
skew effect becomes a serious issue at high speeds since ∆T can occupy a significant
fraction of the clock period. The resulting phase offset may exceed several tens of
degrees, thus degrading the sampling phase margin and finally deteriorating the jitter
tolerance. This phase shift can be compensated by either narrowing the proportional
pulses or widening the reference pulses through inserting proper dummy delay element
[67]. Nonetheless, the delay introduced by the dummy element may not track the FF
41
Chapter 2. Literature Review
delay well against PVT variations. Another drawback of the Hogge PD stems from the
half-cycle shift between the two XOR outputs [see Fig. 2.14 (b)], where the reference
pulse is after the proportional pulse. This phase shift makes the CP driven by the Hogge
PD create tri-wave currents and hence generate ripples on the VCO control line, which
could severely disturb the VCO output phase. This tri-wave issue can be ameliorated
by introducing two additional reference pulses at a cost of one more full-rate latch and
two more power-hungry XOR gates [117]. Finally, the output pulses of the Hogge PD
are approximate to a half of the bit period, which demands extremely high-speed XOR
gates to generate these narrow pulses. Combining with the complex implementation
of the XOR, the Hogge PD could become the speed bottleneck of the whole CDR. As
a consequence, the Hogge PD is suitable for CDR designs with a low to moderate data
rate, where a sufficient margin can be guaranteed for the narrow pulse generation.
Fig. 2.14(d) describes the implementation of the BBPD. It utilizes three data sam-
plers driven by three consecutive 180◦ -shifted clocks along with two XOR gates to
determine whether the clock leads or lags the data when there is a data transition. In
case that there is no data transition, the outputs of the three samplers are identical and
hence the outputs of the two XORs remain at “0s”. In presence of a data transition, the
BBPD produces the signals of early Y and late X by XORing the edge sample with its
previous data and following data, respectively. Fig. 2.14(e) illustrates the waveforms
under the two possible locking conditions, namely, clock Late and clock Early. The
BBPD only outputs the sign information of the phase error in the form of an early
or late pulse with a fixed width, thereby its gain is ideally infinite at zero phase error
[see the left diagram in Fig. 2.14(f)]. However, this gain can be linearized by the
metastability of the samplers, the time uncertainty of the input data, and the jitter of
the edge-sampling clocks. Previous studies [118, 119, 120] have demonstrated that
the overall phase transfer function of the BBPD in practical CDRs can be obtained by
convoluting the ideal PD transfer function with the probability density function (PDF)
of the total jitter [see Fig. 2.14(f)] and its gain can be approximated as,
2
KP D ≈ (T D) (unit of radian−1 ), (2.10)
JP P
42
Chapter 2. Literature Review
where T D is the transition density, and JP P denotes the peak-to-peak jitter (including
the sampler metastability, input data jitter, and edge-sampling clock jitter). The bina-
ry quantization of the BBPD has simplified the phase comparison, which utilizes the
recovered data and quantized edge sequences to extract the early/late signals. Com-
pared to the linear Hogge PD that needs to process pulses no wider than a half of the
bit period, the minimum pulse width involved in this nonlinear BBPD equals the bit
period. Hence, it is able to support an even higher data rate. By replacing the XORs
following the full-rate samplers with a group of parallel XORs after the demultiplexer,
the operation speed of the BBPDs can be further reduced to normal digital logic speed.
Unlike the traditional liner PD whose outputs gently toggle around zero, the outputs of
the BBPD exhibit abrupt toggling between the two states of ‘1’ and ‘0’. On one hand,
the abrupt toggling may introduce larger disturbances on the control voltage line of the
VCO-based CDRs. On the other hand, the complete digital operation renders it more
convenience to implement digital CDRs.
The basic function of the clocked compactor is to sample and resolve the input
signal to binary ‘0’ or ‘1’ at each rising edge of the driving clock. The output is
determined by the polarity of the sampled instantaneous value compared to a specific
reference (e.g., zero for the NRZ modulation). Unlike the digital latches which can be
described by the setup time, hold time, and latch delay, the sampling latches in analog
application are usually characterized by their sensitivity and bandwidth [121, 6]. To
obtain correct bit streams from the attenuated noisy analog input, samplers with high
timing precision and high input sensitivity are badly demanded.
Fig. 2.15 summarizes the two most popular samplers, which are based on CML-
type latch and Strong-Arm latch, respectively. To convert the analog input to logic
output, the CML-latch-based clocked compactor requires two CML latches and one
CML2CMOS converter [see Fig. 2.15(a)] while the Strong-Arm-based counterpart
only needs one Strong-Arm latch and one RS latch [see Fig. 2.15(b)]. Fig. 2.15(c) and
(d) respectively displays the latch sensitivity function and latch transfer function for
43
Chapter 2. Literature Review
CKN
CKN
CKP
CKP
CKN
CKP
RXN Q Full Swing Q Full Swing
CML- CML- CML2 RXN RS-
RXP Output StrongArm- Output
Latch Latch CMOS RXP Latch Latch
CLK CLK
ON
OP
ON OP
IN IP
CKP CKN IP IN
CLK
ISS
(a) (b)
Normalized Latch Sensitivity (ps-1)
Gain (dB)
(c) (d)
Normalized Energy
Tcycle (ps)
(e)
Figure 2.15: Clocked compactors. (a) CML-type latch-based compactor, (b) Strong-
Arm latch-based compactor, (c) latch sensitivity function comparison [6], (d) latch
transfer function comparison [6], and (e) energy consumption comparison [7].
the two clocked compactors in Fig. 2.15(a) and (b) [6]. Referring to the discussion in
[6], the following conclusions can be made: (i) the sensitivity window of the Strong-
44
Chapter 2. Literature Review
Arm latch is smaller than that of the CML-type latch, meaning that the Strong-Arm
latch shows better time resolution ability, (ii) the DC gain of the CML-type latch ex-
hibits 10 dB more than that of the Strong-Arm latch, implying that the CML-type latch
exhibits a high sensitivity, (iii) the gain-bandwidth product (GBW) of the CML-type
latch is higher than that of the Strong-Arm one, indicating that the CML-type latch
is more suitable for high-speed design. Fig. 2.15(e) describes the normalized ener-
gy comparison between the aforementioned two compactors, where the Strong-Arm
latch always demonstrates a better power efficiency [7]. In practical designs, although
Strong-Arm latches provide narrow sensitivity window and dissipate less power, CML
latches are usually used in ultra high-speed receivers because of their large GBW, su-
perior sensitivity (high gain), and high immunity to power fluctuation. Additionally,
the CML-type latch possesses a superior convenience to integrate on-chip inductors to
further extend its bandwidth.
45
Chapter 2. Literature Review
Ф1
Ф1 Ф3
Odd PH1
Ф2
Mux
Phase ФOUT ФN-1 Phase ФOUT
Interpolator Ф2 Mixer
Ф4
Even PH2
ФN Mux
ФN
(a) (b)
ФOUT
SEL1 SELN
Ф1 ФN
(c)
Buffer ФOUT
SEL1 Ф1 SELN ФN
(d)
Figure 2.16: PI structures and implementations. (a) Structure with direct multiple-
input phases [8, 9], (b) structure with coarse phase selection followed by a phase mixer
[10, 11], (c) inverter-based implementation [12, 13], and (d) CML-based implementa-
tion [14, 15].
justing the weights of the two adjacent phases is usually employed in practical designs.
For an ideal multiple-input PI, the input clocks should share an equal phase spacing
between any two adjacent phases. Correspondingly, the interpolated output clock can
be represented by,
ideal
CKPideal
I = AP I ejϕP I = AP I ej(ψi+1 −ψi )·m/K , ψi+1 − ψi = 2π
N
, (2.11)
46
Chapter 2. Literature Review
90°
90°
0° 0°
180° 180°
270°
270°
(a) (b)
Figure 2.17: (a) Phase constellation for quadrature PI, (b) phase constellation for oc-
tagonal PI, (c) interpolated phase steps for quadrature PI in one quadrant, and (d)
interpolated phase steps for octagonal PI in one octant.
where N is the input phase number, AP I denotes the interpolated clock amplitude, K
stands for the total steps between ψi+1 and ψi , and ϕideal
PI represents the ideal output
phase when the phase code m ranges from 0 to K. Considering the fact that the phase
interpolation is achieved by mixing two input phases with different weights, the actual
interpolated output signal can be calculated by,
m N −m (2.13)
CKpi = Api · ejϕpi = N
· A0 ejψi + N
· A0 ejψi+1 ,
47
Chapter 2. Literature Review
where Api and ϕpi denote the instant amplitude and phase of the interpolated clock
signal, respectively. Taking quadrature and octagonal PIs as examples, Fig. 2.17
describes the phase mixing constellations and the interpolated phase step allocations
[8, 10, 122, 9]. It can be found that the maximum interpolation phase error for the
quadrature PI reaches 4◦ and the maximum interpolation phase error for the octagonal
PI is around 0.5◦ , where the maximum deviation happens at the same positions for
both the quadrature and octagonal PIs, which are located at the 1/4 and 3/4 of the
total steps between the two mixing input phases. These phase errors stem from the
linearly sweeping as the phase transfer characteristics of the PIs are in proportion to
the anti-trigonometric function of the input-phase weight ratio rather than the input-
phase weight ratio itself. It is worthy to note that the phase error of the octagonal PI is
smaller than that of the quadrature PI. This makes the octagonal PI a superior choice
for high-linearity phase interpolators, but the cost is the doubled input phases, complex
phase-weight coding, and complicated circuit implementation.
The amplitude of the interpolated clock (Api ) is also modulated by the phase code.
When the phase code is 0, the PI actually performs as a buffer with a minimum mixing
factor, hence a maximum amplitude can be obtained. As the phase code increases, the
amplitude will decrease with the increasing mixing factor. Once the phase codes of the
two input phases are adjusted to be equal to each other, the mixing factor reaches its
maximum value and the amplitude decreases to its minimum value. If the phase code
continues to rise, the amplitude will increase with the decreasing mixing factor and
finally rise up to its maximum value. According to the discussion in [67], these ampli-
tude fluctuations can be potentially converted into delay variations through amplitude
modulation (AM) to phase modulation (PM) conversion, and the delay variation is ap-
proximately proportional to the square of the input-signal swing. Theoretically, the
maximum amplitude reduction of a quadrature PI can reach 29.3%, occurring at the
half of the total steps for each quadrant. It is also under the same condition that the
phase deviation runs up to its maximum value, thus any extra delay caused by the AM-
PM conversion can further aggravate the maximum DNL directly. The linearity of the
phase interpolator can also be deteriorated by the I, Q mismatch, clock duty distortion,
48
Chapter 2. Literature Review
and inadequate edge overlap of the input clocks [122, 123, 115]. To mitigate these
effects, a variety of techniques including local duty cycle correction, I, Q phase cor-
rection, and slew rate calibration using slew buffers or harmonic rejection poly phase
filters are usually utilized to optimize the quality of the I, Q clocks [14, 123, 115].
When transmitting data pass through electrical mediums, the insertion loss caused
by frequency-dependent skin effect and dielectric absorption could result in prominent
ISI. This ISI can be directly converted into the deterministic jitter to compress the link
jitter margin and hence reduces the maximum support rate or deteriorates the BER of
the serial link. For instance, for a -12 dB loss channel, the far-end eye-diagram after
this channel can be completely closed. It seems that this issue can be solved by simply
increasing the signal strength to go against the attenuation. In practical designs, there
does exist an optimal swing for a specific channel loss. This is the reason why many
transmitters have integrated the function of swing adjustment, and therefore allows the
users to adjust the driving strength to the optimal values to accommodate to differ-
ent applications. If the signal swing is too small, the received signal could be buried
by the noise, thus exhibiting a low SNR. Theoretically, a high swing can effectively
improve the SNR of the system. However, this does not mean a higher swing is al-
ways better for the link communication. Firstly, the increased signal swing does not
solve the ISI problem. This is because the increased symbol swing also improves the
energy spread to the other symbols, thus exhibiting no optimization on the ISI. Sec-
ondly, the increased swing also improves the strength of some proportional noises such
as reflection, crosstalk, which could deteriorate the performance of the link. Thirdly,
the increased swing always means substantial power consumption as the driver need-
s to draw more currents. To overcome this frequency-dependent signal dispersion,
many equalization techniques have been developed to compensate for the channel loss
by either attenuating the low-frequency components or boosting the high-frequency
components [124]. This section will summarize the mainstream equalizers utilized in
high-speed links, including the FFE, CTLE, and DFE. These equalizers are usually
49
Chapter 2. Literature Review
Din Tb Tb Tb
Gain (dB)
-n -n+1 -n+2 n
20log(1-2k)
+
Dout 0.5
Normalized Frequency
(a) (b)
Figure 2.18: The FFE. (a) Functional block diagram, where Tb is the bit period and αn
is the weight of the nth tap. (b) Typical frequency response, where k is the summation
of the absolute tap weights.
combined together to cover a broad range of channel spectrums, especially for high-
loss legacy channels. The FFE is usually employed to cancel the pre-cursor ISI and
partial nearby post-cursor ISI. The CTLE is often adopted to neutralize the long-tail
ISI. The DFE is frequently utilized to remove the nearby post-cursor ISI.
The FFE, which is usually implemented using a finite impulse response (FIR) fil-
ter, is one of the most common techniques in high-speed serial links. It pre-distorts
the output waveform shape over several symbols to pre-attenuate the low-frequency
portion of the transmitted signal, thus making the signal spectrum after the lossy chan-
nel maintain a proper balance between various frequency components. Fig. 2.18(a)
describes the functional block diagram of the FFE, where Tb is the bit period and
α(l) is the normalized tap weight. Clearly, the waveform pre-distortion is actually
performed by summing the symbol-spaced streams with different tap weights. Fig.
2.18(b) displays a typical frequency response of the FFE, which demonstrates promi-
nent low-frequency attenuation. The maximum de-emphasis amount is 20log(1 − 2k),
P
where k = |α(l)|. Note that k must be within 0 and 1/2 to perform high-frequency
l6=0
boosting. For k > 1/2, the frequency response actually exhibits attenuation rather
than boosting for high frequencies. The specific response shape is subject to the tap
number as well as the tap-weight distribution. The discussion in [101] shows that
more taps help to fit desired response. Meanwhile, the increased tap number implies
50
Chapter 2. Literature Review
an almost linear increase of parasitic capacitance at the output node, thus limiting the
output bandwidth. To keep sufficient bandwidth and maintain an adequate eye open-
ing, a tap number of three or four is usually adopted for the data rate below 30 Gb/s
[101, 9, 125, 13]. For the cutting-edge transmitters operating around 40-60 Gb/s, two-
tap FFEs are usually adopted [23, 36, 57].
The FFE exhibits several unique advantages over its counterparts. Firstly, the FFE
is able to cancel pre-cursor ISI by introducing pre-cursor taps. Secondly, the FFE
shows negligible noise amplification due to its digital implementation. Thirdly, the
tap weights of the FFE can be accurately controlled by employing a high-resolution
digital-to-analog converter (DAC) For example, 5-6 bit resolution can be achieved
conveniently, which is usually accurate enough for the FFE tap-weight adjustmen-
t. The main disadvantage of the FFE is that it is implemented by attenuating the
low-frequency portion rather than boosting high-frequency ones. This equalization
mode can significantly reduce the eye-height in the RX-side. Another drawback is its
complex circuit implementation which involves multiple symbol-spaced full-rate data
generations. It not only decreases the maximum allowable data-rate by introducing
parasitic capacitance on the output nodes, but also increases the area occupation and
power consumption. These penalties become even more severe in ultra-high-speed
transceivers operating around the cutting-edge speed of the technology. The FFE e-
qualization can be located either at the TX-side or the RX-side. In the following two
paragraphs, we will separately discuss the pros/cons of the TX-FFE and RX-FFE.
Most designs put the FFE on the TX-side due to the following two reasons. One is
the 1 UI delay can be accurately generated by simply relatching. The other is the coeffi-
cient multiplication can be simply performed on binary values by changing the current-
controlling codes to adjust the tap weight. Nonetheless, TX-FFE has several prominent
disadvantages. Firstly, it is difficult to perform automatic tap-weight adaptation since
the quality of the received signal can only be known at the RX-side. Although a back
channel can be employed to transfer the continuously adjusted tap weights [126], this
extra back channel increases the system complexity in terms of extra chip pins, compli-
cated chip packaging, and additional PCB routing. Moreover, this RX-side adaptation
51
Chapter 2. Literature Review
scheme may not even be available due to the problems in interoperability, especially
when the transmitter and receiver are from different vendors [13]. Secondly, the com-
pensation ability is limited by the allowable minimum signal swing after de-emphasis,
this is because the TX-FFE compensates for the high-frequency channel loss by atten-
uating low-frequency components rather than increasing high-frequency components
in the signal.
By placing the FFE at the RX-side, the tap weights can be adapted locally at the
RX-side, thus eliminating the need for a back channel and removing the issue of TX-
RX interoperability. This also makes the driver at the TX-side simpler by removing
the combining taps, thus reducing the output capacitances and improving the driving
bandwidth. However, there also exist several drawbacks in the RX-FFE. The primary
challenge is how to generate the symbol-spaced versions of the received signal [13].
Passive delay cells using inductors and capacitors need a large area, and their tun-
able delays are not wide enough to handle a wide operation range. Active delay cells
such as CML buffers are power-hungry and distort the signal waveform due to their
delay-dependent bandwidth. Another challenge in the RX-FFE is how to carry out the
product of the coefficients and the analog signals [13]. In bipolar technology, a tradi-
tional Gilbert multiplier can be utilized to perform this multiplication. However, the
limited linearity performance of the CMOS transistors makes the Gilbert multiplier far
less accurate and its resulting distortion significantly degrades the FFE performance.
Unlike the TX-FFE that only sums the weighted digital streams, the RX-FFE process-
es the received signal containing both the useful signal information and useless noise
disturbance. Therefore, the high-frequency components of the noise are also boosted,
which is not desired in high-speed communication systems.
52
Chapter 2. Literature Review
R1
Gain (dB)
ωp
VEQ
20 dB/dec
C1 R1
C2 R2
R1+R2 ωz
RL RL
Gain (dB)
OP ON ωp1 ωp2
CL CL
20log(gmRD+1)
IN IP
0.5CD gmRL
VCTLE 2RD
gmRD+1 ωz
ISS/2 ISS/2
Angular Frequency (rad)
(c) (d)
Figure 2.19: The CTLE. (a) Passive implementation, (c) frequency response of the
passive CTLE, (c) active implementation, and (d) frequency response of the active
CTLE. Here, ωz is the angular frequency of the zero and ωp is the angular frequency
of the pole.
s. As the CTLE sharpens both the rising and falling edges of the received signal, it
shows a capability of canceling both the long tail ISI caused by the pre-cursor and
post-cursor taps. Similar to the RX-FFE, there are also some drawbacks associated
with the CTLE. Firstly, the equalization ability of the CTLE is limited to first-order
compensation. Secondly, it also amplifies the noise and crosstalk in the boosting band.
Thirdly, its gain boosting is sensitive to PVT variations, and the tuning range is small.
Finally, its operation speed is limited by the GBW product of the amplifier.
The CTLE can be realized in both passive components and active devices [124].
Fig. 2.19 displays both passive and active implementations of the CTLE and their fre-
quency responses. For the passive CTLE shown in Fig. 2.19(a), the frequency shaping
is achieved by a simple RC network, where low-frequency components are attenuat-
ed by the resistor and the high-frequency components are allowed to pass through the
capacitor, thus leading to high-frequency gain boosting. According to the signal pro-
53
Chapter 2. Literature Review
cessing theorems, the transfer function and the associated pole-zero positions can be
calculated by,
R2 1 + R1 C1 s
H(s) = · R1 R2
, (2.14)
R1 + R2 1 + R1 +R2 (C1 + C2 )s
1
ωz = , (2.15)
R1 C1
1
ωp = R1 R2 , (2.16)
R1 +R1
(C1 + C2 )
R2
DC − Gain = , (2.17)
R1 + R2
ωp
Gain − Boost = 20log . (2.18)
ωz
Fig. 2.19(b) displays a typical frequency response of the passive CTLE. The boosting
frequency components are determined by the locations of the zero and the pole while
the boosting factor can be approximated by the ratio of the pole to the zero, since the
frequency response shows a 20 dB/dec rolling up [see Eq. (2.18)]. By appropriate-
ly choosing the resistor/capacitor values that determine the positions of the zero and
the pole, reasonable gain boosting including both frequency components and boosting
amounts can be achieved. The main feature of this equalizer is its compact imple-
mentation and zero power consumption since it only contains passive components of
resistors and capacitors. However, there are three prominent disadvantages in this sim-
ple RC equalizer. Firstly, the RC network introduces large impedance discontinuity
at the interface between the channel and the equalizer, which could cause significant
reflection. Secondly, this approach cannot improve the SNR since the equalization is
performed by attenuating low-frequency components. Thirdly, it is not convenient to
adjust the boosting parameters since the configuration of the RC values could introduce
additional overheads to the most high-speed critical path. Therefore, this technique has
seldom been utilized in high-speed serial links.
Fig. 2.19(c) presents the widely used active CTLE implementation. It utilizes an
RC source degradation to provide different gains for different frequencies in order to
realize high-frequency boosting. By analyzing the linear equivalent half circuits, the
54
Chapter 2. Literature Review
1
gm s + R D CD 1
H(s) = · · , (2.19)
gm RD +1
CL s + R C s + RL1CL
D D
1
ωz = , (2.20)
RD CD
gm RD + 1
ωp1 = , (2.21)
RD CD
1
ωp2 = , (2.22)
RL CL
gm RL
DC − Gain = , (2.23)
gm RD + 1
ωp1
Gain − Boost = 20log . (2.24)
ωz
Fig. 2.19(d) presents a typical frequency response of the active CTLE. The response
shape is mainly constrained by ωz , ωp1 and DC −Gain since the second pole is usually
determined by the load resistor and output capacitor. The boosting ability of the active
CTLE is usually changed by adjusting the source degradation RC network [see Fig.
2.19(c)]. As the control voltage VCTLE is tuned from high to low, both the equivalent
resistor RD and equivalent capacitor CD become larger. The resulting zero (ωz ) can
be reduced while the ratio of the dominant pole to zero (ωp1 /ωz ) will increase, thus
the boosting frequency band and the boosting gain can be both improved. Inductive
peaking [127] or forward-coupling capacitance neutralization [124] can be used to
further increase the bandwidth of the CTLE and hence enhances the gain-boost ability.
Compared to the passive CTLE, the main feature of the active CTLE is its ability to
achieve higher gains (over 0 dB) for both low and high frequency components, which
helps to improve the SNR to optimize the BER of the link.
The DFE is another effective signal conditioning technique to cancel the ISI caused
by frequency-dependent channel loss, which is commonly implemented at the RX-
side in serial links. Fig. 2.20 gives the conceptional diagram and typical frequency
response of the DFE. It works by directly subtracting (or adding) the previous deci-
sions in multiplication with corresponding tap weights. This previous-decision-based
55
Chapter 2. Literature Review
Din + Tb Tb
1
Gain (dB)
2
0
(a) (b)
Figure 2.20: The DFE. (a) Functional diagram, where Tb is the bit period and αn is
the tap weight of the nth tap. (b) Typical frequency response, where the frequency is
normalized to the value of the data rate.
ISI cancellation not only increases the boosting factor of the DFE, but also makes
it immune to noise amplification since the feedback signal is the scaled version of
well-recovered digital streams. Similar to the FFE, for a fixed tap-weight summa-
Pn
tion k, where k = |α(l)|, (0 < k < 1), the maximum boost factor is a constant
l=1
[20log((1 + k)/(1 − k))], while the tap number and tap weight distribution only affect
the shape of the response.
There are three issues in the DFE design [128]. Firstly, there exists error propaga-
tion problem in the DFE because the ISI cancellation is based on the assumption that
all the previous decisions are correct. When there are bit errors, the subtraction or ad-
dition of the scaled decisions will rather exacerbate the ISI than cancel it. Fortunately,
this error propagation can be neglected for a robust serial link since its BER is usually
lower than 10−12 . Secondly, the DFE can only remove post-cursor ISI as the feedback
sequences can only be the previously received data. This is the reason why the DFE
is usually combined with the TX-FFE and/or RX-CTLE to cancel the ISI caused by
both the pre-cursors and post-cursors. Finally, the DFE implementation suffers from
a stringent timing problem. As described in Fig. 2.20(a), there are two possible criti-
cal paths. One is the feedback loop of the first tap, whose timing requirement can be
expressed by,
tslicer
cq + tslicer
setup + tf b < 1U I,
(2.25)
56
Chapter 2. Literature Review
where tslicer
cq is the ck-to-q delay of the slicer, tslicer
setup denotes the setup time of the slicer,
and tf b stands for the feedback path delay. The other is the feedback loop on other
taps, whose loop delay also must be lower than 1 UI,
tfcqf + tslicer
setup + tf b < 1U I,
(2.26)
where tfcqf represents the ck-to-q delay of the retiming FF. The main difference between
these two loops is that their ck-to-q delays come from different components. Compare
to the FF which retimes the sliced full-swing data sequence, the slicer needs to regen-
erate the digital output from a small input. Consequently, the tslicer
cq should be larger
than tfcqf , which makes the timing budget for the first tap [see Eq. (2.25)] tighter than
that for other taps [see Eq. (2.26)]. This is the reason why various techniques are
developed to relax the first tap timing requirement.
In practical transmission systems, the connection channels usually have the fol-
lowing features. Firstly, the exact channel profiles in practical serial links are usually
unknown in advance. Secondly, the channel length can vary from one application to
another. Thirdly, the channel profile may change due to the fabrication variation. Fi-
nally, the channel profile will vary in real time with its operation environment, which
becomes particularly severe for data rates beyond 10 Gb/s. To accommodate to the
different channel losses and track the real-time channel variations, many adaptive e-
qualization techniques like least mean square (LMS) [129, 34, 130, 131], zero-forcing
(ZF) [105, 132], maximum eye opening (MEO) [133], and spectrum matching [134]
have been developed. Fig. 2.21 summarizes the conceptional diagrams of these adap-
tation methods.
Algorithm-Based Adaptation- Fig. 2.21(a) describes the conceptional diagram of the
most widely used algorithm-based adaptation, which can be applied to any type of
equalizers including the FFE, CTLE, and DFE. There are many algorithms that can be
used to adjust the equalizer coefficients, but only a few of them are suitable for on-
chip integrations. The most popular ones for compact hardware implementation are
57
Chapter 2. Literature Review
xk yk dk xk yk dk
Equalizer Equalizer
CTLE
fm
VCTLE LPF LPF
(c)
where α(k,l) denotes the lth tap weight at the k th iteration, λ is the update step
size, xk stands for the samplers at the channel output, dbk is the estimate of the
transmitted data, and ek = dbk −yk represents the equalization error. The require-
ment of the analog multiplications (xk and ek are naturally analog signals) in Eq.
(2.27) makes it difficult to be implemented in hardware, thus reducing its com-
petitiveness in equalization coefficient adaptations. To reduce the complexity
of the traditional LMS, the sign-sign LMS (SS-LMS) algorithm has been devel-
oped, which utilizes the binary quantized sign(ek ) and sign(xk−l ) to replace the
analogue ek and xk−l in Eq. (2.27). The update iteration is then changed to,
58
Chapter 2. Literature Review
Considering the fact that the binary quantized sign(ek ) and sign(xk−n ) can be
directly mapped from the sliced error sequence and recovered data stream, the
SS-LMS obviates the need for analog operations, hence making it more feasible
for on-chip integrations. Since the binary quantization significantly reduces the
iterative accuracy, the convergence time of the SS-LMS is generally worsen than
that of the traditional LMS. Fortunately, this increased convergence time is not a
problem in most serial links.
2. The ZF solution is obtained by forcing residual ISI in the decision instant to zero
[135], which can be theoretically achieved by completely inverting the channel
response HC (s) [136],
1
HE (s) = , (2.29)
HC (s)
where HE (s) is the frequency response of the equalizer. The resulting total
transfer function of the convolution of the equalizer and the channel should be
flat. Optimal ZF equalization requires equalization filters with infinite taps to fit
the long-tail impulse response. In practical implementations, suitable truncation
is usually applied to construct a finite impulse response (FIR) to approximate the
infinite impulse response (IIR). This method is suitable for the time-invariant
channel, which is well known in advance. To adaptively adjust the equalizer
coefficients and track the slow channel changing, the equalizer coefficients can
be updated by the following iteration [137],
αk+1 = αk − λ · ek · xk ,
(2.30)
ek = sbk − sk , sbk = xTk αk ,
where αk is the equalizer coefficient vector, λ denotes the update step that con-
trols the adaptation rate, ek stands for the error vector, sbk represents the estimate
vector of the transmitted data, xk is a vector being composed of the input sig-
nal applied to the equalizer, and sk denotes a vector consisting of the training
symbols. Note that the subscript k or k + 1 refers to the k th or (k + 1)th itera-
59
Chapter 2. Literature Review
tion. Comparing Eq. (2.30) to Eq. (2.27), we can find that the ZF algorithm is
equivalent to the LMS for FIR equalizers.
The errors utilized in the aforementioned LMS and ZF algorithms are extracted by
measuring the amplitude differences between the equalized and desired outputs that
are sampled at the data-sampling positions. This level-based error extraction method
involves both data recovery and peak detection [138]. Moreover, this configuration of-
ten requires additional slicers or even an analog-to-digital converter (ADC) to extract
the amplitude errors between the equalized and expected eye heights, which makes it
less competitive for high-speed applications due to the following reasons. Firstly, these
auxiliary circuits (slicers or ADC) degrade the maximum bandwidth because their in-
put capacitances are directly connected to the maximum-speed signal path. Secondly,
the additional high-speed circuits will inevitably introduce more connections, which
not only makes the layout routing more complicated but also increases the parasitic
capacitances. Thirdly, the additional circuits consume considerable power since they
need to operate at the maximum speed. Meanwhile, the residual ISI can also be min-
imized using the errors at the crossing points since the ISI at the crossing points is
heavily correlated to the transmitted data for bandwidth-limited systems [138]. Lever-
aging this characteristic, Xilinx [138, 139, 42] has developed an edge-based algorithm,
where the the error in Eq. (2.29) is replaced with the error at the crossing points. These
errors can be directly mapped from the quantized edge sequence that is indispensable
for the CDR. Consequently, the additional samplers can be obviated to optimize the
critical path capacitances and improve the power efficiency. Note that the indirect na-
ture of the edge-based algorithm shows a relatively lower effectiveness when compared
with its level-based counterpart. Fortunately, simulation results indicate that for low-
loss applications, the edge-based adaptation is sufficient to guarantee an acceptable eye
opening at the data-sampling point [138].
Eye Monitor-Based Adaptation- Fig. 2.21(b) describes the eye monitor-based adap-
tation, which is also applicable to any type of equalizer structures. The optimal e-
qualization coefficients are attained based on maximizing the eye opening. The two-
dimensional mask of the eye opening can be obtained by monitoring the BER while
60
Chapter 2. Literature Review
adjusting the sampling position and slicing levels of the error-detection slicer [140].
As for the adaptive equalization process, the eye masks with different equalization
coefficients are first measured by the internal eye monitor under the control of the co-
efficient scanning engine [see Fig. 2.21(b)]. The optimal coefficient configuration is
then selected by a maximum-eye-opening searching algorithm. This method can pro-
duce visualized eye-diagram with distinct eye width and eye height, thus providing
an intuitional window to observe the equalization effect. Nonetheless, there exist two
drawbacks in the eye monitor-based equalization adaptation. One is the high power
consumption of the eye monitor (including full-rate slicer, clock PI, driving buffer,
scanning engine, and searching algorithm) can significantly degrade the power effi-
ciency of the serial link. The other is the contradiction of the design complexity, scan-
ning speed, and measuring accuracy. Precise eye measurement needs high-resolution
DAC and PI for slicing level adjustment and sampling position moving, which not on-
ly complicates the design but also significantly prolongs the eye-scanning time. The
eye monitor presented in [141] shows that the combination of a 3-bit DAC and a 4-bit
PI contributes a total of 210 different masks, and it is a good balance for a 10 Gb/s
design. In addition, the eye-scanning accuracy is also limited by the slicer sensitivity,
slicer offset, and PI nonlinearity.
Spectrum Matching-Based Adaptation- Fig. 2.21(c) presents the spectrum matching-
based adaptation, which is applicable to the RX-CTLE and one tap RX-FFE as it only
provides one control voltage [134, 142, 143]. The control voltage is optimized by forc-
ing the imbalance of the spectrum split by the frequency fm to zero [see Fig. 2.21(c)],
where fm equals 0.28/Tb and Tb is the bit period [134]. This fm is selected based on
the fact that it equally splits the power energy of the spectrum for ideal random bina-
ry sequences. Note that the setup of fm as 0.28/Tb is valid only for purely-random
or pseudo-random data streams [134]. There are several difficulties in this adaptation
method. Firstly, the LPF and HPF are directly connected to the critical path [Fig.
2.21(c)], which could degrade the maximum bandwidth. Secondly, the splitting band-
width is difficult to control since the passive components utilized in the LPF and HFP
are sensitive to PVT variations. Thirdly, the effective power detection is challenging,
61
Chapter 2. Literature Review
especially for the high-frequency power detection. Finally, the accuracy is limited by
various system uncertainties . For example, the unbalanced power detection between
the low-frequency and high-frequency rectifiers could lead to underestimate or overes-
timate of the boosting factors, thus resulting in a suboptimal solution [142].
62
Chapter 3
63
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
It is a nontrivial task to design a robust RILCM for practical applications and the
challenges mainly focus on the following aspects. Firstly, the jitter suppression is
sensitive to the frequency offset between the target frequency and the free-running
frequency of the oscillator. Specifically, the phase noise tracking ability will decline
rapidly as the frequency offset increases [81]. Moreover, it is quite challenging to de-
tect the frequency offset since the accumulated phase error can always be reset by the
injection pulse. Therefore, the free-running frequency of the voltage-controlled oscil-
lator (VCO) should be tuned as close as possible to the center of the locking range to
obtain an optimum jitter suppression. Secondly, the injection locking technique cannot
suppress the 1/f 3 noise of the VCO. This is because the injection locking is actually e-
64
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
ti DLL
Pulse
Generator Pulse
Generator
fref fout fref VCDL
Vctrl fout
PFD CP LPF PFD CP LPF
Vctrl
td
/N /N
(a) (b)
Vctrl
fref
PLL/DLL Vctrl Replica
VCO/VCDL LPF DAC Modulation
(c) (d)
fref Pulse
Generator
PFD CP1
Auxiliary Loop
for Initial Step
/N
(e)
Figure 3.1: Previous frequency tracking techniques. (a)Traditional IL-PLL, (b) IL-
PLL with DLL-based injection position adjustment, (c) dual-loop architecture with
replica-VCO/VCDL, (d) TDC-based FTL, and (e) TPD-based FTL.
quivalent to a single-pole feedback system that can only achieve 20 dB/dec of in-band
noise shaping [82, 89]. It means that the injection locking technique suppresses the
1/f 2 noise (converted from white noise) of the VCO but not its 1/f 3 noise (converted
from flick noise). Thirdly, the injected VCO is possibly locked to some harmonic fre-
quency of the injection signal [151]. This can be traditionally solved by introducing a
beginning-calibration procedure [152, 145, 153] to initially adjust the control voltage
close to the desired value. However, it cannot prevent the hidden risk of possibly losing
lock due to its limited lock-in range and weak lock-acquisition ability [86, 87]. As a
consequence, robust frequency tracking techniques with low-frequency noise suppres-
sion abilities are highly demanded to overcome these difficulties.
65
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Fig. 3.1 summarizes the previous frequency tracking techniques that are utilized
to address the aforementioned issues. According to the frequency offset detection
mechanism, they are categorized into two different classes. One is based on traditional
PLL/DLL [see Fig. 3.1(a), (b), and (c)] and the other is based on injection caused
phase shift detection (PSD) [see Fig. 3.1(d) and (e)].
Fig. 3.1(a) shows the most general injection-locked (IL)-PLL, where the PLL sys-
tem keeps the natural frequency of the VCO located at the desired frequency harmonic
[88]. The main problem of this scheme is the mutual-pulling between the PLL locking
force and the injection locking force, which could degrade the jitter performance or
even result in a stability problem. The mutual-pulling is usually caused by the delay
mismatch between ti (the intrinsic delay of the pulse generator) and td (the delay of
the asynchronous divider) [see Fig. 3.1(a)], and their delay fluctuations over different
PVT corners make it even more difficult to handle. This problem was solved in [82]
by adding a voltage-controlled delay line (VCDL) preceding the pulse generator [see
Fig. 3.1(b)]. Driven by the DLL loop, the delay of the VCDL is adaptively adjust-
ed to maintain an optimal injection position. This method removes the timing issue
with the penalty of an additional DLL loop. Fig. 3.1(c) describes the dual-loop ar-
chitecture, where the frequency deviation is monitored by a separate PLL utilizing a
replica-VCO [83] or an independent DLL using the same delay cell as the main V-
CO [84]. The physical separation of the FTL and the injection-locked oscillator (ILO)
can effectively prevent the mutual-pulling problem between the two locking forces.
However, there are still several drawbacks within this architecture. Firstly, the auxil-
iary PLL/DLL consumes substantial extra power which lowers the power efficiency.
Secondly, the fabrication mismatch constrains the calibration precision. Thirdly, the
separate FTL cannot suppress the 1/f 3 noise of the VCO, since the flick noise tracked
by the PLL/DLL is independent of that in the main VCO. Generally, the common fea-
ture of the above mentioned architectures is employing an additional PLL/DLL loop to
correct the frequency offset. Hence, they can be classified as PLL/DLL-based FTLs.
The main drawback of these FTLs is the low efficiencies in power consumption and
66
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
area occupation.
Meanwhile, the PSD-based FTLs are attracting more attentions because of their
low power consumption and high jitter performance. By merging the frequency off-
set detection and the injection error detection into one single PSD, the always-on
PLL/DLL in the aforementioned FTLs is only required to work during the frequency
initialization, thus saving substantial power consumption. Considering the fact that the
phase disturbance induced by the device noise is equally detected by the PSD without
distinction, thus the FTL is capable of attenuating the VCO in-band noise like tradition-
al IL-PLL. Combining with the 20 dB/dec noise shaping introduced by the injection
locking, the 1/f 3 noise of the VCO can be completely suppressed. Fig. 3.1(d) and (e)
respectively presents the digital and analogical PSD-based FTLs [85, 146]. The former
adopts a time-to-digital converter (TDC) to measure the periodic phase errors caused
by frequency offset [85], while the latter utilizes a TPD to detect the phase shift be-
tween the injection-pulse center and the zero-crossing point of the IL-VCO [146]. For
the TDC-based FTL, its performance is restricted by the TDC resolution and control
voltage granularity. The complex logic operation associated with the complicated cir-
cuit implementation also reduces its power efficiency. In contrast, the TPD-based FTL
shows superior power efficiency since its operation only involves the TPD, CP, and
LPF. As an example, the IL-PLL designed in [146] with the TPD-based FTL achieves
a figure-of-merit (FOM) of -247 dB at 3.2 GHz. However, there still exist several chal-
lenges within the TPD-based FTL. Firstly, it is quite challenging to design a high-speed
TPD since it needs to process the most high speed injection pulse. Secondly, the TPD
must have high detection accuracy to distinguish the small phase shift caused by the
frequency offset. Thirdly, the hidden risk of possibly losing lock along with its limited
locking range and weak lock-acquisition ability reduces its robustness and reliability
[86, 87]. This work is aimed to address these issues in the TPD-based FTL.
67
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
TPD_EN
Pulse Injection Path
REF_CLK
Generator Polarity INJ_LOCK Timing-Adjusted
Detector Loop
TPD/CP2
TPD_EN
MUX VCO
PFD/CP1
TPD_EN
PFD_EN
TPD_EN
PFD_EN
LPF
DIV4
/2 /2
Phase-Locked Loop
REF_CLK INJ_LOCK TPD_EN
Freq. Lock Loop
DIV4_90 FRE_LOCK PFD_EN
/2 Detector Selector
EXT_MODE_SEL
Fig. 3.2 shows the block diagram of the proposed RILCM. It contains a pulse
generator (PG) and a hybrid FTL consisting of a traditional PLL, a timing-adjusted
loop (TAL), and a loop-selection state machine (LSSM). Driven by the LSSM, the
LPF/VCO alternately connects to PFD/CP1 and TPD/CP2 to accomplish lock acquisi-
tion. When the FTL switches from PLL to TAL, the resistor in series with the capacitor
is shorted to remove the stabilizing zero in the loop gain. This is because the injection
locking gives rise to the inclusion of a high pass filter within the TAL, thus making it
a first-order system.
This design has two main features. One is the newly developed TPD, which utilizes
limited transistors to achieve both high detection accuracy and high operation speed.
Meanwhile, a polarity detection mechanism is introduced to avoid positive feedback.
The other is the introduced LLD-LR in the hybrid FTL, which automatically switches
the FTL to traditional PLL mode for a specific duration to undertake lock recovery in
case that an injection-lock loss is detected. In doing so, the pull-in range of the RILCM
is effectively extended, which not only solves the problem of initial lock acquisition but
also prevents the hidden risk of losing lock in normal operation mode. Owing to these
68
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
θi (s)
N·Hinj (s)
/N
Figure 3.3: Linear model of the RILCM in case of the injection-locked condition,
where θref (s), θi (s), θo (s), θn,ref (s), θn,vco (s) represent the reference input phase,
total input phase, output phase, reference input noise, and VCO noise, respectively.
two techniques, the proposed RILCM effectively prevents the mutual-pulling issue in
conventional IL-PLLs while keeping their good properties of enhanced in-band noise
suppression and high operation robustness, thus making it a competitive option for
commercial applications.
Fig. 3.3 displays the detailed linear model of the RILCM with the TAL, where
the two main noise sources [i.e., the reference noise θn,ref (s) and the RVCO noise
θn,vco (s)] are included. In contrast to traditional PLL, the injection locking gives rise
to the inclusion of [1 − Hinj (s)] within the TAL loop [85, 89], where Hinj (s) denotes
the normalized phase transfer function of the injection locking. It can be approximated
by an LFP with a left-plane pole around the tracking bandwidth of the IL-VCO [89, 80,
154]. The presence of such an HPF accounts for the in-band phase noise attenuation
in terms of resetting the phase errors at the arrival of each injection pulse. To explore
the system stability and phase transfer characteristics, the closed-loop characteristic
equation is formulated as below,
69
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
where θi (s) is the summation of the reference input θref (s) and the reference noise
θn,ref (s). Rearranging Eq. (3.1), the closed-loop transfer function can be obtained by
N ·LG(s) N
θo (s) = θi (s)· + θi (s)· ·Hinj (s)
1 + LG(s) 1 + LG(s)
1
+ θn,vco (s)· ·[1 − Hinj (s)],
1 + LG(s)
N ·LG(s)
= θref (s)· ·[1 − Hinj (s)] + N ·θref (s)·Hinj (s)
1 + LG(s)
N ·LG(s)
+ θn,ref (s)· ·[1 − Hinj (s)] + N ·θn,ref (s)·Hinj (s)
1 + LG(s)
1
+ θn,vco (s)· ·[1 − Hinj (s)], (3.2)
1 + LG(s)
where the first line represents the phase transfer of θref (s), the second line stands for
the noise transfer of θn,ref (s), the third line denotes the noise transfer of θn,V CO (s),
and LG(s) is the loop gain, written as
1 KV CO
LG(s) = ·KT P D ·LP F (s)· ·[1 − Hinj (s)]. (3.3)
N s
70
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
ftune
finj
Sɵ ( f )
θn,vco θn,vco
ftune finj A 1
1 1 - H inj ( s ) f3
1 LG ( s )
θn,ref B θn,out
Σ 1
finj
20log(N) dB
NH inj ( s )
θn,out f2
ftune finj 20log(N)
θn,ref
C
ftune finj fc finj f
N LG ( s ) 1 - H inj ( s ) fc < ftune < finj
1 LG ( s )
(a) (b)
Figure 3.4: NTF characteristics of the RILCM. (a) NTF behaviors and (b) simplified
noise shaping characteristics. Here, fc is the corner frequency of the oscillator, finj
stands for the bandwidth of the injection locking, ftune denotes the tunable bandwidth
of the TAL, 1/f 2 represents the white noise of the oscillator, and 1/f 3 is the flick noise
of the oscillator.
performance. Referring to Eq. (3.3), we can find that the secondary pole within the
TAL is subject to the dominant pole of Hinj (s). Therefore, the unity-gain bandwidth
of the loop gain should be designed smaller than the -3 dB bandwidth of Hinj (s) so
as to guarantee sufficient phase margin. Meanwhile, to suppress the 1/f 3 noise of the
VCO, the TAL bandwidth ftune is expected to be larger than the corner frequency fc
of the VCO. In this design, the bandwidth of the injection locking is designed to be 40
MHz while the TAL bandwidth can be adjusted by changing the CP current.
Noise Shaping Characteristics- Following the closed-loop transfer function in Eq.
(3.2), Fig. 3.4(a) describes the noise transfer function (NTF) behaviors of the two
main noise sources θn,ref and θn,vco . The three NTFs in Eq. (3.2) are generalized into
three noise transfer paths: A, B, and C. Path A stands for the NTF from the VCO, path
B refers to the main NTF of the reference, and path C represents the secondary NTF
path from the reference. The TAL leads to the inclusion of an extra [1/(1 + LG(s))]
within the NTF of the IL-VCO [see path A in Fig. 3.4(a)] and introduces an addition-
al path [see path C in Fig. 3.4(a)] from the reference noise to the VCO output. The
equivalent NTFs for these two paths are plotted in gray solid line [see Fig. 3.4(a)].
For path A, the presence of the [1/(1 + LG(s))] provides 20 dB/dec noise suppres-
sion. Combining with the 20 dB/dec attenuation contributed by the [1 − Hinj (s)], the
71
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
1/f 3 noise of the VCO can be significantly suppressed as long as the bandwidths of
ftune and finj are larger than the VCO corner frequency fc . This requirement can be
easily satisfied by adjusting the TAL loop parameters and injection strength. Path B
denotes the main noise transfer mechanism of the RILCM, which can be considered
as the reference NTF of the IL-VCO without the TAL. As for path C, the reference
noise transferred to the VCO output is negligible. Because the equivalent NTF of the
cascaded [LG(s)/(1 + LG(s))] and [1 − Hinj (s)] shows significant attenuations over
all frequencies as long as their bandwidths satisfy ftune < finj [see Fig. 3.4(a)]. This
requirement can be naturally met as it coincides with the loop stability request. Fig.
3.4(b) presents the simplified noise-shaping characteristics of the proposed RILCM
with the TAL. The injection locking along with the TAL can completely suppress the
in-band noise of the VCO, hence making its in-band noise tightly track the reference
noise.
tor (IL-RVCO)
The LC oscillator has demonstrated excellent performance on phase noise and pow-
er efficiency. However, its large area occupation, narrow tuning range, and inductor-
caused cross-coupling make it less suitable for multi-lane applications [153, 151]. In
contrast, the ring oscillator shows more potential in such applications because of its
wide operation range, multi-phase generation, and compact layout implementation.
Moreover, the recently developed injection locking technique makes it possible to
achieve a comparable jitter performance to its LC counterpart [82, 145]. This sec-
tion will firstly describe the IL-RVCO based on a new FS-PDDC, and then explore the
relative phase difference (i.e., the crossing point of the IL-RVCO output relative to the
injection center) with respect to the frequency offset.
72
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
VCTRL
(a)
INJ_EN
INJ REF
INJ
CLK
REF OUT
(b) (c)
Figure 3.5: IL-RVCO. (a) Four-stage RVCO implementation, (b) pulse generator, and
(c) injection locking behavior.
IN OP IN OP
VCTRL VCTRL
IP ON IP ON
(a) (b)
IN OP IN OP
IP ON IP ON
(c) (d)
Figure 3.6: (a) FTG-based FS-PDDC, (b) CCI-based FS-PDDC, (c) effect of the FTGs,
and (d) effect of the CCIs. Here, the arrows stand for the effort directions that are
offered by the FTGs or CCIs.
Fig. 3.5(a) shows the adopted IL-RVCO, which consists of four identical delay
cells and its frequency is adjusted by the control voltage (VCTRL). The injection pulse
is applied to one of the four stages, while other injection transistors are connected to
the ground to avoid disrupting injection. By injecting the narrow pulses produced by
the pulse generator in Fig. 3.5(b) into the IL-RVCO, the accumulated jitter can be
periodically corrected by the injection pulse at every reference cycle [see Fig. 3.5(c)].
Fig. 3.6(a) presents the proposed FS-PDDC, where the pseudo differential output is
73
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
ensured by a pair of forward transmission gates (FTGs). To illustrate the unique fea-
tures of this FS-PDDC, another representative implementation is also described in Fig.
3.6(b), whose pseudo differential output is guaranteed by two cross-coupled inverters
(CCIs) [155].
The common feature of these two FS-PDDCs is that they both employ a pair of
back-to-back varactors to tune the free-running frequency of the VCO, where the VC-
TRL is fed to the common body of the two varactors. In principle, when the VCTRL
goes down, the equivalent voltage applied to the varactors increases, implying that the
delay cells need to drive higher load capacitances. Hence, the free-running frequency
of the VCO becomes low. Conversely, as the VCTRL goes high, the VCO frequency
will rise. Compared to conventional supply-regulated delay cells [156, 155], the main
advantages of these FS-PDDCs are their high output swing and fixed common-mode
voltage (around half of the power supply), which preclude the demands of level shift
correction [123, 157], and thereby facilitate their applications. Fig. 3.6(c) and (d) de-
scribes the effects on edge transitions that are contributed by the FTGs and CCIs [see
Fig. 3.6(a) and (b)], where the arrows stand for the effort directions that are offered by
the FTGs or CCIs. Obviously, the arrows in Fig. 3.6(c) always coincide with the edge
transitions, thus accelerating them. The reason is that the transmission delay from IP
to OP (IN to ON) through the FTG is similar to that from IN to OP (IP to ON) via
the inverter. In contrast, the CCIs decelerate the edge transitions in the portion preced-
ing the crossing point, while accelerating the edge transitions succeeding the crossing
point [see Fig. 3.6(d)]. This can be understood by realizing that the state changes of
the CCIs happen at the crossing point. Before that, they provide negative feedback to
preserve previous states [see the gray arrows in Fig. 3.6(d)]. After that, they contribute
positive feedback to speed up the state changes [see the black arrows in Fig. 3.6(d)].
Compared to the half-negative and half-positive feedback associated with the CCIs in
Fig. 3.6(b), the FTGs in Fig 3.6(a) contribute persist positive feedback, which makes
the FTG-based FS-PDDC a more promising solution for high-frequency applications.
74
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Positive Negative
Accelerate the Feedback Region Feedback Region
Decelerate the
Edge Transitions Edge Transitions
Figure 3.7: Effect of the injection pulse on the speed of edge transitions, where the pro-
ceeding portion of the injection pulse contributes positive feedback while the following
portion provides negative feedback.
Frequency Offset
Previous work has demonstrated that the most challenging task in the TAL is how
to detect the difference between the free-running frequency of the VCO and the target
frequency, since the VCO output frequency is not changed with the control voltage
in locked conditions [82, 158]. Inspired by the design in [82], the relative phase dif-
ference of the VCO output with respect to the center of the injection pulse is used to
estimate the frequency offset in this design. To explore the relationship between the
relative phase difference and the frequency offset, Fig. 3.7 summarizes the effect of
the injection pulse on the speed of edge transitions. An ideal injection is depicted at
the center of the diagram when the crossing point of the differential output clock oc-
curs at the center of the injection pulse. The left subfigure describes the current flow
through the injection transistor for the preceding part of the injection pulse. During
this period, CLK P falls to VDD/2 from VDD while CLK N rises to VDD/2 from
0. The current flows from CLK P to CLK N through the injection transistor, provid-
ing an additional current path for both pulling down CLK P and pulling up CLK N.
Therefore, the preceding part of the injection pulse contributes a positive feedback that
accelerates both the rising and falling edges. On the contrary, the following part of
75
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Frequency (GHz)
2.508
GHz
Frequency (GHz)
(c)
Figure 3.8: Transient simulation results of the IL-RVCO. (a) Injection locking range,
(b) the relative phase difference with respect to the transient time, and (c) the relative
phase difference versus the frequency offset.
the injection pulse results in negative feedback. As illustrated in the right subfigure,
CLK P falls from VDD/2 to 0 while CLK N rises from VDD/2 to VDD. The current
flows from CLK N to CLK P, which slows down the edge transitions. Based on the
above analysis, when the free running frequency is lower than the target frequency, the
output clock period needs to be decreased to catch up the injection signal and hence the
crossing point should be located succeeding the injection center to make sure that the
positive feedback is stronger than the negative feedback to speed up the VCO. Con-
versely, the crossing point should be located proceeding the injection center to slow
down the VCO. When the free running frequency equals the target frequency, the in-
jection center should be located at the crossing point, where the phase-noise reduction
contributed by the injection locking reaches its maximum [151].
Fig. 3.8 displays the transient simulation results of the IL-RVCO, where an injec-
tion pulse with a slow ramping frequency is applied to the RVCO, while the control
voltage is set to a fixed value that makes the center frequency of the IL-RVCO locate
76
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
around 10 GHz. The simulated frequencies of the RVCO’s quarter-rate output and the
injection pulse are plotted in Fig. 3.8(a). Obviously, the IL-RVCO can track the injec-
tion pulse in a locking range of 2.491-2.509 GHz. Fig. 3.8(b) depicts the relative phase
difference (i.e., the crossing point of the IL-RVCO output to the center of the injection
pulse) with respect to the transient time. Replacing the transient time with the frequen-
cy offset in the horizontal axis, the relationship between the relative phase difference
and the frequency offset can be obtained [see Fig. 3.8(c)], where the visual locking
positions are also given in the right waveforms. Clearly, the relative phase difference
can be regarded as linear with respect to the frequency offset in the vicinity of the lock-
ing center. According to the analysis in Appendix A, the reciprocal of the slope Kslope
is actually equal to the tracking bandwidth of the phase transfer function Hinj (s) in
Fig. 3.3. The linear relationship of the relative phase difference versus the frequency
offset and the explicit tracking bandwidth lay a solid foundation for the FTL design,
including the TPD implementation, stability analysis, and bandwidth optimization.
The main function of the TPD in the FTL is to detect the phase difference from
the crossing point of the differential output clock to the injection-pulse center to indi-
cate the frequency deviation between the free-running frequency of the VCO and the
target output frequency. Many attempts have been made to obtain an accurate phase
difference. In [153], a sub-sampling TPD (SSTPD), which embeds the sample-and-
hold (S/H) circuits into one of the stages, is adopted to monitor the injection timing.
However, the heavy load of the S/H along with the subsequent voltage-to-current can
dramatically prolong the delay of the SSTPD embedded stage, which not only slows
down the maximum operation speed of the VCO but also leads to an I, Q matching
problem. Additionally, its output polarity is unpredictable since the sampled output
is subject to the injection positions (falling edge or rising edge). This could lead to a
probability of 50% to form an undesired positive feedback at the initial state. Another
TPD consisting of four AND gates, six D-flip-flops (DFFs), and several logic gates is
developed to convert the phase differences to voltage pulses [146]. Nevertheless, its
77
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
CLK_P PO_P
×β
D Q
MOS12 PULSE VBP VBP
CLK
MOS25 SW
CLK_N PO_N VBP E
D Q
PULSE CLK N P EN_P EN_N
N P
PO_P PO_N PO_P
Polarity Detector D + C
- iCP
ICP
EN_N EN_P
A B A B
CLK_P F
PULSE PULSE
PSP_P PSP_N CLK_N
PS_P PS_N ×β
PULSE
CLK_P CLK_N
78
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
CLK_N
CLK_P ... CLK_P
CLK_N
...
PULSE Φ PULSE Φ
φp φn
PS_P PS_P
φn φp
PS_N PS_N
PO_N ... PO_P ...
PO_P PO_N
PSP_P φp PSP_P
φp
PSP_N φn PSP_N φn
(a) (b)
Figure 3.10: Locking behaviour of the proposed TPD. (a) Waveforms when injection
occurs at the falling edge of CLK P, and (b) waveforms when injection occurs at the
rising edge of CLK P.
Fig 3.10 shows the locking behaviour of the proposed TPD, where the operation of
the TPD logic can be considered as a pair of on/off switches that are controlled by the
equivalent signals of PS P and PS N. Obviously, the injection pulse is partitioned into
two sections ϕp and ϕn by the crossing point of the high-speed complementary clocks.
When the injection center is leading the crossing point, the pulse width ϕp is larger than
ϕn , and vice versa. This width difference is then converted to current by the following
CP in Fig. 3.9, where the instant current is determined by the threshold voltage of the
common-gate high voltage transistor and its source-equivalent turning-on resistor of
the phase detecting transistors.
When the TAL is stable, the average output current of the CP should be zero, thus
the crossing point of the VCO’s output should be located at the center of the injection
pulse. It is under such an exact condition that the frequency of the free-running VCO
becomes close to the target frequency, according to the analysis of the relationship
between the phase difference and frequency offset in Section 3.3.2. This means that
the alignment of the injection pulse center and VCO’s output crossing point is a com-
mon target for both the injection locking and frequency tracking. Therefore, the race
condition between the two pulling forces is eliminated.
79
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
One common problem in the existing TPD-based TALs is that an improper injec-
tion position may lead to an undesired positive feedback [82, 153]. To be more specific,
Fig. 3.10 illustrates the functional waveform of the TPD logic when injection occurs
at different transition edges. For the condition that the injection happens at the rising
edge of CLK P as shown in Fig. 3.10(a), the equivalent pulses of PS P and PS N can
be given by
When the injection occurs at the falling edge of CLK P as depicted in Fig. 3.10(b),
they can be induced by
Clearly, if no measure is taken, the value of the detected phase difference will change
to the opposite sign as the injection position switches between the two possible locking
conditions depicted in Fig 3.10. This will make the TAL have a 50% chance to operate
in positive feedback in the initial state, which may cause a false lock or even a fail
lock since the injection locking range is small. To solve this problem, a POD shown in
Fig. 3.9 is introduced to produce the polarity signals PO P and PO N by distinguish-
ing the edge types at the injection instant. The equivalent function of the TPD with
the polarity selection is also depicted in Fig. 3.9, where the waveforms for the final
equivalent inputs of PSP P and PSP N are described in Fig. 3.10. It can be seen that
the connection of the detected pulses of PS P and PS N are exchanged by the polarity
selection signals of PO P and PO N. Therefore, the same equivalent pulses of PSP P
and PSP N for both conditions shown in Fig. 3.10 can be acquired. Consequently, the
possible positive feedback is avoided.
80
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
DIV4_90
D Q INJ_LOCK
FD1
REF_CLK IND_LOCK TPD_EN
CLK XOR
FRE_LOCK XOR Saturation PFD_EN
RST Edge Counter
REF_CLK FD2
D Q EXT_MODE_SEL
DIV4_90
CLK Loop
Selector
A RST
Frequency Lock
PFD_EN
Detector
RC=60 ns
(a)
REF_CLK
... ...
DIV4_90
FD1 0
FD2 1 0
FRE_LOCK 1
Target Harmonic Locked False Harmonic Locked Regular Frequency Deviation
(b)
Figure 3.11: Implementation of the introduced LSSM. (a) Circuit details and (b) be-
havior of the FLD.
Recovery (LLD-LR)
There exist initial lock acquisition problem and losing lock risk in previous TPD-
based TALs due to their limited locking range and weak lock-acquisition ability [146,
153]. To overcome these difficulties, a complete LLD-LR mechanism is embedded in-
to the hybrid FTL under the control of the LSSM. Fig. 3.11(a) gives the details of the
LSSM. It consists of a frequency lock detector (FLD) and a loop selector (LS). Apply-
ing the injection-lock indicator INJ LOCK and frequency-lock indicator FRE LOCK
to the LS, the total edge transitions on INJ LOCK and FRE LOCK are recorded by
a saturated counter. Once the number reaches a specific value (4 in this design), the
LS will switch the FTL from TAL to PLL to start a lock-acquisition process. Simul-
taneously, the RC timer with a time constant of 60 ns is launched to charge node A.
81
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
When its voltage climbs to the inverter threshold, the LS is reset and the FTL switches
back to TAL to engage injection locking. If the VCO successfully locks to the injec-
tion pulse at the target frequency, both the INJ LOCK and FRE LOCK will stay static,
and so does the LS. Otherwise, this process will be repeated until injection locking
is achieved. During the initial period, the injection lock can be obtained by repeating
this LLD-LR process. During the normal operation period, lock loss can be detected
in time to activate this LLD-LR process. Additionally, it is worth noting that almost
no extra power is dissipated in the normal loss detecting mode since there is no signal
transition in the LSSM.
Fig. 3.11(b) describes the functional behavior of the FLD. When the feedback
frequency is exactly equal to the reference frequency under the locked condition [left
subset in Fig. 3.11(b)], the outputs FD1 and FD2 of the mutual sampling D-flip-flops
(DFFs) stay unchanged and thus the frequency-lock indicating signal FRE LOCK
remains static. For the case when the frequency of the feedback clock equals the
sub-harmonic or multiple-harmonic of the reference frequency [middle subset in Fig.
3.11(b)], the FRE LOCK must be the delayed version of the low-frequency clock since
the mutual sampling can reserve all the timing information of the low-frequency clock.
For the regular condition when there is a frequency deviation between the feedback
clock and reference clock [right subset in Fig. 3.11(b)], the FLD can also produce
transitions on FRE LOCK. Generally, only when the VCO is running at the target
frequency, the frequency-lock indicating signal FRE LOCK stays static. Hence, the
presence of transitions on FRE LOCK can be considered as a frequency-lock failure.
Although the FLD can detect any frequency deviation [see Fig. 3.11(b)], it takes
a long time to bring in a cycle slip to generate the frequency-loss edge transitions
when the frequency of the VCO is close to the target frequency. Due to the small
locking range, there is a high likelihood of such an occurrence during the injection
locking process. To speed up the lock-loss detection, the INJ LOCK (buffered version
of PO P) is also applied to the LS. Recalling the cases when VCO is locked to the
82
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
injected pulse as shown in Fig. 3.10, the DFFs in the POD triggered by the rising edge
of the injected pulse will always sample the same logic level. Therefore, the polarity
signal INJ LOCK stays unchanged either in logic high or logic low. Conversely, if the
injected pulse fails to lock to the VCO, the injection position will change with phase
error accumulation, which will finally bring in a cycle slip. It is at this specific moment
the polarity signal INJ LOCK will present an edge transition, which can be considered
as an effective indicator of injection-lock failure. By monitoring the edge transitions
on both INJ LOCK and FRE LOCK, any injection failure including injection-lock loss
and false harmonic lock can be quickly detected.
The RILCM is designed using a Dell R730 server with two E5-2609V4 CUPs, 128
G memory and 8 T hard disk. The schematic, layout, and simulation are respectively
finished by Schematic Composor, Virtuoso Layout, and Spectre/aps that are developed
by Cadence. The software version is IC5141. The layout verification and parasitic
extraction are carried out by layout versus schematics (LVS)/design rule check (DRC)
and parasitic extraction (PEX) using Caliber2013 that is developed by Mentor Graph-
ics. To characterize the jitter performance of the fabricated prototype, a very clean ref-
erence clock is generated by a KEYSIGHT N5191A. For a 2.5 GHz output, it presents
phase noises of -146 dBc/Hz and -150 dBc/Hz at 1 MHz and 10 MHz offset, respec-
tively. The rms-jitter integrated from 10 kHz to 40 MHz is 38.7 fs. Without special
explanation, the rms-jitter in the following description is designated to be integrated
over the same frequency range.
The prototype chip is designed and fabricated utilizing a 65 nm process. Under a
typical corner, the cut-off frequency (fT ) of the NMOS transistor and the inverter delay
with a fan-out-of-4 in this process achieve 200 GHz and 13 ps, respectively. These two
metrics indicate that the utilized 65 nm process is able to provide enough bandwidth
and timing margin for the targeted 10 GHz RILCM design. Although an advanced
83
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
PFD/CP1 TPD/CP2
Capacitor
LSSM PG
RILCM
CORE VCO
LPF
RILCM CORE
Figure 3.12: Layout view of the whole RILCM chip, where the block placement of the
core circuits is illustrated in the left view.
(d) (e)
Figure 3.13: Layout views of the crucial blocks. (a) VCO, (b) PG, (c) PFD/CP1, (d)
TPD/CP2, and (e) LSSM.
process with a smaller minimum channel length such as 45 nm, 32 nm, 22 nm and
16 nm can offer higher fT and shorter inverter delay, their high prices make them not
available for us. Fortunately, our RILCM mainly focuses on the hybrid-loop frequency
tacking architecture, improved FS-PDDC-based RVCO, TPD circuit implementation
and LLD-LR mechanism. These techniques can still be verified by the economical and
practical 65 nm CMOS process.
84
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Fig. 3.12 displays the layout view of the whole RILCM chip, where the block
placement of the core circuits is illustrated in the left view. The PG is placed very
close to the VCO to reduce the parasitic capacitance of the pulse output, and hence
provides an injection pulse with sharp edges. The PFD/CP1, TPD/CP2, and LSSM are
placed together to facilitate the connections among the mode selection signals. The
LFP is put close to the VCO to reduce the effect caused by supply fluctuations. Fig.
3.13 further presents the layout views of the crucial blocks. As shown in Fig. 3.13(a),
the VCO layout is implemented in a ring, which assists to make each of the delay cell
share the same parasitic capacitance, and hence optimize the noise performance of the
VCO. The main design point of the PG is to optimize the parasitic capacitance on the
pulse output node [see Fig. 3.13(b)]. As for the PFD/CP1 and TPD/CP2 shown in
Fig. 3.13(c) and (d), we have paid special attentions to guarantee the two comparison
branches are symmetrical, and thereby reduce the mismatch between the two com-
paring phases. The main consideration for the LSSM is the convenience to route the
connection signals.
1.2 VCTRL
RVCO
Voltage (V)
0.7
PWL
0.2
Figure 3.14: Simulation setup of the RVCO, where the left curve depicts the VCTRL
of the RVCO.
85
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Zoom in
1.1 V
Voltage (V)
Voltage (V)
(a) (b)
30 dB/dec VCTRL=700 mV
1 -91.83
f3
4.69 fc 10 MHz
20 dB/dec
1
f2
Figure 3.15: Simulation results of the RVCO. (a) Differential output clock, (b) swing
reduction, (c) frequency range, and (d) phase noise.
12 5.5
Frequency Range (GHz)
Cs
FS-PDD 5 RVCO with FTG-Based FS-PDDCs
Frequency (GHz)
11
ith FTG -Based
RVCO w
4.5
10
4 RVC
Ow
9 RVC ith
Ow 3.5 CCI
ith C -B ase
CI-B dF
8 ased S-P
FS-P 3 DDC
DDC s
s
7 2.5
10 20 30 40 50 10 20 30 40 50
(a) (b)
-153 30
C s Cs
Swing Reduction (%)
-PDD DD
-154 s ed FS - P
CI-Ba dF
S
FOM PN (dBc/Hz)
CO with C 20 se
-155 RV -Ba
TG
wi th F
-156 10 CO
RVC RV
Ow
-157 ith F
TG-B
ased 0
-158 FS-P RVCO with CCI-Based FS-PDDCs
DDC
s
-159 -10
10 20 30 40 50 10 20 30 40 50
(c) (d)
Figure 3.16: Simulated performance comparison of the RVCOs with FTG-based and
CCI-based FS-PDDCs in terms of (a) operation frequency, (b) frequency range, (c)
FOMPN , and (d) swing reduction. Here, the horizontal axes denote the percentage of
the FTG/CCI to the main inverter in dimension.
86
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Fig. 3.14 describes the simulation setup of the RVCO, where the VCTRL rises
from 0.2 V to 1.2 V in a liner mode to continuously adjust the operation frequency of
the RVCO. Fig. 3.15 (a) demonstrates the transient output waveforms of the RVCO.
Fig. 3.15 (b) shows the zooming waveforms, which clearly show that the swing of the
RVCO outputs is shrunk to 1.1 V. Fig. 3.15 (c) displays the operation frequency of
the RVCO, which shows that the frequency range of the RVCO is 4.69 GHz when the
VCTRL changes from 0.2 V to 1.2 V. Fig. 3.15 (d) gives the simulated phase noise of
the RVCO when the VCTRL is set to 0.7 V. The corner frequency of the 1/f 3 noise is
around 10 MHz and the phase noise at 1 MHz offset is -91.83 dBc/Hz.
To compare the performance of the RVCOs with the FTG-based and CCI-based
FS-PDDCs that are described in Fig. 3.6(a) and (b), we repeated the simulations in
Fig. 3.15 using the setup in Fig. 3.14 while changing the ratio of the the FTG/CCI to
the main inverter. Fig. 3.16 summarizes the simulated comparison results, where the
horizontal axis is the percentage of the FTG/CCI to the main inverter in dimension.
As depicted in Fig. 3.16(a), (b), and (c), the RVCO integrated with the FTG-based
FS-PDDCs holds the advantages of higher operation frequencies, wide tunable ranges,
and lower FOMPN s over that with the CCI-based FS-PDDCs. Here, the FOMPN refers
to the the phase noise FOM of the VCO, which is defined by
∆f 2 PDC (3.8)
F OMP N = L (f0 , ∆f ) + 10log f02
· 1mW
,
where L(f0 , ∆f ) is the single-side band phase noise at a frequency offset ∆f from a
carrier frequency at f0 , and PDC denotes the power consumption. A lower FOMPN in-
dicates a better VCO [147]. When the percentage of the FTG/CCI to the main inverter
increases from 5% to 50%, the metrics of the RVCO with the FTG-based FS-PDDCs
show a trend of optimization [see the red curves with circle markers in Fig. 3.16(a),
(b), and (c)], while those associated with the RVCO using the CCI-based FS-PDDCs
exhibit a deterioration trend [see the blue curves with square markers in Fig. 3.16(a),
(b), and (c)]. Particularly, for the RVCO with the FTG-based FS-PDDCs, the opera-
87
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
tion frequency rises from 9.98 to 10.95 GHz, the tunable range slightly increases from
4.65 to 4.75 GHz, and the FOMPN upgrades from -155.8 to -158.2 dBc/Hz. As for the
RVCO with the CCI-based FS-PDDCs, the operation frequency drops from 9.6 to 7.6
GHz, the tunable range declines from 4.4 to 3.0 GHz, and the FOMPN degrades from
-155.5 to -153.7 dBc/Hz. These are because the increased FTGs provide a higher pre-
driving ability and thus enhance the positive feedback, while the enlarged CCIs offers
a superior reinforcement on the negative feedback over that on the positive feedback.
On the other hand, the increased pre-driving ability gives rise to a prominent swing
reduction on the RVCO with the FTG-based FS-PDDCs [see the red curve with circle
markers in Fig. 3.16(d)], which is not desired in some applications. In this design,
the percentage of the FTG to the main inverter is chosen to be 15% to ensure that the
swing reduction is controlled under 10%.
VCTRL
3.48 us
Voltage (mV)
VCTRL
3.85 us
Voltage (mV)
TPD_EN
Voltage (V)
TAL Mode
PLL Mode
Time (us)
88
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
VCTRL VCTRL
Voltage
(mV)
TPD_EN TPD_EN
Voltage
1 2 3 4 5 6 7 8
18n
19n
(V)
VCTRL
Voltage (mV)
Voltage
(mV)
1.10
TPD_EN
Voltage (V)
1.49
(a)
VCTRL VCTRL
Voltage
(mV)
TPD_EN TPD_EN
Voltage
(V)
1 2 3 4 5 6 7
68n
88n
VCTRL
Voltage (mV)
TPD_EN
Voltage (V)
2.54
(b)
Figure 3.18: Transient behavior comparison. (a) With injection-lock indicator IN-
J LOCK and (b) without injection lock indicator INJ LOCK.
89
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
The simulated settling behavior of the proposed RILCM with the LLD-LR is plot-
ted in Fig. 3.17, where the PLL and the TAL work alternately under the control of
TPD EN and PFD EN. As shown in the zooming subfigure, although the settling pro-
cess is periodically interrupted by frequent TAL engagement, it still exhibits a similar
lock-acquisition process to the traditional PLL loop. This is because the lock loss
can always be quickly detected by the designed LSSM, thus making the FTL operate
in PLL mode occupy a high time proportion. Benefiting from the improved lock-
acquisition ability, the issues mentioned previously such as possible harmonic locking
and weak robustness are completely resolved.
Fig. 3.18 shows the transient behavior for the cases with and without the injection-
lock indicator INJ LOCK. Obviously, they share similar acquisition behavior when
the VCTRL is far away from its target value, since the large frequency differences
make the frequency-loss detection close to the combined injection-loss and frequency-
loss detection. For instance, the details for the first 500 ns are depicted in the top-left
subsets in Fig. 3.18(a) and (b). However, when the VCO frequency is close to the
target frequency, namely the control voltage VCTLR approaches its target value as
detailed in the top-right subsets in Fig. 3.18(a) and (b), the detecting method only
using the frequency-lock loss indicator REF LOCK requires a long time (e.g., 68 ns,
88 ns) to trigger the PLL loop. On the other hand, the strategy monitoring both the
frequency-lock loss signal FRE LOCK and injection-lock loss signal INJ LOCK can
greatly shorten the detecting time (e.g., 18 ns, 19 ns). Consequently, the detection
method involving both frequency-lock loss and injection-lock loss makes the transient
behavior of the proposed RILCM more similar to the traditional PLL, which brings in
significant convenience and facility for fast start-up applications.
Fig. 3.19 shows the die micrograph of the fabricated prototype. The chip size
including pads is 0.8×0.9 mm2 , where the active area of the RILCM only occupies
90
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
350 μm
RILCM
200 μm
CORE
RILCM CORE
Figure 3.19: Die micrograph of the RILCM.
FTL PG
7.8 mW 1.8 mW
LSSM+DIV
VCO 1.8 mW
48 mW
91
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
58.31 dB
Frequency (Hz)
(a)
Spur Level (dB)
57.13 dB
Frequency (Hz)
(b)
Figure 3.22: Measured reference spur with half-rate output at 5GHz. (a) RILCM with-
out FTL and (b) RILCM with FTL.
0.07 mm2 . The power consumption is 59.4 mW, where 44.5 mA is drawn from a 1.2
V supply and 2.4 mA is provided by a 2.5 V supply. The power breakdown is given
in Fig. 3.20. The introduced LSSM along with the two dividers only costs 3.0% (1.8
mW). The fabricated chip is mounted on a printed circuit board by wire-bonding. The
output clock of the RILCM is firstly divided by 2, and then applied to an output buffer
for measurement.
92
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
120
w/o FTL
100
70
RMS Jitter (fs)
60
50
40
Fig. 3.21 describes the measured phase noise (using half-rate output at 5 GHz) in
three operation modes: conventional PLL without injection, RILCM with and without
FTL. The measured phase noises are -120 dBc/Hz, -128 dBc/Hz, and -138 dBc/Hz, re-
spectively, at an offset frequency of 10 MHz for the above three operation modes. Cor-
respondingly, the measured rms-jitters are 390.2 fs, 130.0 fs, and 56.1 fs. Obviously,
the implemented RILCM demonstrates significant improvement on noise performance
due to the noise shaping contributed by the pulse injection and the continuous FTL. As
illustrated in Fig. 3.22 (a) and (b), the measured reference spur levels without and with
the FTL are 58.31 dB and 57.13 dB, respectively. Note that Fig. 3.22 (a) is measured
93
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
under an ideal condition with a nearly zero frequency deviation that is initially set, the
slight spur degradation (1.2 dB) indicates that the FTL can adjust the injection window
to an optimal position without introducing destructive disturbance.
By repeating the phase noise measurement and jitter integration shown in 3.21, the
rms-jitter under different testing conditions can be obtained. Fig. 3.23 depicts the rms-
jitter versus the supply voltage for modes with and without the FTL, where the rms-
jitter decreases as the supply voltage varies from 1.1 V to 1.32 V. This is because the
improved supply voltage makes the swing of the proposed VCO increase, which helps
sharpen the transition edges to reduce device noise to jitter conversion. To evaluate the
operation range of the RILCM, we recorded the rms-jitter while continuously adjusting
the reference frequency. The measurement results (see Fig. 3.24) demonstrate that the
RILCM can produce high performance clocks (i.e., rms-jitter keeps lower than 60 fs)
over a wide range of 8-12 GHz.
JSSC09 [85] ISSCC13 [146] ISSCC15 [158] ISSCC14 [82] ISSCC16 [81] JSSC14 [151] This work
Architecture LC-ILCM LC-ILCM LC-ILCM PPM Ring-ILCM Ring-ILCM Ring-ILCM Ring-ILCM
Freq. Tracking FTL w/ Timing FTL w/ FTL w/ Replica- Dual Loop w/
FTL w/ TDC None FTL w/ TPD
Method Adjusted PD Pulse Gating Delay Cell Replica-VCO
Lock-Acquisition Manually-Tuned PLL Coarse Freq. Manually-Tuned Coarse Freq.
None None
Auxiliary Control Voltage Initialization Selection Control Voltage Selection
Loss Detection, Not Not Not Not Not
Available Available
Lock Recovery Available Available Available Available Available
Output Freq. 3.2 GHz 2.4 GHz 6.75-8.25 GHz 2-16 GHz 0.96-1.44 GHz 0.5-1.6 GHz 8.0-12.0 GHz
Reference Freq. 50 MHz 150 MHz 105-129 MHz 0.25-2.0 GHz 120 MHz 40-300 MHz 2.0-3.0 GHz
Phase Noise at
-127.4 dBc/Hz -126.4 dBc/Hz -113.5dBc/Hz -115 dBc/Hz -134.4 dBc/Hz -124 dBc/Hz -133.8 dBc/Hz
1 MHz Offset
Jitterrms (δt ) 130 fs 188 fs 190 fs 268 fs 185 fs 700 fs 56.1 fs
(Integ. Range) (100k-40MHz) (1k-40MHz) (10k-100MHz) (100k-1GHz) (10k-40MHz) (10k-40MHz) (10k-40MHz)
Power Diss.
28.6 mW 5.2 mW 2.25 mW 46.2 mW 9.5 mW 0.97 mW 59.4 mW
(PDC )
Reference Spur -64 dBc -49 dBc -40 dBc -48 dBc -53 dBc -57 dBc -57.13 dBc
FOM -243.2 dB -247.0 dB -251.0 dB -235 dB -244.9 dB -243 dB -247.3 dB
Active Area 0.4 mm2 0.25 mm2 0.25 mm2 0.044 mm2 0.06 mm2 0.022 mm2 0.07 mm2
Technology 130 nm CMOS 65 nm CMOS 65 nm CMOS 20 nm CMOS 65 nm CMOS 65 nm CMOS 65 nm CMOS
94
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
Table 3.1 compares the performance of our RILCM with state-of-the-art ILCMs
that have the capability of frequency tracking. Obviously, the phase noise at 1 MHz
offset and integrated jitter of our RILCM outperforms other RILCMs and even compa-
rable to the LC-ILCMs. This is mainly owing to the well combination of the injection
locking and frequency tracking, both of which could provide significate noise suppres-
sion. Additionally, the high-swing RVCO also helps to reduce the phase noise. The
good spur level indicates that the FTL can tune the RVCO to the target free-running fre-
quency and hence make the injection happens around the optimal position. Meanwhile,
it has a much smaller area occupation in contrast to those LC-ILCMs [85, 146, 158].
Additionally, the designed LLD-LR enables our RILCM with similar lock acquisition
ability to conventional PLLs, thus making it a robust solution for commercial produc-
tions.
It is worthy to note that some parameters of the proposed RILCM are inferior.
The tuning range of the proposed RILCM is less than that developed in [82] due to
the limited tuning range of the back-to-back connected varactors. Fortunately, the 4
GHz tuning range is still relatively wide, which can satisfy most of the applications.
The power consumption of our RILCM is higher than previous studies. However, the
power consumption alone cannot be considered as the performance criterion since it is
mainly determined by the utilized transistor sizes rather than the developed techniques.
To estimate the power efficiency of the proposed RILCM, the FOM of the ILCMs are
calculated, which is defined as,
" 2 #
δt PDC
F OM = 10 · log · , (3.9)
1s 1mW
where δt is the rms-jitter of the output signal and PDC is the power consumption. It is
usually considered as the performance-evaluation parameter of the clock multipliers.
Clearly, The proposed RILCM achieves the best FOM (-247.3 dB) among the RILCMs
[82, 81, 151], which indicates that the proposed RILCM (mainly referring to the archi-
tecture and circuit topologies) has the potential to achieve a better FOM (i.e., power
efficiency) than previously developed clock multipliers. As for the area occupation,
it is subject to the process, transistor sizes, and decoupling capacitance values. If an
95
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
PPM-ILCM
VLSI 2015 [84] JSSC 2010 [87]
JSSC 2014 [151]Ring-ILCM ILRO
-240 IL-PLL JSSC 2009 [85]
ISSCC 2016 [81] LC-ILCM
Ring-ILCM ISSCC 2015 [158] This work
-250 ISSCC 2013 [146] LC-ILCM
LC-ILCM
advanced process with a smaller minimum channel length such as 45 nm, 32 nm, 22
nm and 16 nm is utilized, the area occupation can be significantly optimized.
Fig. 3.25 gives a comparison between the proposed RILCM and previous work
in terms of performance-area-speed trade-off. It can be easily seen that our RILCM
achieves a good balance among jitter performance, area occupation, operation speed,
and power efficiency.
This chapter presents a RILCM using a newly developed TPD-based hybrid FTL
capable of producing a low-jitter, high-speed output at a low power consumption and
a small active area occupation. The RILCM occupies 0.07 mm2 , while producing a
56.1 fs rms-jitter at 10 GHz oscillation frequency and consuming a power of 59.4 mW.
96
Chapter 3. Design of the Ring-Based Injection-Locked Clock Multiplier (RILCM)
The utilization of the newly developed FS-PDDC-based RVCO leads to a low device
noise to phase noise conversion and a high convenience for subsequent applications.
A compact TPD associated with a well-matched CP is designed to accomplish high
phase-difference-detection accuracy and low charge-pumping disturbance. By timely
starting a traditional PLL under the control of the LSSM, the essential frequency ini-
tialization in prior FTL-based-ILCMs is eliminated. The LLD-LR mechanism benefits
the developed RILCM with the comparable lock-acquisition ability to conventional
PLL, thus making it a robust solution for commercial productions. Moreover, the de-
signed LSSM only consumes little additional power since most of the logics stay static
when target harmonic locking is obtained. Overall, the proposed system achieves a
good balance of performance-area-speed-efficiency trade-off when compared to other
work.
97
Chapter 4
As one of the most important components in serial links, the transmitter (TX) needs
to produce full-rate data stream with precise timing for correct data transmission and
provide sufficient voltage swing and appropriate equalization such that the received
signal can maintain an adequate swing to make the receiver capable of distinguishing
the transmitted data bits without errors. This chapter presents a 5-50 Gb/s transmitter
with a 4-tap forward-feed equalizer (FFE), where the unit interval (UI)-spaced seri-
al data are produced by four parallel 4:1 multiplexers (MUXs). This scheme brings in
several benefits, including compact layout implementation, accurate 1UI-delay genera-
tion, and wide operating range. To mitigate the inherent large self-drain capacitance of
the 4:1 MUX, an enhanced 4:1 pulling-down unit cell is proposed, which not only im-
proves the maximum operation speed, but also effectively reduces the charge-sharing
effect. A compact latch-array with an interleaved-retiming technique is adopted to
produce the required 16 paths of quarter-rate data streams, where the retiming clock-
s for both the latch array and the 4:1 MUXs are generated by a clock bundle that is
implemented in power-efficiency CMOS style.
In the rest of this chapter, we will firstly illustrate the design challenges in the
high-speed transmitter, and then present the designed transmitter architecture. Follow-
ing that, the enhanced 4:1 MUX and clocking techniques for the transmitter will be
described. Finally, the experimental results of the transmitter will be discussed.
98
Chapter 4. The Transmitter Design
tsetup
Da CK1
2:1 tck-q
L
CK2
tdiv Da
CK2 CK1 tdiv
/2 tck-q tsetup
(a) (b)
Figure 4.1: (a) Critical path and (b) timing diagram for the 2:1 MUX. Here, tdiv is the
delay of the divider, tck−q is the ck-to-q delay of the 2:1 MUX, and tsetup is the setup
time of the sampling latch.
The difficulties in the high-speed transmitter design mainly focus on two aspects.
The first one is the timing constrains for the final-stage serialization, the second one is
the bandwidth limitations for high-speed blocks such as latches, MUXs, and clock/data
driving buffers, which usually involve a tradeoff between the bandwidth extension and
power consumption.
Fig. 4.1 re-draws the critical path and timing diagram in the 2:1 MUX. Note that
the latch needs to sample Da with CK1 , hence sufficient setup time and hold time for
the sampling latch must be satisfied to guarantee the correct functionality. As shown
in Fig. 4.1(b), the hold time can be easily met as the data always hold an adequate time
after the arrival of CK1 . To satisfy the setup time constrain, the following equation
must be held.
where tdiv is the delay of the divider with a division factor of 2, tck−q is the ck-to-q
delay of the 2:1 MUX, and tsetup is the setup time of the sampling latch. The other
possible critical path is the located at the final data selection stage, where the margin
99
Chapter 4. The Transmitter Design
tM UX MUX
setup + thold < 1U I,
(4.2)
where tM UX MUX
setup and thold separately stand for the setup time and hold time of the final 2:1
MUX. As the data rate increases, the reduced bit period will lower these timing margins
and hence limit the maximum operation rate. Moreover, the delay changes associated
with the process, voltage, and temperature (PVT) variations make this problem even
more challenging. To overcome this difficulty, traditional half-rate transmitters often
insert extra delay-matching buffers [27, 24] or phase calibration loops [100, 33, 26]
between CK1 and the latch [see Fig. 4.1(a)]. For the former method, the delay fluctu-
ation between the multiplexing path and the delay-matching path may excess 1 UI and
thereby causes bit errors. For the latter approach, the timing margin is subject to the ac-
curacy of phase detection, which could reduce the stability, reliability, and robustness
of the serializer. Meanwhile, both of these two techniques involves substantial power
and area overhead. An alternative solution is to replace the last three 2:1 MUXs with
a single 4:1 MUX [159, 32, 24, 99]. The resulting quarter-rate serialization relaxes the
critical path timing margin to 3 UI, halves the maximum clock speed, and saves con-
siderable power, thus making it a promising solution for the high-speed serialization.
It is worthy to note that these benefits come with the penalty of a doubled self-drain
capacitance, which dramatically degrades the bandwidth of the 4:1 MUX and hence
limits its maximum operation speed.
The transmitter contains a large number of latches, MUXs, and data/clock driving
buffers that operate at high speeds. As the data rate increases, the bandwidth require-
ments for these blocks rise accordingly. An insufficient bandwidth could make the
signal difficult to reach the top or return to bottom, thus resulting in an attenuated
amplitude. This could bring in significant detriments to the high-speed transmitter.
Firstly, the insufficient bandwidth can make the ck-to-q delay of the MUX occupy a
prominent portion of the bit period and hence restricts the maximum operation rate.
100
Chapter 4. The Transmitter Design
(a) (b)
Figure 4.2: (a) Traditional CML-based MUX implementation and (b) power consump-
tion with different multiplexing ratio [16]. Here, N refers to the the multiplexing
branch number.
Secondly, the limited bandwidth could slow down the transition edges of the trans-
mitting clocks, which will deteriorate the jitter performance of the clock. Thirdly, the
insufficient bandwidth could lead to prominent inter-symbol interface (ISI). Specifical-
ly, the limited bandwidth makes the bit pulses cannot reach the top or return to bottom
and thereby results in long tail over the succeeding bits.
As a general method, the bandwidth can be extended by burning more power. Fig.
4.2 describes the power consumption versus data rates with different multiplexing ra-
tios (i.e., 1, 2, and 4), where the 1:1 MUX actually refers to the clock/data buffer and
the performance of the latch can be estimated by the 2:1 MUX. At low data rates, the
power consumption is linear to the data rate, where the self-drain capacitance can be
neglected. As the data rate rises, the power consumption grows exponentially. This
can be understood by noting that the self-drain capacitance gradually becomes the
dominant load and thereby the resulting bandwidth of the MUXs cannot be extend-
ed by solely increasing the transistor sizes and power consumption. Referring to the
curves in Fig. 4.2(b), it seems that the half-rate serialization with the 2:1 MUX is
much more efficient than the quarter-rate serialization with the 4:1 MUX. However,
the quarter-rate serialization scheme eliminates the three half-rate latches, two half-
rate 2:1 MUXs, and a large number of quarter-rate latches as well as a few half-rate
clock/data driving buffers. These significant power savings can effectively compen-
sate for the power increase in the 4:1 MUX. The other advantage of this multiplexing
scheme is that it can significantly relax the timing constrains for the final stage data se-
101
Chapter 4. The Transmitter Design
DPRE0<n>
Latch Array DPRE1<n> 4:1
D0<n> DPRE0<n>
L
DPST10<n> CKP DPRE2<n> DRV
L MUX
D QN DPRE3<n> α-1
DMAIN0<n> DPST20<n>
L L L
CKN DMAIN0<n>
4-bit D1<n> L L
DPRE1<n>
L
DPST11<n> Latch Details DMAIN1<n> 4:1
DMAIN2<n> DRV
MUX
Termination
Quarter
DMAIN1<n> DPST21<n> DMAIN3<n> α0
ESD &
Rate L L L TX_P
DPST10<n> TX_N
Parallel D2<n> DPRE2<n> DPST12<n>
PRBS L L L DPST11<n> 4:1
DPST12<n> DRV
Gen. DMAIN2<n> DPST22<n> MUX
L L L DPST13<n> α1
D3<n> DPRE3<n> DPST13<n> DPST20<n>
L L L L DPST21<n> 4:1
L L
DMAIN3<n>
L
DPST23<n> DPST22<n>
MUX
DRV 4-tap FFE
DPST23<n>
PH180 α2 Combiner
CK180
CK270
CK90
td1
CK0
BUF
PH0 PH90 PH180 PH270 PH0 PH90 PH180
BUF Pseudo-AND2
Clock Bundle td1
CK180_D
td2
CK0_D
X4 X4
CK0_CML CK0_D
CML
CKP_CML CK90_CML CK90_D CML Logic
Half Rate Clock 2
Conditioner
CKN_CML DIV2 CK180_CML CK180_D
CMOS CMOS Logic
Clock CK270_CML CK270_D
X4
rialization since the data to be multiplexed operate at quarter rate rather than half rate.
The input data width is doubled and hence provides a doubled timing margin, which
makes it possible to produce the full-rate data stream across PVT variations without
additional matching buffers and phase tuning mechanism. Owing to these good prop-
erties, the quarter-rate architecture has become one of the most promising solutions in
the 20+ Gb/s transmitter designs. One of the main task in this work is to optimize the
operation speed and energy efficiency of the 4:1 MUX, including topology considera-
tion, unit cell enhancement, and clocking optimization.
The block diagram of the transmitter chip is illustrated in Fig. 4.3. It consists of
a multi-MUX-based 4-tap FFE combiner, a latch array, an on-chip PRBS generator,
and a clock bundle. In principle, the on-chip PRBS is utilized to generate the par-
allel quarter-rate data streams D0<n>, D1<n>, D2<n>, and D3<n>. These four
data streams are then interleavedly latched by the compact latch array to produce the
16-path quarter-rate data for the following four 4:1 MUXs. The desired timing rela-
102
Chapter 4. The Transmitter Design
tionship (see the signal positions in the latch array), which enables each MUX to share
the same timing margin, is satisfied by 90◦ -spaced quarter-rate clock relatching. After
the four 4:1 MUXs, the four full-rate UI-spaced serial sequences are firstly buffered
by the pre-drivers and then sent to the 4-tap FFE combiner to finally pre-distort the
output waveform and launch to the transmission channel. In the clock bundle, a clock
conditioner is employed to convert the incoming single-end half-rate clock into differ-
ential outputs, which are then fed into a divider (DIV2) to generate the quart-rate I,
Q clocks. Applying these quadrature clocks to four CML2CMOS converters, they are
transformed into full swing clocks, which are further applied to four driving buffers
and four pseudo-AND2s to produce 50% and 25% duty cycle clocks for the latch array
and the 4:1 MUXs, respectively.
The main feature of the transmitter chip is the compact implementation of the mul-
tiple 4:1 MUX-based 4-tap FFE, which not only relaxes the stringent timing require-
ment of the final serialization stage, but also provides a robust approach to support a
wide operation range. The quarter-rate multiplexing scheme implemented by the 4:1
MUXs significantly relaxes the stringent timing requirement. The interleaved-latching
method is able to guarantee the 16 quarter-rate data streams always maintain the suffi-
cient timing margins for the 4:1 MUXs. To improve the performance of the 4:1 MUX,
we propose a new unit cell to cancel the charge-sharing effect, which not only reduces
its output jitter, but also helps to optimize the self-drain capacitance and hence im-
proves its maximum operation speed. For the clocking, the shared 25% duty cycle
UI-spaced clocks are produced by pseudo-NANDs. This clocking scheme not only
possesses the good property of the high power efficiency, but also provides full swing
outputs and hence optimize the sizes of the gating transistors in the 4:1 MUX.
103
Chapter 4. The Transmitter Design
CK1
CK2
CK3
CK4
Iout Iout
CKin,2
CKin,1 CKin,2
(a) (b)
Iout Din
Iout
X
X
Din
CKin,2 CKin,2
CKin,1
CKin,1
(c) (d)
Figure 4.5: Four possible unit cell implementations of the 4:1 MUX.
Fig. 4.4 displays the conceptional schematic implementation of the traditional 4:1
104
MUX, which consists of four pulling-down unit cells and a pair of shunt-peaked loads.
Chapter 4. The Transmitter Design
Each unit cell performs two tasks, i.e., clock ANDing and data sampling, where the
former refers to ANDing the two adjacent clock phases to determine the edge positions
of the output pulse and the latter represents the input data sampling and hence decides
the logic of the output pulse.
Fig. 4.5 shows four possible implementations of the unit cell within the 4:1 MUX.
One common feature in these unit cells is that the current source is eliminated to reduce
the number of the stacked devices. In the first implementation [see Fig. 4.5(a)], the
ANDing and sampling operations are combined into one stage and hence the number of
the internal nodes can be reduced to the minimum. Nonetheless, these stacked devices
in the output stage need large sizes to provide sufficient driving current. The increased
device size shows a large capacitance load for the preceding stage and manifests a
increased self-drain capacitance, which in return limits the maximum operation speed
and/or the achievable power efficiency. To mitigate these issues, a second realization
shown in Fig. 4.5(b) is developed, where a separate sampling stage is introduced to
AND the two adjacent clock phases CKin,1 and CKin,2 to produce the 25% duty-cycle
pulse. This pulse is then applied to the output stage to gate the enabling transistor to
transmit the input data Din to the output. By separating the ANDing and sampling
operations into two stages, the stacked devices in the output stage are reduced to two,
which could significantly improve the operation speed and power efficiency of the 4:1
MUX. However, the involvement of processing the 25% duty cycle pulse along with
the sharp edge requirement has posed a high requirement on the 1-UI pulse generation.
To avoid the involvement of the 25% duty cycle pulse, a third possible implementation
of the unit cell is developed in [24]. As shown in Fig. 4.5(c), the leading clock CKin,1
is firstly sampled by the input data Din to remove the high pulses whenever Din is
low (corresponding to no discharging current in the output stage). After that, this
data-selected clock together with the CKin,2 will generate the pulse current to transmit
the input data onto the output with an accurate UI spacing. This technique possesses
three advantages. Firstly, the involvement of 25% duty cycle is precluded and hence
the stringent speed requirement on the inter nodes is relaxed. Secondly, the switching
activity of the preceding sampling sate is actually determined by the input data Din .
105
Chapter 4. The Transmitter Design
For a random input data with equal polarities, the switching activity is 50%, which
is lower than that of the design in Fig. 4.5(b). Finally, the sampling stage actually
performs the function of a latch and thereby a latch in the preceding stage can be
saved. Fig. 4.5(d) shows a variant of the design in Fig. 4.5(c) [32]. Instead of using
a NMOS for the first latch, a PMOS latch is utilized to keep node X pre-discharged
rather than pre-charged. This allows to remove the intermediate inverter, which reduces
the operation devices and hence leads a significant power saving. This unit cell also
naturally implements the latching function and therefore saves a latch in the preceding
data path. The main disadvantage of this topology is the stacked devices in the latch,
which could slow down the edge transitions of node X, thus limiting its maximum
operation speed. Another common drawback within the unit cells in Fig. 4.5(c) and
(d) is that both the sampling and ANDing operations are integrated together in the unit
cell, hence ruling out the possibility of the ANDing stage sharing.
Fig. 4.6(a) describes the schematic of the developed 4:1 MUX. Like the tradi-
tional 4:1 MUX shown in Fig. 4.5, it is composed of a pair of shunt-peaked loads
and four identical pull-down unit cells. Unlike the conventional 4:1 MUX that are di-
rectly driven by the quadrature 50% duty cycle clocks, these unit cells are activated
sequentially by four 25% duty cycle UI-spaced phases (CK0-90-180-270) to combine
the four quarter-rate data streams (D0-1-2-3) into one serial sequence (SDATA) [see
Fig. 4.6(b)]. Compared to the 4:1 MUXs presented in [24, 32] that combine both the
ANDing operation and sampling operation into the pulling-down unit cell, the unit cell
in this design only performs the sampling operation while the ANDing operation is
carried out by the pseudo-AND2s in the clock bundle (see Fig. 4.3). This splitting ar-
rangement allows the four 4:1 MUXs in Fig. 4.3 to share one common ANDing stage,
thus exhibiting more potentials on power efficiency.
106
Chapter 4. The Transmitter Design
(a)
Tsetup Thold
D0N/P D0<n> D0<n+1>
D1N/P D1<n-1> D1<n>
D2N/P D2<n-1> D2<n>
D3N/P D3<n-1> D3<n>
CK0
CK90
CK180
CK270
SDATA D1<n-1> D2<n-1> D3<n-1> D0<n> D1<n> D2<n> D3<n> D0<n+1>
(b)
Figure 4.6: Topology of the 4:1 MUX. (a) Conceptual schematic and (b) timing dia-
gram.
The main drawback of the quarter-rate serialization is the doubled self-drain capac-
itances of the 4:1 MUX, which significantly constrain the maximum operation speed.
Consequently, bandwidth extending techniques for the 4:1 MUX are highly desired.
This part will firstly discuss the drawbacks in traditional unit cells and then presents
our optimization techniques. Fig. 4.7 depicts the two widely used traditional unit cells
that support the splitting placement of ANDing and sampling operations. To optimize
the operation speed, the current source transistors are eliminated to avoid stacked de-
vices. In the data-up structure [101, 32] depicted in Fig. 4.7(a), the output can be
corrupted by the data transitions on other branches through the forward-coupling path
from the data input to the output when the MUX is performing data selection on one
branch [37]. Fig. 4.7(b) describes the clock-up structure [21, 103], which addresses
107
Chapter 4. The Transmitter Design
(a) (b)
Figure 4.7: Traditional unit cell implementations for high-speed 4:1 MUX. (a) Data-up
structure and (b) clock-up structure.
VOP VON
NM3 NM4
PM1 CK0 PM2
X Y
NM1 NM2
D0N INV D0P
the forward-coupling problem by moving the clocking pairs to the top to eliminate the
feed-through path. However, it suffers from severe charge-sharing effect between the
outputs VOP/VON and junction nodes X/Y in the form of causing glitches on two con-
secutive bits at high level or slowing down the rising edges for low-to-high transitions.
Inspired by the voltage mode source-series terminated (SST) driver discussed in [98],
we introduce a pair of pre-charging transistors PM1/PM2 connecting to nodes X/Y to
mitigate this effect. As shown in Fig. 4.8, the pre-charging PM1/PM2 and the data-
gating NM1/NM2 actually constitute two inverters, which make nodes X/Y be always
pre-driven to desired states, thus eliminating the charge-sharing effect. Compared to
the SST implementation in [98], the improved 4:1 MUX exhibits more potentials on
high-speed applications. The reason is that it can fully exploit the process potentials as
108
Chapter 4. The Transmitter Design
Voltage (V)
large glitch the rising
Voltage (V)
edge
Remain at low state w/o PM Remain at low state
w/o PM
VT(OUTP) VT(X) VT(OUTP) VT(X)
Voltage (V)
Voltage (V)
Pre-charge No glitch Pre-charge
to VDD w/ PM A faster
to VDD
rising edge w/ PM
VT(CK0) VT(D0N) VT(CK0) VT(D0N)
Voltage (V)
Voltage (V)
Input data Input data
and CLK and CLK
(a) (b)
Figure 4.9: Effect of the introduced PM on (a) high-level glitches and (b) edge transi-
tions.
its compact NMOS driving topology naturally features fast current switching speed and
small parasitic capacitance. Additionally, the speed-constraining output capacitances
including self-drain load, routing wire, and far-end driving load can be neutralized by
adopting on-chip peaking inductors. In the rest of this part, we will discuss the adverse
effect of the charge-sharing in conventional clock-up structure and the favorable effect
of the introduced pre-charging transistors.
(1) Charge-sharing effect in conventional clock-up structure
The top row of the simulated waveforms in Fig. 4.9(a) and (b) demonstrates the
two adverse effects of the charge-sharing in the conventional clock-up structure [see
Fig. 4.7(b)]. Assuming the upcoming data D0P/D0N are logic high/low, node Y is
pre-discharged to the ground through NM2, which helps to speed up the falling edge.
The voltage of node X depends on previous transmitted data. In case that the previous
D0N is logic low, node X should have been charged to an allowed maximum value
(V DD − VT HN ) during the selection-enabled period (high pulse duration of CK0),
which should maintain to the present instant since NM1 has always been in cut-off
state. This will not cause prominent charge-extraction effect, as node X has already
been charged to the allowed maximum value by the previous transmitted bit. If the
previous D0N is logic high, node X should keep the ground voltage that is pulled
down during the hold time in previous bit period [i.e., Thold in Fig. 4.6(b)]. When
109
Chapter 4. The Transmitter Design
the high pulse of CK0 arrives, the capacitance at node X will extract charge from the
output, thus causing a remarkable glitch for two consecutive output bits at high level
or slowing down the rising edge for a low-to-high transition [see the waveform details
in the top row of Fig. 4.9(a) and (b)]
(2) The effect of the introduced pre-charging transistors
To demonstrate the effect of the introduced pre-charging transistors PM1/PM2
shown in Fig. 4.8, we take PH0 branch as an example to illustrate the operation
process of the proposed pull-down unit cell. When input data arrive, depending on
D0N/D0P, nodes X/Y are either pre-charged to VDD or pre-discharged to VSS by the
two inverters consisting of PM1/PM2 and NM1/NM2. This makes nodes X/Y always
pre-driven to the desired states that are coincident with the output signal levels. As the
high level of CK0 comes, NM3/NM4 are turned on to send D0N/D0P to the MUX’s
outputs. After a period of 1 UI, the pull-down path is switched off by the falling edge
of CK0 and the voltage level of nodes X/Y stays unchanged until the next input data
come. The main feature of this 4:1 MUX is its ability of eliminating the charge-sharing
effect caused by parasitic capacitances at nodes X/Y, which brings in several benefit-
s. Firstly, the deterministic jitter and glitches caused by charge-sharing extraction can
be remarkably mitigated [see the middle row in Fig. 4.9(a) and (b)]. Moreover, the
glitch elimination effectively improves the noise margin that allows a lower output
swing to save power. Secondly, the elimination of the charge-sharing effect makes the
capacitances at nodes X/Y less significant. Thus, large-size NM1/NM2 can be used
to enhance the discharging capabilities. Note that the output swing is determined by
the proportion of resistive load and equivalent resistance of stacked NM1/NM3 (N-
M2/NM4). For a fixed minimum output swing, the big size of NM1/NM2 implies that
NM3/NM4’s size can be reduced. The smaller size of NM3/NM4 helps to decrease the
self-drain capacitances of the unit cells. Consequently, the bandwidth of the overall
4:1 MUX can be expanded. Thirdly, the added transistors PM1/PM2 provide another
path through NM3/NM4 to help to pull up the output, which can accelerate the rising
transitions.
110
Chapter 4. The Transmitter Design
320 pH
320 pH
80 ohm
80 ohm
ON OP ON
R OP
IP IN
IP IN
CK_IP 2R
ISS ISS
(a)
CKN
CKN
CKP
CKP
Latch Latch
150 ohm
ON
OP
IN IP
CKP CKN
ISS
(b)
20k ohm
200 fF
(c)
Figure 4.10: Circuit details of the clocking blocks. (a) Clock conditioner, (b) DIV2,
and (c) CML2CMOS.
111
Chapter 4. The Transmitter Design
As depicted at the bottom of Fig. 4.3, the desired full swing clocks for the latch
array and the 4:1 MUXs are produced by a clock bundle, where current-mode logic
(CML)-style circuits are employed in the clock conditioner and DIV2 to support the
most high-speed (half-rate) operation while the CML2CMOS and pseudo-AND2 that
operate at quarter rate are implemented in a more power efficient CMOS style.
Fig. 4.10 presents the implementation details of these clocking blocks. In the clock
conditioner [see Fig. 4.10(a)], an AC-coupled CML with one input connected to the
fixed common voltage (2VDD/3) is adopted to perform the single-end input to differen-
tial output conversion. This differential clock is further rectified by two CML buffers.
To reduce the power consumption, multi-layer on-chip inductors are employed to neu-
tralize the output capacitances. For the DIV2, a traditional inductorless CML latch
shown in Fig. 4.10(b) is used to balance the operation speed and layout compactness.
Fig. 4.10(c) gives the schematic details of the CML2CMOS, where an AC-coupled in-
verter with a feedback resistor is utilized to convert the CML voltage level to full swing
CMOS logic. This compact CML2CMOS possesses the good properties of small area
occupation and high power efficiency. To some extent, it is also capable of performing
the function of duty cycle correction since the DC voltage of the converted full-swing
clock is feedback to bias the common voltage of the inverter.
For the pseudo-AND2, its function is to AND the two 50% duty cycle half-rate
clocks with a 90◦ phase shift to generate the 25% duty cycle clocks (CK0-90-180-270
in Fig. 4.3) for the 4:1 MUX. As the final retiming stage, the transmitter performance
largely relies on these clocks since any timing deviation will be converted into final
output jitter directly. This necessitates the following two desirable properties: i) the
high pulse width for each phase should be an accurate UI period, and ii) the spacing
between any two adjacent phases should be the same, which equals 1 UI. Generally
112
Chapter 4. The Transmitter Design
PM2
PM1
OUT X OUT X
NM1
CK90
X
NM2
CK0 NM1
NM2
(a)
CK0
CK90
OUT
Figure 4.11: Pesudo-NAND2. (a) Circuit details and (b) operation waveform.
speaking, these pulses can be created by NOR/AND of two 50% duty-cycle half-rate
clocks with 90◦ phase shifts. Considering the fact that serial NMOS transistors are
much faster than serial PMOS transistors, NAND2 associated with a driving inverter
could be a better choice. Fig. 4.11 presents the designed pseudo NAND2 and its oper-
ation waveforms. In contrast to conventional NAND2, this pseudo-NAND2 eliminates
the pulling-up transistor PM1 [see Fig. 4.11(a)]. In doing so, the output capacitance
can be reduced, thus leading to a higher operation speed. The similar circuit realiza-
tions of the pseudo-AND2 and the BUF (consisting of two cascaded inverters) also
mitigate the delay mismatch between td1 and td2 (see Fig. 4.3), which helps to meet
the stringent timing constraints against PVT variations. Fig. 4.11(b) presents the oper-
ation waveforms of the pseudo NAND2. At the beginning of PH1, node OUT is pulled
up to VDD by PM1, which can be held during PH2 since NM1 is still in closed state.
In PH3, both NM1 and NM2 are turned on to generate the UI-spaced pulse. It is worth
noting that there does exist charge-sharing effect between the capacitance at node X
113
Chapter 4. The Transmitter Design
Main Tap
Pst2 Tap FFE
Clock Conditioner MCG PRBS Gen.
Pst1 Tap
Combiner
Pre1 Tap
and the output. Particularly, at the beginning of PH1, CK0 goes down to trigger PM1 to
charge the output node, while node X extracts charge through NM1 since CK90 is still
remaining at high state. To alleviate this effect, an abutment layout approach with min-
imum gate spacing [see Fig. 4.11(a)] is exploited to reduce the parasitic capacitance
at node X. The big serial transistors are divided into several small serial transistors,
and every two small ones are connected in parallel, sharing a common drain region to
reduce the junction area.
The transmitter is designed using a Dell R730 server with two E5-2609V4 CUPs,
128 G memory and 8 T hard disk. The schematic, layout, and simulation are re-
spectively finished by Schematic Composor, Virtuoso Layout, and Spectre/aps that
are developed by Cadence and the Cadence version is IC5141. The layout verifica-
tion and parasitic extraction are carried out by layout versus schematics (LVS)/design
rule check (DRC) and parasitic extraction (PEX) using Caliber2013 that is develope-
d by Mentor Graphics. To perform the measurements of the fabricated prototype, a
KEYSIGHT N5191A is used to generate the input clock and a KEYSIGHT DSA-X
93204A with a 80 GS/s and 32 GHz bandwidth is utilized to characterize the jitter
performance of the transmitter.
114
Chapter 4. The Transmitter Design
(d) (e)
(f)
Figure 4.13: Layout views of the crucial blocks. (a) 4:1 MUX, (b) interleaved-retiming
latch array, (c) pesudo-NAND2 with an inverter, (d) CML2CMOS converter, (e) DIV2,
and (f) clock conditioner.
115
Chapter 4. The Transmitter Design
Fig. 4.12 displays the layout view of the whole transmitter chip. The FFE combiner
is located at the right edge of the chip to directly drive the output pads. The four
paths consisting of 4:1 MUXs and drivers (i.e., main tap, pst2 tap, pst1 tap, and pre
tap in Fig. 4.12) are placed next to the FFE combiner to reduce the driving length
of the connection wires. The PRBS generator and the latch array are dispersed at the
blank places among these four multiplexing paths to generate the quarter-rate data with
appropriate delays. The clock conditioner and the MCG is put at the left side of the
chip to provide proper clocks for the PRBS generator, latch array, and 4:1 MUXs.
Fig. 4.13 further presents the layout views of the crucial blocks. For the 4:1 MUX
shown in Fig. 4.13(a), the parasitic capacitances on the output nodes are optimized
to support a maximum operation speed. For the latch array displayed in Fig. 4.13(b),
special attentions are paid to the latch placement to facilitate the signal connections.
For the pesudo NAND2 shown in Fig. 4.13(c), an abutment layout approach with a
minimum poly spacing is adopted to optimize the parasitic capacitance on node X as
shown in Fig. 4.11. For the CML2CMOS converter, DIV2, and clock conditioner
[see Fig. 4.13(d), (e), and (f)], special attentions are paid to the parasitic capacitance
optimization, hence making the received clock can be well amplified, rectified, and
divided.
Fig. 4.14 illustrates the simulation setup of the transmitter chip. The inputs of
bias main, bias pre, bias pst1, and bias pst2 are corresponding to the four tap weights
of the FFE combiner. The input clock operates at 25 GHz. The muxed data are the
direct outputs of the 4:1 MUX on the main-tap path. The output data are DC coupled
to a pair of far-end 50 ohm resistors through a channel with a 12 dB attenuation at 20
GHz.
To evaluate the effect of the introduced PMs in the 4:1 MUX, the transient output
and overlapped eye-diagrams using the traditional unit cell [see Fig. 4.7(b)] and the
116
Chapter 4. The Transmitter Design
bias_main
bias_pre muxed data
bias_pst1 Transmitter
bias_pst2
Chip Channel
output data
25GHz
Amplitude (V)
5.50 5.75 6.0 6.25 6.50 6.75 7.0 7.25 5.50 5.75 6.0 6.25 6.50 6.75 7.0 7.25
Time (ns) Time (ns)
(a) (b)
0 10 20 30 40 0 10 20 30 40
Time (ps) Time (ps)
(c) (d)
Figure 4.15: (a) Transient waveform of the traditional unit cell, (b) transient waveform
of the enhanced unit cell, (c) eye-diagram of the the traditional unit cell, and (d) eye-
diagram of the the enhanced unit cell.
enhanced unit cell (see Fig. 4.8) are separately displayed in Fig. 4.15. The simulated
eye-diagrams indicate that the ISI induced by the charge-sharing is reduced from 1.6
ps to 0.3 ps and the voltage glitches are mostly removed. It worthy to note that there
exists an drawback within this proposed 4:1 MUX. Its output swing is sensitive to PVT
variations. The reason is that the equivalent resistance of the two stacked transistors
117
Chapter 4. The Transmitter Design
Δ%
ΔSW Swing Variation
Δ%=
500mV Swing at Typical Corner
Temperature (° C)
Figure 4.16: Swing variations of the improved unit cell under different PVT corners.
Amplitude (mV)
Amplitude (mV)
440 mV
Jitter: 3.7 ps
Jitter: 11.2 ps
150 mV 400 mV
Jitter: 3.6 ps
0 10 20 30 40 5 15 25 35
Time (ps) Time (ps)
(c) (d)
Figure 4.17: Simulation eye-diagrams of the transmitter at (a) 10 Gb/s with over equal-
ization, (b) 40 Gb/s with proper equalization, (c) 50 Gb/s without equalization, and (d)
50 Gb/s with proper equalization.
could change a lot under different PVT corners. Fig. 4.16 gives the swing variations
for different PVT corners, where the swing variation can be controlled under 25% and
it can be further reduced by adopting a tunable resistor described in [24].
The performance of a transmitter is usually characterized by its output eye-diagram,
which folds a time-domain waveform into one or several bit periods. The two critical
parameters of the eye-diagram refer to the voltage swing and inner eye opening, where
118
Chapter 4. The Transmitter Design
1200μm
MUX Driver
x2 x2
FFE
500μm
Clock Conditioner MCG PRBS Gen. Combiner
MUX Driver
x2 x2
Muxs Driver/FFE
Latch Array 22mW Combiner
11mW 43mW
PRBS Gen.
7mW Clocking
73mW
the former determines the transmitter output power and sets a requirement for the re-
ceiver sensitivity, while the latter indicates the overall performance of the jitter, noise,
and effective bandwidth. Fig. 4.17 shows the simulated eye-diagrams. Fig. 4.17(a)
displays the simulated eye-diagram at 10 Gb/s with an over equalization, where the
sub-levels are contributed by the FFE taps. Fig. 4.17(b) presents the simulated eye-
diagram at 40 Gb/s with a proper equalization, where the horizontal jitter and the ver-
tical eye opening are 3.7 ps and 440 mV, respectively. Fig. 4.17(c) and (d) gives the
eye-diagram comparison before and after applying an appropriate equalization at 50
Gb/s. Clearly, the FFE can significantly optimize the eye opening, where the horizon-
tal jitter is reduced from 11.2 ps to 3.6 ps and the vertical swing is increased from 150
mV to 400 mV.
119
Chapter 4. The Transmitter Design
Figure 4.20: Measured output eye-diagrams of the transmitter at (a) 5 Gb/s with over
equalization, (b) 40 Gb/s without equalization, (c) 40 Gb/s with proper equalization,
and (d) 50 Gb/s with proper equalization.
Fig. 4.18 presents the chip micrograph, which occupies an area of 0.6 mm2 . Fig.
4.19 shows the power breakdown of the transmitter chip. It consumes 156 mW from
a 1.2 V supply when operating at 50 Gb/s, where the four enhanced 4:1 MUXs only
consume 22 mW. The fabricated chip is mounted on a printed circuit board (PCB)
through wire-bonding. The transmitter output is measured after a compound channel
consisting of doubled bonding wire, PCB trace, and connection cable.
Fig. 4.20 gives the measured eye-diagrams under different conditions. Fig. 4.20(a)
depicts the over-equalized eye-diagram when operating at 5 Gb/s, where the four sub-
levels are contributed by the four FFE taps. Fig. 4.20(b) and (c) presents the output
eye-diagrams at 40 Gb/s before and after applying the 4-tap FFE. The comparison
120
Chapter 4. The Transmitter Design
Figure 4.21: Measured output eye-diagrams with four separate eyes. (a) Clock pattern
and (b) PRBS pattern.
shows that the FFE can significantly improve the inner eye opening. Specifically, the
eye height and eye width are optimized from 140 mV and 0.45 UI to 180 mV and
0.68 UI, respectively. Meanwhile, the thickness of the eyelid is dramatically reduced
from around 330 mV to 140 mV. Fig. 4.20(d) displays the properly-compensated eye-
diagram at the maximum operation speed of 50 Gb/s. Its eye height and eye width
are 50 mV and 0.38 UI. Clearly, a wide operation range from 5 Gb/s to 50 Gb/s is
achieved, which is mainly attributed to the multi-MUX-based FFE implementation.
Fig. 4.21 further illustrates the transmitter output with four separate eyes. It can be
seen that the horizontal eye widths for both fixed clock and PRBS patterns are almost
identical, thus proving that the four sampling phases are properly aligned.
Table 4.1 compares the measurement results of our transmitter chip with other
transmitters operating at similar data rates. The results indicate that this transmitter
chip achieves wider operation range, lower jitter performance, and better power effi-
ciency than others. These are mainly owing to the proposed high-speed 4:1 MUX and
the compact interleaved-latching scheme. The comparison also shows the area of our
transmitter is much larger than that developed in [99], this is mainly due to the follow-
ing two reasons. Firstly, the area of our transmitter refers to the whole chip including
the core circuits, decoupling transistors, and input/output PADs, while the area in [99]
only includes the core circuits. Secondly, the transmitter in [99] is designed based on
121
Chapter 4. The Transmitter Design
122
Chapter 5
The main task of the receiver (RX) is to extract the originally transmitted data from
the received signal using appropriate equalization and clock data recovery (CDR) tech-
niques [69, 61, 70, 10]. This chapter presents a quarter-rate receiver operating at 40
Gb/s. It employs a two-stage continuous-time linear equalizer (CTLE) as the analog
front-end and integrates an improved CDR to extract the sampling clocks and retime
the incoming data. To automatically balance the jitter tracking and jitter suppression,
passive low-pass filters (LPFs) with adaptively adjusted bandwidth are introduced in-
to the data-sampling path, where the controlling code of the bandwidth is truncated
from the frequency code generated by the integral path of the digital LPF within the C-
DR loop. To optimize the linearity of the phase interpolation, a time-averaging-based
compensating phase interpolator (PI) is proposed, which significantly optimizes the
differential nonlinearity (DNL) and integral nonlinearity (INL) of the phase interpo-
lation, thus improving the phase-step and phase-spacing uniformities of the sampling
clocks.
In the remainder of this chapter, we firstly discuss the design considerations of the
receiver, and then present the overall receiver chip and illustrate its main features. After
that, the architecture-level improvement on the CDR loop and the linearity-optimized
compensating PI are elaborately discussed in the following two sections. Finally, the
experimental results are presented and discussed.
123
Chapter 5. The Receiver Design
Receiver sensitivity is the minimum differential voltage level that the receiver can
correctly differentiate between a “0” and a “1”. It is a function of the input referred
noise, offset, minimum latch resolution, and bit error rate (BER) requirement. It can
be calculated by
√
Vspp = 2Vnrms SN R + Vmin + Vof f set , (5.1)
where Vspp is the receiver sensitivity, Vnrms denotes the equivalent input random noise,
SN R represents the signal-to-noise ratio, Vmin stands for the minimum latch resolu-
tion, and Vof f set refers to the equivalent input offset. Vnrms usually comes from match-
ing impedances, input amplifiers, and data slicers. The SN R is determined by the BER
√
requirement, e.g. SN R=7 for a BER = 10−12 . Vmin stems from the hysteresis, fi-
nite regeneration gain, and bounded noise sources. Typically, its value is smaller than 5
mV. Vof f set is subject to circuit mismatches, which primarily exhibits a strong function
of the Vth mismatch and a weak function of electron mobility mismatch. Although
a large area (4×) can reduce the input offset [1/(2×)], it is not feasible in practical
designs due to the excessive area occupation and power consumption. In practical de-
signs, offset correction circuitry is usually employed to reduce the input offset from a
potentially large uncorrected value (>50 mV) to near 1 mV.
The CDR bandwidth is one of the most important parameters in the CDR design,
which involves a tradeoff of jitter tracking, jitter suppression, and jitter tolerance. A
narrow bandwidth can provide prominent input-jitter suppression and help to reduce
the jitter peaking, while a wide bandwidth can enhance the capability of jitter track-
ing and jitter tolerance. To suppress jitter amplification and accumulation in long-haul
telecommunication systems, a narrow bandwidth is usually specified (e.g., 120 kHz
124
Chapter 5. The Receiver Design
for optical carrier (OC)-192 in synchronous optical network (SONET) [5]). To im-
prove the jitter tacking ability in chip-to-chip connections, a relatively wide bandwidth
is frequently utilized (e.g., 10 MHz for 32G fiber channel (FC) [160]). A wide CDR
bandwidth also helps to suppress the VCO phase noise, thus reducing the jitter of the
sampling clocks (i.e., optimizing the jitter generation), which finally helps to lower the
link BER. Historically, the CDR bandwidth in many SerDes protocols such as periph-
eral component interconnect express (PCIE), Infiniband, FC, and common electrical
interface (CEI) grows linearly with the data rate, which is usually defined as 1/1667 or
1/2500 of the data rate.
As the data-rate approaches to the process limit, the short unit interval (UI) sig-
nificantly compresses the jitter budget for the CDR at the RX-side. This means there
is a even smaller margin left for sampling position deviation, clock dithering, random
and/or deterministic jitter, duty cycle distortion, and spacing errors among differen-
t phases [23], thus setting higher standards on low-frequency jitter tracking, high-
frequency jitter suppression, recovered clock jitter generation, sampling clock duty
cycle precision and phase-spacing accuracy. These requirements bring in significant
challenges in designing a high performance CDR [25, 23, 123], mainly because of the
following reasons. Firstly, the tightly coupled jitter tolerance (JTOL) and jitter trans-
fer (JTRAN) parameters make it difficult to design a low bandwidth for JTRAN to
suppress the incoming jitter. Secondly, the cycle-limited dithering caused by steady-
state oscillation contributes a substantial amount of deterministic jitter. Thirdly, the
inevitable loop latency along with the data-rate proportional CDR bandwidth may de-
grade the system phase margin.
125
Chapter 5. The Receiver Design
CLK0
Half Rate Clock CML CLK180 Quarter-Rate
CLK90 IDACs
Clock Cond. DIV2 CLK270 Multiple PIs
CK270
CK135
CK315
CK180
CK225
CK90
CK45
CK0
RX_N RX_DN
D<3:0> D<15:0>
DEMUX
Tcoil Data/Edge
BBPD
CTLE Digital
RX_P Rterm RX_DP Samplers E<3:0> E<15:0>
Filter
CML
Vbias,α-1 6
EDC-SZF
DACs
CMOS To TX- Vbias,α1 6
side FFE Vbias,α2 6 Algorithm
MUX
CLK
DIV2
CK0 Driver CLK_TN
CK180 1
TX_N
4:1 MUX
TX D<3:0>
TX_P Driver CLK_SEL
Fig. 5.1 describes the block diagram of the receiver chip. It consists of a two-stage
CTLE, a quarter-rate CDR, an feed-forward equalizer (FFE) adaptation unit, and some
testing circuits for the recovered data and clock measurements. The received signal is
firstly equalized by the CTLE and then sliced by eight quarter-rate data and edge sam-
plers, where the sampling clocks are generated by two quarter-rate compensating PIs
and the sampling positions are adjusted by a digital CDR using bang-bang phase detec-
tors (BBPDs). To support the high operation speed, the samplers, PIs and clock/data
buffers are implemented in current-mode logic (CML) type [10]. To alleviate the tim-
ing problem, a quarter-rate sampling scheme using multiple PIs is used to extend the
slicer regeneration time. The channel loss is compensated for by the TX-FFE and
RX-CTLE, where the TX-FFE is adaptively adjusted by the proposed edge-data cor-
relation based sign zero-forcing algorithm (EDC-SZF) (refer to Section 6.1) while the
126
Chapter 5. The Receiver Design
n
BBPD
n Voter
Xn
DEMUX DEMUX
Digital
Filter KI KP
RX_N
Data Edge
RX_P
Samplers Samplers Freq.
+
Integ.
CLKD CLKE
CLK0
CLK180
Phase Phase Code Phase
CLK90
Interpolator Integ.
CLK270
There are two main features in this receiver chip. One is the improved CDR ar-
chitecture, where passive LPFs with adaptively adjusted bandwidth are introduced into
the data-sampling path to automatically balance jitter tracking and jitter suppression for
data decisions. In doing so, the JTRAN bandwidth can be adjusted separately with-
out affecting the bandwidth of the JTOL. The other is the proposed compensating PI,
which not only improves the phase-step uniformity but also reduces the phase-spacing
drifting between edge and data sampling clocks.
Fig. 5.2 displays the conventional architecture of the BBPD-based CDR. Due to the
nonlinear behaviour and inevitable loop delay, the phase code applied to the PI usually
127
Chapter 5. The Receiver Design
16
BBPD
16 Voter
64 Steps
DEMUX DEMUX Loop Filter
4:16 4:16 KI KP
4 4 Freq. Phase
RX_N +
Data Samplers Edge Samplers Integ. Integ.
RX_P X4 X4 +
PHA PHB
ABS
<8:0> <8:0>
CK135
CK225
CK315
CK180
CK270
CK45
CK90
CK0
Limiter
IDAC
IDAC
CLK0
DF<2:0>
CLK180
Compensating Compensating
CLK90
CLK270
PI1 PI2 4 4
8 8 8
Current Mirrors
Fig. 5.3 shows the block diagram of the improved CDR. It employs separate PI1
and PI2 to produce the two sets of 45◦ -spaced clocks for the data sampling and edge
128
Chapter 5. The Receiver Design
sampling, where passive LPFs are introduced into the clock branch for the data sam-
pling to provide extra jitter suppression on the data-sampling clocks. The bandwidth
of these introduced LPFs is adaptively adjusted by the same DF<2:0>, which is the
absolute value of the truncated frequency code generated from the integral path of
the digital loop filter. In this design, the minimum bandwidth of the LPFs is about 4
MHz while the maximum one is around 50 MHz. Particularly, a limiter is utilized to
set the DF<2:0> to its maximum value when the frequency code goes too large. In
principle, a large frequency code indicates a continuous phase slewing to accommo-
date to the accumulative jitter tracking. Thus, a wide bandwidth is chosen to improve
the jitter tracking ability. On the contrary, a small frequency code implies that there
is little trackable jitter. Accordingly, a narrow bandwidth is selected to suppress the
high-frequency jitter.
For the implementation, 16 BBPDs associated with a majority voter are adopted to
generate a 5-bit signed phase error, which is filtered by a digital loop filter consisting
of a proportional path and an integral path to produce a 14-bit output. Here, the top
9 bits are applied to a 12-bit phase integrator whose output is then truncated to form
the phase code PHA<8:0>, which is further circularly added by 64 steps (a half of
quadrant phase steps) to obtain PHB<8:0>. These phase codes are applied to two
current digital-to-analog converters (IDACs) to produce 8 paths of weighted currents
that are fed into a current mirror array consisting of 8 identical slots. As shown in
Fig. 5.3, each slot generates two branches of currents, one is directly mirrored for the
edge-sampling PI2, while the other is mirrored through a LPF for the data-sampling
PI1.
The working principle of the BBPD is illustrated in Fig. 5.4(a). Considering the
fact that the data sampling occurring at the center of the eye-diagram serves as a refer-
ence to judge whether the edge sampling is leading or lagging the input data transitions,
there should be sufficient margin for the data sampling. Accordingly, the outputs of
the data samplers show a fairly low sensitivity to phase errors in normal operating
129
Chapter 5. The Receiver Design
E L
. .
E<n>
D<n> D<n+1>
(a)
V(D) C STD
ΦE K PI K DA
ΦD K PI LPF
SPI1
(b)
S IN + STF STD
+ K PD
-
SQBB
S PI2 + +
f0 f0
H A (f) H B (f) HC (f)
S PI2 fL <f0 S PI1
+
ΦE V(D) fL ΦD
+ K PI K PI +
HLPF (f)
(c)
Figure 5.4: Functional view of the introduced LPFs. (a) Principle of the BBPD, (b)
linearized CDR model, and (c) jitter transfer functions.
CDRs, which means that further jitter suppression on data-sampling clocks exhibits
little effect on the loop parameters for jitter tracking. Leveraging this characteristic
of the BBPD, we introduce LPFs into the data-sampling path to further filter the out-
put jitter while keeping the loop parameters unchanged to satisfy the jitter tolerance
specification. Fig. 5.4(b) presents the small-signal model of the modified CDR, where
130
Chapter 5. The Receiver Design
the LPF located outside of the feedback loop is able to provide additional jitter sup-
pression for the data-sampling clocks [see Fig. 5.4(c)]. Therefore, the dithering jitter
caused by the limit-cycle oscillation can be effectively attenuated. The noise sources
are also depicted in Fig. 5.4(b), including the input noise (SIN ), quantization noise
(SQBB ) of the BBPD, truncation noise I (STF ) due to finite resolution of the integral
path, truncation noise II (STD ) due to limited resolution of the IDAC, and nonlinearity
noise (SPI1 , SPI2 ) of the PIs. Fig. 5.4(c) displays the transfer function characteristics
for these noise sources. It can be seen that the introduced LPFs can dramatically at-
tenuate the remaining band-frequency and high-frequency components from STF and
STD . The low-frequency components of SIN , SPI2 , and SQBB can be further reduced by
these LPFs when lower bandwidths are employed. Simultaneously, the potential jitter
peak can be suppressed to alleviate the jitter amplification problem.
The nonlinearity of phase interpolation can result in serious adverse effects on the
overall performance of the CDR. Specifically, the differential nonlinearity (DNL) in-
troduces a much larger phase jump than the ideal one, which can be directly converted
into recovered clock jitter. The integral nonlinearity (INL) can make the data-sampling
clocks drift away from their optimal decision points in quarter-rate architectures using
multiple PIs [23]. To optimize the PI nonlinearity, fine weight current sources have
been adopted in [115]. Unfortunately, the non-uniformity of the tail current sources
gives rise to fluctuant common-mode output, which may distort the phase-interpolated
clocks through common-mode to differential-mode conversion. Moreover, its perfor-
mance is also subject to input waveform shape and fabrication mismatches. Another
approach that is also usually adopted to optimize the PI linearity is the octagonal PI
[122], which needs eight 45◦ -spaced clock phases to perform the phase interpolation.
Correspondingly, it requires a complex phase rotator and phase controlling circuits to
generate the octagonal phase constellation. Note that even in the octagonal PI, there
does exist nonlinearity in theory. As a consequence, new techniques that can improve
the linearity of the phase interpolation are still highly demanded.
131
Chapter 5. The Receiver Design
IB315 CKB315
PHB<8:0> Current IB225 Conventional CKB135
IDAC IB135 CKB225
Mirrors IB45 PIB CKB45
64 Steps
(A Half of Quadrant Steps) CKI CKQ
+
IA270 CKA270
PHA<8:0> Current IA180 Conventional CKA90
IDAC IA90 CKA180
Mirrors IA0
PIA CKA0
(a)
CKA0 CKB45
CKA180 CK0 CKB225 CK45
CKB45 TA CK180 CKA90 TA CK225
CKB225 CKA270
CKA90 CKB135
CKA270 CK90 CKB315 CK135
CKB135 TA CK270 CKA180 TA CK315
CKB315 CKA0
(b) (c)
Figure 5.5: Proposed compensating PI. (a) Quarter-rate 45◦ -spaced clock generation,
(b) in-phase I, Q clock generation for the data sampling, and (c) 45◦ phase-shifted I, Q
clock generation for the edge sampling.
(a)
250 ohm
BIAS BIAS
(b)
132
Chapter 5. The Receiver Design
Fig. 5.5 shows the conceptional block diagram of the compensating PI. It em-
ploys two conventional PIs (PIA and PIB) with 1/2-quadrant-step spaced phase codes
(PHA<8:0> and PHB<8:0>) to produce the two sets of 45◦ -spaced clocks (CKA0-
90-180-270 and CKB45-135-225-315) [see Fig. 5.5(a)]. The two sets of 45◦ -spaced
clocks are then applied to four time averaging (TA) [see Fig. 5.5(b) and (c)] to gener-
ate the final data and edge sampling clocks. Specifically, the data-sampling clocks
(CK0-90-180-270) are obtained by averaging CKA0-90-180-270 and CKB45-135-
225-315, while the edge-sampling clocks (CK45-135-225-315) are attained by av-
eraging CKA90-180-270-0 and CKB45-135-225-315. Fig. 5.6 further displays the
schematic details of the quadrature PI and TA, which are implemented in CML style.
The simulation also shows that the additional PI and TAs in each compensating PI
consume around 10 mW, which occupies 50% of the compensating PI.
Taking the sinusoidal waveform to approximate the input-clock wave shape, the
quadrature input clocks can be expressed by
where A and f are the amplitude and frequency of the input clock. Then the output of
the traditional PIA can be calculated by
= AP IA sin(2πf t + θP IA ), (5.3)
α
θP IA = arctan( ), (5.4)
1−α
p
AP IA = α2 + (1 − α)2 A, (5.5)
133
Chapter 5. The Receiver Design
135
1/ 2 PIB
arctan( 1 / 2 ), 0 1 / 2
112.5 B
arctan( 1 / 2 ) ,1 / 2 1
3 / 2 2
)
Phase (degree)
90 1 A B
(
F ing
P I: 2
67.5 en sat
p
C om PIA
45
A arctan( )
1-
45
22.5 B 8.1º phase code
E phase steps
0
0 16 32 48 64 80 96 112 128
Phase code
where α is the ratio of the current phase code to the total phase steps and its range
always meets 0 ≤ α ≤ 1. Similarly, the equations for PIB can also be obtained,
written as
where
arctan( α+1/2 ), 0 ≤ α ≤ 1/2,
1/2−α
θP IB = (5.7)
arctan( α−1/2 ), 1/2 ≤ α ≤ 1.
3/2−α
p
(α + 1/2)2 + (1/2 − α)2 A, 0 ≤ α ≤ 1/2,
AP IB = (5.8)
p
(α − 1/2)2 + (3/2 − α)2 A, 1/2 ≤ α ≤ 1.
Previous studies [23, 103] have demonstrated that this 45◦ -spaced clock generation
can be directly used in CDR designs. However, the nonlinearity of the traditional PI
could significantly degrade the performance of the CDR. To gain more insight into this
issue, the red dashed and blue dotted lines in Fig. 5.7 respectively present the phase
transfer curves according to Eq. 5.4 and 5.7. Clearly, the phase transfer curves of
134
Chapter 5. The Receiver Design
125
100
Compensating PI
Delay (ps)
75
Conventional PIB
50
Conventional PIA
25
0
0 128 256 384 512
Phase code
(a)
2 12 Compensating PI
1LSB=0.703125º
8
DNL (LSB)
INL (LSB)
4
0 0
-4
Conventional Compensating
PIA PI
-8
Conventional PIA
-2 -12
0 128 256 384 512 0 128 256 384 512
Phase code Phase code
(b) (c)
Figure 5.8: Simulation results of the phase compensating PI. (a) Simulated phase trans-
fer characteristics, (b) DNL performance, and (c) INL performance.
the traditional PIA and PIB present an S-shape phase transfer characteristic. When
PIA rotates to point E and PIB rotates to point F, the phase shift between them can
reach a maximum of 8.1◦ (or 0.09 UI). For the designs directly using these phases as
the sampling clocks [23, 103], since the edge-sampling clocks tightly track the edge
transitions in the received data stream, any phase-spacing variation between the edge-
sampling and data-sampling clocks could make the data-sampling clocks drift away
from the expected decision point. As a result, the data decision margin is reduced,
which directly degrades the CDR performance. Moreover, improving the PI resolution
cannot optimize this effect since fine step weights cannot change the shape of the phase
transfer characteristics.
Referring to the time-averaging effect of the TA, the output phase of the compen-
135
Chapter 5. The Receiver Design
The black solid line in Fig. 5.7 displays the phase transfer curve of the compensating
PI according to Eq. 5.9, which indicates that a more linear phase transfer curve with
negligible phase deviations smaller than 0.17◦ can be achieved. This is mainly because
of the compensating characteristics of the phase transfer curves of PIA and PIB. In
contrast to the theoretical analysis, the practical linearity could be degraded by the
transistors’ inherent nonlinearity and the nonideal input clock waveform. Fig. 5.8
shows the transistor-level simulation results of the compensating PI. It can be seen that
the maximum DNL and INL of the compensating PI can be significantly improved over
the traditional PI, where the INL can be controlled below 2.5 LSB (or 1.8◦ ), which is
only a quarter of that of the conventional PI.
The receiver is designed using a Dell R730 server with two E5-2609V4 CUPs, 128
G memory and 8 T hard disk. The schematic, layout, and simulation are respectively
finished by Schematic Composor, Virtuoso Layout, and Spectre/aps that are developed
by Cadence and the Cadence version is IC5141. The layout verification and parasitic
extraction are carried out by layout versus schematics (LVS)/design rule check (DRC)
and parasitic extraction (PEX) using Caliber2013 that is developed by Mentor Graph-
ics. To perform the measurements of the fabricated prototype, an Anritsu MP1812A
is used to generate the 40 Gb/s input data through combining four 10 Gb/s PRBS7
sequences, a Tektronix BSA286C is used to characterize the CDR performance, and
a KEYSIGHT DSA-X 93204A with an 80 GS/s and 32 GHz bandwidth is utilized to
characterize the jitter performance of the output waveforms.
The prototype chip is designed and fabricated utilizing a 65 nm process. Under
136
Chapter 5. The Receiver Design
1.6 mm
EDC-SZF
Terminals
CTLE CDR
1.2 mm
CLK Driver
Full Clock
Rate Condi-
¼ -Rate
Driver tioner
Driver
a typical corner, the cut-off frequency (fT ) of the NMOS transistor and the inverter
delay with a fan-out-of-4 in this process achieve 200 GHz and 13 ps, respectively.
This implies that the utilized 65 nm process is able to provide enough bandwidth and
timing margin for the targeted 40 Gb/s receiver design. Although an advanced process
with smaller minimum channel length such 45 nm, 32 nm, 22 nm and 16 nm can
offer higher fT and shorter inverter delay, their high prices make them not available
for us. Fortunately, our receiver mainly focuses on the CDR architecture improvement
and high-linearity compensating PI implementation, which can still be verified by the
economical and practical 65 nm CMOS process.
Fig. 5.9 displays the layout view of the whole receiver chip. The Terminals, CTLE,
and CDR located at the top side of the chip are the core blocks of the receiver. They
are placed in a line to guarantee the layout symmetry and reduce the parasitic effect on
the high-speed signals. The clock conditioner is placed close to the PI to facilitate the
high-speed clock connection. The full-rate driver, clock driver, and quarter-rate driver
are placed at the bottom side to output the measurement signals.
137
Chapter 5. The Receiver Design
(a) (b)
Figure 5.10: Layout views of the (a) Terminals+CTLE and (b) CDR.
(a) (b)
(c)
Figure 5.11: Layout views of the crucial blocks within the CDR. (a) Samplers, (b)
compensating PI, and (c) digital loop filter.
138
Chapter 5. The Receiver Design
RJpp=5ps
data
40Gb/s PRBS
CDR clock
biasa
biasb
20GHz SJ: 1UI @ 500kHz
Figure 5.12: Simulation setup of the CDR. A PRBS generator is used to produce the
40 Gb/s input data with 5 ps peak-to-peak jitter, a clock generator is utilized to produce
the 20 GHz input clock with a 1 UI amplitude sinusoidal jitter at 500 kHz, the output
data refers to the input data at the samplers, the output clock is the recovered data-
sampling clock, the output biasa represents the current mirror bias for 0◦ -phase before
the LFP, and the biasb stands for the current mirror bias for 0◦ -phase after the LFP.
Fig. 5.11 gives the layout views of the blocks in the CDR. For the samplers [see
Fig. 5.11(a)], multi-layer inductors are used in the first latch to extend its bandwidth.
For the compensating PI [see Fig. 5.11(b)], the inductors are removed to reduce the
area occupation. For the digital loop filter [see Fig. 5.11(c)], it is designed based on
the standard cells provided by the foundry.
139
Chapter 5. The Receiver Design
Voltage (mV)
Voltage (mV)
Prominent
Delay
Amplitude (V)
7.54 ps 4.04 ps
Voltage (mV)
Tightly
Tracking
Slight Attenuation Significant Attenuation
Time (us) Time (us)
Amplitude (V)
Amplitude (V)
2.66 ps 2.62 ps
(c) (d)
Figure 5.13: Effect of the LPFs with a bandwidth of (a) 4 MHz, (b) 20 MHz, (c) 50
MHz, and (d) adaptively-adjusting.
ing jitter of the sampling clock [see Fig. 5.13(b) and (c)]. For the bandwidth fixed at
50 MHz, the dithering jitter of the data-sampling clock (2.66 ps) becomes smaller than
that of the edge-sampling clock (3.04 ps). This implies that the jitter optimization
contributed by the bias-ripple suppression overwhelms the delay-caused phase shift.
Based on the above discussion, it can be found that adopting a fixed bandwidth is inad-
visable since the low bandwidth suffers from delay-caused phase shift while the high
bandwidth exhibits limited jitter suppression. Fig. 5.13(d) presents the simulation re-
sults when utilizing the proposed bandwidth-adaptively-adjusting technique, where the
low dithering jitter is achieved by balancing the bias tracking and ripple suppression.
The high-frequency ripple at the slow input-jitter changing region [circled region in
Fig. 5.13(d)] can be effectively attenuated while the phase variations at fast input-jitter
changing region [surround region in Fig. 5.13(d)] can be tightly tracked.
140
Chapter 5. The Receiver Design
Delay (ps)
Edge-Sampling Clock Jitter Jitter
without LPFs Tracking Suppression
Frequency=500 kHz
Amplitude=1 UI
Bandwidth Control Code (Decimal)
Large
Bandwidth
Small Bandwidth
141
Chapter 5. The Receiver Design
Voltage (mV)
Voltage (mV)
Tightly Tightly
PRBS7 Tracking PRBS15 Tracking
Significant Attenuation Significant Attenuation
Time (us) Time (us)
Amplitude (V)
Amplitude (V)
2.62 ps
2.74 ps
Time (s)
Time (s)
(a) (b)
Voltage (mV)
Voltage (mV)
Tightly Tightly
PRBS23 Tracking PRBS31 Tracking
Significant Attenuation Significant Attenuation
Time (us) Time (us)
Amplitude (V)
Amplitude (V)
3.13 ps 3.30 ps
(c) (d)
Figure 5.15: Effect of different input patterns on jitter attenuation. (a) PRBS7, (b)
PRBS15, (c) PRBS23, and (d) PRBS31.
To demonstrate the jitter suppression effect with different patterns, we have re-
peated the simulations with the adaptively-adjusted bandwidth using the setup shown
in Fig. 5.12. As depicted in Fig. 5.15, when the input pattern ranges from PRBS7
to PRBS15, PRBS23, and PRBS31, the jitter performance of the recovered clock be-
comes slightly worse. This is because the increased run-length of “1s” or “0s” extends
the wandering time of the CDR loop, thus causing a larger amplitude of steady-state
oscillation and hence increase the deterministic jitter. Additionally, the high-frequency
jitter suppression effect becomes more prominent as the max run-length of the input
pattern increases (see the voltage ripple attenuation in Fig. 5.15).
142
Chapter 5. The Receiver Design
1.6 mm
EDC-SZF
Terminals
CTLE CDR
1.2 mm
CLK Driver
Full Clock
Rate Condi-
¼ -Rate
Driver tioner
Driver
(a)
Clock
Conditioner
30 mW
CTLE CDR
36 mW CDR CTLE
DEMUX
Clock Conditioner
159 mW
Total Power=225mW
(b)
Figure 5.16: (a) Chip micrograph and (b) power breakdown of the receiver.
The prototype receiver chip is fabricated in 65-nm CMOS process. Fig. 5.16 illus-
trates its micrograph and power breakdown when applying a 1.2 V supply and oper-
ating at 40 Gb/s. The receiver chip occupies 1.92 mm2 (including the testing circuits)
and dissipates 225 mW power (excluding the testing circuits). The fabricated chip is
143
Chapter 5. The Receiver Design
RJ: 260 fs, DJ: 3.63 ps, TJ: 7.31 ps RJ: 450 fs, DJ: 6.38 ps, TJ: 12.73 ps
(a) (b)
RJ: 490 fs, DJ: 4.45 ps, TJ: 11.48 ps RJ: 450 fs, DJ: 1.18 ps, TJ: 7.66 ps
(c) (d)
Figure 5.17: Measured eye-diagrams for (a) input data at 40 Gb/s, (b) recovered data at
10 Gb/s, (c) recovered edge-sampling clock without LPFs at 5 GHz, and (d) recovered
data-sampling clock with LPFs at 5 GHz.
mounted on a printed circuit board (PCB) through wire-bonding. The receiver input-
s and outputs are connected to the instruments through double-bonding wires, PCB
traces, and connection cables.
The receiver standalone measurement results are presented in this part. Fig. 5.17(a)
shows the eye-diagram of the 40 Gb/s input data, where the single-end eye height and
eye width are around 410 mV and 0.71 UI. Fig. 5.17(b) presents the eye-diagram of
the 10 Gb/s recovered data with a total jitter of 12.73 ps. The eye-diagrams of the
recovered clocks (divided by 2) for the data sampling and edge sampling are shown
in Fig. 5.17(c) and (d), which reveal that the introduced LPFs can optimize the total
jitter from 11.48 ps to 7.66 ps. To demonstrate the effect of the LPFs with adaptively-
adjusting bandwidth, the JTRAN and JTOL curves are measured using a Tektronix
BSA286C with a CDR block. The input peak-to-peak swing is tuned to 800 mV and
the control voltage of the CTLE is manually set to 710 mV. The JTRAN curves in Fig.
144
Chapter 5. The Receiver Design
0 Without LPFs
BW 18 MHz
[dB]
JTRAN (dB)
With LPFs
-4 BW 4 MHz
JTRAN
Inputjitter
Input Jitter
-8 Amplitude=0.20UI
amplitude=0.20 UI
Edge-Sampling Clock
Edge-sampling clock
Data-Sampling Clock
Data-sampling clock
-12 -1 0 1 2
10 10 10 10
JTOL (UIpp)
Improved by
Introduced LPFS
Frequency (MHz)
5.18 illustrate that the bandwidth of the data-sampling path depending on the LPFs is
4 MHz, which is much smaller than 18 MHz for the edge-sampling path determined
by the loop parameters. The measured JTOL in Fig. 5.18 indicates that the embedded
LPFs result in a significant dip attenuation around the corner frequency and improve
the JTOL amplitudes apparently at high jitter frequencies. Meanwhile, the adaptively-
adjusting bandwidth of the LPFs makes them exhibit little effect on the phase-tracking
slew rate at low jitter frequencies. Additionally, the corner frequency of the JTOL is
about 20 MHz, which is much larger than the JTRAN bandwidth of 4 MHz.
Table 5.1 compares the performance of our receiver with previous studies. It can be
seen that the maximum tolerable amplitude of sinusoidal jitter at high frequency (0.41
UI@100 MHz) outperforms the other two, which is mainly because of the introduced
LPFs and the developed compensating PI. As for the reason why this parameter is so
145
Chapter 5. The Receiver Design
important is because it directly indicates that our receiver has the ability of relaxing the
timing budget of the link and hence optimizes the communication BER. Meanwhile,
it is worthy to note that the area occupation and power consumption of our receiver
are larger than the design presented in [123]. This is mainly because of the following
two reasons. One is that the area occupation and the power consumption in [123]
are measured based on the the core circuits, while these two parameters in our design
are measured based on the whole chip, including the core circuits, testing circuits,
decoupling transistors, and connection pads. The other is that the receiver in [123]
is implemented in a 22 nm process, which naturally possesses the good properties of
smaller area and lower power consumption. If this receiver is also implemented in
such an advanced processes, the receiver should have the ability to operate at a higher
data rate with a smaller area occupation and a lower power consumption than the one
implemented in the 65 nm process.
This chapter presents a 40 Gb/s receiver with excellent performance on both jitter
suppression and jitter tracking, where a compensating PI is designed to alleviate the
issues of non-uniform phase steps and I, Q phase-spacing drifting. Moreover, the in-
troduced bandwidth-adaptively adjusted LPFs can provide additional high-frequency
146
Chapter 5. The Receiver Design
jitter attenuation for data-sampling clocks, while leaving the edge-sampling clocks un-
filtered to maintain a high jitter tracking capacity.
147
Chapter 6
To overcome the channel loss and satisfy the stringent power and area budgets, so-
phisticated equalization design is demanded to compensate for the channel loss while
balancing the cost of power and area overheads. Based on the transmitter (TX) and
receiver (RX) chips designed in the previous two chapters, this chapter constructs a
chip-to-chip connection where the output of the transmitter chip and the input of the
receiver chip are DC connected over a 12-cm printed circuit board (PCB) trace. A
combined TX-side feed-forward equalizer (FFE) and RX-side continuous-time linear
equalizer (CTLE) is adopted to compensate for the channel loss. The control voltage
of the RX-CTLE is manually calibrated while the tap weights of the TX-FFE are auto-
matically adjusted by a newly developed edge-data correlation-based sign zero-forcing
(EDC-SZF) adaptation engine located at the RX-side.
In the rest of this chapter, Section 6.1 illustrates the equalization scheme employed
in the serial link. The proposed EDC-SZF adaptation is presented in section 6.2.
It begins by summarizing the drawbacks in previous adaptation techniques and then
presents the update iteration and the derivation of the proposed EDC-SZF algorithm.
Section 6.3 finally gives the link setup and experimental results.
148
Chapter 6. Overall Serial Link and Adaptive Equalization
Deserializer
EDC-SZF
I(k) RX
CDR
Z-1 α0 CTLE Data
TX
Data n
Z-1 α1 Combined Channel RX
Edge
Z-1 α2
Limiters
Range
α-1,α1,α2
DACs
6-bit
Adaptively Adjusted
Figure 6.1: Implemented equalization scheme with the proposed EDC-SZF algorithm.
Here, TX-FFE and RX-CTLE are employed to compensate for the channel loss, the
control voltage of the RX-CTLE (VCTLE) is manually calibrated while the tap weights
(α−1 , α1 , α2 ) of the TX-FFE are adaptively adjusted by the proposed EDC-SZF.
Fig. 6.1 describes the block diagram of the serial link along with the equalization
scheme, where the output of the transmitter chip is directly connected to the receiver
chip through a channel. It employs a TX-FFE and a RX-CTLE to compensate for the
channel loss. The decision feedback equalizer (DFE) is ruled out here, mainly because
of its operation speed limitation, complicated implementation, and significant power
consumption [162, 101]. These overheads generally result from the increased number
of data samplers within the DFE [34, 25]. The RX-CTLE is manually calibrated while
the tap weights of the TX-FFE are adaptively adjusted by an EDC-SZF algorithm at
the RX-side. The digital tap weights generated by the EDC-SZF engine are firstly
constrained by three range limiters and then applied to three 6-bit digital-to-analog
converters (DACs) to produce the bias voltages for the TX-FFE taps. These bias volt-
ages are transferred to the transmitter through PCB traces. To save the output pins,
the DACs in practical implementations are located at the TX-side and the controlling
tap-weight codes are sent through the communication channel under the control of the
status state machine in the media access control (MAC) layer [163]. In our prototype,
149
Chapter 6. Overall Serial Link and Adaptive Equalization
400 pH
50 ohm
ON OP
α-1 α0 α1 α2
Dpre Dmain Dpst1 Dpst2
(a)
Amplitude (mV)
(b) (c)
Figure 6.2: TX-FFE. (a) Schematic details, (b) simulated output eye-diagram at 10
Gb/s, and (c) simulated output eye-diagram at 40 Gb/s.
due to the lack of the MAC layer, the DACs are located at the RX-side and the bias
voltages are transferred to the transmitter through PCB traces.
Fig. 6.2(a) shows the schematic details of the TX-FFE. It is realized by a 4-tap
current-mode logic (CML) combiner, where the tap weights are adjusted by changing
the bias voltages of the current sources. Fig. 6.2(b) and (c) display the simulated
near-end eye-diagrams at 10 Gb/s and 40 Gb/s when applying -3 dB, -6 dB, and -
3 dB equalization coefficients to the pre, post1, and post2 cursors in the FFE. The
circuit implementation of the RX-CTLE and its frequency responses with different
control voltages are presented in Fig. 6.3. To optimize the equalization configuration,
150
Chapter 6. Overall Serial Link and Adaptive Equalization
320 pH
320 pH
65 ohm
ON 65 ohm
OP
IN IP OP ON
IN IP
VCTLE
ISS
ISS/2 ISS/2
(a)
VCTLE=900 mV
VCTLE=800 mV
Gain (dB)
VCTLE=700 mV
VCTLE=600 mV
Frequency (Hz)
(b)
Figure 6.3: RX-CTLE. (a) Schematic details and (b) frequency responses for different
control voltages.
the control voltage of the RX-CTLE is manually adjusted while the tap weights of
the TX-FFE are adaptively adjusted by a low-cost EDC-SZF adaptation engine. In
the remainder of this section, we will focus on the design of the proposed EDC-SZF
algorithm.
151
Chapter 6. Overall Serial Link and Adaptive Equalization
SZF)
152
Chapter 6. Overall Serial Link and Adaptive Equalization
where αl (k) is the instant l-tap weight, sign[e(k)] represents the sign of the edge sam-
pling error, D(k) denotes the recovered data, and λ stands for the scale factor control-
ling the adjustment rate and its value is usually much smaller than 1. The sign of the
edge sampling error sign[e(k)] caused by the inter-symbol interface (ISI) is directly
mapped from the quantized edge sequence E(k), and it is correlated with the data bit
D(k − l) to produce the product sign[e(k)] · D(k − l). The result is then integrated to
update the FFE tap weight αl (k).
The main feature of this approach is that it only involves the existing quantized edge
sequence E(k) and recovered data sequence D(k). As a result, the essential auxiliary
circuits such as samplers, ADCs, and PIs in previous adaptive equalizations [129, 130,
131, 105, 132, 133] are removed, thus exhibiting more potentials on operation speed
and cost effectiveness.
When dealing with band-limited channels that result in ISI, it is convenient to de-
velop an equivalent discrete-time model for the continuous-time system. The reason
is that the transmitter sends discrete-time symbols with a period of T and the output
at the receiver side is also a discrete-time signal with samples of the same period. Fig.
6.4 presents the UI-width pulse response of a typical dispersion channel, where hk and
hk+0.5 denote the ISI tail values at data-sampling and edge-sampling points, respec-
tively. According to the signal processing principles, the received discrete-time signal
153
Chapter 6. Overall Serial Link and Adaptive Equalization
Single
Input Bit
h0
h0.5 Output
h-0.5 Pulse Response
h1
Channel
h h1.5 h2.5
h-1.5 -1 h2 h3 h
3.5
qk can be computed by the convolution of the input data sequence Ik and channel pulse
response hk ,
X
qk = Ii hk−i . (6.2)
i
For a normal operating serial link where the data-sampling clock always locates at
the center of the eye-diagram, qk and qk+0.5 can be considered as the sampled analog
values before binary quantization. After the decision latches, qk is quantized to the
data sequence Dk , while qk+0.5 is quantized to the edge sequence Ek . Applying the
cross-correlation function to the edge-sampled sequence qk+0.5 and the recovered data
sequence Dk , we can get,
X X
Re,d (n) = qj+0.5 Dj−n = qj+0.5 Ij−n
j j
!
X X
= Ii Ij−n hj+0.5−i
j i
(6.3)
X X
= Ij−n Ij−n hj+0.5−(j−n) + Ii Ij−n hj+0.5−i
j i6=j−n
X X X
= hn+0.5 + Ii Ij−n hj+0.5−i .
j j i6=j−n
Here, Dk is replaced by the input sequence Ik because the bit-error-rate (BER) is usu-
ally quite low (< 1e−12 ) for proper operating serial links. Assuming m = j − i, we
154
Chapter 6. Overall Serial Link and Adaptive Equalization
have,
X X X X
Ii Ij−n hj+0.5−i = Ij−m Ij−n hm+0.5
j i6=j−n j m6=n
(6.4)
X X
= Ij−m Ij−n hm+0.5 = 0.
m6=n j
Note that the sum indexes of i and j traverse over all the integers except for i = j − n,
thus m should also round over all integers except for m = n. The final derivation
of Eq. (6.4) is obtained based on the fact that the time-shifted data sequences Ij−m
and Ij−n (m 6= n) are actually independent with each other, since the transmitted data
streams in wireline systems are usually random sequences. Substituting Eq. (6.4) into
Eq. (6.3) , the cross-correlation function is simplified to,
X
Re,d (n) = hn+0.5 . (6.5)
j
Clearly, the normalized cross-correlation coefficient ρe,d (n) between the sequence qk+0.5
and the recovered data sequence Dk exactly equals the residual ISI value with a time
shift of (n + 0.5)T , as shown in Fig. 6.4.
For a transmitter with an l-tap UI-spaced FFE, the pre-distorted output can be rep-
resented by,
X
t(k) = αl I(k − l), (6.7)
l
where I(k) is the transmitting sequence, αl denotes the tap weight, and l is the tap
index [133]. To make the analysis more compact, the cascaded passive channel and
RX-CTLE is treated as a combined channel with a new pulse response of ck . By
155
Chapter 6. Overall Serial Link and Adaptive Equalization
calculating the convolution of pre-distorted output t(k) and the channel pulse response
ck , the received discrete-time sequence before binary quantization can be given by
!
X X
r(k) = αl I(i)ck−l−i . (6.8)
l i
Following the steps of the cross-correlation analysis in 6.2.3 and using the derived
results, we attain the cross-correlation coefficient between the edge-sampling error
sequence r(k + 0.5) and the recovered data sequence D(k),
X
ρ̂e,d (n) = αl cn−l+0.5 . (6.9)
l
For an ideally equalized serial link, the edge-sampling error sequence is supposed
to be a 0-sequence. Hence, all the cross-correlation coefficients should be zero. How-
ever, this needs infinite taps to cancel all the residual ISI. Considering the fact that the
ISI tail decreases exponentially as the time goes on, it is reasonable to assume that
the ISI affects a finite number of symbols and previous research has demonstrated that
equalizers with a specific number of taps can effectively compensate for legacy chan-
nels [130, 131, 164, 133, 123]. In principle, when the tap weights are adjusted close to
the targeted values, the resulting cross-correlation coefficient ρ̂e,d (n) should be forced
towards zero. Taking the implemented 4-tap FFE in this design as an example, for a
group of proper tap weights, we have,
ρ̂e,d = Cα = 0, (6.10)
where,
ρ̂e,d = (ρ̂e,d (−1), ρ̂e,d (0), ρ̂e,d (1), ρ̂e,d (2))T ,
α = (α−1 , α0 , α1 , α2 )T ,
156
Chapter 6. Overall Serial Link and Adaptive Equalization
c0.5 c−0.5 c−1.5 c−2.5
c1.5 c0.5 c−0.5 c−1.5
C= .
c2.5 c1.5 c0.5 c−0.5
c3.5 c2.5 c1.5 c0.5
To find the optimal TX-FFE tap weights, a recursive equation is constructed as,
Fig. 6.5 depicts the implementation of the EDC-SZF adaptation algorithm, which
contains three identical paths to process the quantized data and edge sequences to pro-
157
Chapter 6. Overall Serial Link and Adaptive Equalization
Sl ot 4
Sl ot 3
Sl ot 2
Sl ot 1
{+1, 0, -1}
D(n+1) Correlation D1<3:0>
4 16-bit Truncation 6 Range 6-bit Vbias,α-1
4 Adder
E(n) Detector D0<3:0> Integrator Output Limiter DAC
ResCor-1(n)
{+1, 0, -1}
D(n-1) Correlation D1<3:0>
4 16-bit Truncation 6 Range 6-bit Vbias,α1
Detector 4 Adder
D0<3:0> Integrator Output Limiter DAC
ResCor1(n)
{+1, 0, -1}
D(n-2) Correlation D1<3:0>
4 16-bit Truncation 6 Range 6-bit Vbias,α2
4 Adder
Detector D0<3:0> Integrator Output Limiter DAC
ResCor2(n)
D(n) XOR 2 BW<1:0>
D(n+1)
+ ResCor-1(n)
+ ResCor0(n)
+ ResCor1(n)
+ ResCor2(n)
(a)
0 0 0 0
1 0 0 0
0 1 0 1
1 1 1 1
Note: The signed ResCorl(n) is represented by two bits: ResCorl1(n) and ResCorl0(n).
(b)
Figure 6.6: Correlation detector. (a) Operation principle illustration and (b) function
table.
duce the desired bias voltages for TX-FFE taps. Here, the main tap weight is pre-fixed
to accelerate the convergence speed. In each path, the edge and data streams with a
proper time shift are applied to a correlation detector (CD) to generate the residual
correlation ResCorl (n), which is used to represent the sign[e(n)] · D(n − l) in Eq.
(6.1). These parallel correlation coefficients are firstly summed and then fed into a
158
Chapter 6. Overall Serial Link and Adaptive Equalization
16-bit integrator to execute the iteration of Eq. (6.1), where λ is determined by the
subsequent truncation operation. In this design, a set of consecutive 4-bit data/edge of
the 1/16-rate demultiplexed data/edge are employed, which ensures that the data/edge
information used for equalization adaptation comes from different samplers. This de-
centralized error collection method reduces the possibility of non-optimal adaptation
caused by imperfections such as fabrication mismatch, duty cycle distortion, and I, Q
quadrature error. Fig. 6.6 further details the operation principle and function table of
the CD. Clearly, if there is no transition (D(n) + D(n + 1) = 0), ResCorl (n) is as-
signed 0. In case of a data transition (D(n) + D(n + 1) = 1), ResCorl (n) is assigned
+1 or -1 when the polarities of D(n − l) and E(n) are identical (D(n − l) + E(n) = 0)
or opposite (D(n − l) + E(n) = 1).
(a) (b)
(d)
Figure 6.7: Layout views of the equalization blocks. (a) TX-FFE, (b) RX-CTLE, and
(c) EDC-SZF.
159
Chapter 6. Overall Serial Link and Adaptive Equalization
Fig. 6.7 displays the layout views of the equalization blocks. For the TX-FFE [see
Fig. 6.7(a)], a pair of standard inductors is utilized to neutralize the capacitances on the
output nodes. For the RX-CTLE [see Fig. 6.7(b)], two T-coil inductors are used in the
Terminals to support the high current ability, while multi-layer inductors are adopted
in the CTLE stages to save the area occupation. For the EDC-SZF [see Fig. 6.7(c)], it
is implemented based on the standard cells provided by the foundry.
Post1 Tap
Attenuation (dB)
Voltage (mV)
Pre Tap
-15.9 dB at 20 GHz
Post2 Tap
(a) (b)
Amplitude (mV)
Amplitude (mV)
(c) (d)
Figure 6.8: Transistor-level simulation of the EDC-SZF adaptation. (a) Channel fre-
quency response, (b) convergence process of the TX-FFE tap weights, (c) eye-diagram
with zero TX-FFE tap weights, and (d) eye-diagram with adaptively-adjusted TX-FFE
tap weights.
160
Chapter 6. Overall Serial Link and Adaptive Equalization
Transmitter Receiver
Chip Chip
PCB Channel
(a)
TX Channel
Duplicated Channel
(b)
Gain (dB)
Frequency (Hz)
(c)
Figure 6.9: Constructed chip-to-chip interconnect. (a) Testing PCB, (b) auxiliary PCB,
and (c) duplicated channel frequency response.
161
Chapter 6. Overall Serial Link and Adaptive Equalization
0.6 A
B C D
0.5
Post1 Tap E
F
(V)
0.4
Bias(V)
Tab Bias
0.3
Tap
Pre Tap
0.2
0.1
Post2 Tap
0
0.9 0.85 0.8 0.75 0.7 0.65 0.6
VCTLE
VCTLE(V)
(V)
Figure 6.10: Adaptively-adjusted bias voltages of the TX-FFE with different RX-
CTLE control voltages.
Figure 6.11: Measured far-end eye-diagrams for (a) bias condition A, (b) bias condition
B, (c) bias condition D, and (d) bias condition F depicted in Fig. 6.10.
Fig. 6.8 gives the transistor-level simulation results of the serial link with the EDC-
SZF adaptation, where the control voltage of the RX-CTLE is pre-set to 700 mV, and
162
Chapter 6. Overall Serial Link and Adaptive Equalization
Bias Condition A
Bit Error Rate (Error ratio)
Bias Condition F
Bias Condition C
BIAS
Phase
Condition
(UI) C
Figure 6.12: Measured bathtub curves under different bias conditions depicted in Fig.
6.10.
the dispersive channel is imitated by an LPF with a -15.9 dB loss at 20 GHz. The
channel frequency response and the eye-diagram after the channel are shown in Fig.
6.8(a). Fig. 6.8(b) describes the convergence process of the TX-FFE tap weights. Fig.
6.8(c) and (d) displays the eye-diagrams (measured at the output of the RX-CTLE) with
zero and adaptively-adjusted tap weights, respectively. It can be easily seen that the
developed EDC-SZF adaptation algorithm can gradually tune the TX-FFE tap weights
to optimal values, which can effectively optimize the eye opening and eyelid thickness.
Fig. 6.9 shows the measurement setup of the serial link. As shown in Fig. 6.9(a),
a chip-to-chip interconnect is constructed. The outputs of the transmitter chip and the
inputs of the receiver chip are separately wire-bonded to the two terminals of a 12
cm PCB channel. Meanwhile, an auxiliary PCB with a transmitter chip bonding to a
replica channel and a pair of duplicated PCB traces are also manufactured to measure
the far-end eye-diagrams and evaluate the channel characteristics [see Fig. 6.9(a)]. Fig.
163
Chapter 6. Overall Serial Link and Adaptive Equalization
6.9(c) depicts the frequency response of the PCB channel, where the channel loss at
the half-baud frequency is over 16 dB.
Fig. 6.10 shows the adaptively-adjusted bias voltages of the TX-FFE taps as the
control voltage of the RX-CTLE changes from 900 mV to 615 mV [see the correspond-
ing equalization abilities in Fig. 6.3(b)]. Fig. 6.11 describes the far-end eye-diagrams
under the bias conditions of A, B, D, and F depicted in Fig. 6.10. As the control volt-
age of the RX-CTLE is decreased (i.e., improving the high-frequency peaking ability
of the RX-CTLE), the TX-FFE bias voltages are adjusted accordingly to decrease the
equalization capability of the TX-FFE, thus maintaining the frequency response of the
combined TX-FFE, RX-CTLE, and transmission channel close to a flat profile. By
detecting the BER while adjusting the sampling positions, the bathtub diagram can be
obtained. Fig. 6.12 displays the measured bathtub curves under the bias conditions of
A, C, and F described in Fig. 6.10. For the balanced equalization coefficient alloca-
tion under bias condition C, the horizontal eye opening at BER=10−12 achieves 0.51
UI, which is much better than those measured under bias condition A (0.30 UI) and
bias condition F (0.35 UI). This proves that a combination scheme of the TX-FFE and
RX-CTLE is a good choice for the equalization of the 40 Gb/s link.
This chapter constructs a serial link over a > 16 dB loss PCB channel using the
chips designed in Chapter 4 and 5. A combined TX-FFE and RX-CTLE is employed
to compensate for the channel loss. To obtain the optimal equalization coefficients
and track the channel-loss variations with respect to operation environment, a low-cost
EDC-SZF adaptation algorithm is proposed to automatically adjust the TX-FFE’s tap
weights. Unlike previous adaptation techniques that need auxiliary circuits to extract
the error information, the proposed EDC-SZF adaptation performs the tap-weight ad-
justment through processing the existing data and edge sequences, hence introducing
little overheads to the link.
164
Chapter 7
7.1 Conclusions
The rapid growth of the computing power and storage volume has led to an ex-
plosive bandwidth demand on data communication in both telecommunication equip-
ments and inter/intra data centers. To accommodate to this requirement, the data rate
of the wireline SerDes transceiver has been continuously increased. Currently, 25-28
Gb/s serial links have stepped into the period of industrial deployment. The 38-64 Gb/s
transceivers, which will play a key role in the next-generation data rate have attracted
increasing attentions in both the industry and the academia. This thesis addresses some
of the architecture-level and circuit-level challenges associated with such cutting-edge
wireline transceiver designs. Several advanced techniques are developed to optimize
the operation speed, power efficiency, performance margin, and area occupation. The
prototype chips of a 10 GHz clock multiplier, a 40 Gb/s transmitter, and a 40 Gb/s
receiver are separately designed and fabricated in a 65 nm CMOS process. The main
features of these designed chips are summarized as below.
165
Chapter 7. Conclusions and Future Work
tial frequency setup aid and preventing the potential lock-loss risk. Secondly,
a full-swing pseudo-differential delay cell is developed to optimize the phase
noise performance of the VCO. Thirdly, a compact timing-adjusted phase detec-
tor tightly combined with a well-matched charge pump is designed to satisfy the
requirements of high operation speed, high detection accuracy, and low output
disturbance. The measurement results show that the implemented 10 GHz RIL-
CM chip achieves a good balance among jitter performance, area occupation,
operation speed, and power efficiency.
• The main features of the implemented transmitter focus on three aspects. Firstly,
a 4-tap feed-forward equalizer (FFE) based on multiple multiplexers (MUXs) is
designed. Thanks to the retiming-based symbol-spaced sequence generation, it
can support a wide operation range of 5-50 Gb/s. Secondly, an enhanced 4:1
MUX is developed. By introducing a pair of pre-charging PMOS transistors in
the pulling-down unit cell, it completely eliminates the charge-sharing effect,
which not only improves the jitter performance of the 4:1 MUX but also helps
to extend its maximum bandwidth. Thirdly, a compact latch array associated
with an interleaved-retiming technique is designed. By interleaved-retiming the
parallel data, the 16 paths quarter-rate data streams with appropriate delays can
be obtained. The measurement results indicate that the fabricated 40 Gb/s trans-
mitter chip achieves excellent jitter performance and power efficiency.
• The main features of the implemented receiver focus on two aspects. One is
the architecture-level improvement on the clock data recovery (CDR). By intro-
ducing passive low-pass filters with an adaptively adjusted bandwidth into the
data-sampling path, the jitter tracking and jitter suppression for data decisions
can be automatically balanced, thus improving the jitter tolerance of the CDR.
The other is the time-averaging-based compensating phase interpolator, which
not only improves the phase-step uniformity but also reduces the phase-spacing
errors between the edge and data sampling clocks. The measurement results
show that the maximum tolerable amplitude of implemented 40 Gb/s receiver
chip outperforms previous receivers at high frequencies.
166
Chapter 7. Conclusions and Future Work
The factors to consider when designing a serial communication link mainly include
data transmission rate, power efficiency, and channel characteristics. The first factor is
usually set by particular operation standards, the other two factors largely depend on
the network infrastructure, operation medium, and link spaces. As the requirement for
the data rates goes beyond 40 Gb/s, efforts in channel optimization, on-chip transmis-
sion line, and modulation scheme should also be made to further optimize the factors
of the serial link. As a consequence, the following items could be the future tasks to
further optimize the link performance.
167
Bibliography
168
Bibliography
[1] C. V. N. Index, “The zettabyte era-trends and analysis.” Cisco white paper,
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-
vni/vni-hyperconnectivity-wp.html, Jun. 2016. [Online]. Accessed 26-Feb.-2017.
[3] S. Voinigescu et al., “SiGe BiCMOS for analog, high-speed digital and millimetre-wave ap-
plications beyond 50 GHz,” in Proc. IEEE Bipolar/BiCMOS Circuits and Technology Meeting,
pp. 1–8, Oct. 2006.
[4] Optical Internetworking Forum, “OIF CEI-56G application note-Common electrical interface
at 56Gb/s.” OIF application note, http://www.oiforum.com/wp-content/uploads/OIF-CEI-white-
paper-final-Mar-23-2016.pdf, 2016. [Online]. Accessed 26-Feb.-2017.
[5] Telecordia Technologies, Synchronous Optical Network (SONET) Transport Sys-tems: Common
Generic Criteria, Sep. 2000. GR-253-CORE.
[6] T. Toif et al., “A 22-Gb/s PAM-4 receiver in 90-nm CMOS SOI technology,” IEEE J. Solid-State
Circuits, vol. 41, pp. 954–965, Apr. 2006.
[7] T. Toifl, Low-Power High-Speed CMOS I/Os: Design Challenges and Solutions. IBM Research
GmbH Zurich Research Laboratory, 2012.
[8] C. Kromer et al., “A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects,” IEEE J.
Solid-State Circuits, vol. 41, pp. 2921–2929, Dec. 2006.
[9] J. F. Bulzacchelli et al., “A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technolo-
gy,” IEEE J. Solid-State Circuits, vol. 41, pp. 2885–2900, Dec. 2006.
[10] L. Rodoni et al., “A 5.75 to 44 Gb/s quarter rate CDR with data rate selection in 90 nm bulk
CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 1927–1941, Jul. 2009.
[11] S. Sidiropoulos et al., “A semidigital dual delay-locked loop,” IEEE J. Solid-State Circuits,
vol. 32, pp. 1683–1692, Nov. 1997.
[12] G.-Y. Wei et al., “A variable-frequency parallel I/O interface with adaptive power-supply regula-
tion,” IEEE J. Solid-State Circuits, vol. 35, pp. 1600–1610, Nov. 2000.
[13] A. Agrawal et al., “A 19-Gb/s serial link receiver with both 4-tap FFE and 5-tap DFE functions
in 45-nm SOI CMOS,” IEEE J. Solid-State Circuits, vol. 47, pp. 3220–3231, Dec. 2012.
[14] R. Kreienkamp et al., “A 10-Gb/s CMOS clock and data recovery circuit with an analog phase
interpolator,” IEEE J. Solid-State Circuits, vol. 40, pp. 736–743, Mar. 2005.
[15] H. Pan et al., “A digital wideband CDR with 15.6kppm frequency tracking at 8Gb/s in 40nm
CMOS,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 442–443, Feb. 2011.
[16] M.-S. Chen, Design of 60+Gb/s Serial-Link Transmitters Using Filter Techniques. PhD thesis,
Electrical Engineering, University of California, Los Angeles, 2015.
169
Bibliography
[17] B. Welch, “400G optics-technologies, timing, and transceivers.” IEEE P802. 3bs,
http://www.ieee802.org/3/bs/public/14 05/welch 3bs 01 0514.pdf, May. 2014. [Online]. Ac-
cessed 22-Oct.-2016.
[19] M. Cvijetic and I. B. Djordjevic, Advanced Optical Communication Systems and Networks, ch. 1,
pp. 1–38. Artech House, 2013.
[20] P. C. Chiang et al., “4 × 25 Gb/s transceiver with optical front-end for 100 GbE system in 65 nm
CMOS technology,” IEEE J. Solid-State Circuits, vol. 50, pp. 573–582, Feb. 2015.
[21] U. Singh et al., “A 780 mW 4 × 28 Gb/s transceiver for 100 GbE gearbox PHY in 40 nm CMOS,”
IEEE J. Solid-State Circuits, vol. 49, pp. 3116–3129, Dec. 2014.
[22] T. Takemoto et al., “A 25-Gb/s 2.2-W 65-nm CMOS optical transceiver using a power-supply-
variation-tolerant analog front end and data-format conversion,” IEEE J. Solid-State Circuits,
vol. 49, pp. 1903–1916, Feb. 2014.
[23] R. Navid et al., “A 40 Gb/s serial link transceiver in 28 nm CMOS technology,” IEEE J. Solid-
State Circuits, vol. 50, pp. 814–827, Dec. 2015.
[24] M. S. Chen and C. K. K. Yang, “A 50-64 Gb/s serializing transmitter with a 4-tap, LC-ladder-
filter-based FFE in 65 nm CMOS technology,” IEEE J. Solid-State Circuits, vol. 50, pp. 1903–
1916, Apr. 2015.
[25] J. Lee et al., “Design of 56 Gb/s NRZ and PAM4 SerDes transceivers in CMOS technologies,”
IEEE J. Solid-State Circuits, vol. 50, pp. 2061–2073, Sep. 2015.
[26] P. C. Chiang et al., “60Gb/s NRZ and PAM4 transmitters for 400GbE in 65nm CMOS link,” in
Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 42–43, Feb. 2014.
[27] H. Tao et al., “40-43-Gb/s OC-768 16:1 MUX/CMU chipset with SFI-5 compliance,” IEEE J.
Solid-State Circuits, vol. 38, pp. 2169–2180, Dec. 2003.
[28] Inphi, “CMOS paves the road to 100 GbE mainstream markets.” http-
s://www.inphi.com/products/whitepapers/inphi whitepaper iphy final.pdf, 2011. [Online].
Accessed 28-Jul.-2017.
[29] T. H. Lee, The Design of CMOS Radio-Frequency Integrated Circuits. Cambridge: Cambridge
University Press, 1998.
[31] IEEE 802.3, 50 Gb/s Ethernet Over a Single Lane and Next Generation 100 Gb/s & 200 Gb/s
Ethernet Call For Interest Consensus Presentation, Nov. 2015.
[32] A. A. Hafez, M.-S. Chen, and C.-K. K. Yang, “A 32-48 Gb/s serializing transmitter using multi-
phase serialization in 65 nm CMOS technology,” IEEE J. Solid-State Circuits, vol. 50, pp. 763–
775, Mar. 2015.
[33] S. Kaeriyama et al., “A 40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5
interface for optical transmission systems,” IEEE J. Solid-State Circuits, vol. 44, pp. 3568–3579,
Dec. 2009.
[34] M. S. Chen et al., “A fully-integrated 40-Gb/s transceiver in 65-nm CMOS technology,” IEEE J.
Solid-State Circuits, vol. 47, pp. 627–640, Mar. 2012.
170
Bibliography
[36] J. Han et al., “A 60Gb/s 288mW NRZ transceiver with adaptive equalization and baud-rate clock
and data recovery in 65nm CMOS technology,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, pp. 112–113, Feb. 2017.
[37] D. Cui et al., “A dual-channel 23-Gbps CMOS transmitter/receiver chipset for 40-Gbps RZ-
DQPSK and CS-RZ-DQPSK optical transmission,” IEEE J. Solid-State Circuits, vol. 47, p-
p. 3249–3260, Dec. 2012.
[38] M. Harwood et al., “A 225mW 28Gb/s SerDes in 40nm CMOS with 13dB of analog equalization
for 100GBASE-LR4 and optical transport lane 4.4 applications,” in Proc. IEEE Int. Solid-State
Circuits Conf. Dig. Tech. Papers, pp. 326–327, Feb. 2012.
[41] T. Toifl et al., “A 72mW 0.03mm2 inductorless 40 Gb/s CDR in 65 nm SOI CMOS,” in Proc.
IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 226–227, Feb. 2007.
[42] J. Savoj et al., “Design of high-speed wireline transceivers for backplane communications in
28nm CMOS,” in Proc. IEEE Custom Integrated Circuits Conf., pp. 1–4, Sep. 2012.
[43] T. T. Vu, Compound Semiconductor Integrated Circuits, vol. 29. World Scientific, 2003.
[44] C.-K. K. Yang, Design of High-Speed Serial Links in CMOS. PhD thesis, Stanford University,
Dec. 1998.
[45] H. Bakoglu, ed., Circuits, Interconnections and Packaging for Very Large Scale Integration. Ad-
dison Wesley Longman Publishing Co., 1990.
[47] H. Johnson and M. Graham, High-Speed Signal Propagation: Advanced Black Magic. Prentice-
Hall, 2003.
[48] W.-K. Chen, ed., The VLSI Handbook. Taylor & Francis Group, 2 ed., 2007.
[49] C. R. Paul, Analysis of Multiconductor Transmission Lines. John Wiley & Sons, 2 ed., 2008.
[50] T. Dhaene and D. D. Zutter, “Selection of lumped element models for coupled lossy transmission
lines,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 11, pp. 805–815, Jul. 1992.
[51] K. Fukuda et al., “A 12.3-mW 12.5-Gb/s complete transceiver in 65-nm CMOS process,” IEEE
J. Solid-State Circuits, vol. 45, pp. 2838–2849, Dec. 2010.
[52] K.-L. J. Wong et al., “A 27-mW 3.6-Gb/s I/O transceiver,” IEEE J. Solid-State Circuits, vol. 39,
pp. 602–612, Apr. 2004.
[53] H. Hatamkhani and C.-K. K. Yang, “Power analysis for high-speed I/O transmitters,” in Proc.
Symp. VLSI Circuits, pp. 142–145, Jun. 2004.
[54] A. Agrawal, Design of High Speed I/O Interfaces for High Performance Microprocessors. PhD
thesis, The School of Engineering and Applied Sciences, Harvard University, Oct. 2010.
[55] B. Kim et al., “A 10-Gb/s compact low-power serial I/O with DFE-IIR equalization in 65-nm
CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 3526–3538, Dec. 2009.
[56] G. Balamurugan et al., “A scalable 5-15 Gbps, 14-75 mW low-power I/O transceiver in 65 nm
CMOS,” IEEE J. Solid-State Circuits, vol. 43, pp. 1010–1019, Apr. 2008.
[57] J. Lee et al., “56Gb/s PAM4 and NRZ SerDes transceivers in 40nm CMOS,” in Proc. IEEE Symp.
VLSI Circ. Dig. Tech. Papers, pp. 118–119, Jun. 2015.
171
Bibliography
[59] F. Rao and S. Hindi, “Frequency domain analysis of jitter amplification in clock channels,” in
Proc. IEEE 21st Conference on Electrical Performance of Electronic Packaging and Systems,
pp. 51–54, Oct. 2012.
[60] S. Chaudhuri et al., “Jitter amplification characterization of passive clock channels at 6.4 and 9.6
Gb/s,” in Proc. IEEE Electrical Performance of Electronic Packaging, pp. 23–25, Feb. 2006.
[61] B. Casper and F. OMahony, “Clocking analysis, implementation and measurement techniques for
high-speed data links-A tutorial,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, pp. 17–39,
Jan. 2009.
[62] Maxim Integrated, Converting between RMS and Peak-to-Peak Jitter at a Specified BER, Apr.
2008.
[63] Y. Moon et al., “A 0.6-2.5 GBaud CMOS tracked 3× oversampling transceiver with dead-zone
phase detection for robust clock/data recovery,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, pp. 212–213, Feb. 2001.
[64] J. L. Sonntag and J. Stonick, “A digital clock and data recovery architecture for multi-Gigabit/s
binary links,” IEEE J. Solid-State Circuits, vol. 41, pp. 1867–1875, Jul. 2006.
[65] P. K. Hanumolu et al., “Digitally-enhanced phase-locking circuits,” in Proc. IEEE Custom Inte-
grated Circuits Conf., pp. 361–368, Sep. 2007.
[66] A. Ghatak and K. Thyagarajan, An Introduction to Fiber Optics. Cambridge: Cambridge Univer-
sity Press, 1998.
[67] B. Razavi, Design of Integrated Circuits for Optical Communications. John Wiley & Sons. Inc,
2 ed., 2012.
[68] K. Kundert, Verification of Bit-Error Rate in Bang-Bang Clock and Data Recovery Circuits. The
Designers Guide Community, May 2010.
[69] R. Reutemann et al., “A 4.5 mW/Gb/s 6.4 Gb/s 22+1-lane source synchronous receiver core with
optional cleanup PLL in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, pp. 2850–2860,
Dec. 2010.
[70] N. Kalantari and J. F. Buckwalter, “A multichannel serial link receiver with dual-loop clock-
and-data recovery and channel equalization,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60,
pp. 2920–2931, Nov. 2013.
[71] V. F. Kroupa, Phase Lock Loops and Frequency Synthesis. John Wiley & Sons Ltd, 2003.
[73] M. Mansuri and C.-K. K. Yang, “Jitter optimization based on phase-locked loop design parame-
ters,” IEEE J. Solid-State Circuits, vol. 37, pp. 1375–1382, Nov. 2002.
[74] J. G. Maneatis, “Low-jitter process-independent DLL and PLL based on self-biased techniques,”
IEEE J. Solid-State Circuits, vol. 31, pp. 1723–1732, Nov. 1996.
[75] M.-J. E. Lee, “Jitter transfer characteristics of delay-locked loops-theories and design tech-
niques,” IEEE J. Solid-State Circuits, vol. 38, pp. 614–615, Apr. 2003.
[76] C.-N. Chuang and S. luan Liu, “A 40GHz DLL-based clock generator in 90nm CMOS technolo-
gy,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 178–179, Feb. 2007.
[77] X. Gao et al., “Jitter analysis and a benchmarking figure-of-merit for phase-locked loops,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 56, pp. 117–121, Feb. 2009.
[78] C. K. et al., “A low-power small-area 7.28-ps-jitter 1-GHz DLL-based clock generator,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 37, pp. 1414–1420, Nov. 2002.
172
Bibliography
[79] R. Farjad-Rad et al., “A low-power multiplying DLL for low-jitter multigigahertz clock genera-
tion in highly integrated digital chips,” IEEE J. Solid-State Circuits, vol. 37, pp. 1804–1812, Dec.
2002.
[80] H.-Y. Chang, “A low-jitter low-phase-noise 10-Ghz sub-harmonically injection-locked PLL with
self-aligned DLL in 65-nm CMOS technology,” IEEE Trans. Microwave Theory Tech., vol. 62,
pp. 543–555, Mar. 2014.
[81] S. Choi et al., “A 185 fsrms -integrated-jitter and -245dB FOM PVT-robust ring-VCO-based
injection-locked clock multiplier with a continuous frequency-tracking loop using a replica-delay
cell and a dual-edge phase detector,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
pp. 194–195, Feb. 2016.
[83] W. Deng et al., “A 0.022mm2 970µW dual-loop injection-locked PLL with -243 dB FOM using
synthesizable all-digital PVT calibration circuits,” in IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, pp. 248–249, Feb. 2013.
[84] M. Kim et al., “A 450-fs jitter PVT-robust fractional-resolution injection-locked clock multiplier
using a DLL-based calibrator with replica-delay-cells,” in Proc. IEEE Symp. VLSI Circ. Dig.
Tech. Papers, pp. C142–C143, Jun. 2015.
[85] B. M. Helal et al., “A low jitter programmable clock multiplier based on a pulse injection-locked
oscillator with a highly-digital tuning loop,” IEEE J. Solid-State Circuits, vol. 44, pp. 1391–1400,
May 2009.
[86] M. Raj et al., “A 4-to-11GHz injection-locked quarter-rate clocking for an adaptive 153fJ/b opti-
cal receiver in 28nm FDSOI CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
pp. 404–405, Feb. 2015.
[87] K. Hu et al., “A 0.6 mW/Gb/s, 6.4-7.2 Gb/s serial link receiver using local injection-locked ring
oscillators in 90 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, pp. 899–908, Apr. 2010.
[88] J. Lee and H. Wang, “Study of subharmonically injection-locked PLLs,” IEEE J. Solid-State
Circuits, vol. 44, pp. 1539–1553, May 2009.
[89] S. Ye et al., “A multiple-crystal interface PLL with VCO realignment to reduce phase noise,”
IEEE J. Solid-State Circuits, vol. 37, pp. 1795–1803, Dec. 2002.
[90] X. Qi et al., “Compact on-chip wire models for the clock distribution of high-speed i/o inter-
faces,” in Proc. IEEE Electrical Performance of Electronic Packaging, pp. 235–238, Oct. 2008.
[91] K. Hu et al., “Comparison of on-die global clock distribution methods for parallel serial links,”
in Proc. IEEE International Symposium on Circuits and Systems, pp. 1843–1846, May 2009.
[92] F. OMahony et al., “A low-jitter PLL and repeaterless clock distribution network for a 20Gb/s
link,” in Proc. IEEE Symp. VLSI Circ. Dig. Tech. Papers, pp. 29–30, Jun. 2006.
[93] L. Xiu, “Clock technology: The next frontier,” IEEE Circuits and Systems Magazine, vol. 17,
pp. 27–46, May. 2017.
[94] J. Poulton et al., “A 14-mW 6.25-Gb/s transceiver in 90-nm CMOS,” IEEE J. Solid-State Circuits,
vol. 42, pp. 2745–2757, Dec. 2007.
[95] S. Chan et al., “A resonant global clock distribution for the cell broadband-engine processor,” in
Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 512–513, Feb. 2008.
[96] N. Holland, Interfacing Between LVPECL, VML, CML, and LVDS Levels. Texas Instruments,
2002.
[97] Cypress Semiconductor, A Comparison of CML and LVDS for High-Speed Serial Links, 2002.
173
Bibliography
[98] C. Menolfi et al., “A 28Gb/s source-series terminated tx in 32nm CMOS SOI,” in Proc. IEEE Int.
Solid-State Circuits Conf. Dig. Tech. Papers, pp. 334–335, Feb. 2012.
[99] J. Kim et al., “A 16-to-40Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14nm CMOS,”
in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 60–61, Feb. 2015.
[100] K. Kanda et al., “A single-40 Gb/s dual-20 Gb/s serializer IC with SFI-5.2 interface in 65 nm
CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 3580–3589, Dec. 2009.
[101] H. Wang and J. Lee, “A 21-Gb/s 87-mW transceiver with FFE/DFE/Analog equalizer in 65-nm
CMOS technology,” IEEE J. Solid-State Circuits, vol. 45, pp. 909–919, Apr. 2010.
[102] L. Henrickson et al., “Low power fully integrated 10-Gb/s SONET/SDH transceiver in 0.13-µm
CMOS,” IEEE J. Solid-State Circuits, vol. 38, pp. 1595–1601, Oct. 2003.
[103] B. Raghavan et al., “A sub-2 W 39.8-44.6 Gb/s transmitter and receiver chipset with SFI-5.2
interface in 40 nm CMOS,” IEEE J. Solid-State Circuits, vol. 48, pp. 3219–3228, Dec. 2013.
[104] X. Zheng, C. Zhang, and S. Yuan et al., “An improved 40 Gb/s CDR with jitter-suppression filters
and phase-compensating interpolators,” in Proc. IEEE Asian Solid-State Circuits Conf. (ASSCC),
pp. 85–88, Nov. 2016.
[105] J. W. Bergmans, Digital Baseband Transmission and Recording, ch. 8, pp. 400–412. Springer
Science & Business Media, 1996.
[106] F.-T. Chen et al., “A 10-Gb/s low jitter single-loop clock and data recovery circuit with rotational
phase frequency detector,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 61, pp. 3278–3287,
Nov. 2014.
[107] N. Kocaman et al., “An 8.5-11.5-Gbps SONET transceiver with referenceless frequency acquisi-
tion,” IEEE J. Solid-State Circuits, vol. 48, pp. 1975–1884, Aug. 2013.
[108] M. S. Jalali et al., “A reference-less single-loop half-rate binary CDR,” IEEE J. Solid-State Cir-
cuits, vol. 50, pp. 2037–2047, Sep. 2015.
[109] A. Pottbacker et al., “A Si bipolar phase and frequency detector IC for clock extraction up to 8
Gb/s,” IEEE J. Solid-State Circuits, vol. 27, pp. 1747–1751, Dec. 1992.
[110] M. ta Hsieh and G. E. Sobelman, “Architectures for multi-gigabit wire-linked clock and data
recovery,” IEEE Circuits and Systems Magazine, vol. 8, no. 4, pp. 45–57, 2008.
[111] B. Razavi, “Challenges in the design of high-speed clock and data recovery circuits,” IEEE Com-
munications Magazine, pp. 94–101, Aug. 2002.
[112] M. H. Perrott et al., “A 2.5-Gb/s multi-rate 0.25-µm CMOS clock and data recovery circuit
utilizing a hybrid analog/digital loop filter and all-digital referenceless frequency acquisition,”
IEEE J. Solid-State Circuits, vol. 41, pp. 2930–2944, Dec. 2006.
[113] J. C. Scheytt et al., “A 0.155-, 0.622-, and 2.488-Gb/s automatic bit-rate selecting clock and
data recovery IC for bit-rate transparent SDH systems,” IEEE J. Solid-State Circuits, vol. 34,
pp. 1935–1943, Dec. 1999.
[114] H. S. Muthali et al., “A CMOS 10-Gb/s SONET transceiver,” IEEE J. Solid-State Circuits,
vol. 39, pp. 1026–1033, Jul. 2004.
[115] M. Y. He and J. Poulton, “A CMOS mixed-signal clock and data recovery circuit for OIF CEI-
6G+ backplane transceiver,” IEEE J. Solid-State Circuits, vol. 41, pp. 597–606, Mar. 2006.
[116] H.-H. Chang et al., “Low jitter and multirate clock and data recovery circuit using a MSADLL for
chip-to-chip interconnection,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, pp. 2356–2364,
Dec. 2004.
[117] B. Razavi, Monolithic Phase-Locked Loops and Clock Recovery Circuits: Theory and Design.
Wiley-IEEE Press, 1996.
174
Bibliography
[118] Y. Sun and H. Wang, “Analysis of digital bang-bang clock and data recovery for multi-gigabits
serial transceivers,” in Proc. IEEE Custom Integrated Circuits Conf., pp. 13–16, Sep. 2009.
[119] S. Tertinek et al., “Binary phase detector gain in bang-bang phase-locked loops with DCO jitter,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, pp. 941–945, Dec. 2010.
[120] N. D. Dalt, “Markov chains-based derivation of the phase detector gain in bang-bang PLLs,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, pp. 1195–1199, Nov. 2006.
[121] J. Kim et al., “Simulation and analysis of random decision errors in clocked comparators,” IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 56, pp. 1844–1857, Aug. 2009.
[122] G. R. Gangasani et al., “A 16-Gb/s backplane transceiver with 12-tap current integrating DFE
and dynamic adaptation of voltage offset and timing drifts in 45-nm SOI CMOS technology,”
IEEE J. Solid-State Circuits, vol. 47, pp. 1828–1841, Aug. 2012.
[123] T. Musah et al., “A 4-32 Gb/s bidirectional link with 3-tap FFE/6-tap DFE and collaborative CDR
in 22 nm CMOS,” IEEE J. Solid-State Circuits, vol. 49, pp. 3079–3090, Dec. 2014.
[124] P. K. Hanumolu et al., “Equalizer for high-speed links,” International Joruanl of High Speed
Electronics and Systems, vol. 15, pp. 429–458, Jul. 2005.
[125] J. F. Bulzacchelli et al., “A 28-Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32-nm SOI
CMOS technology,” IEEE J. Solid-State Circuits, vol. 47, pp. 3232–3248, Dec. 2012.
[126] M. Altmann and F. Spagna, Adaptive Tx Equalization. IEEE 802.3ap, Nov. 2004.
[127] S. S. Mohan et al., “Bandwidth extension in CMOS with optimized on-chip inductors,” IEEE J.
Solid-State Circuits, vol. 35, pp. 346–355, Mar. 2000.
[128] S. Ibrahim and B. Razavi, “Low-power CMOS equalizer design for 20-Gb/s systems,” IEEE J.
Solid-State Circuits, vol. 46, pp. 1321–1336, Jun. 2011.
[130] J. Jaussi et al., “A 205mW 32Gb/s 3-tap FFE/6-tap DFE bidirectional serial link in 22nm CMOS,”
in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 440–441, Feb. 2014.
[131] M. Pozzoni et al., “A multi-standard 1.5 to 10 Gb/s latch-based 3-tap DFE receiver with a SSC
tolerant CDR for serial backplane communication,” IEEE J. Solid-State Circuits, vol. 44, p-
p. 1306–1315, Apr. 2009.
[132] H. Higashi et al., “A 5-6.4-Gb/s 12-channel transceiver with pre-emphasis and equalization,”
IEEE J. Solid-State Circuits, vol. 40, pp. 978–985, Apr. 2005.
[133] K. Krishna et al., “A multigigabit backplane transceiver core in 0.13-µm CMOS with a power-
efficient equalization architecture,” IEEE J. Solid-State Circuits, vol. 40, pp. 2658–2666, Dec.
2005.
[134] J. Lee, “A 20-Gb/s adaptive equalizer in 0.13-µm CMOS technology,” IEEE J. Solid-State Cir-
cuits, vol. 41, pp. 2058–2066, Sep. 2006.
[135] Wong and Lok, “Theory of digtial communications: Chapter 4 intersymbol interference and
equalization.” http://wireless.ece.ufl.edu/twong/Notes/Comm/ch4.pdf. [Online]. Accessed 16-
Set.-2017.
[136] schober, “Signal detection and estimation: Equalization of channels with ISI.”
http://courses.ece.ubc.ca/564/chapter6.pdf. [Online]. Accessed 16-Set.-2017.
175
Bibliography
[138] J. Savoj et al., “A low-power 0.5-6.6 Gb/s wireline transceiver embedded in low-cost 28 nm
FPGAs,” IEEE J. Solid-State Circuits, vol. 48, pp. 2582–2594, Nov. 2013.
[139] J. Savoj et al., “A wide common-mode fully-adaptive multi-standard 12.5 Gb/s backplane
transceiver in 28 nm CMOS,” in Proc. IEEE Symp. VLSI Circ. Dig. Tech. Papers, pp. 104–105,
Jun. 2012.
[141] B. Analui et al., “A 10-Gb/s two-dimensional eye-opening monitor in 0.13-µm standard CMOS,”
IEEE J. Solid-State Circuits, vol. 40, pp. 2689–2699, Dec. 2005.
[142] J.-S. Choi et al., “A 0.18-µm CMOS 3.5-Gb/s continuous-time adaptive cable equalizer using
enhanced low-frequency gain control method,” IEEE J. Solid-State Circuits, vol. 39, pp. 419–
425, Mar. 2004.
[143] S. Gondi et al., “A 10Gb/s CMOS adaptive equalizer for backplane applications,” in Proc. IEEE
Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 328–329, Feb. 2005.
[144] P. C. Maulik and D. A. Mercer, “A DLL-based programmable clock multiplier in 0.18-µm CMOS
with 70 dBc reference spur,” IEEE J. Solid-State Circuits, vol. 42, pp. 1642–1648, Aug. 2007.
[145] Y.-C. Huang and S.-I. Liu, “A 2.4-GHz subharmonically injection-locked PLL with self-
calibrated injection timing,” IEEE J. Solid-State Circuits, vol. 48, pp. 417–428, Feb. 2013.
[146] I.-T. Lee et al., “A divider-less sub-harmonically injection-locked PLL with self-adjusted injec-
tion timing,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 414–415, Feb. 2013.
[147] H. M. Cheema et al., 60-GHz CMOS Phase-Locked Loops. Springer Sciencet & Business Media,
2010.
[149] L. Zhang et al., “Injection-locked clocking: A low-power clock distribution scheme for high-
performance microprocessors,” IEEE Trans. VLSI. syst, vol. 16, pp. 1251–1256, Sep. 2008.
[150] J. Lee and M. Liu, “A 20-Gb/s burst-mode clock and data recovery circuit using injection-locking
technique,” IEEE J. Solid-State Circuits, vol. 43, pp. 619–630, Mar. 2008.
[151] A. Musa et al., “A compact, low-power and low-jitter dual-loop injection locked PLL using all-
digital PVT calibration,” IEEE J. Solid-State Circuits, vol. 49, pp. 50–60, Jan. 2014.
[152] P. Park, J. Park, H. Park, and S. Cho, “An all-digital clock generator using a fractionally injection-
locked oscillator in 65nm CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
pp. 336–337, Feb. 2012.
[153] C.-F. Liang and K.-J. Hsiao, “An injection-locked ring PLL with self-aligned injection window,”
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp. 90–91, Feb. 2011.
[154] D. Dunwell and A. C. Carusone, “Modeling oscillator injection locking using the phase domain
response,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, pp. 2823–2833, Nov. 2013.
[155] Y.-H. Kwak et al., “A 20 Gb/s clock and data recovery with a ping-pong delay line for unlimited
phase shifting in 65 nm CMOS process,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60,
pp. 303–313, Feb. 2013.
[156] E. Alon et al., “Replica compensated linear regulators for supply-regulated phase-locked loops,”
IEEE J. Solid-State Circuits, vol. 41, pp. 413–424, Feb. 2006.
[157] L. Kull et al., “Implementation of low-power 6-8 b 30-90 GS/s time-interleaved ADCs with
optimized input bandwidth in 32 nm CMOS,” IEEE J. Solid-State Circuits, vol. 51, pp. 636–648,
Mar. 2016.
176
Bibliography
[159] P. Chiang et al., “A 20-Gb/s 0.13-µm CMOS serial link transmitter using an LC-PLL to directly
drive the output multiplexer,” IEEE J. Solid-State Circuits, vol. 40, pp. 1004–1011, Apr. 2005.
[161] M. Hossain et al., “A 4x40 Gb/s quad-lane CDR with shared frequency tracking and data depen-
dent jitter filtering,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, pp. 1–2, Jun. 2014.
[162] T. O. Dickson, J. F. Bulzacchelli, and D. J. Friedman, “A 12-Gb/s 11-mW half-rate sampled 5-tap
decision feedback equalizer with current-integrating summers in 45-nm SOI CMOS technology,”
IEEE J. Solid-State Circuits, vol. 44, pp. 1298–1305, Apr. 2009.
[163] D. Vijayaraghavan et al., “Highly configurable FPGA-integrated PCI Express 3.0 digital IP ar-
chitecture,” in DesignCon, pp. 1274–1288, Jan.-Feb. 2011.
[164] H. Kimura et al., “A 28 Gb/s 560 mW multi-standard SerDes with single-stage analog front-end
and 14-tap decision feedback equalizer in 28 nm CMOS,” IEEE J. Solid-State Circuits, vol. 49,
pp. 3091–3103, Dec. 2014.
[165] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university press, 2012.
177
Appendices
tor (ILO)
The discussions in [89, 154] show that it is reasonable to assume the injection event
shifts the ILO output phase instantaneously and the phase shift is linear with respect to
the instantaneous phase difference relative to the injection signal. These assumptions
are also supported by the circuit simulations and measurement results. This means
each injection phase shift can be modeled as an additional phase step that is applied
to the oscillator output. Fig. A1(a) shows the ILO waveform within the nth injection
period. The total 2N π is divided into two portions of ϕosc (n) and ϕinj (n), which
are separately contributed by the self-oscillation of the oscillator and the pulling of
the injection pulling. Under such a locking condition, the relationship of ϕosc (n) and
ϕinj (n) should satisfy,
where N is the factor of harmonic injection. The phase accumulation produced by the
oscillator self-oscillation can be calculated by,
2N πω0
ϕosc (n) = , (A2)
ωlock
178
Appendices
φosc (n)
+
φinj (n) + + θout (n)
+
Tinj
(b)
θout (n)
+ +
φosc (n) + + + φinj (n)
+ +
θosc (n) Tinj Tinj θinj (n)
(c)
Figure A1: Phase accumulation behavior of the ILO. (a) Output waveform of the ILO
in one injection period, (b) flow-chart diagram of the phase accumulation, and (c)
intuitive diagram of the phase accumulation.
where ω0 stands for the free-running frequency of the oscillator and ωlock represents
the target frequency of the ILO when it is locked to the injection signal. Accordingly,
the phase shift contributed by the injection pulling should be,
2N π(ωlock − ω0 )
ϕinj (n) = . (A3)
ωlock
Considering the fact that the ILO output phase can be calculated by summing all the
discrete phases in different injection periods, the ILO can be modeled as a discrete
phase integrator with an updating period of Tinj . The phase accumulation behavior
of the ILO is described in Fig. A1(b), which can be transformed into Fig. A1(c) to
give a more instructive view. Corresponding to the phase contribution in each injection
period, the total output phase θout (n) of the ILO is also divided into θosc (n) and θinj (n),
where the former denotes the accumulated phase upon the oscillator self-oscillation
and the latter represents the summation of the phase shift produced by the injection
179
Appendices
θosc (n)
+ φ ss(n) φinj (n) + θinj
θref (n) + P(φ ss) + + θout (n)
- +
Tinj
/N
(a)
ωosc
s
+ φinj (s) + θinj (s)
θref (s) + β + + θout (n)
- +
z-1
/N
(b)
Figure A2: Model of the ILO. (a) Signal flow chart and (b) linear model.
event.
According to the discussion in [154], the phase shift ϕinj is a function of the instan-
taneous phase difference between the injection reference signal and oscillator output,
φss = θref − θout /N , which is usually defined as ϕinj = P (φss ). Here, the phase
difference is defined as the horizontal coordinate of the oscillator output crossing point
that locates inside the injection pulse relative to the center of the injection pulse. Em-
bedding this phase shift function into Fig. A1(c), the complete signal flow chart of the
ILO can be obtained [see Fig. A2(a)]. Since the integration of the ϕosc (n) actually
equals the output phase of the free-running oscillator, we replace the right chart sur-
rounded by the dashed line with the the θosc . When the ILO reaches steady-state with
a relative phase difference of φss , the ILO can be modeled as a linear phase transfer
system for small signal analysis. The phase shift ϕinj can be treated as a linear function
with respect to the phase difference φss by a factor of β, which can be approximated
by the instantaneous slope of P (φss ) at φss,lock ,
dP (φss )
β= . (A4)
dφss φss =φss,lock
180
Appendices
Then the linear model of the ILO is constructed as shown in Fig. A2(b). To explore
the phase transfer characteristics, the closed-loop characteristic equation is formulated
as,
1 ωosc
[θref (s) − θout (s)/N ] · β · + = θout (s), (A5)
1 − z −1 s
where ωosc is the angular frequency of the the oscillator. According to the digital signal
processing theorem, the discrete transfer function 1/(1 − z −1 ) can be approximated by
the continuous transfer function of 1/(sTinj ), where Tinj is the period of the injec-
tion signal (i.e. sampling period). Substituting this approximation into Eq. (A5) and
rearranging it, we can get the ILO closed-loop transfer function,
Nβ ωosc sN Tinj
θout (s) = θref (s) · + · . (A6)
sN Tinj + β s sN Tinj + β
From Eq. (A6), the phase transfer function of the input reference is,
Nβ N N
Href (s) = = s = s = N Hinj (s),
sN Tinj + β 1+ β 1+
N Tinj
ωT B
(A7)
where Hinj (s) is the normalized Href (s). Obviously, the phase transfer function
Href (s) is actually a first-order LPF with a left-plane pole located at ωT B = β/(N Tinj )
and its DC-gain is 20log(N ) dB. Hence the ILO shows a low-frequency noise tracking
ability of the input reference.
Reviewing Eq. (A6), the phase transfer function of the oscillator can be written as,
sN Tinj
Hosc (s) = = 1 − Hinj (s). (A8)
sN Tinj + β
Clearly, the Hosc (s) is a first-order HPF with the same pole as the Href (s) and its high-
frequency gain is 0 dB. Thereby, the ILO exhibits a low-frequency noise suppression
for the oscillator in 20 dB/dec.
181
Appendices
According to the discussion in [154], the relative phase difference will settle to a
steady state, φss , where each injection event causes a phase shift P (φss ) that is just
sufficient to cancel the phase drift resulting from the frequency offset. This condition
can be expressed by,
2N π(flock − f0 )
P (φss ) = , (A9)
flock
where N is the multiplication factor, flock denotes the locked frequency, and f0 rep-
resents the free-running frequency of the oscillator. For a different frequency offset,
there exists a different steady state φss . Assume there is a small phase perturbation
∆θinj in the injection signal, then the output phase perturbation ∆θout can be predicted
by β∆θinj , where β is the instantaneous slope of the P (φss ). It can be obtained by
taking the derivative of Eq. (A9). resulting in,
dP (φss )
β= . (A10)
dφss φss =φss,lock
Note that the small perturbations in the injection signal intends to cause an instanta-
neous output frequency change, hence the output frequency flock can be considered
as the intermediate variable of P (φss ). Substituting Eq. (A9) into Eq. (A10) and
simplifying it using flock ≈ f0 , we can get,
2N π dflock
β= · . (A11)
flock dφss φss =φss,lock
Substituting Eq. (A11) into ωT B = β/(N Tinj ) and combining with ωT B = 2πfT B , we
can get the tracing bandwidth,
1 dflock
fT B = · . (A12)
N dφss φss =φss,lock
The tracking bandwidth can also be obtained by the intuitive transient analysis.
Based on the deduced slope β of the phase shift P (φss ) with respect to φss , the output
182
Appendices
2N π dflock
∆θout = · · ∆θinj , (A13)
flock dφss φss =φss,lock
If we assume the the first-order phase transfer function of the IL-RVCO is,
N
Hinj (s) = , (A14)
1 + ωTsB
where N is the harmonic factor of the IL-RVCO and ωT B is the angular frequency of
the tracing bandwidth. Then its transient response for a small step input ∆θinj should
be
For an injection period, ωT B Tinj can be considered as much smaller than 1, then
(1 − e−ωT B Tinj ) can be approximated by ωT B Tinj . Correspondingly, Eq. (A15) can
be simplified as
Compare Eq. (A13) with Eq. (A16), we can get the equation
2N π dflock
· = N ωT B Tinj . (A17)
flock dφss φss =φss,lock
2π dflock
ωT B = · . (A18)
flock Tinj dφss φss =φss,lock
1 dflock
fT B = · , (A19)
N dφss φss =φss,lock
183
Appendices
SZF Iteration
`
X
k x k1 = | xi | . (A21)
i=1
In addition, the 1-norm for matrix A = (aij )i,j=1,2,···` has the following two equivalent
definitions [165],
k Ax k1
k A k1 = sup , (A22)
x∈R` k x k1
`
X
k A k1 = max |aij |. (A23)
1≤j≤`
i=1
184
Appendices
k xn − xm k1 ≤k xn − xn−1 k1 + · · · + k xm+1 − xm k1
≤k T kn−1
1 k x1 − x0 k 1 + · · ·
+ k T km 1 0
1 k x − x k1 (A24)
∞
X
≤k T km
1 k T kk1 k x1 − x0 k1
k=0
1
≤k T km
1 k x1 − x0 k1 ,
1− k T k1
where Eq. (A22) and a simple iteration are used for deducing the second inequality.
When the condition k T k1 < 1 is satisfied, we have lim k T km
1 = 0. Hence, for
m→∞
all > 0, there exists a constant M > 0 such that for any n, m ≥ M , the following
inequality holds,
k xn − xm k1 ≤ . (A25)
k I − λB k1 < 1,
185