# A 2.8 mW/Gb/s, 14 Gb/s Serial Link Transceiver

Saurabh Saxena, Member, IEEE, Guanghua Shu, Student Member, IEEE, Romesh Kumar Nandwana, Student Member, IEEE, Mrunmay Talegaonkar, Ahmed Elkholy, Student Member, IEEE, Tejasvi Anand, Member, IEEE, Woo-Seok Choi, Student Member, IEEE, and Pavan Kumar Hanumolu, Member, IEEE

Abstract—Design techniques to improve energy efficiency of serial link transceivers are presented. Power consumption is reduced by using: 1) low-power clock generation, recovery, and distribution schemes; 2) charge-based circuits to implement analog front-end and samplers/flip-flops; and 3) a partially driver. segmented voltage-mode (VM) output An LC-oscillator based digital phase-locked loop (PLL) is used to generate a low jitter clock that is shared between the transmitter (Tx) and receiver (Rx). The clock recovery unit uses a local ring-oscillator based PLL to reduce the number of phase interpolators and the amount of high-frequency clock distribution. Charge-based samplers that were shown to operate with limited return-to-zero voltage swings and consume only dynamic power are modified to provide non-return-to-zero outputs and used extensively in the deserializer and Rx front-end circuits. A partially segmented VM output driver with embedded 2-tap de-emphasis is proposed to reduce power consumption of pre-drivers. Fabricated in a 65 nm CMOS process, the 14 Gb/s transceiver prototype employs aforementioned techniques and achieves an energy efficiency of 2.8 mW/Gb/s. The Tx achieves a phase margin of 0.36 UI (BER =  $10^{-12}$ ) at the end of an 11 dB loss channel with an energy efficiency of 0.89 mW/Gb/s. The Rx recovers clock with 1.8 psrms long term absolute jitter at BER  $< 10^{-12}$  and achieves an energy efficiency of 1.69 mW/Gb/s. The LC-oscillator based digital PLL achieves an integrated jitter of 0.605 psrms with an energy efficiency of 0.5 mW/GHz at 7 GHz output frequency.

*Index Terms*—Charge-based flip-flop (CFF), digital clock and data recovery (CDR), voltage-mode (VM) transmitter (Tx).

## I. INTRODUCTION

THE demand for off-chip I/O bandwidth is constantly increasing in order to meet the requirements of modern multi-core processors and server platforms [1], [2]. Fig. 1 shows performance metrics for state-of-the-art serial link transceivers with embedded clocking published in last 15 years (2001–2015). Aggressive technology scaling as governed by

Manuscript received August 1, 2016; revised November 23, 2016; accepted December 21, 2016. Date of publication March 30, 2017; date of current version April 20, 2017. This work was supported in part by Analog Devices and in part by Intel. This paper was approved by Associate Editor Azita Emami.

S. Saxena is with the Department of Electrical Engineering, IIT Madras, Chennai 600036, India (e-mail: ssaurabh0204@gmail.com).

G. Shu is with Oracle Labs VLSI Research, Belmont, CA 94002 USA.

R. K. Nandwana, A. Elkholy, W.-S. Choi, and P. K. Hanumolu are with the Department of Electrical and Computer Engineering, University of Illinois, Urbana–Champaign, IL 61801 USA.

M. Talegaonkar is with Infi Corporation, Irvine, CA 92617 USA.

T. Anand is with the Department of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331 USA.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2016.2645738

Moore's law [3] has helped to increase per pin bandwidth and improve energy efficiency as illustrated by the trends depicted in Fig. 1(a) and (b), respectively. Plotting energy efficiency versus data rate, as shown in Fig. 1(c), indicates that state-ofthe-art serial link transceivers with embedded clocking achieve an energy efficiency of about 4 pJ/bit. This lower bound on energy/bit is a byproduct of increasing data rate while using the same channel (for legacy and cost reasons). As benefits from technology and voltage scaling taper off, new serial link architectures and circuit techniques are needed to reduce energy consumption. In this paper, we identify major power hungry operations in a serial link and propose techniques to implement them in an energy-efficient manner.

To this end, consider a typical block diagram of a serial link shown in Fig. 2. On the receiver (Rx) side, received signal is terminated, typically with a 50  $\Omega$  resistor, amplified, and sampled/sliced to recover data that is subsequently deserialized. The sampling clock, RCLK, is recovered from the received signal using a clock recovery unit (CRU). Front-end samplers and CRU consume a significant portion of the Rx power. Fig. 3 shows a block diagram of a commonly used phase interpolator (PI)-based digital clock and data recovery (CDR) [4]–[7]. In this particular half-rate implementation, input sampling phase error is detected by a bang-bang phase detector (!!PD) and processed by a digital loop filter (DLF) whose output,  $D_F$ , drives the PIs. PIs interpolate multiple clock phases provided by a multiphase generator (MPG) and correct for sampling phase/frequency errors. While the PI-based digital CDR architecture is commonly used, high speed equally spaced multiphase clock distribution from MPG to PIs and the need for multiple PIs in sub-rate architectures [5], [6] increase its power dissipation. Providing multiple clock phases to several Rx lanes operating in parallel further exacerbates this issue. In this paper, we present a CDR architecture that alleviates the power consumption issue in CRU.

Samplers used in the Rx front-end (Rx FE) and deserializer also consume a significant portion of the Rx power. Both fullswing sense-amplifier flip-flops (SAFFs) [8] and low-swing current mode logic (CML) samplers consume large power due to dynamic and static power dissipation, respectively. Low-swing charge-steered sampler that consumes only a very little dynamic power was proposed in [9] to address the issue of sampler power dissipation. While it offers an attractive alternative to conventional CML or CMOS samplers, its return-to-zero (RZ) operation and the need for I/Q clock phases when used for deserialization [9]–[12]

0018-9200 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.



Fig. 1. (a) Data rate versus process node, (b) energy efficiency versus process node, and (c) energy efficiency versus data rate for embedded serial links published over last 15 years (2001–2015).



Fig. 2. Block diagram of a serial link with embedded clocking.



Fig. 3. Block diagram of a half-rate PI-based digital CDR.

diminish its power benefits. In this paper, we introduce a charge-based flip-flop (CFF) with non-return-to-zero (NRZ) output and low dynamic power consumption and demonstrate its efficacy when used for deserialization.

On the transmitter (Tx) side, the most power hungry blocks are high-speed portions of the serializer, pre-driver, and output (O/P) driver. Because serializer power is by and large determined by the technology and data rate, majority of the efforts have focused on improving energy efficiency of the pre-driver and output driver. One popular approach is using a voltage-mode (VM) driver [13] in place of a classical current-mode (CM) driver. While a VM driver is, in principle, more power efficient, its efficiency is degraded once de-emphasis is incorporated into it. Multiple design techniques introduced in [14]-[17] have demonstrated various means to improve efficiency of VM drivers with deemphasis. However, performing de-emphasis with fine resolution while keeping output impedance matched to the channel characteristic impedance increases pre-driver power consumption [17] and degrades energy efficiency of VM transmitters. In this paper, we propose a partially segmented VM driver with embedded 2-tap FIR equalizer to alleviate this tradeoff.

The rest of this paper is organized as follows. The architecture used to implement a 14 Gb/s transceiver is described



Fig. 4. Block diagram of the proposed 14 Gb/s transceiver.

in Section II. Section III elaborates on the proposed CDR architecture. In Section IV, an NRZ CFF is presented along with its usage in charge-based Rx FE. The partially segmented VM Tx is described in Section V. The effectiveness of the above methods is demonstrated with measurement results obtained from the transceiver prototype in Section VI and key contributions are summarized in Section VII.

## **II. PROPOSED ARCHITECTURE**

Block diagram of the proposed transceiver (XCVR) is shown in Fig. 4. It consists of an *LC*-oscillator based digital phase-locked loop (*LC*-DPLL), a partially segmented VM Tx, a half-rate digital CDR, and a ring PLL (RPLL)-based MPG. The *LC*-DPLL generates a low jitter half-rate 7 GHz clock, which is shared between the Tx and Rx. On the Tx side, an on-chip PRBS generator uses a divided 7 GHz clock and generates 16 parallel streams of random data at 0.875 Gb/s. A 16:1 serializer serializes these low-rate parallel streams to 14 Gb/s full-rate data. An N-over-N based partially segmented VM output (O/P) driver with embedded 2-tap de-emphasis launches the full-rate data onto the channel. Tx's output swing is controlled by regulating supply voltage V<sub>ODRV</sub> of the O/P driver, and output impedance of the O/P driver is matched to channel characteristic impedance by regulating supply voltage of the pre-driver ( $V_{PDRV}$ ).

The Rx FE includes a wide bandwidth front-end amplifier, charge-based DATA/EDGE samplers, and a 4:32 charge-based deserializer. Fully synthesized !!PD operates on deserialized DATA/EDGE samples and outputs sign of the phase error in the form of early/late (E/L) output signals. These E/L signals are filtered by a DLF. Instead of controlling the PI with the sum of proportional and integral control signals as is commonly done [4]-[7], only the integral control is implemented through the PI path. To this end, integral control output from the DLF is integrated by a phase accumulator ACC<sub>PI</sub> and is used to control the PI. The RPLL multiplies the divided PI output, FREF, RPLL, and generates desired I/Q sampling phases (RCLK) needed for samplers in half-rate CDR. Since, the RPLL is embedded in the CDR loop, its finite bandwidth increases loop delay and deteriorates CDR's dithering jitter performance. To reduce loop delay, CDR's proportional path control signal from the DLF is directly fed into the RPLL instead of controlling the PI. The proposed multiphase sampling clock generation using the RPLL has two advantages compared with the standard PI-based CDR. First, it reduces the number of PIs needed to generate multiple sampling phases, thereby reducing overall PI area and power. Second, placing a PI close to the high-frequency clock source and distributing only low frequency FREF, RPLL to the RPLL minimizes clock distribution power. Implementation details of key building blocks of the transceiver are discussed in Sections III, IV, and V.

# III. CLOCK AND DATA RECOVERY

The proposed CDR architecture has evolved from a conventional PI-based Type-II digital CDR architecture, as shown in Fig. 5. Fig. 5(a) shows a simplified block diagram of a conventional sub-rate digital CDR [4]–[7] where phase error ( $\Phi_{err}$ ) between input bit stream, D<sub>IN</sub>, and sampling clock, RCLK, is detected by a !!PD. !!PD's output is filtered by a first-order DLF and then used to control sampling clock phases using a roll-over accumulator, ACC<sub>PI</sub>, and PIs. High-speed multiphase clock distribution with precise phase-spacing from MPG to PIs along with multiple PIs consumes a significant power and degrades the power efficiency of this CDR.

Multiphase high-speed clock distribution and number of PIs can be minimized by reversing the order of MPG and PIs, as shown in Fig. 5(b). Using the four phases generated by the divide-by-2 stage, the accumulator  $ACC_{PI}$  together with a PI generate 1-phase recovered clock. Multiple phases of the former PI output are generated using an MPG that can be implemented using a PLL or delay locked loop. Implementing MPG as a clock multiplier helps reduce PI output frequency [see Fig. 5(c)], thus reducing the amount of high-frequency clock distribution further. However, finite bandwidth of the MPG along with the delay introduced by  $ACC_{PI} + PI$  increases CDR loop delay, which results in undesirable jitter peaking and degraded jitter tolerance (JTOL).

In view of this, CDR architecture shown in Fig. 5(c) seeks to reduce loop delay by: 1) maximizing the bandwidth of



Fig. 5. Evolution of the proposed CDR from conventional Type-II PI-based CDR.

the MPG, which is implemented using a ring-oscillator based PLL (RPLL) and 2) bypassing PI and divider and implementing proportional control inside the RPLL. Some hardware overhead is reduced by sharing ACC<sub>PI</sub> between proportional and integral paths as illustrated in Fig. 5(d).

The detailed block diagram of the proposed CDR is depicted in Fig. 6(a). The RPLL uses an architecture similar to the one reported in [18] containing analog proportional path and digital integral path. CDR's proportional control is implemented by adding accumulated phase error directly to the current controlled ring oscillator (CCO) and to the integral path of the RPLL. Direct control of the oscillator through CDAC corrects for high-frequency phase perturbations at the input ( $\Phi_{in}$ ). Phase addition with gain  $K_F$  through RPLL's integral path compensates for low-frequency sampling phase error. The open loop gain only through CDR's proportional path is given by

$$LG_{prop,cdr}(s) = \left. \frac{\Phi_{out}}{\Phi_{in}} \right|_{prop, cdr} = \frac{K_{bbpd} f_{ref}}{s} \\ \times \frac{N_r LG_{RPLL}(s)}{1 + LG_{RPLL}(s)} \left[ \frac{K_{pc,pr} + \frac{K_{pc,ir} f_{ref}}{s}}{K_{pr} + \frac{K_{ir} f_{ref}}{s}} \right]$$
(1)



Fig. 6. (a) Block diagram for CDR's proportional and integral path controls through the ring PLL. (b) Open loop gains for CDR's proportional path with and without integral path control in the ring PLL.



Fig. 7. Detailed block diagram of the Rx FE.

where  $LG_{RPLL}(s)$  is an open loop gain of RPLL,  $N_r$  is divider ratio in RPLL's feedback path, and  $K_{pr}$  and  $K_{ir}$  are proportional and integral path gains within the RPLL, respectively. In CDR's proportional path,  $K_{bbpd}$  is gain of !!PD, and  $K_{pc,pr}$  and  $K_{pc,ir}$  are proportional path gains ( $\Phi_{out}/\Phi_{err+}$ ) through CCO and RPLL's integral path, respectively. Under the approximation that the RPLL has much larger bandwidth than that of the CDR,  $LG_{RPLL}/(1 + LG_{RPLL}) \approx 1$  holds true for frequencies within CDR's jitter transfer bandwidth. Denoting CDR's proportional path gains as scaled versions of RPLL's loop gain parameters, i.e.,  $K_{pc,pr} = \alpha K_{pr}$  and  $K_{pc,ir} = \alpha K_{ir}$ , it simplifies (1) to (2), which is the same as proportional path gain of a conventional Type-II CDR loop.

$$\mathrm{LG}_{\mathrm{prop,cdr}}(s) = \left. \frac{\Phi_{\mathrm{out}}}{\Phi_{\mathrm{in}}} \right|_{\mathrm{prop,cdr}} = \frac{\alpha N_r K_{\mathrm{bbpd}} f_{\mathrm{ref}}}{s}.$$
 (2)

In above expressions, it is assumed that all accumulators are clocked at  $f_{ref}$  and loop delay is ignored. Fig. 6(b) shows the magnitude of the open loop gain,  $\Phi_{out}/\Phi_{in}$  for CDR's proportional path when RPLL's bandwidth is set to  $f_{ref}/50$ . Below RPLL's bandwidth, proportional path gain only through the CDAC remains constant while the sum of gains through the CDAC and RPLL's integral path accumulator exhibits 20 dB/decade roll-off as a function of input phase error frequency.

Frequency error between the received data rate and sampling clock generated by the RPLL is corrected through integral path of the CDR [Fig. 6(a)]. The frequency error represented by the ACC<sub>IC</sub>'s output is integrated by the ACC<sub>PI</sub> and used to control the PI, thereby changing the reference

clock frequency of the RPLL. The open loop gain of CDR's integral path is given by (3), which simplifies to the integral path loop gain of Type-II CDR loop under approximation  $LG_{RPLL}/(1 + LG_{RPLL}) \approx 1$ .

$$LG_{int,cdr}(s) = \frac{K_{bbpd}K_{ic}f_{ref}^{2}}{s^{2}N_{c}} \times \frac{N_{r}LG_{RPLL}(s)}{1 + LG_{RPLL}(s)}$$
$$\approx \frac{K_{bbpd}K_{ic}f_{ref}^{2}N_{r}}{s^{2}N_{c}}$$
(3)

where  $K_{ic}$  is integral path gain from  $\Phi_{err+}$  to PI's output. Since !!PD has a limited frequency error detection capability, additional frequency acquisition aids like the ones reported in [19] or [20] can be used, if needed.

#### IV. CHARGE-BASED RX FRONT-END

Fig. 7 shows a detailed block diagram of the Rx FE. A wide bandwidth amplifier is used to drive half-rate DATA/EDGE samplers that were implemented using lowswing charge-based sense-amplifiers (CSAs) connected in series to improve sensitivity, similar to series-connected strong-ARM latch-based sense-amplifiers in [21]. The four half-rate sampled DATA/EDGE values are synchronized by low-swing CFFs (LS-CFFs). The synchronized samples are first deserialized by a factor of 4:8 and then by 8:16 using LS-CFF based 1:2 DMUX units. The last deserialization stage uses full-swing CFFs (FS-CFFs) to restore CMOS levels needed to interface deserialized DATA/EDGE samples to synthesized CDR logic. The building blocks of charge-based front-end are described in detail below.



Fig. 8. (a) Schematic diagram of the proposed LS-CFF. (b) Transient signal waveforms in LS-CFF.



Fig. 9. (a) Differential output voltage swing versus input voltage for LS-CFF clocked at 7 GHz, and (b) output voltage swing and power consumption versus sampling frequency for LS-CFF.

# A. Low-Swing Charge-Based Flip-Flop

Fig. 8(a) shows the schematic of the proposed LS-CFF. It is composed of a CSA followed by a low-swing sample-andhold circuit (LS-SHC). CSA operates in two phases: 1) reset phase and 2) active phase. During the reset phase, clock  $\Phi$ is low and outputs  $V_{op1}/V_{on1}$  are set to  $V_{DD}$ , M<sub>1</sub> is switched OFF, and M<sub>0</sub> discharges the tail capacitor, C<sub>T</sub>, to GND [see Fig. 8(b)]. In the active phase ( $\Phi$  is high), inputs  $V_{ip}$ ,  $V_{in}$ are sampled by  $M_2, M_3$  and voltage difference  $V_{ip} - V_{in}$  is regenerated using cross-coupled inverters formed by transistors  $M_{11} - M_{14}$ . The regenerated output swing,  $|\Delta V|_{op, csa} =$  $|V_{op1} - V_{on1}|$ , is limited by charge transferred from  $V_{op1}/V_{on1}$ to  $C_T$ . Source potential of the input pair,  $V_x$ , rises in proportion to  $|\Delta V|_{op, csa}C_{p1}/C_T$ , where  $C_{p1}$  is parasitic capacitance at CSA's output nodes. This reduces gate overdrive and  $V_{DS}$  for the input pair that gets cut OFF by the end of the active phase. Smaller C<sub>T</sub> shuts down the input pair for a lower  $|\Delta V|_{op, csa}$ . In the limit, as  $C_T \rightarrow \infty$ , the CSA behaves more like a regular sense-amplifier with an output swing of  $V_{DD}$ . For a finite  $C_T$ , the CSA has low output swing and RZ pulse shape.

The second stage sample-and-hold circuit [Fig. 8(a): LS-SHC] samples RZ outputs of CSA. During the active phase with  $V_{\text{on1}} = V_{\text{DD}} - |\Delta V|_{\text{op, csa}} (|\Delta V|_{\text{op, csa}} \ge \text{GND})$  and  $V_{\text{op1}} = V_{\text{DD}}$ , M<sub>6</sub> charges output node  $V_{\text{op}}$  to  $V_{\text{DD}}$ , M<sub>5</sub> discharges  $V_{\text{on}}$ , and M<sub>4</sub>/M<sub>7</sub> are switched OFF. Hence, the output

swing  $(V_{op} - V_{on})$  is proportional to  $(V_{op1} - V_{on1})$ . Limited CSA's swing restricts M4(M5)'s minimum source node voltage to min $(V_{op1}, V_{on1}) + |V_{th}|$ , leading to a differential output voltage swing of  $V_{DD} - \min(V_{op1}, V_{on1}) - |V_{th}| = |\Delta V|_{op, csa} - |V_{th}|$ .  $V_{th}$  is threshold voltage of M4/M5. In the reset phase, transistors M<sub>4</sub> - M<sub>7</sub> are cut off as their gates are pulled to  $V_{DD}$  and M<sub>8</sub> - M<sub>9</sub> help retain the sampled output. Thus, a CFF operates with low input/output swings and provides NRZ outputs.

The operation of an LS-CFF designed in a 65 nm process is verified by extensive transient simulations. The flip-flop is simulated with extracted RC parasitics, 10-fF output load, 7 GHz sinusoidal sampling clock, 1V supply voltage, and across a range of input voltage swings. Fig. 9(a) shows that LS-CFF's differential output voltage swing is limited and is more than 90% of the final value when differential input voltage is  $\geq 50$  mV. The output voltage swing and power consumption (with 100 mV differential input) of the flip-flop are plotted as a function of sampling frequency  $(f_{sam})$  in Fig. 9(b). While the power consumption is directly proportional to  $f_{\rm sam}$ , output voltage swing is fairly constant across sampling frequencies. A slight variation (about 10%) in the output swing across  $f_{\text{sam}}$  is due to varying charge/discharge period and leakage of held voltages during the reset phase. Change in output swing due to PVT variations can be minimized by



Fig. 10. (a) Differential output voltage swing versus input voltage, and (b) power consumption versus input voltage for different values of tail capacitor for LS-CFF clocked at 7 GHz.



Fig. 11. (a) Output voltage swing versus frequency, and (b) power consumption versus frequency for optimized LS-CFF.



Fig. 12. (a) Schematic diagram of an FS-CFF. (b) Output voltage swing and power consumption versus sampling frequency for FS-CFF.



Fig. 13. Block diagram of a clock divider with tunable delay.

controlling C<sub>T</sub>. Fig. 10(a) depicts differential output voltage swing for varying input with different values of tail capacitor. The output swing saturates for higher input voltage and it is lower for smaller C<sub>T</sub>. Power consumption increases with C<sub>T</sub> as depicted in Fig. 10(b) and is roughly constant for input  $\geq$ 50 mV. LS-CFF's input referred offset has a standard deviation of  $\sigma_{offset} = 6.2$  mV. Using noise estimation method in [22], LS-CFF's input referred noise is estimated to be  $\sigma_{\text{noise}} = 0.44 \text{ mV}.$ 

LS-CFF is used in various building blocks of the Rx, such as samplers, synchronizer, and deserializer (DMUX). Because speed requirements and loading constraints are different for each of these blocks, an LS-CFF is optimized for power based on loading and clock frequency while keeping the output swing the same. The optimization of power and area is crucial especially for deserialization where the number of 1:2 DMUX units increases by  $2\times$  with each stage while the sampling frequency reduces by 2. Simulation results of an LS-CFF optimized at three different sampling frequencies (8 GHz, 4 GHz, and 2 GHz) shown in Fig. 11(a) illustrate fairly constant output voltage swing across different sampling frequencies. Fig. 11(b) confirms reduction in the power con-



Fig. 14. (a) Eye diagram at the output of front-end sampler. (b) Eye diagram at the output of the first deserialization stage.



Fig. 15. (a) Block diagram of the proposed partially segmented VM Tx with embedded de-emphasis. (b) Block diagram of the proposed VM Tx.

sumption at a given frequency. The limited swing outputs are scaled to full-swing signals in the last deserialization stage to enable processing with standard CMOS logic in the later stages. This is achieved by using a FS-CFF described next.

# B. Full-Swing Charge-Based Flip-Flop (FS-CFF)

Fig. 12(a) shows circuit diagram of an FS-CFF. It consists of a CSA, which is the same as that used in the LS-CFF, followed by a full-swing SHC (FS-SHC). In contrast to LS-CFF, SHC in FS-CFF employs an inverter-based crosscoupled latch that regenerates CSA's output to full-swing signals. Unlike the LS-SHC, a cross-coupled latch at FS-SHC's output limits the maximum frequency of operation and is suitable for use only in the later stages of deserialization. Fig. 12(b) shows FS-CFF's output voltage swing and power consumption as a function of sampling frequency. The output voltage swing is constant and is equal to  $V_{DD}$  whereas power consumption scales linearly with frequency.

Clock-to-Q delay for TSPC logic-based clock divider at different stages of deserialization is different from LS-CFF's clock-to-Q delay in 1:2 DMUX, which is potentially

detrimental in terms of timing margins across PVT variations. Fig. 13 depicts a clock divider with 1-bit (*Sel*) tunable delay to compensate for limited timing margins. Based on *Sel* signal, the MUX selects between rising edges of  $\Phi_{in}$  and  $\overline{\Phi_{in}}$  as input to the divider and output gets delayed by half of the input clock period. The *Sel* signal is evaluated to be 1/0 during pattern synchronization at the output of deserializer.

### C. Rx Front-End Simulation Results

To verify the low-swing operation of the charge-based Rx FE, it is simulated with extracted *RC* parasitics and 14 Gb/s, 7-bit PRBS data as input. Fig. 14(a) shows a simulated eye diagram at the output of front-end samplers. The differential output amplitude is roughly 400 mV and valid for 2-bit unit intervals (UIs). At the end of the first deserialization stage (4:8), outputs extend over 4-bit periods, as shown in Fig. 14(b). Similarly, output swings of 8:16 deserialization stage are limited. Voltage swings at the output of final deserialization stage are restored to rail-to-rail swings by the FS-CFF in the last stage. The low-swing and dynamic power consumption for samplers and deserializer help to reduce



Fig. 16. Block diagram of an LC-oscillator based digital PLL.

power consumption by roughly 40% when compared with the front-end implementation with strong-ARM latch-based flip-flops. The clocking power is roughly the same in both the cases and excluded in the above comparison.

#### V. TRANSMITTER

The main idea behind the proposed VM Tx is based on the observation that main cursor coefficient in a 2-tap equalizer is always greater than 0.5 due to the maximum output swing constraint. Fig. 15(a) shows a simplified block diagram of a partially segmented VM O/P driver that leverages this to implement 2-tap de-emphasis in a power efficient manner. For simplicity, only a single-ended version of the implemented differential O/P driver is shown. It consists of a segment to implement a fixed main cursor and N uniformly segmented tunable equalizer cells. Each O/P driver segment is a source-terminated resistive segment. The fixed main cursor is controlled by current bit D[n] and tunable cells are controlled by both D[n] and D[n + 1] bits so as to achieve the desired de-emphasis magnitude. Output impedance of the main cursor and each tunable cell are equal to  $R_{\text{MAIN}}$  $(\leq 2R_T)$  and  $R_{\rm EO}$   $(\geq N \times 2R_T)$ , respectively [see Fig. 15(a)]. Hence, the total number of tunable segments is roughly halved compared with uniform segmentation for the same resolution  $(V_{\text{ODRV}}/2N)$ . The total output impedance of the driver is  $R_{\rm MAIN}||(R_{\rm EO}/N)$ , which is designed to match the channel characteristic impedance  $R_T$ . A result of merging uniform equalizer segments to form a fixed main cursor is that the pre-driver and MUX sizing is not limited by technology. Consequently, their power consumption can be minimized by optimally sizing them to drive the O/P driver cell. Input-tooutput delays for the main cursor and tunable cell paths should be matched for reducing circuit-induced ISI.

# A. Low-Swing Partially Segmented VM Tx

Fig. 15(b) shows detailed block diagram of a low-swing VM Tx implementing an N-over-N based partially segmented VM O/P driver. It consists of an *LC*-DPLL, an on-chip PRBS generator, a 16:1 serializer, a partially segmented O/P driver, and a voltage regulator. *LC*-DPLL's output is divided to provide clocks for the PRBS generator and different stages of serialization. 16:2 serialization is achieved using standard TSPC latches and transmission gate 2:1 MUX, whereas the final 2:1 serialization stage is realized using a split-load 2:1 MUX as in [23]. The split-load 2:1 MUX reduces data-dependent ISI by making the clock-to-Q delay independent of the bit pattern. The serialized full-rate data  $D_n/D_{n+1}$  is fed to an N-over-N based low-swing partially segmented O/P driver. The differential output impedance of the O/P driver is tuned to 100  $\Omega$  by

controlling supply voltage of pre-driver ( $V_{PDRV}$ ) [24]. Transmitter's output swing is controlled by regulating supply voltage of the O/P driver ( $V_{ODRV}$ ) with a low-dropout regulator.

# B. LC-Oscillator Based Digital Phase-Locked Loop

Fig. 16 shows block diagram of an *LC*-oscillator based Type-II digital PLL. The *LC*-DPLL generates a low jitter 7 GHz clock using an external 109.375-MHz reference clock. Phase error between the reference clock and divided output clock is measured by a phase and frequency detector (PFD) and quantized using a D flip-flop (DFF) [18]. The DFF output is fed to a DLF that implements proportional-integral control with a gain of  $K_P$  and  $K_I$  for the proportional, and integral path, respectively. The 8-bit loop filter output FFS $\langle 7:0 \rangle$  is converted to a 256-level thermometer code and used to vary *LC*-oscillator's output frequency with 20-ppm resolution at 7 GHz. An additional 7-bit binary coarse frequency select (CFS) signal tunes output frequency with a resolution 300-ppm around 7 GHz. *LC*-oscillator topology is similar to the one used in [25].

#### VI. MEASUREMENT RESULTS

The proposed transceiver was implemented in TSMC 65 nm CMOS process. Fig. 17(a) shows a chip micrograph of the prototype. It occupies an active area of  $1.1 \text{ mm}^2$ . The die was packaged in a standard 88 pin plastic QFN package and characterized using a four-layer FR-4 PCB. The prototype was tested for a peak data rate of 14 Gb/s. Fig. 17(b) shows measured *LC*-DPLL's output phase noise plot and its integrated jitter (over frequency range 20 kHz–1 GHz) is 0.6 ps r.m.s.

Fig. 18(a) shows measured differential Tx output eye diagram when the Tx is configured to transmit 0.4 V<sub>pp</sub> differential amplitude. Tx signal path includes 4.6 mm bond wire, package parasitics,  $\mu$ -stripline on the test board, SMA connectors, and 3-ft long SMA cables. Tx output has a vertical eye opening of 150 mV and horizontal eye opening of 0.71 UI. BER bathtub plots measured at the output of FR-4 stripline channel using 80SJNB software available with Tektronix DSA8300 are shown in Fig. 18(b). The total channel loss is estimated to be about 11 dB at 7 GHz. Phase margin is less than 0.1 UI for BER < 10<sup>-12</sup> without equalization but it improves to 0.36 UI at BER < 10<sup>-12</sup> when the 2-tap FIR equalizer is enabled.

Fig. 19(a) shows RPLL's output phase noise plot at 7 GHz when it is locked to 0.875 GHz reference clock generated by dividing PI output (see Fig. 4). RPLL's output has an integrated r.m.s jitter of 1.5 ps when the phase noise is integrated over 20 kHz and 1 GHz band. RPLL's bandwidth is greater than 30 MHz, which is much larger than the desired CDR's bandwidth. Fig. 19(b) depicts simulated PSNR ( $\Phi_{out}(s)/V_{dd}(s)$ ) for the RPLL while considering supply noise for all current sources, including PDAC, IDAC, and CDAC, feeding into the ring oscillator.

Fig. 20 shows the recovered clock waveform when it is locked to 14 Gb/s PRBS7 data provided by bit error rate tester (BERT). The data is recovered with BER <  $10^{-12}$ . Long term absolute jitter of the recovered clock is 1.8 ps<sub>rms</sub> and 15.4 ps<sub>pp</sub>. The measured jitter transfer (JTRAN) function of the CDR is



Fig. 17. (a) Die photo of the transceiver prototype. (b) Phase noise spectrum of the LC-DPLL locked at 7 GHz.



Fig. 18. (a) Differential transmit eye diagram at 14 Gb/s. (b) Bathtub plots at the output of channel with and without FIR equalization.



Fig. 19. (a) Phase noise spectrum at output of the ring PLL locked at 7 GHz. (b) Simulated PSNR for supply noise in the ring PLL.

shown in Fig. 21(a). Jitter tolerance (JTOL) measured for BER  $< 10^{-12}$  is shown in Fig. 21(b). Peaking in JTRAN and dip in JTOL are attributed to large loop delay introduced by digital logic in the CDR loop.

The taxonomy of the power consumption of the 14 Gb/s transceiver is shown in Fig. 22(a). It consumes a total power of 39.6 mW of which *LC*-DPLL consumes only 3.6 mW. On the Tx side, the serializer consumes 5.9 mW. The supply regulated VM O/P driver, pre-driver, and 2-tap equalizer together with low-dropout regulator dissipate 6.3 mW. On the Rx side, the PI, RPLL, and CDR logic consume 2 mW, 3.2 mW, and 2 mW, respectively. The Rx FE dissipates 16.6 mW. Fig. 22(b) and (c) depict energy efficiency versus data rate and energy efficiency versus channel loss at Nyquist frequency,



Fig. 20. Recovered clock waveform at output of the Rx.



Fig. 21. (a) JTRAN curve for the proposed CDR locked at 14 Gb/s. (b) JTOL plot measured with stressed input data.



Fig. 22. (a) Power breakdown in 14 Gb/s serial link. (b) Energy efficiency versus data rate, (c) energy efficiency versus channel loss at Nyquist frequency for embedded XCVRs published in 2001–2015.

| TABLE I                                                                     |  |
|-----------------------------------------------------------------------------|--|
| Performance Summary for 14-Gb/s XCVR and Comparison to the State-of-the-Art |  |

|                                                      | This Work    | JSSC'15<br>[26]                | CICC'15<br>[27]                | JSSC'14<br>[28] | ISSCC'12<br>[29]      | ISSCC'10<br>[30]     |  |  |
|------------------------------------------------------|--------------|--------------------------------|--------------------------------|-----------------|-----------------------|----------------------|--|--|
| Technology [nm]                                      | 65           | 40                             | 40                             | 22              | 90                    | 65                   |  |  |
| Data Rate [Gb/s]                                     | 14           | 28                             | 10                             | 8               | 8                     | 12.5                 |  |  |
| Architecture                                         | 1/2-rate     | 1/4-rate                       | 1/2-rate                       | 1/2-rate        | 1/4-rate              | 1/2-rate             |  |  |
| Supply Voltage [V]                                   | 1.0/1.1/1.15 | 0.9                            | 0.9                            | 0.72            | 1.25                  | 1.0                  |  |  |
| Transmit Swing (pk-pk) [V]                           | 0.4          | 0.48                           | 0.839                          | 0.15            | 0.2                   | 0.15                 |  |  |
| Channel Loss [dB]                                    | 11           | 25                             | N/A                            | 8               | N/A                   | 12.1                 |  |  |
| Equalizer                                            | 2-tap FIR    | CTLE<br>1-tap DFE<br>3-tap FIR | CTLE<br>1-tap DFE<br>2-tap FIR | 3-tap FIR       | NO                    | CTLE                 |  |  |
| Rx CLK Jitter [ps <sub>rms</sub> ]                   | 1.8          | N/A                            | 0.93                           | N/A             | N/A                   | N/A                  |  |  |
| Tx Power [mW]                                        | 12.5         | 42.4                           | 21.3                           | 6.36            | 7.12                  | 5.1                  |  |  |
| Rx Power [mW]                                        | 23.6         | 41.3                           | 20.3                           | 6.43            | 12.72                 | 6.6                  |  |  |
| PLL Power [mW]                                       | 3.5          | 7.6 <sup>[*8]</sup>            | 2.47 <sup>[*10]</sup>          | $13.2^{[*4]}$   | 12.32 <sup>[*8]</sup> | 0.63 <sup>[*6]</sup> |  |  |
| FOM [pJ/bit]                                         | 2.8          | 3.9**                          | 4.41                           | 3.25            | 4.02                  | 0.98                 |  |  |
| [*N] : Power after amortizing over N parallel lanes. |              |                                |                                |                 |                       |                      |  |  |

\*\* : Additional 17.4 mW/lane power.

respectively, for transceivers with embedded clocking published in last 15 years (2001–2015). Table I summarizes the performance of the proposed 14 Gb/s

transceiver and compares it with the state-of-the-art transceivers with comparable energy efficiency. This work achieves an energy efficiency of 2.8 pJ/bit while operating at 14 Gb/s and compares favorably with the state-of-the-art.

# VII. CONCLUSION

Meeting ever increasing demand for I/O bandwidth in processors and server applications requires maximizing per pin bandwidth of serial link without compromising their power efficiency. With benefits from technology scaling tapering off, it seems to have an energy efficiency barrier of about 4 pJ/bit for embedded serial links. Consequently, new low-power circuit design techniques are needed to improve energy efficiency beyond this point. To this end, we first identified power hungry circuit blocks/functions in standard serial links and then proposed techniques to improve their energy efficiency. On the Rx side, a sub-rate dual loop digital CDR is amenable for high data rates but the need for multiphase clock generation and distribution with low jitter and precise phase matching make it power hungry. Deserialization of data and edge samples also consumes a significant power. Finally, on the Tx side, the power efficiency of output- and pre-driver is degraded by the need to have high-resolution equalization and impedance matching simultaneously. In view of this, we proposed a serial link architecture employing the following energy efficiency improvement techniques: 1) a low-power clock generation, recovery, and distribution; 2) a charge-based Rx FE, including half-rate sampling and deserialization; and 3) a partially segmented VM output driver with reduced pre-driver power consumption.

The proposed CDR incorporates a single phase clock recovery that uses only one PI and a ring-oscillator based PLL to generate multiple sampling clock phases. This minimizes multi-phase high-speed clock distribution and power consumed in comparison to the standard PI-based digital clock recovery. We introduced an LS-CFF with NRZ outputs, which acts as a building block of Rx FE and saves 40% power in comparison to SAFF-based sampling and deserialization. On the Tx side, pre-driver power consumption in VM Tx with 2-tap FIR equalizer is minimized with non-uniform segmentation of the output driver without compromising the resolution of equalization. The proposed design techniques are implemented in a 14 Gb/s transceiver prototype fabricated in a 65 nm CMOS technology. The Tx achieves a sampling time-margin of 0.36 UI at the end of an 11 dB loss channel with an energy efficiency of 0.89 mW/Gb/s. The Rx recovers 7 GHz sampling clock with 1.8 psrms long term absolute jitter at BER <  $10^{-12}$  and achieves an energy efficiency of 1.69 mW/Gb/s. An LC-oscillator based digital PLL is shared between the Tx and Rx. It achieves an integrated jitter of 0.605 psrms with an energy efficiency of 0.5 mW/Gb/s at 7 GHz output. Overall, the transceiver achieves an energy efficiency of 2.8 mW/Gb/s while operating at 14 Gb/s.

#### ACKNOWLEDGMENT

The research was partly funded by Analog Devices and Intel. The authors would like to thank Berkeley Design Automation for providing Analog Fast Spice simulator. They would also like to thank S.-J. Kim for help in testing the transceiver prototype at UIUC.

#### REFERENCES

- E. J. Fluhr *et al.*, "The 12-core POWER8 processor with 7.6 Tb/s IO bandwidth, integrated voltage regulation, and resonant clocking," *IEEE J. Solid-State Circuits*, vol. 50, no. 1, pp. 10–23, Jan. 2015.
- [2] G. K. Konstadinidis et al., "SPARC M7: A 20 nm 32-core 64 MB L3 cache processor," *IEEE J. Solid-State Circuits*, vol. 51, no. 1, pp. 79–91, Jan. 2016.
- [3] G. E. Moore, "Cramming more components onto integrated circuits," *IEEE Solid-State Circuits Soc. Newslett.*, vol. 38, no. 8, p. 114, Apr. 1965. [Online]. Available: http://ieeexplore.ieee.org/ document/4785860/
- [4] M. Y. He and J. Poulton, "A CMOS mixed-signal clock and data recovery circuit for OIF CEI-6G+ backplane transceiver," *IEEE J. Solid-State Circuits*, vol. 41, no. 3, pp. 597–606, Mar. 2006.
- [5] B. Raghavan et al., "A sub-2W 39.8-to-44.6Gb/s transmitter and receiver chipset with SFI-5.2 interface in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2013, pp. 32–33.
- [6] U. Singh et al., "A 780mW 4×28Gb/s transceiver for 100GbE gearbox PHY in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC)* Dig. Tech. Papers, Feb. 2014, pp. 40–41.
- [7] P. Upadhyaya et al., "A 0.5-to-32.75Gb/s flexible-reach wireline transceiver in 20nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
  [8] B. Nikolic, V. G. Oklobdzija, V. Stojanovic, W. Jia, J. K.-S. Chiu,
- [8] B. Nikolic, V. G. Oklobdzija, V. Stojanovic, W. Jia, J. K.-S. Chiu, and M. M.-T. Leung, "Improved sense-amplifier-based flip-flop: Design and measurements," *IEEE J. Solid-State Circuits*, vol. 35, no. 6, pp. 876–884, Jun. 2000.
- [9] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/deserializer," *IEEE J. Solid-State Circuits*, vol. 48, no. 3, pp. 684–697, Mar. 2013.
   [10] J. W. Jung and B. Razavi, "A 25Gb/s 5.8mW CMOS equalizer," in *IEEE*
- [10] J. W. Jung and B. Razavi, "A 25Gb/s 5.8mW CMOS equalizer," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 44–45.
- [11] J. W. Jung and B. Razavi, "A 25 Gb/s 5.8 mW CMOS equalizer," *IEEE J. Solid-State Circuits*, vol. 50, no. 2, pp. 515–526, Feb. 2015.
- [12] A. Manian and B. Razavi, "A 40-Gb/s 9.2-mW CMOS equalizer," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2015, pp. C226–C227.
- [13] H. Hatamkhani, K.-L. J. Wong, R. Drost, and C.-K. K. Yang, "A 10-mW 3.6-Gbps I/O transmitter," in *Symp. VLSI Circuits, Dig. Tech. Papers*, Jun. 2003, pp. 97–98.
- [14] M. Kossel *et al.*, "A T-coil-enhanced 8.5Gb/s high-swing source-seriesterminated transmitter in 65nm bulk CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2008, pp. 110–599.
- [15] W. D. Dettloff et al., "A 32mW 7.4Gb/s protocol-agile source-seriesterminated transmitter in 45nm CMOS SOI," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2010, pp. 370–371.
  [16] J. F. Bulzacchelli et al., "A 28-Gb/s 4-tap FFE/15-tap DFE serial
- [16] J. F. Bulzacchelli *et al.*, "A 28-Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32-nm SOI CMOS technology," *IEEE J. Solid-State Circuits*, vol. 47, no. 12, pp. 3232–3248, Dec. 2012.
- [17] Y. Lu, K. Jung, Y. Hidaka, and E. Alon, "Design and analysis of energyefficient reconfigurable pre-emphasis voltage-mode transmitters," *IEEE J. Solid-State Circuits*, vol. 48, no. 8, pp. 1898–1909, Aug. 2013.
- [18] W. Yin, R. Inti, A. Elshazly, B. Young, and P. K. Hanumolu, "A 0.7-to-3.5 GHz 0.6-to-2.8 mW highly digital phase-locked loop with bandwidth tracking," *IEEE J. Solid-State Circuits*, vol. 46, no. 8, pp. 1870–1880, Aug. 2011.
- [19] R. Inti et al., "A highly digital 0.5-to-4Gb/s 1.9mW/Gb/s seriallink transceiver using current-recycling in 90nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2011, pp. 152–153.
- [20] G. Shu, W.-S. Choi, S. Saxena, T. Anand, A. Elshazly, and P. K. Hanumolu, "A 4-to-10.5Gb/s 2.2mW/Gb/s continuous-rate digital CDR with automatic frequency acquisition in 65nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 150–151.
- [21] B.-J. Lee, M.-S. Hwang, S.-H. Lee, and D.-K. Jeong, "A 2.5-10-Gb/s CMOS transceiver with alternating edge-sampling phase detection for loop characteristic stabilization," *IEEE J. Solid-State Circuits*, vol. 38, no. 11, pp. 1821–1829, Nov. 2003.
- [22] J. Kim, B. S. Leibowitz, J. Ren, and C. J. Madden, "Simulation and analysis of random decision errors in clocked comparators," *IEEE Trans. Circuits Syst. J. Reg. Papers*, vol. 56, no. 8, pp. 1844–1857, Aug. 2009.
- Circuits Syst. I, Reg. Papers, vol. 56, no. 8, pp. 1844–1857, Aug. 2009.
  [23] K. Fukuda et al., "A 12.3-mW 12.5-Gb/s complete transceiver in 65-nm CMOS process," *IEEE J. Solid-State Circuits*, vol. 45, no. 12, pp. 2838–2849, Dec. 2010.
  [24] S. Saxena, R. K. Nandwana, and P. K. Hanumolu, "A 5 Gb/s energy-
- [24] S. Saxena, R. K. Nandwana, and P. K. Hanumolu, "A 5 Gb/s energyefficient voltage-mode transmitter using time-based de-emphasis," *IEEE J. Solid-State Circuits*, vol. 49, no. 8, pp. 1827–1836, Aug. 2014.

- [25] T. Anand, M. Talegaonkar, A. Elkholy, S. Saxena, A. Elshazly, and P. K. Hanumolu, "A 7Gb/s rapid on/off embedded-clock serial-link transceiver with 20ns power-on time, 740  $\mu$ W off-state power for energy-proportional links in 65nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [26] H. Won et al., "A 0.87 W transceiver IC for 100 gigabit Ethernet in 40 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 50, no. 2, pp. 399–413, Feb. 2015.
- [27] J.-Y. Lee *et al.*, "A power-and-area efficient 10×10 Gb/s bootstrap transceiver in 40 nm CMOS for reference-less and lane-independent operation," in *Proc. IEEE Custom Integr. Circuits Conf.*, Sep. 2015, pp. 1–4.
  [28] T. Musah *et al.*, "A 4–32 Gb/s bidirectional link with 3-Tap FFE/6-tap
- [28] T. Musah et al., "A 4–32 Gb/s bidirectional link with 3-Tap FFE/6-tap DFE and collaborative CDR in 22 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 49, no. 12, pp. 3079–3090, Dec. 2014.
- [29] Y.-S. Kim et al., "An 8GB/s quad-skew-cancelling parallel transceiver in 90nm CMOS for high-speed DRAM interface," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2012, pp. 136–138.
- [30] K. Fukuda et al., "A 12.3mW 12.5Gb/s complete transceiver in 65nm CMOS," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2010, pp. 368–369.



**Romesh Kumar Nandwana** (S'12) received the B.Tech. degree in electronics and communication engineering from the Motilal Nehru National Institute of Technology, Allahabad, India, in 2009, and the M.Eng. degree in electrical engineering from Oregon State University, Corvallis, OR, USA, in 2013, respectively. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the University of Illinois at Urbana–Champaign, Champaign, IL, USA.

From 2009 to 2010, he was a Scientist with the Indian Space Research Organization, Ahmedabad, India, where he was involved in the design of RF power amplifiers and dc-dc converters for communication satellites. He was involved in low phase noise clock buffers with Linear Technology Corporation, Grass Valley, CA, USA, as an Engineering Intern, in 2011. In 2015, he was a Research Intern with Xilinx Inc., San Jose, CA, USA, where he was developing clocking circuits for high-speed links. In 2016, he joined Intel Labs, Hillsboro, OR, USA, as a Graduate Intern, where he was involved in high-speed optical circuits. His current research interests include frequency synthesizers, digital phase-locked loops, clock and data recovery circuits, high-speed serial links, and low-voltage mixed-signal circuits.

Mr. Nandwana serves as a Reviewer of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, and the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS.



Saurabh Saxena (S'10–M'16) received the B.Tech. degree in electrical engineering and the M.Tech. degree in microelectronics and VLSI design from IIT Madras, Chennai, India, in 2009, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana–Champaign, Champaign, IL, USA, in 2015.

He is currently an Assistant Professor with the Department of Electrical Engineering, IIT Madras. His current research interests include delta-sigma modulators, high-speed I/O interfaces, and clocking

circuits.

Dr. Saxena serves as a Reviewer of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS, and the IEEE International Symposium on Circuits and Systems.



**Guanghua Shu** (S'10) received the M.S. degree in microelectronics from Fudan University, Shanghai, China, and the Ph.D. degree from the Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Champaign, IL, USA.

In 2014, he was a Research Intern with Xilinx, San Jose, CA, USA, where he was developing power and area-efficient parallel link architectures. He was involved in 56Gb/s wireline receivers both in electrical and optical with the IBM Thomas J. Watson

Research Center, Mixed-Signal Communication IC Design Group, Yorktown Heights, NY, USA, in 2014 and 2015. He is currently a Research Staff with Oracle Labs, Belmont, CA, USA. His current research interests include energy-efficient wireline communication systems, clocking circuits, power converters, and hardware accelerations for efficient computing systems.

Dr. Shu was a recipient of the Dissertation Completion Fellowship from the University of Illinois at Urbana–Champaign from 2015 to 2016 and the SSCS Predoctoral Achievement Award from 2014 to 2015. He serves as a Reviewer of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, the IEEE TRANS-ACTIONS ON CIRCUITS AND SYSTEMS I&II, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS, and the International Symposium on Circuits and Systems.



**Mrunmay Talegaonkar** received the B.Tech. and M.Tech. degrees in electrical engineering from IIT Madras, Chennai, India, in 2007, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana–Champaign, Champaign, IL, USA, in 2016.

From 2007 to 2009, he was a Design Engineer with Analog Devices, Bangalore, India, where he was involved in design of digital-to-analog converters. From 2009 to 2010, he was a Project Associate with IIT Madras, where he was involved in high-speed

clock and data recovery circuits. From 2010 to 2013, he was a Research Assistant with Oregon State University, Corvallis, OR, USA, where he was involved in high-speed links. He is currently a Staff Engineer with Inphi Corporation, Irvine, CA, USA. His current research interests include high-speed I/O interfaces and clocking circuits.

Dr. Talegaonkar was a recipient of the Analog Devices Outstanding Student Designer Award in 2012.



Ahmed Elkholy (S'08) received the B.Sc. degree (Hons.) and the M.Sc. degree in electrical engineering from Ain Shams University, Cairo, Egypt, in 2008 and 2012, respectively. He is currently pursuing the Ph.D. degree with the University of Illinois at Urbana–Champaign, Champaign, IL, USA.

From 2008 to 2012, he was an Analog/ MixedSignal Design Engineer with Si-Ware Systems, Cairo, Egypt, where he was involved in designing high-performance clocking circuits and LC-based reference oscillators. He was with Xilinx,

San Jose, CA, USA, in 2014, where he was involved in high-performance flexible clocking architectures. He is currently a Research Assistant with the University of Illinois at Urbana–Champaign. His current research interests include frequency synthesizers, high-speed serial links, and low-power data converters.

Mr. Elkholy was a recipient of the IEEE Solid-State Circuits Society (SSCS) Predoctoral Achievement Award from 2015 to 2016, the Analog Devices Outstanding Student Designer Award in 2016, and the M. E. Van Valkenburg Graduate Research Award from the University of Illinois from 2016 to 2017. He also received the IEEE SSCS Student Travel Grant Award in 2015, the Intel/IBM/Catalyst Foundation CICC Student Award in 2015, an Edward N. Rickert Engineering Fellowship from Oregon State University from 2012 to 2013, and the Best M.Sc. Thesis Award from Ain Shams University in 2012. He serves as a Reviewer of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II, and the IEEE International Symposium on Circuits and Systems.



**Tejasvi Anand** (S'12–M'15) received the M.Tech. degree (Hons.) in electronics design and technology from the Indian Institute of Science, Bangalore, India, in 2008, and the Ph.D. degree in electrical engineering from the University of Illinois at Urbana–Champaign, Champaign, IL, USA, in 2015.

From 2008 to 2010, he was an Analog Design Engineer with Cadence, Bangalore. He was with the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, in 2015. He is currently an Assistant Professor with the Department of Electri-

cal Engineering and Computer Science, Oregon State University, Corvallis, OR, USA. His current research interests include wireline communication, frequency synthesizers, and sensors with an emphasis on energy efficiency.

Dr. Anand was a recipient of the IEEE Solid-State Circuits Society Predoctoral Achievement Award from 2014 to 2015, the 2015 Broadcom Foundation University Research Competition Award, the 2015 M. E. Van Valkenburg Graduate Research Award from the University of Illinois, the 2013 Analog Devices Outstanding Student Designer Award, and the 2009 CEDT Design (Gold) Medal from the Indian Institute of Science.



**Woo-Seok Choi** (S'08) received the B.S. and M.S. degrees in electrical engineering and computer science from Seoul National University, Seoul, South Korea, in 2008 and 2010, respectively. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the University of Illinois at Urbana–Champaign, IL, USA.

His current research interests include designing power- efficient high-speed serial links, low-power analog- to-digital converters, and interface circuits for capacitive sensors.



**Pavan Kumar Hanumolu** (S'99–M'07) received the Ph.D. degree from the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA, in 2006.

He served as a Faculty Member with Oregon State University till 2013. He is currently an Associate Professor with the Department of Electrical and Computer Engineering and a Research Associate Professor with the Coordinated Science Laboratory, University of Illinois at Urbana–Champaign, Champaign, IL, USA. His current research interests

include energy-efficient integrated circuit implementation of analog and digital signal processing, sensor interfaces, wireline communication systems, and power conversion.

Dr. Hanumolu received the National Science Foundation CAREER Award in 2010. He currently serves as an Associate Editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS and a Technical Program Committee Member of the VLSI Circuits Symposium and the International Solid-State Circuits Conference.