# A 7 Gb/s Embedded Clock Transceiver for Energy Proportional Links

Tejasvi Anand, Student Member, IEEE, Mrunmay Talegaonkar, Ahmed Elkholy, Student Member, IEEE, Saurabh Saxena, Amr Elshazly, Member, IEEE, and Pavan Kumar Hanumolu, Member, IEEE

Abstract—A rapid-on/off transceiver for embedded clock architecture that enables energy proportional communication over the serial link is presented. In an energy proportional link, energy consumed by serial link is proportional to the amount of data communicated. Energy proportionality can be achieved by scaling the serial link power linearly with the link utilization, and fine grained rapid power state transition (rapid-on/off) is one such technique which can achieve this objective. In this paper, architecture and circuit techniques to achieve rapid-on/off in PLL, transmitter and receiver are discussed. Background phase calibration technique in PLL and CDR phase calibration logic in receiver enable instantaneous lock on power-on. The proposed transceiver demonstrates power scalability with a wide range of link utilization and, therefore, helps in improving overall system efficiency. Fabricated in 65 nm CMOS technology, the 7 Gb/s transceiver achieves power-on-lock in less than 20 ns. Proposed PLL achieves power-on-lock in 1 ns. The transceiver achieves power scaling by  $44 \times$  (63.7 mW-to-1.43 mW) and energy efficiency degradation by only  $2.2 \times (9.1 \text{ pJ/bit-to-}20.5 \text{ pJ/bit})$ , when the effective data rate (link utilization) changes by 100× (7 Gb/s-to-70 Mb/s). The proposed transceiver occupies an active die area of 0.39 mm<sup>2</sup>.

*Index Terms*—Burst mode, embedded clock, energy efficient, energy proportional, fast power-on, I/O, low power, PLL, power scalable, rapid on/off, serial link, transceiver.

# I. INTRODUCTION

**G** ROWING demand for high-performance computing has resulted in a constant increase of the number of processing cores and consequently increase in demand for off-chip I/O bandwidth. Because processor's package is typically constrained by the number of I/O pins, bandwidth demand is usually met by increasing the per-pin data-rate/bandwidth, as shown in Fig. 1(a). Energy efficiency of serial links, as measured in terms of energy-per-bit metric, has also improved as depicted in Fig. 1(b), but has started to taper off in recent years.<sup>1</sup> Voltage and process scaling are the biggest contributors in improving energy efficiency over the last 15 years. However, as the supply voltage scaling has saturated in finer technology

Manuscript received May 04, 2015; revised July 16, 2015; accepted August 11, 2015. Date of publication September 14, 2015; date of current version November 24, 2015. This paper was approved by Guest Editor Jaeha Kim. This work was supported in part by Intel Labs University Research Office, SRC, under task ID: 1836.129, and in part by the NSF under CAREER EECS-0954969.



Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2015.2470553



Fig. 1. Serial link trend for the last 15 years. (a) Data rate versus year of publication. (b) Energy-per-bit versus year of publication. (c) Energy-per-bit versus technology node.

nodes, energy efficiency improvement due to voltage scaling has diminished. Plotting energy efficiency as function of technology, as shown in Fig. 1(c), also reveals that there is no net

<sup>1</sup>The serial link data is collected from papers published in ISSCC, VLSI symp., CICC, ESSCIRC and A-SSCC

0018-9200 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.



Fig. 2. Proposed rapid-on/off transceiver architecture.

efficiency improvement beyond 65 nm, which can be explained as follows. Over the last several years, channel characteristics have remained more or less fixed for cost reasons. As a result, increasing data rates meant that channel loss has increased, which mandates more equalization and hence increased power consumption. In other words, energy efficiency improvements provided by process scaling are offset by the energy needed to compensate for channel loss.

The combined effect of increasing bandwidth and saturating energy efficiency would eventually result in increased power consumption of serial links. Due to thermal constraints of a package, increasing serial link power could potentially constrain the power budget allocated for computational cores. Consequently, serial link power may become the bottleneck to increasing processor's computational capacity. Therefore, a paradigm shift in reducing serial link power is needed.

It was observed that serial links in data centers are used only 15%–30% of the time [1], [2]. During idle periods, links consume almost peak power to maintain synchronization between transmitter and receiver. In view of this, fine-grained power state transition is a promising way to reduce power consumption of serial links by almost 70%–85% [3]–[10]. In this technique, the link is powered down when idle and powered back up when data is ready to be transmitted. This results in energy proportional operation, where the energy consumed to transfer data is directly proportional to the amount of data transferred and is independent of link utilization. We estimated that in the year of 2016, idle power savings offered by rapid-on/off technique could result in savings of approximately \$870M in data centers across North America [11]-[14]. In order to perform rapid-on/off operation, state transition time or power-on time of link must be small, ideally zero, and off-state power should also be close to zero [8]. Both these requirements are very difficult to achieve in practice. Best known power-on-lock times for embedded clock transceiver architectures are of the order of microseconds as specified in the IEEE 802.3az standard for energy efficiency ethernet (EEE) [7].

In this paper, a 7 Gb/s rapid-on/off embedded clock transceiver, that achieves less than 20 ns power-on-lock time is presented [15]. The transceiver achieves  $44 \times$  power scaling (63.7 mW-to-1.43 mW) and energy efficiency degradation of only 2.2× (9.1 pJ/bit-to-20.5 pJ/bit), when the effective data rate (link utilization) changes by  $100 \times$  (7 Gb/s-to-70 Mb/s).

The rest of the paper is organized as follows: Section II introduces the complete transceiver architecture. Design details of fast power-on-lock LC-PLL are described in Section III. Architecture and circuit details of the output driver are presented in Section IV. Section V discusses the fast power-on-lock CDR architecture. Section VI presents the measured results. Section VII concludes the paper.

# II. TRANSCEIVER ARCHITECTURE

The proposed 7 Gb/s rapid-on/off transceiver architecture is shown in Fig. 2. The transmitter consists of a parallel PRBS generator, 16:1 serializer, three tap feed forward equalizer, and fast power-on current-mode logic (CML)-based output driver. The serializer is designed with a series of 2:1 multiplexers. PRBS data generated using synthesized parallel PRBS generators [16] operating at 437.5 MHz, is serialized to generate 7 Gb/s true PRBS data stream. Fast power-on biasing is used to power-on the CML-based predriver and output driver.

The receiver consists of a quarter rate bang-bang phase detectors (BBPD), 4:16 deserializer, clock and data recovery (CDR) logic, phase interpolators (PI), PRBS checker and the START Rx Generator circuit (also known as receiver wake-up circuit). Receiver lock time is estimated from the Error signal, which is generated by performing logical OR operation on parallel PRBS checker outputs. Receiver lock is declared when the Error signal



Fig. 3. Evolution of a conventional PLL into a fast power-on-lock lock PLL. (a) Conventional PLL with random initial phase error. (b) PLL with fixed initial phase error. (c) PLL with fixed initial phase error of max one VCO time period. (d) PLL with zero initial phase error.

goes low. CDR phase calibration logic (CPCL) and Dynamic gain calibration logic (DGCL) are the two techniques used to achieve fast phase and frequency locking.

A fast power-on-lock LC-PLL generates a 7 GHz clock for both the transmitter and receiver blocks. A programmable divider following the LC-PLL divides the PLL output for operation at lower data rates. Because power-off periods can be of the order of milliseconds [17], frequency drifts caused by die temperature change [18] are compensated using an on-chip temperature sensor [19] and a look-up-table (see Section III-A for details). Proposed transceiver can be configured in either transmitter or in receiver mode. In the transmitter mode, the PLL powers-on in a phase locked condition, while in the receiver mode, PLL powers-on without a phase lock. The advantages of starting the PLL in out-of-lock condition are discussed in Section V-B.

# III. FAST POWER-ON LC-PLL

The proposed Type-II LC PLL is based on the conventional hybrid PLL architecture shown in Fig. 3(a). The proportional control is implemented in the analog domain by driving the oscillator directly with UP/DN signals generated by the PFD. Integral control is realized in the digital domain by detecting the sign of the phase error at the output of PFD and driving the oscillator with integrated phase error provided by the digital accumulator. Lock time, defined as the time needed to achieve phase lock (assuming frequency error is zero), is an important consideration in the design of the PLL. Because hybrid PLL exhibits linear loop dynamics, its lock time is dictated by the loop bandwidth and the initial phase error between  $\Phi_{\text{REF}}$  and  $\Phi_{\text{FB}}$  ( $\Phi_{\text{ERROR}} = \Phi_{\text{REF}} - \Phi_{\text{FB}}$ ). While it is possible to reduce lock time by increasing bandwidth, reference clock frequency,

 $F_{REF}$  sets an upper limit on the maximum PLL bandwidth to about  $F_{REF}/10$  for stability reasons [20]. Gear shifting techniques can alleviate this trade-off [21], but the lock time is still of the order of several microseconds. Oversampling the output of feedback divider can achieve fast phase locking [22], but it results in large jitter, which is unacceptable for high-speed wireline applications. An alternate approach to reducing lock time is to make the initial phase error to be zero, so that the loop starts in the locked condition, independent of PLL bandwidth. In this work, we seek to explore this approach to reduce PLL lock time. To this end, we will first evaluate the sources of initial phase error and present techniques to make the error zero.

There are three main sources of initial phase error as depicted in the timing diagram of Fig. 3(a). First, the initial VCO output phase ( $\Delta \Phi_{\rm VCO}$ ) is unknown as it depends on the thermal noise dependent start-up profile of the oscillator. Second, initial state of the feedback divider ( $\Delta \Phi_{\rm IC-DIV}$ ) is unknown as it depends on the state in which it was powered off and leakage during the off period (if dynamic flops are used). Third, delay in the feedback path ( $\Delta \Phi_{\rm DEL}$ ) is unknown because it depends on layout parasitics and is also sensitive to process variations.

A conventional LC-tank builds up oscillations by amplifying thermal noise voltage. Start-up time of the oscillator depends on the initial thermal noise amplitude and is of the order of several nanoseconds. The startup time can be reduced by providing initial condition to the LC-tank [23]. Mathematically, amplitude build-up phenomenon and startup time can be analyzed with the help of Van der Pol equation.

Based on the solution of Van der Pol equation, it was observed that if the initial condition is fixed for every power-on event, then the output phase trajectory of an oscillator is deterministic. In this work, initial condition is given to oscillator in the form of a narrow pulse injected into the oscillator. This pulse is generated from the START signal and has a known phase relation to the phase of the reference clock, as shown in Fig. 3(b).

While using initial condition removes the uncertainty in the oscillator output phase,  $\Delta \Phi_{\rm VCO}$ , initial feedback divider state, and delay in the feedback path,  $\Delta \Phi_{\rm DEL}$  cause  $\Phi_{\rm FB}$  to be a fixed offset from  $\Phi_{\text{REF}}$ . Therefore, this phase offset must be canceled to achieve instantaneous phase locking. A simple digital delay-locked loop (DLL) in which the feedback clock is appropriately tuned such that its phase is aligned to the reference phase can be used for this purpose. Because the phase offset can be as large as one reference time period ( $\approx 10$  ns), a 14 bit accurate delay line is needed to keep the quantization error to within 1 ps. The design of such a wide range and high-resolution delay line is difficult, especially in a 65 nm CMOS process that has an FO4 delay of approximately 16 ps. To alleviate this requirement, we propose to use the feedback divider to first reduce the phase offset to be within one VCO period and then use a 7 bit digitally controlled delay line with a range spanning approximately 250 ps ( $1.7 \times$  VCO period) to correct the residual offset.

To illustrate how the divide-by-64 feedback divider can provide programmable delay, we treat it as a 64-state finite state machine clocked at the VCO frequency. FB output is asserted high after reaching the 64<sup>th</sup> state. If the divider is powered on in the 1<sup>st</sup> state, then the first positive edge of the FB clock is asserted high after a delay of 64 VCO clock periods. In general, if the divider is powered-on in N<sup>th</sup> state, then first FB clock edge occurs after 65-N VCO clock cycles. Therefore, by setting the initial state of the divider, the FB clock can be delayed in steps of VCO period, thereby making the unknown phase difference between  $\Phi_{\rm REF}$  and  $\Phi_{\rm FB}$  to be within one VCO period, as shown in Fig. 3(c). In this work, the optimal initial state of the divider is set based on simulations.

The remaining unknown phase was compensated using a DLL [see Fig. 3(d)]. Thanks to the coarse delay adjustment provided by the divider, the DLL has to cover a range of only one VCO period. This unknown phase also includes the unknown delay in the pulse generation path, and residual frequency error even after precisely setting the oscillator frequency. By accumulating sign of the phase error between  $\Phi_{\rm REF}$  and  $\Phi_{\rm FB}$  measured immediately after a power-on event, the DLL shifts  $\Phi_{\rm FB}$  by one LSB of the digitally controlled delay line and forces  $\Phi_{\rm REF} - \Phi_{\rm FB}$  to zero after several power-on events. The DLL is updated on the falling edge of START<sub>LCH</sub> after stopping the PLL. As a result, the DLL has no impact on PLL loop dynamics.

#### A. Complete PLL Architecture

The proposed PLL architecture along with the timing diagram of phase calibration logic is shown in Fig. 4. Registers in the integral path are clocked with REF, while registers in the DLL path are clocked with  $\mathrm{START}_{\mathrm{LCH}}$ . The PLL multiplies 109.375 MHz reference clock by 64 and generates 7 GHz output clock. It employs a hybrid architecture [24] in which the proportional path is implemented in analog domain and integral path in the digital domain. It can be reconfigured to operate in bang-bang mode, where only the sign of the phase error is used in the proportional control path. The 8 bit integral path accumulator output, which acts as the digital frequency control word is stored during the off-state and restored back on power-on to ensure that the PLL starts in a frequency-locked condition. However, temperature drift during long off-periods may cause the oscillator to start at a different frequency and the resulting frequency error may increase phase lock time of the PLL. To mitigate this, a lookup table (LUT)-based temperature compensation scheme is used. The LUT contains the frequency control word as a function of temperature, needed to ensure 7 GHz free-running frequency. Using the die temperature measured by the integrated temperature sensor, the corresponding digital frequency control word is read from the  $64 \times 8$  LUT and is restored into the integral path accumulator. The LUT stores the absolute value of 8 bit frequency control word. The LUT contents are initially filled with values obtained from transistorlevel simulations of the DCO and the temperature sensor and subsequently refreshed at every power-off event. When the PLL is powered-off, the digital control word from the integral path accumulator is written to the address corresponding to the temperature provided by the temperature sensor. The LUT operation can be optionally enabled at the expense of one reference cycle power-on latency.

Although LUT can compensate for large frequency errors, limited rows of LUT and quantization of fine frequency tuning in 256 discrete steps could result in residual frequency error. If the PLL is configured in bang-bang mode, then the PLL can



Fig. 4. Proposed fast power-on-lock LC-PLL architecture and phase calibration timing diagram.

tolerate error less than or equal to the bang-bang step size. Any error more than the step size would result in phase slewing. In this work, the bang-bang step size is programmable and ranges from 40 ppm (+/-20 ppm) to 640 ppm. If the PLL is configured in the proportional control mode, then the frequency error would result in phase offset at the input of PFD on power-on. The amount of offset depends on the PLL loop dynamics, multiplication factor and frequency error.

As described earlier, phase error between  $\Phi_{REF}$  and  $\Phi_{FB}$  is made zero on power-on with the help of a DLL, which operates in the background. DLL consists of a delay line, which is digitally controlled with 7 bits, and has a range of 250 ps with a step size of approximately 2 ps. Assuming the initial condition of the divider is set properly, in the worst case, the phase difference between the feedback clock (FB<sub>DEL</sub>) and the reference clock (REF<sub>DEL</sub>) will be at most one VCO period (approximately 142 ps). As a result, phase calibration loop takes at most 128 power-on/off cycles to reach steady state. A replica delay line is added on the reference clock path to compensate for the phase drifts in delay line and divider caused due to temperature variations during long off-periods and voltage variations during on-off events.

#### B. LC-Digitally Controlled Oscillator (LC-DCO)

LC-DCO architecture is shown in Fig. 5. It consists of a 7 bit binary weighted coarse frequency selection (CFS) ca-

pacitor bank and an 8 bit thermometer coded fine frequency selection (FFS) capacitor bank. The tuning resolution of CFS and FFS is 300 ppm and 20 ppm, respectively. Single turn 0.44 nH inductor is used to achieve a quality factor of approximately 20. Two additional pull-down NMOS transistors are added on either side of LC-DCO to provide the initial condition. On power-on, a narrow pulse derived from the START signal is applied to the gate of one of these NMOS transistors, which pulls down one end of the LC-tank momentarily. This ensures that the oscillator phase trajectory remains fixed on every power-on event.

Large initial condition helps to reduce the uncertainty in the oscillator phase trajectory. Based on the peak-peak DCO phase error estimated from 100 power-on transient noise simulations, design parameters such as startup pulse width, rise/fall time, and pull-down NMOS transistor size were estimated. Simulation results suggest that a pulse width larger than 100 ps with 100 ps rise and fall time results in phase error of 480 fs<sub>pk-pk</sub>. A 16  $\mu$ m/80 nm pull-down NMOS transistor results in pk-pk phase error of 400 fs<sub>pk-pk</sub>, which meets our design goals.

Proportional control is implemented using 7-step MIM capacitor bank using UP/DN controls with a frequency step of approximately +12,250/-9,750 ppm. This results in a PLL bandwidth of approximately 1.5 MHz.

When the oscillator is powered-off, its output nodes OUTP and OUTM settle towards  $V_{DD}$ . On power-on, the oscillator



Fig. 5. Proposed LC-DCO architecture.

output common-mode quickly changes from  $V_{DD}$  to  $V_{DD}/2$ . As a result, nodes A, B, C, and D, which are at the bottom plate of MIM capacitor bank experience a step-like transition, which decays slowly based on the time constant on these nodes. For an always-on PLL, this transition time is not a problem, but for the rapid-on/off PLL, voltage transition on nodes A, B, C, and D could change the NMOS switch resistance between the MIM caps, and results in slow frequency settling [25]. To address this issue, care was taken to reduce time constants on nodes A, B, C, and D by adding resistors in the form of always-on transistors M1, M2, M3, and M4.

# C. PLL Modes and Power-On Sequence

Proposed PLL can be configured to operate in 4 different power-on sequence modes. These modes are Tx PLL without LUT, Tx PLL with LUT, Rx PLL without LUT, and Rx PLL with LUT mode.

In Tx PLL without LUT mode, PLL is configured to be used for the transmitter. Previously saved frequency control word is restored from the integral path accumulator. PLL starts instantaneously with the START PLL signal that is synchronized with the REF clock.

In Tx PLL with LUT mode, PLL is configured to be used for the transmitter. Previously saved frequency control mode is restored back from the LUT. It takes one reference cycle to perform this operation. Therefore, in this mode the PLL poweron-lock time is increased by one reference cycle.

In Rx PLL without LUT mode, PLL is configured to be used in the receiver. Previously saved frequency control mode is restored back from the integral path accumulator. PLL starts instantaneously with the START PLL signal, but in out-of-lock condition. In Rx PLL with LUT mode, PLL is configured to be used in the receiver. PLL starts instantaneously with the START PLL signal in out-of-lock condition. Previously saved frequency control mode is restored from the LUT. It takes one reference cycle to perform this operation. While the LUT read operation is underway, the PLL output could gain or lose phase in the first reference cycle after power-on. CDR corrects the resulting sampling phase error with the help of phase interpolator (PI).

#### IV. TRANSMITTER

Proposed rapid-on/off CML-based transmitter output driver chain is shown in Fig. 6. While it is possible to achieve rapidon/off capability in low power voltage-mode output drivers [5], [8], CML is preferred for its lower sensitivity to supply transients, which occur during the power-on events. Compared to the previously reported rapid-on/off CML drivers [6], [9], the proposed CML driver achieves a smaller power-on time at the expense of a small always-on bias current. The design of the transmitter driver chain is done in a sliced fashion, where each of these slices consists of a CMOS-to-CML converter, predriver, and an output driver (see Fig. 6). Three tap feed-forward equalization is provided on the transmitter end, and it is integrated into the output driver. Precursor and postcursor of the equalizer consist of one slice each while the main cursor consists of four slices. In each slice of the output driver, tail current source consists of a 2 bit current DAC. By configuring the number of slices and tail current of CML logic, equalization coefficients can be adjusted in coarse and fine way, respectively.

Fast power-on CML logic and the associated timing diagram is shown in Fig. 7. A small bias current  $I_{BIAS}$  is kept always-on to achieve fast power-on capability. During power-off, switch S1 is off, and, as a result, the common-mode voltage of the



Fig. 6. Transmitter output driver slice architecture.

output driver is at  $V_{\rm DD}$ . On power-on, the voltage at node P falls sharply fall from  $V_{\rm DD}$  to voltage  $V_{\rm P}=V_{\rm IN-CM}-V_{\rm TH},$  where  $V_{\rm IN-CM}$  is the input common-mode voltage of the output driver. At the same time, node X rises from ground to voltage  $V_{\rm X}\approx V_{\rm P}.$  Large jump in  $V_{\rm X}$  causes a kick-back on the BIAS node voltage  $V_{\rm BIAS},$  which is given by the following expression:

$$V_{\rm KICK} = \frac{C_{\rm P}}{C_{\rm P} + C_{\rm D}} V_{\rm X} \tag{1}$$

where  $C_P$  is the gate to drain parasitic capacitance, and  $C_D$  is the decoupling capacitor on the BIAS node. Decay time of the kickback depends on the time constant associated with the BIAS node and it is usually of the order of few nanoseconds. Kickback causes current overshoot in the output driver, which manifest itself as jitter. Adding a decoupling capacitor  $C_D$  on the BIAS node, helps in reducing the magnitude of  $V_{\rm KICK}$ , and consequently helps in reducing jitter. A conventional CML driver takes more than 120 ns to power-on and settle [9]. With the proposed fast power-on CML technique, the transmitter output driver settles within 500 ps (based on measured results).

#### V. RECEIVER

Rapid-on/off operation requires the receiver to quickly synchronize with the received data and must have the ability to power-on/off instantaneously. Conventional high-speed burst mode receivers can quickly lock to the received data, but cannot be powered down [26], [27]. Gated VCO-based burst mode receiver can be powered down [10], but they are limited to low



Fig. 7. Fast power-on biasing scheme for CML-based logic.

data rates (2.2 Gb/s). In view of these limitations, we propose a high-speed burst mode receiver with rapid-on/off capability. Two techniques referred to as CDR Phase Calibration Logic (CPCL) and Dynamic Gain Calibration Logic (DGCL), are employed to achieve fast phase and frequency acquisition.

# A. START Rx Generator

START Rx Generator circuit is shown in Fig. 8. It consists of two cross-coupled stages denoted as Stage-1 and Stage-2. Stage-1 output does not go all the way to  $V_{\rm DD}$  due to the current sinking transistors M5 and M6, and for that reason Stage-2 is used to amplify the Stage-1 output and drive the XOR gate. To ensure reasonable swing at the output of Stage-1, transistors M1 and M2 are made wider than transistors M5 and M6. Similarly



Fig. 8. Schematic of START Rx generator circuit.



Fig. 9. (a) Simulated START Rx signal sensitivity to transmitter amplitude. (b) Simulated START Rx signal sensitivity to receiver temperature. (c) Simulated START Rx signal sensitivity to receiver supply voltage. (d) Simulated START Rx signal sensitivity to transmitter supply voltage.

in Stage-2, transistors M17 and M18 are made wider than M9, M10, M11, and M12.

When the transmitter is powered on, common-mode voltage of the output driver drops from  $V_{\rm DD}$  to  $V_{\rm DD} - V_{\rm SWING}/2$ , where  $V_{\rm SWING}$  is the transmitter output swing. Transistors M1 and M2 in Stage-1 sense this change, and starts to pump current in the cross-coupled pair consisting of transistors M3, M4, M7 and M8. As a result, the Stage-1 latch acquires a known state, i.e one output goes high and its complement goes low. This state depends on the relative value of currents I1 and I2, which are drawn from transistors M5 and M6, respectively. The relative value of current depends on the starting data pattern. For instance, if the incoming data is all 0's ( $Rx_{\rm INP} < Rx_{\rm INM}$ ), then I2 > I1 and  $V_{\rm OUTP} < V_{\rm OUTM}$ . In this work, starting data pattern is kept fixed. In the practical usage scenario, it can be

fixed by using a fixed preamble on the data packet. When the transmitter is powered-off, both nodes  $Rx_{INP}$  and  $Rx_{INM}$  are at  $V_{DD}$ . Transistor M1 and M2 are off, and transistors M5 and M6 pull down Stage-1 latch output to ground. Consequently, both outputs of the Stage-2 latch are at  $V_{DD}$  and the START Rx signal is low. START Rx generator has a maximum requirement on the common-mode voltage for correct operation. Based on simulations, at typical corner, the START Rx generator fails to operate with a common-mode voltage above 945 mV.

Simulated sensitivity of START Rx signal is shown in Fig. 9. START Rx has a sensitivity of -1.4 ps/mV to the transmitter signal swing, -2.3 ps/°C to the receiver temperature, -3.2 ps/mV to the receiver supply and 2.4 ps/mV to the transmitter supply. In the present work, on-chip temperature sensor is not used to track START Rx phase drift. At higher



Fig. 10. Simulated START Rx generator for 1000 transient noise simulations and a transmitter swing of (a) 400 mV ( $\text{Diff}_{pk-pk}$ ); (b) 500 mV ( $\text{Diff}_{pk-pk}$ ); (c) 600 mV ( $\text{Diff}_{pk-pk}$ ).

data rates, large phase drifts due to temperature variation during long off-periods could be detrimental. Therefore, it may be possible to use the temperature sensor to compensate for this drift in future work.

One thousand montecarlo mismatch simulations were done for the START Rx generator. The peak-to-peak variation of START Rx signal is 94 ps with the standard deviation of approximately 12.6 ps. One thousand transient noise simulations were done for three different transmitter output swings, as shown in Fig. 10. Simulated peak-peak jitter of START Rx signal with 500 mV (Diff<sub>pk-pk</sub>) is only 16 ps, which is 0.11UI. Simulations were done for five different amounts of channel loss to understand the effect of ISI on START Rx generator. As shown in Fig. 11, START Rx signal delay varies from approximately -150 ps to 150 ps.

Simulations results indicate that the START Rx signal is generated in less than 1 ns after the transmitter is powered on (assuming no channel delay). Fixed phase relationship between the received data on nodes  $Rx_{INP}$ ,  $Rx_{INM}$  and START Rx signal is leveraged to speedup CDR phase acquisition as discussed next.

#### B. CDR Phase Calibration Logic (CPCL)

Receiver architecture and proposed CDR phase calibration logic are shown in Fig. 12. In this scheme, sampling phase of the CDR ( $\Phi_{\rm PI}$ ) at the time of power-on is generated with a fixed phase relation to the data phase ( $\Phi_{\rm DATA}$ ), such that  $\Phi_{\rm PI}$  becomes independent of the local reference clock phase ( $\Phi_{\rm REF}$ ). With the help of background calibration,  $\Phi_{\rm PI}$  is placed at the center of received data eye to ensure that the CDR starts in a phase locked condition i.e.  $\Phi_{\rm DATA} - \Phi_{\rm PI} \approx 0.5 \text{UI}$ .

To meet the above mentioned phase condition, PLL on the receiver chip is configured to be powered-on without a phaselock. In this configuration, PLL is powered-on instantly upon receiving the START Rx signal in bang-bang proportional control mode. The advantage of using bang-bang control for PLL starting in out-of-lock condition is that phase update rate of an oscillator can be well controlled. Care was taken to ensure that PLL phase update rate in the bang-bang mode is smaller than the CDR phase update rate. Note that in contrast to the PLL configured in transmitter mode where the START signal is retimed by the reference clock  $\Phi_{REF}$ , the START signal used to start receiver PLL is not retimed (En Rx Mode = 1 in Fig. 4). This helps in establishing a known phase relation between the phase of START signal (in this case START Rx signal)  $\Phi_{START Rx}$  and PLL output phase  $\Phi_{PLL}$ , which can be expressed as

$$\Phi_{\rm PLL} - \Phi_{\rm START \ Rx} = \Delta \Phi_2. \tag{2}$$

Using a PI, the phase difference between  $\Phi_{PLL}$  and  $\Phi_{PI}$  can be adjusted such that

$$\Phi_{\rm PI} - \Phi_{\rm PLL} = \Delta \Phi_3. \tag{3}$$

Based on the discussion on START Rx Generator circuit, there is a fixed phase relation between  $\Phi_{DATA}$  and  $\Phi_{START Rx}$ , which can be expressed as

$$\Phi_{\text{START Rx}} - \Phi_{\text{DATA}} = \Delta \Phi_1. \tag{4}$$

From (2), (3), and (4), phase relation between  $\Phi_{DATA}$  and  $\Phi_{PI}$  is given by

$$\Phi_{\rm PI} - \Phi_{\rm DATA} = \Delta \Phi_1 + \Delta \Phi_2 + \Delta \Phi_3. \tag{5}$$

By digitally adjusting  $\Delta \Phi_3$ , sampling clock phase  $\Phi_{PI}$  is placed approximately in the middle of received data at the time of



Fig. 11. (a) Simulated START Rx generator with Channel 1. (b) Simulated START Rx generator with Channel 2. (c) Simulated START Rx generator with Channel 3. (d) Simulated START Rx generator with Channel 4. (e) Simulated START Rx generator with Channel 5. (f) Simulated START Rx signal variation for various channels with PRBS7 data.



Fig. 12. Proposed receiver architecture and CDR phase calibration logic (CPCL) concept.

power-on. This ensures that CDR starts in phase-locked condition.

Adjustment of  $\Delta \Phi_3$  is performed in the background by observing the Error signal on every power-on event. If the sampling phase is not positioned in the middle of the received data at power-on, errors are recorded by the PRBS checker. Since CDR loop runs in parallel with CPCL, CDR eventually locks to the received data and errors in the received data cease to exist. By observing the time duration of observed errors,  $\Delta \Phi_3$  is digitally adjusted by setting the initial 5 bit digital control word for PI, so as to minimize the duration of errors. This adjustment is done once at the end of every power-on/off cycle. Consequently, after few on/off cycles, CPCL converges and CDR starts in a phase-locked condition. In this work, monitoring Error signal and controlling  $\Delta \Phi_3$  are performed off-chip.

When the transceiver is powered-on for the first time, the CPCL may start in the nonoptimum phase. Consequently, at the end of 20 ns (based on measurements) the Error would still exist

because CDR hasn't achieved lock. This will force the CPCL to adjust  $\Delta \Phi_3$  for the next power-on event. Since, there are 32 PI steps to cover one unit interval (UI also known as data bit duration), CPCL may take on an average 16 on/off cycles to converge. In other words, CPCL has a cumulative training time of 320 ns (16 \* 20n).

In a practical usage scenario, transmitted data is not PRBS pattern and, therefore, for the operation of CPCL and detection of CDR lock, a preamble prefixed to the data packet could be used (see Fig. 13). Correct decoding of this preamble by the receiver could be used to determine if the CDR is locked. The Error signal from the preamble decoder would be used by the CPCL to adjust the initial data sampling phase.

# *C. CDR Architecture and Dynamic Gain Calibration Logic (DGCL)*

The proposed CDR consists of a quarter rate bang-bang phase detector, which generates early, late and data signals, as shown



Fig. 13. Practical usage scenario of the proposed transceiver.



Fig. 14. Proposed CDR architecture and DGCL blocks.





Fig. 15. Simulated power-on lock profile of transceiver with 0 ppm, 1000 ppm, and 2500 ppm of frequency error between the transmitter and the receiver with PRBS7 data.

Fig. 16. Simulated power-on-lock time as a function of starting sampling phase for various horizontal eye openings with PRBS7 data.

in Fig. 14. The early and late signals go to the digital loop filter, which consists of a proportional and integral path followed by a phase accumulator. Output of phase accumulator is applied to 4 PIs, which are used to sample data and edge in the BBPD.

Dynamic gain calibration logic (DGCL) helps to achieve fast phase and frequency acquisition and operates in conjunction with CPCL (see Fig. 14). DGCL starts the CDR loop with a very high gain on power-on. High gain helps with fast phase and frequency acquisition but increases recovered clock jitter. Therefore, in order to reduce the recovered clock jitter, the loop gain is reduced progressively as the CDR approaches towards frequency lock. Frequency lock can be detected by monitoring the ACC-F output. When the CDR acquires frequency lock, the ACC-F output just moves around a static value. Variations in the ACC-F output around the average value is a function of latency in the system and loop gain. Higher the latency, higher will be the movement of ACC-F output codes in steady state. Similarly, higher loop gain also results in large movements of ACC-F output around the steady state.



Fig. 17. (a) Die micrograph of the proposed transceiver. (b) Die packed in 88-pin QFN package. (c) Active area break down of the proposed transceiver.

To identify if the ACC-F output has reached the steady state, the ACC-F output is first differentiated. Because CDR loop latency is around 5 cycles, with the highest gain setting, it was observed from simulations that the CDR limit cycle has an average period of 10 CDR cycles. Therefore, to identify the variation of ACC-F output, ACC-F output must be differentiated after 10 CDR cycles. In the present architecture, the delay for the differentiator is programmable from 1 to 10. All the measurements were done with a delay of 10.

An absolute value operation is performed on the output of the differentiator so as to obtain either zero or a positive value. If the output of the absolute operation is greater than the threshold, this signifies that ACC-F is still settling and there is no need to change the loop gain setting. Once the output of the absolute operation is less than the threshold, then the 3 bit accumulator increments, which decreases the CDR loop gain. Measurements indicate that at maximum gain setting, the CDR BW is approximately 10 MHz and the worst case power-on time without CDR phase calibration logic (CPCL) is 180 ns. In steady-state, the CDR BW is around 2 MHz.

Behavioral simulation of the transceiver in the presence of 0 ppm, 1000 ppm and 2500 ppm of frequency error between the transmitter and the receiver was performed, and the results are shown in Fig. 15. The first plot is the output of the accumulator ACC- $\Phi$ , which controls the PI. In the presence of frequency error, the ACC- $\Phi$  output wraps around, and the rate at which it wraps around is governed by the frequency error. The second plot is the output of accumulator ACC-F, which forms the CDR integral path. Third plot is the Error signal. For all three frequency error conditions, the Error signal goes down at the same time. Thus, the power-on-lock time of the CDR remains the same, irrespective of the frequency error.

To understand the effect of ISI on the power-on-lock time for DGCL and CPCL, behavioral simulations of the transceiver in the presence of different amounts of channel loss were performed and the results are shown in Fig. 16. Simulations were done by sweeping the starting sampling phase for four different channels with different percentage of horizontal eye opening. If only the DGCL is used, the data sampling clock has no relation to the received data on power-on. Consequently power-on-lock time is dependent on the sampling instance and ISI. It can be observed from the simulation that worst case power-on-lock time increases as the horizontal eye opening reduces. On the other



Fig. 18. Measured PLL jitter in proportional control mode.

hand, if both CPCL+DGCL are used in the receiver, power-onlock time becomes fixed and independent of ISI or sampling instance. Thanks to CPCL, the sampling clock has a known phase relation with the received data on power-on. Thanks to DGCL, the CDR is quickly locked to the transmitter frequency and it continuously tracks the phase.

### VI. MEASUREMENT RESULTS

The prototype transceiver was implemented in a 65 nm CMOS process and the die micrograph is shown Fig. 17(a). The chip was packaged in a 10 mm  $\times$  10 mm 88 pin QFN plastic package [see Fig. 17(b)]. The transceiver occupies an active area of approximately 0.39 mm<sup>2</sup> and the area break down is shown in Fig. 17(c). This chip operates on 1 V and 1.1 V supply (1.1 V for BBPD and deserializer).

Because of large current steps during power-on/off events, design of power distribution network was done carefully. Several critical supplies, which experience large current steps are bonded with multiple bonding wires to reduce the parasitic inductance. Both on-chip and off-chip decoupling capacitors were used. Total on-chip decoupling capacitor is approximately 1.5 nF. Off-chip decoupling of every supply was done with 100 nF, 1  $\mu$ F and 10 $\mu$ F capacitors. A damping resistors of approximately 1-to-5 ohm is added in series on PCB to dampen out the ringing caused by bond wire inductance and decoupling capacitance. Off-chip voltage regulators manufactured by Analog Devices (part# ADP123) were used. Experimental results to quantify the transceiver performance in always-on and rapid-on/off modes are presented in the following subsections.



Fig. 19. Measured PLL phase noise in (a) proportional control mode; (b) bang-bang control mode.



Fig. 20. Measured PLL spectrum in (a) proportional control mode; (b) bang-bang control mode.



Fig. 21. Measured transmitter output eye at 7 Gb/s without FFE in an always-on condition.

# A. Measurements in Always-On Mode

Raw PLL performance such as long-term jitter, phase noise, and output spectrum was measured in always-on condition. Measured jitter histogram of the PLL output clock at 7 GHz is shown in Fig. 18. Long term absolute jitter is 1  $ps_{rms}$  and 9.3  $ps_{pk-pk}$ , including the trigger jitter of Tektronics DSA8300 oscilloscope. Measured phase noise plot is shown Fig. 19 and the integrated jitter is 435  $fs_{rms}$  in proportional control mode



Fig. 22. Measured BathTub plot of receiver at 7 Gb/s in an always-on condition.



Fig. 23. Measured jitter tolerance of the CDR at 7 Gb/s.



Fig. 24. (a) Measured transceiver power break down in an always-on state. (b) Measured transceiver power break down in an always-off state.



Fig. 25. Oscilloscope capture of the PLL power-on/off transient.



Fig. 26. Measured absolute phase drift of the PLL in proportional and bang-bang control modes.

and 890 fs<sub>rms</sub> in bang-bang control mode. Measured reference spur is -50.1 dB in proportional control mode and -46.8 dB in bang-bang control mode, as shown in Fig. 20. Measured temperature sensitivity of LC tank operating at 7 GHz is -79.1 ppm/°C and supply sensitivity is -52.85 ppm/mV. With 20 ppm of fine frequency step (FFS), PLL can track approximately 5120 ppm of frequency offset. 5120 ppm translates to approximately 65 °C of temperature range. With 64 entries, LUT can provide approximately 1 °C of temperature resolution. The temperature sensor is placed near the LC-DCO. Update rate of the temperature sensor is programmable with the maximum measured update rate of around 150 k Samples/second. This is equivalent to around 6.5  $\mu$ s of temperature measurement time.

Measured single-ended eye diagram of the CML output driver is shown in Fig. 21. Transmitter output swing is



Fig. 27. Measured absolute phase drift to demonstrate the effectiveness of the replica delay in the presence of voltage variations on the digital power supply.



Fig. 28. Measured CML output driver power-on/off transient



Fig. 29. Measured transceiver power-on/off transient at 7 Gb/s.

500 mV (Diff<sub>pk-pk</sub>) and single-ended horizontal and vertical eye openings are approximately 105 ps and 102 mV, respectively. This near-end transmitter measurement was done without enabling FFE. ISI present in the eye is due to package parasitic and impedance discontinuities on the PCB. For the transceiver operation, FFE taps pre, main and post are set to 0, 0.75, and -0.25, respectively.

Measured bathtub plot from BERT-to-receiver is shown in Fig. 22. This measurement was performed by synchronizing BERT clock with the receiver clock and sweeping phase interpolator codes in the CDR loop. For a BER of 1e-12, the measured eye opening is approximately 0.25UI.

The measured lock-in range of the CDR is  $\pm 2500$  ppm. Measured jitter on the recovered clock for 0 ppm, 2500 ppm, and -2500 ppm frequency error is 4.8 ps<sub>rms</sub>, 7.6 ps<sub>rms</sub>, and 8.3 ps<sub>rms</sub>, respectively. Fig. 23 shows the measured CDR's jitter tolerance (JTOL) curve in an always-on condition (steady-state CDR locked condition). JTOL corner frequency is approximately 2 MHz. High frequency JTOL is limited primarily by the ISI caused by package parasitics.

The transceiver power break down in always-on and always-off state is shown in Fig. 24(a) and (b), respectively. Operating at 7 Gbps, the transceiver consumes 63.7 mW of which the serializer and output driver consume nearly 45% of the total power. The power consumption in the off-state is only 740  $\mu$ W, which is approximately 1.16% of the on-state power. Leakage in the BBPD and de-serializer blocks make up a 61% of off-state power consumption.



Fig. 30. Measured power-on-lock time versus starting PI-phase.

#### B. Measurements in Rapid-On/Off Mode

The measured power-on transient of the PLL, shown in Fig. 25, demonstrates that the PLL powers-on/off instantaneously. To quantify the settling time of the PLL, error in the output time period was calculated from the output waveform captured using a high-frequency sampling scope, Agilent DSO-81204A. Cumulative sum of the period error signifies the absolute phase drift as depicted in Fig. 26. Mathematically, absolute phase drift can be calculated as

Phase Drift = 
$$\sum_{i=1}^{\infty} \left( P_i - \frac{1}{7 \times 10^9} \right)$$
 (6)

where  $P_i$  is i<sup>th</sup> measured period. The dotted and solid lines show the phase drift without and with background DLL-based phase



Fig. 31. (a) Measured transceiver power versus effective data rate. (b) Measured transceiver energy-per-bit versus effective data rate.

|                            | This Work                          | [6]VLSI'11      | [5]JSSC'10                 |  |
|----------------------------|------------------------------------|-----------------|----------------------------|--|
| Architecture               | Embedded clock                     | Forwarded clock | Forwarded clock            |  |
| Technology                 | 65nm GP                            | 40nm LP         | 40nm LP                    |  |
| Supply [V]                 | 1/1.1 N/A                          |                 | 1.1                        |  |
| Data Rate [Gb/s]           | 7 2.5-5.6                          |                 | 2.7-4.3                    |  |
| Power-on-Lock Time [ns]    | Less than 20ns                     | 8               | 241.8                      |  |
| Energy Efficiency [pJ/bit] | 9.1                                | 2.4             | 3.3                        |  |
| On-State Power[mW]         | 63.7                               | 13.4            | 14.2                       |  |
| Off-State Power [µW]       | 740                                | 0               | 50                         |  |
| De/ Serialization ratio    | 16:1                               | N/A             | 8:1                        |  |
| Output Swing(mV)           | $500(\operatorname{Diff}_{pk-pk})$ | N/A             | $200(\text{Diff}_{pk-pk})$ |  |
| Area [mm <sup>2</sup> ]    | 0.39                               | N/A             | N/A                        |  |

 TABLE I

 PERFORMANCE COMPARISON OF THE PROPOSED TRANSCEIVER WITH STATE-OF-THE-ART DESIGNS

calibration, respectively. The absolute phase drift with phase calibration is measured to be  $\pm 3$  ps in a measurement span of 220 ns, which is approximately 2–3 time constants of the PLL loop. Beyond this span, PLL feedback ensures that the phase does not drift beyond 3 ps. Fig. 27 shows the measured effectiveness of the replica delay line. PLL phase drift measured with 100 mV of voltage variation on the feedback divider and delay line during the off-period is less than 2 ps, compared to nominal supply case.

The measured transmitter settling transient during power-on/off cycle is shown in Fig. 28. The transmitter powers-on within 500 ps. It can be observed that the common-mode drops when the START signal is asserted high, and the common-mode goes to  $V_{\rm DD}$  when the START signal is de-asserted.

Measurement capturing the successful transmission and reception of data in rapid-on/off mode is performed using two separate test chips. For this measurement, one chip was configured as the transmitter while the other chip was configured as the receiver. The chips were connected by a channel consisting of 3.2 in FR-4 trace, 6 in SMA cable, and QFN package. PRBS7 data was transmitted and recovered in this experiment. Approximately 4 Billion on/off transactions are captured using an oscilloscope as shown in Fig. 29. The duration of each transaction is 450 ns. Observed signals at the output of receiver are: START Rx, Error, Recovered Data, and Recovered Clock. The Error signal is used as an indicator to check if the CDR has locked to the PRBS pattern [4]. In these measurements, Error signal goes low in less than 20 ns (in less than 140 bits). PRBS checker seeding latency is approximately 3 to 4 CDR clock cycles ( $\approx$ 7–9 ns), is included in the observed power-onlock time of 20 ns. A small portion of this time ( $\approx 2-3$  ns) is contributed by the PI+PLL power-on time. We believe, the remaining power-on time could be attributed to power-supply

|                         | This Work | [28]VLSI'13 | [5]JSSC'10 | [29]ISSCC'13 | [6]VLSI'11 | [30]CICC'12 |
|-------------------------|-----------|-------------|------------|--------------|------------|-------------|
| Architecture            | PLL       | PLL         | PLL        | MDLL         | MILO       | MILO        |
| Technology              | 65nm GP   | 40nm        | 40nm LP    | 90nm         | 40nm LP    | 65nm        |
| Supply[V]               | 1         | N/A         | 1.1        | 1.1          | N/A        | 1.1         |
| Output Freq.[GHz]       | 7         | 25          | 4.3        | 2.5          | 2.8        | 2.3-4       |
| Reference Freq.[MHz]    | 109.357   | 390         | 537.5      | 312.5        | 700        | 790         |
| Integrated Jitter.[fs]  | 435       | 394         | N/A        | 752          | N/A        | N/A         |
| Power-on-Lock Time [ns] | 1         | 100         | 241.8      | 10           | 8          | 10          |
| Efficiency[mW/GHz]      | 0.68      | 2.56        | N/A        | 0.88         | 4.8        | 30.4        |
| On-Power[mW]            | 4.8       | 64          | N/A        | 2.2          | 13.44      | 96          |
| Off-Power[µW]           | 41.6      | N/A         | N/A        | 25           | 0          | N/A         |

 TABLE II

 Performance Comparison of the Proposed Fast Power-on-Lock LC-PLL With State-of-the-Art Designs

glitch, which could have moved the optimum sampling phase of the CDR to the sub-optimal position because of START Rx generator's sensitivity to supply voltage. Given the 20 ns of power-on-lock time, it may not be beneficial to power-off the link if the idle time between very long active times (say 100 s of milliseconds) is small (say 20 ns).

Fig. 30 shows the measured power-on-lock time of CDR plotted as a function of PI starting phase. In this measurement, PI starting code ( $\Phi_{PI}$  in Fig. 12) was swept across 32 codes and power-on-lock time was measured by observing the Error signal. CPCL helps the CDR to start with the optimal PI code, which ensures minimal power-on-lock time (less than 20 ns in this case). If DGCL is used alone i.e. CPCL is switched-off, phase of the incoming data will be random with respect to the sampling clock and the measured worst-case power-on-lock time is 180 ns.

The measured transceiver power and energy efficiency is plotted as a function of effective data rate in Fig. 31(a) and (b), respectively for 8, 32, and 128 byte data burst lengths. Effective data rate was obtained by duty cycling the transceiver and is equal to

$$Effective Data Rate = \frac{Bits transmitted during on-state}{On-state time + Off-state time}.$$
(7)

For the 128 byte burst, a  $100 \times$  change in the data rate i.e. from 7 Gb/s-to-70 Mb/s, the power scales by  $44 \times$  from 63.7 mW-to-1.43 mW and the energy efficiency changes only by  $2.2 \times$  from 9.1 pJ-per-bit-to-20.5 pJ-per-bit. This demonstrates the energy proportional feature of the proposed transceiver.

Table I compares the proposed transceiver with state-ofthe-art designs. To the best of our knowledge, the proposed transceiver is the first reported embedded clock architecture with a power-on-lock time of less than 20 ns. Table II compares the proposed PLL with the state-of-the-art fast power-on-lock frequency multipliers. The proposed PLL achieves smallest power-on-lock time, which two orders of magnitude lower than other reported PLL architectures and an order of magnitude lower than reported MDLL and MILO architectures.

#### VII. CONCLUSION

Fine grained rapid power state transition technique is used to reduce the overall serial link power. Architectural design techniques for achieving fast locking in PLL, transmitter, and receiver were presented. The prototype fast power-on-lock transceiver with embedded clock architecture was fabricated in 65 nm CMOS technology and occupies an active die area of 0.39 mm<sup>2</sup>. It achieves power-on-lock in less than 20 ns and consumes 63.7 mW/740  $\mu$ W on/off-state power from 1 and 1.1 V supply. The proposed transceiver demonstrates power scalability with link utilization and achieves energy proportional operation. To the best of our knowledge, this is the first reported measured results and techniques of rapid-on/off transceiver for embedded clock architecture.

#### ACKNOWLEDGMENT

The authors would like to thank Berkeley Design Automation for providing the Analog Fast Spice (AFS) simulator. They would also like to thank S. J. Kim and TwistedTraces for their help in testing.

#### References

 L. Barroso and U. Holzle, "The case for energy-proportional computing," *Computer*, vol. 40, no. 12, pp. 33–37, Dec. 2007.

- [2] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu, "Energy proportional datacenter networks," in *Proc. Int. Symp. Computer Architecture (ISCA)*, Jun. 2010, pp. 338–347.
- [3] F. O'Mahony, G. Balamurugan, J. Jaussi, J. Kennedy, M. Mansuri, S. Shekhar, and B. Casper, "The future of electrical I/O for microprocessors," in *Proc. IEEE Symp. VLSI Design Automation and Test* (VLSI-DAT), Apr. 2009, pp. 31–34.
- [4] F. O'Mahony, J. Jaussi, J. Kennedy, G. Balamurugan, M. Mansuri, C. Roberts, S. Shekhar, R. Mooney, and B. Casper, "A 47 x 10 Gb/s 1.4 mW/Gb/s parallel interface in 45 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 45, no. 12, pp. 2828–2837, Dec. 2010.
- [5] B. Leibowitz, R. Palmer, J. Poulton, Y. Frans, S. Li, J. Wilson, M. Bucher, A. Fuller, J. Eyles, M. Aleksic, T. Greer, and N. Nguyen, "A 4.3 GB/s mobile memory interface with power-efficient bandwidth scaling," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 889–898, Apr. 2010.
- [6] J. Zerbe, B. Daly, W. Dettloff, T. Stone, W. Stonecypher, P. Venkatesan, K. Prabhu, B. Su, J. Ren, B. Tsang, B. Leibowitz, D. Dunwell, A. Carusone, and J. Eble, "A 5.6 Gb/s 2.4 mW/Gb/s bidirectional link with 8 ns power-on," in *IEEE Symp. VLSI Circuits Dig.*, Jun. 2011, pp. 82–83.
- [7] Energy Efficient Ethernet Task Force 2010, IEEE P802.3az [Online]. Available: http://grouper.ieee.org/groups/802/3/az
- [8] T. Anand, A. Elshazly, M. Talegaonkar, B. Young, and P. Hanumolu, "A 5 Gb/s, 10 ns power-on-time, 36 μW off-state power, fast power-on transmitter for energy proportional links," *IEEE J. Solid-State Circuits*, vol. 49, no. 10, pp. 2243–2258, Oct. 2014.
- [9] M. Talegaonkar, A. Elshazly, K. Reddy, P. Prabha, T. Anand, and P. Kumar Hanumolu, "An 8 Gb/s-64 Mb/s, 2.3–4.2 mW/Gb/s burst-mode transmitter in 90 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 49, no. 10, pp. 2228–2242, Oct. 2014.
- [10] W.-S. Choi, T. Anand, G. Shu, A. Elshazly, and P. Hanumolu, "A burst-mode digital receiver with programmable input jitter filtering for energy proportional links," *IEEE J. Solid-State Circuits*, vol. 50, no. 3, pp. 737–748, Mar. 2015.
- [11] J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, and R. Van Der Wijngaart, "A 48-Core IA-32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling," *IEEE J. Solid-State Circuits*, vol. 46, no. 1, pp. 173–183, Jan. 2011.
- [12] J. L. Shin, R. Golla, H. Li, S. Dash, Y. Choi, A. Smith, H. Sathianathan, M. Joshi, H. Park, M. Elgebaly, S. Turullols, S. Kim, R. Masleid, G. Konstadinidis, M. Doherty, G. Grohoski, and C. McAllister, "The next generation 64b SPARC core in a T4 SoC processor," *IEEE J. Solid-State Circuits*, vol. 48, no. 1, pp. 82–90, Jan. 2013.
- [13] Emerson, "Energy Logic: Reducing data center energy consumption by creating savings that cascade across systems," [Online]. Available: http://www.emersonnetworkpower.com/documentation/en-us/latest-thinking/edc/documents/white%20paper/energylogicreducingdatacenterenergyconsumption.pdf
- [14] P. Jones, 2014, Is the Industry Getting Better at Using Power [Online]. Available: http://www.datacenterdynamics.com/focus/archive/2014/ 01/dcd-industry-census-2013-data-center-power
- [15] T. Anand, M. Talegaonkar, A. Elkholy, S. Saxena, A. Elshazly, and P. Hanumolu, "A 7 Gb/s rapid-on/off embedded-clock serial-link transceiver with 20 ns power-on time, 740 μW off-state power for energy-proportional links in 65 nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2015, pp. 64–65.
- [16] M.-S. Chen and C.-K. K. Yang, "A low-power highly multiplexed parallel PRBS generator," in *Proc. IEEE Custom Integr. Circuits Conf.* (CICC), Sep. 2012, pp. 1–4.
- [17] K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and M. Horowitz, "Rethinking DRAM power modes for energy proportionality," in *Proc. IEEE/ACM Int. Symp. Microarchitecture*, Dec. 2012, pp. 131–142.
- [18] F. J. Mesa-Martinez, E. K. Ardestani, and J. Renau, "Characterizing processor thermal behavior," in *Proc. Int. Conf. Architectural Support* for Programming Languagesand Operating Syst. (ASPLOS), Mar. 2010, pp. 193–204.
- [19] T. Anand, K. Makinwa, and P. Hanumolu, "A self-referenced VCO based temperature sensor with 0.034 °C/mV supply sensitivity in 65 nm CMOS," in *IEEE Symp. VLSI Circuits Dig.*, Jun. 2015, pp. 200–201.
- [20] P. Hanumolu, M. Brownlee, K. Mayaram, and U.-K. Moon, "Analysis of charge-pump phase-locked loops," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 51, no. 9, pp. 1665–1674, Sep. 2004.

- [21] R. Staszewski and P. Balsara, "All-digital PLL with ultra fast settling," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 54, no. 2, pp. 181–185, Feb. 2007.
- [22] S. Hoppner, S. Haenzsche, G. Ellguth, D. Walter, H. Eisenreich, and R. Schuffny, "A fast-locking ADPLL with instantaneous restart capability in 28-nm CMOS technology," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 60, no. 11, pp. 741–745, Nov. 2013.
- [23] D. Barras, F. Ellinger, H. Jackel, and W. Hirt, "Low-power ultra-wideband wavelets generator with fast start-up circuit," *IEEE J. Microw. Theory Tech.*, vol. 54, no. 5, pp. 2138–2145, May 2006.
- [24] W. Yin, R. Inti, A. Elshazly, B. Young, and P. Hanumolu, "A 0.7-to-3.5 GHz 0.6-to-2.8 mW highly digital phase-locked loop with bandwidth tracking," *IEEE J. Solid-State Circuits*, vol. 46, no. 8, pp. 1870–1880, Aug. 2011.
- [25] P. Andreani, K. Kozmin, P. Sandrup, M. Nilsson, and T. Mattsson, "A TX VCO for WCDMA/EDGE in 90 nm RF CMOS," *IEEE J. Solid-State Circuits*, vol. 46, no. 7, pp. 1618–1626, Jul. 2011.
- [26] J. Lee and M. Liu, "A 20-Gb/s burst-mode clock and data recovery circuit using injection-locking technique," *IEEE J. Solid-State Circuits*, vol. 43, no. 3, pp. 619–630, Mar. 2008.
- [27] A. Rylyakov, J. Proesel, S. Rylov, B. Lee, J. Bulzacchelli, A. Ardey, B. Parker, M. Beakes, C. Baks, C. Schow, and M. Meghelli, "A 25 Gb/s burst-mode receiver for rapidly reconfigurable optical networks," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2015, pp. 400–401.
- [28] R. Navid, M. Hekmat, F. Aryanfar, J. Wei, and V. Gadde, "A 25 GHz 100 ns lock time digital LC PLL with an 8-phase output clock," in *IEEE Symp. VLSI Circuits Dig.*, Jun. 2013, pp. 196–197.
- [29] T. Anand, M. Talegaonkar, A. Elshazly, B. Young, and P. Hanumolu, "A 2.5 GHz 2.2 mW/25 μW on/off-state power 2 ps<sub>rms</sub>-long-termjitter digital clock multiplier with 3-reference-cycles power-on time," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2013, pp. 256–257.
- [30] D. Dunwell, A. Carusone, J. Zerbe, B. Leibowitz, B. Daly, and J. Eble, "A 2.3–4 GHz injection-locked clock multiplier with 55.7% lock range and 10-ns power-on," in *Proc. IEEE Custom Integr. Circuits Conf.* (CICC), Sep. 2012, pp. 1–4.



**Tejasvi Anand** (S'12) received the M.Tech. degree (with distinction) in electronics design and technology from the Indian Institute of Science, Bangalore, India, in 2008. He is currently pursuing the Ph.D. degree at the University of Illinois at Urbana-Champaign, Urbana, IL, USA.

He was with the RF Circuits and Systems Group at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, during the summer of 2015. From 2008 to 2010, he was an Analog Design Engineer at Cosmic Circuits (now Cadence), Bangalore, India.

His research focuses on wireline communication, frequency synthesizers, and sensors, with an emphasis on energy efficiency.

Mr. Anand is a recipient of 2014–2015 IEEE Solid-State Circuits Society Predoctoral Achievement Award, the 2015 Broadcom Foundation University Research Competition Award (BFURC), the 2015 M. E. Van Valkenburg Graduate Research Award from the University of Illinois, the 2013 Analog Devices Outstanding Student Designer Award, and the 2009 CEDT Design (Gold) Medal from the Indian Institute of Science, Bangalore, India. He serves as a reviewer for the IEEE JOURNAL OF SOLID-STATE CIRCUITS.



**Mrunmay Talegaonkar** received the B.Tech. degree in electrical engineering and the M.Tech. degree in microelectronics and VLSI Design from Indian Institute of Technology Madras, Chennai, India, in 2007. He is currently pursuing the Ph.D. degree at University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Between 2007 and 2009, he was a design engineer at Analog Devices, Bangalore, India, where he was involved in the design of digital-to-analog converters. During 2009 and 2010, he was a project as-

sociate at the Indian Institute of Technology, Madras, India, working on highspeed clock and data recovery circuits. From 2010 to 2013, he was a research assistant, working on high-speed links, at the Oregon State University, Corvallis, OR, USA. His research interests include high-speed I/O interfaces and clocking circuits.



Ahmed Elkholy (S'08) received the B.Sc. degree with honors and the M.Sc. degree in electrical engineering from Ain Shams University, Cairo, Egypt, in 2008 and 2012, respectively.

Currently, he is a research assistant at the University of Illinois, Urbana-Champaign, Urbana, IL, USA, where he is pursuing the Ph.D. degree. From 2008 to 2012, he was an analog/mixed-signal design engineer at Si-Ware Systems, Cairo, Egypt, designing high-performance clocking circuits and LC-based reference oscillators. His research interests

include frequency synthesizers, high-speed serial links, and low-power data converters.

Mr. Elkholy received the Edward N. Rickert Engineering Fellowship from Oregon State University (2012–2013) and the Best M.Sc. Degree Award from Ain Shams University in 2012. He serves as a reviewer for the IEEE JOURNAL OF SOLID-STATE CIRCUITS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, and IEEE International Symposium on Circuits and Systems.



Saurabh Saxena received the B.Tech. degree in electrical engineering and the M.Tech. degree in microelectronics and VLSI design from the Indian Institute of Technology Madras, Chennai, India, in 2009, as a part of the dual degree program. He is currently pursuing the Ph.D. degree at University of Illinois at Urbana-Champaign, Urbana, IL, USA. His research interest includes data converters, high-speed I/O interfaces, and clocking circuits.



Amr Elshazly (S'04–M'13) received the B.Sc. (Hons.) and M.Sc. degrees from Ain Shams University, Cairo, Egypt, in 2003 and 2007, respectively, and the Ph.D. degree from the Oregon State University, Corvallis, OR, USA, in 2012, all in electrical engineering.

He is currently a Design Engineer at Intel Corporation, Hillsboro, OR, USA, developing high-performance high-speed I/O circuits and architectures for next generation process technologies. From 2004 to 2006, he was a VLSI Circuit Design

Engineer at AIAT, Inc., working on the design of RF building blocks. From 2006 to 2007, he was with Mentor Graphics Inc., Cairo, designing multistandard clock and data recovery circuits. His research interests include high-speed serial-links, frequency synthesizers, digital phase-locked loops, multiplying delay-locked loops, clock and data recovery circuits, data converter techniques, and low-power mixed-signal circuits.

Dr. Elshazly received the Analog Devices Outstanding Student Designer Award in 2011, the Center for Design of Analog-Digital Integrated Circuits (CDADIC) Best Poster Award in 2012, and the Graduate Research Assistant of the year Award in 2012 from the College of Engineering, Oregon State University, Corvallis, OR, USA. He serves as a reviewer for the IEEE JOURNAL OF SOLID-STATE CIRCUITS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS, the IEEE International Symposium on Circuits and Systems, the IEEE International Conference of Electronic Circuits Systems, and the IEEE Asian Solid State Circuits Conference.



**Pavan Kumar Hanumolu** (S'99–M'07) received the Ph.D. degree from the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA, in 2006, where he subsequently served as a faculty member until 2013.

He is currently an Associate Professor in the Department of Electrical and Computer Engineering and a Research Associate Professor with the Coordinated Science Laboratory, University of Illinois, Urbana-Champaign, Urbana, IL, USA. His research interests are in energy-efficient integrated circuit implementa-

tion of analog and digital signal processing, sensor interfaces, wireline communication systems, and power conversion.

Dr. Hanumolu received the National Science Foundation CAREER Award in 2010. He currently serves as an Associate Editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS and is a technical program committee member of the VLSI Circuits Symposium and the IEEE International Solid-State Circuits Conference.