A 2.8mW/Gb/s 14Gb/s Serial Link Transceiver in 65nm CMOS

Saurabh Saxena, Guanghua Shu, Romesh Kumar Nandwana, Mrunmay Talegaonkar, Ahmed Elkholy, Tejasvi Anand, Seong Joong Kim, Woo-Seok Choi, and Pavan Kumar Hanumolu

University of Illinois at Urbana-Champaign, Urbana, IL, USA
ssaxena4@illinois.edu

Abstract

A low power 14Gb/s transceiver using partially segmented voltage-mode driver, charge-based analog front-end, and low power clock and data recovery circuit that also minimizes clock distribution power is presented. Fabricated in a 65nm CMOS process, the transceiver achieves a power efficiency of 2.8mW/Gb/s and BER<10^-12 while operating at 14Gb/s with 12dB channel loss.

Introduction

Improving energy efficiency is the most important aspect of high-speed serial link transceiver design. Several techniques have been proposed to lower power consumed in both transmitters and receivers. On the transmitter side, voltage-mode (VM) output drivers are commonly used, but they suffer from de-emphasis range/resolution tradeoff and their power efficiency is degraded due to output driver segmentation needed to implement de-emphasis [1]. On the receiver side, low-swing charge-based (CB) CMOS logic was introduced to reduce power in high-speed samplers and de-serializers [2]. But their return-to-zero (RZ) behavior and the need for additional clock phases for de-serialization degrade their power efficiency. Clock generation and distribution consumes significant portion of the link power. Resonant techniques help lower power, but they are susceptible to process variation and can be only used in a narrow frequency range [3]. In this paper, we propose: (i) a partially segmented VM driver that improves power efficiency by more than 30% across all de-emphasis settings, (ii) CB sampling flip-flop (CFF) with non-return-to-zero (NRZ) operation, and (iii) low power clock and data recovery (CDR) architecture, which also reduces clock distribution power. Using these enhancement techniques, the prototype transceiver fabricated in a 65nm CMOS process achieves 2.8mW/Gb/s power efficiency at 14Gb/s data rate.

Transceiver Architecture

The block diagram of the proposed transceiver is shown in Fig. 1. The transmitter consists of parallel PRBS generators operating at 0.875GHz, a 16-to-1 serializer, and the proposed partially segmented N-over-N VM driver with 1-tap of de-emphasis embedded into it. Fig. 1 shows single-ended diagram of a VM output driver which is partially segmented into a single main cursor and 12 tunable cells as 1-tap precursor. Compared to a fully segmented output driver, this reduces the parasitic capacitance in high speed signal path and saves power by roughly 30% for all equalizer settings. A digital LC PLL placed in close proximity to the transmitter provides low jitter clock to both the transmitter and receiver.

The half-rate receiver is composed of the proposed low power charge-based analog front-end (AFE), which generates data and edge samples from the incoming data and feeds them to a type-II clock and data recovery (CDR) circuit that provides 4 equally spaced half-rate clocks to the AFE. In conventional phase interpolator based CDRs, the sign of the phase error generated by the bang-bang phase detector (!!PD) is processed by a proportional-integral filter and fed to a digital phase accumulator implemented using ACCPi and a phase interpolator. However, such CDRs require multiple phase interpolators (PI) for half-rate operation and distributing multiple high frequency clocks from the LC PLL to PI and PI to samplers incur significant power penalty.

To overcome these drawbacks, a single PI controlled by integral control of the CDR was placed close to the LC PLL and its output is divided (down to 0.875GHz) and used as the reference clock to a ring oscillator based digital phase locked loop (DPLL), which is implemented using hybrid analog/digital architecture [4] (see Fig. 2). Proportional control of the CDR is implemented in the DPLL by adding the accumulated and scaled bang-bang PD output to both the ring VCO control node (using CDAC) and the integrator (ACCi) input. This direct control path for DPLL reduces the CDR loop delay as compared to conventional PI-based CDR. As a result, deterministic jitter of the recovered clock is greatly reduced. Ring oscillator placed in close proximity to AFE provides 4 equally spaced sampling clocks. Ring oscillator phase noise is suppressed by choosing a relatively high digital PLL bandwidth. Simulations indicate 0.875GHz reference clock is sufficient to achieve less than 1ps rms long-term absolute jitter. Because only a single phase low frequency reference clock is distributed from PI to the DPLL, the clock distribution power is minimized.

AFE Circuit Design

As shown in Fig. 3, the analog front end (AFE) employs NRZ CFF for data and edge detection. The front-end amplifier drives four samplers clocked by I/Q clock phases from DPLL for half-rate operation. Each sampler consists of a cascade of two charge-based sense amplifiers (CSAs) followed by a low swing sample and hold circuit (SHC), which holds the output voltage when CSA is in reset state. In other words, CFF can be used as a low swing flip-flop for energy efficient sampling and de-serialization without needing additional clock phases as in [2]. The CSA outputs are converted to full CMOS swing signals using a cross-coupled latch only in the final de-serialization stage.

Measurement Results

The prototype transceiver was implemented in a 65nm CMOS process and occupies an active area of 1.1mm^2. The die is packaged in a standard 88 pin plastic QFN package and characterized using a four-layer FR-4 printed circuit board. Fig. 4(a) shows transmitted eye diagram at the end of the channel when the transmitter is configured to transmit 0.4V pp differential amplitude at 14Gb/s. Channel includes bond wire, package parasitics, μ-stripline on test board, and SMA cables. The vertical and horizontal eye openings are 150mV and...
0.71UI, respectively. Fig. 4(b) shows bathtub plots for the transmitter output at the end of a channel with 12dB loss at 7GHz. Without equalization the sampling time margin is <0.1UI for BER<10^-12. When a 2-tap FIR equalizer is applied the time margin improves to 0.36UI. Fig. 5(a) shows clock waveform of the recovered clock when the receiver is locked to 14Gb/s PRBS7 data fed externally from a BERT with BER<10^-12. Long term absolute jitter of the recovered lock is 1.84psrms and 17.8pspp, respectively while that of the LC PLL is 1.25psrms and 10.5pspp respectively. Fig. 5(b) shows power distribution of the transceiver. Thanks to the proposed techniques, the transceiver achieves an excellent energy efficiency of 2.8mW/Gb/s (= 39.6mW at 14Gb/s, which includes power of all the circuits shown in Fig. 1 expect PRBS generators). Performance summary and comparison table along with die photo are shown in Fig. 6.

Acknowledgements
This work was supported by the NSF CARRER award EECS-0954969. We thank Berkeley Design Automation for providing the Analog Fast Spice (AFS) simulator.

References

Fig. 1. Proposed transceiver architecture.

Fig. 2. Block diagram of ring oscillator-based DPLL.

Fig. 3. Charge-based receiver analog front-end.

Fig. 4. (a) Transmit eye. (b) Bathtub plots at o/p of channel.

Fig. 5. (a) Recovered clock jitter. (b) Tx/Rx power summary.

Fig. 6. Performance table and die photo.