Single-cycle On-chip Traversal Using SMART Networks-on-chip
High-performance chips nowadays use multiple processing cores on the same die, connected by a network-on-chip (NoC). The NoC is a grid of wires, with switches at crosspoints to multiplex different flows over the shared wire segments. Messages typically traverse the chip in a hop-by-hop (1-hop ~ 1mm) manner from one switch to the other. As technology scales and transistors become smaller, the number of cores on-chip is expected to scale proportionally. This likelihood leads to an increasing number of hops to get from the source to the destination, adding both latency and power overheads due to buffering and control circuitry at each intermediate switch, leading to system slowdown.
We propose a novel NoC architecture called SMART (single-cycle multi-hop asynchronous repeated traversal)[1][2][3] that allows messages to traverse multiple-hops (i.e., multiple switches and wire segments), potentially all the way from the source to the destination, within a single–cycle before getting latched. This travel is illustrated in Figure 1. The key enabler for this design is clockless repeaters at each switch that replace conventional clocked drivers. Before the message can flow through, all intermediate switches need to be reconfigured[1], [2] to enable/disable the input latch and connect the appropriate input port to the appropriate output port.
We design two low-swing clockless repeater circuits that provide fast transmission at high energy-efficiency. Figure 2 shows our test chip in 45-nm SOI. In the first design, called a self-resetting logic repeater (SRLR)[3], the repeater cell resets its input voltage to zero as soon as it recognizes the input logic value. In other words, voltage-limited pulses transmit data. In the second design, called a voltage-locked repeater (VLR)[2], the input voltage is maintained near the RX threshold voltage by feedback circuits and keeper transistors. Measurement results on a 10-mm link with 1-mm repeater spacing show that SRLR consumes 404fJ/b at a data rate of 4.1 Gb/s while VLR pushes the data rate to 6.8 Gb/s at 608 fJ/b. In contrast, full-swing repeaters can achieve a data rate of 5.5 Gb/s at 765 fJ/bit. At 1GHz, VLR enables 16mm/cycle traversal, in contrast with 13mm/cycle using full-swing repeaters.
- T. Krishna, C-H. O. Chen, W-C. Kwon, and L-S. Peh, “Breaking the on-chip latency barrier using SMART,” in Proc. of the 19th IEEE International Symposium on High-Performance Computer Architecture, Feb 2013, pp. 378-389. [↩] [↩]
- C-H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L-S. Peh, “SMART: A Single-Cycle Reconfigurable NoC for SoC Applications,” in Proc. of Design Automation and Test in Europe, Mar 2013, pp. 338-343. [↩] [↩] [↩]
- S. Park, M. Qazi, L-S. Peh, and A. P. Chandrakasan, “40.4fJ/bit/mm Low-Swing On-Chip Signaling with Self-Resetting Logic Repeaters embedded within a mesh NoC in 45nm SOI CMOS,” in Proc of Design Automation and Test in Europe, Mar 2013, pp.1637-1642. [↩] [↩]