Ultra Low Power Designs Using Asynchronous Design Techniques (Welcome to the World Without Clocks)
Noida, India
Abstract :
Wire delay is beginning to dominate gate delay in current CMOS technologies. According to Moore’s Law by 2016 CMOS feature size should be on the order of 22 nm with clock frequencies reaching around 28.7 GHz. Essentially bus-based interconnects are being stretched to the point where they cannot be scaled further.
This paper presents challenges with the synchronous (clocked) designs and describes the techniques to overcoming the same with asynchronous (Clockless) design methodology. The paper proposes to redesign the synchronous interconnect to an asynchronous interconnect that should cater to tomorrow’s needs of high speed and low power. These circuits work on Handshaking techniques. If not today SOC industry will be forced driven to this methodology tomorrow.
CHALLENGES WITH THE CLOCKED DESIGN
Most digital circuits are synchronous, which means that their operation is controlled by a clock. Although the use of a clock has certain advantages in the design of a digital circuit, it also introduces a number of significant problems that are becoming more serious and more prevalent as technology becomes smaller and faster.
Following are some of the challenges with clocked design:
- Chip partitioned into multiple timing domains: This makes the logic susceptible to metastability along with additional latency.
- Clock Distribution/Clock Skew.
- Performance Overhead :
- Design being synchronous, single slow component or logic slows down the whole chip.
- Clock consumes large part of the Chip Power (40-70%)
- Challenges with designing reusable components: Design normally has to be altered when migrating to a new SoC due to additional Clocking /System constraints.
Probably the most significant problem is clock skew, which is the difference in arrival time of the clock signal to different parts of a circuit. When a circuit is large and slow, the clock skew is insignificant. But as circuits shrink and their speeds grow, this difference becomes very significant and extra design time and often extra circuitry needs to be used to solve the problem. It is becoming difficult to distribute clock as network spreads over die and may have irregular layout.
With all of the problems caused by the clock, it is very tempting to simply remove it from the system. This is the fundamental idea behind asynchronous design. However, it is not as simple as just removing the clock, since the operation of the circuit must still be controlled somehow. Asynchronous circuits essentially govern themselves, and are therefore called self-timed circuits.
ASYNCHRONOUS DESIGN
Much of today’s logic design is based on two major assumptions:
- All signals are binary
- Time is discrete.
Both of these assumptions are made in order to simplify logic design. By assuming binary values on signals, simple Boolean logic can be used to describe and manipulate logic constructs. By assuming time is discrete, hazards and feedback can largely be ignored.
However, as with many simplifying assumptions, a system that can operate without these assumptions has the potential to generate better results.
Asynchronous circuits keep the assumption that signals are binary, but remove the assumption that time is discrete. The problems with Synchronous designs being discussed in previous section can be avoided by Asynchronous Systems.
Figure 1 shows an asynchronous system where the blocks communicate without any global clock.
Figure 1 : Asynchronous System
Highlights:
- Communication using Local handshaking
- Digital design with no centralized clock
- Dynamic timing analysis of logic is needed to determine relative delays between paths
To avoid complex issues, circuits may be built as Delay-insensitive and/or Speed-independent.
MOTIVATION FOR ASYNCHRONOUS DESIGN
This section provides several possible benefits migrating to Asynchronous design/systems.
-
No Clock Skew problem: Perhaps the most important and obvious advantage of asynchronous design is that it removes the clock skew problem inherent in synchronous designs since asynchronous circuits by definition have no globally distributed clock.
-
Higher Performance: The use of a clock forces a synchronous circuit to run at the speed of its slowest component, so its performance is governed by the worst-case delay. An asynchronous circuit, on the other hand, can change speed dynamically and its performance is therefore governed by the average-case rather than worst case delay, potentially resulting in increased performance.
-
Increased Power Efficiency (Consume zero dynamic power): Clock dissipates a lot of power, especially in larger and faster designs, so removing it can yield a substantial improvement in power efficiency. Although asynchronous circuits often require more transitions on the computation path than synchronous circuits, they generally have transitions only in areas involved in the current computation.
-
Greater tolerance to variation in operating conditions: Small variations in operating conditions, such as variations in the input voltage or the ambient temperature, can have a large impact in the timing of a global clock signal. Since the correct operation of synchronous systems is so dependent on the proper operation of the clock, these slight variations can potentially cause enormous problems. Asynchronous system, however, are not so dependent on any individual signal, and therefore are much less sensitive to variations in operating conditions.
-
Better Electromagnetic Compatibility: Since all components in a synchronous system operate at the same clock frequency, the electromagnetic noise produced by synchronous systems is focused over a very narrow frequency band. Components in asynchronous systems, on the other hand, often operate at vastly different speeds, meaning that the electromagnetic noise produced by asynchronous systems is spread over a much wider frequency band. This reduces the effects of interference between different components in a system and also to external devices.
-
Greater Component Modularity: In a synchronous system, when two distinct components are put together, great care must be taken to make sure that they will interface properly. In asynchronous systems, components can be integrated seamlessly since they will automatically adapt their speed to function properly together. That means Asynchronous Design/IP can be easily ported to variable speed environment or a new SoC.
With all of the potential advantages of asynchronous circuits, synchronous systems still predominate and this may be due to fact that asynchronous circuits being more difficult to design as one needs to pay a great deal of attention to the dynamic state of the circuit. Next few sections details on the legacy synchronous interconnect bus and the ways to upgrade the same to high speed asynchronous interconnect.
MULLER C ELEMENT: FUNDAMENTAL COMPONENT OF ASYNCHRONOUS CIRCUIT
In a Synchronous Circuit, the role of the clock is to define points in time where signals are stable and valid. In between the clock ticks, signals may exhibit hazards and may make multiple transitions as combo circuit stabilizes. In Asynchronous System, situation is different. The absence of clock means signals are valid all the time, every transition has a meaning and consequently any hazard and races must be avoided.
In the synchronous world, OR Gate only indicates that both inputs are LOW, when HIGH it does not indicate which one signal made a transition. Similarly AND gate only indicates when both inputs are HIGH but does not indicate which one does LOW when the output of AND gate is LOW. Knowing this transition is very important for Asynchronous circuits as these transitions may have a reverse impact or hazard/ Race condition and should be avoided. So a better circuit in this respect is Muller C Element shown in Figure 2.
Figure 2 : Muller C Element and corresponding CMOS implementation.
Muller C element is a state Holding element just like set-reset latch. When both inputs are LOW, output is LOW and when both inputs are HIGH, output is HIGH. For other combinations the output does not change. An observer seeing the output change from LOW to HIGH may conclude that both inputs are now at HIGH. Similarly an observer when seeing the output change from HIGH to LOW may conclude both inputs are LOW now.
Figure 3 shows the truth table for the Muller C Element.
Figure 3 : Truth Table for Muller C Element
The C-element is a fundamental building block of many asynchronous circuits. It can be thought of as an AND-gate for events. This is also the State Holding Element in the Asynchronous world.
DUAL RAIL ENCODING FOR HANDSHAKE COMMUNICATION
In asynchronous circuit, clock signal is replaced by some form of handshaking neighboring registers. There exist several different protocols to implement a handshaking communication like 2-phase dual rail encoding, 4-phase dual rail encoding, bundled data etc.
Figure 4 shows an example of 4-Phase Dual rail encoding.
Figure 4 : 4-Phase Dual Rail Encoding
Two parties can talk to each other reliably regardless of delays in the wires connecting the two and hence the protocol is also called delay insensitive encoding.
Highlights:
- Dual rail uses 2 wires per bit of data transfer, one wire for signaling Logic 1 and other to indicate Logic 0.
- n–bit data communication requires 2n wires
- Spacer means no Data. 0 or 1 means valid data. “11” is INVALID
- Each bit is self-timed
- Other delay-insensitive codes exist (e.g. m-of-n) and event–based signaling.
2-phase dual rail also uses 2 wires per bit but information is encoded as events (0 to 1, 1 to 0). New codeword is received when one wires makes a transition. There is no empty Value (“00” does not exist). A Valid message acknowledge followed by another valid message.
LEGACY SYNCHRONOUS INTERCONNECT
The initial design approach of SoC designers was to select the IP blocks needed to meet application requirements, place them on silicon and connect them with a standard on-chip bus. As was the case with multimillion-gate ASICs containing many connected IP blocks, today’s SoC cannot be built around a single bus. Instead, complex hierarchies of buses are used, with sophisticated protocols and multiple bridges between them (Figure 5).
Figure 5 : Synchronous IP Interconnect
Communication between any two IP blocks can be via several buses, which places a lot of strain on meeting timing requirements. Essentially bus-based interconnects are being stretched to the point where they cannot be scaled further.
SoC designers face a basic paradox in today's environment: rather than enjoying significant time savings by using acquired IP blocks, they spend additional time in learning the function of the blocks in order to build the logic and test vectors for these blocks. Except for the vendors of processor cores, IP vendors typically provide little of the detailed documentation designers need. Consequently, designers find they have to acquire some level of application expertise or use consulting resources to understand the IP well enough to complete these tasks. This additional design and verification burden currently adds months to SoC design projects. Besides imposing a drain on resource-strapped projects, the additional logic inevitably degrades performance and increases chip area, while the additional test requirements further complicate final test stages.
CMOS feature size is decreasing and would be, according to Moore’s Law, it is clear that interconnect speed is not keeping up with increase in transistor speed. This means that in future circuits wire delay will no longer be negligible, but play a major role in deciding the maximum frequency at which a circuit can operate. In line with the clocking trends, global clock skew becomes an increasing fraction of clock period.
Examining all these issues makes it clear that a new interconnect strategy is required to bring design risks back under control: large high-speed integrated circuits will eventually need to be designed without global clocking.
Per the International Technology Roadmap for Semiconductors (ITRS, ex. SIA), 1999 edition:
“With clock speed possibly exceeding 5 GHz, and across-chip communication taking upwards of 5 to 20 clock cycles, an approach is needed to building a hierarchy of clock speeds with locally synchronous and globally asynchronous interconnects. Tools to handle asynchronous, multi-cycle interconnect as well as locally synchronous, high performance near neighbor communication are needed.”
HIGH SPEED ASYNCHRONOUS INTERCONNECT: THE CONCEPT
Figure 6 shows the concept for a system designed around an asynchronous interconnect bus.
Figure 6 : High Speed Asynchronous Interconnect (The Concept)
The goal is to design a high-throughput, flexible and low-power digital crossbar.
Asynchronous circuits can interconnect multiple Synchronous cores in an SoC design, eliminating global clock distribution and simplifying clock domain crossing. Following are some of the highlights
- Asynchronous crossbar shown interfaces with Masters on one side and slaves on the other side.
- Flow control extends through the crossbar.
- Any Asynchronous IP’s can be directly connected to the crossbar Master/Slave port without any bridges.
- Legacy Synchronous IP’s can be still connected to the crossbar Master/Slave port via Sync/Async Bridges. So this methodology still allows to reuse existing synchronous IP’s.
- Each module maintains input/output queues for traffic to/from each other module.
- Data is sent from an input queue to an output queue over Crossbar as a series of short bursts.
- Asynchronous Interconnect does not require a global clock, and therefore, it keeps the design power efficient and high performance.
With the Asynchronous Design methodology, the IP is completely de-coupled from the interconnect bus. This makes it possible to integrate asynchronous communication within an existing synchronous system. Due to the delay insensitive encoding, the wires supporting the communication at physical level do not have to be balanced. Unlike the “legacy” technique for IP integration, Asynchronous communication does not require large clock tree buffers due to the IP being de-coupled from the interconnect bus. This saves a considerable amount of power, which can be extremely important for handheld devices that operate on battery power.
The Asynchronous approach means that the interconnect bus can run at a much higher frequency, thus increasing overall system performance.
Last but not least, the Asynchronous approach can simplify system level verification of the IP block. With the IP block being completely decoupled from the interconnect bus, verification can be performed at the asynchronous dividing point. In the case of third party pre-verified IP, IP level verification can be completely eliminated.
DESIGN TOOLS FOR ASYNCHRONOUS DESIGN METHODOLOGY
Few commercial available CAD tools for asynchronous design implementation
- ChainWorks “A tool suite for the design and synthesis on-chip interconnect” from Silistix.
- “Balsa Asynchronous Synthesis System” by “The Advanced Processor Technologies Group” University of Manchester.
- “Petrify” a tool for synthesis of Petri Nets
- “LARD” hardware description language developed for describing asynchronous systems by University of Manchester.
- “DESI” Tool for Decomposing Signal Transition Graphs.
“Chainworks” uses tools for the design and synthesis of on-chip interconnect that fits directly into existing design flow for e.g Design Entry via Schematic capture or SystemVerilog, uses Chainworks tool to generate structure verilog. Synthesis and STA is done using existing tool setup via scripts generated by Chainworks.
“Balsa” is built around the Handshake Circuits methodology and can generate gate level netlists from high-level descriptions in the Balsa language. Both dual-rail (QDI) and single-rail (bundled data) circuits can be generated. The approach adopted by Balsa is that of syntax-directed compilation into communicating handshaking components and closely follows the Tangram system of Philips.
CONCLUSION
Asynchronous design is a rich area of research, with many different approaches to circuit design. This paper describes limitation/challenges with the synchronous/clocked design and motivation to migrate to an asynchronous design for a higher performance and power efficiency. This paper also proposes to replace the current existing synchronous interconnect to an asynchronous interconnect catering to tomorrow needs of high speed and low power. If not today SOC industry will be forced driven to this methodology tomorrow.
REFERENCES
[1] “Principles of Asynchronous Circuit Design” by Jens SparsØ and Steve Furber
[2] “Clockless Logic or How do I make hardware fast, power-efficient, less noisy, and easy-to-design?” by Montek Singh
[3] “The Middle Path: Globally Asynchronous Locally Synchronous (GALS) Design” by Scott F. Smith , Boise State University.
[4] “Extension of Asynchronous Design Automation Tools” by Michael Boyer, Steinmetz Symopsium 2005.
[5] “Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow” by Bradley R. Quinton, Mark R. Greenstreet and Steven J.E. Wilton, Dept. of Electrical and ComputerEngineering, University of British Columbia
[6] “The Asynchronous Logic Homepage”
[7] “Asynchronous Design Methodologies: An Overview” by Scott Hauck, Department of Computer Science and Engineering, University of Washington.
[8] “ARM offers first clockless processor core”
[9] VLSI Research Group, Sun Microsystems Laboratories
[10] “Asynchronous Design Methodologies: An Overview” by Scott Hauck, Department of Computer Science and Engineering, University of Washington. Proceedings of the IEEE, Vol. 83, No. 1, pp. 69-93, January, 1995.
[11] “Micropipelines” by I. E. Sutherland, Communications of the ACM, vol. 32, no. 6, pp. 720-738, June 1989.
[12] “Computing Without Clocks” by D. Pountain, BYTE, Vol. 18, No. 1, pp. 145-150, January, 1993
[13] “A Realization Algorithm of Asynchronous Circuits from STG” by K. J. Lin, C. S. Lin, in Proceedings of EDAC, pp. 322-326, 1992
Related Semiconductor IP
- AES GCM IP Core
- High Speed Ethernet Quad 10G to 100G PCS
- High Speed Ethernet Gen-2 Quad 100G PCS IP
- High Speed Ethernet 4/2/1-Lane 100G PCS
- High Speed Ethernet 2/4/8-Lane 200G/400G PCS
Related White Papers
- Low Power Design in SoC Using Arm IP
- Achieving Your Low Power Goals with Synopsys Ultra Low Leakage IO
- Reducing IC power consumption: Low-power design techniques
- Low Power System Design Techniques Using FPGAs
Latest White Papers
- New Realities Demand a New Approach to System Verification and Validation
- How silicon and circuit optimizations help FPGAs offer lower size, power and cost in video bridging applications
- Sustainable Hardware Specialization
- PCIe IP With Enhanced Security For The Automotive Market
- Top 5 Reasons why CPU is the Best Processor for AI Inference