Low Power Design Methodology for Core based ASSP
K S Gurumurthy Reader, U V College of Engineering
Bangalore, India
Abstract
As the design cycles are shrinking and complexity is increasing by many folds, it has become a regular practice to develop the Application Specific Standard Products (ASSP) by using readily available cores with customization around. In today’s complex ASSP system designs for portable and wireless applications, it has become absolute necessity to have low power as one of the design goals. The conventional design technique is to address power at the later stage of the design cycle almost at the physical design stage and thus end up having low options on techniques for power optimization. However, there are ample opportunities to get power reductions to the great extent if we are power conscious right from the system definition stage.
The paper deals with the power optimisation techniques which can be applied to any design core which is selected to be used in a ASSP design. Bluetooth base band core was considered for power study. It is shown that power saving to the tune of around 20% could be achieved.
1. Introduction
In today’s complex ASSP system designs for portable and wireless applications, there is so much of emphasis on low power designs. Reduction of power and power saving are the key words in these applications. The conventional design technique is to address power at the later stage of the design cycle almost at the physical design stage. This will give very low options power optimization techniques. However, there are ample opportunities to reduce the power to a greater extent if the designer is power conscious right from the system definition stage.
The section 2.0 discusses theory of power in CMOS circuits.Power optimisation methodologies are discussed in the section 3.0. A case study is taken up in Section 4.0 for the investigation.The paper is concluded in section 5.0
2. Power theory in CMOS
Different sources contributing for power in any VLSI circuit conceptually are of 2 types. They are Static power dissipation and Dynamic power dissipation. The first occurs constantly when the cell is turned on, while the second occurs when a signal that leads into the cell and/or comes out of the cell changes its voltage level. For computational purposes, the power can be broken down further into the following types which depend on the status of the cell and its surroundings:
- Switching Power
- Short-circuit Power
- Internal Power
- Leakage Power
- Clock Gating Power
- Glitch Power
When a cell has to drive one of its output nets from one voltage value to another, it is fighting against the capacitive load of the output material. The power used to drive an output capacitance (C) from ground to the supply voltage (V) or vice versa, at a frequency f can be calculated by the following equation:
P = C * V2 * f
So, we can conclude that for a cell with a given operating frequency, the lower the voltage and load capacitance are, the lesser will be power used.
To report or optimize power, Synopsis power compiler uses toggle information of the design which is the switching activity. The tool models the switching activity by the static probability and toggle rate.
- Static probability (SP0 or SP1) is the probability that a signal is at a certain logic state and is expressed as a number between 0 and 1.
For example, if SP1 = 0.70, the signal is at the logic 1 state for 70 percent of the time. Synopsis power tools use SP1 when modeling switching activity.
- Toggle rate is the number of logic-0-to-logic-1 and logic-1-to-logic-0 transitions of a design object (for example, net, pin, or port) per unit of time.
When a cell is changing its state from 0 to 1 (or vice versa), both n- and p-transistors are in saturation for short periods of time during the transition. This allows a direct connection through the devices between the power and ground rails of the cell.The Fig 1.0 shows a CMOS inverter cell.
Fig.1 CMOS inverter cell
In the inverter cell, the amount of current that flows directly from VDD to GND will change depending on the rate at which the input signals change. This means that the faster an input signal changes, the lesser the current that flows and thus the smaller would be the short-circuit power consumption.
Most used power estimation tools derive the information from the static probabilities of the cell pins and the state-conditions in the leakage power table of libraries.
Also it is worthwhie to note the significance of the following definitions
Default _cell leakage power: This is used as a default cell leakage power for all the cells in library.
Cell leakage power: Each cell may have this definition. If state dependent leakage is not defined, then this value is used.
Leakage power: As shown above, this is used for state dependent leakage.
c) Internal Power
Not every change of an input to a cell will necessarily lead to a change in the state of the output net. The power consumed by a cell when an input changes but an output does not is typically classified as internal power consumption. The consumption can occur when the transistors in a cell turn either on or off.
d) Leakage Power
Two kinds of power consumption fall into the leakage power consumption category. The first is the power consumed through sub-threshold currents that constantly flow from drain to source in a transistor. This occurs even when the gate-to-source voltage of the device is below the threshold voltage for the transistor to be fully on. The second is the power consumed by the reverse-biased diodes that are formed between the diffusion regions of the transistors and the substrate. The value of the leakage power for a cell typically remains constant; it does not change when an input transition signal or the amount of the output load capacitance varies from instance to instance. However, since this is a constant drain of current, it can be an important factor in calculating the overall power consumption, particularly for the process technologies of smaller dimensions.
e) Clock gating power
Though clock structure consumes 30% of the chip power, the power tool does not estimate any power specifically for the clock trees. It is always required to manually work out the power of the clock trees and to be added to the power estimated with the tool. If the clock trees are available in the early stages of the design, and if included, the tool estimates the power as discussed above.
f) Glitch Power
Glitches in a circuit contribute significantly to the power.Therefore care should be taken to come out with the design which avoids the glitches.
3. Power optimization methodology
The IP cores opted for the design implementaion has to be thoroughly evaluated for low power considerations. The customisation activity should involve the low power consumption in the selected IP core. The section deals with the power optimisation methods at different stages of design.
3.1 System IP core definition:
It is in this phase of the design the designer has the maximum flexibility to design a power efficient system. Some of the areas where the designer can look upon for the power saving strategy while defining System IP are the following :
- Voltage scaling,
- HW-SW partitioning,
- Power down mode operations
- Minimal Retransmissions
- Duplicate Frame Rejections
- RTL /Circuit Level
This is one of the advanced techniques followed to design low power system design. Typically in SOC’s there are several blocks which work at different speeds. [1] Modules for which voltage can be scaled down not affecting the performance, the voltage translators,level shifters or DC - DC converters are implemented within the core. This boils down to deciding whther the core under consideration can be made to work at lower voltage. This requires sperate low power library cells to be selectively used for this core during synthesis where as rest of the core will be using normal library cells. This voltage scaling down improves on the power saving. In some advanced designs even dynamic voltage converters / scalars are implemented which dynamically shifts the supply voltage required there by power saving is done efficiently.
3.1.2 HW-SW Partitioning
Services can be implemented either in software running on cores, or in dedicated hardware. In a typical design flow this will be decided during the step of hardware-software partitioning. Efficient HW-SW partitioning has a great influence on low system power.
3.1.3 Power Down Mode Operation
Can clock to the core be turned off when not in use or can it be made to run on lower clock? Operating at a slower clock is a strongly recommended method for the low power design strategy.
3.2 Processor based Firmware/Software Power optimisation techniques :
Following techniques are more relevant for the Software/Firmware architectures using processors in ASSPs.
3.2.1 Architectural / Algorithm Design :
It is in this phase of the design that the designer has a good flexibility to design a power efficient system. Some of the areas where the designer can look upon for the power saving strategy are :
- Temporal / Spatial Locality
- Concurrency / Pipelining
- Parallel Architectures
- Minimization Of Processor Interactions
Temporality captures information about the lifetimes of the variables in the computation. A computation is considered to be temporally local if the expected lifetimes of the variables are short. It is temporally dense if the measured maximum expected number of variables alive at any time is large.
3.2.1.2 Concurrency / Pipelining
Concurrency / Pipelining are very powerful methods for low power system design. Concurrent operations improve the speed of operation of a block. Pipelining improves the throughput there by the clock cycles are not wasted (processor perspective) and also improves the speed of operation (splitting the huge combinatorial logic in to phases). Improving the speed to a greater value gives flexibility to the designer for operating at lower voltages. The improvement in the power saving is greatly affected by the scaling of the voltage (remember the factor of 2).
3.2.1.3 Parallel Architectures
Parallel computation of crc/ encryption/ decryption/ data processing has been found advantages over serial computations as the number of switching activities that are taking place in the parallel computations is found to be optimized over serial computations. Reduction in the number of switching activity reduces switched capacitance there by saving power.
3.2.1.4 Minimizing Processor Interactions
In communication subsystems (like Bluetooth, WLAN) minimization of the processor interactions yield a good amount of power saving. This can be achieved by the considering few of the following strategies at the architectural level. They are listed below.
3.3 Interrupt Strategy
Design an appropriate number of interrupts to the processor so that the number of interrupts raised are optimal so that a huge overhead of context switch is avoided. Also in informing the source of interrupt different architecture can be followed. If the sources of interrupts are many in the core, then interrupts raised from a source (block) can be reflected in a dedicated status register. Then the source is reflected in the Main Status Register. By this the host need not read all the status registers to arrive at the source of the interrupt. It can perform the optimal reads to understand the source. This is a two level hierarchy representation of the sources of interrupt. The designer can also have many levels of hierarchy to perform optimum host reads to understand the interrupt source. If the sources of interrupt are less a single level hierarchy is preferred as this gives the minimal number of host reads to understand the source of the interrupt.
3.4 Minimal Retransmission
Especially in communication protocols one has a handshake mechanisms (frame - ack) protocol for frame exchanges. Handling retransmissions upon receiving negated acknowledgement can be taken care in the hard ware which avoids a huge reprogramming of the frame related registers, payload transfers (via DMA or direct host) which avoids a huge I/O transactions which in turn saves power.
3.5 Duplicate Frame Rejections
Especially in communication protocols duplicate frame rejection can be handled in hardware so that the overload of processing (raising the receive interrupt, payload transfer (DMA or direct host), parsing) a duplicate packet is avoided which minimizes the processor intervention which in turn saves power.[6],[7]. Temporary Storage Of Constant Payload Packets
Especially in communication protocols typically one can identify a group of frames (management or control) whose payload remain the same (some time zero payload) or have a constant header without a payload through out the network (Example - Beacon / Probe Response / ATIM Frame / Ps-Polls / ACK / CTS / RTS).[5] The over head involved in the host interactions on such frames can be optimized by having a temporal storage and the hardware can be designed to transmit such frames automatically with minimal interactions of the host. The greater the frequency of such frames the greater is the minimization of processor interactions and greater the power saving.
3.6 RTL / Circuit Level
It is in this phase of the design the designer has a wide number of options to exploit for power saving. Some of the techniques are as listed below.
3.6.1. Clocking Strategy
The clocking strategy in the architecture is one of the most important attribute which decides the power consumption in any design / architecture. The efficient clocking strategy is one of the most power optimization technique. Power optimization can be achieved by the following the most frequently used techniques.
3.6.2. Operating At Lower Frequency
As the relation above involves the frequency term (F), working at a lower frequency leads to less power consumption. In any design, if the specification allows to work at a lower frequency, the design can be clocked at a lower frequency for less power consumption. This method is used mostly in processor based designs, where the processor and the peripherals can be made to work at a slower clock when ever possible. (Refer – Power Down Mode Operation - System Design Phase)
3.6.3. Gating Of The Clock
Gating of the clock can be done where ever and when ever possible to blocks if the blocks are not used. The clock can be turned on/off synchronously. This can be achieved either at the global level / module level. The feasibility of implementation of the gating of the clock in the module level depends on the efficient partitioning of independent blocks. A simple clock gating circuit is given below.
3.6.4. Register Level Clocking Disable
One can look at register level clock gating to save power. The FSM State registers can have a gated clock, there by disabling the clock to those flip flops (registers). If the FSM’s state is reasonably over a huge amount of time in a particular state, then gating of the clock to those flip flops can be achieved to those flip-flops to achieve power saving. One can also gate the clock for the register bank registers (which is typically a large number) as most of them are one time programming or the clock can be enabled only during the write operations to the registers either from the host or the design core. Usually register level clock gating is not preferred as it leads to a number of clock trees and synchronization issues becomes tedious.
3.6.5. Data Representation / Signalling
Capacitances for the I/O and the global wide busses are significantly larger than those for the internal circuitry. So if there is some means of minimizing the switching happening on these I/O’s and global buses, one can look for a huge power save. One can look at the instruction sequencing to minimize switching activity on the global wide busses especially in the processor based designs. One more method is to have transition signaling instead of level signaling for the detection of the valid data.
3.6.6. FSM State Encoding
This is also an most important technique where the designer can minimize the switching activity in the internal core. Specific instruction encoding and sequencing can save a large amount of power in processor based designs. CISC / RISC instruction encoding techniques can be followed to minimize the switching activity. This is applicable mainly in processor based designs where one has a huge instruction list. Similar to the instruction encoding and sequencing one can also think on the FSM state encoding techniques as FSM’s typically constitute a considerable portions in the designs. One such method to arrive at an optimal FSM encoding for power saving is the spanning tree method.
3.6.7. Path Balancing
Glitches are a major source of the switching activity. Glitches often arises when paths are with unbalanced propagation delays converge at a same point in the circuit. Since glitches can cause a node to make several power consuming transitions, it should be avoided. By achieving path balancing by re-arranging the blocks or insertion of delay buffers glitches are avoided in the design. This is illustrated by the example shown in Figure 2
Fig.2 Path Balancing
Due to different paths, the output may have the glitch due to the different propogation delays. This glitch will cause many undesired transitions resulting in power consumtion. This can be avoided by proper balancing of the logical paths as in Fig 3.0 glitch free circuit. Power saving of upto 8 to 45% can be achieved in simple logic to complex 16 x16 multipliers.[4].
3.6.8. Logic Optimization
Power consumption in a design is directly proportional to the logic content in the design.The greater the logic, the more is the power consumed, and lesser Logic optimization aids in saving the power.
3.6.9. Signal Gating
In this method, one can gate the inputs which causes significant transitions which don’t change the final output / state transitions. This avoids a lot of switching activity in the module which does not have any impact and there by saves power. This can be explained by the example shown in Fig 3.
Fig.3 Signal Gating
In the above example if Enb1 & Enb2 are asserted (Logical 1) and Enb3 is not asserted (Logical 0), any change in the input Sig1 causes transitions in 1st Mux, Combinational Logic 1, 2nd Mux, Combinational Logic 2. All these transitions are found to be unnecessary as they are not propagated to the Out signal as the Enb3 is not assserted (Logical 0). All these unnecessary transitions add on to the switching capacitances in the design consuming power. So the Sig1 can be gated with Enb3 to avoid these unnecessary transitions.
3.6.10. Address Assignments
If the designer is aware of the sequencing of the registers that will be addressed by the host processor, the address mapping to the registers can be provided in such a way there is a minimal switching happening on the address bus saving power. One can follow the gray encoding or minimum hamming distance encoding techniques for the address assignments. This technique holds good for the pointer manipulations in the FIFO’s (read pointer / write pointer) as the pointers sequencing is incremental.
3.6.11. Avoid Tri-States
Avoid tri-State buses inside the chip or at the I/Os, as undriven tri-state bus draws high static current. Use level keepers at tri-stateable I/O pins to maintain the state at the output of the tri-state buffers.
4. Methodology applied to Bluetooth core
Bluetooth base band core of 60k gate complexity was considered for power study. Following flow was followed for the analysis.
The power analysis makes sense if the correct application model is used. The various bluetooth scenarios are modelled for a particular application like Mouse/Headset and the power analysis is carried out. Power report is taken with and without optimisation techniwue to arrive at the power saving figures.The power optimisation methodology applied is discribed below.
Volatge scaling : This was not applied as the Core itself was of 60k gates. The logic was not partitionable to justify the use of mix of libraries.
Hardware Software partitioning : The framer block, major discovery state machine, encryption block was decided to be in hardware and the protocol control was implemented to be in software Again the retrys and the link maintainance was retained to be in hardware to avoid the frequent processor interventions.
Power Down Mode : some parts of the logic core like master mode when device is opertaing in slave mode and slave mode logic when device is operating in master mode were intentionally turned off to get low power consumption.The processor supported sleep mode operation when core was configured in Sniff/park mode.(Bluetooth Power save modes).
Architectural / Algorithm Design: As the baseband firmware was just the driver for baseband hardware, this is not applied. This makes more relevant for the Link manager protocol which is more software task oriented. But offcourse the duplicate frame rejections/Retrys were handled in hardware to avoid the processor interventions.
RTL/Circuit level :Clocking strategy selected was to use gated clocks to enable the blocks only when they are active. The hold/park/Sniff blocks were completely disabled when the device is in the respective modes. This gave us the power saving close to 18%. This power save figure is arrived at by charecterising the bluetooth device in a day in the respective mode and considering the power consumption estimate with and without clock gating.
Operating at lower frequency : This technique was not justifyiable as amount of logic which worked at low frequency was very minimal.
Gating of clock : Clock gating was effectively used for swtching of enable signals and to park/Sniff functions, Encryption block and the Master/Slave mode, when they are not active.
FSM State Encoding : The FSMs were consciously coded taking care of the nature of the function. The logic is analysed to see how long the state mahcine stays in a particular state and the state transitions were considered in state encoding such that single bit transition occurs. This for some state machines, gave us around 2 % of the power saving.
Address mapping : This was done such that for mandatory registers which are to be configured frequently differ with minimum address bits.
With the consciece power save techniques, this work shows that power saving around 20% could be achieved.
5. Conclusion Power strategies are more traditionally believed to be design specific or core specific. The changed trend demands a mandatory requirement for low power consumption by ASIC, calling for a generalized methodology for power optimization.Through this work we achieved a power saving of around 20% in a core of 60Kgates.This result is an eye opener to visualise the possibilty of saving more power in a complex multi million gate ASSP which uses multiple cores. Thus it is worth evaluating cores from power considerations before they are used in ASSPs. Chip designer and ASSP product companies are aware of this fact and more and more focused effort is on towards finding novel ways of power optimization in different phases of product life cycle.
References:
[1] State Assignment for FSM Low Power Design-M. Koegst, G. Franke, K. Feske EURO-DAC-96
[2] Power Analysis for Sequential Circuits at Logic Level - M. Senn, P. Schneider, B. Wurth, EURO-DAC-96
[3] An effective Low Power design methodology based on interconnect Predictions -Shih-Hsu Huang, Mely Chen Chi, Hsu-Ming Hsiao -SLIP01, 2005
[4] Transistor optimization for Minimizing Switching Power in CMOS Circuits –Chrstian V Schimp e, Arthur Wr. Oblewski and Josef A, Nossek Technical ReportTUM-LNS-TR-99-4
July 1999
[5] CRCD: Low-Power Wireless Communications for Virtual Environments - Julie A. Dickerson, Diane T. Rover, Carolina Cruz-Neira, Robert J. Weber, Benjamin Graubard, Feng Chen, and Zheng Min, 2002 American Society for Engineering Education Annual Conference & Exposition
[6] J. L. Ayala and M. L´opez-Vallejo, “A unified framework for power-aware design of embedded systems,” in IEEE International Workshop on Power and Timing Modeling, Optimization and Simulation, September 2003, also published as Lecture Notes on Computer Science, vol. 2799 (Springer Verlag).
[7] “A case study on power dissipation in the memory hierarchy of embedded systems,” in Design of Circuits and Integrated Systems Conference, J. L. Ayala, M. L´opez-Vallejo, and C. L. Barrio, November 2003.
Related Semiconductor IP
- RISC-V CPU IP
- AES GCM IP Core
- High Speed Ethernet Quad 10G to 100G PCS
- High Speed Ethernet Gen-2 Quad 100G PCS IP
- High Speed Ethernet 4/2/1-Lane 100G PCS
Related White Papers
- An ESD efficient, Generic Low Power Wake up methodology in an SOC
- Context Based Clock Gating Technique For Low Power Designs of IoT Applications - A DesignWare IP Case Study
- VLSI Physical Design Methodology for ASIC Development with a Flavor of IP Hardening
- Low Power Design in SoC Using Arm IP
Latest White Papers
- New Realities Demand a New Approach to System Verification and Validation
- How silicon and circuit optimizations help FPGAs offer lower size, power and cost in video bridging applications
- Sustainable Hardware Specialization
- PCIe IP With Enhanced Security For The Automotive Market
- Top 5 Reasons why CPU is the Best Processor for AI Inference