Algoritham and Architectural Level Methodologies
Algoritham and Architectural Level Methodologies
Power has become a critical design parameter in designing low power devices.
It has been demonstrated that the decisions made at this level have major impact on power consumption. By some of the known synthesis, optimization and estimation techniques, we can analyse the circuits at different stages with greater accuracy.
A design environment must include optimization and estimation tools at all level of the design flow. Effective decisions are made at the highest level of abstraction.
The decisions made at algorithm level are not accurate hence estimates are made at architectural level which gives more accurate result.
Vector quantization
It is one of the data compression method used in voice recognition and video system.
The image is broken into a sequence of 4x4 pixel image. Each pixel is represented by an 8 bit word which is a vector of 16 words each having a 8 bit length. The vectors are compared with a previously generated codebook which contains 256 different combination of vectors.
After compression an 8 bit word is generated which defines the address of a code vector that approximates 4x4 vector image. It corresponds to a compression ratio of 16:1,since16 8-bit words are represented as a single word.
The power consumption of a CMOS chip is given as dynamic, short circuit and leakage power. Power= CeffVf
f frequency of operation V- supply voltage Ceff- effective switching capacitance. It combines two factors i,e
C- The capacitance being charged or discharged. - corresponding switching probability.
At the algorithm level we can predict the design decisions but not make absolute claim about power consumption.
Inherent means that it is necessary for basic functionality and cannot be ignored irrespective of implementation. It serves as a prime factor for comparison between different algorithm. The power dissipation can be considered as a weighted sum of number of operations in the algorithm.
It depends on the specific architectural platform. The power comes into account if it is not greater than the algorithm inherent dissipation. First order prediction are obtained on overhead components given some properties of algorithm and hardware architecture.
Distortion measure is given by full search through the entire codebook(FSVQ) combined with the standard mean square error(MSE).
C-codebook code vector X-4x4 vector representation i-index of individual pixel word
First order approximation is given by measuring number of executions required to search the codebook.(e.g. Multiplication,additon)
Computing MSE between two vectors require 16 memory access, 16 multiplies and 16 addition. In FSVQ it is done for each 256 vetors in the codebook.
Algorithm inherent dissipation: Operation count can be used to estimate the switching capacitance of the targeted hardware architecture.
Using black box capacitance model of the hardware a first order estimate of capacitance can be made.
First order analysis produces an overview to state which are the functions needed for optimization.
Ripple adder dissipates less power than the carry select adder then it fails to meet required throughput below 5v.
CSA continues to meet the required throughput when the voltage is set to 3v.
Estimation tools must be integrated with the design space exploration and optimization tools to provide an easy to use environment for designer. This provides an quick feedback for the designer about the effect of design choices.
Functional pipelining, algebraic transformation, loop transformation can be used to increase speed at low voltages.
This technique result in larger silicon area implementation hence termed has trading area for power.
Strength reduction
Replacing energy consuming operation by a combination of simpler operation.(e.g. Replacing expansion of multiplication by constants into shift and add operation)
Algorithms which possess some certain structural properties such as locality and regularity. The chip area has been reduced as this translates into a reduced bus capacitance.
It requires less computation as compared FSVQ. It performs binary search of the vector space instead of full search. Computational complexity is proportional to log N rather than N
2
The input vector is compared with two code book entries. The branch which is closer to the input vector is assigned 0 and the other branch is assigned with 1 which is not considered for further analysis. There are 2*log2 (256)=16 distortion comparison have to be made instead of 256 in case of FSVQ.
It involves rearranging the difference between input vector X and two code vectors Ca and Cb.
Comparison is made between the two code vectors . Hence this can be under one summation.
The number of multiplication is reduced from 32 to 16 which is same for addition and subtraction.
Capacitance of RTL module (adder, multiplier) can be expressed as a function of complexity parameters.
E.g. The switching capacitance of a multiplier is proportional to square of its input word length.
Average power dissipation of a module is a function of the applied signal. It is difficult to find the capacitance model for all possible input patterns. Power factor approximation is employed to analyse power dissipation.
It uses experimentally determined weighting factor called the power factor to find out average power consumed by the given module. A more accurate model can be designed on the basis of twos complement data words. It can be divided into two regions on the basis of their behaviour
Activity in higher order data depends on temporal correlation. Lower order bits behave similar to white noise data.
The over all module is characterized by its capacitance model in the MSB & LSB. The break points can be determined from the applied signal statistics obtained from theoretical analysis.
Power consumption at the final stage of an algorithm depends on quality of its mapping onto the architecture. The mapping process must use the relevant properties of algorithm so it preserves data correlation.
Spatial locality can be utilized during the binding of operations to hardware units.
Resource sharing was allowed between operations in the same cluster. At the final stage it can be used to reduce the size and allow to access capacitance of register files.
17 memory access. 16 multiply/accumulate instruction Final add operation for comparison to find the location of the next node in the tree.
Total of 18 clock cycles are required for entire computation. Same number of calculations is required for each node. Hence 8x18=146 clock cycles are required.
The locality of reference enables partitioning of the memory into smaller memory associated to a single level of tree.
Distributive memory
There are 8 controllers and processors, they are clocked 1/8 of the frequency, the capacitance switched per vector by these elements is unchanged.
As there is less overhead in reading from smaller memory , switching capacitance can be reduced.