Root Power2019
Root Power2019
Abstract— In this paper, we propose a low complexity archi- digit-recurrence algorithm is presented in [7] whose hardware
tecture design methodology for fixed point root and power com- complexity like NR approach, also depends on N. In [8],
putations. The state of the art approaches perform the root and a top-level approach has been presented based on the binary
power computations based on the natural logarithm-exponential 1 log2 (R)
relation using Hyperbolic COordinate Rotation DIgital Computer logarithm-binary inverse logarithm relation i.e, R N = 2 N .
(CORDIC). In this paper, any root and power computations But this approach [8] did not present the implementation
have been performed using binary logarithm-binary inverse details of the binary logarithm, division and binary inverse
logarithm relation. The designs are modeled using VHDL for logarithm. Another approach was presented in [9] based on the
fixed point numbers and synthesized under the T SM C40-nm 1
CMOS technology @ 1 GHz frequency. The synthesis results natural logarithm-exponential relation i.e, R N = ex p( ln(R) N )
shows that the proposed Nth root computation saves 19.38% where the natural logarithm, division and exponential com-
on chip area and 15.86% power consumption when compared putations are performed using CORDIC. On the other hand,
with the state of the art architecture for root computation the powers are computed using multipliers [10]–[13], in which
without compromising the computational accuracy. Similarly,
the proposed Nth power computation saves 38% on chip area,
the square and cube operations were computed using reduced
35.67% power consumption when compared with the state of the partial product arrays and ancient Indian Vedic mathematics.
art power computation with out loss in accuracy. The proposed However, these approaches [10]–[13] are not generic for the
root and power computation designs save 8 clock cycle latency N t h power computation. Such a generic approach for N t h
when compared with the state of the art implementations. power computation is proposed in [14] based on the natural
Index Terms— CORDIC, logarithm, exponential, VLSI archi- logarithm-exponential relation i.e, R N = ex p(ln(R) × N)
tecture, root computation, power computation, hyperbolic where the natural logarithm and exponential computations are
CORDIC. performed using CORDIC.
It is well known that the CORDIC performs several tasks
I. I NTRODUCTION such as trigonometric, hyperbolic and logarithmic functions,
real and complex multiplications, division and square-root
R OOT and power computations have been used in dif-
ferent areas such as atmospheric models, digital image
synthesis, 3-D graphics and many VLSI signal processing
using shift add operations [15]–[21]. However, the conver-
gence and precision of the CORDIC depend on its negative
applications [1]–[3]. However, the design and implementation index (m) and positive (n) boundaries respectively [15], [21]
of low complexity as well as highly accurate VLSI architecture (elaborated in section II, please see equation (11), (12)
of such N t h root and N t h power computation is a challenging and (13a)). The CORDIC convergence boundary (m) poses
task for real time resource constrained platform. the following limitations on the state of the art N t h root and
There are various approaches available for root compu- N t h power computations [9], [14].
tation. The well known method is Newton-Raphson (NR) • The CORDIC negative index boundary (m) limits the
method requiring an initial guess which may result different input range of R and N. As m value increases the input
precision in the outputs [4]–[6]. The hardware complexity of ranges of R and N will increase.
NR method increases with increasing value of N. A General • As m value increases, number of CORDIC iterative stages
will increase in turn the hardware complexity, area, power
Manuscript received March 2, 2019; revised July 13, 2019; accepted consumption and latency will increase.
August 29, 2019. This work was supported in part by the Science and
Engineering Research Board (SERB), Government of India, for the project Addressing the fore mentioned limitations, in this paper,
entitled “Intelligent IoT enabled Autonomous Structural Health Monitor- • We propose a low complexity architecture design method-
ing System for Ships, Aeroplanes, Trains and Automobiles” through the
Impacting Research Innovation and Technology (IMPRINT) Program under
ology for the N t h root and N t h power computation
Grant IMP/2018/000375. This article was recommended by Associate based on the binary logarithm-exponential relation using
Editor A. Cilardo. (Corresponding author: Amit Acharyya.) CORDIC.
The authors are with the Department of Electrical Engineering, Indian
• We propose Binary Hyperbolic CORDIC algorithm to
Institute of Technology Hyderabad (IIT Hyderabad), Hyderabad 502285, India
(e-mail: [email protected]; amit_ [email protected]). perform the binary logarithm and inverse binary logarithm
Digital Object Identifier 10.1109/TCSI.2019.2939720 computations.
1549-8328 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
i.e, ex p(.). The natural logarithm and exponential computa- |z|max = αi = tanh −1 (2−i ) (8)
tions are performed in [9], [14] using Hyperbolic CORDIC. i=1 i=1
The division operation is performed using linear CORDIC [9]. The iterative formula for Linear Vectoring (LV) mode
The basic working principle of Hyperbolic CORDIC can be CORDIC [15] is expressed as follows
expressed as:
x i+1 = x i (9a)
xf x0 cosh(z) si nh(z) −i
= RH ∗ ; RH = (2) yi+1 = yi + σi (2 x i ) (9b)
yf y0 si nh(z) cosh(z)
z i+1 = z i − σi (2−i ) (9c)
where [x 0 , y0 ] and [x f , y f ] are the initial and final position
vectors, R H is hyperbolic rotation matrix and z is the angle where i is an integer starts with 0, and σi = −sign(yi ). Table I
of rotation [15]. By factoring out the cosh(z) term, the above summarizes the convergence range of the HR, HV and LV
equation can be rewritten as follows CORDIC as n → ∞. The convergence limits of CORDIC
shown in the Table I are not enough for the implementation
xf 1 tanh(z) x 0 logarithm, exponential and division computations in practical
= cosh(z) ∗ (3)
yf tanh(z) 1 y0 applications [9], [15], [21]. Hence, the convergence range for
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Step1: The binary logarithm can be computed using pre-log shown in section III-A. From the proposed methodology,
normalization and BV CORDIC (using (37), (38) and (39),) as the binary logarithm and binary inverse logarithm can be
shown in Fig.1(a) and Fig.1(b). The pre log normalization can performed using Binary Hyperbolic CORDIC. The iterative
be performed using shifting operation. The input to the pre-log formula for the proposed Binary Hyperbolic CORDIC is
normalizer is R = 67.55 then 26 < R ≤ 27 and the outputs shown in (30). Using (30) and (6), the iteration structure
are r = 1.05546875 and k = 6. Now log2 (r ) can be computed of the Binary Hyperbolic CORDIC for rotation and vector-
using BV CORDIC. Consider the inputs to the BV CORDIC ing mode are shown in Fig.2(a) and Fig.2(b) respectively.
as x 0 = r + 1 = 2.05546875, y0 = r − 1 = 0.5546875 and Similarly using (9), the iteration structure of LV CORDIC
z 0 = 0. The output of BV CORDIC is z n = 12 log2 (r ) = is shown Fig.2(c). The iteration stages in the BV CORDIC,
0.0389. The log2 (r ) is obtained by shifting z n one bit to left BR CORDIC and LV CORDIC are cascaded with each other 1
then log2 (r ) = 0.0779. The log2 (R) can computed by adding to form a pipeline architecture. The critical path for the R N
k = 6 to the log2 (r ) then the log2 (R) = 6.0779. computation is a shift-add operation which is same as the
1
Step2: The second step in R N is division computation. The state of the art approach [9]. The critical path for the R N
division can be performed using LV CORDIC. The inputs computation is a multiplication operation which is same as
to the LV CORDIC are x 0 = N = 4.78, y0 = log2 (R) = the state of the art approach [14]. The proposed architectures
6.0779 and z 0 = 0. The output of LV CORDIC is z n = are implemented in pipeline fashion. Therefore, the output is
log2 (R)
N = 1.2715 which is treated as D. The second step in available for every clock cycle. The throughput of the proposed
1
R N is multiplication operation. The inputs to the multiplier approach for R N and R N computation is 100% which is same
are x 0 = N = 4.78, y0 = log2 (R) = 6.0779 then the output as the state of the art approaches [9], [14].
z n = 29.0523 is treated as M.
1
Step3: The final step of the R N and R N computations IV. E XPERIMENTAL R ESULTS AND D ISCUSSION
is binary inverse logarithm computation. The binary inverse A. Verification of Proposed Methodology
logarithm can be computed using pre-exponential normaliza-
In this subsection, the correctness of the proposed methodol-
tion and BR CORDIC (using (43), (44) and (45)) as shown
1 ogy has been verified by modeling in MATLAB and simulating
in Fig.1(a) and Fig.1(b). In R N computation, the input to the absolute errors. The Absolute Error (AE) is defined as
the pre- exponential normalization is P = D = 1.2715. The
outputs are PI = 1, PF = 0.2715. Now 2 PF can be computed T −M
AE = | | (46)
using basic BR CORDIC considering inputs as x 0 = K b , T
y0 = 0 and z 0 = PF = 0.2715. The outputs of BR CORDIC where T is the true value of the N t h root or N t h power and
are x n = 1.0178, yn = 0.1893. The 2 PF can be obtained M is measured value of the N t h root or N t h power using
by adding x n , yn then the 2 PF = 1.2071. The 2 P could the proposed method. The another important criteria is Mean
be obtained by shifting the 2 PF by PI = 1 bits to the left Absolute Error (MAE) which defined as follows
1 1
then 2 D = 2.4141 = R N = 67.55 4.78 . In R N computation, Num
j =1 AE
the input to the pre- exponential normalization is P = M = M AE = (47)
29.0523. The outputs are PI = 29, PF = 0.0523. Now 2 PF Num
can be computed using basic BR CORDIC considering inputs where Num denotes the number of test cases. The steps
as x 0 = K b , y0 = 0 and z 0 = PF = 0.0523. The outputs of involved in the proposed approach and the state of the art
BR CORDIC are x n = 1.0007, yn = 0.0363. The 2 PF can be approaches [9], [14] are depend on the m and n values. Before
obtained by adding x n , yn then 2 PF = 1.0369. The 2 P could simulating the errors, the dependency of m and n values
1
be obtained by shifting the 2 PF by PI = 29 bits to the left. have been analyzed. In R N computation, the state of the
Then 2 M = 5.5669 × 108 = R N = 67.554.78. approach [9] performed the software implementation as well
as hardware implementation for R ∈ [10−6 , 106 ] and N ∈
C. Proposed Architecture [2, 1002] [9]. In order to compare our proposed architecture
1
We implemented the architecture for the R N and R N com- with the state of the art architecture [9] on a uniform platform,
putations in pipeline manner using the proposed methodology we also consider the same values of R and N. In the state of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI
D EPENDENCY OF m AND n FOR P ROPOSED M ETHODOLOGY
TABLE VII
V ERIFICATION OF THE P ROPOSED M ETHODOLOGY (MATLAB S IMULATION )
the art approach [9], from the Table III, the m is chosen as 2 complexity and latency. Hence, the proposed methodology for
for HV CORDIC then R ≤ 1.0562 ∗ 106 . If m = 2, from the R N computation does not have any limitation on its input
Table II, for the HR CORDIC, the z 0 ≤ 6.935112 then N = ranges of R and N. The dependency of m and n for R N
ln(106 )
6.935112 ≥ 2. The maximum N value is chosen as 1002. The
computation has been summarized in the Table VI. For the
convergence of LV CORDIC is ln(R) N ≤ 2
m+1 . The minimum m and n values shown in the Table VI, the proposed and the
value of N is 2 then ln(10
6 )
≤ 2m+1 and m should be 2. The n state of the art approaches [9], [14] are coded in MATLAB and
2 simulated the absolute errors using (46) and (47). The Num
value is considered as 20. In the proposed approach, the input
is chosen as 5 million, the R and N are generated randomly.
R is independent of m value. In the proposed approach from
(39), (40) and (45), the division operation alone depends on The results are summarized in Table VII. From the Table VII,
1
it is evident that the proposed approach for R N computation
m value. The convergence of LV CORDIC is log22(10 ) ≤ 2m+1
6
TABLE VIII
1
W ORD L ENGTHS R EQUIRED FOR THE S TATE OF THE A RT A RCHITECTURE AND P ROPOSED A RCHITECTURE FOR R N C OMPUTATION
lengths required for each step. First, let us analyze the word will be n e f f = 22. To achieve n bit precision in the output
1
lengths required by the state of the art approach for R N com- of CORDIC, the internal registers should have log2 (n) extra
putation. The state of the art approach [9] implemented their bits at the LSB position [15]. Therefore, the fractional word
design for R ∈ [10−6 , 106 ] and N ∈ [2, 1002]. The state of the length is n e f f + log2 (n e f f ) = 27. The first step is pre-log
art design [9] is chosen the fractional part of data as 27 bits. normalization as shown in Fig.1(a). The output of first step
The maximum value of R is 106 and log2 (106 ) ≈ 20. The inte- is r which is fed as input to the BV CORDIC. From (37),
ger part of R will be 20 bit. The input data for next module (LV the r ∈ [1, 2]. The maximum of R is 220 , therefore, to bring
CORDIC) is N and ln(R). The integer part of the input data the R is between r ∈ [1, 2], it is required to shift the R
for the LV CORDIC depends on maximum of ln(R) and N. by k = 19 bits to the right. The least significant 19 bits
The maximum value of ln(R) i.e, ln(220 ) = 13.863 ≈ 24 . Four may be ignored while performing normalization. However,
bits are necessary to represent the ln(.) value. The maximum we considered additional 18 bits in the fractional part of r to
value of N is 1002 ≈ 210 . Ten bits are required to represent improve the accuracy in logarithm computation. The fractional
the N value. Therefore, the integer part of input for LV part of r is to be set as 45. The integer part of r is considered
CORDIC is chosen as 10 bit. The input to the final step as 2 bit because r ∈ [1, 2]. After performing the logarithm
depends on maximum of si nh( ln(R) ln(R)
N ) and cosh( N ). The
computation, the least significant 18 bits of the fractional
si nh( ln(22 ) ) = si nh( ln(22 ) ) = 512.0005 hence the integer
20 20
part will be ignored. The consideration of additional 18 bit
part for input of HR CORDIC is 10 bit. But, the integer part in fractional part of r improves the accuracy of logarithm
of the input data for HR CORDIC as 11 bits to avoid the computation. The next step is compensation of logarithm by
truncation errors due to the iteration formula. An extra sign adding K to the log2 (r ). The maximum value of k is 19.
bit is added in front of every input data and the word length The integer part of the k is to be set as 5 bits. The input
requirements for each step are tabulated in Table VIII. data for LV CORDIC is N and log2 (R). The word length
Next, we will analyze the input data word lengths required of the input data for the LV CORDIC depends on maximum
1 of N and log2 (R). The maximum value is 1002, therefore,
for the proposed architecture for R N computation. We will
the integer part for N and log2 (R) will be chosen as 10 bits.
consider the same input range for R and N as the state of the
The maximum input to the pre exponential normalization step
art approach [9] to compare the proposed architecture with the
is P = log22(2 ) = 10. The integer part of input P to be set as
20
state of the art architecture [9] on a uniform platform and the
integer part and fractional part of R chosen as 20 and 27 4 bit. The input to the final CORDIC depends on maximum
respectively. Here, we followed the word length selection of si nh b (PF ) and cosh b (PF ). The PF is less than equal to 1
methodology presented in the state of the art approach [9]. For so that max{cosh b (PF )} = max{si nh b (PF )} ≈ 1.25. The
hyperbolic CORDIC, the convergence range and precision will integer part required for input of BR CORDIC is 2 bit. The
depend on its positive index(m) and negative index(n) bound- final output is shifted by PI bits to the original value. An extra
aries respectively. The fractional word length will depend on sign bit is added in front of every input data and settings are
the n value. In the hyperbolic CORDIC, the iterations 4, summarized in the Table VIII.
13, . . . ., 3k + 1 need to be repeated. The negative index From the Table VIII, it can be noted that an extra step
boundary (n) considered in MATLAB simulation is 20. For is required in proposed approach in logarithm computation
n = 20, the iterations 4 and 13 need to be repeated in i.e, pre log normalization which involves only shift operations.
hyperbolic CORDIC and the number of effective iterations The input word length to HV CORDIC and BV CORDIC
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
W ORD L ENGTHS R EQUIRED FOR THE S TATE OF THE A RT A RCHITECTURE AND P ROPOSED A RCHITECTURE FOR R N C OMPUTATION
is same. The proposed approach will have more accuracy in C. Hardware Complexity and Timing Analysis
the logarithm computation due to the range of r ∈ [1, 2] In this subsection, we analyze the performance of the
and the fractional part of r consists of 45 bits instead of proposed architecture and compare with the state of the art
27 bits. Similarly, an extra step is required in exponential architecture [9], [14] in terms of the hardware complexity,
computation i.e, pre exponential normalization which involves latency and throughput. Throughout the analysis we keep a
only shift operations. The input word length required by the generalized view on CORDIC stages m, n and word-length
BR CORDIC is lesser compared to the HR CORDIC due as b. A Ripple Carry Adder (RCA) and Conventional Array
to pre exponential normalization which reduces the hardware Multiplier (CAM) are considered here to provide compar-
complexity. ison on a uniform platform. A b-bit RCA requires b full
Now, we will analyze the input data word length required adders (FA). A b X b CAM requires b(b − 2) FA plus b half
for R N computation for m and n shown in the Table VI. adders (HA) and b 2 AND gates. In addition, one FA cell
We consider the input range of R as R ∈ [10−2 , 100] requires 24 transistors, one HA cell consist of 12 transistors
as mentioned in the Table VII. The fractional part R is and a two input AND gates consists of 6 transistors. Based on
chosen as 27 bits to achieve an average precision of 10−7 . the approach presented in [9] and [18], Transistor count for the
The integer part of R depends on maximum R value. The proposed architecture is expressed in terms of Transistor Count
integer part will be log2 (100) ≈ 7 bit. In the state of the (T C) of RCA and CAM. We can calculate T C RC A = 24b
art approach [14], the exponential computation depends on and T CC AM = 6b(5b − 6). In the Hyperbolic CORDIC,
m as shown in Table VI. From Table VI, m = 4 which each iteration requires six add operations for i > 0 and for
limits the N value (N ≤ 5.2669). Hence, we consider i ≤ 0, each iteration requires eight add operations. In the LV
N as N ∈ [1, 5]. The multiplier word length depends on CORDIC, each iteration requires two add operations for all
maximum value ln(R) = ln(100) = 4.6051 and N = 5. values of i . In conventional Hyperbolic CORDIC, for i > 0
The maximum multiplier output is ln(100) ∗ 5 = 23.0258. the critical path the critical path is one shift and one add
Therefore, the integer part of multiplier is chosen as 5 bit. operation but for i ≤ 0 the critical path is one shift and
The word length for HR CORDIC depends on si nh(ln(100)× two add operations [9]. The state of the art approach [9]
5) = cosh(ln(100) × 5) = 1.71 × 1010 . The integer part used a folding-delay technique to maintain critical path as
required by HR CORDIC is 34 bits. The word lengths one shift and one add operation. The consequence of the
required by the state of art approach [14] are tabulated folding-delay technique is the iteration i ≤ 0 requires two
in Table IX for R N computation. In proposed approach, clock cycles and the iteration i > 0 requires one clock
the logarithm and exponential computations are independent cycle [9]. For LV CORDIC each iteration requires one clock
of CORDIC convergence limit (m). The similar word length cycle.
1
analysis of R N computation is performed for R N compu- In the state of the art approaches [9], [14], the natural
tation and the input word lengths required in each step are logarithm is computed using HV CORDIC along with two
tabulated in the Table IX. From the Table IX, it can be additional add operations as shown in (14). The total T C
noted that, the word length required by the BR CORDIC is involved in the natural logarithm computation for the state
32 bit lesser compared to the HR CORDIC. This reduces of the art design [9] is expressed as follows
the hardware complexity in exponential computation. The
hardware complexity analysis is performed in the following T Cnat ural_log = (8 × (m + 1) + 6 × (n) + 2) × T C RC A
subsection. (48)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE X
T RANSISTOR C OUNT AND C LOCK C YCLE A NALYSIS
The number of clock cycles required for the natural logarithm shown in (44) and (45). The T C involved in the binary inverse
computation is expressed as logarithm computation is given by
C L K nat ural_log = 2 ∗ m + n + 3 (49) T Cbinar y_inv_log = (6 × (n) + 1) × T C RC A (56)
In the proposed approach, the binary logarithm is computed The number of clock cycles required for the binary inverse
using basic BV CORDIC along with two additional add logarithm computation using (44) and (45) is given by
operations as shown in (37), (38) and (39). The compensation
with k is performed by add operation. The T C involved in the C L K binar y_inv_log = n + 2 (57)
log2 (r )computation is T Clog2 (r) = (6 × (n) + 2) × T C RC A . The total T C and clock cycles required for each step have
The total T C a require for the binary logarithm computation been summarized in Table X for the values of m and n
can be expressed as follows shown in the Table VI and word lengths shown in the
T Cbinar y_log = T Clog2 (r) + T C RC A (50) Table VIII and Table IX. From the Table VI, n is considered
as 20. In conventional and Binary Hyperbolic CORDIC as per
From (37), (38) and (39), the number of clock cycles required CORDIC convergence theorem i = 4, 13 are to be repeated
for the binary logarithm computation is expressed as which results additional complexity and latency. The repeated
C L K binar y_log = n + 4 (51) iterations are also accounted in T C and C L K computation,
summarized in Table X for root and power computations. The
The division operation has been performed using LV CORDIC TS (Transistor Saving) is defined as follows
in the proposed design and the state of the art design as shown T C proposed
in (15) and (40). The T C involved in the division computation TS = 1− (58)
T C St at eof t heart
is expressed as
In the proposed approach, if the pre-log normalization and
T Cdiv = (2 × (m + n + 1) × T C RC A ) (52)
pre-exponential normalization procedures are not performed
The number of clock cycles required for the division compu- the number of iterations and word lengths required for the
tation is expressed as proposed Binary Hyperbolic CORDIC is same as conventional
Hyperbolic CORDIC. Therefore, the hardware complexity
C L K div = m + n + 1 (53) of the proposed approach is same as [9] and [14] when
The second step in R N computation is multiplication operation normalization is not performed. As can be seen from Table X
which is performed using CAM and number of clock cycles that the proposed approach saves 20.55% and 42.01% T S for
1
required by CAM is 1. The final step in the state of the R N and R N computations when compared with the state of the
art design is natural exponential computation. The natural art approaches [9], [14] respectively. The proposed approach
exponential is computed using HR CORDIC and one add also saves 8 clock cycle latency compared with the the state
operation as shown in (16). The T C involved in the natural of the art approaches [9], [14].
exponential computation is expressed in the following equation
T Cnat ural_ex p = (8 × (m + 1) + 6 × (n) + 1) × T C RC A D. Implementation Results
(54) The proposed architectures and the state of the art archi-
tectures are coded in VHDL for per word lengths shown in
The number of clock cycles required for the natural exponen- the Table VIII and Table IX. The ASIC implementation was
tial computation using (16) can be expressed as done for the proposed architecture at TSMC 45nm CMOS
technology @ V D D = 1.08V and clock frequency @
C L K nat ural_ex p = 2 ∗ m + n + 3 (55)
1G H z with the help of Synopsis Design Compiler (DC) and
The binary inverse logarithm in the proposed design is com- IC compiler. The synthesis results of ASIC implementation
puted using basic BR CORDIC and one add operation as are shown in Table XI. The state of the art approach [9]
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XI
H ARDWARE I MPLEMENTATION FOR THE P ROPOSED AND S TATE OF THE A RT A RCHITECTURES
[3] S. P. Mohanty, N. Ranganathan, and R. K. Namballa, “VLSI implemen- [19] S. Aggarwal, P. K. Meher, and K. Khare, “Concept, design,
tation of visible watermarking for secure digital still camera design,” and implementation of reconfigurable CORDIC,” IEEE Trans. Very
in Proc. 17th Int. Conf. VLSI Design, Mumbai, India, Jun. 2004, Large Scale Integr. (VLSI) Syst., vol. 24, no. 4, pp. 1588–1592,
pp. 1063–1068. Apr. 2016.
[4] W. Liu and A. Nannarelli, “Power efficient division and square root [20] S. Mopuri, S. Bhardwaj, and A. Acharyya, “Coordinate rotation-based
unit,” IEEE Trans. Comput., vol. 61, no. 8, pp. 1059–1070, Apr. 2012. design methodology for square root and division computation,” IEEE
[5] A. Seth and W.-S. Gan, “Fixed-point square roots using L-b truncation,” Trans. Circuits Syst., II, Exp. Briefs, vol. 66, no. 7, pp. 1227–1231,
IEEE Signal Process. Mag., vol. 28, no. 6, pp. 149–153, Nov. 2011. Jul. 2019.
[6] H. Kabuo et al., “Accurate rounding scheme for the Newton-Raphson [21] X. Hu, R. G. Harber, and S. C. Bass, “Expanding the range of
method using redundant binary representation,” IEEE Trans. Comput., convergence of the CORDIC algorithm,” IEEE Trans. Comput., vol. 40,
vol. 43, no. 1, pp. 43–51, Jan. 1994. no. 1, pp. 13–21, Jan. 1991.
[7] P. Montuschi, J. D. Bruguera, L. Ciminiera, and J.-A. Piñeiro, [22] F. de Dinechin, P. Echeverría, M. López-Vallejo, and B. Pasca, “Floating-
“A digit-by-digit algorithm for mth root extraction,” IEEE Trans. Com- point exponentiation units for reconfigurable computing,” ACM Trans.
put., vol. 56, no. 12, pp. 1969–1706, Dec. 2007. Reconfigurable Technol. Syst., vol. 6, no. 1, May 2013, Art. no. 4.
[8] A. Vázquez and J. D. Bruguera, “Composite iterative algorithm and [23] Y. Luo, Y. Wang, Y. Ha, Z. Wang, S. Chen, and H. Pan, “Generalized
architecture for q-th root calculation,” in Proc. IEEE Symp. Comput. hyperbolic CORDIC and its logarithmic and exponential computation
Arith. (ARITH), Jul. 2011, pp. 52–61. with arbitrary fixed base,” IEEE Trans. Very Large Scale Integr. (VLSI)
[9] Y. Luo, Y. Wang, H. Sun, Y. Zha, Z. Wang, and H. Pan, “CORDIC-based Syst., vol. 27, no. 9, pp. 2159–2169, Sep. 2019.
architecture for computing nth root and its implementation,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4183–4195, Dec. 2018.
[10] A. A. Liddicoat and M. J. Flynn, “Parallel square and cube compu-
tations,” in Proc. 34th Asilomar Conf. Signals, Syst. Comput., vol. 2,
Oct./Nov. 2000, pp. 1325–1329.
[11] J. E. Stine and J. M. Blank, “Partial product reduction for parallel Suresh Mopuri received the B.Tech. degree (Hons.)
cubing,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), in electronics and communication engineering from
Porto Alegre, Brazil, Mar. 2007, pp. 337–342. the Sri Venkateswara University College of Engi-
[12] S. Bui, J. E. Stine, and M. Sadeghian, “Experiments with high speed neering, Tirupati, India, in 2012. He is currently
parallel cubing units,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, pursuing the Ph.D. degree with the Department of
Tampa, FL, USA, Jul. 2014, pp. 48–53. Electrical Engineering, Indian Institute of Technol-
[13] H. Thapliyal, S. Kotiyal, and M. B. Srinivas, “Design and analysis of ogy Hyderabad (IIT Hyderabad), as an External
a novel parallel square and cube architecture based on ancient Indian Student. He joined as a Research Scholar with the
Vedic mathematics,” in Proc. 48th Midwest Symp. Circuits Syst., Vol. 2, Indian Institute of Technology, Hyderabad. He is
Aug. 2005, pp. 1462–1465. also a Scientist with the Tracking Systems Group,
[14] J.-A. Pineiro, M. D. Ercegovac, and J. D. Bruguera, “High-radix iterative Indian Space Research Organization (ISRO). His
algorithm for powering computation,” in Proc. 16th IEEE Symp. Comput. research interests include signal processing algorithms, VLSI architectures,
Arithmetic, Santiago de Compostela, Spain, Jun. 2003, pp. 204–211. low power design techniques, radar signal processing, and weather signal
doi: 10.1109/ARITH.2003.1207680. processing.
[15] P. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, “50
years of CORDIC: Algorithms, architectures, and applications,” IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 1893–1907,
Sep. 2009.
[16] S. Aggarwal, P. K. Meher, and K. Khare, “Scale-free hyperbolic Amit Acharyya (M’11) received the Ph.D. degree
CORDIC processor and its application to waveform generation,” IEEE from the School of Electronics and Computer Sci-
Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 2, pp. 314–326, ence, University of Southampton, U.K., in 2011.
Feb. 2013. He is currently an Associate Professor with the
[17] A. Acharyya, K. Maharatna, B. M. Al-Hashimi, and J. Reeve, Indian Institute of Technology Hyderabad (IIT
“Coordinate rotation based low complexity N-D fast ICA algorithm Hyderabad), India. His research interests include
and architecture,” IEEE Trans. Signal Process., vol. 59, no. 8, VLSI systems design for real-time resource-
pp. 3997–4011, Aug. 2011. constrained applications, machine learning and sig-
[18] S. Mopuri and A. Acharyya, “Low-complexity methodology for complex nal processing hardware architecture design, edge
square-root computation,” IEEE Trans. Very Large Scale Integr. (VLSI) computing, health-care technology, hardware secu-
Syst., vol. 25, no. 11, pp. 3255–3259, Nov. 2017. rity, and design for testability and reliability.