CORDIC Designs For Fixed Angle of Rotation: Pramod Kumar Meher, Senior Member, IEEE, and Sang Yoon Park, Member, IEEE

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO.

2, FEBRUARY 2013

217

CORDIC Designs for Fixed Angle of Rotation


Pramod Kumar Meher, Senior Member, IEEE, and Sang Yoon Park, Member, IEEE
AbstractRotation of vectors through xed and known angles has wide applications in robotics, digital signal processing, graphics, games, and animation. But, we do not nd any optimized coordinate rotation digital computer (CORDIC) design for vector-rotation through specic angles. Therefore, in this paper, we present optimization schemes and CORDIC circuits for xed and known rotations with different levels of accuracy. For reducing the area- and time-complexities, we have proposed a hardwired pre-shifting scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are proposed for the xed-angle rotations. In one of those cells, micro-rotations and scaling are interleaved, and in the other they are implemented in two separate stages. Pipelined schemes are suggested further for cascading dedicated single-rotation units and bi-rotation CORDIC units for high-throughput and reduced latency implementations. We have obtained the optimized set of micro-rotations for xed and known angles. The optimized scale-factors are also derived and dedicated shift-add circuits are designed to implement the scaling. The xed-point mean-squared-error of the proposed CORDIC circuit is analyzed statistically, and strategies for reducing the error are given. We have synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher throughput, less latency and less area-delay product than the reference CORDIC design for xed and known angles of rotation. We nd similar results of synthesis for different Xilinx eld-programmable gate-array platforms. Index TermsCoordinate rotation digital computer (CORDIC), digital arithmetic, digital signal processing (DSP) chip, VLSI.

I. INTRODUCTION ORDIC stands for coordinate rotation digital computer. The key concept of CORDIC arithmetic is based on the simple and ancient principles of 2-D geometry. But the iterative formulation of a computational algorithm for its implementation was rst described in 1959 by Volder [1], [2] for the computation of trigonometric functions, multiplication, and division. Not only a wide variety of applications of CORDIC have been suggested over the time, but also a lot of progress has taken place in the area of algorithm design and development of architectures for high-performance and low-cost hardware solutions [3][12]. Rotation of vectors through a xed and known angle has wide applications in robotics, graphics, games, and animation [4], [13], [14]. Locomotion of robots is very often performed by successive rotations through small xed angles and translations of the links. The translation operations are realized by
Manuscript received February 15, 2011; revised December 14, 2011; accepted January 30, 2012. Date of publication March 07, 2012; date of current version January 17, 2013. The authors are with the Institute for Infocomm Research, Agency for Science, Technology, and Research (A STAR), 138632, Singapore (e-mail: [email protected]; [email protected]). Digital Object Identier 10.1109/TVLSI.2012.2187080

simple additions of coordinate values while the new coordinates of a rotational step could be accomplished by suitable successive rotations through a small xed angle which could be performed by a CORDIC circuit for xed rotation [4]. Similarly, interpolation of orientations between key-frames in computer graphics and animation could be performed by xed CORDIC rotations [14]. There are plenty of examples of uniform rotation starting from electrons inside an atom to the planets and satellites. A simple example of uniform rotations is the hands of an animated mechanical clock which perform one degree rotation each time. There are several cases where high-speed constant rotation is required in games, graphic, and animation. The objects with constant rotations are very often used in simulation, modelling, games, and animation. Efcient implementation of rotation through a known small angle to be used in these areas could be implemented efciently by simple and dedicated CORDIC circuits. Similarly, the multiplication of complex number with a known complex constant (which is the same as the rotation of vectors through a xed and known angle) is often encountered in communication, signal processing and many other scientic and engineering applications. In some early works, CORDIC circuits have been developed for the implementation of complex multiplications to be used for digital signal processing (DSP) applications [16][18], but we do not nd any detailed study pertaining to efcient CORDIC realization of xed and knownangle rotations and constant complex multiplication. Latency of computation is the major issue with the implementation of CORDIC algorithm due to its linear-rate convergence [19]. It requires iterations to have -bit precision of the output. Overall latency of computation increases linearly with the product of the word-length and the CORDIC iteration period. The speed of CORDIC operations is, therefore, constrained either by the precision requirement (iteration count) or the duration of the clock period. The angle recoding (AR) schemes [5][9] could be applied for reducing the iteration count for CORDIC implementation of constant complex multiplications by encoding the angle of rotation as a linear combination of a set of selected elementary angles of micro-rotations. In the conventional CORDIC, any given rotation angle is expressed as a linear combination of values of elementary angles that belong to the set to obtain an -bit value of . However, in AR methods, this constraint is relaxed by adding zero into the linear combination to obtain the desired angle using relatively fewer terms of for . The elementhe form tary-angle-set (EAS) used by AR scheme is given by . Hu and Naganathan [5] have proposed an AR method based on the greedy algorithm that tries to represent the remaining angle . Using this reusing the closest elementary angle

1063-8210/$31.00 2012 IEEE

218

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2013

coding schemes the total number of iterations could be reduced to less than half of the conventional CORDIC algorithm for the same accuracy. Wu et al. [7] have suggested an AR scheme based on an extended elementary-angle-set (EEAS), that provides a more exible way of decomposing the target rotation angle. In the EEAS approach, the set of the elementaryangle set is extended further to and . EEAS has better recoding efciency in terms of the number of iterations and can yield better error performance than the AR scheme based on EAS. But the iteration period for EEAS is longer, and involves double the numbers of adders/subtractors in the CORDIC cell compared with that of the other. Most of the advantages gained in the AR schemes are cancelled out by the hardware and time involved in scaling the pseudo-rotated vector. Since the angle of rotation for the xed rotation case is known a priori, it is desirable to perform exhaustive search to obtain an optimal EAS instead of greedy search. Moreover, it is observed that the hardware-complexity of barrel-shifters alone is nearly half of that of a CORDIC circuit. We therefore aim at suggesting some techniques to minimize the complexity of barrel shifters. CORDIC computation is inherently sequential. Therefore, CORDIC is not suitable for parallel implementation, while it is a natural candidate for pipeline implementation. But, the efcient pipelined realization of CORDIC for xed-angle vector rotations is yet to be exploited. Keeping these in view, in this paper, we present the optimization schemes for reducing the number of micro-rotations and for reducing the complexity of barrel-shifters for xed-angle vector-rotation. We also derive a cascaded pipelined circuit for this class of problem which is faster and involves less area-delay complexity than the existing approaches. The contributions of this paper are as follows. 1) Optimized set of micro-rotations are derived for the implementation of xed-angle vector-rotation. 2) Shift-add operations for corresponding scaling circuits are derived. 3) A novel hardware pre-shifting scheme is suggested for reduction of barrel-shifter complexity. 4) Single-rotation and bi-rotation CORDIC circuits are designed and used to derive cascaded CORDIC for highspeed xed-angle vector rotations. 5) The xed-point mean-squared-error (MSE) of the proposed CORDIC circuit is analyzed, and an efcient strategy for reducing the error is described. The remainder of this paper is organized as follows. Section II deals with the optimization of elementary angle set for different accuracies of implementation. Efcient circuits for implementation of micro-rotations for xed rotations are presented in Section III. Implementation of scaling is discussed in Section IV. Section V analyzes the MSE of the proposed CORDIC. Hardware and time complexities are given and synthesis results of the proposed designs are compared with the conventional and a reference design in Section VI. Conclusions are presented in Section VII.

Fig. 1. Reference CORDIC circuit for xed rotations.

II. OPTIMIZATION OF ELEMENTARY ANGLE SET The rotation-mode CORDIC algorithm to rotate a vector through an angle to obtain a rotated vector is given by [1], [2] (1a) (1b) (1c) such that when is sufciently large (1d) where if and otherwise, and scale-factor of the CORDIC algorithm, given by is the

(2) In case of xed rotation, could be pre-computed and the sign-bits corresponding to could be stored in a sign-bit register (SBR) in CORDIC circuit. The CORDIC circuit therefore need not compute the remaining angle during the CORDIC iterations [3]. A reference CORDIC circuit for xed rotations according to (1a) and (1b) is shown in Fig. 1. and are fed as set/reset input to the pair of input registers and the successive feedback values and at the iteration are fed in parallel to the input registers. Note that conventionally we feed the pair of input registers with the initial values and as well as the feedback values and through a pair of multiplexers. We show here that for rotation of a vector through a known and xed angle of rotation using a rotation-mode CORDIC circuit, we can nd a set of a small number of predetermined elementary angles for , where is the elementary angle to be used for the th micro-rotation in the CORDIC algorithm (1), and is the minimum necessary number of micro-rotations. Meanwhile, it is well known that the rotation through any angle, can be mapped into a positive rotation through without any extra arithmetic operations [10]. Hence, as a rst step of optimization, we perform the rotation mapping so that the rotation angle lies in the range of . In the

MEHER AND PARK: CORDIC DESIGNS FOR FIXED ANGLE OF ROTATION

219

next step, we minimize the number of elementary angles in the set according to the accuracy requirements. The rotation mode CORDIC algorithm of (1), therefore, can be modied accordingly to have (3a) such that for a minimum number (3b) . The accuThe scale-factor now depends on the set racy of CORDIC algorithm depends on how closely the resultant rotation due to all the micro-rotations in (1) approximates to the desired rotation angle , which in turn determines the deviation of actual rotation vector from the estimated value. We show here that only a few elementary angles are sufcient to have a CORDIC rotation in the range , and different sets of elementary angles can be chosen according to the accuracy requirement. The simple pseudo code to optimize a set of micro-rotations is described in Algorithm 1. If the maximum accuracy which is dened as the maximum tolerable error between desired angle and approximated angle is given as an input, the optimization algorithm searches the parameters and that can minimize an objective function . The algorithm starts with the single micro-rotation, i.e., , then if the micro-rotation that has smaller angle of deviation than cannot be found, the number of micro-rotations is increased by one and the optimization algorithm is run again. Exhaustive search is employed in the optimization algorithm to search the entire parameter space for all the combinations of and . Based on the obtained micro-rotations, the parameters for scaling operation can be searched with the different objective function, which is described in Section IV. The sub-optimal set of micro-rotations may be used in some cases, if the optimal set of micro-rotations cannot satisfy the design constraint for scaling. We have used sub-optimal solutions particularly for the rotation with the angle of 31 and 35 in Table I since the scaling requires more terms in these two cases if optimal solutions are used. Algorithm 1 Obtains the Optimal Micro-Rotations. 1: 2: do 3: is nonnegative integer. 4: 5: while end while In the experiment with the maximum input angular deviation , we found that a set of four selected micro-rotations is enough. In Table I, it is shown that rotations through any angle in the range (in odd integer degrees) could be achieved with maximum angular deviation radian , where . Using a maximum of two selected micro-rotations, the rota-

TABLE I OPTIMIZATION OF FULL ROTATIONS WITH FOUR MICRO-ROTATIONS

TABLE II OPTIMIZATION OF SMALL ROTATIONS WITH FOUR MICRO-ROTATIONS

tions could be achieved with maximum angular deviation with (0.033 radian). In case of six micro-rotations, angular deviation could be reduced to . In Table II, it is shown further that rotations through in an interval of 0.1 could be obtained by four micro-rotations with angular deviation, . Here we can make an observation that we can always achieve higher accuracy with more number of micro-rotations. From Table II, we nd that higher accuracy could be achieved in case of small rotation angles like 1 or 2 , compared to the most of the larger angles when the same number of micro-rotations is used. III. IMPLEMENTATION OF MICRO-ROTATIONS Since the elementary angles and direction of micro-rotations are predetermined for the given angle of rotation, the angle es-

220

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2013

Fig. 2. CORDIC cell for constant complex multiplications.

Fig. 4. Hardwired pre-shifted bi-rotation CORDIC circuit. SBR is sign-bit regindicates right-shift by bit-locations. ister of 2-bits size.

Fig. 3. Hardwired pre-shifting in basic CORDIC module.

timation data-path is not required in the CORDIC circuit for xed and known rotations. Moreover, because only a few elementary angles are involved in this case, the corresponding control-bits could be stored in a ROM of few words. A CORDIC circuit for complex constant multiplications is shown in Fig. 2. The ROM contains the control-bits for the number of shifts corresponding the micro-rotations to be implemented by the barrel-shifter and the directions of micro-rotations are stored in the sign-bit register (SBR). The major contributors to the hardware-complexity in the implementation of a CORDIC circuit are the barrel-shifters and the adders. There are several options for the implementation of adders [22], from which a designer can always choose depending on the constraints and requirements of the application. But, we have some scope to develop techniques for reducing the complexity of barrel-shifters over the conventional designs as discussed in the followings. 1) Minimization of Barrel-Shifter Complexity by Hardwired Pre-Shifting: A barrel-shifter for maximum of shifts for word-length is implemented by -stages of demultiplexors, where each stage requires number of 1:2 line MUXes. The hardware-complexity of barrel-shifter, therefore, increases linearly with the word-length and logarithmically with the maximum number of shifts. We can reduce the effective word-length in the MUXes of the barrel-shifters, and so also the number of stages of MUXes by simple hardwired pre-shifting as shown in Fig. 3. If is the minimum number of shifts in the set of selected micro-rotations, we can load only the more-signicant bits (MSBs) of an input word from the registers to the barrel-shifters, since the less signicant bits (LSBs) would get truncated during shifting. The barrel-shifter, therefore, needs to implement a maximum of shifts only, where is the maximum number of shifts in the set of selected micro-rotations. The output of the barrel-shifters are loaded as the LSBs to the add/subtract units, and the MSBs of

the corresponding operand of add/subtract unit are hardwired to 0. Therefore, the hardware-complexity of a barrel-shifter could be reduced by the hardwired pre-shifting approach. The time involved in a barrel-shifter could also be reduced by hardwired pre-shifting, since the delay of the barrel-shifter is proportional to the number of stages of MUXes, and it also be possible to reduce the number stages by hardwired pre-shifting. In Table I, we nd that the minimum number of shifts is greater than one in more than 75% of the cases. Similarly, in Table II, we nd that is always greater than 5 except the angle 1.5 . Using hardwired pre-shifting, it would therefore be possible to considerably reduce the total number of shifts to be implemented by barrel-shifters, so as to substantially reduce the hardware-complexity and delay of the barrel-shifters. A conventional barrel-shifter for maximum of shifts is implemented by -stages of 2:1 MUXes. But, when the number of shifts is known a priori, one can design the barrel-shifter to include the specic shifts. For implementing four discrete shifts (see Table I) irrespective of the maximum number of shifts, the barrel-shifter would require three stages of 2:1 MUXes by hardwiring the shifts. 2) Bi-Rotation CORDIC Cell: We nd that using only two micro-rotations, it is possible to get an accuracy up to 0.033 radian. Although the accuracy achieved by two micro-rotations is inadequate in many situations, but can be used for some applications where the outputs are quantized, e.g., in case of speech and image compression, etc., [23], [24]. Besides, the rotations with four and six micro-rotations can also be implemented successively by two and three pairs of micro-rotations, respectively. Therefore, we design an efcient CORDIC circuit to implement a pair of micro-rotations, and named as bi-rotation CORDIC. The proposed circuit for bi-rotation CORDIC is shown in Fig. 4. It consists of an adder-module, two 2:1 multiplexers and a sign-bit register (SBR) of two bit size. The adder-module consists of a pair of adders/subtractors. The adders/subtractors perform additions or subtractions according to the sign-bit available from the SBR. The components of the input vector (real and the imaginary parts of the input complex operand) are loaded to the input-registers through set/reset input. The output of the registers are sent in two lines where the content of the register is fed to one of the adders/subtractors directly while that in the other line is loaded

MEHER AND PARK: CORDIC DESIGNS FOR FIXED ANGLE OF ROTATION

221

Fig. 5. (a) Multi-stage single-rotation cascaded CORDIC circuit. (b) Structure indicates right-shift by bit-locations. of th rotation module.

to the barrel-shifter pre-shifted by bit-locations to right by hardwired pre-shifting technique. The output of the adders are loaded back to the input registers for the second CORDIC iteration. The bi-rotation CORDIC involves only a pair of barrel-shifters consisting of only one stage of 2:1 MUXes. The control-bit for the barrel-shifters is 0 for the rst micro-rotation (no shift) and 1 for the second micro-rotation (shift through ). The control bits are generated by a T ip-op, since they are 1 and 0 in each alternate cycle. 3) High-Throughput Implementation Using Cascaded Multi-Stage CORDIC: For the implementation of small rotations (the remaining angle after the rst two micro-rotations), as shown in Table II, except the angle 1.5 . Similarly, in Table I, we can notice that the second half of the micro-rotations has the minimum shifts . It would be possible to take the best advantage of hardwired-pre-shifting, if the micro-rotations are implemented in more than one CORDIC modules in separate stages in a cascade. Moreover, since the cascade of CORDIC modules is inherently pipelined, it would provide high-throughput pipelined implementation. To implement the CORDIC rotations with higher accuracy without affecting the throughput of computation, we can therefore have cascaded-multi-stage CORDIC consisting of single-rotation cells and bi-rotation CORDIC as described in the followings. Cascaded CORDIC with Single-Rotation Cells: A multi-stage-cascaded pipelined-CORDIC circuit consisting of single-rotation modules is shown in Fig. 5. Each stage of the cascaded design consists of a dedicated rotation-module that performs a specic micro-rotation. The structure and function of a rotation-module is depicted in Fig. 5(b). Each rotation-module consists of a pair of adders or subtractors (depending on the direction of micro-rotation which it is required to implement). Each of the adders/subtractors loads one of the

pair of inputs directly and loads the other input in a pre-shifted form at LSB locations, where is the number of right-shifts required to be performed to implement the th micro-rotation. The MSB locations are hardwired to be zero. The rotation-module in this case does not require input from SBR since each adder module always performs either addition or subtraction. It also does not require barrel-shifter since it has to implement only one xed micro-rotation. The output of each stage is latched to the input of its succeeding stage as shown in the gure. The critical-path in this case amounts to only one addition/subtraction operation in the adder module. Total latency of -stage single-rotation cascade amounts to , where and , are the addition/subtraction time and D ip-op delay, respectively. We nd that in more than two-third of the rotation angles as shown in Table I, only three micro-rotations are adequate to have the maximum deviation of up to 0.04 . The complex multiplications involving three such micro-rotations could be implemented by three-stage-cascaded CORDIC circuit shown in Fig. 5 (for ). The rotation using 4 and 6 micro-rotations, similarly, would require 4 and 6 stages of rotation module for pipelined implementation. This can also be implemented in nonpipelined form using carry-propagate adders with total latency of , where and , are, respectively, the time required for -bit addition-time and fulladder delay, being word-length of implementation. is the number of shifts of the th stages. Cascaded CORDIC with Bi-Rotation Cells: For reduction of adder complexity over the cascaded single-rotation CORDIC, the micro-rotations could be implemented by a cascaded bi-rotation CORDIC circuit. A two-stage cascaded bi-rotation CORDIC is shown in Fig. 6. The rst two of the micro-rotations as shown in Table I out of the four-optimized micro-rotations could be implemented by stage-1, while the rest two are performed by stage-2. The structure and function of the bi-rotation CORDIC is shown in Fig. 4. For implementing six selected micro-rotations, we can use a three-stage-cascade of bi-rotation CORDIC cells. The three-stage bi-rotation cells could however be extended further when higher accuracy is required. IV. SCALING OPTIMIZATION AND IMPLEMENTATION We discuss here the optimization of scaling to match with the optimized set of elementary angles for the micro-rotations. A. Scaling Approximation for Fixed Rotations The generalized expression for the scale-factor given by (2) can be expressed explicitly for the selected set of microrotations as (4) where for is the number of shifts in the th micro-rotation. Except for (i.e., rotation by 45 ), by binomial expansion, any term in (4) can be written as (5)

222

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2013

TABLE III OPTIMIZED SHIFTS TO IMPLEMENT SCALING FOR THE CASE OF ROTATION WITH FOUR MICRO-ROTATIONS

Fig. 6. Two-stage cascaded bi-rotation CORDIC circuit. SBR is sign-bit register of 2-bits size.

where , being the number of shifts in a micro-rotation, and can be expressed alternatively in terms of as (6) Replacing each term in (4) by the expression of (6), we can obtain an approximate scale-factor as a product of shift-add terms of form (7) where is the number of shifts performed for the th iteration of scaling, , and is maximum number of scaling iterations required for the approximation. The number of terms of (6), those are required to be accounted for to obtain the approximate scale-factor [given by (7)] can be estimated according to value of and the desired output accuracy which is limited by the number of micro-rotations used for the pseudorotation. The number of shifts-add/subtract terms in the expression of (7) is therefore minimized separately for the CORDIC implementations by four micro-rotations and six micro-rotations for different angles of rotation. It can be found that for four micro-rotation CORDIC implementation, where the error in is , only the rst two terms in (6) contribute for , while up to the third and the fth terms contribute for and , respectively. Similarly, for six micro-rotation CORDIC implementation, where the error in is , the rst two terms in (6) contribute for , while up to the third, fourth and fth terms contribute for , , and , respec-

tively. Accordingly, we have obtained the recursive shift-add expressions of scale-factor in the form of (7). Algorithm 2 describes the optimization scheme to search the parameters and . Once the set of micro-rotations is obtained by Algorithm 1, the ideal scaling factor can be calculated using (4). The objective function is dened as deviation of from 1, i.e., . The algorithm starts with the single term of scaling, then the number of scaling terms is increased by one until is smaller than the given maximum deviation , which needs to be set as the same value as in the Algorithm 1 since and contribute equally to the overall approximation error. In the experiment, we need three terms in the expression of (7) as listed in Table III in the range of 1 to 41 when is set as which is the same value as used to obtain the Table I after conversion of 0.04 to radian. Algorithm 2 Obtains the Optimal Shifts for Scaling. 1: 2: 3: do 4: is nonnegative integer. 5: 6: while end while We derive here the expression of scale factors separately for 43 and 45 rotations to get scaling with desired accuracy with less number of iterations compared with the above approach.

MEHER AND PARK: CORDIC DESIGNS FOR FIXED ANGLE OF ROTATION

223

Fig. 7. Shift-add scaling circuit using hardwired pre-shifted loading.

To have an accuracy up to , the scale-factor for rotation through 43 and 45 can be expressed as (8) Equation (8) can be expressed in recursive shift-add forms (9) B. Implementation of Scaling Scaling and micro-rotations could be implemented either in the same circuit in interleaved manner or in two separate stages. The implementation of scaling as well as the micro-rotation would however depend on the level of desired accuracy, and the implementation of scaling also depends on the implementation of micro-rotations. Therefore, we discuss here the realization of the scaling circuits corresponding to different implementations of micro-rotations. 1) Generalized Implementation of Scaling: The shift-add circuit for scaling according to (7) is shown in Fig. 7. The scaling circuit of Fig. 7 can use hardwired pre-shifting for minimizing barrel-shifter complexity and could be placed after the CORDIC cell of Fig. 2 to perform micro-rotation and scaling in two separate stages. The generalized CORDIC circuit for xed rotation to perform the micro-rotation and the scaling in interleaved manner in alternate cycles is shown in Fig. 8. The circuit of Fig. 8 is similar to that of Fig. 2. It involves only an additional line-changer circuit to change the path of unshifted (direct) input. The structure and function of line-changer is shown in Fig. 8(b). The line-changer is placed on the unshifted input data line to keep the critical path the same as that of Fig. 2. 2) Implementation of Scaling for Bi-Rotation CORDIC: The scaling and micro-rotations for the proposed bi-rotation CORDIC could be implemented in two separate pipelined stages, where the pair of micro-rotations are implemented by the CORDIC circuit of Fig. 4, and scaling is implemented by a shift-add circuit. The scale factor for this case can be represented by two shift-add terms as (10) The two-factor scaling of (10) can be implemented by the shift-add circuit of Fig. 9. It consists of a pair of adders/subtractors and a pair of single-stage barrel-shifters. Each barrel-shifter consists of only one stage of 2:1 MUXes. The input of each of the barrel-shifters is hardwired pre-shifted by locations to right. Each of the barrel-shifters shifts the input through

Fig. 8. CORDIC circuit for interleaved implementation of micro-rotations as well as scaling circuit. (a) The CORDIC circuit. (b) Structure and function of it performs micro-rotations and for control-bit line-changer. For control-bit it performs the shift-add operations for scaling.

Fig. 9. Shift-add circuit of two-factor scaling using hardwired pre-shifting.

locations to right, when the control-bit is 1. No additional shifts are required when control-bit is 0. The control-bit can be generated by a T ip-op since it toggles in each cycle. The add-subtract cell performs addition if and performs subtraction if , which could be controlled through a two-bit SBR. 3) Implementation of Scaling for Cascaded Single-Rotation CORDIC: The shift-add circuit for single-rotation-cascaded CORDIC is shown in Fig. 10. It consists of a pair of dedicated adders-subtractors. It does not require any multiplexer or sign-bit register. A pair of input is fed to the adder/subtractor from the register, where one of the inputs is obtained directly from the content of the registers, while the other input is shifted by locations to right before being fed to the adder/subtractor. The choice of adder or subtractor depends on the sign-factor in the shift-add term to be implemented by the circuit. 4) Implementation of Scaling for Cascaded Bi-Rotation CORDIC: The cascaded bi-rotation CORDIC could either be used for implementing in two or three stages for four and six micro-rotations, respectively. For scaling by three shift-add-factors as shown in Table III, we can use one

224

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2013

Fig. 10. Shift-add circuit for single-rotation-cascaded-scaling using hardwired pre-shifting.

Fig. 13. Proposed CORDIC operation and approximation error.

A. Approximation Error Fig. 13 illustrates the CORDIC iteration which consists of a pseudo-micro-rotations and a scaling. In the gure, is an input vector to be rotated through angle . It is assumed that scaling and micro-rotations are implemented in two separate stages. be the rotated vector after micro-rotations given by (11) The rotation matrix is given by is given by (3). The th scaling factor (12) such that after iterations of scaling we get (13)
Fig. 12. Scaling circuit for 43 and 45 rotation. (a) and 1correspond and 1correspond to two rightto addition and subtraction, respectively. shifts and four right-shifts, respectively, in the barrel-shifter.

Fig. 11. Time-multiplexed shift-add circuit for one-factor scaling.

two-factor-scaling circuit of Fig. 9 and the third scaling factor could be implemented by a multiplexed shift-add circuit of Fig. 11. The scaling for six micro-rations, which involves ve shift-add factors, could be implemented by a pair of two-factor scaling circuit and a multiplexed circuit. 5) Implementation of Scaling for Large Rotations: The scaling circuit for rotation through 43 and 45 based on (9) is shown in Fig. 12. We can implement this scaling also by simple modications of cascaded forms of single-factor scaling circuit, two-factor scaling circuits and time-multiplexed scaling circuits of Figs. 911. V. ANALYSIS OF ERROR There are two types of error encountered during the rotation mode CORDIC iterations. Those are: approximation error and round-off error. Approximation error arises due to approximation of angle of rotation and scaling factor, while the round-off error arises due to the nite word-length of the output components. We derive the expression for these two errors in the following subsections.

After the micro-rotations, there is a discrepancy between the desired angle and the resultant angle due to the limited number of micro-rotations. Moreover, cannot reach on the circle after the scaling since is an approximated value which is not same as the required given as (4). Similar to the method used in [20], the approximation error is evaluated as a distance between the desired output and the actual CORDIC output as follows: (14) is assumed to be greater than Without loss of generality, zero and is greater than one as shown in Fig. 13. Then (15) If is sufciently small (16) Since (17)

MEHER AND PARK: CORDIC DESIGNS FOR FIXED ANGLE OF ROTATION

225

For the known and xed angle, an expectation of the approximation error can be estimated once we know the input statistics as (18) It can be seen that the accuracy of CORDIC depends on how closely the angle difference approximates to zero, and also the ratio of scale-factors approximates to one. B. Round-off Error As the CORDIC iteration progresses through shift-add operations, the word-length increases, and consequently requires rounding after each CORDIC iteration. Let be the round-off error. The magnitude of round-off error depends on the wordlength in a data-path, especially the length of fractional bits which is denoted as . The mean and variance of are estimated separately and added to obtain as (19) where . When a data with fractional bits is shifted by , the mean and variance of resultant round-off error are calculated in [21] as (20a) (20b) respectively. The round-off error generated from each microrotation and scaling is propagated forward through the CORDIC iterations, and get accumulated in the output vector . The magnitude of accumulated round-off error in the estimation of vector after micro-rotations is (21a)

TABLE IV MINIMUM CLOCK PERIOD OF DIFFERENT ARCHITECTURES

C. Error of the Proposed CORDIC and , all the necessary values For the case of in (18) and (22) can be obtained from Tables I and III. Additionally, the number of fractional bit and average power are needed for the estimation of round-off and approximation error, respectively. Equation (22) is valid only for the case when the micro-rotations and scaling are performed in two separated stages. If the micro-rotations and scaling are deployed in interleaved manner, the sequence of and in (21) and (22) should be changed accordingly in order to represent the transfer function in interleaved manner. If we want to reduce the total error, and the approximation error is dominant error source, it would be a better strategy to increase the number of micro-rotations and scaling iterations. It would make and/or approach to zero. If the round-off error is greater than approximation error, we need to increase the number of fractional bits in order to reduce the total error. VI. COMPLEXITY CONSIDERATIONS We discuss here the hardware and time complexities of the proposed design. In the existing literature we do not nd similar work on CORDIC implementation of known and xed rotations. Therefore, we compare the proposed design with the conventional CORDIC design for the rotation of unknown angle. We have used the basic CORDIC processor in [3, Fig. 2] for the implementation of conventional CORDIC. In addition, we have designed a reference architecture (see Fig. 1) for straightforward implementation of xed rotations, and we have compared the complexities and speed performance of the proposed design with the conventional and reference design. The maximum deviation of amounting to is assumed to be accuracy level-1 (AL-1) and that amounting to is assumed to be accuracy level-2 (AL-2), so that AL-1 and AL-2 would correspond to the proposed CORDIC implementations of rotation through four and six micro-rotations, respectively. A. Complexity of the Conventional and Reference CORDIC The conventional rotation-mode CORDIC requires three -bit adders, three -bit registers, two barrel-shifters and four MUXes, where is the word-length. The complexity of barrel-shifter, however, depends on the accuracy of implementation. Considering that the rotations are mapped to the rst quadrant, the conventional CORDIC would involve 11 iterations and 17 iterations, respectively, for AL-1 and AL-2. Each of its pair of barrel-shifters would thus involve four and ve stages, where each stage requires 2:1 MUXes, for AL-1

(21b) The nal round-off error accumulated in the output vector after scaling is calculated by using (22)

(22a)

(22b)

226

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2013

TABLE V AREA AND TIME COMPLEXITIES OF DIFFERENT ARCHITECTURES

TABLE VI AREA-TIME COMPLEXITIES OF DIFFERENT ARCHITECTURES BASED ON SYNTHESIS RESULT USING TSMC 90-nm LIBRARY

TABLE VII RELATIVE ADVANTAGES OF PROPOSED DESIGNS OVER DESIGN FOR FIXED ROTATIONS

THE

REFERENCE

has the same barrel-shifter complexity and time-complexity as the conventional CORDIC. B. Complexity of the Proposed CORDIC Designs Each of the proposed CORDIC designs involves a latency of 7 cycles and 11 cycles for accuracy level-1 and 2, respectively. But the hardware requirement, duration of clock period and throughput rate differ from one another. We discuss these complexities of proposed CORDIC designs in four categories: 1) single CORDIC cell with interleaved-scaling; 2) single CORDIC cell with separate-scaling; 3) single-rotation cascade; and 4) bi-rotation cascade. 1) Cordic Cell With Interleaved Scaling and Micro-Rotations: As shown in Fig. 8, the CORDIC implementation by interleaved scaling requires an additional ROM and a line changer over that of reference design of Fig. 1. The line changer requires number of tri-state buffers and a T ip op to generate the control-bit. Using hardwired pre-shifting, each of the pair of barrel-shifters involves 4 stages of 2:1 MUXes for implementing all the necessary shifts for micro-rotations as well as scaling for both accuracy levels. Accordingly, the duration of minimum clock period for the proposed interleaved CORDIC can be found to be for both the accuracy levels. It involves 7 and 11 iterations for AL-1 and AL-2, respectively, to implement both scaling and micro-rotations. The ROM therefore needs to store 7 and 11 control words of 4-bit size to be used by the barrel-shifter, and the SBR is of 7 and 11 bit size for AL-1 and AL-2, respectively. 2) CORDIC Cell With Separate Scaling and Micro-Rotation Stages: CORDIC implementation of xed rotation could be performed in two pipelined stages, where micro-rotations are implemented by Fig. 2 and scaling is implemented by Fig. 7.

and AL-2, respectively. Apart from that, all the three input registers are to be loaded through MUXes to allow direct input as well as the input through the feedback path. The ROM needs to store bits angles for 11 and 17 iterations for AL-1 and AL-2, respectively. The duration of minimum clock period in conventional CORDIC is and for AL-1 and AL-2, where , , and are the -bit addition-time, D ip op delay and delays of 2:1 MUX, respectively. The reference CORDIC for the xed rotation (shown in Fig. 1), consists of two adders, two barrel-shifters, one sign-bit-register and two input registers with MUXes. The MUXes accompanied by the input registers are, however, not shown in the reference as well as the proposed designs (as discussed in Section II for the description of Fig. 1). We assume that the rotation is mapped to half quadrant range so that for accuracy of AL-1 and AL-2, it requires 10 and 16 iterations. It

MEHER AND PARK: CORDIC DESIGNS FOR FIXED ANGLE OF ROTATION

227

TABLE VIII AREA AND TIME COMPLEXITIES COMPARISON OF DIFFERENT ARCHITECTURES ON FPGA

Using hardwired pre-shifting, the barrel-shifter involves 3 and 4 stages of 2:1 MUXes to implement the necessary shifts for micro-rotation and 2 and 4 stages for scaling for accuracy levels-1 and -2, respectively. The ROM therefore needs to store 4 control words of 3 bit size for micro-rotation and 3 control words of 2 bit size for scaling to be used by the barrel-shifter for AL-1 and 11 control words of 4 bit size for AL-2, along with SBR of 7 and 11 bit size for AL-1 and AL-2, respectively. Accordingly, the duration of minimum clock period for this implementation is found to be and for both accuracy levels-1 and 2, respectively. Although it involves 3 and 5 iterations for scaling, it involves 4 and 6 iterations for micro-rotations for AL-1 and AL-2, respectively. Therefore, the iteration count in this case is 4 and 6 for these two cases. 3) Single-Rotation Cascade: The single-rotation cascaded CORDIC for xed-angle rotation is shown in Fig. 5. For accuracy level-1 it involves seven stages out of which four stages perform the micro-rotations and the three remaining stages perform scaling. The rotation modules are modied to implement shift operations for scaling. Each stage requires two adders and two pipelining registers (except that the last stage does not require pipeline register). All the shiftings are hardwired and there is no feed-back path in this circuit. Therefore, it does not require any ROM, SBR, barrel-shifters or MUXes. The duration of minimum clock period for this implementation is for both accuracy levels and produces one output in each cycle. 4) Bi-Rotation Cascade: For accuracy level-1, it requires a cascaded two-stage bi-rotation CORDIC as shown in Fig. 6 for micro-rotation. To implement scaling, it requires a two-factor scaling circuit of Fig. 9 and time-multiplexed circuit of Fig. 11 for one-factor scaling. For accuracy level-2, it requires a cascaded three-stage bi-rotation CORDIC (see Fig. 6) for micro-rotation. To implement scaling, it requires three cascaded stages consisting of two stages of two-factor scaling circuit of Fig. 9 and one stage of a time-multiplexed circuit of Fig. 11. The duration of minimum clock period for both the accuracy-levels is and it gives an output in every alternate cycle. C. Comparative Performances The expressions of clock periods of the architectures are listed in Table IV. The single-rotation CORDIC has the minimum of clock period of one addition-time and bi-rotation CORDIC has slightly higher clock period. The hardware and

time-complexities of different architectures are listed comprehensively in Table V. The CORDIC algorithms are written in hardware description language and synthesized by Synopsys Design Compiler using the TSMC 90-nm library to obtain the complexities of proposed and the reference designs. Word size and 32 are used for accuracy level-1 and -2, respectively. The area, clock period, latency, throughput, average computation time (ACT), and area-delay product (ADP) are listed in Table VI. The reference design has the same clock-period as the conventional CORDIC but yields more throughput and involves less area, less latency and less area-delay product (ADP), over the conventional one, in average over the two levels of accuracy. The proposed design of single CORDIC cell with interleaved-scaling has more area but offers more throughput and involves less latency and less ADP, in average over both the levels of accuracy, compared to the reference design. The proposed design of single CORDIC unit with separate-scaling similarly, has nearly more area but offers nearly 2.7 time the throughput and involves less ADP and two-third of the latency over the reference design. The relative advantages of single-rotation cascade and bi-rotation cascade are shown in Table VII. In average over both the levels of accuracy, the single-rotation and bi-rotation cascades, respectively, involve nearly 3.6 times and 2.9 times more area over the reference design, but offer nearly 16.3 times and 7.0 times more throughput, and involve 4.6 and 2.5 times less ADP with nearly half and two-third less latency over the other. The reference and proposed designs are also implemented on the eld-programmable gate-array (FPGA) platform of Xilinx devices. The number of slices (NOS), maximum operating frequency (MUF), and slice-delay product (SDP) using two different devices of Spartan-3A (XC3SD1800A-4FG676) and Virtex-4 (XC4VSX3510FF668) are listed in Table VIII. The proposed design of single-rotation cascade involves smaller number of slices and faster operating frequency over the conventional and reference designs for two devices. In average over both levels of accuracy and devices, the single-rotation cascade offers nearly 23.5 and 32.2 less SDP over the reference design and conventional design, respectively. VII. CONCLUSION The number of micro-rotations for rotation of vectors through known and xed angles are optimized and several possible

228

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2013

dedicated circuits are explored for rotation-mode CORDIC processing with different levels of accuracy. The proposed CORDIC cell with interleaved scaling involves more area, but offers more throughput and involves nearly less latency and less ADP, than the reference design for known and xed rotations. The proposed single-rotation cascade and birotation cascade require, respectively, and times more area over the reference design, but offer nearly 16.3 and 7.0 times more throughput, and involve nearly 4.6 and 2.5 times less ADP with nearly half and two-thirds of the latency of the other. With progressing scaling trends, since the silicon area is getting continually cheaper, it appears to be a good idea to use the cascaded designs for their potential for high-throughput and low-latency implementation. It is found that higher accuracy could be achieved in case of smaller angles of rotation when the same number of micro-rotations are used. The small angle rotators could therefore be very much useful for shape design and curve tracing for animation and gaming devices. The xed-angle CORDIC rotation would have wide applications in signal processing, games, animation, graphics and robotics, as well. REFERENCES
[1] J. E. Volder, The CORDIC trigonometric computing technique, IRE Trans. Electron. Comput., vol. EC-8, pp. 330334, Sep. 1959. [2] J. S. Walther, A unied algorithm for elementary functions, in Proc. 38th Spring Joint Comput. Conf., 1971, pp. 379385. [3] Y. H. Hu, CORDIC-based VLSI architectures for digital signal processing, IEEE Signal Process. Mag., vol. 9, no. 3, pp. 1635, Jul. 1992. [4] P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, 50 years of CORDIC: Algorithms, architectures and applications, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 18931907, Sep. 2009. [5] Y. H. Hu and S. Naganathan, An angle recoding method for CORDIC algorithm implementation, IEEE Trans. Comput., vol. 42, no. 1, pp. 99102, Jan. 1993. [6] Y. H. Hu and H. H. M. Chern, A novel implementation of CORDIC algorithm using backward angle recoding (BAR), IEEE Trans. Comput., vol. 45, no. 12, pp. 13701378, Dec. 1996. [7] C.-S. Wu, A.-Y. Wu, and C.-H. Lin, A high-performance/low-latency vector rotational CORDIC architecture based on extended elementary angle set and trellis-based searching schemes, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 50, no. 9, pp. 589601, Sep. 2003. [8] T.-B. Juang, S.-F. Hsiao, and M.-Y. Tsai, Para-CORDIC: Parallel cordic rotation algorithm, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 8, pp. 15151524, Aug. 2004. [9] T. K. Rodrigues and E. E. Swartzlander, Adaptive CORDIC: Using parallel angle recoding to accelerate CORDIC rotations, in Proc. 40th Asilomar Conf. Signals, Syst. Comput. (ACSSC), 2006, pp. 323327. [10] K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, Modied virtually scaling free adaptive CORDIC rotator algorithm and architecture, IEEE Trans. Circuits Syst. for Video Technol., vol. 15, no. 11, pp. 14631474, Nov. 2005. [11] C. Y. Kang and E. E. Swartzlander, Jr., Digit-pipelined direct digital frequency synthesis based on differential CORDIC, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 5, pp. 10351044, May 2006. [12] L. Vachhani, K. Sridharan, and P. K. Meher, Efcient CORDIC algorithms and architectures for low area and high throughput implementation, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 1, pp. 6165, Jan. 2009.

[13] B. G. Blundell, An Introduction to Computer Graphics and Creative 3-D Environments. London, U.K.: Springer-Verlag, 2008. [14] T. Lang and E. Antelo, High-throughput CORDIC-based geometry operations for 3D computer graphics, IEEE Trans. Comput., vol. 54, no. 3, pp. 347361, Mar. 2005. [15] M.-P. Cani, T. Igarashi, and G. Wyvill, Interactive Shape Design. San Rafael, CA: Morgan & Claypool, 2007. [16] K. J. Jones, 2D systolic solution to discrete Fourier transform, IEE Proc. Comput. Digit. Techn., vol. 136, no. 3, pp. 211216, May 1989. [17] P. K. Meher, J. K. Satapathy, and G. Panda, Efcient systolic solution for a new prime factor discrete Hartley transform algorithm, IEE Proc. Circuits, Devices Syst., vol. 140, no. 2, pp. 135139, Apr. 1993. [18] S. Freeman and M. ODonnell, A complex arithmetic digital signal processor using cordic rotators, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1995, pp. 31913194. [19] J. R. Cavallaro and F. T. Luk, CORDIC arithmetic for a SVD processor, J. for Parallel Distrib. Comput., vol. 5, pp. 271290, 1988. [20] C. H. Lin and A. Y. Wu, Mixed-scaling-rotation CORDIC (MSRCORDIC) algorithm and architecture for high-performance vector rotational DSP applications, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 11, pp. 23852396, Nov. 2005. [21] S. Y. Park and N. I. Cho, Fixed-point error analysis of CORDIC processor based on the variance propagation formula, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp. 573584, Mar. 2004. [22] A. R. Omondi, Computer Arithmetic Systems: Algorithms, Architectures, and Implementations. New York: Prentice-Hall, 1994. [23] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. [24] Signal Compression : Coding of Speech, Audio, Text, Image And Video, ser. Selected Topics in Electronics and Systems, N. E. Jayant, Ed. Singapore: World Scientic, 1997, vol. 9.

Pramod Kumar Meher (SM03) is currently working as a Senior Scientist with the Institute for Infocomm Research, Singapore. His research interests include design of dedicated and recongurable architectures for computation-intensive algorithms pertaining to signal, image, and video processing, communication, bio-informatics, and intelligent computing. He has contributed more than 180 technical papers to various reputed journals and conference proceedings. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India. He has served as an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPART II: EXPRESS BRIEFS during 20082011. Currently, he is serving as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society, and Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPART I: REGULAR PAPERS, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, and Signal Processing. He was a recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for the year 1999.

Sang Yoon Park (S03M11) received the B.S., M.S., and Ph.D. degrees from Seoul National University, Seoul, Korea, in 2000, 2002, and 2006, respectively. He was a Research Fellow with Nanyang Technological University, Singapore, from 2007 to 2008. He joined the Institute for Infocomm Research, Singapore, in 2008, where he is currently a Research Scientist. His research interests include architectures and algorithms for low-power/high-performance digital signal processing and communication systems.

You might also like