IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2001
Novel VLSI architectures and a design methodology for adderbased Residue Number System (RNS) mult... more Novel VLSI architectures and a design methodology for adderbased Residue Number System (RNS) multipliers are presented in this paper. In the proposed approach, the exploitation of the non-occuring combinations of input bits reduces the number of 1-bit full adders (FAs) required to compose a m ultiplier. In particular, couples and triplets of input bits assigned to particular FAs are identi ed, which contain bits that cannot be simultaneously asserted for any v alid input combination. It is shown that the particular couples or triplets can be assigned to OR gates instead of 1-bit adders, therefore reducing multiplier complexity. By comparing the performance and hardware complexity of the proposed multiplier to previously reported designs, it is found that the introduced architecture is more e cient i n t h e a r e a time product sense. In fact, it is shown that more than 80% performance improvement can be achieved in certain cases.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2000
Novel radix-modulo-arithmetic units for residue number system (RNS)-based architectures are intro... more Novel radix-modulo-arithmetic units for residue number system (RNS)-based architectures are introduced in this paper. The proposed circuits are shown to require several times less area than previously reported architectures for particular moduli of operation, while also being preferable in the area time complexity sense. The complexity reduction is achieved by extending the carry-ignore property of modulo-2 operations to radices higher than two, which are not powers of two. The carry-ignore property is efficiently exploited by introducing simplified digit adders, instead of general radixadders. The proposed simplification of digit adders is possible, since the maximum values of certain intermediate digits produced in the architecture are found to be less than 1. Detailed area and time complexity models are derived for the arithmetic units. The proposed radixarchitectures include multipliers, adders, and merged multipliers-adders. In addition, efficient radixbinary-to-residue and residue-to-binary conversion techniques and architectures are introduced.
... The reserved interrupt is not available for use. The PCU contains five directly addressable r... more ... The reserved interrupt is not available for use. The PCU contains five directly addressable registers in addition to the program counter (PC). These are the loop address (LA), loop counter (LC), status register (SR), operating mode register (OMR), and stack pointer (SP). ...
Proceedings of the 12th ACM Great Lakes Symposium on VLSI - GLSVLSI '02, 2002
A novel approach for the reduction of the power dissipated in a signal processing application is ... more A novel approach for the reduction of the power dissipated in a signal processing application is introduced in this paper. By exploiting the properties of the Polynomial Residue Number System (PRNS) and of the arithmetic modulo´2 n · 1µ, the power dissipation of implementing cyclic convolution is reduced up to four times. Furthermore, the corresponding power¢delay product is reduced up to 2 4 times, while a simultaneous reduction of area cost is achieved. The particular performance improvement becomes possible by introducing a way to minimize the forward and inverse conversion overhead associated with PRNS. The introduced minimization exploits the fact that for the conversions for particular lengths of data sequences and particular moduli, only multiplications with powers of two and additions are required, thus leading to low implementation complexity. In addition multiple supply voltages are utilized to further reduce power dissipation by more than 30% for particular cases. Formulas that return the applicable supply voltage values per PRNS channel are derived in this paper.
1993 IEEE International Symposium on Circuits and Systems, 2000
A full adder-based arithmetic unit of a modulus m, called an FA-based AUm, is proposed. It perfor... more A full adder-based arithmetic unit of a modulus m, called an FA-based AUm, is proposed. It performs both addition and multiplication at the same time. Since the proposed AUms use full adders as their basic units, they lead to modular and regular designs which result in lower cost and easier implementation in VLSI
Proceedings of 13th International Conference on Digital Signal Processing, 2000
A computationally efficient block matching algorithm is presented to perform motion estimation of... more A computationally efficient block matching algorithm is presented to perform motion estimation of image sequences. The algorithm evaluates an objective function for all neighbouring blocks and stops, when no further improvement can be achieved. The complexity of the algorithm is reduced significantly, as the objective function is calculated from the projections of the blocks along the horizontal and the vertical axis. Furthermore, the relationship between projections of the neighbouring blocks is utilized, so as to alleviate the need for fully calculating the projection vectors for each candidate block. The proposed algorithm is compared against the full search (FS), two-dimensional logarithmic search (2D LS), and block based gradient descent search (BBGDS), in terms of complexity and compression performance. Experimental results show that the proposed algorithm exhibits quite good performance at a significantly reduced computational complexity
ISCAS '98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187), 1998
In this paper a novel approach for low power realization of DSP algorithms that are based on inne... more In this paper a novel approach for low power realization of DSP algorithms that are based on inner product computation is proposed. Inner product computation between data and coefficients is a very common computational structure in DSP algorithms. The proposed methodology is based on an architectural transformation that reorders the sequence of evaluation of the partial products forming the inner products. The total Hamming distance of the sequence of coefficients, which are known before realization, is used as the cost function driving the reordering. The reordering of computation reduces the switching activity at the inputs of the computational units. Experimental results show that the proposed methodology leads to significant savings in switching activity and thus in power consumption
ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349), 1999
ABSTRACT Novel techniques for low-power synthesis of sum-of-product computation are presented. Th... more ABSTRACT Novel techniques for low-power synthesis of sum-of-product computation are presented. The proposed synthesis techniques aim at reducing the switching activity at the inputs of the functional units leading to reduction of the internal activity as well. Heuristics are used to assign the partial products of the computation to the functional units. These heuristics increase the correlation of the partial products that will be assigned to the same functional unit thus reducing the switching activity. Next, scheduling techniques are used, to reduce the switching activity at the inputs of the functional units required for the successive evaluation of the partial products assigned to the same functional unit. Both the assignment and scheduling steps use information from both data (dynamic) and coefficients (static). Experimental results from the application of the proposed techniques on signal processing algorithms have proven that significant switching activity savings can be achieved
ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349), 1999
A digital area/power efficient VLSI implementation of the baseband part of a DECT demodulator, is... more A digital area/power efficient VLSI implementation of the baseband part of a DECT demodulator, is introduced. Starting from algorithm level and after exhaustive architecture-level exploration employing low power design techniques and transformations, we conclude with the hardware implementation of four optimized algorithms. The proposed DECT receiver will be integrated with the processor ASPIS implementing the baseband signal processing of a multi-mode terminal GSM/DECT/DCS-1800.
The Kluwer International Series in Engineering and Computer Science, 1993
This chapter addresses the design of two-level pipelined processor arrays. The parallelism of alg... more This chapter addresses the design of two-level pipelined processor arrays. The parallelism of algorithms is exploited both in word-level and in bit-level operations. Given an algorithm in the form of a Fortran-like nested loop program, a two-step procedure is applied. First, any word-level parallelism is exploited by using loop transformation techniques, which include a uniformization method, if required, and a
Proceedings of the First International Conference on Massively Parallel Computing Systems (MPCS) The Challenges of General-Purpose and Special-Purpose Computing, 1994
In this paper, Petri Nets (PNs) are used for deriving efficient mapping transformations of a wide... more In this paper, Petri Nets (PNs) are used for deriving efficient mapping transformations of a wide class of algorithms to processor arrays. In the proposed methodology, given an algorithm and the interconnections of the processor array, two PNs are constructed: one that is related to the algorithm and one that is related to the processor array. The former PN models the execution of the algorithm and differs drastically from the common data-flow methods. Based on properties of PNs and on the reachability tree analysis technique, a theorem is given, through which the two PN model suggest all possible ways of implementing the algorithm by the processor array
2012 International Conference on Control Engineering and Communication Technology, 2012
ABSTRACT This paper addresses the designing of a low complexity and high speed matrix inversion a... more ABSTRACT This paper addresses the designing of a low complexity and high speed matrix inversion algorithm using fast inverse square root based on QR-decomposition and systolic array architecture. Matrix operations are the most costly computational module within MIMO-LTE receivers. We have demonstrated a novel approach of matrix inverse to reduce the MIMO receiver module cost in terms of latency and complexity. The cost is reduced by implementing a 4x4 matrix inverse in Xilinx Virtex-6 FPGA by optimizing the module for speed and power by pipelining and achieving a better throughput. The results are compared with state of art techniques of CORDIC based squared givens rotation.
ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.01EX483), 2001
Four types of VLSI architectures for the hardware realization of the FLOS-CM algorithm are introd... more Four types of VLSI architectures for the hardware realization of the FLOS-CM algorithm are introduced in this paper. Each architecture is appropriate for a particular environment. The FLOS-CM algorithm is found to be amenable for implementation using logarithmic arithmetic. A logarithmic architecture is shown to require up to 50% less area and be 14% faster than a linear fixed-point arithmetic counterpart. In terms of AreaxTime and AreaxTime2 complexities, the logarithmic architecture is up to 120% better.
2006 IEEE International Symposium on Circuits and Systems, 2006
architecture of an ECPM is presented in this paper. In the proposed approach, the necessary mathe... more architecture of an ECPM is presented in this paper. In the proposed approach, the necessary mathematical conditions that need to be satisfied, in order to replace typical finite field circuits with RNS ones, are investigated. It is shown that such an application is feasible and that it leads to a significant improvement in the execution time of a scalar point multiplication.
2009 16th IEEE International Conference on Electronics, Circuits and Systems - (ICECS 2009), 2009
Abstract Elliptic Curve Point Multiplication is the main operation employed in all elliptic curve... more Abstract Elliptic Curve Point Multiplication is the main operation employed in all elliptic curve cryptosystems, as it forms the basis of the Elliptic Curve Discrete Logarithm Problem. Therefore, the efficient realization of an Elliptic Curve Point Multiplier is of fundamental ...
An Elliptic Curve Point Multiplier (ECPM) is the main part of all Elliptic Curve Cryptography (EC... more An Elliptic Curve Point Multiplier (ECPM) is the main part of all Elliptic Curve Cryptography (ECC) systems and its performance is decisive for the performance of the overall cryptosystem. A VLSI Residue Number System (RNS) architecture of an ECPM is presented in this paper. In the proposed approach, the necessary mathematical conditions that need to be satisfied, in order to replace typical finite field circuits with RNS ones, are investigated. It is shown that such an application is feasible and that it leads to a significant improvement in the execution time of a scalar point multiplication.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2001
Novel VLSI architectures and a design methodology for adderbased Residue Number System (RNS) mult... more Novel VLSI architectures and a design methodology for adderbased Residue Number System (RNS) multipliers are presented in this paper. In the proposed approach, the exploitation of the non-occuring combinations of input bits reduces the number of 1-bit full adders (FAs) required to compose a m ultiplier. In particular, couples and triplets of input bits assigned to particular FAs are identi ed, which contain bits that cannot be simultaneously asserted for any v alid input combination. It is shown that the particular couples or triplets can be assigned to OR gates instead of 1-bit adders, therefore reducing multiplier complexity. By comparing the performance and hardware complexity of the proposed multiplier to previously reported designs, it is found that the introduced architecture is more e cient i n t h e a r e a time product sense. In fact, it is shown that more than 80% performance improvement can be achieved in certain cases.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2000
Novel radix-modulo-arithmetic units for residue number system (RNS)-based architectures are intro... more Novel radix-modulo-arithmetic units for residue number system (RNS)-based architectures are introduced in this paper. The proposed circuits are shown to require several times less area than previously reported architectures for particular moduli of operation, while also being preferable in the area time complexity sense. The complexity reduction is achieved by extending the carry-ignore property of modulo-2 operations to radices higher than two, which are not powers of two. The carry-ignore property is efficiently exploited by introducing simplified digit adders, instead of general radixadders. The proposed simplification of digit adders is possible, since the maximum values of certain intermediate digits produced in the architecture are found to be less than 1. Detailed area and time complexity models are derived for the arithmetic units. The proposed radixarchitectures include multipliers, adders, and merged multipliers-adders. In addition, efficient radixbinary-to-residue and residue-to-binary conversion techniques and architectures are introduced.
... The reserved interrupt is not available for use. The PCU contains five directly addressable r... more ... The reserved interrupt is not available for use. The PCU contains five directly addressable registers in addition to the program counter (PC). These are the loop address (LA), loop counter (LC), status register (SR), operating mode register (OMR), and stack pointer (SP). ...
Proceedings of the 12th ACM Great Lakes Symposium on VLSI - GLSVLSI '02, 2002
A novel approach for the reduction of the power dissipated in a signal processing application is ... more A novel approach for the reduction of the power dissipated in a signal processing application is introduced in this paper. By exploiting the properties of the Polynomial Residue Number System (PRNS) and of the arithmetic modulo´2 n · 1µ, the power dissipation of implementing cyclic convolution is reduced up to four times. Furthermore, the corresponding power¢delay product is reduced up to 2 4 times, while a simultaneous reduction of area cost is achieved. The particular performance improvement becomes possible by introducing a way to minimize the forward and inverse conversion overhead associated with PRNS. The introduced minimization exploits the fact that for the conversions for particular lengths of data sequences and particular moduli, only multiplications with powers of two and additions are required, thus leading to low implementation complexity. In addition multiple supply voltages are utilized to further reduce power dissipation by more than 30% for particular cases. Formulas that return the applicable supply voltage values per PRNS channel are derived in this paper.
1993 IEEE International Symposium on Circuits and Systems, 2000
A full adder-based arithmetic unit of a modulus m, called an FA-based AUm, is proposed. It perfor... more A full adder-based arithmetic unit of a modulus m, called an FA-based AUm, is proposed. It performs both addition and multiplication at the same time. Since the proposed AUms use full adders as their basic units, they lead to modular and regular designs which result in lower cost and easier implementation in VLSI
Proceedings of 13th International Conference on Digital Signal Processing, 2000
A computationally efficient block matching algorithm is presented to perform motion estimation of... more A computationally efficient block matching algorithm is presented to perform motion estimation of image sequences. The algorithm evaluates an objective function for all neighbouring blocks and stops, when no further improvement can be achieved. The complexity of the algorithm is reduced significantly, as the objective function is calculated from the projections of the blocks along the horizontal and the vertical axis. Furthermore, the relationship between projections of the neighbouring blocks is utilized, so as to alleviate the need for fully calculating the projection vectors for each candidate block. The proposed algorithm is compared against the full search (FS), two-dimensional logarithmic search (2D LS), and block based gradient descent search (BBGDS), in terms of complexity and compression performance. Experimental results show that the proposed algorithm exhibits quite good performance at a significantly reduced computational complexity
ISCAS '98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187), 1998
In this paper a novel approach for low power realization of DSP algorithms that are based on inne... more In this paper a novel approach for low power realization of DSP algorithms that are based on inner product computation is proposed. Inner product computation between data and coefficients is a very common computational structure in DSP algorithms. The proposed methodology is based on an architectural transformation that reorders the sequence of evaluation of the partial products forming the inner products. The total Hamming distance of the sequence of coefficients, which are known before realization, is used as the cost function driving the reordering. The reordering of computation reduces the switching activity at the inputs of the computational units. Experimental results show that the proposed methodology leads to significant savings in switching activity and thus in power consumption
ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349), 1999
ABSTRACT Novel techniques for low-power synthesis of sum-of-product computation are presented. Th... more ABSTRACT Novel techniques for low-power synthesis of sum-of-product computation are presented. The proposed synthesis techniques aim at reducing the switching activity at the inputs of the functional units leading to reduction of the internal activity as well. Heuristics are used to assign the partial products of the computation to the functional units. These heuristics increase the correlation of the partial products that will be assigned to the same functional unit thus reducing the switching activity. Next, scheduling techniques are used, to reduce the switching activity at the inputs of the functional units required for the successive evaluation of the partial products assigned to the same functional unit. Both the assignment and scheduling steps use information from both data (dynamic) and coefficients (static). Experimental results from the application of the proposed techniques on signal processing algorithms have proven that significant switching activity savings can be achieved
ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349), 1999
A digital area/power efficient VLSI implementation of the baseband part of a DECT demodulator, is... more A digital area/power efficient VLSI implementation of the baseband part of a DECT demodulator, is introduced. Starting from algorithm level and after exhaustive architecture-level exploration employing low power design techniques and transformations, we conclude with the hardware implementation of four optimized algorithms. The proposed DECT receiver will be integrated with the processor ASPIS implementing the baseband signal processing of a multi-mode terminal GSM/DECT/DCS-1800.
The Kluwer International Series in Engineering and Computer Science, 1993
This chapter addresses the design of two-level pipelined processor arrays. The parallelism of alg... more This chapter addresses the design of two-level pipelined processor arrays. The parallelism of algorithms is exploited both in word-level and in bit-level operations. Given an algorithm in the form of a Fortran-like nested loop program, a two-step procedure is applied. First, any word-level parallelism is exploited by using loop transformation techniques, which include a uniformization method, if required, and a
Proceedings of the First International Conference on Massively Parallel Computing Systems (MPCS) The Challenges of General-Purpose and Special-Purpose Computing, 1994
In this paper, Petri Nets (PNs) are used for deriving efficient mapping transformations of a wide... more In this paper, Petri Nets (PNs) are used for deriving efficient mapping transformations of a wide class of algorithms to processor arrays. In the proposed methodology, given an algorithm and the interconnections of the processor array, two PNs are constructed: one that is related to the algorithm and one that is related to the processor array. The former PN models the execution of the algorithm and differs drastically from the common data-flow methods. Based on properties of PNs and on the reachability tree analysis technique, a theorem is given, through which the two PN model suggest all possible ways of implementing the algorithm by the processor array
2012 International Conference on Control Engineering and Communication Technology, 2012
ABSTRACT This paper addresses the designing of a low complexity and high speed matrix inversion a... more ABSTRACT This paper addresses the designing of a low complexity and high speed matrix inversion algorithm using fast inverse square root based on QR-decomposition and systolic array architecture. Matrix operations are the most costly computational module within MIMO-LTE receivers. We have demonstrated a novel approach of matrix inverse to reduce the MIMO receiver module cost in terms of latency and complexity. The cost is reduced by implementing a 4x4 matrix inverse in Xilinx Virtex-6 FPGA by optimizing the module for speed and power by pipelining and achieving a better throughput. The results are compared with state of art techniques of CORDIC based squared givens rotation.
ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.01EX483), 2001
Four types of VLSI architectures for the hardware realization of the FLOS-CM algorithm are introd... more Four types of VLSI architectures for the hardware realization of the FLOS-CM algorithm are introduced in this paper. Each architecture is appropriate for a particular environment. The FLOS-CM algorithm is found to be amenable for implementation using logarithmic arithmetic. A logarithmic architecture is shown to require up to 50% less area and be 14% faster than a linear fixed-point arithmetic counterpart. In terms of AreaxTime and AreaxTime2 complexities, the logarithmic architecture is up to 120% better.
2006 IEEE International Symposium on Circuits and Systems, 2006
architecture of an ECPM is presented in this paper. In the proposed approach, the necessary mathe... more architecture of an ECPM is presented in this paper. In the proposed approach, the necessary mathematical conditions that need to be satisfied, in order to replace typical finite field circuits with RNS ones, are investigated. It is shown that such an application is feasible and that it leads to a significant improvement in the execution time of a scalar point multiplication.
2009 16th IEEE International Conference on Electronics, Circuits and Systems - (ICECS 2009), 2009
Abstract Elliptic Curve Point Multiplication is the main operation employed in all elliptic curve... more Abstract Elliptic Curve Point Multiplication is the main operation employed in all elliptic curve cryptosystems, as it forms the basis of the Elliptic Curve Discrete Logarithm Problem. Therefore, the efficient realization of an Elliptic Curve Point Multiplier is of fundamental ...
An Elliptic Curve Point Multiplier (ECPM) is the main part of all Elliptic Curve Cryptography (EC... more An Elliptic Curve Point Multiplier (ECPM) is the main part of all Elliptic Curve Cryptography (ECC) systems and its performance is decisive for the performance of the overall cryptosystem. A VLSI Residue Number System (RNS) architecture of an ECPM is presented in this paper. In the proposed approach, the necessary mathematical conditions that need to be satisfied, in order to replace typical finite field circuits with RNS ones, are investigated. It is shown that such an application is feasible and that it leads to a significant improvement in the execution time of a scalar point multiplication.
Uploads
Papers by T. Stouraitis