Efficient Hardware Implementation of Hybrid Cosine-Fourier-Wavelet Transforms On A Single FPGA

Efficient Hardware Implementation of Hybrid CosineFourier-Wavelet Transforms on a Single FPGA
K. Wahid+, S. Shimu+, M. Islam+, D. Teng+, Moon Ho Lee*, and S-B. Ko*+

+
Department of Electrical and Computer Engineering

University of Saskatchewan, Saskatoon, Saskatchewan, Canada, S7N5A9
*Institute of Information and Communication, Chonbuk National University, Jeonju 561-756, Korea
email: [email protected]
Abstract This paper presents an efficient hardware
implementation of a hybrid architecture to compute three 8point transforms the Discrete Cosine Transform, the Discrete
Fourier Transform, and the Discrete Wavelet Transform on a
single FPGA. The architecture is based on an element-wise
matrix factorization and row-permutation algorithm, where the
forward basis transformation matrices are decomposed into
multiple sub-matrices and the common units are shared among
them. The hardware implementation is parallel, pipelined and
multiplication-free; it costs only 2,073 logic cells, 1,476 registers
and runs at maximum frequency of 118 MHz with a very high
process throughput of 944 Megabits/sec when synthesized onto
an Altera FPGA device. The synthesized results for other FPGA
technologies are also presented for performance assessment.
I.
INTRODUCTION
The paradigm of system on chip design techniques has

brought an added dimension to multimedia-based applications
by allowing third party intellectual property (IP) blocks to be
used by digital circuit designers. This provides us with an
opportunity to introduce efficient arithmetic techniques into
next generation integrated circuits, where we can take
advantage of the arithmetic structure to implement faster and
area-efficient processors.
The Discrete Cosine Transform (DCT) is reportedly the
best suboptimal transformation for image and video
processing and being used in various coding standards [1].
The Discrete Fourier Transform (DFT) has been widely used
as the fundamental computation unit for signal processing and
communication applications [2]. Over the past decade, the
Discrete Wavelet Transform (DWT) based on the Haar matrix
[3] has become very popular for compression algorithms,
especially after the DWTs inclusion in the JPEG2000
standard [4]. Taking the respective strengths and weaknesses
of these transforms into account, researchers have developed
several new schemes, called hybrid architectures, where these
transforms can be combined to achieve better performance.
For example, the Fourier-Wavelet transform can be used to
improve the denoising performance for images [5]; a CosineWavelet hybrid structure can be used to minimize the
978-1-4244-3828-0/09/$25.00 2009 IEEE
prediction errors in video coding [6][7] and to enhance the

security in digital watermarking [8], etc.
Here in this paper, we present an efficient hardware
implementation of a hybrid architecture to compute the three
popular 8-point transforms the DCT, the DFT, and the DWT
on a single FPGA. Fig. 1 illustrates the presented hybrid
architecture of the DCT-DFT-DWT. The architecture is based
on an element-wise matrix factorization and row-permutation
algorithm, where the forward basis transformation matrices
are first decomposed into multiple sub-matrices; the common
structures of the sub-matrices are then identified and shared
among them. This results in significant hardware savings in
terms of less computational resources, as well as faster design.
The architecture has been prototyped onto several FPGA
technologies for performance assessment. The implementation
details, as well as synthesis results are discussed.
Figure 1. Hybrid Cosine-Fourier-Wavelet transform architecture
II.
MATRIX DECOMPOSITION ALGORITHM
In order to decompose the forward transform basis

matrices, we utilize sparse matrix factorization where, the
DCT, the DFT, and the DWT matrices can be sub-divided into
one orthogonal character matrix and a special sparse matrix,
named the Jacket matrix [9].
A.
2325
Matrix Decomposition of the DCT

A classical N-point DCT is defined as follows:
2
m(2n +1)
km cos
where,{m, n} = 0,1,..., N 1
N
2N
[CN ]m,n =
(1) [ F ] = ([ Pr ] )1 F~
N
N

= [ Pr ]N
j = 1,2,...., N 1
and k = 1,
Now we define a row permutation matrix, [Pr ]N given

below in (2):
[ Pr ]2 = [ I ]2
and [Pr ]N = pri , j
pri , j = 1, if i = 2 j , 0 j N / 2 1
pri , j = 1, if i = (2 j + 1) mod N , N / 2 j N 1, and
pr = 0, others
i, j
i, j {0,1,..., N 1}. A reversible permutation matrix, [ Pc]N

is defined below in (3):
[Pc]2 = [I ]2 and [Pc]N
0
IN / 4
0
0
0
0
,N 4
IN / 4
0
0
0
IN / 4
(3)
Then, the general recursive form of the DCT can be

simplified and is shown in (4) ( [ D ] = diag C4N ,..., C4N ):
N
N 1
The proof of the algorithm can be found in [10]. The

algorithm is further modified for an 8-point data sequence.
B. Matrix Decomposition of the DFT
The DFT of a given sequence x(m) , 0 m N 1 is
defined by (5):
X ( n ) = x ( m) W nm , 0 n N 1
(5)
m=0
Where W = e
[ ]
2
N
, j = 1 .The N-point DFT matrix can be
denoted by F N = W
N . Using similar approach defined
in the previous section, the general recursive form for the DFT
matrix can be represented by the following (6):
nm
I N /2
I N /2
W0
W2
W0
W0 W2
W0 W0
W W
2
0
W0 W0
W0 W2
W0
W0
W0
[ 0]
W0
W0
W0
W0
W0
W2
W0
W2
(7)
[ I4 ]
[0]
W
0
0
0
[I ] [I ]
0
4
4
0 W1 0
0
[ I 4 ] [ I 4 ]
[ 0]
0 0 W2 0
0 0 0 W3
C. Matrix Decomposition of the DWT

The discrete wavelet transform based on the 8-point Haar
matrix (HWT) is defined below in (8):
r3
3
r
r 2
0
r
0
0
K (i, 0) = (1)i + j . 2
Here, [ K ] =
K (i, j ) = (1)i + j .2 , when i j , j 0
ij
otherwise
0,
N 1
I 2 I N /2
0 I N /2
...
I 2 0 WN /2 I N /2
W0 W0
W0 W2
W0 W0
[ I 4 ][ 0] W0 W2
[Pr]8F
F
[ 0][ Pr ]4
[ 0]
0
I2
I
W2 N /4 I 2
0
[ I F2 ]
Pr2 N /2
The algorithm is again modified for an 8-point sequence

for efficient implementation, which is given below in (7):
I N /2
0
I2 0
(4)

... I N /4 0 K [ I N /2 C2 ]
0
K
N
N /2
2
0
I I
0 I
I
I
I
1
[ I N /4 02 D I N /4 I2 I2 ... N0/2 D I N /2 IN /2 ( Pc
N )
N /2 N /2
N /2
2
2 2
[CN ] = Pr
0
I
... I 2
PrN /2 N /4 0
I N /2
0
(2)
,N 4
Where,
I N / 4
0
=
0
[ I N /4 02
j = 0, N
0.5 ,
(6)
r3
r3
r3
r3
r3
r3
r
0
r2
0
r
0
r 2
0
r
r2
r
r 2
r
r 2
r
0
0
r
0
0
0
0
r2
0
0
0
0
0
0
0
r
0
r
0
0
r
r3
r 3
(8)
0
r 2
1
where, r =
= C4
0
2
0
0
By taking similar approach, the recursive form can be

computed and is shown below in (9):
[ I4 ]
C4
[ I 4 ][ 0]
W1
C
[Pr]8
W1
4
[0] [ Pr ]4 [ 0]
1
[ 0]
C 4 C4
C 4 C4
1 0
0
C4
C4
(9)
[ I 4 ][ 0] [ I 4 ] [ I 4 ]
W2
[Pr]8
W2
[ 0][ Pr ]4 [ I 4 ] [ I 4 ]
Now we take advantage of the common sub-matrices

shown in (4), (7) and (9), and implement the architecture
which is discussed in the next section.
2326
III.
HARDWARE IMPLEMENTATION
The architecture is organized as a linear multistage

pipeline in order to achieve high throughput and based on 2s
complement arithmetic in order to naturally handle signed
numbers. The block diagram in Fig. 2 shows the processor
core. The data input is 8 bits and the output coefficient (both
real and imaginary) is 12 bits; a 2-bit input pin, t_select,
selects the transforms. The selection encoding is as follows:
{00: DCT; 01: DFT; 10: DWT; 11: DCT (by default)}. The
detailed flow graph inside the core is given in Fig. 3. All
modules are controlled by a controller, and the control signals
are denoted by arrows. The module description is as follows:
{C, F, W, #} stands for {Cosine, Fourier, Wavelet, Stage
number}; for example, CFW_2 means that this is a common
module required to compute all three transform and it belongs
CF_1
X0
X1
X2
X3
X4
X5
X6
X7
Data_in
[7:0]
DMUX
1x2
C
o
n
t
r
o
l
CF_2
CFW_2
to stage 2; F_5 means that the stage 5 module is only required

to compute the DFT.
Figure 2. Hybrid Cosine-Fourier-Wavelet transform processor
C_3
C
o
n
t
r
o
l
C
o
n
t
r
o
l
CFW_4
Buffer
t_select
[1:0]
CFW_6
B
u
f
f
e
r
F_6
B
u
f
f
e
r
CF_4
C
o
n
t
r
o
l
C_5
F_5
C
o
n
t
r
o
l
data_out_real
[11:0]
data_out_img
[11:0]
Controller
Figure 3. Top module of the hybrid transform processor
(a) DEMUX
(g) CF_2
(h) F_6
(i) CFW_6
(j) C_3
(b) CF_1
C1
C3
C5
C7
(c) C_5
(d) F_5
(e) CF_4 / CFW_4
Figure 4. Hardware blocks of the hybrid processor (a j)
(f) CFW_2
2327
By asserting high to the start pin will start the processor.

A high in the data_valid pin indicates the valid output
coefficients. The intermediate data that come out from the
blocks with a control arrow pin have been latched into register
banks. Thus from Fig. 3, we can see that a large number of
blocks can be shared among the transforms, which results in
smaller hardware, as well as lower power consumption.
The internal circuitry of each module has been shown in
Fig. 4. Internal counters are set to automatically control these
operations and maintain a parallel and congestion-free data
flow. The transform coefficients are denoted by small boxes
where, CK = cos k .
16
These multipliers are fixed in nature; hence, they have
been computed with shifts and additions only with a precision
of 8 bits as a way to minimize the hardware and optimize the
operations, which are completed in one clock cycle. Internally
to the multiplier, all significant bits are used to retain the
precision of the calculation but the multiplier outputs are
truncated to discard the fractional part.
IV.
HARDWARE SYNTHESIS
The proposed algorithm is implemented in Verilog HDL

and synthesized into several FPGA technologies so that the
cost can be assessed and compared. The results are
summarized in Table I. It can be seen that, the hardware cost
is low with a maximum operational frequency of 118 MHz
that results in a process throughput of 944 Megabits/sec.
TABLE I.
HARDWARE RESOURCES OF THE HYBRID ARCHITECTURE

Cell
Count
LUT
(utilization)
Registers
(utilization)
Frequency
(MHz)
N/A
2073 (19%)
1476 (N/A)
118
Technology
Altera Stratix
(EP1S10FC780)
Altera Cyclone II
(EP2C35FC672)
Xilinx Virtex4
(XCV4LX15SF363)
Actel Fusion
(AFS090)
TABLE II.
Topology
Standalone
Hybrid
N/A
1941 (N/A)
1476 (4%)
107
N/A
2017 (16%)
1479 (12%)
95
4338
(5%)
N/A
N/A
30
M+A
LUT
Registers
DCT
8 + 30
1118
1094
DFT
3 + 32
1073
887
DWT
1 + 14
764
800
Total
12 + 76
2955
2781
DCT-DFTDWT
10 + 38
2073
1476
17 + 50
30
47
Overall Savings (%)

1
V.
CONCLUSIONS
In this paper, we have presented an efficient hardware

implementation of a hybrid architecture to compute three 8point discrete transforms, the DCT, the DFT, and the DWT.
The matrix decomposition algorithm has been derived from
the sparse matrix representation based on the Jacket matrix.
The matrices have been further decomposed to identify the
symmetric components enabling the recurring sub-matrices to
be shared among the three transforms. The architecture has
been prototyped onto several FPGA platforms and the
synthesized results show that a savings of 47% in terms of
hardware resources can be achieved using the fast hybrid
algorithm. Future research is directed towards the VLSI
implementation of the presented hybrid architecture.
ACKNOWLEDGMENT
The authors would like to acknowledge the Natural
Science and Engineering Research Council of Canada
(NSERC) for its support to this research work.
REFERENCES
[1]
HARDWARE COMPARISONS: STANDALONE VS. HYBRID1

Technology
In order to demonstrate the savings in computational

resources of the hybrid architecture, we have implemented the
three transforms separately using similar hardware topology.
Table II summarizes the results (the LUT and the logic
resources are based on Altera Stratix EP1S10FC780 FPGA
technology). Here, we can see that by taking advantage of
matrix symmetry and hardware resources sharing, the
hardware cost of the hybrid architecture can be reduced by at
least 30%. The total number of multiplications and additions
are also reduced by a good margin. Such a hybrid processor
with three transforms on a single FPGA can be used in a wide
range of applications such as video coding, digital
communication, digital watermarking, signal denoising, etc.
M = Multiplications; these fixed coefficient multipliers have been implemented with adders and
shifters only which makes the entire architecture multiplication-free; A = Additions
K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms,

Advantages, and Applications, New York: Academic, 1990.
[2] E. O. Brigham, The Fast Fourier transform and its Applications,
Englewood Cliffs, N.J., Prentice Hall, 1988.
[3] G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley
Cambridge Pr, 1996.
[4] ISO/IEC, JPEG Committee Draft 15444-1, 2004.
[5] S. Jiang and X. Hao, Hybrid Fourier-Wavelet Image Denoising,
Electronics Letters, vol. 43, no. 20, pp. 1081-1082, 2007.
[6] M. Ezhilarasan and P. Thambidurai, A Hybrid Transformation
Technique for Video Coding, LNCS 4308, pp. 503-508, Springer,
2006.
[7] P. Paul and P. Girija, A Novel VLSI Architecture for Image
Compression, Proc. of the 8th IEEE International Symposium on
Multimedia, pp. 794-795, 2006.
[8] A. Haj, Combined DWT-DCT Digital Image Watermarking, Journal
of Computer Science, Science Publication, vol. 3, no. 9, pp. 740-746,
2007.
[9] Moon Ho Lee, On computing 2-D systolic algorithm for discrete
cosine transform, IEEE Trans. on Circuits and Systems, vol.37, no.10,
pp.1321-1323, 1990.
[10] Z. Chen and M. H. Lee On Fast Hybrid Source Coding Design, Proc.
of the 2007 International Symposium on Information Technology
Convergence, pp. 143-147, 2007.
2328

Efficient Hardware Implementation of Hybrid Cosine-Fourier-Wavelet Transforms On A Single FPGA

Uploaded by

Copyright:

Available Formats

Efficient Hardware Implementation of Hybrid Cosine-Fourier-Wavelet Transforms On A Single FPGA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Hardware Implementation of Hybrid Cosine-Fourier-Wavelet Transforms On A Single FPGA

Uploaded by

Copyright:

Available Formats

Efficient Hardware Implementation of Hybrid CosineFourier-Wavelet Transforms on a Single FPGA

K. Wahid+, S. Shimu+, M. Islam+, D. Teng+, Moon Ho Lee*, and S-B. Ko*+

Department of Electrical and Computer Engineering

The paradigm of system on chip design techniques has

978-1-4244-3828-0/09/$25.00 2009 IEEE

prediction errors in video coding [6][7] and to enhance the

Figure 1. Hybrid Cosine-Fourier-Wavelet transform architecture

MATRIX DECOMPOSITION ALGORITHM

In order to decompose the forward transform basis

Matrix Decomposition of the DCT

Now we define a row permutation matrix, [Pr ]N given

and [Pr ]N = pri , j

i, j {0,1,..., N 1}. A reversible permutation matrix, [ Pc]N

[Pc]2 = [I ]2 and [Pc]N

Then, the general recursive form of the DCT can be

The proof of the algorithm can be found in [10]. The

, j = 1 .The N-point DFT matrix can be

C. Matrix Decomposition of the DWT

The algorithm is again modified for an 8-point sequence

By taking similar approach, the recursive form can be

Now we take advantage of the common sub-matrices

The architecture is organized as a linear multistage

to stage 2; F_5 means that the stage 5 module is only required

Figure 2. Hybrid Cosine-Fourier-Wavelet transform processor

Figure 3. Top module of the hybrid transform processor

(e) CF_4 / CFW_4

Figure 4. Hardware blocks of the hybrid processor (a j)

By asserting high to the start pin will start the processor.

The proposed algorithm is implemented in Verilog HDL

HARDWARE RESOURCES OF THE HYBRID ARCHITECTURE

Overall Savings (%)

In this paper, we have presented an efficient hardware

HARDWARE COMPARISONS: STANDALONE VS. HYBRID1

In order to demonstrate the savings in computational

K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms,

You might also like

K. Wahid+, S. Shimu+, M. Islam+, D. Teng+, Moon Ho Lee, and S-B. Ko+