Efficient Hardware Implementation of Hybrid Cosine-Fourier-Wavelet Transforms On A Single FPGA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Efficient Hardware Implementation of Hybrid CosineFourier-Wavelet Transforms on a Single FPGA

K. Wahid+, S. Shimu+, M. Islam+, D. Teng+, Moon Ho Lee*, and S-B. Ko*+


+

Department of Electrical and Computer Engineering


University of Saskatchewan, Saskatoon, Saskatchewan, Canada, S7N5A9
*Institute of Information and Communication, Chonbuk National University, Jeonju 561-756, Korea
email: [email protected]
Abstract This paper presents an efficient hardware
implementation of a hybrid architecture to compute three 8point transforms the Discrete Cosine Transform, the Discrete
Fourier Transform, and the Discrete Wavelet Transform on a
single FPGA. The architecture is based on an element-wise
matrix factorization and row-permutation algorithm, where the
forward basis transformation matrices are decomposed into
multiple sub-matrices and the common units are shared among
them. The hardware implementation is parallel, pipelined and
multiplication-free; it costs only 2,073 logic cells, 1,476 registers
and runs at maximum frequency of 118 MHz with a very high
process throughput of 944 Megabits/sec when synthesized onto
an Altera FPGA device. The synthesized results for other FPGA
technologies are also presented for performance assessment.

I.

INTRODUCTION

The paradigm of system on chip design techniques has


brought an added dimension to multimedia-based applications
by allowing third party intellectual property (IP) blocks to be
used by digital circuit designers. This provides us with an
opportunity to introduce efficient arithmetic techniques into
next generation integrated circuits, where we can take
advantage of the arithmetic structure to implement faster and
area-efficient processors.
The Discrete Cosine Transform (DCT) is reportedly the
best suboptimal transformation for image and video
processing and being used in various coding standards [1].
The Discrete Fourier Transform (DFT) has been widely used
as the fundamental computation unit for signal processing and
communication applications [2]. Over the past decade, the
Discrete Wavelet Transform (DWT) based on the Haar matrix
[3] has become very popular for compression algorithms,
especially after the DWTs inclusion in the JPEG2000
standard [4]. Taking the respective strengths and weaknesses
of these transforms into account, researchers have developed
several new schemes, called hybrid architectures, where these
transforms can be combined to achieve better performance.
For example, the Fourier-Wavelet transform can be used to
improve the denoising performance for images [5]; a CosineWavelet hybrid structure can be used to minimize the

978-1-4244-3828-0/09/$25.00 2009 IEEE

prediction errors in video coding [6][7] and to enhance the


security in digital watermarking [8], etc.
Here in this paper, we present an efficient hardware
implementation of a hybrid architecture to compute the three
popular 8-point transforms the DCT, the DFT, and the DWT
on a single FPGA. Fig. 1 illustrates the presented hybrid
architecture of the DCT-DFT-DWT. The architecture is based
on an element-wise matrix factorization and row-permutation
algorithm, where the forward basis transformation matrices
are first decomposed into multiple sub-matrices; the common
structures of the sub-matrices are then identified and shared
among them. This results in significant hardware savings in
terms of less computational resources, as well as faster design.
The architecture has been prototyped onto several FPGA
technologies for performance assessment. The implementation
details, as well as synthesis results are discussed.

Figure 1. Hybrid Cosine-Fourier-Wavelet transform architecture

II.

MATRIX DECOMPOSITION ALGORITHM

In order to decompose the forward transform basis


matrices, we utilize sparse matrix factorization where, the
DCT, the DFT, and the DWT matrices can be sub-divided into
one orthogonal character matrix and a special sparse matrix,
named the Jacket matrix [9].
A.

2325

Matrix Decomposition of the DCT


A classical N-point DCT is defined as follows:

2
m(2n +1)
km cos
where,{m, n} = 0,1,..., N 1
N
2N

[CN ]m,n =

(1) [ F ] = ([ Pr ] )1 F~
N
N

= [ Pr ]N

j = 1,2,...., N 1

and k = 1,

Now we define a row permutation matrix, [Pr ]N given


below in (2):

[ Pr ]2 = [ I ]2

and [Pr ]N = pri , j

pri , j = 1, if i = 2 j , 0 j N / 2 1
pri , j = 1, if i = (2 j + 1) mod N , N / 2 j N 1, and
pr = 0, others
i, j

i, j {0,1,..., N 1}. A reversible permutation matrix, [ Pc]N


is defined below in (3):

[Pc]2 = [I ]2 and [Pc]N

0
IN / 4
0
0

0
0
,N 4
IN / 4

0
0
0
IN / 4

(3)

Then, the general recursive form of the DCT can be


simplified and is shown in (4) ( [ D ] = diag C4N ,..., C4N ):
N

N 1

The proof of the algorithm can be found in [10]. The


algorithm is further modified for an 8-point data sequence.
B. Matrix Decomposition of the DFT
The DFT of a given sequence x(m) , 0 m N 1 is
defined by (5):

X ( n ) = x ( m) W nm , 0 n N 1

(5)

m=0

Where W = e

[ ]

2
N

, j = 1 .The N-point DFT matrix can be

denoted by F N = W
N . Using similar approach defined
in the previous section, the general recursive form for the DFT
matrix can be represented by the following (6):
nm

I N /2
I N /2

W0
W2
W0

W0 W2
W0 W0
W W
2
0
W0 W0

W0 W2
W0

W0
W0

[ 0]
W0
W0
W0
W0

W0

W2

W0

W2

(7)

[ I4 ]

[0]

W
0
0
0

[I ] [I ]

0
4
4

0 W1 0
0
[ I 4 ] [ I 4 ]
[ 0]
0 0 W2 0

0 0 0 W3

C. Matrix Decomposition of the DWT


The discrete wavelet transform based on the 8-point Haar
matrix (HWT) is defined below in (8):
r3
3
r
r 2

0
r

0
0

K (i, 0) = (1)i + j . 2
Here, [ K ] =
K (i, j ) = (1)i + j .2 , when i j , j 0
ij

otherwise
0,

N 1

I 2 I N /2
0 I N /2
...
I 2 0 WN /2 I N /2

W0 W0

W0 W2
W0 W0

[ I 4 ][ 0] W0 W2

[Pr]8F
F

[ 0][ Pr ]4

[ 0]

0
I2
I
W2 N /4 I 2

0
[ I F2 ]
Pr2 N /2

The algorithm is again modified for an 8-point sequence


for efficient implementation, which is given below in (7):

I N /2
0
I2 0
(4)

... I N /4 0 K [ I N /2 C2 ]
0
K

N
N /2
2

0
I I
0 I
I 
I
I
1
[ I N /4 02 D I N /4 I2 I2 ... N0/2 D I N /2 IN /2 ( Pc
N )
N /2 N /2
N /2
2

2 2

[CN ] = Pr

0
I
... I 2
PrN /2 N /4 0

I N /2
0

(2)

,N 4

Where,

I N / 4
0
=
0

[ I N /4 02

j = 0, N

0.5 ,

(6)

r3

r3

r3

r3

r3

r3

r
0

r2
0

r
0
r 2
0

r
r2

r
r 2

r
r 2

r
0

0
r

0
0

0
0

r2
0

0
0

0
0

0
0

r
0

r
0

0
r

r3

r 3
(8)
0

r 2
1
where, r =
= C4
0
2

0
0

By taking similar approach, the recursive form can be


computed and is shown below in (9):
[ I4 ]

C4
[ I 4 ][ 0]
W1
C

[Pr]8
W1
4
[0] [ Pr ]4 [ 0]
1

[ 0]
C 4 C4
C 4 C4
1 0
0

C4

C4

(9)

[ I 4 ][ 0] [ I 4 ] [ I 4 ]
W2

[Pr]8
W2
[ 0][ Pr ]4 [ I 4 ] [ I 4 ]

Now we take advantage of the common sub-matrices


shown in (4), (7) and (9), and implement the architecture
which is discussed in the next section.

2326

III.

HARDWARE IMPLEMENTATION

The architecture is organized as a linear multistage


pipeline in order to achieve high throughput and based on 2s
complement arithmetic in order to naturally handle signed
numbers. The block diagram in Fig. 2 shows the processor
core. The data input is 8 bits and the output coefficient (both
real and imaginary) is 12 bits; a 2-bit input pin, t_select,
selects the transforms. The selection encoding is as follows:
{00: DCT; 01: DFT; 10: DWT; 11: DCT (by default)}. The
detailed flow graph inside the core is given in Fig. 3. All
modules are controlled by a controller, and the control signals
are denoted by arrows. The module description is as follows:
{C, F, W, #} stands for {Cosine, Fourier, Wavelet, Stage
number}; for example, CFW_2 means that this is a common
module required to compute all three transform and it belongs
CF_1

X0
X1
X2
X3
X4
X5
X6
X7

Data_in
[7:0]

DMUX
1x2

C
o
n
t
r
o
l

CF_2

CFW_2

to stage 2; F_5 means that the stage 5 module is only required


to compute the DFT.

Figure 2. Hybrid Cosine-Fourier-Wavelet transform processor

C_3

C
o
n
t
r
o
l

C
o
n
t
r
o
l

CFW_4

Buffer

t_select
[1:0]

CFW_6

B
u
f
f
e
r

F_6

B
u
f
f
e
r

CF_4

C
o
n
t
r
o
l

C_5

F_5

C
o
n
t
r
o
l

data_out_real
[11:0]

data_out_img
[11:0]

Controller

Figure 3. Top module of the hybrid transform processor

(a) DEMUX

(g) CF_2

(h) F_6

(i) CFW_6

(j) C_3

(b) CF_1

C1

C3

C5

C7

(c) C_5

(d) F_5

(e) CF_4 / CFW_4

Figure 4. Hardware blocks of the hybrid processor (a j)

(f) CFW_2

2327

By asserting high to the start pin will start the processor.


A high in the data_valid pin indicates the valid output
coefficients. The intermediate data that come out from the
blocks with a control arrow pin have been latched into register
banks. Thus from Fig. 3, we can see that a large number of
blocks can be shared among the transforms, which results in
smaller hardware, as well as lower power consumption.
The internal circuitry of each module has been shown in
Fig. 4. Internal counters are set to automatically control these
operations and maintain a parallel and congestion-free data
flow. The transform coefficients are denoted by small boxes
where, CK = cos k .
16
These multipliers are fixed in nature; hence, they have
been computed with shifts and additions only with a precision
of 8 bits as a way to minimize the hardware and optimize the
operations, which are completed in one clock cycle. Internally
to the multiplier, all significant bits are used to retain the
precision of the calculation but the multiplier outputs are
truncated to discard the fractional part.
IV.

HARDWARE SYNTHESIS

The proposed algorithm is implemented in Verilog HDL


and synthesized into several FPGA technologies so that the
cost can be assessed and compared. The results are
summarized in Table I. It can be seen that, the hardware cost
is low with a maximum operational frequency of 118 MHz
that results in a process throughput of 944 Megabits/sec.
TABLE I.

HARDWARE RESOURCES OF THE HYBRID ARCHITECTURE


Cell
Count

LUT
(utilization)

Registers
(utilization)

Frequency
(MHz)

N/A

2073 (19%)

1476 (N/A)

118

Technology
Altera Stratix
(EP1S10FC780)
Altera Cyclone II
(EP2C35FC672)
Xilinx Virtex4
(XCV4LX15SF363)
Actel Fusion
(AFS090)
TABLE II.
Topology

Standalone

Hybrid

N/A

1941 (N/A)

1476 (4%)

107

N/A

2017 (16%)

1479 (12%)

95

4338
(5%)

N/A

N/A

30

M+A

LUT

Registers

DCT

8 + 30

1118

1094

DFT

3 + 32

1073

887

DWT

1 + 14

764

800

Total

12 + 76

2955

2781

DCT-DFTDWT

10 + 38

2073

1476

17 + 50

30

47

Overall Savings (%)


1

V.

CONCLUSIONS

In this paper, we have presented an efficient hardware


implementation of a hybrid architecture to compute three 8point discrete transforms, the DCT, the DFT, and the DWT.
The matrix decomposition algorithm has been derived from
the sparse matrix representation based on the Jacket matrix.
The matrices have been further decomposed to identify the
symmetric components enabling the recurring sub-matrices to
be shared among the three transforms. The architecture has
been prototyped onto several FPGA platforms and the
synthesized results show that a savings of 47% in terms of
hardware resources can be achieved using the fast hybrid
algorithm. Future research is directed towards the VLSI
implementation of the presented hybrid architecture.
ACKNOWLEDGMENT
The authors would like to acknowledge the Natural
Science and Engineering Research Council of Canada
(NSERC) for its support to this research work.
REFERENCES
[1]

HARDWARE COMPARISONS: STANDALONE VS. HYBRID1


Technology

In order to demonstrate the savings in computational


resources of the hybrid architecture, we have implemented the
three transforms separately using similar hardware topology.
Table II summarizes the results (the LUT and the logic
resources are based on Altera Stratix EP1S10FC780 FPGA
technology). Here, we can see that by taking advantage of
matrix symmetry and hardware resources sharing, the
hardware cost of the hybrid architecture can be reduced by at
least 30%. The total number of multiplications and additions
are also reduced by a good margin. Such a hybrid processor
with three transforms on a single FPGA can be used in a wide
range of applications such as video coding, digital
communication, digital watermarking, signal denoising, etc.

M = Multiplications; these fixed coefficient multipliers have been implemented with adders and
shifters only which makes the entire architecture multiplication-free; A = Additions

K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms,


Advantages, and Applications, New York: Academic, 1990.
[2] E. O. Brigham, The Fast Fourier transform and its Applications,
Englewood Cliffs, N.J., Prentice Hall, 1988.
[3] G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley
Cambridge Pr, 1996.
[4] ISO/IEC, JPEG Committee Draft 15444-1, 2004.
[5] S. Jiang and X. Hao, Hybrid Fourier-Wavelet Image Denoising,
Electronics Letters, vol. 43, no. 20, pp. 1081-1082, 2007.
[6] M. Ezhilarasan and P. Thambidurai, A Hybrid Transformation
Technique for Video Coding, LNCS 4308, pp. 503-508, Springer,
2006.
[7] P. Paul and P. Girija, A Novel VLSI Architecture for Image
Compression, Proc. of the 8th IEEE International Symposium on
Multimedia, pp. 794-795, 2006.
[8] A. Haj, Combined DWT-DCT Digital Image Watermarking, Journal
of Computer Science, Science Publication, vol. 3, no. 9, pp. 740-746,
2007.
[9] Moon Ho Lee, On computing 2-D systolic algorithm for discrete
cosine transform, IEEE Trans. on Circuits and Systems, vol.37, no.10,
pp.1321-1323, 1990.
[10] Z. Chen and M. H. Lee On Fast Hybrid Source Coding Design, Proc.
of the 2007 International Symposium on Information Technology
Convergence, pp. 143-147, 2007.

2328

You might also like