E Cient Constant Coe Cient Multiplication Using Advanced FPGA Architectures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Efficient Constant Coefficient Multiplication

Using Advanced FPGA Architectures

Michael J. Wirthlin and Brian McMurtrey

Brigham Young University, Provo UT 84602, USA


[email protected], [email protected]

Abstract. Constant coefficient multiplication using look-up tables is


a popular form of multiplication in FPGAs. The ample look-up table
resources found within the FPGA match well to the architecture of a
look-up table based multiplier. While this form of multiplication maps
well to FPGAs, it isn’t particularly efficient. This paper presents an
efficient variant of this multiplier using the advanced features of the
Xilinx Virtex FPGA. Specifically, this approach combines the look-up
and add operations required by this multiplier architecture.

1 Introduction

Constant coefficient multiplication using look-up tables is a popular method of


performing multiplication within FPGAs. These multipliers perform multipli-
cation by a constant by determining partial-products using table look-up and
summing their results with a network of adders. Multiplication based on this
technique is frequently referred as KCM multiplication. The extensive array of
small look-up tables (LUT) within an FPGA make this implementation style
an attractive and efficient alternative for constant multiplication. Several FPGA
vendors and users have reported the use of these multipliers in FPGA designs
[1, 2].
Although there are several styles of performing constant coefficient multipli-
cation in FPGAs, this approach offers a number of specific advantages. First, the
structure of this multiplier is regular and maps nicely to the tile-based FPGA
architectures. Second, this multiplier is easy to configure for any constant of in-
terest. Since the structure of the multiplier does not depend on the constant of
interest, any arbitrary constant can be performed in the multiplier by modifying
the contents of the look-up tables.
While this style of multiplication is popular, recent results suggest that other
styles of constant coefficient multiplication may be more efficient than the KCM
multiplier[3, 4]. There are two disadvantages of this multiplier: first, this multi-
plier offers little or no constant specific optimizations. All constant multipliers of
similar bit-size are the same size and do not exploit constant-specific optimiza-
tions. Second, this multiplier requires separate resources for both the constant-
look up and the adder network. This separation of resources makes the multiplier
relatively larger than other constant multiplication approaches.

G. Brebner and R. Woods (Eds.): FPL 2001, LNCS 2147, pp. 555–564, 2001.

c Springer-Verlag Berlin Heidelberg 2001
556 Michael J. Wirthlin and Brian McMurtrey

This paper introduces an efficient look-up table based constant multiplier


that exploits advanced properties of the Xilinx Virtex FPGA[5]. Using advanced
features of this device, this KCM multiplier is able to merge both the look-up
and adder network of this operation. Such merging reduces the overall size of
the multiplier by an average of 33%. In addition, the Virtex FPGA simplifies the
process of modifying the constant at run-time by providing built-in shift registers
for the look-up tables. This paper will begin by reviewing the KCM constant
coefficient multiplier and tabulating its area for a variety of multiplier sizes.
Next, the inefficiencies of this multiplier on conventional FPGA architectures
will be described. The optimized KCM multiplier for the Virtex architecture will
be introduced and demonstrated with an example. The paper will conclude by
contrasting the optimized multiplier with the conventional approach of constant
coefficient multiplication.

2 Constant Coefficient Multiplication

Like any boolean function, a multiplication can be performed using a look-up


table or memory. However, multiplication is an extremely expensive operation
to perform with look-up tables since the size of memory needed for the multipli-
cation grows exponentially with the size of the multiplier operands. The size of
the memory can be reduced by performing a multiplication by a constant. The
multiplication of an arbitrary m-bit multiplicand (M ) by some c-bit constant
(C) can be performed with a 2m word memory. This memory is pre-loaded with
the product of the constant and every possible value of the multiplicand. The
result is obtained by addressing the memory with the m-bit multiplicand value.
A more efficient approach for multiplication by a constant (with respect to
memory size) is to use the look-up tables for computing the partial products of
the constant multiplication operation. The m-bit multiplicand is decomposed
into digits that are each 4 bits in size. Each 4-bit digit of the multiplicand is
used to address a partial-product look-up table 1 . This look-up table contains
the product of the constant multiplied by all possible 4-bit
 digit values (i.e. 16
entries). One look-up table is needed for each of the m 4 digits of the multipli-
cand. The final result is obtained by summing the results of all partial products
as follows:
m4 −1

M ×C = mi C · 24i , (1)
i=0

where mi represents bits 4 ∗ (i + 1) − 1 through 4i of the multiplicand.


An architecture that implements this method of constant multiplication re-
quires multiple look-up tables to compute each of the partial products and adder
circuits to sum their results. Figure 1 demonstrates this approach for multiplica-
tion between an 8-bit constant and a 12-bit multiplicand. This multiplier requires
three 4x12 look-up tables to perform the partial-product look-up and two 12-bit
1
4 bits are used since the common size of most FPGA look-up tables is 24 words deep.
Efficient Constant Coefficient Multiplication 557

adders. The contents of all three ROM blocks are the same, with location i of
each ROM containing the value i × C.

16x1 Look Up Table m3 - m0

Full-Adder
m7 - m 4

m 11 - m 8

Fig. 1. Architecture of a Constant 8x12 Bit Multiplier.

3 Implementing KCM Multiplier on Conventional FPGA


Architectures

Building the KCM multiplier on an FPGA requires two distinct pieces: a set of
look-up tables to determine the partial products and an adder network to sum
the results. Each look-up table performs a c × 4 bit multiplication resulting in a
c + 4 bit partial product. One 16 × 1 LUT is required for each bit of the partial-
product for a total of c + 4 LUTs for each ROM. Since one ROM is needed for
each
 m  four-bit digit of the multiplicand, the total LUT count for the ROMS is
4 × (c + 4).  
To sum the partial-product results, m 4 − 1 adders are necessary (one fewer
adder stage than the number of look-up tables). If an array of adders is used
to sum the results, each adder stage will produce a c + 5 bit result2 . Assuming
each full-adderrequires
  a look-up table and the carry logic is included within
the logic cell, m 4 − 1 × (c + 4) logic cells are required for the adder network.
The total logic cell count for this constant multiplier is as follows:
 m
#LogicCells = (c + 4) · 2 · −1 (2)
4

4 Merging Look-Up Tables with Adders

The use of separate resources for both the look-up tables and adders is not par-
ticularly efficient. Specifically, the use of look-up tables to perform the addition
wastes valuable look-up table resources. To perform a sum, only three inputs
of a four-input look-up table are used (A, B, and Carry In). Since one input
2
Alternative adder networks such as a tree network may be used instead.
558 Michael J. Wirthlin and Brian McMurtrey

is unused, half of the look-up table is wasted. One possible method for improv-
ing the efficiency of this multiplier is to use the wasted LUT resources to assist
with the table look-up operation. If part of the table look-up operation could be
combined with the summation, significant logic resources may be saved.
Consider the detailed arrangement of adders and look-up tables in Figure
1. The multiplier is composed of an initial look-up stage followed by a series
of look-up/adder pairs. If the look-up and adder pair can be combined into a
single row of logic blocks, significant FPGA resources can be saved. For example,
consider the combined look-up table, full-adder cell of Figure 2. This combined
cell includes both a look-up table to perform a single-bit partial-product as well
as a full-adder. The full-adder sums the result of the look-up with the sum of
the previous stage. Further, the full-adder produces a carry to support chaining
of cells. This cell is represented by the shaded cells in Figure 1.

sum in

LUT
LUT
address

full
carry out carry in
add

sum out

Fig. 2. Combined Look-Up/Adder Cell.

While the arrangement of Figure 2 appears to be straightforward, it does not


map well to most FPGA architectures. For most architectures, the high-speed
carry chain appears before the look-up table as shown in Figure 3-a. Because the
carry chain logic occurs before the table look-up, the logic cells cannot perform
a full-adder operation (including carry) using the results of the table look-up.
Instead, the carry logic has access only to the cell inputs and is limited to
performing a single full-add operation.

5 Virtex Architecture
Unlike most FPGAs, the Xilinx Virtex carry logic exits after the look-up table
as shown in Figure 3-b. This arrangement of the carry chain allows the combined
look-up and full-adder operation of Figure 2 to fit within a single logic cell of
the FPGA. The LUT within the logic cell performs a partial-product look-up
Efficient Constant Coefficient Multiplication 559

carry out

carry
chain carry out

A A
LUT B LUT B
sum C sum C
D D

carry in carry in

(a) Conventional Carry Chain (b) Virtex Carry Chain Archi-


Architecture tecture

Fig. 3. FPGA Carry Chain Architectures

operation and a summation (sum the results of the look-up with the previous
sum). By merging the partial-product look-up with the full-adder, the multiplier
can be built with fewer resources.

5.1 Combined Cell

The key feature of the Virtex CLB architecture that enables the combining of the
look-up with the add is the additional logic provided at the output of the LUT.
As shown in Figure 3-b, an exclusive OR (XOR) and multiplexer are available
to the designer after the results of the look-up table are computed3 . Both the
output XOR and Carry Chain multiplexer are used to combine the look-up and
full adder.
Although the look-up table within the Virtex logic cell provides 4 inputs (i.e.
a 24 × 1 ROM), only three of the four input wires are used for table look-up.
The fourth LUT input is used for the previous sum input as shown in Figure 4.
Internally, the LUT is programmed to add the three-bit table lookup with the
previous sum input. This half-add is performed by an exclusive OR operation
between the 3-bit look-up and previous sum as shown in the LUT of Figure 4.
This look-up and XOR operation are all performed within a single 4-input LUT.
The actual sum value is determined by the taking the LUT output (half-add)
and performing an exclusive OR with the carry-in. This XOR is performed in
the output XOR gate that is provided within all Virtex slices. Using this XOR
gate, no additional LUT resources are needed.
Computing the carry is slightly more complicated than the sum. As shown
in Figure 4, the carry out is driven by a multiplexer. The select line of the
3
This CLB architecture contains additional logic that is not used for this operation.
560 Michael J. Wirthlin and Brian McMurtrey

multiplexer is driven by the output of the LUT or, in this case, the result of the
half-add operation. The carry out is computed for two different cases based on
the value of the LUT output.
When the output of the LUT is 0, the multiplexer selects the previous sum
for the carry out (note that the previous sum must be mapped to the D input of
the LUT for this to occur). Under these conditions, the inputs to the half-add
(previous sum and the 3-input LUT) are the same value. A carry out will only
be generated when they are both 1 (See Figure 4-a). Since the carry out is driven
by the previous sum, the proper carry value will be generated.
When the output of the LUT is 1, the multiplexer selects the the carry in for
the carry out signal. Under this condition, the result of the XOR between the
previous sum and the 3-input LUT is 1. As shown in Figure 4-a, the carry out
under this condition is simply the carry in.

Virtex LUT
3-bit LUT
Carry Logic
Lut_Out
A0
A1
Out_Bit
P rev 3bit Carry LU T SU M Carry A2
Sum LU T In Out Out Out
Prev_Sum
0 0 0 0 0 0 0

0 0 1 0 1 0 1s Carry_Out
Carry_In
0 1 0 1 1 0
0 1 1 1 0 1
1 0 0 1 1 0 (b) The Virtex Cell with LUT and
1 0 1 1 0 1 Carry Logic
1 1 0 0 0 1
1 1 1 0 1 1

(a) Truth Table for a single bit of


the ROM/Adder stage

Fig. 4. One bit of the ROM/Adder stage

As described above, the additional logic at the output of the CLB allows a
look-up and add operation to be combined within a single logic cell of the CLB.
Although the look-up operation is only three bits, this combined cell uses 33%
less area than a separate four bit look-up table and separate sum operator.

5.2 Organization of Combined Cells

The combined ROM/Adder cell of Figure 4 are tiled together to form a multi-bit
ROM/Adder stage. Each stage requires c + 3 cells to perform a multiplication
between three bits of the multiplicand and the c bit constant.
Efficient Constant Coefficient Multiplication 561

To form a complete multiplier, this approach requires an initial 4-bit ROM


stage followed by as many combined 3-bit ROM/adder stages as necessary. An
initial ROM stage is needed to provide the “previous sum” for the following
combined ROM/Adder pair (there is no previous sum for the first stage). Since
the first stage is not a combined stage and does not take the previous sum input,
a full 4-bit LUT can be used. Figure 5 shows the stages of this modified 16 × 16
KCM.

Input
16
20 bit ROM
4
16 4
19 bit ROM/Adder
3
16 3
19 bit ROM/Adder
3
16 3
19 bit ROM/Adder
3
16 3
19 bit ROM/Adder
3
19 32
Output

Fig. 5. 16x16 KCM with Combined ROM/Adder Stages

A module generator has been designed to create this Virtex KCM for any
arbitrary constant. Written in JHDL [6], this module generator computes the
contents of each LUT ROM, insures proper mapping of FPGA logic within the
FPGA slice and performs relative placement of the design. Placement decisions
are made to optimize its area and timing. An example layout of this multiplier
is shown in Figure 6.

5.3 Reprogramming the Multiplier


Most constant coefficient multipliers perform multiplication by optimizing the
structure of the logic. While such approaches are efficient, they are difficult to
modify at run-time. Modification of the constant involves modification of the
circuit structure. Although recent approaches suggest that the structure of the
multiplier can be changed at run-time using novel architectural techniques and
bit-stream manipulation, modification of the constant is still slow and cumber-
some [4].
Unlike these constant multipliers, the structure of the KCM does not need
to be modified to support additional constants. KCM multipliers can support
arbitrary constants by modifying the contents of its look-up tables. In conven-
tional KCMs, support for constant modification occurs by replacing the ROM
look-up table with a RAM. The use of a RAM allows the designer to update
the constant by writing new values into the RAM. Although this requires some
562 Michael J. Wirthlin and Brian McMurtrey

10-bit
9-bit 9-bit
ROM
9-bit ROM/Adr 9-bit ROM/Adr
ROM/Adr ROM/Adr KEY

CLB with 2 LUTs

CLB with 1 LUT

Number of LUTs = 46

Number of CLBs = 25

Fig. 6. The Layout of the Modified KCM on an Virtex FPGA

additional logic for RAM addressing and data formatting, updating the constant
is relatively straightforward and fast.
The Virtex FPGA contains an additional architectural feature to ease the
process of updating the constant. In the Virtex FPGA, the 16 × 1 LUT can be
organized as a 16-bit shift register. Acting as a shift register, the contents of the
LUT can be loaded serially with little external logic. Further, the individual look-
up tables can be cascaded serially to form a single, long shift register containing
the contents of all LUTs in a stage of the multiplier. While taking longer to
update the constant, this approach allows the multiplier to be updated with
little external logic and only two additional inputs (serial data in and serial data
enable). The enhanced FPGA features allow the constant to be modified quickly,
easily and with little external logic.

6 Comparison
By combining the ROM with the full-adder, this combined KCM multiplier is
smaller than the conventional look-up table approach for constant coefficient
multiplication. As described above, this modified multiplier requires an initial
ROM stage followed by a series of combined ROM/Full-adder stages. The first
ROM stage of the multiplier consumes c + 4 logic cells(the result of a 4 × c bit
multiply). This pure ROM stage is followed by m−4 3 − 1 ROM/adder stages
with each stage taking c + 3 logic cells (each stage performs a 3 × c bit multiply).
The total size of this multiplier is:


n−4
#LUTs = c + 4 + · (c + 3) (3)
3
Efficient Constant Coefficient Multiplication 563

(m − 1)c
≈ +m (4)
3
The area of this multiplier is plotted in Figure 7 for various sizes of the
constant and along with that of the conventional multiplier approach. Clearly,
the merged multiplier is smaller than the conventional approach. On average,
merging the look-up tables and the full-adder reduces the size of the multiplier by
33%. The actual reduction in area of a specific multiplier will vary based on the
size of the multiplier. Note in the graph the distinct steps of both multipliers.
For the conventional multiplier, the addition of every fourth bit requires an
additional look-up and adder stage. For the merged multiplier, the addition of
every third bit requires an additional look-up/adder stage.

900
800
700
600
A re a (L U T s)

Con v en tio nal M u ltp lier


500
M erg ed ROM /A d der
400
300
200
100
0
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Cons tant and O perand S ize (bits )

Fig. 7. Size of Conventional and Merged KCM Multiplier.

7 Conclusion
The efficiency of the KCM constant multiplier can be improved by merging the
look-up table operation with the summation operation. This merging of look-
up table and adder has been mapped to the Xilinx Virtex FPGA. Using the
additional logic resources found within the Virtex slice, the look-up and add can
be combined for a 33% reduction in logic resources. A module generator has been
written to synthesize arbitrary constant multipliers on the Virtex architecture.
Additional optimizations can be made to further reduce the size of the KCM
multiplier. Currently, the KCM multiplier is a fixed area for all constants and
multiplicands of a specific size. Further improvements to the area can be made
by taking advantage of constant-specific optimizations. While such multipliers
cannot be reprogrammed to support other constants, they are extremely efficient
and a viable alternative for constant multiplication.
564 Michael J. Wirthlin and Brian McMurtrey

References
[1] Ken Chapman, “Constant coefficient multipliers for the XC4000E,” Tech. Rep.
XAPP 054, Xilinx Corporation, December 11 1996, Version 1.1.
[2] Kenneth David Chapman, “Fast integer multipliers fit in FPGA’s,” EDN, pp.
79–80, May 12 1994.
[3] Florent de Dinchin and Vincent Lefèvre, “Constant multipliers for FPGAs,” in
Proceedings of the International Conference on Parallel and Distributed Processing
Techniques and Applications, H.R. Arabnia, Ed. June 2000, vol. I, pp. 167–173,
CSREA Press.
[4] Tim Courtney, Richard Turner, and Roger Woods, “Multiplexer based reconfig-
uration for Virtex multipliers,” in Field-Programmable Logic and Applications.
Proceedings of the 9th International Workshop, FPL 2000, 2000, pp. 749–758.
[5] Xilinx Corporation, Virtex-II 1.5V Field-Programmable Gate Arrays, January
2001, DS031-2 (v1.3).
[6] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting, “A
cad suite for high-performance fpga design,” in Proceedings of the IEEE Workshop
on FPGAs for Custom Computing Machines, K. L. Pocek and J. M. Arnold, Eds.,
Napa, CA, April 1999, IEEE Computer Society, pp. 12–24, IEEE.

You might also like