E Cient Constant Coe Cient Multiplication Using Advanced FPGA Architectures
E Cient Constant Coe Cient Multiplication Using Advanced FPGA Architectures
E Cient Constant Coe Cient Multiplication Using Advanced FPGA Architectures
1 Introduction
G. Brebner and R. Woods (Eds.): FPL 2001, LNCS 2147, pp. 555–564, 2001.
c Springer-Verlag Berlin Heidelberg 2001
556 Michael J. Wirthlin and Brian McMurtrey
adders. The contents of all three ROM blocks are the same, with location i of
each ROM containing the value i × C.
Full-Adder
m7 - m 4
m 11 - m 8
Building the KCM multiplier on an FPGA requires two distinct pieces: a set of
look-up tables to determine the partial products and an adder network to sum
the results. Each look-up table performs a c × 4 bit multiplication resulting in a
c + 4 bit partial product. One 16 × 1 LUT is required for each bit of the partial-
product for a total of c + 4 LUTs for each ROM. Since one ROM is needed for
each
m four-bit digit of the multiplicand, the total LUT count for the ROMS is
4 × (c + 4).
To sum the partial-product results, m 4 − 1 adders are necessary (one fewer
adder stage than the number of look-up tables). If an array of adders is used
to sum the results, each adder stage will produce a c + 5 bit result2 . Assuming
each full-adderrequires
a look-up table and the carry logic is included within
the logic cell, m 4 − 1 × (c + 4) logic cells are required for the adder network.
The total logic cell count for this constant multiplier is as follows:
m
#LogicCells = (c + 4) · 2 · −1 (2)
4
The use of separate resources for both the look-up tables and adders is not par-
ticularly efficient. Specifically, the use of look-up tables to perform the addition
wastes valuable look-up table resources. To perform a sum, only three inputs
of a four-input look-up table are used (A, B, and Carry In). Since one input
2
Alternative adder networks such as a tree network may be used instead.
558 Michael J. Wirthlin and Brian McMurtrey
is unused, half of the look-up table is wasted. One possible method for improv-
ing the efficiency of this multiplier is to use the wasted LUT resources to assist
with the table look-up operation. If part of the table look-up operation could be
combined with the summation, significant logic resources may be saved.
Consider the detailed arrangement of adders and look-up tables in Figure
1. The multiplier is composed of an initial look-up stage followed by a series
of look-up/adder pairs. If the look-up and adder pair can be combined into a
single row of logic blocks, significant FPGA resources can be saved. For example,
consider the combined look-up table, full-adder cell of Figure 2. This combined
cell includes both a look-up table to perform a single-bit partial-product as well
as a full-adder. The full-adder sums the result of the look-up with the sum of
the previous stage. Further, the full-adder produces a carry to support chaining
of cells. This cell is represented by the shaded cells in Figure 1.
sum in
LUT
LUT
address
full
carry out carry in
add
sum out
5 Virtex Architecture
Unlike most FPGAs, the Xilinx Virtex carry logic exits after the look-up table
as shown in Figure 3-b. This arrangement of the carry chain allows the combined
look-up and full-adder operation of Figure 2 to fit within a single logic cell of
the FPGA. The LUT within the logic cell performs a partial-product look-up
Efficient Constant Coefficient Multiplication 559
carry out
carry
chain carry out
A A
LUT B LUT B
sum C sum C
D D
carry in carry in
operation and a summation (sum the results of the look-up with the previous
sum). By merging the partial-product look-up with the full-adder, the multiplier
can be built with fewer resources.
The key feature of the Virtex CLB architecture that enables the combining of the
look-up with the add is the additional logic provided at the output of the LUT.
As shown in Figure 3-b, an exclusive OR (XOR) and multiplexer are available
to the designer after the results of the look-up table are computed3 . Both the
output XOR and Carry Chain multiplexer are used to combine the look-up and
full adder.
Although the look-up table within the Virtex logic cell provides 4 inputs (i.e.
a 24 × 1 ROM), only three of the four input wires are used for table look-up.
The fourth LUT input is used for the previous sum input as shown in Figure 4.
Internally, the LUT is programmed to add the three-bit table lookup with the
previous sum input. This half-add is performed by an exclusive OR operation
between the 3-bit look-up and previous sum as shown in the LUT of Figure 4.
This look-up and XOR operation are all performed within a single 4-input LUT.
The actual sum value is determined by the taking the LUT output (half-add)
and performing an exclusive OR with the carry-in. This XOR is performed in
the output XOR gate that is provided within all Virtex slices. Using this XOR
gate, no additional LUT resources are needed.
Computing the carry is slightly more complicated than the sum. As shown
in Figure 4, the carry out is driven by a multiplexer. The select line of the
3
This CLB architecture contains additional logic that is not used for this operation.
560 Michael J. Wirthlin and Brian McMurtrey
multiplexer is driven by the output of the LUT or, in this case, the result of the
half-add operation. The carry out is computed for two different cases based on
the value of the LUT output.
When the output of the LUT is 0, the multiplexer selects the previous sum
for the carry out (note that the previous sum must be mapped to the D input of
the LUT for this to occur). Under these conditions, the inputs to the half-add
(previous sum and the 3-input LUT) are the same value. A carry out will only
be generated when they are both 1 (See Figure 4-a). Since the carry out is driven
by the previous sum, the proper carry value will be generated.
When the output of the LUT is 1, the multiplexer selects the the carry in for
the carry out signal. Under this condition, the result of the XOR between the
previous sum and the 3-input LUT is 1. As shown in Figure 4-a, the carry out
under this condition is simply the carry in.
Virtex LUT
3-bit LUT
Carry Logic
Lut_Out
A0
A1
Out_Bit
P rev 3bit Carry LU T SU M Carry A2
Sum LU T In Out Out Out
Prev_Sum
0 0 0 0 0 0 0
0 0 1 0 1 0 1s Carry_Out
Carry_In
0 1 0 1 1 0
0 1 1 1 0 1
1 0 0 1 1 0 (b) The Virtex Cell with LUT and
1 0 1 1 0 1 Carry Logic
1 1 0 0 0 1
1 1 1 0 1 1
As described above, the additional logic at the output of the CLB allows a
look-up and add operation to be combined within a single logic cell of the CLB.
Although the look-up operation is only three bits, this combined cell uses 33%
less area than a separate four bit look-up table and separate sum operator.
The combined ROM/Adder cell of Figure 4 are tiled together to form a multi-bit
ROM/Adder stage. Each stage requires c + 3 cells to perform a multiplication
between three bits of the multiplicand and the c bit constant.
Efficient Constant Coefficient Multiplication 561
Input
16
20 bit ROM
4
16 4
19 bit ROM/Adder
3
16 3
19 bit ROM/Adder
3
16 3
19 bit ROM/Adder
3
16 3
19 bit ROM/Adder
3
19 32
Output
A module generator has been designed to create this Virtex KCM for any
arbitrary constant. Written in JHDL [6], this module generator computes the
contents of each LUT ROM, insures proper mapping of FPGA logic within the
FPGA slice and performs relative placement of the design. Placement decisions
are made to optimize its area and timing. An example layout of this multiplier
is shown in Figure 6.
10-bit
9-bit 9-bit
ROM
9-bit ROM/Adr 9-bit ROM/Adr
ROM/Adr ROM/Adr KEY
Number of LUTs = 46
Number of CLBs = 25
additional logic for RAM addressing and data formatting, updating the constant
is relatively straightforward and fast.
The Virtex FPGA contains an additional architectural feature to ease the
process of updating the constant. In the Virtex FPGA, the 16 × 1 LUT can be
organized as a 16-bit shift register. Acting as a shift register, the contents of the
LUT can be loaded serially with little external logic. Further, the individual look-
up tables can be cascaded serially to form a single, long shift register containing
the contents of all LUTs in a stage of the multiplier. While taking longer to
update the constant, this approach allows the multiplier to be updated with
little external logic and only two additional inputs (serial data in and serial data
enable). The enhanced FPGA features allow the constant to be modified quickly,
easily and with little external logic.
6 Comparison
By combining the ROM with the full-adder, this combined KCM multiplier is
smaller than the conventional look-up table approach for constant coefficient
multiplication. As described above, this modified multiplier requires an initial
ROM stage followed by a series of combined ROM/Full-adder stages. The first
ROM stage of the multiplier consumes c + 4 logic cells(the result of a 4 × c bit
multiply). This pure ROM stage is followed by m−4 3 − 1 ROM/adder stages
with each stage taking c + 3 logic cells (each stage performs a 3 × c bit multiply).
The total size of this multiplier is:
n−4
#LUTs = c + 4 + · (c + 3) (3)
3
Efficient Constant Coefficient Multiplication 563
(m − 1)c
≈ +m (4)
3
The area of this multiplier is plotted in Figure 7 for various sizes of the
constant and along with that of the conventional multiplier approach. Clearly,
the merged multiplier is smaller than the conventional approach. On average,
merging the look-up tables and the full-adder reduces the size of the multiplier by
33%. The actual reduction in area of a specific multiplier will vary based on the
size of the multiplier. Note in the graph the distinct steps of both multipliers.
For the conventional multiplier, the addition of every fourth bit requires an
additional look-up and adder stage. For the merged multiplier, the addition of
every third bit requires an additional look-up/adder stage.
900
800
700
600
A re a (L U T s)
7 Conclusion
The efficiency of the KCM constant multiplier can be improved by merging the
look-up table operation with the summation operation. This merging of look-
up table and adder has been mapped to the Xilinx Virtex FPGA. Using the
additional logic resources found within the Virtex slice, the look-up and add can
be combined for a 33% reduction in logic resources. A module generator has been
written to synthesize arbitrary constant multipliers on the Virtex architecture.
Additional optimizations can be made to further reduce the size of the KCM
multiplier. Currently, the KCM multiplier is a fixed area for all constants and
multiplicands of a specific size. Further improvements to the area can be made
by taking advantage of constant-specific optimizations. While such multipliers
cannot be reprogrammed to support other constants, they are extremely efficient
and a viable alternative for constant multiplication.
564 Michael J. Wirthlin and Brian McMurtrey
References
[1] Ken Chapman, “Constant coefficient multipliers for the XC4000E,” Tech. Rep.
XAPP 054, Xilinx Corporation, December 11 1996, Version 1.1.
[2] Kenneth David Chapman, “Fast integer multipliers fit in FPGA’s,” EDN, pp.
79–80, May 12 1994.
[3] Florent de Dinchin and Vincent Lefèvre, “Constant multipliers for FPGAs,” in
Proceedings of the International Conference on Parallel and Distributed Processing
Techniques and Applications, H.R. Arabnia, Ed. June 2000, vol. I, pp. 167–173,
CSREA Press.
[4] Tim Courtney, Richard Turner, and Roger Woods, “Multiplexer based reconfig-
uration for Virtex multipliers,” in Field-Programmable Logic and Applications.
Proceedings of the 9th International Workshop, FPL 2000, 2000, pp. 749–758.
[5] Xilinx Corporation, Virtex-II 1.5V Field-Programmable Gate Arrays, January
2001, DS031-2 (v1.3).
[6] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting, “A
cad suite for high-performance fpga design,” in Proceedings of the IEEE Workshop
on FPGAs for Custom Computing Machines, K. L. Pocek and J. M. Arnold, Eds.,
Napa, CA, April 1999, IEEE Computer Society, pp. 12–24, IEEE.