RC
RC
RC
IMPLEMENTATION
In the first part of this chapter, we present the different possibilities fo r the
use of reconfigurable devices in a system. According to the way those dev ices
are used, the target applications and the systems in which they are integrated,
different terminologies will be defined for the systems. We follow up by pre-
senting the design flow, i.e. the steps required to implement an application on
those devices. Because the programming of coarse-grained reconfigu rable de-
vices is very similar to that of processors, we will not focus on the variety of
tool that exists for this purpose. The implementation on FPGA devices is rather
unusual in two points. First the programming is not aimed at generating a set of
instructions to be executed sequentially on a given processor. We seek the
generation of the hardware components that will be mapped at different time on
the available resources. According to the application, the resources needed for
the computation of the application will be built as components to be down-
loaded to the device at run-time. The generation of such components is called
logic synthesis. It is an optimization process whose goal is to minimize some
cost functions aimed at producing, for instance, the fastest hardware with the
smallest amount of resources and the smallest power consumption. The map-
ping of the application to the FPGA resources is a step of the logic synthesis
called technology mapping. The second unusual point with FPGAs is that the
technology mapping targets look-up tables rather than NAND-Gate as it is the
case with many digital devices. In the last part of the chapter, we will therefore
shortly present the steps required in logic synthesis and focus in more details on
technology mapping for FPGAs.
1. Integration
Reconfigurable devices are usually used in three different ways:
68 RECONFIGURABLE COMPUTING
1
Today leading edge reconfigurable devices contain million of gates with few hundreds MHz
Implementation 69
the device and communication between newly placed modules. The man-
agement of the reconfigurable device is usually done by a scheduler and a
placer that can be implemented as part of an operating system running on
a processor (figure 3.1). The processor can either resides inside or outside
the reconfigurable chip.
The scheduler manages the tasks and decides when a task should be exe-
cuted. The tasks, which are available as configuration data in a database a
re characterized through their bounding box and their run-time. The
bounding box defines the area that a task occupies on the device. The
management o f task execution at run-time is therefore a temporal
placement problem that will be studied in details in chapter 5. The
scheduler determines which task should be executed on the RPU and then
gives the task to the placer which will try to place it on the device, i.e.
allocate a set of resources for the implementation of that task. If the
placer is not able to find a site for the new task, then it will be sent back
to the scheduler which can then decide to send it later and to send another
task to the placer. In this case, we say that the task is rejected.
bus that is used for the data transfer between the processor and the reconfig-
urable device. In supercomputing, systems are made upon a high speed pro-
cessor from Intel or AMD and an FPGA-board attached to a bus like the PCI.
Several systems (Cray XD1, SGI, Nallatech SRC Computers) in which FPGAs
cohabits and communicate using a dedicated high speed bus, are available to
buy. In embedded systems, the processors are more and more integrated in the
reconfigurable devices and are heavily used for management purpose rather than
for computation. The RPU acts like a coprocessor with varying instruc-tion sets
accessible by the processor in a function call. The computation flow can be
summarized as shown in algorithm 2.
2
It is already possible to trigger the reconfiguration of the X ilinx FPGAs from within the device using
their ICAP-port. This might allow a self reconfiguration of a s ystem from a processor running within the device
3The data might also be collected by the RPU itself from an external source
Implementation 71
can also send the data directly to an external sink. In the computation flow
presented above, the RPU is configured only once. However, in frequ ently
re-configured systems several configurations might be done. If the RPU ha s
to be configured more than once, then the body of the while loop must be
run again according to the number of reconfigurations to be done before.
3. Logic Synthesis
A function, assigned to the hardware in the hardware/software co-design
process, can be described as a digital structured system. As shown in Figure
3.4, such a digital structured system consists of a set of combinatorial logic
modules (the nodes), memory (the registers), inputs and outputs.
Implementation 75
The inputs provide data to the system while the outputs carry data out of
the system. Computation is performed in the combinatorial parts and the
results might be temporally stored in registers that are placed between the
combinato-rial blocks. A clock is used to synchronize the transfer of data
from register to register via combinatorial parts. The description of a design
at this level is usu-ally called register transfer description, due to the register
to register operation mode previously described.
For such a digital system, the goal of the logic synthesis is to produce an
optimal implementation of the system on a given hardware platform. In the
case of FPGA, the goal is the generation of configuration data which satis-
fies a set of given constraints like the maximal speed, the minimum area, the
minimum power consumption, etc. In a structured system, each
combinatorial block is a node that can be represented as a two-level function
or as multi-level function. Depending on the node representations, the two
following synthesis approaches exist:
Most of the circuits used in practice are implemented using multi-level logic.
Multi-level circuits are smaller, faster in most cases and consume less power
than two-level circuits. Two-level logic is most appropriate for PAL and PLA
implementations while multi-level is used for standard cell, mask-
programmable or field-programmable devices.
We formally represent a node of the structured system as a Boolean network,
i.e. a network of Boolean operators that reflects the structure and functio n of the
nodes. A Boolean network is defined as a directed acyclic graph (DAG) in
which a node represents an arbitrary Boolean function and an edge (i, j)
represents the data dependency between the two nodes i and j of the network.
The final implementation matches the node representation. In the second case ,
the representation is technology independent, i.e the design is not tied to any
library. A final mapping must be done to the final library in order to have an
implementation. The technology independent method is most used, due to the
large set of available optimization methods available. With a technology
independent representation, synthesis for FPGA devices is done in two steps. In
the first step, all the Boolean equations are minimized, independent of the
function generators used. In the second step, the technology mapping process
maps the parts of the Boolean network to a set of LUTs.
In general, the following choices are made for the representation of a node:
Sum of products form: A Sum of Product ( SOP) is the most trivial
form to represent a Boolean function. It consists of a sum of product
terms and it is well adapted for two-level logic implementation on PALs
and PLAs. Example: f = xyz + xyz + wxy.
This representation has the advantage that it is well understood and it is
easy to manipulate. Many optimization algorithms are available (AND,
OR, Tautology, two-level minimizers). The main disadvantage is the non
representativity of the logic complexity. In fact, designs represented as
sum of products are not easy to estimate as the complexity of the design
decreases through manipulation. Therefore estimation of progress during
logic minimization on SOPs is difficult.
Factored form A factored form is defined recursively either as a single
literal or, as a product or a sum of two factored forms: a product is either
a single literal or the product of two factored forms and a sum is either a
single literal or the sum of two factored forms. c(a + b(d + e)) is a
product of the factored forms c and a + b(d + e), and a + b(d + e) is a
sum of the factored forms a and b(d + e).
Factored forms are representative of the logic complexity. In many design
styles, the implementation of a function corresponds to its factored form.
Therefore, factored forms are good estimation of complexity of the logic
implementation. Their main disadvantage is the lack of manipulation algo-
rithms. They are usually converted in SOPs before manipulation.
Binary decision diagram (BDD) A binary decision diagram (BDD) is a
rooted directed acyclic graph used to represent a boolean function. Two
kinds of nodes exist in BDDs: variable and constant nodes.
4
An index i defines a variable xi .
78 RECONFIGURABLE COMPUTING
5
Two BDDs G1 and G2 are isomorph ⇐ ⇒ there exists a bijective function σ from G1 in G2 such that: 1)
for a terminal node v ∈ G1, σ(v) = w is a terminal node in G2 with value(v) = value(w); 2) for a non terminal
node v ∈ G1, σ(v) = w is a non terminal node of G2 with index(v) = index(w),
σ(low(v)) = low(w) and σ(high(v)) = high(w)
Implementation 79
Definition 3.2 (Primary Input, Primary Output, Node Level,Node Depth, Fan-in
Given a boolean network G, we define the following:
1 A primary input (PI) node is a node without any predecessor.
2 A primary output (PO) node is a node without any successor.
3 The level l(v) of a node v is the length of the longest path from the
primary inputs to v.
4 The depth of a network G is the largest level of a node in G.
5 The fan-in of a node v is the set of gates whose outputs are inputs of v.
6 The fan-out of v is the set of gates that use the output of v as input.
7 Given a node v ∈ G , input(v) is defined as the set of node of G, which
are fan-in of v, i.e. the set of predecessors of v.
With input(H), we denote the set of all nodes not included in H, which
are predecessors of some nodes in H.
3−LUT 3−LUT
3−LUT
Figure 3.8. Example of a graph covering with K-feasible cone and the corresponding covering
with LUTs
Mapping the trees. The strategy used by Chortle to map a tree is a combina-
tion of bin packing and dynamic programming. Each tree is traversed from the
primary inputs to the primary outputs. At each node v, a circuit referred to as the
best Circuit, implementing the cone at v extending from the node to the pri-mary
inputs of the network is constructed. The best circuit is characterized by two
main factors: The tree rooted at v and represented by a cone must contain the
minimum number of LUTs and the output LUT (the root-LUT) implement-ing v
should contain the maximum number of unused input pins. For a primary input
p the best circuit is a single LUT whose function is a buffer. Using the dynamic
programming, the best circuit at a node v can be constructed from its fan-in
nodes, because each of them is already optimally implemented. The procedure
enforces the use of the minimum number of LUTs at a given node. The best-
circuit is then constructed from the minimum number of LUTs used to
implement its fan-in nodes. The secondary goal is to minimize the number of
unused inputs of the circuit rooted at node v.
The construction of the tree is done in two steps. First a two-level
decompo-sition of the cone at v is constructed and then this decomposition is
converted into a multi-level decomposition.
84 RECONFIGURABLE COMPUTING
Multi-level Decomposition. In the second step, the first-level nodes are im-
plemented using a tree of LUTs. The number of LUTs used is minimized by
using second level LUTs that have unused pins to implement a portion of the
first-level three as shown in figure 3.10. The detailed procedure for co nverting a
two-level decomposition into a multi-level decomposition is given in algo-rithm
4 and figure 3.10 provides an illustration of this process.
Implementation 85
The fact that the most filled unconnected LUTs are always selected push
es less filled LUTs to the root of the tree being built. Therefore, the LUTs
with the most unused inputs will be found near the root of the tree.
a given node. The goal of the optimization is to pack the reconvergent paths
caused by a given input into just one LUT.
This strategy is illustrated in figure 3.11. To see are two paths going
through the blocks g and h and placed in the same LUT, thus reducing the
number of LUTs used from five to four.
Figure 3.11. Exploiting reconvergent paths to reduce the amount of LUTs used
Implementation 87
If more than one pair of fan-in LUTs share inputs, there will be several
pairs of reconvergent paths. To determine which one should be packed in the
same LUT, two approaches exist in the Chortle algorithm. The first one is an
exhaustive approach that first finds all pairs of fan-in LUTs that sha re
inputs. Then every possible combination is constructed by first merging the
fan-in LUTs and then proceeding with the FFD bin packing algorithm. The
two-level decomposition that produces the fewest bins and the smallest least
filled bins is retained as the solution.
A large amount of pairs of fan-in LUTs sharing the same inputs cause the
algorithm to be impracticable. To overcome this limitation, a second heuristic
called the Maximum Share Decreasing (MSD) can be used. The goal is to
maximize the sharing of inputs when fan-in LUTs (boxes) are packed into bins.
The MSD iteratively chooses the next box to be packed into bins according to
the following criteria: 1) the box hast the greatest number of inputs, 2) the box
shares the greatest number of inputs with any existing bin and 3) it shares the
greatest number of inputs with any of the remaining boxes.
The first criterion insures that the MSD algorithm works like the FFD if no
reconvergeant path exists. The second and third criteria help to place boxes that
share inputs into the same bins. The chosen box is then packed into the bins with
which it shares the most input without exceeding the bin capacity, i.e. the
number of inputs. If no more bins exist for packing the chosen bin, then a new
bin is created and the chosen box is packed into the new bin.
Figure 3.12. Logic replication at fan-out nodes to reduce the number of LUTs used
88 RECONFIGURABLE COMPUTING
The chortle algorithm exists in two versions. One with the replication of
all nodes and the other without replication. The solution that produces the
less amount of LUT is retained.
The Chortle algorithm presented in this section is known in the literature
6
[87] as the Chortle-crf
As seen in figure 3.10, the path generated by the Chorlte algorithm can
be-come very long if no effort is spent in reducing the delay. This problem is
addressed in another version of the Chrotle algorithm, the chortle-d7. Instead
of using the bin packing to minimize the amount of LUTs used, the Chortle-
d focus in the bin packing strategy, on the reduction of the number of levels
in the final design. Because we investigate a delay optimal algorithm in the
next section, the Chortle-d algorithm will not be considered futher. The
FlowMap algorithm that we next present focuses on delay minimization
using a network flow approach, a technique that was also use in the MIS-pga
algorithm of M ur-gai et al[163].
6c is for the constructive bin packing, r for the reconvergent path and f for the logic replication.
7
d stays for delay
Implementation 89
the nodes in the LUTs according to their level number. We present the two
steps in the next paragraphs.
The edge cut-size e(X, X) of (X, X) is the sum of the crossing edges ca-
pacities.
The label l(t) of a node t is defined as the depth of the LUT, which contains
t in an optimal mapping of the cone at t.
vol(X, X) = X
With the previous reflection, the K-LUT mapping with minimal delay is
8
reduced to the problem of finding a K-feasible cut with minimum height for
each node in the graph. The level l(t) of the node t is therefore given by the
following formula:
Lemma 3.7 The minimum depth of any mapping solution of Nt is given by:
8
It is assumed here that no cut (X, X) is computed with primary input nodes in X.
Implementation 91
time O(Km), where m is the number of nodes in the network. Further trans-
formations are required on the networks Nt in order to reach this goal. The
node labels defined by the FlowMap scheme satisfy the following property.
Lemma 3.8 Let l(t) be the label of node t, then l(t) = p or l(t) = p + 1, where p
is the maximum label of the nodes in input(t).
Proof Let t′ be any node in input(t). Then for any cut (X, X) in Nt, either
1 t′ ∈ X or
2 (X, X) also determines a K-feasible cut (X′, X′) in Nt′ with
h(X′, X′) ≤ h(X, X), where X′ = X ∩ Nt′ and X′ = X ∩ Nt′. Those two
cases are lillustrated in figure 3.15
′ ′
In the first case, we have l(t) = h(X, X) + 1 ≥ l(t ) + 1 and therefore l(t) ≥ l(t )
In the second case we have l(t′) − 1 ≤ h(X′, X′) ≤ h(X, X) = l(t) − 1, which
implies l(t) ≥ l(t′). Therefore, l(t) ≥ p. This part of the proof shows that the
label of a node cannot be smaller than those of its predecessors.
Because the network Nt is K-bounded, input(t) ≤ K. Therefore (Nt − t, {t}) is
a K-feasible cut. Since each node in Nt − {t} is either in input(t) or is a
predecessor of some node in input(t), the maximum label of the nodes in Nt −
{t} is p. Therefore, h(Nt − {t}, {t}) = p, i.e l(t) ≤ p + 1.
92 RECONFIGURABLE COMPUTING
Figure 3.15. Illustration of the two cases in the proof of Lemma 3.8
Proof Let Ht denote the set of node in Nt that are collapsed into t′.
⇐ If N ′t has a K-feasible cut(X′, X′), let X = X′ and X = (X′ − {t′} ∪ Ht),
then (X, X) is a K-feasible cut of Nt. Because no node in X′(= X) has a label
p or larger, we have h(X, X) ≤ p − 1. Furthermore according to Lemma 3.8,
l(t) ≥ p, which implies h(X, X) ≥ p − 1. Therefore h(X, X) = p − 1.
Figure 3.17. Transforming the node cut constraints into the edge cut ones
′′
each bridging edge is one, the edge size cut in Nt is equivalent to the node cut
′
size in Nt . Based on this observation, the following Lemma can be stated.
′ ′′
Lemma 3.10 Nt has a K-feasible cut if Nt has a cut whose edge cut size is no
more than K.
The Ford and Fulkerson method [84] [55] can be used to check if a cut
with edge cut-size smaller or equal K exists. We first briefly provide some
more background on networks, which are important to understand the testing
procedure.
In a network N with source s and sink t a flow can be seen as a streaming of
data on different direction on the edges of the network. A node might have data
coming in and data going out through its edges, each of which has a given
capacity. Data streaming in the s-direction (resp. in the t-direction) caused a
negative (resp. positive) value of the flow at a given node. The value of the flow
in the network is therefore the sum of the flows on the network edges.
Formally, a flow f in N can be defined as a function of N × N in .
A residual value exists on an edge if, the flow at that edge has a lower value
than the edge capacity. The residual value is then the difference between the
capacity and the flow's value. This can be added to the flow in order to satur ate
the edge. The capacity of a cut in the network is defined as the sum of all
positive crossing edge capacities. Negative crossing edges do not infl uence the
capacity of a cut. An edge not saturated by the flow is called a residual edge.
The residual network of a network is made upon all the residual edges as well
Implementation 95
as the nodes that they connect. An augmenting path in the residual network
is a path from source s to the sink t that contains only residual edges, i.e. a
path in the residual network.
A very important relationship between a flow and a cut in a network is
given by the following corollary:
Corollary 3.11 The value of any flow in the network is bounded from above by
the capacity of any cut in the network.
This implies the following: the maximum flow value in the network is
bounded from above by the minimum cut of the network.
Based on the above observation, the notion of cut and that of residual net-
work, the famous max-flow min-cut theorem [55] of Ford and Fulkerson
states that a flow is maximum in the network if the residual network does not
contain any augmenting path. The value of the flow is then equal to the
capacity of the minimum cut.
Applying the max-flow min-cut theorem to our problem, we can state the
′′
following: If a cut with edge cut-size smaller or equal K exists in Nt , then
′′ ′′
the maximum value of any flow between s and t in Nt will be less than K.
Because we are only interested in testing if the value of the cut is smaller
than K, the augmenting path of Ford an Fulkerson can be applied to compute
the maximum flow. The approach starts with a flow f , whose value is set to
0 and then, iteratively find some augmenting path P in the residual network
and increase the flow on P by the residual capacity cf (P ), that is the value
by which the flow on each edge of P can be increased. If no path from s to t
exists, then the computing stops and return the maximum flow value.
Because each bridging edge in Nt′′ have a capacity of one, each
augmenting path in the flow residual graph of Nt′′ from s to t′′ increases the
flow by one. Therefore, the augmenting path can be recursively used to
check if the maxi-mum value for a flow associated to a cut is less than K. For
a given cut, if a K + 1 augmenting paths could be found, then the maximum
flow in Nt′′ has a value more than K. Otherwise, the residual graph will not
contain a (K + 1)th path.
′′
Testing if Nt has a cut, whose value is no more than K can therefore be done
′
through a depth first search starting at s and including in X all nodes reachable by s.
∗
The run-time of the Ford and Fulkerson method is O(m|f |,
∗
where |f | is the value of the maximal flow computed. In the F lowM ap al-
gorithm, this value is K, which corresponds to the number of iterations that were
performed to find the K augmenting paths. Since finding an augmenting path
′′
takes O(m) (m being the number of edges of Nt ), testing if a cut with
edge cut-size less or equal K exists can can be determined in time O(Km).
′′ ′′ ′′ ′ ′ ′
The resulting cut (X , X ) in Nt induces a cut (X , X ) in Nt which in turn
induces a K-feasible cut (X, X) in Nt.
96 RECONFIGURABLE COMPUTING
Because the above procedure must be repeated for each node in the
original boolean network, we conclude that the labeling phase, i.e.
computing the label of all edges in the graph can be done in O(Kmn) where
n is the number of nodes and m the number of edges in N .
Node Mapping. In its second phase, the FlowMap algorithm maps the nodes
into K-LUTs. The algorithm works on a set L of node outputs to be imple-
mented in the LUTs. Each output will therefore be implemented as a LUT-
output.
Initially L contains all primary outputs. For each node v in L, it is assumed
that the minimum K-feasible cut (X, X) in Nv has been computed in the first
phase. A K-LUT LU T v is then generated to implement the function of v as well
as that of all nodes in X. The inputs of LU T v are the crossing edges from
X to X which are less than K, since the cut is K-feasible. L is then updated to
be (L − {v}) ∪input(X). Those nodes w belonging to two different cut-set
X u and Xv will be automatically duplicated. Algorithm 5 provides the
pseudo code that summarizes all the steps of the FlowMap presented here.
Improvements performed in the FlowMap algorithm have the goal of re-
ducing the amount of LUTs used, while keeping the delay minimal. The first
improvement is used during the maping of the nodes into the LUTs. The al-
gorithm tries to find the K-feasible cut with maximum volume. This allows
the final LUTs to contain more nodes, thus reducing the area used. The se
cond possibility is the used of the so called flow-pack method, which
generalizes the predecessor packing used in the DAG-Map [46]. The idea of
predecessor pack-ing is to pack a K-inputs LUT v in the same K-inputs LUT
u, if v is a fan-out free fan-in LUT of u and if the total amount the inputs of
u and v is less than K. The flow-pack method generalizes the predecessor
packing method to the set of predecessors Pu of u (including u) of the node
u, provide that Pu has a number of inputs less or equal to K.
Figure 3.18 illustrates the predecessor packing approach. The gate decom-
position presented in the Chortle algorithm can be used as well for the reduc-
tion of the LUT amount.
Figure 3.18. Improvement of the FlowMap algorithm through efficient predecessor packing
Implementation 97
4. Conclusion
In this chapter, the general use of FPGAs in different systems was explained
as well as the different steps needed to synthesize the hardware modules that
will be executed in the FPGAs at run-time. We have taken a brief tour in logic
synthesis with the focus placed on LUT technology mapping, which is the step
by which FPGAs differ from other logic devices. Our goal was to present two
algorithms that best match the optimization requirements, namely minimal delay
and minimal area. Research in FPGA synthesis keep on going and we several
LUT-based technology mapping algorithms are expected to be
98 RECONFIGURABLE COMPUTING
Developments in the field of FPGA have been very amazing in the last
two decades. FPGAs have move from tiny devices, with few thousands of
gates, only able to implement some finite state machines and glue-logic to
very com-plex devices with millions of gates as well as coarse-grained cores.
In year 2003, a growth rate of 200% was observed in the capacity of the
Xilinx FPGA in less than 10 years, while in the meantime, a 50% reduction
rate in the power consumption could be reached with the prices also having
the same decrease rate. Other FPGA vendors have face similar development
and this trend is likely to continue for a while. This development, together
with the progress in design tools, has boosted the acceptance of FPGAs in
different computation fields. With the coarse-grained elements like CPU,
memory, arithmetic units, available in recent FPGAs, it is now possible to
build a complete system con-sisting of one or more processors, embedded
memory, peripherals and custom hardware blocks in a single FPGA. This
opportunity limits the amount of com-ponents that must be soldered on a
printed circuit board in order to get a FPGA system working.
In this chapter, we present the different possibilities that exist to build
those systems consisting of one or more processors, peripherals and custom
hard-ware component.
We start with an introduction in system on programmable chip, and then
we present some of the various elements usually needed to build those sys-
tems. We then present at the end of the chapter a design approach for
adaptive multiprocessor systems on chip.
1. Introduction to SoPC
A system on chip (SoC) is usually defined as a chip that integrates the major
functional elements of a complete end product. Figure 8.1 presents the achieve-
260 RECONFIGURABLE COMPUTING
Figure 8.1. Integration of PCB modules into a single chip: from system on PCB to SoC
1.1 Processor
Processors are the central processing and coordination units in system on
programmable chips. Besides the coordination of the complete system, they
are in charge of collecting the data from different peripheral modules or
from the memory, process those data and store them into the memory or sent
them to the peripheral modules. Also the processor initializes the peripherals
as well as the dedicated hardware modules on the chip and manages the
memory. The most widely used processors are those used on Xilinx and
Altera devices, since those two companies control the largest part of the
programmable device market. Besides the processors offered by those two
companies, other platform independent implementations exist. We list some
of the available processors next.
System on a Programmable Chip 261
Harvard architecture with separate 32-bit address bus and 32-bit data bus
seven Fast Simplpex Links (FSL) that can be used to connect the
processors to custom hardware
a 32-bit single precision Floating Point Unit (FPU) in the IEEE-754 format
debug logic
1.2 Memory
A system on programmable chip needs memory for storing instructions
and data. Memory elements are available in small amount on almost all
modern FPGAs. This is usually used to build the caches directly on the chip.
For applications that require more memory, external SRAM or DRAM must
be used. On-chip memory can be built from the memory elements, the block
RAMs, or from the LUTs. The use of the LUT in the second case to built
memory has two drawbacks: First the amount of available computing
resources decreases, thus reducing the complexity of the logic that can be
implemented on the chip. Second, the memory built with LUTs is distributed
across the chip, meaning that chip interconnection must be used to connect
the spread LUTs. This leads to decrease in performance in the memory
access. On-chip block RAMs are usually dual-ported. They therefore provide
a nice possibility to integrate a custom hardware module, in particular those
working in two different clock domains. The memory is used in this case to
synchronize the module with each other.
System on a Programmable Chip 265
1.3 Peripheral
Peripherals are used in the system for communication with the external
world and for debugging purpose. The peripheral components are usually
pro-vided by the board manufacturers as ready to the module that can be
inserted in the SoPC-design. Because of their lower frequency, compare to
the one of processors, peripherals should be connected to the low-speed bus,
which in turn can be connected to the processor bus through a bridge. The
communi-cation between the processor and the available peripheral can be
done either through the I/O mapping paradigm or through the memory map
paradigm. In the first case, the processor addresses each peripheral using spe
cial instructions that directly address the peripheral on the bus. In the second
the peripheral is assigned an address space in the memory and the
communication happens in this space with the processor writing in control
register that can be read by the peripheral. The peripheral responds by
writing in status registers that can be read by the processor.
1.5 Interconnection
The interconnection provides the mechanism for integrating integrate all the
components previously described in a workable system on programmable chip.
The interconnection mechanism provides the communication channels, the in-
terface specification, the communication protocols and the arbitration policy on
the channels. The design of the interconnection infrastructure depends on the
target application. In traditional system on chip, where the infrastructure is fixed
at chip production, the communication infrastructure must be as gen-eral as
possible to serve all class of applications that can be implemented on the chip. In
programmable devices, the flexibility can be used to modify the interconnection
infrastructure according to the type of application to be imple-mented. In this
case, each application can first be analyzed in order to der ive
266 RECONFIGURABLE COMPUTING
the On-Chip Peripheral Bus (OPB), which is a secondary bus that can
be used to decoupled the peripherals from the PLB in order to avoid a
lost of system performance. Peripherals like serial ports, parallel ports,
UARTs, GPIO, timers and other low-bandwidth devices should be
attached to the OPB. Access to the peripherals on the OPB bus by PLB
masters is done through a bridge, which is used as a slave device on the
PLB and as master on the OPB. The bridge performs dynamic bus sizing,
in order to allow devices with different data widths to communicate.
Figure 8.3 illustrates the implementation of the OPB. Note that no tri-
state is required. The address and date buses are instead implemented
using a distributed multiplexer. This is a common technique for
implementing buses in programmable logic devices. All the master inputs
to the bus are ORed and the result is provided to all the slaves. The
arbitration module defines which master is granted the bus. All other
masters must then place a zero signal on their output. The ORed is then
used to write the value of the bus master on the bus.
System on a Programmable Chip 267
Figure 8.3. Implementation of the OPB for two maters and two slaves
nels, they allow multiple masters, and support split transactions and burst trans-
fers. The AMBA 2 interconnection infrastructure consists of two main buses:
the Advance Peripheral Bus (APB), used to connect the slower peripherals.
A bridge is used to interface between the AHB and APB buses and allow
low peripheral to be accessed from components hung on the AHB.
processors. The Intel Pentium D and the AMD Dual Core Athlon 64
Processor are other examples of multiprocessor on chip.
The work in [167] proposed an adaptive chip-multiprocessor architecture,
where the number of active processors is dynamically adjusted to the current
workload needed in order to save energy while preserving performance. The
proposed architecture is a shared memory based, with a fix number of embed
- ded processor cores. The adaptivity results in the possibility of switching
the single processors on and off.
Programmable logic devices in general and FPGA in particular have
experi-enced a continuous growth in their capacity, a constant decrease in
their power consumption and permanent reduction of their price. This trend,
which has increased interest in using FPGAs as flexible hardware
accelerators, is likely to continue for a while.
After an unsuccessful attempt in the 90's to use FPGA as co-processor in
computing system, the field FPGA in high-performance computing is experi-
encing a renaissance with manufacturers like Cray who now produce
systems made upon high speed processors couple to FPGAs that act as
hardware accel-erators [58]. However, actual FPGA-solutions for high-
performance comput-ing still use fast microprocessors and the results are
systems with a very high power consumption and power dissipation.
A combination of the chip multiprocessor paradigm and flexible hardware
accelerators can be used to increase the computation speed of applications.
In this case, FPGAs can be used as target devices in which a set of
processors and a set of custom hardware accelerators are implemented and
communicate together in a network.
FPGA manufacturers like Xilinx have been very active in this field by pro-
viding ready to use components (soft of hard-cores processors, bus-based in-
tercommunication facilities, and peripherals) on the base of which, multipro-
cessor on chip can be built. The Virtex II Pro FPGA for example provides up to
four PowerPC processors and several features like memory and DSP-modules.
The adaptivity of the whole system can be reached by modifying the com-
putation infrastructure. This can be done at compile-time using full reconfig
- uration or at run-time by means of partial device reconfiguration. This res
ults in a multiprocessor on-chip, whose structure can be adapted at run-time
to the computation paradigm of a given application.
Unfortunately, the capabilities of those devices are not exploited. FPGA
devices like the Xilinx Virtex Pro are usually used only to implement a small
hardware component. The two available Power PC processors remains most
the time unused. This is in part due to the lack of support for the
multiprocessor design.
In this section, we investigate the use of Adaptive Multiprocessor on Chip
(AMoC) and present a design approach for such systems. We also present an
System on a Programmable Chip 271
One of the main tasks in the design and implementation of a system like that
of figure 8.4, is the interconnection network. Contrary to other modules
(processors, peripherals, memory controllers and some custom hardware) that
272 RECONFIGURABLE COMPUTING
for the whole class of communication media. It comprises buses, networks and
point-to-point connections. Periphery and HWAccelerator differ only in their
type of connections. While a HWAccelerator is used for internal components,
Periphery is used to describe modules with external connections.
Having specified the system, the description may be functionally
simulated to observe the system behaviour, and afterwards be validated.
In the last step, the final system can be created by supplying a concrete
system description for the target platform.
When transforming an abstract system description into a concrete one,
trace-ability must be maintained. Therefore, in each of both steps, the
respective XML input file is validated against its document type definition to
prevent in-valid component specifications.
Subsequently, the given specification is processed by a Java application con-
sisting of three consecutive phases. In phase one, an XML parser extracts re-
278 RECONFIGURABLE COMPUTING
[ht]
Since the bulk of the computations in the SVD is the computation of the dot-
products of the column pairs, which is a good candidate for hardware imple-
mentation. Custom multiply accumulate (MAC) modules for computing those
column-pair multiplications were designed and directly attached connected to
the MicroBlaze processors via the fast simplex link (FSL) ports (figure 8.12, in
order to increase the performance of the whole system.
2.7 Adaptivity
As state at the beginning of the chapter, the adaptivity can be used for several
purposes. According to the application, a physical topology consisting of a given
arrangement of routers and processors may be built to match the best
computation paradigm of that application. The topology can be modified, even
at run-time to better adapt to the changes observed in the application.
On the other hand, the adaptivity of the system can be reached by using
reconfigurable hardware accelerators, provide that the target devic e supports
partial reconfiguration.
Designing partially reconfigurable systems with unlimited capabilities is a
complex engineering task that requires a lot of hand work. In order to avoid this
and keep the thing very simple, a template for partial reconfiguration can firs t
be created for the target device. In this template, fixed locations must be se-
lected, where the reconfiguration is allows to take place. It acts as place ho lder
in order to accommodate reconfigurable modules at run-time. Any hardware
accelerator can be plugged on the hardware block at run-time by means of par-
284 RECONFIGURABLE COMPUTING
tial reconfiguration. This step can even be performed from within the devic e
using the appropriated port like the ICAP in the case of Xilinx devices.
3. Conclusion
In this chapter, we have addressed the system on programmable chip
paradigm and provided some advantages of using those architectures. The
main compo-nents needed to build such a system were presented, with a focus
on the leading manufacturers. The goal was not a comparative study on the
different system on programmable chip infrastructure. We rather wanted to
provide the user a brief view on the existing solution.
In the second part of the chapter, we have presented a design approach for
multiprocessor systems on FPGAs. Using the reconfiguration, it is possible
to modify the hardware infrastructure to adapt it to the best computation
paradigm of the application being computed.
We have presented a framework to specify and implement a complete sys-
tem without knowledge of the vendor tools. Obviously, all what is feasible with
the PinHat tool is also feasible with current vendor tools. However, the
complexity and the difficulty of using those tools do not allow any new comer to
generate design with it. The goal here was to hide the complexity in the de-sign
of multiprocessor, which is very limited in the current tool to the user and allow
him to focus on the efficient analysis and partitioning of its application.
We believe that multiprocessing on FPGA has a great potential to speed-
up computation in embedded system. In order to facilitate their use, systems
like the PinHat are welcome to hide the complexity of the hardware
generation and let the designer focus on the analysis of the application and
the definition of the best adapted structure for its computation.
The adaptivity of the whole system is provided through the use of partial
reconfiguration in order to exchange running hardware modules at run- time.
However in order to overcome the difficulty of designing for partial recon
figu-ration with the current tool, a template base approach is recommended,
which foresee predefined location for placing components at run-time.
Chapter 7
∗
This chapter and the correponding Appendix was prepared by Christophe Bobda and Dominik Murr
214 RECONFIGURABLE COMPUTING
Example 7.1 Consider for instance the two designs of figure 7.1, where the
full designs are named Top1 and Top2. The fix part consists of the module F-
Module 1, F-Module 2, and F-Module 3, which are placed at the same location
in the two designs. In Top1 , the reconfigurable module is R-Module 1, which is
placed at the same location, with the same bounding box constraints as the
reconfigurable module R-Module 2 in Top2.
The addition of R-Module 2 to the fix parts of the two bitstreams generate
the full bitstream Top2. The same is true for R-Module 2 and Top1. Also, as
shown on the figure, adding the two bitstreams Top1 and R-Module 2 will
produce the bitstream Top2, because R-Module 2 will replace R-Module 1
in Top1. The substraction of Top2 from Top1 produce the partial bitstream
for R-Module 1.
The JBits API is constructed from a set of java classes, methods and tools
which can be used to set the LUT-values as well as the interconnections.
This is all what is required to implement a function in an FPGA. JBits also
provides functions to read back the content of a FPGA currently in use.
The content of a LUT can be set in JBits using the set function:
connect(outpin, inpin)
This function uses the JBits routing capabilities to connect the pin outpin
to the pin inpin anywhere inside the FPGA. outpin and inpin should be one
of the CLB terminals.
Connecting the output of a LUT to a CLB terminal can be done with the
function.
2
Partial reconfiguration design 217
Figure 7.2. Routing tools can route the same signals on different paths
(design entry and synthesis, initial budgeting, active implementation, and as-
sembly) for which a viable directory structure must be use.
We recommend to use the ISE version 6.3, which is on our opinion, the
most stable version for implementing partial reconfiguration with the
modular design flow.
The HDL Design directory (HDL): This directory is used to store the
HDL descriptions of the modules and that of the top level designs. The
HDL description can be in VHDL, Verilog or any other hardware
description language.
Figure 7.3. The recommended directory structure for the modular design flow
Signals that connect modules to each other and to I/O pins must be
defined in the top-level design.
3 Global logic constraint: All global logic like clocks, power and ground
lines must be constrained in the top level. No global logic should be
locked in the module topX.ucf file.
The output NGD file is named after the top-level design and contains im-
plementation information for both the top-level design and the individual
module. The -uc option ensures that the constraints from the local
topX.ucf file are annotated.
5
Partial reconfiguration design 223
3 Map the module to the library elements using the following
In this step, the logic of the design only with expanded active module is
mapped.
4 The place and route of the library elements assigned to the module is
done by invoking the placer and router with:
The -w option ensures that any previous versions of the produced files
de-sign name routed.ncd are overwritten. If the area specified for the
module cannot contain the physical logic for the module because it is
sized incor-rectly, then resizing must be done again and the .ucf file
generated during the initial budgeting phase must be regenerated with the
changes made. The entire initial budgeting phase should be performed
again. This would then imply that new .ucf files would need to be copied
to each module and that each module needs to be reimplemented. If
everything went fine, then the module is now placed and routed at the
correct location in the given bounding box.
5 Run trace on the implemented design to check the timing report in order
to verify that the timing constraints are met.
This command creates the appropriate module directory inside the PIMs
directory. It then copies the local, implemented module files, including the
.ngo, .ngm and .ncd files, to the module directory inside the PIMs direc-
tory and renames the .ngd and .ngm files to module name.ncd and mod-
ule name.ngm. The -ncd option specifies the fully routed NCD file that
should be published.
224 RECONFIGURABLE COMPUTING
1 In order to incorporate all the logic for each module into the top-level de-
sign, run ngdbuild as follows:
Ngdbuild generates an topX.ngd file from the top-level topX.ucf file, the
top-level .ngo file, and each PIM's topX.ngo file.
MAP uses the topX.ncd and topX.ngm files from each of the module direc-
tories inside the PIMs directory to accurately recreate the module logic.
3 The full design can then be placed and routed using the following com-
mand:
The place and route command, par, uses the topX.ncd file from each of
the module directories inside the PIMs directory to accurately
reimplement the module logic.
4 In order to verify if timing constraints were met, the trace command must
be invoked as follows:
Figure 7.4. Limitation of a PR area to a block (dark) and the actual dimensions (light)
Figure 7.5. Scheme of a PR application with a traversing bus that may not be interrupted
design flow. Now, traversing signals must not be guided through a bus mac
ro, a tedious handwork.
7
Partial reconfiguration design 227
4.1 New directory structure
The Early Access Tools now support a modified, more sensible directory
structure for PR projects as depicted in figure 7.6. There, the folders' c
ontents are presented in the order of usage in the design flow.
Figure 7.6. Improved directory structure for the Early Access Design Flow
Finally the design is assembled in merges and full and partial bitstreams
are generated.
all design modules, instantiated as black boxes with only ports and port
directions specified and
Figure 7.8 shows the difference between a wide and a narrow type bus
macro. The classification concerns only the range of CLBs that one of the se
bus macros bridges. Both kinds offer eight signal lines to cross a boundary.
230 RECONFIGURABLE COMPUTING
The unoccupied CLBs between wide bus macros can be used by user
logic or other bus macros. They can also be nested up to three times in the
same y-row as depicted in figure 7.9. With the old bus macros, only four
signals could traverse the boundary at this position whereas the new macros
put through up to 24 lines. The three staggered bus macros do not have to be
of the same type or direction.
(a)
(b)
Figure 7.8. Narrow (7.8(a)) and wide (7.8(b)) bus macro spanning two or four CLBs
the x-coordinate of the CLBs on which the bus macros is locked to, has
to be divisible by two
the y-coordinate has also to be divisible by two
Virtex-4 devices require coordinates divisible by four due to the location
of block RAMs and DSPs
bus macros have to traverse a region boundary. Each of the endpoints has
to completely lie in a different area
Further actions for the Initial Budgeting like setting the timing constraints
follow the same guidelines as the Modular Design. One must keep in mind
that signals that will not interfere with a module can cross that module
without having to run through a bus macro.
To close the Initial Budgeting Phase, the top-level design has to be imple-
mented. In directory pr project/top, the command
ngdbuild -uc system.ucf -modular initial system.ngc
has to be invoked. pr project/static now budgets the static portion of the
de-sign. The .nmc files for the used bus macros have to be placed in this
directory. The necessary commands are:
ngdbuild -uc ../top/system.ucf -modular initial ../top/system.ngc
map system.ngd
par -w system.ncd sytem base routed.ncd
The routed outputs are placed in pr project/top respectively pr
project/static. The process yields another file, static.used, that contains a list
of routes within the partially reconfigurable regions that are already occupied
with static de - sign connections. This is used as an input for the placement
and routing of the partially reconfigurable modules later on. The file allows
for guiding signals through PRMs without using bus macros.
232 RECONFIGURABLE COMPUTING
4.5 Activation
In the Activation Phase the partially reconfigurable modules are
implemented. First the previously generated static.used is copied to each of the
modules di-rectories (e.g. pr project/ reconfigmodules/ prm a1) and renamed to
arcs.exclude. Each module is then run through the known sequence of
ngdbuild, map and par as follows for the prm a1. In directory pr project/
reconfigmodules/ prm a1, do the following:
ngdbuild -uc ../../top/system.ucf -modular module -active \\
prm component name ../../top/system.ngc
map system.ngd
par -w system.ncd prm des routed.ncd
Note that prm component name is the component name not the instance name
of the module. par automatically looks for and incorporates the arcs.exclude
file.
1 Merge each PRM of one reconfigurable area with the static design
pr verifydesign and pr assemble generate intermediate
designs with just the one PRM contained and partial bitstreams that can
be used to re-configure the FPGA at this particular region afterwards.
The necessar y commands are
merges/prmB1withA1/pr verifydesign \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
merges/prmB1withA1/pr assemble \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
merges/prmB1withA2/pr verifydesign \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
merges/prmB1withA2/pr assemble \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b1/prm b1 routed.ncd
234 RECONFIGURABLE COMPUTING
merges/prmB2withA1/pr verifydesign \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
merges/prmB2withA1/pr assemble \\
../../prmA/prm a1/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
merges/prmB2withA2/pr verifydesign \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
merges/prmB2withA2/pr assemble \\
../../prmA/prm a2/base routed full.ncd \\
../../reconfigmodules/prm b2/prm b2 routed.ncd
With the iterations several files that will be created are equivalent to
each other. In this case for partially reconfigurable area B
prmB1withA1/prm b1 blank.bit,
prmB1withA2/prm b1 blank.bit,
prmB2withA1/prm b1 blank.bit and
prmB2withA2/prm b1 blank.bit
are equal. Equivalent bitstreams for module b1 are
Four variations of the full bitstream are generated representing the four
combinations of the two partially reconfigurable modules in this example:
Figure 7.11. Two patterns that can be created with modules of Animated Patterns. Left: the
middle beam is moving. Right: the diagonal stripes are moving.
5.2 Video8
The second sample design is called Video8 and is used for the code and
command execution examples in the following guide. It deploys the Digilent
VDEC1 extension card [70] for the Virtex-II Pro XUP Development Board
[228] to digitize an analog video stream from a camera. The extension card
is connected to the XUP via a proprietary slot and is technically connected
through an I2C Bus.
As shown in figure 7.12, the data stream is taken into the system by video
in. The module converts to the RGB color space and hands over to a filter,
employ - ing e.g. a Sobel-implementation for edge detection. This filter
module will be reconfigurable at the end of this tutorial. The filtered data is
then processe d in vga out to be send to a VGA compliant screen.
The embedded hardcore processor on the FPGA is used to configure the
video digitizer card and the ICAP on start up. It is then used to run a simple
program to choose which partial bitstream to load next. The in- and output is
brought to the user's console on a remote PC through the serial port.
reconstruct the initial design to place the reconfigurable parts and also
therewith connected module instantiations in the top-level. On the on
hand the resulting code is much easier to understand and modules can be
developed by different engineers according to the modular design flow
straight forward. On the other hand the facilitation that comes for
example from using EDK for the basic design might be outweighed by
the lot of work involved when extracting and reconnecting all the
necessary modules. This militates all the more regarding the synthesis
process. The placement tool places modules according to their connec-
tivity with each other and the constraints given in the UCF not accord-
ing to their actual position in the hierarchy. Thus reconstructing the
design might be useless since the placement is done independently.
In the Video8 non pr example video in and vga out are directly affili-
ated to the partially reconfigurable module to-be. As suggested in fig-ure
7.13 these two modules at least would have to be instantiated in
238 RECONFIGURABLE COMPUTING
the top-level module. Additional difficulties arise from the use of the
OPB to connect the filter module. Either the developer has to adapt the
filter module and provide other means of connection to the rest of the
system or the OPB has to be brought up to the top-level as well. This
again affords work and other modules to be changed accordingly.
Figure 7.13. Reconstructing example Video8 to place the partially reconfigurable part and
connected modules in the top-level.
The other solution is to only replace the old instantiations of the par-
tially reconfigurable modules with instantiations on the top-level and
amend the entities that formerly hosted the modules: Ports have to be
added to redirect the input data to the partially reconfigurable
modules up the module hierarchy to its new location at the top-level
module and the output data back down to be fed into the system
again. Figure 7.14 depicts the new structures. The big advantage with
this approach is the minimal adaption work to be done. An initial
EDK design may be used as is and can easily be changed.
entity video_in is {
...
Figure 7.14. Moving a partially reconfigurable module to the top-level design on the exam ple
of Video8
...
-- signals out to the filter:
R_out_to_filter <= R1; G_out_to_filter
<= G1; B_out_to_filter <= B1;
h_counter_out_to_filter <= h_counter1;
v_counter_out_to_filter <= v_counter1;
valid_out_to_filter <= valid1;
R2 <= R_in_from_filter;
G2 <= G_in_from_filter;
B2 <= B_in_from_filter;
h_counter2 <=
h_counter_in_from_filter; v_counter2
<= v_counter_in_from_filter; valid2
<= valid_in_from_filter; ...
docm ctlr and iocm ctrl are the controllers for data and
instructions to be stored in on-chip block RAMs. The program that
controls the reconfiguration of the FPGA in this case is to large to fit
in a smaller ram area. It can generally be put into the DDR-RAM as
well but in this special design, the framebuffer for the VGA output is
placed in the DDR-RAM, consuming a great share of the bandwidth
of the bus it is connected to. Thus placing the program in the DDR-
RAM as well generates unwanted distortions in the VGA output.
2.4. Rewriting the Control Software
The program deployed to control the non-pr design has to be
extended by the initialization of the ICAP module and a mean to
choose which partial bitstream to load next.
For the sample design, the control program main.c contains just the
initialization of the video decoder card. The initialization routine for
the ICAP and the SystemACE has to be added as well as a menu that
prompts the user for the next partial bitstream.
2.5. Export to ISE
The generated design has to be synthesized to be a submodule to the
top-level design. For this EDK supports the export of a project to the
ISE where detailed options for the synthesis can be set. Alterna-
tively, the module can be synthesized using xst. To export the
project in Project|Project-Options|Hierarchy and
Flow the follow-ing settings have to be made as follows:
Design is a sub-module with name system i
use project navigator implementation flow (ISE)
don't add modules to an existing ISE-project
4. Synthesis
The synthesis of the partially reconfigurable project takes place in the sub -
folder synth of the project base folder. Here, corresponding subfolders for
static and reconfigurable modules and the top-level module can be found. It
should be mentioned again, that any submodules, including the partially
reconfigurable ones, have to be synthesized without automatically adding
I/O-buffers. Thus the option -IOBUF has be set to no. I/O-buffers only have
to be included in the top-level module. The exported project contains the
system as a submodule system i of type system and should already in-
clude the needed ports for the data flow to and from the reconfigurable mod-
ule. system i is ordered under a generated super module system stub
and can therefore directly be included in a top-level module. These ports
should be created by EDK when entering and wiring external ports as de-
scribed above. If not so the generated HDL files describing and instantia t-
ing entity system resp. system i (system.vhd resp. system
stub.vhd) have to be extended with the necessary ports.
For the sample design folders, the reconfigurable modules ( mod *) as well
as a top and an edk directory must be created. There is no separate static
module, except the EDK system. Therefore, no static folder is required.
The ports to add to entity system are the same as when changing the
former super module of the one that should be partially reconfigurable.
5. Appending the UCF Taking a UCF built for example from EDK, several
amendments have to be made to let the partial reconfiguration work. As
stated before the definition of the area to place the partial and static blocks
and the location of the bus macros have to be declared in the UCF.
To set a region to be partially reconfigurable the UCF is extended by
lines like
"pblock_fixed_area" RANGE=SLICE_X46Y16:SLICE_X87Y149;
Omitting MODE=RECONFIG tells the parser that the block will be fixed.
Al-though the UCF is a simple text file and can be edited manually, tools
like PlanAhead can be a big help with their architecture dependant
graphical user interfaces.
6. The Build Process
The build process starts with the synthesis of all the involved modules.
The submodules should not include I/O-buffers, which must all be
included only in the top-level module. The initial budgeting is partly
done already, and must be complete now. All other phases, from the
activation to the final assembly follow the steps described in section 4.
void main() {
unsigned 32 res;
// Interface definition
interface port_in(unsigned 32 var_1
with {busformat="B<I>"}) Invar_a();
// Addition
res = Invar_a.var_1 + Invar_b.var_2; }
The first part of the code is the definition of the interfaces for communication
with other modules and the second part realizes the addition of the input values
coming from the input interface, and the output values are sent to the output
interface. We need not to provide the implementation of the subtractor, since it
is the same like that of the adder, but instead of adding we subtract. Having
implemented the two modules separately, each module will be inserted in a
separated to-level. Up to the use of the adder or subtractor, the two top-levels are
the same. The design can now be reconfigure later to change the adde r against
the subtractor. Connecting the module in a top-level design is done as shown in
the following code segment.
unsigned 32 result;
with {busformat="B<I>"};
void main() {
operand1 = produceoperand( 0);
result = my_adder.Res;
dosethingwithresult(result);
}
According to the top-level design being implemented, the adder will be re-
placed by a subtracter. The next question is how to keep the signal integrity of
the interfaces. Since there is no bus macro available in Handel-C, bus-macros
provided by Xilinx must be used as VHDL-component in a Handel-C
246 RECONFIGURABLE COMPUTING
7. Platform design
One of the main factor that have hindered the implementation of on-line
placement algorithms developed in the past and presented in chapter 5, is the
development process of reconfigurable modules for the Xilinx platforms. A s
we have seen in the previous section, partial reconfiguration design is do ne
with many restrictions, which make a systematic development process for
par-tial reconfiguration very difficult. Each module placed at a given
location o n the device is implicitly assigned all the resources in that area.
This includes the device pins, clock managers and other hard macros
components like the embedded multipliers and the embedded BlockRAMs.
Figure 7.15. Modules using resources (pins) not available in their placement area (the com-
plete column) must use feed-through signals to access those resources
Most of the FPGA platforms available on the market do not provide solu-
tion to the problems previously mentioned. Many systems on the market
offer various interfaces for audio and video capturing and rendering, for
communi-cation and so forth. However each interface is connected to the
FPGA using dedicated pins in a fix location. Modules willing to access a
given interface like the VGA must be placed in the area of the chip where
the VGA signals available such that they can be assigned those signals
among other resources. This makes a relocation of module at run-time
impossible. Relocation pro-vides the possibility to have only one bitstream
for a module representing the implementation of that module for a given
location. At run-time, the coordi-nate of the module is modified to match the
location assigned to the module by the on-line placer. Without relocation,
each module must be compile for each possible position on the device where
it can be later placed. A copy of the component bitstream must therefore be
available for each location on the chip. The amount of storage needed for
such a solution does not make such an approach attractive.
A research platform, the Erlangen Slot Machine (ESM) that we next
present, was developed at the university of Erlangen to overcome the
drawbacks of existing platform, and therefore allow unrestricted on-line
placement on an FPGA platform. Through the presentation of the ESM, we
also hope to high-light some of the requirements in the platform design for
reconfigurable com-puting.
The goals that the ESM-designers had in mind while developing a new
plat-form was to overcome the deficiency of existing FPGA-Platforms by
provid-ing:
through the reconfigurable module must be used, in order for the VGA
module to access the pins cover by the reconfigurable module.
Similar situation are not only present on the Celoxica boards. The XESS
boards [56], the Nallatech boards [165], the Alpha boards [8] face the
same limitations.
On the XF-Board [177] [211] from the ETH in Zurich, the connection to
the peripherals are done on the side of the device. Each module access its
peripheral through an operating system (OS) layer implemented on the
left and right part of the device and between the component swapped in
and out. This approach provides only a restricted solution to the problem,
since all modules must be implemented with a given number of feed-
trough lines and interfaces to access the OS layer for communication
with the pe-ripheral. The intermodule communication as well as the
communication between a module and its peripheral is done via buffers
provided by the OS. This indirect communication will affect the
performance of the sys-tem. Many other existing platforms like the
RAPTOR-Board [132], Celox-ica RC1000 and RC2000 [43] are PCI
systems, which require a workstation for operation. The use in stand-
alone systems as it is the case in embedded is not possible.
The previous limitations was the primary motivation in the design of the
Erlangen Slot Machine (ESM), whose concept we next present.
slots, each of which have access to the SRAM on the top part and to the
crossbar at the bottom.
The name Erlangen Slot Machine was choosen, because it was developed
at the University of Erlangen-Nuremberg in Germany, but mainly due to
this organization in slots. This modular organization of the device simpli-
fies the relocation, primary condition for a viable reconfigurable
computing system. Each module moved from one slot to another one will
find the same resources there. The architecture is illustrated in figure
7.17. The Baby-Board is logically divided into columns of 2 CLBs called
micro slots. The micro slots are the smallest allocatable units in the
system. Each module must therefore be implemented in a given number
of micro slots. Due to the number of pins needed to access one the
external SRAM, each module willing to used an SRAM module must be
implemented in a multiple of 3 micro slots.
The CPLD is used to download the Spartan II's configuration from the
flash on power-up. It also contains board initialization routines for the on
board PLL and the Flash.
7.0.5 Memory
6 SRAM banks with 2 MBytes each are vertically attached to the board on
the top side of the device, thus providing enough memory space to six different
slots for temporal data storage. The SRAMs can also be used for shared mem-
ory communication between neighbor modules e.g for streaming applications.
They are connected to the FPGA in such a way that the reconfiguration of a
given module will not affect the access to other modules.
memory using the external SRAMs or the internal BlockRAMs. However, only
neighbor modules can use those two communication modes. For modules placed
in non-adjacent slots, we provide a dynamic signal switching communi-cation,
the reconfigurable multiple bus (RMB) presented in section 4.1. Also, the
crossbar switch can be use as well for communication between modules.
FPGA FPGA
M1 M2 M3 M1 M2 M3
FPGA FPGA
M1 M2 M3 M1 M2 M3
Crossbar
one side with its own frequency, while the receiver will read the data at
the other end with its own frequency.
First, the crossbar switch can now be implemented in the FPGA rather than
on a separated device. In this case, we need to attach all the peripheral on
one side of the block in which the crossbar is implemented. Modules can
then be connected on the other side of the block. We can even think of a
system in which all components are attached around the chip. The crossbar
can the be implemented as a ring, distributed module around the chip.
The distribution of the resources, like the external RAM must not be done in
a column wise manner anymore. Resources should now be homogeneous
spread around the device, in order to allow different modules, which are
placed on different blocks to access their own resources.
Figure 7.21. Possible enhancement of the Erlangen Slot Machine on the Xilinx Virtex 4 and
Virtex 5 FPGAs
Despite the advantages provided by the new class of Virtex devices, a great
disadvantage that results from this architectural enhancement is the communi-
cation: for the columnwise reconfigurable devices, the 1-D circuit was a s im-ple
and efficient possibility to enforce communication among different mod-ules
running on the FPGA. In a two dimensional reconfigurable device as it
258 RECONFIGURABLE COMPUTING
9. Conclusion
We have presented in this chapter the different possibilities to
design for partial reconfiguration on Xilinx devices. Our goal was not
to rewrite a co m-plete manual on partial reconfiguration, since several
description on this exist
[128] [227]. The Xillinx manuals as well as the work of Sedcole [190]
provide very good descriptions on using the modular design flow and
the early acce ss. our primary motivation was to provide a king of
tutorial, based on our experi-ence and for which a workable design,
not too complex, but also not so easy, exists. The designs as well as all
the scripts need for compiling are available to download from the
book's web page. The difficulty in designing for par tial
reconfiguration can be reduced, if the target platform is well designed.
One of such platform is the Erlangen Slot Machine that was presented,
with the goal to emphasize the challenges in designing such a
platform. The ESM is how-ever strongly influenced by the column
wise reconfigurable Virtex. The pr ice we pay for flexibility is very
high. With the advent of new Virtex 4 and virtex 5, enhancements can
be made in the design in order to increase the flexibility, while
reducing the costs.
One of the main problem remain the communication between the
module at run-time. While this problem is somehow better solved in a
1-D reconfig-urable device through the use of circuit switching and
dedicated channels on the module, its extension on a 2-D
reconfigurable device is not feasible due to the amount of resources
needed. Viable communications approaches like the DyNoC was
presented in chapter 6 however, with the amount of resources needed
by those approaches, their feasibility is only possible if manufacturers
provide coarse-grained communication element in their devices.
MODULE IV
CHAPTER
21
IMPLEMENTING APPLICATIONS
WITH FPGAS
Brad L. Hutchings, Brent E. Nelson
Department of Electrical and Computer Engineering
Brigham Young University
Developers can choose from three general classes of devices when implement-ing
an algorithm or application: microprocessor, FPGA, or ASIC (for simplicity, ASSPs
are not considered here). This section provides a brief summary of the advantages
and disadvantages of these devices in terms of time to market, cost, development
time, power consumption, and debug and verification.
More difficult to quantify but perhaps more important are: (1) refabrications
(respins) caused by either errors in the design or late changes to the specifica-tion,
due to a change in an evolving standard, for example, and (2) software
development schedules that depend on access to the ASIC. Both of these items
impact product production schedules; a respin can easily consume an additional
four months, and early access to hardware can greatly accelerate software devel-
opment and debug, particularly for the embedded software that communicates
directly with the device.
In light of these considerations, a conservative estimate of the time-to-market
advantage of FPGAs relative to ASICs is 6 to 12 months. Such a reduction is
significant; in consumer electronics markets, many products have only a 24-month
lifecycle.
21.1.2 Cost
Per device, FPGAs can be much less expensive than ASICs, especially in lower
volumes, because the nonrecurring costs of FPGA fabrication are borne by many
users. However, because of their reprogrammability, FPGAs require much more
silicon area to implement equivalent functionality. Thus, at the highest volumes
possible in consumer electronics, FPGA device cost will eventually exceed ASIC
device cost.
in simulators just as typical ASIC designs are. However, many designers verify their
designs directly, by downloading them into an FPGA and testing them in a system.
With this approach the application can be tested at speed (a million times faster
than simulation) in the actual operating environment, where it is exposed to real-
world conditions. If thorough, this testing provides a stronger form of functional
verification than simulation. However, debugging applica-tions in an FPGA can be
difficult because vendor tools provide much less observ-ability and controllability
than, for example, an hardware description language (HDL) simulator.
limited to about 300 to 350 MHz and CPUs operating at approximately 3 GHz. As a
result, FPGAs must perform at least 10 times the computational work per cycle to
perform on par with processors. To be a compelling alternative, an FPGA-based
solution should exceed the performance of a processor-based solution by 5 to 10
times and hence must actually perform 50 to 100 times the computational work per
clock cycle. This kind of performance is feasible only if the target application
exhibits a corresponding amount of exploitable parallelism.
The guideline of 5 to 10 times is suggested for two main reasons. First of all,
prior to actual implementation, it is difficult or impossible to foresee the impact of
various system and I/O issues on eventual performance. In our experience, 5 times
can quickly become 2 times or less as various system and algorithmic issues arise
during implementation. Second, application development for FPGAs is much more
difficult than conventional software development. For that rea-son, the additional
development effort must be carefully weighed against the potential performance
advantages. A guideline of 5 to 10 times provides some insurance that any FPGA-
specific performance advantages will not completely vanish during the
implementation phase.
Ultimately, the intrinsic characteristics of the application place an upper bound on
FPGA performance. They determine how much raw parallelism exists, how
exploitable it is, and how fast the clock can operate. A review of the liter-ature [3–6,
11, 16, 19–21, 23, 26, 28] shows that the application characteristics that have the
most impact on application performance are: data parallelism, amenability to
pipelining, data element size and arithmetic complexity, and sim-ple control
requirements.
Data parallelism
Large datasets with few or no data dependencies are ideal for FPGA imple-
mentation for two reasons: (1) They enable high performance because many
computations can occur concurrently, and (2) they allow operations to be exten-
sively rescheduled. As previously mentioned, concurrency is extremely impor-tant
because FPGA applications must be able to achieve 50 to 100 times the operations
per clock cycle of a microprocessor to be competitive. The ability to reschedule
computations is also important because it makes it feasible to tailor the circuit
design to FPGA hardware and achieve higher performance. For example,
computations can be scheduled to maximize data reuse to increase performance
and reduce memory bandwidth requirements. Image-processing algorithms with
their attendant data parallelism have been among the highest-performing
algorithms mapped to FPGA devices.
and fewer computational elements and less parallelism. Moreover, larger and more
complex circuits exhibit more delay that slows clock rate and impacts performance.
Not surprisingly, representing data with the fewest possible bits and performing
computation with the simplest operators generally lead to the highest performance.
Designing high-performance applications in FPGAs almost always involves a
precision/performance trade-off.
Pipelining
Pipelining is essential to achieving high performance in FPGAs. Because FPGA
performance is limited primarily by interconnect delay, pipelining (inserting reg-
isters on long circuit pathways) is an essential way to improve clock rate (and
therefore throughput) at the cost of latency. In addition, pipelining allows com-
putational operations to be overlapped in time and leads to more parallelism in the
implementation. Generally speaking, because pipelining is used extensively
throughout FPGA-based designs, applications must be able to tolerate some
latency (via pipelining) to be suitable candidates for FPGA implementation.
2 2
compute bound. In contrast, matrix–matrix addition requires 3n I/Os and 3n
calculations and is thus I/O bound. Another way to see this is to note that each
source element read from memory in a matrix–matrix multiplication is used n times
and each result is produced using n multiply–accumulate operations. In matrix–
matrix addition, each element fetched from memory is used only once and each
result is produced from only a single addition.
Carefully coordinating data transfer, I/O movement, and computation order is
crucial to achieving enough parallelism to provide effective speedup. The entire
field of systolic array design is based on the concepts of (1) arranging the I/O and
computation in a compute-bound application so that each data element fetched
from memory is reused multiple times, and (2) keeping many processing elements
busy operating in parallel on that data.
FPGAs offer a wide variety of memory elements that can be used to coor-dinate
I/O and computation: flip-flops to provide single-bit storage (10,000s of bits); LUT-
based RAM to provide many small blocks of randomly distributed memory
(100,000s of bits); and larger RAM or ROM memories (1,000,000s of bits). Some
vendors’ FPGAs contain multiple sizes of random access memories, and these
memories are often easily configured into special-purpose structures such as
dynamic-length shift registers, content-addressable memories (CAMs), and so forth.
In addition to these types of on-chip memory, most FPGA plat-forms provide off-
chip memory as well.
Increasing the I/O bandwidth to memory is usually critical in harnessing the
parallelism inherent in a computation. That is, after some point, further multi-plying
the number of processing elements (PEs) in a design (to increase paral-lelism)
usually requires a corresponding increase in I/O. This additional I/O can often be
provided by the many on-chip memories in a typical modern FPGA. The work of
Graham and Nelson [8] describes a series of early experiments to map time-delay
SONAR beam forming to an FPGA platform where memory band-width was the
limiting factor in design speedup. While the data to be processed were an infinite
stream of large data blocks, many of the other data structures in the computation
were not large (e.g., coefficients, delay values). In this com-putation, it was not the
total amount of memory that limited the speedup but rather the number of memory
ports available. Thus, the use of multiple small memories in parallel were able to
provide the needed bandwidth.
The availability of many small memories in today’s FPGAs further supports the
idea of trading off computation for table lookup. Conventional FPGA fabrics are
based on a foundation of 4-input LUTs; in addition, larger on-chip memories can be
used to support larger lookup structures. Because the memories already exist on
chip, unlike in ASIC technology, using them adds no additional cost to the system.
A common approach in FPGA-based design, therefore, is to evaluate which parts of
the system’s computations might lend themselves to table lookup and use the
available RAM blocks for these lookups.
In summary, the performance of FPGA-based applications is largely deter-mined
by how much exploitable parallelism is available, and by the ability of the system to
provide data to keep the parallel hardware operational.
21.3 General Implementation Strategies for FPGA-based Systems 445
21.3.1 Configure-once
Configure-once (during operation) is the simplest and most common way to
implement applications with reconfigurable logic. The distinctive feature of
configure-once applications is that they consist of a single system-wide config-
uration. Prior to operation, the FPGAs comprising the reconfigurable resource are
loaded with their respective configurations. Once operation commences, they
remain in this configuration until the application completes. This approach is very
similar to using an ASIC for application acceleration. From the application point of
view, it matters little whether the hardware used to accelerate the appli-cation is an
FPGA or a custom ASIC because it remains constant throughout its operation.
Global RTR
Global RTR allocates all (FPGA) hardware resources in each configuration step.
More specifically, global RTR applications are divided into distinct temporal phases,
with each phase implemented as a single system-wide configuration that occupies
all system FPGA resources. At runtime, the application steps through each phase
by loading all of the system FPGAs with the appropriate configura-tion data
associated with a given phase.
Local RTR
Local RTR takes an even more flexible approach to reconfiguration than does
global RTR. As the name implies, these applications locally (or selectively) recon-
figure subsets of the logic as they execute. Local RTR applications may configure
any percentage of the reconfigurable resources at any time, individual FPGAs may
be configured, or even single FPGA devices may themselves be partially
reconfigured on demand. This flexibility allows hardware resources to be tai-lored to
the runtime profile of the application with finer granularity than that possible with
global RTR. Whereas global RTR approaches implement the execu-tion process by
loading relatively large, global application partitions, local RTR applications need
load only the necessary functionality at each point in time. This can reduce the
amount of time spent downloading configurations and can lead to a more efficient
runtime hardware allocation.
The organization of local RTR applications is based more on a functional division
of labor than the phased partitioning used by global RTR applications. Typically,
local RTR applications are implemented by functionally partitioning an application
into a set of fine-grained operations. These operations need not be temporally
exclusive—many of them may be active at one time. This is in direct contrast to
global RTR, where only one configuration (per FPGA) may be active at any given
time. Still, with local RTR it is important to organize the operations such that idle
circuitry is eliminated or greatly reduced. Each operation is implemented as a
distinct circuit module, and these circuit modules are then downloaded to the
FPGAs as necessary during operation. Note that, unlike global RTR, several of
these operations may be loaded simultaneously, and each may consume any
portion of the system FPGA resources.
21.3 General Implementation Strategies for FPGA-based Systems 447
RTR applications
Runtime Reconfigured Artificial Neural Network (RRANN) is an early example of a
global RTR application [7]. RRANN divided the back-propagation algorithm (used to
train neural networks) into three temporally exclusive configurations that were
loaded into the FPGA in rapid succession during operation. It demon-strated a 500
percent increase in density by eliminating idle circuitry in individ-ual algorithm
phases.
RRANN was followed up with RRANN-2 [9], an application using local RTR. Like
RRANN, the algorithm was still divided into three distinct phases. However, unlike
the earlier version, the phases were carefully designed so that they shared common
circuitry, which was placed and routed into identical physical locations for each
phase. Initially, only the first configuration was loaded; thereafter, the common
circuitry remained resident and only circuit differences were loaded during
operation. This reduced configuration overhead by 25 percent over the global RTR
approach.
The Dynamic Instruction Set Computer (DISC) [29] used local RTR to create a
sequential control processor with a very small fixed core that remained resi-dent at
all times. This resident core was augmented by circuit modules that were
dynamically loaded as required by the application. DISC was used to implement an
image-processing application that consisted of various filtering operations. At
runtime, the circuit modules were loaded as necessary. Although the application
used all of the filtering circuit modules, it did not require all of them to be loaded
simultaneously. Thus, DISC loaded circuit modules on demand as required. Only a
few active circuit modules were ever resident at any time, allowing the appli-cation
to fit in a much smaller device than possible with global RTR.
RTR, where circuit interfaces typically remain fixed between configurations, local
RTR allows these interfaces to change with each configuration. When circuit
configurations become small enough for multiple configurations to fit into a single
device, the designer needs to ensure that all configurations will interface correctly
one with another. Moreover, the designer may have to ensure not only structural
compliance but physical compliance as well. That is, when the designer creates
circuit configurations that do not occupy an entire FPGA, he or she will have to
ensure that the physical footprint of each is compatible with that of others that may
be loaded concurrently.
form 1.9, where the largest number is 0.111111111 = 0.99810 and the smallest is
1.000000000 = −110. As can be seen, fixed-point arithmetic exactly follows the
rules learned in grade school, where lining up the implied binary point is required for
performing addition or subtraction.
When designing with fixed-point values, one must keep track of the number
format on each wire; such bookkeeping is one of the design costs associated with
fixed-point design. At any point in a computation, either truncation or rounding can
be used to reduce the number of bits to the right of the binary point, the effect being
to simply reduce the precision with which the number is represented.
The use of floating point in FPGA-based design has been the topic of much
research over the past decade. Early papers, such as Ligon and colleagues [15]
and Shirazi et al. [24], focused on the cost of floating point and demonstrated that
small floating-point formats as well as single-precision formats could be eventu-ally
implemented using FPGA technology. Later work, such as that by Bellows and
Hutchings [1] and Roesler and Nelson [22], demonstrated novel ways of leverag-ing
FPGA-specific features to more efficiently implement floating-point modules. Finally,
Underwood [27] argued that the capabilities of FPGA-based platforms for
performing floating point would eventually surpass those of standard computing
systems.
All of the research just mentioned contains size and performance estimates for
floating-point modules on FPGAs at the time they were published. Clever design
techniques and growing FPGA densities and clock rates continually com-bine to
produce smaller, faster floating-point circuits on FPGAs. At the time of this writing,
floating-point module libraries are available from a number of sources, both
commercial and academic.
The key is that BFP allows for growth in the range of values in the data block
while retaining the low cost of fixed-point computations. Block floating point has
found extensive use in fast Fourier transform (FFT) computations where an input
block (such as from an A/D converter) may have a limited range of values, the data
is processed in stages, and stage boundaries provide natural renormalization
locations.
a0
b0
a1 a0
b
1 a1
a2
a=?b a2
a = ? 1011
b2 a3
a3
b3
(a) (b)
FIGURE 21.1 ■ Two comparator implementations: (a) with and (b) without constant folding.
21.5 SUMMARY
FPGAs provide a flexible, high-performance, and reprogrammable means for
implementing a variety of electronic applications. Because of their repro-
grammability, they are well suited to applications that require some form of direct
reprogrammability, and to situations where reprogrammability can be used indirectly
to increase reuse and thereby reduce device cost or count. FPGAs achieve the
highest performance when the application can be implemented as many parallel
hardware units operating in parallel, and where the aggregate I/O requirements for
these parallel units can be reasonably met by the overall sys-tem. Most FPGA
applications are described using HDLs because HDL tools and synthesis software
are mature and well developed, and because, for now, they provide the best means
for describing applications in a highly parallel manner.
Once FPGAs are determined to be a suitable choice, there are several ways to
tailor the system design to exploit their reprogrammability by reconfiguring them at
runtime or by compiling specific, temporary application-specific data into the FPGA
circuitry. Performance can be further enhanced by crafting arith-metic circuitry to
work around FPGA limitations and to exploit the FPGA’s spe-cial arithmetic
features. Finally, FPGAs provide additional debug and verification methods that are
not available in ASICs and that enable debug and verification to occur in a system
and at speed.
In summary, FPGAs combine the advantages and disadvantages of micropro-
cessors and ASICs. On the positive side, they can provide high performance that is
achievable only with custom hardware, they are reprogrammable, and they can be
purchased in volume as a fully tested, standard product. On the neg-ative side, they
remain largely inaccessible to the software community; more-over, high-
performance application development requires hardware design and the use of
standard synthesis tools and Verilog or VHDL.
References
[166] P. Bellows, B. L. Hutchings. JHDL—An HDL for reconfigurable systems.
Proceed-ings of IEEE Workshop on FPGAs for Custom Computing Machines, April
1998.
[167] B. Catanzaro, B. Nelson. Higher radix floating-point representations for FPGA-
based arithmetic. Proceedings of the IEEE Symposium on FPGAs for Custom
Computing Machines, April 2005.
[168] W. Culbertson, R. Amerson, R. Carter, P. Kuekes, G. Snider. Exploring
architectures for volume visualization on the Teramac custom computer. Proceedings
of IEEE Workshop on FPGAs for Custom Computing Machines, April 1996.
21.5 Summary 453
E. Roesler, B. Nelson. Novel optimizations for hardware floating-point units in a modern FPGA
architecture. Proceedings of the 12th International Workshop on Field-Programmable Logic
and Applications, August 2002.
N. Shirazi, P. M. Athanas, A. L. Abbott. Implementation of a 2D fast Fourier transform on an
FPGA-based custom computing machine. Fifth International Workshop on Field-Programmable
Logic and Applications, September 1995.
N. Shirazi, A. Walters, P. Athanas. Quantitative analysis of floating point arithmetic on FPGA-
based custom computing machines. Proceedings of the IEEE Workshop on FPGAs for Custom
Computing Machines, April 1995.
A. Slade, B. Nelson. Reconfigurable computing application frameworks. Proceed-ings of the
IEEE Symposium on Field-Programmable Custom Computing Machines, April 2003.
L. E. Turner, P. J. W. Graumann, S. G. Gibb. Bit-serial FIR filters with CSD coef-ficients for
FPGAs. Fifth International Workshop on Field-Programmable Logic and Applications,
September 1995.
K. Underwood. FPGAs vs. CPUs: Trends in peak floating-point performance. Proceed-ings of
the ACM/SIGDA 12th International Symposium on Field-Programmable Gate Arrays, 2004.
J. E. Vuillemin. On computing power. Programming languages and system archi-tectures.
Lecture Notes in Computer Science, vol. 781, Springer-Verlag, 1994.
M. J. Wirthlin, B. L. Hutchings (eds). A dynamic instruction set computer. Proceedings of the
IEEE Workshop on FPGAs for Custom Computing Machines, April 1995.
M. J. Wirthlin. Constant coefficient multiplication using look-up tables. Journal of VLSI Signal
Processing 36, 2004.
APPLICATIONS
Up to now, we have focussed on the technology behind reconfigurable d e-
vices, their integration in different systems and their programmability at dif-
ferent level of abstraction. The natural questions which arises it the one to know,
in which area does reconfigurable computing could be helpful? if ar - eas of
application exist, then we would also like to know what the gain of using
reconfigurable technology is, compared to other processing devic es used
before? The gain of using a solution rather another one is usually expressed as
the computation speed, or as the improvement in power consumption, or as the
price, or even as the easiness in the programmability. In [57], Craven and
Athanas provided a performace/price comparative study between FPGA-based
high-performance computing machines and traditional supercomputers. Based
on the gain in accelerating floating-point computations, they came to the
conclusion that the price to pay was too high compare to the marginal gain
obtained, when using FPGAs in supercomputers. Other researchers have re-
ported amazing gains with FPGA-based computing machines in several fields of
computation and several companies that have been offering FPGA-based
computing solutions a decade ago are still operating and even growing. Our goal
in this chapter is to present some applications that benefit somehow fro m an
implementation on reconfigurable devices. Almost all the experiments re-ported
were done on FPGA platforms. We will not report all the experiment in all
details. We rather focus on the coarse-grained structure of the applications and
point out where the reconfiguration can be helpful. We are primary c on-cerned
with the speed improvement and the use of flexibility of reconfigurab le
computing systems.
Despite the very large number of applications that could benefit from an
implementation in FPGAs, we have selected only 6 domains of application the
286 RECONFIGURABLE COMPUTING
1. Pattern Matching
Pattern matching can be defined as the process of checking if a character
string is part of a longer sequence of characters. Pattern matching is used in a
large variety of fields in computer science. In text processing programs like
Microsoft Word, pattern matching is used in the search function. The pur-
pose is to match the keyword being searched against a sequence of
characters that build the complete text. In database information retrieval, the
content of different fields of a given database entry are matched against the
sequ ence of characters that build the user request. Searching in genetical
database also use pattern matching to match the content of character
sequence from a database entries with the sequence of characters that build a
given query. In this case the alphabet is built upon a set of a genomic
characters. Speech recognition and other pattern recognition tools also use
pattern matching as basic opera-tions on top of which complex intelligent
functions may be built, in order to better classify the audio sequences. Other
application using pattern matching are: dictionary implementation, spam
avoidance, network intrusion detection and content surveillance.
Because text mining, whose primary role is the categorization of
documents, makes a heavy use of pattern matching, we choose in this section
to present the use of pattern matching in text mining and to point out the
possible use of reconfiguration.
Documents categorization is the process of assigning a given set of docu-
ments from a database to a given set of categories. The categories can either be
manually defined by a user, or it can be computed automatically by a soft-ware
tool. In the first case, the categorization is supervised and in the sec ond cased
we have an unsupervised categorization. In categorization, the fi rst step usually
consist of indexing the collection of available documents. This is usu-ally done
through the so called vector space representation. In this model a document is
represented as a vector of key words or term present in the doc-ument. A
complete collection of n documents over a list of m keywords is then
m n
represented by a term by documents matrix A ∈ R × R . An entry aij in A
represents the frequency of the word i in the document j. The term by documents
matrix is then used for indexing purpose or statistical analysis like LSI (Latent
Semantic Indexing) [67]. Building a terms by documents matrix is done by
scanning the documents of the given collection in order to find the appearance
of key words and return the corresponding entry in the matrix for each
document. Pattern matching is therefore used for this purpose.
The first advantage of using the reconfigurable device here is the inher ent
fine-grained parallelism that characterizes the search. Many differen t words
Applications 287
can be searched for in parallel by matching the input text against a set of
words on different paths. The second advantage is the possibility of quickly
exchange the list of searched words by means or reconfiguration.
Pattern matching was investigated in different work using different approaches
on FPGAs [52] [85] [104] [151] [101] [180] [34], each with a different over-head
in compiling the set of words down to the hardware and different capac-ities. We
define the capacity of a search engine as the number of words that can de searched
for in parallel. A large capacity also means a high complexity of the function to be
implemented in hardware, which in turn means a large amount of resources. The
goal in implementing a search engine in hardware is to have a maximal hardware
utilization, i.e. as many words as possible that can be searched for in parallel. We
present in this section various hardware imple-mentation of the pattern matching,
each with its advantages and drawbacks.
1With dictionary, we mean here the set of keyword compiled into the reconfigurable device
288 RECONFIGURABLE COMPUTING
Figure 9.1. Sliding windows for the search of three words in parallel
This method however has a main drawback, which is the redundancy in the
amount of register fields used to stored the words as well as the number of
comparators. Redundancy in the sliding windows approach reduce the capac-ity
of such implementation, thus making its use not competitive. In the case of the
Xilinx Virtex FPGA XCV300 with a maximum of 6144 flip flops, a sliding
windows of length 10 needs 10 × 8 = 80 flip flops to store one target word in
the device. Consider that the device placement and routing permits an utiliza-
tion of 80%. The number of words which can be folded in such an FPGA will
theoretically be in the range of 60, which is small compared to the number of
words which can be folded in the same divide with a more efficient implemen-
tation. A nice approach would be to just perform the comparison on a given
alphabet consisting of the set of common character and to use the result in the
evaluation of each single word. This would lead to a reduce amount of register
fields and a reduce amount of comparators.
ing technique is used. A hash table in memory is used to store the value of a
presence bit for a particular target word. An entry zero in this table means
that the word is not present and a value one means that the word will
probably be present. A hash register (with a length of 22 bit in the SPLASH
implementa-tion), which is initially set to zero, is incrementally modified by
the incoming superbytes. At the end of a word marked by a delimiter, the
hash register is set to zero to allow the next word to be processed. The hash
22
register is used to address the hash table of 2 pseudo-random mappings of
words. The modifi-cation of the hash register happens as follows: When a
non delimiter superbyte is encountered the contents of the 22-bit hash
register is updated by first XOR-ing the upper 16-bit of the register with the
incoming superbyte and a value of a hash function. The modified value of
the hash register is then circularly shifted by a fixed number of bits. Upon
reaching the end of a word, the con tent of the register is used to address the
hash table and determine the value of the presence bit. To reduce the
likelihood of false hits a number of independent hash functions is calculated
for each word of the text, with the stipulation that each hash function lookup
of a word must result in a hit for that word to be counted as a target word.
Estimations on performance of the SPLASH text searching have been
done mainly on the basis of the critical path length returned by the place and
route tool. Communication between the host computer and the boards like
memory latency have not been taken in consideration.
Figure 9.2. FSM recognizers for the word "conte": a) sate diagram, b) transition table, c) basis
structure the hardware implementation: 4 flip flops will be need to code a 5 × 6 transition table
share a common prefix use a common path from the root corresponding to
the length of their common prefix. A split occurs at the node where the
common prefix ends. In the hardware implementation of a group of words
with a com-mon prefix, common flip flops will be used to implement the
common path (figure 9.3a)).
Figure 9.3. a) Use of the common prefix to reduce the number of flip flops of the comm on
word detector for "partir", "paris", "avale","avant". b) implementation without use of common
prefix and common comparator set
For each character in the target alphabet only one comparator is needed.
The comparison occurs in this case, once at a particular location in the device.
292 RECONFIGURABLE COMPUTING
Figure 9.4. Basis structure of a FSM-based words recognizer that exploits the common prefix
and a common set of characters
The one-hot approach (figure 9.2c)) is probably the best implementation for
the kind of automaton realized in the state decoder. First it takes a major
advantage toward the regular structure of the FPGA and it is the fastest way to
implement FSM in hardware [142]. Moreover, there is no need to incorporate
backtracking edges in a suitable hardware representation. To implement a set of
different words, the resulting tree structure of the common FSM is mapped in
hardware. All the leafs corresponding to target words are connected to the hit
port. The width of the hit port is determined by the number of positions needed
to code the number of target words. Upon matching a particular word,
Applications 293
one leaf of the automaton is reached and the hit output is set as to the index
of the matched target word.
Figure 9.5. Processing steps of the FSM for the word "tictic"
2
Can the state machine recognize the word 'issip' in a file conta ining the word Mississippi?[52]
294 RECONFIGURABLE COMPUTING
2. Video Streaming
Video streaming is the process of performing computations on video data,
which are streamed through a given system, picture after picture.
Implementation of video streaming on FPGA have attracted several researchers
in the past, resulting in the building of various platforms [39] [115] [198] [72]
[207] [140] [201] for various computations. Two main reasons are always state
as motivation for implementation video streaming in FPGAs: the perfor-mance,
which results from using the exploitation of the inherent parallelism in the target
applications in order to build a tailored hardware architecture and the adaptivity,
which can be used to adjust the overall system by exchanging parts of the
computing blocks with better adapted ones.
In [193], Richard G.S discusses the idea of parameterized program gener-
ation of convolution filters in an FPGA. A 2-D filter is assembled from a set of
multipliers and adders, which are in turn generated from a canonical serial-
parallel multiplier stage. Atmel application notes [13] discuss 3x3 convolu-tion
implementations with run-time reconfigurable vector multiplier in Atmel
FPGAs. To overcome the difficulties of programming devices with classic
Hardware Description Languages(HDL) like VHDL and Verliog, Celoxica has
develop the Handel-C language. Handel-C is essentially an extended subset of
the standard ANSI-C language, specifically designed for use in a har dware
environment. The Celoxica Handel-C compiler, the DK1 development envi-
ronment includes a tool, the PixelStream, for an easy generation of video pro-
cessing functions for FPGA implementations. PixelStream offers a set of ba-sic
modules and functions that can be (graphically) combined to build a more
complex datapath in FPGA. The resulting design can be mapped to one of the
Celoxica FPGA development boards.
34 million if less than 255 words are compiled into the word recognizer.
Applications 295
We will not focus in this section on the details of video or image process-
ing. A comprehensive description of video processing algorithms and their
hardware implementation on FPGA is provided in [153].
The Sonic architecture [115] is a configurable computing platform for acc el-
eration and real-time video image processing. The platform consists of plug-in
processing elements (PIPEs), which are FPGA daughter card that can be
mounted on a PCI-board plugged on a PCI-slot of a workstation. A PIPEfl ow
bus exists to connect adjacent PIPES, while a the available PIPE provide global
connection to the PIPES. Sonic's architecture exploits the reconfiguration ca-
pabilities of the PIPES to adapt part of the computation flow at run-time.
The Splash [97] architecture was also used in video processing. Its
systolic array structure makes it well suited to image processing.
The ESM platform introduced in section 7.0.2 presents an optimal pipelined
architecture for the modular implementation of video streams. Its organization in
exchangeable slots, each of which can be reconfigured at run-time to p er-form a
different computation, makes it a viable platform in video streaming. Moreover,
the communication infrastructure available on the platform provides an
unlimited access of modules to their data, no matter on which slot they are
placed, thus increasing the degree of flexibility in the system.
The computation on video frames is usually performed in different steps,
while the pictures stream through the datapath. It therefore presents a nice
structure for a pipelined computation. This has led to a chained architecture
on the base of which most video streaming systems are built. The chain con-
sist of a set of modules, each of which is specialized for a given
computation. The first module on the chain is in charge of capturing the
video frames, while the last module output the processed frames. Output can
be rendered on a monitor or can be sent as compressed data over a network.
Between the fir st and the last modules, several computational modules can
be used according to the algorithm implemented. A module can be
implemented in hardware or in software according to the goals. While
software provides a great degree of flexibility, it is usually not fast enough to
carry the challenging computations required in video applications. ASICs
can be used to implement the computa-tional demanding parts, however
ASIC does not provide the flexibility need ed in many systems. On a
reconfigurable platform, the flexibility of a processor can be combined with
the efficiency of ASICs in order to build a high perfor-mance flexible
system. The partial reconfiguration capability of reconfigur able devices
provides the possibility to replace a given module on the chain at run-time.
Most of the video capture modules provide the frames to the system on a
pixel by pixel manner, leading to a serial computation on the incoming pixels.
Since many algorithms need the neighborhood of a pixel to compute its new
value, a complete block must be stored and processed for each pixel. Cap-
296 RECONFIGURABLE COMPUTING
turing the neighborhood of a pixel is often done using a sliding window data
structure with varying size. The size of the sliding window corresponds to
that of neighbor region of a pixel needed for the computation of the new
value. As shown in figure 9.6, a sliding window is a data structure used to
sequentially capture the neighborhood of pixels in a given image.
A given set of buffers (FIFO) is used to update the windows. The number
of FIFOs vary according to the size of the window. In each step, a pixel is
read from the memory and placed in the lower left cell of the window. Up to
the upper right pixel which is disposed, i.e. outputted, all the pixels in the
right part of the window are placed at the queue of the FIFOs one level
higher. The processing part of the video is a normal image processing
algorithm combining some of the basic operators like:
Median filtering
Basic Morphological operations
Convolution
Edge detection
the sliding windows and passes it to the third module, which processes the
pixel and saves it in its own memory or directly passes it to the next module.
This architecture presents a pipelined computation in which the
computational blocks are the modules that process the data frames. RAMs
are used to tempo-rally store frames between two modules, thus allowing a
frame to stream from RAM to RAM and the processed pictures to the output.
This is also true for a median operator which cannot be replaced by a Gauss
operator by just changing the parameters. The network capture module and the
camera capture module require two different algorithms for collecting the pixels
and bringing them in the system. The second possibility consists of re-placing
the complete module at run-time with a module of the same size, but different in
its structure, while the rest of the system keeps running. Recon-figurable devices
in general and FPGAs in particular fulfill this requiremen ts. Many available
FPGAs can be partly configured, while the rest of the syste m is running.
Moreover many FPGAs provide small size on-chip memories, able to hold part
of a frame (the so called region of interest). It is therefore possible to perform
many computations in parallel on different regions of interest, thus increasing
the performance and flexibility of video applications.
3. Distributed Arithmetic
Distributed arithmetic (DA) is certainly one of the most powerful tool for
the computation of the product of two vector products, one of which is con-
stant, i.e it consists of constant values. DA exploits the nature of LUT-based
computation provided by the FPGAs by storing in a LUT, all the possible
result for a set of variable combinations. Computation at run-time only
consists of retrieving the results from the LUT, where they were previously
stored. The el-ements of the variable vector are used to address the look-up
table and retrieve partial sums in a bit-serial (1-BAAT, 1 Bit At A Time)
manner. Investigations in distributed arithmetic have been done in [215]
[158] [223] [224] [33] [95] [63] [62] [31].
One of the notorious DA-contribution is the work of White [215]. He pro-
posed the use of ROMs to store the precomputed values. The surrounding
logic to access the ROM and retrieve the partial sums had to be implemented
on a separate chip. Because of this moribund architecture, distributed
arithmetic could not be successful. With the appearance of SRAM based
FPGAs, DA became an interesting alternative to implement signal
processing applications in FPGA [158] [223] [224]. Because of the
availability of SRAMs in those FPGAs, the precomputed values could now
be stored in the same chip as the surrounding logic.
In [33], the design automation of DA architecture was investigated and a tool
was designed to help the user in the code generation of a complete DA design,
and to perform investigations on the various tradeoffs. Also, an initial attempt to
implement DA for floating-point numbers numbers in order to in-crease the
accuracy, was presented. However, the amount of memory required to
implement such a solution makes it applicable only on small examples.
The idea behind the distributed arithmetic is to distribute the bits of one
operand across the equation to be computed. This operation results in a new
Applications 299
one, which can then be computed in a more efficient way. The product of tw
o vectors X and A is given by the following equation:
n
X
j j X A
Z =Ai × Xij 2 = 2 ij i (3.2)
i=0 j=0 j=0 i=0
n
DALUT and retrieve the partial sums i=0 Xij Ai. the n-tupel (0, 0, ..., 0) will
be used to address the location , ( 0, ..., 1) is used to address the location 1,
and (1, 1, ..., 1) is used to address the location 2 − 1.
The complete dot-product Z requires w steps to be computed in a 1-BAAT
manner. At step j, the j-th bit of each variable are used to address the DALUT
and retrieved the value Pn Xij Ai. This value is then left shifted by a factor
i=0
then add to the value in the accumulator. After w steps, the results is
collected from the accumulator.
Many enhancements can be done on a DA implementation. The size of
the DALUT for example can be halved if only positive values are stored. In
this case, the sign bit, which is the first bit of a number, will be used to
decide if the retrieved value should be added to or subtracted from the
accumulated sum. On the other hand, it is obvious that all the bit operations,
i.e. the retrievals of a value from the DALUT, are independent from each
other. The computation can therefore be performed in parallel. The degree of
parallelism in a given implementation depends on the available memory to
implement the DALUTs. In the case where w DALUTs and datapaths can be
instantiated in parallel, the retrieval of all partial sums can be done in only
one step, meaning that the complete computation can be done in only one
step insted of w as in the serial case.
In general, if k DALUTs are instantiated in parallel, i.e. for a computation
performed on a k-BAAT basis, then w/k steps are required to retrieved all the
partial sums, which can be directly accumulated, thus leading to a run-time of
w/k steps. Figure 9.9 shows a datapath for the computation of the DA.
The input values are segmented in different fields, each of which is assig
ned to a datapath for the retrieval of the partial sums from the corresponding
DA-LUT. The retrieved values from the k DALUTs are shifted by the
correspond-ing values and added to the accumulated sum. After w/k steps,
the result can be collected from the accumulator.
Applications 301
... ...
This equation can be implemented using s DALUTs. The i-th (i ∈ {1, .., s})
Pr
DALUT is used to compute the vector product zi = xj × aij with the
j=0
constants ai1 to air . If there is enough space on the chip to hold all the DA-
LUTs, then equation 3.4 can be computed in few clocks. As we will see
later, partial reconfiguration can also be used to implement many dot-
products, if there is not enough memory on the chip to hold all the DALUTs.
for each number. The integer part is separated from the fractional part using a
point. Because the point is only an imaginary and not physically available in the
number representation, operations on fixed-point numbers are not differ-ent than
those on normal integer numbers. The datapath previously presented can be used
without modifications for for real numbers represented as fi xed-point.
Representing real numbers as fixed-point is advantageous in the c ompu-tation
speed and the amount of resources required to implement the datapath.
However, the ranges of fixed-point representations as well as their pr ecisions is
small compared to those of a floating-point representations. Therefore we would
like to handle real numbers as floating-point.
In this section we present a concept first investigated in [33] for handlin g
real numbers, represented as floating-point in the IEEE 754 format, usin g dis-
tributed arithmetic. In IEEE 754 format, a number X is represented as follows:
eA eX
Z = X × A =(Xi × Ai) = (2 i × mAi ) × (2 i × mXi )
i=0 i=0
n n
X X
+e
=(2eAi × 2eXi ) × (m ×m ) =(2eAi Xi ) × (m ×m )
Ai Xi Ai Xi
i=0 i=0
(3.6)
The goal here is to compute and provide Z as floating-point number, us-ing
the same principle of the partial sums previously described. At each step of the
computation, a floating-point value Fi must be read from the floating-point
DALUT and added to an accumulated sum, which is also a floating-point
number. Since the adder used at this stage is a floating-point adder, we as - sume
that issues like rounding and normalization have been integrated in its
e
implementation. From the last part of equation 3.6, we see that the value (2 Ai
+eX
i ) × (mAi × mXi ) represents a floating-point number with expo-
call the first DALUT which stores the exponents the EDALUT and the sec-
ond DALUT which stores the mantissas the MDALUT. The size of
EDALUT, size(EDALU T ) as well as that of MDALUT, size(M DALU T )
are given in equations (3.7).
|M |
size(M DALU T ) = n × 2 × |M | bits (3.8)
|E| is the exponent width and |M | is the mantissa width of the floating-
point representation.
The main argument against this approach is the size of the DALUTs used
for this floating-point DA implementation, which can be so large that an im-
plementation cannot be possible. However, if we consider a DA implementa-
tion involving five variables and five coefficients represented in the IEEE 7
54 floating-point format with 8 bits exponent and 10 bits mantissa, the total
8
mem-ory requirement for the EDALUTs and the MDALUTs is: ((5 ×2 ×8)
+ (5 × 210 × 10))/1024 = 60 Kbits. Therefore the EDALUTS and the
MDALUTs will easily fit into a very small low cost FPGA like the Xilinx
Spartan III 50 with a total of 72 Kbits Block RAM [225].
For large vectors with numbers represented on large mantissas and expo-
nents, this approach will require a large amount of memory, certainly not di-
rectly available on the chip. An external memory must therefore be used.
Having the EDALUTs and the MDALUTs for each coefficient, the datapath
will not be implemented as in the fixed-point or integer case. The variables ar e
no more input in a bit-serial way. At step i the variable Xi is used to address the
two i-th EDALUT and MDALUT. The bits of eXi are used to access the
EDALUT, while the bits of mXi are used to access the MDALUT in parallel.
The values collected from the EDALUT and the MDALUT are used to build the
floating point number Fi. After n steps the floating-point dot-product is
computed. Figure 9.10 shows the datapath for the floating-point DA. Since all
the DALUTs for the n coefficients are available on the device, they can be
accessed in parallel to compute the dot-product in only one step.
Signal processing applications are usually based on the computation of the
dot-product of one vector of variables with a vector of constants. They are ideal
candidates for a DA implementation in reconfigurable hardware. As sta te earlier
in this section, in order to facilitate the design of DA applications a tool was
implemented in [33] for the design automation. Also, the possibil-ity is given to
investigate the different tradeoffs for a given application. For a given device and
a given set of constants, different DA implementations can be generated, from
the full parallel implementation (w-BAAT), which can be computed in only one
step, to the full sequential DA (1-BAAT). For each possi-
304 RECONFIGURABLE COMPUTING
e m
xn xn
e m
x1 x1
e m
x0 x0
e m
A1 A1
eA1 + 1 mA1 * 2 Control
LUTs 1
e + 2|E| m * 2|M|
A1 A1
e m
A2 A2
e m
LUTs 2 Fi Fi
e + 2|E| m * 2|M|
A2 A2
+
e m
An An e m
Zi Zi
LUTs n
e + 2|E| m * 2|M|
An An
Figure 9.10. Datapath of the distributed arithmetic computation for floating popin numbers
bility, the user is provided the area and speed of the design. Real numbers
can be handled either as fixed-point or as floating-point in the IEEE 754
forma t with the technique previously defined. The width of the mantissa as
well as that of the exponent has to be provided. For the final design, the tool
ge nerates a description in the Handel-C language.
3.4 Applications
In this section, we present one application, the implementation of recur-
sive convolution algorithm for time domain simulation of optical multimode
intrasystem interconnects that was substantially speeded-up through the use
of distributed arithmetic. Another application that has benefit from the
imple-mentation as DA is the adaptive controller. Because section 4 is
devoted to adaptive controller, we explain only the first application here.
O
N
O
A
NA O
=
O
=A
4. Adaptive Controller
In this section, we investigate the next field of application of
reconfigurable devices, namely the implementation of adaptive controllers,
also identified here as multi-controller, using partial reconfiguration.
We will first introduce the multi-controller concept. Thereafter, we inves-
tigate its implementation using the distributed arithmetic. Finally, the imple-
mentation of the whole design using partial reconfiguration is shown.
Applications 307
Figure 9.12. Screenshot of the 6-parallel DA implementation of the the recursive convolution
equation on the Celoxica RC100-PP platform
4At a given time, the active controller is the one which controls the plant.
308 RECONFIGURABLE COMPUTING
surements of physical values of the plant. The strategy of the supervisor can
vary from simple boolean combinations of the input values to very complex
reasoning techniques [206].
Multi−Controller
Supervisor
i CM 1
CM i
Plant MUX
CM n
state vector of the controller. The matrices A, B, C and D are used for the
calculation of the outputs based on the inputs.
z =M v
(s+q,1) (s+q,s+p) (s+p,1) (4.2)
x1 (k + 1) a11
a b
... 1s 11 ... b1px1 (k)
. . .. . . . . .
. . . . . . .
. . . . . . . .
(4.3)
sy1 (k)
x (k + 1) a bs1
= s1
c11
. . .ass
c d
... 1n 11
. . .. . . .. . .
. . . . . .
. . .
yq (k)
. . . . . . .dqp
.
uq (k)
.. .cqsdq1
c {z }
q1
| {z }| }| {z
z M v
Figure 9.14. Adaptive controller architecture. Left the one slot implementation, and right the
two slot implemenation
into the reconfigurable slot, which has predefined interfaces to the super
visor and to the plant.
The problem with this approach is the vacuum, which arises on switching
from one module to the next one. Because only one slot is available, the re-
configuration of this slot will place the plant in a "floating state" on reconfig-
uration, where it is not really controlled. In order to avoid this situation, the
approach developed in [64] was the use of two slots. One active slot an a
pas-sive one (figure 9.14 right). The active slot is in charge of controlling the
plant, while the reconfiguration takes place in the passive slot. Whenever the
sup er-visor decides to replace a controller module, the reconfiguration will
first ta ke place on the passive slot. Thereafter, the control of the plant will
be given to the configured module, which becomes the active one, while the
former activ e becomes passive.
tough off few years ago. Applications like e-commerce, e-government, virtual
private network, on-line banking must provide a high degree of security.
Over years, a large variety of standards like Triple-DES, Advanced Encryp-
tion Standard AES), Data Encryption Standard (DES), RSA, OpenPGP, Ci-
pherSaber, IPsec, Blowfish, ANSI X9.59, CAST, RC4 and RC6, have been
developed to provide high security. With this large variety of standards and the
customized implementation possibilities for each standard, cryptography can be
seen as one of the most versatile application domains of computer science.
Depending on criteria like speed, degree of flexibility, and degree of sec urity,
single implementations of cryptography application were developed either as
software or as intellectual property component.
In [221] the advantages of using reconfigurable hardware in cryptog raphy
are listed. The author focus on the traditional advantage of flexibility and
performance. The flexibility is so far important because it offers the poss i-
bility to used the same hardware to switch from one algorithm to the next
one at run-time, according to factors like the degree of security, the compu-
tational speed, the power consumption. Also, according to some parameters,
a given algorithm can be tuned. Moreover, algorithms that has been broken
and where the security is no more insured can be changed by means of re-
configuration. The system can easily be upgraded to include new standard s,
developed while the system was already deployed. The corresponding algo-
rithm can therefore be compiled and included in the library of bitstreams for
the device configuration. The second advantage provided by reconfig urable
hardware, namely the performance can be used to efficiently implement the
components, by using the inherent parallelism and building efficient opera
tors for computing boolean operation on a very large amount of data. This
results on a large throughput and a cost efficiency. Experiments reported a
thr oughput of 12 GBit/s for an implementation of the block cipher AES on
an FPGA Virtex 1000, using 12,600 slices and 80 RAMs [90], while an
ASIC implementation, the Amphion CS5240TK [150] clocked at 200 MHz,
could provided twice the throughput of the FPGA solution. The same
algorithm, implemented on a DSP TI TMS320C6201 provided a throughput
of 112.3 Mbits/s [222], while a throughput of 718.4 Mbit/s could be reached
on a counterpart implementation on a on a Pentium III [10].
The general architecture of an adaptive cryptographic engine proposed by
Prasanna and Dandalis [61] [178], basically consists of a database to hold the
different configuration that can be downloaded at run-time onto the devic e,
like an FPGA for instance, to perform the computation and a configuration
controller to perform the reconfiguration, i.e. downloading the correspo
nding bitstream form the database into the FPGA. Each bitstream represent a
given algorithm implementing a given standard and tuned with some
parameters ac-cording to the current user's need.
312 RECONFIGURABLE COMPUTING
single chip. Also, in contrast to the adaptive architecture for a control system
presented in figure 9.14, the loader module resides into the device. Depend -
ing on the implementation chosen, the loader can reside inside or outside the
device. If a Xillinx Virtex chip is used and the configuration takes place via
the ICAP port, then the loader, which is the ICAP module in this case, is au-
tomatically available inside the device. However, if the configuration
happens through the normal SelectMap port, then we need an external loader
module for collecting configuration data from the database and copy them on
the co n-figuration port.
In figure 9.15, the architecture is logically divided in two main blocks. A
fix one, which remains continuously on the chip. It consist of the parts, wh
ich are common to all the cryptographic algorithms in general or common to
the algorithms in a given class only. On the figure, we show only one
reconfig-urable slots, however, it can be implemented as set of configurable
blocks , each of which can be changed by mean of reconfiguration to realize
a giv en customized standard.
The last point concerns the development of building blocks that will be
com-piled in bitstreams to be downloaded into the device at run-time. A
designer is no more required to focus on the hardware implementation of the
crypto-graphic algorithm. A lot of work was done in this direction and the
results are available. We need mostly to focus on the architecture of the
overall system and find out how a viable partitioning can be done, according
to the reconfi gu-ration scheme.
Most of the work have focussed in various implementations of a given ap-
proach like the RSA [185] in [155] [88] [154] [232] [50] or the implementa-
tions described in [78] [19] [20] [47] [145] [152] [15], mostly
parameterizable and based on the Elliptic Curve approach [138] [157].
Generators for pro-ducing a customized description in a hardware
description language have been developed for example in [47]. This can be
used to generate the variety of configurations that will be used at run-time to
move from one implementation to the next one.
The ideal realization of a software defined radio would push the ADC and
DAC as close as possible to the antenna. The practical realization is
however, as explained in [217], very challenging.
8. Conclusion
We have presented in this chapter a couple of applications that
can take advantage of the flexibility as well as performance of
reconfigurable dev ices. For each application targeted, we placed
the focus mostly on the architectural organization because the goal
was not to necessary present the detailed im-plementation of an
application. However, we have presented a comprehensive
description of the pattern matching application, while keeping the
other pre-sentation short. This is due to the low attention paid on
pattern matching in hardware in available textbooks. With the
large amount of literature in other presented topics like image
processing and control, we choose not to focus in details of those
applications.
A lot of work have been done in the last two decades and a large
number of applications was implemented in FPGA, which is the main
representative of reconfigurable devices. We could not present nor
cite all available imp le-mentations here. We have rather focussed on
few ones where we could show the usefulness of reconfigurability.
Finally, we would like to emphasize that despite two decades of
research, the existence of "killer applications" could not be shown for
reconfigurable device, thus limiting the acceptance of re con-
figuration in the industry. The existence of a killer application, will
certainly boost the development of reconfigurable devices, leading to
new class o f de-vices, tools and programmers. Despite this missing
step, research keep going and the support of the industry is needed
more than it have ever be.
AUTOMATIC TARGET RECOGNITION SYSTEMS
ON RECONFIGURABLE DEVICES
Young H. Cho
Open Acceleration Systems Research
Sandia real-time SAR ATR systems use a hierarchy of algorithms to reduce the
processing demands for SAR images in order to yield a high probability of detec-
tion (PD) and a low false alarm rate (FAR).
Focus of attention
M-47 Tank
Angle: 3558
Reporting module Elevation: 10 ft
hypotheses, and returns the hit information for each image chip consisting of the
best two orientation matches and other relevant information.
The system has a database of target models. For each target, and for each of its
three different elevations, 72 templates are defined corresponding to its all-around
views. The orientations of adjacent views are separated by 5 degrees.
SLD is a binary silhouette matcher that has a bright mask and a surround mask
that are mutually exclusive. Each template is composed of several param-eters
along with a “bright mask” and a “surround mask,” where the former defines the
image pixels that should be bright for a match, and the latter defines the ones that
should not. The bright and surround masks are 32×32 bitmaps, each with about
100 asserted bits. “Bright” is defined relative to a dynamic threshold.
On receiving tasks from the FOA, the SLD unit compares all of the stored
templates for this target and elevation and the applicable orientations with the
image chip, and computes the level of matching (the “hit quality”). The two hits with
the highest quality are reported to the SLD driver as the most likely candidates to
include targets. For each hit, the template index number, the exact position of the
hit in the search area, and the hit quality are pro-vided. After receiving this
information, the SLD driver reports it to the ATR system.
The purpose of the first step in the SLD algorithm, called the shape sum, is to
distinguish the target from its surrounding background. This consists of adap-tively
estimating the illumination for each position in the search area, assuming that the
target is at that orientation and location. If the energy is too little or too much, no
further processing for that position for that template match is required. Hence, for
each mask position in the search area, a specific threshold value is computed as in
equation 28.1.
31 31
SMx, y = ∑ ∑ Bu, vMx+u, y+v (28.1)
u=0 v=0
SMx, y
THx, y = – Bias (28.2)
BC
The next step in the algorithm distinguishes the target from the background by
thresholding each image pixel with respect to the threshold of the cur-rent mask
position, as computed before. The same pixel may be above the threshold for some
mask positions but below it for others. This threshold calculation determines the
actual bright and surround pixel for each posi-tion. As shown in equation 28.2, it
consists of dividing the shape sum by the number of pixels in the bright mask and
subtracting a template-specific Bias constant.
As shown in equation 28.3, the pixel values under the bright mask that are
greater than or equal to the threshold are counted; if this count exceeds the minimal
bright sum, the processing continues. On the other hand, the pixel
[189] Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
values under the surround mask that are less than the threshold are counted to
calculate the surround sum as shown in equation 28.4. If this count exceeds the
minimal surround sum, it is declared a hit.
31 31
BSx, y = ∑ ∑ Bu, v Mx+u, y+v ≥ THx, y (28.3)
u=0 v=0
31 31
SSx, y = ∑ ∑ Su, v Mx+u, y+v < THx, y (28.4)
u=0 v=0
Once the position of the hit is determined, we can calculate its quality by taking
the average of bright and surround pixels that were correct, as shown in equation
28.5. This quality value is sent back to the driver with the position to
determine the two best targets.
1 BSx, y SSx, y
Qx, y = 2 BC + SC (28.5)
FPGAs can be reconfigured to perform multiple functions with the same logic
resources by providing a number of corresponding configuration bit files. This ability
allows us to develop dynamically reconfigurable designs. In this section, we present
an ATR system implementation of UCLA’s Mojave project that uses an FPGA’s
dynamic reconfigurability.
that would make relatively poor utilization of FPGA resources. More important, the
sparse nature of the templates can be utilized to achieve a far more efficient
implementation in the FPGA than could be realized in a general-purpose corre-
lation device. This can be illustrated using the example of the simple template
shown in Figure 28.2.
In the example template shown in the figure, only 5 of the 20 pixels are asserted.
At any given relative offset between the template and the chip, the correlation
output is the sum of the 5 binary pixels in the chip that match the asserted bits in
the template. The template can therefore be implemented in the FPGA as a simple
multiple-port adder. The chip pixel values can be stored in flip-flops and are shifted
to the right by one flip-flop with each clock cycle. Though correlation of a large
image with a small mask is often understood con-ceptually in terms of the mask
being scanned across the image, in this case the opposite is occurring—the
template is hardwired into the FPGA while the image pixels are clocked past it.
A potential advantage FPGAs have over ASICs is that they can be dynam-ically
optimized at the gate level to exploit template characteristics. For our application, a
programmable ASIC design would need to provide large general-purpose adder
trees to handle the worst-case condition of summing all possi-ble template bits, as
shown in Figure 28.4. In constrast, an FPGA exploits the sparse nature of the
templates and constructs only the small adder trees required. Additionally, FPGAs
can optimize the design based on other application-specific characteristics.
TemplateA
D
00
D10
D20 1 ResultA
D
01
D
21
Image
D D D D
00 10 20 30
D D D D
01 11 21 31
FIGURE 28.2 ■ An example template and a corresponding register chain with an adder tree.
[190] Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
TemplateA
D
01
D
00
+ ResultA
D
D2010 +
TemplateB D
21
+ ResultB
D
31
TemplateA Image
Registers
for image chip
and templates
AND gates
used to
perform
dot product
Sum
Large adder
LUT can implement an element that is effectively a 16-bit shift register in which the
internal bits cannot be accessed. A flip-flop is also needed at the output of each
RAM to act as a buffer and synchronizer. A single control circuit is used to control
the stepping of the address lines and the timely assertion of the write-enable and
output-enable signals for all RAM-based shift register elements. This is a small
price to pay for the savings in configurable logic block (CLB) usage relative to a
brute-force implementation using flip-flops.
In contrast, the 256-pixel template images, like those shown in Figure 28.5, can
be stored easily using flip-flop–based registers. This is because sufficient flip-flops
are available to do this, and the adder tree structures do not consume them. Also,
using standard flip-flop–based shift registers for image pixels in the template
simplifies the mapping process by allowing every pixel to be accessed. New
templates can be implemented by simply connecting the template pixels of concern
to the inputs of the adder tree structures. This leads to significant simplification of
automated template-mapping tools.
The resources used by the two components of target correlation—namely,
storage of active pixels on the FPGA and implementation of the adder tree cor-
responding to the templates—are independent of each other. The resources used
by the pixel storage are determined by the template size and are independent of the
number of templates being implemented. Adding templates involves adding new
adder tree structures and hence increases the number of function genera-tors being
used. The total number of templates on an FPGA is bounded by the number of
usable function generators.
The experimental results suggest that in practice we can expect to fit 6 to 10
surround templates having a higher number of overlapping pixels onto a 13,000-
gate FPGA. However, intelligent grouping of compatible templates is important.
Because the bright templates are less populated than the surround templates, we
estimate that 15 to 20 of them can be mapped onto the same FPGA.
FIGURE 28.5 ■ Example of eight rotation templates of a SAR 16 ×16 bitmap image.
[191] Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
Using C++, we have created a design tool to implement the partitioning pro-cess
that uses an iterative approach to partitioning templates. Templates that compare
well to a chosen “base” template (usually selected by largest area) are removed
from the main template set and placed in a separate partition. This process is
repeated until all templates are partitioned. After the partitions have been selected,
the tool computes the adder tree for each partition.
Figure 28.6 shows the creation of an adder tree from the templates in a par-
tition. Within each partition, the templates are searched for shared subsets of
pixels. Called terms, these subsets can be automatically added together, leading to
a template description that uses terms instead of pixels.
The most common addition of two terms is chosen to be grouped together, to
form a new term that can be used by the templates. In this way, each template is
rebuilt by combining terms in such a way that the most redundant additions are
shared between templates; the final result is terms that compute entire tem-plates.
For the sample templates shown in Figure 28.6, 39 additions would be required to
compute the correlations for all 5 in a naive approach. However,
28.2 Dynamically Reconfigurable Designs 599
A B C D E
Template A 5 1 1 3 1 4
1
Template B 5 3 1 4 1 5
2 3 4 5
Template C 5 2131617
6 Template D 5 1121617
7
Template E 5 1 1 3 1 7
8
after combining the templates through the process just described, only 17 additions
are required.
(a) (b)
FIGURE 28.7 ■ Each of eight FPGAs correlating each bit plane of the template (a). A single FPGA
correlating bit planes and adding the partial sums serially (b).
The large amount of sum in the algorithm can be performed in parallel. This
requires a total of D clock cycles, where D is each pixel’s depth of representa-tion.
Once the sum results are obtained, the correlation outputs are produced at the rate
of 1 per clock cycle. Parallelism cannot be as directly exploited in this step because
different pixels are asserted for different templates. However, in the limit of very
large FPGAs the number of clock cycles to compute the cor-relation is upper-
bounded by the number of possible thresholds, as opposed to the number of
templates.
has improved dramatically over the years, any time spent on reconfiguration is time
that could be used to process more data.
Unlike the dynamic reconfigurable architecture describe in the previous sec-tion,
we describe another efficient FPGA design that does not require complete design
reconfiguration. However, like the previous system, it uses a number of parameters
to design a highly pipelined custom design to maximize utilization of the design
space to exploit the parallelism in the algorithm.
template set, there are total of 4608 bitmap rows to use in the correlation tasks. Out
of 4608 bright rows, only 2206 are nonzero, and out of 4608 surround rows, 2815
are nonzero. Since the bright mask is used for both threshold and bright sum
calculations, and the surround mask is used once, skipping the zero rows reduces
the number of row operations from 13,824 to 7227, which produces a savings of
about 52 percent.
It is also possible to reduce the computation by skipping zero columns. How-
ever, as will be described in following section, the FPGA implementation works on
an entire search row concurrently. Hence, skipping rows reduces time but skipping
columns reduces the number of active elements that work in parallel, yielding no
savings.
4-byte pipeline
M10 M14
M M
11 15
M M
12 16
M13 M17
M M M M M M
00 01 02 03 04 05
FIGURE 28.9 ■ Computation logic for equations 28.1, 28.3, and 28.4.
1961 1491
1959 1449
1958 1487
account for the hardware requirements of the entire design, including the con-trol
hardware that is common to all designs as well as the template-specific adder trees.
Relative savings in the adder trees alone are higher.
Table 28.3 lists the overall resources used for both FPGAs in the system, the
dynamic devices used for correlation, and the static support device used to
implement system control features. Because the image source is a standard video
camera rather than a SAR sensor, the surround template is the comple-ment of the
bright template, resulting in more hardware than would be required for true SAR
templates. The majority of the flip-flops in the dynamic FPGA
[195] Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
are assigned to holding the 8-bit chip data in a set of shift registers. This load
increases as a linear function of the template size.
Each configuration of the dynamic FPGA requires 16 milliseconds to complete
an evaluation of the entire chip for four template pairs. The Xilinx 4013PG223-4
requires 30 milliseconds for reconfiguration. Thus, a total of 4 template pairs can be
evaluated in 46 milliseconds, or 84 template pairs per second. This timing will
increase logarithmically with the template size.
Comparing configurable machines with traditional ASIC solutions is neces-sary
but complicated. Clearly, for almost any application, a bank of ASICs could be
designed that used the same techniques as the multiple configurations of the FPGA
and would likely achieve higher performance and consume less power. The
principal advantage of configurable computing is that a single FPGA may act as
many ASICs without the cost of manufacturing each device. If the com-parison is
restricted to a single IC (e.g., a single FPGA against a single ASIC of similar size),
relative performance becomes a function of the hardware savings enabled by data
specificity. For example, in the ATR application the templates used are quite
sparse—only 5 to 10 percent of the pixels are important in the computation—which
translates directly into a hardware savings that is much more difficult to realize in an
ASIC. Further savings in the ATR application are possible by leveraging topological
similarities across templates. Again, this is an advantage that ASICs cannot easily
exploit.
If the power and speed advantages of ASICs over FPGAs are estimated at a
factor of 10, the configurable computing approach achieves a factor of improve-
ment anywhere from 2 and 10 (depending on sparseness and topological prop-
erties) for the ATR application.
The host is a SparcStation IPX running SunOS 4.1.3 with a Myrinet inter-face
board having a 512K memory. The FPGA node—consisting of Lucent Tech-
nologies’ ORCA FPGA 40K and Myricom’s LANai 4.1 running in 3.3 V at 40 MHz—
communicates with the host through an 8-port Myrinet switch.
Without additional optimization, static implementation of the complete ATR
algorithm on one FPGA node processes more than 900 templates per second.
Each template requires about 450,000 iterations of 1-bit conditional accumulate for
the complete shape sum calculation. The threshold calculation requires one division
followed by subtraction. The bright and surround sum compares all the image pixels
against the threshold results. Next, 1-bit conditional accumulate is
28.4 ATR Implementations 607
FIGURE 28.11 ■ A Myrinet 8-port switch motherboard with Myricom ORCA FPGA
daughter nodes. Four FPGA nodes can be plugged into a single motherboard.
executed for each sum. And then the quality values are calculated using two
divides, an add, and a multiply.
Given that 1-bit conditional accumulate, subtract, divide, multiply, and 8-bit
compare are one operation each, the total number of 8-bit operations to process
one 32 × 32 template over a 64 × 64 image is approximately 3.1 million. Each
FPGA node executes over 2.8 billion 8-bit operations per second (GOPS).
After the simulations, we found that the sparseness of the actual templates
reduced their average valid rows to approximately one-half the number of total
template rows. This optimization was implemented to increase the throughput by 40
percent. Further simulations revealed more room for improvements, such as
dividing the shape sum in the FPGA, transposing narrow template masks, and
skipping invalid threshold lines. Although these optimizations were not imple-
mented in the FPGA, the simulation results indicated an additional 94 percent
increase in throughput. Implementing all optimizations would yield a result
equivalent to about a 7.75 GOPS correlator.
Traditional
FPGA FPGA
processor
Intermediate
Data Data Data
result
memory memory memory
storage
(a) (b) (c)
memory, performs a simple operation in the program, and returns the result to data
memory. Similarly, dynamic computing uses a small number of rapidly reconfiguring
FPGAs tightly coupled to an intermediate result memory, data memory, and
configuration memory. A reconfigurable custom computer is simi-lar to a fixed ASIC
device in that, usually, only one highly tuned design is config-ured on the FPGA—
there is no need to reconfigure to perform a needed function.
In most cases, a custom ASIC performs far better than a traditional processor.
However, traditional processors continue to be used for their programmability.
FPGAs attempts to bridge the gap between custom ASICs and software by allow-
ing designers to build custom hardware using programmable firmware. There-fore,
unlike in pure ASIC designs, configuration memory is used to program the
reconfigurable hardware as instructions in a traditional processor would dictate the
functionality of a program. Unlike software, once the FPGA is configured, it can
function just like a custom device.
As shown in previous sections, an ATR was implemented in an FPGA using two
different methods. The first implementation uses the dynamic computer model,
where parts of the entire algorithm are dynamically configured to pro-duce the final
results. The second design uses simulation results to produce a highly tuned fixed
design in the FPGA that does not require more than a single reconfiguration.
Because of algorithm modifications made to the first design, there is no clear way to
compare the two designs. However, looking deeper, we find that there is not a
drastic difference in the subcomponents or the algo-rithm; in fact, the number of
required operations for the algorithm in either design should be the same.
The adders make up the critical path of both designs. Because both designs are
reconfigurable, we expect the adders used to have approximately the same
performance as long as pipelining is done properly. Clever use of adders in the
static design allows it to execute more than one calculation
28.5 Summary 609
28.5 SUMMARY
Like many streaming image correlation algorithms, the Sandia ATR system dis-
cussed in this chapter can be efficiently implemented on an FPGA. Because of the
high degree of parallelism in the algorithm, designers can take full advan-tage of
parallel processing in hardware while linearly scaling total throughput with available
hardware resources. In this chapter we presented two different ways of
implementing such a system.
The first system employs a dynamic computing model to effectively imple-ment a
large custom design using a smaller reconfigurable device. To fit, high-performance
custom designs can be divided into subcomponents, which can then share a single
FPGA to execute parts of the algorithm at a high speed. For the ATR algorithm, this
process produced a resource-efficient design that exceeded the performance of
previous custom ASIC-based systems.
The second system is based on a more generic architecture highly tuned for a
given set of templates. Through extensive simulations, many parameters of the
algorithm are tuned to efficiently process the incoming data. With algorithm-specific
optimizations, the throughput of the system increased threefold from an initial naive
implementation. Because of the highly pipelined structure of the design, the
maximum clock frequency is more than three times that of the
[197] Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices