zyxwvut
zyxwvuts
zyxwvuts
zyxwvuts
zyxwvutsr
The IUA Feedback Concentrator
Deepak R a n a
Department of Electrical a n d Computer Engineering
University of Massachusetts, Amhers t M A 01 003.
I l a n a % cs. umass. edu
Charles C. Weems
Department of Computer a n d Information Sciences
University of Massachi~setts,Amherst M A 01003.
I.lieenis~cs.urnass.edu
ABSTRACT
overview of the IUA. Section 3 presents the details of the feedback concentrator mechanism as implemented using a custom
VLSI chip. The custom chip uses an innovative combination of
circuit techniques to achieve high speed. A description of the
custom VLSl concentrator chip is provided in section 3.1. In
section 4 we compare the performance of our architecture with
a mesh connected processor without the feedback mechanism
for three common, low-level vision tasks. The paper ends with
conclusions and references.
Many low- and intermediate-level vision algorithms
are characterized by massive spatial or data parallelism.
When these algorithms are mapped onto multipleprocessor systems, making a global decision often requires gathering data from all of the processors. In
SIMD parallel processors, making a global decision is
equivalent to test-and-branch in a uniprocessor, and
should have similar time cost. This requires a fast
global feedback mechanism in the system. This paper presents a description of the custom hardware in
the Image Understanding Architecture, that provides
global feedback, and a few example algorithm- that take
advantage of it.
2 OVERVIEW OF THE IUA
The IUA integrates three different parallel processors operating simultaneously at three levels of abstraction and computational granularity in a tightly coupled manner. Communication
between levels takes place via parallel data and control paths.
The processing elements within each level can also communicate with each other in parallel, via different mechanisms at
each level.
1 INTRODUCTION
Machine vision is one of the most computationally intractable problems, requiring at least three levels of computational abstraction, with two of these levels obvious: Processing
of the sensory data, and processing of the world knowledge. The
necessity of an intermediate level of processing, bridging the gap
between the low and high levels has been motivated in [Boldt
87; Draper 89; Duff 86; Hanson 781 among others.
The low-level, called the Content Addressable Array Parallel Processor (CAAPP), is a 512 x 512 array of bit-serial processors intended to perform low-level image processing tasks on the
input image pixels. The CAAPP architecture is especially oriented toward associative processing with hardware support for
the four capabilities described in section 1. The CAAPP processing elements are linked through a four-way communications
grid that is augmended with a Coterie Network that allows certain types of long-distance communications to take place quickly.
The CAAPP operates in SIMD mode under the control of a dedicated Array Control Unit (ACU).
The Image Understanding Architecture (IUA) is a niassively parallel, multi-level system, representing a hardware irnplementation of these three levels of abstraction. [Weems 891
provides an overview of the IUA. High speed, fine-grained communication, and control at the low and the intermediate levels in
the IUA is achieved by using associative processing techniques.
Foster [Foster 761 has identified four key processing capabilities
with associative computation:
The intermediate-level, called the Intermediate and Communication Associative Processor (ICAP), is a collection of 4096
16-bit fast Digital Signal Processor (DSP) chips. The ICAP
is designed for manipulating symbolic descriptions of extracted
image events. Control of the ICAP is provided by the ACU in
synchronous-MIMD mode, and by the high-level in pure MIMD
mode. When operating under the control of the ACU, the ICAP
is treated as an associative array with the same four capabilities listed in section 1. The individual processors in the ICAP
communicate with each other through a reconfigurable interconnection network [Rana 88; Rana 891.
(1) Global Broadcast/local compare/ Activity control
(2) Select a Single Responder.
(3) Some/None response.
(4) Count-responders.
The purpose of this paper is to present the architectural details of the last two of these capabilities as they are implemented
in the IUA.
This paper is organized as follows. Section 2 presents a brief
At the high-level, called the Symbolic Processing Array
(SPA), a set of 64 processors capable of executing LISP programs
support computation involving inference, hypothesis generation
and verification, analysis of uncertainty, model-based processing, and indirect control of processing at lower levels. The SPA
processors operate in MIMD mode.
This work was funded in part by the Defense Advanced Projects
Agency under contract number DACA76-86-C-0015, and DACA76-89-C0016 monitored by the U.S. Army Engineer Topographic Laboratory, and
contract number F49620-86-C-0041, monitored by thr Air Force Office of
Scientific Research, and by a Coordinated Experimental Research grant
#DCA8500332, from the National Science Foundation.
zyxwvutsrq
zyxwvuts
540
CH2898-5/90/0000/0540$01.OO 0 1990 IEEE
zyxwvutsrqp
zyxw
zyxwvutsrqpon
zyxwvutsrqp
zyxwvutsrqp
zyxwvutsrq
era1 purpose signals Do - Dz are provided by the ICAP and are
multiplexed under program control with three special purposr
feedback signals from the CAAPP chip.
A proof-of-concept prototype of 1/64th of the IUA is currently being constructed by the University of Massachusetts,
Amherst, and Hughes Research Laboratories, Malibu, California. The remainder of this paper focuses on the hardware implementation of fast global summary operations (Some/None
response, and responder count) in the CAAPP and the ICAP
levels of the IUA.
......................................................................
i
CAAPPCHIP
3 THE IUA FEEDBACK CONCENTRATOR
The hardware comprising two lower level processors of the
IUA is partitioned into motherboards and daughterboards. The
full IUA system comprises 64 motherboards. Each motherboard
contains 64 daughterboards. Each daughterboard holds one
CAAPP chip and an ICAP processor, along with memory chips
and the associated interface and J/O circuitry. Each CAAPP
chip contains 64 bit-serial CAAPP PE’s, their local memory,
and interface logic.
Each CAAPP P E has a response register called the X register, and an activity register called the A register. The fastest
method of counting the number of responders would be to feed
the 1 bit output from the X registers of the 512 x 512 = 262,144
CAAPP PES to a hardware adder and generate a 19 bit sum
for the ACU. One technique, proposed by Favor [Favor 641 for
counting the number of responders, uses a pyramid of full adders
to generate the least significant bit of the count of the inputs.
Next, the carries generated at various stages of the adder pyramid are fed to a second smaller pyramid to generate the second
least significant count bit. Successively smaller adder pyramids
are chained together t o develop the full count. In this method
each stage waits for one count bit to be formed before feeding
any carries from the current level to the next. Foster [Foster
711 improved upon Favor’s scheme such that at any stage of the
current adder pyramid, as soon as 3 carries are available, they
are fed to the next level adder pyramid. Foster’s scheme, called
a carry-shower adder, results in a significant speedup over Favor’s scheme. For example, Favor’s scheme has a delay of 21
full adders for 64 inputs, whereas Foster’s scheme has a delay of
10 full adders for 64 inputs.’ However, the fan-in from 262,144
processors to a single sum is too great to be practically realized
this way.
L-S/N
D2/EF
DI/B
DO/IC
L-COUNT
D2, D1, DO : C A P Status Lines
EF, B, IC : ICAP/CAAPP Status Lines
SIN
LCR
ICR
CR
:
:
:
:
Local Some/None line on a daughterboard
Local Count Register
ICAP Count Register
Daughterboard Count Register
Fig 1: Daughterboard Some/None and Count Network
To summarize, the following feedback signals are provided
on each daughterboard: Local Some/None (L-S/N), L-Count
which can either be a CAAPP responder count or an 8 bit value
from the ICAP, and three general purpose feedback signals from
the lCAP which are multiplexed with three special purpose feedback signals from the CAAPP chip.
In the IUA design, the pin constraints on the processor chip,
and space and pin limitations for the external circuitry required
that the chip-level counts be output serially. Figure 1 illustrates the logical organization of the Some/None and the Countresponder mechanisms on one daughterboard. Local Some/None
is the logical sum of the X register of the 64 CAAPP PES on
a daughterboard. This signal is generated at the end of every
CAAPP instruction. If one or more PES have a logical 1 in their
X register, the Local Some/None is set to logical 1. Foster’s
scheme is used within the processor chip to produce a response
count at the end of every CAAPP instruction cycle. A special instruction, called latch-count, allows either the CAAPP
count or an 8 bit value from the local ICAP to be loaded in the
daughterboard count register (CR). The value in the CR can be
read out serially from the CAAPP chip. A separate instruction,
called latchlocal-count, allows the CAAPP count to be loaded
in the local count register, where it can be read by the ICAP
processor on the same daughterboard. Three additional gen-
From the daughterboards, a finite state machine is used that
automatically shifts the count out of the chip, least signscant
bit first,, one bit per cycle. Output begins as soon as a count
is latched. The processors are thus able to overlap computation
with the development of the current count.
There are many ways to sum the daughterboard-level
counts. One method would be to feed 8 bit values from 4096
daughterboards to a hardware adder such as a Carry-Save-Adder
(CSA) [Cavanagh 841. This approach is infeasible because of the
hardware cost and the number of wires. If the CSA was arbitrarily fast, this scheme would take about 1pS in computing the
global count.
Our final solution was to trade a smaU amount of the speed
of counting responders for a substantial saving in the hardware
and the number of wires: We serialized the entire adding process.
This scheme is shown i n figure 2 and figure 3 . Figure 2 illustrates the Some/NoIie and the CountJesponder mechanism on
one motherboard, The local Sorne/None (L_S/N) outputs from
the 64 daughterboards are fed into the motherboard Some/None
tree. (The U0 - Vz feedback signals are treated in the same
‘In another study, Swartelander [Swartdander 731 corrected Foster’s
lower bound formula and showed that the theoretiral lower bound on the
delay is 9 full adders for 64 inputs. But the best known actual rircuit has a
delay of 10 full adders. Further, Swartelander proposed a “faster” scheme
for t h e counting hardware. However, his scheme was suitable for only small
number of inputs, and he assumed that the delay of a Read-only memory is
independent of its size.
zyx
54 1
zyxwv
zyxwvutsrqpo
zyxwvutsrqpon
way.) The local counts (L-Count) from the count register (CR)
of the 64 daughterboards are serially fed into the motherboard
count responder tree. Figure 3 illustrates the global Some/None
and Count-responder mechanisms for the full IUA, which concentrate the outputs of the motherboard level networks and are
similar in design.
Before describing the functioning of the feedback concentrator, we discuss the design of a custom VLSI chip that was
built to implement all four blocks of figures 2 and 3.
3.1
THE CONCENTRATOR CHIP
A schematic of the concentrator chip is shown in figure 4 . It
comprises four hardware blocks: a carry-shower adder, registers,
an auxiliary logic unit (ALU), and a carry-select adder. The
CAAPP PE's operate with a IOOnS cycle time, therefore the
delay from the inputs to the outputs should be less than 100nS.
There was an additional architectural constraint, that the delay
from the inputs of D l l e g - l to the outputs should be less than
2511s. In addition, the technology was conslrained to a 2 micron
CMOS process availablp through the blOS1S racility
zyxwvut
1
L-SIN-63
L-SIN-0
Motherboard Some/None Tree
t
M-SIN
zyxwvuts
11
InDuts
L-Count-0
L-Count-63
1
Motherboard
Contr.
M-Ser-Count
Fig 2: Motherboard Some/None and Count Network
(Networks for other 3 inputs are similar to L-S/N)
M-SIN-63
M-SIN-0
Most significant 6 bits
Serial Sum to
Next level
I
Fig 4: Schematic of the Concentrator Chip
Global Some/None Tree
The carry shower adder designed using a basic full adder
cell is sinular to the one proposed by Foster, as discussed in the
previous section. The carry shower adder generates a 7 bit sum
from 64 inputs and has a delay of about 5011s.
t
G-SIN
M-Ser-Cou nt-63
Another hardware block of the concentrator chip is a pair
of registers D R e g - l and DReg-2, that are of 7 and 6 bits respectively. The clock-to-output delay in the registers is about
M-S er-CO unt-0
611s.
Global
Count Responder Tree
The auxiliary logic unit(ALU) is used to generate logical
AND, logical OR and logical EXOR of the 64 inputs to the
concentrator chip. The inputs for the ALU are taken from the
7 binary outputs from D R e g - I . Logical EXOR of the 64 inputs
to the concentrator chip is the parity of the 7 bits, and hence
simply the least significant bit. Logical AND of the 64 inputs
to t h e concentrator chip is equivalent to a count of 64, and thus
simply the most significant bit of the DReg-I. Logical OR is
generated by using a two stage OR gate tree structure, which
takes the 7 bits from D Reg.1 as its inputs. The logical OR is
used for building the Some/None circuitry at the motherboard
and the global level. The logical AND and the logical EXOR
are provided for future use. The auxiliary logic unit has a delay
of less than 10nS.
Reset
Contr.
G-Par-Count
G-Ser-Co unt
Fig 3: Global Some/None and Count Network
(Networks for other 3 inputs are similar to M-S/N)
542
The last major block of the concentrator chip is a 7 bit
carry-select adder. Its design was particularly critical to the
overall speed, and the timing of the chip. Subtracting 6nS for
the D register delay and 4nS for the delay in the output pads
in chip from the 25nS goal left us with a maximum of 15nS
allowable for the delay in the 7 bit adder. This could not be
achieved with a ripple carry adder using 7 adder cells, because
each cell would have a 5nS delay for a total of 35nS. Also, a 7 bit
carry look-ahead adder cannot be constructed in the available
technology with a delay of less than 1511s.
through the carry-select adder. Thus on the third cycle, another
result bit appears at the serial output and the high-order portion is again present at the parallel output. The process can be
repealed to sum 64 inputs of any length with the low-order portion of the result being shifted out serially and the high-order
six bits available in parallel one cycle after the last set of bits
has been input. To Rush the entire result out serially, zeros are
input for six cycles after the last set of bits. Serial flushing of
the high-order portion allows the chips to be cascaded. Thus,
only two levels of the chips are required to form a count for the
entire array. After the ACU issues a latch-count instruction,
the first bit of a count reaches it at the end of the third cycle,
and 13 cycles later the last of the low-order bits is output (we
always assume an 8-bit value is being output by the processor
chips) and the high-order portion is available in parallel. Thus,
only 1.6pS are required to count the responders in an array of
262,144 CRAPP PES or to sum 8-bit values from 4096 ICAPs,
using 65 copies of a single custom chip.
zyxwvutsrq
zyxwvutsrqponm
zyxwvu
We achieved the desired speed by trading more hardware
(VLSI chip area) for speed, using a 7 bit carry-select adder [Cavanagh 841. A detailed description of the concentrator chip can
be found in [Rana 901. A microphotograph of the concentrator
chip is shown in figure 5.
4
SAMPLE ALGORITHMS
In this section we provide three sample algorithms, used extensively in low-level vision tasks. We provide their exact running time for a 512 x 512 image on the same size CAAPP array,
and then compare these times with another 512 x 512 meshconnected SIMD architecture but without the Some/None and
the Count-responders mechanisms. The algorithms presented
here are merely intended to demonstrate the power of the two
feedback mechanisms in the extreme case. There is a great body
of literature on various other architectures for low-level vision,
that provide speedups between the two mesh architectures. We
will not compare these architectures with the CAAPP, within
the scope of this paper.
zyx
zyxwvutsrq
Both the mesh connected processor (MCP), and the
CAAPP are assumed to have the same lOOnS machine cycle
time. For the CAAPP, the feedback response is stored in the
array control unit (ACU), whereas the feedback response from
the MCP is stored in the top rightmost P E (Possibly to offload
it later. The top right PE is chosen merely for convenience).
A . Some/None
Fig 5: Microphotograph of the
Concentrator Chip
3.2
As mentioned earlier, it takes 3 cycles on the CAAPP or
0.3pS for this operation. On the MCP, in the worst case, a
1 from the lower left corner P E will have to be shifted t o the
upper right hand corner PE. Assuming that the MCP has an
instruction that allows a P E to get a value from a neighbor PE’s
register, OR it w i t h its register value, and store the result in its
local register (which is possible in CAAPP), all in one cycle, it
5 the MCP for this instruction.
will take 2x512 x0.1 = 1 0 2 . 4 ~ on
FEEDBACK CONCENTRATOR OPERATION
When the concentrator chip is used for computing global
Some/None, it functions as follows. The logical sum (L-S/N) of
the response registers (X registers) on a daughterboard stabilizes at the end of the instruction cycle (this operation is carried
out in every CAAPP instruction cycle). In the second cycle, a
logical sum of the daughterboard L-S/N is computed in every
motherboard concentrator chip, and is latched in its DReg-1.
By the end of the second cycle, the M-S/N from the 64 motherboard concentrator chips is ready at the inputs of the global
concentrator chip. Meanwhile, in the second cycle, the L_S/N of
the next instruction passes through the motherboard concentrator chips in a pipelined manner. By the end of the third cycle
(or 0.3pS), a global Some/None is available at the output of the
global concentrator chip.
B. Count Responders
zyxwvutsrqponm
Here we want a count of PE’s with 1 stored in their registers.
A s mentioned earlier, the latch-count instruction takes 16 cycles
or 1.61.15 for this instruction in the CAAPP. In the MCP, the
final count length is 19 bits. First we accumulate the counts of
the rows in the rightmost PE’s. The variable ‘count’ is kept in
each rightmost column PE’s memory. By successively shifting
right the X register values in the rows and accumulating in the
rightmost PE’s, we get a maximum length of 10 for the ‘count’
variable. For the initial two cycles, the ‘count’ variable can be
a niaximurn of 2 bits, for the next 4 cycles it can be of 3 bits,
for the next 8 cycles it can be of 4 bits, and so on. This gives us
a formula to compute the time for computing sums in the rows;
which is:
When the concentrator chip is used for counting the responders in the CAAPP or for adding 8 bit values from the ICAP,
it functions as follows. One cycle after the low-order bits of a
set of 64 partial counts are input, the low-order bit of the result
appears at its serial output. The high-order bits of the result appear at the parallel outputs. If another set of bits are input, the
high-order portion is recirculated and added to the next result
2 x 2 t 3 x 4 + 4 x 8 + ... -1 9 x 2 5 6 t l o x 1 = 4106 or 410.6pS.
543
zyxwvutsrqpo
zyxwvutsrqpon
Next the values in the rightmost PE’s are accumulated from
bottom to top in the rightmost column. The time taken for this
operation is given by
It should be noted that count can be used to quickly compute other statistical measures, such as mean and standard deviation, and to form a histogram of an array of data.
feedback mechanisms: Some/None, and Countxesponders, for
the lower two processing levels of the Image Understanding Architecture, was described. Both mechanisms are implemented
using multiple copies o f a single custom VLSI chip.The architecture of the custom chip was described in detail. The performance
of the IUA’s low-level processor with the feedback concentrator
was compared to a similar mesh-connected parallel processor
without the feedback concentrator mechanism, and shown to be
significantly faster.
C.
6
+
+ +
11 x 2 12 x 4 ... 18 x 256
The total time taken is 1 2 8 1 . 1 ~ s .
+ 19 x 1 = 8705 or 870.51~s.
Find greatest value
In this algorithm, the goal is to determine the greatest value
in a given memory field of the PE’s. In the CAAPP, the algorithm begins by loading the high-order bit of a given field
into the response register of all cells. The ACU then tests the
Some/None output of the CAAPP. If any PE’s have their highorder bit set, then they are candidates for the maximumvalue, in
which case, any cells that have a 0 in their high-order bit are then
deactivated. However, if no cells have their high-order bit set,
then none are deactivated because they are still potential candidates. This process repeats with each successively lower-order
bit in the field. When the low-order bit has been processed, only
those cells that have the maximum value will remain active. For
each iteration, the ACU saves the Some/None response so that
the maximum value is available in the ACU at the conclusion of
processing. A pseudo-code algorithm is shown below.
REFERENCES
[Boldt 87 ] M.Boldt and R. Weiss, “Token based extraction of
straight lines,” COINS Tech Report 87 - 104, Computer and
Information Science Dept., University of Massachusetts,
Amherst, MA 01003, Oct 1987.
[Cavanagh 84 ] Joseph J . F. Cavanagh, Digital Computer
Arithmetic, McGraw Hill, New York, 1984.
[Draper 89 ] B.A. Draper, R.T. Collins, J. Brolio, J. Griffith,
A.R. Hanson, and E.M. Riseman, “The Schema System,”
International Journal of Computer Vision, Vol 2, March
1989, pp 209 - 250.
zyxwvu
zyxw
zyxwvutsrq
zyxwvutsrqponm
For B i t N u m := FieldLength - 1 Down to 0 Do
{Beginning with the high-order bit}
Response := Field[BitNum]
{ P u t bit in response register}
Lf Some
{ I f any cell has a 1 in this bit}
Then Activity := Response
{Then turn o f factivity in cells }
{with a 0 in this bit}
This algorithm takes 40 CAAPP instruction cycles or 4 . 0 ~ 5
for an 8 bit value.
For the MCP, first we find the maximum in each row, and
put it in the rightmost PE’s. Next we find the maximum in
the right column and put the value in the top PE. The basic
operation is to compare the fields of two neighbors, and put the
greater value in the right hand side PE’s memory during the
row phase, and in the upper PE’s memory during the column
phase. It will take some k multiples of 8 instructions for each
compare. The value of k for MCP will depend upon the specific
architectural implementation, and its instruction set. When this
algorithm is emulated on the CAAPP, the value of k is 4. Thus,
the total run time for the algorithm on the MCP will be of the
order of 4 x 8 x 2 x 512 or 3 2 7 6 . 8 ~ 5for the operation.
zyxwvutsrqpo
zyxwvutsrqponmlk
There are numerous algorithms where a hardware
Some/None and Count-responder mechanism will give significant speedups. Our objectivein this section was to demonstrate
a few low-level algorithms that make use of the feedback concentrator mechanism.
5
[Duff 86 ] M.J.B. Duff (Ed), Intermediate-LevelImage Processing, Academic Press, New York, 1986.
[Favor 64 ] J . Favor, “A method for obtaining the exact
count of responses using Full and Half adders,” AP-111770,
Goodyear Aerospace Corporation, Akron, Ohio,Oct 1964.
[Foster 71 ] C. C. Foster, “Counting responders in an Associative Memory,” IEEE Trans. on Computers, Dec 1971, pp
1580 - 1583.
[Foster 76 ] C. C. Foster, Content Addressable Parallel Processors, Van Nostrand Reinhold Company, New York, 1976.
[ Hansori 78 ] A.R. Hanson and E.M. Riseman, Computer Vision Systems, Academic Press, New York, 1978.
[Rana 88 ] D. Rana, C.C. Weems, and S.P. Levitan, “An easily
recoufigurable circuit-switched connection network,” PTOC
1988 IEEE Int Symp on Circuits and Systems, June 1988,
pp 247 - 250.
[Rana 89 ] D. Rana, and C.C. Weems, “The ICAP Parallel
Processor Communication Switch,” Proc 1989 IEEE Int
Symp on Circuits and Systems, May 1989, pp 126 - 129.
[Rana 90 ] D. Rana, and C.C. Weems, “A Feedback Concentrator for the Image Understanding Architecture,” COINS
Tech Report, Computer and Information Science Dept.,
University of Massachusetts, Amherst, M A 01003, May
1990.
[Swartzlander 73 ] E.E. Swartzlander, “Parallel Counters,”
ZEEE Trans on Computers, Nov 1973, pp 1021 - 1024.
[Weerns 89 ] C.C. Weems, S.P. Levitan, A.R. Hanson, E.M.
Riseman, D.B. Shu, and J.G. Nash, “The Image Understanding Architecture,” International Journal of Computer
Vision, March 1989, pp 251-282.
CONCLUSIONS
One of the most important architectural requirements of a
multilevel parallel processor for computer vision is that it should
provide mechanisms for rapid global summary for any centrally
controlled processing.
The hardware implementation of two important summary
544