Academia.eduAcademia.edu

The IUA feedback concentrator

[1990] Proceedings. 10th International Conference on Pattern Recognition

Many low-and intermediate-level vision algorithms are characterized by massive spatial or data parallelism. When these algorithms are mapped onto multipleprocessor systems, making a global decision often requires gathering data from all of the processors. In SIMD parallel processors, making a global decision is equivalent to test-and-branch in a uniprocessor, and should have similar time cost. This requires a fast global feedback mechanism in the system. This paper presents a description of the custom hardware in the Image Understanding Architecture, that provides global feedback, and a few example algorithm-that take advantage of it. 1 INTRODUCTION Machine vision is one of the most computationally intractable problems, requiring at least three levels of computational abstraction, with two of these levels obvious: Processing of the sensory data, and processing of the world knowledge. The necessity of an intermediate level of processing, bridging the gap between the low and high levels has been motivated in [Boldt 87; Draper 89; Duff 86; Hanson 781 among others. The Image Understanding Architecture (IUA) is a niassively parallel, multi-level system, representing a hardware irnplementation of these three levels of abstraction. [Weems 891 provides an overview of the IUA. High speed, fine-grained communication, and control at the low and the intermediate levels in the IUA is achieved by using associative processing techniques. Foster [Foster 761 has identified four key processing capabilities with associative computation: (1) Global Broadcast/local compare/ Activity control (2) Select a Single Responder. (3) Some/None response.

zyxwvut zyxwvuts zyxwvuts zyxwvuts zyxwvutsr The IUA Feedback Concentrator Deepak R a n a Department of Electrical a n d Computer Engineering University of Massachusetts, Amhers t M A 01 003. I l a n a % cs. umass. edu Charles C. Weems Department of Computer a n d Information Sciences University of Massachi~setts,Amherst M A 01003. I.lieenis~cs.urnass.edu ABSTRACT overview of the IUA. Section 3 presents the details of the feedback concentrator mechanism as implemented using a custom VLSI chip. The custom chip uses an innovative combination of circuit techniques to achieve high speed. A description of the custom VLSl concentrator chip is provided in section 3.1. In section 4 we compare the performance of our architecture with a mesh connected processor without the feedback mechanism for three common, low-level vision tasks. The paper ends with conclusions and references. Many low- and intermediate-level vision algorithms are characterized by massive spatial or data parallelism. When these algorithms are mapped onto multipleprocessor systems, making a global decision often requires gathering data from all of the processors. In SIMD parallel processors, making a global decision is equivalent to test-and-branch in a uniprocessor, and should have similar time cost. This requires a fast global feedback mechanism in the system. This paper presents a description of the custom hardware in the Image Understanding Architecture, that provides global feedback, and a few example algorithm- that take advantage of it. 2 OVERVIEW OF THE IUA The IUA integrates three different parallel processors operating simultaneously at three levels of abstraction and computational granularity in a tightly coupled manner. Communication between levels takes place via parallel data and control paths. The processing elements within each level can also communicate with each other in parallel, via different mechanisms at each level. 1 INTRODUCTION Machine vision is one of the most computationally intractable problems, requiring at least three levels of computational abstraction, with two of these levels obvious: Processing of the sensory data, and processing of the world knowledge. The necessity of an intermediate level of processing, bridging the gap between the low and high levels has been motivated in [Boldt 87; Draper 89; Duff 86; Hanson 781 among others. The low-level, called the Content Addressable Array Parallel Processor (CAAPP), is a 512 x 512 array of bit-serial processors intended to perform low-level image processing tasks on the input image pixels. The CAAPP architecture is especially oriented toward associative processing with hardware support for the four capabilities described in section 1. The CAAPP processing elements are linked through a four-way communications grid that is augmended with a Coterie Network that allows certain types of long-distance communications to take place quickly. The CAAPP operates in SIMD mode under the control of a dedicated Array Control Unit (ACU). The Image Understanding Architecture (IUA) is a niassively parallel, multi-level system, representing a hardware irnplementation of these three levels of abstraction. [Weems 891 provides an overview of the IUA. High speed, fine-grained communication, and control at the low and the intermediate levels in the IUA is achieved by using associative processing techniques. Foster [Foster 761 has identified four key processing capabilities with associative computation: The intermediate-level, called the Intermediate and Communication Associative Processor (ICAP), is a collection of 4096 16-bit fast Digital Signal Processor (DSP) chips. The ICAP is designed for manipulating symbolic descriptions of extracted image events. Control of the ICAP is provided by the ACU in synchronous-MIMD mode, and by the high-level in pure MIMD mode. When operating under the control of the ACU, the ICAP is treated as an associative array with the same four capabilities listed in section 1. The individual processors in the ICAP communicate with each other through a reconfigurable interconnection network [Rana 88; Rana 891. (1) Global Broadcast/local compare/ Activity control (2) Select a Single Responder. (3) Some/None response. (4) Count-responders. The purpose of this paper is to present the architectural details of the last two of these capabilities as they are implemented in the IUA. This paper is organized as follows. Section 2 presents a brief At the high-level, called the Symbolic Processing Array (SPA), a set of 64 processors capable of executing LISP programs support computation involving inference, hypothesis generation and verification, analysis of uncertainty, model-based processing, and indirect control of processing at lower levels. The SPA processors operate in MIMD mode. This work was funded in part by the Defense Advanced Projects Agency under contract number DACA76-86-C-0015, and DACA76-89-C0016 monitored by the U.S. Army Engineer Topographic Laboratory, and contract number F49620-86-C-0041, monitored by thr Air Force Office of Scientific Research, and by a Coordinated Experimental Research grant #DCA8500332, from the National Science Foundation. zyxwvutsrq zyxwvuts 540 CH2898-5/90/0000/0540$01.OO 0 1990 IEEE zyxwvutsrqp zyxw zyxwvutsrqpon zyxwvutsrqp zyxwvutsrqp zyxwvutsrq era1 purpose signals Do - Dz are provided by the ICAP and are multiplexed under program control with three special purposr feedback signals from the CAAPP chip. A proof-of-concept prototype of 1/64th of the IUA is currently being constructed by the University of Massachusetts, Amherst, and Hughes Research Laboratories, Malibu, California. The remainder of this paper focuses on the hardware implementation of fast global summary operations (Some/None response, and responder count) in the CAAPP and the ICAP levels of the IUA. ...................................................................... i CAAPPCHIP 3 THE IUA FEEDBACK CONCENTRATOR The hardware comprising two lower level processors of the IUA is partitioned into motherboards and daughterboards. The full IUA system comprises 64 motherboards. Each motherboard contains 64 daughterboards. Each daughterboard holds one CAAPP chip and an ICAP processor, along with memory chips and the associated interface and J/O circuitry. Each CAAPP chip contains 64 bit-serial CAAPP PE’s, their local memory, and interface logic. Each CAAPP P E has a response register called the X register, and an activity register called the A register. The fastest method of counting the number of responders would be to feed the 1 bit output from the X registers of the 512 x 512 = 262,144 CAAPP PES to a hardware adder and generate a 19 bit sum for the ACU. One technique, proposed by Favor [Favor 641 for counting the number of responders, uses a pyramid of full adders to generate the least significant bit of the count of the inputs. Next, the carries generated at various stages of the adder pyramid are fed to a second smaller pyramid to generate the second least significant count bit. Successively smaller adder pyramids are chained together t o develop the full count. In this method each stage waits for one count bit to be formed before feeding any carries from the current level to the next. Foster [Foster 711 improved upon Favor’s scheme such that at any stage of the current adder pyramid, as soon as 3 carries are available, they are fed to the next level adder pyramid. Foster’s scheme, called a carry-shower adder, results in a significant speedup over Favor’s scheme. For example, Favor’s scheme has a delay of 21 full adders for 64 inputs, whereas Foster’s scheme has a delay of 10 full adders for 64 inputs.’ However, the fan-in from 262,144 processors to a single sum is too great to be practically realized this way. L-S/N D2/EF DI/B DO/IC L-COUNT D2, D1, DO : C A P Status Lines EF, B, IC : ICAP/CAAPP Status Lines SIN LCR ICR CR : : : : Local Some/None line on a daughterboard Local Count Register ICAP Count Register Daughterboard Count Register Fig 1: Daughterboard Some/None and Count Network To summarize, the following feedback signals are provided on each daughterboard: Local Some/None (L-S/N), L-Count which can either be a CAAPP responder count or an 8 bit value from the ICAP, and three general purpose feedback signals from the lCAP which are multiplexed with three special purpose feedback signals from the CAAPP chip. In the IUA design, the pin constraints on the processor chip, and space and pin limitations for the external circuitry required that the chip-level counts be output serially. Figure 1 illustrates the logical organization of the Some/None and the Countresponder mechanisms on one daughterboard. Local Some/None is the logical sum of the X register of the 64 CAAPP PES on a daughterboard. This signal is generated at the end of every CAAPP instruction. If one or more PES have a logical 1 in their X register, the Local Some/None is set to logical 1. Foster’s scheme is used within the processor chip to produce a response count at the end of every CAAPP instruction cycle. A special instruction, called latch-count, allows either the CAAPP count or an 8 bit value from the local ICAP to be loaded in the daughterboard count register (CR). The value in the CR can be read out serially from the CAAPP chip. A separate instruction, called latchlocal-count, allows the CAAPP count to be loaded in the local count register, where it can be read by the ICAP processor on the same daughterboard. Three additional gen- From the daughterboards, a finite state machine is used that automatically shifts the count out of the chip, least signscant bit first,, one bit per cycle. Output begins as soon as a count is latched. The processors are thus able to overlap computation with the development of the current count. There are many ways to sum the daughterboard-level counts. One method would be to feed 8 bit values from 4096 daughterboards to a hardware adder such as a Carry-Save-Adder (CSA) [Cavanagh 841. This approach is infeasible because of the hardware cost and the number of wires. If the CSA was arbitrarily fast, this scheme would take about 1pS in computing the global count. Our final solution was to trade a smaU amount of the speed of counting responders for a substantial saving in the hardware and the number of wires: We serialized the entire adding process. This scheme is shown i n figure 2 and figure 3 . Figure 2 illustrates the Some/NoIie and the CountJesponder mechanism on one motherboard, The local Sorne/None (L_S/N) outputs from the 64 daughterboards are fed into the motherboard Some/None tree. (The U0 - Vz feedback signals are treated in the same ‘In another study, Swartelander [Swartdander 731 corrected Foster’s lower bound formula and showed that the theoretiral lower bound on the delay is 9 full adders for 64 inputs. But the best known actual rircuit has a delay of 10 full adders. Further, Swartelander proposed a “faster” scheme for t h e counting hardware. However, his scheme was suitable for only small number of inputs, and he assumed that the delay of a Read-only memory is independent of its size. zyx 54 1 zyxwv zyxwvutsrqpo zyxwvutsrqpon way.) The local counts (L-Count) from the count register (CR) of the 64 daughterboards are serially fed into the motherboard count responder tree. Figure 3 illustrates the global Some/None and Count-responder mechanisms for the full IUA, which concentrate the outputs of the motherboard level networks and are similar in design. Before describing the functioning of the feedback concentrator, we discuss the design of a custom VLSI chip that was built to implement all four blocks of figures 2 and 3. 3.1 THE CONCENTRATOR CHIP A schematic of the concentrator chip is shown in figure 4 . It comprises four hardware blocks: a carry-shower adder, registers, an auxiliary logic unit (ALU), and a carry-select adder. The CAAPP PE's operate with a IOOnS cycle time, therefore the delay from the inputs to the outputs should be less than 100nS. There was an additional architectural constraint, that the delay from the inputs of D l l e g - l to the outputs should be less than 2511s. In addition, the technology was conslrained to a 2 micron CMOS process availablp through the blOS1S racility zyxwvut 1 L-SIN-63 L-SIN-0 Motherboard Some/None Tree t M-SIN zyxwvuts 11 InDuts L-Count-0 L-Count-63 1 Motherboard Contr. M-Ser-Count Fig 2: Motherboard Some/None and Count Network (Networks for other 3 inputs are similar to L-S/N) M-SIN-63 M-SIN-0 Most significant 6 bits Serial Sum to Next level I Fig 4: Schematic of the Concentrator Chip Global Some/None Tree The carry shower adder designed using a basic full adder cell is sinular to the one proposed by Foster, as discussed in the previous section. The carry shower adder generates a 7 bit sum from 64 inputs and has a delay of about 5011s. t G-SIN M-Ser-Cou nt-63 Another hardware block of the concentrator chip is a pair of registers D R e g - l and DReg-2, that are of 7 and 6 bits respectively. The clock-to-output delay in the registers is about M-S er-CO unt-0 611s. Global Count Responder Tree The auxiliary logic unit(ALU) is used to generate logical AND, logical OR and logical EXOR of the 64 inputs to the concentrator chip. The inputs for the ALU are taken from the 7 binary outputs from D R e g - I . Logical EXOR of the 64 inputs to the concentrator chip is the parity of the 7 bits, and hence simply the least significant bit. Logical AND of the 64 inputs to t h e concentrator chip is equivalent to a count of 64, and thus simply the most significant bit of the DReg-I. Logical OR is generated by using a two stage OR gate tree structure, which takes the 7 bits from D Reg.1 as its inputs. The logical OR is used for building the Some/None circuitry at the motherboard and the global level. The logical AND and the logical EXOR are provided for future use. The auxiliary logic unit has a delay of less than 10nS. Reset Contr. G-Par-Count G-Ser-Co unt Fig 3: Global Some/None and Count Network (Networks for other 3 inputs are similar to M-S/N) 542 The last major block of the concentrator chip is a 7 bit carry-select adder. Its design was particularly critical to the overall speed, and the timing of the chip. Subtracting 6nS for the D register delay and 4nS for the delay in the output pads in chip from the 25nS goal left us with a maximum of 15nS allowable for the delay in the 7 bit adder. This could not be achieved with a ripple carry adder using 7 adder cells, because each cell would have a 5nS delay for a total of 35nS. Also, a 7 bit carry look-ahead adder cannot be constructed in the available technology with a delay of less than 1511s. through the carry-select adder. Thus on the third cycle, another result bit appears at the serial output and the high-order portion is again present at the parallel output. The process can be repealed to sum 64 inputs of any length with the low-order portion of the result being shifted out serially and the high-order six bits available in parallel one cycle after the last set of bits has been input. To Rush the entire result out serially, zeros are input for six cycles after the last set of bits. Serial flushing of the high-order portion allows the chips to be cascaded. Thus, only two levels of the chips are required to form a count for the entire array. After the ACU issues a latch-count instruction, the first bit of a count reaches it at the end of the third cycle, and 13 cycles later the last of the low-order bits is output (we always assume an 8-bit value is being output by the processor chips) and the high-order portion is available in parallel. Thus, only 1.6pS are required to count the responders in an array of 262,144 CRAPP PES or to sum 8-bit values from 4096 ICAPs, using 65 copies of a single custom chip. zyxwvutsrq zyxwvutsrqponm zyxwvu We achieved the desired speed by trading more hardware (VLSI chip area) for speed, using a 7 bit carry-select adder [Cavanagh 841. A detailed description of the concentrator chip can be found in [Rana 901. A microphotograph of the concentrator chip is shown in figure 5. 4 SAMPLE ALGORITHMS In this section we provide three sample algorithms, used extensively in low-level vision tasks. We provide their exact running time for a 512 x 512 image on the same size CAAPP array, and then compare these times with another 512 x 512 meshconnected SIMD architecture but without the Some/None and the Count-responders mechanisms. The algorithms presented here are merely intended to demonstrate the power of the two feedback mechanisms in the extreme case. There is a great body of literature on various other architectures for low-level vision, that provide speedups between the two mesh architectures. We will not compare these architectures with the CAAPP, within the scope of this paper. zyx zyxwvutsrq Both the mesh connected processor (MCP), and the CAAPP are assumed to have the same lOOnS machine cycle time. For the CAAPP, the feedback response is stored in the array control unit (ACU), whereas the feedback response from the MCP is stored in the top rightmost P E (Possibly to offload it later. The top right PE is chosen merely for convenience). A . Some/None Fig 5: Microphotograph of the Concentrator Chip 3.2 As mentioned earlier, it takes 3 cycles on the CAAPP or 0.3pS for this operation. On the MCP, in the worst case, a 1 from the lower left corner P E will have to be shifted t o the upper right hand corner PE. Assuming that the MCP has an instruction that allows a P E to get a value from a neighbor PE’s register, OR it w i t h its register value, and store the result in its local register (which is possible in CAAPP), all in one cycle, it 5 the MCP for this instruction. will take 2x512 x0.1 = 1 0 2 . 4 ~ on FEEDBACK CONCENTRATOR OPERATION When the concentrator chip is used for computing global Some/None, it functions as follows. The logical sum (L-S/N) of the response registers (X registers) on a daughterboard stabilizes at the end of the instruction cycle (this operation is carried out in every CAAPP instruction cycle). In the second cycle, a logical sum of the daughterboard L-S/N is computed in every motherboard concentrator chip, and is latched in its DReg-1. By the end of the second cycle, the M-S/N from the 64 motherboard concentrator chips is ready at the inputs of the global concentrator chip. Meanwhile, in the second cycle, the L_S/N of the next instruction passes through the motherboard concentrator chips in a pipelined manner. By the end of the third cycle (or 0.3pS), a global Some/None is available at the output of the global concentrator chip. B. Count Responders zyxwvutsrqponm Here we want a count of PE’s with 1 stored in their registers. A s mentioned earlier, the latch-count instruction takes 16 cycles or 1.61.15 for this instruction in the CAAPP. In the MCP, the final count length is 19 bits. First we accumulate the counts of the rows in the rightmost PE’s. The variable ‘count’ is kept in each rightmost column PE’s memory. By successively shifting right the X register values in the rows and accumulating in the rightmost PE’s, we get a maximum length of 10 for the ‘count’ variable. For the initial two cycles, the ‘count’ variable can be a niaximurn of 2 bits, for the next 4 cycles it can be of 3 bits, for the next 8 cycles it can be of 4 bits, and so on. This gives us a formula to compute the time for computing sums in the rows; which is: When the concentrator chip is used for counting the responders in the CAAPP or for adding 8 bit values from the ICAP, it functions as follows. One cycle after the low-order bits of a set of 64 partial counts are input, the low-order bit of the result appears at its serial output. The high-order bits of the result appear at the parallel outputs. If another set of bits are input, the high-order portion is recirculated and added to the next result 2 x 2 t 3 x 4 + 4 x 8 + ... -1 9 x 2 5 6 t l o x 1 = 4106 or 410.6pS. 543 zyxwvutsrqpo zyxwvutsrqpon Next the values in the rightmost PE’s are accumulated from bottom to top in the rightmost column. The time taken for this operation is given by It should be noted that count can be used to quickly compute other statistical measures, such as mean and standard deviation, and to form a histogram of an array of data. feedback mechanisms: Some/None, and Countxesponders, for the lower two processing levels of the Image Understanding Architecture, was described. Both mechanisms are implemented using multiple copies o f a single custom VLSI chip.The architecture of the custom chip was described in detail. The performance of the IUA’s low-level processor with the feedback concentrator was compared to a similar mesh-connected parallel processor without the feedback concentrator mechanism, and shown to be significantly faster. C. 6 + + + 11 x 2 12 x 4 ... 18 x 256 The total time taken is 1 2 8 1 . 1 ~ s . + 19 x 1 = 8705 or 870.51~s. Find greatest value In this algorithm, the goal is to determine the greatest value in a given memory field of the PE’s. In the CAAPP, the algorithm begins by loading the high-order bit of a given field into the response register of all cells. The ACU then tests the Some/None output of the CAAPP. If any PE’s have their highorder bit set, then they are candidates for the maximumvalue, in which case, any cells that have a 0 in their high-order bit are then deactivated. However, if no cells have their high-order bit set, then none are deactivated because they are still potential candidates. This process repeats with each successively lower-order bit in the field. When the low-order bit has been processed, only those cells that have the maximum value will remain active. For each iteration, the ACU saves the Some/None response so that the maximum value is available in the ACU at the conclusion of processing. A pseudo-code algorithm is shown below. REFERENCES [Boldt 87 ] M.Boldt and R. Weiss, “Token based extraction of straight lines,” COINS Tech Report 87 - 104, Computer and Information Science Dept., University of Massachusetts, Amherst, MA 01003, Oct 1987. [Cavanagh 84 ] Joseph J . F. Cavanagh, Digital Computer Arithmetic, McGraw Hill, New York, 1984. [Draper 89 ] B.A. Draper, R.T. Collins, J. Brolio, J. Griffith, A.R. Hanson, and E.M. Riseman, “The Schema System,” International Journal of Computer Vision, Vol 2, March 1989, pp 209 - 250. zyxwvu zyxw zyxwvutsrq zyxwvutsrqponm For B i t N u m := FieldLength - 1 Down to 0 Do {Beginning with the high-order bit} Response := Field[BitNum] { P u t bit in response register} Lf Some { I f any cell has a 1 in this bit} Then Activity := Response {Then turn o f factivity in cells } {with a 0 in this bit} This algorithm takes 40 CAAPP instruction cycles or 4 . 0 ~ 5 for an 8 bit value. For the MCP, first we find the maximum in each row, and put it in the rightmost PE’s. Next we find the maximum in the right column and put the value in the top PE. The basic operation is to compare the fields of two neighbors, and put the greater value in the right hand side PE’s memory during the row phase, and in the upper PE’s memory during the column phase. It will take some k multiples of 8 instructions for each compare. The value of k for MCP will depend upon the specific architectural implementation, and its instruction set. When this algorithm is emulated on the CAAPP, the value of k is 4. Thus, the total run time for the algorithm on the MCP will be of the order of 4 x 8 x 2 x 512 or 3 2 7 6 . 8 ~ 5for the operation. zyxwvutsrqpo zyxwvutsrqponmlk There are numerous algorithms where a hardware Some/None and Count-responder mechanism will give significant speedups. Our objectivein this section was to demonstrate a few low-level algorithms that make use of the feedback concentrator mechanism. 5 [Duff 86 ] M.J.B. Duff (Ed), Intermediate-LevelImage Processing, Academic Press, New York, 1986. [Favor 64 ] J . Favor, “A method for obtaining the exact count of responses using Full and Half adders,” AP-111770, Goodyear Aerospace Corporation, Akron, Ohio,Oct 1964. [Foster 71 ] C. C. Foster, “Counting responders in an Associative Memory,” IEEE Trans. on Computers, Dec 1971, pp 1580 - 1583. [Foster 76 ] C. C. Foster, Content Addressable Parallel Processors, Van Nostrand Reinhold Company, New York, 1976. [ Hansori 78 ] A.R. Hanson and E.M. Riseman, Computer Vision Systems, Academic Press, New York, 1978. [Rana 88 ] D. Rana, C.C. Weems, and S.P. Levitan, “An easily recoufigurable circuit-switched connection network,” PTOC 1988 IEEE Int Symp on Circuits and Systems, June 1988, pp 247 - 250. [Rana 89 ] D. Rana, and C.C. Weems, “The ICAP Parallel Processor Communication Switch,” Proc 1989 IEEE Int Symp on Circuits and Systems, May 1989, pp 126 - 129. [Rana 90 ] D. Rana, and C.C. Weems, “A Feedback Concentrator for the Image Understanding Architecture,” COINS Tech Report, Computer and Information Science Dept., University of Massachusetts, Amherst, M A 01003, May 1990. [Swartzlander 73 ] E.E. Swartzlander, “Parallel Counters,” ZEEE Trans on Computers, Nov 1973, pp 1021 - 1024. [Weerns 89 ] C.C. Weems, S.P. Levitan, A.R. Hanson, E.M. Riseman, D.B. Shu, and J.G. Nash, “The Image Understanding Architecture,” International Journal of Computer Vision, March 1989, pp 251-282. CONCLUSIONS One of the most important architectural requirements of a multilevel parallel processor for computer vision is that it should provide mechanisms for rapid global summary for any centrally controlled processing. The hardware implementation of two important summary 544