Academia.eduAcademia.edu

A computational model for static data flow machines

1995, Computers & Electrical Engineering

Ahetract-Computer architects have been constantly looking for new approaches to design high-performance machines. Data flow and VLSI offer two mutually supportive approaches towards a promising design for future super-computers. When very high speed computations are needed, data flow machines may be relied upon as an adequate solution in which extremely parallel processing is achieved. This paper presents a formal analysis for data flow machines. Moreover, the following three machines are considered: (1) MIT static data flow machine; (2) TI's DDP static data flow machine; (3) LAU data flow machine. These machines are investigated by making use of a reference model. The contributions of this paper include: (1) Developing a Data Flow Random Access Machine model (DFRAM), for first time, to serve as a formal modeling tool. Also, by making use of this model one can calculate the time cost of various static data machines, as well as the performance of these machines. (2) Constructing a practical Data Flow Simulator (DFS) on the basis of the DFRAM model. Such DFS is modular and portable and can be implemented with less sophistication. The DFS is used not only to study the performance of the underlying data flow machines but also to verify the DFRAM model.

ComputersE/W I. Engng Vol. 21, No. 6, pp. 483-497, 1995 Copyright 0 1995Elsevierscience Ltd Pergamon zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Printed in Great Britain. All righta raewed 0045-7906(!5)ooo18-6 004s7906195 A COMPUTATIONAL s9.50 + 0.00 zyxwvut MODEL FOR STATIC DATA FLOW MACHINES FATHY E. EASSA, M. M. EASSA and M. ZAKI Computers and Systems Engineering Department, Faculty of Engineering, Al-Azhar University, Nasr City, Cairo, Egypt (Received for publication 3 July 1995) Ahetract-Computer architects have been constantly looking for new approaches to design high-performance machines. Data flow and VLSI offer two mutually supportive approaches towards a promising design for future super-computers. When very high speed computations are needed, data flow machines may be relied upon as an adequate solution in which extremely parallel processing is achieved. This paper presents a formal analysis for data flow machines. Moreover, the following three machines are considered: (1) MIT static data flow machine; (2) TI’s DDP static data flow machine; (3) LAU data flow machine. These machines are investigated by making use of a reference model. The contributions of this paper include: (1) Developing a Data Flow Random Access Machine model (DFRAM), for first time, to serve as a formal modeling tool. Also, by making use of this model one can calculate the time cost of various static data machines, as well as the performance of these machines. (2) Constructing a practical Data Flow Simulator (DFS) on the basis of the DFRAM model. Such DFS is modular and portable and can be implemented with less sophistication. The DFS is used not only to study the performance of the underlying data flow machines but also to verify the DFRAM model. Key words: Static data flow machine, parallel processing, computer architecture, random access machine model, data flow random access machine model, data flow graph, firing time, taken table, queue structure, enabling network controller. 1. INTRODUCTION The demand for ultrahigh speed computing machines for analyzing physical processes, solving scientific problems and intelligence computations is increasing every day. The major difficulty in satisfying this demand in uniprocessing is the physical constraints of hardware and the sequential control in the Von-Neumann abstract computing model. The alternative to sequential processing is parallel processing with high density devices. Computer architects have been constantly searching for new approaches to develop high performance machines. Data flow offer a supportive approach towards the design of future supercomputers. Data flow computers [l-9] are based on the concept of data-driven computation, which is drastically different from the operation of a conventional Von-Neumann machine. The fundamental difference is that instruction execution in a conventional computer is under program-flow control, whereas in a data flow computer, processing is driven by data availability. This means that the concepts of control flow and data flow computing are distinguished by the control of computation sequences in two distinct computational paradigms. Data flow computers are designed to execute data flow program graphs [P-15]. The nodes of a data flow graph are computation and control constructs. Arcs pass tokens carrying data values between nodes. When a node receives tokens on each of its incoming arcs, it can then fire. Thus it absorbs the input tokens, computes a result, and generates a token carrying the result value. The Random Access Machine (RAM) model [12] is widely used as a formal model for describing Von-Neumann computers with sequential control. Also, model has been extended to the Parallel Random Access Machine (PRAM) model [12] in order to serve as a modeling tool for control flow machines with parallel computations. However, to our knowledge, nothing has been reported about a “formal” model for data flow machines. In what follows we present the DFRAM model as a formal representation for Data Flow Machines. 483 Fathy 484 E. Eassa et al. The DFRAM is a new paradigm of the RAM model which is concerned with data flow instead of control flow. Moreover, the DFRAM verification has been carried out by making use of three different machines. These machines are the MIT [ 131, the DDP [ 131 and the LAU [ 131 static data flow machines. The three machines are experimental and have been developed at different places. The MIT machine has been built by the Dennis group at MIT, while the DDP has been made up by Texas Instruments Co., and the LAU machine is available at the ONERA/CERT Laboratory, France. 2. ARCHITECTURE OF STATIC DATA FLOW MACHINES In static architectures of data flow machines the nodes of a program graph are loaded into a memory before the computation begins and at most one instance of a node is enabled for firing at a time, i.e. in the static data flow model only one token (or instruction operand) is allowed on a program arc at any time [22-251. Three static data flow architectures are presented in this section. They are MIT, TI’s DDP and LAU. 2.1. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA M IT architecture The MIT static architecture consists of memory units, processing section, arbitration network and distribution network, as shown in Fig. 1 [1,13]. Every memory unit is a collection of blocks of storage locations. Each block (called a cell block) stores the operation, operands and destination address of a node. The program graph to be executed is loaded in the cell blocks of the memory by the host before computation begins. To support the rapid access of array elements a part of the memory is used for storing arrays. Enabled instructions in the memory are communicated to the processing elements as operation packets using the routing network R, in Fig. 1. The results of execution are communicated to cell blocks as data packets using the routing network R,. The execution of a program graph terminates when none of the nodes in the memory units is enabled. It is assumed that the routing networks are fault tolerant and the FIFOs in the routing c I- c j- RI r c R2 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA I Host Mi - Instruction and array PEj - Processing element RI, R2 - Routing Fig. I. The MIT machine architecture, networks memory unit Static data flow machines 485 PE - E-bus (34-bit Maintenance shift register) bus -_______ ----- -- --- I -I-- _ _____ . ..___.___.1_.. Maintenance controller L Result 0 port Host Q-FIFO PE Fig. 2. The DDP machine architecture. network have sufficient capacity to reduce blockage [13]. Faults in the machine require restarting the computation from the beginning. In the MIT machine a processor does not have a private memory for instructions or data. However, the machine has a central memory. Packets are communicated between memory cell blocks and processors using both FIFO’s arbitration networks and distribution networks. A node in the machine model has three input and four output arcs. The MIT machine supports multiactor instructions and vector instructions to reduce token traffic in the routing networks. It has one path with a width of 32 bits and 11 words packets. Streams are handled by pipelining (word data packets). 2.2. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA TI’s DDP architecture The data-driven processor, or DDP, of Texas Instruments is designed for executing Fortran program using some of the principles outlined in the MIT architecture. The DDP investigates the feasibility of static data flow computing without acknowledgement signals for high speed computing systems. A block diagram of the DDP machine is shown in Fig. 2. The program graph corresponding to a source program is generated by a compiler in the host processor. The program graph is partitioned into subgraphs by a cluster detection algorithm. The subgraphs are loaded into the memory units of processing elements. Every node stored in the memory unit of a processor has an operation and a counter (called the predecessor counter) to determine whether a node can be enabled for firing, a maximum of 13 operands, and a maximum of 13 destination addresses but the total number of input and output arcs cannot exceed 14. A node is enabled when the predecessor counter reaches zero. The enabled node is executed in the ALU of the processing element. The results of a node firing are forwarded to successor nodes in the processor’s memory unit or another processing element. If the successor node is in another processing element, then it is communicated using the interconnection network E-bus as a series of 34-bit packets, as shown in Fig. 2. As a result of this token forwarding, one or more nodes may be enabled. Since there is a single ALU in a processing element, the enabled. nodes are linked together in a queue called Fathy E. Eassaet zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQP al. zyxwvutsrqponmlkjihgfedcbaZYXWVUTS 486 C o ntro l Memory unit I Instruction control unit Execution WU) W r Host unit Instruction bus R e r-i Data control unit (DCU) t --) : e . . b b ” U S unit Control-related b use s Fig. 3. The LAU machine zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQ S architecture the pending instruction queue (PIQ). The node at the head of the queue is dispatched for execution zyxwvutsr u31. The maintenance controller, Fig. 2, detects faulty processor(s). Actually, faults require restarting the computation at the previous checkpoint, where the buffer contents can be dumped using the maintenance controller. Each instruction packet has a field for processor number. After detecting a fault in a processor, that processor number will not be used in allocating instruction packets. The previously scheduled instruction packets and the results to a failed processor can be rerouted to another processor using the maintenance controller. In the DDP machine the Pending Instruction Queue (PIQ) holds executable instructions that have been enabled. Executable instructions are removed from this queue by the arithmetic unit. When the capacity of the queue is exceeded, the enabled instructions are linked, in memory, to the PIQ via their link field that is already reserved for this purpose. This method has the advantage that no amount of program parallelism overflow the capacity of the hardware resources [1,3,131. 2.3. LA U architecture The LAU machine contains four major units: memory, execution, control, and an interface as shown in Fig. 3. Programs written in a single assignment language are compiled to produce data flow graphs in the host. Each node in the data flow graph can have a maximum of two input arcs and several output arcs. The instruction format corresponding to a node contains an operation part and a control part. The operation part of an instruction is loaded into the memory and the control part is maintained in the control unit. A node is enabled when the operands are ready and proper context exists. The instruction control part detects enabled nodes using a simulated associative memory and sends the addresses to the memory. The addresses in the ICU and DCU of Fig. 3 have one-to-one correspondence with those in the memory unit. The memory reads the operation part of the instruction and sends it to a queue (FIFO) that can hold 128 enabled instructions. The instruction at the head of the queue is dispatched to an available processor in the execution unit. The execution unit in the LAU machine uses a number of buses to access the rest of the system. It can support up to 32 processors. Each processor executes the instruction assigned to it by reading the operands from the memory unit. The results produced are written into the memory and the destination addresses are obtained. In the LAU machine, enabled instructions are kept in a queue until results come out of the processor. This helps to reassign instructions to a healthy processor if a faulty processor was Static data flow machines 481 detected. The LAU execution unit does not have a local memory, on the other hand it has a central memory from which the execution unit can read or write data. 3. THE DFRAM MODEL There are several fundamental models of computing devices, the most known model is the random access machine (RAM) which models a sequential computer that carries out one operation at a time and the parallel random access machine (PRAM) which models parallel machines with control flow [12]. In the following we present our DFRAM model and emphasize its basic characteristics. 3.1. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Data flow random access machines (DFRAM ) model Data flow graphs have been used extensively to model parallel computations. In these models the data flow graph is considered as an uninterpreted bipartite graph with nodes and links, as shown in Fig. 4 [l&21]. Unfortunately, such a representation is unable to express the structural and the behavioral details of data flow machines. This shortcome might be avoided if the semantic actions of the graph actors (nodes) are considered. Therefore, here, we propose the DFRAM model that augments the basic structural and behavioral characteristics of data flow machines in every node of the corresponding graph. The DFRAM model for static data flow architectures in Fig. 5 consists of: (1) (2) (3) (4) (5) (6) (7) a set of sharable cell blocks and/or local RAM memory cells an enabling network controller an enabled queue structure processing (firing) element(s) processor-availability checker results queue structure distribution controller. The DFRAM model works according to the following rules: Input Link Fig. 4. A pipe-lined data flow graph. Fathy 488 E. Eassa et al. From the ’ host I------ + Enabling-network controller I Processing elements ‘ Distribution network I Fig. 5. The DFRAM Results model of a graph actor (1) The data flow graph of a computation (program) should be generated and stored in the cell blocks and/or the local memory cells before the DFRAM model operation is initiated in order to provide the data required for the model. (2) When the data is available, the enabling network controller determines the nodes to be enabled. A node can be enabled if and only if (a) there are tokens on every link of the input links required for enabling the node, and (b) there are no tokens on any of the node’s output links, (3) All enabled nodes are arranged in a queue structure. (4) The front node of the enabled queue is assigned a processing element to accomplish the node firing. (5) The output of the processing element is checked. If it is available, then it is passed to the results queue; otherwise the firing process is canceled. If the firing process is canceled, then its input is returned either to the enabled queue or to the memory cells. (6) Again all the results are arranged in a results queue. (7) A distribution network controller takes the results from their queue and directs (writes) them either to the memory cells as intermediate results or to the output as final results. Consequently, the intermediate results will act as new data for the rest of the data flow graph. (8) A computation terminates when there are no enabled nodes. From the DFRAM model, we can point out the following: (a) Each data flow operation consists of four steps. These steps are: Static data flow machines 489 (1) reading the node information from the cell blocks (reading step) (2) realizing the enabling conditions of a node and detecting the enabled nodes (enabling step) (3) assigning nodes to processing elements to fire and passing tokens to the results queue (firing step) and (4) moving results to output links (distribution or writing step). @I The DFRAM model is asynchronous. On the other hand, the PRAM model [ 121assumes w (4 synchronous control flow. Rule (2b) of the model is confined to static architectures only and is violated when dynamic data flow machines are considered. Rules (2b) and (8) ensure that no write conflicts occur by the DFRAM model, since node result has a predefined address value. However, such conflicts may take place in the PRAM model. The performance of data flow machines is assessed by comparing the time cost of these machines. For the sake of comparison, the time complexity of all the underlied machine is computed for the matrix multiplication problem. Two n x n matrices A and B are considered and their manipulation by the three machines, MIT, DDP and LAU, is investigated. 3.2. The time cost of data flow machines For calculating the product matrix C, where C = A x B in the three machines, the unit time T is given by: T = (reading + enabling + firing + distributing) time. In what follows the reading time is assumed to be negligibly small. 3.2.1. The MIT machine. In the MIT machine, the program graph to be executed is stored in the cell blocks by the host. Each cell block stores the operations, operands and destination address of a node. Enabled nodes in the cell blocks are sent to the processing elements using the enabling network controller. The results of execution are passed either to cell blocks or to the output link using the distribution network of the machine. Typically, one enabled node is using a processor and a number of enabled nodes are waiting in the enabled queue of that processor which has not a local memory. The execution of a program graph terminates when none of the nodes in the memory units is enabled. The data flow graph of the multiplication process is stored in cell blocks of the MIT machine in a time equals O(logn). (1) One column from the matrix B is selected. (2) The selected column and one row from the matrix A are assigned concurrently to a processing element. This occurs in n/p time, where p is the number of processors. The processing element multiplies the elements of the selected column by the elements of (3) selected row sequentially in n time. (4) Steps 1, 2 and 3 are repeated n times until all columns of the matrix B are multiplied by the rows of the matrix A. The time complexity No. of A rows x No. of B columns x NO. of elements of row x T(MIT) of multiplication = No. of processing elements Where T(MIT) is the unit time for the MIT machine. Thus the multiplication time = (n x n x n/p) x T(MIT) The total time = storing time + multiplication time = O(log n) + n3/p x T(MIT) (1) if p <tn then time = O(log n) + O(n3) x T(MIT) (2) if p = n then time = O(log n) + O(n’) x T(MIT) (4) if p >>n(p 3 n3), time = O(log n) + (O(1) x T(MIT)). (1) Fathy 490 E. Eassa et al. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSR 3.22. The DDP machine. In the DDP machine, a program graph is generated by the compiler (data flo.w generator) in the host processor. The program graph is partitioned into subgraphs. The subgraphs are loaded into the local memory cells of the processing elements, since the DDP machine does not have sharable cell blocks. The enabled node is executed in the processing element. The results of a node firing are forwarded to the corresponding successor nodes in the processor’s local memory cell using the distribution network. As a result, one or more nodes may be enabled. The enabled nodes are linked together in the enabled queue. The node at the head of the queue is dispatched for execution. The time required for storing a data flow graph is O(logn). After partitioning the flow graph into subgraphs we can find in the local memory of each processing element, a copy of matrix B and n/p rows from matrix A. Since there is only one ALU in the processing element, then the time required for matrix multiplication will be: (n x II x n)/p x T(DDP) Then the total time = storing time + multiplication time = O(log n) + n’/p x T(DDP). (2) Again, if p >>n(p > ni), then time = O(log n) + (O(1) x T(DDP)). 3.2.3. The LAU machine. The LAU has neither local memory cells nor results queue facilities. The data flow graph is stored in the cell blocks. As in the above two machines, the execution consists of enabling, firing and distribution steps. The enabled nodes are linked into the enabled queue. An enabled node from the queue is assigned to a processing element. The results produced are written into the cell blocks using the distribution network. In that machine the data flow graph of the matrix multiplication process is stored in the cell blocks of the machine in O(logn) time. The enabled nodes are stored in the enabled queue of the machine. The total number of enabled nodes equals (n x n x n). At the first unit time of multiplication, p enabled nodes are assigned to the p processing elements concurrently. After that, another p enabled nodes at the head of the enabled queue will be assigned to the empty processing elements. This process will continue until all enabled nodes are fired. The time of multiplication = (n x n x n/p) x T(LAU). The total time = storing time + multiplication time = O(log n) + (n3/p) x T(LAU) (3) if p >>/I time = O(log n) + (O(1) x T(LAU)). From the above we find out that equations (l)-(3) are the same, but the difference between them is the unit-time value. The unit-time consists of three terms for enabling, firing and writing (distribution). In the following we exploit the DFRAM model to calculate the unit times of the three machines in terms of their primitives. n enabling-time = (ni + no) x t, where ni is the number of input arcs no is the number of output arcs t, is the time of one comparison W firing-time = t,, + [twf, if any] where t,, is the time of executing one operation t,, is the writing time of a control bit. H writing-time = [tcf, if any] + no x t,) + (d x t,) where tcf is the comparison time of control t, is the destination writing time d is the number of destinations. bits Static data flow machines 491 Since the number of input arcs and output arcs of a node is different from machine to machine, then the enabling and writing times will be different. The unit time of the three machines (MIT, DDP and LAU) can be given as following: (1) The MIT unit time, T(MIT) In the MIT, the number of arcs for input and output are 3 and 4, respectively. enabling time = (3 + 4) x zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCB t, =7t, firing time = t,, writing time = 4 x t, + d x t, where 1 < d G 4 Thus, T(MIT) = 11t, + t,, + d x t, . (2) The DDP unit time In the DDP, a node can have up to 13 input arcs or output arcs, but the total number of input and output arcs cannot exceed 14. Accordingly, enabling time = 14t, firing time = t,, writing time = 13t, + d x tw where lGdG13 T(DDP) = 27t, + t,, + d x t,. (3) The LAU unit time In the LAU, a node has two input arcs and 13 output arcs. The LAU machine contains control unit which can synchronize the operation. enabling time = (2 + 13) x t, = 15t, firing time = t,, + twr where twf is a writing time of a control bit cd. The cd is a control bit in the control unit in the LAU machine. The results of node firing is not available until the c, bit is set. If a fault is detected in a processor, then the cd bit will never be set. writing time = tcf+ 13t, + d x t, where l<d<13 tcf is a comparison time of cd. The cd must have value 1. If the value of c, is zero, then a fault is detected in a processor and it is possible to reassign the nodes to a healthy processor. Unittime=15t,+t,,+t,+t,,+13t,+dxtW T(LAU) = 28t, + tex+ t,f + tcf+ d x t, . From the above, we find out that: T(LAU) > T(DDP) > T(MIT). This is due to the difference in the architecture of the three machines [26-291. The LAU machine contains control unit which can synchronize the operation, so the unit time increases due to this synchronization process by the twf and tcf values. In the LAU machine, a node can have at most two input and several output arcs. Since the number of output arcs is large, the enabling and writing times will increase. The DDP machine does not have control unit, but a node can have up to 13 input arcs or 13 output arcs. However, the total number of input and output arcs cannot exceed 14. Due to this large number of input and output arcs, the enabling and writing times increase. Consequently, the unit time of the DDP machine will increase. In the MIT machine, a node can have up to three input arcs and four output arcs. Therefore, the enabling and writing times of the MIT machine are smaller than the times of DDP and LAU. 4. THE SIMULATOR On the basis of the DFRAM model a Data Flow Simulator (DFS) has been designed and implemented. Here the basic features of DFS are emphasized. 492 4.1. Generation Fathy E. Eassa et al of the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA data Jlow graph The generation of a data flow graph and token table is important because it is an interface between a user program and data flow machine. We have built a token-table generator that accepts the user source program and generates the corresponding token table. This token table as such is manipulated by a machine simulator. The generator consists of two phases: a scanner phase and token-table generator phase. The scanner converts the stream of input characters into a stream of tokens. Such tokens become the input to the next phase. In the data flow generator, the scanner phase accepts a user program and generates tokens for each statement in the program. Each token consists of two fields: token type and value, i.e. token (type, value). The token-table generator phase accepts the tokens that are generated from the scanner phase and generates the corresponding token table. The token-table consists of many fields. These fields are op-code (operator), dn (node number), dr (arc number), value (of operand), next-dn (node number of a destination node), next-opcode (opcode (operator) of a destination node), next-arc (arc number of a destination node) and ready-f. The ready-f is set to one if the value of an operand exists and reset to zero if the value of an operand does not exist. The rules used for token-table generator are: (1) if the token-type is identifier (id), then a zero (0) is stored in the value field of a token-table (2) if the token-type is constant then the token-value is stored in the value field of the token-table, (3) if the token-type is opcode (op), then an integer value is stored in the op-code field. This field will be one (I) for multiplication, two (2) for division, three (3) for addition and so on. There are two counters: the first counts the nodes and the second counts the arcs for each node. The value of the first counter is stored in the dn field and the value of the second counter is stored in the dr field of the token table. We have provided three token-table generators: one for the MIT machine, the second for the DDP machine and the third for the LAU machine. The number of fields of the generated token table is different from machine to machine. This is because the number of output nodes is different from machine to machine. 4.2. Flow graph construction Many data flow machine simulators have been built. Some of them [9,13-l 51 permit us to identify the basic mechanisms that must be supported by any data flow computation. Also, they have been used to evaluate the performance of some data flow machines. However, they may fail in displaying the essential differences between various simulated machines. Here we present DFS as a simulator that has been built on the basis of the DFRAM mode1 to study the detailed performance of data flow machines. Three static machines are considered: MIT, DDP and LAU. The structure (block diagram) of DFS is shown in Fig. 6. In DFS the user program is transformed into a data flow graph and a token table. The data flow graph consists of nodes, input arcs and output arcs. Each node corresponds to an instruction operator, the input arc corresponds to input data token (instruction operand), and the output arc corresponds to a result (it is also an input arc for next node). Each node has a specific number of input arcs and a specific number of output arcs according to the type of the machine. The common features of data flow machines are specified in a “reference machine”. Such a machine is hypothetical and does not include any particularities. The token-table of the reference machine, shown in Fig. 7, contains the instruction opcode, node number (dn), arc number (dr), token value (value), addresses of the next nodes and other control data. The control data are ready-f, then-no, else-no and while-no. The opcode value is an integer value, for example, the value 1 for multiplication operator, the value 2 for division operator and so on. The node number (dn) is an integer value of the arc number (dr) taken on the values 1, 2, 13 according to the type of data flow machine. The address of the next node (destination node) contains the node number (next-dn), the opcode (next-opcode) and the arc number (next-arc) of Static data flow machines Module 1 Token-table 493 generator Token-table Modu,e 2 1 Initialt:;Ff node , hdulti-enabled node table [ Div enabled table Add-enabled node table I Module 1 4 i 1 I I Firing phase 4 I Multi-output table Add output table I I Writing (distribution) Div output table phase Fig. 6. The DFS structure. the next node. The ready-f is set to 1 if the value of the operand (token) is ready. There are two counters, one to count the if-statement and the second to count the while-statements. The then-no and else-no contain the counter value of the if-statements. The while-no field of all instructions existing inside the then-body of the if-statement has the if-statement counter value. Also the else-no field of the instructions existing inside the else-body of the if-statement has the if-statement counter value. Thus, when an if-statement is satisfied, the instructions inside its then-body are enabled otherwise, the instructions inside its else-body are enabled. The execution process of instructions consists of enabling stage, firing stage and distributing stage. In the enabling stage, a group of node tables are generated; a table for each opcode type. The importance of generating a node table for each opcode type is to satisfy the parallelism of the data flow machines. Each node table contains the enabled nodes data. Figure 8 shows the contents of the node table. This table structure is suitable for conditional opcode type used in all data flow machines. The table contains the opcode of the node (opcode), the number (dn), the values of the input tokens (dr, , dr,), the output arcs values (dor, , dor,), the output nodes addresses (next-dn . . . , Fathy E. E!assa et al. 494 Opcode dn value dr next-dn next-arc next-dn2 next-opcode2 next-arc2 else-no then-no ready-f next-opcodc Fig. 7. The token-table of the reference machine. Opcode next-dn2 dn dr, next-opcode2 dr, dor, next-arc2 dorz C ready-f Fig. 8. The node-table of ef then-no of the reference next-dn next-opcode else-no while-no next-arc machine. next-arc) and other control data (c, ef, of, ready-f, then-no, else-no, while-no). The node is enabled when the tokens of its input arcs are ready and its output arcs are free. In the enabling stage, the enabling module in Fig. 6 accepts the token-table and generates node tables that contain the enabled nodes. In order for a node to be enabled, the enabling module reads the token-table records and collects the operands of each node. When the number of ready operands of a node are equal to the required input, the output arcs of the node must be tested. If the output arcs of the node are free, the node data shown in Fig. 8(b) are written to the node table. During the firing stage, the firing module accepts the node tables and passes them to a table for processing elements in order to fire the enabled nodes. Before the firing moment, the status of the processing element is tested. The status may be fault, busy or ready, the firing occurs if the status of the processor is ready. During the distributing stage, the writing module communicates the results of the firing stage to the token table update data values of some tokens, so that other nodes may be enabled. 5. PERFORMANCE EVALUATION By making use of the DFS the performance of MIT, DDP and LAU has been evaluated under various operational conditions. Naturally, a data flow program may undergo several “stages”. A stage represents one operation that may be carried out in one or more actors concurrently. For multiplying two matrices A, xn x zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA B, x n, Fig. 9 shows the increase of the time cost with the increase of the matrix size n, for the MIT, DDP and LAU machines. It is obvious that at fault free operation, the MIT machine has the best performance, followed by the DDP performance while the LAU machine has the least performance. Also, this figure agrees with the result of the DFRAM model that, for p >>n, the multiplication of two n x n matrices take O(1) DFRAM steps. (As indicated in [12], it takes O(log(n)) PRAM steps and O(n3) RAM steps.) During faulty operational conditions, the high degradation rate of the MIT machine is due to the fact that when one of the machine processors becomes faulty, the current computation is turned off and the machine restarts the computation from its beginning [13]. The degradation rate of the DDP machine is smaller than that of the MIT machine since a processor fault returns the computation to the nearest break point. However, the LAU machine can fail safe (without degradation) because if a processor fails, the machine can immediately provide a fault-free reconfiguration to continue the computation. Table 1 summarizes the essential characteristics of these machines. Static data flow machines 495 190 180 170 160 150 140 ^x $ 0 130 E 120 3 110 B z 100 E .- 90 -z ;j b 80 70 60 50 40 30 20 10 I I I I I I I I 1 1 2 3 4 5 6 7 8 9 No. Fig. 9. Total time of MIT, Table Machine feature DDP I. Characteristics of elements and LAU comparison at fault-free operation. of MIT, DDP and LAU machines MIT LAU DDP No. of Communication buses One Two Six unidirectional buses Memory machine Cell blocks (Memory units) Local memory Central Process of the machine with faulty processors Restarting the process from the beginning Complete operation the break point Node input 3 up to 13 Node output 4 Time cost Availability of the memory Reassignment and complete operation from zyxwvutsrqponmlkjihgfedcbaZYX 2 up to 13 Small but the total i/ p and o / p nodes up to 13 > must not exceed 14 Medium Least Good Best Large 6. CONCLUSION In this paper the DFRAM model and the DFS simulator have been presented. The DFRAM model can be relied upon as a formal model for data flow machines. The essential features of that model are: (1) It extends the semantic data flow graph so that each graph nodes includes the semantic actions of the underlying instruction. Accordingly, each machine time unit consists of three phases for enabling, firing and writing tokens, respectively. Fathy 496 E. Eassa et zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQ al. No write conflicts may arise and, consequently, it can be easily used for computing (2) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA cost of various data flow algorithms. model, i.e. the tokens (3) It is an asynchronous ized. flow is neither controlled nor the time synchron- On the basis of the DFRAM, the DFS has been built to act as a modeling tool for data flow machines. Since the DFRAM augments the semantic actions of data flow processing, the DFS inherently has the capability of emphasizing the structural and the behavioral differences of the simulated machines. Therefore, DFS may have an edge over similar simulators. Moreover, DFS is modular, portable and can be implemented with ease. Three data flow machines have been simulated by the DFS. They are the MIT, DDP and LAU machines. The results of the simulator have indicated that with fault-free conditions the MIT machine has the best performance. The second is the DDP machine while the LAU machine has the least performance. On the other hand, if a processor becomes faulty, then the LAU machine will possess the highest availability followed by the DDP machine in the second position, whilst the MIT architecture will have the lowest availability. REFERENCES 1. J. Herath, Y. Yamaguchi, N. Saito and T. Yuba, Data flow computing computations. IEEE Trans. Sofiware Engng 14, (1988). 2. D. Abramson and G. Egan, The RMIT data flow computer: models, languages a hybrid and machines architecture. for intelligence The Computer .I. 33, (1990). 3. P. C. Treleaven, D. R. Brownbridge and R. P. Hopkins. Data driven and demand-driven computer architecture. Computing Surveys 14, (1982). 4. M. L. Cambell, Static allocation for a data flow multiprocessor. IEEE Computer, pp. 511-517 (1985). 5. S. A. Thoreson. A feasibility study of a memory hierarchy in a data flow environment. IEEE Computer, pp. 356360 (1985). IEEE Computer 1, 225.-253 (1986). 6. A. Culler and D. E. Culler, Data flow architectures. of a simulated data flow computer. IEEE Trans. Computers C- 29, I. K. P. Gostelow and R. E. Thomas. Performance (1980). 8. B. Lee, A. R. Hurson and B. Shirazi, A hybrid scheme for processing Trans. Parallel and Disrributed Systems 3, (1992). 9. M. Takesue, Cache memories for data J~OW machines. (1992). IEEE data structures Trans. m a data flow environment. Parallel and Distribuied IEEE Systems 41, 10. J. B. Dennis, C. K. C. Leung and D. P. Histmas. A highly parallel processor using a data flow machine language. Laboratory for Computer Science, MIT, CSG Memo 134-l (June 1979). Il. J. B. Dennis, Data flow supercomputers. lEEE Computer, pp. 48856 (1980). 12. A. V. Aho, J. E. Hopcroft and J. D. Ullman. The De.rign ond Analy sis of Computer Algorithms. Addison-Wesley, Reading, Mass. (1974). comparison of data flow systems, IEEE Computer, pp. 68887 (1986). 13. V. P. Srini, An architecture in the Manchester data flow computer. IEEE Trans. Parallel 14. A. P. W. Bohm and J. R. Curd, Iterative instructions Distributed Systems 1, (1990). 15. P. C. Treleaven, R. P. Hopkins and P. W. Rautenbach. Combining data flow and control flow computing. The Computer J. 25, (1982). 16. K. M. Kavi, B. P. Buckles and N. U. Bhat. A formal definition of data flow graph models. IEEE Trans. Computers C- 35, (1986). 17. 18. 19. 20. R. H. Perrott, Parallel Programming, pp. 13-17 and 25526. Addison-Wesley, New York (1988). K. Hwang and F. A. Briggs, Computer Architecrure and Parallel Processing, McGraw-Hill, New York (1985). A. S. Tanenbaum, Computer Networks. Prentice-Hall, Englewood Cliffs, New Jersey (1981). J. B. Dennis and D. P. Misuans, A preliminary architecture for a basic data-flow processor. Proceedings of the 2nd Annual Symposium on Computer Architecture, New York (1975). 21. D. A. Reed and R. M. Fujimoto, M ulticomputer Networks: M essage- Based Parallel Processing. MIT Press, Cambridge, Mass. (1987). 22. W. W. Carison and K. Hwang, Algorithmic performance of data flow multiprocessors. IEEE Cornpurer, pp. 30- 40 (1985). 23. D. Ghosal and L. N. Bhuyan, Performance evaluation of a data flow architecture. IEEE Trans. Computers 39, (1990). 24. R. Duncan, 25. H. Burkhart Survey of parallel computer architectures. and E. Millen, Performance-measurement IEEE Computer, pp. 5-16 (1990). tools in a multiprocessor environment. IEEE Trans. Computers 38, (1989). 26. M. Sowa and T. Murata, A data flow computer architecture with program and token memories. IEEE Trans. Computer C 31, 820-824 (1982). 27. Y. C. Hong, T..H. Payne and B. Ferguson, An architecture for a data flow multiprocessor. IEEE Compuler (1985). 28. L. S. Haynes, R. L. Low, D. P. Siewiorek and D. W. Mizall, A survey of highly parallel computing. IEEE Computer (1982). 29. W. Daniel and G. L. Steele, Data parallel algorithms. CACM 29, (1986). Static data flow machines AUTHORS’ 497 zyxwvutsr BIOGRAPHIES Fathy E. Eassa received the B.Sc. degree in electronics and electrical communications engineering from Cairo University, Egypt in 1978, and the MS. and Ph.D. degrees in computers and systems engineering from Al-Azhar University, Cairo, Egypt in 1984 and 1989, respectively. He is an Assistant Professor with the Department of computers and systems engineering at Al-Azhar University, Cairo, Egypt. His research interests include dataflow machines, software engineering and artificial intelligence. Mohamed M. Eassa received his B.Sc. degree in electronic engineering from Menofia University and his M.S. and qh.D. degrees in computer engineering from the University of Al-Azhar. He is an Assistant Professor of software engineering and computer science in the Sadat Academy for Management Sciences, Cairo. His main research interests are parallel processing systems, data flow machines and computer networks. Mohamed zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Z a k i is the Professor of Computer Science, Faculty of Engineering, Al-Azhar University. He received his BSc. and MSc. in electrical engineering from Cairo University in 1968 and 1972, respectively. He received his Ph.D. in engineering from Warsaw Polytechnic. His research interests include parallel processing, data flow machines, knowledge engineering and distributed databases.