ComputersE/W I. Engng Vol. 21, No. 6, pp. 483-497, 1995
Copyright 0 1995Elsevierscience Ltd
Pergamon zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Printed in Great Britain. All righta raewed
0045-7906(!5)ooo18-6
004s7906195
A COMPUTATIONAL
s9.50 + 0.00 zyxwvut
MODEL FOR STATIC DATA FLOW
MACHINES
FATHY E. EASSA, M. M. EASSA and M. ZAKI
Computers and Systems Engineering Department, Faculty of Engineering, Al-Azhar University,
Nasr City, Cairo, Egypt
(Received for publication 3 July 1995)
Ahetract-Computer
architects have been constantly looking for new approaches to design high-performance machines. Data flow and VLSI offer two mutually supportive approaches towards a promising design
for future super-computers. When very high speed computations are needed, data flow machines may be
relied upon as an adequate solution in which extremely parallel processing is achieved.
This paper presents a formal analysis for data flow machines. Moreover, the following three machines
are considered: (1) MIT static data flow machine; (2) TI’s DDP static data flow machine; (3) LAU data
flow machine.
These machines are investigated by making use of a reference model. The contributions of this paper
include: (1) Developing a Data Flow Random Access Machine model (DFRAM), for first time, to serve
as a formal modeling tool. Also, by making use of this model one can calculate the time cost of various
static data machines, as well as the performance of these machines. (2) Constructing a practical Data Flow
Simulator (DFS) on the basis of the DFRAM model. Such DFS is modular and portable and can be
implemented with less sophistication. The DFS is used not only to study the performance of the underlying
data flow machines but also to verify the DFRAM model.
Key words: Static data flow machine, parallel processing, computer architecture, random access machine
model, data flow random access machine model, data flow graph, firing time, taken table, queue structure,
enabling network controller.
1. INTRODUCTION
The demand for ultrahigh speed computing machines for analyzing physical processes, solving
scientific problems and intelligence computations is increasing every day. The major difficulty in
satisfying this demand in uniprocessing is the physical constraints of hardware and the sequential
control in the Von-Neumann abstract computing model. The alternative to sequential processing
is parallel processing with high density devices.
Computer architects have been constantly searching for new approaches to develop high
performance machines. Data flow offer a supportive approach towards the design of future
supercomputers.
Data flow computers [l-9] are based on the concept of data-driven computation, which is
drastically different from the operation of a conventional Von-Neumann machine. The fundamental difference is that instruction execution in a conventional computer is under program-flow
control, whereas in a data flow computer, processing is driven by data availability. This means that
the concepts of control flow and data flow computing are distinguished by the control of
computation sequences in two distinct computational paradigms.
Data flow computers are designed to execute data flow program graphs [P-15]. The nodes of a
data flow graph are computation and control constructs. Arcs pass tokens carrying data values
between nodes. When a node receives tokens on each of its incoming arcs, it can then fire. Thus
it absorbs the input tokens, computes a result, and generates a token carrying the result value.
The Random Access Machine (RAM) model [12] is widely used as a formal model for describing
Von-Neumann computers with sequential control. Also, model has been extended to the Parallel
Random Access Machine (PRAM) model [12] in order to serve as a modeling tool for control flow
machines with parallel computations. However, to our knowledge, nothing has been reported about
a “formal” model for data flow machines. In what follows we present the DFRAM model as a
formal representation for Data Flow Machines.
483
Fathy
484
E. Eassa et al.
The DFRAM is a new paradigm of the RAM model which is concerned with data flow instead
of control flow. Moreover, the DFRAM verification has been carried out by making use of three
different machines. These machines are the MIT [ 131, the DDP [ 131 and the LAU [ 131 static data
flow machines. The three machines are experimental and have been developed at different places.
The MIT machine has been built by the Dennis group at MIT, while the DDP has been made up
by Texas Instruments Co., and the LAU machine is available at the ONERA/CERT Laboratory,
France.
2. ARCHITECTURE
OF STATIC
DATA
FLOW
MACHINES
In static architectures of data flow machines the nodes of a program graph are loaded into a
memory before the computation begins and at most one instance of a node is enabled for firing
at a time, i.e. in the static data flow model only one token (or instruction operand) is allowed on
a program arc at any time [22-251.
Three static data flow architectures are presented in this section. They are MIT, TI’s DDP and
LAU.
2.1. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
M IT architecture
The MIT static architecture consists of memory units, processing section, arbitration network
and distribution network, as shown in Fig. 1 [1,13]. Every memory unit is a collection of blocks
of storage locations. Each block (called a cell block) stores the operation, operands and destination
address of a node. The program graph to be executed is loaded in the cell blocks of the memory
by the host before computation begins. To support the rapid access of array elements a part
of the memory is used for storing arrays. Enabled instructions in the memory are communicated
to the processing elements as operation packets using the routing network R, in Fig. 1. The
results of execution are communicated to cell blocks as data packets using the routing network
R,.
The execution of a program graph terminates when none of the nodes in the memory units is
enabled. It is assumed that the routing networks are fault tolerant and the FIFOs in the routing
c
I-
c
j-
RI
r
c
R2
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
I
Host
Mi - Instruction
and array
PEj - Processing
element
RI, R2 - Routing
Fig. I. The MIT machine
architecture,
networks
memory
unit
Static data flow machines
485
PE
-
E-bus
(34-bit
Maintenance
shift register)
bus
-_______ ----- -- --- I -I-- _
_____ . ..___.___.1_..
Maintenance
controller
L
Result
0
port
Host
Q-FIFO
PE
Fig. 2. The DDP machine
architecture.
network have sufficient capacity to reduce blockage [13]. Faults in the machine require restarting
the computation from the beginning.
In the MIT machine a processor does not have a private memory for instructions or data.
However, the machine has a central memory. Packets are communicated between memory cell
blocks and processors using both FIFO’s arbitration networks and distribution networks. A node
in the machine model has three input and four output arcs.
The MIT machine supports multiactor instructions and vector instructions to reduce token traffic
in the routing networks. It has one path with a width of 32 bits and 11 words packets. Streams
are handled by pipelining (word data packets).
2.2. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
TI’s DDP architecture
The data-driven processor, or DDP, of Texas Instruments is designed for executing Fortran
program using some of the principles outlined in the MIT architecture. The DDP investigates the
feasibility of static data flow computing without acknowledgement signals for high speed
computing systems. A block diagram of the DDP machine is shown in Fig. 2. The program graph
corresponding to a source program is generated by a compiler in the host processor. The program
graph is partitioned into subgraphs by a cluster detection algorithm. The subgraphs are loaded into
the memory units of processing elements. Every node stored in the memory unit of a processor
has an operation and a counter (called the predecessor counter) to determine whether a node can
be enabled for firing, a maximum of 13 operands, and a maximum of 13 destination addresses but
the total number of input and output arcs cannot exceed 14. A node is enabled when the
predecessor counter reaches zero. The enabled node is executed in the ALU of the processing
element. The results of a node firing are forwarded to successor nodes in the processor’s memory
unit or another processing element. If the successor node is in another processing element, then
it is communicated using the interconnection network E-bus as a series of 34-bit packets, as shown
in Fig. 2. As a result of this token forwarding, one or more nodes may be enabled. Since there
is a single ALU in a processing element, the enabled. nodes are linked together in a queue called
Fathy E. Eassaet zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQP
al. zyxwvutsrqponmlkjihgfedcbaZYXWVUTS
486
C o ntro l
Memory
unit
I
Instruction
control unit
Execution
WU)
W
r
Host
unit
Instruction
bus
R
e
r-i
Data control
unit (DCU)
t
--)
:
e
.
.
b
b
”
U
S
unit
Control-related
b use s
Fig. 3. The LAU machine
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQ
S
architecture
the pending instruction queue (PIQ). The node at the head of the queue is dispatched for execution zyxwvutsr
u31.
The maintenance controller, Fig. 2, detects faulty processor(s). Actually, faults require restarting
the computation at the previous checkpoint, where the buffer contents can be dumped using the
maintenance controller. Each instruction packet has a field for processor number. After detecting
a fault in a processor, that processor number will not be used in allocating instruction packets.
The previously scheduled instruction packets and the results to a failed processor can be rerouted
to another processor using the maintenance controller.
In the DDP machine the Pending Instruction Queue (PIQ) holds executable instructions
that have been enabled. Executable instructions are removed from this queue by the arithmetic
unit. When the capacity of the queue is exceeded, the enabled instructions are linked, in memory,
to the PIQ via their link field that is already reserved for this purpose. This method has the
advantage that no amount of program parallelism overflow the capacity of the hardware resources
[1,3,131.
2.3.
LA U architecture
The LAU machine contains four major units: memory, execution, control, and an interface as
shown in Fig. 3. Programs written in a single assignment language are compiled to produce data
flow graphs in the host. Each node in the data flow graph can have a maximum of two input arcs
and several output arcs. The instruction format corresponding to a node contains an operation part
and a control part. The operation part of an instruction is loaded into the memory and the control
part is maintained in the control unit. A node is enabled when the operands are ready and proper
context exists. The instruction control part detects enabled nodes using a simulated associative
memory and sends the addresses to the memory. The addresses in the ICU and DCU of Fig. 3
have one-to-one correspondence with those in the memory unit. The memory reads the operation
part of the instruction and sends it to a queue (FIFO) that can hold 128 enabled instructions. The
instruction at the head of the queue is dispatched to an available processor in the execution unit.
The execution unit in the LAU machine uses a number of buses to access the rest of the system.
It can support up to 32 processors. Each processor executes the instruction assigned to it by reading
the operands from the memory unit. The results produced are written into the memory and the
destination addresses are obtained.
In the LAU machine, enabled instructions are kept in a queue until results come out of the
processor. This helps to reassign instructions to a healthy processor if a faulty processor was
Static data flow machines
481
detected. The LAU execution unit does not have a local memory, on the other hand it has a central
memory from which the execution unit can read or write data.
3. THE
DFRAM
MODEL
There are several fundamental models of computing devices, the most known model is the
random access machine (RAM) which models a sequential computer that carries out one operation
at a time and the parallel random access machine (PRAM) which models parallel machines with
control flow [12]. In the following we present our DFRAM model and emphasize its basic
characteristics.
3.1. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Data flow random access machines (DFRAM ) model
Data flow graphs have been used extensively to model parallel computations. In these models
the data flow graph is considered as an uninterpreted bipartite graph with nodes and links, as shown
in Fig. 4 [l&21].
Unfortunately, such a representation is unable to express the structural and the behavioral details
of data flow machines. This shortcome might be avoided if the semantic actions of the graph actors
(nodes) are considered. Therefore, here, we propose the DFRAM model that augments the basic
structural and behavioral characteristics of data flow machines in every node of the corresponding
graph.
The DFRAM model for static data flow architectures in Fig. 5 consists of:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
a set of sharable cell blocks and/or local RAM memory cells
an enabling network controller
an enabled queue structure
processing (firing) element(s)
processor-availability checker
results queue structure
distribution controller.
The DFRAM
model works according to the following rules:
Input
Link
Fig. 4. A pipe-lined data flow graph.
Fathy
488
E. Eassa et al.
From the
’ host
I------
+
Enabling-network
controller
I
Processing
elements
‘
Distribution
network
I
Fig. 5. The DFRAM
Results
model of a graph
actor
(1) The data flow graph of a computation (program) should be generated and stored in the cell
blocks and/or the local memory cells before the DFRAM model operation is initiated in order to
provide the data required for the model.
(2) When the data is available, the enabling network controller determines the nodes to be
enabled. A node can be enabled if and only if
(a) there are tokens on every link of the input links required for enabling the node, and
(b) there are no tokens on any of the node’s output links,
(3) All enabled nodes are arranged in a queue structure.
(4) The front node of the enabled queue is assigned a processing element to accomplish the node
firing.
(5) The output of the processing element is checked. If it is available, then it is passed to the
results queue; otherwise the firing process is canceled. If the firing process is canceled, then its input
is returned either to the enabled queue or to the memory cells.
(6) Again all the results are arranged in a results queue.
(7) A distribution network controller takes the results from their queue and directs (writes) them
either to the memory cells as intermediate results or to the output as final results. Consequently,
the intermediate results will act as new data for the rest of the data flow graph.
(8) A computation terminates when there are no enabled nodes. From the DFRAM model, we
can point out the following:
(a) Each data flow operation consists of four steps. These steps are:
Static data flow machines
489
(1) reading the node information from the cell blocks (reading step)
(2) realizing the enabling conditions of a node and detecting the enabled nodes (enabling
step)
(3) assigning nodes to processing elements to fire and passing tokens to the results queue
(firing step) and
(4) moving results to output links (distribution or writing step).
@I The DFRAM model is asynchronous. On the other hand, the PRAM model [ 121assumes
w
(4
synchronous control flow.
Rule (2b) of the model is confined to static architectures only and is violated when
dynamic data flow machines are considered.
Rules (2b) and (8) ensure that no write conflicts occur by the DFRAM model, since node
result has a predefined address value. However, such conflicts may take place in the
PRAM model.
The performance of data flow machines is assessed by comparing the time cost of these machines.
For the sake of comparison, the time complexity of all the underlied machine is computed for the
matrix multiplication problem. Two n x n matrices A and B are considered and their manipulation
by the three machines, MIT, DDP and LAU, is investigated.
3.2. The time cost of data flow machines
For calculating the product matrix C, where C = A x B in the three machines, the unit time T
is given by:
T = (reading + enabling + firing + distributing) time.
In what follows the reading time is assumed to be negligibly small.
3.2.1. The MIT machine. In the MIT machine, the program graph to be executed is stored in
the cell blocks by the host. Each cell block stores the operations, operands and destination address
of a node. Enabled nodes in the cell blocks are sent to the processing elements using the enabling
network controller. The results of execution are passed either to cell blocks or to the output link
using the distribution network of the machine. Typically, one enabled node is using a processor
and a number of enabled nodes are waiting in the enabled queue of that processor which has not
a local memory. The execution of a program graph terminates when none of the nodes in the
memory units is enabled.
The data flow graph of the multiplication process is stored in cell blocks of the MIT machine
in a time equals O(logn).
(1) One column from the matrix B is selected.
(2) The selected column and one row from the matrix A are assigned concurrently to a processing
element. This occurs in n/p time, where p is the number of processors.
The
processing element multiplies the elements of the selected column by the elements of
(3)
selected row sequentially in n time.
(4) Steps 1, 2 and 3 are repeated n times until all columns of the matrix B are multiplied by the
rows of the matrix A.
The time complexity
No. of A rows x No. of B columns x NO. of elements of row
x T(MIT)
of multiplication
=
No. of processing elements
Where T(MIT) is the unit time for the MIT machine.
Thus the multiplication
time = (n x n x n/p) x T(MIT)
The total time = storing time + multiplication
time = O(log n) + n3/p x T(MIT)
(1) if p <tn then time = O(log n) + O(n3) x T(MIT)
(2) if p = n then time = O(log n) + O(n’) x T(MIT)
(4) if p >>n(p 3 n3), time = O(log n) + (O(1) x T(MIT)).
(1)
Fathy
490
E. Eassa et al. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSR
3.22. The DDP machine. In the DDP machine, a program graph is generated by the compiler
(data flo.w generator) in the host processor. The program graph is partitioned
into subgraphs. The
subgraphs
are loaded into the local memory cells of the processing elements, since the DDP
machine does not have sharable cell blocks. The enabled node is executed in the processing element.
The results of a node firing are forwarded to the corresponding
successor nodes in the processor’s
local memory cell using the distribution
network. As a result, one or more nodes may be enabled.
The enabled nodes are linked together in the enabled queue. The node at the head of the queue
is dispatched for execution.
The time required for storing a data flow graph is O(logn).
After partitioning
the flow graph into subgraphs
we can find in the local memory of each
processing element, a copy of matrix B and n/p rows from matrix A.
Since there is only one ALU in the processing element, then the time required for matrix
multiplication
will be:
(n x II x n)/p x T(DDP)
Then the total time = storing
time + multiplication
time = O(log n) + n’/p x T(DDP).
(2)
Again, if p >>n(p > ni), then time = O(log n) + (O(1) x T(DDP)).
3.2.3. The LAU machine. The LAU has neither local memory cells nor results queue facilities.
The data flow graph is stored in the cell blocks. As in the above two machines, the execution
consists of enabling, firing and distribution
steps. The enabled nodes are linked into the enabled
queue. An enabled node from the queue is assigned to a processing element. The results produced
are written into the cell blocks using the distribution
network.
In that machine the data flow graph of the matrix multiplication
process is stored in the cell
blocks of the machine in O(logn) time.
The enabled nodes are stored in the enabled queue of the machine. The total number of enabled
nodes equals (n x n x n). At the first unit time of multiplication,
p enabled nodes are assigned to
the p processing elements concurrently.
After that, another p enabled nodes at the head of the
enabled queue will be assigned to the empty processing elements. This process will continue until
all enabled nodes are fired.
The time of multiplication
= (n x n x n/p) x T(LAU).
The total time = storing
time + multiplication
time = O(log n) + (n3/p) x T(LAU)
(3)
if p >>/I time = O(log n) + (O(1) x T(LAU)).
From the above we find out that equations (l)-(3) are the same, but the difference between them
is the unit-time
value. The unit-time
consists of three terms for enabling,
firing and writing
(distribution).
In the following we exploit the DFRAM model to calculate the unit times of the three machines
in terms of their primitives.
n enabling-time
= (ni + no) x t,
where ni is the number of input arcs
no is the number of output arcs
t, is the time of one comparison
W firing-time
= t,, + [twf, if any]
where t,, is the time of executing one operation
t,, is the writing time of a control bit.
H writing-time
= [tcf, if any] + no x t,) + (d x t,)
where tcf is the comparison
time of control
t, is the destination
writing time
d is the number of destinations.
bits
Static data flow machines
491
Since the number of input arcs and output arcs of a node is different from machine to machine,
then the enabling and writing times will be different. The unit time of the three machines (MIT,
DDP and LAU) can be given as following:
(1) The MIT unit time, T(MIT)
In the MIT, the number of arcs for input and output are 3 and 4, respectively.
enabling time = (3 + 4) x zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCB
t,
=7t,
firing time = t,,
writing time = 4 x t, + d x t,
where 1 < d G 4
Thus, T(MIT) = 11t, + t,, + d x t, .
(2) The DDP unit time
In the DDP, a node can have up to 13 input arcs or output arcs, but the total number of
input and output arcs cannot exceed 14. Accordingly,
enabling time = 14t,
firing time = t,,
writing time = 13t, + d x tw
where lGdG13
T(DDP) = 27t, + t,, + d x t,.
(3) The LAU unit time
In the LAU, a node has two input arcs and 13 output arcs. The LAU machine contains
control unit which can synchronize the operation.
enabling time = (2 + 13) x t,
= 15t,
firing time = t,, + twr
where twf is a writing time of a control bit cd. The cd is a control bit in the control
unit in the LAU machine. The results of node firing is not available until the c, bit
is set. If a fault is detected in a processor, then the cd bit will never be set.
writing time = tcf+ 13t, + d x t,
where l<d<13
tcf is a comparison time of cd. The cd must have value 1. If the value of c, is zero, then
a fault is detected in a processor and it is possible to reassign the nodes to a healthy
processor.
Unittime=15t,+t,,+t,+t,,+13t,+dxtW
T(LAU) = 28t, + tex+ t,f + tcf+ d x t, .
From the above, we find out that:
T(LAU) > T(DDP) > T(MIT).
This is due to the difference in the architecture of the three machines [26-291.
The LAU machine contains control unit which can synchronize the operation, so the unit time
increases due to this synchronization process by the twf and tcf values. In the LAU machine, a node
can have at most two input and several output arcs. Since the number of output arcs is large, the
enabling and writing times will increase.
The DDP machine does not have control unit, but a node can have up to 13 input arcs or 13
output arcs. However, the total number of input and output arcs cannot exceed 14. Due to this
large number of input and output arcs, the enabling and writing times increase. Consequently, the
unit time of the DDP machine will increase.
In the MIT machine, a node can have up to three input arcs and four output arcs. Therefore,
the enabling and writing times of the MIT machine are smaller than the times of DDP and LAU.
4. THE
SIMULATOR
On the basis of the DFRAM model a Data Flow Simulator (DFS) has been designed and
implemented. Here the basic features of DFS are emphasized.
492
4.1. Generation
Fathy
E. Eassa et al
of the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
data Jlow graph
The generation
of a data flow graph and token table is important
because it is an interface
between a user program and data flow machine. We have built a token-table generator that accepts
the user source program and generates the corresponding
token table. This token table as such is
manipulated
by a machine simulator.
The generator consists of two phases: a scanner phase and token-table
generator phase. The
scanner converts the stream of input characters into a stream of tokens. Such tokens become the
input to the next phase.
In the data flow generator, the scanner phase accepts a user program and generates tokens for
each statement in the program. Each token consists of two fields: token type and value, i.e. token
(type, value).
The token-table
generator phase accepts the tokens that are generated from the scanner phase
and generates the corresponding
token table. The token-table
consists of many fields. These fields
are op-code (operator),
dn (node number), dr (arc number), value (of operand), next-dn (node
number of a destination
node), next-opcode
(opcode (operator) of a destination
node), next-arc
(arc number of a destination
node) and ready-f. The ready-f is set to one if the value of an operand
exists and reset to zero if the value of an operand does not exist.
The rules used for token-table
generator are:
(1) if the token-type is identifier (id), then a zero (0) is stored in the value field of a token-table
(2) if the token-type
is constant
then the token-value
is stored in the value field of the
token-table,
(3) if the token-type is opcode (op), then an integer value is stored in the op-code field. This
field will be one (I) for multiplication,
two (2) for division, three (3) for addition and so on.
There are two counters: the first counts the nodes and the second counts the arcs for each node.
The value of the first counter is stored in the dn field and the value of the second counter is stored
in the dr field of the token table.
We have provided three token-table
generators: one for the MIT machine, the second for the
DDP machine and the third for the LAU machine. The number of fields of the generated token
table is different from machine to machine. This is because the number of output nodes is different
from machine to machine.
4.2. Flow graph construction
Many data flow machine simulators have been built. Some of them [9,13-l 51 permit us to identify
the basic mechanisms that must be supported by any data flow computation.
Also, they have been
used to evaluate the performance of some data flow machines. However, they may fail in displaying
the essential differences between various simulated machines.
Here we present DFS as a simulator that has been built on the basis of the DFRAM mode1 to
study the detailed performance
of data flow machines. Three static machines are considered: MIT,
DDP and LAU. The structure (block diagram) of DFS is shown in Fig. 6. In DFS the user program
is transformed
into a data flow graph and a token table. The data flow graph consists of nodes,
input arcs and output arcs. Each node corresponds
to an instruction
operator,
the input arc
corresponds to input data token (instruction
operand), and the output arc corresponds to a result
(it is also an input arc for next node). Each node has a specific number of input arcs and a specific
number of output arcs according to the type of the machine.
The common features of data flow machines are specified in a “reference machine”.
Such a
machine is hypothetical
and does not include any particularities.
The token-table of the reference machine, shown in Fig. 7, contains the instruction opcode, node
number (dn), arc number (dr), token value (value), addresses of the next nodes and other control
data. The control data are ready-f, then-no, else-no and while-no. The opcode value is an integer
value, for example, the value 1 for multiplication
operator, the value 2 for division operator and
so on. The node number (dn) is an integer value of the arc number (dr) taken on the values 1, 2,
13 according to the type of data flow machine. The address of the next node (destination
node)
contains the node number (next-dn), the opcode (next-opcode)
and the arc number (next-arc) of
Static data flow machines
Module
1
Token-table
493
generator
Token-table
Modu,e 2 1
Initialt:;Ff
node
,
hdulti-enabled
node table
[
Div
enabled
table
Add-enabled
node table
I
Module
1
4
i
1
I
I
Firing phase
4
I
Multi-output
table
Add output
table
I
I
Writing (distribution)
Div
output
table
phase
Fig. 6. The DFS structure.
the next node. The ready-f is set to 1 if the value of the operand (token) is ready. There are two
counters, one to count the if-statement and the second to count the while-statements. The then-no
and else-no contain the counter value of the if-statements. The while-no field of all instructions
existing inside the then-body of the if-statement has the if-statement counter value. Also the else-no
field of the instructions existing inside the else-body of the if-statement has the if-statement counter
value. Thus, when an if-statement is satisfied, the instructions inside its then-body are enabled
otherwise, the instructions inside its else-body are enabled.
The execution process of instructions consists of enabling stage, firing stage and distributing
stage. In the enabling stage, a group of node tables are generated; a table for each opcode type.
The importance of generating a node table for each opcode type is to satisfy the parallelism of the
data flow machines. Each node table contains the enabled nodes data. Figure 8 shows the contents
of the node table. This table structure is suitable for conditional opcode type used in all data flow
machines. The table contains the opcode of the node (opcode), the number (dn), the values of the
input tokens (dr, , dr,), the output arcs values (dor, , dor,), the output nodes addresses (next-dn . . . ,
Fathy E. E!assa et al.
494
Opcode
dn
value
dr
next-dn
next-arc
next-dn2
next-opcode2
next-arc2
else-no
then-no
ready-f
next-opcodc
Fig. 7. The token-table of the reference machine.
Opcode
next-dn2
dn
dr,
next-opcode2
dr,
dor,
next-arc2
dorz
C
ready-f
Fig. 8. The node-table
of
ef
then-no
of the reference
next-dn
next-opcode
else-no
while-no
next-arc
machine.
next-arc) and other control data (c, ef, of, ready-f, then-no, else-no, while-no). The node is enabled
when the tokens of its input arcs are ready and its output arcs are free.
In the enabling stage, the enabling module in Fig. 6 accepts the token-table and generates node
tables that contain the enabled nodes. In order for a node to be enabled, the enabling module reads
the token-table records and collects the operands of each node. When the number of ready
operands of a node are equal to the required input, the output arcs of the node must be tested.
If the output arcs of the node are free, the node data shown in Fig. 8(b) are written to the node
table.
During the firing stage, the firing module accepts the node tables and passes them to a table for
processing elements in order to fire the enabled nodes. Before the firing moment, the status of the
processing element is tested. The status may be fault, busy or ready, the firing occurs if the status
of the processor is ready.
During the distributing stage, the writing module communicates the results of the firing stage
to the token table update data values of some tokens, so that other nodes may be enabled.
5. PERFORMANCE
EVALUATION
By making use of the DFS the performance of MIT, DDP and LAU has been evaluated under
various operational conditions. Naturally, a data flow program may undergo several “stages”. A
stage represents one operation that may be carried out in one or more actors concurrently. For
multiplying two matrices A, xn x zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
B, x n, Fig. 9 shows the increase of the time cost with the increase
of the matrix size n, for the MIT, DDP and LAU machines. It is obvious that at fault free
operation, the MIT machine has the best performance, followed by the DDP performance while
the LAU machine has the least performance. Also, this figure agrees with the result of the DFRAM
model that, for p >>n, the multiplication of two n x n matrices take O(1) DFRAM steps. (As
indicated in [12], it takes O(log(n)) PRAM steps and O(n3) RAM steps.)
During faulty operational conditions, the high degradation rate of the MIT machine is due to
the fact that when one of the machine processors becomes faulty, the current computation is turned
off and the machine restarts the computation from its beginning [13]. The degradation rate of the
DDP machine is smaller than that of the MIT machine since a processor fault returns the
computation to the nearest break point. However, the LAU machine can fail safe (without
degradation) because if a processor fails, the machine can immediately provide a fault-free
reconfiguration to continue the computation. Table 1 summarizes the essential characteristics of
these machines.
Static data flow machines
495
190
180
170
160
150
140
^x
$
0
130
E
120
3
110
B
z
100
E
.-
90
-z
;j
b
80
70
60
50
40
30
20
10
I
I
I
I
I
I
I
I
1
1
2
3
4
5
6
7
8
9
No.
Fig. 9. Total
time of MIT,
Table
Machine feature
DDP
I. Characteristics
of elements
and LAU comparison
at fault-free
operation.
of MIT, DDP and LAU machines
MIT
LAU
DDP
No. of Communication
buses
One
Two
Six unidirectional
buses
Memory
machine
Cell blocks
(Memory units)
Local memory
Central
Process of the machine
with faulty processors
Restarting the process
from the beginning
Complete operation
the break point
Node input
3
up to 13
Node output
4
Time cost
Availability
of the
memory
Reassignment and
complete operation
from
zyxwvutsrqponmlkjihgfedcbaZYX
2
up to 13
Small
but the total
i/ p and o / p nodes
up to 13 > must not exceed 14
Medium
Least
Good
Best
Large
6. CONCLUSION
In this paper the DFRAM model and the DFS simulator have been presented. The DFRAM
model can be relied upon as a formal model for data flow machines. The essential features of that
model are:
(1) It extends the semantic data flow graph so that each graph nodes includes the semantic
actions of the underlying instruction. Accordingly, each machine time unit consists of three
phases for enabling, firing and writing tokens, respectively.
Fathy
496
E. Eassa et zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQ
al.
No write conflicts may arise and, consequently,
it can be easily used for computing
(2) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
cost of various data flow algorithms.
model, i.e. the tokens
(3) It is an asynchronous
ized.
flow is neither
controlled
nor
the time
synchron-
On the basis of the DFRAM,
the DFS has been built to act as a modeling tool for data flow
machines. Since the DFRAM
augments the semantic actions of data flow processing, the DFS
inherently has the capability of emphasizing
the structural and the behavioral differences of the
simulated machines. Therefore, DFS may have an edge over similar simulators.
Moreover, DFS
is modular, portable and can be implemented
with ease.
Three data flow machines have been simulated by the DFS. They are the MIT, DDP and
LAU machines. The results of the simulator
have indicated that with fault-free conditions
the
MIT machine has the best performance.
The second is the DDP machine while the LAU
machine has the least performance.
On the other hand, if a processor becomes faulty, then the
LAU machine will possess the highest availability
followed by the DDP machine in the second
position, whilst the MIT architecture
will have the lowest availability.
REFERENCES
1. J. Herath, Y. Yamaguchi,
N. Saito and T. Yuba, Data flow computing
computations.
IEEE Trans. Sofiware Engng 14, (1988).
2. D. Abramson
and G. Egan, The RMIT data flow computer:
models, languages
a hybrid
and machines
architecture.
for intelligence
The Computer
.I. 33,
(1990).
3. P. C. Treleaven,
D. R. Brownbridge
and
R. P. Hopkins.
Data
driven
and demand-driven
computer
architecture.
Computing Surveys 14, (1982).
4. M. L. Cambell, Static allocation
for a data flow multiprocessor.
IEEE Computer, pp. 511-517 (1985).
5. S. A. Thoreson. A feasibility study of a memory hierarchy in a data flow environment.
IEEE Computer, pp. 356360
(1985).
IEEE Computer 1, 225.-253 (1986).
6. A. Culler and D. E. Culler, Data flow architectures.
of a simulated data flow computer.
IEEE Trans. Computers C- 29,
I. K. P. Gostelow and R. E. Thomas. Performance
(1980).
8. B. Lee, A. R. Hurson
and B. Shirazi, A hybrid scheme for processing
Trans. Parallel and Disrributed Systems 3, (1992).
9. M. Takesue,
Cache memories for data J~OW machines.
(1992).
IEEE
data structures
Trans.
m a data flow environment.
Parallel
and
Distribuied
IEEE
Systems
41,
10. J. B. Dennis, C. K. C. Leung and D. P. Histmas. A highly parallel processor using a data flow machine language.
Laboratory
for Computer
Science, MIT, CSG Memo 134-l (June 1979).
Il. J. B. Dennis, Data flow supercomputers.
lEEE Computer, pp. 48856 (1980).
12. A. V. Aho, J. E. Hopcroft and J. D. Ullman. The De.rign ond Analy sis of Computer Algorithms. Addison-Wesley,
Reading, Mass. (1974).
comparison
of data flow systems, IEEE Computer, pp. 68887 (1986).
13. V. P. Srini, An architecture
in the Manchester
data flow computer. IEEE Trans. Parallel
14. A. P. W. Bohm and J. R. Curd, Iterative instructions
Distributed Systems 1, (1990).
15. P. C. Treleaven, R. P. Hopkins and P. W. Rautenbach.
Combining data flow and control flow computing. The Computer
J. 25, (1982).
16. K. M. Kavi,
B. P. Buckles and N. U. Bhat. A formal
definition
of data flow graph
models.
IEEE
Trans. Computers
C- 35, (1986).
17.
18.
19.
20.
R. H. Perrott, Parallel Programming, pp. 13-17 and 25526. Addison-Wesley,
New York (1988).
K. Hwang and F. A. Briggs, Computer Architecrure and Parallel Processing, McGraw-Hill,
New York (1985).
A. S. Tanenbaum,
Computer Networks. Prentice-Hall,
Englewood Cliffs, New Jersey (1981).
J. B. Dennis and D. P. Misuans, A preliminary
architecture
for a basic data-flow processor. Proceedings of the 2nd
Annual Symposium on Computer Architecture, New York (1975).
21. D. A. Reed and R. M. Fujimoto, M ulticomputer Networks: M essage- Based Parallel Processing. MIT Press, Cambridge,
Mass. (1987).
22. W. W. Carison and K. Hwang, Algorithmic
performance
of data flow multiprocessors.
IEEE Cornpurer, pp. 30- 40
(1985).
23. D. Ghosal
and
L. N. Bhuyan,
Performance
evaluation
of a data
flow architecture.
IEEE
Trans. Computers 39,
(1990).
24. R. Duncan,
25. H. Burkhart
Survey of parallel computer architectures.
and E. Millen, Performance-measurement
IEEE Computer, pp. 5-16 (1990).
tools in a multiprocessor
environment.
IEEE Trans. Computers
38, (1989).
26. M. Sowa and T. Murata, A data flow computer architecture with program and token memories. IEEE Trans. Computer
C 31, 820-824 (1982).
27. Y. C. Hong, T..H. Payne and B. Ferguson, An architecture
for a data flow multiprocessor.
IEEE Compuler (1985).
28. L. S. Haynes, R. L. Low, D. P. Siewiorek and D. W. Mizall, A survey of highly parallel computing.
IEEE Computer
(1982).
29. W. Daniel and G. L. Steele, Data parallel
algorithms.
CACM
29, (1986).
Static data flow machines
AUTHORS’
497 zyxwvutsr
BIOGRAPHIES
Fathy E. Eassa received the B.Sc. degree in electronics and electrical communications
engineering from Cairo University,
Egypt in 1978, and the MS. and Ph.D. degrees in computers and systems engineering from Al-Azhar University, Cairo,
Egypt in 1984 and 1989, respectively.
He is an Assistant
Professor with the Department
of computers
and systems
engineering at Al-Azhar University, Cairo, Egypt. His research interests include dataflow machines, software engineering
and artificial intelligence.
Mohamed M. Eassa received his B.Sc. degree in electronic engineering from Menofia University and his M.S. and qh.D.
degrees in computer engineering from the University of Al-Azhar. He is an Assistant Professor of software engineering and
computer science in the Sadat Academy for Management
Sciences, Cairo. His main research interests are parallel processing
systems, data flow machines and computer networks.
Mohamed zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Z a k i is the Professor of Computer Science, Faculty of Engineering,
Al-Azhar University. He received his BSc.
and MSc. in electrical engineering
from Cairo University
in 1968 and 1972, respectively.
He received his Ph.D. in
engineering
from Warsaw Polytechnic.
His research interests include parallel processing, data flow machines, knowledge
engineering
and distributed
databases.