Academia.eduAcademia.edu

System-level synthesis using re-programmable components

1992

The authors formulate the synthesis problem of complex behavioral descriptions with performance constraints as a hardware-software co-design problem. The target system architecture consists of a software component as a program running on a re-programmable processor assisted by application-specific hardware components. System synthesis is performed by first partitioning the input system description into hardware and software portions and then by implementing each of them separately. The synthesis of dedicated hardware is then achieved by means of hardware synthesis tools (D.D. Mitchell, D.C.Ku, F. Mailhot, and T. Truong, `The Olympus Synthesis System for digital design' IEEE Design and Test Magazine, p.37-53, Oct. 1990), while the software component is generated using software compiling techniques. The authors consider the problem of identifying potential hardware and software components of a system described in a high-level modeling language and they present a partitioning procedure. They then describe the results of partitioning a network coprocessor

System-level Synthesis using Re-programmable Components Rajesh K. Gupta Giovanni De Micheli Center for Integrated Systems Stanford University, Stanford, CA 94305. Abstract receive data from memory using DMA REPROGRAMMABLE assemble frame We formulate the synthesis problem of complex behavioral descriptions with performance constraints as a hardware-software co-design problem. The target system architecture consists of a software component as a program running on a re-programmable processor assisted by application-specific hardware components. System synthesis is performed by first partitioning the input system description into hardware and software portions and then by implementing each of them separately. The synthesis of dedicated hardware is then achieved by means of hardware synthesis tools [1], while the software component is generated using software compiling techniques. We consider the problem of identifying potential hardware and software components of a system described in a high-level modeling language and we present a partitioning procedure. We then describe the results of partitioning a network coprocessor. 1 Introduction Existing high-level synthesis techniques attempt to generate a purely hardware implementation of a system design either as a single chip or as an interconnection of multiple chips, each of which is individually synthesized [1] [2] [3] [4]. A common objection to such an approach to system design is the costeffectiveness of an application-specific hardware implementation versus a corresponding software solution using re-programmable components, such as off-the-shelf microprocessors. It is often the case that system design requires a mixed implementation that blends application-specific (ASIC) chips with processors, memory and other special purpose modules. Important examples are embedded controllers and telecommunication systems. Indeed, most digital functions can be implemented by software programs. The major reason for building dedicated ASIC hardware is the satisfaction of performance constraints. These performance constraints can be on the overall time (latency) to perform a given task, a subtask and/or on the input/output data rates. For example, consider design of a data encryption/protocol controller chip. As shown in Figure 1, the DES transmitter takes data from memory using a DMA controller, assembles the frame for transmission, encrypts the data after it receives the key and transmits the encrypted data. The encryption algorithm is a long, iterative process of rotations and bit permutations. Generally it is more cost-effective to implement bit-oriented operations in dedicated hardware whereas it may take too long to execute as a sequence of instructions on most processors. On the other hand, operations related to memory access and frame assembly are typically unconstrained such that they can be relegated to a program running on an already available general-purpose microprocessor, thus substantially reducing the amount of dedicated hardware required for system implementation. Therefore, while a complete applicationspecific hardware implementation of the controller may be too SOFTWARE receive encryption key DEDICATED encrypt data max time constraint HARDWARE transmit data Figure 1: Example of A Mixed System Implementation expensive in terms of hardware size and design time, an implementation which utilizes a re-programmable component may satisfy the performance requirements and at the same time provide the ease and flexibility of reprogramming in software. The problem of hardware-software co-design is fairly complex, and todate there are no available CAD tools to support it. This paper addresses the co-design issue by formulating it as a partitioning problem into application-specific and re-programmable components. We can view it as an extension of high-level synthesis techniques to systems with generic re-programmable resources. Nevertheless, the overall problem is much more complex and it involves, among others, solving the following subproblems: 1. Modeling the system functionality and performance constraints. 2. Determination of the boundary of tasks between ASIC and re-programmable components. 3. Specification and synthesis of the hardware-software interface and related synchronization mechanisms. 4. Implementation of software routines to provide real-time response to concurrently executing hardware modules. Due to limited space availability in this paper, we touch briefly on the modeling problem and implementation issues in order to describe our approach to obtaining feasible partitions. We consider an approach to hardware-software partitioning based on distribution of unknown delay operations in the system model. This, by no means, is the only way to obtain such a partition, but it allows us to delve into the problem and experiment with existing system designs. For more details on this subject and the related issues mentioned above the reader is referred to [5]. Some potential applications of the techniques presented in this paper are: Design of cost-effective systems: The overall cost of a system implementation can be reduced by the ability to use already available general purpose re-programmable components while reducing the number of application-specific components. MEMORY ipc ML Program loop body rd op Interface Buffer User Data wr loop ASIC op op MICROPROCESSOR Figure 2: Example of A System Graph Model Rapid prototyping of complex system designs: A complete hardware prototype of a complex system is often large. A feasible partition that shifts the non performance-critical tasks to software programs can be used to quickly evaluate the design. Speedup of hardware emulation software: During their development phase, many system designs are often modeled and emulated in software for testing purposes. Such an emulation can be assisted by dedicated hardware components which provides a speedup on the emulation time. Rapid prototyping and hardware emulation are two opposite ends of the system synthesis objective. Rapid prototyping attempts to minimize the application-specific component to reduce design time whereas hardware emulation attempts to maximize the application-specific component in order to realize maximum speed-up. 2 System Model We model system behavior using a graph representation based on data-flow graphs. The system graph model, G(V; E ), consists of a set of vertices as operations, V , and a set of edges, E, which represent either data or sequencing dependency between vertices. Vertices correspond to operations some of which may have data-dependent delays. Data-dependent and synchronization operations introduce uncertainty over the precise delay and order of operations in the system model and thus make its execution non-deterministic[6]. We refer to a vertex with data-dependent delay as a point of non-determinism in the system graph model. We discuss the property of non-determinism and its effects in Section 3.2. Overall, the system graph model consists of concurrent data-flow sections which are ordered by control flow. The dataflow sections preserve the parallelism while control constructs like conditionals and loops obviate the need for a separate description of the system control flow. The control operations like loops and conditionals are specified as separate subgraphs through the use of hierarchy. Figure 2 shows an example of a graph model. Timing constraints are of two types: (a) minimum/maximum timing separation between pairs of operations and (b) system input-outputs rate constraints. Timing constraints between operations are indicated by tagging the corresponding operations. The input (output) rate constraints refer to the rates at which the data is required to be consumed (produced). The rate constraints refer to time constraints on multiple executions of the same input or output operation. The system model has no notion of how the actual data transfer takes place in the final system implementation, for example, it may be a synchronous or an asynchronous transfer which may or may not be blocking. The final determination of data-transfer protocol takes place when considering operation partitioning between hardware and software and the associated implementation overheads. The system graph model can be compiled from most hardware description languages. We use the HardwareC [7] language that has a C-like syntax and supports timing and resource constraints. ASIC Re-programmable Component Application-Specific Components Figure 3: Target System Architecture The input/output rate constraints are specified as additional attributes to the corresponding input/output statements. 3 Partitioning of a System Model Any partition of the system model is also a system graph model and hence it may contain concurrent sets of operations. While such concurrency among operations is natural to any hardware implementation, the corresponding program implementations can execute operations only serially. Thus a hardware-software system may require multiple re-programmable components in order to preserve concurrency inherent (or externally imposed) on the system model. System designs with multiple processors are complicated by the need for a coherent memory management among various processors. In this paper, we make a simplifying assumption that the target system design consists of only one reprogrammable component and develop our system synthesis approach based on the target architecture shown in Figure 3. If needed, the hardware component of system design may be partitioned further based on the area, pin-out and latency constraints using the homogenous partitioning approaches described in [8]. The resulting hardware modules are connected to the system address and data busses. The memory used for program and datastorage may be on-board the re-programmable component. However, the interface buffer memory needs to be accessible to the hardware modules directly. Because of the complexities associated with modeling hierarchical memory design, we consider the case where all memory accesses are to a single level memory, i.e., outside the re-programmable component. 3.1 Feasibility A system partition into an application-specific and a reprogrammable component is considered feasible when it implements the original specifications and it satisfies the performance and interface constraints. We assume that the hardware and software compilation, done using standard tools, preserves the functionality. We, therefore, concentrate on constraints. Constraints can be on sizes of hardware and software components, timing constraints between operations and constraints related data rates of system inputs and outputs. Satisfaction of timing constraints is complicated by the presence of non-determinism in the system model. In presence of uniform rates of data transfer, we use in our approach the constraint analysis and synthesis technique [9], that guarantees that either the synthesized circuit satisfies the min/max timing constraints for any value of the delay of the non-deterministic operations, or no solution exists. Since these techniques have already been described elsewhere [10], we refer the reader to the original papers. As mentioned earlier, when partitioning system model into hardware and software components the data rates may not be uniform across models. The discrepancy in data-rates is caused by the fact that the application-specific hardware and reprogrammable components may be operated off different clocks and the system execution model supports multi-rate executions that makes it possible to produce data at a rate faster than it can be consumed by the software component when using a using a finite sized buffer. In presence of multi-rate data transfers, feasibility of hardware/software partition is determined by the fact that for all data transfers across a partition, the production and consumption data rates are compatible with a finite and size-constrained interface buffer. That is, for any data transfer across partition, data consumption rate is at least as high as the data production rate. The size of the actual buffer needed may then be determined by using the scheme proposed in [11]. In addition, since the target architecture as shown in Figure 3 contains a single system bus over which data transfer to and from the re-programmable component takes place. Therefore, the net effect of all data-transfers over this bus should not exceed the pre-specified system bus bandwidth. Available bus bandwidth is a function of bus/processor clock rate and memory latency. The partition problem of hardware and software components requires first finding a feasible partition. Among data-rate feasible solutions, a cost function of overall hardware size, program and data storage cost, bus bandwidth and synchronization overhead cost is used to determine the quality of a solution. 3.2 Partitioning We now consider approaches to system partitioning in the order of increasing complexity of the system model. Let us first consider a system graph model with no unbounded delay operations and with non-multirate execution model. We then look for a partition of a system model driven by satisfaction of the imposed timing constraints. Consider an algorithm that is summarized as follows: starting with an initial solution with all operations in hardware, we select operations for move into the software component based on a cost criterion of communication overheads. Movement of operations to software requires a serialization of operations in accordance with the partial order imposed by the system model. With this serialization and analysis of the corresponding assembly code for a given re-programmable processor, we derive delays through the software component. The movement of operations is then constrained by satisfaction of the imposed timing constraints. Such a partitioning algorithm would strive to achieve maximal number of operations in the software component. Effect of non-determinism on system partitioning Non-determinism in our system models is caused either by external synchronization operations or by internal data-dependent delay operations, like conditionals and data-dependent loops. In presence of unbounded delay operations, we can still apply the algorithm described before. Note that unbounded delay operations can not be subject to any maximum timing constraints. Therefore, we transfer all such operations into the software component and then identify deterministic delay operations for move into the software component such that all timing constraints are satisfied. However, in systems with multi-rate execution model, the data-dependent delay operations makes it difficult to predict actual data-rates of production and consumption across partitions. Further, non-deterministic delays in the system model makes it difficult to statically schedule operations in any implementation of the system design. When considering a mixed implementation of the system design, it is possible to use dynamic scheduling of operations either or in both hardware and software components. Dynamic scheduling of operations in hardware or software requires both area and time overheads that may sometimes render a hardware-software co-design solution difficult or even infeasible. On the other hand, use of static scheduling requires a careful analysis of data-transfer rates across hardware and software portions in order to make sure that possible data-rates can indeed be supported by the interface implementation. Due to non-determinism in system models, the most general implementation of hardware and software components requires a control generation scheme that supports data-driven dynamic scheduling of various operations. Since the software component is implemented on a processor that physically supports only single thread of control, realization of concurrency in software entails both storage area and execution time overheads. On the other hand, in absence of any point of non-determinism from the software, all the operations in the software can be scheduled statically. However, such a software model may be too restrictive by requiring the control flow to be entirely in hardware. In our model of software implementation, we take an intermediate approach to scheduling of various operations as described below. First, we make following assumptions about the implementation model:  The system has an application-specific hardware component that handles all external synchronization operations. (external non-determinism points).  All the data dependent delay operations (internal nondeterminism points) are implemented by software fragments running on re-programmable components.  For the sake of simplicity and explanation within the scope of this paper, we assume that the system model contains no nested unbounded delay operations, such as a nested loop. However, the software synthesis technique described here can be expanded to include nested unbounded operations[5]. The software component is thought to consist of a set of concurrently executing routines, called threads. A thread consists of a linearly ordered set of operations. The serialization of the operations is imposed by the control flow in the corresponding graph model. Concurrent sets of operations are implemented as separate threads in order to preserve concurrency specified in the system graph model. All the threads begin with a point of non-determinism and as such these are scheduled dynamically. However, within each thread of execution all operations are statically scheduled. As an example, data-dependent loops in software are implemented as a single thread with a data-dependent repeat count. In this way, we take an intermediate approach between dynamic and static scheduling of software operations. Instead of scheduling every operation dynamically, we create statically known deterministic threads of execution which are scheduled in a cyclo-static manner depending upon availability of data. Thus, an individual operation in software has a fixed schedule in its thread, however, the time and the number of times the thread may be invoked is data-driven. Therefore, for a given re-programmable processor, the latency,, of each thread is known statically. For a given data-input operation in a thread, i, with latency,  i , the 1 data consumption rate,  i is bounded as: max  i  1i where max refers to the latency of the longest thread. It is assumed that the latency includes any synchronization overhead that may be required to implement multiple threads of execution on a single-thread re-programmable processor. The lower bound on  i is obtained by implementing a software scheduling scheme that reschedules a repeating thread for execution at the end of every iteration. The system partitioning across hardware and software components is performed by decoupling the external and internal points of non-determinism in the system model. It is assumed that for all external points of non-determinism, the corresponding datarates are externally specified. Thus, through this decoupling we are able to determine all the data-rates for all the inputs to the re-programmable component. The production data-rates of the [ Input: System graph model, G = ( V , E ) Output: Partitioned system graph model, V = V H VS partition(V ): d n V = V /* identify points of non-determinism */ n V nVe V ni V /* external vs internal non-determinism */ d n VH = V e; V /* the initial hardware component */ n VS = V i /* the initial software component */ n i create software threads (V ) /* create V ni routines */ compute data rates (processor) /* No feasible solution exists */ if not(feasible(V H ; VS )) exit /* initialize cost function */ fmin = f(VH ; VS ) repeat d VH /* select a deterministic delay operation */ V foreach v d i move(v di ) /* recursively move operations to sw */ until no improvement in cost function return(V H ; VS ) [ g g 2 j SW Generation = f f [ HardwareC description j \ /* considers a vertex for move from V H to VS */ move(v di ): d v i ; V if feasible(V H S d vd i ) d v i ) < fmin v i ; V if f(V H S d V H = VH v i /* move this operation to sw */ d V S = VS + v i fmin = f(VH ,VS ) update software threads update data rates (processor) succ(v di ) VH /* identify successor for move */ foreach v d j move(v dj ) return \ Figure 4: Partitioning Algorithm re-programmable component are determined by the software synchronization scheme used. We consider the issue of software implementation in Section 4.2. The pseudo-code in Figure 4 outlines the partitioning algorithm. A system graph model, G = (V ; E ), is created by compiling the HardwareC description. From externally specified data rates we compute data rates for data flow edges in the system graph model. The vertex set, V , consists of two sets of vertices, V =fV d ; Vn g, where V d denotes the set of operations whose delay is bounded and known at compile time, and n refers to non-deterministic delay vertices. With a data-rate V annotated system model as an input, we first isolate its points of non-determinism, V n , into two groups: V ne , those caused by external input/output operations, and V ni , those caused by internal data-dependent operations. The external points of nondeterminism, V ne , are solely assigned to the hardware while the internal points of non-determinism, V ni , are assigned solely to the software component. With this initial partition we determine the feasibility of data transfers across the partition. If this initial partition is not feasible, then the algorithm fails since no feasible partition exists under the proposed hardware-software interface and software implementation scheme. Alternative approaches to system partitioning when partitioning of non-deterministic operations fails are discussed in [5]. If the initial partition is feasible, then it is refined by migrating operations from hardware to software (i.e., moving vertices from VH to VS ) in the search for a lower cost feasible partition. Associated with each internal point of non-determinism (e.g. data-dependent loop bodies, conditional case bodies etc.) we create a program fragment or a thread of execution. Each thread of execution corresponds to a software routine by creating corresponding C code from HardwareC description. For various threads of execution in the software component, we derive latency and static storage measures by analyzing the corresponding assembly code. The assembly code is obtained by compiling the corresponding C descriptions. We have considered todate two off-the-shelf components, the R3000 and the 8086, and used existing compilers to evaluate the performance of the correspond- Hercules (Behavioral Optimizations) HEBE + CERES ASIC HW System Graph Model (Structual Synthesis) C Program compiler 0f g +f g 0f g + f g f g f g 2 HW Generation Machinelevel Code SW HW Mixed System Implementation Figure 5: Generation of Hardware and Software Components P ing implementation. The algorithm presented in Figure 4 uses a cost function, f = f(size(VH ), size(VS ), Synch cost(VH ; V S ), interface data ra t es) that is a weighted sum of its arguments. The algorithm uses a greedy approach to selection of vertices for move into VS . There is no backtracking since a vertex moved into VS stays in that set throughout rest of the algorithm. Therefore, the resulting partition is a local optimum with respect single vertex moves. The overall complexity of the algorithm is quadratic in the number of vertices. A partition of the system model is indicated by tagging its vertices by either hardware or software. Individual hardware and software components and interface circuitry are created from a partitioned model as described in the following section. 4 Implementation of System Components Figure 5 shows the methodology for generation of individual hardware and software components starting from a functional description in HardwareC. The system synthesis is performed by a program called Vulcan-II which invokes appropriate hardware and software synthesis tools to generate the final system design. Hardware synthesis is done using program Hebe [1] and software component is generated using available C-compiler for the target processor. Synthesis of interface logic may also be obtained using techniques indicated in [12] [13]. 4.1 Interface As mentioned earlier, the hardware-software interface depends upon the corresponding data transfer requirements imposed on the system model. In the case of known data-rates where (nonblocking) synchronous data transfers are possible, the interface contains an interface buffer memory for data transfer. Different policy-of-use for the interface buffer is adopted when transferring data or control information across the partition. Therefore, the interface buffer consists of two parts: a data-transfer buffer and a control-transfer buffer (Figure 6). The data-transfer buffer uses an associative memory with statically determined tags while the control-transfer buffer uses a FIFO policy-of-use in order to dynamically schedule multiple threads of execution in the software. Associated with each data-transfer we assign a unique tag which consists of two parts, software thread id and the specific datatransfer id. Since all the threads and all input/output operations are known, the tags are determined statically. In addition, the INTERFACE BUFFER POLICY-OF-USE MicroProcessor DIRECT-MAPPED BUFFER FOR DATA TRANSFER: TAG RQ Local Memory DATA Local Bus 1. Tags determined statically 2. RQ used for demand scheduling of SW 3. MW/DW ratio to support multiple HW executions Receive Unit d0 d1 d2 d3 DMA-RCVD Host CPU DW RXE RCVD-BUFFER RCVD-BIT DMA-XMIT XMIT-FRAME XMIT-BIT RXD TXD Memory MW FIFO BUFFER FOR DYNAMIC CONTROL FLOW: RCVD-FRAME TXE Transmit Unit CRS ENQUEUE 1. Control flow modifications from: CDT a. Memory Read or b. Interrupt driven or c. A dedicated Input Port Execute Unit Network Coprocessor System Bus Figure 6: Hardware and Software Interface Architecture PROCESS MODEL EXEC-UNIT TASK SWITCH MODEL INTERFACE BUFFER MODEL Figure 8: Network Coprocessor Block Diagram initial compute ready(i) Hit Control FIFO: ENQUEUE(i) read nextp DEQUEUE -nextp Miss +data(i) compute receive Miss i:= nextp DM Buffer (i) Hit wait(i) RQ(i)=1 miss ready(i) RQ(i)=0 +data(i) hit i refers to routine associated with point of non-determinism i detach (a) (b) (c) Figure 7: Hardware and Software Interface Architecture data-buffer contains a request flag (RQ bit) associated with each tag to facilitate demand scheduling of various threads in software. Figure 7 explains the modus operandi of data transfer across a hardware/software partition. In the software, a thread of execution is in the compute state as long as it has all the available data [Figure 7(a)]. In case of a miss on a data, the corresponding RQ bit is raised and the thread is detached [Figure 7(c)]. The processor then selects a new thread of execution from the control FIFO [Figure 7(b)]. In case of data arrival to the interface buffer, if the corresponding RQ bit is on, its tag is put into the control FIFO [Figure 7(c)]. bytes in an assembly that provides register-memory operands (like 8086). The coroutine switch takes 364 cycles when implemented for 8086 processor. By contrast, implementation of a global task scheduler using subroutines takes 728 clock cycles for the 8086 processor [5]. Since the processor is completely dedicated to the implementation of the system model and all software tasks are known statically, we can use a simpler and more relevant scheme to implement the software component. In this approach, we merge different routines and describe all operations in a single routine using a method of description by cases [16]. This scheme is simpler than the coroutine scheme presented above. Here we construct a single program which has a unique state assignment for each point of non-determinism. A global state register is used to store the state of execution of a thread. Transitions between states are determined by the requirement on interrupt latency for blocking transfers and scheduling of different points of non-determinism based on data received. This method is restrictive since it precludes use of nested routines and requires description as a single switch statement, which in cases of particularly large software descriptions, may be too cumbersome. Overhead due to state save and restore amounts to 85 clock cycles for every point of non-determinism when implemented on a 8086 processor. Consequently, this scheme entails smaller overheads when compared to the general coroutine scheme described earlier. 4.2 Hardware and Software Components Application-specific hardware synthesis under resource and timing constraints has been addressed in detail elsewhere [10]. Therefore, in this section we focus on the problem of synthesis of the software component of system design. As mentioned earlier, we essentially have a set of program fragments each initiated by a point of non-determinism. The problem of concurrent multi-thread implementation is well known[14]. In general, multiple program threads may be implemented as subroutines operating under a global task scheduler. However, subroutine calling adds overheads which can be reduced by putting all the program fragments at the same level of execution. Such an alternative is provided by implementing different threads as coroutines [15]. In this case, routines maintain a co-operative rather than a hierarchical relationship by keeping all individual data as local storage. The coroutines maintain a local state and willingly relinquish control of the processor at exception conditions which may be caused by unavailability of data or an interrupt. The code for a coroutine based scheduler comes to 34 instructions taking about 100 5 Example: Network Coprocessor As an example of hardware-software system implementation, we describe implementation of a network coprocessor for communication via an ethernet link. The coprocessor manages the processes of transmitting and receiving data frames over a network under CSMA/CD protocol. The purpose of this coprocessor is to off-load the host CPU from managing communication activities. The coprocessor provides following functions: Data Framing and De-Framing, Network/Link Operation, Address sensing, Error Detection, Data Encoding, and Memory Access. In addition, the coprocessor provides a repertoire of eight instructions that let the host CPU program the machine for specific operations (transmit some data from memory, for example). For details on coprocessor operation, the reader is referred to [5]. The network coprocessor block diagram shown in Figure 8 is modeled on the target architecture described in earlier. The important rate and timing constraints on the coprocessor design are: the maximum Unit Transmission Unit Reception Unit Process xmit bit xmit frame DMA xmit DMA rcvd rcvd bit rcvd buffer rcvd frame Coprocessor Area 271 3183 2560 400 282 127 1571 8394 Delay 14.31 ns 37.15 ns 45.06 ns 27.51 ns 12.30 ns 22.09 ns 38.12 ns 45.06 ns Table 1: Network Coprocessor Application-Specific Hardware Component using LSI LCA10K Gates Target Processor R3000, 10 MHz 8086, 10 MHz Pgm & Data Size 8572 bytes 1295 bytes Max Delay 56 cycles, 5.6 s 115 cycles, 11.5 s Table 2: Network Coprocessor Software Component input/output bit rate is 10 Mb/sec; maximum propagation delay is 46.4 s; maximum jam time is 4.8 s and the minimum interframe spacing is 67.2 s. The ethernet coprocessor is modularly described as a set of 13 concurrently executing processes which interact with each other by means of 24 send and 40 receive operations. The total HardwareC description consists of 1036 lines of code. A mixed implementation following the approach outlined in Section 3 was attempted by decoupling the points of non-determinism in the system model. Table 1 shows the results of synthesis of applicationspecific hardware component of the system implementations that was synthesized in the Olympus Synthesis System and mapped using LSI logic 10K library of gates. The software component is implemented in a single program containing case switches corresponding to 17 synchronization points, i.e., internal points of non-determinism as described in Section 4.2. With reference to Figure 8 the software component consists of the execution unit and portions of the DMA rcvd and DMA xmit blocks. The reception and transmission of data on the ethernet line is handled by the application-specific hardware running at 20 MHz. The total interface buffer cost is 314 bits of memory elements. Table 2 lists statistics on the code generated by a commercial compiler for the ethernet software component implementation. By contrast, a purely hardware implementation of the Network Coprocessor requires 10882 gates (using LSI 10K library). With a maximum limit of 10000 gates on a single chip, a pure hardware implementation would require two application-specific chips. Thus, through the use of system partitioning into hardware and software components we are able to achieve a 20 MHz coprocessor operation while decreasing the overall hardware cost to only one application-specific chip (or 23% in terms of gate count). The reprogrammability of software components makes it possible to increase the coprocessor functionality, for example addition of self-test and diagnostic features, with little or no increase in dedicated hardware required. 6 Conclusion We have presented a scheme for performing constraint-driven system-level partitioning into hardware and software components using a system model that supports non-deterministic delay operations and timing constraints. This partitioning algorithm is driven by the satisfaction of data-rate constraints. A feasible solution to the data-rate constraints is obtained by identification and sep- aration of internal and external points of non-determinism in the system model. Using the partitioning approach presented we were able to partition the design of an Ethernet based network coprocessor into feasible hardware and software components. The mixed implementation requires 23% less dedicated hardware than a purely application-specific implementation. More importantly, reprogrammability of the software component makes it possible to extend the coprocessor functionality without the need of additional application-specific hardware modules. Currently we do not consider memory hierarchy in our model of system design. Most modern processors come with a certain amount of on-chip cache memory that can be used to speed up the response time of the software component. However, this is not an inherent limitation of our approach, and future extensions include modeling of the effect of cache misses on software performance. The topic of system synthesis using hardware-software partitioning is explorative in nature, because of its novelty. The limitation of the technique presented here would be related to the lack of a feasible partition on some system designs. In addition, the assumptions on hardware, software implementation model and interface scheme influence the partition. As a result, the partitioning results may not be as general to all system designs but specific to the assumptions made for example to the type of re-programmable processor being considered. 7 Acknowledgements The authors would like to thank Claudionor Coelho, Jr. for helpful discussions. This research was sponsored by NSF-ARPA, under grant No. MIP 8719546 and, by DEC jointly with NSF, under a PYI Award program, and by a fellowship provided by Philips/Signetics. We also acknowledge support from ARPA, under contract No. J-FBI-89-101. References [1] G. D. Micheli, D. C. Ku, F. Mailhot, and T. Truong, “The Olympus Synthesis System for Digital Design,” IEEE Design and Test Magazine, pp. 37–53, Oct. 1990. [2] J. Rabaey, H. D. Man, and et. al., Cathedral II: A Synthesis System for Multiprocessor DSP Systems, in Silicon Compilation, D. Gajski, editor, pp. 311–360. Addison Wesley, 1988. [3] D. Thomas, E. Lagnese, R. Walker, J. Nestor, J. Rajan, and R. Blackburn, Algorithmic and Register-Transfer Level: The System Architect’s Workbench. Kluwer Academic Publishers, 1990. [4] R. Camposano and W. Rosenstiel, “Synthesizing Circuits from Behavioral Descriptions,” IEEE Transactions on CAD/ICAS, vol. 8, no. 2, pp. 171–180, Feb. 1989. [5] R. K. Gupta and G. D. Micheli, “System Synthesis via Hardware-Software Co-design,” CSL Technical Report CSL-TR, Stanford University, 1992. [6] D. Bustard, J. Elder, and J. Welsh, Concurrent Program Structures, p. 3. Prentice Hall, 1988. [7] D. Ku and G. D. Micheli, “HardwareC - A Language for Hardware Design (version 2.0),” CSL Technical Report CSL-TR-90-419, Stanford University, Apr. 1990. [8] R. Gupta and G. D. Micheli, “Partitioning of Functional Models of Synchronous Digital Systems,” in Proceedings of the International Conference on Computer-Aided Design, (Santa Clara), pp. 216–219, Nov. 1990. [9] D. Ku and G. D. Micheli, “Relative Scheduling under Timing Constraints,” in Proceedings of the 27th Design Automation Conference, (Orlando), June 1990. [10] D. Ku and G. D. Micheli, “Optimal synthesis of control logic from behavioral specifications,” VLSI Integration Journal, vol. 3, no. 10, pp. 271–298, Feb. 1990. [11] T. Amon and G. Borriello, “Sizing Synchronization Queues: A Case Study in Higher Level Synthesis,” in Proceedings of the 28th Design Automation Conference, June 1991. [12] T. H. Meng, Synchronization Design for Digital Systems, ch. Synthesis of Self-Timed Circuits, pp. 23–63. Kluwer Academic Publishers, 1991. [13] G. Borriello and R. Katz, “Synthesis and Optimization of Interface Transducer Logic,” in Proceedings of the IEEE Transactions on CAD/ICAS, Nov. 1987. [14] G. R. Andrews and F. Schneider, “Concepts and Notations for Concurrent Programming,” ACM Computing Surveys, vol. 15, no. 1, pp. 3–44, Mar. 1983. [15] M. E. Conway, “Design of a Separate Transition-Diagram Compiler,” Comm. of the ACM, vol. 6, pp. 396–408, 1963. [16] P. J. H. King, “Decision Tables,” The Computer Journal, vol. 10, no. 2, Aug. 1967.