Academia.eduAcademia.edu

Fault simulation in a distributed environment

25th ACM/IEEE, Design Automation Conference.Proceedings 1988.

Fault simulation of VLSI circuits takes considerable computing resources and there have been significant efforts to speed up the fault simulation process. This paper describes a distributed fault simulator implemented on a loosely-coupled network of general purpose computers. The techniques used result in a close to linear speedup and can be used effectively in most industrial VLSI CAD environments. * ?his work was supported by the Semiconductor Research Corporation under Contract 87-DP-109.

zyxwvu zyxw zyxwvutsrqp zyxw zyxwvut FAULT SIMULATION IN A DISTRIBUTED ENVIRONMENT* Patrick A. Duba, Rabindra K . Roy, Jacob A . Abraham, William A. Rogers zyxwvutsr zyxwvutsrq Computer Systems Group Coordinated Science Laboratory University of Illinois Urbana, IL 61801 Abstract and Department of Electrical & Computer Engineering University of Texas at Austin Austin, TX 78712 ways to achieve this: by making a simpler model, which suffers from lack of accuracy, or by using a more efficient algorithm, such as one which exploits hierarchy in the circuit [7], or by using special purpose hardware accelerators, e.g. IBMs Yorktown Simulation Engine (YSE) [8], NEC’s Hardware Accelerator (HAL) [9], Zycad, Silicon Solutions, Daisy, etc. [lo, 111. Fault simulation of VLSI circuits takes considerable computing resources and there have been significant efforts to speed up the fault simulation process. This paper describes a distributed fault simulator implemented on a loosely-coupled network of general purpose computers. The techniques used result in a close to linear speedup and can be used effectively in most industrial VLSI CAD environments. The hardware accelerators no doubt provide the fastest simulation, but they have some severe drawbacks. They are very expensive in initial cost as well as maintenance, they are very inflexible because of their fixed architectures, and it is not possible to incorporate new features into them. Moreover, they have fixed fault libraries, which are applicable to only one or a few technologies, and any major change in the prevailing technology may lead to their obsolescence. Hence, there is an interest in software simulators, which are flexible and can be run on general purpose computers. 1. Introduction An alternative to hardware accelerators is the use of distributed and parallel processing for fault simulation. The distributed processing approach is applicable to general purpose processors and the implementation can be made flexible enough to incorporate changes in technology. In almost all of the cases, distributed processing is less expensive when compared to hardware accelerators. In the late 1980s, there has been a strong trend towards networking. Almost all the IC design houses have a network of engineering workstations. Most of the workstations in a network are idle during nonwork hours. If all these workstations are used as a distributed processing system to work on the same problem, it is possible to solve problems that are too complex for most mainframes. This led to the investigation and development of the distributed fault simulator described in this paper. zyxwvutsr zyxwvutsrqp Test generation is an important phase. in the successful design and validation of a circuit. It has been determined that test generation is an NP-complete problem [1,2]; thus, the computational requirements grow exponentially as the circuit size increases. The problem is much worse for sequential circuits, where. the search space must also consider one more dimension, that of time. Efficient automatic test pattern generators for sequential circuits are not yet available, so the burden is laid on the test engineers, who have to generate test sequences from their experience. Fault simulation involves the process of applying such test patterns to a circuit model in order to validate their effectiveness. For a given fault model, the quality of the test set is expressed as the fault coverage, which is the percentage of faults detected out of all the possible faults. Fault simulators are thus indispensable tools in the test generation process. Considering the fact that a significant portion of the cost of producing an IC is due to testing, it is essential to perform the fault simulation in as short a time as possible, to reduce both the design-test-redesigncycle and the time the produced chip spends on the test bed. In its 6rst form, fault simulation was performed serially, injecting one fault at a time into the circuit. The inadequacy of this approach became evident with the development of MSI and LSI circuits which prompted researchers to find new methods of fault simulation. At present, the three most prevailing techniques used for fault simulation are parallel, deductive and concurrent [3]. For simulating very large circuits, concurrent fault simulation is considered to be the best approach [4]. The costs of parallel and deductive fault simulation are O(G3) and at least O(G2). respectively [SI, where G is the number of gates in the circuit. A formal model shows that the cost of concurrent fault simulation is worse than linear for realistic circuits [4]; based on observations, concurrent fault simulation seems to behave as an O(G2) algorithm [6]. As a result, when these algorithms are implemented on a general purpose processor they take a prohibitive amount of time for current VLSI circuits. This has led to research into making the fault simulation process faster. There are three * Implementing an algorithm in a distributed environment is not a simple task. The problem domain must be decomposed among the processors [12]; this is also known as task purririoning. Stated in simple words, task partitioning is nothing but breaking the original problem into several smaller subproblems. Another important issue is rusk allocation, i.e., mapping the subproblems to the processors. If task partitioning and task allocation are not performed properly, an increase in the number of processors in a system may actually result in a decrease of total throughput [131. This decrease can occur if a key factor called inferprocessor communicarion (IPC), which consists of all messages exchanged among processors and shared memories, is not taken into consideration [14]. zyxwvutsr ?his work was supported by the SemiconductorResearch Corporation under Contract 87- DP-109. Patrick A. Duba is cumntly with Hewlea Packard EDD, Cdmdo Springs, Colorado. This communication overhead, which is absent in the analyses of sequential algorithms, can be a bottleneck if not handled efficiently. Granularity of computation (the smallest size of a task given to a processor) should be based on the parameters of a particular system. If the granularity is chosen to be small, then the subproblems can be divided evenly among the processors. This ensures better utilization of the processors, but the amount of interprocessor communication becomes large. Therefore, if communication is the bottleneck of a system, granularity should be large. It is very difficult to predict the exact granularity size which will yield the best performance for a given problem and distributed system. However, some broad ranges, based on computation and communication requirements for each granule size, can be specified within which the algorithm is expected to behave reasonably. Fault simulation algorithms, which were developed for sequential 25th ACM/IEEE Design Automation Conference@ Paper 42.1 686 CH2540-3/88/oooO/osss$ol .OO0 1988 IEEE zyxwvutsrqponm zyxwvutsrqpo , zyxwvutsr zyxwvutsrqpon zyxwvutsrqp processors, are not suitable for distributed implementation in their original forms. One of the reasons is the difficulty in properly partitioning the given task into subtasks. However, novel techniques developed recently provide a way to overcome this difficulty [4]. These techniques automatically partition the problem into several smaller and somewhat similar sized subproblems using information about the circuit hierarchy. The partitioning is done in such a way that each subproblem gets a unique subset of the set of all the possible faults in the circuit. Thus the subsets of faults associated with different subproblems are mutually disjoint. As a result of this, the task allocation problem is greatly simplied. This paper describes the implementation of a distributed fault simulator running on a loosely coupled network of general purpose processors connected via a bus. A large number of simulations were run on circuits of varying sizes and total number of faults. For a given problem, there were two parameters to be decided upon: the partition size (how many faults are injected into each partition) and the number of processors used. When compared to a serial run, many situations yielded a near linear speedup. As the circuit size, and hence the total number of faults, increased the speedup figures were found to improve. The paper is divided into five sections. Section I gives the inuoduction to the problem. Section I1 describes the method of hierarchical fault simulation, a hierarchical fault simulator (CHIEFS), and hierarchical fault partitioning. The distributed algorithm and implementation are described in section 111. Section N reports the results and explains various anomalies in the performance graphs and section V concludes the paper. II. Hierarchical fault simulation The computationally intensive task in fault simulation is evaluation of logic elements. Fault simulation speed can be increased by reducing the number of evaluations or by increasing the speed of element evaluation. If the evaluation rate is increased, the simulation will go faster, but the computational complexity of the algorithm remains unchanged. Since the complexity of fault simulation is worse than linear [4], the computational requirements for circuit evaluations will increase faster than the corresponding increase in circuit complexity. Reducing the number of evaluations decreases the complexity of the algorithm and provides increased performance from existing hardware. An event driven simulation performs evaluations only when signals change, so there is no scope of reduction in the number of evaluations in the event driven algorithm. Hence the only other alternative is to reduce the number of primitives in the circuit by altering the circuit representation. Hierarchy provides the framework to make these changes and hierarchical design practices provide the information required to construct a hierarchical circuit representation. The fault simulator, CHIEFS (Concurrent HIerarchical and Extensible Fault Simulator), has demonstrated that these changes are feasible and has provided dramatic simulation speedup under appropriate conditions [4,71. text editor Graphics Edit01 Text Editor Extractor Fault Simulator Hierarchical GeE%r I Completed [ In Development Simulator User Defined vectors v i ) 1 j Fault Undetected Coverage Faults Figure 1. CHIEFS Simulation System Diagram definitions for the higher level macromodules in the circuit description. Initially, the circuit description is automatically partitioned, using hierarchical information, into subcircuits each of which contains a small subset of all the possible faults, as shown in Figure 2. During simulation, the circuit description is reconfigured so that the current fault partition is simulated at the primitive level and all other (fault-free) parts of the circuit are simulated at the highest possible levels. This is achieved by substituting the high-level functional definition for subcircuits wherever possible in the fault-free areas. Fault simulation is performed and the circuit is reconfigured; then the next partition is faulted and the previous partition is now fault-free. This reconfiguration process is repeated until all partitions have been fault simulated. A performance model shows that the total time for simulating many passes is indeed much less than simulating everything in one pass for very large circuits [4]. It is instructive to note that this technique partitions the given problem into subproblems that do not require information from each other during evaluation. If, in a distributed environment, each processor simulates a disjoint set of fault partitions, many fault partitions can be evaluated in parallel. This was the key idea behind the distributed implementation of this fault simulator. C. P a r t i t i a A l P o r i t h The hierarchical partitioning algorithm creates a series of circuit representations (partitions), each of which is a complete representation of the circuit. Each partition has a unique partition number associated with it and considers a unique set of modules for fault injection. In other words, each partition has a a unique set of faults for simulation. The partitioning algorithm uses a user-specified number of faults (partition size) as a guide in creating a partition. This recommended size has a -25% to +50% tolerance to allow the algorithm to partition along the higher-level module boundaries whenever possible, which helps in minimizing the number of modules in the partition. Fault behavior is modeled only in the simulator primitives so each partition must contain a portion of the circuit which is zyxw zyxwvutsrqponml zyxwvutsrqponm zyxwvutsrqpo A. Descriution of the CHIEFS S v m The CHIEFS system consists of several C programs, two parsers, a fault simulator, and a wrapper. These programs contain about 20,000 lines of code, which were developed under UNIX. The parsers were written with LEX [I51 and YACC [16] to parse the fault and circuit description languages. These parsers produce a common data structure to be used by the fault simulator and the wrapper. The circuit description languages currently accepted are SCALD [17,18] and TDL [19]. The fault library language was developed in-house. Figure 1 shows the relationship among various parts of the simulator. B. Hierarchical Fault Partitioning CHIEFS features a hierarchical fault partitioning technique which increases simulator performance by reducing the total amount of computation required to perform the simulation. This technique is based on the hierarchical circuit description and (redundant) high level functional Figure 2. Hierarchical Fault Partitioning Paper 42.1 687 zyxwvutsrq zyxwvuts zyx zyxwvutsrqpon zyxwvut zyxwvutsrqponm zyxwvutsrqpon represented at the primitive level. The remaining part of the circuit is represented functionally (where possible) by a higher-level functional representation. A configuration is created by marking the nodes of a tree expansion of the circuit hierarchy. This tree data structure is also used to record the faults that have been simulated, the faults that are in the current partition, the faults that have been detected, and the faults that remain undetected. Each module in the circuit hierarchy also contains the sum of the number of faults in all of its lower-level constituent modules or in itself if the module is a primitive. This number is used in the partitioning algorithm to decide if a module is an acceptable object for partitioning or if the module contains too many faults and therefore must be broken into its constituent lower-level modules. The partitioning algorithm starts at the highest level definition in the circuit hierarchy and searches the tree data structure from left to right to find modules which have not been incorporated into a partition, (see Figure 2). The number of faults in the first available module is checked against the recommended partition size. If it is within the tolerances, the module, all its descendents, and all its direct ancestors are marked as fault enubled in the current partitioning. All siblings of the module and all siblings of the direct ancestors of the module may be simulated with functional models if they exist. These modules are checked for the presence of a functional model and if present, the module is marked for functional evaluation. If a functional model does not exist then the descendents of the module are checked. This procedure continues until a complete circuit representation consisting of fault-enabled and functional modules is cmted. If the number of faults in the module under consideration is too large, the partitioning algorithm descends into the descendents of the module and continues. If the number of faults was too small, the module is marked as fault-enabled and the next sibling is considered for incorporation into the current partition. Subsequent siblings are incorporated until the minimum number of faults have been enabled for the current partition. Once the simulation for a partition has finished all the fault-enabled modules are marked as previously simulated to prevent them from being incorporated in another partition. The simulation is complete when all the modules have been marked as previously simulated. cal fault partitioning. Each faulted partition is considered as a subproblem and allocated a processor. The proper grain size of a subproblem (faulted partition) is very difficult to determine. Some general guidelines have been followed. The partition size must be small enough so that each of the available processors gets at least one partition, yet large enough so that the evaluation time of the subproblem is not negligible compared to the communication time. The allocation of the subproblems to processors can be done in any order. The interconnection network employed is a bus, which is slow relative to the processor speed and limits the maximum number of processors that can be utilized efficiently. However, it was adequate for the maximum number of processors considered in this paper. The following conditions must be satisfied for a distributed program to be efficient: T =T, l p where T is the turnaround time for p processors working on the problem, T, is the turnaround time for a single processor implementation and p is the number of processors in the multiple-processor implementation. It should be noted that T also includes communication time. This implies that the overhead due to distributing the algorithm should be low to satisfy the condition of linear speedup. (2) Ti ZZ T, where Ti and T, are turnaround times of processors i and j, respectively. This condition implies a homogeneous distribution and hence a balanced load on all the processors. An alternative way to partition the problem of fault simulation of a given circuit could be at the circuit topology level, where different processors have different portions of the circuit. However, in that case some processors will require results of computation from other processors, which would lead to large amounts of communication between the processors. In the environment under consideration, communication is the bottleneck, so minimizing communication is necessary. Hence partitioning at the circuit topology is not an efficient method. The advantage of using faulted partitions is the independence of the subproblems from each other resulting in minimized communication. Figure 2 shows the general picture of hierarchical fault partitioning. Each host processor, also referred to as a server, obtains a copy of the circuit and injects faults in a unique portion of the circuit. (1) zyxwvutsrqpo III. Distributed Hierarchical Fault S i m u l ~ Distributed processing can be defined as several loosely coupled processors working on the same problem to achieve some goal. It is different from tightly coupled parallel processing, where communication time between processors is comparable to local memory access time. In loosely coupled systems, the communication costs are relatively high. These systems are preferable if computation granularity is large and task interactions are relatively infrequent Though they offer less efficiency than tightly coupled systems or simulation engines, loosely coupled systems have the advantage of being less expensive and more versatile. The distributed fault simulator, DCHIEFS (Distributed Concurrent HIerarchical and Extensible Fault Simulator), was developed on a loosely coupled system. Several parameters influence the performance of a distributed system [20], some of which are: The architecture of our loosely coupled system is shown in Figure 3. The client (master) and servers are on a communication bus using the TCP protocol. The system employs the Network File System (NFS) which enables the sharing of files in a heterogeneous environment of machines, operating systems and networks [22]. The Remote Procedure Call CRpC) facility of NFS supports a mechanism where a client machine can request a remote server machine to execute a procedure and return the results to the client. These two facilities were used to implement the communications between the client and servers. zyxwvutsrq zyxwvu (1) The amount of parallelism inherent in the application problem. (2) The decomposition of a problem into subproblems. (3) The allocation of these subproblems to processors. (4) The communication delay of the interconnection structure relative to the speed of the processors. The first parameter is application-specific. The next two depend on the implementation and the last one depends on the particular multipleprocessor system used. All these parameters will be examined for the fault simulation problem. The precedence relation of a problem determines the inherent sequentiality that must be observed in that problem [211. Because only single faults are assumed to occur in a faulty circuit, there is no precedence relation in the simulation of different faults. Thus high parallelism can be achieved in the fault simulation problem. The method for decomposing the problem into smaller subproblems is based on hierarchi- Paper 42.1 688 A. A r c h i t e m dr'strib- B-01 st r u w To keep the performance close to the optimal level, an efficient control paradigm should be selected. The control structures may be synchronous or asynchronous. In the synchronous case, all the servers start and finish a parallel portion of the algorithm at the same time; whereas in the asynchronous case,each server may be in different phases of the algorithm simultaneously. In a synchronous control structure if imperfect load balancing (varying partition sizes in the case of DCHIEFS) occurs, then the servers that get smaller loads eventually become idle. In addition, if the client has to determine which servers are ready to process more partitions, it has to serially check the status of the servers (polling). Both of these procedures cause a performance penalty. To avoid these penalties, we chose an asynchronous control structure. With this control structure, the inherent difficulties associated with different subproblem sizes (imperfect load balancing) have a minimized effect on performance. In our algo- zyxwv zyxwvutsrqpon result and provides a new partition number for the server. The process continues until all the partitions of the circuit are fault simulated. Figure 5 gives an overview of the algorithm. 0 zyxwvutsrqponm ; Machim D.I m o l e m The. distributed fault simulator, DCHIEFS, has been implemented on a network of SUN workstations. The network is comprised of eight SUN 3/50's and one SUN 3/280 file-server connected by a 10 Mb/s ethemet. The file-server is also connected to three other file-servers, each with its own set of workstations. Although workstations connected to the other file-servers were not included in the experiment, their utilization is planned for future studies. TCP Protocol Machine Machine Machine Machine DISTRIBUTED ALGORITHM zyxwvutsrqponm zyxwvutsrqpon zyxwvutsrqpo zyxwvutsrqpon c - w 1 zyxwvut Figure 3. Architecture of the distributed system. rithm, when a server finishes its assigned subproblem, it is assigned a new partition number, which has not already been assigned to any server. This happens irrespective of the status of simulation on other servers. To minimize polling, a call-back scheme is chosen, where each server process is responsible for initiating the request to the client for a new fault partition number. C. Distributed A / Q o r i t ~ Client process determines the number of server processes available Client process hitiallzes the server processes While there are simulatiom to perform { A server process sends results to the client process and requests a new partition number The client process saves the results and provides a new partition number for the server process Figure 4 shows the structure of the distributed process. The process proceeds as follows: the user specifies the number of servers, which the client initializes, by sending the circuit name and the user input options. The client determines if the requested number of servers is available. If so, the client initializes the desired number of servers; otherwise, the available servers are initialized. Based on the user-specified fault partition size, the servers concurrently divide the circuit into fault partitioned sections The server process repartitions the circuit for the partition number it received The server process simulates for the faults contained in that partition 1 Figure 5. The distributed algorithm. Initialize server Receive next panition number SERVER PROCESS Return partition result SERVER PROCESS The implementation required a server process to be running in the background on each of the server machines. A distributed simulation is initiated by starting a client process on the client machine (refer to Figures 3 and 4). then the distributed process proceeds until simulations of all the partitions are complete. It should be noted that there is no reason why the client machine can not also have a server process running on it. Either the file-server or a workstation can be chosen as the client. In ow envuonment, however, the file-server (SUN 3/280) is 3 to 4 times faster than any of the workstations (SUN 3/5Os). To simplify the analysis, the file-server was not used as a client or a server. IV. Results using algorithm given in Section IIC. The algorithm is based on information about circuit hierarchy, which usually does not have a uniform or homogeneous structure. As a result, the actual sizes of fault partitions are not always equal to the specified size. Several circuits of varying gate counts and total fault counts were simulated on DCHIEFS. Two different parameters, the size of the fault partitions and the number of servers used, were varied to study the performance of the distributed simulator. Since the communications between client processor and server processors were found to occur infrequently, the client processor (workstation) was also used to run a server process. It is well known that the runtime of any program depends on the system load. To eliminate this load dependency, no other user processes were allowed to run either on the fileserver or on the workstations during the course of the experiment. This no-load condition ensured repeatability of the results, and was c o n h e d by repeating some of the experiments. Each server gets a unique partition number from the client. Using the partition number, each server injects faults in the corresponding partition, as explained in section 2. The server simulates the faults contained in that partition. Once completed, the server sends the results to the client and requests a new partition number (call-back). The client saves the Figures 6 , 7, and 8 show the speedups for different number of processors and various partition sizes for three different circuits. The first circuit, MULT44 is a 4x4 multiplier, which has 3 levels of hierarchy (which is the number of levels in the circuit hierarchy tree,as shown in Figure 2) and 416 possible single line stuck-at faults. Simulation results of Figure 4. Servers communicating with the client through a call-back scheme. Paper 42.1 689 PARTITION 7.0 number of processors Figure 6. Speedup versus the number of processors for varying partition SizeS. zyxwvut zyxwv - zy CONTROL LOGIC (664 faults) zyxwvutsrq -'-I number of processors (a) 7.0 U T 4 4 is included in Figure 6. Figure 7 shows the results for a control logic module, which contains many sequential elements with 3 level of hierarchy and 664 possible single line stuck-at faults. The last circuit simulated, FASTMULT, is a 24x24 bit mantissa multiplier with 25530 faults and 5 levels of hierarchy. Figure 8 shows the simulation results for FASTMULT. Almost all the speedup curves are monotonically increasing with respect to the number of processors, which ensures that the use of more processors result in lower run-time in most of the cases, but the rates of increase in the curves do not follow a regular pattern with respect to either the number of processors or the partition sizes. Two factors, both of which are related to the irregularity in hierarchical structure of the circuit, are responsible for this anomalous behavior. First, the circuits do not have a uniform hierarchical structure, i.e., the sizes and the heights of the subtrees associated with different macromodules at the same level are not the same. This implies that the corresponding hierarchical tree structure of the circuit, similar to Figure 2, is not balanced. As a result of this, partitions having deeper leaves in the faulty part take longer time to be simulated. This causes an imbalance in the amount of work associated with each partition. Also,the nonfaulty parts of the circuit are different in different partitions. Some of the nonfaulty parts have a functional description and some do not. So, the partitions, even when containing identical number of faults, may take different times to be simulated, depending on the availability of a functional description of the nonfaulty part of the circuit. Secondly, the actual partition sizes may be widely varying. The spread in sizes may be as much as 75% of the user-specified partition size (50% on the higher side and 25%on the lower side). As a result, the total number of partitions n:, becomes more than n: , where n: is given by 6.0 CONTROL LOGIC (664 faults) 1 speedup .*' ..r 50 60 70 80 zyxwvut zyxwvu zyxwvuts zyxwvutsrqpo zyxwvutsrqpo zyx n:= L N ~ P , J , where N is the total number of faults and P , is the partition size specified by the user. If x, is not a perfect multiple of the number of processors, then some of the servers remain idle for some period of time when the simulation process is about to end. This penalizes the performance, as can be seen for larger number of processors for all the partition sizes in Figures 7(a) and 7(b). The two factors mentioned above explain the anomalies in the speedup curves. A very uniform hierarchical smcture of a circuit, both in terms of number of faults and the availability of functional descriptions of nonfaulty parts will ensure good uniformity in sizes of tasks associated with the subproblems. However, this is very hard to achieve for any gen- Paper 42.1 690 2.0 1 1. number of processors (b) Figure 7. Speedup versus the number of processors for varying partition sizes. era1 circuit, which will have some parts that are random as opposed to some that are regular. Also, the regular parts are usually of different sizes, so it is very hard to say which combination of parution size and number of processors will lead to the best performance. In the experiments reported in this paper, some situations occurred when the speedups were very close to optimal (partition size 50, 5 to 7 processors for the control logic in Figure 7(b) and partition size 500, 5 processors for FASTMULT in Figure 8). In these cases, near-perfect load balancing occurred. But, given a circuit, it is still unknown how to predict the best partition size. The three example circuits presented here show very g o d speedups for the distributed implementation. Several experiments were performed in an attempt to find a formula for the optimal partition size for a given circuit, but it became evident that the optimal partition size depends heavily on the circuit topology, thus it can vary vastly even for circuits of similar size and functionality. However, even if the optimal partition size is not used, distributed fault simulation still shows very good speedup, allowing this technique to be used to simulate very large circuits that may currently require prohibitive simulation times on large mainframes. 9.0 8.0 7.0 zyxwvu zyxwv zyx zyxwvutsrqponm zyxwvut [2] H. Fujiwara and S. Toida, “The Complexity of Fault Detection Problems for Combinational Logic Circuits,” IEEE Transactions on Computers, vol. C-31, pp. 555-560, June 1982. [3] Y. H. Levendel and P. R. Menon, “Fault-Simulation Methods - ,SIZE - FASTMULT ( 25530 faults ) Extensions and Comparison,” Bell System Technical Journal, vol. 60, pp. 2235-2259, November 1981. - 6.0 - Speedup 5.0 - 1500 [41 W. A. Rogers, J. F. Guzolek, and J. A. Abraham, “Concurrent Hierarchical Fault Simulation: A Performance Model and Two Optimizations,” IEEE Transactions on Computer-Aided Design, vol. 0 - 6 , pp. 848-862, September 1987. [5] P. Goel, “Test Generation Costs Analysis and Projections,” Proceedings of the 17th Annual Design Aufomation Conference, pp. 77-84, 1980. T. W. Williams and K. P. Parker, “Design for Testability - A Survey,’’ Proceedings of the IEEE, vol. 71, pp. 98-112, January 1983. W. A. Rogers and J. A. Abraham, “CHIEFS: A Concurrent Hierarchical and Extensible Fault Simulator,” Proceedings of the IEEE International Test Conference, pp. 710-716, 1985. G. Pfister, “The Yorktown Simulation Engine : Intrcduction,” Proceedings of the Annual 19th Design Automation Conference, pp. 51-54, 1982. [6] [7] 1 2 3 4 5 I l l 6 7 8 number of processors 9 zyxwvut zyxwvutsrq Figure 8. Speedup versus the number of processors for varying partition sizes. V. Conclusion The purpose behind this research was to find a low-cost method to speed up fault simulation by providing more general purpose processors in a network environment. The results show that distributed fault simulation is a promising alternative to the use of hardware accelerators. Most IC design houses have networks of engineering workstations and distributed fault simulation over that network can be an attractive solution to their fault simulation needs. Distributed fault simulation as a method is not restricted to CHIEFS and it can be applied to any hierarchical fault simulator which allows hierarchical fault partitioning. The speedups obtained for the circuits are very close to optimal in those situations where the partitioning size and the number of processors used are chosen correctly. Unfortunately, the optimal fault partition size depends heavily on the circuit topology and is difficult to predict. A rule of thumb is to use a partition size less than the ratio of total number of faults to the total number of servers. If a larger partition size used, some server(s) will certainly remain idle, thus reducing eficiency. Even if the optimal partition size is not used, the speedup figures are still very attractive. So, this technique can be cffectively used to fault simulate very large circuits. Future work will try to find a model which would predict the optimal or near optimal values of partition size and number of processors based on circuit topology. ‘I [SI [91 T. Sasaki, N. Koike, K. Ohmori, and K. Tomita, “HAL : A Block Level Hardware Logic Simulator,” Proceedings of the Annual 20th Design Automation Conference, pp. 150-156, 1983. T. Blank, “A Survey of Hardware Accelerators Used in Computer-Aided Design,” IEEE Design and Test, pp. 21-39, August 1984. B. Milne, “Put the Pedal to the Metal with Simulation Accelerators,” Electronic Design, pp. 39-52, September 1987. D. A. Reed, L. A. Adams, and M. L. Patrick, “Stencils and Problem Partitioning: their Influence on the Performance of Multiple Processor Systems,” IEEE Transactions on Computers, vol. C-36, pp. 845-858, July 1987. W. W. Chu, D. Lee, and B. m a , “A Distributed Processing System for Naval Data Communication Networks,” Proc. AFIPS National Computer Conference, vol. 47, pp. 783-793, 1978. W. W. Chu, M.-T. Lan, and J. Hellerstein, “Estimation of Intermodule Communication (IMC) and Its Applications in Distributed Processing Systems,” IEEE Transactions on Computers, vol. C-33, pp. 691-699, August 1984. M. E. Lesk and E. Schmidt, “Lex - A Lexical Analyzer Generator,” in UNIX Programmer’s Manual. Murray Hill, New Jersey: Bell Laboratories, 1979. S. C. Johnson, “Yacc: Yet Another Compiler-Compiler,” in UNIX Programmer’s Manual. Murray Hill, New Jersey: Bell Laboratories, 1979. zyxwvutsrqponm Acknowledeemenf This work was supported by the Semiconductor Research Corporation under Contract 87-DP-109. Special thanks to Ms. Carol Gura, Mr. Jeff Baxter and Mr. Ralph Kling, without whose help this work would not have been possible. T. M. McWilliams, J. B. Rubin, L. C. Widdoes, and S. Correl, SCALD II User‘s Manual. Lawrence Livermore Laboratory, 1979 Annual Report, The S-1 Project. J. M. Acken and J. D. Stauffer, “Logic Circuit Simulation,” IEEE Circuits and Systems Magazine, vol. 1, pp. 3-12, June 1979. TEGAS Design Language User and Reference Manual. Austin, Texas: Calma Company, 1984. References [ll 0. H. Ibarra and S. K. Sahni, “Polynomially Complete Fault Detection Problems,” IEEE Transactions on Computers, vol. C-24, pp. 242-249, March 1975. 2. Cvetanovic, “The Effects of Problem Partitioning, Allocation, and Granularity on the Performance of Multiple-Processor Sysetms,” IEEE Transactions on Computers, vol. C-36, pp. 421432, April 1987. K. Hwang and F. A. Briggs, Computer Architecture and Parrallel Processing, 1984. Sun Microsystems Inc., in Networking on the Sun Workrtation, Mountain View, CA, May 1985. Paper 42.1 691