Academia.eduAcademia.edu

Mparm: Exploring the multi-processor soc design space with systemc

2005, The Journal of VLSI …

Technology is making the integration of a large number of processors on the same silicon die technically feasible. These multi-processor systems-on-chip (MP-SoC) can provide a high degree of flexibility and represent the most efficient architectural solution for supporting multimedia applications, characterized by the request for highly parallel computation. As a consequence, tools for the simulation of these systems are needed for the design stage, with the distinctive requirement of simulation speed, accuracy and capability to support design space exploration. We developed a complete simulation platform for a MP-SoC called MP-ARM, based on SystemC as modelling and simulation environment, and including models for processors, the AMBA bus compliant communication architecture, memory models and support for parallel programming. A fully operating linux version for embedded systems has been ported on this platform, and a cross-toolchain has been developed as well. Our MP simulation environment turns out to be a powerful tool for the MP-SOC design stage. As an example thereof, we use our tool to evaluate the impact on system performance of architectural parameters and of bus arbitration policies, showing that the effectiveness of a particular system configuration strongly depends on the application domain and the generated traffic profile.

Journal of VLSI Signal Processing 41, 169–182, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.  MPARM: Exploring the Multi-Processor SoC Design Space with SystemC LUCA BENINI AND DAVIDE BERTOZZI DEIS—University of Bologna, Via Risorgimento 2, Bologna, Italy ALESSANDRO BOGLIOLO ISTI—University of Urbino, Piazza della Repubblica, 13 Urbino, Italy FRANCESCO MENICHELLI AND MAURO OLIVIERI DIE—La Sapienza University of Rome, Via Eudossiana 18, 00184 Roma, Italy Received February 13, 2003; Revised December 24, 2003; Accepted July 30, 2004 Abstract. Technology is making the integration of a large number of processors on the same silicon die technically feasible. These multi-processor systems-on-chip (MP-SoC) can provide a high degree of flexibility and represent the most efficient architectural solution for supporting multimedia applications, characterized by the request for highly parallel computation. As a consequence, tools for the simulation of these systems are needed for the design stage, with the distinctive requirement of simulation speed, accuracy and capability to support design space exploration. We developed a complete simulation platform for a MP-SoC called MP-ARM, based on SystemC as modelling and simulation environment, and including models for processors, the AMBA bus compliant communication architecture, memory models and support for parallel programming. A fully operating linux version for embedded systems has been ported on this platform, and a cross-toolchain has been developed as well. Our MP simulation environment turns out to be a powerful tool for the MP-SOC design stage. As an example thereof, we use our tool to evaluate the impact on system performance of architectural parameters and of bus arbitration policies, showing that the effectiveness of a particular system configuration strongly depends on the application domain and the generated traffic profile. Keywords: system-on-chip simulation, multiprocessor embedded systems, design space exploration 1. Introduction Systems-on-chips (SoC) are increasingly complex and expensive to design, debug and fabricate. The costs incurred in taking a new SoC to market can be amortized only with large sales volume. This is achievable only if the architecture is flexible enough to support a number of different applications in a given domain. Processorbased architectures are completely flexible and they are often chosen as the back-bone for current SoCs. Multi-media applications often contain highly parallel computation, therefore it is quite natural to envision Multi-processor SoCs (MPSoCs) as the platforms of choice for multi-media. Indeed, most high-end multimedia SoCs on the market today are MPSoCs [1–3]. Supporting the design and architectural exploration of MPSoCs is key for accelerating the design process and converging towards the best-suited architectures for a target application domain. Unfortunately we are today in a transition phase where design tuning, optimization and exploration is supported either at a very high-level or at the register-transfer level. In this paper we describe a MPSoC architectural template and a simulation-based exploration tool, which operates at the macro-architectural level, and we demonstrate its usage on a classical MPSoC design problem, i.e., the 170 Benini et al. analysis of bus-access performance with changing architectures and access profiles. To support research for general-purpose multiprocessors in the past, a number of architectural levelmultiprocessor simulators have been developed by the computer architecture community [4–6] for performance analysis of large-scale parallel machines. These tools operate at a very high level of abstraction: their processor models are highly simplified in an effort to speedup simulation and enable the analysis of complex software workloads. Furthermore, they all postulate a symmetric multiprocessing model (i.e. all the processing units are identical), which is universally accepted in large-scale, general-purpose multiprocessors. This model is not proper for embedded systems, where very different processing units (i.e general purpose, DSP, VLIW) can coexist. To enable MPSoC design space exploration, flexibility and accuracy in hardware modeling must be significantly enhanced. Increased flexibility is required because most MPSoC for multimedia applications are highly heterogeneous: they contain various types of processing nodes (e.g. general-purpose embedded processors and specialized accelerators), multiple on-chip memory modules and I/O units, an heterogeneous system interconnect fabric. These architectures are targeted towards a restricted class of applications, and they do not need to be highly homogeneous as in the case of general-purpose machines. Hardware modeling accuracy is highly desirable because it would make it possible to use the same exploration engine both during architectural exploration and hardware design. These needs are well recognized in the EDA (Electronic Design Automation) community and several simulators have been developed to support SoC design [7–11]. However, these tools are primarily targeted towards single-processor architectures (e.g. a single processor cores with many hardware accelerators), and their extension toward MPSoCs, albeit certainly possible, is a non-trivial task. In analogy with current SoC simulators, our design space exploration engine supports hardware abstraction level and continuity between architectural and hardware design, but it fully supports multiprocessing. In contrast with traditional mixed language co-simulators [7], we assume that all components of the system are modeled in the same language. This motivates our choice of SystemC as the modeling and simulation environment of choice for our MPSoC platform. The primary contribution of this paper is not centered on describing a simulation engine, but on introducing MP-ARM, a complete platform for MPSoC research, including processor models (ARM), SoC bus models (AMBA), memory models, hardware support for parallel programming, a fully operational operating system port (UCLinux) and code development tools (GNU toolchain). We demonstrate how our MPSoC platform enables the exploration of different hardware architectures and the analysis of complex interaction patterns between parallel processors sharing storage and communication resources. In particular we demonstrate the impact of various bus arbitration policies on system performance, one of the most critical elements in MPSoC design, as demonstrated in previous work [12–15]. The paper is organized as follows: Section 2 describes the concepts of the emulated platform architecture and its subsystems (network, master and slave modules), Section 3 shows the software support elements developed for the platform (compiler, peripheral drivers, synchronization, O.S.), Section 4 gives some examples of use of the tool for hardware/software exploration and bus arbitration policies. 2. Multiprocessor Simulation Platform Integrating multiple Instruction Set Simulators (in the following, ISSs) in a unified system simulation framework entails several non-trivial challenges, such as the synchronization of multiple CPUs to a common time base, or the definition of an interface between the ISS and the simulation engine. The utilization of SystemC [16] as back-bone simulation framework represents a powerful solution for embedding ISSs in a framework for efficient and scalable simulation of multiprocessor SoCs. Besides the distinctive features of modeling software algorithms, hardware architectures and SoC or system level designs interfaces, SystemC functionalities make it possible to plug an ISS into the simulation framework as a system module, activated by the common system clock provided to all of the modules (not physical clock). SystemC provides a standard and well defined interface for the description of the interconnections between modules (ports and signals). Moreover, among the advantages of C/C++ based hardware descriptions, there is the possibility of bridging the hardware/software description language gap [17]. MPARM: Exploring the Multi-Processor SoC Design Space with SystemC SystemC can be used in such a way that each module consists of a C/C++ implementation of the ISS, encapsulated in a SystemC wrapper. The wrapper realizes the interface and synchronization layer between the instruction set simulator and the SystemC simulation framework: in particular, the cycle-accurate communication architecture has to be connected with the coarser granularity domain of the ISS. The applicability of this technique is not limited to ISSs, but can be extended to encapsulate C/C++ implementations of system blocks (such as memories and peripherals) into SystemC wrappers, thus achieving considerable speedups in the simulation speed. This methodology tradesoff simulation accuracy with time, and represents an efficient alternative to the full SystemC description of the system modules (SystemC as an hardware description language) at a lower abstraction level. This former solution would slow-down the simulation, and for complex multiprocessor systems this performance penalty could turn out to be unacceptable. A co-simulation scenario can also be supported by SystemC, where modules encapsulating C++ code (describing the simulated hardware at a high level of abstraction, i.e. behavioural) coexist with modules completely written in SystemC (generally realizing a description at a lower level of abstraction). In this way, performance versus simulation accuracy can be tuned and differentiated between the modules. Based on these guidelines, we have developed a multiprocessor simulation framework using SystemC 1.0.2 as simulation engine. The simulated system currently contains a model of the communication architecture (compliant with the AMBA bus standard), along with multiple masters (CPUs) and slaves (memories) (Fig. 1). The intrinsic multi-master communication supported by the AMBA protocol has been exploited by declaring multiple instances of the ISS master module, thus constructing a scalable multiprocessor simulator. Figure 1. System architecture. 171 Figure 2. Processing module architecture. Processing Modules The processing modules of the system are represented by cycle accurate models of cached ARM cores. The module (Fig. 2) is internally composed of the ARM CPU, the first-level cache and peripherals (UART, timer, interrupt controller) simulator written in C++. It was derived from the open source cycle accurate SWARM (software ARM) simulator [18] encapsulated in a SystemC wrapper. The insertion of an external (C++) ISS is subject to the necessity of interfacing it with the SystemC environment. (for example, accesses to memories and interrupt requests must be trapped and translated to SystemC signals). Another important issue is synchronization, the ISS, typically written to be run as a single unit, must be capable of being synchronized with the multiprocessing environment (i.e. there must be a way to start and stop it maintaining cycle-accuracy). Last, as a further requirement, the ISS must be capable of being multi-instantiable (for example it must be a C++ class), since there will be one instance of the module for each simulated processor. The SWARM simulator is entirely written in C++. It emulates an ARM CPU and is structured as a C++ class which communicates with the external world using a Cycle function, which executes a clock cycle of the core, and set of variables in very close relation to the corresponding pins of a real hardware ARM core. Along with the CPU, a set of peripherals is emulated (timers, interrupt controller, UART) to provide support for an Operating System running on the simulator. The cycle-level accuracy of the SWARM simulator simplifies the synchronization with the SystemC environment (i.e. the wrapper module), especially in a multiprocessor scenario, since the control is returned 172 Benini et al. to the main system simulator synchronizer (SystemC) at every clock cycle [18]. The interesting thing about ISS wrapping is that with relatively little effort, other processor simulators can be embedded in our multiprocessor simulation back-bone (e.g. mips). Provided they are written in C/C++, their access requests to the system bus need to be trapped, so to be able to make the communication extrinsic and generate the cycle accurate bus signals in compliance with the communication architecture protocol. Moreover, the need for a synchronization between simulation time and ISS simulated time arises only when the ISS to be embedded has a coarse time resolution, i.e. when it does not simulate each individual processor clock cycle. Finally, the wrapping methodology determines negligible communication overhead between the ISS and the SystemC simulation engine, because the ISS does not run as a separate thread and consequent communication primitives are not required, that would otherwise become the bottleneck with respect to the simulation speed. AMBA Bus Model AMBA is a widely used standard defining the communication architecture for high performance embedded systems [19]. Multi-master communication is supported by this back-bone bus and requests for simultaneous accesses to the shared medium are serialized by means of an arbitration algorithm. The AMBA specification includes an advanced highperformance system bus (AHB), and a peripheral bus (APB) optimized for minimal power consumption and reduced interface complexity to support connection with low-performance peripherals. We have developed a SystemC description only for the former one, given the multi-processor scenario we are targeting. Our implementation supports the distinctive standard-defined features for AHB, namely burst transfers, split transactions and single-cycle bus master handover. The model has been developed with scalability in mind, so to be able to easily plug-in multiple masters and slaves through proper bus interfaces. Bus transactions are triggered by asserting a bus request signal. Then the master waits until bus ownership is granted by the arbiter: at that time, address and control lines are driven, while data bus ownership is delayed by one clock cycle, as an effect of the pipelined operation of the AMBA bus. Finally, data sampling at the master side (for read transfers) or slave side (for write transfers) takes place when a ready signal is asserted by the slave, indicating that on the next rising edge of the clock the configuration of the data bus can be considered stable and the transaction can be completed. Besides single transfers, four, eight and sixteen-beat bursts are defined in the AHB protocol too. Unspecified-length bursts are also supported. An important characteristic of AMBA bus is that the arbitration algorithm is not specified by the standard, and it represents a degree of freedom for a taskdependent performance optimization of the communication architecture. A great number of arbitration policies can be implemented in our multiprocessor simulation framework by exploiting some relevant features of the AMBA bus. For example, the standard allows higher priority masters to gain ownership of the bus even though the master which is currently using it has not completed yet. This is the case of the early burst termination mechanism, that comes into play whenever the arbiter does not allow a master to complete an ongoing burst. In this case, masters must be able to appropriately rebuild the burst when they next regain access to it. We exploited this bus preemption mechanism to implement an arbitration strategy called “slot reservation”. In practice, the bus is periodically granted to the same master, which is therefore provided with a minimum guaranteed bandwidth even though it cannot compete for bus access during the remaining period of time. This strategy allows tasks with particular bandwidth requirements not to fail as an effect of arbitration delays due to a high level of bus contention. In our SystemC model of the AMBA bus, we exploit the early burst termination to eventually suspend ongoing bursts that are being carried out by the master owning the bus, and to enable it to resume its operation once the slot period is expired. This arbitration mechanism is parameterized, as the slot duration and period can be independently set to search for the most efficient solution. A traditional round-robin policy has also been implemented, allowing a comparison between the two strategies in terms of their impact on a number of metrics such as execution time, average waiting time of the processors before their bus access requests are satisfied, degree of bus idleness, number of burst early terminations, average preemption time, etc. Our multiprocessor simulation platform allows design space exploration of arbitration policies, and to MPARM: Exploring the Multi-Processor SoC Design Space with SystemC easily derive the most critical parameters determining the performance of the communication architecture of a MP-SoC. This capability of the simulation environment is becoming of critical importance, as the design paradigm for SoC is shifting from device centric to interconnect centric [20]. The efficiency of a certain arbitration strategy can be easily assessed for multiple hardware configurations, such as number of masters, different master characteristics (e.g. cache size, general purpose versus application specific, etc.). 173 with the difference that when one of these 32 bit registers is read, its value is returned to the requester, but at the same time the register is automatically set to a predefined value before the completion of the read access. In this way a single read of one of the registers works as an atomic test&set function. This module is connected to the bus as a slave and its locations are memory mapped in a reserved address space. 3. Software Support Memory Sub-System The system is provided with two hierarchies of memories, namely cache memory and main memory. The cache memory is contained in the processing module and is directly connected to the CPU core through its local bus. Each processing module has its own cache, acting as a local instruction and data memory; it can be configured as a unified instruction and data cache or as two separate banks of instruction and data caches. Configuration parameters include also cache size, line length and the definition of non cacheable areas in the address space. Main memory banks reside on the shared bus as slave devices. They consist of multiple instantiations of a basic SystemC memory module. Each memory module is mapped on its reserved area within the address space; it communicates with the masters through the bus using a request-ready asynchronous protocol; the access latency—expressed in clock cycles—is configurable. Multiprocessor Synchronization Module In a multiprocessing system there is the need for an hardware support for process synchronization in order to avoid race conditions when two or more processes try to access the same shared resource simultaneously. The support for mutual exclusion is generally provided by ad hoc non-interruptible CPU instructions, such as the test&set instruction. In a multiprocessor environment the presence of noninterruptible instructions must be combined with external hardware support in order to obtain mutual exclusion of shared resources between different processors. We have equipped the simulator with a bank of memory mapped registers which work as hardware semaphores. They are shared among the processors and their behavior is similar to that of a shared memory, Cross-Compilation Toolchain The cross-compilation toolchain includes the GNU gcc-3.0.4 compiler for the ARM family of processors ad its related utilities, compiled under Linux. The result of the compilation and linking step is a binary image of the memory, which can be uploaded into the simulator. Operating System Support: uClinux Hardware support for booting an operating system has been provided to the simulator through the emulation of two basic peripherals needed by a multitasking O.S.: a timer and an interrupt controller. An additional UART I/O device allows to display startup, error and debug information on a virtual console. Linux-style drivers have been written for these devices, running under the linux 2.4 kernel. The kernel version ported onto the emulation platform consists of a reduced version of linux (uClinux) for embedded systems without memory management unit support [21]. Our simulation platform allows to boot multiple parallel uCLinux kernels on independent processors and to run benchmarks or interactive programs, using the UART as an I/0 console. Support for Multiple Processors The software support for multiprocessors includes the initialization step and synchronization primitives, together with some modifications of the memory map. When a processor performs an access to the memory region where it expects to find the exception vectors, the address has been shifted to a different region in the main memory, so that each processor can have its own distinct exception table. The result is a virtual memory map specific for each processor (Fig. 3), which must 174 Figure 3. Benini et al. Memory map. not be confused with a general purpose memory management support. Having its own reset vector, each processor can execute its own startup code independently on the others. Each processor initializes its registers (e.g. stack pointer) and private resources (timers, interrupt controllers). Shared resources are initialized by a single processor while the others wait using a semaphore synchronization method. At the end of the initialization step, each processor branches to its own main routine (namely main0, main1, main2, etc.). The linker script is responsible for the allocation of the startup routines and of the code and data sections of the C program. Synchronization software facilities includes definitions and primitives to support the hardware semaphore region (multiprocessor synchronization module) at C programming level. The routines consists of a blocking test&set function, of a non-blocking test function and of a free function. 4. Experimental Results Our simulation environment can be used for different kinds of design exploration, and this section will give some examples thereof. To this purpose, we used the aforementioned software toolchain to write some benchmark programs for a two-processors system with different levels of data interaction between the two processors. Figure 4 shows the common system architecture configuration used for the examples. Two processing ARM modules are connected to the AMBA bus and act as masters, and two identical memory modules are connected as slaves and can be accessed by both processors. The third slave module is the semaphore unit, used for synchronization in one of the following benchmark programs. Benchmark Description (a) same data set program (shared data source) The two processors execute the same algorithm (matrix multiplication) on the same data source. In this program half the result matrix is generated by the first processor while the other half is generated by the other processor (Fig. 5). The two processors share the source data (the two matrixes that have to be multiplied), but there are no data dependencies between them, so there is no need to use synchronization functions between the processors. (b) data dependent program (producer-consumer algorithm) The first processor execute a one-dimensional N -size integer FFT on a data source stream while the second execute a one-dimensional N -size integer IFFT on the data produced by the first processor. For each N -size FFT block completed, a dedicated semaphore is released by the first CPU before initiating data elaboration of the subsequent block. The second CPU, before performing the IFFT on a data block will check its related semaphore and will be locked until data ready will be signaled. MPARM: Exploring the Multi-Processor SoC Design Space with SystemC Figure 4. System architecture for benchmark examples. Figure 5. Matrix multiplication. 175 Architectural Exploration In this example we show the results obtained running the preceding benchmarks and varying architectural or program parameters. The explored parameters are two, one related to the system architecture, cache size, and the other related to the program being executed, FFT size (which affects data locality). The FFT performed on an N -size block will be hereafter indicated as “FFT N”. Using our multiprocessor simulation tool we can obtain output statistics in text format as the ones showed in Fig. 6. In Figs. 7–9 we graphically illustrate the results relative to contention-free bus accesses (percentage of times a CPU is immediately granted the bus against its access request, with respect to the total number of bus access requests), average waiting time before gaining bus ownership (this delay is a side-effect of the Figure 6. Statistics collection example (FFT 16 with 512 byte cache size). 176 Benini et al. Figure 7. Contention-free bus accesses vs. cache size. Figure 8. Average waiting for bus cycles vs. cache size. Figure 9. Cache miss rate vs. cache size. MPARM: Exploring the Multi-Processor SoC Design Space with SystemC arbitration mechanism and of the serialization of bus requests), average cache miss rate for the two processors. reservation policies. Slot reservation was parameterized in terms of slot duration and slot period, and the effects of both parameters were analyzed by means of simulation sweeps. Results obtained for benchmark (a) are reported in Figs. 10–14. Figures 10 and 11 show the effects of slot duration and period on the average waiting time per memory access perceived by each processor. Horizontal lines on the two graphs represent the average waiting time obtained with round robin, which is almost the same for the two processors. In our experiments, slots are reserved to processor 1. As expected, increasing the Arbitration Policies Exploration We performed extensive experiments to evaluate the effects of bus arbitration policies on the performance provided by the system on our benchmark applications. In particular we simulated each benchmark with different bus arbiters, implementing round-robin and slot 4750 Processor 1 Processor 2 Average waiting time (µs) 3750 2750 1750 750 −250 500 Figure 10. 2000 3500 5000 6500 Slot duration (µs) 8000 Average waiting time vs. slot duration (slot period = 10 ms). 3800 Processor 1 Processor 2 Average waiting time (µs) 3000 2200 1400 600 −200 1100 Figure 11. 2600 4100 177 5600 Slot period (µs) Average waiting time vs. slot period (slot duration = 1 ms). 7100 8600 9500 178 Benini et al. 8.3e+06 No. of bus idle cycles Processor 1 Processor 2 No. of bus cycles 6.8e+06 5.3e+06 3.8e+06 2.3e+06 8e+05 500 2000 3500 5000 6500 Slot duration (µs) 8000 9500 Figure 12. Bus cycles vs. slot duration (slot period = 10 ms). 8e+06 7e+06 No. of bus idle cycles Execution time processor 1 Execution time processor 2 No. of bus cycles 6e+06 5e+06 4e+06 3e+06 2e+06 1e+06 0 1100 2600 4100 5600 Slot period (µs) 7100 8600 Figure 13. Bus cycles vs. slot period (slot duration = 1 ms). duration of the slot reduces the waiting time of processor 1 and increases that of processor 2. As regards the slot period, its increase corresponds to an increase of the inter-slot time, as the slot duration is kept constant here, and this translates to an opposite effect on waiting times. In fact, in a system with only two masters the inter-slot time can be viewed as a slot time reserved to the second processor, so that the larger the inter-slot time, the better the bus performance perceived by processor 2. It is worth noting that slot reservation does never outperform round robin. In fact, in our application the two processors have exactly the same workload and the same memory access needs, so that it is counterproductive to implement unbalanced arbitration policies. Figure 12 shows the overall number of idle bus cycles and the execution time of the two processors for different slot durations. Interestingly, the idleness of the bus has a non monotonic behavior, clearly showing a global optimum. This can be explained by observing that matrix multiplication is the only task in the system, so that each processor stops executing as soon as it finishes its sub-task. If the two processors have different execution times because of slot reservation, once the faster processor completes execution it cannot generate further requests, while the bus arbiter keeps reserving time MPARM: Exploring the Multi-Processor SoC Design Space with SystemC 179 4750 Average burst preemption time (µs) No. of preemptions 3750 2750 1750 750 −250 1100 3100 5100 Slot period (µs) 7100 9100 Figure 14. Preemption (slot duration = 1 ms). slots to it, thus necessarily causing idle bus cycles. The number of bus idle cycles reported in Fig. 12 suffers from this situation, since it is computed over the execution time of the slower processor. A similar behavior is shown in Fig. 13 as a function of the slot period. Finally, Fig. 14 shows the preemptive effect of slot reservation. The number of preemptions represents the number of times a data burst is interrupted because it exceeds the time slot. For a fixed slot duration of 1 second, the number of preemptions decreases by increasing the slot period and hence the inter-slot time. Notice that the inter-slot time affects only the number of preemptions of processor 2, while the number of preemptions of processor 1 depends only on the slot duration, that doesn’t change in our experiment. The average burst preemption time (reported in Fig. 14) is a measure of the time the interrupted processor has to wait before resuming data transfer. For bursts initiated by processor 2, the preemption time is fixed and equal to the slot duration. For bursts initiated by processor 1, on the other hand, the preemption time grows with the inter-slot time. That’s why the average burst preemption time is a sub-linear function of the inter-slot time. Simulation Performances In this paragraph we will show simulation speed and performances of our simulator. These data are very im- portant since they are an index of the complexity vs. accuracy trade-off and contribute to the definition of the design space that can be actually explored using MPARM. In order to have more significant measures a further benchmark was developed. The benchmark consists of a chain of pipelined data manipulation operations (in the specific case, matrix multiplications), where each processor is a stage of the pipeline (i.e. processor N operates on data produced by processor N − 1 and its data outputs are the inputs of processor N + 1) and the processing load is equally distributed over the CPUs. This system structure has the advantage of being easily expandable (regarding the number of processing stages) with minor modifications to the simulator and to the benchmark code. In that way we can produce data on simulation time and show how it scales with the number of processors. Figure 15 shows the simulation time needed for the execution of about 1.5 million “simulated” cycles on each processing module. We report the measures as a function of the number of processors and of the output statistics produced (global statistics collection as those in Fig. 6 or complete signal waveform tracing (VCD files) and memory access tracing). The whole simulator was running on a Pentium 4, 2.26 GHz workstation. Simulation speed is in the range of 60000–80000 cycles/sec per processor without signal tracing (i.e 2 simulated processors proceed at about 30000–40000 180 Benini et al. Figure 15. Simulation time (seconds). cycles/sec each, 6 processor, roughly, at 10000 cycles/sec each). The collection of global statistics does not affect significantly simulation speed, while runtime signal tracing has a deeper impact. 5. Conclusions We have developed a complete platform for the simulation of a MP-SoC, allowing investigation in the parameter space (related to the architecture configuration or to the protocols) to come up with the most efficient solution for a particular application domain. Our platform makes use of SystemC as simulation engine, so that hardware and software can be described in the same language, and is based on an AMBA bus compliant communication architecture. ARM processors act as bus masters (like in commercial high-end multimedia SoCs), and the simulation platform includes memory modules, synchronization tools, and support for system software (porting of the uClinux OS and development of a cross-toolchain.) We have shown examples of applications wherein our simulation environment is used to explore some design parameters, namely cache parameters and bus arbitration policies. The applications involve dataindependent or data-dependent tasks running on different ARM CPUs sharing the main memory through a common AMBA bus. The examples show how to derive important metrics (cache size, average waiting time for accessing the bus since the request is asserted, etc.) that heavily impact system performance, proving its effectiveness in supporting the design stage of a multi-processor system-on-chip. References 1. Philips, “Nexperia Media Processor,” http://www.semiconductors.philips.com/platforms/nexperia/media processing/ 2. Mapletree Networks, “Access Processor,” http://www.mapletree.com/products/vop tech.cfm 3. Intel, “IXS1000 media signal processor,” http://www.intel.com/ design/network/products/wan/vop/ixs1000.htm 4. P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A Full System Simulation Platform,” IEEE Trans. on Computer, vol. 35, Issue 2, Feb. 2002. 5. M. Rosenblum, S.A. Herrod, E. Witchel, and A. Gupta, “Complete Computer System Simulation: The SimOS Approach,” IEEE Parallel & Distributed Technology: Systems & Applications, vol. 3, Issue 4, Winter 1995. 6. C.J. Hughes, V.S. Pai, P. Ranganathan, and S.V. Adve, “Rsim: Simulating Shared-Memory Multiprocessors with ILP Processors,” IEEE Trans. on Computer, vol. 35, Issue 2, Feb. 2002. 7. Mentor Graphics “Seamless Hardware/Software Co-Verification,” http://www.mentor.com/seamless/products.html 8. CoWare, Inc., “N2C,” http://www.coware.com/cowareN2C. html 9. K. Van Rompaey, D. Verkest, I. Bolsens, and H. De Man, “CoWare-a Design Environment for Heterogeneous Hardware/Software Systems,” in Design Automation Conference, 1996, with EURO-VHDL’96 and Exhibition, Proceedings EURO-DAC’96, European, Sep. 1996, pp. 16–20. 10. Shubhendu S. Mukherjee, Steven K. Reinhardt, Babak Falsafi, Mike Litzkow, Steve Huss-Lederman, Mark D. Hill, James R. Larus, and David A. Wood, “Wisconsin Wind Tunnel II: A Fast and Portable Parallel Architecture,” in Workshop on Performance Analysis and Its Impact on Design (PAID), June 1997. 11. Babak Falsafi and David A. Wood, “Modeling Cost/Performance of a Parallel Computer Simulator,” ACM Transactions on Modeling and Computer Simulation (TOMACS), Jan. 1997. 12. K. Lahiri, A. Raghunathan, G. Lakshminarayana, and S. Dey, “Communication Architecture Tuners: A Methodology for the MPARM: Exploring the Multi-Processor SoC Design Space with SystemC 13. 14. 15. 16. 17. 18. 19. 20. 21. Design of High-Performance Communication Architectures for System-on-Chips,” in Design Automation Conference, 2000. Proceedings 2000. 37th, 2000, pp. 513–518. K. Lahiri, A. Raghunathan, and G. Lakshminarayana, “LOTTERYBUS: A New High-Performance Communication Architecture for System-on-Chip Designs,” in Design Automation Conference, 2001. Proceedings, 2001. K. Anjo, A. Okamura, T. Kajiwara, N. Mizushima, M. Omori, and Y. Kuroda, “NECoBus: A High-End SOC Bus with a Portable & Low-Latency Wrapper-Based Interface Mechanism,” in Custom Integrated Circuits Conference, 2002. Proceedings of the IEEE 2002, 2002, pp. 315–318. Ryu Kyeong Keol, Shin Eung, and V.J. Mooney, “A Comparison of Five Different Multiprocessor SoC Bus Architectures,” in Digital Systems, Design, 2001. Proceedings. Euromicro Symposium on, 2001, pp. 202–209. Synopsys, Inc., “SystemC, Version 2.0,” http://www.systemc. org. G. De Micheli, “Hardware Synthesis from C/C++ Models,” DATE’ 99: Design Automation and Test in Europe, Mar. 1999, pp. 382–383. M. Dales, SWARM—Software arm http://www.dcs.gla.ac.uk/ ∼michael/phd/swarm.html ARM “AMBA Bus,” http://www.arm.com/armtech.nsf/html/ AMBA?OpenDocument&style=AMBA J. Cong, “An Interconnect-Centric Design Flow for Nanometer Technologies,” Int. Symp. VLSI Technology, Systems, and Applications, June 1999, pp. 54–57. uClinux – www.uclinux.org. Luca Benini received the B.S. degree (summa cum laude) in electrical engineering from the University of Bologna, Italy, in 1991, and the M.S. and Ph.D. degrees in electrical engineering from Stanford University in 1994 and 1997, respectively. He is an associate professor in the department of electronics and computer science in the University of Bologna. He also holds visiting researcher positions at Stanford University and the Hewlett-Packard Laboratories, Palo Alto, CA. Dr. Benini’s research interests are in all aspects of computer-aided design of digital circuits, with special emphasis on low-power applications, and in the design of portable systems. He is co-author of the book: Dynamic Power management, Design Techniques and CAD tools, Kluwer 1998. Dr. Benini is a member of the technical program committee for several technical conferences, including the Design Automation Conference, the International Symposium on Low Power Design and the International symposium on Hardware-Software Codesign. [email protected] 181 Davide Bertozzi received the B.S. degree in electrical engineering from the University of Bologna, Bologna, Italy, in 1999. He is currently pursuing the Ph.D. degree at the same University and is expected to graduate in 2003. His research interests concern the development of SoC co-simulation platforms, exploration of SoC communication architectures and low power system design. [email protected] Alessandro Bogliolo received the Laurea degree in electrical engineering and the Ph.D. degree in electrical engineering and computer science from the University of Bologna, Bologna, Italy, in 1992 and 1998. In 1995 and 1996 he was a Visiting Scholar at the Computer Systems Laboratory (CSL), Stanford University, Stanford, CA. From 1999 to 2002 he was an Assistant Professor at the Department of Engineering (DI) of the University of Ferrara, Ferrara, Italy. Since 2002 he’s been with the Information Science and Technology Institute (STI) of the University of Urbino, Urbino, Italy, as Associate Professor. His research interests are mainly in the area of digital integrated circuits and systems, with emphasis on low power and signal integrity. [email protected] Francesco Menichelli was born in Rome in 1976. He received the Electronic Engineering degree in 2001 at the University of Rome “La Sapienza”. From 2002 he is a Ph.D. student in Electronic Engineering at “La Sapienza” University of Rome. 182 Benini et al. His scientific interests focus on low power digital design, and in particular in level tecniques for low power consumption, power modeling and simulation of digital systems. [email protected] Mauro Olivieri received a Master degree in electronic engineering “cum laude” in 1991 and a Ph.D. degree in electronic and computer engineering in 1994 from the University of Genoa, Italy, where he also worked as an assistant professor. In 1998 he joined the University of Rome “La Sapienza”, where he is currently associate professor in electronics. His research interests are digital system-on-chips and microprocessor core design. Prof. Olivieri supervises several research projects supported by private and public fundings in the field of VLSI system design. [email protected]