Host-Compiled Multicore RTOS Simulator For Embedded Real-Time Software Development
Host-Compiled Multicore RTOS Simulator For Embedded Real-Time Software Development
Host-Compiled Multicore RTOS Simulator For Embedded Real-Time Software Development
I.
INTRODUCTION
With increasing system complexities, embedded software forms a large portion of todays systems. Software provides a high degree of flexibility and code reuse at reduced development times and costs. To achieve continued high software performance while managing power budgets, multicore processors are rapidly becoming a permanent part of embedded systems. As we are reaching limits in technology and frequency scaling, multicore architectures provide concurrent resources and high performance at lower clock frequencies and power consumption [1]. Multicore processor architectures open new software development challenges, such as ensuring correct communication and synchronization between tasks running concurrently on different cores [2]. In traditional multiprocessor setups, tasks are assigned to different processors in an asymmetric multi-processing (AMP) fashion. Each processor has an independent address space and is managed by a dedicated operating system (OS). With each processor acting like a conventional single-core machine, this approach allows for easy adaptation of legacy applications. By contrast, in a multicore setup, each processor can in turn contain multiple cores that share a common set of resources and are managed by a single OS. Tasks execute on different cores under a shared memory model in a symmetric or bound multi-processing (SMP or BMP) context, where the latter model allows designers to limit task migration and to control which cores are allowed to execute a particular task. A crucial component for the performance of real-time software on multicore processors is the OS scheduler that assigns tasks to cores. There are two general classes of
multicore scheduling schemes distinguished by the number of task queues associated with each core (Fig. 1). In a partitioned scheme, each core has a separate ready queue and tasks are initially assigned to a fixed core and queue. The OS picks tasks for a core only from the associated queue, but it can perform load balancing to migrate tasks between queues and cores either at regular intervals or whenever a task leaves a core. By contrast, in a global scheme the OS maintains only a single ready queue and tasks can be freely assigned to the next available core. A global queue can lead to a better core utilization but is less scalable and may result in degraded performance due to cache pollution when tasks move between cores too frequently [3]. Both schemes are further characterized by the policy used to organize the tasks queues (e.g. roundrobin or by priority), as well as scheduling parameters such as time slices or affinities that restrict a task to a subset of cores. Combined, there is a large number of scheduling options that determine overall real-time behavior. Complex interactions and the highly dynamic nature make static analysis difficult. Instead, designers rely on simulations to validate software and optimize the implementation for specific application requirements like real-time constraints. With ever growing complexities and software content, fast and accurate simulations of multicore platforms are therefore essential. Traditional micro-architecture or instruction set simulators (ISS) can be very accurate but tend to be slow due to a high level of detail and fine granularity required for modeling interactions, esp. in a multicore or multi-processor context. In this paper, we introduce a novel host-compiled multicore real-time software simulator. Our main objective is to address the need for rapid yet precise design space exploration of embedded multicore software at early stages of design process. The proposed approach uses a fast, high-level host-compiled application model that incorporates an abstract real-time OS (RTOS) to accurately simulate software execution in a multicore context. The model integrates into existing transaction-level modeling (TLM) backplanes using standard system-level design languages (SLDLs) for co-simulation with external hardware and the rest of the system, e.g. in a multi-
978-3-9810801-7-9/DATE11/2011 EDAA
processor context. Furthermore, the model is parametrizable, and software developers can easily evaluate the effect of a wide range of design decisions on real-time performance of their multicore software applications. The rest of the paper is organized as follows: in the following subsections, we review related work and present an overview of our proposed simulator. Section II then describes the supported programming model for hosting of applications in the simulator. Details of our abstract RTOS model are discussed in Section III, and we evaluate the proposed approach on an industrial-strength design example and random task sets in Section IV. Finally, Section V concludes the paper with a summary and outlook. A. Related Work Traditionally, software is simulated using implementationlevel micro-architecture or interpreted instruction-set simulators (ISSs) [4]. Such models can reach cycle accuracy at the expense of slow simulation speeds. More recently, virtual platform simulators using binary translation have become popular [5][6]. Such code morphing ISSs can provide significant speedups (reaching simulation throughputs of several hundred MIPS), but only focus on fast functional simulation with limited or no timing information. At an early and abstract level, several real-time simulation environments allow evaluation based on idealized task delay models without execution of actual task functionalities [7][8]. Other approaches evaluate different multicore scheduling strategies directly on a real OS and architecture [9][10]. However, such low-level implementations are time-consuming and laborintensive. Host-compiled (sometimes also called native or computational TLM) approaches have recently received widespread attention as solutions that can provide both fast and accurate simulations [11][12]. In such approaches, application code is natively compiled and executed on the simulation host to achieve fastest possible functional simulation. The code is further back-annotated with timing estimates that are typically obtained by compiling to an intermediate representation [13][14]. Finally, back-annotated code is wrapped into an abstract model of the software execution environment built and integrated on top of a proprietary or standard SLDL environment, such as SystemC [15] or SpecC [16]. Some of the earliest host-compiled approaches were centered around models of the OS itself [17][18]. Later, these approaches were extended into complete processor models that include timing-accurate descriptions of interrupt chains and TLM-based bus interfaces [19][20]. Such processor models have been shown to simulate at speeds beyond 500 MIPS with more than 95% timing accuracy. In all cases, however, existing models are for single core processors only. To the best of our knowledge, no host-compiled real-time multicore OS and processor modeling approach currently exists. B. Host-Compiled Multicore RTOS Simulator We propose a high-level, host-compiled multicore RTOS simulator that is based on the approach presented in [17]. Fig. 2 illustrates the structure of the simulator, which is constructed in a layered-based fashion. At the highest level, the user application consists of a set of sequential and concurrent tasks that are controlled by and interact with an underlying OS model
(a) Partitioned queue structure (b) Global queue structure Figure 2. Host-compiled multicore RTOS simulation.
through an abstract OS interface. We extend existing approaches by developing a novel multicore SMP OS model that supports both partitioned and global queue structures. In both cases, the OS layer manages the scheduling and dispatching of application tasks across available queues and cores. In doing so, the OS model wraps around the basic SLDL event handling mechanism, replacing SLDL primitives with calls to the OS interface instead and ensuring that at any time only as many SLDL threads as there are cores are active. The complete simulator is developed on top of a SLDL simulation kernel, which provides basic concurrency and event handling models and can be integrated with standard TLM libraries to provide an environment for fast system-wide co-simulation. In the following sections, we will describe each of the layers of our simulator in more detail. II. MULTI-TASK APPLICATION MODEL
We provide a simple yet powerful programming model to mount applications on top of our simulator and the underlying C-based SLDL (Fig. 3). As mentioned before, an application model is composed out of concurrent and sequential high-level SLDL processes. Tasks can communicate and synchronize with each other using high-level inter-process communication (IPC) primitives provided by the standard SLDL channel library. For external communication with the rest of the system, tasks can access externally provided TLM bus interfaces. Internally, task code is provided in standard C-based form. For timing accuracy, we assume that the execution delays of the task functionality are back-annotated from measurements or estimations. User application tasks are integrated into and access services of the OS model via an abstract OS interface shown in Fig. 4. Fig. 3 demonstrates the general structure of a multi-task application and shows how a software designer utilizes the OS interface to run the application on the multicore simulator. The OS interface thereby provides the facilities to configure the OS and integrate an application with APIs for OS initialization and startup, task management, execution delay modeling, and event synchronization. During the system startup phase, the Init() method, initializes the OS model data structures and defines the OS parameters such as the number of cores (see Section III). The Start() method triggers multicore scheduling after all tasks have been attached to the model. Tasks are added to the OS kernel by calling TaskCreate(), which allocates an internal
representation inside the OS. At the start of simulation, task threads are spawned by the SLDL and register themselves with the OS via a call to the TaskActivate() method at the beginning of their execution. This allows the OS model to collect all threads and enter them into the scheduler. At the end of their execution, tasks remove themselves from the OS kernel by calling TaskTerminate(). In addition, tasks can suspend themselves (TaskSleep()) or can be resumed or killed by other tasks (TaskResume() and TaskKill()). Finally, if supported by the underlying SLDL, tasks can fork children and temporarily remove themselves from OS scheduling (ParStart()) until all children are collected on the SLDL level (ParEnd()). During creation of tasks, the designer can identify task properties and characteristics. Specifically, the supported task parameters are: 1) Affinity: a bitmap representing the set of cores allowed to execute the task; 2) Initial Core: the initial core for the task in a partitioned queue structure; 3) Priority: the priority level of the tasks; 4) Time Slice: time interval the task is allowed to execute without preemption by other tasks with the same priority. During execution, tasks communicate with the OS interface to model delays and synchronization points. The designer utilizes TimeWait() methods for modeling task delays, which in turn define possible preemption points. Likewise, PreWait() and PostWait() methods are used to encapsulate regular wait for event statements that implement event and task synchronization. Similar to forking and joining, these methods remove and return the task from/to OS scheduling while it waits for an external SLDL event that can in turn be notified by another task through regular SLDL mechanisms. Note that the designer will typically not have to deal with event handling directly. Instead, we provide a reimplementation of the SLDL channel library that is properly hooked into the OS. III. ABSTRACT MULTICORE OS MODEL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
/* OS initialization and startup */ VOID I NIT (OSP ARAM PARAM ); VOID S TART ( VOID ); /* Task management */ PROC T ASK CREATE (T ASK P ARAM PARAM ); PROC P ARS TART ( PROC TID ); /* fork */ VOID P ARE ND ( PROC TID ); /* join */ VOID T ASK A CTIVATE ( PROC TID ); VOID T ASK S LEEP ( PROC TID ); VOID T ASK R ESUME ( PROC ID ); VOID T ASK T ERMINATE( PROC TID ); VOID T ASK K ILL ( PROC TID ); /* Delay modeling and event handling */ VOID T IME W AIT ( LONG LONG NSEC , PROC P ); PROC P RE W AIT ( PROC TID ); VOID P OST W AIT ( PROC ME ); Figure 4. Multicore OS interface.
3) Scheduling Policy: In general, the designer can assign different priorities and time slices to each task. For first come-first serve (FCFS or FIFO) scheduling, tasks are assigned the same priority and an infinite (-1) time slice. For round-robin (RR) scheduling, task with the same priority are assigned a fixed time slice instead. Overall, our OS model helps designers to evaluate the real-time behavior of applications across different scheduling policies and facilities. A. OS Scheduler At the core of the OS model is the multicore scheduler, the body of which is shown in Fig. 5. The scheduler is an internal function of the OS model and is called by the OS interface methods whenever a task switching is possible or required. The main functionality of the scheduler is to retire the currently active task on a core, if any, and place it in a proper place in the right ready queue. Note that suspended tasks, e.g. when waiting for an event wrapped via Pre/PostWait(), will have been previously removed from all cores and ready queues, placing them in a wait state and separate wait queue instead. We utilize a time slice notion to model FIFO or RR scheduling on tasks that have the same priority. When the RTOS runs the scheduler for a specific core, it calculates the remaining time slice of the current active task on the desired core (lines 4-5 in Fig. 5). Then, the current task is moved to the corresponding ready queue based on the new value of the time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
FUNCTION S CHEDULER( INT C ORE ID): QUEUEID = GLOBAL_SCHEDULING? 0 : C OREID; IF (( OLDCURRENT = CURRENT [C ORE ID]) != NIL) THEN T ASKLIST[ CURRENT[C OREID]].T IMES LICE -= (NOW() - T ASKLIST[ CURRENT[C OREID]].S TARTT IME); T ASKLIST[ CURRENT[C OREID]].S TATE = READY; IF (T ASK L IST [ CURRENT [C ORE ID]].T IME S LICE == 0 ) THEN FIFO(&READYQUEUE[QUEUEID], CURRENT[C OREID]); ELSE
We have designed and developed a general model of a multicore RTOS with comprehensive configuration through task and OS parameters. The designer can adjust the OS model for a desired application to meet system requirements. The OS model parameters are as follows: 1) Number of Cores: Our simulator supports a configurable number of cores in the OS model. 2) Scheduler Structure: We support both partitioned and global ready queue structures.
DISPATCH (C OREID); IF ( OLDCURRENT ) WAIT 4 SCHED ( OLDCURRENT ); Figure 5. Multicore RTOS scheduler.
1 2 3 4 5 6 7 8 9 10
FUNCTION D ISPATCH ( INT COREID ): LOADB ALANCE(COREID); IF (! EMPTY (&R EADY Q UEUE [ COREID ]) THEN A CTIVET ASKID = G ETF IRST(&READYQUEUE[ COREID]); CURRENT[ COREID ] = A CTIVE T ASK ID; T ASKLIST[ CURRENT[ COREID]].S TATE = RUN ; S ENDS CHED (A CTIVET ASKID); ELSE CURRENT[ COREID ] ENDIF
= NIL;
B. OS Kernel The abstract model of the complete OS kernel implements OS interface methods by calling the scheduler at appropriate preemption points and task switch events. The TimeWait() method models task delays by calling the underlying SLDL to advance time, followed by a call to the OS scheduler. The latter gives any higher priority tasks that are asynchronously triggered (e.g. by a different core or an external interrupt) a chance to preempt the currently running task. The PreWait() and TaskSleep() methods remove the current task from all cores and ready queues, put it into a wait or sleep state and queue, and call the OS scheduler to assign a new task to the now idle core. At the other end, the PostWait() and TaskResume() methods return the calling or given task back into the ready queue after it has been externally woken up. From there it will be picked up for scheduling on the next scheduler call. Finally, both TaskActivate() and TaskSleep() suspend the calling task and wait until the SendSched() function activates the schedule event of the corresponding task. IV. EXPERIMENTAL RESULTS
= NIL;
= NIL;
READYQUEUE_MUTEX.R ELEASE (); (b) Global queue. Figure 6. Multicore task dispatcher.
To validate our models and demonstrate their efficiency for fast and accurate design space exploration, we have applied our approach to a set of randomly generated artificial tasks and an industrial-strength design example. For our experiments, we implemented models in SpecC, but we are also in the process of transferring results to other SLDLs, such as SystemC. A. Random task sets To evaluate the speed and accuracy of the proposed models, we compared the execution behavior of artificial task sets running on our processor model and on a virtual reference platform. We evaluate the response time of randomly generated periodic tasks on a dual-core MIPS34Kc Malta platform running a 2.6.24 Linux SMP kernel, which realizes a partitioned queue scheduling scheme and is configured with preemption and high resolution timers. We compare our model of this platform against a reference ISS running the actual Linux binary [6]. Following the setup presented in [9], task periods are uniformly distributed over [10, 100] ms, while task utilizations are distributed over [0.001, 0.1], [0.1, 0.4], and [0.001, 0.4], i.e. in the small (S), large (L) or whole/medium (M) range of execution delays. For each task set, we generate tasks until a maximum is reached or the core utilization falls into the range [0.3, 0.6), [0.6, 0.8) or [0.8, 1) for light, medium, or heavy task loads, respectively. Tasks priorities were assigned inversely to their periods following a rate-monotonic scheduling strategy. Actual task delays were measured on the reference simulator and back annotated into our model at different levels of timing granularity. Generated task sets and resulting modeling accuracies are summarized in Table I, II and Fig. 7. Model error was measured as the average absolute difference in individual task response times over all tasks and task iterations. Results show that with a timing granularity of 1 s and 10 s the average timing error across all task sets is less than 0.4% and 1%, respectively. In general, timing error grows with increasing timing granularity. In addition, the timing error is a function of the number of task switches (Fig. 8), and it is higher for task sets with a large number of small tasks. This is due to
slice. If the time slice reaches zero, the task is added at the end of its priority list where it will be scheduled after all current ready tasks with the same priority. Otherwise, it will be placed back at the beginning of the priority list and will be scheduled immediately again right before any other ready tasks with the same priority. Consequently, in RR scheduling the value of the time slice defines the portion of time that every task is allowed to be executed without any preemption, while setting an infinite time slice value will result in a FIFO schedule. At the end of the scheduler, an OS-specific Dispatch() function will be called to assign a new task to the current core. Fig. 6 shows the implementation of Dispatch() function for both global and partitioned queue structures. This function selects the highest priority task in the ready queue and assigns it to run on the current core. Ready queue are sorted by tasks priorities, and tasks with the same priority are arranged based on time slices. In case of partitioned queues (Fig. 6(a)), a load balancing to optionally, e.g. at regular intervals, migrate tasks between queues is performed before dispatching. In case of a global queue structure (Fig. 6(b)), a semaphore controls all accesses to the shared queue. In both cases, the choice of tasks allowed to be migrated or picked for a given core is further restricted based on task affinities. After selecting a new task, the dispatcher releases it by calling SendSched() to notify an SLDL event associated with the chosen task. After returning from the Dispatch() call at the end of the scheduler, the current task on the given core in turn suspends itself on its own event. Leveraging SLDL events assigned to each task, this implements actual context switches. Note that if no higher priority or other sibling task is available, the current task may simply dispatch itself and be immediately triggered again.
TABLE I.
AVERAGE ERR FOR SETS WITH SMALL TASKS UNDER DIFFERENT TIMING GRANULA ROR ARITIES.
S1 1 C0 7 0.33 0.047 0.58% 0.54% 0.94% 8.56% C1 6 0.30 0.050 0.30% 0.29% 0.59% 3.55% C0 11 0.47 0.043 0.39% 0.33% 1.51% 10.0% S2 C1 10 0.45 0.045 0.73% 0.68% 1.78% 10.1% C0 9 0.56 0.062 0.64% 0.57% 1.17% 8.58% S3 C1 9 0.50 0.056 0.38% 0.35% 0.58% 4.09% C0 12 0.70 0.058 1.02% 0.88% 2.74% 16.8% S4 C1 15 0.69 0.046 0.88% 0.81% 1.77% 12.6% C0 13 0.84 0.064 1.10% 0.98% 2.04% 13.2% S5 S C1 16 0.87 0.054 1.12% 0.98% 3.57% 15.6%
Task Set Core ID # of Tasks Total Util. Avg. Task Util. Avg. Err. (1 s) Avg. Err. (10 s) Avg. Err. (100 s) Avg. Err. (1 ms) TABLE II. Task Set Core ID # of Tasks Total Util. Avg. Task Util. Avg. Err. (1 s) Avg. Err. (10 s) Avg. Err. (100 s) Avg. Err. (1 ms)
C0 3 0.63 0.209 0.14% 0.16% 0.26% 2.40%
AVERAGE ERROR FOR SETS WITH LARGE AND MEDIUM TASKS UNDER DIFFERENT TIMING GRANULARITIES. R R
L1 C1 4 0.44 0.109 0.09% 0.08% 0.23% 2.91% C0 3 0.64 0.212 0.08% 0.08% 0.50% 4.70% L2 2 C1 2 0.63 0.315 0.14% 0.13% 0.14% 0.63% C0 3 0.88 0.294 0.07% 0.04% 0.43% 4.30% L3 C1 3 0.92 0.306 0.11% 0.13% 1.02% 9.05% C0 4 0.54 0.137 0.65% 0.57% 0.88% 8.4% M1 C1 4 0.56 0.139 0.14% 0.1% 1.1% 12.7% C0 4 0.69 0.173 0.25% 0.25% 0.26% 1.44% M2 C1 4 0.64 0.160 0.55% 0.53% 1.97% 19.3% C0 4 0.71 0.176 0.35% 0.23% 1.05% 11.4% M3 M C1 3 0.70 0.235 0.16% 0.15% 0.44% 3.16% C0 3 0.86 0.286 0.12% 0.11% 0.31% 2.75% M4 C1 3 0.69 0.230 0.09% 0.09% 0.12% 0.81%
TABLE III. 1.20% Average Error [%] 1.00% 0.80% 0.60% 0.40% 0.20% 0.00% 0 S2 S3 S1 M1 M2 L1 S5 S4 Set S1 S2 S3 S4 S5 M1 M2 M3 M4 L1 L2 L3 1 s
4.1s (240MIPS) 5.0s (200MIPS) 6.1s (160MIPS) 12.0s (80MIPS) 10.5s (90MIPS) 7.6s (130MIPS) 7.5s (130MIPS) 7.6s (130MIPS) 9.3s (110MIPS) 5.9s (170MIPS) 7.4s (130MIPS) 10s (100MIPS)
100 s
0.07s (14.3GIPS) 0 0.08s (12.5GIPS) 0 0.08s (12.5GIPS) 0 0.14s (7.1GIPS) 0.16s (6.2GIPS) 0.10s (10GIPS) 0.12s (8.3GIPS) 0.12s (8.3GIPS) 0.16s (6.2GIPS) 0.12s (8.3GIPS) 0.10s (10GIPS) 0.13s (7.7GIPS)
1000 s
0.03s (33GIPS) 0.05s ( 20GIPS) 0.04s (25GIPS) 0.04s (25GIPS) 0.08s (12GIPS) 0.03s (33GIPS) 0.03s (33GIPS) 0.04s (25GIPS) 0.03s ( 33GIPS) 0.03s (33GIPS) 0.02s (50GIPS) 0.04s (25GIPS)
0.20 0.25 0.30 Average Task Utilization Figure 7. Average error in average response time o 1s task sets. of 3.0% Average Error [%] 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% 0 11,000 5,000 7,000 9,000 Number of Task Switches Figure 8. Average timing error over number of ta switches. ask 1,000 3,000 5ms Response Time 4ms 3ms 2ms 1ms
T1(delay=1ms,period=20ms): 0.21% Avg. Err. T2(delay=3.4ms,period=50ms): 0.1% Avg. Err. M1 M2 S1 S3
0.05 0.10
0.15
1 us 10 us 100 us
S2
S5 S4
L2 L3 M3 M4 L1
Linux
context switch delays, which are not yet included in our model. t Finally, another source of errors is the non-ideal behavior of s the real Linux kernel. Fig. 9 plots the response times of the two highest-priority tasks in the M1 task set on the Linux kernel n, and our simulator. As can be seen on the Linux kernel the highest priority task is inte errupted at regular intervals additional, unknown background ac ctivities. Table III shows the simulation runtimes of the models for each task set. We ran each task set for 10 s of simulated time. At a nominal rate of nce 100 MIPS simulated by the referen ISS, this corresponds to 1000 million NOP instructions on each core for artificial delay e loops and any idling. Simulations each ran for about 30 s of wall time on the reference ISS. By contrast, our model lay-only setup in faster than simulates such a non-functional, del real time with a throughput of more than 1000 MIPS per core (or 2000 MIPS for the whole du ual-core system) at timing granularities of 10 s and above. B. Industrial-strength example o To demonstrate the benefits of our models for design space exploration, we implemented a cellphone example running concurrent control, MP3 decoding, and Jpeg encoding tasks on stem running at 100 MHz. a model of a dual-core ARM sys Tasks communicate with external hardware and the rest of the h system via an AHB bus. Task delay are back-annotated from ys measurements obtained on a cycl accurate ISS [21]. We le explored both single-core and globa and partitioned dual-core al
30 MP3 DL = 26ms 25 20 15 10 5 0 PQ1 PQ2 PQ3 PQ4 PQ5 PQ6 GQ1 GQ2 S1 S2 S3 S4 S5
JPeg
MP3
TABLE IV. Configuration S1 S2 S3 S4 S5 PQ1 PQ2 PQ3 PQ4 PQ5 PQ6 GQ1 GQ2 GQ3 GQ4 GQ5
MP3 > Jpeg Single Core
CELLPHONE EXAMPLE EXPERIMENTAL RESULTS Avg. / Max. Frame Delay Sim. MCPS Time Jpeg [ms] MP3 [ms]
23.01/25.36 23.01/25.45 23.01/25.36 23.01/25.36 20.70/23.97 22.97/25.29 22.97/32.25 22.97/25.29 22.97/25.28 19.82/23.01 18.73/21.20 18.66/21.21 18.62/20.81 18.62/21.20 18.62/20.90 18.62/20.81 8.84/9.99 23.84/43.43 9.42/10.79 13.16/15.99 28.12/79.69 8.65/9.72 26.97/43.70 9.23/10.61 13.05/16.22 28.24/115.8 8.52/8.59 8.56/9.80 8.52/8.67 8.52/8.67 8.52/8.67 8.52/8.66 500 583 514 530 667 427 546 605 546 529 724 426 414 438 438 438 0.35s 0.30s 0.34s 0.33s 0.32s 0.35s 0.32s 0.29s 0.32s 0.33s 0.31s 0.35s 0.36s 0.34s 0.34s 0.34s
MP3 = Jpeg, FIFO MP3 = Jpeg, RR(1us) MP3 = Jpeg, RR(.5ms) Jpeg > Mp3 1:MP3 > Jpeg 1: MP3=Jpeg, FIFO 1: MP3=Jpeg, RR(1us) 1: MP3=Jpeg, RR(.5ms) 1:Jpeg > MP3 1:MP3, 2:Jpeg MP3 > Jpeg MP3 = Jpeg, FIFO MP3 = Jpeg, RR(1us) MP3 = Jpeg, RR(.5ms) Jpeg > MP3
GQ3
GQ4
GQ5
scheduling structures with different task priorities, core assignments and time slices (FIFO or RR with different intervals). The complete system is modeled in 32,000 lines of SLDL code and effects of different scheduling policies on system performance were evaluated using model simulations. Fig. 10 and Table IV compare average and maximum MP3 and Jpeg frame delays for all possible configurations. Results generally match expectations. When running on different cores (PQ6) or in a global queue (GQx), each task is effectively assigned to a dedicated core and frame delays are consistently minimal. When running on a single (Sx) or the same core (PQ1-PQ5), the MP3 task misses deadlines if its priority is lower than that of the Jpeg encoder (S5 or PQ5). Likewise, a FIFO strategy can lead to unpredictable blocking of the MP3 decoder by the Jpeg task (S2 and PQ2). Since the order of tasks executions highly depends on the order in which tasks become ready, results can vary widely, e.g. depending on the background load on the corresponding core. By contrast, a RR strategy can provide the necessary fairness in task accesses to a single core. Overall, such effects are hard to predict statically. All in all, we were able to easily cycle through different system configurations of our model within seconds. Furthermore, system models simulate at an average speed of 520 Mcycles/s. Assuming a nominal CPI of 1, this corresponds to 520 MIPS per core or 1040 MIPS total. We simulated decoding of 55 MP3 frames and encoding of one 640x480 picture. Not counting idle periods and hardware interactions, this translates into roughly 200 million simulated instructions for an average of 610 MIPS. All in all, models provide rapid feedback for evaluation and analysis of scheduling options. V. SUMMARY AND CONCLUSION In this paper, we presented a configurable and high-level host-compiled multicore software simulator that integrates an abstract SMP RTOS model into a complete SLDL- and TLMbased multi-processor and system environment. Simulations are fast and accurate at speeds of more than 1000 MIPS with less than 3% timing error, supporting rapid and early embedded software development and design space exploration. A medium timing granularity of 10s thereby seems to represent the best compromise. In the future, we plan to further investigate tradeoffs and improvements in accuracy and speed, e.g. through automatic adjustment of timing granularities, by including OS-internal timing models, and by broadly supporting other scheduling algorithms, such as PFair. Furthermore, we are currently working on expanding the OS model into a complete parametrizable processor simulator that includes models of interrupt chains and caches for evaluation of synchronization, task migration and cache pollution effects.
Dual Core-GQ
Dual Core-PQ
REFERENCES
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] G. Blake, R.G. Dreslinski, T. Mudge, "A survey of multicore processors," IEEE Signal Processing Magazine, vol.26, no.6, Nov. 2009. R. Goodman, S. Black, "Design challenges for realization of the advantages of embedded multi-core processors," AUTOTESTCON, Sept. 2008. S. Lauzac, R. Melhem, D. Mosse, "Comparison of global and partitioning schemes for scheduling rate monotonic tasks on a multiprocessor," Euroicro Workshop on Real-Time Systems, Jun. 1998. L. Benini, D. Bertozzi, A. Bogoliolo, F. Menichelli, M. Olivieri, MPARM: Exploring the multi-processor SoC design space with SystemC, Journal of VLSI Signal Processing, vol. 41, no. 2, 2005. F. Bellard, QEMU, a fast and portable dynamic translator, USENIX, 2005. Open Virtual Platforms [online]. Available: http://www.ovpworld.org Xenomai: Real-Time Famework for Linux [online]. Available: http://www.xenomai.org VirtualTime: Simulation of Real-Time Systems [online]. Available: http://www.rapitasystems.com/virtualtime J. Calandrino, H. Leontyev; A. Block, U. Maheswari, C. Devi, J. H. Anderson, "LITMUS^RT : A Testbed for Empirically Comparing RealTime Multiprocessor Schedulers," RTSS, Dec. 2006. RTSIM [online]. Available: http://rtsim.sssup.it J. Schnerr, O. Bringmann, A. Viehl, W. Rosenstiel, High-performance timing simulation of embedded software, DAC, Jun. 2008. J. Ceng, W. Sheng, J. Castrillon, A. Stulova, R. Leupers, G. Ascheid, H. Meyr, A high-level virtual platform for early MPSoC software development, CODES+ISSS, Sep. 2009. Z. Wang, A. Herkersdorf, An efficient approach for system-level timing simulation of compiler-optimized embedded software, DAC, Jul. 2009. Y. Hwang, S. Abdi, D. Gajski, Cycle approximate retargettable performance estimation at the transaction level, DATE, Mar. 2008. SystemC [online]. Available: http://www.systemc.org D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, S. Zhao, SpecC: Specification Language and Methodology, Kluwer, 2000. A. Gerstlauer, H. Yu, D. Gajski, "RTOS modeling for system-level design," DATE, Mar. 2003. H. Posadas, J. A. Adamez, E. Villar, F. Blasco, F. Escuder, RTOS modeling in SystemC for real-time embedded SW simulation: A POSIX model, DAES, vol. 10, no. 4, Dec. 2005. A. Bouchhima, I. Bacivarov, W. Yousseff, M. Bonaciu, A. Jerraya, Using abstract CPU subsystem simulation model for high level HW/SW architecture exploration, ASPDAC, Jan. 2005. G. Schirner, A. Gerstlauer, R. Dmer, "Fast and Accurate Processor Models for Efficient MPSoC Design," TODAES, vol. 15, no. 2, article no. 10, Feb. 2010. M. Dales, SWARM 0.44 Documentation. Available: http://www.cl.cam.ac.uk/~mwd24/phd/swarm.html