HC-Sim: A Fast and Exact L1 Cache Simulator With Scratchpad Memory Co-Simulation Support

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

HC-Sim: A Fast and Exact L1 Cache Simulator with Scratchpad Memory Co-simulation Support

Yu-Ting Chen, Jason Cong, and Glenn Reinman


Computer Science Department University of California, Los Angeles Los Angeles, CA 90095, USA

[email protected], [email protected], [email protected] ABSTRACT


The conguration of L1 caches has a signicant impact on the performance and energy consumption of an embedded system. Normally, an embedded system is designed for a specic application or a domain of applications. Performing simulations on the application(s) is the most popular way to nd the optimal L1 cache conguration. However, the simulation-based approach suffers from long simulation time due to the need to exhaustively simulate all congurations, which are characterized by three parameters: the number of cache sets, associativity, and the cache line size. In previous work, the most time-consuming part was to determine the hit or miss status of a cache access under each conguration by performing a linear search on a long linked-list based on the inclusion property. In this work, we propose a novel simulator, HC-Sim, which adopts elaborate data structures, a centralized hash table, and a novel miss counter structure, to effectively reduce the search time. On average, we can achieve 2.56X speedup compared to the existing fastest approach (SuSeSim). In addition, we implement HCSim by using the dynamic binary instrumentation tool, Pin. This provides scalability for simulating larger applications by eliminating the overhead of generating and storing a huge trace le. Furthermore, HC-Sim provides the capacity to simulate an L1 cache and a scratchpad memory (SPM) simultaneously. It helps designers to explore the design space considering both L1 cache congurations and the SPM sizes.

1.

INTRODUCTION

Categories and Subject Descriptors


B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids; C.4 [Performance of Systems]: Performance Attributes

General Terms
Algorithms, Performance

Keywords
L1 cache, Scratchpad memory, Simulation, Cache simulation, LRU, Miss rate, Dynamic binary instrumentation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. CODES+ISSS11, October 914, 2011, Taipei, Taiwan. Copyright 2011 ACM 978-1-4503-0715-4/11/10 ...$10.00.

The speed gap between the processor and main memory has been increasing constantly. Caches are utilized as intermediate storage to mitigate the gap. The L1 cache, which is accessed most frequently by the processor, has a signicant impact on the performance and energy consumption of a system. Increasing the cache size can improve locality, leading to a higher cache hit rate and possible performance improvement. However, a large cache will increase the access time and the energy of a cache access. The cache size determination depends on the requirements of system performance and the constraint on energy consumption. An embedded system is designed for a specic application or a domain of applications. Therefore, the performance and energy of the system can be optimized by providing a customized cache hierarchy. One important issue is to nd the optimal conguration for the L1 cache. A cache conguration can be dened by three parameters: the number of cache sets, associativity, and cache line size. Previous work has shown that a correct L1 cache conguration can signicantly improve performance and reduce energy consumption [2, 5, 10, 18]. To nd the most energy/performance efcient conguration, the L1 cache miss rate should be evaluated for each conguration. The distribution of L1 cache miss rates denes the working set of the application and is an important metric used for the performance and energy estimation [2, 5, 18]. It is also very useful for energy optimization of computer systems with heterogeneous [19] or customizable cache designs [2, 26, 6]. To obtain the miss rates of different cache congurations on a target application, two classes of approaches are proposed: simulationbased techniques [14, 27, 9, 18, 28, 13] and analytical approaches [10, 25, 4, 11, 12, 29]. The simulation-based techniques perform exact simulation on every cache access in a given program trace. The hit or miss status of every access is simulated exactly. However, the simulation-based techniques suffer from long simulation time since each conguration should be simulated for the target application. The analytical approaches provide analytical models for cache miss rate estimation. Generally, the analytical approaches are fast but inaccurate. In this work we focus on improving the performance of simulationbased techniques. Two reasons lead to the long simulation time of exact techniques. First, exhaustive simulation on each cache conguration is required. In [24, 14], the researchers discovered and applied the inclusion property to simulate multiple cache congurations simultaneously. This property species that when a cache hit occurs in a cache, then cache hits occur in all larger caches. For example, if a hit occurs in a fully associative cache with two cache blocks, then hits will denitely occur in a fully associative cache with more than two blocks. Therefore, multiple congurations can be simulated in one run, and this reduces the number of

295

runs to simulate all congurations. Second, based on the inclusion property, the hit/miss status of a cache access of multiple cache congurations can be simulated by using a stack [24]. For each cache access, a linear search is required to check if the tag of the access is still in the stack. If not, cache misses occur. The process of the linear search may be time consuming since it depends on the total stack distance traversed by all accesses. In [28, 13], the authors developed efcient techniques to reduce the search space and thus improve simulation performance. The scalability of the cache simulator is another important issue. The implementations of previous simulation-based techniques rely on a given trace le as an input [9, 18, 28, 13]. However, if the target application has a long execution time, the generated trace will be huge. The storage of memory trace consumes tremendous disk space. For example, the trace le of a medical imaging application, riciandenoise, has 3.5 billion data cache accesses, requiring 42.3GB to store the trace. For some SPEC2006 benchmarks [17], such as h264re f , the trace le is estimated to be 1.1TB. For applications with predictable access patterns, scratchpad memory (SPM) can used to achieve better performance and energy consumption of an embedded system [6, 3, 20, 7]. The predictable patterns can be identied by programmers or compilers to provide optimized codes. Compared to a cache, SPM does not need to perform expensive associative way driving and tag comparisons and hence is more energy-efcient. The recent NVIDIA Fermi GPU [1] also provides a mechanism to congure the ratio between L1 cache and SPM. However, currently no tool can efciently simulate the hybrid memory system with L1 cache and SPM. The co-simulation is achieved by integrating SPM into the memory system in a full system simulator, which suffers from long simulation time [7]. In this paper we propose HC-Sim (Hashing-based Cache Simulator), which efciently simulates multiple L1 cache congurations simultaneously and supports fast co-simulation for L1 caches and SPM. The contributions are summarized as follows. (1) We propose novel data structures and algorithms, which incorporates a centralized hash table and a novel miss counter structure to avoid time-consuming tag searches on the stacks. Compared to SuSeSim [13] and CRCB algorithm [28], HCSim results can be up to 5.21X and 13.73X faster, respectively. On average, HC-Sim is 2.56X and 5.44X faster than SuSeSim and CRCB algorithm. Furthermore, we shows that the efciency of HC-Sim can be improved by using misscounter-based structure for a group of applications. (2) To enhance scalability of HC-Sim, we build our simulation framework on the dynamic binary instrumentation tool, Pin [22]. The trace generation is embedded into HC-Sim and is performed on-the-y to avoid the overhead of a huge trace le. (3) To observe the interaction between a L1 cache and SPM, such as the miss rates and the distribution of memory accesses on L1 caches and SPM, fast cache and SPM co-simulation is needed. This is non-trivial since the implementation of HC-Sim is based on Pin and only instruction-level information can be obtained. Here, we provide an interface for designers to specify the address range of SPM so that the SPM accesses can be bypassed and the correct miss rates of L1 caches can be simulated. We also provide a mechanism to lter out prefetching loads, which arise from SPM prefetching, to maintain the correctness of L1 cache simulation.

2. 2.1

BACKGROUND Terminology

A cache conguration is determined by three parameters: the number o f cache sets (s), associativity (a), and cache line size (b). The cache size is the multiple of the three parameters (s a b). The largest settings of the number of cache sets, associativity, and cache line size are denoted as S, A, and B respectively. We assume that the three parameters can only be set to powers of two. Hence, to nd the optimal conguration, (log2 (S) + 1) (log2 (A) + 1) (log2 (B) + 1) congurations should be simulated. For a memory reference, the address can be divided into three elds: a tag, an index, and an o f f set as shown in Figure 1. The index is used to indicate in which cache set the datum is stored. We can use log2 (s) bits for the index. In a set-associative cache, a datum can be stored in any cache way of the cache set. Therefore, tag comparison is performed to nd the datum. Typically, the cache line size is larger than the size of one datum. The offset is used to retrieve the datum from the cache line. log2 (b) bits are required for the offset. Hence, in a 64-bit system, the tag eld contains (64 log2 (s) - log2 (b)) bits. The block address contains the elds of the tag and the index of a reference.

Figure 1: The Address Fields of a 64-Bit Memory Reference

2.2

Inclusion Property

The inclusion property can be used to achieve fast simulation in multiple set-associative caches with LRU replacement policy [24, 14, 18, 28, 13]. The inclusion property holds when the cache line sizes of these set-associative caches are the same and no prefetching is performed. We summarize the inclusion property into the following two cases. First, for caches with the same number of sets (s) and different associativity (a), whenever a hit occurs in a cache, hits are then guaranteed in all caches with larger associativity [24, 14]. For example, if a hit occurs in a cache where s = 2 and a = 1, it is then guaranteed that the same hit occurs in the caches with s = 2 and a > 1. This can be used to simulate associative caches simultaneously as described in Section 2.3. Second, for caches with the same associativity and different number of sets, if a hit occurs in a cache, then all caches with a larger number of sets guarantee the same hit [14, 18]. For example, if a hit occurs in a cache with s = 2 and a = 1, it is then guaranteed that hits occur in the caches with s > 2 and a = 1. Based on the property, the forest simulation is proposed to simulate direct-mapped caches [14] and set-associative caches [18] as reviewed in Section 2.4 and Section 2.5.

2.3

Stack Simulation for Associative Caches

By utilizing the inclusion property, the authors in [24] showed that a stack can be used to model multiple associative caches/memories. Figure 2(a) shows the linked-list structure used to simulate the stack behavior. A linked-list with four nodes can be used to simulate from 1-way to 4-way associative caches simultaneously. Note that we consider that all associative caches are in the same cache set. Each node in the list represents a cache way and stores the tag of a cache access. The most recently accessed cache access is stored in the head while the second most recently accessed on is stored on the second node and so on. Only the four most recently accessed

296

ones can be stored in the stack. Here, we dene the stack distance of a node to be the distance between the head and the node. To nd a node in the list by linear search, the stack distance is that the number of nodes that must be traversed. Here, we illustrate the process of stack simulation with an example. Figure 2(b) shows the situation after a cache access sequence with addresses {0110, 1001, 1010, 1000} is performed. The initial tags stored in the linked-list are {1000, 1001, 0010, 1100}. Here, we assume the cache line size is one byte and the number of sets is one. When the rst access 0110 comes in, a linear search is performed on the linked-list. Since no matched tag is found, misses occur at all associative caches. For the second access 1001, a mapped tag is found in the third node. Based on the inclusion property, hits occur at the 3-way and 4-way associative caches while misses occur at 1-way and 2-way associative caches. After that, 1010 and 1000 are processed and their corresponding status are shown in Figure 2(b). The total number of misses for 1-way to 4-way caches are four, four, three, and two respectively.

2.5

Forest Simulation for Set-Associative Caches

To simulate set-associative caches in one run, the researchers in [18] proposed a data structure based on the forest simulation framework described in Section 2.4. In [18], the tag entry is replaced by a linked-list described in Section 2.3. The linked-list is used to simulate a group of associative caches. Figure 4 shows the data structure used to simulate the caches where 0 s 8 and 0 a 4. If we want to simulate set-associative caches with multiple cache line sizes, we can replicate the structure for multiple copies.

Figure 3: Forest Simulation for Direct-Mapped Caches The authors proposed an algorithm upon this data structure to obtain miss rates for caches [18]. The algorithm can be briey explained as follows. For each access in the trace, we traverse the tree from the root to the leaves in a top-down scheme. Note that only one of the nodes in each tree level is traversed because a cache access can only be mapped to one cache set in a set-associative cache. For example, the 1001 reference will be mapped to the tree node with set number 01 in the third tree level. After that, a linear search is performed on the linked-list pointed to by the cache set 01. If a matched tag is found at the n-th node of the linkedlist, hits occur on caches with associativity larger than or equal to n. Finally, we need to update the order of the linked-list to maintain the LRU property. Note that the cache miss counters need to be updated if misses occur on corresponding cache congurations.

Figure 2: LRU Stack Simulation

2.4

Forest Simulation for Direct-Mapped Caches

Initially, forest simulation was proposed to simulate direct-mapped caches [14]. As mentioned in Section 2.1, we use log2 (s) bits to encode the index eld of a memory reference. We can use 1-bit to encode a direct-mapped cache with two sets, which are encoded by 0 and 1. For a cache with four sets, we can use 00, 01, 10, and 11 to encode them. It is natural to represent a group of direct-mapped caches of different sizes in a binary tree, as shown in Figure 3(a). The root of the tree represents a direct-mapped cache with one cache set while two nodes in the second level represents a cache with two sets and so on. In Figure 3, a group of directmapped caches with one, two, four, and eight sets are represented in a four-level binary tree. Figure 3(a) also shows an intermediate state of the tree. Each node has an entry to store a tag. We assume the cache line size to be one byte. In forest simulation, the tree traversal starts at the root and proceeds to the leaves. However, only one node in each tree level will be traversed since an access can only be mapped to one cache set. First, suppose the next address accessed is 1001, where a hit occurs at the root. Based on the inclusion property, we do not need to traverse the tree any more since hits are guaranteed in all larger caches with two, four, and eight sets. Therefore, tag updates are not required. After that, the next access is 1100, a miss occurs in the root and further traversal of the tree should be performed until either the tag is found or the leaf of the tree is arrived at. Since a matched tag is not found until the node 100, we can conclude misses occur in direct-mapped caches with one, two, and four sets. The updated state of the tree is shown in Figure 3(b).

Figure 4: Forest Simulation for Set-Associative Caches Assume that there are N accesses in the trace. For each access, it takes at most O(log2 (S)) time, which is equal to the height of the tree, to perform the tree traversal. The time complexity to perform linear search in a linked-list is O(A) in the worst case if no matched tag is found. Note that we assume that the associativity can only be powers of two. Therefore, log2 (A) counters are required for each linked-list. To update the miss coun-

297

ters, it takes O(log2 (A)) for each linked-list. The time complexity is O(N + N (log2 (S))(log2 (A)) + N (log2 (S))A), which is equal to O(N (log2 (S))A) [18].

2.6

Strategies for Speedup

Based the framework described in Section 2.5, the researchers presented strategies utilizing the inclusion property to speed up the simulation [28, 13]. In [28], the researchers provided two enhancements which reduced the total number of nodes to be traversed in the linked-lists. First, if a hit occurs in the head of the linked-list with s = i, then we do not need to perform simulations for caches with s > i since hits occur at all congurations. Second, if the block addresses of two consecutive cache accesses are the same, hits occur for all cache congurations. In [13], the authors had two other observations. First, the inclusion property can also be interpreted in another way. If a miss occurs in caches with s = i for all associativity, then misses will occur in all smaller caches with s < i for all associativity. A bottom-up tree traversal scheme can be developed according to the observation. This removes lots of unnecessary tag searches on the linkedlists, and thus reduces the simulation time. Second, we assume the largest setting of associativity is A. If a matched tag is found in X -th way where X > (A/2), then cache hits can only occur after X -th way for caches with a smaller number of cache sets. Therefore, by performing linear search from the tail to the head, more unnecessary searches can be removed.

this does not improve the simulation efciency compared to linear search. To improve the efciency, the marker pointers are used to point to the boundaries of markers. For example, the marker pointer of size = 22 is pointed to the fth node which follows the last node with marker 2, as shown in Figure 5(a). When a cache access occurs, the marker pointers and the corresponding hit counter should be updated. Consider the address of the next access as 110111, Figure 5(b) shows the update of the data structures. First, since 110111 is found in the hash table, the node in the LRU stack can be found by the pointer. The node should be moved to the head of the stack and the marker should be set to 0. Second, the marker pointers of 20 and 21 should be moved upward by one node such that the marker pointers point to the correct places. The marker pointers are moved from 1010 and 0110 to 1011 and 1010 respectively. Also, the markers of the node 1011 and 1010 should be set to 1 and 2 respectively. Third, the hit counter of 22 should be incremented by 1. Based on the inclusion property, the hit counter of 23 should be also incremented to record the hit. However, we can save the increment to improve simulation efciency. The total number of hits of a specic cache size can be calculated by accumulation after all cache accesses are processed.

3. 3.1

HC-SIM Motivation

We observe that the most important factor that leads to the long simulation time is the linear search performed on the linked-lists. If we can improve the efciency of searching, the simulation time can be signicantly reduced. However, even if one can replace the linked-lists with alternative data structures to improve searching efciency, it is still costly to update the linear order for every cache access. The linear order is naturally maintained by the LRU stacks, i.e. the linked-lists. Therefore, we focus on improving the searching efciency of the linked-lists. A hash table is one of the data structures that can be used to improve the searching efciency. In [21], the researchers provided a data structure utilizing a hash table to simulate a memory trace. Figure 5 shows the data structure. A hash table is used to nd the corresponding memory reference in the stack. The key in the hash table is the address of the memory reference, while the value of the hash table is a pointer used to point to the corresponding reference in the stack. With the aid of the hash table, a memory reference can be found in constant time. However, to obtain the miss rates of different memory congurations, the stack distance of a node should be found in constant time as well. Then, we can use counters to record the hit information for a cache conguration. Figure 5(a) shows that the marker is added as a eld in each node of the stack. Normally, a designer only considers memory congurations to be power-oftwo sizes. Therefore, we assume that the stack sizes can only be powers of two. A marker represents the relative stack distance of a node instead of the distance from the top of the stack. For example, the value of the marker of the fth element in the LRU stack is 3, which means the node can be found in the stack only when the stack size is greater than or equal to 23 . The node cannot be found when the stack size is less than or equal to 22 . Whenever a new memory reference is processed, the marker elds are required to be updated. It can be naively done by updating the elds from the top of the stack to the node found by the hash table. However,

Figure 5: Hashing-Based Structure for Stack Simulation If the maximum stack distance is A, the time complexity to update marker pointers is O(log2 (A)). Compared to the time complexity of linear search, which is O(A), this method is more efcient. In addition, the hit counter takes O(1) to be updated, which is more efcient than the miss counters used in [18, 28, 13]. Therefore, this method has been used to simulate disk and le-system traces, which have long stack distances [21]. The hashing-base structure can be used to simulate associative caches efciently. However, it is not clear how to extend the proposed data structure to simulate set-associative caches in one run. Simply replicating hash tables for all stacks is not feasible. Considering the forest simulation framework, if we want to simulate set-associative caches with up to 512 sets, there will be 1023 nodes

298

in the forest. Each node requires a hash table, a LRU stack and a counter structure. However, the size of a hash table depends on the number of distinct cache accesses appearing in the hash table, which may be large in some applications. Therefore, it may be infeasible to realize the system due to huge memory demand. We shall introduce an elaborate centralized hash table design, as described in Section 3.2.2, to handle the memory usage problem. Also, the extra data structure modications are required and are discussed in Section 3.2.1.

different levels in the binary tree. The data structures shown in Figure 7 can be used to simulate set-associative caches where b = 1, 0 a 4, and 0 s 8.

3.2
3.2.1

HC-Sim Data Structures


Modications of Data Structures

To simulate associative caches, the size of a LRU stack is the largest setting of associativity. For a cache access, if misses occur for all associative caches, the access will become the head of the linked-list. In the meantime, the original tail of the linked-list should be removed from the linked-list since it is no longer in the LRU stack anymore. Otherwise, the stack will grow continuously without the recycling mechanism. This issue is not mentioned in [21]. Therefore, we require a mechanism to invalidate the pointer in the hash table when the corresponding cache line is no longer available. Figure 6(a) shows a new eld, called reset pointer, which is added to each node of the LRU stack for invalidation use. In this example, we assume the cache line size is four bytes and the number of cache sets is one. Moreover, we use block addresses as the keys of the hash table, as shown in Figure 6(b). It is possible that many references stored in the hash table point to the same node in the stack, as shown in Figure 6(a). In this example, 000110, 000111, and 000100 point to the same node in the stack since their tags and indexes are the same. The design of reset pointers becomes complicated since multiple entries in the hash table require invalidation when there is a miss. One feasible solution is to use a linked-list to model a set of reset pointers of a node in the LRU stack. However, this degrades the performance since O(B) times of invalidation are required. Also, extra memory overhead in O(B) is introduced. In a cache, the granularity of a hit or a miss is based on one cache line. Cache accesses in the same cache line share the same block address and thus the same hit/miss status. Therefore, it is natural to use the block address as the key. One important benet is that the number in entries of the hash table can be greatly reduced, and thus the memory usage is reduced. In Figure 6(b), two redundant accesses are removed from the hash table. Furthermore, the invalidation mechanism is simplied and can be realized by using one reset pointer per stack node.

Figure 6: Modications of Data Structures: (a) Reset Pointers (b) Use of Block Addresses For the forest, we used the modied node with reset pointer to realize the invalidation of a hash-table entry. For each linked-list, a corresponding counter structure is provided for stack distance calculation and thus the number of hits can be updated.

3.2.3

Miss-Counter-Based HC-Sim

3.2.2

Centralized Hash Table Design

Instead of using independent hash tables for all stacks, which results in huge memory overhead, we adopt a centralized hash table structure. Figure 7 shows the data structures used by HC-Sim, which can be divided into three parts: the hash table, the forest, and the counter structures. The key of each entry is the block address, as described in Section 3.2.1. For the value eld, a node pointer array is used to store the node pointers that point to the corresponding nodes in the LRU stacks. Intuitively, it seems that the size of the node pointer array should be equal to the number of tree nodes in the forest. However, the array requires only (log2 (S)) + 1 elements, which are equal to the number of the levels of the binary tree. This is because a cache line can only be mapped to one of the cache sets in any level of the binary tree. For example, the cache access 10011 is mapped to the sets with set number 1, 11, and 011 respectively. 1, 11, and 011 are the set numbers of three

The counter structures are used to record the number of hits or misses for miss rates calculation. Figure 8 shows the process of updating the counters in three different counter structures. In this example, the next coming access 0101101 is found at the fourth node of the linked list. Therefore, hits occur when associativity equals 22 and 23 while misses occur when associativity equals 20 and 21 . Figure 8(a) shows the hit counter structure used in [21]. As mentioned in Section 3.1, we only increment the counter of 22 . The time complexity for the counter update is O(1). In [18, 28, 13], they use a centralized miss counter structure, as shown in Figure 8(b). Since they do not use the marker pointers to point to the boundaries of markers, the counter structure is a two dimensional array that records the total number of misses of each cache conguration. For the counter updates, they simply increment the miss counters of all congurations with cache misses. The time complexity is O(log2 (A)). However, spatial locality exists in most applications. Normally, the desired miss rates for most L1 cache congurations are less than 10%. The total number of hits are an order larger than the the total number of misses. Therefore, the hit counter structure may or may not outperform the centralized miss counter structure. Based on this observation, we design a new miss counter structure as shown in 8(c). In this example, only the miss counter of 21 is required to be incremented. Similarly, the miss rates can be calculated by accumulation. The miss counter structure inherits both advantages of

299

Figure 7: HC-Sim Structure

the O(1) complexity and fewer increments on misses. We demonstrate the statistics of counter updates in Section 5.2. For each cache access, at most O(log2 (S)) LRU stacks are traversed in the binary tree. Therefore, the time complexity of counter updates for the hit counter structure and the miss counter structure is O(log2 (S)) per access while that for the centralized miss counter structure is O((log2 (S))(log2 (A))).

tions. We need to insert the block into the hash table and perform corresponding updates (lines 30-40). Note that the observations of the CRCB algorithm [28] are also applied to avoid unnecessary accesses to node pointers. Algorithm 1 HC-Sim Algorithm 1: H the centralized hash table in HC-Sim 2: array_size the size of a node pointer array 3: prev_x the block address of the previous cache access 4: 5: for each cache access x do 6: block_x = get_block_address(x) 7: /* CRCB2: Two consecutive accesses are in the same cache line */ 8: if block_x == prev_x then 9: Break 10: end if 11: prev_x = block_x 12: if block_x is found at entry h in H then 13: for i = 1 to array_size do 14: /* Cache hit(s) occur(s) in this cache set */ 15: if h.array[i] is valid then 16: if h.array[i] does not point to the head of the stack then 17: Move the node pointed by h.array[i] to the head 18: Increment the miss counter in the corresponding counter structure 19: Update the marker pointers in the corresponding counter structure 20: end if 21: /* Cache miss(es) occur(s) in this cache set */ 22: else 23: Push the new node of x into the corresponding stack 24: Point h.array[i] to the head of the stack 25: Increment the miss counter of largest associativity in the corresponding 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41:
counter structure Update all marker pointers in the corresponding counter structure Pop out the tail and invalidate the corresponding node pointer in H end if end for /* Cache misses for all congurations */ else Insert block_x into the hash table for i = 1 to array_size do Push the new node of x into the corresponding stack Point h.array[i] to the head of the stack Increment the miss counter of largest associativity in the corresponding counter structure Update all marker pointers in the corresponding counter structure Pop out the tail and invalidate the corresponding node pointer in H end for end if end for

Figure 8: Counter Structure Design: (a) Hit Counter Structure (b) Centralized Miss Counter Structure (c) Miss Counter Structure

3.3

HC-Sim Algorithm

Algorithm 1 is proposed to simulate set-associative caches based on the HC-Sim data structures. For each cache access, we will rst check to determine if the block address of the access is already in the centralized hash table (line 12). If so, then we check the node pointers in the array to determine the hit/miss status for all congurations. If the node pointer is valid, it means that it points to a node in the corresponding LRU stack and thus a hit is detected. If the node pointer is invalid, a miss is detected. Lines 14-29 show the checking and updating procedure. If the block address does not exist in the hash table, then misses occur for all cache congura-

To compute the time complexity of the algorithm, we assume that there are N cache accesses in the trace. Hence, there will be O(N ) accesses on the centralized hash table. For each access, the

300

node point array is accessed for at most O(log2 (S)) times and thus the complexity of updating miss counters is also O(log2 (S)). However, the number of marker pointers required to be updated may be O(log2 (A)) per linked-list access. Therefore, the time complexity of Algorithm 1 is O(N + N (log2 (S)) + N (log2 (S))(log2 (A))), which is equal to O(N (log2 (S))(log2 (A))). In the worst case, HCSim outperforms previous methods [18, 28, 13] since the complexity of updating marker pointers is O(N (log2 (S))(log2 (A))), which is more efcient than that of performing linear search. The time complexity of linear search is O(N (log2 (S))A), as described in Section 2.5. For the counter updates, the time complexity of HC-Sim is also smaller than that of the previous work [18, 28, 13], which is O(N (log2 (S))(log2 (A))). Note that Algorithm 1 can be slightly modied to be used for a hit-counter-based HC-Sim while the time complexity does not change.

3.4

HC-Sim Implementation

For scalability concerns, HC-Sim is implemented based on the Pin framework. Pin is a dynamic binary instrumentation system that supports x86 architectures. We can place instrumentations on different types of instructions by using Pins APIs. The implementation of HC-Sim can be partitioned into two parts as shown in Figure 9. Trace Generator bypasses the instructions which are not memory accesses. This can be done by placing instrumentations before every load/store instruction. When a memory access is identied by Trace Generator, the address will be sent to Simulation Engine. Next, Simulation Engine simulates the hit/miss status for each conguration and nally gives the control back to Trace Generator. Simulation Engine implements the data structures and the corresponding algorithms described in Section 3.2 and Section 3.3 respectively. In addition, the implementation of the hash table uses the google sparsehash [15], which is an extremely memory-efcient implementation. The overhead is only 2 bits/entry. The memory-efcient feature is important since the memory demand of HC-Sim is proportional to the number of entries in the hash table. The number of entries, which is usually a large number, depends on the number of distinct cache lines occurring in the execution of an application. Therefore, the memory efciency is one of the important factors determining the scalability of HC-Sim.

to obtain the miss rates of different L1 cache congurations, the accesses to SPM should be ltered out. Since the accesses to SPM have an address range which is not overlapping with the range of cache accesses, one feasible solution is to examine the address of each access before the cache simulation is performed. Figure 10(a) shows a code sample of riciandenoise [16] re-written by a programmer for performance optimization by using SPM. Here, we assume that an one-dimensional array is used to model SPM in the program, and the address range of the SPM array is contiguous. The accesses to SPM can be specied by a programmer or a compiler. We provide an interface function, which is called spm_alloc(), for programmers to specify the starting address and the size of the array, as shown in Figure 10(a) (line 72). This function is added to inform HC-Sim the address range of SPM accesses. The same interface is used in [8, 7]. In [7], the interface is integrated with a full system simulator to support co-simulation. Figure 12 shows the design of the co-simulation support for HC-Sim. By intercepting the spm_alloc(), the address range of SPM can be known in advance by Address Filter, as shown in Figure 11. Therefore, SPM accesses can be ltered out, and they do not enter Simulation Engine. Even if the address range is non-contiguous or segmented, the interface function still can be applied to specify the address range of SPM.

Figure 9: Implementation of HC-Sim

Figure 10: (a) An SPM Code Sample of riciandenoise (b) Detection of Prefetching Loads

4.

SPM CO-SIMULATION SUPPORT

In this section we discuss how to add the co-simulation support for scratchpad memory (SPM) into HC-Sim. Our motivation is to evaluate the miss rates of different L1 cache congurations when SPM is provided. The miss rates of caches can provide information for designers to select a suitable L1 cache conguration under a given SPM conguration for a target application. Based on the fast and exact L1 cache simulation framework, we show that the cosimulation can be done accurately and efciently.

Figure 11: Co-Simulation Support for HC-Sim

4.1

SPM Access Handling

4.2

Prefetching Support and Prefetching Loads

In a hybrid cache system, a memory access will be directed to the L1 cache and SPM according to its memory addresses. Therefore,

Prefetching is a common technique to improve the performance of SPM. In Figure 10(a), the expressions in line 6 and line 7 rep-

301

resent that the data in array u should be prefetched into SPM from lower-level cache or main memory for future computation. The expression in line 7 would be translated into instructions that contain one load and one store, as shown in Figure 10(b). First, a load instruction is generated since the data in u should be loaded into the register. After that, a store instruction is generated to store the data loaded from array u. The second store instruction would be ltered out as described previously. However, the rst load instruction should never enter Simulation Engine. It is used to model the behavior of prefetching but it is not a real access on L1 caches. We call these loads prefetching loads. However, it is not trivial to lter out these prefetching loads. One reason is that implementation of HC-Sim is based on Pin. Through binary level instrumentation, we can only observe the instructionlevel information, such as register names or the types of instructions. One way to handle this is by tracking the dependence between the SPM store and the prefetching load. However, another problem is that the prefetching load always occurs before the SPM store. Therefore, we introduce a code buffer to deal with the problem. With the assistance of the code buffer, memory accesses can be temporarily stored. When an SPM store is detected, we can lter out the prefetching load corresponding to that store and then re-simulate the memory accesses stored in the code buffer in program order. Therefore, the correctness of cache simulation can be maintained. Figure 12 shows the co-simulation ow. There is one potential problem for the code buffer mechanism. A prefetching load may be simulated when the buffer is full. However, this case is rare if an adequate size of code buffer is given since the prefetching load and the SPM store are adjacent. For example, only 0.01% of prefetching loads are simulated when we perform simulation on the riciandenoise benchmark. The buffer size is 10000 elements. The simulation error is negligible.

Since the trace is dynamically generated during runtime by the tool itself, the addresses of memory references will be different every time and that makes verication difcult. Therefore, we implement a golden version that takes a trace le as an input for each method, and the golden version is veried by DineroIV [9]. The golden version provides a reference to guarantee the correctness of the implementations on the Pin framework. We simulated 400 congurations for each method on each workload. The range of each cache parameter is described in Table 1. Note that we consider only L1 data cache simulation in this work. However, the proposed method can also be applied to L1 instruction cache simulation. Table 1: Cache Congurations
The Number of Cache Sets (s) = 2i Associativity (a) = 2i Cache Line Size (b) = 2i 0i9 0i9 4i7

As shown in Table 2, the benchmarks we used in this work cover memory-intensive workloads from SPEC2006 [17]. The medical imaging benchmarks are a set of benchmarks which are used to process the image produced by a CT scanner [16]. The benchmarks include applications for image reconstruction, de-noise, deblur, image registration, and image segmentation. These medical imaging benchmarks consists of stencil computation and are dataintensive. The total number of accesses and the simulation time of miss-counter-based HC-Sim for each benchmark are listed in Table 3. In our experiment, the largest trace contains 87.07 billion of accesses, which results in a huge disk demand in terabyte scale. By using HC-Sim, we can eliminate the problem and can perform simulation on an even larger program. Table 2: Workloads
Benchmark SPEC2006 Medical Imaging Applications gcc, mcf, libquantum, h264ref, astar soplex, lbm rician-denoise, rician-deblure, registration segmentation, compressive sensing

Table 3: Trace Size and Simulation Time of HC-Sim(M)


Benchmarks mcf lbm gcc h264ref astar soplex libquantum riciandenoise riciandeblure registration segmentation comp. sensing Num of Accesses 4.84B 3.40B 5.53B 87.07B 64.30B 128.94M 70.14M 3.34B 10.85B 10.46B 6.03B 35.55B Simulation Time (Second) 33712 38526 112523 324017 198116 452 339 33892 61233 49557 30389 139727

Figure 12: Flow of Co-simulation on SPM and L1 Caches

5. 5.1

EXPERIMENTAL RESULTS Simulation Setup and Workloads 5.2

Performance Evaluation

As mentioned in Section 3.4, we implement HC-Sim based on the Pin framework for scalability concerns. Therefore, we also implement SuSeSim [13] and the CRCB algorithm [28] based on the Pin framework to perform a fair comparison. All methods are implemented in C++ and compiled by the g++ compiler with -O3 optimization level. Simulations are performed on an Intel Xeon Quad 2Ghz processor with 8GB main memory.

Table 4 shows the normalized simulation time of HC-Sim, SuSeSim [13], and the CRCB algorithm [28]. Note that the simulation time of SuSeSim, CRCB, and hit-counter-based HC-Sim are normalized to that of miss-counter-based HC-Sim. Compared to hitcounter-based HC-Sim, SuSeSim and CRCB, miss-counter-based HC-Sim can achieve up to 1.27X, 5.21X and 13.73X reduction of runtime respectively. On average, miss-counter-based HC-Sim can

302

run 1.12X, 2.56X and 5.44X faster than hit-counter-based HC-Sim, SuSeSim, and CRCB respectively. Table 4: Normalized Simulation Time
Benchmarks mcf lbm gcc h264ref astar soplex libquantum riciandenoise riciandeblure registration segmentation comp. sensing Average HC-Sim (M) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 HC-Sim (H) 1.03 1.14 1.10 1.27 1.19 1.25 1.07 1.09 1.05 1.05 1.12 1.09 1.12 SuSeSim 3.74 5.21 2.38 1.70 1.22 1.50 3.05 2.52 4.17 2.28 1.03 1.86 2.56 CRCB1 8.57 13.73 3.33 2.27 1.69 2.04 3.89 7.49 12.47 5.92 1.29 2.58 5.44

system simulator [23] which takes almost eight hours to obtain the result of only one cache conguration with SPM. With HC-Sim, we can evaluate the distribution of miss rates more efciently and thus provide the capacity for an early stage design space exploration with more than 10000X simulation runtime reduction. When the miss rate distribution of L1 caches is the metric we would like to measure, full system simulation is inefcient and is not required.

As mentioned in Section 3.3, the time complexity of HC-Sim is bounded by the marker update, which is O(N (log2 (S))(log2 (A))). The complexities of previous methods [18, 28, 13] are bounded by linked-list traversal, which is O(N (log2 (S))A)). To show the effectiveness of the reduced complexity, we compare the number of marker updates of HC-Sim with the number of traversed list nodes for SuSeSim and CRCB, as shown in Figure 13. Here, we collect data for the cache congurations where 0 s 9, 0 a 9, and b = 64. The data are normalized to the number of marker updates of miss-counter-based HC-Sim. We can observe that the number of marker updates is from 5X to 13X and from 7X to 40X less than the traversed list nodes of SuSeSim and the CRCB algorithm, respectively. Therefore, the number of nodes accessed is signicantly reduced through the hashing-based structure. Even if the constant overhead of a hash table access is large, the time complexity reduction from O(N (log2 (S))A)) to O(N (log2 (S))(log2 (A))) still compensates for the overhead. Figure 13(b) quanties the benet of utilizing miss counters to further improve the efciency of HC-Sim. First, we can observe the number of counter updates of miss-counter-based HC-Sim is from 2.2X to 7.6X less than that of hit-counter-based HC-Sim. Moreover, compared to the centralized miss counter structure used in [28, 13], miss-counter-based HC-Sim reduces the number of counter updates from 2.1X to 5.5X since the time complexity is reduced from O(N (log2 (S))(log2 (A))) to O(N (log2 (S))). Note that we collect the data where the cache line size is equal to 64 bytes.

Figure 13: Normalized Number of (a) Marker Updates and Traversed List Nodes and (b) Counter Updates (b = 64 bytes)

6.

RELATED WORK

5.3

Evaluation of Co-Simulation Support

In this section a case study of riciandenoise is used to evaluate the efciency of co-simulation support for HC-Sim. Figure 14 shows the miss rate distribution of a hybrid cache system with a L1 cache and SPM and a cache-only system. The cache line sizes for both systems are 64 bytes. In the hybrid system, the size of SPM is 4320 bytes as indicated in Figure 10(b). The input size of riciandenoise is an image with 64 64 64 pixels. Figure 14(a)(b) shows the miss rates distribution of both systems. With SPM support, the hybrid cache system has lower miss rate distribution as shown in Figure 14(b). The simulation time of HC-Sim with co-simulation support takes only 122 seconds, while the simulation time for the cache-only system takes 309 seconds. The rst reason is that SPM accesses are ltered out and thus the number of accesses for cache simulation is reduced. Second, with the assistance of SPM, the miss rates are reduced, leading to faster cache simulation. In comparison, recently the authors in [8, 7] evaluated the performance of SPM with a full

Section 2 provided a detail review of simulation-based methods. In this section we discuss analytical approaches for cache miss rate estimation. In contrast to the exact simulation-based methods, the analytical approaches provide a fast estimation of cache misses. The researchers in [10] use system-level simulations to estimate the energy-delay product under different cache congurations. In contrast to exploring the whole cache design space, they use a sensitivity-based analysis to optimize the three cache parameters sequentially. In their work, the optimization procedure is to give initial values of cache line size and associativity and rst optimize the number of cache sets according to the ED metric. Next, the line sizes are optimized under the xed number of sets. Finally, the associativity is optimized. In [12] the cache miss equations (CME) are proposed to represent the cache misses of a loop nest. By analyzing the iteration space and reuse vectors of a loop nest, the CMEs can capture cache misses in a simple loop nest, such as a perfect nested loop, accurately. The cache parameters such associativity, line size, and the number of sets, are treated as inputs for the CMEs. In [29] the authors provided a fast and accurate approach to solving CMEs. By using a sampling technique, a small subset of the iteration space can be analyzed to approximate the miss ratio of each reference.

303

In [4], the authors developed a probability model to estimate miss rates by utilizing the stack distance distribution of the cache accesses in a program. They showed that the miss rate can be estimated accurately with a 104 sampling rate. However, the approach can only be used on fully-associative cache.

dition in Computing Award CCF-0926127). We would like to thank Yi Zou and Hui Huang for providing the benchmarks that utilizing the scratchpad memory.

Figure 14: Miss Rate Distribution of (a) L1 Caches Only (b) L1 Caches + SPM In [11], instead of calculating the cache miss rates for different cache congurations, the authors try to obtain the congurations that satisfy the constraint of a cache miss rate. They provided an exact method for analyzing the trace and nding feasible congurations for different miss rate targets. In general, the estimation approaches may be inaccurate since they do not examine the hit/miss status of each cache access. In addition, some analytical models can be only applied on a subset of caches [4] or a subset of programs [12, 29].

7.

CONCLUSIONS

We propose a fast and exact L1 cache simulator, HC-Sim, to simulate multiple cache congurations in one run. HC-Sim adopts a centralized hash table and supplementary data structures to efciently reduce the search time performed on the LRU stacks. Assuming no prefetching, HC-Sim can simulate multiple caches that adopt the LRU replacement policy simultaneously. For a collection of 12 workloads, HC-Sim can be 2.56X and 5.44X faster than SuSeSim and CRCB algorithm on average. To enhance the scalability, HC-Sim is implemented based on the dynamic instrumentation framework, Pin, to generate traces during runtime. The overhead of huge trace les can thus be avoided. In addition, HC-Sim provides the capacity to perform co-simulation on L1 caches and SPM simultaneously. The miss rates of L1 caches can be efciently calculated with a given SPM conguration. Therefore, designers can efciently explore the design space of a hybrid memory system consisting of L1 cache and SPM.

[1] NVIDIAs Next Generation CUDA Compute Architecture: Fermi (Whitepaper), 2009. [2] D. H. Albonesi. Selective Cache Ways: On-Demand Cache Resource Allocation. In Proc. MICRO, pages 248259, 1999. [3] R. Banakar, S. Steinke, B. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proc. CODES, pages 7378, 2002. [4] E. Berg and E. Hagersten. Statcache: A probabilistic approach to efcient and accurate data locality analysis. In Proc. ISPASS, pages 2027, 2004. [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proc. ISCA, pages 8394, 2000. [6] D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specic memory management for embedded systems using software-controlled caches. In Proc. DAC, pages 416419, 2000. [7] J. Cong, K. Gururaj, H. Huang, C. Liu, G. Reinman, and Y. Zou. An Energy-Efcient Adaptive Hybrid Cache. In Proc. ISLPED, 2011. [8] J. Cong, H. Huang, C. Liu, and Y. Zou. A Reuse-Aware Prefetching Algorithm for Scratchpad Memory. In Proc. DAC, pages 960965, 2011. [9] J. Edler and M. D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator. http://pages.cs.wisc.edu/ markhill/DineroIV/, 1998. [10] W. Fornaciari, D. Sciuto, C. Silvano, and V. Zaccaria. A design framework to efciently explore energy-delay tradeoffs. In Proc. CODES, pages 260265, 2001. [11] A. Ghosh and T. Givargis. Analytical design space exploration of caches for embedded systems. In Proc. DATE, pages 650655, 2003. [12] S. Ghosh, M. Martonosi, , and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems (TOPLAS), 21(4):703746, 1999. [13] M. S. Haque, A. Janapsatya, and S. Parameswaran. Susesim: A fast simulation strategy to nd optimal l1 cache conguration for embedded systems. In Proc. CODES+ISSS, pages 295304, 2009. [14] M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches. IEEE Transactions on Computers, 38(12):16121630, 1989. [15] http://code.google.com/p/google-sparsehash/. Google Sparsehash. [16] http://www.itk.org/ItkSoftwareGuide.pdf. ITK Software Guide. [17] http://www.spec.org/cpu2006. SPEC Benchmark, 2006. [18] A. Janapsatya, A. Ignjatovi c, and S. Parameswaran. Finding optimal l1 cache conguration for embedded systems. In Proc. ASPDAC, pages 796801, 2006. [19] X. Jiang, A. Mishra, L. Zhao, R. Iyer, Z. Fang, S. Srinivasan, S. Makineni, P. Brett, and C. R. Das. Access: Smart scheduling for asymmetric cache cmps. In Proc. HPCA, 2011. [20] M. Kandemir and A. Choudhary. Compiler-directed scratch pad memory hierarchy design and management. In Proc. DAC, pages 628633, 2002. [21] Y. H. Kim, M. D. Hill, and D. A. Wood. Implementing stack simulation for highly-associative memories. In Proc. SIGMETRICS, pages 212213, 1991. [22] C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. PLDI, pages 190200, 2005. [23] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacets General Execution-Driven Multiprocessor Simulator(GEMS) Toolset. In Computer Architecture News, pages 9299, 2005. [24] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78117, 1970. [25] J. J. Pieper, A. Mellan, J. M. Paul, D. E. Thomas, and F. Karim. High level cache simulation for heterogeneous multiprocessors. In Proc. DAC, pages 287292, 2004. [26] P. Ranganathan, S. Adve, and N. P. Jouppi. Recongurable caches and their application to media processing. In Proc. ISCA, pages 214224, 2000. [27] R. A. Sugumar and S. G. Abraham. Set-associative cache simulation using generalized binomial trees. ACM Transactions on Computer Systems (TOCS), 13(1):3256, 1995. [28] N. Tojo, N. Togawa, M. Yanagisawa, and T. Ohtsuki. Exact and fast l1 cache simulation for embedded systems. In Proc. ASPDAC, pages 817822, 2009. [29] X. Vera, N. Bermudo, J. Llosa, and A. Gonzlez. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Transactions on Programming Languages and Systems (TOPLAS), 26(2):263300, 2004.

9.

REFERENCES

8.

ACKNOWLEDGMENTS

This work is partially supported by the SRC Contract 2009-TJ1984, and the Center for Domain Specic Computing (NSF Expe-

304

You might also like