Locality Optimizations for Multi-Level Caches
Gabriel Rivera, Chau-Wen Tseng
Department of Computer Science
University of Maryland
College Park, MD 20742
Abstract
cality can be useful in hiding the complexities of
the memory hierarchy from scientists and engineers.
Compilers may either rearrange the computation order through loop transformations (e.g., loop permutation, fusion, tiling), or change the layout of data
through data transformations (e.g., padding, transpose). Such transformations are usually guided by
a simple cache model to evaluate different optimization choices. In almost all cases, current state of the
art models a single level of cache.
In this paper, we examine the impact of multilevel caches on data locality optimizations. We wish
to discover 1) whether compiler optimizations need
to consider multiple levels of cache in order to maximize program performance, and 2) quantify the improvements possible by targeting multi-level caches.
We find that many compiler transformations improve locality for all levels of cache simultaneously,
without any need to explicitly consider multiple cache
levels. In other cases, targeting just the smallest
cache generally yields most of the benefits of explicit optimizations for the larger cache(s). We show
several transformations which can be enhanced for
multi-level caches. In these cases we propose new
heuristics and evaluate their impact. The contributions for this paper are:
Compiler transformations can significantly improve
data locality of scientific programs. In this paper,
we examine the impact of multi-level caches on data
locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of
the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage
of modular arithmetic to eliminate conflict misses
and preserve group reuse on multiple cache levels.
Loop fusion can balance increasing group reuse for
the L2 (secondary) cache at the expense of losing
group reuse at the smaller L1 cache. Tiling for the
L1 cache also exploits locality available in the L2
cache. Experiments show enhanced algorithms are
able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance for multi-level
caches.
1 Introduction
Because of the increasing disparity between memory
and processor speeds, effectively exploiting caches
is widely regarded as the key to achieving good performance on modern microprocessors. As the gap
between processors and memory grows, architects
are beginning to rely on several levels of cache in
order to hide memory latencies. Two level caches
are now common, while processors such as the DEC
Alpha 21164 have three levels of cache.
Compiler transformations for improving data loPermission to make digital/hard copy of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication and its date appear, and notice is given that copying is by
permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SC ’99, Portland, OR
(c) 1999 ACM 1-58113-091-8/99/0011 $3.50
Examining when data locality optimizations
will benefit from targeting multi-level caches.
Developing new heuristics for inter-variable
padding, loop fusion, and tiling for multi-level
caches.
Experimentally evaluating the impact of multilevel cache optimizations. Results show while
1
// Original
real A(N,M), B(N)
do j = 1,N
do i = 1,M
B(j) = A(j,i)
Figure 1
// Loop Permutation
real A(N,M), B(N)
do i = 1,M
do j = 1,N
B(j) = A(j,i)
// Array Transpose
real A(M,N), B(N)
do j = 1,N
do i = 1,M
B(j) = A(i,j)
Loop Nest and Data Layout Transformations
miss rates may be reduced, performance is not to the same variable, it is considered group reuse,
otherwise it is self reuse.
improved except in rare cases.
Consider the example Fortran code in Figure 1.
For simplicity, we usually consider a multi-level In the code there is temporal reuse of B(j) on the i
cache of two levels, with a small level-one primary loop because all iterations of i will access the same
cache (L1) and a large level-two secondary cache memory location. There is spatial reuse of B(j) on
(L2). In modern microprocessors, the L1 cache is the j loop because iterations of j will access contypically 8K or 16K, while the L2 cache is 128K secutive locations in memory. Similarly, there is
to 4M. We also assume both the L1 and L2 caches spatial reuse of A(j,i) on the j loop, because on
are direct-mapped. Optimizations which avoid con- iterations of j the reference will access consecutive
flict misses on a direct-mapped cache certainly avoid memory locations in A (arrays are column-major in
conflicts in k-way associative caches. Our experi- Fortran).
ence indicates that simply treating k-way associative
In the original version of the program, spatial
caches as direct-mapped for locality optimizations reuse of A is unlikely to be exploited for large values
achieves nearly all the benefits of explicitly consid- of M , since iterations of loop i access many different
ering higher associativity.
cache lines. To improve data locality, we can apply
In the rest of the paper, we begin by consider- loop nest and/or data layout transformations.
ing why many locality optimizations do not need
to consider multiple levels of cache, then focus on 2.1 Loop nest transformations
individual transformations which may benefit from
targeting multi-level caches. We describe how to ex- Loop nest transformations can change both the order
tend inter-variable padding, loop fusion, and tiling loop iterations are executed, and structure of the loop
for multi-level caches. We experimentally evalu- nest. The most basic loop nest transformation is loop
ate our optimizations on a selection of representative permutation, which changes the order of loops in a
programs through cache simulations and actual pro- loop nest. Consider applying loop permutation to the
gram execution and discuss our results. We describe code in Figure 1. Now that the j loop is innermost,
both the temporal reuse of B and spatial reuse of A
related work and then conclude.
occur much closer in time, on consecutive loop iterations rather than across M iterations of the i loop.
2 Locality Optimizations
As a result, for large values of M , loop permutation significantly increases the probability the cache
When we examine compiler transformations for im- line being accessed is still in the cache, improving
proving data locality, we find many loop nest and performance.
data layout transformations do not need to explicitly
Notice that bringing reuse closer together in time
target multi-level caches. To see why this is the case, is almost always desirable, regardless of cache size
we briefly review different forms of data locality. As or level. The only role cache size plays is whether
pointed out by Wolf and Lam [29], reuse can be either the cache is large enough to keep data in the cache
temporal (multiple accesses to same memory loca- across M iterations of the i loop in the original protion) or spatial (multiple accesses to nearby memory gram. If not, then miss rates should drop for that
locations). If reuse is between multiple references level of cache. If the change is significant, it will
2
produce a performance improvement. Loop permutation in this example thus benefits all levels of cache
simultaneously. For small values of N; M , only upper levels of cache will benefit. For large enough
values of N; M , all levels of cache will benefit. The
compiler does not have to explicitly target multiple
levels of cache.
In addition to loop permutation, other unimodular loop nest transformations such as loop skewing,
loop reversal, and their combinations [29, 30] do not
need to target multi-level caches, for reasons just described. One issue which may arise is if tradeoffs
must be made between spatial and temporal locality
in choosing a profitable loop permutation. In such
cases the cache line size may affect the impact of spatial locality. So if cache line sizes differ significantly
for different levels of cache, loop nest transformations may need to be aware of multi-level caches.
We have not found any such cases in practice.
2.2
real A(N,N), B(N,N), C(N,N)
do j = 2,N-1
// loop nest 1
do i = 1,N
= A(i,j) + A(i,j+1)
= B(i,j) + B(i,j+1)
= C(i,j) + C(i,j+1)
do j = 2,N-1
// loop nest 2
do i = 1,N
= B(i,j-1) + B(i,j) + B(i,j+1)
= C(i,j)
Figure 2
Example program
tion is in putting data accesses onto the same cache
line becomes a factor in deciding whether to perform the transformation. In this case the cache line
size (and miss penalty) for each cache level must be
considered.
We have seen how many loop nest and data layout
transformations compiler optimizations are largely
insensitive to multi-level caches. However, several
optimizations can benefit from targeting multiple levels of cache. We examine them in the following
sections.
Data layout transformations
Another class of location optimizations modify the
data layout, rather than the loop iterations. Data
layout transformations are designed to improve only
spatial locality. An example of array transpose [5,
13], a simple data layout transformation, is shown in
Figure 1. Array A is transposed, changing the reference A(j,i) to A(i,j). Consecutive accesses
are now to adjacent memory locations, increasing the
the chance they will access the same cache line.
As with loop permutation, data layout transformations benefit multiple levels of cache simultaneously, since bring references close in memory improves spatial locality at all cache levels. Cache line
size is important in determining whether transformations will actually reduce misses; improvements will
result only if accesses are close enough to end up on
the same cache line. If so, the locality optimization
can improve cache line utilization, reduce working
set size, and exploit hardware prefetching.
In addition to array transpose, other unimodular [12] and nonlinear [4] array layout transformations also do not need to target multiple levels of
cache, for the same reasons as just discussed. One
issue which may arise is if cache line sizes are significantly different in size. If data layout transformations
incur overhead, then how successful the transforma-
3 Inter-variable Padding
Inter-variable padding is a data transformation useful
for eliminating conflict misses [1, 20, 21]. Existing
methods for inter-variable padding generally require
knowledge of cache size and line size. To see how
padding variables eliminates conflict misses at a single cache level, consider the program in Figure 2.
The unit-stride references to A and B provide spatial locality, leading to cache reuse. However, if A
and B are separated by a multiple of the cache size
in a direct-mapped cache, references A(j,i) and
B(j,i) will map to the same cache line in the first
loop nest, eliminating reuse. In this case severe or
ping-pong conflict misses result, since misses can
occur on every iteration. To avoid severe conflicts,
we can apply inter-variable padding to change the
base address of B relative to A. Further inter-variable
padding can eliminate conflicts between the remaining variables.
There is also group reuse of columns of B carried
3
on the outer j loop in both nests, since if B(i,j+1)
can be kept in the cache it can be reused by B(i,j)
on the next iteration of the j loop. As we review in
the next section, inter-variable padding is also useful
for preserving such group reuse.
3.1
3.1.1
Avoiding Severe Conflicts
Current Methods
A B C
PAD is a simple compiler technique for eliminating
severe conflict misses [20]. It analyzes array subscripts in loop nests to compute a memory access
pattern for each array variable. It then iteratively increments each variable base addresses until no conflicts result with other variables analyzed. When
considering a base address for a variable A, if PAD
finds a loop in which a reference to A maps to a
location on the cache within one cache line of a reference to a different variable, PAD will increment the
address until conflicts are eliminated. In practice,
PAD requires only a few caches lines of padding per
variable [20].
Figure 3 illustrates a possible layout achieved
by PAD when transforming the layout of Figure 2
to avoid severe conflicts on the L1 cache. Each
box corresponds to the L1 cache during a given loop
nest, with the width representing the cache size. In
this case, the cache size is slightly more than double the common column size. Each dot represents
a variable reference; its position in a box indicates
its cache location inside the loop nest. The vertical
line above A shows the relative position of A(i,j);
other vertical lines are interpreted similarly. As a
consequence, vertical lines also reveal relative positions of base addresses. Arcs connect references to
the same variable. For instance, the reference connected to A(i,j) in the first loop is A(i,j+1).
The two dots are thus apart by a distance of N, the
column size. Since all references are uniformly generated, these relative positions do not change over
loop iterations.
Layout diagrams such as Figure 3 appear in several places in this paper. These diagrams are convenient for showing how inter-variable padding can
avoid severe conflicts and preserve group temporal
reuse. Severe conflicts occur when two references
are mapped to the same cache line, and would be
Figure 3
PAD layout for example
illustrated by superimposing dots. Group reuse between two columns of an array can be exploited only
if the cache lines for the first column are not flushed
before they are reused. Group reuse is represented
by having no dots appear between an arc connecting
two array columns. To see why, consider that all references move in unit stride between loop iterations.
It follows that if a reference is connected by an arc
from the right, it reuses the data accessed by its right
neighbor only if there are no intervening references
“underneath” this arc.
Supposing array sizes are multiples of the L1
cache size, we find all base addresses in the original sample program coincide on the cache, causing
severe conflicts between references to different arrays. In Figure 3, we see PAD eliminates severe
conflicts by inserting small pads between successive
variables, so no dots overlap. However, since dots
appear between the endpoints of 4 out of 5 arcs,
group reuse is not fully exploited. For instance, the
reuse of B(i,j+1) by B(i,j) in the second loop
is prevented by C(i,j).
3.1.2
Multi-level Methods
The PAD algorithm generalizes easily to multiple
cache levels. Base addresses are tested for conflicts
with respect to all cache levels instead of just one
cache. An even simpler method follows from the
fact that the cache size at a given level evenly divides
that of lower levels. Base addresses are padded when
conflicts result with respect to a single cache configuration. This configuration consists of the L1 cache
size, S1 , and the largest cache line size found at any
level, Lmax . Note that S1 is the smallest cache size
at any level. Thus, when each cache level shares a
4
A C
Figure 4
A
B
Figure 5
GROUPPAD layout for example
on L1 cache
3.2.1
C
L2MAXPAD layout for example
on L2 cache
ative to other variables. The number of references
successfully exploiting group reuse at the L1 cache
is counted for each position. GROUPPAD then selects
the position maximizing this value.
common line size, this configuration is the same as
the L1 cache. Otherwise, this precise configuration
does not actually exist in the memory hierarchy. We
call this MULTILVLPAD since it generalizes pad for
multiple levels of cache.
The validity of this method follows from modular
arithmetic. If two references maintain a distance of
at least Lmax on a cache of size S1 , then the distance
must be equal or greater on a cache of size kS1 (larger
by an integer factor k). Spacing references by at least
Lmax ensures severe conflicts are avoided at each
cache level regardless of line size.
3.2
B
3.2.2
Multi-level Methods
Once we consider a secondary cache, another goal
emerges: to preserve on the L2 cache the group reuse
which remains unexploited on the L1 cache following GROUPPAD. In order to preserve the GROUPPAD
layout on the L1 cache, we restrict pad sizes to multiples of S1 , the size of the L1 cache. If we pad a
variable B by such an amount, the relative distances
between B references and references to other variables may change on the L2 cache but not on the L1
cache. By using only S1 sized pads, we can then
reapply GROUPPAD to optimize group reuse for the
L2 cache. We can thus GROUPPAD in such a manner
that it begins targeting the L1 cache as already described, and then in later phases recursively applies
GROUPPAD to exploit group reuse for lower levels of
cache, using pads which are multiples of the previous
cache size to preserve group reuse at higher levels of
cache.
For large L2 caches, a simpler approach may
sometimes suffice. If array column sizes are a small
fraction of the L2 cache size, merely spacing variables as far apart as possible on the L2 cache can
preserve all group reuse at this cache level. This
is illustrated in Figure 5. Boxes now represent the
much larger L2 cache. We see that all group reuse is
exploited on this cache.
To preserve the L1 cache layout computed by
GROUPPAD while separating variables in this manner,
Preserving Group Reuse
Current Methods
Often the L1 cache contains sufficient space with
which to exploit some or all group temporal reuse
across outer loop iterations. We previously introduced GROUPPAD which inserts larger pads than PAD
to obtain a layout both preserving group reuse on the
L1 cache and avoiding severe conflict misses [21].
Figure 4 gives the result of applying GROUPPAD to
the example program using a diagram similar to Figure 3. By sufficiently separating B from A and C on
the cache, all group reuse between B references is
preserved. Though A and C references fail to exploit
group reuse, it is apparent that the L1 cache lacks the
capacity to preserve all group reuse in the first loop
(as this would require a cache size three times the
column size.) It is thus unavoidable that two out of
three arcs must overlap.
GROUPPAD obtains such a layout by considering
for each variable a limited number of positions rel-
5
we also round pads to the nearest S1 multiple after
determining the approximate position for a variable
on the L2 cache. In this way we can maintain the layout of Figure 5 on the L2 cache while simultaneously
maintaining the layout of Figure 4 on the L1 cache.
This will allow A and C references in the first loop to
exploit group reuse on the L2 cache where they do not
on the L1 cache. We call this method L2MAXPAD,
since it extends the our MAXPAD algorithm [21].
3.3
real A(N,N),
do j = 2,N-1
do i = 1,N
= A(i,j) +
= B(i,j) +
= C(i,j) +
= B(i,j-1)
= C(i,j)
Figure 6
B(N,N), C(N,N)
A(i,j+1)
B(i,j+1)
C(i,j+1)
+ B(i,j) + B(i,j+1)
Example program after fusion
Summary
tion can force a loss of group temporal reuse on
smaller caches. Consider Figure 7, which illustrates the layout of the fused nest after GROUPPAD.
This figure consists of only one box, since the two
loops have been fused into one. Note that unlike
in earlier diagrams, dots may represent two identical references, due to fusion. We see that on the L1
cache, group reuse is exploited only for one reference, B(i,j-1). A L1 cache size over four times
the column size would be required to exploit all group
reuse.
Since three references exhibit group reuse in Figure 4 we find that loop fusion has decreased the overall amount of group reuse exploited on the L1 cache.
To get a precise accounting of the cache effects of
fusing the two loops, we can count the total number
of references which due to cache faults access either
the L2 cache or main memory. We compute these
totals for original and fused versions. We assume
L2MAXPAD is applied following GROUPPAD so that
group reuse is exploited on the L2 cache whenever
it is not on the L1 cache. We also assume no reuse
between nests due to capacity constraints.
In Figure 4, we see that references A(i,j+1),
B(i,j+1), and C(i,j+1) in the first loop must
access main memory, as do B(i,j+1) and C(i,j)
in the second, totaling 5 memory references. Since
A(i,j) and C(i,j) in the first loop do not exploit group reuse on the L1 cache, they must access
the L2 cache. The remaining references (all to B)
successfully exploit group reuse on the L1 cache. In
total, 2 references access the L2 cache. Of course
due to self-spatial reuse, these cache faults occur
only whenever a references accesses a new cache
line. In the fused loop of Figure 7, 3 references,
A(i,j+1), B(i,j+1), and C(i,j+1) must access main memory, an improvement from Figure 4.
Transformations for the L2 cache combine easily
with PAD and GROUPPAD. To eliminate severe conflict misses at both cache levels, MULTILVLPAD need
only use the largest line size instead of the L1 cache
line size. To preserve group reuse on the L2 cache,
we follow GROUPPAD with L2MAXPAD, in which we
maximally separate variables on the L2 cache using
pads which are multiples of S1. These techniques
easily generalize to three or more cache levels.
4 Loop Fusion
Loop fusion is a transformation where adjacent loops
are fused into a single loop containing both loop
bodies. It can be used to improve locality directly by
bringing together memory references [14, 17, 24], or
to enable additional locality optimizations such as
loop permutation [18] and array contraction [9].
We observe improvements in temporal locality
after fusing the loop nests of Figure 2 at the innermost level, obtaining the nest shown in Figure 6. Assuming array sizes exceed the L2 cache size, in the
original loop nest reference B(i,j+1) would miss
both cache levels in both loops. Fusing the loops
ensures that only one memory access is needed; the
second reference to B(i,j+1) may be found in
cache or register. Fused loops may therefore exhibit
improved temporal locality.
Loop fusion can improve data locality, but it also
increases the chance of severe conflicts. We find
applying inter-variable padding using the PAD algorithm after loop fusion is important. Fortunately, it
can eliminate severe conflicts on all levels of cache
fairly easily.
Another disadvantage to loop fusion is that the
increased amount of data accessed per loop itera6
do KK=1,N,W
// W = tile width
do II=1,N,H
// H = tile height
do J=1,N
do K=KK,min(KK+W-1,N)
do I=II,min(II+H-1,N)
C(I,J) = C(I,J) + A(I,K)*B(K,J)
B A C
Figure 7 GROUPPAD layout for example
on L1 cache after fusion
Figure 8
Tiled matrix multiplication
number of cache misses.
However, 3 references, A(i,j), B(i,j), and C(i,j)
Tiling for multi-level caches is more complex.
will access the L2 cache. Note that wherever there
are two identical references, only the first may cause For each level of cache, selecting a tile larger than
a cache fault; the second will access the L1 cache or the cache will cause A to overflow, requiring it be read
a register. Fusion has therefore saved two memory in N times. Smaller tiles, however, will cause more
misses for arrays B and C. What tile size should be
accesses but cost one L2 cache access.
We find therefore that loop fusion may involve selected thus depends on the relative cost of misses
a tradeoff between L1 and L2 cache performance. at different levels of cache.
We believe most of the benefits of tiling may
Fortunately, the compiler can predict group reuse exploited at each cache level before and after loop fu- be obtained by simply choosing tile sizes targeting
sion. Deciding whether fusion is profitable requires the L1 cache. First, from modular arithmetic we
comparing computing the sum of reuse at each cache can show tiles with no L1 self-interference conflict
level, scaled by the cost of cache misses at that level. misses will also have no L2 conflicts. Tiling for the
Since the cost of L2 misses are typically much higher L1 cache thus maximizes L1 reuse and also captures
than L1 misses, fusion will generally be profitable if L2 reuse.
In comparison, choosing L2-sized tiles can reit enables the compiler to exploit more L2 reuse.
duce misses for B and C in the L2 cache but loses
almost all L1 temporal reuse for A. Benefits are mod5 Tiling
erate since reductions in miss rates for the L2 cache
from larger tiles grows slowly. For instance, quaTiling or blocking is a loop transformation which drupling the size of a tile only reduces misses by
combines strip-mining with loop permutation to form 50% (to 1 + 1 ) while the amount of reuse lost
2H
2W
small tiles of loop iterations which are executed to- for L1 is proportional to . As a result, tiling for
gether to exploit data locality [2, 11, 31]. Effec- the lowest level of cache is likely to be more proftively utilizing the cache also requires avoiding self- itable unless the cost of L2 misses is much greater
interference conflict misses within each tile using than for L1 misses. To choose the desirable tile size,
techniques such as tile size selection, intra-variable the compiler can compare estimated cache misses at
padding, and copying tiles to contiguous buffers [6, each cache level, scaled by their costs.
3, 22].
We note an exception to simply tiling for L1
Figure 8 illustrates a tiled version of matrix mul- caches. Song and Li recently extended tiling techtiplication of NxN arrays in which reference A(I,K) niques to handle multiple loop nests enclosed in a
accesses a W by H tile on each J loop iteration. When single time-step loop, allowing tiles be overlapped
this tile fits in cache with no self-interference, data from different time steps [25]. Because of the large
for array A is brought into cache just once, exploit- amount of data that must be held in cache spans many
ing much temporal reuse. Data for the other arrays loop nests, the L1 cache is unlikely to be sufficiently
are brought in multiple times, causing a number of large for reasonable sized tiles. As a result the tiling
cache misses proportional to H1 + W1 [6, 22]. When algorithm targets the L2 cache, completely bypassing
targeting a single level of cache, selecting the largest the L1 cache.
tile size which fits in cache will thus yields the fewest
N
7
Program
ADI32
DOT256
ERLE64
EXPL512
IRR500K
JACOBI512
LINPACKD
SHAL512
APPBT
APPLU
APPSP
BUK
CGM
EMBAR
FFTPDE
MGRID
Description
Lines
KERNELS
2D ADI Integration Fragment (Liv8)
Vector Dot Product (Liv3)
3D Tridiagonal Solver
2D Explicit Hydrodynamics (Liv18)
Relaxation over Irregular Mesh
2D Jacobi with Convergence Test
Gaussian Elimination w/Pivoting
Shallow Water Model
63
32
612
59
196
52
795
227
NAS BENCHMARKS
Block-Tridiagonal PDE Solver
Parabolic/Elliptic PDE Solver
Scalar-Pentadiagonal PDE Solver
Integer Bucket Sort
Sparse Conjugate Gradient
Monte Carlo
3D Fast Fourier Transform
Multigrid Solver
4441
3417
3991
305
855
265
773
680
SPEC95 BENCHMARKS
Pseudospectral Air Pollution
2 Electron Integral Derivative
Navier-Stokes
Quantum Physics
Vector Shallow Water Model
Mesh Generation
Isotropic Turbulence
Maxwell’s Equations
7361
2784
4292
2332
429
190
2100
7764
APSI
FPPPP
HYDRO2D
SU2COR
SWIM
TOMCATV
TURB3D
WAVE5
Table 1 Test programs for experiments
6 Experimental Evaluation
Transformations were applied to a number of scientific kernels and applications from NAS and SPEC95
6.1 Evaluation Framework
benchmarks, shown in Table 1.
All data transformation were implemented as passes
We experimentally evaluated multi-level locality transin the Stanford SUIF compiler [27]. Before these
formations for uniprocessors using both cache simpasses, some transformations are performed to give
ulations and timings. Cache simulations were made
the compiler control over variable base addresses.
for a 16K direct-mapped L1 cache with 32 byte cache
lines and a 512K direct-mapped L2 cache with 64 First, local array and structure variables are promoted
into the global scope. Formal parameters to functions
byte lines. Miss rates for both the L1 and L2 cache
cannot and do not need to be promoted, since they
are reported as the number of cache misses for that
represent variables declared elsewhere. Array and
level, relative to the total number of memory referstructure global variables are then made into fields
ences (i.e., L2 misses are normalized to L1 misses).
of a large structured variable, resulting in a single
Timings were made on a Sun UltraSparc I, which has
global variable containing all of the variables to be
the same cache configuration as in our simulations.
optimized. Optimizing passes may now modify the
8
6.3
base addresses of variables by reordering fields in
the structure and inserting pad variables. Also, intravariable (array column) padding is first performed in
ADI32 and ERLE64 to avoid severe conflicts between
references to the same variable as described in [20].
6.2
6.3.1
Padding To Preserve Group Reuse
Test Programs
We evaluated multi-level data transformations for
preserving group reuse using five programs with
numerous opportunities for improving group reuse.
The results appear in Figure 10. The L1 Opt versions were transformed with GROUPPAD alone while
the L1&L2 Opt versions were transformed by both
GROUPPAD and L2MAXPAD. The first graph gives
cache miss rates at both cache levels for the three
versions. Again, we find that optimizing for the L2
cache in addition to the L1 cache is needed in few
programs; only EXPL benefited on the L2 cache. L1
optimizations again account for most of the improvement in L2 cache miss rates. As we also see in Figure 9, optimizing for the L2 cache does not adversely
affect L1 miss rates. Thus no inherent tradeoff exists
between data transformations for the L1 cache and
L2 cache.
The second graph demonstrates a very small improvement in EXPL execution time but improvements
also in programs whose cache performance does
not benefit from L2MAXPAD, the L2 transformation.
Small degradations also occur in SWIM and TOMCATV,
again suggesting that L2 optimizations have a small
impact on this architecture.
Padding To Avoid Severe Conflicts
To examine the effectiveness of data transformations
for eliminating severe conflicts on multi-level caches,
we transformed several programs. Performance was
measured for three versions: original, optimized by
PAD for only the L1 cache, and optimized by MULTILVLPAD for both caches (L1&L2). Figure 9 gives
the simulated cache miss rates and execution time
improvements for these programs.
The top two graphs present L1 and L2 cache miss
rates. From the L2 cache w/ L1 Opt miss rates we
note that PAD, unaware of the L2 cache, obtains a L2
miss rate reduction similar in magnitude to the L1
reduction. We see from L2 cache w/ L1&L2 Opt that
MULTILVLPAD performs only slightly better on the
L2 cache than PAD, mostly in the case of EXPL. This
suggests that by eliminating severe misses on the L1
cache, we avoid many accesses to the L2 cache which
could conflict with one another. Even though L2
cache lines are longer, PAD is able to eliminate most
L2 conflict misses by moving conflicting references
apart by a distance equal to an L1 cache line. By
noting the similarity between values in L1 cache w/
L1 Opt and L1 cache w/ L1&L2 Opt, we also find
that generalizing PAD for two levels of cache does
not have an adverse effect on L1 miss rates.
The final graph in Figure 9 gives the UltraSparc
execution time improvement relative to the original
program for PAD and MULTILVLPAD versions. Timings were made for programs showing large miss rate
changes in cache simulations. Comparing the two
versions we find that multi-level optimizations for
eliminating severe conflicts have only a minor effect
on performance on this architecture, even slightly
degrading performance in cases such as ERLE64. Reductions in L2 cache miss rates thus do not translate
into performance improvements1 .
6.3.2
Varying Problem Size
Prior work has shown that the data transformations
can be particularly useful for pathological problem
sizes which might not arise in a limited set of test
programs [21]. To reveal such opportunities for L2
data transformations and to investigate the robustness
of these transformation, we varied the problem size
of two programs, EXPL and SHAL, from 250 to 520
and simulated cache miss rates on both caches for
L1 Opt and L1&L2 Opt versions. Results appear in
figure 11, where the X-axis represents problem size
and the Y-axis gives the cache miss rate.
system to handle multiple outstanding cache misses [1], since
the two input vectors were padded 64 instead of 32 bytes due to
the longer L2 cache lines. Such memory effects are prominent
only for simple kernels such as DOT256, which have very few
references. For larger loops the effects average out.
1
An improvement was found for DOT256, even though cache
miss rates were not improved significantly. We believe this is
due to the differences in the ability of the underlying memory
9
We see that while both versions have similar
L1 miss rates, L1 Opt (GROUPPAD alone) experiences clusters of problem sizes where L2 miss rates
increase by up to 5%. The L1&L2 Opt versions
avoid these increases. These results indicate that the
clusters correspond to problem sizes in which overlapping array columns of different variables prevent
group-temporal reuse or self-spatial reuse on the L2
cache. While for most problem sizes the distance
between references is large on the L2 cache, references occasionally converge on one another as the
problem size is varied. L2MAXPAD prevents this by
separating variables on the L2 cache.
A prominent feature of these graphs is the invariant L2 miss rate of L1&L2 Opt, in contrast to the
occasionally increasing L1 miss rates for both versions. This is attributed to the relative capacities of
the two caches. The L1 cache, which can hold only
3 to 8 columns, depending on problem size, increasingly lacks the capacity to preserve group reuse as the
problem size increases. All group reuse is exploited
on the much larger L2 cache following L2MAXPAD.
6.4
Loop Fusion
To further explore the potential tradeoff between L1
performance and L2 performance as a result of fusion, we determined the effects of fusing two loops
in EXPL by several measures. Using reuse statistics
available though GROUPPAD compiler analysis, we
first determined the number of L2 references, i.e.,
the number of references in all loops which miss the
L1 cache but hit the L2 cache, in the same manner
as in Section 4. As in Section 4, we also determined
the number of memory references, i.e., the number
of references in all loops missing both the L1 and L2
cache.
L2 and memory references were computed assuming both GROUPPAD and L2MAXPAD transformations, so that all group reuse not exploited on the
L1 cache was assumed to be preserved on the L2
cache. L1 and L2 miss rates were then obtained before and after fusion. To account for a decrease in
the reference count associated with fusion, miss rates
for both versions were computed as the number of
cache misses divided by the number of references
in the original version. From this data we computed
the change in L2 references, memory references, and
cache miss rates as a result of fusion. These values
were obtained for problem sizes ranging from 250 to
700. Results appear in Figure 12.
The upper graph reveals that the change in group
reuse on the L1 cache may vary considerably depending on problem size. The increase in L2 references alternates between 1 and 2 before plateauing at
3. This high is maintained from problem size 334 to
398. After this point, fusion usually results in a small
change in L2 references, 0 in most cases. Because of
the larger capacity of the L2 cache, improved L2 locality as a result of fusion is full exploited, resulting
in a constant decrease by 3 of the number of memory
references. Thus, the upper graph suggests that the
steady improvement in L2 performance as reflected
by the memory reference count is offset somewhat
by a loss of group reuse on the L1 cache, especially
for problem sizes under 398.
The lower graph reveals a nearly linear relationship between the the computed references counts and
the changes in cache miss rates as a result of fusion.
The change in the L1 miss rate varies closely in proportion to the change in the number of L2 references.
Like the change in memory references, the change in
the L2 miss rate is a flat curve. However, the curve
for L1 miss rates is not situated over the X-axis as
is the curve for L2 references. Instead, this curve is
somewhat lower, with the plateau from 334 to 398
barely breaking 0%. Thus, the miss rate improves on
the L1 cache even with the net loss of group reuse on
this cache. This is possibly due to an overall decrease
in the number of L1 misses as the result of fusion. It
is apparent though that had a fourth reference missed
the L1 cache as the result of fusion, L1 performance
would be adversely affected.
The results show that the compiler can predict
relative cache miss rates fairly accurately by analyzing group reuse. As a result it should be able to
accurately decide whether loop fusion is profitable,
given the relative cost of L1 and L2 cache misses.
6.5
Tiling
Figure 13 compares UltraSparc performance in MFLOPS
for different versions of matrix multiply. We examined matrices sizes from 1002 to 4002 . None of the
matrix sizes evaluated fit in L1 cache, and matrices
larger than 2562 do not fit in L2 cache. Several ap-
10
proaches for tile size selection are examined: picking
L1-sized and L2-sized tiles (i.e., attempts to maximize L1 and L2 cache reuse, respectively), as well as
picking intermediate sized tiles two and four times
larger than the L1 cache (2xL1, 4xL1). We use the
eucPad algorithm to choose tile dimensions which
eliminate tile self-interference [22].
The graph shows the actual MFLOP rates for
each version of the code 2 . We see L1-sized tiles
yields the best performance. It maximizes L1 reuse
but can also exploit L2 reuse, since performance is
steady even for large matrices. In comparison, L2sized tiles can improve performance for large matrices, but not small ones. The reason for this disadvantage is clear—L2-sized tiles are of no use when the
data already fits in L2 cache. Intermediate tiles 2xL1
and 4XL1 achieve performance slightly higher than
L2-sized tiles, showing most L1 benefits are lost as
soon as tiles exceed what can fit in L1 cache. Results
show that the benefits of exploiting L1 cache reuse
outweigh the cost in capacity misses on the L2 cache.
Tiling for the L1 cache is thus effective in improving
performance at both levels of the memory hierarchy,
and yields best overall performance.
6.6
Discussion
Overall, our experiments show that while locality
optimizations can be enhanced to improve miss rates
multi-level caches, their actual impact on program
performance is minimal. This outcome is because
locality optimizations which target L1 caches also
exploit most of the reuse at other levels of cache. As
a result, existing optimizing compilers appear quite
capable of achieving good performance for processors with multi-level caches.
7 Related Work
Data locality has been recognized as a significant
performance issue for modern processor architec2
Raw MFLOP rates (around 38 MFLOPS) are about half of what
is achieved by ATLAS, a tuned version of matrix multiply, on our
UltraSparc (around 84 MFLOPS). However, if we unroll the loop
by hand and apply scalar replacement, we achieve (60 MFLOPS).
The difference is thus mostly due to performance tuning to exploit
registers and instruction level parallelism, and our conclusions
are still valid despite lower absolute performance.
tures. Wolf and Lam provide a concise definition
and summary of important types of data locality [29].
Computation-reordering transformations such as loop
permutation and tiling are the primary optimization
techniques [8, 18, 23, 29], though loop fission (distribution) and loop fusion have also been found to
be helpful [18]. Several cache estimation techniques
have been proposed to help guide data locality optimizations [7, 8, 29]. Recently these techniques have
been enhanced to take into account conflict misses
due to limited cache associativity [10, 26].
Data layout optimizations such as padding and
transpose have been shown to be useful in eliminating conflict misses and improving spatial locality [1, 13, 20, 21]. They have also been combined
with loop transformations for improved effect [5,
12]. In previous work, we examined the applicability of inter-variable padding for eliminating severe
conflict misses [20] and preserving group reuse [21].
In this paper we extend our padding algorithms to
consider multi-level caches.
A number of researchers have examined techniques related to this paper. Manjikian and Abdelrahman propose a new loop fusion algorithm called shiftand-peel which expands the applicability of loop fusion [17]. They also propose cache partitioning,
a version of MAXPAD which does not take severe
conflict misses into account. Singhai and McKinley present a parameterized loop fusion algorithm
which considers parallelism and register pressure in
addition to reuse [24]. In comparison, our fusion
algorithm explicitly calculates group reuse benefits
for loop fusion in conjunction with inter-variable
padding. We also consider multiple levels of cache.
Lam, Rothberg, Wolf show conflict misses can
severely degrade the performance of tiling [16]. Coleman and McKinley select rectangular non-conflicting
tile sizes [6] while others focus on using a portion of
cache [28]. Chame and Moon propose a new cost
model for estimating both interference and capacity
misses to guide tiling [3]. Kodukula and Pingali develop data shacking, a data-centric approach to tiling
which can be applied to a wide variety of loop nests,
but doesn’t account for tile conflicts [15]. Song and
Li extended tiling techniques to handle multiple loop
nests [25]. They concentrate on only L2 cache since
L1 cache is too small to provided reuse for their necessarily large tiles. Nonlinear array layouts can be
11
used in conjunction with tiling to improve spatial
locality [4].
In comparison to these previous researchers, we
consider the effects of locality optimizations on multilevel caches. Mitchell et al. are the only other researchers considering multi-level caches, examining
the interactions of multi-level tiling and effects for
goals such as cache, TLB, and parallelism [19]. They
found explicitly considering multiple levels of the
memory hierarchy (cache and TLB) led to the choice
of compromise tile sizes which can yield significant
improvements in performance. In this paper we expand the consideration of multi-level caches to other
locality optimizations besides tiling.
8 Conclusions
ings of the 1999 ACM International Conference on
Supercomputing, Rhodes, Greece, June 1999.
[4] S. Chatterjee, V. Jain, A. Lebeck, S. Mundhra, and
M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999
ACM International Conference on Supercomputing,
Rhodes, Greece, June 1999.
[5] M. Cierniak and W. Li. Unifying data and control
transformations for distributed shared-memory machines. In Proceedings of the SIGPLAN ’95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.
[6] S. Coleman and K. S. McKinley. Tile size selection
using cache organization and data layout. In Proceedings of the SIGPLAN ’95 Conference on Programming Language Design and Implementation,
La Jolla, CA, June 1995.
[7] J. Ferrante, V. Sarkar, and W. Thrash. On estimating
and enhancing cache effectiveness. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors,
Languages and Compilers for Parallel Computing,
Fourth International Workshop, Santa Clara, CA,
August 1991. Springer-Verlag.
Compiler transformations can significantly improve
data locality in scientific programs. In this paper,
we show that most locality transformations can usually improve reuse for multiple levels of cache by
simply targeting the smallest usable level of cache. [8] D. Gannon, W. Jalby, and K. Gallivan. Strategies for
Benefits for lower levels of cache are then obtained
cache and local memory management by global proindirectly as a side effect. Some optimizations can
gram transformation. Journal of Parallel and Distributed Computing, 5(5):587–616, October 1988.
benefit from explicitly considering multiple levels
of cache, including inter-variable padding and loop [9] G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Colfusion. Cache simulations and timings show while
lective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and
enhanced algorithms are able to reduce cache miss
Compilers for Parallel Computing, New Haven, CT,
rates, they rarely improve execution times for current
August 1992.
processors. While our results do not point out major
opportunities to improve program performance, we [10] S. Ghosh, M. Martonosi, and S. Malik. Cache miss
equations: An analytical representation of cache
believe the are still worthwhile because they indicate
misses. In Proceedings of the 1997 ACM Internaexisting most compiler optimizations are sufficient
tional Conference on Supercomputing, Vienna, Austo achieve good performance for multi-level caches.
tria, July 1997.
[11] F. Irigoin and R. Triolet. Supernode partitioning. In
Proceedings of the Fifteenth Annual ACM Symposium on the Principles of Programming Languages,
[1] D. Bacon, J.-H. Chow, D.-C. Ju, K. MuthukuSan Diego, CA, January 1988.
mar, and V. Sarkar. A compiler framework for re[12] M. Kandemir, A. Choudhary, J. Ramanujam, and
structuring data declarations to enhance cache and
P. Banerjee. Improving locality using loop and data
TLB effectiveness. In Proceedings of CASCON’94,
transformations in an integrated framework. In ProToronto, Canada, October 1994.
ceedings of the 31th IEEE/ACM International Sym[2] S. Carr and K. Kennedy. Compiler blockability of
posium on Microarchitecture, Dallas, TX, November
numerical algorithms. In Proceedings of Supercom1998.
puting ’92, Minneapolis, MN, November 1992.
[13] M. Kandemir, J. Ramanujam, and A. Choudhary. A
[3] J. Chame and S. Moon. A tile selection algorithm
compiler algorithm for optimizing locality in loop
for data locality and cache interference. In Proceednests. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Aus-
References
12
tria, July 1997.
loop fusion algorithm for improving parallelism and
cache locality. The Computer Journal, 40(6):340–
355, 1997.
[14] K. Kennedy and K. S. McKinley. Maximizing loop
parallelism and improving data locality via loop fusion and distribution. In Proceedings of the Sixth [25] Y. Song and Z. Li. New tiling techniques to improve
Workshop on Languages and Compilers for Parallel
cache temporal locality. In Proceedings of the SIGComputing, Portland, OR, August 1993.
PLAN ’99 Conference on Programming Language
Design and Implementation, Atlanta, GA, May 1999.
[15] I. Kodukula and K. Pingali. An experimental evaluation of tiling and shacking for memory hierarchy [26] O. Temam, C. Fricker, and W. Jalby. Cache inmanagement. In Proceedings of the 1999 ACM Interference phenomena. In Proceedings of the 1994
ternational Conference on Supercomputing, Rhodes,
ACM SIGMETRICS Conference on Measurement &
Greece, June 1999.
Modeling Computer Systems, Santa Clara, CA, May
1994.
[16] M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. [27] R. Wilson et al. SUIF: An infrastructure for research
In Proceedings of the Fourth International Conon parallelizing and optimizing compilers. ACM
ference on Architectural Support for Programming
SIGPLAN Notices, 29(12):31–37, December 1994.
Languages and Operating Systems (ASPLOS-IV), [28] M. Wolf, D. Maydan, and D.-K. Chen. ComSanta Clara, CA, April 1991.
bining loop transformations considering caches
[17] N. Manjikian and T. Abdelrahman. Fusion of loops
for parallelism and locality. IEEE Transactions
on Parallel and Distributed Systems, 8(2):193–209,
February 1997.
[29]
[18] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM
Transactions on Programming Languages and Systems, 18(4):424–453, July 1996.
[30]
[19] N. Mitchell, L. Carter, J. Ferrante, and K. Högstedt.
Quantifying the multi-level nature of tiling interactions. In Proceedings of the Tenth Workshop on
Languages and Compilers for Parallel Computing, [31]
Minneapolis, MN, August 1997.
[20] G. Rivera and C.-W. Tseng. Data transformations
for eliminating conflict misses. In Proceedings
of the SIGPLAN ’98 Conference on Programming
Language Design and Implementation, Montreal,
Canada, June 1998.
[21] G. Rivera and C.-W. Tseng. Eliminating conflict
misses for high performance architectures. In Proceedings of the 1998 ACM International Conference on Supercomputing, Melbourne, Australia, July
1998.
[22] G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th
International Conference on Compiler Construction (CC’99), Amsterdam, The Netherlands, March
1999.
[23] V. Sarkar. Automatic selection of higher order transformations in the IBM XL Fortran compilers. IBM
Journal of Research and Development, 41(3):233–
264, May 1997.
[24] S. Singhai and K. S. McKinley. A parameterized
13
and scheduling.
In Proceedings of the 29th
IEEE/ACM International Symposium on Microarchitecture, Paris, France, December 1996.
M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN ’91
Conference on Programming Language Design and
Implementation, Toronto, Canada, June 1991.
M. E. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE
Transactions on Parallel and Distributed Systems,
2(4):452–471, October 1991.
M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing ’89, Reno, NV, November
1989.
L1 cache w/ Orig
L2 cache w/ Orig
L1 cache w/ L1 Opt
L2 cache w/ L1 Opt
L1 cache w/ L1&L2 Opt
L2 cache w/ L1&L2 Opt
Percent Miss Rate
100
80
60
40
20
0
adi32
L1 cache w/ Orig
L1 cache w/ L1 Opt
L1 cache w/ L1&L2 Opt
L2 cache w/ Orig
L2 cache w/ L1 Opt
L2 cache w/ L1&L2 Opt
dot512
erle64
expl512
irr500k
jacobi512
linpackd
shal512
Percent Miss Rate
30
25
20
15
10
5
sw
im
to
m
ca
tv
tu
rb
3d
w
av
e5
2c
or
d
su
p
dr
o2
L1&L2 Opt
d
tu
rb
3
v
ca
t
m
to
sw
im
k
bu
p
ap
ps
2
al
51
ja
co
b
sh
2
i5
1
2
l5
1
ex
p
er
le
64
12
t5
do
i3
2
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
ad
Improvement (UltraSparc)
L1 Opt
hy
pp
si
fp
ap
gr
id
e
fft
m
pd
r
ba
em
cg
m
ap
bu
k
p
ps
pl
u
ap
ap
pb
t
0
Figure 9 Miss rates and execution time improvements for PAD and MULTILVLPAD
14
L1 cache w/ Orig
L2 cache w/ Orig
L1 cache w/ L1 Opt
L2 cache w/ L1 Opt
L1 cache w/ L1&L2 Opt
L2 cache w/ L1&L2 Opt
Percent Miss Rate
100
80
60
40
20
0
expl512
jacobi512
shal512
L1 Opt
swim
tomcatv
L1&L2 Opt
Improvement (UltraSparc)
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
expl512
Figure 10
jacobi512
shal512
swim
tomcatv
Miss rates and execution time improvements for GROUPPAD, with and without
L2MAXPAD for optimizing the L2 cache
15
Percent Miss rate
L2 cache w/ L1 Opt
L1cache w/ L1 Opt
L2 cache w/ L1&L2 Opt
L1 cache w/ L1&L2 Opt
20
15
10
L2 cache w/ L1 Opt
L2 cache w/ L1&L2 Opt
L1cache w/ L1 Opt
L1 cache w/ L1&L2 Opt
5
510
497
484
471
458
445
432
419
406
393
380
367
354
341
328
315
302
289
276
263
250
0
15
10
5
0
51
7
49
4
48
1
47
8
45
5
44
2
43
9
41
6
40
3
39
0
38
7
36
4
35
1
34
8
32
5
31
2
30
9
28
6
27
3
26
0
0
25
Percent Miss rate
Expl Problem Size
Shal Problem Size
Figure 11 Cache miss rates over varying problem sizes for GROUPPAD with and without L2MAXPAD
16
L2
Memory
Change In # References
4
3
2
1
0
-1
-2
-3
562
586
610
634
658
682
562
586
610
634
658
682
538
514
490
466
442
418
394
370
346
322
298
274
250
-4
Expl Problem Size
L1
L2
0.0%
-0.5%
-1.0%
-1.5%
-2.0%
-2.5%
538
514
490
466
442
418
394
370
346
322
298
274
-3.0%
250
Change In Miss Rate
0.5%
Expl Problem Size
Figure 12
Change in L2 refs, memory refs, and miss rates as a result of fusion
17
L1
2xL1
4xL1
L2
40
30
20
10
0
10
0
11
8
13
6
15
4
17
2
19
0
20
8
22
6
24
4
26
2
28
0
29
8
31
6
33
4
35
2
37
0
38
8
MFlops (Ultrasparc)
Orig
Matrix Sizes
Figure 13
Performance for two tiling methods over varying problem size
18