2 Cache Complexity
2 Cache Complexity
2 Cache Complexity
< 0 “O ‘A O
Plan
Matrix Transposition
< □ ► «g ► ◄ “O ‘A O
Plan
Matrix Transposition
“O ‘A O
Capacity
Access Time Staging
Cost Xfer Unit
~$1 / GByte
<□►<g► “O ‘A O
CPU Cache (1/3)
Main Cache
Memory Memory
Main Cache
Memory Memory
Index Data
0 xyz
__ 1 Pdq
2 abc
3 rgf
< 0 “O ‘A O
CPU Cache (3/3)
Main Cache
Memory Memory
Main Cache
Memory Memory
□►< “O ‘A O
CPU Cache (5/7)
Main Cache
Memory Memory
Each bcatbn in main me maty can be Each bcatbn in main memory can be
cached by just one cache bcatbn. cached by one of two cache bcatbns.
<□►«g► “O A O
Issues with matrix representation
K
► Contiguous accesses are better:
► Data fetch as cache line (Core 2 Duo 64 byte per cache line)
► With contiguous data, a single cache fetch supports 8 reads of
doubles.
► Transposing the matrix C should reduce LI cache misses!
Transposing for optimizing spatial locality
float testMM(const int x, const int y, const int z)
<□►«g► T) ‘-t O
Issues with data reuse
X
1024
□►<► “O O
Transposing and blocking for optimizing data locality
□►<► T) O
Experimental results
< 0 “O ‘A O
Other performance counters
Hardware count events
► CPI Clock cycles Per Instruction: the number ofclock cycles
that happen when an instruction is being executed. With
pipelining we can improve the CPI by exploiting instruction
level parallelism
► LI and L2 Cache Miss Rate.
► Instructions Retired: In the event of a misprediction,
instructions that were scheduled to execute along the
mispredicted path must be canceled.
L1 L2
Miss Miss Percent SSE Instructions
CPI Rate Rate Instructions Retired
InC 478 - 0.24 0.02 43% 13,137,280,000
-5x " 2x “ 1x
Transposed 1.13 - 015 0.02 50% 13,001.486,336 -
-3x - 8x ■0 8x
Tiled 049 - 002 0 39% 18,044,811,264
□►< “O ‘A O
Analyzing cache misses in the naive and transposed
multiplication
EH
► Let A, B and C have format (m, n), (m, p) and (p, n)
respectively.
► A is scanned one, so mn/L cache misses if L is the number of
coefficients per cache line.
► B is scanned n times, so mnp/L cache misses if the cache
cannot hold a row.
► C is accessed “nearly randomly” (for m large enough) leading
to mnp cache misses.
► Since 2m n p arithmetic operations are performed, this means
roughly one cache miss per flop!
Analyzing cache misses in the tiled multiplication
1024 384 1024
A
1024
1. n/L to compute k.
2. k/L cache misses to initialize Count.
3. n/L + n cache misses for the histogram (worst case).
4. k/L cache misses for the prefix sum.
5. n/L + n + n cache misses for building Output (worst case).
Total: 3n+3n/ L + 2k/L cache misses (worst case).
<0► «>► « ► “O O
Counting sort: cache complexity analysis: explanations
n classical cache-oblivious
counting counting sort
sort (preprocessing + sorting)
100000000 13.74 4.66 (3.04 + 1.62 )
200000000 30.20 9.93 (6.16 + 3.77)
300000000 50.19 16.02 (9.32 + 6.70)
400000000 71.55 22.13 (12.50 +9.63)
500000000 94.32 28.37 (15.71 + 12.66)
600000000 116.74 34.61 (18.95 + 15.66)
< “O ‘A O'
Plan
Matrix Transposition
“O ‘A O
Basic idea of a cache memory (review)
Memory
► S reads to A.
► 16 elements of A per cache line
► 15 of every 16 hit in cache.
► Total access time: 15(S/16) + 100(S/16).
► spatial locality, cold misses.
<□►<g► “O ‘A O
Exercise 2 (1/2)
Memory
<□►<0►<>►«>► O
Exercise 2 (2/2)
► S reads to A
► All except the first one hit in cache.
► Total access time: 100 + (S — 1).
► Temporal locality
► Cold misses.
“O ‘A O
Exercise 3 (1/2)
Memory
► S reads to A
► One miss for each accessed line, rest hit in cache.
► Number of accessed lines: 2/v_4.
► Total access time: 2/v_4100 + (S — 2A/_4).
► Temporal and spatial locality
► Cold misses.
< □ ► «g ► « “O ‘A O
Exercise 4 (1/2)
// Assume 14 <= N
#define S ((l«20)*sizeof (int))
int A [S] ;
for (i = 0; i < S; i++) {
read A[i °/0 (1«N)] ;
Memory
// Assume 14 <= N
#define S ((l«20)*sizeof (int))
int A [S] ;
for (i = 0; i < S; i++) {
read A[i °/0 (1«N)] ;
► S reads to A.
► First access to each line misses
► Rest accesses to that line hit.
► Total access time: 15(S/16) + 100(S/16).
► Spatial locality
► Cold and capacity misses.
<□►<g► “O ‘A O
Exercise 5 (1/2)
// Assume 14 <= N
#define S ((l«20)*sizeof (int))
int A [S] ;
for (i = 0; i < S; i++) {
read A[(i*16) °/0 (1«N)] ;
Memory
// Assume 14 <= N
#define S ((l«20)*sizeof (int))
int A [S] ;
for (i = 0; i < S; i++) {
read A[(i*16) °/0 (1«N)] ;
► S reads to A.
► First access to each line misses
► One access per line.
► Total access time: 100S.
► No locality!
► Cold and conflict misses.
“O ‘A O
Exercise 6 (1/2)
Memory
<□►<0►<>►«>► O
Exercise 6 (2/2)
► S reads to A.
► After A/ iterations, for some l\l, the cache is full.
► Them the chance of hitting in cache is 29/218 = 1/512, that
is the number of lines in the cache divided by the total
number of cache lines used by A.
► Estimated total access time: S(511/512)100 + S(l/512).
► Almost no locality!
► Cold, capacity conflict misses.
< □ ► < 9 ► « -e ► « > ► O
Exercise 7 (1/2)
Memory
^09999995 Cache
► S reads to A and B.
► A and B interfere in cache: indeed two cache lines whose
addresses differ by a multiple of 29 have the same way to
cache.
► Total access time: 200S.
► Spatial locality but the cache cannot exploit it.
► Cold and conflict misses.
Exercise 8 (1/2)
► S reads to A and B.
► A and B almost do not interfere in cache.
► Total access time: 2(15S/16 + 100S/16).
► Spatial locality.
► Cold misses.
□►< “O ‘A O
Set Associative Caches
► Set associative caches have sets with multiple lines per set.
► Each line in a set is called a way
► Each memory line maps to a specific set and can be put into
any cache line in its set
► In our example, we assume a 32 Kbyte cache, with 64 byte
lines, 2-way associative. Hence we have:
► 256 sets
► Bottom six bits determine offset in cache line
► Next 8 bits determine the set.
□►< “O A O
Exercise 9 (1/2)
^99999995 Cache
► S reads to A and B.
► A and B lines hit same set, but enough lines in a set.
► Total access time: 2(15S/16 + 100S/16).
► Spatial locality.
► Cold misses.
□ ► < g ► < .a ► «► “O ‘A O
Tuned cache-oblivious square matrix transposition
void DC_matrix_transpose(int *A, int Ida, int iO, int il,
int jO, int djO, int jl /*, int djl = 0 */){
const int THRESHOLD = 16; // tuned for the target machine
tail:
int di = il - iO, dj = jl - jO;
if (dj >= 2 * di && dj > THRESHOLD) {
int dj2 = dj / 2;
cilk_spawn DC_matrix_transpose(A, Ida, iO, il, jO, djO, jO + dj2);
jO += dj2; djO = 0; goto tail;
} else if (di > THRESHOLD) {
int di2 = di / 2;
cilk_spawn DC_matrix_transpose(A, Ida, iO, iO + di2, jO, djO, jl);
iO += di2; jO += djO * di2; goto tail;
} else {
for (int i = iO; i < il; ++i) {
for (int j = jO; j < jl; ++j) {
int x = A[j * Ida + i];
A[j * Ida + i] = A[i * Ida + j] ;
A[i * Ida + j] = x;
jO += djO;
}
For the above 2-way associative cache (of size 32 Kbyte cache, and
with 64 byte lines): Total access time? What kind of locality?
What kind of misses?
□►< “O A O
Extra Exercise B
Let A be a n x n invertible lower triangular matrix. A simple
divide-and-conquer strategy to invert A is described below. Let A
be partitioned into (n/2) x (n/2) blocks as follows:
Ai 0
(1)
A2 A3
^r1 (2)
-a^1a 2a^
1
Matrix Transposition
“O ‘A O
The (Z, Z.) ideal cache model (1/4)
Main
organized by Memory
<□►«s “O ‘A O
The (Z, Z.) ideal cache model (2/4)
Main
< 0 “O ‘A O
Cache complexity of the naive matrix multiplication
// A is stored, in ROW-major and B in COLUMN-major
for(i=0; i < n; i++)
for(j=0; j < n; j++)
for(k=0; k < n; k++)
C[i][j] += A[i][k] * B[j][k];
< 0 “O ‘A O
A cache-aware matrix multiplication algorithm (2/2)
// A, B and C are in row-major storage
for(i =0; i < n/s; i++)
for(j =0; j < n/s; j++)
for(k=0; k < n/s; k++)
blockMult(A,B,C,i,j,k,s);
Matrix Transposition
“O ‘A O
Scanning
Figure 2. Scanning an array of N elements arbitrarily aligned with blocks may cost
one more memory transfer than IJV/B].
1 -------------------------- 1
< □ ► «s ► < “O ‘A O
Median and selection (1/8)
< 0 O ‘A O
Median and selection (2/8)
select(L,k)
sort L
return the element in the kth position
for (i = 1 to n/5) do
x[i] = select(S [i],3)
M = select({x[i]}, n/10)
< ►◄ O'
Median and selection (3/8)
□►<g►« “O ‘A O
Median and selection (7/8)
How to solve
leading to c ~ 0.8397803.
► Since each leaf incurs a constant number of cache misses we
have Q(n) G Q(nc), which could be larger or smaller than
Median and selection (8/8)
How to solve
□►< “O ‘A O
Plan
Matrix Transposition
“O ‘A O
Matrix transposition: various algorithms
< 0 “O ‘A O
Matrix transposition: a first divide-and-conquer (1/4)
► For simplicity, assume that our input matrix A is square of
order n and that n is a power of 2, say n = 2k.
► We divide A into four square quadrants of order n/2 and we
have
%,i A
^2,2 ) '
► This observation yields an “in-place” algorithm:
1. If n = 1 then return A.
2. If n > 1 then
2.1 recursively compute ^2,1/A,2/^2,2 in place as
( %,i %,2 \
V %,1 ^2,2 )
2.2 exchange Mi,2 and M24.
0 if n = 1
4/W(n/2) + 2(n/2)2 if n > 1.
Thus, for a fixed L, the functions |_n/LJ, n/L and fn/Z-l are
asymptotically of the same order of magnitude.
► We need to translate "for n small enough” into a formula. We
claim that there exists a real constant a > 0 s.t. for all n and Z
we have
n2 < aZ => Q(n) < n2/L + n.
“O ‘A O
Matrix transposition: a first divide-and-conquer (4/4)
► Q(n) satisfies the following recurrence relation
if n2 < aZ (base case)
if n2 > aZ (recurrence)
A = (^2), e=(|)
< 0 “O ‘A O
A matrix transposition cache-oblivious algorithm (2/2)
□►< “O ‘A O
Case I: max{m,n} < aL.
□►< “O ‘A O
Consider n < aL < m. The Rec-Transpose algorithm
divides the greater dimension m by 2 and recurses.
At some point in the recursion, we have aL}?. < m < aL and
the whole computation fits in cache. At this point:
► the input array resides in contiguous locations, requiring at
most ©(1 + nm/L) cache misses
► the output array consists of nm elements in n rows, where in
the worst case every row starts at a different cache line,
leading to at most 0(n + nm/L) cache misses.
Since m/L e [a/2,a], the total cache complexity for this base
case is ©(1 + n), yielding the recurrence (where the resulting
Q(m. n) is a worst case estimate)
Q(m, n) =
©(m + n + mn//_) if m, n e [orZ_/2, aL] ,
2Q(m/2, n) + 0(1) if m > n ,
2Q(m, n/2) + 0(1) otherwise;
Matrix Transposition
“O ‘A O
A cache-oblivious matrix multiplication algorithm (1/3)
► We describe and analyze a cache-oblivious algorithm for
multiplying an m x n matrix by an n x p matrix
cache-obliviously using
► ©(mnp) work and incurring
► O(m + n + p + (mn + np + rnp)/L + mnp/(Ly/Z)) cache
misses.
► This straightforward divide-and-conquer algorithm contains no
voodoo parameters (tuning parameters) and it uses cache
optimally.
► Intuitively, this algorithm uses the cache effectively, because
once a subproblem fits into the cache, its smaller subproblems
can be solved in cache with no further cache misses.
► These results require the tail-cache assumption for matrices
stored in row-major layout format,
► This assumption can be relaxed for certain other layouts, see
(Frigo et al. 1999).
► The case of Strassen's algorithm is also treated in (Frigo et al.
“O ‘A O
A cache-oblivious matrix multiplication algorithm (2/3)
► To multiply an m x n matrix A and an n x p matrix B, the
Rec-Mult algorithm halves the largest of the three
dimensions and recurs according to one of the following three
cases:
Ct) ■ - (S).
(A ^2)^) = ^iBi + 42B2, (4)
□►< “O ‘A O
Case I: m, n, p > aVZ. (2/2)
Q(m, n.p) =
©((mn + np+ mp')/L') if m, n, p G [a-\ZZ/2, a^Z] ,
2Q(m/2,n,p) + 0(1) ow. if m > n and m > p ,
2Q{m, n/2,p) + 0(1) ow. if n > m and n > p ,
2Q{m, n,p/2) + 0(1) otherwise .
0(m, n, p) = 0(mnp/(l_Vzy).
□►< “O ‘A O
Typical memory layouts for matrices
(a) 0 1 (b)
8— 0 10 1112 1 Til- 15
16^7TTj9202L-22-e3
32<rT3£3536JZ-3fr39
40<ni245j4JS46-47
<□►«s►< T) ‘A O
Acknowledgements and references
Acknowledgements.
► Charles E. Leiserson (MIT) and Matteo Frigo (Intel) for
providing me with the sources of their article Cache-Oblivious
Algorithms.
► Charles E. Leiserson (MIT) and Saman P. Amarasinghe (MIT)
for sharing with me the sources of their course notes and
other documents.
References.
► Cache-Oblivious Algorithms by Matteo Frigo, Charles E.
Leiserson, Harald Prokop and Sridhar Ramachandran.