Cache Performance Research Paper
Cache Performance Research Paper
Cache Performance Research Paper
Fall, 2006
CSCE430/830
HitTime Data MissRate Data MissPenaltyData Memory: Set-Associative $
Cache Performance Example
• Assume we have a computer where the clock per instruction (CPI) is 1.0
when all memory accesses hit in the cache. The only data accesses are
loads and stores, and these total 50% of the instructions. If the miss penalty
is 25 clock cycles and the miss rate is 2% (Unified instruction cache and
data cache), how much faster would the computer be if all instructions and
data were cache hit?
CPUtime CPUClockCycles MemeoryStalls ClockCycleTime
( IC CPI MemoryStalls ) ClockCycleTime
In reality: MemAccess
MemoryStallCycles IC MissRate MissPenalt y
Inst
IC (1 0.5) 0.02 25 IC 0.75
For gcc, the frequency for all loads and stores is 36%
Instruction miss cycles = IC x 2% x 80 = 1.600 x IC
Data miss cycles = IC x 36% x 4% x 80 = 1.152 x IC
2.752 x IC
I x CPIslowClk x Clock period 3.376
I x CPIfastClk x Clock period
= 4.752 x 0.5 = 1.42 (not 2)
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Block 2
Block 3
Block 2
Block 3
Block 3
Block 3
CSCE430/830
Set 0 contains Mem[0]. Overwrite withMemory:
Mem[8] Set-Associative $
Example: Accessing A Direct-Mapped Cache
• DM cache contains 4 1-word blocks. Find the # Misses for each
cache given this sequence of memory block accesses: 0, 8, 0, 6, 8
Block 3
Block 3
CSCE430/830
Set 0 contains Mem[8]. Overwrite withMemory:
Mem[0] Set-Associative $
Example: Accessing A Direct-Mapped Cache
• DM cache contains 4 1-word blocks. Find the # Misses for each
cache given this sequence of memory block accesses: 0, 8, 0, 6, 8
CSCE430/830
Set 0 contains Mem[0]. Overwrite withMemory:
Mem[8] Set-Associative $
Direct-Mapped Cache with n one-word blocks
• Pros: find data fast
• Con: What if access 00001 and 10001 repeatedly?
We always miss…
Cache
000
001
010
011
111
100
101
110
Memory
CSCE430/830 Memory: Set-Associative $
Fully Associative Block Placement
Cache
arbitrary block mapping
location = any
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
FA Memory Access 1:
FA Memory Access 1:
FA Memory Access 2:
FA Memory Access 2:
FA Memory Access 3:
FA Memory Access 3:
FA Memory Access 4:
FA Memory Access 4:
FA Memory Access 5:
FA Memory Access 5:
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Set 0…1111
11 …
PRO: Set
00
Block 0 Block 1 0…1010
0…1011
Easier to find but won’t Set
01
0…1100
0…1101
always overwrite Set
10
0…1110
0…1111
Set
CON: 11 …
Address
• Disadvantage: 253
254
255
– More tag bits 22 32
– More hardware
– Higher access time
4-to-1 multiplexor
Hit Data
• How many total bits are needed for a direct- mapped cache with
64 KBytes of data and one word blocks, assuming a 32-bit
address?
• How many total bits would be needed for a 4-way set associative
cache to store the same amount of data
• How many total bits are needed for a direct- mapped cache with
64 KBytes of data and 8 word blocks, assuming a 32-bit address?
Store
Memory
Processor
Cache
Load
Cache
Load
CSCE430/830 Memory: Set-Associative $
Write Back
Write
Processor Store Back
Memory
Cache
Load Cache
Load
• Write-Allocate
– Bring written block into cache
– Update word in block
– Anticipate further use of block
• No-write Allocate
– Main memory is updated
– Cache contents unmodified
• Cache parameters
– Cache size, block size, associativity
0.14
1-way Conflict
0.12
2-way
0.1
Miss Rate per Type
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
Compulsory vanishingly
small
CSCE430/830 Memory: Set-Associative $
2:1 Cache Rule
0.14 Conflict
1-way
0.12
2-way
0.1
Miss Rate per Type
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
CSCE430/830 Memory: Set-Associative $
3C Relative Miss Rate
100%
1-way
80%
2-way
Miss Rate per Type
4-way Conflict
60% 8-way
40%
Capacity
20%
0%
1
16
32
64
128
Compulsory
Flaws: for fixed block size Cache Size (KB)
Good: insight => invention
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Using the principle of locality. The larger the block, the greater the
chance parts of it will be used again.
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
16
32
64
128
256
Block Size (bytes)
CPU
Address
Address
In
Out
=?
Tag
Victim Cache
Data
Cache
=?
Write
Buffer
c) All N*N elements of Y and Z are accessed N times and each element of X
is accessed once. Thus, there are N3 operations and 2N3 + N2 reads!
Capacity misses are a function of N and cache size in this case.
CSCE430/830
HitTime Data MissRate Data MissPenaltyData Memory: Set-Associative $
Calculating AMAT
• If a direct mapped cache has a hit rate of 95%, a hit
time of 4 ns, and a miss penalty of 100 ns, what is the
AMAT?
• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops 75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops 75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
CSCE430/830 Memory: Set-Associative $
Cache Performance Summary