ECE4680 Computer Organization and Architecture Memory Hierarchy: Cache System
ECE4680 Computer Organization and Architecture Memory Hierarchy: Cache System
ECE4680 Computer Organization and Architecture Memory Hierarchy: Cache System
ECE4680 Cache.1
2002-4-17
Processor
Cache
DRAM
Motivation: Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access time small by: Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory
ECE4680 Cache.2
2002-4-17
ECE4680 Cache.3
2002-4-17
Upper Level
Staging Xfer Unit
faster
Registers Instr. Operands Cache Blocks Memory Pages Disk Files Tape
user/operator Mbytes OS 512-4K bytes cache cntl 8-128 bytes prog./compiler 1-8 bytes
Larger
Lower Level
2002-4-17
Address Space
The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Example: 90% of time in 10% of the code Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.
ECE4680 Cache.5 2002-4-17
Block: The minimum unit of information that can either be present or not present in the two level hierarchy
To Processor
From Processor
Blk Y
ECE4680 Cache.6
2002-4-17
To Processor
From Processor
Blk Y
ECE4680 Cache.7
2002-4-17
Typical Values Block (line) size Hit time Miss penalty (access time) (transfer time) Miss rate Cache Size 4 - 128 bytes 1 - 4 cycles 8 - 32 cycles (and increasing) (6-10 cycles) (2 - 22 cycles) 1% - 20% 1 KB - 256 KB
ECE4680 Cache.8
2002-4-17
To Processor
From Processor
Blk Y
ECE4680 Cache.9
2002-4-17
Location 0 can be occupied by data from: Memory location 0, 4, 8, ... etc. In general: any memory location whose 2 LSBs of the address are 0s Address<1:0> => cache index (Block Addr) module (#cache blocks) How can we tell which one is in the cache? Which one should we place in the cache?
2002-4-17
Example: Consider an eight-word direct mapped cache. Show the contents of the cache as it responds to a series of requests (decimal addresses): 22, 26, 22, 26, 16, 3, 16, 18
22 10110 miss 110 26 11010 miss 010 22 10110 hit 110 26 11010 hit 010 16 10000 miss 000 3 00011 miss 011 16 10000 hit 000 18 10010 miss 010
ECE4680 Cache.11
2002-4-17
N Cache Index 2 Bytes Direct Mapped Cache Byte 0 Byte 1 Byte 2 Byte 3
N
0 Ex: 0x03
Valid Bit
0x50
0 1 2 3
:
Byte 2 N- 1 2
N
-1
2002-4-17
ECE4680 Cache.12
M [00001]
000 010
M [00001] M [01010]
000 010
ECE4680 Cache.13
Question: Can we say the larger, the better for the block size?
ECE4680 Cache.14 2002-4-17
Stored as part of the cache state Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63
: :
:
Byte 1023
: :
Byte 992 31
2002-4-17
ECE4680 Cache.15
ECE4680 Cache.16
2002-4-17
Bits in a cache How many totoal bits are required for a directed mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address?
14 2
1 32 14 2 32
ECE4680 Cache.17
2002-4-17
Block Size
Block Size
Block Size
ECE4680 Cache.18
2002-4-17
True: If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect
Conflict Misses are misses caused by: Different memory locations mapped to the same cache index - Solution 1: make the cache size bigger ECE4680 Cache.19
Determine the internal register transfers Design the Datapath Design the Cache Controller
Address Data In Data Out
ECE4680 Cache.20
Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity
I -Cache miss
IR
IRex
invalid
IRm R D Cache IRwb Miss T
Average Memory Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls)
ECE4680 Cache.21
2002-4-17
Memory stall cycles = Read stall cycles + write-stall cycles Read stall cycles = #reads * read miss rate * read miss penalty Write stall cycles = #writes * write miss rate * write miss penalty Memory stall cycles = #access * miss rate * miss penalty
ECE4680 Cache.22
2002-4-17
Examples:
Q1: Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. (It is known that the frequency of all loads and stores in gcc is 36%, unused) Answer: Instruction miss cycles = I * 2% * 40 = 0.80I Data miss cycles = I * 4% * 40 = 0.56I The CPI with memory stalls is 2 + 1.36 = 3.36 The performance with the perfect cache is better by 1.68 Q2: Suppose we speed up the machine by reducing CPI from 2 to 1. Q3: If we double clock rate,
ECE4680 Cache.23
2002-4-17
Cache Tag X X X X X
ECE4680 Cache.24
Valid Bit Cache Data Byte 31 Byte 1 Byte 0 Byte 63 Byte 33 Byte 32
: :
:
2002-4-17
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR Hit
ECE4680 Cache.25
Cache Block
2002-4-17
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR Hit
ECE4680 Cache.26
Cache Block
2002-4-17
Example: There are three small caches, each consisting of four one-word blocks. One cache is fully associative, a second is two-way set associative, and the third is direct mapped. Find the number of misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, 8
0 0000 M M M 8 1000 M M M 0 0000 H H M 6 0110 M M M 8 1000 H H M
F. A. 2-Way D. M.
ECE4680 Cache.27
2002-4-17
ECE4680 Cache.28
2002-4-17
Fully Associative
ECE4680 Cache.29
2002-4-17
Note: If you are going to run billions of instruction, Compulsory Misses are insignificant.
ECE4680 Cache.30
2002-4-17
ECE4680 Cache.31
2002-4-17
Replacement Pointer
:
Entry 63
2002-4-17
ECE4680 Cache.32
- Control can be complex Write Through: write to cache and memory at the same time. What!!! How can this be? Isnt memory too slow for this?
ECE4680 Cache.33
2002-4-17
Processor
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designers nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation
ECE4680 Cache.34
2002-4-17
Write Buffer
Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): - Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time
Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache:
Processor
Cache
L2 Cache
DRAM
Write Buffer
ECE4680 Cache.35 2002-4-17
Valid Bit
: :
:
Byte 1023
: :
Byte 992 31
2002-4-17
ECE4680 Cache.36
Memory stall cycles = Read stall cycles + write-stall cycles Read stall cycles = #reads * read miss rate * read miss penalty Write stall cycles = #writes * write miss rate * write miss penalty Memory stall cycles = #access * miss rate * miss penalty
ECE4680 Cache.37
2002-4-17
$ bus M
mux $ bus M
DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time - 2:1; why? DRAM (Read/Write) Cycle Time : How frequent can you initiate an access? Analogy: A little kid can only ask his father for money on Saturday DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access? Analogy: As soon as he asks, his father will give him the money DRAM Bandwidth Limitation analogy: What happens if he runs out of money on Wednesday?
ECE4680 Cache.39
2002-4-17
CPU
Access Bank 0
Memory Bank 3 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again
2002-4-17
ECE4680 Cache.40
ECE4680 Cache.41
2002-4-17
For sequential accesses, otherwise will return to original bank before it has next word ready Increasing DRAM => fewer chips => harder to have banks Growth bits/chip DRAM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) - growth MB/$ of DRAM (25%-30%/yr)
ECE4680 Cache.42
2002-4-17
Memory Bus (SIMM Bus) 128-bit wide datapath Memory Module 5 Memory Module 7 Memory Module 4 Memory Module 3 Memory Module 6 Memory Module 2 Memory Module 1 Memory Module 0
Processor Module (Mbus Module) SuperSPARC Processor External Cache Instruction Cache Data Cache
Register File
ECE4680 Cache.43
2002-4-17
SPARCstation 20s External Cache: Size and organization: 1 MB, direct mapped Block size: 128 B Sub-block size: 32 B Write Policy: Write back, write allocate
ECE4680 Cache.44
2002-4-17
SPARCstation 20s Internal Instruction Cache: Size and organization: 20 KB, 5-way Set Associative Block size: 64 B Sub-block size: 32 B Write Policy: Does not apply Note: Sub-block size the same as the External (L2) Cache
ECE4680 Cache.45
2002-4-17
SPARCstation 20s Internal Data Cache: Size and organization: 16 KB, 4-way Set Associative Block size: 64 B Sub-block size: 32 B Write Policy: Write through, write not allocate Sub-block size the same as the External (L2) Cache
ECE4680 Cache.46
2002-4-17
Why did they use N-way set associative cache internally? Answer: A N-way set associative cache is like having N direct mapped caches in parallel. They want each of those N direct mapped cache to be 4 KB. Same as the virtual page size. How many levels of cache does SPARCstation 20 has? Answer: Three levels. (1) Internal I & D caches, (2) External cache and (3) ...
ECE4680 Cache.47
2002-4-17
Memory Bus<127:0>
ECE4680 Cache.48
2002-4-17
DRAM Performance
A 60 ns (tRAC) DRAM can perform a row access only every 110 ns (tRC) perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC). In practice, external address delays and turning around buses make it 40 to 50 ns
These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins 180 ns to 250 ns latency from processor to memory is good for a 60 ns (tRAC) DRAM
ECE4680 Cache.49
2002-4-17
Summary:
The Principle of Locality: Temporal Locality vs Spatial Locality Four Questions For Any Cache Where to place in the cache How to locate a block in the cache Replacement Write policy: Write through vs Write back - Write miss: Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size
ECE4680 Cache.50
2002-4-17