Memory
Memory
Memory
Memory Hierarchy
國立清華大學資訊工程學系
一零零學年度第二學期
Outline
Memory hierarchy
The basics of caches
Measuring and improving cache performance
Virtual memory
A common framework for memory hierarchy
Using a Finite State Machine to Control a Simple
Cache
Parallelism and Memory Hierarchies: Cache
Coherence
1000 Proc
60%/yr.
Moore’s Law
(2X/1.5 yr)
Performance
100 Processor-memory
performance gap:
(grows 50% / year)
10 DRAM
9%/yr.
(2X/10 yrs)
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
Control
Memory
Memory
Memory
Memory
Memory
Datapath
Probability
of reference
0 address space 2n - 1
Memory-7 Computer Architecture
Levels of Memory Hierarchy
Upper Level
Staging
Transfer Unit faster
Registers
Instr. Operands prog./compiler
Cache
cache controller
Blocks
Memory
OS
Pages
Disk
user/operator
Files
Larger
Tape Lower Level
How Is the Hierarchy Managed?
Registers <-> Memory
by compiler (programmer?)
cache <-> memory
by the hardware
memory <-> disks
by the hardware and operating system (virtual
memory)
by the programmer (files)
Cache
cache controller
Blocks
Memory
OS
Pages
Disk
user/operator
Files
Larger
Tape Lower Level
Memory-14 Computer Architecture
Cache Memory
Cache memory
The level of the memory hierarchy closest to the CPU
Given accesses X1, …, Xn–1, Xn
000
001
010
011
111
100
101
110
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1
1K words,
1-word block:
Cache index:
lower 10 bits
Cache tag:
upper 20 bits
Valid bit (When
start up, valid is
0)
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
Write Buffer
Use a write buffer (WB):
Processor: writes data into cache and WB
Memory controller: write WB data to memory
Write buffer is just a FIFO:
Typical number of entries: 4
Memory system designer’s nightmare:
Store frequency > 1 / DRAM write cycle
Write buffer saturation => CPU stalled
Multiplexor
Cache Cache
Cache
Fig. 5.11
a. One-word-wide
memory organization
Memory-35 Computer Architecture
Interleaving for Bandwidth
Access pattern without interleaving:
Cycle time
Access time
D1 available
Start access for D1
Start access for D2
Access pattern with interleaving
Data ready
Access
Bank 0,1,2, 3 Access
Bank 0 again
Memory-36 Computer Architecture
Miss Penalty for Different Memory
Organizations
Assume
1 memory bus clock to send the address
15 memory bus clocks for each DRAM access initiated
1 memory bus clock to send a word of data
A cache block = 4 words
Three memory organizations :
A one-word-wide bank of DRAMs
Miss penalty = 1 + 4 x 15 + 4 x 1 = 65
2048 x 2048
array
21-0
000
001
010
011
111
100
101
110
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1
4 1
5 2
6 3
22 8
253
254
255
22 32
4-to-1 multiplexor
Hit Data
Cache
cache controller
Blocks
Memory
OS
Pages
Disk
user/operator
Files
Larger
Tape Lower Level
Memory-66 Computer Architecture
Virtual Memory
Use main memory as a “cache” for secondary (disk) storage
Managed jointly by CPU hardware and the operating system
(OS)
Programs share main memory
Each gets a private virtual address space holding its frequently
used code and data, starting at address 0, only accessible to
itself
yet, any can run anywhere in physical memory
executed in a name space (virtual address space) different from
memory space (physical address space)
virtual memory implements the translation from virtual space to
physical space
Protected from other programs
Every program has lots of memory (> physical memory)
cache disk
register memory
pages
frame
pages
frame
Memory-70 Computer Architecture
Block Size and Placement Policy
Huge miss penalty: a page fault may take millions of
cycles to process
Pages should be fairly large (e.g., 4KB) to amortize the
high access time
Reducing page faults is important
fully associative placement
=> use page table (in memory) to locate pages
Addr Trans Main Secondary
a Mechanism Memory Memory
a'
physical address OS does this transfer
Memory-73 Computer Architecture
Page Tables
Stores placement information
Array of page table entries, indexed by virtual page
number
Page table register in CPU points to page table in
physical memory
If page is present in memory
PTE stores the physical page number
Plus other status bits (referenced, dirty, …)
If page is not present
PTE can refer to location in swap space on disk
Virtual address
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
20 12
Physical address
4 GB virtual address
space
4 KB of PTE1
4 bytes
(Each entry indicate if
any page in the
segment is allocated)
4 MB of PTE2
paged, holes
1
1 Physical memory
1
1
0
1
Page table
Physical page
Valid or disk address
1
1
1 Disk storage
1
0
1
1
0
1
1
Fig. 5.23 0
1
Integrating 20
Virtual page number Page offset
12
Cache
TLB
TLB hit
20
Cache
32
No Yes
Write?
Write protection
Try to write data
No Yes exception
Cache miss stall to cache
Cache hit?
while read block
Deliver data
to the CPU
Cache miss stall No Yes
Cache hit?
while read block
Cache
hit
data Memory-98 Computer Architecture
Virtually Addressed Cache
Require address translation only on miss!
Problem:
Same virtual addresses (different processes) map to
different physical addresses: tag + process id
Synonym/alias problem: two different virtual addresses
map to same physical address
Two different cache entries holding data for the same
physical address!
For update: must update all cache entries with same
physical address or memory becomes inconsistent
Determining this requires significant hardware, essentially
an associative lookup on the physical address tags to
see if you have multiple hits;
Or software enforced alias boundary: same least-
significant bits of VA &PA > cache size
Integrating 20
Virtual page number Page offset
12
Cache
TLB
TLB hit
20
Cache
32
4 bytes
10 2
00
PA Hit/ Data
20 12 PA Hit/
Miss
displace Miss
page #
=
IF cache hit AND (cache tag = PA) then deliver data to CPU
ELSE IF [cache invalid OR (cache tag ! = PA)] and TLB hit THEN
access memory with the PA from the TLB
ELSE do standard VA translation
Memory-101 Computer Architecture
Memory Protection
Different tasks can share parts of their virtual address
spaces
But need to protect against errant access
Requires OS assistance
Hardware support for OS protection
2 modes: kernel, user
Privileged supervisor mode (aka kernel mode)
Privileged instructions
Page tables and other state information only
accessible in supervisor mode
System call exception (e.g., syscall in MIPS) : CPU from
user to kernel
Hardware caches
Reduce comparisons to reduce cost
Virtual memory
Full table lookup makes full associativity feasible
Benefit in reduced miss rate
Trends:
Synchronous SRAMs (provide a burst of data)
Redesign DRAM chips to provide higher bandwidth or
processing
Restructure code to increase locality
Use prefetching (make cache visible to ISA)
Memory-111 Computer Architecture
Outline
Memory hierarchy
The basics of caches
Measuring and improving cache performance
Virtual memory
A common framework for memory hierarchy
Using a Finite State Machine to Control a Simple
Cache
Parallelism and Memory Hierarchies: Cache
Coherence
31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits
Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready
Multiple cycles
per access
Could
partition into
separate
states to
reduce clock
cycle time
3 CPU A writes 1 to X 1 0 1