Chapter 5

CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University
CpE 440
Computer Architecture
Dr. Haithem Al-Mefleh
Computer Engineering Department
Yarmouk University, Second 2020-2021
Large and Fast: Exploiting

Memory Hierarchy
1
University
• We want unlimited amounts of fast memory (M)

• We create that illusion
• A program does not access all Code/Data at once with equal

probability
• Principle of Locality
• Programs access a small portion of their address space at any given
time
2 types of locality
• Temporal (locality in time)
an item is referenced  tend to be referenced soon
• Spatial (locality in space)

an item is referenced  items with close addresses tend to be
referenced soon
• Loops, arrays, ..
• Sequential access to code, data
2
University
Memory Hierarchy
• Take advantage of Principle of Locality by implementing M hierarchy
• Multiple levels of M of different speeds and sizes
• Faster M – more expensive per bit  Smaller & closer to Processor
• Main M – DRAM
• Caches – SRAM
• Magnetic Disks, or Flashes in Embedded Systems
Provide the user with

as much memory as available
in the cheapest technology
access at the speed of the fastest M
• Access
time of
Level 1
• Size of
Level n
3
University
Data is also hierarchical

Upper Level
• All data – at the lowest level
• Then a subset of that data
• and so on
block (or line)
• Data is copied only between 2 adjacent

levels at a time
block (or line) The minimum unit of information

that can be either present or not present in a cache. Lower Level
7
• Requested data is in a block in the upper level  a hit
• Requested data not in the upper level  a miss
• Hit Rate = 1 – Miss Rate Rate or Ratio
• Hit Time
• Time to access upper level; includes determining a miss or a hit
• Miss Penalty
• Time to replace a block in the upper level with corresponding block from lower
level + time to deliver to processor
• Hit time << Time to accessing next level
4
University
• Concepts to build M affect other aspects of a computer

• OS M and I/O management
• Compilers generating code
• Applications using the computer
• Programs spend much of time in accessing M

• A major factor of performance
• Programmers must now understand M hierarchy
• Do not forget to try “Check Yourself” sections

• Answers given at end of chapter
10
10
5
University
The Basics of Cashes
11
11
• A storage that takes advantage of locality
• Levels of M between main M and the processor
• M’s in the datapath (Chapter 4) – caches
12
12
6
University
A very simple example,….

• Processor – requests 1 word
• Block – 1 word
• Xn initially not in cache 

a miss 
fetch Xn from M to cache
How to know if data is in cache?
How to find it? before after
Cache structure
13
13
Direct Mapped
• Simplest
• Each M location…. directly mapped to exactly 1 location in the cache
• Almost all,
• Number of entries in cache = power of 2 

• Modulo computed by using low-order log2 (cache size in blocks)
bits of the block address
14
14
7
University
An 8-block cache:
3 lowest bits of block
address
More than one M location,…

• How to know if data requested is in cache?
• Is it valid?
15
15
• Add tags to cache …. the upper bits of the address

• Specify whether the word in cache corresponds to the requested word
• An 8-block cache:
• address – 5 bits
• low 3 bits – index
• High 2 bits – tag
• Add a valid bit

• 1 – valid
• 0 – not valid  no match for this block
• Initially,…. 0
• How to handle: reads, misses, writes,…?
16
16
8
University
Word Address Binary Hit/Miss Cache Block
Index V Tag Data

000 0
001 0 Example
- 8-blocks cache
010 0
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 0
111 0
17
17

22 10 110 Miss 110
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 0
- 1 word/block
100 0
101 0
110 1 10 M[10110]
111 0
18
18
9
University

26 11 010 Miss 010
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
100 0
101 0
110 1 10 M[10110]
111 0
19
19

22 10 110 Hit 110
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
100 0
101 0
110 1 10 M[10110]
111 0
20
20
10
University

26 11 010 Hit 010
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
100 0
101 0
110 1 10 M[10110]
111 0
21
21

16 10 000 Miss 000
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
100 0
101 0
110 1 10 M[10110]
111 0
22
22
11
University

3 00 011 Miss 011
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 1 00 M[00011] - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
23
23

16 10 000 Hit 000
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 1 00 M[00011] - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
24
24
12
University

18 10 010 Miss 010
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 10 M[10010] Replace word 26
- 1 word/block
011 1 00 M[00011] - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
25
25
Address Subdivision
• 32-bits address
• One word
Naming Convention
 4 KB Cache  size of data only
26
26
13
University
2n-blocks Cache
• Each block – 2m words (2m+2 bytes)
• m bits – a word within block
• 2 bits – a byte within word
• tag ….. 32-(n+m+2) bits
• Total size (bits) =
27
27
16KB / 4B = 4K Words
4K / 4 = K Blocks in cache
tag index word byte
18 10 2 2
Total Size = V bits + tag bits + data bits

= K + K(18) + 16K*8 = 147 K bits
28
28
14
University
Address 1200 in M is in
block 1200/16 = 75 in M
64 blocks in cache  6 bits index

75 = 100 10112  block 11 in cache (75 mod 64)
Block 75 in M includes addresses 1200 to 1215
29
29
Block Size?
Larger blocks exploit spatial locality to reduce miss rates
• eventually, miss rate may go up if block size is a significant fraction of cache size
• number of blocks that can be in cache becomes small
• a block is replaced before many words are accessed – spatial locality among words of a block
decreases
• Less blocks  blocks are replaced many times
Increase the block size 

• the cost of a miss increases – miss penalty
• latency to 1st word, transfer of the rest of the block
• miss rate improvement starts to decrease
30
30
15
University
Handling a Miss – stall the processor

• Instruction
• Data – the same; stall until memory responds with data
31
31
Handling Writes
• Inconsistency
• Writing to a cache word  that word in M becomes different
• Write policy
• Write Through – write to both Simplest
• Write Back (Copy Back) – write to lower level only when replaced
• More complex to implement
• A dirty bit is needed
32
32
16
University
• On a write miss – fetch from M, place into cache, overwrite, then write to M
using the full address (if write through)
• Example
• 10% of instructions – Stores
• CPI without cache misses – 1
• 100 extra cycles – on every write
• CPI = 1 + 0.1*100 = 11 reduce performance by more than a factor of 10
• Write Buffer – write to cache and a buffer, processor continues while writing to M
• When write to M is complete; free the entry in buffer
• If buffer is full, stall
• rate at which M can finish writes < rate at which processor generates writes
• buffering cannot help
• rate at which M can finish writes > rate at which processor generates writes
• Stalls can happen – when write in bursts
• to reduce this, depth of write buffer > one entry
33
33
Design Memory System to Support Caches

• It is difficult to reduce latency to fetch the first word from M
• Miss penalty can be decreased
• Increase bandwidth from M to cache
• Larger blocks, low miss penalty like that of a smaller block
• Processor connected to M over a bus

• Clock rate of the bus usually much slower than the processor
• Speed of bus affects the miss penalty
34
34
17
University
• One-word-wide, sequential access

• Example – impact of memory organization
• 1 memory bus clock cycle – to send address
• 15 memory bus clock cycles – for each DRAM access initiated
• 1 memory bus clock cycle – to send a data word
• A cache block – 4 words

• 1-word wide bank of DRMAs
• Miss penalty = 1 + 4x15 + 4x1 = 65 memory bus clock cycles

• Number of bytes transferred per bus clock cycle for a miss
= 4x4/65 = 0.25
• 0.25 bytes per bus clock cycle – bandwidth for a single miss
One-word-wide
35
35
• Increase bandwidth to M by widening M and buses between

processor and M
• Parallel access to multiple words of the block
• decrease access time and transfer time parts of miss penalty
• a M of 2 words
• Miss penalty = 1 +2x15 + 2x1 = 33 memory bus clock cycles
• Bandwidth of a single miss = 0.48 byes per bus clock cycle
• Cost
• Wider bus Wider memory
• Possible increase in cache access time
• Multiplexor & control logic between processor and cache
36
36
18
University
• Increase bandwidth by widening M and not the

• cost to transmit each word
• access latency once
• Each bank can be one word wide

• No change to width of cache and bus
• Send an address
• Read all simultaneously
• 4 banks, 1 word wide each Interleaved Memory

• Miss penalty = 1 + 1x15 +4x1 = 20 memory bus clock cycles
• Bandwidth for a miss = 0.8 bytes per clock
• On writes, each bank can write independently,

• quadrupling the write bandwidth, leading to fewer stalls in a write-through cache.
37
37

38
38
19
University
Measuring and Improving Cache Performance
39
39
40
40
20
University
• 2 + (2%)(IC)(100) + (4%)(36%)(IC)(100) = 2 + 3.44 = 5.44
• 5.44/2 = 2.72
• Amount of execution time spent on M stalls = 3.44/5.44 = 63%

41
41
Average memory access time (AMAT)
42
42
21
University
43
43
Full Associative Cache

• Block can be placed in any location in cache
• Find a block in cache  All entries must be searched
• Practical search  search all in parallel

• a comparator per entry
• Expensive
• Practical for small number of blocks
44
44
22
University
Set-Associative Cache
• A block can be placed in one of a fixed number of locations in cache
• n locations  an n-way set-associative cache
• Each block in M  a unique set in cache

- block can be placed in any element of the set
• A block directly maps to a set

• Then all blocks in a set are searched
45
45
• 8 blocks total
• Block 12 – 1100
• Direct Mapped  1100 block
• 2-ways set-associative  4 sets  1100 set – any block inside this set
• Fully-associative – any block
46
46
23
University
all block placement strategies as a variation on set

associativity…
Cache of
8 blocks
Increase degree of associativity

Decrease miss rate
Increase hit time
47
47
• Use the LRU replacement rule,….

• Replace the least recently used block within a set; that is, the block
that was used furthest in the past is replaced.
• What happens if cache size is 8 one-word blocks?
48
48
24
University
Direct,….
• Find miss rate
49
49
2-way set-associative,… LRU
50
50
25
University
Fully associative,…
For each case, what would be number of misses if cache has:

• 8 blocks
• 16 blocks
What do results show?
51
51
a 4-way set-associative cache
52
52
26
University
64K bits, 68K bits, 112K bits
53
53
Reducing miss penalty using multilevel caches

• Most microprocessors support an additional level of caching
• Usually on the same chip
• Accessed when a miss occurs in the primary cache
• If 2nd level cache contains the data 
miss penalty of 1st level cache = the access time of the 2nd level cache
• Design considerations of both caches are significantly different
• 2 levels structure
• Primary – focus on minimizing hit time to have shorter clock cycle or fewer
pipeline stages
• Secondary – focus on miss rate to reduce penalty of long M access times
54
54
27
University
• Compared to a single-level cache

• Primary
• often smaller
• may use smaller block size
• go with smaller size
• reduce miss penalty
• Secondary
• much larger
• access time is less critical
• may use a larger block size
• often a higher associativity than primary cache; focus of reducing miss rates
55
55
global miss rate
56
56
28
University

57
57
Virtual Memory
58
58
29
University
• Main Memory (MM)… a cache for the secondary storage usually magnetic disks
• Share memory among programs – safely/efficiently
• A program can read/write only its part of MM
• Remove burden of limited memory
• MM needs to contain only active portion of a program

• Principle of locality
• Which programs share memory?

• Dynamic at run time
• Compile each program into a separate address space [range of M locations accessible
only by the program]
• VM – maps program’ address space to physical addresses (address in MM)

• enforce protection of address space
59
59
• Allow a program to exceed size of MM (Physical M)

• Divide into pieces – load/unload as necessary.
• Overlays – modules of code and data
• Programmer makes sure
• A program doesn’t access an overlay that was not loaded
• Loaded overlays do not exceed the total size of M
• VM automates management
• Page = a VM block
• Virtual Address (VA)
• corresponds to a location in virtual space
• produced by processor
• mapped to a Physical Address (using a combination of hardware and software)
• Page Fault = a VM miss; an accessed page is not in MM
• Address Translation = address mapping; VA  an address to access M
60
60
30
University
2 virtual addresses  same physical address:

2 programs share code or data
• Paging – use of fixed-size blocks

• Segmentation – use of variable-size blocks
Relocation – map VAs in a program to different Physical addresses

• allows loading the program anywhere in MM
• not necessarily adjacent M block
• Just find a sufficient # of pages for a program
61
61
Mapping from a virtual

to a physical address
• Page offset = 12 bits  page size = 212= 4 KB

• Max # of Physical pages = 218
• max 1 GB of MM
• Virtual space – max 232 = 4 GB
62
62
31
University
Page Table Start of Page Table in Memory
• A table in memory –
indexes memory to find pages
• A table for each program
• Each entry can have extra information
• protection, …
Example:
32-bit Virtual Address, 4 KB pages
Assuming: 4 Bytes per page table entry;
- shown only 19 bit
# of page table entries = 232/212 = 220

Size of page table = 220x22 = 4 MB
 we use 4MB for each program in
execution at any time
63
63
Page Fault
• Operating System
• Find page in next level (usually magnetic disk)
• Decide where to place it in MM
• Swap Space – space on disk for the full VM space of a process

• A data structure – store where each virtual page is stored on disk
• Can be a part of Page Table, or separate
• A data structure – which processes and Virtual addresses use each Physical
page
• Replacement Policy – like Least Recently Used (LRU)

• Write replaced page in swap space
64
64
32
University
• One page table
65
65
Writes
• Write-back – write back a page to disk when replaced in memory
• Dirty pages
66
66
33
University
TLB – Translation Lookaside Buffer

• Page tables stored in MM
• Every access – 2 times long;
• Physical address, and Data
• A special cache to do the translation

• TLB
67
67
Integrating VM, TLBs, and Caches

• VM and cache work together as a hierarchy
• Data cannot be in cache unless it is in MM
• OS maintain this hierarchy by flushing content of any page from cache

when to migrate that page to disk
• OS modifies page tables and TLB so that an access to data in migrated page
results in a page fault
• Best case
• A virtual address translated by TLB, sent to cache, data found in cache
• Worst case
• A reference would miss in all components (TLB, page table, and cache)
68
68
34
University
69
69
Pitfalls
32-byte Direct-mapped cache, 4 bytes per block

Which block in cache maps to byte address 36?
Cache has 32/4 = 8 blocks  3 bits
In M, byte 36 is in block 9  10012  cache block 1 (or 9 mod 8)
Which block in cache maps to word address 36?

Block 36 = 1001002  block 4 in cache
70
70
35
University
71
71
Any questions/comments?
Thank you
72
72
36

Chapter 5

Uploaded by

Copyright:

Available Formats

Chapter 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5

Uploaded by

Copyright:

Available Formats

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

Large and Fast: Exploiting

• We want unlimited amounts of fast memory (M)

• A program does not access all Code/Data at once with equal

• Spatial (locality in space)

• Multiple levels of M of different speeds and sizes

• Faster M – more expensive per bit  Smaller & closer to Processor

Provide the user with

Data is also hierarchical

• Data is copied only between 2 adjacent

block (or line) The minimum unit of information

• Requested data is in a block in the upper level  a hit

• Requested data not in the upper level  a miss

• Hit Rate = 1 – Miss Rate Rate or Ratio

• Hit time << Time to accessing next level

• Concepts to build M affect other aspects of a computer

• Programs spend much of time in accessing M

• Programmers must now understand M hierarchy

• Do not forget to try “Check Yourself” sections

The Basics of Cashes

• A storage that takes advantage of locality

• Levels of M between main M and the processor

• M’s in the datapath (Chapter 4) – caches

A very simple example,….

• Xn initially not in cache 

• Number of entries in cache = power of 2 

More than one M location,…

• Add tags to cache …. the upper bits of the address

• Add a valid bit

• How to handle: reads, misses, writes,…?

Word Address Binary Hit/Miss Cache Block

Index V Tag Data

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

Word Address Binary Hit/Miss Cache Block

• tag ….. 32-(n+m+2) bits

• Total size (bits) =

Total Size = V bits + tag bits + data bits

64 blocks in cache  6 bits index

Increase the block size 

Handling a Miss – stall the processor

• Data – the same; stall until memory responds with data

Design Memory System to Support Caches

• Processor connected to M over a bus

• One-word-wide, sequential access

• A cache block – 4 words

• Miss penalty = 1 + 4x15 + 4x1 = 65 memory bus clock cycles

• Increase bandwidth to M by widening M and buses between

• Increase bandwidth by widening M and not the

• Each bank can be one word wide

• 4 banks, 1 word wide each Interleaved Memory

• On writes, each bank can write independently,

• Do not forget to try “Check Yourself” sections

Measuring and Improving Cache Performance

• 2 + (2%)(IC)(100) + (4%)(36%)(IC)(100) = 2 + 3.44 = 5.44