Chapter 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

University

CpE 440
Computer Architecture
Dr. Haithem Al-Mefleh
Computer Engineering Department
Yarmouk University, Second 2020-2021

Large and Fast: Exploiting


Memory Hierarchy

1
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• We want unlimited amounts of fast memory (M)


• We create that illusion

• A program does not access all Code/Data at once with equal


probability

• Principle of Locality
• Programs access a small portion of their address space at any given
time

2 types of locality
• Temporal (locality in time)
an item is referenced  tend to be referenced soon

• Spatial (locality in space)


an item is referenced  items with close addresses tend to be
referenced soon

• Loops, arrays, ..
• Sequential access to code, data

2
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Memory Hierarchy
• Take advantage of Principle of Locality by implementing M hierarchy

• Multiple levels of M of different speeds and sizes

• Faster M – more expensive per bit  Smaller & closer to Processor

• Main M – DRAM
• Caches – SRAM
• Magnetic Disks, or Flashes in Embedded Systems

Provide the user with


as much memory as available
in the cheapest technology
access at the speed of the fastest M

• Access
time of
Level 1

• Size of
Level n

3
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Data is also hierarchical


Upper Level
• All data – at the lowest level
• Then a subset of that data
• and so on
block (or line)

• Data is copied only between 2 adjacent


levels at a time

block (or line) The minimum unit of information


that can be either present or not present in a cache. Lower Level
7

• Requested data is in a block in the upper level  a hit

• Requested data not in the upper level  a miss

• Hit Rate = 1 – Miss Rate Rate or Ratio

• Hit Time
• Time to access upper level; includes determining a miss or a hit

• Miss Penalty
• Time to replace a block in the upper level with corresponding block from lower
level + time to deliver to processor

• Hit time << Time to accessing next level

4
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Concepts to build M affect other aspects of a computer


• OS M and I/O management
• Compilers generating code
• Applications using the computer

• Programs spend much of time in accessing M


• A major factor of performance

• Programmers must now understand M hierarchy

• Do not forget to try “Check Yourself” sections


• Answers given at end of chapter

10

10

5
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

The Basics of Cashes

11

11

• A storage that takes advantage of locality

• Levels of M between main M and the processor

• M’s in the datapath (Chapter 4) – caches

12

12

6
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

A very simple example,….


• Processor – requests 1 word
• Block – 1 word

• Xn initially not in cache 


a miss 
fetch Xn from M to cache
How to know if data is in cache?
How to find it? before after

Cache structure

13

13

Direct Mapped
• Simplest
• Each M location…. directly mapped to exactly 1 location in the cache

• Almost all,

• Number of entries in cache = power of 2 


• Modulo computed by using low-order log2 (cache size in blocks)
bits of the block address

14

14

7
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

An 8-block cache:
3 lowest bits of block
address

More than one M location,…


• How to know if data requested is in cache?
• Is it valid?

15

15

• Add tags to cache …. the upper bits of the address


• Specify whether the word in cache corresponds to the requested word
• An 8-block cache:
• address – 5 bits
• low 3 bits – index
• High 2 bits – tag

• Add a valid bit


• 1 – valid
• 0 – not valid  no match for this block
• Initially,…. 0

• How to handle: reads, misses, writes,…?

16

16

8
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Word Address Binary Hit/Miss Cache Block

Index V Tag Data


000 0
001 0 Example
- 8-blocks cache
010 0
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 0
111 0
17

17

Word Address Binary Hit/Miss Cache Block


22 10 110 Miss 110
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 0
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
18

18

9
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Word Address Binary Hit/Miss Cache Block


26 11 010 Miss 010
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
19

19

Word Address Binary Hit/Miss Cache Block


22 10 110 Hit 110
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
20

20

10
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Word Address Binary Hit/Miss Cache Block


26 11 010 Hit 010
Index V Tag Data
000 0
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
21

21

Word Address Binary Hit/Miss Cache Block


16 10 000 Miss 000
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 0 - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
22

22

11
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Word Address Binary Hit/Miss Cache Block


3 00 011 Miss 011
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 1 00 M[00011] - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
23

23

Word Address Binary Hit/Miss Cache Block


16 10 000 Hit 000
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 11 M[11010]
- 1 word/block
011 1 00 M[00011] - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
24

24

12
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Word Address Binary Hit/Miss Cache Block


18 10 010 Miss 010
Index V Tag Data
000 1 10 M[10000]
001 0 Example
- 8-blocks cache
010 1 10 M[10010] Replace word 26
- 1 word/block
011 1 00 M[00011] - direct mapped
100 0
101 0
110 1 10 M[10110]
111 0
25

25

Address Subdivision

• 32-bits address
• One word

Naming Convention
 4 KB Cache  size of data only

26

26

13
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

2n-blocks Cache
• Each block – 2m words (2m+2 bytes)
• m bits – a word within block
• 2 bits – a byte within word

• tag ….. 32-(n+m+2) bits

• Total size (bits) =

27

27

16KB / 4B = 4K Words

4K / 4 = K Blocks in cache
tag index word byte
18 10 2 2

Total Size = V bits + tag bits + data bits


= K + K(18) + 16K*8 = 147 K bits

28

28

14
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Address 1200 in M is in
block 1200/16 = 75 in M

64 blocks in cache  6 bits index


75 = 100 10112  block 11 in cache (75 mod 64)
Block 75 in M includes addresses 1200 to 1215

29

29

Block Size?
Larger blocks exploit spatial locality to reduce miss rates
• eventually, miss rate may go up if block size is a significant fraction of cache size
• number of blocks that can be in cache becomes small
• a block is replaced before many words are accessed – spatial locality among words of a block
decreases
• Less blocks  blocks are replaced many times

Increase the block size 


• the cost of a miss increases – miss penalty
• latency to 1st word, transfer of the rest of the block
• miss rate improvement starts to decrease

30

30

15
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Handling a Miss – stall the processor


• Instruction

• Data – the same; stall until memory responds with data

31

31

Handling Writes
• Inconsistency
• Writing to a cache word  that word in M becomes different

• Write policy
• Write Through – write to both Simplest

• Write Back (Copy Back) – write to lower level only when replaced
• More complex to implement
• A dirty bit is needed

32

32

16
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• On a write miss – fetch from M, place into cache, overwrite, then write to M
using the full address (if write through)
• Example
• 10% of instructions – Stores
• CPI without cache misses – 1
• 100 extra cycles – on every write
• CPI = 1 + 0.1*100 = 11 reduce performance by more than a factor of 10

• Write Buffer – write to cache and a buffer, processor continues while writing to M
• When write to M is complete; free the entry in buffer
• If buffer is full, stall
• rate at which M can finish writes < rate at which processor generates writes
• buffering cannot help
• rate at which M can finish writes > rate at which processor generates writes
• Stalls can happen – when write in bursts
• to reduce this, depth of write buffer > one entry

33

33

Design Memory System to Support Caches


• It is difficult to reduce latency to fetch the first word from M
• Miss penalty can be decreased
• Increase bandwidth from M to cache
• Larger blocks, low miss penalty like that of a smaller block

• Processor connected to M over a bus


• Clock rate of the bus usually much slower than the processor
• Speed of bus affects the miss penalty

34

34

17
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• One-word-wide, sequential access


• Example – impact of memory organization
• 1 memory bus clock cycle – to send address
• 15 memory bus clock cycles – for each DRAM access initiated
• 1 memory bus clock cycle – to send a data word

• A cache block – 4 words


• 1-word wide bank of DRMAs

• Miss penalty = 1 + 4x15 + 4x1 = 65 memory bus clock cycles


• Number of bytes transferred per bus clock cycle for a miss
= 4x4/65 = 0.25
• 0.25 bytes per bus clock cycle – bandwidth for a single miss

One-word-wide

35

35

• Increase bandwidth to M by widening M and buses between


processor and M
• Parallel access to multiple words of the block
• decrease access time and transfer time parts of miss penalty

• a M of 2 words
• Miss penalty = 1 +2x15 + 2x1 = 33 memory bus clock cycles
• Bandwidth of a single miss = 0.48 byes per bus clock cycle

• Cost
• Wider bus Wider memory
• Possible increase in cache access time
• Multiplexor & control logic between processor and cache

36

36

18
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Increase bandwidth by widening M and not the


• cost to transmit each word
• access latency once

• Each bank can be one word wide


• No change to width of cache and bus
• Send an address
• Read all simultaneously

• 4 banks, 1 word wide each Interleaved Memory


• Miss penalty = 1 + 1x15 +4x1 = 20 memory bus clock cycles
• Bandwidth for a miss = 0.8 bytes per clock

• On writes, each bank can write independently,


• quadrupling the write bandwidth, leading to fewer stalls in a write-through cache.

37

37

• Do not forget to try “Check Yourself” sections


• Answers given at end of chapter

38

38

19
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Measuring and Improving Cache Performance

39

39

40

40

20
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• 2 + (2%)(IC)(100) + (4%)(36%)(IC)(100) = 2 + 3.44 = 5.44

• 5.44/2 = 2.72

• Amount of execution time spent on M stalls = 3.44/5.44 = 63%


41

41

Average memory access time (AMAT)

42

42

21
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

43

43

Full Associative Cache


• Block can be placed in any location in cache

• Find a block in cache  All entries must be searched

• Practical search  search all in parallel


• a comparator per entry
• Expensive
• Practical for small number of blocks

44

44

22
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Set-Associative Cache
• A block can be placed in one of a fixed number of locations in cache
• n locations  an n-way set-associative cache

• Each block in M  a unique set in cache


- block can be placed in any element of the set

• A block directly maps to a set


• Then all blocks in a set are searched

45

45

• 8 blocks total
• Block 12 – 1100
• Direct Mapped  1100 block
• 2-ways set-associative  4 sets  1100 set – any block inside this set
• Fully-associative – any block
46

46

23
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

all block placement strategies as a variation on set


associativity…

Cache of
8 blocks

Increase degree of associativity


Decrease miss rate
Increase hit time

47

47

• Use the LRU replacement rule,….


• Replace the least recently used block within a set; that is, the block
that was used furthest in the past is replaced.

• What happens if cache size is 8 one-word blocks?

48

48

24
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Direct,….

• Find miss rate

49

49

2-way set-associative,… LRU

50

50

25
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Fully associative,…

For each case, what would be number of misses if cache has:


• 8 blocks
• 16 blocks
What do results show?

51

51

a 4-way set-associative cache

52

52

26
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

64K bits, 68K bits, 112K bits

53

53

Reducing miss penalty using multilevel caches


• Most microprocessors support an additional level of caching
• Usually on the same chip
• Accessed when a miss occurs in the primary cache
• If 2nd level cache contains the data 
miss penalty of 1st level cache = the access time of the 2nd level cache
• Design considerations of both caches are significantly different
• 2 levels structure
• Primary – focus on minimizing hit time to have shorter clock cycle or fewer
pipeline stages
• Secondary – focus on miss rate to reduce penalty of long M access times

54

54

27
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Compared to a single-level cache


• Primary
• often smaller
• may use smaller block size
• go with smaller size
• reduce miss penalty
• Secondary
• much larger
• access time is less critical
• may use a larger block size
• often a higher associativity than primary cache; focus of reducing miss rates

55

55

global miss rate

56

56

28
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Do not forget to try “Check Yourself” sections


• Answers given at end of chapter

57

57

Virtual Memory

58

58

29
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Main Memory (MM)… a cache for the secondary storage usually magnetic disks
• Share memory among programs – safely/efficiently
• A program can read/write only its part of MM
• Remove burden of limited memory

• MM needs to contain only active portion of a program


• Principle of locality

• Which programs share memory?


• Dynamic at run time
• Compile each program into a separate address space [range of M locations accessible
only by the program]

• VM – maps program’ address space to physical addresses (address in MM)


• enforce protection of address space

59

59

• Allow a program to exceed size of MM (Physical M)


• Divide into pieces – load/unload as necessary.
• Overlays – modules of code and data
• Programmer makes sure
• A program doesn’t access an overlay that was not loaded
• Loaded overlays do not exceed the total size of M

• VM automates management
• Page = a VM block
• Virtual Address (VA)
• corresponds to a location in virtual space
• produced by processor
• mapped to a Physical Address (using a combination of hardware and software)
• Page Fault = a VM miss; an accessed page is not in MM
• Address Translation = address mapping; VA  an address to access M

60

60

30
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

2 virtual addresses  same physical address:


2 programs share code or data

• Paging – use of fixed-size blocks


• Segmentation – use of variable-size blocks

Relocation – map VAs in a program to different Physical addresses


• allows loading the program anywhere in MM
• not necessarily adjacent M block
• Just find a sufficient # of pages for a program

61

61

Mapping from a virtual


to a physical address

• Page offset = 12 bits  page size = 212= 4 KB


• Max # of Physical pages = 218
• max 1 GB of MM
• Virtual space – max 232 = 4 GB

62

62

31
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Page Table Start of Page Table in Memory

• A table in memory –
indexes memory to find pages
• A table for each program
• Each entry can have extra information
• protection, …
Example:
32-bit Virtual Address, 4 KB pages
Assuming: 4 Bytes per page table entry;
- shown only 19 bit

# of page table entries = 232/212 = 220


Size of page table = 220x22 = 4 MB
 we use 4MB for each program in
execution at any time

63

63

Page Fault
• Operating System
• Find page in next level (usually magnetic disk)
• Decide where to place it in MM

• Swap Space – space on disk for the full VM space of a process


• A data structure – store where each virtual page is stored on disk
• Can be a part of Page Table, or separate
• A data structure – which processes and Virtual addresses use each Physical
page

• Replacement Policy – like Least Recently Used (LRU)


• Write replaced page in swap space

64

64

32
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• One page table

65

65

Writes
• Write-back – write back a page to disk when replaced in memory
• Dirty pages

66

66

33
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

TLB – Translation Lookaside Buffer


• Page tables stored in MM
• Every access – 2 times long;
• Physical address, and Data

• A special cache to do the translation


• TLB

67

67

Integrating VM, TLBs, and Caches


• VM and cache work together as a hierarchy
• Data cannot be in cache unless it is in MM

• OS maintain this hierarchy by flushing content of any page from cache


when to migrate that page to disk
• OS modifies page tables and TLB so that an access to data in migrated page
results in a page fault

• Best case
• A virtual address translated by TLB, sent to cache, data found in cache
• Worst case
• A reference would miss in all components (TLB, page table, and cache)

68

68

34
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

69

69

Pitfalls

32-byte Direct-mapped cache, 4 bytes per block


Which block in cache maps to byte address 36?
Cache has 32/4 = 8 blocks  3 bits
In M, byte 36 is in block 9  10012  cache block 1 (or 9 mod 8)

Which block in cache maps to word address 36?


Block 36 = 1001002  block 4 in cache

70

70

35
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

71

71

Any questions/comments?

Thank you
72

72

36

You might also like