Csapp Lab3

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 7

- [Lab3: Cache Lab](#lab3-cache-lab)

- [Problem Description](#problem-description)
- [Part A. Cache Simulator](#part-a-cache-simulator)
- [Step 1. Parse the trace created by valgrind](#step-1-parse-the-trace-
created-by-valgrind)
- [Step 2. Simulate the behavior of cache](#step-2-simulate-the-behavior-of-
cache)
- [Part B. Matrix Transpose](#part-b-matrix-transpose)
- [Step 1. Analyze the capability of cache](#step-1-analyze-the-capability-of-
cache)
- [Step 2. Blocking of 32 x 32 matrix](#step-2-blocking-of-32-x-32-matrix)
- [Step 3. Blocking of 64 x 64 matrix](#step-3-blocking-of-64-x-64-matrix)
- [Step 4. Blocking of 61 x 67 matrix](#step-4-blocking-of-61-x-67-matrix)
- [Test Result](#test-result)
# Lab3: Cache Lab
---
> *Covered Concept: Cache mechanism, Cache optimization*

# Problem Description
---
The lab has two parts. In Part A you will implement a cache simulator. In Part B
you will write a matrix transpose function that is optimized for cache performance.

# Part A. Cache Simulator


---
In Part A you will write a cache simulator in csim.c that takes a valgrind memory
trace as input,
simulates the hit/miss behavior of a cache memory on this trace, and outputs the
total number of hits, misses, and evictions.
We have provided you with the binary executable of a reference cache simulator,
called csim-ref, that simulates the behavior of a cache with arbitrary size and
associativity on a valgrind trace file. It uses the LRU (least-recently used)
replacement policy when choosing which cache line to evict.
The reference simulator takes the following command-line arguments:
```
Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>
• -h: Optional help flag that prints usage info
• -v: Optional verbose flag that displays trace info
• -s <s>: Number of set index bits (S = 2s
is the number of sets)
• -E <E>: Associativity (number of lines per set)
• -b <b>: Number of block bits (B = 2b
is the block size)
• -t <tracefile>: Name of the valgrind trace to replay
The command-line arguments are based on the notation (s, E, and b) from page 597 of
the CS:APP2e
textbook. For example:
linux> ./csim-ref -s 4 -E 1 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3
The same example in verbose mode:
linux> ./csim-ref -v -s 4 -E 1 -b 4 -t traces/yi.trace
L 10,1 miss
M 20,1 miss hit
L 22,1 hit
S 18,1 hit
L 110,1 miss eviction
L 210,1 miss eviction
M 12,1 miss eviction hit
hits:4 misses:5 evictions:3
```
Your job for Part A is to fill in the `csim.c` file so that it takes the same
command line arguments and produces the identical output as the reference
simulator. Notice that this file is almost completely empty. You’ll need to write
it from scratch.

### Step 1. Parse the trace created by valgrind


---
The provided trace files generated by `Valgrind` is used to verify the cache
simulator. Trace files contains the information of memory operation in order.
Valgrind memory traces have the following form:
> I 0400d7d4,8
> &nbsp; M 0421c7f0,4
> &nbsp; L 04f6b868,8
> &nbsp; S 7ff0005c8,8
> Each line denotes one or two memory accesses. The format of each line is
>
> ***[space] operation address,size***
>
> The operation field denotes the type of memory access: “I” denotes an instruction
load, “L” a data load, “S” a data store, and “M” a data modify (==i.e., a data load
followed by a data store==). There is never a space before each “I”. There is
always a space before each “M”, “L”, and “S”. The address field specifies a 64-bit
hexadecimal memory address. The size field specifies the number of bytes accessed
by the operation.

For simplicity, let's take a look at the data structure defined in `csim.h`.
The detail of processing input argument and string is not described here. The
purpose of parsing the trace is to get the information in the structure
`memOpInfo_t` below

```clike=
#include "stdbool.h"

typedef unsigned int uint;


typedef unsigned char uint8_t;
typedef char int8_t;
typedef unsigned long uint64_t;

typedef enum {
INSTRUCTION = 'I',
DATALOAD = 'L',
DATASTORE = 'S',
DATAMODIFY = 'M' /* Load and then Store */
}memOpType_t;

typedef struct {
memOpType_t memOpType; /* Type of operation */
uint64_t addr; /* which memory address is operated on */
uint8_t size; /* size accessed by the operation */
}memOpInfo_t;
```

After parsing a single line in trace file, we have `memOpType`, `addr` and `size`
which will be used in cache simulation.

### Step 2. Simulate the behavior of cache


---
We've got the information of memory operation from the trace file. In this step, we
are taking advantage of this information and simulate the cache.
According to properties of cache, the simulator followed the process flow below.

![](https://i.imgur.com/oy5odFo.png)

A cache consist of sets. A set consists of lines. A line consists of valid bit and
tag. The number of set is determined by input argument `-s`. The number of line in
a set is determined by input argument `-E`. The size (bytes) of a line is
determined by the input argument `-b`.
The lines in a set is represented with a linked list so that we can easily insert a
new line in it and move specified line to the head of the list in case of miss and
eviction. Note that the eviction strategy used here is Least Recently Used (LRU).
Refer to `cache.c` for more details of implementation.

# Part B. Matrix Transpose


---
In Part B you will write a transpose function in `trans.c` that causes as few cache
misses as possible.
Let **A** denote a matrix, and **A**ij denote the component on the ith row and jth
column. The transpose of **A**, denoted **A**^T^, is a matrix such that
**A**<sub>ij</sub> = **A**^T^<sub>ji</sub>. To help you get started, we have given
you an example transpose function in `trans.c` that computes the transpose of N × M
matrix **A** and stores the results in M × N matrix B:
char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[N][M], int B[M][N])
The example transpose function is correct, but it is inefficient because the access
pattern results in relatively many cache misses. Your job in Part B is to write a
similar function, called transpose_submit, that minimizes the number of cache
misses across different sized matrices:
char transpose_submit_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[N][M], int B[M][N]);
Do not change the description string (“Transpose submission”) for your
transpose_submit function. The autograder searches for this string to determine
which transpose function to evaluate for credit.

***Programming Rules for Part B***


• Include your name and loginID in the header comment for trans.c.
• Your code in trans.c must compile without warnings to receive credit.
• You are allowed to define at most 12 local variables of type int per transpose
function.1
• You are not allowed to side-step the previous rule by using any variables of type
long or by using any bit tricks to store more than one value to a single variable.
• Your transpose function may not use recursion.
• If you choose to use helper functions, you may not have more than 12 local
variables on the stack at a time between your helper functions and your top level
transpose function. For example, if your transpose declares 8 variables, and then
you call a function which uses 4 variables, which calls another function which uses
2, you will have 14 variables on the stack, and you will be in violation of the
rule.
• Your transpose function may not modify array A. You may, however, do whatever you
want with the contents of array B.
• You are NOT allowed to define any arrays in your code or to use any variant of
malloc.

***Performance***
For each matrix size, the performance of your transpose_submit function is
evaluated by using valgrind to extract the address trace for your function, and
then using the reference simulator to replay this trace on a cache with parameters
`(s = 5, E = 1, b = 5)`.
Your performance score for each matrix size scales linearly with the number of
misses, m, up to some
threshold:
• 32 × 32: 8 points if m < 300, 0 points if m > 600
• 64 × 64: 8 points if m < 1, 300, 0 points if m > 2, 000
• 61 × 67: 10 points if m < 2, 000, 0 points if m > 3, 000
Your code must be correct to receive any performance points for a particular size.
Your code only needs to be correct for these three cases and you can optimize it
specifically for these three cases. In particular, it is perfectly OK for your
function to explicitly check for the input sizes and implement separate code
optimized for each case.

### Step 1. Analyze the capability of cache


---
The given cache is `(s = 5, E = 1, b = 5)` which is a direct-mapped cache.

`b = 5` : Each line (block) contains 32 bytes (8 integers).


`E = 1` : Each set contains 1 line.
`s = 5` : There are 32 set in the cache.

It's assumed that data is properly aligned so there is no need to concern the
operation across the boundary of a line. The given cache is like:

![](https://i.imgur.com/FSgBRMI.png)

Since the matrix is composed of `int`. We discuss the size in `int (4 bytes)`.

- Each block accomodates 8 integers.


- A whole cahce accomodates 8 * 32 = 256 integers.
- An integer and the 256th integer follows it competes for the same set.

### Step 2. Blocking of 32 x 32 matrix

There are 32 integers in a row which equals to 4 sets in the cache. The element
would have competition each **8** rows in the matrix. The layout is shown below.

![](https://i.imgur.com/GuBN9RZ.png)

The sub-rows in green compete with each other for the same set. Similarly, The sub-
rows in yellow compete with each other for the same set. Apparently, we can set
block size as 8 x 8 for blocking.

- **Non-diagonal blocking (blue)**


The purpose of blocking is to increase the usage of cache and reduce the misses and
evictions. Since the cache load a blocks (8 integers in a row) each time, we must
make use of the 8 integers.
Let's see the blue blocks of A and B, A<sub>ur</sub> is going to transpose and put
in B<sub>dl</sub>. Loading the 1st row in A<sub>ur</sub> takes 1 miss. Putting the
transposed row in B<sub>dl</sub> takes 8 misses because the row is put in the 1st
column in B<sub>dl</sub>.
Go on to the next row in th A<sub>ur</sub>, it also takes 1 miss loading a row.
However, the 2nd column in B<sub>dl</sub> is already in the cache so that it
doesn't take any misses.

- **Diagonal blocking (white)**


On the other hand, the diagonal block doesn't work like the case above since the
A<sub>ul</sub> competes with B <sub>ul</sub> for the same cache set.
Putting the 1st element in the 1st row of A<sub>ul</sub> to the 1st element in the
1st column of B<sub>ul</sub> would cause the 1st row of A<sub>ul</sub> evicted from
cache. The diagonal blocking would cost one more miss and eviction in each loading
of row.
To eliminate the cost, we create a `tmpDiagData[8]` so that a row in A<sub>ul</sub>
can temporarily stored in it. After that, move the transposed `tmpDiagData[8]` to
B<sub>ul</sub>. In that case, the additional miss and eviction is eliminated. Seek
more details in `trans.c`.

*Note. the tmpDiagData[8] can be used in non-diagonal blocking without losing


generality.*

### Step 3. Blocking of 64 x 64 matrix

The transpose of 64x64 matrix seems similar to the 32x32 one. However, it ends up
yielding a large number of miss rate by the method above. It's caused by the bigger
size of a row.

![](https://i.imgur.com/1xrO0dG.jpg)

The memory layout of 64x64 matrix is showned above. Remember that the given cache
can accomodate 256 integers. That is, a row competes with the 4th row follows it
for the same set in cache. This phenomenon leads to more miss whenever moving the
transposed row of **A** to **B**.

- **Method 1.**
Let's see what happened when we use the method mentioned above. **A<sub>ur</sub>**
is transpoed and put in **B<sub>dl</sub>**. Say we are processing the 1st row of
**A<sub>ur</sub>** and storing it to the 1st column of **B<sub>dl</sub>**. The
extra misses occur at the storing 1st column of **B<sub>dl</sub>**. Although it's
totally fine at the first 4 elements in the 1st column of **B<sub>dl</sub>**,
another misses occurs when it comes to the last 4 element because both of them
compete for the same set.
`Estimation of misses: (8 + 64) * 64 = 4608`

- **Method 2.**
The phenomenon can be eliminated by setting the size of blocking to 4x4 but it's
not good enough for requirement `m < 1300` since there are still 4 integers in a
line of cache not used.
`Estimation of misses: (4 + 4 + 4) * 2 * 64 = 1536`

- **Method 3.**
To Solve this problem, I've searched for [others'
solution](https://hackmd.io/@Chang-Chia-Chi/CSAPP/https%3A%2F%2Fhackmd.io%2F
%40Chang-Chia-Chi%2FrkRCq_vbY). Most of them use a tricky method which divides a
8x8 block to four 4x4 sub-blocks.
Let's dig into the method in detail. As figure shown above, we divided
**A<sub>ur</sub>** into four sub-blocks (A<sub>11</sub>, A<sub>12</sub>,
A<sub>21</sub>, A<sub>22</sub>) and **B<sub>dl</sub>** into four sub-blocks
(B<sub>11</sub>, B<sub>12</sub>, B<sub>21</sub>, B<sub>22</sub>).
- **step 1**. A<sub>11</sub> to B<sub>11</sub>
As the method in 32x32, we load 1st row of A<sub>11</sub> and stores it in
tmpDiagData[8] (only 4 elements are used). And then put them into the 1st column of
B<sub>11</sub>. The following row of A<sub>11</sub> does the same thing and so on.
`Estimation of misses: 4 + 4 = 8`
- **step 2**. A<sub>12</sub> to B<sub>12</sub> (Temporarily)
Do the same method as step 1. above except that the targets become
A<sub>12</sub> and B<sub>12</sub>.
`Estimation of misses: 0`
- **step 3**. A<sub>21</sub> to B<sub>12</sub>, B<sub>12</sub> to
B<sub>21</sub>
**3.1.** Store the 1 st row of B<sub>12</sub> in tmpDiagData[8] (only 4
elements are used).
**3.2.** Read the 1st **column** in A<sub>21</sub>, transpose and put it to the
1st **row** of B<sub>12</sub>.
**3.3.** Move the stored 1st B<sub>12</sub> from tmpDiagData[8] to the 1st row
of B<sub>21</sub>.
**3.4.** Does the same thing to the other rows.
`Estimation of misses: 4 + 4 = 8`
- **step 4**. A<sub>22</sub> to B<sub>22</sub>
Move the the transposed rows in A<sub>22</sub> to B<sub>22</sub>. This step
costs no miss because both A<sub>22</sub> and B<sub>22</sub> are already stored in
the cache.
`Estimation of misses: 0`

Totally, method 3. costs 16 misses for each 8x8 block.


`Estimation of misses: 16 * 64 = 1024`

Consequently, method 3 is chosen as the optimized method for 64x64 matrix.

### Step 4. Blocking of 61 x 67 matrix

We can simply pass this test through the method used in the case of 32x32 matrix.

# Test Result
Test result yield from `./driver.py`

```
Part A: Testing cache simulator
Running ./test-csim
Your simulator Reference simulator
Points (s,E,b) Hits Misses Evicts Hits Misses Evicts
3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace
3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace
3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace
3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace
3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace
3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace
3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace
6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace
27

Part B: Testing transpose function


Running ./test-trans -M 32 -N 32
Running ./test-trans -M 64 -N 64
Running ./test-trans -M 61 -N 67

Cache Lab summary:


Points Max pts Misses
Csim correctness 27.0 27
Trans perf 32x32 8.0 8 290
Trans perf 64x64 8.0 8 1286
Trans perf 61x67 10.0 10 2000
Total points 53.0 53
```

You might also like