Csapp Lab3
Csapp Lab3
Csapp Lab3
- [Problem Description](#problem-description)
- [Part A. Cache Simulator](#part-a-cache-simulator)
- [Step 1. Parse the trace created by valgrind](#step-1-parse-the-trace-
created-by-valgrind)
- [Step 2. Simulate the behavior of cache](#step-2-simulate-the-behavior-of-
cache)
- [Part B. Matrix Transpose](#part-b-matrix-transpose)
- [Step 1. Analyze the capability of cache](#step-1-analyze-the-capability-of-
cache)
- [Step 2. Blocking of 32 x 32 matrix](#step-2-blocking-of-32-x-32-matrix)
- [Step 3. Blocking of 64 x 64 matrix](#step-3-blocking-of-64-x-64-matrix)
- [Step 4. Blocking of 61 x 67 matrix](#step-4-blocking-of-61-x-67-matrix)
- [Test Result](#test-result)
# Lab3: Cache Lab
---
> *Covered Concept: Cache mechanism, Cache optimization*
# Problem Description
---
The lab has two parts. In Part A you will implement a cache simulator. In Part B
you will write a matrix transpose function that is optimized for cache performance.
For simplicity, let's take a look at the data structure defined in `csim.h`.
The detail of processing input argument and string is not described here. The
purpose of parsing the trace is to get the information in the structure
`memOpInfo_t` below
```clike=
#include "stdbool.h"
typedef enum {
INSTRUCTION = 'I',
DATALOAD = 'L',
DATASTORE = 'S',
DATAMODIFY = 'M' /* Load and then Store */
}memOpType_t;
typedef struct {
memOpType_t memOpType; /* Type of operation */
uint64_t addr; /* which memory address is operated on */
uint8_t size; /* size accessed by the operation */
}memOpInfo_t;
```
After parsing a single line in trace file, we have `memOpType`, `addr` and `size`
which will be used in cache simulation.
![](https://i.imgur.com/oy5odFo.png)
A cache consist of sets. A set consists of lines. A line consists of valid bit and
tag. The number of set is determined by input argument `-s`. The number of line in
a set is determined by input argument `-E`. The size (bytes) of a line is
determined by the input argument `-b`.
The lines in a set is represented with a linked list so that we can easily insert a
new line in it and move specified line to the head of the list in case of miss and
eviction. Note that the eviction strategy used here is Least Recently Used (LRU).
Refer to `cache.c` for more details of implementation.
***Performance***
For each matrix size, the performance of your transpose_submit function is
evaluated by using valgrind to extract the address trace for your function, and
then using the reference simulator to replay this trace on a cache with parameters
`(s = 5, E = 1, b = 5)`.
Your performance score for each matrix size scales linearly with the number of
misses, m, up to some
threshold:
• 32 × 32: 8 points if m < 300, 0 points if m > 600
• 64 × 64: 8 points if m < 1, 300, 0 points if m > 2, 000
• 61 × 67: 10 points if m < 2, 000, 0 points if m > 3, 000
Your code must be correct to receive any performance points for a particular size.
Your code only needs to be correct for these three cases and you can optimize it
specifically for these three cases. In particular, it is perfectly OK for your
function to explicitly check for the input sizes and implement separate code
optimized for each case.
It's assumed that data is properly aligned so there is no need to concern the
operation across the boundary of a line. The given cache is like:
![](https://i.imgur.com/FSgBRMI.png)
Since the matrix is composed of `int`. We discuss the size in `int (4 bytes)`.
There are 32 integers in a row which equals to 4 sets in the cache. The element
would have competition each **8** rows in the matrix. The layout is shown below.
![](https://i.imgur.com/GuBN9RZ.png)
The sub-rows in green compete with each other for the same set. Similarly, The sub-
rows in yellow compete with each other for the same set. Apparently, we can set
block size as 8 x 8 for blocking.
The transpose of 64x64 matrix seems similar to the 32x32 one. However, it ends up
yielding a large number of miss rate by the method above. It's caused by the bigger
size of a row.
![](https://i.imgur.com/1xrO0dG.jpg)
The memory layout of 64x64 matrix is showned above. Remember that the given cache
can accomodate 256 integers. That is, a row competes with the 4th row follows it
for the same set in cache. This phenomenon leads to more miss whenever moving the
transposed row of **A** to **B**.
- **Method 1.**
Let's see what happened when we use the method mentioned above. **A<sub>ur</sub>**
is transpoed and put in **B<sub>dl</sub>**. Say we are processing the 1st row of
**A<sub>ur</sub>** and storing it to the 1st column of **B<sub>dl</sub>**. The
extra misses occur at the storing 1st column of **B<sub>dl</sub>**. Although it's
totally fine at the first 4 elements in the 1st column of **B<sub>dl</sub>**,
another misses occurs when it comes to the last 4 element because both of them
compete for the same set.
`Estimation of misses: (8 + 64) * 64 = 4608`
- **Method 2.**
The phenomenon can be eliminated by setting the size of blocking to 4x4 but it's
not good enough for requirement `m < 1300` since there are still 4 integers in a
line of cache not used.
`Estimation of misses: (4 + 4 + 4) * 2 * 64 = 1536`
- **Method 3.**
To Solve this problem, I've searched for [others'
solution](https://hackmd.io/@Chang-Chia-Chi/CSAPP/https%3A%2F%2Fhackmd.io%2F
%40Chang-Chia-Chi%2FrkRCq_vbY). Most of them use a tricky method which divides a
8x8 block to four 4x4 sub-blocks.
Let's dig into the method in detail. As figure shown above, we divided
**A<sub>ur</sub>** into four sub-blocks (A<sub>11</sub>, A<sub>12</sub>,
A<sub>21</sub>, A<sub>22</sub>) and **B<sub>dl</sub>** into four sub-blocks
(B<sub>11</sub>, B<sub>12</sub>, B<sub>21</sub>, B<sub>22</sub>).
- **step 1**. A<sub>11</sub> to B<sub>11</sub>
As the method in 32x32, we load 1st row of A<sub>11</sub> and stores it in
tmpDiagData[8] (only 4 elements are used). And then put them into the 1st column of
B<sub>11</sub>. The following row of A<sub>11</sub> does the same thing and so on.
`Estimation of misses: 4 + 4 = 8`
- **step 2**. A<sub>12</sub> to B<sub>12</sub> (Temporarily)
Do the same method as step 1. above except that the targets become
A<sub>12</sub> and B<sub>12</sub>.
`Estimation of misses: 0`
- **step 3**. A<sub>21</sub> to B<sub>12</sub>, B<sub>12</sub> to
B<sub>21</sub>
**3.1.** Store the 1 st row of B<sub>12</sub> in tmpDiagData[8] (only 4
elements are used).
**3.2.** Read the 1st **column** in A<sub>21</sub>, transpose and put it to the
1st **row** of B<sub>12</sub>.
**3.3.** Move the stored 1st B<sub>12</sub> from tmpDiagData[8] to the 1st row
of B<sub>21</sub>.
**3.4.** Does the same thing to the other rows.
`Estimation of misses: 4 + 4 = 8`
- **step 4**. A<sub>22</sub> to B<sub>22</sub>
Move the the transposed rows in A<sub>22</sub> to B<sub>22</sub>. This step
costs no miss because both A<sub>22</sub> and B<sub>22</sub> are already stored in
the cache.
`Estimation of misses: 0`
We can simply pass this test through the method used in the case of 32x32 matrix.
# Test Result
Test result yield from `./driver.py`
```
Part A: Testing cache simulator
Running ./test-csim
Your simulator Reference simulator
Points (s,E,b) Hits Misses Evicts Hits Misses Evicts
3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace
3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace
3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace
3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace
3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace
3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace
3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace
6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace
27