Assign-01
Assign-01
Assign-01
Assignment #1
Group Id: 20
Akash Maji ([email protected]) 24212
Nehal Shah ([email protected]) 24623
Question 1:
● Here we use perf as our performance measurement tool. It is a widely used linux utility.
● We measure various PMU events (hardware based and software based).
● We use a standard O(n^3) time matrix multiplication program written in C/C++ to measure PMU
events like L1/L2/L3 misses, TLB misses, page faults etc. in three orders <i,j,k>, <j, i, k> and
<k,i,j>.
● Here, part(a) uses basic-versions with simple 4KB pages, while part(b) uses basic-versions with
2MB pages, and part(3) uses tiled-versions of those orders with 4KB pages.
Question 2:
● Here we use `ChampSim` as our architecture simulation tool. It is a free open-source tool.
● We measure various performance metrics (MPKI, IPC etc)
● We run 3 cloudsuite benchmark traces for simulating four predictors: Bimodal, GShare,
Perceptron and TAGE, each given a maximum storage budget of 64KB.
● Here, part(a) uses atleast 500M simulation instructions, while part(b) uses TAGE variants with
different history lengths and table sizes, and part(3) explores a hybrid(GShare+TAGE) predictor
with varying storage distributions
The following GitHub repo contains all code files for Question 1:
https://github.com/akashmaji946/HPCA-Assignment-01
The following GitHub repo contains all code files for Question 2:
https://github.com/akashmaji946/ChampSim-IISc
Question #1
The order of efficiency in terms of time consumption using huge 2MB pages is: [I, J, K] < [J, I, K] < [K, I, J]
Observations:
● For 2048 variants, (k,i,j) is fastest, which can be attributed to the fact that it incurs least LLC stores and
loads, while (i,j,k) takes most time as it incurs the most LLC-load-misses.
● For 8192 variants, (k,i,j) is fastest, which can be attributed to the fact that it incurs least LLC stores-misses
and loads-misses, while (j, i, k) takes the most time as it incurs the most LLC stores and stores-misses.
● All three variants in both 2048 and 8192 variants incur almost same page faults in their respective groups
Metric (i,j,k) (j,i,k) (k,i,j)
2048 8192 2048 8192 2048 8192
11,139,637,268,956
Duration time (ns) 93,711,056,202 92,630,806,364 11,134,738,841,697.00 23,323,075,474 1,566,293,575,103
11,136,436,269,000
User time (ns) 93,704,515,000 92,627,513,000 11,131,221,227,000 23,322,073,000 1,565,706,047,000
438
Page-faults 76 76 436 77 437
9,351,843,396,129
L1-dcache-loads 146,111,309,862 146,274,782,105 9,351,893,672,538 146,346,423,246 9,348,517,946,862
554,643,660,292
L1-dcache-load-misses 8,628,169,783 9,166,601,291 585,449,138,806 552,740,466 66,942,935,635
1,103,796,102,446
L1-dcache-stores 17,277,682,693 17,308,685,726 1,103,650,129,000 17,315,062,334 1,101,331,384,271
1,254,965,004
L1-icache-load-misses 8,293,170 9,859,822 1,277,390,048 3,207,756 217,260,196
550,138,866,918
LLC-loads 2,701,304,319 2,913,869,006 550,133,261,235 5,327,420 213,379,416
548,435,729,154
LLC-load-misses 1,139,672,283 954,685,548 548,151,801,003 4,289,713 166,808,666
582,761,944
LLC-stores 3,910,346 6,287,668 211,207,525 1,078,413 24,755,121
522,126,714
LLC-store-misses 2,203,642 2,707,380 101,606,528 766,603 10,942,586
9,353,519,539,316
dTLB-loads 146,428,902,934 146,317,418,345 9,353,470,761,736 146,124,811,633 9,346,937,597,114
13,429,387
dTLB-load-misses 3,012 12,430 18,071,320 5,810 829,690
1,103,765,705,725
dTLB-stores 17,325,900,270 17,301,784,980 1,103,699,301,055 17,284,801,579 1,101,350,107,039
11,835,387
dTLB-store-misses 91,607 93,163 12,470,670 24,827 1,618,156
1,740,466
iTLB-load-misses 7,877 15,404 2,558,554 6,010 367,316
99,854,918
Branch-instructions 4,613,776 4,683,516 102,508,848 4,558,781 76,515,626
554,780,482,096
Branch-misses 8,760,926,085 8,767,129,620 554,820,377,522 8,759,089,629 552,441,645,258
265,443,790,303
Bus-cycles 2,234,646,940 2,211,446,148 265,344,609,240 545,669,830 36,795,383,547
Part C We have to do matrix multiplication in a tiled fashion as shown.
(i, j, k) order
// do matrix multiplication in a tiled fashion of tile-size=64 loop order: (i, j, k)
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}
Time Taken => 27.941030455 seconds 2048x2048 matrix
Time Taken => 1756.026040607 seconds 8192x8192 matrix
(k,i, j) order
// do matrix multiplication in a tiled fashion of tile-size=64 // loop order: (k, i, j)
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}
Time Taken => 24.635800144 seconds 2048x2048 matrix
Time Taken => 1579.586545771 seconds 8192x8192 matrix
(j, i, k) order
// do matrix multiplication in a tiled fashion of tile-size=64 // loop order: (j, i, k)
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}
Time Taken => 25.647599198 seconds 2048x2048 matrix
Time Taken => 1692.886438544 seconds 8192x8192 matrix
Observations:
In this case, we do not see much time savings when we run the three loop orders. There are only marginal
differences in terms of time taken.
● In the 2048 variant, (i,j,k) incurs most LLC load-misses and stores while (k,i,j) incurs the least.
● In the 8192 variant, (i,j,k) incurs the most LLC-stores while (k,i,j) incurs the least.
● Here ,(i, j, k) performs worst,, while (k, i, j) takes the least amount of time. This is similar to the fact that (i, j,
k) was worst performing in non-tiled versions.
So, TAGE predictor has the best outcomes and BIMODAL is least desirable.
Storage Justification (64 KB)
Bimodal Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our bimodal predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64K Bytes / 2 bits = 64 * 8 K bits / 2 bits = 256 K = = 262144
Largest prime less than 262144 = 262139
*/
#define BIMODAL_TABLE_SIZE 262144
#define BIMODAL_PRIME 262139
#define MAX_COUNTER 3
int bimodal_table[NUM_CPUS][BIMODAL_TABLE_SIZE];
*/
GShare Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our gshare predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64 * 8 K bits / 2 bits = 256 K = 262144
*/
#define GLOBAL_HISTORY_LENGTH 16
#define GLOBAL_HISTORY_MASK (1 << GLOBAL_HISTORY_LENGTH) - 1
int branch_history_vector[NUM_CPUS];
#define GS_HISTORY_TABLE_SIZE 262144
int gs_history_table[NUM_CPUS][GS_HISTORY_TABLE_SIZE];
int my_last_prediction[NUM_CPUS];
Perceptron Predictor
/*
—----------------------
Storage Cost Estimation:
-----------------------
Each perceptron has N=32 weights and each weight takes 8 bits
Each perceptron is thus 32 Bytes
We have a table of NUM_PERCEPTRONS=2048 perceptrons
Perceptron Table Size = 2048*32 Bytes = 64 KB
Update Table Entry Size = 8 B (at max) [i.e. 32 + 1 + 1 + 11 = 6 B]
Update Table Size = 256 * 6 B < 2 KB
Total Size Taken is within 64+2 KB
*/
TAGE Predictor
/*
—--------------------------
Storage Budget Justification:
---------------------------
We are using 12 TAGE tables with entry sizes as:
[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12]
as the tag_width are:
[19, 17, 15, 15, 13, 13, 11, 11, 9, 9, 7, 7]
and 5 bits for 'ctr' and 'u' fields
There are 2^11 entries in each TAGE table. TAGE_BITS=11
So TAGE Tables Size = (sum[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12] bits * 2^11) = 206 bits * 2048 = 52 KB
Bimodal Table Size = 2^13 entries each of 2 bits = 2 ^ 14 bits = 2 ^ 11 B = 4 KB
Total Size = 52 + 4 KB = 56 KB which is within 64 KB budget
*/
Part B We have to do cloudsuite simulations using ChampSim and TAGE variants
We are first using a fixed number of TAGE Tables (say 8).
Case-1
We are using the different sizes of TAGE Tables (say max 2^13 entries), and vary the minimum history lengths for four different
cases as specified below. We will first try to see any improvement, with this setting.
// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 12, 12, 11, 11, 11, 11};
const uint8_t TAGE_TAG_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 11, 11, 9, 9, 9, 9};
From the two tables, we observe that even after changing the history lengths, and running the traces, we obtain
marginal differences. The increase in MPKI decreases IPC, and vice versa.
We are using a hybrid predictor (GShare + TAGE) with varying storage distributions, but totalling under 64KB. We
observe the following parameters as stated:
trace1 0.00607091 2.05946 99.9976% 0.00677091 2.05939 99.9973% 0.00769273 2.05927 99.997%
trace2 1.50078 0.406576 99.3107% 1.51612 0.406312 99.3036% 1.82682 0.404121 99.1609%
trace3 2.75407 0.362922 97.9377% 2.85958 0.362615 97.8587% 3.28599 0.361268 97.5394%
Observations:
1. For all three traces, IPC and Prediction Accuracy decreases as we increase fraction of storage to GShare and reduce fraction of
storage to TAGE in the 3 configurations of Hybrid predictor
2. Similarly, we see GShare under-performed TAGE in 25:75 and 50:50 and 75:25 configurations.
3. TAGE has better performance per unit storage budget over GShare. So 25:75 is the best configuration in terms of performance.