Assign-01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

E0-243 Computer Architecture

Assignment #1
Group Id: 20
Akash Maji ([email protected]) 24212
Nehal Shah ([email protected]) 24623

Question 1:
● Here we use perf as our performance measurement tool. It is a widely used linux utility.
● We measure various PMU events (hardware based and software based).
● We use a standard O(n^3) time matrix multiplication program written in C/C++ to measure PMU
events like L1/L2/L3 misses, TLB misses, page faults etc. in three orders <i,j,k>, <j, i, k> and
<k,i,j>.
● Here, part(a) uses basic-versions with simple 4KB pages, while part(b) uses basic-versions with
2MB pages, and part(3) uses tiled-versions of those orders with 4KB pages.

Question 2:
● Here we use `ChampSim` as our architecture simulation tool. It is a free open-source tool.
● We measure various performance metrics (MPKI, IPC etc)
● We run 3 cloudsuite benchmark traces for simulating four predictors: Bimodal, GShare,
Perceptron and TAGE, each given a maximum storage budget of 64KB.
● Here, part(a) uses atleast 500M simulation instructions, while part(b) uses TAGE variants with
different history lengths and table sizes, and part(3) explores a hybrid(GShare+TAGE) predictor
with varying storage distributions

The following GitHub repo contains all code files for Question 1:
https://github.com/akashmaji946/HPCA-Assignment-01

The following GitHub repo contains all code files for Question 2:
https://github.com/akashmaji946/ChampSim-IISc
Question #1

Part A We have to do basic matrix multiplication using 4KB pages

The first variant uses (i, j, k) as the loop order


// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k]*matrixB[k][j]);
}
}
}

Time Taken => 30.839546165 seconds for 2048x2048 matrix


Time Taken => 3657.000821699 seconds for 8192x8192 matrix

The second variant uses (j, i, k) as the loop order


// do the matrix multiplication C = A * B
// loop order is (j, i, k)
for(int j = 0; j < COLS; j++){
for(int i = 0; i < ROWS; i++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 28.747145839 seconds 2048x2048 matrix


Time Taken => 4426.857946943 seconds 8192x8192 matrix

The third variant uses (k, i, j) as the loop order


// do the matrix multiplication C = A * B
// loop order is (k, i, j)
for(int k = 0; k < SIZE; k++){
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 24.600546672 seconds 2048x2048 matrix


Time Taken => 1578.107532743 seconds 8192x8192 matrix
The order of efficiency in terms of time consumption using normal 4KB pages is: [I, J, K] < [J, I, K] < [K, I, J]
Observations:
● For 2048 variants, (k,i,j) is fastest, which can be attributed to the fact that it incurs least LLC stores and
loads, while (i,j,k) incurs the most.
● For 8192 variants, (k,i,j) is fastest, which can be attributed to the fact that it incurs least LLC stores-misses
and loads-misses, while (j, i, k) incurs the most.
● All three variants in both 2048 and 8192 variants incur almost same page faults in their respective groups

Metric (i,j,k) (j,i,k) (k,i,j)


2048 8192 2048 8192 2048 8192

Duration time (ns) 30,839,546,165 3,657,000,821,699 28,747,145,839 4,426,857,946,943 24,600,546,672 1,578,107,532,743


User time (ns) 30,810,343,000 3,656,602,950,000 28,713,868,000 4,425,380,384,000 24,579,683,000 1,577,534,617,000
Page-faults 12,376 196,804 12,377 197,230 12,377 196,808
L1-dcache-loads 180,716,201,859 11,548,025,757,67 180,799,591,141 11,546,530,116,698 180,493,924,548 1,548,034,424,737
L1-dcache-load-misses 13,051,193,798 1,200,446,330,014 12,931,597,982 1,175,863,490,126 566,698,299 68,981,574,642
L1-dcache-stores 17,324,396,716 1,102,180,578,461 17,323,845,706 1,102,123,802,619 17,280,379,677 1,101,538,256,169
L1-icache-load-misses 4,028,646 367,988,829 4,192,323 523,788,201 4,980,158 193,654,440
LLC-loads 717,213,971 963,041,823,085 97,421,469 957,922,755,311 10,044,197 294,832,978
LLC-load-misses 20,139,255 1,305,607,741 10,706,600 2,861,865,612 6,827,789 205,877,525
LLC-stores 681,560 51,369,940 663,832 44,348,743 579,777 24,475,056
LLC-store-misses 285,538 15,393,320 286,705 18,226,105 321,250 13,983,975
dTLB-loads 180,522,694,359 11,550,454,804,09 180,650,953,750 11,552,726,081,292 180,800,981,910 1,548,569,857,358
dTLB-load-misses 3,124,469,564 543,022,254,070 2,985,825,485 518,437,302,979 6,763,029 129,763,674
dTLB-stores 17,290,052,333 1,102,257,021,574 17,305,298,422 1,102,405,097,230 17,344,176,075 1,101,875,244,138
dTLB-store-misses 111,706 9,517,610 625,479 20,618,786 296,218 9,327,580
iTLB-load-misses 81,775 10,251,533 63,775 13,713,203 45,773 1,392,163
Branch-instructions 8,782,139,839 76,749,309 4,557,846 85,614,929 4,621,447 75,776,978
Branch-misses 4,576,725 553,185,083,901 8,775,284,817 553,363,279,103 8,774,585,811 552,735,244,961
Bus-cycles 730,209,252 86,486,454,981 681,141,936 104,104,517,082 575,552,191 37,048,955,948
Part B We have to do basic matrix multiplication using 2MB pages

For the 2048x2048 matrix multiplication using huge pages


/*
Some Calculation on large pages (2MB)
matrix size : 2048 * 2048 elements
: 2^22 ints
: 2^24 B
: 16 MB
: 8 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 8 huge pages each
: 24 huge pages
That means, we must have 24 huge pages each of size 2MB allocated
*/

For the 8192x8192 matrix multiplication using huge pages


/*
Some Calculation on large pages (2MB)
matrix size : 8192 * 8192 elements
: 2^26 ints
: 2^28 B
: 256 MB
: 128 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 128 huge pages each
: 384 huge pages
That means, we must have 384 huge pages each of size 2MB allocated
*/

This is how we allocate and deallocate memory


// allocate memory dynamically to matrices A, B and C
// using nmap with appropriate size and flags
int *matrixA = (int*)mmap(NULL,
NMAP_ALLOC_SIZE,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB,
-1, 0);
/*
Do your work here
*/

// unallocate memory allocated by nmap() using munmap()


munmap(matrixA, NMAP_ALLOC_SIZE);
The first variant uses (i, j, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i * COLS + j] += (matrixA[i * COLS + k] * matrixB[k * COLS + j]);
}
}
}

Time Taken => 93.711056202 seconds for 2048x2048 matrix


Time Taken => 11139.637268956 seconds for 8192x8192 matrix

The second variant uses (j, i, k) as the loop order


// do the matrix multiplication C = A * B
// loop order is (j, i, k)
for(int j = 0; j < COLS; j++){
for(int i = 0; i < ROWS; i++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 92.630806364 seconds 2048x2048 matrix


Time Taken => 11134.738841697 seconds 8192x8192 matrix

The third variant uses (k, i, j) as the loop order


// do the matrix multiplication C = A * B
// loop order is (k, i, j)
for(int k = 0; k < SIZE; k++){
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 23.323075474 seconds 2048x2048 matrix


Time Taken => 1566.293575103 seconds 8192x8192 matrix

The order of efficiency in terms of time consumption using huge 2MB pages is: [I, J, K] < [J, I, K] < [K, I, J]
Observations:
● For 2048 variants, (k,i,j) is fastest, which can be attributed to the fact that it incurs least LLC stores and
loads, while (i,j,k) takes most time as it incurs the most LLC-load-misses.
● For 8192 variants, (k,i,j) is fastest, which can be attributed to the fact that it incurs least LLC stores-misses
and loads-misses, while (j, i, k) takes the most time as it incurs the most LLC stores and stores-misses.
● All three variants in both 2048 and 8192 variants incur almost same page faults in their respective groups
Metric (i,j,k) (j,i,k) (k,i,j)
2048 8192 2048 8192 2048 8192
11,139,637,268,956
Duration time (ns) 93,711,056,202 92,630,806,364 11,134,738,841,697.00 23,323,075,474 1,566,293,575,103
11,136,436,269,000
User time (ns) 93,704,515,000 92,627,513,000 11,131,221,227,000 23,322,073,000 1,565,706,047,000
438
Page-faults 76 76 436 77 437
9,351,843,396,129
L1-dcache-loads 146,111,309,862 146,274,782,105 9,351,893,672,538 146,346,423,246 9,348,517,946,862
554,643,660,292
L1-dcache-load-misses 8,628,169,783 9,166,601,291 585,449,138,806 552,740,466 66,942,935,635
1,103,796,102,446
L1-dcache-stores 17,277,682,693 17,308,685,726 1,103,650,129,000 17,315,062,334 1,101,331,384,271
1,254,965,004
L1-icache-load-misses 8,293,170 9,859,822 1,277,390,048 3,207,756 217,260,196
550,138,866,918
LLC-loads 2,701,304,319 2,913,869,006 550,133,261,235 5,327,420 213,379,416
548,435,729,154
LLC-load-misses 1,139,672,283 954,685,548 548,151,801,003 4,289,713 166,808,666
582,761,944
LLC-stores 3,910,346 6,287,668 211,207,525 1,078,413 24,755,121
522,126,714
LLC-store-misses 2,203,642 2,707,380 101,606,528 766,603 10,942,586
9,353,519,539,316
dTLB-loads 146,428,902,934 146,317,418,345 9,353,470,761,736 146,124,811,633 9,346,937,597,114
13,429,387
dTLB-load-misses 3,012 12,430 18,071,320 5,810 829,690
1,103,765,705,725
dTLB-stores 17,325,900,270 17,301,784,980 1,103,699,301,055 17,284,801,579 1,101,350,107,039
11,835,387
dTLB-store-misses 91,607 93,163 12,470,670 24,827 1,618,156
1,740,466
iTLB-load-misses 7,877 15,404 2,558,554 6,010 367,316
99,854,918
Branch-instructions 4,613,776 4,683,516 102,508,848 4,558,781 76,515,626
554,780,482,096
Branch-misses 8,760,926,085 8,767,129,620 554,820,377,522 8,759,089,629 552,441,645,258
265,443,790,303
Bus-cycles 2,234,646,940 2,211,446,148 265,344,609,240 545,669,830 36,795,383,547
Part C We have to do matrix multiplication in a tiled fashion as shown.
(i, j, k) order
// do matrix multiplication in a tiled fashion of tile-size=64 loop order: (i, j, k)
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}
Time Taken => 27.941030455 seconds 2048x2048 matrix
Time Taken => 1756.026040607 seconds 8192x8192 matrix
(k,i, j) order
// do matrix multiplication in a tiled fashion of tile-size=64 // loop order: (k, i, j)
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}
Time Taken => 24.635800144 seconds 2048x2048 matrix
Time Taken => 1579.586545771 seconds 8192x8192 matrix
(j, i, k) order
// do matrix multiplication in a tiled fashion of tile-size=64 // loop order: (j, i, k)
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}
Time Taken => 25.647599198 seconds 2048x2048 matrix
Time Taken => 1692.886438544 seconds 8192x8192 matrix
Observations:
In this case, we do not see much time savings when we run the three loop orders. There are only marginal
differences in terms of time taken.
● In the 2048 variant, (i,j,k) incurs most LLC load-misses and stores while (k,i,j) incurs the least.
● In the 8192 variant, (i,j,k) incurs the most LLC-stores while (k,i,j) incurs the least.
● Here ,(i, j, k) performs worst,, while (k, i, j) takes the least amount of time. This is similar to the fact that (i, j,
k) was worst performing in non-tiled versions.

Metric (i,j,k) (j,i,k) (k,i,j)


2048 8192 2048 8192 2048 8192
1,756,026,040,607 25,647,599,1
1,692,886,438,544 24,635,800,144 1,579,586,545,771
Duration time (ns) 27,941,030,455 98
25,627,140,0
1,692,689,712,000 24,623,036,000 1,579,331,811,000
User time (ns) 27,920,088,000 1,755,841,046,000 00
Page-faults 12,377 196,805 12,377 196,827 12,376 196,808
190,194,232,62 12,150,517,050,77 190,010,392, 12,149,808,482,66 12,149,874,528,78
190,260,678,597
L1-dcache-loads 2 4 317 8 9
L1-dcache-load-misses 61,483,215 3,446,557,830 318,199,290 19,942,640,700 65,210,685 3,672,203,829
17,556,658,7
1,118,977,407,534 17,594,147,665 1,118,889,954,245
L1-dcache-stores 17,576,630 1,118,981,561,228 94
L1-icache-load-misses 4,857,490 149,885,643 3,429,295 151,370,928 3,586,939 144,493,229

LLC-loads 4,731,392 371,554,980 6,612,223 517,173,439 1,576,002 83,369,550

LLC-load-misses 2,325,423 283,490,200 553,649 404,442,395 525,688 29,758,027

LLC-stores 666,641 24,699,157 603,218 23,768,203 597,245 21,997,877

LLC-store-misses 417,711 11,171,659 398,225 11,933,107 408,272 11,181,211


189,938,841,93 12,151,094,554,18 189,978,432, 12,151,335,842,63 12,153,672,025,76
189,246,743,067
dTLB-loads 8 9 782 9 0
dTLB-load-misses 2,018,513 143,697,521 1,980,500 143,434,123 187,679 3,635,874
17,565,664,4
1,118,934,040,459 17,525,520,826 1,118,969,231,880
dTLB-stores 17,581,663,757 1,118,892,390,497 88
dTLB-store-misses 95,280 3,263,733 90,075 3,251,544 91,484 3,224,840

iTLB-load-misses 17,707 185,199 5,703 206,662 3,867 116,615

Branch-instructions 136,791,425 8,732,335,847 136,695,025 8,730,766,311 136,601,847 8,717,341,208


9,168,421,05
578,636,484,224 9,169,744,340 578,559,572,924
Branch-misses 9,175,329,258 578,749,548,881 7
Bus-cycles 660,603,499 41,737,917,678 610,683,266 39,853,243,827 583,186,121 37,105,852,844
Question #2

Part A We have to do cloudsuite simulations using ChampSim using predictor variants

The 3 cloudsuite traces are:


1. Trace1: nutch_phase2_core1.trace.xz
2. Trace2: cloud9_phase1_core1.trace.xz
3. Trace3: classification_phase0_core3.trace.xz
Note: We are using 10M warmup and 500M actual simulation instructions.
The observed parameters are:

Trace MPKI IPC Prediction Accuracy Predictor Used

Trace1 16.5163 1.36976 93.499% Bimodal

Trace1 0.005922 2.05911 99.9977% GShare

Trace1 0.00844 2.059 99.9967% Perceptron

Trace1 0.006368 2.05922 99.9975% TAGE

Trace MPKI IPC Prediction Accuracy Predictor Used

Trace2 4.374 0.393449 97.9909% Bimodal

Trace2 0.73988 0.409175 99.6602% GShare

Trace2 0.416594 0.410222 99.8086% Perceptron

Trace2 0.117964 0.409526 99.9458% TAGE

Trace MPKI IPC Prediction Accuracy Predictor Used

Trace3 4.42255 0.361737 96.6896% Bimodal

Trace3 3.08063 0.364767 97.6941% GShare

Trace3 2.48727 0.366098 98.1382% Perceptron

Trace3 1.6329 0.370232 98.7777% TAGE


Observations:
For trace1, we see TAGE gives the highest IPC and BIMODAL gives the least IPC. Also, TAGE gives the least
MPKI, and BIMODAL gives the highest MPKI.
For trace2, we see Perceptron gives the highest IPC and BIMODAL gives least IPC. Also, TAGE gives the least
MPKI, and BIMODAL gives the highest MPKI.
For trace3, we see TAGE gives the highest IPC and BIMODAL gives the least IPC. Also, TAGE gives the least
MPKI, and BIMODAL gives the highest MPKI.

So, TAGE predictor has the best outcomes and BIMODAL is least desirable.
Storage Justification (64 KB)
Bimodal Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our bimodal predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64K Bytes / 2 bits = 64 * 8 K bits / 2 bits = 256 K = = 262144
Largest prime less than 262144 = 262139
*/
#define BIMODAL_TABLE_SIZE 262144
#define BIMODAL_PRIME 262139
#define MAX_COUNTER 3
int bimodal_table[NUM_CPUS][BIMODAL_TABLE_SIZE];
*/
GShare Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our gshare predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64 * 8 K bits / 2 bits = 256 K = 262144
*/
#define GLOBAL_HISTORY_LENGTH 16
#define GLOBAL_HISTORY_MASK (1 << GLOBAL_HISTORY_LENGTH) - 1
int branch_history_vector[NUM_CPUS];
#define GS_HISTORY_TABLE_SIZE 262144
int gs_history_table[NUM_CPUS][GS_HISTORY_TABLE_SIZE];
int my_last_prediction[NUM_CPUS];
Perceptron Predictor
/*
—----------------------
Storage Cost Estimation:
-----------------------
Each perceptron has N=32 weights and each weight takes 8 bits
Each perceptron is thus 32 Bytes
We have a table of NUM_PERCEPTRONS=2048 perceptrons
Perceptron Table Size = 2048*32 Bytes = 64 KB
Update Table Entry Size = 8 B (at max) [i.e. 32 + 1 + 1 + 11 = 6 B]
Update Table Size = 256 * 6 B < 2 KB
Total Size Taken is within 64+2 KB
*/
TAGE Predictor
/*
—--------------------------
Storage Budget Justification:
---------------------------
We are using 12 TAGE tables with entry sizes as:
[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12]
as the tag_width are:
[19, 17, 15, 15, 13, 13, 11, 11, 9, 9, 7, 7]
and 5 bits for 'ctr' and 'u' fields
There are 2^11 entries in each TAGE table. TAGE_BITS=11
So TAGE Tables Size = (sum[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12] bits * 2^11) = 206 bits * 2048 = 52 KB
Bimodal Table Size = 2^13 entries each of 2 bits = 2 ^ 14 bits = 2 ^ 11 B = 4 KB
Total Size = 52 + 4 KB = 56 KB which is within 64 KB budget
*/
Part B We have to do cloudsuite simulations using ChampSim and TAGE variants
We are first using a fixed number of TAGE Tables (say 8).

Case-1
We are using the different sizes of TAGE Tables (say max 2^13 entries), and vary the minimum history lengths for four different
cases as specified below. We will first try to see any improvement, with this setting.

// Global history lengths according to the adjustment


History lengths = [4 6 10 16 26 42 67 107] min history length = 4
History lengths = [8 13 20 33 52 84 134 215] min history length = 8
History lengths = [12 19 31 49 79 126 201 322] min history length = 12
History lengths = [16 26 41 66 105 168 268 429] min history length = 16

// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 12, 12, 11, 11, 11, 11};
const uint8_t TAGE_TAG_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 11, 11, 9, 9, 9, 9};

// Clearly there are 8 TAGE Tables


// Assume each has different table size (max is 2^13 entries)

When we simulate with 300M instructions:

Trace MPKI IPC MPKI IPC MPKI IPC MPKI IPC

trace1 0.00743333 2.05839 0.00785 2.05832 0.00795333 2.05831 0.00816667 2.05826

trace2 0.0963733 0.409516 0.103427 0.409591 0.11343 0.409467 0.114957 0.409534

trace3 1.92031 0.368561 1.81499 0.369058 1.75186 0.369506 1.69893 0.369608


Case-2
We are using the different sizes of TAGE Tables (say max 2^13 again but with different sizes as before), and history lengths in
geometric progression. We will then try to see any improvement, with this setting.

// Global history lengths according to the GP


History lengths = [4 8 16 32 64 128 256 512]
// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS]={13, 13, 12, 12, 11, 11, 11, 10};
// We still assume there are 12 TAGE Tables
// And each table has different size (max is 2^13 entries)
When we simulate with 500M instructions:

Trace MPKI IPC Prediction Accuracy Predictor Used

Trace1 0.00460392 2.05965 99.9982% TAGE

Trace2 0.112835 0.409517 99.9482% TAGE

Trace3 1.8186 0.368683 98.6385% TAGE

Best Trace trace1 trace1 trace1

From the two tables, we observe that even after changing the history lengths, and running the traces, we obtain
marginal differences. The increase in MPKI decreases IPC, and vice versa.

Part C We have to do cloudsuite simulations using ChampSim and hybrid predictor

We are using a hybrid predictor (GShare + TAGE) with varying storage distributions, but totalling under 64KB. We
observe the following parameters as stated:

MPKI IPC Acc. MPKI IPC Acc MPKI IPC Acc

trace1 0.00607091 2.05946 99.9976% 0.00677091 2.05939 99.9973% 0.00769273 2.05927 99.997%

trace2 1.50078 0.406576 99.3107% 1.51612 0.406312 99.3036% 1.82682 0.404121 99.1609%

trace3 2.75407 0.362922 97.9377% 2.85958 0.362615 97.8587% 3.28599 0.361268 97.5394%

Storage budget justification:


1. 25:75 distribution (Gshare: 16KB | TAGE: 48KB)
We use gshare table of size 65536 entries of size 2 bits, and use TAGE index lengths = [13, 12, 11, 11, 11,
11, 10, 10] of entry sizes = [14, 15, 16, 17, 17, 18, 19, 20] respectively across 8 tables
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {13, 12, 11, 11, 11, 11, 10, 10};
const uint8_t TAGE_TAG_BITS[TAGE_NUM_COMPONENTS] = {9, 10, 11, 12, 12, 13, 14, 15};
2. 50:50 distribution (Gshare: 32KB | TAGE: 32KB)
We use gshare table of size 131072 entries of size 2 bits, and use TAGE index lengths = [12, 12, 11, 11, 11,
10] of entry sizes = [14, 15, 16, 17, 18, 19] respectively across 6 tables
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {12, 12, 11, 11, 11, 10};
const uint8_t TAGE_TAG_BITS[TAGE_NUM_COMPONENTS] = {9, 10, 11, 12, 13, 14};
3. 75:25 distribution (Gshare: 48KB | TAGE: 16KB)
We use gshare table of size 262144 entries of size 2 bits, and use TAGE index lengths = [11, 11, 11, 10]
of entry sizes = [16, 17, 18, 19] respectively across 4 tables
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {11, 11, 11, 10};
const uint8_t TAGE_TAG_BITS[TAGE_NUM_COMPONENTS] = {11, 12, 13, 14};
Storage justification for 25:75 distribution is given as under. Similarly rest two configs can be justified.
Gshare Table Size = 2^16 * 2 bits = 16 KB
Total TAGE Table Size = 2^13 * 14 bits + 2 ^ 12 * 15 bits + … = 44 KB approx
TAGE Bimodal Table Size = 2^14 * 2 bits = 4 KB
Extra 1KB data can be used for a bimodal predictor with 2^12 entries of 2 bits each, as meta predictor

Observations:
1. For all three traces, IPC and Prediction Accuracy decreases as we increase fraction of storage to GShare and reduce fraction of
storage to TAGE in the 3 configurations of Hybrid predictor
2. Similarly, we see GShare under-performed TAGE in 25:75 and 50:50 and 75:25 configurations.
3. TAGE has better performance per unit storage budget over GShare. So 25:75 is the best configuration in terms of performance.

You might also like