Designing an Advanced Simulator for Unbiased
Branches’ Prediction
Adrian Florea1, Ciprian Radu1, Horia Calborean1, Adrian Crapciu1, Arpad Gellert1 and Lucian Vintan1
1
“Lucian Blaga” University of Sibiu, Computer Science Department, E. Cioran Street, No. 4, Sibiu-550025, ROMANIA,
Tel./Fax: +40-269-212716, E-mail:
[email protected], {radu_ciprianro, horia.calborean, adrian.crapciu}@yahoo.com,
[email protected],
[email protected]
Abstract — In this paper we continue our work on detecting and
predicting unbiased branches. We centered on two directions:
first, based on a simple example from Perm – Stanford
benchmark, we show that extending context information some of
branches in certain contexts became fully biased, thus diminishing
the frequency of unbiased branches at benchmark level. Second,
we use some state-of-the art branch predictors to predict the
unbiased branches. Following our aims, we developed the ABPS
tool (Advanced Branch Prediction Simulator), an original useful
simulator written in Java that performs trace-driven simulation
on 25 benchmarks from Stanford and SPEC suites.
Keywords
unbiased branches, neural predictors, trace-driven
simulation, benchmarking.
I.
INTRODUCTION
The branch prediction becomes a challenge problem for
processors’ designers. Without performing branch prediction
(BP) it won’t be possible to aggressively exploit program’s
instruction level parallelism. All present branch prediction
techniques are limited in their accuracy. An important limitation
cause is given by the used prediction contexts (global and local
histories respectively path information). Using these dynamic
contexts, some branches are unbiased and non-deterministically
shuffled, therefore unpredictable. The percentages of these
branches represent a fundamental prediction limitation. One of
our goals is to demonstrate the insufficiency of global
correlation information (many times the global history is too
short and doesn’t keep the really correlated branches with the
predicted one, retaining quite enough noise). If the context
would permit it could be seen a correlation between branches
situated at a large distance in the dynamic instruction stream.
Also, the local correlation reduces the noise included in global
history. Another aim of our work is to use some state-of-the art
neural branch predictors (simple and fast path-based perceptron)
to predict the unbiased branches.
The organization of the rest of this paper is as follows. In
section II we review related work in the field of branch
prediction. Section III shows, based on a simple example from
Perm – Stanford benchmark, the influence of different length
context information (global and local) on unbiased branches. In
section IV is presented the software design of ABPS. Section V
includes simulation methodology and experimental results
obtained using the developed ABPS simulator. Finally, section
VI suggests directions for future works and concludes the paper.
II.
RELATED WORK
The nowadays processors use hybrid prediction structures,
combining two (or more) two-level adaptive predictors [1], one
correlated with local history of the predicted branch (PAg
predictor) and other correlated with global history of the
predicted branch (GAg predictor). The selection between two
predictions is made using a confidence table that records the
dynamic behavior of each predictor. The processor Alpha 21264
embeds a hybrid predictor having a local predictor with 1024
entries (keeping a local history of 10 bits) and a global predictor
with 4096 entries reaching to almost 95% of prediction
accuracy [2].
The most accurate single-component branch predictors in the
literature are neural branch predictors [2, 3, and 4]. Their main
advantages consist in possibility of using longer correlation
information at linear cost. The Perceptron predictor – the
simplest neural branch predictor – keeps a table of weights
vectors (small integers that are learned through the perceptron
learning rule) [2]. As in global two-level adaptive branch
prediction, a shift register records a global history of outcomes
of conditional branches, recording true for taken, or false for not
taken. To predict a branch outcome, a weights vector is selected
by indexing the table with the branch address modulo the
number of weights vectors. The dot product of the selected
vector and the global history register is computed, where true in
the history represents 1 and false represents -1. If the dot
product is at least 0, then the branch is predicted taken,
otherwise it is predicted not taken. Once the perceptron output
has been computed, the training algorithm starts: it increments
the i-th correlation weight when the branch outcome agrees with
the i-th bit from global branch history shift register and
decrements the weight otherwise. Unfortunately, the high
latency of the perceptron predictor and impossibility to predict
the linearly inseparable branches makes it impractical yet for
hardware implementation. In order to reduce the prediction
latency, the Fast Path-based Perceptron [3] chooses its weights
for generating a prediction according to the current branch’s
path, rather than according to the branch’s PC and history
register. The prediction latency is hidden due to the speculative
calculation of the perceptron’s output. Intel Co includes the
perceptron predictor in one of its IA-64 simulators for
researching future microarchitectures [2]. The piecewise linear
branch predictors [4] use a piecewise-linear function for a given
MOV R20, R5
LD R13, _pctr
ADD R13, R13, #1
ST _pctr, R13
EQ B1, R20, #1
BT B1, L8 (#0) # after compiling process this
branch has the address 35 (PC=35)
ADD R17, R20, #-1
MOV R5, R17
BSR RA, _Permute (#0)
MOV R18, R17
LES B1, R18, #0
BT B1, L8 (#0)
ASL R13, R20, #2
MOV R7, #_permarray
ADD R19, R13, R7
ASL R13, R18, #2
ADD R17, R13, R7
branch, exploiting in this way different paths that lead to the
same branch in order to predict – otherwise linearly inseparable
– branches.
In [5] the authors proposed a hybrid scheme that employs two
Prediction by Partial Matching (PPM) Markovian predictors,
one that predicts based on local branch histories and one based
on global branch histories. The two independent predictions are
combined using a simple hardware feasible perceptron.
Vintan et al. proved that a branch in a certain dynamic
context is difficult-to-predict if it is unbiased and the outcomes
are shuffled [6]. In other words, a dynamic branch instruction is
unpredictable with a given prediction information if it is
unbiased in the considered dynamic context and the behavior in
that certain context cannot be modeled through Markov
stochastic processes of any order.
III. UNBIASED BRANCH PREDICTION – A CHALLENGE
PROBLEM FOR PROCESSORS’ DESIGNERS
A. Analyzing Branch Prediction Contexts Influence
In this section we analyze the present day branch prediction
used contexts (global and local histories respectively path
information) from the point of view of their limits in predicting
unbiased branches. The main idea is: in a perfect dynamic
context all branch instances should have the same outcome. If
the outcome is not the same a first solution might consists in
extending the context information. We vary the contexts length
and observed that some of dynamic contexts remained
unpredictable despite of their length.
In the following we present partially the C and assembly code
of Stanford Perm benchmark that generates a suite of
permutations. We detect unbiased branches and we focused on
two of the most important branch instructions (having PC=35
and PC=58 after compiling process).
/**********************************************/
Permute (int n){
int k;
pctr = pctr+1;
if(n != 1) # the first branch instruction analyzed (PC=35)
{
Permute(n-1);
for( k = n-1; k >= 1; k--)# the second branch instruction
analyzed (PC=58)
{
Swap(&permarray[n], &permarray[k]);
Permute(n-1);
Swap(&permarray[n], &permarray[k]);
};
}
}
/**********************************************/
_Permute:
SUB SP, SP, #128
ST 0(SP), RA
ST 8(SP), R17
ST 12(SP), R18
ST 16(SP), R19
ST 20(SP), R20
L12:
MOV R5, R19
MOV R6, R17
BSR RA, _Swap (#0)
ADD R5, R20, #-1
BSR RA, _Permute (#0)
MOV R5, R19
MOV R6, R17
BSR RA, _Swap (#0)
ADD R17, R17, #-4
ADD R18, R18, #-1
GTS B1, R18, #0
BT B1, L12 (#0) # after compiling process this
branch has the address 58 (PC=58)
***************************************************
In the following simulations the settled parameters are: Path
= not selected, Unbiased polarization degree = 0.95, HRl and
HRg being the local and global history.
We define polarization index (bias) of a certain branch
context as:
bias = max(
T
NT
) , where T and
,
T + NT T + NT
NT represent number of “taken” respective “not taken” branch
instances corresponding to that certain context.
1. Parameters: HRl = not selected, HRg on 3 bits, =>
Unbiased contexts: 25.0[%]
From the unbiased branches list we selected just two branch
instructions in two global contexts:
PC: 35 HRg: 101 T: 2520 NT: 1100 Bias: 0.696
PC: 58 HRg: 111 T: 1419 NT: 3620 Bias: 0.718
2. Parameters: HRl = not selected, HRg on 4 bits, =>
Unbiased contexts: 17.813[%]
PC: 35 HRg: 0101 T: 840 NT: 260 Bias: 0.763
PC: 35 HRg: 1101 T: 1680 NT: 840 Bias: 0.667
PC: 58 HRg: 0111 T: 1419 NT: 1100 Bias: 0.563
PC: 58 HRg: 1111 T: 0 NT: 2520 Bias: 1.000 => The
branch with the address PC: 58 in context HRg: 1111
became fully biased. Practically it doesn’t appear in the
unbiased branch list.
3. Parameters: HRl on 1 bit, HRg on 4 bits, => Unbiased
contexts: 17.813[%]
PC: 35 HRg: 0101 HRl: 0 T: 840 NT: 260 Bias: 0.763
PC: 35 HRg: 0101 HRl: 1 – this context doesn’t occur
PC: 35 HRg: 1101 HRl: 0 T: 1680 NT: 840 Bias: 0.667
PC: 35 HRg: 1101 HRl: 1 – this context doesn’t occur
PC: 58 HRg: 0111 HRl: 0 T: 1419 NT: 1100 Bias: 0.563
PC: 58 HRg: 0111 HRl: 1 – this context doesn’t occur
4. Parameters: HRl on 2 bits, HRg on 4 bits, => Unbiased
contexts: 9.673[%]
PC: 35 HRg: 0101 HRl: 00 T: 840 NT: 260 Bias: 0.763
PC: 35 HRg: 0101 HRl: 10 – this context doesn’t occur
PC: 35 HRg: 1101 HRl: 00 – this context doesn’t occur
PC: 35 HRg: 1101 HRl: 10 T: 1680 NT: 840 Bias: 0.667
PC: 58 HRg: 0111 HRl: 00 T: 1419 NT: 260 Bias: 0.845
PC: 58 HRg: 0111 HRl: 10 T: 0 NT: 840 Bias: 1.000=> The
branch with the address PC: 58 in context HRg: 0111 and
HRl: 10 became fully biased. Practically it doesn’t appear in
the unbiased branch list.
…
5. Parameters: HRl on 2 bits, HRg on 7 bits, => Unbiased
contexts: 9.668[%]
PC: 58 HRg: 1110111 HRl: 00 T: 1419 NT: 260 Bias: 0.845
6. Parameters: HRl on 2 bits, HRg on 8 bits, => Unbiased
contexts: 8.134[%]
PC: 58 HRg: 01110111 HRl: 00 T: 579 NT: 260 Bias: 0.690
PC: 58 HRg: 11110111 HRl: 00 T: 840 NT: 0 Bias: 1.000=>
The branch with the address PC: 58 in context HRg:
11110111 and HRl: 00 became fully biased. Practically it
doesn’t appear in the unbiased branch list.
Conclusion: As it can be observed, increasing the context
length, some branches in certain contexts became fully biased,
but a great percentage still remains unbiased.
Comparing the previous results it can be observed that as
longer (increase the history length) or richer (local history it is
added beside global history) became the context as smaller
became the unbiased branches percentage. From the 1st case to
2nd one, the unbiased branches percentages decrease with
7.187% and it can be observed how the two unbiased branches
in small contexts are still unsolved. However, the branch with
the address PC: 58 in context HRg: 1111 became fully biased
and decrease the number of unbiased branches with 2520.
Practically it doesn’t appear in the unbiased branch list. In the
3rd case (adding one bit of local history) the unbiased branches
percentage remains unchanged. In the 4th local history is set on
2 bits and much more contexts became biased (the unbiased
branches percentage decreases with 8.14%). Although, there are
some contexts that remain unbiased (see above: PC: 35 HRg:
x101 HRl: x0 – where x could be 0 or 1).
Analyzing the code sequence it can be said that the results
regarding to unbiased branches are correct. It can be observed
that to reach conditional branch 58, the previously 3 branches
are every time Taken (return from permute function, call of
swap function and return from it – not necessarily correlated
with the branch 58). One reason for the larger percentage of
unbiased branches refers to the fact that the branches within the
global history length may not have correlation with the current
branch, or the relevant history might be too far away. If the
context would permit it could be seen a correlation between
branches situated at a large distance in the dynamic instruction
stream. Recurrence and function calls hide some branches that
are really correlated with the analyzed one. Also, the local
correlation reduces the noise included in global history. Similar
examples we found in tower benchmark that solves the Hanoi
towers problem.
The insufficiency of global correlation information is
remarked also in the case of programs or data structures, which
produce a variable number of history bits as the data changes
(data correlation). This occurs in the link lists or trees cases
where it is tested the address of an element (usually comparison
with 0) and then follow a recurrent call of the same function to
test the next element in the tree (left or right sub-tree). The same
situation it happen in the hash table cases having link lists to
solve the collisions. A possible solution could be to use data
values or structural information to keep the predictor more
synchronized with data. We tried such an approach in [7].
B. Predicting Unbiased Branches using State-of-the-art
Predictors
The prediction process supposes accessing the tables for
every instruction from traces and establishing the prediction
function of associated prediction automaton or perceptron
computed output. After branch’s resolution, it starts the
updating algorithm (every good prediction increase the
automatons state or perceptron weights, otherwise decreasing
the same parameters). The automatons are implemented as
saturating counters and, in the neural predictors’ case, the
threshold keeps from overtraining, permitting the perceptron to
adapt quickly at every changing behavior.
ABPS includes the following predictors implemented: GAg,
PAg, PAp, GShare and Perceptron. The two-level predictors
implemented (first 4) request as inputs parameters: number of
entries in prediction table, the history length (global / local).
Besides input parameters used by two-level predictors, the
neural predictors (Simple Perceptron and Fast Path-based
Perceptron) need some additional: threshold value used for
learning algorithm, number of bits for storing the weights. Each
predictor can predict all branches or only unbiased branches.
If the user selected the Perceptron predictor (Simple or Fast
Path-based), the simulation results consist in four important
metrics. The prediction accuracy is the number of correct
predictions divided to total number of dynamic branches. We
compute also a confidence metric that represents the total cases
when the prediction was correct and the perceptron didn’t need
to be trained (the magnitude of perceptron output was greater
than threshold) divided to the total number of correct
predictions. While the first two have impact on processor’s
performance, the next two metrics have direct influence on
transistors’ budget and integration area (the number of
perceptrons used in prediction process and respectively the
saturation degree of perceptrons). The saturation degree
represents the percentage of cases when the weights of
perceptrons can’t be increased / decreased because they are
saturated. If the last two metrics are quite low means that the
perceptrons are underused. The prediction accuracy and the
usage degree of prediction table are computed also in the case
of classical two-level predictors.
IV.
SOFTWARE DESIGN OF ABPS SIMULATOR
The user diagram (Fig.1a) illustrates the general user
interaction process with ABPS. A generic user can mainly
interact with ABPS in two ways (not fully distinct):
• Default start -> the user starts a simulation using the
default input parameters.
• Custom start (Choose simulation type) -> the user
chooses:
1. The simulation type – detection or prediction;
2. The benchmarks (Stanford and/or SPEC 2000);
3. The values for the simulation parameters.
After the three steps presented above, the user can start the
simulation process. Both in the Default start and in the Custom
start cases, after the simulation process is ready, simulation
results are shown.
NOTES: Steps 1, 2, 3 can be executed in any order. Either of
steps 1 and 3 is not mandatory. If one of them is not executed,
default values are used. Step 2 (choosing the benchmarks) is
necessary the first time (initially no traces are selected for
simulation) for both user interaction types.
The activity diagram (Fig.1b) shows a general view for the
simulation process flow in ABPS:
• Initialization – all simulation parameters are set (traces,
simulation type: detection / prediction, detector /
predictor values);
• Starts simulation – the simulation begins after all the
inputs had been set. The simulation process consists
basically in processing each trace included (in a
multithreaded manner);
• Read trace – each trace is processed, branch after
branch. Each branch instruction is fed to the selected
detector / predictor. This is done until all branch
instructions (from the selected trace) are processed.
During this, results are accumulated.
• Processing results – after a trace had been processed,
the obtained results are processed in order to compute
certain metrics;
• Display results – the results are displayed and the
simulation process stops.
NOTE: At any time the simulation process can be aborted
from the GUI (Graphic User Interface).
Figure 1. UML Diagrams – User and Activity perspectives
Figure 2. Sequence Diagram
The sequence diagram (Fig.2) presents in detail how ABPS
performs the process of detecting unbiased branches. The
process starts in the GUI, where the detection parameters are
set. After this initialization, the user can trigger the detection
process, which will be managed by another thread (1: create,
st:SimulatorThread). In this way, the GUI will not block itself,
leaving the user with the ability to perform other tasks from
ABPS. The simulation thread will create and start a detection
thread (1.1: create, dt:DetectorThread). The detection thread
will manage all the detection process (1.1.1: Create1,
tr:TraceReader). When all the above initializations were
performed, the detection process actually starts (2:
startSimulation(), 2.1: run()): the trace used for simulation is
processed using the appropriate detector (see: 2.1.1 – 2.1.6).
Finally, the detection thread signals (by returning the results)
the simulation thread that the detection is done (2.2: Destruct3).
In the same manner, the simulation thread signals the GUI
thread (3:Destruct4), which will display the results.
NOTE: Although the above diagram doesn’t show, at any
time the detection process can be aborted from the GUI.
V.
SIMULATION METHODOLOGY AND
EXPERIMENTAL RESULTS
We developed ABPS (Advanced Branch Prediction
Simulator) an original interactive graphical trace-driven
simulator [8]. We simulate eight C Stanford integer
benchmarks, designed by Professor John Hennessy (Stanford
University), to be computationally intensive and representative
of non - numeric code while at the same time being compact.
Also, we simulate all of the SPEC CPU2000 integer
benchmarks, and all of the SPEC CPU95 integer benchmarks
that are not duplicated in SPEC CPU2000, each benchmark
having 1 million dynamic branch instructions. All these
benchmarks cover a lot of applications ranging from
compression (text/image) to word processing, from compilers
and architectures to games enhanced with artificial intelligence,
etc. We choose to simulate different version of benchmarks
(Stanford and SPEC) in order to discover how these different
testing programs influence the neural branch predictors’ microarchitectural features.
The ABPS simulator provides a wider variety of
configuration options. Thus, it can be determined how vary
prediction accuracy with input parameters (number of entries in
prediction tables, history length, number of bits for weights
representation, threshold value used for perceptron training,
etc). ABPS is written in Java and assures three of the features
specific to almost high-performance standard simulators: free
availability for use, extensibility and portability. Full inheritance
and polymorphism is used, allowing for ease of extension in the
future adding new functionalities.
Repeating the detection methodology for a length-ordered set
of contexts it could be observed how decreases the number of
unbiased branches from tested benchmarks. Figure 3 shows the
reduction in the number of unbiased branches varying the length
of prediction contexts from 8 to 32 bits. The percentage
reduction in the number of unbiased branches decreases from
25.12% to 9.26%. We consider that the last value is too high
and further investigations are required.
Ap=f(HrG) using fast path-base d pe rce ptron
pre dictor with 1024 e ntrie s
ol
ag
e
f
HrG=24
gz
ip
HrG=32
A
eo
n
rl b
m
k
ga
p
vo
rte
x
bz
ip
2
tw
A olf
ve
ra
ge
HrG=8
HrG=16
ve
r
HrG=32
9,26
2
17,44
12,64
91,05
89,88
tw
HrG=24
25,12
95
93
91
89
87
85
bz
ip
HrG=16
pa
rs
er
HrG=8
vp
r
Prediction
Accuracy [%]
60
55
50
45
40
35
30
25
20
15
10
5
0
SPEC critical be nchmark s
pe
gz
ip
vp
r
gc
c
m
cf
cr
af
ty
pa
rse
r
Unbiased branches [%]
%Unbias = f(HrG)
SPEC benchmarks
Figure 5. Prediction accuracy on SPEC critical benchmarks
Figure 3. Reducing the number of unbiased branches with increasing global
history register length
Figure 4 graphical illustrates the influence of global history
on prediction accuracy using a fast path-based perceptron
predictor. It is very clear that as longer became the global
history length as greater became the prediction accuracy on all
branches. Also, the prediction accuracy ascendant trend still
remains when the number of perceptrons increases. The best
prediction accuracy obtained with a fast path-based perceptron
predictor – 95.21%, is superior to that provided by Alpha
21264, but having a hardware budget of 8th times smaller
(≈32Kbytes vs. ≈257Kbytes).
Prediction Accuracy
[%]
Ap = f(HrG)
95,50
95,21
95,00
94,56
94,50
94,03
94,00
93,50
93,00 92,81
92,50
92,00
91,50
100
200
1024
HrG=8
HrG=16
HrG=24
VI.
CONCLUSIONS AND FUTURE WORK
In this paper we have shown that the design of branch
predictors should consider the identification of unbiased
branches due to their negative impact on prediction accuracy.
Repeating the detection methodology for a length-ordered set of
contexts (varying the global history length from 8 to 32 bits) it
could be observed that the percentage of unbiased branches
decreases from 25.12% to 9.26% but still remains a quite
significant percentage of unbiased branches. Further, we have
demonstrated the insufficiency of global correlation
information. We have therefore shown that even state of the art
branch predictors are unable to accurately predict these
unbiased branches (the best prediction accuracy measured on all
branches using a fast path-based perceptron predictor is
95.21%). We therefore consider that the use of more prediction
contexts (some HLL code information) is required to further
improve prediction accuracies. In order to efficiently use such
information we consider it will be necessary to have a
significant amount of compiler support.
HrG=32
REFERENCES
[1]
Numbe r of pe rce ptrons
Figure 4. The influence of global history on prediction accuracy using a fast
path-based perceptron
[2]
Despite of significant reduction of unbiased branches
percentages (in average 26.79%) on five of SPEC benchmarks
(gzip, vpr, parser, bzip2 and twolf) the prediction accuracy
varies asymptotically (under 1.30% in average) whether global
history length raises from 8 to 32 bits (see figure 5). We named
these SPEC testing programs as critical benchmarks. The
average prediction accuracy on these benchmarks is very low
(91.06% – see figure 5). When global history length is 32 bits
the unbiased branches percentage on the 5 critical benchmarks
is still high (in average 15.90%) and may be responsible for the
lower prediction accuracy. This is because the current amount
of prediction information is limited (global-correlations). The
use of such limited information means that unbiased branches
cannot be predicted to a high degree of accuracy. Consequently,
other information is required to predict branches which have
been classified as unbiased (local, path or sign condition).
[3]
[4]
[5]
[6]
[7]
[8]
Yeh T., Patt Y., Alternative Implementations of Two-Level Adaptive
Branch Prediction, 19th Annual International Symposium on Computer
Architecture, 1992.
Jiménez D., Lin C., Neural Methods for Dynamic Branch Prediction,
ACM Transactions on Computer Systems, Vol. 20, New York, USA,
November 2002.
Jiménez D., Fast Path-Based Neural Branch Prediction, Proceedings of
the 36th Annual International Symposium on Microarchitecture,
December 2003.
Jiménez D., Idealized Piecewise Linear Branch Prediction, Journal of
Instruction-Level Parallelism, Vol. 7, pp. 1-11, (2005).
Srinivasan R., Frachtenberg E., Lubeck O., Pakin S., Cook J., NeuroPPM Branch Prediction, The 2nd Journal of Instruction-Level Parallelism
Championship Branch Prediction Competition (CBP-2), Orlando, Florida,
USA, pp. 30-35, (2006).
Vintan L., Gellert A., Florea A., Oancea M., Egan C., Understanding
Prediction Limits through Unbiased Branches, Lecture Notes in
Computer Science, vol. 4186-0480, Springer-Verlag, ISSN 0302-9743,
Berlin Heidelberg, (2006), pp. 480-487.
Gellert A., Florea A., Vintan M., Egan C. Vintan L., Unbiased Branches:
An Open Problem, The 12th Asia-Pacific Computer Systems Architecture
Conference (ACSAC 2007), Seoul, Korea, August 2007.
Radu C., Calborean H., Crapciu A., Gellert A., Florea A. – An Interactive
Graphical Trace-Driven Simulator for Teaching Branch Prediction in
Computer Architecture, The 6th EUROSIM Congress on Modeling and
Simulation, 2007, Ljubljana, Slovenia.