Reconsidering Complex Branch Predictors
Daniel A. Jiménez
Department of Computer Science
Rutgers University, Piscataway, NJ 08854
To sustain instruction throughput rates in more aggressively clocked microarchitectures, microarchitects have incorporated larger and more complex branch predictors into
their designs, taking advantage of the increasing numbers
of transistors available on a chip. Unfortunately, because
of penalties associated with their implementations, the extra accuracy provided by many branch predictors does not
produce a proportionate increase in performance. Specifically, we show that the techniques used to hide the latency
of a large and complex branch predictor do not scale well
and will be unable to sustain IPC for deeper pipelines.
We investigate a different way to build large branch predictors. We propose an alternative predictor design that
completely hides predictor latency so that accuracy and
hardware budget are the only factors that affect the efficiency of the predictor. Our simple design allows the predictor to be pipelined efficiently by avoiding difficulties introduced by complex predictors. Because this predictor
eliminates the penalties associated with complex predictors,
overall performance exceeds that of even the most accurate
known branch predictors in the literature at large hardware
budgets. We conclude that as chip densities increase in the
next several years, the accuracy of complex branch predictors must be weighed against the performance benefits of
simple branch predictors.
1 Introduction
Accurate branch prediction is an essential component
of today’s deeply pipelined microprocessors. As improvements in process technology have continued to provide
more transistors in the same area, branch predictors have
become larger, more complex, and more accurate. This
trend will continue into the future, with branch predictor
hardware budgets running into the scores or hundreds of
kilobytes. For example, the design for the Compaq EV8 included a hybrid branch predictor with approximately 45KB
of state [14].
Figure 1 shows the arithmetic mean misprediction rates
achieved over the SPEC 2000 integer benchmarks by extending several branch predictors from the literature into
large hardware budgets, as well two very complex but
highly accurate predictors: the multi-component hybrid
predictor [5] and the perceptron predictor [8].
6
Percent Mispredicted
Abstract
4
Gshare
2Bc-gskew
Bi-Mode
Multi-Component Hybrid
Perceptron
2
0
2K
4K
8K
16K
32K
64K
128K
Hardware Budget (Bytes)
256K
512K
Figure 1. Arithmetic mean misprediction rates
on SPECint 2000
1.1 Better Accuracy Doesn’t Always Mean Better
Performance
Unfortunately, a large hardware budget and complex organization can have a negative impact on performance relative to equally large but simpler branch predictors. Because
the branch predictor is on the critical path for fetching instructions, it must deliver a prediction in a single cycle [7].
However, as pipelines deepen and clock rates increase, access delay significantly decreases the size and accuracy of
large on-chip SRAM arrays such as branch predictors that
can be accessed in a single cycle [1]. Since larger branch
predictors are more accurate, modern microarchitectures attempt to overcome the access delay problem by using complex schemes such as hierarchical overriding branch predictors, in which a simple and quick predictor is backed up
by a large, complex, and slower predictor [7]. This tech-
Instructions per Cycle (IPC)
2.0
1.5
Multi-Component Hybrid (Realistic Delay w/ Overriding)
Multi-Component Hybrid (No Delay)
Perceptron (Realistic Delay w/ Overriding)
Perceptron (No Delay)
1.0
16K
32K
64K
128K
Hardware Budget (Bytes)
256K
512K
Figure 2. Ideal IPC for perceptron and multicomponent predictors, contrasted with realistic, overriding versions
nique was used for the Alpha EV6 branch predictor [9] and
was built into the design for the Alpha EV8 branch predictor [14]. Because of penalties associated with their implementations, the extra accuracy provided by such complex
predictors does not yield a proportionate increase in performance.
To illustrate our point, Figure 2 shows the instructions
per cycle (IPC) rates yielded by the multi-component hybrid predictor and the perceptron predictor. We show the
IPCs yielded by ideal, zero-delay versions of these predictors, as well as IPCs yielded by more realistic versions that
use overriding. Overriding has been shown to yield better
performance [7] than other proposed delay-hiding schemes
such as lookahead [21] and cascading [7, 4]. However, no
study of which we are aware has addressed the utility of
overriding or other techniques on very large hardware budgets in the hundreds of kilobytes. Figure 2 shows that, even
with the best known delay-hiding techniques, large branch
predictors lead to significantly worse overall performance
than more moderate sized predictors. For instance, the
512KB version of the perceptron predictor yields an IPC
11% less than that of the 32KB version, even though the
larger predictor is more accurate. This is because overriding
cannot completely hide the access delay of the branch predictor; when the slow and fast predictor disagree, a penalty
proportional the the predictor access delay must be paid.
Obviously, no microarchitect would actually design such a
large branch overriding branch predictor with such a negative impact on performance. We conclude that either branch
predictors cannot get larger than a few tens of kilobytes,
or we must find another predictor organization capable of
completely hiding access delay.
scribe a simple, large branch predictor that can be pipelined
to deliver every prediction in a single cycle, thus solving
the problem of access delay. We present the design of our
branch predictor and contrast it to previous work, and we
present the results of simulations showing that our simplified predictor is capable of improving performance over the
best known branch predictors in the literature. We make optimistic assumptions about the implementations of the complex predictors we study, providing an upper bound on the
accuracy of predictors that may actually be designed by microarchitects. We conclude that in the next several years,
as hardware budgets allow for larger branch predictors, the
ability to pipeline a branch predictor will become a more
important design criterion than a clever prediction scheme
that improves accuracy incrementally.
1.3 The Basic Idea
Our idea is to pipeline the branch predictor so that it
always produces an accurate prediction in a single cycle.
We apply our technique to a special version of the gshare
branch predictor which we call gshare.fast. Because of its
simple design, gshare.fast can be pipelined to provide a
branch prediction in a single cycle. The key idea is that our
predictor is organized so that a small set of candidate entries
from the pattern history table (PHT) is prefetched several
cycles before the prediction is needed. Figure 3 illustrates
the notion of narrowing down the number of entries from
which to select the final prediction. As more instructions
are fetched and executed, the location of the exact PHT entry needed to make a prediction becomes known, and by
the time branch prediction is required, the prediction can be
selected in a single cycle. This idea can be applied to any
simple predictor that uses only global history and a few bits
from the address branch PC. The predictor can produce a
prediction in a single cycle because, unlike complex predictors, it does not use local histories, compute hashing functions, or otherwise create dependences between the branch
address and the prefetched PHT entries.
This paper is organized as follows. We first present some
background into branch prediction, describing some basic
concepts and reviewing some implemented and proposed
complex branch predictors. We then discuss the impact of
branch predictor complexity on performance. We describe
our solution to the problem, giving a detailed discussion of
gshare.fast. We provide experimental results showing the
advantages of our predictor and the conclude.
2 Impact of Branch Predictor Complexity
1.2 Our Solution
We propose an alternative design that eliminates the
problems of previous delay-hiding organizations. We de-
Branch predictor complexity leads to branch predictor delay, which has a large negative impact on performance [7]. Techniques have been proposed and imple-
prediction
newer global history
xor lower branch PC bits
PHT
Buffer
dict a branch bias for highly biased branches. Hardto-predict branches are predicted by a 16K-entry pattern history table indexed by a global history register.
Because of the complexity of the branch predictor, a
predicted fetch introduces a single cycle bubble into
Hammer’s 12-stage pipeline [19].
Seznec et al. describe the Compaq EV8 hybrid predictor, which has four components: two two-level
predictors, one bi-modal predictor, and one metapredictor [14]. The hardware budget for this predictor
is 352 Kbits.
old global history
Large PHT
Figure 3. gshare.fast separates the process of
indexing the PHT into stages, prefetching a
contiguous set of candidate predictions.
mented that hide some of the delay, but none of these techniques can hide the delay completely and still maintain predictor accuracy. Thus, branch predictor delay will always
remain a problem for complex predictors. In this section,
we describe some existing branch predictors, discuss the
impact of complexity on branch predictor design, and review the solutions that have been proposed for this problem.
2.1 Some Implemented Branch Predictors
As microprocessor designs become more aggressive
with deeper pipelines and higher clock rates, branch predictor designs have become more complex. We review a few of
the branch predictors in industrial designs, in chronological
order of their introduction. As we will see, industrial branch
predictor designs have become more and more complex in
recent years:
2.2 Sources of Branch Predictor Complexity
Branch predictor complexity comes from predictor organizations that place many levels of logic on the critical path
to making a prediction. Some examples of this complexity
are:
Large tables indexed by the branch PC. If the branch
PC or previous branch target address is needed to fetch
a prediction from a large table with a high access delay,
it might arrive too late and require overriding or some
other latency hiding trick.
Computation. Hybrid predictors must bring data from
several tables together to compute a final prediction.
This adds more gate delay to the process of making
a prediction. Other predictors perform more complex
computations. For instance, the perceptron predictor [8] computes the dot product of two vectors using
a deep circuit similar to a multiplier.
Depending on the contents of the branch instruction.
Some architectures expose details of the branch predictor to the ISA. This idea puts the instruction cache
on the critical path to making a prediction.
The Alpha EV6 core uses a hybrid branch predictor
composed of two two-level predictors [9, 20]. A 4Kentry PHT is indexed by a global history register while
a 1K-entry PHT is indexed by one of 1024 local 10-bit
branch histories. The final branch prediction is chosen
by indexing a third chooser PHT. its implementation
complexity comes with a cost. The Alpha branch predictor overrides a less accurate line predictor, introducing a single-cycle bubble into the pipeline whenever
the two disagree [9].
Instruction fetch bandwidth is a bottleneck in modern
microprocessor designs. Thus, it is critical that branches be
predicted as quickly as possible to avoid introducing fetch
bubbles into the pipeline. As explained in the Introduction,
eliminating branch predictor access delay is essential to sustaining instruction fetch rates and thus performance.
AMD’s new Hammer (K8) architecture uses a sophisticated branch prediction scheme designed for large
workloads [19]. An array of branch selectors cached
in the L2 cache decides whether to use two-level prediction for hard-to-predict branches or to simply pre-
As with caches, larger prediction tables take longer to read.
The problem is particularly bad with branch predictors because of the increased decoder delay. Consider two SRAM
tables of 4KB: a direct-mapped L1 data cache with 32-byte
lines, and a pattern history table with two-bit saturating
2.3 Branch Predictor Delay
2.3.1 Table Access Delay
counters. An access to the cache requires selecting among
only 128 entries, while the pattern history table requires selecting among 16K entries. Thus, a pattern history table
incurs a higher decoding cost than a cache with the same
size.
2.4 Increased Predictor Complexity
Complexity also contributes to delay. In addition to table
accesses, these predictors incur extra delay by, for instance,
combining information from multiple tables, using an index
read from one table to access another table, computing information such as the majority of two counters [14], or even
computing the dot product of a two vectors [8].
2.5 Impact of Delay
Jimenez et al. point out that a 100% accurate predictor
with a latency of two cycles results in worse performance
than a relatively inaccurate predictor with single-cycle latency [7]. Because of increasing clock rates and wire delay, the maximum size of a pattern history table accessible in a single cycle in future designs will be 1024 entries.
Several techniques for mitigating branch predictor delay are
discussed. Nevertheless, our research shows that even with
these techniques, branch predictor delay still has a significant on performance.
2.6 Techniques for Mitigating Delay
2.6.1 Overriding Predictors
An overriding predictor uses two branch predictors: a quick
but relatively inaccurate predictor, and a slower, more accurate predictor. When branch prediction is needed, the quick
predictor makes a single-cycle prediction, after which instructions are fetched and speculatively executed. A few
cycles later, the slower predictor returns its prediction. If
the two predictions disagree, the few instructions fetched
and issued are squashed, and fetch continues down the other
path. This process incurs a penalty smaller than a full
branch misprediction. The overriding technique is used in
the Alpha EV6 and EV7 cores, and is included in the EV8
design [9, 14]: a quick instruction cache line predictor is
overridden by a two-cycle hybrid branch predictor.
There are two drawbacks to the overriding idea. First,
although a high-latency predictor may have a higher overall accuracy, the improvement in performance is not directly proportional the the improvement in accuracy, since
pipeline bubbles result from those cases when the quick and
slow predictors disagree. Thus, the actual performance improvement is a function of the accuracies of the quick predictor and the slow predictor, as well as the latency of the
slow predictor. Second, implementing the overriding mechanism introduces extra complexity into the design of the rest
of the pipeline. Instead of having the two cases of correct
and incorrect prediction, there are now four combinations of
correct and incorrect quick and slow predictions. The processor designer must design logic to squash instructions for
the case when the slow predictor overrides the fast predictor,
and for when the overriding prediction itself is incorrect.
2.6.2 Dual-Path Fetch
An alternative to overriding is dual-path fetch, which is used
in the AMD Hammer architecture [3]. When an instruction cache line is fetched that contains a branch instruction,
it may be several cycles before the branch prediction for
that instruction becomes available. During this time, the
processor fetches and speculatively issues down both possible paths. This technique effectively halves instruction
fetch bandwidth and available execution resources while the
branch prediction is being computed, and is not scalable to
situations where multiple branches are being predicted in
the front end.
3 Our Solution: A Large, Fast gshare
In this section, we present a highly accurate large branch
predictor based on gshare [10] called gshare.fast that produces its prediction in a single cycle, uses the most recently
updated history, and incurs no added penalty or complexity
for the rest of the pipeline. We first describe the idea for predicting a single branch per cycle using a pipelined gshare,
then discuss related issues such as a way to extend the idea
to predict multiple branches in a single cycle.
3.1 A Pipelined Implementation
Figure 4 shows a diagram of a pipelined implementation
of gshare.fast. Note that this pipeline runs in parallel with
the rest of the fetch engine components, interacting with the
components only to receive a branch address, send a prediction, or recover from a misprediction. There is no overriding or other overhead. Our results model several different
PHT access latencies, but for this discussion we will assume
that the PHT has an access latency of three cycles, and can
return one line of 8 two-bit counters on each cycle. The
branch predictor is pipelined in four stages: three for reading the PHT, and one for computing a prediction. In order
to keep track of new speculative global history encountered
since the PHT access began, the first three stages of the predictor pipeline each contain four latches. The first latch,
which we call the Branch Present latch, records whether a
branch was fetched and predicted during the corresponding
cycle. The second latch, which we call the New History Bit
Stage 1
New History Bit
Branch Present
Stage 2
Update
New History Bit
Branch Present
Stage 3
Update
New History Bit
Branch Present
Stage 4
Update
Lower 9 Bits of
Branch Address
Global History Shift Register
XOR
Large
PHT
Reading PHT
8−Entry PHT Buffer
Reading PHT
Finished Reading PHT
Select
Prediction
Fetch
Logic
Figure 4. gshare.fast pipeline for predicting a branch fetched at cycle t
latch, records the corresponding speculative global history
bit (if any) shifted in from the next predictor pipeline stage.
The third and fourth latches (not pictured) receive the corresponding bits from later pipeline stages in the first half of
the cycle. To predict a branch fetched during cycle t, the
predictor pipeline is organized as follows:
3.2 Predictor Update
Here, we describe our mechanisms for updating the
branch predictor. There are three types of update:
Stage 1, cycle t 3 In this cycle, the global history register
is used to begin fetching a line of 8 two-bit counters from
the PHT into the 8-entry PHT buffer. If a branch had been
fetched and predicted during pipeline stage 2, the resulting
New History Bit is shifted in from stage 2 and the Branch
Present latch is set; otherwise, the Branch Present latch is
reset.
Stage 2, cycle t 2 During this cycle, the multi-cycle
PHT access continues. In the first half of the cycle, any history and Branch Present bits are forwarded to the previous
stage. In the second half of the cycle, any New History and
Branch Present bits are shifted in from stage 3 to maintain
the current speculative global history.
Stage 3, cycle t 1 At the end of stage 3, the 8-entry PHT
buffer has been read from the PHT. Any New History and
Branch Present latches are forwarded to the previous stage
in the first half of the cycle. In the second half of the cycle,
New History and Branch Present bits are shifted from stage
4.
Stage 4, cycle t If a branch is fetched in this stage, the
lower nine bits of its address are exclusive-ORed with the
low bits of the global history register, shifted left and combined with up to three newly generated New History Bits
from the previous stages. This computation forms an index into the 512-entry PHT buffer, selecting an entry whose
value provides a branch prediction that is forwarded to the
rest of the fetch engine. This stage sends New History and
Branch Present bits to the previous stage.
Speculative update of the global history register. When
a branch is predicted, the global history is updated
speculatively to reflect this prediction, on the assumption that the prediction is correct. This policy has
been shown to produce better accuracy [17]. To speculatively update the global history in our predictor,
the New History Bit and Branch Present latches are
shifted to previous pipeline stages as described above.
A global history shift register records history bits that
are shifted out of the first predictor pipeline stage.
Non-speculative update of the PHT. When a branch is
executed, the PHT entry that provided the prediction
for that branch must be incremented or decremented
based on whether the branch is taken or not taken. Because it takes several cycles to update the predictor,
a problem occurs when in-flight branches need access
PHT entries that have not yet been updated. Our approach is to simply update the table slowly. This policy has a negligible impact on predictor accuracy since
the PHT entry for a particular address/history combination tends not to change very much. For instance,
in our simulations, we have found that if we allow the
branch predictor to predict 64 branches between the
time it predicts and updates any branch, the average
misprediction rate for a 256KB budget increases from
4.03% to 4.07%, with less than 1% decrease in IPC.
Recovery after a misprediction. When a branch misprediction occurs, the speculative global history register must be overwritten with a non-speculative history
that is updated only when branches are executed. The
information in the PHT buffer will be invalid for the
few cycles it takes to refill it. To overcome this latency, the PHT buffer can be checkpointed to a buffer
associated with the pipeline stage where branch prediction occurs. This buffer is propagated to similar buffers
kept for each pipeline stage. During a misprediction,
the buffer associated with the commit stage, whose
contents reflect the PHT entries associated with the by
now non-speculative global history, is copied into the
PHT buffer. The copies of the PHT buffer that had
been fetched and stored for previous pipeline stages
will serve to fill the PHT buffer until the gshare.fast
pipeline is filled again. The portion of the history register that had been used to fetch these copies contained
the correct and now non-speculative history, since any
speculative and wrong-path histories recorded in the
global history register would only have been used in
indexing the PHT buffer, not fetching from the PHT itself. This idea requires no interaction between the microarchitecture and branch predictor, since there is no
communication between the buffers and the pipeline
stages with which they are associated.
3.3 Discussion
The ideas we introduce with our predictor raise several
issues. We discuss some of them here.
buffer must have at least 64 entries. The Branch Present and
speculative history latches must also be increased in each
pipeline stage by a factor of p, since up to p new speculative
history bits may be generated in each cycle. This organization can still be easily designed to provide single-cycle
access to the branch predictor. If necessary, an extra predictor pipeline stage can be introduced to separate the process
of reading the PHT buffer from the process of computing
the index with which it is indexed.
3.3.2 Large Hardware Budgets
Currently, branch predictors with hardware budgets in the
hundreds of kilobytes are still considered infeasible due
to limits on the number of transistors available on a chip.
However, even now, a PA-RISC processor has been built
with over 2MB of on-chip cache [18]. Intel has recently announced that it has built a chip only 109 square millimeters
in size with 52 megabits of SRAM in its next-generation 90
nm process technology [13]. Thus, it is reasonable to expect
that a 100KB branch predictor consuming less than 2% of
the area of the chip will be designed for future processor
generations.
3.3.1 Multiple Branch Prediction
3.3.3 Branch Targets
We have presented our branch predictor in the simple setting
of a fetch engine that predicts at most one branch per cycle.
However, modern wide-issue microarchitectures often need
to predict many branches at once. Methods for doing multiple branch prediction have been proposed [21, 15]. For
instance, the EV8 branch predictor is capable of predicting as many as 16 branches from two fetch blocks in the
same cycle [14]. It does this by arranging the dual-ported
branch prediction table such that the branch predictions for
each fetch block reside in consecutive locations in the table, so that they can be fetched all at once. This predictor
uses stale speculative global histories that can be as many as
three fetch blocks old. Using these stale histories is reported
to have minimal impact on branch prediction accuracy.
Our predictor design can also be enhanced to provide
many predictions in a single cycle. Predictions for consecutive branch instructions are already laid out close to one
another in the PHT buffer. By enlarging this buffer, more
branches can be predicted simultaneously. Suppose the latency of the PHT is k cycles. The PHT buffer has to be large
enough to accommodate each combination of PHT entries
that might be required after k cycles. In our original design
where we predict up to one branch per cycle, the PHT buffer
size can be as small as 2k entries. If we enhance the design to accommodate p predictions per block, then the PHT
buffer size should increase to 2k p entries. For example, if
up to 8 branch instructions can be fetched in a single cycle
and the branch predictor latency is 3 cycles, then the PHT
This paper is concerned only with the design of the branch
direction predictor, which predicts whether a conditional
branch will be taken or not. The branch direction predictor is a component of the branch target predictor, which
predicts the address of the next instruction to fetch after a
branch. The branch target predictor can be implemented as
a branch target buffer (BTB). If a branch is predicted to be
taken, the branch target predictor must predict the branch
target so that instruction fetch may continue along the predicted path. Our predictor is useful even in situations with
a slow BTB because the branch target only needs to be predicted if the branch is predicted taken, and compilers try to
minimize the number of taken branches to avoid the penalty
of a discontinuous fetch.
3.3.4 Interactions With Microarchitecture
We have claimed that our predictor reduces design complexity in two ways: it is conceptually simple, and it minimizes interactions with other aspects of the microarchitecture. This second point is important: the only interactions
between the rest of the microarchitecture are when the fetch
engine requests and receives a prediction, and when the microarchitecture signals the predictor to recover from a misprediction. We contrast this with the interactions with an
overriding predictor, such as the ones in the Alpha EV6 and
EV8 cores that require the interactions mentioned as well as
the following:
When the quick and slow predictors disagree, the few
instructions fetched so far must be squashed and instruction fetch restarted down the path indicated by the
slow predictor.
The microarchitecture must keep track of two types
of speculative instructions: instructions predicted with
the quick predictor that must be squashed when the
slow predictor overrides them, and instructions predicted with the slow predictor. Even when the quick
and slow predictors agree, the instructions predicted
with the quick predictor must be re-marked as having
been predicted with the slow predictor.
4 Experimental Results
In this section, we present experimental results showing
that our gshare.fast predictor achieves better performance
than the best branch predictors in the literature. We describe
our experimental methodology, discuss the accuracies of
each predictor, and present performance results measured
with instructions-per-cycle (IPC).
4.1 Experimental Methodology
4.1.1 Predictors Simulated
We use the perceptron predictor [8], multi-component hybrid predictor [5], and the 2Bc-gskew predictor [11]. The
perceptron predictor and multi-component hybrid predictor
are the most accurate known branch predictors in the academic literature. The recently presented EV8 branch predictor is a practical implementation of 2Bc-gskew [14]. For
each predictor, we use the overriding mechanism to mitigate access delay. For the perceptron predictor, we use both
global and local history as input to the predictor.
4.1.2 Optimistic Assumptions
Our claim is that gshare.fast provides superior performance
to these other predictors. To strengthen our claim, we
choose optimistic assumptions for the other predictors.
Thus, our results provide an upper bound on the performance achievable by more realistic, stripped-down versions
of these complex predictors. We make the following assumptions:
Speculative Update Both global and local history are updated speculatively and recovered with no latency after misprediction. This policy has been shown to give the best accuracy, but is difficult to implement, especially with large
tables [17].
Latency Most of the access latency for the table based
predictors comes from the time it takes to select the counters from the various tables, with only one fan-out-of-four
inverter gate delay. For the perceptron predictor, we assume
that the delay in computing the perceptron output is a single
cycles, and that the rest of the delay comes from the table
lookups. Although the actual implementation of these predictors would likely have a higher latency, even a more practical complex predictor design, the branch predictor of the
Alpha EV8, has a two-cycle latency, the same as the multicomponent predictor we model for up to a 53KB hardware
budget.
Overriding Penalty The cycle penalty associated with
overriding a quick prediction is equivalent to the access latency of the branch predictor, with no extra time taken to
squash instructions or fetch from the other path.
Clock Rate We assume an aggressive clock period of 8
fan-out-of-four inverter (FO4) delays, yielding a 3.5 GHz
clock rate in 100 nm process technology. Hrishikesh et al.
suggest that 8 FO4s is the optimal clock period – 6 FO4 for
doing useful work and 2 FO4 for latch delay – giving the
best combination of pipeline depth and time for useful work
in the processor [6]. Jimenez et al. show that the largest
pattern history table accessible at this clock rate contains
1K entries [7]. However, we optimistically assume that the
quick predictor in the overriding scheme can contain 2K
entries.
4.1.3 Execution Driven Simulations
We use the 12 SPEC 2000 integer benchmarks running under a modified version of SimpleScalar/Alpha [2], a cycleaccurate out-of-order execution simulator, to evaluate our
branch predictors. To better capture the steady-state performance behavior of the programs, our evaluation runs skip
the first 500 million instructions, as several of the benchmarks have an initialization period (lasting fewer than 500
million instructions), during which branch prediction accuracy is unusually high. Each benchmark executes over one
billion instructions on the ref inputs before the simulation
ends. Table 1 shows the microarchitectural parameters used
for the simulations.
4.1.4 Branch Predictor Configuration
For the multi-component hybrid and perceptron predictors,
we use the configurations of global and local history lengths
and table sizes reported for the corresponding hardware
budgets in the literature [5, 8]. Our gshare.fast predictor
uses the maximum history length possible, i.e., the basetwo logarithm of the number of PHT entries. We explore
Parameter
L1 I-cache
L1 D-cache
L2 cache
BTB
Issue width
Pipeline Depth
Configuration
64 KB, 64-byte lines, direct mapped
64 KB, 64-byte lines, direct mapped
2 MB, 128-byte lines, 4-way set assoc.,
512 entry, 2-way set-assoc.
8
20
Table 1. Parameters used for the simulations
hardware budgets containing powers of two entries in the
range of the hardware budgets chosen for the other two predictors.
MultiComponent
Perceptron
2Bc-gskew
Hardware
Budget
Delay
(cycles)
Hardware
Budget
Delay
(cycles)
Delay
(cycles)
18KB
30KB
53KB
98KB
188KB
368KB
3
3
4
5
7
9
16KB
32KB
64KB
128KB
256KB
512KB
4
4
5
7
8
11
3
3
4
5
7
9
Table 2. Predictor access latencies
6
4
Fast Gshare
2Bc-gskew
Multi-Component Hybrid
Perceptron
2
0
16K
256K
512K
2Bc-gskew
Multi-Component Hybrid
Perceptron
Fast Gshare
10
5
4.gz
Figure 5 shows the misprediction rates of the four large
predictors simulated over hardware budgets ranging be-
64K
128K
Hardware Budget (Bytes)
tween 16KB and 512KB. We report arithmetic mean misprediction rates showing a slight advantage in favor of
the three complex predictors over gshare.fast. At a hardware budget of 64KB, the perceptron predictor achieves a
3.6% misprediction rate, compared with 4.4% for a 64KB
gshare.fast predictor. Thus, the complex predictors are
more accurate than gshare.fast. Figure 6 shows, for each
benchmark, the misprediction rates of the two complex predictors at a 64KB budget, as well as the misprediction rate
of gshare.fast at a 64KB budget.
016
4.2 Misprediction Rates
32K
Figure 5. Arithmetic mean misprediction rates
Percent Mispredicted
Each predictor has two delay component: PHT access latency and computation time. For the multi-component and
2Bc-gskew predictors, we estimate the latency of the largest
table component and optimistically add a single fan-out-offour inverter delay for the computation cost, e.g., for 2Bcgskew, the cost computing the majority function and choosing between the bimodal table and the gskew predictor. In
addition to a table access, the perceptron predictor has a
substantial computation component, estimated to be at least
two cycles at our clock rate [8]. We optimistically assume
that, with clever custom logic design, the perceptron predictor will take only one cycle to compute a prediction once
the table has been accessed.
We estimate pattern history table access times for 100
nm process technology using the CACTI 3.0 [16] tool for
simulating cache delay. This modified version of CACTI
is more accurate than the original in several ways. First,
while the original version of CACTI 2.0 [12] uses a simplistic linear scaling for delay estimates, the modified simulator uses separate wire models to account for the physical
layout of wire interconnects: thin local interconnect, taller
and wider wires for longer distances, and the widest and
tallest metal traces for global interconnect. Second, wire resistance is based on copper rather than aluminum material
properties. Third, all capacitance values are derived from
three-dimensional electric field equations. Finally, bit-lines
are placed in the middle layer metal, where resistance is
lower.
Table 2 shows the access delays we computed for the
multi-component hybrid predictor, 2Bc-gskew, and the perceptron predictor at several hardware budgets, using the optimistic assumptions mentioned in Section 4.1.2 and a clock
period of 8 fan-out-of-four inverter delays.
Percent Mispredicted
4.1.5 Estimating Predictor Access Delay
ip
175
.vpr
176
.gcc
181
.mc
f
186
.cra
fty
197
252
2
2
2
2
3
.eon 53.perl 54.gap 55.vort 56.bzip 00.two
lf
bmk
ex
er
2
.pars
Ari
Meathmetic
n
Benchmark
Figure 6. Misprediction rates at a 53KB budget
2.5
Instructions per Cycle (IPC)
Instructions per Cycle (IPC)
2.5
2.0
Fast Gshare
Multi-Component Hybrid
2Bc-gskew
Perceptron
1.5
1.0
16K
32K
64K
128K
Hardware Budget (Bytes)
256K
512K
2.0
Fast Gshare
Multi-Component Hybrid
2Bc-gskew
Perceptron
1.5
1.0
16K
32K
64K
128K
Hardware Budget (Bytes)
256K
512K
Figure 7. Harmonic mean IPC for 1-cycle prediction (left) and overriding prediction (right)
4.3 Ideal IPC
Instructions per Cycle (IPC)
Figure 7 shows the harmonic mean IPCs produced by our
cycle-accurate simulations. On the left, we show a graph
with the ideal IPCs for each predictor, disregarding any
impact of branch predictor latency. At a hardware budget
of 53KB, the multi-component predictor achieves an IPC
of 1.80, which is 4% higher than the IPC for gshare.fast
at 64KB, which is 1.73. The 64KB perceptron predictor
achieves an IPC of 1.86, which is 7.5% higher than the
IPC for gshare.fast. The two complex predictors would
clearly allow better performance than our predictor given an
ideal implementation, but the magnitude of the advantage is
somewhat unimpressive.
2Bc-gskew
Multi-Component Hybrid
Perceptron
gshare.fast
3
2
116
4.gz
ip
175
.vpr
176
.gcc
181
.mc
f
186
.cra
fty
197
.pars
2
2
H
3
252
2
2
.eon 53.perl 54.gap 55.vort 56.bzip 00.two M armon
lf
bmk
ex
er
2
ean ic
for gshare.fast do not change, since that predictor has uses
its special properties to operate in a single cycle. Because
the overriding mechanism must sometimes be invoked, the
small performance advantage of the complex predictors
over gshare.fast turns into a slight loss. Figure 8 shows
the IPCs for each SPECint 2000 benchmark, as well as the
harmonic and arithmetic means, at at a hardware budget of
53KB. The perceptron predictor produces a harmonic mean
IPC of 1.71, slightly lower than that of gshare.fast, which is
still 1.73. The multi-component predictor achieves an IPC
of 1.69, which is lower than that of gshare.fast. For some
benchmarks such as 181.mcf and 254.gap, the complex
predictors result in slightly higher IPCs than gshare.fast,
but for others such as 186.crafty and 300.twolf,
gshare.fast yields slightly higher IPCs. At larger hardware
budgets with higher access delays, the complex predictors
yield lower performance, despite their better accuracies, due
to the higher overriding penalty. However, the point of presenting these figures is not necessarily that our branch predictor is strictly superior to the other predictors, but that the
IPCs are about the same without the added overheads imposed by the complex predictors.
4.5 Explaining the Difference
Benchmark
Figure 8. IPCs for each benchmark at 53KB
hardware budget.
4.4 Realistic IPC
When branch predictor latency is taken into account,
as in the graph on the right in Figure 7, the advantage of
the two complex predictors vanishes. This graph shows
the more realistic IPCs achieved when the two complex
predictors are implemented as overriding predictors with
a single-cycle 2K-entry gshare initial predictor. The IPCs
For complex predictors there is a significant difference
between the ideal IPC assuming single-cycle prediction and
the realistic IPC when we take delay into account. The reason is that the overriding mechanism relies on a relatively
inaccurate first level branch predictor. Each time the first
and second level predictors disagree, a single cycle overriding penalty is incurred. The more accurate the second level
predictor is, the more often the cost of overriding will have
to be paid. perceptron predictor overrides the first level predictor an average of 7.38% of the time. On the benchmark
300.twolf with the multi-component predictor, the first
and second level predictors disagree 18.1% of the time.
5 Conclusions and Future Work
In future processor generations, the availability of large
hardware budgets will favor conceptually simple branch
predictors over complex predictors with high implementation overheads. This assertion is driven by two ideas.
First, the overhead of implementing a complex predictor
can eliminate the performance advantage it might have had
because of its improved accuracy. Second, complex predictor organizations affect other aspects of the microarchitecture, increasing overall design complexity. We have presented a simple predictor that solves these problems with
its low implementation overhead and isolation from the
rest of the microarchitecture. We are currently studying
ways to reorganize other predictors to take advantage of
the same ideas. We conclude that future research into complex branch predictors must consider the impact of the extra
complexity on performance.
6 Acknowledgments
I thank Calvin Lin and Samuel Z. Guyer for their helpful
comments on an earlier draft, and I thank Emery D. Berger
for inspiring the title of this paper.
References
[1] V. Agarwal, M.S. Hrishikesh, S. W. Keckler, and D. Burger.
Clock rate versus ipc: The end of the road for conventional
microarchitectures. In the 27th Annual International Symposium on Computer Architecture, pages 248–259, May 2000.
[2] Doug Burger and Todd M. Austin. The SimpleScalar tool
set version 2.0. Technical Report 1342, Computer Sciences
Department, University of Wisconsin, June 1997.
[3] Hans de Vries. AMD’s hammer microarchitecture preview.
Chip Architect, October 2001.
[4] Karel Driesen and Urs Hölze. The cascaded predictor: Economical and adaptive branch target prediction. In Proceedings of the 31th International Symposium on Microarchitecture, December 1998.
[5] Marius Evers. Improving Branch Prediction by Understanding Branch Behavior. PhD thesis, University of Michigan,
Department of Computer Science and Engineering, 2000.
[6] M.S. Hrishikesh, Norman P. Jouppi, Keith I. Farkas, Doug
Burger, Stephen W. Keckler, and Premkishore Shivakumar.
The optimal useful logic depth per pipeline stage is approximately 6 fo4. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002.
[7] Daniel A. Jiménez, Stephen W. Keckler, and Calvin Lin. The
impact of delay on the design of branch predictors. In Proceedings of the 33th Annual International Symposium on Microarchitecture, pages 67–76, December 2000.
[8] Daniel A. Jiménez and Calvin Lin. Neural methods for dynamic branch prediction. ACM Transactions on Computer
Systems, 20(4), November 2002.
[9] Richard E. Kessler. The Alpha 21264 microprocessor. IEEE
Micro, 19(2):24–36, March/April 1999.
[10] Scott McFarling. Combining branch predictors. Technical
Report TN-36m, Digital Western Research Laboratory, June
1993.
[11] P. Michaud, A. Seznec, and R. Uhlig. Trading conflict and
capacity aliasing in conditional branch predictors. In Proceedings of the 24th International Symposium on Computer
Architecture, pages 292–303, June 1997.
[12] Glenn Reinman and Norm Jouppi. Extensions to cacti, 1999.
Unpublished document.
[13] Intel Corporation Press Release.
Intel builds world’s
first one square micron sram cell. Intel Press Room,
http://www.intel.com/ pressroom/ archive/ releases/
20020312tech.htm, March 2002.
[14] André Seznec, Stephen Felix, Venkata Krishnan, and Yiannakakis Sazeides. Design tradeoffs for the Alpha EV8 conditional branch predictor. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002.
[15] André Seznec, Stéphan Jourdan, Pascal Sainrat, and Pierre
Michaud. Multiple-block ahead branch predictors. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating
Systems, pages 116–127, October 1996.
[16] Premkishore Shivakumar and Norman P. Jouppi. Cacti 3.0:
An integrated cache timing, power and area model. Technical Report 2001/2, Compaq Computer Corporation, August
2001.
[17] Kevin Skadron, M. Martonosi, and D.W. Clark. Speculative
updates of local and global branch history: A quantitative
analysis. Journal of Instruction-Level Parallelism, 2, January
2000.
[18] Li C. Tsai. A 1GHz PA-RISC processor. In Proceedings
of the 2001 International Solid State Circuits Conference
(ISSCC), February 2001.
[19] Fred Weber. AMD’s next generation microprocessor architecture. In Microprocessor Forum, October 2001.
[20] T.-Y. Yeh and Yale N. Patt. Two-level adaptive branch prediction. In Proceedings of the 24th ACM/IEEE Int’l Symposium on Microarchitecture, pages 51–61, November 1991.
[21] Tse-Yu Yeh, Deborah T. Marr, and Yale N. Patt. Increasing the instruction fetch rate via multiple branch prediction
and a branch address cache. In Proceedings of the 7th ACM
Conference on Supercomputing, pages 67–76, July 1993.