A Sampling Method Focusing on Practicality
Daniel Gracia-Perez, Hugues Berry, Olivier Temam
To cite this version:
Daniel Gracia-Perez, Hugues Berry, Olivier Temam. A Sampling Method Focusing on Practicality. IEEE Micro, Institute of Electrical and Electronics Engineers, 2006. <inria-00158808v2>
HAL Id: inria-00158808
https://hal.inria.fr/inria-00158808v2
Submitted on 1 Jul 2007
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A Sampling Method Focusing on Practicality ∗
Daniel Gracia Pérez
[email protected]
CEA List, France
Hugues Berry, Olivier Temam
{hugues.berry,olivier.temam}@inria.fr
INRIA Futurs, France
Abstract. In the past few years, several research works have
demonstrated that sampling can drastically speed up architecture simulation, and several of these sampling techniques are
already largely used. However, for a sampling technique to be
both easily and properly used, i.e., plugged and reliably used
into many simulators with little or no effort or knowledge from
the user, it must fulfill a number of conditions: it should require
no hardware-dependent modification of the functional or timing
simulator, it should simultaneously consider warm-up and sampling, while still delivering high speed and accuracy.
The motivation for this article is that, with the advent of
generic and modular simulation frameworks like ASIM, SystemC, LSE, MicroLib or UniSim, there is a need for sampling
techniques with the aforementioned properties, i.e., which are
almost entirely transparent to the user and simulator agnostic.
In this article, we propose a sampling technique focused more
on transparency than on speed and accuracy, though the technique delivers almost state-of-the-art performance. Our sampling technique is a hardware-independent and integrated approach to warm-up and sampling; it requires no modification
of the functional simulator and solely relies on the performance
simulator for warm-up. We make the following contributions:
(1) a technique for splitting the execution trace into a potentially
very large number of variable-size regions to capture program
dynamic control flow, (2) a clustering method capable of efficiently coping with such a large number of regions, (3) a budget-based method for jointly considering warm-up and sampling
costs, presenting them as a single parameter to the user, and for
distributing the number of simulated instructions between warmup and sampling based on the region partitioning and clustering
information.
Overall, the method achieves an accuracy/time tradeoff that is
close to the best reported results using clustering-based sampling
(though usually with perfect or hardware-dependent warm-up),
with an average CPI error of 1.68% and an average number of
simulated instructions of 288 million instructions over the Spec
benchmarks. The technique/tool can be readily applied to a wide
range of benchmarks, architectures and simulators, and will be
used as a sampling option of the UniSim modular simulation
framework.
1 Introduction and Related Work
Sampling is a delicate balance or tradeoff between simulation accuracy, overall simulation time, and practicality (scope of target architectures, user effort, transparency
for the user). All these characteristics are important for
a sampling technique to be effi cient and useful for architecture researchers. SimPoint [20] can be credited
for sparking a surge of interest in sampling because the
technique is both effi cient and easy to use. Considering the achievements of SimPoint and of later techniques/improvements [25, 12, 10, 24], do we need to push
sampling research any further? Most sampling techniques
have been applied to a specifi c simulator. However, with
the advent of modular simulation frameworks such as
ASIM [6], SystemC [21], the Liberty Simulation Environment (LSE) [23], MicroLib [19] or UniSim [22], there is
a need for a sampling technique which is as independent
as possible of the simulator/target architecture. In such
frameworks, the simulation is driven by a generic engine
which calls the different simulator modules, so any sampling technique would have to be plugged directly in the
engine itself. But plugging a sampling technique within
the engine would also have the benefi t of automatically
providing the sampling capability to almost any simulator written for that framework. However, since the engine
can be used to simulate a large range of architectures, the
sampling technique must be as architecture-independent
as possible, so the user can transparently use this capability. Note also that modular simulators are typically 10
to 20 times slower than monolithic simulators like SimpleScalar, so that sampling is critical for them. After surveying existing sampling techniques, we concluded that,
in spite of signifi cant recent progress, they still do not
achieve a satisfactory accuracy/time/practicality tradeoff
for end-users. We feel it is important to explain why in
details, in order to better expose the rationale and con-
∗ This article is a modifi ed version of the article originally published
at IEEE Micro. In the IEEE Micro version, we compared our clustering
technique against the technique used in SimPoint 2.0; in this version, we
compare against SimPoint 3.0, where the speed of clustering was largely
improved.
1
tributions of this article. Let us fi rst characterize what a been paid to practicality. The rationale of this article is to
good accuracy/time/practicality tradeoff should be.
achieve a good accuracy/time tradeoff without degrading
We consider a good accuracy target is of the order of practicality. We explain below why recent sampling techone or a few percent, because designing an architecture niques still have signifi cant practicality limitations, espemechanism is a trial-and-error process composed of many cially when it comes to warm-up.
“micro decisions” (parameter values selection, choosing
Why current sampling techniques are not
against two architecture options, etc. . . ) based on simulayet satisfactory, from a practicality perspection results which often correspond to small performance
tive.
Arguably, the current best sampling techdifferences.
niques are SimPoint [20, 15, 16, 11, 10, 3],
With respect to time, the functional simulation time SMARTS/TurboSMARTS [25, 24], and EXPERT [12].
largely dominates the overall simulation time (timing sim- There are two possible approaches for selecting sampling
ulation of sampling intervals plus functional simulation intervals: either (1) pick a large number of uniformly (or
between sampling intervals) because the sampling inter- randomly) selected small intervals, or (2) pick a few but
vals only correspond to a few percent of the total number large and carefully selected intervals.
of instructions in the program trace, so improving sam- SMARTS adopts the former approach and uses uniform
pling effi ciency (reducing the total sample size) would sampling with a large number of very small intervals,
bring little improvements in that context. However, re- and achieves one of the best accuracy/time tradeoff
cent studies, such as TurboSMARTS [24], SimPoint with (0.64% CPI error/50 million instructions on the Spec
LVS (Load Value Sequence) and MHS (Memory Hierar- benchmarks and SimpleScalar [4]). However, due to the
chy State) [3], show that checkpointing can drastically small size of intervals (around 1,000 instructions), it must
reduce overall simulation time of sampling techniques continuously warm-up the main SRAM structures in
(from a few hours to a few minutes per benchmark) by get- the functional simulator (especially the caches, possibly
ting rid of functional simulation. Considering the speedup the branch predictors), which is thus called functional
factor brought by checkpointing, and considering the ad- warm-up. The main limitation of this approach is
vent of more complex processor architectures (more com- practicality not effi ciency: a range of cache mechanisms,
plex superscalar processors, multi-cores) and slower mod- for instance prefetching mechanisms, do not lend well
ular simulation, checkpointing will likely become a nec- to this warm-up approach. Prefetching naturally affects
essary complement to sampling techniques in the near fu- cache behavior, so the prefetching mechanism should be
ture. Assuming checkpointing will become a common added to the functional simulator as well, but it requires
feature, overall simulation time becomes again entirely timing information, which the functional simulator does
determined by the number of simulated instructions. In not have. Moreover, with this approach to warm-up,
other words, reducing the sampling size by X% reduces whenever an architecture optimization affects a large
the overall simulation time by X% as well. Since this ar- processor SRAM structure (cache, branch predictor,
ticle is solely concerned with the sampling technique (not TLB,. . . ), it should ideally be implemented in both the
functional simulation vs. checkpointing), and since the functional and timing simulator. This is fairly impractical
proposed sampling technique is perfectly compatible with if not impossible, for some mechanisms, such as for
a checkpointing technique, throughout the article, time prefetching. It also adds a software engineering burden to
will denote the number of simulated instructions (as op- the user. It must be noted though that the same authors
posed to overall simulation time), and we will thus focus recently proposed to embed the warm-up information in
on reducing the total number of simulated instructions.
the simulator modules of their FLEXUS [7] infrastructure
Finally, practicality may be the least measurable char- in order to reduce that burden. The same authors also
acteristic of the sampling tradeoff, but one should not for- proposed TurboSMARTS which obviates functional
get that, in the end, scope and ease of use are just as im- warm-up by checkpointing micro-architectural state such
portant as effi ciency for users to adopt a new methodol- as the content of SRAM, DRAM and register structures.
ogy. Considering architecture researchers have been us- While this method drastically improves overall simulation
ing the crude approach of randomly picking traces of ar- time by getting rid of full-program functional simulation,
bitrary sizes for a long time, it is a safe bet to assume they it has similar practicality restrictions related to archiwill discard any technique that is not simple enough to tecture dependence. The authors show how to partially
use or which imposes too many restrictions. Among the relax these constraints so that the checkpoints can be
three components of the sampling tradeoff (accuracy, time reused when some architecture parameters vary, but they
and practicality), we believe that the least attention has acknowledged that the method is diffi cult to adapt to
2
some structures, such as modern branch predictors. Still,
more recently, Barr et al. [1] have proposed to compress
all the trace information required for warming branch
predictors; this approach enables to tackle a wider range
of branch predictors, though it is still specifi c to that type
of architecture components.
EXPERT [12] adopts the second sampling approach, i.e.,
wisely picking a few intervals of different sizes; the selection is based on characteristic program constructs, such
as loops or subroutines. This program-aware approach
to interval selection brings excellent accuracy (0.89%
of CPI error and 1 billion1 simulated instructions on
average, again on Spec benchmarks and SimpleScalar).
However, because of the small interval granularity/size,
it has to resort to continuous warm-up in the functional
simulator like SMARTS, with the same practicality limitations. EXPERT can achieve further gains in accuracy
and time by pre-characterizing sampling intervals using
simulations, but again, this approach raises signifi cant
practicality and stability issues when the architecture
varies. Also, EXPERT uses checkpointing to drastically
reduce simulation time, like TurboSMARTS, but it is
again based on micro-architectural state (caches and
branch predictors), i.e., architecture-dependent warm-up.
The original SimPoint version [20] was the fi rst step
toward wisely picking sampling intervals, based on
basic block frequency characteristics. However, having
few but large intervals achieved a poorer accuracy/time
tradeoff than SMARTS and EXPERT later did (CPI error
of 3.7% and 500 million instructions, again with the
same benchmarks and simulator). Even though most
SimPoint articles assume perfect warm-up (implemented
using functional warm-up) [20, 16, 11], the original
sampling interval sizes were large enough (100 million
instructions) that good accuracy could be achieved
without warm-up; and our own SimPoint experiments
confi rm that warm-up has little impact with such large
traces. In the past few years, the SimPoint group has
experimented with more but smaller intervals (10 and
1 million instruction intervals and a maximum of 300
simulation points [16, 11]) in order to achieve a better
accuracy/time tradeoff, and eventually, they recently
proposed a variable-length interval technique [10], in the
spirit of EXPERT, but with a different interval selection
approach. While this latter technique exhibits again
very good accuracy (≃2.5% of CPI error over 9 Spec
benchmarks), the simulation time is still long (≃1 billion
instructions); it also assumes again perfect warm-up
before the sampled intervals, which makes it diffi cult to
evaluate the actual accuracy/time/practicality tradeoff.
1 This
In this article, we focus on practicality, and take a
holistic approach by considering sampling (trace partitioning), clustering (sample selection), and warm-up
together. We make no compromise on practicality by
strictly relying on the timing simulator for warm-up,
as opposed to the functional simulator or checkpointing. In other words, it is not necessary to implement
caches, TLBs, or branch-prediction tables warm-up in
the functional simulator, or to have checkpoints include
micro-architectural state. We set an accuracy goal on the
same order of SMARTS/TurboSMARTS, EXPERT or
SimPoint VLI, e.g., on the order of a one or few percent,
and we then try to minimize the number of total simulated
instructions (warm-up and measurement). Though our
number of simulated instructions is higher than for
SMARTS/TurboSMARTS (50 million instructions), we
have no restrictions on practicality, due to functional simulator/checkpointing warm-up. Moreover, this approach
will enable the replacement of functional simulation with
micro-architecture-independent checkpointing for further
reducing overall simulation time, without the need to
resort to micro-architectural state checkpointing as in
TurboSMARTS.
We achieve these results through a method which combines several contributions. (1) First, our technique splits
the trace into variable-size regions to capture program dynamic control flow [17]. The main benefi t of variable size
regions is to decompose the program trace into a very
large number of regions. Smaller regions are better for
sampling accuracy, but they have two caveats: they are
more sensitive to warm-up, and they increase the number
of regions, straining clustering techniques. Our method
avoids both caveats because some of the regions remain
large and thus less sensitive to warm-up, and the presence
of these large regions keeps the total number of regions
reasonable. In some sense, it performs a kind of programaware “pre-clustering”, a bit as if the trace had been split
into very small regions and then later clustered into larger
ones, but without incurring the high cost of clustering a
huge number of very small regions. (2) Because the resulting number of regions is still high, we found that standard clustering techniques [14], as used for fi xed-size 1million intervals in SimPoint, were not appropriate (they
become fairly time-consuming). As a result, we have developed a new clustering technique, called IDDCA [18].
Since then, a new version of SimPoint, SimPoint 3.0 [8],
has been implemented, which signifi cantly increases the
performance of SimPoint; the improvements brought by
SimPoint 3.0 are orthogonal, and thus potentially com-
number is an approximation derived from the article fi gures.
3
plementary, with our approach. The second motivation
for this new clustering technique is to integrate clustering
with our budget approach to decide which intervals should
receive the largest share of the budget. (3) Finally, we propose a budget-oriented method for distributing the simulation time among both warm-up and performance measurement. The total budget is set by the user according to the
maximum accepted simulation time. The budget allocated
to a given cluster representative interval (the simulated interval for this cluster) is proportional to the importance of
this cluster with respect to the whole trace. And within
that cluster budget, the warm-up budget is inversely proportional to the size of the cluster representative interval.
The method achieves an average CPI error of 1.68%
and 288 million total simulated instructions per benchmark, over the Spec benchmark suite.
Section 2 presents our method for partitioning the program trace into regions, and how it compares to existing variable-sized sampling intervals methods. Section 3
presents the IDDCA clustering algorithm. Section 4 combines our region partitioning and clustering techniques
with a budget-based approach for allocating simulated
instructions among warm-up and performance measurement intervals. Section 5 provides an experimental evaluation of our technique.
be sampled on that architecture. If the target architecture
changes, it is hard to anticipate the consequences for the
characterization. Loop-based subroutine characterization
may also be an issue for codes with complex control flow
behavior (multiple if statements within a procedure, recursion, etc), and not surprisingly EXPERT demonstrates
better results with SpecFp than with SpecInt.
Consider the example of Figure 1 which corresponds
to a sequence of basic blocks within the program static
control flow graph; each lettered node denotes a basic
block. Assuming BEF is a large loop called within an
even larger loop ABDCE, EXPERT would most likely
defi ne an interval that encapsulates BEF only; multiple
invocations of the BEF loop would breed multiple intervals.
SimPoint VLI partitioning. SimPoint VLI adopts a different approach, though also based on loops and procedures. Each code structure is numbered and then the
trace is viewed as a sequence of identifi ers. Then the
Sequitur [9, 13] hierarchical pattern matching algorithm
is used to identify repeating sequences of variable size
within the trace. The main issue with this approach is
that it relies on the exact match between two sequences
within the trace. Programs with complex control flow due
to non-trivial if statements behavior, may exhibit many
different sequences and/or may only enable exact match2 Program Partitioning Into Regions
ing for smaller sequences. Therefore, after this pattern
We describe our approach for partitioning the trace into a matching phase, the SimPoint VLI technique applies sevvery large number of regions. As mentioned above, we eral heuristics to relieve this exact match constraint in oruse a variable-size approach to have the benefi ts of both a der to obtain longer sequences.
Consider again the example of Figure 1. Sequitur
large number of regions for sampling accuracy, and havwould partition the sequence into two main recurring
ing several large regions, less sensitive to warm-up.
Recently, EXPERT and SimPoint VLI explored parts, i.e., ABDCE and BEF. As mentioned above, Simvariable-size regions with positive results on accuracy. Point VLI adds several heuristics for improving the flexiHowever, we decided not to retain the EXPERT nor bility of this sequence partitioning.
the SimPoint VLI program partitioning methods for the Region-Based partitioning. Our program partitioning
following reasons.
approach is based on the principle that programs can exhibit complex control flow behavior, even within phases.
EXPERT partitioning. The principle of EXPERT is to More precisely, the very principle of phases means that
partition the program based on subroutines, with a dis- programs usually “stay” within a set of static basic blocks
tinction between long, short and infrequently executed for a certain time, then move to another (possibly oversubroutines, then to characterize the performance vari- lapping) set of basic blocks, and so on. This set of baability of these subroutines using simulation, and then to sic blocks can span small code sections such as loops or
select the number and location of subroutine representa- several subroutines. Moreover, the order and frequency
tives. Beyond the practicality issue mentioned in the in- with which these basic blocks are traversed may be very
troduction, and related to continuous warm-up, this parti- irregular (e.g., if statements with very irregular behavtioning/characterization method has two flaws: the char- ior, subroutines which are called infrequently within loopacterization is hardware-dependent and it heavily relies ing statements, etc. . . ). We call such sets of basic blocks
on loops. Hardware-dependent characterization means a where the program “stays” for a while regions. These recode must be simulated on an architecture, before it can gions capture the program stability while accommodating
4
Figure 1: Program trace partitioning algorithms.
its irregular behavior. We propose a simple method, composed of two rules, for characterizing these basic block
regions:
SPEC
ammp
applu
apsi
art
bzip2
crafty
eon
equake
facerec
fma3d
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbmk
sixtrack
swim
twolf
vortex
vpr
wupwise
Average
1. Whenever the reuse distance between two occurrences of the same basic block (expressed in number
of basic blocks) is greater than a certain time T , the
program is said to leave a region.
2. After the program has left a region, application of
rule 1 is suspended during T basic blocks, in order
to “learn” the new region.
Implicitly, the method progressively builds a pool of
basic blocks: whenever a new basic block is accessed, it
examines whether this basic block has been recently referenced (less than T ago); if so, it assumes the program
is still traversing the same region of basic blocks; if not,
it assumes the program is leaving this region. Then, the
second rule gives time for the program to build the new
pool of basic blocks. Consider again the example of Figure 1. Because A,B,C,D,E, and F are assumed to be
all referenced within a time interval smaller than T , they
all belong to the same region, in spite of the sometimes
irregular control flow behavior. G,H, and I mark the beginning of a new region because their reuse distance is
greater than T .
Since T determines which reuse distances are captured by regions, a fi xed value of T can potentially
miss key reuses in certain programs or conversely insuffi ciently discriminate regions in other programs.2 We use
a benchmark-tolerant method to capture “enough but not
too many” reuses. The method sets T for each benchmark such that a fi xed percentage P of reuse distances
are captured in regions, and we experimentally found
Number of
Instructions
326,548,908,728
223,883,652,707
347,924,060,406
41,798,846,919
108,878,091,744
191,882,991,994
80,614,082,807
131,518,587,184
211,026,682,877
268,369,311,687
409,366,708,209
269,035,811,516
46,917,702,075
84,367,396,275
142,398,812,356
61,867,398,195
281,694,701,214
419,156,005,842
546,749,947,007
39,933,232,781
470,948,977,898
225,830,956,489
346,485,090,250
118,972,497,867
84,068,782,425
349,623,848,084
231,987,140,463
T
45,000
1,500
3,000
1,500
25,000
100,000
20,000
2,000
35,000
15,000
70,000
90,000
20,000
30,000
100
25,000
80,000
2,500
300,000
100,000
9,500
400
200,000
80,000
8,500
200,000
61,130
Num.
Regions
183,558
187,278
187,311
112,350
170,903
199,499
194,912
196,991
196,206
184,667
111,399
192,658
95,529
170,966
187,849
178,469
187,916
54,440
177,738
41,866
183,823
75,740
161,142
190,722
193,173
13,696
151,915
Insn. per Number of
Region
Clusters
1,778,996
49
1,195,462
37
1,857,450
44
372,041
42
637,075
318
961,824
527
413,592
92
667,637
17
1,075,536
22
1,453,260
73
3,674,779
140
1,396,442
92
4,911,357
323
493,475
167
758,049
56
346,653
54
1,499,046
16
7,699,412
32
3,076,157
507
953,834
129
2,561,970
46
2,981,660
54
2,150,184
28
623,806
31
435,199
155
25,527,442
16
2,712,371
118
Table 1: Region statistics and T .
P = 99.6% would capture the appropriate amount of
reuse, and thus would result in appropriate values of T for
all benchmarks. Table 1 shows T and the regions statistics
obtained with P = 99.6%.
For some programs, the average region size is of the order of a few hundred thousands instructions, with some regions as small as a few thousands instructions in crafty.
Thanks to a mix of large and small regions in each pro-
2 Note
however that we did observe very good average accuracy/time
trade-offs for the same T value applied across all benchmarks.
5
gram, the total number of regions is not excessively high
(several tens thousands to a few hundreds thousands). But
it is large enough that the k-means [14] clustering method
used in SimPoint would take an excessively long time (of
the order of one day per benchmark). The IDDCA clustering method presented in the next section can reduce clustering time by more than two orders of magnitude.
dynamic process relies upon three different parameters:
Θnew , Θmerge and Θstep f actor .
Intuitively, Θnew , Θmerge and Θstep f actor control the
creation and merging of clusters. Θnew and Θmerge are
threshold distance parameters for respectively determining when a point is far enough from other clusters to induce the creation of a new cluster, or close enough from
an existing cluster to be merged into it. Θstep f actor
determines the rate at which these threshold distances
3 Clustering a Large Number of Regions change. Θnew and Θmerge are initialized using a simUsing IDDCA
ple heuristic: 10% of the distance between the global
centroid (centroid of all regions) and the farthest region.
In this section, we present our technique for clustering a Θstep f actor is related to the number of data points, but
large number of regions. Below, we indifferently use the the clustering method is robust enough to tolerate the
term “region” or the more classic term “interval”. The dis- same Θstep f actor value across all Spec benchmarks, emtance between two intervals is defi ned as the distance be- pirically set to Θstep f actor = 10−5 .
tween their two Basic-Block Vectors, as proposed in SimIDDCA starts with two elements:
Point [20].
• An empty cluster list;
The popular k-means clustering technique has three
main shortcomings: (1) the method works by randomly
• And the list of regions to cluster (called R). The reselecting intervals at the start-up phase, so that several
gions are regularly interleaved in this list, because it
runs of the method on the same trace may not provide
makes the online clustering method less sensitive to
the same sampling intervals, and thus the same accuracy
the original program trace order. Let us assume there
results; (2) the number of clusters is a user-selected paare N regions in the trace and the interleaving factor
rameter, but it is sensitive to benchmarks/traces, so that
is I, then the list is the following:
inappropriately setting this parameter can degrade either
simulation time or accuracy; (3) the method requires mul1, N/I + 1, 2N/I + 1, . . . , [(I − 1)×N/I] + 1,
tiple passes which may be impractical for a large number
of intervals.
2, N/I + 2, 2N/I + 2, . . . , [(I − 1)×N/I] + 2,
We ran the default SimPoint 2.0 clustering script
...
(runsimpoint; default parameters except for max k =
100) on our region partitioning in order to evaluate its exeThe method is fairly insensitive to interleaving for
cution time. k-means requires 21 hours per benchmark, on
I ≥ 2 and we selected I = 10 for all benchmarks.
average, on an Athlon XP 2800+, (and up to two days for
Note that randomly picking regions would have percrafty), against 9 minutes on average for IDDCA (and
formed similarly well or better, and was not used
44 minutes for crafty), see Figure 2. SimPoint versimply due to implementation constraints.
sion 3.0 [8] has signifi cantly improved the performance
of the clustering time, essentially by reducing the number Then, the IDDCA algorithm is the following one:
of intervals in the trace to which clustering is applied, see
1. Pick a region (r) from the list of regions R; if there is
Figure 3. With this new approach, SimPoint 3.0 is able to
no cluster yet, create a fi rst cluster containing region
perform clustering from 10 to 50 times faster than Simr and go to step 5.
Point 2.0 while keeping the same accuracy results. Note
that the approach used in SimPoint 3.0 is orthogonal to the
2. Find the cluster (ci ) with the closest centroid to the
approach used in IDDCA so we expect IDDCA to benecurrent region r and compute the distance (d) befi t as well from intervals reduction, but we have not yet
tween r and the centroid of ci .
evaluated the combined technique.
3. If d is greater than Θnew , then create a new cluster
IDDCA algorithm. Our clustering method is called IDcontaining the current region r.
DCA (Interleaved Double DCA), and it is derived from
the Dynamical Clustering Analysis (DCA) [2] clustering
4. If d is less or equal to Θnew then:
method, and adapted to sampling. IDDCA is an online
• Add r to cluster ci and update ci centroid acalgorithm, clustering regions one at a time, constantly recordingly.
fi ning the number of clusters and their centroids. This
6
Figure 2: Clustering time of IDDCA vs. k-means (logarithmic scale).
Figure 3: Clustering time of IDDCA vs. Estimated k-means in SimPoint 3.0.
• Find the cluster (cj ) with the closest centroid
to that of ci . If the distance between the centroids of ci and cj is less or equal to Θmerge
then merge the two clusters into a unique one
and compute its centroid.
otherwise go to step 1.
At the end of this process IDDCA has created a set of
clusters. If one of the clusters contains more than 90% of
the regions, then IDDCA is hierarchically applied again
within this cluster; until clustering is spread enough (no
cluster accounts for 90% or more of the regions). Finally,
the instructions which must be simulated (sampled) are
the region individuals which are the closest regions to the
clusters centroids, one per cluster.
Weighted vs. Unweighted IDDCA. Because large regions represent a greater share of the global execution
trace than small regions, regions should normally be
weighed with their size when computing the centroid.
SimPoint VLI is weighing intervals with their size when
• Update Θnew and Θmerge thresholds so as
to make cluster creation and merger more
diffi cult. For that purpose, increase Θnew
and decrease Θmerge as follows: Θnew =
Θnew / (1 − Θstep f actor ) and Θmerge =
Θmerge × (1 − Θstep f actor ).
5. Remove r from the list of regions R. If there are
no more regions in R, then the process terminates,
7
Figure 4: Simulated instructions with IDDCA and weighted IDDCA (IDDCAW).
Figure 5: Simulation accuracy with IDDCA and weighted IDDCA (IDDCAW).
applying k-means. We decided not to weigh regions with
their size in order to privilege a reduction of sampling size
over accuracy. This choice was driven by initial sampling results which suggested an effort was required on
size rather than accuracy.
Still, in order to investigate the effect of not weighing
the clusters, we have run IDDCA clustering with weighted
clusters. As expected, the total sampling size increases,
and rather signifi cantly (34%), see Figure 4. More surprisingly, the accuracy is lower with weighted clusters than
with unweighted clusters: 2.00% CPI error for weighted
clusters versus 1.62% for unweighted clusters, see Figure 5. The two observations combined empirically validate the unweighted approach.
The reason why unweighted clustering performs so
well is related to both the size distribution of intervals
within a cluster and to warming. Within the same cluster,
interval size can vary signifi cantly, even within intervals
at approximately the same distance from the centroid. As
a result, it is possible to sometimes drastically reduce the
representative interval size without signifi cantly affecting
accuracy. Moreover, the results of Figure 4 already incorporate warming since we want to evaluate the best clustering strategy for our global method. Since the latter method
is based on a fi xed budget and distributes the budget between sampling and warming, any simulated instruction
budget not consumed by sampling can be spent on warming. As a result, the overall accuracy of the method improves when smaller sampling interval representatives are
selected. This latter property illustrates the benefi ts of
properly integrating the different components of a sampling strategy.
8
4 Budget-Oriented Integration of Warm- call B the total instruction budget, i.e., the maximum
number of simulated instructions (including warm-up).
Up and Sampling
Let us number clusters i, with 1 ≤ i ≤ k, where k
is the total number of clusters, and let us call Si the total
size (in number of instructions) of cluster i; the clusters
are ordered by decreasing size, i.e., Si > Sj , if i < j.
fi is the weight factor of cluster i over the whole program
trace size (fi = P Si S ), and si is the size of the
Warm-up is implemented using the main simulator (as opposed to the functional simulator or checkpointing), so
warm-up and performance measurement share the same
simulation budget. Because simulation is costly, we must
take great care to wisely allocate the simulated instructions. And due to both variable-size regions and warmup implemented through simulation, we must determine
what is the budget allocated to each cluster (one region
representative is simulated per cluster). The general philosophy is: spend the budget where it’s most needed (and
in the process, try to minimize the total needed budget).
In that spirit, we make two simple observations. (1)
The weight of each cluster should be factored in when allocating its (warm-up and measurement) instruction budget; the cluster weight is defi ned as the total number of
trace (dynamic) instructions in the cluster, which is itself
the sum of the lengths of all intervals in the cluster. And
(2) the length of each cluster representative region should
be factored in when determining the warm-up size for this
region.
Let us go back to observation (1). The goal of clustering methods, such as IDDCA or k-means, is to fi nd a
representative for each cluster of regions. Not all clusters contain the same total number of instructions; for instance, the cluster sizes range from 57,193 instructions to
430 billion instructions in sixtrack. Naturally, when
extrapolating performance statistics collected for each
cluster representative to the whole program trace, the relative weight of each cluster is factored in. Unlike for the
clustering method, this weighing has only an impact on
accuracy, not size. Weighing means that the performance
measurement of some of the representatives will have a
greater impact on the total estimated performance than
others. So, we should allocate a greater share of the simulated instruction budget to representatives of large clusters
in order to more accurately estimate their performance.
The number of simulated instructions allocated per region consists of the region size plus the additional instructions simulated for warm-up purposes. Which brings observation (2). If a cluster representative region is large
(the representative itself, not necessarily the cluster), then
it will need less warm-up instructions as the start-up effect will be diluted in the simulation of the cluster representative region. Conversely, small representative regions need signifi cant warm-up, which is a key reason
why SMARTS and EXPERT use continuous functional
simulator/checkpointing-based warm-up.
Determining measurement and warm-up sizes. Let us
r=1..k
r
representative region of cluster i.
Because of observation (1), we distribute the budget for each cluster based on the global weight fi
of the cluster. For that purpose, we defi ne Bi as
the maximum simulation budget for cluster i (measurement and warm-up);
B1
=
B × f1 and
P
P fi
, ∀ i > 1,
Bi = (B −
j=1..i−1 Bj ) ×
f
l=i..k
l
which can be simplifiPed to Bi = B × fi if all the clusters
are considered, i.e. i=1..k fi = 1. The actual number
of simulated instructions for cluster i is ri + wi where ri
is the measurement size (it is a subset of the representative
of cluster i), and wi is the warm-up size.
Since the measurement size ri must be smaller than the
budget Bi , i.e., ri = min(si , Bi ), we sometimes need
to truncate the simulation of the cluster representative. It
rarely degrades accuracy, thanks to the looping behavior
which is at the core of our region-partitioning scheme.
Because of observation (2), we preferably allocate
warm-up instructions to small samples, within the constraint of budget Bi , i.e., wi = Bi − ri . The warm-up
instructions wi are instructions preceding the representative of cluster i. Now, due to our region-based partitioning approach, these instructions may reference code
sections and data structures which are distinct from the
ones referenced in the representative. To avoid simulating useless warm-up instructions, we use the BLRL [5]
(Boundary Line Reuse Latency) technique for determining the size of the useful warm-up interval. BLRL is an
architecture-independent method which consists in collecting the memory addresses and branch instruction addresses used in the sampled interval, and to identify the
earliest point in the trace before the interval where they
will be all accessed. By starting the warm-up at that
point, most SRAM structures are likely to be adequately
warmed-up (e.g., the fi rst access to an address will be
correctly identifi ed as a hit or a miss) independently of
the SRAM structures sizes. However, under that constraint, the actual warm-up interval per sampled region
can be very large (e.g., parser requires more than 2 billion warm-up instructions for a region of only 1.8 million instructions). For that reason, the authors propose
to set a percentage threshold of the sampled interval addresses covered in the warm-up interval, thereby relaxing
9
Instruction
Cache
Data
Cache
L2 Cache
Main Memory
Branch
Predictors
O-O-O Issue
16K 4-way set-associative, 32 byte blocks,
1 cycle latency
16K 4-way set-associative, 32 byte blocks,
1 cycle latency
128K 8-way set-associative, 64 byte
blocks, 12 cycle latency
120 cycle latency
hybrid - 8-bit gshare w/ 2k 2-bit predictors
+ a 8k bimodal predictor
out-of-order issue of up to 8 operations
per cycle, 64 entry re-order buffer
load/store queue, loads may execute
when all prior store addresses are known
32 integer, 32 floating point
2-integer ALU, 2-load/store units,
1-FP adder, 1-integer MULT/DIV,
1-FP MULT/DIV
8K byte pages, 30 cycle fi xed TLB miss
latency after earlier-issued
instructions complete
has lower accuracy but requires fewer instructions than
SimPoint 10M. More importantly, our contribution does
not lay so much in this instruction budget reduction, but
in the fact that the user needs not worry about setting the
appropriate sample and warm-up sizes for a new given
program, it is all integrated in the partitioning and budgeting approach. All the user needs to set/decide is the
maximum simulation budget (i.e., time).
Still, for several cases, especially eon and vpr, our
budget approach performs signifi cantly worse than SimPoint. Some programs have very small but frequently recurring regions, which translates into clusters with many
small intervals. And performance is more variable across
small intervals, i.e., it is harder to reach steady-state perVirtual
Memory
formance after just a few hundred thousands instructions.
This is in part due to the higher influence of start-up state
on performance for such small intervals. This variability,
in turn, can result into signifi cant performance estimation
Table 2: Baseline simulation model.
error. As shown in Table 1, eon and vpr have particularly small regions on average, around 400,000 instructhe constraint and signifi cantly reducing the warm-up in- tions. While small regions are not necessarily synonyterval size (we used a threshold of 95% across all bench- mous with performance variability and higher error, they
marks). Still, because our budget approach introduces a are a potentially aggravating factor. Still, these two codes
size constraint on the warm-up interval, we slightly mod- highlight more the necessity to fi ne-tune our heuristic than
ify BLRL by limiting the warm-up size to wi .
a shortcoming, since both codes use only around 20 million instructions for sampling, and 80 million overall, i.e.,
a small fraction of the total available budget of 500 mil5 Evaluation
lion instructions.
For evaluation purposes, we used the SimpleScalar [4] A possible extension of our method could be to simu3.0b toolset for the Alpha ISA and experimented with all late multiple intervals within clusters where the most rep26 Spec CPU2000 benchmarks. To create the regions we resentative intervals are small, e.g., 10% of the cluster
used the sim-fast functional simulator. Table 2 shows the budget. Not only it would average out the performance
microarchitecture confi guration used for our experiments. variability within such clusters, but it would also provide a means for estimating the error within such clusters.
Figures 6 and 7 respectively show the number of inThe latter would provide a signifi cant enhancement to our
structions and accuracy for our budget approach (setting
technique because one of the shortcomings of clusteringthe budget to B = 500M instructions), and different conbased techniques compared to statistical simulation techfi gurations of SimPoint (the maximum number of samples
niques is that they cannot easily provide a confi dence esis set to 50 for 10M intervals, and to 100 for 1M intertimate of the error [16].
vals, so as to provide a fair accuracy/size comparison).
We use perfect warm-up for SimPoint as in most of the
Other codes like gap and perlbmk also behave worse
articles [20, 16, 11, 10] (recall SimPoint treats sampling than SimPoint 10M. However, it must be noted that Simas an issue independent from warm-up); we sometimes Point 1M without warm-up behaves signifi cantly worse in
use no warm-up for comparison purposes. As mentioned both cases as well. In fact these codes illustrate the diffi in the introduction, while the accuracy of SimPoint 10M culty of properly selecting both regions and warm-up size.
is barely sensitive to warm-up, SimPoint 1M becomes They show that systematic techniques like SimPoint may
fairly sensitive (from 0.7% down to 2.4%), and the trend perform well or poorly depending on how the user selects
can only worsen as the sample size decreases. Therefore, this interval size, unless the user is willing to engage in
while SimPoint 1M requires little instructions compared a trial and error process for selecting the size. Our budto SimPoint 10M or our budget approach with warm-up, get approach may not be optimal, but it does attempt to
it would actually require additional instructions for warm- shield the user from such decisions by selecting regions
up in order to preserve its accuracy. Our budget approach sizes automatically.
Memory
Disambiguation
Registers
Functional
Units
10
Figure 6: Number of simulated instructions with different sampling techniques.
Figure 7: CPI error.
We also evaluated our approach without any budget limitation and no budget spent on warm-up, see
Budget Unlimited, Perfect warm-up (for the
Perfect warm-up bars, none of the budget is allocated to
warm-up). We can see that wisely allocating the budget
allows drastic reductions of the number of simulated instructions in several cases, with limited impact on accuracy (from 1.62% to 1.68%). In some cases, the unlimited
budget requires less instructions than the standard budget
11
approach, because the latter includes warm-up. Overall,
our allocation strategies result in a total budget which is
signifi cantly lower than the maximum accepted budget,
set at B = 500 million instructions in these experiments.
Figure 8 displays, for each benchmark, how our approach actually distributes its instruction budget between
measurement and warm-up. Obviously, the number of
simulated instructions devoted to measurement is rather
low (only 84 million instructions on average). This value
is close to the number of instructions simulated by SimPoint with 1 million instructions samples (71 million instructions). Warm-up uses most of the instruction budget,
with an average of 70% of the total number of simulated
instructions. This observation suggests that warm-up and
sampling should not be considered separately, especially
if the goal is to develop an architecture-independent sampling method by implementing warm-up through simulation.
[4] Doug Burger, Todd M. Austin, and Steve Bennett. Evaluating Future Microprocessors: The SimpleScalar Tool
Set. Technical Report CS-TR-1996-1308, University of
Wisconsin-Madison, Madison, 1996.
[5] L. Eeckhout, S. Eyerman, B. Callens, and K. De Bosschere. Accurately Warmed-up Trace Samples for the Evaluation of Cache Memories. In I. Banicescu, editor, Proceedings of the High Performance Computing Symposium
- HPC2003, pages 267–274, Orlando, FL, USA, 4 2003.
SCS.
6 Conclusions
The rationale for our sampling approach is that some of
the most recent and effi cient sampling techniques have
practicality shortcomings due to their warm-up approach
(in the functional simulator or by checkpointing microarchitectural state, or simply using perfect warm-up on
the principle of separating sampling and warm-up issues).
These shortcomings can make it diffi cult to explore specifi c and/or a large range of architectural organizations.
They are not compatible either with current and upcoming modular simulation frameworks, where the sampling
technique will be implemented in the common simulation
engine; thus, it will have to be architecture-independent
and transparent to the user. This technique will be one of
the sampling options implemented in the UniSim framework under development.
Our sampling+warm-up approach focuses on transparency and architecture independence, and still achieves
an accuracy/time tradeoff that is of the same order of magnitude as the best sampling techniques. There are three
key features in our approach: a novel trace partitioning
into variable-size regions which provides a useful compromise between many small intervals and few large intervals, a clustering technique capable of harnessing a large
number of intervals, and a budget-oriented distribution
of simulated instructions between warm-up and measurement.
References
[6] Joel S. Emer, Pritpal Ahuja, Eric Borch, Artur Klauser,
Chi-Keung Luk, Srilatha Manne, Shubhendu S. Mukherjee, Harish Patil, Steven Wallace, Nathan L. Binkert, Roger
Espasa, and Toni Juan. Asim: A performance model
framework. IEEE Computer, 35(2):68–76, 2002.
[7] Flexus. http://www.ece.cmu.edu/ simflex/flexus.html.
[8] Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad
Calder. Simpoint 3.0: Faster and More Flexible Program
Analysis. MOBS ’05: Workshop on Modeling, Benchmarking and Simulation, june 2004.
[9] James R. Larus. Whole program paths. In PLDI ’99:
Proceedings of the ACM SIGPLAN 1999 conference on
Programming language design and implementation, pages
259–269. ACM Press, 1999.
[10] Jeremy Lau, Erez Perelman, Greg Hamerly, Timothy Sherwood, and Brad Calder. Motivation for Variable Length
Intervals and Hierarchical Phase Behavior. ISPASS ’05:
IEEE International Symposium on Performance Analysis
of Systems and Software, 2005.
[11] Jeremy Lau, Stefan Schoenmackers, and Brad Calder.
Structures for Phase Classification. ISPASS ’04: IEEE International Symposium on Performance Analysis of Systems and Software, 2004.
[12] Wei Liu and Michael C. Huang. EXPERT: expedited simulation exploiting program behavior repetition. In ICS ’04:
Proceedings of the 18th annual international conference
on Supercomputing, pages 126–135. ACM Press, 2004.
[13] C. G. Nevill-Manning and I. H. Witten. Compression and
Explanation Using Hierarchical Grammars. The Computer
Journal, 40(2/3):103–116, 1997.
[14] Dan Pelleg and Andrew W. Moore. X-means: Extending
K-means with Efficient Estimation of the Number of Clusters. In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning, pages 727–734.
Morgan Kaufmann Publishers Inc., 2000.
[1] Kenneth C. Barr and Krste Asanovic. Branch trace compression for snapshot-based simulation. In International
Symposium on Performance Analysis of Systems and Software, February 2006.
[15] Erez Perelman, Greg Hamerly, Michael Van Biesbrouck,
Timothy Sherwood, and Brad Calder. Using SimPoint for
accurate and efficient simulation. SIGMETRICS Perform.
Eval. Rev., 31(1):318–319, 2003.
[2] A. Baune, F. T. Sommer, M. Erb, D. Wildgruber, B. Kardatzki, G. Palm, and W. Grodd. Dynamical Cluster Analysis of Cortical fMRI Activation. In NeuroImage 6(5), pages
477 – 489, May 1999.
[16] Erez Perelman, Greg Hamerly, and Brad Calder. Picking
Statistically Valid and Early Simulation Points. In PACT
’03: Proceedings of the 12th International Conference on
Parallel Architectures and Compilation Techniques, page
244. IEEE Computer Society, 2003.
[3] Michael Van Biesbrouck, Lieven Eeckhout, and Brad
Calder. Efficient Sampling Startup for Sampled Processor Simulation. International Conference on High Performance Embedded Architectures & Compilers, 2005.
[17] Daniel Gracia Pérez, Hugues Berry, and Olivier Temam.
Budgeted Region Sampling (BeeRS): Do Not Separate
Sampling From Warm-Up, And Then Spend Wisely Your
12
Figure 8: Distribution of the number of measurement and warmed-up instructions with the budget approach.
Simulation Budget (6-pages abstract). ISSPIT 5: IEEE International Symposium on Signal Processing and Information Technology, December 2005.
’03: Proceedings of the 30th annual international symposium on Computer architecture, pages 84–97. ACM Press,
2003.
[18] Daniel Gracia Pérez, Hugues Berry, and Olivier Temam.
IDDCA: A New Clustering Approach For Sampling. In
MoBS ’05: Workshop on Modeling, Benchmarking and
Simulation, 2005.
[19] Daniel Gracia Pérez, Gilles Mouchard, and Olivier
Temam. MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms. In MICRO37: Proceedings of the 37th International Symposium on
Microarchitecture, pages 43–54. IEEE Computer Society,
2004.
[20] Timothy Sherwood, Erez Perelman, Greg Hamerly, and
Brad Calder. Automatically characterizing large scale program behavior. SIGOPS Oper. Syst. Rev., 36(5):45–57,
2002.
[21] Systemc v2.0.1 language reference manual,
http://www.systemc.org/.
[22] UNISIM:
UNIted
http://unisim.org.
SIMulation
2003.
environment.
[23] Manish Vachharajani, Neil Vachharajani, David A. Penry,
Jason A. Blome, and David I. August. Microarchitectural
Exploration with Liberty. In the 34th Annual International
Symposium on Microarchitecture, Austin, Texas, USA.,
December 2001.
[24] Thomas F. Wenisch, Roland E. Wunderlich, Babak Falsafi,
and James C. Hoe. TurboSMARTS: Accurate Microarchitecture Simulation Sampling in Minutes. SIGMETRICS
’05, June 2005.
[25] Roland E. Wunderlich, Thomas F. Wenisch, Babak Falsafi,
and James C. Hoe. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling. In ISCA
13