PbXhBgZIJUFpInfB PDF

CS-16 1
Performance Characterization of SPEC CPU2006

Benchmarks on Intel Core 2 Duo Processor
Tribuvan Kumar Prakash and Lu Peng, Member, IEEE Computer Society
This paper presents a detailed analysis of the SPEC
Abstract—As the processor architectures are evolving, it is CPU2006 benchmark running on Intel Core 2 duo processor
very important to develop appropriate benchmarks that are used [4] and emphasizes on its workload characteristics and mem-
to measure their performance. Also, it is very important to de- ory system behavior. We compare the CPU2006 and
sign appropriate compilers that can optimally utilize the new
features of the evolving processors. For this we need to have a
CPU2000 benchmarks with respect to performance bottle-
complete insight on the performance characteristics and the im- necks by using the Intel VTune performance analyzer [5] for
pact of compilers on performance characteristics of the bench- the entire program execution. Also, the various performance
marks. In this paper, we first report performance characteriza- aspects of two popularly used C/C++ compilers: Intel C++ 9.1
tion of SPEC CPU2006 suite on Intel Core 2 Duo processor [3] and Microsoft Visual C++ 2005 [7] are compared.
which represents an emerging popular computing platform. Sec- The Intel C++ compiler 9.1 for Windows provides ad-
ond, we compare the effects of two widely used C++ compilers:
Intel C++ and Microsoft VC++ compilers. Performance charac-
vanced optimization features that maximize performance for
teristics include Instruction per cycle (IPC), run time, cache miss applications running on the latest Intel processors, including
rate and branch miss rate are measured and reported. Our re- Chip Multi-processors (CMP). Key features of Intel C++ 9.1
sults showed that Intel Compiler has better performance than compiler include multi-threaded application support, multi-
Microsoft VC++ compilers for a majority of SPEC CPU2006 core development support, Microsoft Visual Studio 200X in-
C/C++ programs running on Intel Core 2 Duo Processor. tegration and advanced optimization like Interprocedural Op-
timization (IPO), Profile-guided Optimization (PGO), Auto-
Index Terms— SPEC CPU2006, Intel Core 2 Duo, Intel C++
Compiler, Microsoft VC++ Compiler. matic Vectorizer and High-Level Optimization (HLO) [3]. On
the other hand, Microsoft Visual C++ 2005 [7] is an integrated
Area of Interests: 5.5 Computer Architecture development environment (IDE) product developed by Micro-
soft. It has features such as syntax highlighting, IntelliSense (a
coding auto completion feature) and advanced debugging
functionality. It includes MFC (Microsoft Foundation Classes)
I. INTRODUCTION
8.0 and support for the C++/CLI language and OpenMP.
W ITH the evolution of processor architecture over time,
benchmarks that were used to measure the performance
of these processors are not as useful today as they were before
According to our measurements, CPU2006 benchmarks
have larger input dataset and longer execution time than those
of CPU2000. Our results also show that apart from architec-
due to their inability to stress the new architectures to their tural features, compilers also have high impact on perform-
maximum capacity in terms of clock cycles, cache, main ance. For some application such as hmmer and h264ref, Intel
memory and I/O bandwidth. Hence new and improved C++ shows its superiority in performance over Microsoft
benchmarks need to be developed and used. The SPEC VC++ compiler. In addition, it also shows better microarchi-
CPU2006 [9] is one such benchmark that has intensive work- tecture performance in L2 cache miss rate and branch miss
loads based on real applications and is the successor of the rate for most of programs because of its specific optimizations
SPEC CPU2000 benchmark [9]. Also, the need of appropriate on Intel Core architecture. However, its larger dynamic in-
compilers to keep up with those advanced architectures to struction counts compromises this effect for some floating
maximize the performance has evoked interests in researchers programs such as lbm.
to understand the impact of compilers on performance charac- The remainder of this paper is organized as follows. Section
teristics. II describes the methodology. Section III reports the perform-
ance characterization of SPEC CPU2006 and CPU2000 on
Intel Core 2 Duo processor. Section IV details the comparison
Manuscript received Oct. 12, 2007. This work is supported in part by the of performance characteristics for Intel C++ 9.1 and Microsoft
Louisiana Board of Regents grants NSF (2006)-Pfund-80 and LEQSF (2006- Visual C++ 2005 compilers on SPEC CPU2006. Section V
09)-RD-A-10, the Louisiana State University and an ORAU Ralph E. Powe
describes the related work. Lastly, section VI gives a brief
Junior Faculty Enhancement Award.
Tribuvan Kumar Prakash is with Realization Technologies, Inc., San Jose, conclusion obtained from our analysis.
CA 95113 USA (email: [email protected]).
Lu Peng, is with the Electrical and Computer Engineering Department,
Louisiana State University, Baton Rouge, LA 70803 USA (phone: 1-225-578-
5535, fax: 1-225-578-5200, email: [email protected]).
CS-16 2
TABLE I
SYSTEM SPECIFICATION
CPU Intel Core 2 Duo E6400 (2 x 2.13GHz)
Technology 65nm
Transistors 291 Millions
Hyperthreading No
Branch Predictor Combined three types of predictors - global,
bi-modal and loop detectors.
L1 Cache Code and Data: 32 KB X 2, 8 way, 64–byte
cache line size, write-back
L2 Cache 2MB shared cache (2MB x 1), 8-way, 64-
byte line size, non-inclusive with L1 cache.
L1 TLB size Instructions: 128 entries
Data: 256 entries Figure 1(a)
Memory 2GB (1GB x 2) DDR2 533MHz
FSB 1066MHz Data Rate 64-bit
FSB bandwidth 8.5GB/s
HD Interface SATA 375MB/s
II. METHODOLOGY
We installed Microsoft Visual C++ 2005 (also known as
VC++ 8) and Intel C++ compiler 9.1 on 32 bit Windows XP
with SP2 operating system running on Intel Core 2 Duo
E6400 processor with 2.13GHz. The specification of Intel
Core 2 Duo machine is shown in table 1. Figure 1(b)
For performance characterization of SPEC CPU2006 Figure 1(a) IPC of SPEC CPU2006 Benchmarks;
benchmark suite, all the integer and floating point programs (b) IPC of SPEC CPU2000 Benchmarks
were considered. The details of the applications in the bench-
mark suite can be found in [9]. We also made a comparison benchmarks ranges from 563-1590 seconds on the Intel Core
with SPEC CPU2000 C/C++ programs. Microsoft Visual C++ 2 Duo system.
2005 and Intel FORTRAN Compiler 9.1 were used to compile Figure 1(a) and Figure 1(b) depict the Instruction per Cycle
most of the applications under consideration except for lib- (IPC) of CPU2006 and CPU2000 respectively. The average
quantum, xalancbmk, calculix, povray, tonto, wrf and zeusmp IPC for CPU2006 and CPU2000 benchmarks were measured
due to compilation problems. Therefore, we compiled these at 0.97 and 1.1 respectively. From the figures, it can be ob-
programs using the Intel C++ 9.1 compiler. served that mcf, omnetpp and lbm have low IPC among
After that, a subset of C/C++ SPEC CPU2006 benchmark CPU2006 benchmarks, while mcf, art and swim have low IPC
suite was used to analyze the performance characteristics of among the CPU2000 benchmarks.
the two compilers under consideration. We use the fastest Figure 2(a) and Figure 2(b) represent the instruction retired
speed compilation flags for both compilers. For the Microsoft profile of CPU2006 and CPU2000 respectively. It is evident
VC++ compiler, we set “-O2”, while for the Intel C++ com- from the figure that a very high percentage of instructions
piler we set “-fast” which is equal to “-O3 –ipo -xP” [3]. retired consist of loads and stores. CPU2006 benchmarks like
All benchmark applications were analyzed using Intel(R) h264ref, hmmer, bwaves, lesli3d and gemsfdtd have compara-
VTune(TM) Performance Analyzer 8.0.1. At a given time, tively high percentage of loads while gcc, libquantum, mcf,
Intel(R) VTune(TM) Performance Analyzer 8.0.1 can measure perlbench, sjeng, xalancbmk and gamess have high percentage
only certain definite number of events, depending upon the of branch instructions. On the other hand, CPU2000 bench-
configuration; hence, several complete runs were made to marks like gap, parser, vortex, applu, equake, fma3d, mgrid
measure all the events. Event based sampling was selected for and swim have comparatively high percentage of loads while
monitoring. We measured microarchitecture events such as almost all integer programs have high percentage of branch
L1D cache miss, L2 cache misses, DTLB misses, Instruction instructions.
per Cycle (IPC), branch misprediction, etc. Higher percentage of load and store instructions retired or
higher percentage of branches do not necessary indicate the
presence of more bottlenecks. For example, h264ref and perl-
III. CHARACTERIZATION OF SPEC CPU2006 BENCHMARK bench have high percentage of load, store and branch instruc-
tions, but they also have comparatively high IPC. Similarly
Compared with CPU2000 programs, CPU2006 benchmarks
among CPU2000 benchmarks crafty, parser and perl have
have larger input dataset and longer execution time. Accord-
high percentage of load, store and branch instruction and have
ing to our measurement, the execution time for CPU2000 pro-
better IPC. To get a better understanding of the bottlenecks of
grams ranges from 56-170 seconds while those for CPU2006
CS-16 3
Figure 2(a) Figure 3(a)
CPU2000 Instruction Profile load store

branch other
100%
90%
80%
70%
60%
50%
%
40%
30%
20%
10%
0%
GALGEL
GCC
LUCAS
MESA
MGRID
PARSER
VPR
FACEREC
GAP
GZIP
MCF
VORTEX
FMA3D
SIXTRAK
SWIM
EQUAKE
BZIP
AMMP
CRAFTY
PERL
ART
TWOLF
APPLU
WUPWISE
Figure 2(b) Figure 3(b)

Figure2 (a) Instruction Profile of SPEC CPU2006 Benchmark; Figure 3 (a) L1 D Cache Misses Per 1000 Instruction of SPEC CPU2006
(b) Instruction Profile of SPEC CPU2000 Benchmark Benchmarks; (b) L1 D Cache Misses Per 1000 Instruction of SPEC
CPU2000 Benchmarks
these benchmarks, L1 data cache misses per 1000 instructions, misses rate in CPU2000.
L2 cache misses per 1000 instructions and branch mispredic- We also measured L1 DTLB misses for SPEC CPU2006.
tion per 1000 instructions were measured and analyzed. Only a few programs have L1 DTLB miss rates equal to or
Figure 3(a) and 3(b) indicates the L1 cache misses per 1000 larger than 1%. They are astar (1%), mcf (6%), omnetpp (1%)
instructions of CPU2006 and CPU2000 benchmarks. The re- and cactusADM (2%). Some programs have very small L1
sults show that there is no significant improvement in DTLB miss rate, for example, the miss rates for hammer and
CPU2006 than CPU2000 with respect to stressing the L1 gromacs are 3.3*10-5 and 6.2*10-5 respectively.
cache. The average L1D cache misses per 1000 instructions Figure 5(a) and 5(b) represent the L2 cache misses per 1000
for cpu2006 and cpu2000 benchmark set under consideration instructions of CPU2006 and CPU2000 SPEC benchmarks
was found to be 22.5 and 27 respectively. The mcf benchmark respectively. The average L2 cache misses per 1000 instruc-
has highest L1 cache misses per 1000 instructions in both tions for CPU2006 and CPU2000 benchmarks under consid-
CPU2000 and CPU2006 benchmarks. This is one of the sig- eration was found to be 4.1 and 2.6 respectively. Lbm has the
nificant reasons for its low IPC. highest L2 cache misses which attributes for its low IPC. Lbm
Mcf is a memory intensive integer benchmark written in C (Lattice Boltzmann Method) is a floating point based bench-
language. Code analysis using Intel(R) VTune(TM) Perform- mark written in C language. It is used in the field of fluid dy-
ance Analyzer 8.0.1 shows that the key functions responsible namics to simulate the behavior of fluids in 3D. Lbm has two
for stressing the various processor units are primal_bea_mpp steps of accessing memory, namely I) streaming step, in which
and refresh_potential. Primal_bea_mpp (72.6%) and re- values are derived from neighboring cells and ii) linear mem-
fresh_potential (12.8%) together are responsible for 85% of ory access to read the cell values (collide-stream) and write
the overall L1 data cache miss events. the values to the cell (stream-collide) [9].
A code sample of primal_bea_mpp function is shown in Code analysis reveals that LBM_performStreamCollide
Figure 4. The function traverses an array of pointer (denoted function used to write the values to the cell is responsible for
by arc_t) to a set of structures. For each structure traversed, it 99.98% of the overall L2 cache miss events. A code sample of
optimizes the routines used for massive communication. In the the same function is shown in Figure 6(a). A macro
code under consideration, pointer chasing in line 6 is respon- “TEST_FLAG_SWEEP” is responsible for 21% of overall L2
sible for more than 50% of overall L1D cache misses for the cache misses. The definition of TEST_FLAG_SWEEP is
whole program. Similar result for mcf in CPU2000 was also shown in Figure 6(b). The pointer *MAGIC_CAST dynami-
found in previous work [11]. Apart from mcf, lbm have com- cally accesses memory accesses over 400MB of data which is
paratively significant L1 cache misses rate in CPU2006 and much larger than the available L2 cache size (2MB), resulting
mcf, art and swim have comparatively significant L1 cache in very high L2 cache misses. Hence it can be concluded that
CS-16 4
Figure 6 Code Sample of LBM

Figure 4 Code Sample of MCF
Figure 5(a) Figure 7(a)
Figure 5(b) Figure 7(b)

Figure 5(a) L2 Cache Misses Per 1000 Instructions of SPEC CPU2006 Figure 7(a) Branch Misprediction Per 1000 Instruction of SPEC CPU2006
Benchmarks; (b) L2 Cache Misses Per 1000 Instruction of SPEC CPU2000 Benchmarks; (b) Branch Misprediction Per 1000 Instruction of SPEC
Benchmarks CPU2000 Benchmarks
lbm has very large data footprint which results in high stress benchmarks were measured as 0.38 and 0.08 respectively.
on L2 cache. For mcf, Primal_bea_mpp (33.4%) and re- Thus from the results analyzed so far we can conclude that
fresh_poten-tial (20.2%) are two major functions resulting in the cpu2006 benchmarks have larger data sets and requires
L2 cache misses. Intensive pointer chasing is responsible for longer execution time than its predecessor CPU2000 bench-
this. marks.
Figure 7(a) and 7(b) represents the branch mispredicted per
1000 instructions of CPU2006 and CPU2000 SPEC bench- IV. MICROSOFT VC++ VS. INTEL C++
marks. CPU2006 benchmarks have comparatively higher In this section, we compared compiler effects on SPEC
branch misprediction than CPU2000 benchmark and almost CPU2006. We first compared static code size and dynamic
all floating point benchmarks under consideration have negli- instruction counts. Table 2 lists static code size of binaries
gible branch misprediction comparatively. The average branch generated by both compilers. In general, we observed that
mispredicted per 1000 instructions for CPU2006 and Intel C++ binaries are larger than those generated by the Mi-
CPU2000 integer benchmarks were measured as 4.2 and 4.0 crosoft VC++ compiler. Figure 8 shows the profile of Instruc-
respectively and the average branch misprediction per 1000 tion Retired comparison between Microsoft VC++ and Intel
instructions for CPU2006 and CPU2000 floating point C++. The vertical axis represents the absolute number of in-
CS-16 5
Figure 8. CPU2006 Instruction Retired Profile (VC vs. ICC)
structions brake down by types. A few observations can be tions hmmer and h264ref there is a drastic decrease in runtime
made: while running with Intel C++ compiler. Microsoft VC++
(1) For 9 out of 15 programs, dynamic instructions retired shows improvement in runtime for floating programs lbm,
for Intel C++ binaries are smaller than those generated by the soplex and sphinx3.
Microsoft VC++ compiler though the former have larger static To better understand the performance impact of compilers,
code size. we compared various performance matrics. We analyzed the
(2) The percentage of load and store instructions is lower in L1D cache misses per 1000 instructions, L2 cache misses per
most cases for binaries generated from Intel C++ compiler 1000 instructions and branch misprediction per 1000 instruc-
compared to that of Microsoft VC++ binaries. Hence, Intel tions for binaries generated by the Intel C++ and Microsoft
C++ compiler reduces the number of memory accesses com- VC++ compiler. Figure 10 shows the comparison of L1D
paratively. cache misses per 1000 instructions. From this figure, the total
(3) The percentage of branch instructions is closely same number of L1D cache misses rate is almost the same for both
for both Intel C++ and Microsoft VC++ binaries. Other in- compliers except for sphinx3 and soplex. The L1 data cache
structions consist of various integer and floating point instruc- rate gap between Intel C++ and Microsoft VC++ is responsi-
tions which on an average comprise for approximately 37% ble for the execution time difference for these two programs.
and 32% of the overall instructions, for Intel C++ binaries and Figure 11 shows the comparison of L2 cache misses per
Microsoft C++ binaries respectively. 1000 instructions for both compilers. The figure shows that
We then compared the normalized runtime for Intel C++ there was considerable improvement in L2 cache misses rate
and Microsoft VC++ compilers running SPEC CPU2006 for memory intensive applications such as mcf, lbm, perlbench
benchmarks. For normalization, the runtime of Microsoft and soplex in the case of Intel C++ compiler compared to that
VC++ was considered to be the base runtime. Figure 9 shows of Microsoft VC++ compiler. From this figure, we can con-
the normalized runtime for Intel C++ and Microsoft VC++ clude that Intel C++ compiler, which utilizes more features of
compilers. From the figure, it is evident that the runtime for Intel Core 2 Duo processor, has better memory performance
most of the applications are very close. However, for applica- than that of Microsoft VC++.
Figure 12 shows the branch misprediction rate. From this
TABLE II figure, it can be observed that astar, h264ref, hmmer and om-
STATIC CODE SIZE (IN BYTES) OF BINARIES GENERATED BY
MICROSOFT VC++ AND INTEL C++ netpp show improvement in branch misprediction rate when
Name / Bytes VC++ IC++ running with Intel C++ compiler compared to that with Mi-
ASTAR 126976 163840 crosoft VC++ compiler. Other programs show similar behav-
BZIP2 122880 163840 iors.
GCC 2744320 3788800 In general, we find that Intel C++ compiler shows superior
GOBMK 3190784 3792896 performance for hammer and h264ref. In addition, it also
H264REF 552960 1294336
shows better microarchitecture performance in L2 cache miss
HMMER 237568 323584
MCF 90112 106496 rate and branch miss rate for most of programs. However, its
OMNETPP 724992 1286144 larger dynamic instruction counts compromises this effect for
PERLBENCH 978944 1536000 some floating programs such as lbm.
SJENG 188416 266240
LBM 102400 102400 V. RELATED WORK
MILC 180224 323584
NAMD 356352 561152 Researchers in computer architecture area show strong in-
SOPLEX 409600 1093632 terests in performance characterization of CPU2006. Sarah et
SPHINX3 262144 393216 al [1] reported the performance characterization of SPEC
CS-16 6
Figure 9. Runtime Comparison Figure 10. Comparison of L1D Cache Miss Per 1000 Instructions
Figure 11. Comparison of L2 Cache Miss Per 1000 Instructions Figure 12. Comparison of Branch Mis-prediction Per 1000 Instruction
CPU2006 and analyzed the impact of “Macro fusion” and counts compromises this effect for some floating programs
“Micro-op fusion” of the Woodcrest processor. These results such as lbm.
parallel our own upon which this paper is based. Ye et al [10]
compared CPU2006 integer benchmark binaries in 64-bit and REFERENCES
32-bit formats on an x86-64 architecture based processor. [1] S. Bird, A. Phansalkar, L K. John, A. Mericasand and R. Indukuru,
The effect of compilers and compiler optimizations on ap- “Performance Characterization of SPEC CPU Benchmarks on Intel's
Core Microarchitecture based processor”, in Proceedings of 2007 SPEC
plication performance has been studied and analyzed for a Benchmark Workshop, Jan 2007.
long time. Gurumani and Milenkovic studied the execution [2] S. T. Gurumani and A. Milenkovic, “Execution Characteristics of SPEC
characteristics of Visual C++ 6.0 and Intel C++ on Pentium 4 CPU2000 Benchmarks:Intel C++ vs. Microsoft VC++”, in Proceedings
processor using SPEC CPU2000 benchmark suite in [2]. They of the 42nd ACM annual southeast regional conference, 2004.
[3] Intel, Intel C++ Compiler 9.1 for Windows,
concluded that Intel C++ compilers performed better for http://cache.www.intel.com/cd/00/00/28/48/284831_284831.pdf
graphics and visualization applications. [4] Intel, Announcing Intel Core 2 Processor Family Brand,
Compared with software simulation, using Intel VTune per- http://www.intel.com/products/processor/core2/index.htm
[5] Intel, Intel VTune Performance Analyzer, http://www.intel.com/cd
formance analyzer and performance counters in real proces- /software/products/asmona/eng/vtune/239144.htm
sors is a fast and feasible way to characterizing emerging [6] Y. Li, T. Li, T. Kahveci, and J. Fortes. Workload characterization of
workloads. There are a few recent works analyzing Bioinfor- bioinformatic applications. In Proceedings of IEEE International Sym-
posium on Modeling, Analysis, and Simulation of Computer and Tele-
matics and Data Mining workload [6][8] by performance communication Systems (MASCOTS), 2005.
counters and VTune analyzer. [7] Microsoft, 32-bit Optimizations and Command-Line Switches,
http://msdn.microsoft.com/vstudio/tour/vs2005_guided_tour/VS2005pro
/Framework/CPlus32BitOptimization.htm
VI. CONCLUSION
[8] B. Ozisikyilmaz, R. Narayanan, J. Zambreno, G. Memik, A. Choudhary,
In this paper, we analyzed the emerging CPU2006 on Intel An Architectural Characterization Study of Data Mining and Bioinfor-
Core 2 Duo processor. According to our measurements, matics Workloads, in Proceedings of IEEE International Symposium on
Workload Characterization, Oct. 2006.
CPU2006 benchmarks have larger input dataset and longer [9] SPEC, SPEC CPU2000 and CPU2006, http://www.spec.org/
execution time than those of CPU2000. Our results also show [10] D.Ye, J. Ray, C. Harle and D. Kaeli, Performance Characterization of
that apart from architectural features, compilers also have high SPEC CPU2006 Integer Benchmarks on x86-64 Architecture, in Pro-
ceedings of IEEE International Symposium on Workload Characteriza-
impact on performance. For some application such as hammer tion, Oct. 2006.
and h264ref, Intel C++ shows its superiority in performance [11] H. Zhou and T. M. Conte, “Enhancing memory level parallelism via
over Microsoft VC++ compiler. In addition, it also shows bet- recovery-free value prediction,” Proceedings of the 17th annual interna-
tional conference on Supercomputing (ICS), Jun. 2003.
ter performance in L2 cache miss rate and branch miss rate for
most of programs because of its specific optimizations on Intel
Core architecture. However, its larger dynamic instruction

PbXhBgZIJUFpInfB PDF

Uploaded by

Copyright:

Available Formats

PbXhBgZIJUFpInfB PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PbXhBgZIJUFpInfB PDF

Uploaded by

Copyright:

Available Formats

CS-16 1

Performance Characterization of SPEC CPU2006

Figure 2(a) Figure 3(a)

CPU2000 Instruction Profile load store

Figure 2(b) Figure 3(b)

Figure 6 Code Sample of LBM

Figure 5(a) Figure 7(a)

Figure 5(b) Figure 7(b)

Figure 8. CPU2006 Instruction Retired Profile (VC vs. ICC)

You might also like