PbXhBgZIJUFpInfB PDF
PbXhBgZIJUFpInfB PDF
PbXhBgZIJUFpInfB PDF
TABLE I
SYSTEM SPECIFICATION
CPU Intel Core 2 Duo E6400 (2 x 2.13GHz)
Technology 65nm
Transistors 291 Millions
Hyperthreading No
Branch Predictor Combined three types of predictors - global,
bi-modal and loop detectors.
L1 Cache Code and Data: 32 KB X 2, 8 way, 64–byte
cache line size, write-back
L2 Cache 2MB shared cache (2MB x 1), 8-way, 64-
byte line size, non-inclusive with L1 cache.
L1 TLB size Instructions: 128 entries
Data: 256 entries Figure 1(a)
Memory 2GB (1GB x 2) DDR2 533MHz
FSB 1066MHz Data Rate 64-bit
FSB bandwidth 8.5GB/s
HD Interface SATA 375MB/s
II. METHODOLOGY
We installed Microsoft Visual C++ 2005 (also known as
VC++ 8) and Intel C++ compiler 9.1 on 32 bit Windows XP
with SP2 operating system running on Intel Core 2 Duo
E6400 processor with 2.13GHz. The specification of Intel
Core 2 Duo machine is shown in table 1. Figure 1(b)
For performance characterization of SPEC CPU2006 Figure 1(a) IPC of SPEC CPU2006 Benchmarks;
benchmark suite, all the integer and floating point programs (b) IPC of SPEC CPU2000 Benchmarks
were considered. The details of the applications in the bench-
mark suite can be found in [9]. We also made a comparison benchmarks ranges from 563-1590 seconds on the Intel Core
with SPEC CPU2000 C/C++ programs. Microsoft Visual C++ 2 Duo system.
2005 and Intel FORTRAN Compiler 9.1 were used to compile Figure 1(a) and Figure 1(b) depict the Instruction per Cycle
most of the applications under consideration except for lib- (IPC) of CPU2006 and CPU2000 respectively. The average
quantum, xalancbmk, calculix, povray, tonto, wrf and zeusmp IPC for CPU2006 and CPU2000 benchmarks were measured
due to compilation problems. Therefore, we compiled these at 0.97 and 1.1 respectively. From the figures, it can be ob-
programs using the Intel C++ 9.1 compiler. served that mcf, omnetpp and lbm have low IPC among
After that, a subset of C/C++ SPEC CPU2006 benchmark CPU2006 benchmarks, while mcf, art and swim have low IPC
suite was used to analyze the performance characteristics of among the CPU2000 benchmarks.
the two compilers under consideration. We use the fastest Figure 2(a) and Figure 2(b) represent the instruction retired
speed compilation flags for both compilers. For the Microsoft profile of CPU2006 and CPU2000 respectively. It is evident
VC++ compiler, we set “-O2”, while for the Intel C++ com- from the figure that a very high percentage of instructions
piler we set “-fast” which is equal to “-O3 –ipo -xP” [3]. retired consist of loads and stores. CPU2006 benchmarks like
All benchmark applications were analyzed using Intel(R) h264ref, hmmer, bwaves, lesli3d and gemsfdtd have compara-
VTune(TM) Performance Analyzer 8.0.1. At a given time, tively high percentage of loads while gcc, libquantum, mcf,
Intel(R) VTune(TM) Performance Analyzer 8.0.1 can measure perlbench, sjeng, xalancbmk and gamess have high percentage
only certain definite number of events, depending upon the of branch instructions. On the other hand, CPU2000 bench-
configuration; hence, several complete runs were made to marks like gap, parser, vortex, applu, equake, fma3d, mgrid
measure all the events. Event based sampling was selected for and swim have comparatively high percentage of loads while
monitoring. We measured microarchitecture events such as almost all integer programs have high percentage of branch
L1D cache miss, L2 cache misses, DTLB misses, Instruction instructions.
per Cycle (IPC), branch misprediction, etc. Higher percentage of load and store instructions retired or
higher percentage of branches do not necessary indicate the
presence of more bottlenecks. For example, h264ref and perl-
III. CHARACTERIZATION OF SPEC CPU2006 BENCHMARK bench have high percentage of load, store and branch instruc-
tions, but they also have comparatively high IPC. Similarly
Compared with CPU2000 programs, CPU2006 benchmarks
among CPU2000 benchmarks crafty, parser and perl have
have larger input dataset and longer execution time. Accord-
high percentage of load, store and branch instruction and have
ing to our measurement, the execution time for CPU2000 pro-
better IPC. To get a better understanding of the bottlenecks of
grams ranges from 56-170 seconds while those for CPU2006
CS-16 3
40%
30%
20%
10%
0%
GALGEL
GCC
LUCAS
MESA
MGRID
PARSER
VPR
FACEREC
GAP
GZIP
MCF
VORTEX
FMA3D
SIXTRAK
SWIM
EQUAKE
BZIP
AMMP
CRAFTY
PERL
ART
TWOLF
APPLU
WUPWISE
lbm has very large data footprint which results in high stress benchmarks were measured as 0.38 and 0.08 respectively.
on L2 cache. For mcf, Primal_bea_mpp (33.4%) and re- Thus from the results analyzed so far we can conclude that
fresh_poten-tial (20.2%) are two major functions resulting in the cpu2006 benchmarks have larger data sets and requires
L2 cache misses. Intensive pointer chasing is responsible for longer execution time than its predecessor CPU2000 bench-
this. marks.
Figure 7(a) and 7(b) represents the branch mispredicted per
1000 instructions of CPU2006 and CPU2000 SPEC bench- IV. MICROSOFT VC++ VS. INTEL C++
marks. CPU2006 benchmarks have comparatively higher In this section, we compared compiler effects on SPEC
branch misprediction than CPU2000 benchmark and almost CPU2006. We first compared static code size and dynamic
all floating point benchmarks under consideration have negli- instruction counts. Table 2 lists static code size of binaries
gible branch misprediction comparatively. The average branch generated by both compilers. In general, we observed that
mispredicted per 1000 instructions for CPU2006 and Intel C++ binaries are larger than those generated by the Mi-
CPU2000 integer benchmarks were measured as 4.2 and 4.0 crosoft VC++ compiler. Figure 8 shows the profile of Instruc-
respectively and the average branch misprediction per 1000 tion Retired comparison between Microsoft VC++ and Intel
instructions for CPU2006 and CPU2000 floating point C++. The vertical axis represents the absolute number of in-
CS-16 5
structions brake down by types. A few observations can be tions hmmer and h264ref there is a drastic decrease in runtime
made: while running with Intel C++ compiler. Microsoft VC++
(1) For 9 out of 15 programs, dynamic instructions retired shows improvement in runtime for floating programs lbm,
for Intel C++ binaries are smaller than those generated by the soplex and sphinx3.
Microsoft VC++ compiler though the former have larger static To better understand the performance impact of compilers,
code size. we compared various performance matrics. We analyzed the
(2) The percentage of load and store instructions is lower in L1D cache misses per 1000 instructions, L2 cache misses per
most cases for binaries generated from Intel C++ compiler 1000 instructions and branch misprediction per 1000 instruc-
compared to that of Microsoft VC++ binaries. Hence, Intel tions for binaries generated by the Intel C++ and Microsoft
C++ compiler reduces the number of memory accesses com- VC++ compiler. Figure 10 shows the comparison of L1D
paratively. cache misses per 1000 instructions. From this figure, the total
(3) The percentage of branch instructions is closely same number of L1D cache misses rate is almost the same for both
for both Intel C++ and Microsoft VC++ binaries. Other in- compliers except for sphinx3 and soplex. The L1 data cache
structions consist of various integer and floating point instruc- rate gap between Intel C++ and Microsoft VC++ is responsi-
tions which on an average comprise for approximately 37% ble for the execution time difference for these two programs.
and 32% of the overall instructions, for Intel C++ binaries and Figure 11 shows the comparison of L2 cache misses per
Microsoft C++ binaries respectively. 1000 instructions for both compilers. The figure shows that
We then compared the normalized runtime for Intel C++ there was considerable improvement in L2 cache misses rate
and Microsoft VC++ compilers running SPEC CPU2006 for memory intensive applications such as mcf, lbm, perlbench
benchmarks. For normalization, the runtime of Microsoft and soplex in the case of Intel C++ compiler compared to that
VC++ was considered to be the base runtime. Figure 9 shows of Microsoft VC++ compiler. From this figure, we can con-
the normalized runtime for Intel C++ and Microsoft VC++ clude that Intel C++ compiler, which utilizes more features of
compilers. From the figure, it is evident that the runtime for Intel Core 2 Duo processor, has better memory performance
most of the applications are very close. However, for applica- than that of Microsoft VC++.
Figure 12 shows the branch misprediction rate. From this
TABLE II figure, it can be observed that astar, h264ref, hmmer and om-
STATIC CODE SIZE (IN BYTES) OF BINARIES GENERATED BY
MICROSOFT VC++ AND INTEL C++ netpp show improvement in branch misprediction rate when
Name / Bytes VC++ IC++ running with Intel C++ compiler compared to that with Mi-
ASTAR 126976 163840 crosoft VC++ compiler. Other programs show similar behav-
BZIP2 122880 163840 iors.
GCC 2744320 3788800 In general, we find that Intel C++ compiler shows superior
GOBMK 3190784 3792896 performance for hammer and h264ref. In addition, it also
H264REF 552960 1294336
shows better microarchitecture performance in L2 cache miss
HMMER 237568 323584
MCF 90112 106496 rate and branch miss rate for most of programs. However, its
OMNETPP 724992 1286144 larger dynamic instruction counts compromises this effect for
PERLBENCH 978944 1536000 some floating programs such as lbm.
SJENG 188416 266240
LBM 102400 102400 V. RELATED WORK
MILC 180224 323584
NAMD 356352 561152 Researchers in computer architecture area show strong in-
SOPLEX 409600 1093632 terests in performance characterization of CPU2006. Sarah et
SPHINX3 262144 393216 al [1] reported the performance characterization of SPEC
CS-16 6
Figure 9. Runtime Comparison Figure 10. Comparison of L1D Cache Miss Per 1000 Instructions
Figure 11. Comparison of L2 Cache Miss Per 1000 Instructions Figure 12. Comparison of Branch Mis-prediction Per 1000 Instruction
CPU2006 and analyzed the impact of “Macro fusion” and counts compromises this effect for some floating programs
“Micro-op fusion” of the Woodcrest processor. These results such as lbm.
parallel our own upon which this paper is based. Ye et al [10]
compared CPU2006 integer benchmark binaries in 64-bit and REFERENCES
32-bit formats on an x86-64 architecture based processor. [1] S. Bird, A. Phansalkar, L K. John, A. Mericasand and R. Indukuru,
The effect of compilers and compiler optimizations on ap- “Performance Characterization of SPEC CPU Benchmarks on Intel's
Core Microarchitecture based processor”, in Proceedings of 2007 SPEC
plication performance has been studied and analyzed for a Benchmark Workshop, Jan 2007.
long time. Gurumani and Milenkovic studied the execution [2] S. T. Gurumani and A. Milenkovic, “Execution Characteristics of SPEC
characteristics of Visual C++ 6.0 and Intel C++ on Pentium 4 CPU2000 Benchmarks:Intel C++ vs. Microsoft VC++”, in Proceedings
processor using SPEC CPU2000 benchmark suite in [2]. They of the 42nd ACM annual southeast regional conference, 2004.
[3] Intel, Intel C++ Compiler 9.1 for Windows,
concluded that Intel C++ compilers performed better for http://cache.www.intel.com/cd/00/00/28/48/284831_284831.pdf
graphics and visualization applications. [4] Intel, Announcing Intel Core 2 Processor Family Brand,
Compared with software simulation, using Intel VTune per- http://www.intel.com/products/processor/core2/index.htm
[5] Intel, Intel VTune Performance Analyzer, http://www.intel.com/cd
formance analyzer and performance counters in real proces- /software/products/asmona/eng/vtune/239144.htm
sors is a fast and feasible way to characterizing emerging [6] Y. Li, T. Li, T. Kahveci, and J. Fortes. Workload characterization of
workloads. There are a few recent works analyzing Bioinfor- bioinformatic applications. In Proceedings of IEEE International Sym-
posium on Modeling, Analysis, and Simulation of Computer and Tele-
matics and Data Mining workload [6][8] by performance communication Systems (MASCOTS), 2005.
counters and VTune analyzer. [7] Microsoft, 32-bit Optimizations and Command-Line Switches,
http://msdn.microsoft.com/vstudio/tour/vs2005_guided_tour/VS2005pro
/Framework/CPlus32BitOptimization.htm
VI. CONCLUSION
[8] B. Ozisikyilmaz, R. Narayanan, J. Zambreno, G. Memik, A. Choudhary,
In this paper, we analyzed the emerging CPU2006 on Intel An Architectural Characterization Study of Data Mining and Bioinfor-
Core 2 Duo processor. According to our measurements, matics Workloads, in Proceedings of IEEE International Symposium on
Workload Characterization, Oct. 2006.
CPU2006 benchmarks have larger input dataset and longer [9] SPEC, SPEC CPU2000 and CPU2006, http://www.spec.org/
execution time than those of CPU2000. Our results also show [10] D.Ye, J. Ray, C. Harle and D. Kaeli, Performance Characterization of
that apart from architectural features, compilers also have high SPEC CPU2006 Integer Benchmarks on x86-64 Architecture, in Pro-
ceedings of IEEE International Symposium on Workload Characteriza-
impact on performance. For some application such as hammer tion, Oct. 2006.
and h264ref, Intel C++ shows its superiority in performance [11] H. Zhou and T. M. Conte, “Enhancing memory level parallelism via
over Microsoft VC++ compiler. In addition, it also shows bet- recovery-free value prediction,” Proceedings of the 17th annual interna-
tional conference on Supercomputing (ICS), Jun. 2003.
ter performance in L2 cache miss rate and branch miss rate for
most of programs because of its specific optimizations on Intel
Core architecture. However, its larger dynamic instruction