EC 14-15 Computer Organization Fall 2014 Chapter 1: Measuring & Understanding Performance
EC 14-15 Computer Organization Fall 2014 Chapter 1: Measuring & Understanding Performance
EC 14-15 Computer Organization Fall 2014 Chapter 1: Measuring & Understanding Performance
1
Dept. of Comp. Arch., UMA, 2014
EC 14-15
Computer Organization
Fall 2014
Chapter 1: Measuring &
Understanding Performance
M Angeles Gonzlez Navarro
www.ac.uma.es/~angeles
[email protected]
[Adapted from Mary Jane Irwins slides (PSU) based on Computer
Organization and Design, 4
th
Edition, Patterson & Hennessy, 2009, MK]
EC14-15 Chapter 1.2
Dept. of Comp. Arch., UMA, 2014
The Computer Revolution
! Progress in computer technology
! Underpinned by Moores Law
! Makes novel applications feasible
! Computers in automobiles
! Cell phones
! Human genome project
! World Wide Web
! Search Engines
! Computers are pervasive
EC14-15 Chapter 1.3
Dept. of Comp. Arch., UMA, 2014
Classes of Computers
! Desktop computers
! Designed to deliver good performance to a single user at low
cost usually executing 3
rd
party software, usually incorporating a
graphics display, a keyboard, and a mouse
! Servers
! Used to run larger programs for multiple, simultaneous users
typically accessed only via a network and that places a greater
emphasis on dependability and (often) security
! Supercomputers
! A high performance, high cost class of servers with hundreds to
thousands of processors, terabytes of memory and petabytes of
storage that are used for high-end scientific and engineering
applications
! Embedded computers (processors)
! A computer inside another device used for running one
predetermined application
EC14-15 Chapter 1.4
Dept. of Comp. Arch., UMA, 2014
The Processor Market
embedded growth >> desktop growth
! Where else are embedded processors found?
EC14-15 Chapter 1.5
Dept. of Comp. Arch., UMA, 2014
The Processor Market
! Where else are embedded processors found?
EC14-15 Chapter 1.6
Dept. of Comp. Arch., UMA, 2014
The Processor Market
embedded growth >> desktop growth
EC14-15 Chapter 1.7
Dept. of Comp. Arch., UMA, 2014
The Processor Market
embedded growth >> desktop growth
EC14-15 Chapter 1.8
Dept. of Comp. Arch., UMA, 2014
Embedded Processor Characteristics
The largest class of computers spanning the widest range
of applications and performance
! Often have minimum performance requirements.
Example?
! Often have stringent limitations on cost. Example?
! Often have stringent limitations on power consumption.
Example?
! Often have low tolerance for failure. Example?
EC14-15 Chapter 1.9
Dept. of Comp. Arch., UMA, 2014
Servers and supercomputers
! Servers and supercomputing market
! Google: porting the search engine for ARM and PowerPC
! AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8)
! E4s EK003 Servers: X-Gene ARM A57 (8 cores) + K20
! Mont Blanc project: supercomputer made of ARM
- Once commodity processors took over
- Be prepared for when mobile processors do so
EC14-15 Chapter 1.10
Dept. of Comp. Arch., UMA, 2014
Supercomputers (top500, Jun 2014 )
! http://www.top500.org/lists/2014/
EC14-15 Chapter 1.11
Dept. of Comp. Arch., UMA, 2014
What You Will Learn
! How programs are translated into the machine language
! And how the hardware executes them
! The hardware/software interface
! What determines program performance
! And how it can be improved
! How hardware designers improve performance
! What is parallel processing
EC14-15 Chapter 1.12
Dept. of Comp. Arch., UMA, 2014
Understanding Performance
! Algorithm
! Determines number of operations executed
! Programming language, compiler, architecture
! Determine number of machine instructions executed
per operation
! Processor and memory system
! Determine how fast instructions are executed
! I/O system (including OS)
! Determines how fast I/O operations are executed
EC14-15 Chapter 1.13
Dept. of Comp. Arch., UMA, 2014
Below the Program
! System software
! Operating system supervising program that interfaces the
users program with the hardware (e.g., Linux, MacOS,
Windows)
- Handles basic input and output operations
- Allocates storage and memory
- Schedules tasks & Provides for protected sharing among multiple
applications
! Compiler translate programs written in a high-level language
(e.g., C, Java) into instructions that the hardware can execute
Systems software
Applications software
Hardware
Written in high-level
language
EC14-15 Chapter 1.14
Dept. of Comp. Arch., UMA, 2014
Below the Program, Cont
! High-level language program (in C)
swap (int v[], int k)
(int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
)
! Assembly language program (for MIPS)
swap: sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
! Machine (object, binary) code (for MIPS)
000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
C compiler
assembler
one-to-many
one-to-one
EC14-15 Chapter 1.15
Dept. of Comp. Arch., UMA, 2014
Advantages of Higher-Level Languages ?
! Higher-level languages
! As a result, very little programming is done today at the
assembler level
! Allow the programmer to think in a more natural language and
for their intended use (Fortran for scientific computation,
Cobol for business programming, Lisp for symbol
manipulation, Java for web programming, !)
! Improve programmer productivity more understandable
code that is easier to debug and validate
! Improve program maintainability
! Allow programs to be independent of the computer on which
they are developed (compilers and assemblers can translate
high-level language programs to the binary instructions of any
machine)
! Emergence of optimizing compilers that produce very efficient
assembly code optimized for the target machine
EC14-15 Chapter 1.16
Dept. of Comp. Arch., UMA, 2014
Under the Covers
! Same components for all kinds of computer
! Five classic components of a computer input, output, memory,
datapath, and control
! datapath + control =
processor (CPU)
! Input/output includes
User-interface devices
Display, keyboard,
mouse
Storage devices
Hard disk, CD/DVD,
flash
Network adapters
For communicating
with other
computers
EC14-15 Chapter 1.17
Dept. of Comp. Arch., UMA, 2014
Opening the Box
EC14-15 Chapter 1.18
Dept. of Comp. Arch., UMA, 2014
! Ivy Bridge HM-4
! 6 MB L3 Cache
! 1400 millions of TRTs
! GPU with 8 EU
! 4 cores
! 2.5- 3.5 GHz
! TDP: 45 W-77 W
EC14-15 Chapter 1.19
Dept. of Comp. Arch., UMA, 2014
Processors for desktops and laptops
! Desktops (35 130W) and laptops (15 57 W):
http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/
http://techguru3d.com/4th-gen-intel-haswell-processors-
architecture-and-lineup/
Intel Haswell AMD APU Kaveri
! Haxwell (22 nm)
! 2-20 MB L3 Cache
! > 1400 millions of TRTs
! GPU with 20-40 EU
! Kaveri (28 nm)
! 2-4 cores
! 4 MB L2 (resizable)
! GPU with 8-512 EU
EC14-15 Chapter 1.20
Dept. of Comp. Arch., UMA, 2014
A Safe Place for Data
! Volatile main memory
! Loses instructions and data when power off
! Non-volatile secondary memory
! Magnetic disk
! Flash memory
! Optical disk (CDROM, DVD)
EC14-15 Chapter 1.21
Dept. of Comp. Arch., UMA, 2014
Networks
! Communication and resource sharing
! Local area network (LAN): Ethernet
! Within a building
! Wide area network (WAN): the Internet
! Wireless network: WiFi, Bluetooth
Communicating to the Outside World:
Cluster Networking
Tis online section describes the networking hardware and sofware used to
connect the nodes of cluster together. As there are whole books and courses just on
networking, this section only introduces the main terms and concepts. While our
example is networking, the techniques we describe apply to storage controllers and
other I/O devices as well.
Ethernet has dominated local area networks for decades, so it is not surprising
that clusters primarily rely on Ethernet as the cluster interconnect. It became
commercially popular at 10 Megabits per second link speed in the 1980s, but
today 1 Gigabit per second Ethernet is standard and 10 Gigabit per second is being
deployed in datacenters. Figure 6.9.1 shows a network interface card (NIC) for 10
Gigabit Ethernet.
Computers ofer high-speed links to plug in fast I/O devices like this NIC. While
there used to be separate chips to connect the microprocessor to the memory and
high-speed I/O devices, thanks to Moores Law these functions have been absorbed
into the main chip in recent oferings like Intels Sandy Bridge. A popular high-
speed link today is PCIe, which stands for Peripheral Component Interconnect
Express. It is called a link in that the basic building block, called a serial lane,
consists of just four wires: two for receiving data and two for transmitting data.
Tis small number contrasts with an earlier version of PCI that consisted of 64
5. 9 6.9
FIGURE 6.9.1 The NetFPGA 10-Gigabit Ethernet card (see http://netfpga.org/), which
connects up to four 10-Gigabit/sec Ethernet links. It is an FPGA-based open platform for
network research and classroom experimentation. Te DMA engine and the four MAC chips
in Figure 6.9.2 are just portions of the Xilinx Virtex FPGA in the middle of the board. Te four PHY chips
in Figure 6.9.2 are the four black squares just to the right of the four white rectangles on the lef edge of the
board, which is where the Ethernet cables are plugged in.
NetFPGA 10 Gigabit Ethernet card
EC14-15 Chapter 1.22
Dept. of Comp. Arch., UMA, 2014
Abstractions
! Abstraction helps us deal with complexity
! Hide lower-level detail
! Instruction set architecture (ISA)
! The hardware/software interface
! Application binary interface
! The ISA plus system software interface
! Implementation
! The details underlying and interface
The BIG Picture
EC14-15 Chapter 1.23
Dept. of Comp. Arch., UMA, 2014
Technology Trends
! Electronics
technology continues
to evolve
! Increased capacity
and performance
! Reduced cost
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2005 Ultra large scale IC 6,200,000,000
DRAM capacity
EC14-15 Chapter 1.24
Dept. of Comp. Arch., UMA, 2014
Instruction Set Architecture (ISA)
! ISA, or simply architecture the abstract interface
between the hardware and the lowest level software that
encompasses all the information necessary to write a
machine language program, including instructions,
registers, memory access, I/O, !
! Enables implementations of varying cost and performance to run
identical software
! The combination of the basic instruction set (the ISA) and
the operating system interface is called the application
binary interface (ABI)
! ABI The user portion of the instruction set plus the operating
system interfaces used by application programmers. Defines a
standard for binary portability across computers.
EC14-15 Chapter 1.25
Dept. of Comp. Arch., UMA, 2014
Courtesy, Intel
2014: Quad Core Haswell
with 1.4B transistors
Moores Law
feature size
&
die size
! In 1965, Intels Gordon Moore
predicted that the number of
transistors that can be
integrated on single chip would
double about every two years
EC14-15 Chapter 1.26
Dept. of Comp. Arch., UMA, 2014
Technology Scaling Road Map (ITRS)
Year 2006 2008 2010 2012 2014
Feature size (nm) 65 45 32 22 14
Intg. Capacity (BT) 4 6 16 32 50
! Fun facts about 45nm transistors
! 30 million can fit on the head of a pin
! You could fit more than 2,000 across the width of a human
hair
! If car prices had fallen at the same rate as the price of a
single transistor has since 1968, a new car today would cost
about 1 cent
EC14-15 Chapter 1.27
Dept. of Comp. Arch., UMA, 2014
Another Example of Moores Law Impact
16K
64K
256K
1M
4M
16M
64M
128M
256M
512M
1G
DRAM capacity growth over 3 decades
EC14-15 Chapter 1.28
Dept. of Comp. Arch., UMA, 2014
But What Happened to Clock Rates and Why?
!
"
#
$
%
'
(
)
*
+
,
! Clock rates hit a
power wall
EC14-15 Chapter 1.29
Dept. of Comp. Arch., UMA, 2014
The Sea Change
For the P6, success criteria included performance above a
certain level and failure criteria included power
dissipation above some threshold.
Bob Colwell, Pentium Chronicles
EC14-15 Chapter 1.30
Dept. of Comp. Arch., UMA, 2014
The Sea Change: the switch to multiprocessors
Constrained by power, instruction-level parallelism,
memory latency
Uniprocessor Performance
EC14-15 Chapter 1.31
Dept. of Comp. Arch., UMA, 2014
The Sea Change: multiprocessors
! The power challenge has forced a change in the design of
microprocessors
! Since 2002 the rate of improvement in the response time of programs
on desktop computers has slowed from a factor of 1.5 per year to less
than a factor of 1.2 per year
! As of 2006 all desktop and server companies are shipping
microprocessors with multiple processors cores per chip
Product AMD
Kaveri
Intel
Haswell
Samsung
Exynos Octa
Qualcomm
Snapdragon
Cores per
chip
2-4 + GPU 4 + GPU 4 + 4 4 + GPU +DSP
Clock rate ~4 GHz ~3.0 GHz 1.3-1.8 GHz 2.3 GHz
Power ~100 W ~15-100W < 10 W < 10 W
! Plan of record is to double the number of cores per chip per
generation (about every two years)
EC14-15 Chapter 1.32
Dept. of Comp. Arch., UMA, 2014
The sea change: multiprocessors
! Multicore microprocessors
! More than one processor per chip
! Requires explicitly parallel programming
! Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
! Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization
EC14-15 Chapter 1.33
Dept. of Comp. Arch., UMA, 2014
Performance Metrics
! Purchasing perspective
! given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?
! Design perspective
! faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?
! Both require
! basis for comparison
! metric for evaluation
! Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors
EC14-15 Chapter 1.34
Dept. of Comp. Arch., UMA, 2014
Throughput versus Response Time
! Response time (execution time) the time between the
start and the completion of a task
! Important to individual users
! Throughput (bandwidth) the total amount of work done
in a given unit time
! Important to data center managers
! Will need different performance metrics as well as a
different set of applications to benchmark embedded and
desktop computers, which are more focused on response
time, versus servers, which are more focused on
throughput
! How are response time and throughput affected by
! Replacing the processor with a faster version?
! Adding more processors?
Well focus on response time for now!
EC14-15 Chapter 1.35
Dept. of Comp. Arch., UMA, 2014
Response Time Matters
Justin Rattners ISCA08 Keynote (VP and CTO of Intel)
EC14-15 Chapter 1.36
Dept. of Comp. Arch., UMA, 2014
Defining (Speed) Performance
! To maximize performance, need to minimize execution
time
performance
X
= 1 / execution_time
X
If X is n times faster than Y, then
performance
X
execution_time
Y
-------------------- = --------------------- = n
performance
Y
execution_time
X
! Decreasing response time almost always improves
throughput
EC14-15 Chapter 1.37
Dept. of Comp. Arch., UMA, 2014
Relative Performance Example
! If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds, how
much faster is A than B?
We know that A is n times faster than B if
performance
A
execution_time
B
-------------------- = --------------------- = n
performance
B
execution_time
A
15
------ = 1.5
10
The performance ratio is
So A is 1.5 times faster than B
EC14-15 Chapter 1.38
Dept. of Comp. Arch., UMA, 2014
Measuring Execution Time
! Elapsed time
! Total response time= Wall clock Time = Elapsed Time
- it includes all aspects to complete a task
- Processing, I/O operations, OS overhead, idle time
! Determines system performance
! Productivity
! Throughput: the total amount of work done in a given unit time
EC14-15 Chapter 1.39
Dept. of Comp. Arch., UMA, 2014
Measuring Execution Time
! CPU time
! Time spent processing a given job
- Discounts I/O time, other jobs shares
! Comprises user CPU time and system CPU time
! Different programs are affected differently by CPU and system
performance
! Example: time in Unix:
90.7u 12.9s 2:39 65%
! Our goal: user CPU time + system CPU time
user CPU
time
system
CPU time
Elapsed
time
CPU utiliz.:
(90.7 + 12.9) /
(2*60 + 39)
I/O and other
processes
EC14-15 Chapter 1.40
Dept. of Comp. Arch., UMA, 2014
Review: Machine Clock Rate
! Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10
-9
) clock cycle => 1 GHz (10
9
) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
EC14-15 Chapter 1.41
Dept. of Comp. Arch., UMA, 2014
Performance Factors
! CPU execution time (CPU time) time the CPU spends
working on a task
! Does not include time waiting for I/O or running other programs
CPU execution time # CPU clock cycles
for a program for a program
= x clock cycle time
CPU execution time # CPU clock cycles for a program
for a program clock rate
= -------------------------------------------
! Can improve performance by increasing the the clock
rate or by reducing the number of clock cycles required
for a program
! Hardware designer must often trade off clock rate against cycle
count
or
EC14-15 Chapter 1.42
Dept. of Comp. Arch., UMA, 2014
Improving Performance Example
! A program runs on computer A with a 2 GHz clock in 10
seconds. What clock rate must computer B run at to run
this program in 6 seconds? Unfortunately, to accomplish
this, computer B will require 1.2 times as many clock
cycles as computer A to run the program.
CPU time
A
CPU clock cycles
A
clock rate
A
= -------------------------------
CPU clock cycles
A
= 10 sec x 2 x 10
9
cycles/sec
= 20 x 10
9
cycles
CPU time
B
1.2 x 20 x 10
9
cycles
clock rate
B
= -------------------------------
clock rate
B
1.2 x 20 x 10
9
cycles
6 seconds
= ------------------------------- = 4 GHz
EC14-15 Chapter 1.43
Dept. of Comp. Arch., UMA, 2014
Clock Cycles per Instruction
! Not all instructions take the same amount of time to
execute
! One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction
! Clock cycles per instruction (CPI) the average number
of clock cycles each instruction takes to execute
! A way to compare two different implementations of the same ISA
# CPU clock cycles # Instructions Average clock cycles
for a program for a program per instruction
= x
CPI for this instruction class
A B C
CPI 1 2 3
EC14-15 Chapter 1.44
Dept. of Comp. Arch., UMA, 2014
Using the Performance Equation
! Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI of
2.0 for some program and computer B has a clock cycle
time of 500 ps and an effective CPI of 1.2 for the same
program. Which computer is faster and by how much?
Each computer executes the same number of instructions, I,
so
CPU time
A
= I x 2.0 x 250 ps = 500 x I ps
CPU time
B
= I x 1.2 x 500 ps = 600 x I ps
Clearly, A is faster ! by the ratio of execution times
performance
A
execution_time
B
600 x I ps
------------------- = --------------------- = ---------------- = 1.2
performance
B
execution_time
A
500 x I ps
EC14-15 Chapter 1.45
Dept. of Comp. Arch., UMA, 2014
Effective (Average) CPI
! Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
Overall effective CPI = ! (CPI
i
x IC
i
)
i = 1
n
! Where IC
i
is the count (percentage) of the number of instructions
of class i executed
! CPI
i
is the (average) number of clock cycles per instruction for
that instruction class
! n is the number of instruction classes
! The overall effective CPI varies by instruction mix a
measure of the dynamic frequency of instructions across
one or many programs
EC14-15 Chapter 1.46
Dept. of Comp. Arch., UMA, 2014
THE Performance Equation
! Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count x CPI
clock_rate
CPU time = -----------------------------------------------
or
! These equations separate the three key factors that
affect performance
! Can measure the CPU execution time by running the program
! The clock rate is usually given
! Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
! CPI varies by instruction type and ISA implementation for which
we must know the implementation details
EC14-15 Chapter 1.47
Dept. of Comp. Arch., UMA, 2014
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction_
count
CPI clock_cycle
Algorithm
Programming
language
Compiler
ISA
Core
organization
Technology
X
X X
X X
X X
X
X
X
X
X
EC14-15 Chapter 1.48
Dept. of Comp. Arch., UMA, 2014
A Simple Example
! How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
! How does this compare with using branch prediction to shave
a cycle off the branch time?
! What if two ALU instructions could be executed at once?
Op Freq CPI
i
Freq x CPI
i
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
! =
.5
1.0
.3
.4
2.2
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
1.6
.5
.4
.3
.4
.5
1.0
.3
.2
2.0
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
.25
1.0
.3
.4
1.95
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
EC14-15 Chapter 1.49
Dept. of Comp. Arch., UMA, 2014
Summary: Evaluating ISAs
! Design-time metrics:
! Can it be implemented, in how long, at what cost?
! Can it be programmed? Ease of compilation?
! Static Metrics:
! How many bytes does the program occupy in memory?
! Dynamic Metrics:
! How many instructions are executed? How many bytes does the
processor fetch to execute the program?
! How many clocks are required per instruction?
! How "lean" a clock is practical?
Best Metric: Time to execute the program!
CPI
Inst. Count Cycle Time
depends on the instructions set, the
processor organization, and compilation
techniques.
EC14-15 Chapter 1.50
Dept. of Comp. Arch., UMA, 2014
Pitfall: MIPS as a Performance Metric
! MIPS: Millions of Instructions Per Second
! Doesnt account for
- Differences in ISAs between computers
- Differences in complexity between instructions
6
6
6
10 CPI
rate Clock
10
rate Clock
CPI count n Instructio
count n Instructio
10 time Execution
count n Instructio
MIPS
!
=
!
!
=
!
=
CPI varies between programs on a given CPU
EC14-15 Chapter 1.51
Dept. of Comp. Arch., UMA, 2014
Workloads and Benchmarks
! Benchmarks a set of programs that form a workload
specifically chosen to measure performance
! SPEC (System Performance Evaluation Cooperative)
creates standard sets of benchmarks starting with
SPEC89. The latest is SPEC CPU2006 which consists
of 12 integer benchmarks (CINT2006) and 17 floating-
point benchmarks (CFP2006).
www.spec.org
! There are also benchmark collections for power
workloads (SPECpower_ssj2008), for mail workloads
(SPECmail2008), for multimedia workloads
(mediabench), !
EC14-15 Chapter 1.52
Dept. of Comp. Arch., UMA, 2014
Old SPEC Benchmarks
Integer benchmarks FP benchmarks
gzip compression wupwise Quantum chromodynamics
vpr FPGA place & route swim Shallow water model
gcc GNU C compiler mgrid Multigrid solver in 3D fields
mcf Combinatorial optimization applu Parabolic/elliptic pde
crafty Chess program mesa 3D graphics library
parser Word processing program galgel Computational fluid dynamics
eon Computer visualization art Image recognition (NN)
perlbmk perl application equake Seismic wave propagation
simulation
gap Group theory interpreter facerec Facial image recognition
vortex Object oriented database ammp Computational chemistry
bzip2 compression lucas Primality testing
twolf Circuit place & route fma3d Crash simulation fem
sixtrack Nuclear physics accel
apsi Pollutant distribution
EC14-15 Chapter 1.53
Dept. of Comp. Arch., UMA, 2014
SPEC CPU Benchmark
! Programs used to measure performance
! Supposedly typical of actual workload
! Standard Performance Evaluation Corp (SPEC)
! Develops benchmarks for CPU, I/O, Web, !
! SPEC CPU2006
! Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
! Normalize relative to reference machine
! Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
n
n
1 i
i
ratio time Execution
!
=
EC14-15 Chapter 1.54
Dept. of Comp. Arch., UMA, 2014
Comparing and Summarizing Performance
! Guiding principle in reporting performance measurements
is reproducibility list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
! How do we summarize the performance for benchmark
set with a single number?
! First the execution times are normalized giving the SPEC
ratio (bigger is faster, i.e., SPEC ratio is the inverse of execution
time)
! The SPEC ratios are then averaged using the geometric mean
(GM)
GM = n " SPEC ratio
i
i = 1
n
EC14-15 Chapter 1.55
Dept. of Comp. Arch., UMA, 2014
SPEC CINT2006 on Barcelona (CC = 0.4 x 10
9
)
Name ICx10
9
CPI ExTime RefTime SPEC
ratio
perl 2,1118 0.75 637 9,770 15.3
bzip2 2,389 0.85 817 9,650 11.8
gcc 1,050 1.72 724 8,050 11.1
mcf 336 10.00 1,345 9,120 6.8
go 1,658 1.09 721 10,490 14.6
hmmer 2,783 0.80 890 9,330 10.5
sjeng 2,176 0.96 837 12,100 14.5
libquantum 1,623 1.61 1,047 20,720 19.8
h264avc 3,102 0.80 993 22,130 22.3
omnetpp 587 2.94 690 6,250 9.1
astar 1,082 1.79 773 7,020 9.1
xalancbmk 1,058 2.70 1,143 6,900 6.0
Geometric Mean 11.7
EC14-15 Chapter 1.56
Dept. of Comp. Arch., UMA, 2014
CINT2006 for Opteron X4 2356
Name Description IC"10
9
CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
High cache miss rates
EC14-15 Chapter 1.57
Dept. of Comp. Arch., UMA, 2014
SPEC Power Benchmark
! Power consumption of server at different workload levels
! Performance: ssj_ops/sec
! Power: Watts (Joules/sec)
!
"
#
$
%
&
!
"
#
$
%
&
=
' '
= =
10
0 i
i
10
0 i
i
power ssj_ops Watt per ssj_ops Overall
EC14-15 Chapter 1.58
Dept. of Comp. Arch., UMA, 2014
SPECpower_ssj2008 for X4
Target Load % Performance (ssj_ops/sec) Average Power (Watts)
100% 231,867 295
90% 211,282 286
80% 185,803 275
70% 163,427 265
60% 140,160 256
50% 118,324 246
40% 920,35 233
30% 70,500 222
20% 47,126 206
10% 23,066 180
0% 0 141
Overall sum 1,283,590 2,605
#ssj_ops/ #power 493
EC14-15 Chapter 1.59
Dept. of Comp. Arch., UMA, 2014
Other Performance Metrics
! Power consumption especially in the embedded market
where battery life is important
! For power-limited applications, the most important metric is
energy efficiency
EC14-15 Chapter 1.60
Dept. of Comp. Arch., UMA, 2014
Power Trends
! In CMOS IC technology
Frequency Voltage load Capacitive Power
2
! ! =
"1000 "30 5V $ 1V
EC14-15 Chapter 1.61
Dept. of Comp. Arch., UMA, 2014
Reducing Power
! Suppose a new CPU has
! 85% of capacitive load of old CPU
! 15% voltage and 15% frequency reduction
P
old
P
new
=
C
old
! V
old
2
!F
old
C
old
!0.85!(V
old
!0.85)
2
!F
old
!0.85
=
1
0.85
4
=1.92
! The power wall
" We cant reduce voltage further
" We cant remove more heat
! How else can we improve performance?
EC14-15 Chapter 1.62
Dept. of Comp. Arch., UMA, 2014
Fallacy: Low Power at Idle
! X4 power benchmark
! At 100% load: 295W
! At 50% load: 246W (83%)
! At 10% load: 180W (61%)
! Google data center
! Mostly operates at 10% 50% load
! At 100% load less than 1% of the time
! Consider designing processors to make power
proportional to load
EC14-15 Chapter 1.63
Dept. of Comp. Arch., UMA, 2014
Amdahls Law
! Pitfall: Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
! Fm = Fraction of improvement
! Sm = Factor of improvement
! The opportunity for improvement is affected by how
much time the modified event consumes
! Corollary 1: make the common case fast
! The performance enhancement possible with a given
improvement is limited by the amount the improved
feature is used
! Corollary 2: the bottleneck will limit the improv.
S =
T
CPU_org
T
CPU_imp
=
1
Fm
Sm
+(1! Fm)
EC14-15 Chapter 1.64
Dept. of Comp. Arch., UMA, 2014
Amdahls Law
! Example: multiply accounts for 80s/100s
" How much improvement in multiply performance to
get 5" overall?
20
80
20 + =
n
" Cant be done!
! Study the limit cases in the Amdahl law
! Fm=0
! Sm=1
S =
1
0 +(1)
=1
! Fm=1
! Sm=inf
S =
1
1/ Sm+(1!1)
= Sm
S =
1
Fm+(1! Fm)
=1
S =
1
0 +(1! Fm)
EC14-15 Chapter 1.65
Dept. of Comp. Arch., UMA, 2014
Manufacturing ICs
! Yield: proportion of working dies per wafer
EC14-15 Chapter 1.66
Dept. of Comp. Arch., UMA, 2014
AMD Opteron X2 Wafer
! X2: 300mm wafer, 117 chips, 90nm technology
! X4: 45nm technology
EC14-15 Chapter 1.67
Dept. of Comp. Arch., UMA, 2014
Integrated Circuit Cost
! Nonlinear relation to area and defect rate
! Wafer cost and area are fixed
! Defect rate determined by manufacturing process
! Die area determined by architecture and circuit design
2
area/2)) Die area per (Defects (1
1
Yield
area Die area Wafer wafer per Dies
Yield wafer per Dies
wafer per Cost
die per Cost
! +
=
"
!
=
EC14-15 Chapter 1.68
Dept. of Comp. Arch., UMA, 2014
Concluding Remarks
! Cost/performance is improving
! Due to underlying technology development
! Hierarchical layers of abstraction
! In both hardware and software
! Instruction set architecture
! The hardware/software interface
! Execution time: the best performance measure
! Power is a limiting factor
! Use parallelism to improve performance