Performance of Computer Systems
Performance of Computer Systems
Performance of Computer Systems
CSE 675.02: Introduction to Computer Architecture • "X is n times faster than Y" means:
Performance (X)
n = ––––––––––––––
Performance Performance (Y)
1
CPU Time Equation Calculating Components of CPU time
• For an existing processor it is easy to obtain the CPU time (i.e.
• CPU time = Clock cycles for a program * Clock cycle time the execution time) by measurement, and the clock rate is
= Clock cycles for a program / Clock rate known. But, it is difficult to figure out the instruction count or
Clock cycles for a program is a total number of clock cycles CPI.
needed to execute all instructions of a given program.
Newer processors, MIPS processor is such an example,
• CPU time = Instruction count * CPI / Clock rate include counters for instructions executed and for clock cycles.
Those can be helpful to programmers trying to understand and
Instruction count is a number of instructions executed, tune the performance of an application.
sometimes referred as the instruction path length.
CPI – the average number of clock cycles per instruction (for a
• Also, different simulation techniques and queuing theory could
given execution of a given program) is an important parameter
be used to obtain values for components of the execution
given as:
(CPU) time.
CPI = Clock cycles for a program / Instructions count
g. babic Presentation C 7 g. babic Presentation C 8
2
CPU Time: Example 2 Calculating CPI
Consider another implementation of MIPS ISA with 1 GHz clock
and
CPI = (x3 + y2 + z4 + w5)/ (x + y + z + w)
– each ALU instruction takes 4 clock cycles,
– each branch/jump instruction takes 3 clock cycles,
– each sw instruction takes 5 clock cycles,
– each lw instruction takes 6 clock cycles. The calculation may not be necessary correct (and usually it isn’t)
Also, consider the same program as in Example 1. since the numbers of cycles per instruction given don’t account
Find CPI and CPU time. Assume sequentially executing CPU. for pipeline effects and other advanced design techniques.
---------------------------------------------------------------------------------
CPI = (x4 + y3 + z5 + w6)/ (x + y + z + w)
= 4.03 clock cycles/ instruction
CPU time = Instruction count CPI / Clock rate
= (x+y+z+w) 4.03 / 1000 106
= 300 106 4.03 /1000 106
= 1.21 sec
g. babic Presentation C 13 g. babic Presentation C 14
( i n in s tr u c t i o n s )
– ID: Instruction decode and register fetch lw r 1 , 1 0 0 ( r 4 )
Instruction
Reg ALU
Data
R eg
fetch access
– EX: Execution, effective address or branch calculation lw r 2 , 2 0 0 ( r 5 )
Instruction Data
8 ns Reg ALU Reg
fetch access
– MEM: Memory access (for lw and sw instructions only) Instruction
lw r 3 , 3 0 0 ( r 6 ) 8 ns
fetch
...
– WB: Register write back (for ALU and lw instructions) 8 ns
Every lw instruction needs 8 nsec to execute.
In this course, we shall design a processor that executes
instructions sequentially, i.e. as illustrated here.
g. babic Presentation C 15 g. babic Presentation C 16
3
Pipelined Laundry Pipeline Executing 3 LW Instructions
• Assuming delays as in the sequential case and pipelined
6 PM 7 8 9 10 11 Midnight processor with a clock cycle time of 2 nsec.
P ro g ra m
Time e x e c u t io n 2 4 6 8 10 12 14
T im e
o rd er
30 40 40 40 40 20 ( in in s t r u c t i o n s )
lw r 1 , 1 00 (r 4 )
Instruction Data
Reg ALU Reg
fetch access
T A lw r 2 , 2 00 (r 5 ) 2 ns
Instruction
Reg ALU
Data
Reg
fetch access
a
s lw r 3 , 3 00 (r 6 ) 2 ns
Instruction
Reg ALU
Data
Reg
k B fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
O
r
C Note that registers are written during the first part of a cycle and
d read during the second part of the same cycle.
e D
r • Pipelining doesn’t help to execute a single instruction, it may
Pipelined laundry takes 3.5 hours for 4 loads; improve performance by increasing instruction throughput;
g. babic Presentation C 19 g. babic Presentation C 20
4
Summarizing Performance Summarizing SPEC CPU2000 Performance
• The arithmetic mean of the execution times is given as: SPEC CPU2000 summarizes performance using a geometric
n mean of ratios, with larger numbers indicating higher
1
–* Timei performance.
n i=1
CINT2000 is indicator of integer performance and it is given as:
where Timei is the execution time for the ith program of a 12
total of n in the workload (benchmark). 12
Weight
i=1
i * Timei where k1 is a coefficient and CPU timei is the CPU time for the
ith integer program of a total of 12 programs in the workload.
where Weighti is the frequency of the ith program in the
workload. Similarly for floating point performance, CFP2000 is given as:
14
• The geometric mean of execution times is given as: 14
n
n
n CFP2000 = k2 FPExec time basei/FPExec timei
Timei where
i=1
xi = x1 * x2 * x3* … * xn i=1
i=1
g. babic Presentation C 25 g. babic Presentation C 26
5
Performance Example (part 5/5)
c. Calculate MFLOPS for both computers.
Number of floating point operations in a program
MFLOPS = ––––––––––––––––––––––––––––––––––––––––
Execution time * 106
g. babic Presentation C 31