L5-L6-Performance Issues

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

+

Performance Issues
+
Designing for Performance
 The cost of computer systems continues to drop dramatically, while the performance
and capacity of those systems continue to rise equally dramatically

 Today’s laptops have the computing power of an IBM mainframe from 10 or 15 years ago

 Processors are so inexpensive that we now have microprocessors we throw away

 Desktop applications that require the great power of today’s microprocessor-based


systems include:
 Image processing
 Three-dimensional rendering
 Speech recognition
 Videoconferencing
 Multimedia authoring
 Voice and video annotation of files
 Simulation modeling

 Businesses are relying on increasingly powerful servers to handle transaction and


database processing and to support massive client/server networks that have
replaced the huge mainframe computer centers of yesteryear

 Cloud service providers use massive high-performance banks of servers to


satisfy high-volume, high-transaction-rate applications for a broad spectrum of
clients
+
Microprocessor Speed
Techniques built into contemporary processors include:

• Processor moves data or instructions into a


Pipelining conceptual pipe with all stages of the pipe processing
simultaneously

• Processor looks ahead in the instruction code fetched


Branch prediction from memory and predicts which branches, or groups
of instructions, are likely to be processed next

Superscalar • This is the ability to issue more than one instruction in


every processor clock cycle. (In effect, multiple
execution parallel pipelines are used.)

• Processor analyzes which instructions are dependent


Data flow analysis on each other’s results, or data, to create an
optimized schedule of instructions

Speculative • Using branch prediction and data flow analysis, some


processors speculatively execute instructions ahead
of their actual appearance in the program execution,
execution holding the results in temporary locations, keeping
execution engines as busy as possible
+
Performance
Balance
Increase the number
 Adjust the organization and of bits that are
retrieved at one time
architecture to compensate by making DRAMs
“wider” rather than
for the mismatch among the “deeper” and by
using wide bus data
capabilities of the various paths

components
Reduce the frequency
 Architectural examples of memory access by
incorporating
include: increasingly complex
and efficient cache
structures between
the processor and
main memory

Change the DRAM Increase the


interface to make it interconnect
more efficient by bandwidth between
processors and
including a cache or memory by using
other buffering higher speed buses
scheme on the DRAM and a hierarchy of
chip buses to buffer and
structure data flow
Ethernet modem
(max speed)

Graphics display

Wi-Fi modem
(max speed)

Hard disk

Optical disc

Laser printer

Scanner

Mouse

Keyboard

101 102 103 104 105 106 107 108 109 1010 1011
Data Rate (bps)

Figure 2.1 Typical I/O Device Data Rates


+
Improvements in Chip
Organization and Architecture
 Increase hardware speed of processor
 Fundamentally due to shrinking logic gate size
 More gates, packed more tightly, increasing clock rate
 Propagation time for signals reduced

 Increase size and speed of caches


 Dedicating part of processor chip
 Cache access times drop significantly

 Change processor organization and architecture


 Increase effective speed of instruction execution
 Parallelism
+
Problems with Clock Speed and
Login Density
 Power
 Power density increases with density of logic and clock speed
 Dissipating heat

 RC delay
 Speed at which electrons flow limited by resistance and
capacitance of metal wires connecting them
 Delay increases as the RC product increases
 As components on the chip decrease in size, the wire
interconnects become thinner, increasing resistance
 Also, the wires are closer together, increasing capacitance

 Memory latency
 Memory speeds lag processor speeds
107

106
Transistors (Thousands)
105 Frequency (MHz)
Power (W)
104 Cores

103

102
+
10

0.1
1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 2.2 Processor Trends


+
The use of multiple
processors on the same chip
provides the potential to
increase performance

Multicore without increasing the clock


rate

Strategy is to use two simpler


processors on the chip rather
than one more complex
processor

With two processors larger


caches are justified

As caches became larger it


made performance sense to
create two and then three
levels of cache on a chip
+
Many Integrated Core (MIC)
Graphics Processing Unit (GPU)
MIC GPU
 Leap in performance as well  Core designed to perform
as the challenges in parallel operations on graphics
developing software to exploit data
such a large number of cores
 Traditionally found on a plug-in
 The multicore and MIC graphics card, it is used to
strategy involves a encode and render 2D and 3D
homogeneous collection of graphics as well as process
general purpose processors video
on a single chip
 Used as vector processors for a
variety of applications that
require repetitive computations
+  Gene Amdahl

 Deals with the potential speedup of a


program using multiple processors
compared to a single processor
Amdahl’s  Illustrates the problems facing industry

Law
in the development of multi-core
machines
 Software must be adapted to a highly
parallel execution environment to
exploit the power of parallel
processing

 Can be generalized to evaluate and


design technical improvement in a
computer system
T
(1 – f)T fT

(1 – f)T fT
N

1
1 f 1 T
N

Figure 2.3 Illustration of Amdahl’s Law


Spedup f = 0.95

f = 0.90

+ f = 0.75

f = 0.5

Number of Processors

Figure 2.4 Amdahl’s Law for Multiprocessors


+
Little’s Law
 Fundamental and simple relation with broad applications

 Can be applied to almost any system that is statistically in


steady state, and in which there is no leakage

 Queuing system
 If server is idle an item is served immediately, otherwise an
arriving item joins a queue
 There can be a single queue for a single server or for multiple
servers, or multiple queues with one being for each of multiple
servers

 Average number of items in a queuing system equals the


average rate at which items arrive multiplied by the time
that an item spends in the system
 Relationship requires very few assumptions
 Because of its simplicity and generality it is extremely useful
+
Basic Measures of Computer
Performance
 Clock Speed

 Instruction Execution Rate


 MIPS
 MFLOPS
q
cr uar
ys tz
ta
l

an
co di alog
nv git to
er al
sio
n

From Computer Desktop Encyclopedia


1998, The Computer Language Co.

Figure 2.5 System Clock


Slide
+ Basic Measures of Computer Performance 18

Performance = 1 / Execution time is simplified to

Performance = 1 / CPU execution time

(Performance of M1) / (Performance of M2) = Speedup of M1 over M2


= (Execution time of M2) / (Execution time M1)

Terminology: M1 is x times as fast as M2 (e.g., 1.5 times as fast)


M1 is 100(x – 1)% faster than M2 (e.g., 50% faster)

CPU time = Instructions  (Cycles per instruction)  (Secs per cycle)


= Instructions  CPI / (Clock rate)
Instruction count, CPI, and clock rate are not completely independent,
so improving one by a given factor may not lead to overall execution
time improvement by the same factor.
Slide
+ Elaboration on the CPU Time Formula 19

CPU time = Instructions  (Cycles per instruction)  (Secs per cycle)


= Instructions  Average CPI / (Clock rate)

Instructions: Number of instructions executed, not number of


instructions in our program (dynamic count)

Average CPI: Is calculated based on the dynamic instruction mix


and knowledge of how many clock cycles are needed
to execute various instructions (or instruction classes)

Clock rate: 1 GHz = 109 cycles / s (cycle time 10–9 s = 1 ns)


200 MHz = 200  106 cycles / s (cycle time = 5 ns)
Clock period
Slide
+ Dynamic Instruction Count 20

How many instructions Each “for” consists of two instructions:


are executed in this increment index, check exit condition
program fragment?
12,422,450 Instructions
250 instructions
for i = 1, 100 do 2 + 20 + 124,200 instructions
20 instructions 100 iterations
for j = 1, 100 do 12,422,200 instructions in all
40 instructions 2 + 40 + 1200 instructions
for k = 1, 100 do 100 iterations
10 instructions 124,200 instructions in all
endfor 2 + 10 instructions
endfor 100 iterations for i = 1, n
endfor 1200 instructions in all while x > 0
Static count = 326
+
Basic Measures of Computer
Performance
σ𝑛
𝑖=1(𝐶𝑃𝐼𝑖 ×𝐼𝑖 )
 𝐶𝑃𝐼 =
𝐼𝑐

 Processor Time T = Ic × CPI × τ

 T = Ic x [p + (m × k)] × τ

 MIPS rate = Ic / (T × 106) = f / (CPI × 106)

 MFLOPS = No. of executed floating-point operations in a


program/ Execution time x 106
Ic p m k t
Instruction set X
architecture X

Compiler technology X X X
Processor X X
implementation
Cache and memory
X X
hierarchy

Table 2.1 Performance Factors and System Attributes


+
Example
+
Example
The three
The use of benchmarks to common
compare systems involves formulas
calculating the mean value of a
set of data points related to used for
execution time calculating a
mean are:

• Arithmetic
• Geometric
• Harmonic
+
AM
(f) GM
HM

MD
AM
(g) GM The three means applied to various data sets, each of
Example:
which
HM
has eleven data points and a maximum data point value of
11. The median value is also included in the chart.
0 1 2 3 4 5 6 7

(a) Constant (11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11) M
(b) Clustered around a central value (3, 5, 6, 6, 7, 7, 7, 8, 8, 9, 1 1) A
(c) Uniform distribution (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1) G
(d) Large-number bias (1, 4, 4, 7, 7, 9, 9, 10, 10, 1 1, 11) H
(e) Small-number bias(1, 1, 2, 2, 3, 3, 5, 5, 8, 8, 1 1)
(f) Upper outlier (11, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
(g) Lower outlier (1, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11)

Figure 2.6 Comparison of Means on Various


(each set has a maximum data point valu
MD
AM
(a) GM
HM

MD
AM
(b) GM
HM

MD
AM
(c) GM
HM

MD
AM
(d) GM
HM

MD
AM
(e) GM
HM

MD
AM
(f) GM
HM

MD
AM
(g) GM
HM

0 1 2 3 4 5 6 7 8 9 10 11

(a) Constant (11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11) MD = median
(b) Clustered around a central value (3, 5, 6, 6, 7, 7, 7, 8, 8, 9, 1 1) AM = arithmetic mean
(c) Uniform distribution (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1) GM = geometric mean
(d) Large-number bias (1, 4, 4, 7, 7, 9, 9, 10, 10, 1 1, 11) HM = harmonic mean
(e) Small-number bias(1, 1, 2, 2, 3, 3, 5, 5, 8, 8, 1 1)
(f) Upper outlier (11, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
(g) Lower outlier (1, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11)

Figure 2.6 Comparison of Means on Various Data Sets


(each set has a maximum data point value of 11)
 An Arithmetic Mean (AM) is an
appropriate measure if the sum of all the
measurements is a meaningful and
interesting value Arithmetic
 The AM is a good candidate for
comparing the execution time
performance of several systems
For example, suppose we were interested in using a system
for large-scale simulation studies and wanted to evaluate several
alternative products. On each system we could run the simulation
multiple times with different input values for each run, and then
Mean
take the average execution time across all runs. The use of
multiple runs with different inputs should ensure that the results are
not heavily biased by some unusual feature of a given input set. The
AM of all the runs is a good measure of the system’s performance
on simulations, and a good number to use for system comparison.
+
 The AM used for a time-based variable, such as
program execution time, has the important property
that it is directly proportional to the total time
 If the total time doubles, the mean value doubles
 For some situations, a system’s execution rate
may be viewed as a more useful measure of
the value of the system.
 This could be either the instruction execution
rate, measured in MIPS or MFLOPS, or a Harmonic
program execution rate, which measures the
rate at which a given type of program can be
executed.

 Consider how we wish the calculated mean


to behave. It makes no sense to say that we
would like the mean rate to be proportional
to the total rate, where the total rate is Mean
defined as the sum of the individual rates.
The sum of the rates would be a meaningless
statistic. Rather, we would like the mean to
be inversely proportional to the total
execution time.
+  For example, if the total time to execute all
the benchmark programs in a suite of
programs is twice as much for system C as
for system D, we would want the mean value
of the execution rate to be half as much for
system C as for system D.
+
Example: How the AM and HM
performs?
 Let us look at a basic example and first examine how the AM
performs. Suppose we have a set of n benchmark programs
and record the execution times of each program on a given
system as t1, t2, …, tn. For simplicity, let us assume that each
program executes the same number of operations Z; we
could weight the individual programs and calculate
accordingly but this would not change the conclusion of our
argument. The execution rate for each individual program is
Ri = Z/ti. We use the AM to calculate the average execution
rate.
+
Example: How the AM and HM
performs?
 Suppose we have a set of n benchmark programs and record the
execution times of each program on a given system as t1, t2, …, tn.
For simplicity, let us assume that each program executes the
same number of operations Z.

 The execution rate for each individual program is Ri = Z/ti. We


use the AM to calculate the average execution rate.

 AM execution rate is proportional to the sum of the inverse


execution times, which is not the same as being inversely
proportional to the sum of the execution times. Thus, the AM
does not have the desired property.
+
Example: How the AM and HM
performs?
 HM

 The HM is inversely proportional to the total execution time,


which is the desired property.
Computer Computer Computer Computer Computer Computer
A time B time C time A rate B rate C rate
(secs) (secs) (secs) (MFLOPS) (MFLOPS) (MFLOPS)
Program 1
(108 FP 2.0 1.0 0.75 50 100 133.33
ops)
Program 2
Table 2.2
(108 FP 0.75 2.0 4.0 133.33 50 25
ops) A Comparison
Total
execution 2.75 3.0 4.75
of Arithmetic
time and
Arithmetic Harmonic
mean of 1.38 1.5 2.38
times Means for
Inverse of Rates
total
execution 0.36 0.33 0.21
time
(1/sec)
Arithmetic
mean of 91.67 75.00 79.17
rates
Harmonic
mean of 72.72 66.67 42.11
rates
 Looking at the equations for the three
types of means, it is easier to get an
intuitive sense of the behavior of the AM
and the HM than that of the GM. Geometric
 Some observations:
 With respect to changes in values, the GM
gives equal weight to all of the values in
the data set.
 For example, suppose the set of data values
to be averaged includes a few large values
and more small values. Here, the AM is
dominated by the large values. A change of Mean
10% in the largest value will have a
noticeable effect, while a change in the
smallest value by the same factor will have
a negligible effect. In contrast, a change in
value by 10% of any of the data values
+ results in the same change in the GM.
 For the GM of a ratio, the GM of the ratios
equals the ratio of the GMs:
+
Geometric Mean
+
Geometric Mean

 For use with execution times, as opposed to rates, one


drawback of the GM is that it may be non-monotonic relative
to the more intuitive AM. In other words there may be cases
where the AM of one data set is larger than that of another
set, but the GM is smaller.
Table 2.3 A Comparison of Arithmetic and Geometric Means for Normalized
Results

(a) Results normalized to Computer A

Computer A time Computer B time Computer C time


Program 1 2.0 (1.0) 1.0 (0.5) 0.75 (0.38)
Program 2 0.75 (1.0) 2.0 (2.67) 4.0 (5.33)
Total execution time 2.75 3.0 4.75
Arithmetic mean of 1.00 1.58 2.85
normalized times
Geometric mean of 1.00 1.15 1.41
normalized times

(b) Results normalized to Computer B

Computer A time Computer B time Computer C time


Program 1 2.0 (2.0) 1.0 (1.0) 0.75 (0.75)
Program 2 0.75 (0.38) 2.0 (1.0) 4.0 (2.0)
Total execution time 2.75 3.0 4.75
Arithmetic mean of 1.19 1.00 1.38
normalized times
Geometric mean of 0.87 1.00 1.22
normalized times
Table 2.4 Another Comparison of Arithmetic and Geometric Means for
Normalized Results

(a) Results normalized to Computer A

Computer A time Computer B time Computer C time


Program 1 2.0 (1.0) 1.0 (0.5) 0.20 (0.1)
Program 2 0.4 (1.0) 2.0 (5.0) 4.0 (10)
Total execution time 2.4 3.00 4.2
Arithmetic mean of 1.00 2.75 5.05
normalized times
Geometric mean of 1.00 1.58 1.00
normalized times

(b) Results normalized to Computer B

Computer A time Computer B time Computer C time


Program 1 2.0 (2.0) 1.0 (1.0) 0.20 (0.2)
Program 2 0.4 (0.2) 2.0 (1.0) 4.0 (2)
Total execution time 2.4 3.00 4.2
Arithmetic mean of 1.10 1.00 1.10
normalized times
Geometric mean of 0.63 1.00 0.63
normalized times
+
Benchmark Principles

 Desirable characteristics of a benchmark


program:

1. It is written in a high-level language, making it


portable across different machines
2. It is representative of a particular kind of
programming domain or paradigm, such as
systems programming, numerical
programming, or commercial programming
3. It can be measured easily
4. It has wide distribution
+
System Performance Evaluation
Corporation (SPEC)
 Benchmark suite
 A collection of programs, defined in a high-level language
 Together attempt to provide a representative test of a computer in
a particular application or system programming area

 SPEC
 An industry consortium
 Defines and maintains the best known collection of benchmark
suites aimed at evaluating computer systems
 Performance measurements are widely used for comparison and
research purposes
+  Best known SPEC benchmark suite

 Industry standard suite for processor


intensive applications
SPEC  Appropriate for measuring
performance for applications that
spend most of their time doing
computation rather than I/O
CPU2006  Consists of 17 floating point programs
written in C, C++, and Fortran and 12
integer programs written in C and C++

 Suite contains over 3 million lines of


code

 Fifth generation of processor intensive


suites from SPEC
+
Other SPEC suites

 SPECviewperf: Standard for measuring 3D graphics performance based on


professional applications.

 SPECwpc: benchmark to measure all key aspects of workstation performance based on


diverse professional applications, including media and entertainment, product
development, life sciences, financial services, and energy.

 SPECjvm2008: Intended to evaluate performance of the combined hardware and


software aspects of the Java Virtual Machine (JVM) client platform.

 SPECjbb2013 (Java Business Benchmark): A benchmark for evaluating server-side


Java-based electronic commerce applications.

 SPECsfs2008: Designed to evaluate the speed and request-handling capabilities of file


servers.

 SPECvirt_sc2013: Performance evaluation of datacenter servers used in virtualized


server consolidation. Measures the end-to-end performance of all system components
including the hardware, virtualization platform, and the virtualized guest operating
system and application software. The benchmark supports hardware virtualization,
operating system virtualization, and hardware partitioning schemes.
Benchmark Reference Instr Language Application Brief Description
time count Area
(hours) (billion)
Programming PERL programming
400.perlbench 2.71 2,378 C Language language interpreter, applied
to a set of three programs.
Compression General-purpose data
401.bzip2 2.68 2,472 C compression with most work
done in memory, rather than
doing I/O.

Table 2.5
C Compiler Based on gcc Version 3.2,
403.gcc 2.24 1,064 C
generates code for Opteron.
Combinatoria Vehicle scheduling
429.mcf 2.53 327 C l algorithm.
Optimization
Artificial Plays the game of Go, a
445.gobmk 2.91 1,603 C Intelligence simply described but deeply
complex game.

456.hmmer 2.59 3,363 C


Search Gene
Sequence
Protein sequence analysis
using profile hidden Markov
models.
SPEC
458.sjeng 3.36 2,383 C
Artificial
Intelligence
A highly ranked chess
program that also plays
several chess variants.
CPU2006
462.libquantum 5.76 3,555 C
Physics /
Quantum
Computing
Simulates a quantum
computer, running Shor's
polynomial-time
Integer
Benchmarks
factorization algorithm.
Video H.264/AVC (Advanced
464.h264ref 6.15 3,731 C Compression Video Coding) Video
compression.
Discrete Uses the OMNet++ discrete
Event event simulator to model a
471.omnetpp 1.74 687 C++
Simulation large Ethernet campus
network.
Path-finding Pathfinding library for 2D
473.astar 1.95 1,200 C++
Algorithms maps.
XML A modified version of
483.xalancbmk 1.92 1,184 C++ Processing Xalan-C++, which
transforms XML documents
to other document types.

(Table can be found on page 69 in the textbook.)


Reference Instr count
Benchmark time (hours) (billion) Language Application Area Brief Description
Computes 3D transonic
410.bwaves 3.78 1,176 Fortran Fluid Dynamics transient laminar viscous
flow.
416.gamess 5.44 5,189 Fortran Quantum Quantum chemical
Chemistry computations.
Physics / Quantum Simulates behavior of
433.milc 2.55 937 C
Chromodynamics quarks and gluons
Computational fluid

Table 2.6
434.zeusmp 2.53 1,566 Fortran Physics / CFD dynamics simulation of
astrophysical phenomena.
Simulate Newtonian
Biochemistry /
equations of motion for
435.gromacs 1.98 1,958 C, Fortran Molecular
Dynamics hundreds to millions of
particles.
436.cactusAD 3.32 1,376 C, Fortran Physics / General Solves the Einstein
M Relativity evolution equations.
437.leslie3d

444.namd
2.61

2.23
1,273

2,483
Fortran

C++
Fluid Dynamics
Biology /
Molecular
Model fuel injection flows.
Simulates large
biomolecular systems.
SPEC
CPU2006
Dynamics
Program library targeted at
Finite Element
447.dealII 3.18 2,323 C++ adaptive finite elements and
Analysis

Floating-Point
error estimation.
Linear Test cases include railroad
450.soplex 2.32 703 C++ Programming, planning and military airlift
Optimization models.
453.povray

454.calculix
1.48

2.29
940

3,04`
C++

C, Fortran
Image Ray-tracing
Structural
Mechanics
3D Image rendering.
Finite element code for
linear and nonlinear 3D
Benchmarks
structural applications.
459.GemsFDT Computational Solves the Maxwell
2.95 1,320 Fortran
D Electromagnetics equations in 3D.
Quantum chemistry
Quantum
465.tonto 2.73 2,392 Fortran package, adapted for
Chemistry
crystallographic tasks.
Simulates incompressible
470.lbm 3.82 1,500 C Fluid Dynamics
fluids in 3D.
481.wrf 3.10 1,684 C, Fortran Weather Weather forecasting model
482.sphinx3 5.41 2,472 C Speech recognition Speech recognition
software. (Table can be found on page 70
in the textbook.)
+
Terms Used in SPEC Documentation
 Benchmark  Peak metric
 A program written in a high-level  This enables users to attempt to
language that can be compiled optimize system performance by
and executed on any computer optimizing the compiler output
that implements the compiler
 Speed metric
 System under test  This is simply a measurement of the
 This is the system to be evaluated
time it takes to execute a compiled
benchmark
 Used for comparing the ability of
 Reference machine a computer to complete single
 This is a system used by SPEC to tasks
establish a baseline performance
for all benchmarks  Rate metric
 Each benchmark is run and  This is a measurement of how many
measured on this machine to tasks a computer can accomplish in
establish a reference time for a certain amount of time
that benchmark  This is called a throughput,
capacity, or rate measure
 Base metric  Allows the system under test to
 These are required for all execute simultaneous tasks to
reported results and have strict take advantage of multiple
guidelines for compilation processors
Start

Get next
program

Run program
three times

Select
median value

Ratio(prog) =
Tref(prog)/TSUT(prog)

Yes More No Compute geometric


programs? mean of all ratios

End

Figure 2.7 SPEC Evaluation Flowchart

You might also like