The Lecture Contains:: Lecture 9: Performance Issues in Shared Memory
The Lecture Contains:: Lecture 9: Performance Issues in Shared Memory
The Lecture Contains:: Lecture 9: Performance Issues in Shared Memory
Objectives_template
Data Access
In multiprocessors
Every processor wants to see the memory interface as its own local cache and the main
memory
In reality it is much more complicated
If the system has a centralized memory (e.g., SMPs), there are still caches of other
processors; if the memory is distributed then some part of it is local and some is remote
For shared memory, data movement from local or remote memory to cache is
transparent while for message passing it is explicit
View a multiprocessor as an extended memory hierarchy where the extension includes
caches of other processors, remote memory modules and the network topology
Objectives_template
Capacity Problem
Most probable reason for artifactual communication
Due to finite capacity of cache, local memory or remote memory
May view a multiprocessor as a three-level memory hierarchy for this
purpose: local cache, local memory, remote memory
Communication due to cold or compulsory misses and inherent communication are
independent of capacity
Capacity and conflict misses generate communication resulting from finite capacity
Generated traffic may be local or remote depending on the allocation of pages
General technique: exploit spatial and temporal locality to use the cache properly
Objectives_template
Spatial Locality
Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[ i ][j] and A[
i ][j+1] have contiguous memory locations
Objectives_template
Transfer Granularity
How much data do you transfer in one communication?
For message passing it is explicit in the program
For shared memory this is really under the control of the cache coherence protocol:
there is a fixed size for which transactions are defined (normally the block size of the
outermost level of cache hierarchy)
In shared memory you have to be careful
Since the minimum transfer size is a cache line you may end up transferring extra data
e.g., in grid solver the elements of the left and right neighbors for a square block
decomposition (you need only one element, but must transfer the whole cache line): no
good solution
Objectives_template
Contention
It is very easy to ignore contention effects when designing algorithms
Can severely degrade performance by creating hot-spots
Location hot-spot:
Consider accumulating a global variable; the accumulation takes place on a single node
i.e. all nodes access the variable allocated on that particular node whenever it tries to
increment it
Objectives_template
Overlap
Increase overlap between communication and computation
Not much to do at algorithm level unless the programming model and/or OS provide
some primitives to carry out prefetching , block data transfer, non-blocking receive etc.
Normally, these techniques increase bandwidth demand because you end up
communicating the same amount of data, but in a shorter amount of time (execution
time hopefully goes down if you can exploit overlap)
Summary
Parallel programs introduce three overhead terms: busy overhead (extra work),
remote data access time, and synchronization time
Goal of a good parallel program is to minimize these three terms
Goal of a good parallel computer architecture is to provide sufficient support to let
programmers optimize these three terms (and this is the focus of the rest of the course)