CH17-COA10e - Parallel Processing

+
William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 17
Parallel Processing
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+
Multiple Processor Organization
 Single instruction, single data  Multiple instruction, single data

(SISD) stream (MISD) stream
 Single processor executes a  A sequence of data is transmitted
single instruction stream to to a set of processors, each of
operate on data stored in a single which executes a different
memory instruction sequence
 Uniprocessors fall into this  Not commercially implemented
category
 Single instruction, multiple data  Multiple instruction, multiple

(SIMD) stream data (MIMD) stream
 A single machine instruction  A set of processors
controls the simultaneous
simultaneously execute different
execution of a number of
processing elements on a instruction sequences on different
lockstep basis data sets
 Vector and array processors fall  SMPs, clusters and NUMA systems
into this category fit this category

Processor Organizations
Single Instruction, Single Instruction, Multiple Instruction, Multiple Instruction,

Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream
(SISD) (SIMD) (MISD) (MIMD)
Uniprocessor
Vector Array Shared Memory Distributed Memory

Processor Processor (tightly coupled) (loosely coupled)
Clusters
Symmetric Nonumiform
Multiprocessor Memory
(SMP) Access
(NUMA)
Figure 17.1 A Taxonomy of Parallel Processor Architectures

IS DS DS
CU PU MU PU1 LM1
(a) SISD DS
PU2 LM2
IS
CU
IS DS DS
CU1 PU1 PUn LMn
IS DS (b) SIMD (with distributed memory)

CU2 PU2
Memory
Shared
IS DS
CU1 PU1 LM1
Interconnection
IS DS IS DS
Network
CUn PUn CU2 PU2 LM2
(c) MIMD (with shared memory)
CU = control unit SISD = single instruction,

IS = instruction stream single data stream IS DS
PU = processing unit SIMD = single instruction, CUn PUn LMn
DS = data stream multiple data stream
MU = memory unit MIMD = multiple instruction, (d) MIMD (with distributed memory)
LM = local memory multiple data stream
Figure 17.2 Alternative Computer Organizations

Symmetric Multiprocessor (SMP)
A stand alone computer with

the following characteristics:
Processors All System
share same processors controlled by
memory and share access integrated
I/O facilities to I/O All operating
Two or more devices processors system
• Processors are
similar connected by a can perform
• Either through • Provides
processors of bus or other same channels the same interaction
comparable internal or different functions between
capacity connection channels giving (hence processors and
• Memory access paths to same their programs
time is devices
“symmetric”) at job, task, file
approximately and data
the same for element levels
each processor

Time
Process 1
Process 2
Process 3
(a) Interleaving (multiprogramming, one processor)
Process 1
Process 2
Process 3
(b) Interleaving and overlapping (multiprocessing; two processors)
Blocked Running
Figure 17.3 Multiprogramming and Multiprocessing

Processor Processor Processor
I/O
I/O
Interconnection
Network
I/O
Main Memory
Figure 17.4 Generic Block Diagram of a Tightly Coupled Multiprocessor

Processor Processor Processor
L1 Cache L1 Cache L1 Cache
L2 Cache L2 Cache L2 Cache
shared bus
Main I/O
Memory I/O Adapter
Subsytem
I/O
Adapter
I/O
Adapter
Figure 17.5 Symmetric Multiprocessor Organization

+
The bus organization has several
attractive features:
 Simplicity
 Simplest approach to multiprocessor organization
 Flexibility
 Generally easy to expand the system by attaching more
processors to the bus
 Reliability
 The bus is essentially a passive medium and the failure of any
attached device should not cause failure of the whole system

+
Disadvantages of the bus organization:
 Main drawback is performance

 All memory references pass through the common bus
 Performance is limited by bus cycle time
 Each processor should have cache memory

 Reduces the number of bus accesses
 Leads to problems with cache coherence

 If a word is altered in one cache it could conceivably invalidate a
word in another cache
 To prevent this the other processors must be alerted that an
update has taken place
 Typically addressed in hardware rather than the operating system

+
Multiprocessor Operating System
Design Considerations
 Simultaneous concurrent processes
 OS routines need to be reentrant to allow several processors to execute the same IS code simultaneously
 OS tables and management structures must be managed properly to avoid deadlock or invalid operations
 Scheduling
 Any processor may perform scheduling so conflicts must be avoided
 Scheduler must assign ready processes to available processors
 Synchronization
 With multiple active processes having potential access to shared address spaces or I/O resources, care must be
taken to provide effective synchronization
 Synchronization is a facility that enforces mutual exclusion and event ordering
 Memory management
 In addition to dealing with all of the issues found on uniprocessor machines, the OS needs to exploit the available
hardware parallelism to achieve the best performance
 Paging mechanisms on different processors must be coordinated to enforce consistency when several processors
share a page or segment and to decide on page replacement
 Reliability and fault tolerance

 OS should provide graceful degradation in the face of processor failure
 Scheduler and other portions of the operating system must recognize the loss of a processor and restructure
accordingly

+
Cache Coherence
Software Solutions
 Attempt to avoid the need for additional hardware circuitry

and logic by relying on the compiler and operating system to
deal with the problem
 Attractive because the overhead of detecting potential

problems is transferred from run time to compile time, and
the design complexity is transferred from hardware to
software
 However, compile-time software approaches generally must make
conservative decisions, leading to inefficient cache utilization

+
Cache Coherence
Hardware-Based Solutions
 Generally referred to as cache coherence protocols
 These solutions provide dynamic recognition at run time of

potential inconsistency conditions
 Because the problem is only dealt with when it actually arises

there is more effective use of caches, leading to improved
performance over a software approach
 Approaches are transparent to the programmer and the

compiler, reducing the software development burden
 Can be divided into two categories:

 Directory protocols
 Snoopy protocols

Directory Protocols
Collect and Effective in large
maintain scale systems with
information about complex
copies of data in interconnection
cache schemes
Directory stored in Creates central

main memory bottleneck
Requests are Appropriate

checked against transfers are
directory performed

Snoopy Protocols
 Distribute the responsibility for maintaining cache coherence
among all of the cache controllers in a multiprocessor
 A cache must recognize when a line that it holds is shared with other
caches
 When updates are performed on a shared cache line, it must be
announced to other caches by a broadcast mechanism
 Each cache controller is able to “snoop” on the network to observe
these broadcast notifications and react accordingly
 Suited to bus-based multiprocessor because the shared bus

provides a simple means for broadcasting and snooping
 Care must be taken that the increased bus traffic required for
broadcasting and snooping does not cancel out the gains from the
use of local caches
 Two basic approaches have been explored:

 Write invalidate
 Write update (or write broadcast)
+
Write Invalidate
 Multiple readers, but only one writer at a time
 When a write is required, all other caches of the line are

invalidated
 Writing processor then has exclusive (cheap) access until

line is required by another processor
 Most widely used in commercial multiprocessor systems

such as the x86 architecture
 State of every line is marked as modified, exclusive, shared

or invalid
 For this reason the write-invalidate protocol is called MESI

+
Write Update
Can be multiple readers and writers
When a processor wishes to update a shared line

the word to be updated is distributed to all others
and caches containing that line can update it
Some systems use an adaptive mixture of both

write-invalidate and write-update mechanisms

+
MESI Protocol
To provide cache consistency on an SMP the data cache
supports a protocol known as MESI:
 Modified
 The line in the cache has been modified and is available only in
this cache
 Exclusive
 The line in the cache is the same as that in main memory and is
not present in any other cache
 Shared
 The line in the cache is the same as that in main memory and may
be present in another cache
 Invalid
 The line in the cache does not contain valid data

Table 17.1
MESI Cache Line States
M E S I
Modified Exclusive Shared Invalid
This cache line Yes Yes Yes No
valid?
The memory out of date valid valid —
copy is…
Copies exist in
No No Maybe Maybe
other caches?
A write to this does not go to does not go to goes to bus and goes directly to
line… bus bus updates cache bus

SHR
Invalid RMS Shared RH Invalid SHW Shared
R
M
WM
SHR
SHW
SH
W
R
H
SH
W
RH Modified WH Exclusive RH Modified Exclusive
WH
(a) Line in cache at initiating pr ocessor (b) Line in snooping cache
Dirty line copyback

RH Read hit
RMS Read miss, shared
RME Read miss, exclusive Invalidate transaction
WH Write hit
WM Write miss
SHR Snoop hit on read Read-with-intent-to-modify
SHW Snoop hit on write or
read-with-intent-to-modify Cache line fill
Figure 17.6 MESI State Transition Diagram

+
Multithreading and Chip
Multiprocessors
 Processor performance can be measured by the rate at which it
executes instructions
 MIPS rate = f * IPC

 f = processor clock frequency, in MHz
 IPC = average instructions per cycle
 Increase performance by increasing clock frequency and

increasing instructions that complete during cycle
 Multithreading
 Allows for a high degree of instruction-level parallelism without
increasing circuit complexity or power consumption
 Instruction stream is divided into several smaller streams, known as
threads, that can be executed in parallel

Definitions of Threads
and Processes Thread in multithreaded
processors may or may not be
the same as the concept of
software threads in a
multiprogrammed operating
system
Thread is concerned with

Thread switch scheduling and execution,
whereas a process is
• The act of switching processor control
between threads within the same concerned with both
process scheduling/execution and
• Typically less costly than process resource and resource
switch ownership
Thread:
• Dispatchable unit of work within a Process:
process • An instance of program running on
• Includes processor context (which computer
includes the program counter and • Two key characteristics:
stack pointer) and data area for stack
• Resource ownership
• Executes sequentially and is
interruptible so that the processor can • Scheduling/execution
turn to another thread
Process switch
• Operation that switches the processor
from one process to another by saving all
the process control data, registers, and
other information for the first and
replacing them with the process
information for the second
Implicit and Explicit
Multithreading
 All commercial processors and most
experimental ones use explicit multithreading
 Concurrently execute instructions from different
explicit threads
 Interleave instructions from different threads on
shared pipelines or parallel execution on parallel
pipelines
 Implicit multithreading is concurrent execution

+ of multiple threads extracted from single
sequential program
 Implicit threads defined statically by compiler or
dynamically by hardware

+ Approaches to Explicit
Multithreading
 Interleaved  Blocked
 Fine-grained  Coarse-grained
 Processor deals with two or  Thread executed until event
more thread contexts at a causes delay
time  Effective on in-order
 Switching thread at each processor
clock cycle  Avoids pipeline stall
 If thread is blocked it is
skipped  Chip multiprocessing
 Processor is replicated on a
 Simultaneous (SMT) single chip
 Instructions are  Each processor handles
simultaneously issued from separate threads
multiple threads to  Advantage is that the
execution units of available logic area on a chip
superscalar processor is used effectively

A ABCD ABCD A
A A A A A
thread switches
thread switches
A B A A
cycles
C A
D A A A A
A B
A B B A A A
(a) single-threaded (b) interleaved (c) blocked (d) superscalar

scalar multithreading multithreading
scalar scalar
ABCD ABCD A ABCD
ot
e sl
A A A A issu A A N N A A N N
thread switches
thread switches
thread switches
B B B A A A N N N B B B N
latency
C cycle
C N N N
D D D D B B B A A A A D D D D
A A B A A N N
B C A A A N B N N N
issue bandwidth
(g) VLIW (h) interleaved
(e) interleaved (f) blocked multithreading
multithreading multithreading VLIW
superscalar superscalar
ABCD ABCD ABCD
A A N N A A A A B B B C A A B B C
thread switches
A A N N D D D A A A B D A B B D D
D D D A A A B C B D
B B B N B D A A A A B B A A C D D
B N N N C D D A A A A A B B C C D D
C N N N A B B D D D D D A A B C C D
(i) blocked (j) simultaneous (k) chip multiprocessor

multithreading multithreading (multicore)
VLIW (SMT)
Figure 17.7 Approaches to Executing Multiple Threads

Clusters
 Alternative to SMP as an approach to providing
high performance and high availability
 Particularly attractive for server applications
 Defined as:
 A group of interconnected whole computers working
together as a unified computing resource that can
create the illusion of being one machine
 (The term whole computer means a system that can run
on its own, apart from the cluster)
 Each computer in a cluster is called a node

+  Benefits:
 Absolute scalability
 Incremental scalability
 High availability
 Superior price/performance
P P P P
High-speed message link

M I/O I/O I/O I/O M
(a) Standby server with no shared disk
High-speed message link

P P I/O I/O P P
M I/O I/O I/O I/O M
RAID
(b) Shared disk
Figure 17.8 Cluster Configurations

Table 17.2
Clustering Methods: Benefits and Limitations
Clustering Method Description Benefits Limitations
Passive Standby A secondary server Easy to implement. High cost because the
takes over in case of secondary server is
primary server failure. unavailable for other
processing tasks.
Active Secondary: The secondary server is Reduced cost because Increased complexity.
also used for processing secondary servers can
tasks. be used for processing.
Separate Servers Separate servers have High availability. High network and
their own disks. Data is server overhead due to
continuously copied copying operations.
from primary to
secondary server.
Servers Connected Servers are cabled to the Reduced network and Usually requires disk
to Disks same disks, but each server overhead due to mirroring or RAID
server owns its disks. If elimination of copying technology to
one server fails, its disks operations. compensate for risk of
are taken over by the disk failure.
other server.
Servers Share Disks Multiple servers Low network and server Requires lock manager
simultaneously share overhead. Reduced risk software. Usually used
access to disks. of downtime caused by with disk mirroring or
disk failure. RAID technology.
+
Operating System Design Issues
 How failures are managed depends on the clustering method used
 Two approaches:
 Highly available clusters
 Fault tolerant clusters
 Failover
 The function of switching applications and data resources over from a failed system
to an alternative system in the cluster
 Failback
 Restoration of applications and data resources to the original system once it
has been fixed
 Load balancing
 Incremental scalability
 Automatically include new computers in scheduling
 Middleware needs to recognize that processes may switch between machines

Parallelizing Computation
Effective use of a cluster requires executing

software from a single application in parallel
Three approaches are:
Parallelizing complier Parallelized Parametric computing

• Determines at compile time application • Can be used if the essence of
which parts of an application • Application written from the the application is an
can be executed in parallel outset to run on a cluster and algorithm or program that
• These are then split off to be uses message passing to must be executed a large
assigned to different move data between cluster number of times, each time
computers in the cluster nodes with a different set of starting
conditions or parameters

Parallel Applications
Sequential Applications Parallel Programming Environment
Cluster Middleware
(Single System Image and Availability Infrastructure)
PC/Workstation PC/Workstation PC/Workstation PC/Workstation PC/Workstation
Comm SW Comm SW Comm SW Comm SW Comm SW
Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW
High Speed Network/Switch
Figure 17.9 Cluster Computer Architecture

Additional blade
server racks
N 100GbE
Eth Switch Eth Switch Eth Switch
Eth Switch Eth Switch

100GbE
10GbE
& Eth Switch Eth Switch Eth Switch
40GbE
Figure 17.10 Example 100-Gbps Ethernet

Configuration for Massive Blade Server Cloud Site
+
Clusters Compared to SMP
 Both provide a configuration with multiple processors to
support high demand applications
 Both solutions are available commercially
SMP Clustering
 Easier to manage and  Far superior in terms of
configure incremental and absolute
scalability
 Much closer to the original
single processor model for  Superior in terms of
which nearly all applications availability
are written
 All components of the system
 Less physical space and lower can readily be made highly
power consumption redundant
 Well established and stable

+
Nonuniform Memory Access
(NUMA)
 Alternative to SMP and clustering
 Uniform memory access (UMA)

 All processors have access to all parts of main memory using loads and stores
 Access time to all regions of memory is the same
 Access time to memory for different processors is the same
 Nonuniform memory access (NUMA)

 All processors have access to all parts of main memory using loads and stores
 Access time of processor differs depending on which region of main memory
is being accessed
 Different processors access different regions of memory at different speeds
 Cache-coherent NUMA (CC-NUMA)

 A NUMA system in which cache coherence is maintained among the caches of
the various processors

Motivation
SMP has practical limit to In clusters each node has its
number of processors that own private main memory
can be used • Applications do not see a large
• Bus traffic limits to between 16 and global memory
64 processors • Coherency is maintained by
software rather than hardware
Objective with NUMA is to

maintain a transparent
NUMA retains SMP flavor system wide memory while
while giving large scale permitting multiple
multiprocessing multiprocessor nodes, each
with its own bus or internal
interconnect system
Processor Processor
1-1 1-m
L1 Cache L1 Cache
L2 Cache L2 Cache Directory
I/O
Main
Memory 1
Processor Processor
2-1 2-m
L1 Cache L1 Cache
Interconnect
Network L2 Cache L2 Cache Directory
I/O
Main
Memory 2
Processor Processor
N-1 N-m
L1 Cache L1 Cache
L2 Cache L2 Cache I/O
Directory
Main
Memory N
Figure 17.11 CC-NUMA Organization

+
NUMA Pros and Cons
 Main advantage of a CC-

NUMA system is that it can
deliver effective performance
at higher levels of parallelism  Does not transparently look
than SMP without requiring like an SMP
major software changes
 Software changes will be
 Bus traffic on any individual required to move an operating
node is limited to a demand system and applications from
that the bus can handle an SMP to a CC-NUMA system
 If many of the memory  Concern with availability

accesses are to remote nodes,
performance begins to break
down

Characteristics
Broad Rapid Measured On-Demand
Essential Network Access Elasticity Service Self-Service
Resource Pooling
Software as a Service (SaaS)

Platform as a Service (PaaS)
Service
Models
Infrastructure as a Service (IaaS)

Deployment
Models
Public Private Hybrid Community
Figure 17.12 Cloud Computing Elements

Cloud Application Software Cloud Application Software
(provided by cloud, visible to subscriber) (developed by subscriber)
Cloud Platform Cloud Platform
(visible only to provider) (visible to subscriber)
Cloud Cloud
Infrastructure Infrastructure
(visible only (visible only
to provider) to provider)
(a) SaaS (b) PaaS
Cloud Application Software

(developed by subscriber)
Cloud Platform
(visible to subscriber)
Cloud
Infrastructure
(visible to
subscriber)
(c) IaaS
Figure 17.13 Cloud Service Models

+
Deployment Models
 Community cloud
 Like a private cloud it is not
open to any subscriber
 Public cloud  Like a public cloud the
 The cloud infrastructure is resources are shared among a
made available to the general number of independent
public or a large industry orgaizations
group and is owned by an
organization selling cloud  Hybrid cloud
services  The cloud infrastructure is a
 Major advantage is cost composition of two or more
clouds that remain unique
 Private cloud entities but are bound
 A cloud infrastructure
together by standardized or
implemented within the proprietary technology that
internal IT environment of the enables data and application
organization portability
 Sensitive information can be
 A key motivation for opting
for a private cloud is security placed in a private area of the
cloud and less sensitive data
can take advantage of the cost
benefits of the public cloud

Enterprise -
Cloud User
LAN
switch
Router
Network
or Internet
Router
LAN Cloud
switch service
provider
Servers
Figure 17.14 Cloud Computing Context

+
Cloud Computing Reference
Architecture
 NIST SP 500-292 establishes a reference architecture,
described as:
“The NIST cloud computing reference architecture focuses on

the requirements of “what” cloud services provide, not a “how
to” design solution and implementation. The reference
architecture is intended to facilitate the understanding of the
operational intricacies in cloud computing. It does not
represent the system architecture of a specific
cloud computing system; instead it is a tool for describing,
discussing, and developing a system-specific
architecture using a common framework of reference.”

Cloud Provider
Cloud Service Orchestration Cloud Cloud
Consumer Service Layer Service Broker
Management
SaaS
Service
PaaS Intermediation
Cloud Business
Auditor Support
Security
Privacy
IaaS Service
Security Aggregation
Resource Abstraction Provisioning/
Audit
and Control Layer Configuration Service
Privacy Physical Resource Layer Arbitrage
Impact Audit
Hardware Portability/
Performance Interoperability
Facility
Audit
Cloud Carrier
Figure 17.15 NIST Cloud Computing Reference Architecture

+ Summary Parallel
Processing
Chapter 17
 Multithreading and chip multiprocessors
 Implicit and explicit multithreading
 Approaches to explicit multithreading
 Multiple processor organizations
 Types of parallel processor systems
 Clusters
 Cluster configurations
 Parallel organizations
 Operating system design issues
 Symmetric multiprocessors  Cluster computer architecture
 Blade servers
 Organization
 Clusters compared to SMP
 Multiprocessor operating system
design considerations
 Nonuniform memory access
 Motivation
 Cache coherence and the MESI
 Organization
protocol
 NUMA Pros and cons
 Software solutions
 Hardware solutions  Cloud computing
 The MESI protocol  Cloud computing elements
 Cloud computing reference architecture

CH17-COA10e - Parallel Processing

Uploaded by

Copyright:

Available Formats

CH17-COA10e - Parallel Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH17-COA10e - Parallel Processing

Uploaded by

Copyright:

Available Formats

+

 Single instruction, single data  Multiple instruction, single data

 Single instruction, multiple data  Multiple instruction, multiple

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Single Instruction, Single Instruction, Multiple Instruction, Multiple Instruction,

Vector Array Shared Memory Distributed Memory

Figure 17.1 A Taxonomy of Parallel Processor Architectures

IS DS (b) SIMD (with distributed memory)

(c) MIMD (with shared memory)

CU = control unit SISD = single instruction,

Figure 17.2 Alternative Computer Organizations

A stand alone computer with

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

(a) Interleaving (multiprogramming, one processor)

(b) Interleaving and overlapping (multiprocessing; two processors)

Figure 17.3 Multiprogramming and Multiprocessing

Figure 17.4 Generic Block Diagram of a Tightly Coupled Multiprocessor

L2 Cache L2 Cache L2 Cache

Figure 17.5 Symmetric Multiprocessor Organization

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

 Main drawback is performance

 Each processor should have cache memory

 Leads to problems with cache coherence

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

 Reliability and fault tolerance

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

 Attempt to avoid the need for additional hardware circuitry

 Attractive because the overhead of detecting potential

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

 These solutions provide dynamic recognition at run time of

 Because the problem is only dealt with when it actually arises

 Approaches are transparent to the programmer and the

 Can be divided into two categories:

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Directory stored in Creates central

Requests are Appropriate

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

 Suited to bus-based multiprocessor because the shared bus

 Two basic approaches have been explored:

 Multiple readers, but only one writer at a time

 When a write is required, all other caches of the line are

 Writing processor then has exclusive (cheap) access until

 Most widely used in commercial multiprocessor systems

 State of every line is marked as modified, exclusive, shared

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Can be multiple readers and writers

When a processor wishes to update a shared line

Some systems use an adaptive mixture of both

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Invalid RMS Shared RH Invalid SHW Shared

RH Modified WH Exclusive RH Modified Exclusive

Dirty line copyback

Figure 17.6 MESI State Transition Diagram

 MIPS rate = f * IPC

 Increase performance by increasing clock frequency and

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Thread is concerned with