CPP Unit-4

UNIT-IV
Concurrent and Parallel Programming: Parallel programming paradigms – Data parallel,

Task parallel, Shared memory and Message Passing, Parallel Architectures, GPGPU, pthreads,
STM
Parallel Algorithm - Models

The model of a parallel algorithm is developed by considering a strategy for dividing the data
and processing method and applying a suitable strategy to reduce interactions. In this chapter,
we willdiscuss the following Parallel Algorithm Models−
• Data parallel model
• Task graph model
• Work pool model
• Master slave model
• Producer consumer or pipeline model
• Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task performs similar types of
operations on different data. Data parallelism is a consequence of single operations that is
being applied on multiple data items.
Data-parallel model can be applied on shared-address spaces and message-passing paradigms.
In data- parallel model, interaction overheads can be reduced by selecting a locality preserving
decomposition, by using optimized collective interaction routines, or by overlapping
computation and interaction.
The primary characteristic of data-parallel model problems is that the intensity of data
parallelism increases with the size of the problem, which in turn makes it possible to use more
processes to solve larger problems.
Example − Dense matrix multiplication.
Fig 4.1: Dense Matrix Multiplication
1
Task Graph Model
In the task graph model, parallelism is expressed by a task graph. A task graph can be either
trivial or nontrivial. In this model, the correlation among the tasks are utilized to promote
locality or to minimize interaction costs. This model is enforced to solve problems in which the
quantity of data associated with the tasks is huge compared to the number of computation
associated with them. The tasks are assigned to help improve the cost of data movement among
the tasks.
Examples − Parallel quick sort, sparse matrix factorization, and parallel algorithms derived via
divide-and- conquer approach.
Fig 4.2: Task Graph Model
Here, problems are divided into atomic tasks and implemented as a graph. Each task is an
independent unit of job that has dependencies on one or more antecedent task. After the
completion of a task, the output of an antecedent task is passed to the dependent task. A task
with antecedent task starts execution only when its entire antecedent task is completed. The
final output of the graph is received when the last dependent task is completed (Task 6 in the
above Figure).
Work Pool Model
In work pool model, tasks are dynamically assigned to the processes for balancing the load.
Therefore, any process may potentially execute any task. This model is used when the quantity
of data associated with tasks is comparatively smaller than the computation associated with the
tasks.
There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is centralized
or decentralized. Pointers to the tasks are saved in a physically shared list, in a priority queue,
or in a hash table or tree, or they could be saved in a physically distributed data structure.
The task may be available in the beginning, or may be generated dynamically. If the task is
generated dynamically and a decentralized assigning of task is done, then a termination
detection algorithm is required so that all the processes can actually detect the completion of
the entire program and stop looking for more tasks.
Example – Parallel tree search
2
Fig: 4.3 Parallel Tree Search
Master-Slave Model
In the master-slave model, one or more master processes generate task and allocate it to slave
processes. The tasks may be allocated beforehand if the master can estimate the volume of the
tasks, or a random assigning can do a satisfactory job of balancing load, or slaves are assigned
smaller pieces of task at different times.
This model is generally equally suitable to shared-address-space or message-passing
paradigms,
since the interaction is naturally two ways.
In some cases, a task may need to be completed in phases, and the task in each phase must be
completed before the task in the next phases can be generated. The master-slave model can be
generalized to hierarchical or multi-level master-slave model in which the top level master
feeds the large portion of tasks to the second-level master, who further subdivides the tasks
among its own slaves and may perform a part of the task itself.
Fig 4.4: Master-Slave Model
Precautions in using the master-slave model are:

Care should be taken to assure that the master does not become a congestion point. It may
happen if the tasks are too small or the workers are comparatively fast.
The tasks should be selected in a way that the cost of performing a task dominates the cost of
communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation associated with
work generation by the master.
3
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is passed on through a
series of processes, each of which performs some task on it. Here, the arrival of new data
generates the execution of a new task by a process in the queue. The processes could form a
queue in the shape of linear or multidimensional arrays, trees, or general graphs with or without
cycles.
This model is a chain of producers and consumers. Each process in the queue can be considered
as a consumer of a sequence of data items for the process preceding it in the queue and as a
producer of data for the process following it in the queue. The queue does not need to be a
linear chain; it can be a directed graph. The most common interaction minimization technique
applicable to this model is overlapping interaction with computation.
Example − Parallel LU factorization algorithm
Fig 4.5: Parallel LU factorization algorithm
Hybrid Models
A hybrid algorithm model is required when more than one model may be needed to solve a
problem.
A hybrid model may be composed of either multiple models applied hierarchically or multiple
models applied sequentially to different phases of a parallel algorithm.
Example − Parallel quick sort
Parallel Random Access Machines

Parallel Random Access Machines (PRAM) is a model, which is considered for most of the
parallel algorithms. Here, multiple processors are attached to a single block of memory.
APRAM model contains:
• A set of similar type of processors.
• All the processors share a common memory unit. Processors can communicate among
themselves through the shared memory only.
• A memory access unit (MAU) connects the processors with the single shared memory.
4
Fig 4.6: PRAM Model
Here, number of processors can perform independent operations on number of data in a

particular unit of time. This may result in simultaneous access of same memory location by
different processors.
To solve this problem, the following constraints have been enforced on PRAM model –
• Exclusive Read Exclusive Write(EREW)−Here no two processors are allowed to read

from or write to the same memory location at the same time.
• Exclusive Read Concurrent Write(ERCW)−Here no two processors are allowed to read
from the same memory location at the same time, but are allowed to write to the same
memory location at the same time.
• Concurrent Read Exclusive Write (CREW) − Here all the processors are allowed to read
from the same memory location at the same time, but are not allowed to write to the same
memory location at the same time.
• Concurrent Read Concurrent Write (CRCW) − All the processors are allowed to read
from or write to the same memory location at the same time.
There are many methods to implement the PRAM model, but the most prominent ones are −
• Shared memory model
• Message passing model
• Data parallel model
Shared Memory Model

Shared memory emphasizes on control parallelism than on data parallelism. In the shared
memory model, multiple processes execute on different processors independently, but they
share a common memory space. Due to any processor activity, if there is any change in any
memory location, it is visible to the rest of the processors.
As multiple processors access the same memory location, it may happen that at any particular
point of time, more than one processor is accessing the same memory location. Suppose one is
reading that location and the other is writing on that location. It may create confusion. To avoid
this, some control mechanism, like lock / semaphore, is implemented to ensure mutual
exclusion.
5
Fig 4.8: Shared Memory Model
Shared memory programming has been implemented in the following:
Thread libraries−The thread library allows multiple threads of control that run concurrently in
the same memory location. Thread library provides an interface that supports multithreading
through a library of subroutine. It contains subroutines for:
• Creating and destroying threads
• Scheduling
• Execution of thread
• Passing data and message between threads
• Saving and restoring thread contexts
Examples of thread libraries include−SolarisTM threads for Solaris, POSIX threads as
implemented in Linux, Win32 threads available in Windows NT and Windows 2000, and
JavaTM threads as part of the standard JavaTM Development Kit(JDK).
• Distributed Shared Memory (DSM) Systems − DSM systems create an abstraction of
shared memory on loosely coupled architecture in order to implement shared memory
programming without hardware support. They implement standard libraries and use the
advanced user-level memory management features present in modern operating systems.
Examples include Tread Marks System, Munin, IVY, Shasta, Brazos, and Cashmere.
• Program Annotation Packages − This is implemented on the architectures having uniform
memory access characteristics. The most not able example of program annotation
packagesis OpenMP. OpenMP implements functional parallelism. It mainly focuses on
parallelization of loops.
• The concept of shared memory provides a low-level control of shared memory system,
but intends to be tedious and erroneous. It is more applicable for system programming
than application programming.
Merits of Shared Memory Programming

• Global address space gives a user-friendly programming approach to memory.
• Due to the closeness of memory to CPU, data sharing among processes is fast and
uniform.
• There is no need to specify distinctly the communication of data among processes.
• Process-communication overhead is negligible.
• It is very easy to learn.
Demerits of Shared Memory Programming

• It is not portable.
• Managing data locality is very difficult.
6
Message Passing Model
Message passing is the most commonly used parallel programming approach in distributed
memory systems. Here, the programmer has to determine the parallelism. In this model, all the
processors have their own local memory unit and they exchange data through a communication
network.
Fig 4.9: Message Passing Model
Processors use message-passing libraries for Communication among themselves. Along with
the data being sent, the message contains the following components−
• The address of the processor from which the message is being sent
• Startingaddressofthememorylocationofthedatainthesendingprocessor; Data type of the
sending data;
• Data size of the sending data;
• The address of the processor to which the message is being sent
• Starting address of the memory location for the data in the receiving processor.
Processors can communicate with each other by any of the following methods −
• Point-to-Point Communication
• Collective Communication
• Message Passing Interface
Point-to-Point Communication
Point-to-point communication is the simplest form of message passing. Here, a message can be
sent from the sending processor to a receiving processor by any of the following transfer modes
Synchronous mode − The next message is sent only after the receiving a confirmation that its
previous message has been delivered, to maintain the sequence of the message.
Asynchronous mode − To send the next message, receipt of the confirmation of the delivery of
the previous message is not required.
Collective Communication
• Collective communication involves more than two processors for message passing.
Following modes allow collective communications −
• Barrier − Barrier mode is possible if all the processors included in the communications
run a particular bock (known as barrier block) for message passing.
• Broadcast − Broadcasting is of two types −
• One-to-all − Here, one processor with a single operation sends same message to all other
processors.
• All-to-all − Here, all processors send message to all other processors.
7
Messages broadcasted may be of three types –
Personalized − Unique messages are sent to all other destination processors.
Non-personalized − All the destination processors receive the same message.
Reduction−In reduction broadcasting, one processor of the group collects all the messages from
all other processors in the group and combine them to a single message which all other
processors in the group can access.
Merits of Message Passing

• Provides low-level control of parallelism;
• It is portable;
• Less error prone;
• Less overhead in parallel synchronization and data distribution.
Demerits of Message Passing
As compared to parallel shared-memory code, message-passing code generally needs more
software overhead.
Message Passing Libraries
There are many message- passing libraries. Here, we will discuss two of the most-used message
-passing libraries −
• Message Passing Interface (MPI)
• Parallel Virtual Machine(PVM)
Message Passing Interface (MPI)
It is a universal standard to provide communication among all the concurrent processes in a
distributed memory system. Most of the commonly used parallel computing platforms provide
at least one implementation of message passing interface. It has been implemented as the
collection of predefined functions called library and can be called from languages such as C,
C++, Fortran, etc. MPIs are both fast and portable as compared to the other message passing
libraries.
Merits of Message Passing Interface
• Runs only on shared memory architectures or distributed memory architectures;
• Each processors has its own local variables;
• As compared to large shared memory computers, distributed memory computers are less
expensive.
Demerits of Message Passing Interface
• More programming changes are required for parallel algorithm; Sometimes difficult to
debug; and
• Does not perform well in the communication network between the nodes.
Parallel Virtual Machine (PVM)
PVM is a portable message passing system, designed to connect separate heterogeneous host
machines to form a single virtual machine. It is a single manageable parallel computing
resource. Large computational problems like superconductivity studies, molecular dynamics
simulations, and matrix algorithm scan be solved more cost effectively by using them memory
8
and the aggregate power of many computers. It manages all message routing, data conversion,
task scheduling in the network of incompatible computer architectures.
Features of PVM
• Very easy to install and configure.
• Multiple users can use PVM at the same time
• One user can execute multiple applications
• It’s a small package
• Supports C, C++, Fortran
• For a given run of a PVM program, users can select the group of machines;
• It is a message-passingmodel,
• Process-based computation;
• Supports heterogeneous architecture.
Data Parallel Programming

The major focus of data parallel programming model is on performing operations on a data set
simultaneously. The data set is organized into some structure like an array, hypercube, etc.
Processors perform operations collectively on the same data structure. Each task is performed
on a different partition of the same data structure.
It is restrictive, as not all the algorithms can be specified in terms of data parallelism. This is
the reason why data parallelism is not universal.
Data parallel an gauges help to specify the data decomposition and mapping to the processors.
It also includes data distribution statements that allow the programmer to have control on data–
for example, which data will go on which processor–to reduce the amount of communication
within the processors.
Parallel Architectures
In this next section various parallel architectures are discussed, which are based on the
classification of parallel computers considered earlier. The two major parametric considerations
in designing a parallel computer architecture are: (i) executing multiple number of instructions
in parallel, and (ii) increasing the efficiency of processors. There are various methods by which
instructions can be executed in parallel and parallel architectures are based on these methods of
executing instructions in parallel. Pipelining is one of the classical and effective methods to
increase parallelism where different stages perform repeated functions on different operands.
Vector processing is the arithmetic or logical computation applied on vectors whereas in scalar
processing only one data item or a pair of data items is processed. Parallel architectures have
also been developed based on associative memory organizations. Another idea of improving the
processor’s speed by having multiple instructions per cycle is known as Superscalar processing.
Multithreading for increasing processor utilization has also been used in parallel computer
architecture. All the architectures based on these parallel-processing types have been discussed
in detail.
Pipeline Processing
Pipelining is a method to realize, overlapped parallelism in the proposed solution of a problem,
on a digital computer in an economical way. To understand the concept of pipelining, we need
to understand first the concept of assembly lines in an automated production plant where items
are assembled from separate parts (stages) and output of one stage becomes the input to another
stage. Taking the analogy of assembly lines, pipelining is the method to introduce temporal
9
parallelism in computer operations. Assembly line is the pipeline and the separate parts of the
assembly line are different stages through which operands of an operation are passed. To
introduce pipelining in a processor P, the following steps must be followed:
• Sub-divide the input process into a sequence of subtasks. These subtasks will make
stages of pipeline, which are also known as segments.
• Each stage Si of the pipeline according to the subtask will perform some operation on a
distinct set of operands.
• When stage Si has completed its operation, results are passed to the next stage Si+1 for the
next operation.
• The stage Si receives a new set of input from previous stage Si-1.
In this way, parallelism in a pipelined processor can be achieved such that m independent
operations can be performed simultaneously in m segments as shown in Figure 4.10
Fig 4.10: Pipelined processor
The stages or segments are implemented as pure combinational circuits performing arithmetic
or logic operations over the data streams flowing through the pipe. Latches are used to separate
the stages, which are fast registers to hold intermediate results between the stages as shown in
Figure 2. Each stage Si consists of a latch Li and a processing circuit Ci. The final output is
stored in output register R. The flow of data from one stage to another is controlled by a
common clock. Thus, in each clock period, one stage transfers its results to another stage.
10
Fig 4.11: pipelined processor
Pipelined Processor: Having discussed pipelining; now we can define a pipeline processor. A
pipeline processor can be defined as a processor that consists of a sequence of processing
circuits called segments and a stream of operands (data) is passed through the pipeline. In each
segment partial processing of the data stream is performed and the final output is received when
the stream has passed through the whole pipeline. An operation that can be decomposed into a
sequence of well-defined sub tasks is realized through the pipelining concept.
Classification of Pipeline Processors

In this section, we describe various types of pipelining that can be applied in
computer operations. These types depend on the following factors:
• Level of Processing
• Pipeline configuration
• Type of Instruction an ddata
Classification according to level of processing

According to this classification, computer operations are classified as instruction execution and
arithmetic operations. Next, we discuss these classes of this classification:
Instruction Pipeline: We know that an instruction cycle may consist of many operations like,
fetch opcode, decode , compute operand addresses, fetch operands, and execute instructions.
These operations of the instruction execution cycle can be realized through the pipelining
concept. Each of these operations forms one stage of a pipeline. The overlapping of execution
of the operations through the pipeline provides a speedup over the normal execution. Thus, the
pipelines used for instruction cycle operations are known as instruction pipelines.
11
Arithmetic Pipeline: The complex arithmetic operations like multiplication, and floating point
operations consume much of the time of the ALU. These operations can also be pipelined by
segmenting the operations of the ALU and as a consequence, high speed performance may be
achieved. Thus, the pipelines used for arithmetic operations are known as arithmetic pipelines.
Classification according to pipelinecon figuration:
According to the configuration of a pipeline, the following types are identified under
this classification:
Unifunction Pipelines: When a fixed and dedicated function is performed through a pipeline,
it is called a Unifunction pipeline.
Multifunction Pipelines: When different functions at different times are performed
through the pipeline, this is known as Multifunction pipeline. Multifunction pipelines are re
configurable at different times according to the operation being performed
Classification according to type of instruction and data:
According to the types of instruction and data, following types are identified under this
classification:
Scalar Pipelines: This type of pipeline processes scalar operands of repeated scalar
instructions.
Vector Pipelines: This type of pipeline processes vector instructions over vector operands.
Instruction Pipelines
As discussed earlier, the stream of instructions in the instruction execution cycle, can be
realized through a pipeline where overlapped execution of different operations are performed.
The process of executing the instruction involves the following major steps:
• Fetch the instruction from the main memory
• Decode the instruction
• Fetch the operand
• Execute the decode dinstruction
These four steps become the candidates for stages for the pipeline, which we callas instruction
pipeline (It is shown in Figure4.13).
12
Fig 4.13: Instruction Pipeline
Since, in the pipelined execution, there is overlapped execution of operations, the four stages of
the instruction pipeline will work in the overlapped manner. First, the instruction address is
fetched from the memory to the first stage of the pipeline. The first stage fetches the instruction
and gives its output to the second stage. While the second stage of the pipeline is decoding the
instruction, the first stage gets another input and fetches the next instruction. When the first
instruction has been decoded in the second stage, then its output is fed to the third stage. When
the third stage is fetching the operand for the first instruction, then the second stage gets the
second instruction and the first stage gets input for another instruction and so on. In this way,
the pipeline is executing the instruction in an overlapped manner increasing the throughput and
speed of execution.
The scenario of these overlapped operations in the instruction pipeline can be illustrated
through the space-time diagram. In Figure4, first we show the space-time diagram for non-
overlapped execution in a sequential environment and then for the overlapped pipelined
environment. It is clear from the two diagrams that in non-overlapped execution, results are
achieved only after 4 cycles while in overlapped pipelined execution, after 4 cycles, we are
getting output after each cycle. Soon in the instruction pipeline, the instruction cycle has been
reduced to ¼ of the sequential execution.
13
Fig 4.14: space – time diagram for non pipelined process
Fig 4.15: space-time diagram for overlapped Instruction pipelined processor
Instruction buffers: For taking the full advantage of pipelining, pipelines should be filled
continuously. Therefore, instruction fetch rate should be matched with the pipeline
consumption rate. To do this, instruction buffers are used. Instruction buffers in CPU have high
speed memory for storing the instructions. The instructions are pre-fetched in the buffer from
the main memory. Another alternative for the instruction buffer is the cache memory between
the CPU and the main memory. The advantage of cache memory is that it can be used for both
instruction and data. But cache requires more complex control logic than the instruction buffer.
Some pipelined computers have adopted both.
Arithmetic Pipelines
The technique of pipelining can be applied to various complex and slow arithmetic operations
to speed up the processing time. The pipelines used for arithmetic computations are called
Arithmetic pipelines. In this section, we discuss arithmetic pipelines based on arithmetic
operations. Arithmetic pipelines are constructed for simple fixed-point and complex floating-
point arithmetic operations. These arithmetic operations are well suited to pipelining as these
operations can be efficiently partitioned into subtasks for the pipeline stages. For implementing
the arithmetic pipelines we generally use following two types of adder:
i) Carry propagation adder (CPA): It adds two numbers such thatcarries generated in
14
successive digits arepropagated.
ii) Carry save adder (CSA): It adds two numbers such that carries generated are not
propagated rather these are saved in a carry vector.
Fixed Arithmetic pipelines: We take the example of multiplication of fixed numbers. Two
fixed-point numbers are added by the ALU using add and shift operations. This sequential
execution makes the multiplication a slow process. If we look at the multiplication process
carefully, then we observe that this is the process of adding the multiple copies of shifted
multiplicands as show below:
Fig 4.16: Multiple copies of shifted multiplicands
Now, we can identify the following stages for the pipeline:

• The first stage generates the partial product of the numbers, which form the six rows of
shifted multiplicands.
• In the second stage, the six numbers are given to the two CSAs merging into four
numbers.
• In the third stage, there is a single CSA merging the numbers into 3 numbers.
• In the fourth stage, there is a single number merging three numbers into 2 numbers.
• In the fifth stage, the last two numbers are added through a CPA to get the final product.
These stages have been implemented using CSA tree as shown in Figure
Fig 4.17: Carry save adder
Floating point Arithmetic pipelines: Floating point computations are the best candidates for
pipelining. Take the example of addition of two floating point numbers. Following stages are
identified for the addition of two floating point numbers:
First stage will compare the exponents of the two numbers.
• Second stage will look for alignment of mantissas.

• In the third stage, mantissas are added.
• In the last stage, the result is normalized.
These stages are shown in Figure6.
Fig 4.18: Arithmetic pipelining for floating point addition of two numbers
Performance and Issues in Pipelining
Speedup :First we take the speedup factor that is we see how much speed up performance we
get through pipelining.
• First we take the ideal case for measuring the speedup.

• Let n be the total number of tasks executed through m stages of pipelines.
• Then m stages can process n tasks in clock cycles = m + (n-1)
Time taken to execute without pipelining = m.n
Speedup due to pipelining = m.n/[m +(n-1)].
As n>=∞ , There is speedup of n times over the non-pipelined execution.
Efficiency: The efficiency of a pipeline can be measured as the ratio of busy time span to the
total time span including the idle time. Let c be the clock period of the pipeline, the efficiency
E can be denoted as:
E = (n. m. c) / m.[m.c + (n-1).c] = n / (m + (n-1) As n-> ∞ , E becomes 1.

Throughput: Throughput of a pipeline can be defined as the number of results that have been
achieved per unit time. It can be denoted as: T = n / [m + (n-1)]. c = E / c Throughput denotes
the computing power of the pipeline. Maximum speedup, efficiency and throughput are the
ideal cases but these are not achieved in the practical cases, as the speedup is limited due to the
following factors:
• Data dependency between successive tasks: There may be dependencies between the
instructions of two tasks used in the pipeline. For example, one instruction cannot be
started until the previous instruction returns the results, as both are interdependent.
Another instance of data dependency will be when that both instructions try to modify
the same data object. These are called data hazards.
• Resource Constraints: When resources are not available at the time of execution then
delays are caused in pipelining. For example, if one common memory is used for both
data and instructions and there is need to read/write and fetch the instruction at the same
time then only one can be carried out and the other has to wait. Another example is of
limited resource like execution unit, which may be busy at the required time.
• Branch Instructions and Interrupts in the program: A program is not a straight flow
of sequential instructions. There may be branch instructions that alter the normal flow
of program, which delays the pipelining execution and affects the performance.
Similarly, there are interrupts that postpones the execution of next instruction until the
interrupt has been serviced. Branches and the interrupts have damaging effects on the
pipelining.
Vector Processing
A vector is an ordered set of the same type of scalar data items. The scalar item can be a
floating pint number, an integer or a logical value. Vector processing is the arithmetic or logical
computation applied on vectors whereas in scalar processing only one or pair of data is
processed. Therefore, vector processing is faster compared to scalar processing. When the
scalar code is converted to vector form then it is called vectorization. A vector processor is a
special coprocessor, which is designed to handle the vector computations.
Vector instructions can be classified as below:
Vector-Vector Instructions: In this type, vector operands are fetched from the vector register
and stored in another vector register. These instructions are denoted with the following function
mappings:
F1:V->V
F2:V×V->V
For example, vector square root is of F1 type and addition of two vectors is of F2.
Vector-Scalar Instructions: In this type, when the combination of scalar and vector are
fetched and stored in vector register. These instructions are denoted with the following function
mappings:
F3 : S × V -> V where S is the scalar item
For example, vector-scalar addition or divisions are of F3 type.

Vector reduction Instructions: When operations on vector are being reduced to scalar items
as the result, then these are vector reduction instructions. These instructions are denoted with
the following function mappings:
F4:V->S
F5:V×V->S
For example, finding the maximum, minimum and summation of all the elements of vector are
of the type F4. The dot product of two vectors is generated by F5.
Vector-Memory Instructions: When vector operations with memory M are performed then
these are vector-memory instructions. These instructions are denoted with the following
function mappings:
F6:M->V
F7:V->V
For example, vector load is of type F6 and vector store operation is of F7.
Vector Processing with Pipelining: Since in vector processing, vector instructions perform the
same computation on different data operands repeatedly, vector processing is most suitable for
pipelining. Vector processors with pipelines are designed to handle vectors of varying length n
where n is the length of vector. A vector processor performs better if length of vector is larger.
But a large value of n causes the problem in storage of vectors and there is difficulty in moving
the vectors to and from the pipelines.
Pipeline Vector processors adopt the following two architectural conFigureurations for this
problem as discussed below:
Memory-to-Memory Architecture: The pipelines can access vector operands, intermediate

and final results directly in the main memory. This requires the higher memory bandwidth.
Moreover, the information of the base address, the offset and vector length should be specified
for transferring the data streams between the main memory and pipelines. STAR-100 and TI-
ASC computers have adopted this architecture for vector instructions.
Register-to-Register Architecture: In this organization, operands and results are accessed

indirectly from the main memory through the scalar or vector registers. The vectors which are
required currently can be stored in the CPU registers. Cray- 1 computer adopts this architecture
for the vector instructions and its CPY contains 8 vector registers, each register capable of
storing a 64 element vector where one element is of 8 bytes.
Efficiency of Vector Processing over Scalar Processing:
As we know, a sequential computer processes scalar operands one at a time. Therefore if we

have to process a vector of length n through the sequential computer then the vector must be
broken into n scalar steps and executed one by one.
For example, consider the following vector addition:

A+B→C
The vectors are of length 500. This operation through the sequential computer can be specified
by 500 add instructions as given below:
If we perform the same operation through a pipelined-vector computer then it does not break
the vectors in 500 add statements. Because a vector processor has the set of vector instructions
that allow the operations to be specified in one vector instruction as:
A (1:500) + B (1:500) → C (1:500)
Each vector operation may be broken internally in scalar operations but they are executed in
parallel which results in mush faster execution as compared to sequential computer. Thus, the
advantage of adopting vector processing over scalar processing is that it eliminates the
overhead caused by the loop control required in a sequential computer.
Array Processing
We have seen that for performing vector operations, the pipelining concept has been used.
There is another method for vector operations. If we have an array of n processing elements
(PEs) i.e., multiple ALUs for storing multiple operands of the vector, then an n instruction, for
example, vector addition, is broadcast to all PEs such that they add all operands of the vector at
the same time. That means all PEs will perform computation in parallel. All PEs are
synchronized under one control unit. This organization of synchronous array of PEs for vector
operations is called Array Processor. The organization is same as in SIMD which we studied in
unit 2. An array processor can handle one instruction and multiple data streams as we have seen
in case of SIMD organization. Therefore, array processors are also called SIMD array
computers.
The organization of an array processor is shown in Figure 7. The following components are
organized in an array processor:
Fig 4.19: Organization of SIMD Array Processor
Control Unit (CU): All PEs are under the control of one control unit. CU controls the inter
communication between the PEs. There is a local memory of CU also called CY memory. The
user programs are loaded into the CU memory. The vector instructions in the program are
decoded by CU and broadcast to the array of PEs. Instruction fetch and decoding is done by the
CU only.
Processing elements (PEs): Each processing element consists of ALU, its registers and a local
memory for storage of distributed data. These PEs have been interconnected via an
interconnection network. All PEs receive the instructions from the control unit and the different
component operands are fetched from their local memory. Thus, all PEs perform the same
function synchronously in a lock-step fashion under the control of the CU.
It may be possible that all PEs need not participate in the execution of a vector instruction.
Therefore, it is required to adopt a masking scheme to control the status of each PE. A masking
vector is used to control the status of all PEs such that only enabled PEs are allowed to
participate in the execution and others are disabled.
Interconnection Network (IN): IN performs data exchange among the PEs, data routing and
manipulation functions. This IN is under the control of CU.
Host Computer: An array processor may be attached to a host computer through the control
unit. The purpose of the host computer is to broadcast a sequence of vector instructions through
CU to the PEs. Thus, the host computer is a general-purpose machine that acts as a manager of
the entire system.
Array processors are special purpose computers which have been adopted for the following:
• various scientific applications,
• matrix algebra,
• matrix eigen value calculations,
• real-time scene analysis
SIMD array processor on the large scale has been developed by NASA for earth resources
satellite image processing. This computer has been named Massively parallel processor (MPP)
because it contains 16,384 processors that work in parallel. MPP provides real-time time
varying scene analysis.
However, array processors are not commercially popular and are not commonly used. The
reasons are that array processors are difficult to program compared to pipelining and there is
problem in vectorization.
Associative Array Processing
Consider that a table or a list of record is stored in the memory and you want to find some
information in that list. For example, the list consists of three fields as shown below:
Suppose now that we want to find the ID number and age of Ravi. If we use conventional
RAM then it is necessary to give the exact physical address of entry related to Ravi in the
instruction access the entry such as:
READ ROW 3
Another alternative idea is that we search the whole list using the Name field as an address
in the instruction such as:
READ NAME = RAVI
Again with serial access memory this option can be implemented easily but it is a very
slow process.
An associative memory helps at this point and simultaneously examines all the entries in the
list and returns the desired list very quickly. SIMD array computers have been developed with
associative memory. An associative memory is content addressable memory, by which it is
meant that multiple memory words are accessible in parallel. The parallel accessing feature also
support parallel search and parallel compare. This capability can be used in many applications
such as:
• Storage and retrieval of databases which are changing rapidly
• Radar signal tracking
• Image processing
• Artificial Intelligence
The inherent parallelism feature of this memory has great advantages and impact in parallel
computer architecture. The associative memory is costly compared to RAM. The array
processor built with associative memory is called Associative array processor. In this section,
we describe some categories of associative array processor. Types of associative processors are
based on the organization of associative memory. Therefore, first we discuss about the
associative memory organization.
Associative Memory Organizations
The associative memory is organized in w words with b bits per word. In w x b array, each bit
is called a cell. Each cell is made up of a flip-flop that contains some comparison logic gates for
pattern match and read-write operations. Therefore, it is possible to read or write in parallel due
to this logic structure. A group of bit cells of all the words at the same position in a vertical
column is called bit slice as shown in Figure 4.20.
Fig 4.20: Associative Memory
In the organization of an associative memory, following registers are used:
• Comparand Register (C): This register is used to hold the operands, which are being
searched for, or being compared with.
• Masking Register (M): It may be possible that all bit slices are not involved in parallel
operations. Masking register is used to enable or disable the bit slices.
• Indicator (I) and Temporary (T) Registers: Indicator register is used to hold the current
match patterns and temporary registers are used to hold the previous match patterns.
There are following two methods for organizing the associative memory based on bit slices:
• Bit parallel organization: In this organization all bit slices which are not masked off,
participate in the comparison process, i.e., all words are used in parallel.
• Bit Serial Organization: In this organization, only one bit slice participate in the
operation across all the words. The bit slice is selected through an extra logic and
control unit. This organization is slower in speed but requires lesser hardware as
compared to bit parallel which is faster.
Types of Associative Processor
Based on the associative memory organizations, we can classify the associative processors into
the following categories:
1) Fully Parallel Associative Processor: This processor adopts the bit parallel memory
organization. There are two type of this associative processor:
• Word Organized associative processor: In this processor one comparison logic is
used with each bit cell of every word and the logical decision is achieved at the
output of every word.
• Distributed associative processor: In this processor comparison logic is provided
with each character cell of a fixed number of bits or with a group of character cells.
This is less complex and therefore less expensive compared to word organized
associative processor.
2) Bit Serial Associative Processor: When the associative processor adopts bit serial
memory organization then it is called bit serial associative processor. Since only one bit slice
is involved in the parallel operations, logic is very much reduced and therefore this processor
is much less expensive than the fully parallel associative processor.
PEPE is an example of distributed associative processor which was designed as a special

purpose computer for performing real time radar tracking in a missile environment. STARAN
is an example of a bit serial associative processor which was designed for digital image
processing. There is a high cost performance ratio of associative processors. Due to this reason
these have not been commercialised and are limited to military applications.
Superscalar Processors
In scalar processors, only one instruction is executed per cycle. That means only one instruction
is issued per cycle and only one instruction is completed. But the speed of the processor can be
improved in scalar pipeline processor if multiple instructions instead of one are issued per
cycle. This idea of improving the processor’s speed by having multiple instructions per cycle is
known as Superscalar processing. In superscalar processing multiple instructions are issued per
cycle and multiple results are generated per cycle. Thus, the basic idea of superscalar processor
is to have more instruction level parallelism.
Instruction Issue degree: The main concept in superscalar processing is how many
instructions we can issue per cycle. If we can issue k number of instructions per cycle in a
superscalar processor, then that processor is called a k-degree superscalar processor. If we want
to exploit the full parallelism from a superscalar processor then k instructions must be
executable in parallel.
For example, we consider a 2-degree superscalar processor with 4 pipeline stages for
instruction cycle, i.e. instruction fetch (IF), decode instruction (DI), fetch the operands (FO),
execute the instruction (EI) as shown in Figure3. In this superscalar processor, 2 instructions
are issued per cycle as shown in Figure 9. Here, 6 instructions in 4 stage pipeline have been
executed in 6 clock cycles. Under ideal conditions, after steady state, two instructions are being
executed per cycle.
Fig 4.21: Super scalar processing of instruction cycle in 4-stage instruction pipeline
For implementing superscalar processing, some special hardware must be provided which is
discussed below:
• The requirement of data path is increased with the degree of superscalar processing.
Suppose, one instruction size is 32 bit and we are using 2-degree superscalar processor,
then 64 data path from the instruction memory is required and 2 instruction registers are
also needed.
• Multiple execution units are also required for executing multiple instructions and to
avoid resource conflicts.
Data dependency will be increased in superscalar processing if sufficient hardware is not
provided. The extra hardware provided is called hardware machine parallelism. Hardware
parallelism ensures that resource is available in hardware to exploit parallelism. Another
alternative is to exploit the instruction level parallelism inherent in the code. This is achieved
by transforming the source code by an optimizing compiler such that it reduces the dependency
and resource conflicts in the resulting code.
Many popular commercial processors have been implemented with superscalar architecture like
IBM RS/6000, DEC 21064, MIPS R4000, Power PC, Pentium, etc.
VLIW Architecture
Superscalar architecture was designed to improve the speed of the scalar processor. But it has
been realized that it is not easy to implement as we discussed earlier. Following are some
problems faced in the superscalar architecture:
• It is required that extra hardware must be provided for hardware parallelism such as
instruction registers, decoder and arithmetic units, etc.
• Scheduling of instructions dynamically to reduce the pipeline delays and to keep all
processing units busy, is very difficult.
Another alternative to improve the speed of the processor is to exploit a sequence of

instructions having no dependency and may require different resources, thus avoiding resource
conflicts. The idea is to combine these independent instructions in a compact long word
incorporating many operations to be executed simultaneously. That is why; this architecture is
called very long instruction word (VLIW) architecture. In fact, long instruction words carry
the opcodes of different instructions, which are dispatched to different functional units of the
processor. In this way, all the operations to be executed simultaneously by the functional units
are synchronized in a VLIW instruction. The size of the VLIW instruction word can be in
hundreds of bits. VLIW instructions must be formed by compacting small instruction words of
conventional program. The job of compaction in VLIW is done by a compiler. The processor
must have the sufficient resources to execute all the operations in VLIW word simultaneously.
For example, one VLIW instruction word is compacted to have load/store operation, floating
point addition, floating point multiply, one branch, and one integer arithmetic as shown in
Figure 10.
Fig 4.22: VLIW Instruction word
A VLIW processor to support the above instruction word must have the functional components
as shown in Figure 11. All the functions units have been incorporated according to the VLIW
instruction word. All the units in the processor share one common large register file.
Fig 4.23: VLIW Processor
Parallelism in instructions and data movement should be completely specified at compile time.
But scheduling of branch instructions at compile time is very difficult. To handle branch
instructions, trace scheduling is adopted. Trace scheduling is based on the prediction of branch
decisions with some reliability at compile time. The prediction is based on some heuristics,
hints given by the programmer or using profiles of some previous program executions.
Multi-Threaded Processors
In unit 3, we have seen the use of distributed shared memory in parallel computer architecture.
But the use of distributed shared memory has the problem of accessing the remote memory,
which results in latency problems. This problem increases in case of large-scale
multiprocessors like massively parallel processors (MPP).
For example, one processor in a multiprocessor system needs two memory loads of two
variables from two remote processors as shown in Figure 12. The issuing processor will use
these variables simultaneously in one operation. In case of large-scale MPP systems, the
following two problems arise:
Fig 4.24: Latency Problem in MPP
Remote-load Latency Problem: When one processor needs some remote loading of data from
other nodes, then the processor has to wait for these two remote load operations. The longer the
time taken in remote loading, the greater will be the latency and idle period of the issuing
processor.
Synchronization Latency Problem: If two concurrent processes are performing remote

loading, then it is not known by what time two processes will load, as the issuing processor
needs two remote memory loads by two processes together for some operation. That means two
concurrent processes return the results asynchronously and this causes the synchronization
latency for the processor.
Concept of Multithreading: These problems increase in the design of large-scale

multiprocessors such as MPP as discussed above. Therefore, a solution for optimizing these
latency should be acquired at. The concept of Multithreading offers the solution to these
problems. When the processor activities are multiplexed among many threads of execution,
then problems are not occurring. In single threaded systems, only one thread of execution per
process is present. But if we multiplex the activities of process among several threads, then the
multithreading concept removes the latency problems.
In the above example, if multithreading is implemented, then one thread can be for issuing a
remote load request from one variable and another thread can be for remote load for second
variable and third thread can be for another operation for the processor and so on.
Multithreaded Architecture: It is clear now that if we provide many contexts to multiple

threads, then processors with multiple contexts are called multithreaded systems. These systems
are implemented in a manner similar to multitasking systems. A multithreaded processor will
suspend the current context and switch to another. In this way, the processor will be busy most
of the time and latency problems will also be optimized. Multithreaded architecture depends on
the context switching time between the threads. The switching time should be very less as
compared to latency time.
The processor utilization or its efficiency can be measured as:
U=P/(P+I+S)
where
P = useful processing time for which processor is busy
I = Idle time when processor is waiting
S = Context switch time used for changing the active thread on the processor
The objective of any parallel system is to keep U as high as possible. U will be high if I and S
are very low or negligible. The idea of multithreading systems is to reduce I such that S is not
increasing. If context-switching time is more when compared to idle time, then the purpose of
multithreaded systems is lost.
Design issues: To achieve the maximum processor utilization in a multithreaded architecture,

the following design issues must be addressed:
• Context Switching time: S < I, that means very fast context switching mechanism is
needed.
• Number of Threads: A large number of threads should be available such that processor
switches to an active thread from the idle state.
General-purpose computing on graphics processing units
General-purpose computing on graphics processing units (GPGPU, rarely GPGP) is the

use of a graphics processing unit (GPU), which typically handles computation only for
computer graphics, to perform computation in applications traditionally handled by the central
processing unit (CPU). The use of multiple video cards in one computer, or large numbers of
graphics chips, further parallelizes the already parallel nature of graphics processing. In
addition, even a single GPU-CPU framework provides advantages that multiple CPUs on their
own do not offer due to the specialization in each chip.
Essentially, a GPGPU pipeline is a kind of parallel processing between one or more GPUs and
CPUs that analyzes data as if it were in image or other graphic form. While GPUs operate at
lower frequencies, they typically have many times the number of cores. Thus, GPUs can
process far more pictures and graphical data per second than a traditional CPU. Migrating data
into graphical form and then using the GPU to scan and analyze it can create a large speedup.
GPGPU pipelines were developed at the beginning of the 21st century for graphics processing
(e.g., for better shaders). These pipelines were found to fit scientific computing needs well, and
have since been developed in this direction.
History
In principle, any arbitrary boolean function, including those of addition, multiplication and
other mathematical functions can be built-up from a functionally complete set of logic
operators. In 1987, Conway's Game of Life became one of the first examples of general
purpose computing using an early stream processor called a blitter to invoke a special sequence
of logical operations on bit vectors.
General-purpose computing on GPUs became more practical and popular after about 2001,
with the advent of both programmable shaders and floating point support on graphics
processors. Notably, problems involving matrices and/or vectors – especially two-, three-, or
four- dimensional vectors – were easy to translate to a GPU, which acts with native speed and
support on those types. The scientific computing community's experiments with the new
hardware began with a matrix multiplication routine (2001); one of the first common scientific
programs to run faster on GPUs than CPUs was an implementation of LU factorization (2005).
These early efforts to use GPUs as general-purpose processors required reformulating
computational problems in terms of graphics primitives, as supported by the two major APIs
for graphics processors, OpenGL and DirectX. This cumbersome translation was obviated by
the advent of general-purpose programming languages and APIs such as Sh/RapidMind, Brook
and Accelerator. These were followed by Nvidia's CUDA, which allowed programmers to
ignore the underlying graphical concepts in favor of more common high-performance
[8]
computing concepts. Newer, hardware vendor-independent offerings include Microsoft's
Direct Compute and Apple/Khronos Group's OpenCL. This means that modern GPGPU
pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the
data to a graphical form.
Implementation:
Any language that allows the code running on the CPU to poll a GPU shaderfor return values,
can create a GPGPU framework.
As of 2016, OpenCL is the dominant open general-purpose GPU computing language, and is an
open standard defined by the Khronos Group. OpenCL provides a cross-platform GPGPU
platform that additionally supports data parallel compute on CPUs. OpenCL is actively
supported on Intel, AMD, Nvidia, and ARM platforms. The Khronos Group has also
standardized and implemented SYCL, a higher-level programming model for OpenCL as a
single-source domain specific embedded language based on pureC++11.
The dominant proprietary framework is NvidiaCUDA. Nvidia launched CUDA in 2006, a
software development kit (SDK) and application programming interface (API) that allows
using the programming language C to code algorithms for execution on GeForce 8 series and
laterGPUs.
Programming standards for parallel computing include OpenCL(vendor-independent),
OpenACC, and OpenHMPP. Mark Harris, the founder of GPGPU.org, coined the termGPGPU.
The Xcelerit SDK, created by Xcelerit, is designed to accelerate large existing C++ or C# code-
bases on GPUs with minimal effort. It provides a simplified programming model, automates
parallelisation, manages devices and memory, and compiles to CUDA binaries. Additionally,
multi-core CPUs and other accelerators can be targeted from the same source code.
Open VIDIA was developed at University of Toron to between 2003-2005, in collaboration
with Nvidia. Altimesh
available as a Visual Studio Extension on Visual Studio Marketplace.
Microsoft introduced the DirectComputeGPU computing API, released with the DirectX 11
API. Alea GPU created by Quant Alea introduces native GPU computing capabilities for the
Microsoft .NET language F# and C#. Alea GPU also provides a simplified GPU programming
model based on GPU parallel-for and parallel aggregate using delegates and automatic memory
management.
MATLAB supports GPGPU acceleration using the Parallel Computing Toolbox and MATLAB
Distributed Computing Server, and third-party packages like Jacket.
GPGPU processing is also used to simulate Newtonian physics by Physics engines, and
commercial implementations include Havok Physics, FX and PhysX, both of which are
typically used for computer and video games.
Close to Metal, now called Stream, is AMD's GPGPU technology for ATI Radeon-based
GPUs.
C++ Accelerated Massive Parallelism (C++ AMP) is a library that accelerates execution of C+
+ code by exploiting the data-parallel hardware onGPUs.
Mobile computers
Due to a trend of increasing power of mobile GPUs, general-purpose programming became
available also on the mobile devices running major mobile operating systems.
Google Android 4.2 enabled running RenderScriptcode on the mobile device GPU. Apple
introduced a proprietary Metal API for iOSapplications, able to execute arbitrary code through
Apple's GPU computeshaders.
Hardware support:
Computer video cards are produced by various vendors, such as Nvidia, and AMD and ATI.
Cards from such vendors differ on implementing data-format support, such as integer and
floating-point formats (32-bit and 64-bit). Microsoft introduced a
ShaderModel standard, to help rank the various features of graphic cards into a simple Shader
Model version number (1.0, 2.0, 3.0,etc.).
Integer numbers
Pre-DirectX 9 video cards only supported palettedor integer color types. Various formats are
[
available, each containing a red element, a green element, and a blue element. Sometimes
another alpha value is added, to be used for transparency. Common formats are:
• 8 bits per pixel – Sometimes palette mode, where each value is an index in a table with
the real color value specified in one of the other formats. Sometimes three bits for red,
three bits for green, and two bits forblue.
• 16 bits per pixel – Usually the bits are allocated as five bits for red, six bits for green,
and five bits forblue.
• 24 bits per pixel – There are eight bits for each of red, green, andblue.
• 32 bits per pixel – There are eight bits for each of red, green, blue, andalpha.
Floating-point numbers
For early fixed-function or limited programmability graphics (i.e., up to and including DirectX
8.1-compliant GPUs) this was sufficient because this is also the representation used in displays.
It is important to note that this representation does have certain limitations. Given sufficient
graphics processing power even graphics programmers would like to use better formats, such as
floating point data formats, to obtain effects such as high dynamic range imaging. Many
GPGPU applications require floating point accuracy, which came with video cards conforming
to the DirectX 9specification.
DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial
precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per
component) or greater, while partial precision was FP16. ATI's Radeon R300 series of GPUs
supported FP24 precision only in the programmable fragment pipeline (although FP32 was
supported in the vertex processors) while Nvidia'sNV30 series supported both FP16 and FP32;
other vendors such as S3 Graphics and XGI supported a mixture of formats up to FP24.
The implementations of floating point on Nvidia GPUs are mostly IEEE compliant; however,
this is not true across all vendors. This has implications for correctness which are considered
important to some scientific applications. While 64-bit floating point values (double precision
float) are commonly available on CPUs, these are not universally supported on GPUs. Some
GPU architectures sacrifice IEEE compliance, while others lack double-precision. Efforts have
occurred to emulate double-precision floating point values on GPUs; however, the speed
tradeoff negates any benefit to offloading the computing onto the GPU in the first place.
Vectorization
Most operations on the GPU operate in a vectorized fashion: one operation can be performed
on up to four values at once. For example, if one color <R1, G1, B1>is to be modulated by
another color <R2, G2, B2>, the GPU can produce the resulting color <R1*R2, G1*G2,
B1*B2>in one operation. This functionality is useful in graphics because almost every basic
[citation needed]
data type is a vector (either 2-, 3-, or 4-dimensional). Examples include
vertices, colors, normal vectors, and texture coordinates. Many other applications can put this
to good use, and because of their higher performance, vector instructions, termed single
instruction, multiple data (SIMD), have long been available on CPUs.
GPU Vs CPU
Originally, data was simply passed one-way from a central processing unit (CPU) to a graphics
processing unit (GPU), then to a display device. As time progressed, however, it became
valuable for GPUs to store at first simple, then complex structures of data to be passed back to
the CPU that analyzed an image, or a set of scientific-data represented as a 2D or 3D format
that a video card can understand. Because the GPU has access to every draw operation, it can
analyze data in these forms quickly, whereas a CPU must poll every pixel or data element much
more slowly, as the speed of access between a CPU and its larger pool of random-access
memory (or in an even worse case, a hard drive) is slower than GPUs and video cards, which
typically contain smaller amounts of more expensive memory that is much faster to access.
Transferring the portion of the data set to be actively analyzed to that GPU memory in the form
of textures or other easily readable GPU forms results in speed increase. The distinguishing
feature of a GPGPU design is the ability to transfer information bi directionally back from the
GPU to the CPU; generally the data throughput in both directions is ideally high, resulting in a
multiplier effect on the speed of a specific high-use algorithm. GPGPU pipelines may improve
efficiency on especially large data sets and/or data containing 2D or 3D imagery. It is used in
complex graphics pipelines as well as scientific computing; more so in fields with large data
sets like genome mapping, or where two- or three-dimensional analysis is useful – especially at
present biomolecule analysis, protein study, and other complex organic chemistry. Such
pipelines can also vastly improve efficiency in image processing and computer vision, among
other fields; as well as parallel processing generally. Some very heavily optimized pipelines
have yielded speed increases of several hundred times the original CPU-based
pipeline on one high- use task.
A simple example would be a GPU program that collects data about average lighting values as
it renders some view from either a camera or a computer graphics program back to the main
program on the CPU, so that the CPU can then make adjustments to the overall screen view. A
more advanced example might use edge detection to return both numerical information and a
processed image representing outlines to a computer vision program controlling, say, a mobile
robot. Because the GPU has fast and local hardware access to every pixel or other picture
element in an image, it can analyze and average it (for the first example) or apply a Sobel edge
filter or other convolution filter (for the second) with much greater speed than a CPU, which
typically must access slower random-access memory copies of the graphic in question.
GPGPU is fundamentally a software concept, not a hardware concept; it is a type of algorithm,
not a piece of equipment. Specialized equipment designs may, however, even further enhance
the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on
very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be
parallelized even further via specialized setups such as rack computing (many similar, highly
tailored machines built into a rack), which adds a third layer – many computing units each
using many CPUs to correspond to many GPUs. Some Bitcoin"miners" used such setups for
high-quantity processing.
Caches
Historically, CPUs have used hardware-managed caches but the earlier GPUs only provided
software-managed local memories. However, as GPUs are being increasingly used for general-
purpose applications, state-of-the-art GPUs are being designed with hardware-managed multi-
level caches which have helped the GPUs to move towards mainstream computing. For
example, GeForce 200 series GT200 architecture GPUs did not feature an L2 cache, the Fermi
GPU has 768 KiB last-level cache, the KeplerGPU has 1.5 MiBlast-level cache,theMaxwell
GPU has 2 MiB last-level cache and the Pascal GPU has 4 MiB last-level cache.
Register file
GPUs have very large register files, which allow them to reduce context-switching latency.
Register file size is also increasing over different GPU generations, e.g., the total register file
size on Maxwell (GM200) and Pascal GPUs are 6 MiBand 14 MiB, respectively. By
comparison, the size of a register file on CPUs is small, typically tens or hundreds of kilobytes.
Energy efficiency
Several research projects have compared the energy efficiency of GPUs with that of CPUs and
FPGAs.
Stream processing
GPUs are designed specifically for graphics and thus are very restrictive in operations and
programming. Due to their design, GPUs are only effective for problems that can be solved
using stream processing and the hardware can only be used in certain ways.
The following discussion referring to vertices, fragments and textures concerns mainly the
legacy model of GPGPU programming, where graphics APIs (OpenGL or DirectX) were used
to perform general-purpose computation. With the introduction of the CUDA (Nvidia, 2007)
and OpenCL(vendor-independent, 2008) general-purpose computing APIs, in new GPGPU
codes it is no longer necessary to map the computation to graphics primitives. The stream
processing nature of GPUs remains valid regardless of the APIs used.
GPUs can only process independent vertices and fragments, but can process many of them in
parallel. This is especially effective when the programmer wants to process many vertices or
fragments in the same way. In this sense, GPUs are stream processors – processors that can
operate in parallel by running one kernel on many records in a stream at once.
A stream is simply a set of records that require similar computation. Streams provide data
parallelism. Kernels are the functions that are applied to each element in the stream. In the
GPUs, vertices and fragments are the elements in streams and vertex and fragment shaders are
the kernels to be run on them. For each element we can only read from the input, perform
operations on it, and write to the output. It is permissible to have multiple inputs and multiple
outputs, but never a piece of memory that is both readable and writable.
Arithmetic intensity is defined as the number of operations performed per word of memory
transferred. It is important for GPGPU applications to have high arithmetic intensity else the
memory access latency will limit computational speedup.
Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency
between data elements.
GPU programming concepts
Computational resources
There are a variety of computational resources available on the GPU:
• Programmable processors – vertex, primitive, fragment and mainly compute pipelines

allow programmer to perform kernel on streams of data
• Rasterizer – creates fragments and interpolates per-vertex constants such as
texture coordinates and color
• Texture unit – read-only memory interface
• Frame buffer – write-only memory interface
In fact, a program can substitute a write only texture for output instead of the frame buffer. This
is done either through Render to Texture (RTT), Render-To-Backbuffer-Copy-To-Texture
(RTBCTT), or the more recent stream-out.
Textures as stream
The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally
with the rendering model built into GPUs. Many computations naturally map into grids: matrix
algebra, image processing, physically based simulation, and so on.
Since textures are used as memory, texture lookups are then used as memory reads. Certain
operations can be done automatically by the GPU because of this
Kernels
Compute kernels can be thought of as the body of loops. For example, a programmer operating
// Input and output grids have 10000 x 10000 or 100 million elements.
void transform_10k_by_10k_grid(float in[10000][10000], float out[10000][10000])

{
for(int x = 0; x < 10000; x++) {
for(int y = 0; y < 10000; y++) {
// The next line is executed 100 million times
out[x][y] = do_some_hard_work(in[x][y]);
}
}
}
on a grid on the CPU might have code that looks like this:
On the GPU, the programmer only specifies the body of the loop as the kernel and what data to
loop over by invoking geometry processing.
Flow control
In sequential code it is possible to control the flow of the program using if-then-else statements
and various forms of loops. Such flow control structures have only recently been added to
GPUs. Conditional writes could be performed using a properly crafted series of arithmetic/bit
operations, but looping and conditional branching were not possible.
Recent GPUs allow branching, but usually with a performance penalty. Branching should
generally be avoided in inner loops, whether in CPU or GPU code, and various methods, such
as static branch resolution, pre-computation, predication, loop splitting, and Z-cull can be used
to achieve branching when hardware support does not exist.
GPU methods
Map
The map operation simply applies the given function (the kernel) to every element in the
stream. A simple example is multiplying each value in the stream by a constant (increasing the
brightness of an image). The map operation is simple to implement on the GPU. The
programmer generates a fragment for each pixel on screen and applies a fragment program to
each one. The result stream of the same size is stored in the output buffer.
Reduce
Some computations require calculating a smaller stream (possibly a stream of only 1 element)
from a larger stream. This is called a reduction of the stream. Generally, a reduction can be
performed in multiple steps. The results from the prior step are used as the input for the current
step and the range over which the operation is applied is reduced until only one stream element
remains.
Stream filtering
Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from
the stream based on some criteria.
Scan
The scan operation, also termed parallel prefix sum, takes in a vector (stream) of data elements
and an (arbitrary) associative binary function '+' with an identity element 'i'. If the input is [a0,
a1, a2, a3, ...], an exclusive scan produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while an
inclusive scan produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...] and does
not require an identity to exist. While at first glance the operation may seem inherently serial,
efficient parallel scan algorithms are possible and have been implemented on graphics
processing units. The scan operation has uses in e.g., quick sort and sparse matrix-vector
multiplication.
Scatter
The scatter operation is most naturally defined on the vertex processor. The vertex processor is
able to adjust the position of the vertex, which allows the programmer to control where
information is deposited on the grid. Other extensions are also possible, such as controlling
how large an area the vertex affects.
The fragment processor cannot perform a direct scatter operation because the location of each
fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the
programmer. However, a logical scatter operation may sometimes be recast or implemented
with another gather step. A scatter implementation would first emit both an output value and an
output address. An immediately following gather operation uses address comparisons to see
whether the output value maps to the current output slot.
In dedicated compute kernels, scatter can be performed by indexed writes.
Gather
Gather is the reverse of scatter, after scatter reorders elements according to a map, gather can
restore the order of the elements according to the map scatter used. In dedicated compute
kernels, gather may be performed by indexed reads. In other shaders, it is performed with
texture-lookups.
Sort
The sort operation transforms an unordered set of elements into an ordered set of elements. The
most common implementation on GPUs is using radix sort for integer and floating point data
and coarse-grained merge sort and fine-grained sorting networks for general comparable data.
[42][43]
Search
The search operation allows the programmer to find a given element within the stream, or
possibly find neighbors of a specified element. The GPU is not used to speed up the search for
an individual element, but instead is used to run multiple searches in parallel.Mostly the search
method used is binary search on sorted elements.
Data structures
A variety of data structures can be represented on the GPU:
• Dense arrays
• Sparse matrixes (sparse array) – static or dynamic
• Adaptive structures (union type)
Application
The following are some of the areas where GPUs have been used for general
purpose computing:
• Automatic parallelization
• Computer clusters or a variant of a parallel computing (using GPU cluster technology)
for highly calculation-intensive tasks:
• High-performance computing (HPC) clusters, often termed supercomputers
• including cluster technologies like Message Passing Interface, and single-system image
(SSI), distributed computing, and Beowulf
• Grid computing (a form of distributed computing) (networking many heterogeneous
computers to create a virtual computer architecture)
• Load-balancing clusters, sometimes termed a server farm
• Physical based simulation and physics engines (usually based on
Newtonian physics models)
• Conway's Game of Life, cloth simulation, fluid incompressible flow by solution
of Euler equations (fluid dynamics) or Navier–Stokes equations
• Statistical physics
• Ising model
• Lattice gauge theory
• Segmentation – 2D and3D
• Level set methods
• CTree construction
• Fast Fourier transform
• k-nearest neighbor algorithm
• Fuzzy logic
• Tone mapping
• Audio signal processing
▪
Audio and sound effects processing, to use a GPU for digital signal
processing(DSP)
▪
Analog signal processing
▪
Speech processing
• Digital image processing
• Video processing
▪
Hardware accelerated video decoding and post-processing
• Motion compensation (mocomp)
• Inverse discrete cosine transform (iDCT)
• Variable-length decoding (VLD), Huffmancoding
• Inverse quantization (IQ (not to be confused by Intelligence Quotient))
• In-loopdeblocking
• Bitstream processing (CAVLC/CABAC) using special purpose hardware for this
task because this is a serial task not suitable for regular GPGPU computation
• Deinterlacing
▪ Spatial-temporaldeinterlacing
• Noise reduction
• Edge enhancement
• Color correction
• Hardware accelerated video encoding and pre-processing
• Global
Geometric computing – constructive solid geometry, distance fields, collision detection,
transparency computation, shadow generation
• ▪
Scientific computing
Monte Carlo simulation of light propagation
▪
Weather forecasting
▪
Climate research
▪
Molecular modeling on GPU
▪
Quantum mechanical physics
▪
Astrophysics
• Bioinformatics
• Computational finance
• Medical imaging
• Clinical decision support system(CDSS)
• Compute rvision
• Digital signal processing / signal processing
• Control
▪
engineering
Operation sresearch
▪
Implementations of: the GPU Tabu Search algorithm solving the Resource Constrained
;
Project Scheduling problem is freely available on GitHub the GPU algorithm solving the
.
Nurse Rerostering problem is freely available on GitHub
▪ Neural networks
▪ Database operations
▪ Computational Fluid Dynamics especially using Lattice Boltzmann methods
▪ Cryptography and cryptanalysis
▪ Performance modeling: computationally intensive tasks on GPU
▪ Anti virus software
▪ Intrusion detection
• Increase computing power for distributed computing projects like
SETI@home,Einstein@home
Multithreaded Programming (POSIX pthreads)

Introduction
Code is often written in a serialized (or sequential) fashion. What is meant by the term
serialized? Ignoring instruction level parallelism (ILP), code is executed sequentially, one after
the next in a monolithic fashion, without regard to possibly more available processors the
program could exploit. Often, there are potential parts of a program where performance can be
improved through the use of threads. With increasing popularity of machines with symmetric
multiprocessing (largely due in part to the rise of multicore processors), programming with
threads is a valuable skill set worth learning. Why is it that most programs are sequential? One
guess would be that students are not taught how to program in a parallel fashion until later or in
a difficult-to-follow manner. To make matters worse, multithreading non-trivial code is
difficult. Careful analysis of the problem, and then a good design is not an option for
multithreaded programming; it is an absolute must. We will dive into the world of threads with
a little bit of background first. We will examine thread synchronization primitives and then a
tutorial on how to use POSIX pthreads will be presented.
What is a thread?
In order to define a thread formally, we must first understand the boundaries of where a
thread operates.
A computer program becomes a process when it is loaded from some store into the computer's
memory and begins execution. A process can be executed by a processor or a set of processors.
A process description in memory contains vital information such as the program counter which
keeps track of the current position in the program (i.e. which instruction is currently being
executed), registers, variable stores, file handles, signals, and so forth. A thread is a sequence
of such instructions within a program that can be executed independently of othercode.
The Figureto the right conceptually shows that threads are within the same process address
space, thus, much of the information present in the memory description of the process can be
shared across threads. Some information cannot be replicated, such as the stack (stack pointer
to a different memory area per thread), registers and thread-specific data. This information
sufficies to allow threads to be scheduled independently of the program's main thread and
possibly one or more other threads within the program.Explicit operating system support is
required to run multithreaded programs. Fortunately, most modern operating systems support
threads such as Linux (via NPTL), BSD variants, Mac OS X, Windows, Solaris, AIX, HP-UX,
etc.Operatingsystems may use different mechanisms to implement multithreading support.
Fig 4.25: Multithreaded process
Terminology
Before we can dive in depth into threading concepts, we need to get familiarized with a
few terms related to threads, parallelism and concurrency.
• Lightweight Process (LWP) can be thought of as a virtual CPU where the number of
LWPs is usually greater than the number of CPUs in the system. Thread libraries
communicate with LWPs to schedule threads. LWPs arealso sometimes referred to as
kernelthreads.
• X-to-Y model. The mapping between LWPs and Threads. Depending upon the
operating system implementation and/or user-level thread library in use, this can vary
from 1:1, X:1, or X:Y. Linux, some BSD kernels, and some Windows versions use the
1:1 model. User-level threading libraries are commonly in the X:1 class as the
underlying kernel does not have any knowledge of the user-level threads. The X:Y
model is used in Windows7.
• Contention Scope is how threads compete for system resources (i.e.scheduling).
• Bound threads have system-wide contention scope, in other words, these threads
contend with other processes on the entiresystem.
• Unbound threads have process contentionscope.
• Thread-safe means that the program protects shared data, possibly through the use of
mutualexclusion.
• Reentrant code means that a program can have more than one thread executing
concurrently.
• Async-safe means that a function is reentrant while handling a signal (i.e. can be called
from a signalhandler).
• Concurrency vs. Parallelism - They are not the same! Parallelism implies
simultaneous running of code (which is not possible, in the strict sense, on uniprocessor
machines) while concurrency implies that many tasks can run in any order and possibly
inparallel.
Amdahl’s Law and the Pareto Principle
Gene Amdahl argued the theoretical maximum improvement that is possible for a computer
program that is parallelized, under the premise that the program is strongly scaled (i.e. the
program operates on a fixed problem size). His claim is a well known assertion known as
Amdahl's Law. Essentially, Amdahl's law states that the speedup of a program due to
parallelization can be no larger than the inverse of the portion of the program that is immutably
sequential. For example, if 50% of your program is not parallelizable, then you can only expect
a maximum speed up of 2x, regardless the number of processors you throw at the problem. Of
course many problems and data sets that parallel programs process are not of fixed size or the
serial portion can be very close to zero. What is important to the reader here, is to understand
that most interesting problems that are solved by computer programs tend to have some
limitations in the amount of parallelism that can be effectively expressed (or introduced by the
very mechanism to parallelize) and exploited as threads or some other parallel construct.
It must be underscored how important it is to understand the problem the computer program is
trying to solve first, before simply jumping in head first. Careful planning and consideration of
not only what the program must attack in a parallel fashion and the means to do so by way of
the algorithms employed and the vehicle for which they are delivered must be performed.
There is a common saying: "90% of processor cycles are spent in 10% of the code." This is
more formally known as the Pareto Principle. Carefully analyze your code or your design plan;
don't spend all of your time optimizing/parallelizing the 90% of the code that doesn't matter
much! Code profiling and analysis is outside of the scope of this document, but it is
recommended reading left to those unfamiliar with the subject.
Thread Design Process
There are different ways to use threads within a program. Here, three common thread design
patterns are presented. There is no hard and fast rule on which is the best. It depends on what
the program is intended to tackle and in what context. It is up to you to decide which best
pattern or patterns fit your needs.
Thread Pool(Boss/Worker)
One thread dispatches other threads to do useful work which are usually part of a worker thread
pool. This thread pool is usually pre-allocated before the boss (or master) begins dispatching
threads to work. Although threads are lightweight, they still incur overhead when they are
created.
Peer (work crew)
The peer model is similar to the boss/worker model except once the worker pool has been
created, the boss becomes the another thread in the thread pool, and is thus, a peer to the other
threads.
Pipe line
Similar to how pipelining works in a processor, each thread is part of a long chain in a
processing factory. Each thread works on data processed by the previous thread and hands it off
to the next thread. You must be careful to equally distribute work and take extra steps to ensure
non-blocking behavior in this thread model or you could experience pipeline"stalls."
Protecting Shared Resources
Threads may operate on disparate data, but often threads may have to touch the same data. It is
unsafe to allow concurrent access to such data or resources without some mechanism that
defines a protocol for safe access! Threads must be explicitly instructed to block when other
threads maybe potentially accessing the same resources.
Mutual Exclusion
Mutual exclusion is the method of serializing access to shared resources. You do not want a
thread to be modifying a variable that is already in the process of being modified by another
thread! Another scenario is a dirty read where the value is in the process of being updated and
another thread reads an old value.
Mutual exclusion allows the programmer to create a defined protocol for serializing access to
shared data or resources. Logically, a mutexis a lock that one can virtually attach to some
resource. If a thread wishes to modify or read a value from a shared resource, the thread must
first gain the lock. Once it has the lock it may do what it wants with the shared resource without
concerns of other threads accessing the shared resource because other threads will have to wait.
Once the thread finishes using the shared resource, it unlocks the mutex, which allows other
threads to access the resource. This is a protocol that serializes access to the shared resource.
Note that such a protocol must been forced for the data or resource a mutex is protecting across
all threads that may touch the resource being protected. If the protocol is violated (e.g., a thread
modifies a shared resource without first requesting a mutex lock), then the protocol defined by
the programmer has failed. There is nothing preventing a thread programmer, whether
unintentionally (most often the case, i.e., a bug -- see race conditions below) or intentionally
from implementing a flawed serialization protocol.
As an analogy, you can think of a mutex as a safe with only one key (for a standard mutex
case), and the resource it is protecting lies within the safe. Only one person can have the key to
the chest at any time, therefore, is the only person allowed to look or modify the contents of the
chest at the time it holds the key.
The code between the lock and unlock calls to the mutex, is referred to as a critical section.
Minimizing time spent in the critical section allows for greater concurrency because it
potentially reduces the amount of time other threads must wait to gain the lock. Therefore, it is
important for a thread programmer to minimize critical sections if possible.
Mutex Types
There are different types of locks other than the standard simple blocking kind.
• Recursive: allows a thread holding the lock to acquire the same lock again which may
be necessary for recursive algorithms.
• Queuing: allows for fairness in lock acquisition by providing FIFO ordering to the
arrival of lock requests. Such mutexes may be slower due to increased overhead and the
possibility of having to wake threads next in line that may besleeping.
• Reader/Writer: allows for multiple readers to acquire the lock simultaneously. If
existing readers have the lock, a writer request on the lock will block until all readers
have given up the lock. This can lead to writerstarvation.
• Scoped: RAII-style semantics regarding lock acquisition andunlocking.
Depending upon the thread library or interface being used, only a subset of the additional types
of locks may be available. POSIX pthreads allows recursive and reader/writer stylelocks.
Potential traps with Mutexes

An important problem associated with mutexesis the possibility of deadlock. A program can
deadlock if two (or more) threads have stopped execution or are spinning permanently. For
example, a simple deadlock situation: thread 1 locks lock A, thread 2 locks lock B, thread 1
wants lock B and thread 2 wants lock A. Instant deadlock. You can prevent this from
happening by making sure threads acquire locks in an agreed order (i.e. preservation of lock
ordering). Deadlock can also happen if threads do not unlock mutexes properly.
A race condition is when non-deterministic behavior results from threads accessing shared
data or resources without following a defined synchronization protocol for serializing such
access. This can result in erroneous outcomes that cause failure or inconsistent behavior
making race condition sparticularly difficult to debug. In addition to incorrectly synchronized
access to shared resources, library calls outside of your program's control are common culprits.
Make sure you take steps within your program to enforce serial access to shared file descriptors
and other external resources. Most man pages will contain information about thread safety of a
particular function, and if it is not thread-safe, if any alternatives exist (e.g., gethostbyname()
andgethostbyname_r()).
Another problem with mutexes is that contention for a mutex can lead to priority inversion. A
higher priority thread can wait behind a lower priority thread if the lower priority thread holds a
lock for which the higher priority thread is waiting. This can be eliminated/reduced by limiting
the number of shared mutexes between different priority threads. A famous case of priority
inversion occurred on the Mars Pathfinder.
Atomic Operation
Atomic operations allow for concurrent algorithms and access to certain shared data types
without the use of mutexes. For example, if there is sufficient compiler and system support, one
can modify some variable (e.g., a 64-bit integer) within a multithreaded context without having
to go through a locking protocol. Many atomic calls are non-portable and specific to the
compiler and system. Intel Threading Building Blocks (see below), contains semi-portable
atomic support under C++. The C++1x and C1x standards will also include atomic operations
support.
Lock-free algorithms can provide highly concurrent and scalable operations. However, lock-
free algorithms may be more complex than their lock-based counterparts, potentially incurring
additional overhead that may induce negative cache effects and other problems. Careful
analysis and performance testing is required for the problem under consideration.
Thread synchronization Primitives

As we have just discussed, mutexes are one way of synchronizing access to shared resources.
There are other mechanisms available for not only coordinating access to resources but
synchronizing threads.
Join
A thread join is a protocol to allow the programmer to collect all relevant threads at a logical
synchronization point. For example, in fork-join parallelism, threads are spawned to tackle
parallel tasks and then join back up to the main thread after completing their respective tasks
(thus performing an implicit barrier at the join point). Note that a thread that executes a join has
terminated execution of their respective thread function.
Condition variables
Condition variables allow threads to synchronize to a value of a shared resource. Typically,
condition variables are used as a notification system between threads.
For example, you could have a counter that once reaching a certain count, you would like for a
thread to activate. The thread (or threads) that activates once the counter reaches the limit
would wait on the condition variable. Active threads signal on this condition variable to notify
other threads waiting/sleeping on this condition variable; thus causing a waiting thread to wake.
You can also use a broadcast mechanism if you want to signal all threads waiting on the
condition variable to wakeup. Conceptually, this is modeled by the Figure on the right with
pseudocode.
When waiting on condition variables, the wait should be inside a loop, not in a simple if
statement because of spurious wakeups. You are not guaranteed that if a thread wakes up, it is
the result of a signal or a broadcastcall
Fig 4.26: Pseudocode

Barriers
Barriers are a method to synchronize a set of threads at some point in time by having all
participating threads in the barrier wait until all threads have called the said barrier function.
This, in essence, blocks all threads participating in the barrier until the slowest participating
thread reaches the barrier call.
SpinLocks
Spinlocks are locks which spin on mutexes. Spinning refers to continuously polling until a
condition has been met. In the case of spinlocks, if a thread cannot obtain the mutex, it will
keep polling the lock until it is free. The advantage of a spinlock is that the thread is kept active
and does not enter a sleep-wait for a mutexto become available, thus can perform better in
certain cases than typical blocking-sleep-wait style mutexes. Mutexes which are heavily
contended are poorCandidates for spinlocks.
Spinlocks should be avoided in uni processor contexts. Why is this?
Semaphores
Semaphores are another type of synchronization primitive that come in two flavors: binary and
counting. Binary semaphores act much like simple mutexes, while counting semaphores can
behave as recursive mutexes. Counting semaphores can be initialized to any arbitrary value
which should depend on how many resources you have available for that particular shared data.
Many threads can obtain the lock simultaneously until the limit is reached. This is referred to as
lock depth.
Semaphores are more common in multiprocess programming (i.e. it's usually used as a synch
primitive between processes).
POSIXpthreads
Now that we have a good foundation of thread concepts, lets talk about a particular threading
implementation, POSIX pthreads. The pthread library can be found on almost any modern
POSIX-compliant OS (and even under Windows, seepthreads-win32).
Note that it is not possible to cover more than an introduction on pthreads within the context of
this short overview and tutorial. pthreads concepts such as thread scheduling classes, thread-
specific data, thread canceling, handling signals and reader/writer locks are not covered here.
If you are programming in C++, I highly recommend evaluating the Boost C++ Libraries. One
of the libraries is the Thread library which provides a common interface for portable
multithreading.
It is assumed that you have a good understanding of the C programming language. If you do
not or need to brush up, please review basic C (especially pointers and arrays). Here are some
resources.
Preliminaries
Before we begin, there are a few required steps you need to take before starting any pthreads
coding:
1. Add #include <pthread.h> to your sourcefile(s).

2. If you are using gcc, you can simply specify -pthread which will set all proper defines and
link-time libraries. On other compilers, you may have to define _REENTRANT and link
against-lpthread.
3. Optional: some compilers may require defining _POSIX_PTHREAD_SEMANTICS for
certain function calls likesigwait().
Creating Pthreads
A pthread is represented by the type pthread_t. To create a thread, the following function
is available:?
1. intpthread_create(pthread_t *thread, pthread_attr_t*attr,

2. void *(*start_routine)(void *), void*arg);
Let's digest the arguments required for pthread_create():
1. pthread_t *thread: the actual thread object that contains pthreadid

2. pthread_attr_t *attr: attributes to apply to this thread
3. void *(*start_routine)(void *): the function this thread executes
4. void *arg: arguments to pass to thread function above
Before we dive into an example, let's first look at two other important thread functions:?
1. void pthread_exit(void*value_ptr);
2. intpthread_join(pthread_t thread, void **value_ptr); 3
3. /* ignore me: needed so underscores above do not get clipped off*/
pthread_exit() terminates the thread and provides the pointer*value_ptravailable to any

pthread_join()call.
pthread_join() suspends the calling thread to waitfor successful termination of the thread
specified as the first argument pthread_t thread with an optional *value_ptr data passed from
the terminating thread's call topthread_exit().
Let's look at an example program exercising the above pthread functions:

1. #include <stdio.h>
2. #include <stdlib.h>
3. #include <pthread.h>4
4. #define NUM_THREADS 2 6
5. /* create thread argument struct for thr_func()*/
6. typedefstruct _thread_data_t{
7. int tid;
8. double stuff;
9. } thread_data_t; 12
10. /* thread function*/
11. void *thr_func(void *arg){
12. thread_data_t *data = (thread_data_t *)arg;
13. printf("hello from thr_func, thread id: %d\n", data->tid);
14. pthread_exit(NULL);
15. }
16. intmain(intargc, char **argv){
17. pthread_tthr[NUM_THREADS];
18. int i,rc;
19. /* create a thread_data_t argument array*/
20. thread_data_tthr_data[NUM_THREADS]; 27
21. /* create threads*/
22. for (i = 0; i < NUM_THREADS; ++i){
23. thr_data[i].tid =i;
24. if ((rc = pthread_create(&thr[i], NULL, thr_func, &thr_data[i]))){
25. fprintf(stderr, "error: pthread_create, rc: %d\n",rc);
26. return EXIT_FAILURE;
27. }
28. }
29. /* block until all threads complete*/
30. for (i = 0; i < NUM_THREADS; ++i){
31. pthread_join(thr[i],NULL);
32. }
33. return EXIT_SUCCESS;
34. }
This program creates NUM_THREADS threads and prints their respective user-assigned thread
id. The first thing to notice is the call to pthread_create()in the main function. The syntax of the
third and fourth argument are particularly important. Notice that the thr_funcis the name of the
thread function, while the fourth argument is the argument passed to said function. Here we are
passing a thread function argument that we created as a thread_data_tstruct. Of course, you can
pass simple data types as pointers if that is all that is needed, or NULLif no arguments are
required. However, it is good practice to be able to pass arguments of arbitrary type and size,
and isthus
Illustrated for this purpose.
A few things to mention:
• Make sure you check the return values for all important functions.
• The second argument to pthread_create()is NULL indicating to create a thread with
default attributes. The defaults vary depend upon the system and pthread
implementation.
• Notice that we have broken apart the pthread_join() from the pthread_create(). Why is it
that you should not integrate the pthread_join()in to the thread creation loop?
• Although not explicitly required to call pthread_exit() at the end of the thread function,
it is good practice to do so, as you may have the need to return some arbitrary data back
to the caller viapthread_join().
PThread Attribute
Threads can be assigned various thread attributes at the time of thread creation. This is
controlled through the second argument to pthread_create(). You must first pass the
pthread_attr_t variable through:
1. intpthread_attr_init(pthread_attr_t *attr);
Some attributes that can be set are:

1. intpthread_attr_setdetachstate(pthread_attr_t *attr, intdetachstate);
2. intpthread_attr_setguardsize(pthread_attr_t *attr, size_tguardsize);
3. intpthread_attr_setinheritsched(pthread_attr_t *attr, intinheritsched);
4. intpthread_attr_setschedparam(pthread_attr_t *attr, conststructsched_param*param);
5. intpthread_attr_setschedpolicy(pthread_attr_t *attr, intpolicy);
6. intpthread_attr_setscope(pthread_attr_t *attr, intcontentionscope);
7. intpthread_attr_setstackaddr(pthread_attr_t *attr, void*stackaddr);
8. intpthread_attr_setstacksize(pthread_attr_t *attr, size_tstacksize);
Attributes can be retrieved via complimentary get functions. Consult the man pages for the
effect of each of these attributes.
PThread Mutex
pthreadmutexes are created through the following function:
1 intpthread_mutex_init(pthread_mutex_t *mutex, constpthread_mutexattr_t *mutexattr);

2 /* ignore me: needed so underscores above do not get clipped off*/
The pthread_mutex_init() function requires a pthread_mutex_t variable to operate on as the

firstargument. Attributes for the mutex can be given through the second parameter. To specify
default attributes, pass NULL as the second parameter. Alternatively, mutexes can be initialized
to default values through a convenient macro rather than a function call:
pthread_mutex_t lock =PTHREAD_MUTEX_INITIALIZER;
Here a mutex object named lock is initialized to the default pthread mutex values. To perform
mutex locking and unlocking, the pthreads provides the following functions:
1 intpthread_mutex_lock(pthread_mutex_t*mutex);
2 intpthread_mutex_trylock(pthread_mutex_t*mutex);
3 intpthread_mutex_unlock(pthread_mutex_t *mutex);
Each of these calls requires a reference to the mutex object. The difference between the lock
and trylock calls is that lock is blocking and trylockis non-blocking and will return immediately
evenif gaining the mutex lock has failed due to it already being held/locked. It is absolutely
essential to check the return value of the trylock call to determine if the mutex has been
successfullyacquiredornot.Ifithasnot,thentheerrorcodeEBUSYwillbereturned.
Let's expand the previous example with code that uses mutexes:
1. #include <stdio.h>
2. #include <stdlib.h>
3. #include <pthread.h>
4. #define NUM_THREADS 5 6
5. /* create thread argument struct for thr_func()*/
6. typedefstruct _thread_data_t{
7. inttid;
8. double stuff;
9. } thread_data_t; 12
10. /* shared data between threads*/
11. double shared_x;
12. pthread_mutex_tlock_x; 16
13. void *thr_func(void *arg){
14. thread_data_t *data = (thread_data_t *)arg; 19
15. printf("hello from thr_func, thread id: %d\n",data->tid);
16. /* get mutex before modifying and printing shared_x*/
17. pthread_mutex_lock(&lock_x);
18. shared_x +=data->stuff;
19. printf("x = %f\n",shared_x);
20. pthread_mutex_unlock(&lock_x);
22. }
23. intmain(intargc, char **argv){
24. pthread_tthr[NUM_THREADS];
25. inti,rc;
26. /* create a thread_data_t argument array*/
thread_data_tthr_data[NUM_THREADS];
/* initialize shared data */
shared_x = 0;
/* initialize pthreadmutex protecting "shared_x"
*/ pthread_mutex_init(&lock_x, NULL);
/* create threads */
for (i = 0; i < NUM_THREADS; ++i) {
thr_data[i].tid = i;
thr_data[i].stuff = (i + 1) * NUM_THREADS;
if ((rc = pthread_create(&thr[i], NULL, thr_func, &thr_data[i]))) {
fprintf(stderr, "error: pthread_create, rc: %d\n", rc);
return EXIT_FAILURE;
/* block until all threads complete */
for (i = 0; i < NUM_THREADS; ++i) {
pthread_join(thr[i], NULL);
return EXIT_SUCCESS;
In the above example code, we add some shared data called shared_x and ensure serialized
access to this variable through a mutex named lock_x. Within the thr_func() we call
pthread_mutex_lock() before reading or modifying the shared data. Note that we continue to
maintain the lock even through the printf() function call as releasing the lock before this and
printing can lead to inconsistent results in the output. Recall that the code in-between the lock
and unlock calls is called a critical section. Critical sections should be minimized for increased
concurrency.
PThread Condition Variable
pthread condition variables are created through the following function call or initializer macro
similar to mutexes:
1. intpthread_cond_init(pthread_cond_t *cond, pthread_condattr_t *cond_attr);
2. pthread_cond_tcond =PTHREAD_COND_INITIALIZER; 4
Similar to the mutex initialization call, condition variables can be given non-default attributes
through the second parameter. To specify defaults, either use the initializer macro or specify
NULLinthe second parameter to the call topthread_cond_init(). Threads can act on condition
variables in three ways: wait, signal or broadcast:
1. intpthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t*mutex);

2. intpthread_cond_signal(pthread_cond_t*cond);
3. intpthread_cond_broadcast(pthread_cond_t *cond);
pthread_cond_wait() puts the current thread to sleep. It requires a mutexof the associated
shared resource value it is waiting on. pthread_cond_signal() signals one thread out of the
possibly many sleeping threads to wakeup. pthread_cond_broadcast() signals all threads
waiting on the cond condition variable to wakeup. Here is an example on using pthread
conditionvariables:
1. void *thr_func1(void *arg){

2. /* thread code blocks here until MAX_COUNT is reached*/
3. pthread_mutex_lock(&count_lock);
4. while (count < MAX_COUNT){
5. pthread_cond_wait(&count_cond,&count_lock);
a. }
6. pthread_mutex_unlock(&count_lock);
7. /* proceed with thread execution*/
9. }
10. /* some other thread code that signals a waiting thread that MAX_COUNT has
been reached*/
11. void *thr_func2(void *arg){
12. pthread_mutex_lock(&count_lock);
13. /* some code here that does interesting stuff and modifies count */
14. if (count == MAX_COUNT){
16. pthread_cond_signal(&count_cond);
17. } else{
19. }
In thr_func1(), we are locking the count_lockmutex so we can read the value of count without
entering a potential race condition. The subsequent pthread_cond_wait() also requires a locked
mutex as the second parameter to avoid a race condition where a thread prepares to wait on a
condition variable and another thread signals the condition just before the first thread actually
waits on it (as explained from the man page on pthread_cond_wait). Notice how a while loop is
used instead of an if statement for the pthread_cond_wait() call. This is because of spurious
wakeups problem mentioned previously. If a thread has been woken, it does not mean it was
due to a pthread_cond_signal() or pthread_cond_broadcast() call. pthread_cond_wait() if
awoken, automatically tries to re-acquire the mutex, and will block if it cannot. Locks that other
threads could be waiting on should be released before you signal or broadcast.
PThread Barrries
pthreads can participate in a barrier to synchronize to some point in time. Before a barrier can
be called, a pthread barrier object must be initialized first:
1 int pthread_barrier_init(pthread_barrier_t *barrier, pthread_barrierattr_t
*barrier_attr, unsigned intcount); 2
2 pthread_barrier_t barrier =PTHREAD_BARRIER_INITIALIZER(count); 4
Barrier objects are initialized like mutexes or condition variables, except there is one additional
parameter, count. The count variable defines the number threads that must join the barrier for
the barrier to reach completion and unblock all threads waiting at the barrier. If default barrier
attributes are used (i.e. NULL for the second parameter), one can use the initializer macro with
the specified count.
The actual barrier call follows
1 intpthread_barrier_wait(pthread_barrier_t *barrier); 2
This function would be inside thread code where the barrier is to take place. Once count
number of threads have called pthread_barrier_wait() then the barrier condition is met and all
threads are unblocked and progress continues.
Miscellaneous
Here are some suggestions and issues you should consider when using pthreads:
• You should check all return values for important pthread functioncalls!
• Sometimes it is desirable for a thread not to terminate (e.g., a server with a worker
thread pool). This can be solved by placing the thread code in an infinite loop and using
condition variables. Of course, there needs to be some terminating condition(s) to the
infinite loop (i.e., break when it is deemednecessary).
Additional useful pthread calls:
• pthread_kill() can be used to deliver signals to specific threads.

• pthread_self() returns a handle on the calling thread.
• pthread_equal() compares for equality between two pthread ids
• pthread_once() can be used to ensure that an initializing function within a thread is only
runonce.
• There are many more useful functions in the pthread library. Consult pthreads
man pages or the Nichols text (AppendixC).
Performance Consideration
The performance gains from using threads can be substantial when done properly and in the
right problem context, but can it be even better? You should consider the following when
analyzing your program for potential bottlenecks:
• Lock granularity - How "big" (coarse) or "small" (fine) are your mutexes? Do they
lock your whole structure or fields of a structure? The more fine-grained you make your
locks, the more concurrency you can gain, but at the cost of more overhead and
potential deadlocks.
• Lock ordering - Make sure your locks are always locked in an agreed order (if they are
not, make sure you take steps to rectify situations where locks are obtained in an out-of-
order fashion, e.g. by using trylock/unlockcalls).
• Lock frequency - Are you locking too often? Locking at unnecessary times? Reduce
such occurrences to fully exploit concurrency and reduce synchronization overhead.
• Critical sections - This has been mentioned before, but you should take extra steps to
minimize critical sections which can be potentially large bottlenecks.
• Worker thread pool - If you are using a Boss/Worker thread model, make sure you
pre- allocate your threads instead of creating threads on demand. It doesn't matter to the
user how long it took your server to initialize, it only matters how fast it processes his or
her request!
• Contention scope - Do your threads perform better when they are in contention with all
of the system's processes? Or do they perform better when individually scheduled by the
thread library itself? Only experimentation can give you the answers.
• Scheduling class - We have not touched on this topic, but changing the thread
scheduling class from FIFO to RR can give better response times. But is this what you
really want? Refer to Nichols or Lewis book for more information on thread scheduling
classes.
• Too many threads? - At what point are there too many threads? Can it severely impact
and degrade performance? Again, only experimentation will give you the real answers
to this question.
Other Approaches
C++ Template Libraries

There are various template libraries available that ease implementation of multithreading in a
(semi-)portable fashion. For those programming in C++, you may want to look at Boost, Intel
Threading Building Blocks (TBB) and POCO.
Multiprocessor and Shared Memory

This tutorial has explored the very basics of multithreaded programming. What about multi
process programming? These topics are beyond the scope of this document, but to perform
cross-process synchronization, one would use some form of IPC: pipes, semaphores, message
queues, or shared memory. Of all of the forms of IPC, shared memory is usually the fastest
(excluding doors). You can use mmap(), POSIX (e.g., shm_open()) or SysV (e.g., shmget())
semantics when dealing with cross-process resource management, IPC and synchronization.
For those interested in shared memory programming in C++, I recommend looking at Boost.
Inter process first.
OpenMP
OpenMP is a portable interface for implementing fork-join parallelism on shared memory
multiprocessor machines. It is available for C/C++ and Fortran. For a quick introduction,
please see the slide share.
MPI
The Message Passing Interface (MPI) is the de-facto standard for distributed memory parallel
processing. Data can be sent/received from distinct computing machines with support for
vectored I/O (scatter/gather), synchronization and collectives. It is not uncommon to see
programs that are both multithreaded and contain MPI calls to take advantage of shared
memory within a node and MPI to perform processing across nodes.
Software Transactional Memory

In computer science, software transactional memory (STM) is a concurrency control
mechanism analogous to database transactions for controlling access to shared memory in
concurrent computing. It is an alternative to lock-based synchronization. STM is a strategy
implemented in software, rather than as a hardware component. A transaction in this context
occurs when a piece of code executes a series of reads and writes to shared memory. These
reads and writes logically occur at a single instant in time; intermediate states are not visible to
other (successful) transactions. The idea of providing hardware support for transactions
originated in a 1986 paper by Tom Knight. The idea was popularized by Maurice Herlihy and
J.Eliot B. Moss. In 1995 NirShavitand Dan Touitou extended this idea to software-only
transactional memory (STM). Since 2005, STM has been the focus of intense research and
support for practical implementations is growing.
Performance
Unlike the locking techniques used in most modern multithreaded applications, STM is often
very optimistic: a thread completes modifications to shared memory without regard for what
other threads might be doing, recording every read and write that it is performing in a log.
Instead of placing the onus on the writer to make sure it does not adversely affect other
operations in progress, it is placed on the reader, who after completing an entire transaction
verifies that other threads have not concurrently made changes to memory that it accessed in
the past. This final operation, in which the changes of a transaction are validated and, if
validation is successful, made permanent, is called a commit. A transaction may also abort at
any time, causing all of its prior changes to be rolled back or undone. If a transaction cannot be
committed due to conflicting changes, it is typically aborted and re-executed from the
beginning until it succeeds.
The benefit of this optimistic approach is increased concurrency: no thread needs to wait for
access to a resource, and different threads can safely and simultaneously modify disjoint parts
of a data structure that would normally be protected under the same lock.
However, in practice, STM systems also suffer a performance hit compared to fine-grained
lock- based systems on small numbers of processors (1 to 4 depending on the application). This
is due primarily to the overhead associated with maintaining the log and the time spent
committing transactions. Even in this case performance is typically no worse than twice as
slow. Advocates of STM believe this penalty is justified by the conceptual benefits ofSTM.
Theoretically, the worst case space and time complexity of n concurrent transactions is O(n).
Actual needs depend on implementation details (one can make transactions fail early enough to
avoid overhead), but there will also be cases, albeit rare, where lock-based algorithms have
better time complexity than software transactional memory.
Conceptual advantages and disadvantages
In addition to their performance benefits, STM greatly simplifies conceptual understanding of
multithreaded programs and helps make programs more maintainable by working in harmony
with existing high-level abstractions such as objects and modules. Lock-based programming
has a number of well-known problems that frequently arise in practice:
• Locking requires thinking about overlapping operations and partial operations in

distantly separated and seemingly unrelated sections of code, a task which is very
difficult and error- prone.
• Locking requires programmers to adopt a locking policy to prevent deadlock, live lock,
and other failures to make progress. Such policies are often informally enforced and
fallible, and when these issues arise they are insidiously difficult to reproduce and
debug.
• Locking can lead to priority inversion, a phenomenon where a high-priority thread is
forced to wait for a low-priority thread holding exclusive access to a resource that it
needs.
In contrast, the concept of a memory transaction is much simpler, because each transaction can
be viewed in isolation as a single-threaded computation. Deadlock and livelock are either
prevented entirely or handled by an external transaction manager; the programmer need hardly
worry about it. Priority inversion can still be an issue, but high-priority transactions can abort
conflicting lower priority transactions that have not already committed.
On the other hand, the need to abort failed transactions also places limitations on the behavior
of transactions: they cannot perform any operation that cannot be undone, including most I/O.
Such limitations are typically overcome in practice by creating buffers that queue up the
irreversible operations and perform them at a later time outside of any transaction. In Haskell,
this limitation is enforced at compile time by the type system.
Composable operations
In 2005, Tim Harris, Simon Marlow, Simon Peyton Jones, and Maurice Herlihy described an
STM system built on Concurrent Haskell that enables arbitrary atomic operations to be
composed into larger atomic operations, a useful concept impossible with lock-based
programming. To quote the authors:
Perhaps the most fundamental objection [...] is that lock-based programs do not compose:
correct fragments may fail when combined. For example, consider a hash table with thread-safe
insert and delete operations. Now suppose that we want to delete one item A from table t1, and
insert it into table t2; but the intermediate state (in which neither table contains the item) must
not be visible to other threads. Unless the implementer of the hash table anticipates this need,
there is simply no way to satisfy this requirement. [...]. In short, operations that are individually
correct (insert, delete) cannot be composed into larger correct operations.
With STM, this problem is simple to solve: simply wrapping two operations in a transaction
makes the combined operation atomic. The only sticking point is that it is unclear to the caller,
who is unaware of the implementation details of the component methods, when they should
attempt to re-execute the transaction if it fails. In response, the authors proposed a retry
command which uses the transaction log generated by the failed transaction to determine which
memory cells it read, and automatically retries the transaction when one of these cells is
modified, based on the logic that the transaction will not behave differently until at least one
such value is changed.
The authors also proposed a mechanism for composition of alternatives, the or Else function. It
runs one transaction and, if that transaction does a retry, runs a second one. If both retry, it tries
them both again as soon as a relevant change is made. This facility, comparable to features such
as the POSIX networking select() call, allows the caller to wait on any one of a number of
events simultaneously. It also simplifies programming interfaces, for example by providing a
simple mechanism to convert between blocking and non blocking operations.
This scheme has been implemented in the Glasgow Haskell Compiler.
Proposed language support

The conceptual simplicity of STMs enables them to be exposed to the programmer using
relatively simple language syntax. Tim Harris and Keir Fraser's "Language Support for
Lightweight Transactions" proposed the idea of using the classical conditional critical region
(CCR) to represent transactions. In its simplest form, this is just an "atomic block", a block of
code which logically occurs at a single instant:
// Insert a node into a doubly linked list atomically
atomic{
newNode->prev = node; newNode->next = node->next; node->next->prev =

newNode; node->next = newNode;
}
When the end of the block is reached, the transaction is committed if possible, or else aborted
and retried. (This is simply a conceptual example, not correct code. For example, it behaves
incorrectly if node is deleted from the list during the transaction.)
CCRs also permit a guard condition, which enables a transaction to wait until it has work to do:
atomic(queueSize> 0) {
remove item from queue and use it

}
If the condition is not satisfied, the transaction manager will wait until another transaction has
made a commit that affects the condition before retrying. This loose coupling between
producers and consumers enhances modularity compared to explicit signaling between threads.
[6]
"Composable Memory Transactions" took this a step farther with its retry command
(discussed above), which can, at any time, abort the transaction and wait until some value
previously read by the transaction is modified before retrying. For example:
atomic{
if (queueSize> 0) {
remove item from queue and use it
} else{
retry
}
}
This ability to retry dynamically late in the transaction simplifies the programming model and
opens up new possibilities.
One issue is how exceptions behave when they propagate outside of transactions. In
"Composable Memory Transactions", the authors decided that this should abort the transaction,
since exceptions normally indicate unexpected errors in Concurrent Haskell, but that the
exception could retain information allocated by and read during the transaction for diagnostic
purposes. They stress that other design decisions may be reasonable in other settings.
Transactional locking
STM can be implemented as a lock-free algorithm or it can use locking. There are two types of
locking schemes: In encounter-time locking (Ennals, Saha, and Harris), memory writes are
done by first temporarily acquiring a lock for a given location, writing the value directly, and
logging it in the undo log. Commit-time locking locks memory locations only during the
commit phase.
A commit-time scheme named "Transactional Locking II" implemented by Dice, Shalev, and
Shavit uses a global version clock. Every transaction starts by reading the current value of the
clock and storing it as the read-version. Then, on every read or write, the version of the
particular memory location is compared to the read-version; and, if it is greater, the transaction
is aborted. This guarantees that the code is executed on a consistent snapshot of memory.
During commit, all write locations are locked, and version numbers of all read and write
locations are re-checked. Finally, the global version clock is incremented, new write values
from the log are written back to memory and stamped with the new clock version.
An increasingly utilized method to manage transactional conflicts in Transactional memory,

and especially in STM, is Commitment ordering (also called Commit ordering; CO). It is
utilized for achieving serializability optimistically (i.e., without blocking upon conflict, and
only locking for commit) by "commit order" (e.g., Ramadan et al.2009,and Zhang et al.2006).
Serializabilityis the basis for the correctness of (concurrent transactions and) transactional
memory. Tens of STM articles on "commit order" have already been published, and the
technique is encumbered by a number of patents.
With CO the desired serializability property is achieved by committing transactions only in
chronological order that is compatible with the precedence order (as determined by
chronological orders of operations in conflicts) of the respective transactions. To enforce CO
some implementation of the Generic local CO algorithm needs to be utilized. The patent
abstract quoted above describes a general implementation of the algorithm with a pre-
determined commit order (this falls into the category of "CO generic algorithm with real-time
constraints").
Implementation issues
One problem with implementing software transactional memory with optimistic reading is that
it is possible for an incomplete transaction to read inconsistent state (that is, to read a mixture
of old and new values written by another transaction). Such a transaction is doomed to abort if
it ever tries to commit, so this does not violate the consistency condition enforced by the
transactional system, but it is possible for this "temporary" inconsistent state to cause a
transaction to trigger a fatal exceptional condition such as a segmentation fault or even enter an
endless loop, as in the following contrived example from Figure 4 of "Language Support for
Lightweight Transactions":
atomic {
atomic { x+
if (x != y)
y++;
while (true)
}
}
}
Transaction Transaction
A B
Provided x=y initially, neither transaction above alters this invariant, but it is possible that
transaction A will read x after transaction B updates it but read y before transaction B updates
it, causing it to enter an infinite loop. The usual strategy for dealing with this is to intercept any
fatal exceptions and abort any transaction that is not valid.
One way to deal with these issues is to detect transactions that execute illegal operations or fail
to terminate and abort them cleanly; another approach is the transactional locking scheme
Implementations
A number of STM implementations (on varying scales of quality and stability) have been
released, many under liberal licenses. These include:
C/C++
• TinySTMa time-based STM and Tangerto integrate STMs with C and C++ viaLLVM.
• The Lightweight Transaction Library (LibLTX), a C implementation by Robert Ennals
focusing on efficiency and based on his papers "Software Transactional Memory
Should Not Be Obstruction-Free" and "Cache Sensitive Software
TransactionalMemory".
• LibCMT, an open-source implementation in C by DuilioProtti based on "Composable
[6]
Memory Transactions". The implementation also includes a C#binding.
• TARIFA is a prototype that brings the "atomic" keyword to C/C++ by instrumenting the
assembler output of thecompiler.
• Intel STM Compiler Prototype Edition implements STM for C/C++ directly in a
compiler (the Intel Compiler) for Linux or Windows producing 32- or 64 bit code for
Intel or AMD processors. Implements atomic keyword as well as providing ways to
decorate (declspec) function definitions to control/authorize use in atomic sections. A
substantial implementation in a compiler with the stated purpose to enable large scale
experimentation in any size C/C++ program. Intel has made four research releases of
this special experimental version of its productcompiler.
• stmmapAn implementation of STM in C, based on shared memory mapping. It is for
sharing memory between threads and/or processes (not just between threads within a
process) with transactional semantics. The multi-threaded version of its memory
allocator is inC++.
• CTL An implementation of STM in C, based on TL2 but with many extensions and
optimizations.
• The TL2 lock-based STM from the Scalable Synchronization research group at Sun
Microsystems Laboratories, as featured in the DISC 2006 article "Transactional
LockingII".
• Several implementations by Tim Harris &KeirFraser, based on ideas from his papers
"Language Support for Lightweight Transactions", "Practical Lock Freedom", and an
upcoming unpublished work.
• RSTM The University of Rochester STM written by a team of researchers led by
Michael L. Scott.
• G++ 4.7 now supports STM for C/C++ directly in the compiler. The feature is still
listed as "experimental", but may still provide the necessary functionality for testing.
• STM is part of the picotm transaction framework for C
C#
• Shielded A strict and mostly obstruction-free STM for .NET, written in C#. Features
include: conditional transactions, commutable (low conflict) operations, transactional
collection types, and automatic generation of transactional proxy-subclasses for POCO
objects.
• STMNetA pure C#, open-source, lightweight software transactional memory API.
• SXM, an implementation of transactions for C# by Microsoft Research.

Documentation, Download page Discontinued.
• LibCMT, an open-source implementation in C by DuilioProtti based on "Composable
[6]
Memory Transactions". The implementation also includes a C# binding.
• NSTM, .NET Software Transactional Memory written entirely in C# offering truly
nested transactions and even integrating with System.Transactions.
• MikroKosmosA Verification-Oriented Model Implementation of an STM inC#.
• ObjectFabriccf. Java implementations.
• Sasa.TM A pure C# implementation of software transactional memory.
Clojure
• Clojure has STM support built into the corelanguage

Common Lisp
• CL-STM A multi-platform STM implementation for CommonLisp.

• STMX An open-source, actively maintained concurrency library providing
software, hardware and hybrid memory transactions for Common Lisp.
Erlang
• MnesiaA distributed, transactional, in-memory DBMS built into Erlang/OTP, that

performs the role of STM; Perhaps the oldest implementation of STM in thewild.
F#
• F# have them through [1] FSharpX - sample at [2]F#
Groovy
• GPars - The Gpars framework contains support for STM leveraging the Java Multiverse
implementation.
•
Haskell
• The STM library, as featured in "Composable Memory Transactions", is part of

the Haskell Platform.
• The DSTM library, a distributed STM, based on the abovelibrary.
Java
• SCAT research group's implementation of AtomJava.
• JVSTM implements the concept of Versioned Boxes proposed by JoãoCachopo and
AntónioRito Silva, members of the Software Engineering Group - INESC-ID.
Beginning from version 2.0, JVSTM is completely lock-free.
• Deuce A runtime environment for Java Software Transactional Memory using byte code
manipulation.
• Multiverse Is a Java 1.6+ based Software Transactional Memory (STM) implementation
that uses Multi Version Concurrency Control (MVCC) as concurrency control
mechanism.
• DSTM2 Sun Lab's Dynamic Software Transactional MemoryLibrary
• ObjectFabricis an open source implementation for Java and .NET. It can be turned into
a Distributed STM through an extension mechanism. Other extensions allow logging,
change notification, and persistence.
• ScalaSTM- A library-based STM written in Scala that additionally provides a Java-
focused API to allow use with Runnable and Callable objects.
JavaScript
• AtomizeJSimplements Distributed Software Transactional Memory to web browsers

with a single NodeJS server to validate the effects of transactions.
OCaml
• coThreads, a concurrent programming library of OCaml, offers STM (originally

STMLib) as a module. Just like any other components in this library, the STM module
can be used uniformly with VM-level threads, system threads and processes.
Perl
• STM for Perl 6 has been implemented in Pugs via the Glasgow Haskell
Compiler's STM library.
Python
• Fork of CPython with atomic locks- Armin Rigo explains his patch to CPythonin an
email to the pypy-dev list.
• PyPy STM with Threads announcement from Armin Rigo forPyPy.
Python Software Transactional Memory and Dual-Port Memory Based Python Software
Transactional Memory, two versions of Python STM that is being developed on University of
NoviSad.
Ruby
• MagLevis an implementation of the Ruby interpreter built on top of the

GemStone/S virtual machine
• Concurrent Ruby is a library providing concurrent features for Ruby, including STM
Scala
• Scala STM– A draft proposal along with reference implementation CCSTM to be

included in the Scala standard library
• Akka STM – The Akka framework contains support for STM in both Scala & Java
• MUTS – Manchester University Transactions for Scala
• ZIO -- Implementation in ZIO inspired by STM API in Haskell.
Smalltalk
• GemStone/S [3] A Transactional Memory Object Server for Smalltalk.

• STM for the open-source Smalltalk (MIT Licence) Pharo
Other languages
• Fortress is a language developed by Sun that usesDSTM2

• STM.NET

CPP Unit-4

Uploaded by

Copyright:

Available Formats

CPP Unit-4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CPP Unit-4

Uploaded by

Copyright:

Available Formats

UNIT-IV

Concurrent and Parallel Programming: Parallel programming paradigms – Data parallel,

Parallel Algorithm - Models

Fig 4.1: Dense Matrix Multiplication

Fig 4.2: Task Graph Model

Fig 4.4: Master-Slave Model

Precautions in using the master-slave model are:

Fig 4.5: Parallel LU factorization algorithm

Parallel Random Access Machines

Here, number of processors can perform independent operations on number of data in a

• Exclusive Read Exclusive Write(EREW)−Here no two processors are allowed to read

Shared Memory Model

Shared memory programming has been implemented in the following:

Merits of Shared Memory Programming

Demerits of Shared Memory Programming

Fig 4.9: Message Passing Model

Merits of Message Passing

Data Parallel Programming

Fig 4.10: Pipelined processor

Classification of Pipeline Processors

Classification according to level of processing

Classification according to type of instruction and data:

Fig 4.15: space-time diagram for overlapped Instruction pipelined processor

Fig 4.16: Multiple copies of shifted multiplicands

Now, we can identify the following stages for the pipeline:

First stage will compare the exponents of the two numbers.

• Second stage will look for alignment of mantissas.

Performance and Issues in Pipelining

• First we take the ideal case for measuring the speedup.

Time taken to execute without pipelining = m.n

Speedup due to pipelining = m.n/[m +(n-1)].

As n>=∞ , There is speedup of n times over the non-pipelined execution.

E = (n. m. c) / m.[m.c + (n-1).c] = n / (m + (n-1) As n-> ∞ , E becomes 1.

Vector instructions can be classified as below:

F3 : S × V -> V where S is the scalar item

For example, vector-scalar addition or divisions are of F3 type.

Memory-to-Memory Architecture: The pipelines can access vector operands, intermediate

Register-to-Register Architecture: In this organization, operands and results are accessed

Efficiency of Vector Processing over Scalar Processing:

As we know, a sequential computer processes scalar operands one at a time. Therefore if we

For example, consider the following vector addition:

A (1:500) + B (1:500) → C (1:500)

Associative Array Processing

READ NAME = RAVI

Associative Memory Organizations

Fig 4.20: Associative Memory

In the organization of an associative memory, following registers are used:

Types of Associative Processor

PEPE is an example of distributed associative processor which was designed as a special

Another alternative to improve the speed of the processor is to exploit a sequence of

Fig 4.22: VLIW Instruction word

Synchronization Latency Problem: If two concurrent processes are performing remote

Concept of Multithreading: These problems increase in the design of large-scale

Multithreaded Architecture: It is clear now that if we provide many contexts to multiple

P = useful processing time for which processor is busy

I = Idle time when processor is waiting

Design issues: To achieve the maximum processor utilization in a multithreaded architecture,

General-purpose computing on graphics processing units

General-purpose computing on graphics processing units (GPGPU, rarely GPGP) is the

1. intpthread_create(pthread_t thread, pthread_attr_tattr,

1 intpthread_mutex_init(pthread_mutex_t mutex, constpthread_mutexattr_t mutexattr);

1. intpthread_cond_wait(pthread_cond_t cond, pthread_mutex_tmutex);

1. void thr_func1(void arg){