Instruction Level Parallelism

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

CS 6303 - Computer Architecture Unit 4 – Q & A

1. Explain instruction level parallelism and its difficulties in implementing it?

Instruction-level parallelism (ILP) is a measure of how many of the operations in a


computer program can be performed simultaneously. The potential overlap among
instructions is called instruction level parallelism.

ILP is to Achieve not only instruction overlap, but the actual execution of more than
one instruction at a time through dynamic scheduling and how to maximize the throughput of
a processor. For typical RISC processors, instructions usually depend on each other too and as
a result the amount of overlap is limited.
Instruction level dependencies and hazards during ILP
If two instructions are not dependent, then they can execute simultaneously— assuming
sufficient resources that is no structural hazards. if one instruction depends on another, they
must execute in order though they may still partially overlap.
1. Data dependencies
Data dependence means that one instruction is dependent on another if there exists a
chain of dependencies between them. . Compilers can be of great help in detecting and
scheduling around these sorts of hazards; hardware can only resolve these dependencies with
severe limitations.
ET
2. Name Dependencies
A name dependency occurs when two instructions use the same register or memory
C

location, called a name, but there is no flow of data between them.


SV

 Anti-dependence occurs when j writes a register/memory that i reads.


 Output dependence occurs when i and j write to the same register/memory location
The name used in the instruction is changed so that they do not conflict. This technique
is known as register renaming (uses temp registers).
3. Data Hazards
Data dependency between instructions and they are close enough to cause the pipeline
to stall three types of data hazards:
read after write (RAW)—j tries to read a source before i writes it—this is the most common
type and is a true data dependence;
 example sequence logic 1. i=i+1; 2. j=i+1
write after write (WAW)—j tries to write an operand before it is written by i—this
corresponds to the output dependence;
 example sequence logic 1. i=i+1; 2. print i; 3. i=j+1
write after read (WAR)—j tries to write a destination before i has read it—this corresponds
to an anti-dependency
 example sequence logic 1.read i ; 2. i=j+1
CS 6303 - Computer Architecture Unit 4 – Q & A

4. Control Dependencies
A control dependency determines the order of an instruction i with respect to a branch,
if (p1)
S1
if (p2)
S2
S1 is control dependent on p1 and S2 is control dependent on p2 speculatively executed
instructions do not affect the program state until branch result is determined. This implies that
the instructions executed speculatively must not raise an exception or otherwise cause side
effects.
Implementation of ILP and overcoming hazards or dependencies.
To implement ILP, 3 methods are used to avoid any delay during ILP
1. score boarding.
2. Tomasulo's solution for dynamic scheduling.
3. Branch prediction.
1. Score boarding.
Instructions to be issued when they are ready, not necessarily in order, hence out of-
ET
order execution. To implement out-of-order issue we need to split the instruction decode phase
into two:
C

1. Issue—decode instructions and check for structural hazards;


SV

2. Read operands—wait until no data hazards obtain then read the operands and start
executing.
It dynamically schedules the pipeline. instructions must pass through the issue
phase in order;
This method can stall or bypass each other, in the read operands phase and enter, or
even, complete execution in out of order manner.
Example
CDC6600 used a scoreboard, the goal of a scoreboard is to maintain processor
throughput of one instruction per clock cycle (no structural hazard). If the next instruction
would stall, then store it on a queue and start with a later instruction and takes full responsibility
for instruction issue and execution. It uses as many as 16 separate functional units.
2. Tomasulo's solution for dynamic scheduling.
Executing instructions only when operands are available, waiting instruction is stored
in a reservation station. Reservation stations keep track of pending instructions (RAW). WAW
can be avoided using Register renaming.(80 reg).
Tomasulo architecture executes instructions in three phases; each phase may take more
than one clock cycle:
CS 6303 - Computer Architecture Unit 4 – Q & A

Three Steps:
1. Issue
o Get next instruction from FIFO queue
o If available RS, issue the instruction to the RS with operand values if available
o If operand values not available, stall the instruction
2. Execute
o When operand becomes available, store it in any reservation stations waiting for it
o When all operands are ready, issue the instruction
o Loads and store maintained in program order through effective address
o No instruction allowed to initiate execution until all branches that proceed it in
program order have completed
3. Write result
o Write result on CDB into reservation stations and store buffers
o (Stores must wait until address and value are received)
ET
3. Branch prediction.
It uses predictor is a simple saturating n-bit counter. Each time a particular branch is
C

taken its entry is incremented otherwise it is decremented.


SV

If the most significant bit in the counter is set then predict that the branch is taken.
Limitations of ILP
1. An instruction stream needs to be run on an ideal processor with no significant
limitations.
2. The ideal processor always predicts branches correctly, has no structural hazards.
3. This eliminates all control and name dependencies. (only data dependencies)
4. Theoretically it is possible for the last dynamically executed instruction in the program
to be scheduled on the first cycle.
CS 6303 - Computer Architecture Unit 4 – Q & A

3. Parallel processing challenges


 Parallel processing is the simultaneous use of more than one CPU to execute a program or
multiple computational threads.
 Ideally, parallel processing makes programs run faster because there are more engines
(CPUs or cores) running it.
 A parallel computer (or multiple processor system) is a collection of communicating
processing elements (processors) that cooperate to solve large computational problems fast
by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP).
Advantages:
o Faster execution time, so higher throughput.
Disadvantages:
o More hardware required, also more power requirements.
o Not good for low power and mobile devices.
Challenges in parallel processing
 Connecting your CPUs
o Dynamic vs Static—connections can change from one communication to
next
ET
o Blocking vs Nonblocking—can simultaneous connections be present?
o Connections can be complete, linear, star, grid, tree, hypercube, etc.
C

 Bus-based routing
SV

o Crossbar switching—impractical for all but the most expensive super-


computers
o 2X2 switch—can route inputs to different destinations
 Dealing with memory
 Various options:
o Global Shared Memory
o Distributed Shared Memory
o Global shared memory with separate cache for processors
 Potential Hazards:
o Individual CPU caches or memories can become out of synch with each
other. “Cache Coherence”
o Solutions:
 UMA/NUMA machines
 Snoopy cache controllers
 Write-through protocols
CS 6303 - Computer Architecture Unit 4 – Q & A

3. Write short notes on different organization in SMP?


A symmetric multiprocessor (SMP) can be defined as a standalone computer system
with the following characteristic:
1. 1.There are two or more similar processor of comparable capability.
2. processors share the same main memory and I/O facilities and are interconnected by a
bus or other internal connection scheme.
3. All processors share access to I/O devices, either through the same channels or
through different channels
4. All processors can perform the same functions.
5. The system is controlled by an integrated operating system that provides interaction
between processors and their programs at the job, task, file and data element levels.
6. SMP has a potential advantage over uniprocessor architecture:
Performance
A system with multiple processors will perform in a better way than one with a single
processor of the same type if the task can be organized in such a manner that some portion of
the work done can be done in parallel.
Availability
ET
Since all the processors can perform the same function in a symmetric multiprocessor,
the failure of a single processor does not stop the machine. Instead, the system can continue to
function at reduce performance level.
C

Incremental growth
SV

A user can enhance the performance of a system by adding an additional processor.


Sealing
Vendors can offer a range of product with different price and performance
characteristics based on number of processors configured in the system.
Organization
The organization of a multiprocessor system is shown in Figure 10.1

Figure 10.1: Block diagram of tightly coupled multiprocessors


CS 6303 - Computer Architecture Unit 4 – Q & A

 The processor can communicate with each other through memory (messages and status
information left in common data areas).
 It may also be possible for processors to exchange signal directly.
 The memory is often organized so that multiple simultaneous accesses to separate
blocks of memory are possible.
 In some configurations each processor may also have its own private main memory and
I/O channels in addition to the shared resources.

The organization of multiprocessor system can be classified as follows:

 Time shared or common bus


 Multiport memory.
 Central control unit.

Time shared Bus


Time shared bus is the simplest mechanism for constructing a multiprocessor system.
The bus consists of control, address and data lines. The block diagram is shown in
ET
C
SV

The following features are provided in time-shared bus organization:

 Addressing: It must be possible to distinguish modules on the bus to determine the


source and destination of data
 Arbitration: Any I/O module can temporarily function as “master”. A mechanism is
provided to arbitrate competing request for bus control, using some sort of priority
scheme.
 Time shearing: when one module is controlling the bus, other modules are locked out
and if necessary suspend operation until bus access in achieved.
CS 6303 - Computer Architecture Unit 4 – Q & A

The bus organization has several advantages compared with other approaches:

 Simplicity: This is the simplest approach to multiprocessor organization. The physical


interface and the addressing, arbitration and time sharing logic of each processor remain
the same as in a single processor system.
 Flexibility: It is generally easy to expand the system by attaching more processor to
the bus.
 Reliability: The bus is essentially a passive medium and the failure of any attached
device should not cause failure of the whole system.

The main drawback to the bus organization is performance. Thus, the speed of the system is
limited by the bus cycle time. To improve performance, each processor can be equipped with
local cache memory.

Multiport Memory
The multiport memory approach allows the direct, independent access of main memory
modules by each processor and IO module. The multiport memory system is shown in Figure
10.3: Multiport memory
ET
C
SV

The multiport memory approach is more complex than the bus approach, requiring a
fair amount of logic to be added to the memory system. Logic associated with memory is
required for resolving conflict. The method often used to resolve conflicts is to assign
permanently designated priorities to each memory port.
CS 6303 - Computer Architecture Unit 4 – Q & A

4. Explain Hardware based Speculation


Hardware based speculation
 Execute instructions along predicted execution paths but only commit the results if
prediction was correct
 Instruction commit: allowing an instruction to update the register file when instruction
is no longer speculative
 Need an additional piece of hardware to prevent any irrevocable action until an
instruction commits. i.e. updating state or taking an execution.
Reorder Buffer
 Reorder buffer – holds the result of instruction between completion and commit
 Four fields:
o Instruction type: branch/store/register
o Destination field: register number
o Value field: output value
o Ready field: completed execution?
 Modify reservation stations:
o Operand source is now reorder buffer instead of functional unit
 Register values and memory values are not written until an instruction commits
 On misprediction:
ET
o Speculated entries in ROB are cleared
 Exceptions:
C

o Not recognized until it is ready to commit


Multiple issue and static scheduling
SV

 To achieve CPI < 1, need to complete multiple instructions per clock


 Solutions:
o Statically scheduled superscalar processors
o VLIW (very long instruction word) processors
o dynamically scheduled superscalar processors
Limitations of ILP
5. An instruction stream needs to be run on an ideal processor with no significant
limitations.
6. The ideal processor always predicts branches correctly, has no structural hazards.
7. This eliminates all control and name dependencies. (only data dependencies)
8. Theoretically it is possible for the last dynamically executed instruction in the program
to be scheduled on the first cycle.
CS 6303 - Computer Architecture Unit 4 – Q & A

5. Explain multi core organization and its processors?


or
Explain the implementation of multicore organization?

Top level of description, the main variables in a multicore organization are as follows:

 The number of core processors on the chip.


 The number of levels of cache memory.
 The amount of cache memory that is shared.

ET
C
SV

Dedicated L1 cache
Figure 18.8a is an organization found in some of the earlier multicore computer chips
and is still seen in embedded chips. In this organization, the only on-chip cache is L1 cache,
with each core having its own dedicated L1 cache. Almost invariably, the L1 cache is divided
into instruction and data caches.An example of this organization is the ARM11 MPCore.
Dedicated L2 cache
Figure 18.8b is also one in which there is no on-chip cache sharing. In this, there is
enough area available on the chip to allow for L2 cache. An example of this organization is the
AMD Opteron.
Shared L2 cache
Figure 18.8c shows a similar allocation of chip space to memory, but with the use of a
shared L2 cache. The Intel Core Duo has this organization.
Shared L3 cache
Figure 18.8 d as the amount of cache memory available on the chip continues to grow,
performance considerations dictate splitting off a separate, shared L3 cache, with dedicated L1
and L2 caches for each core processor.
CS 6303 - Computer Architecture Unit 4 – Q & A

The Intel Core i7 is an example of this organization. (shared L3 cache)


1. Constructive interference can reduce overall miss rates. That is, if a thread on one
core accesses a main memory location, this brings the frame containing the referenced
location into the shared cache. If a thread on another core soon thereafter accesses the
same memory block, the memory locations will already be available in the shared on-
chip cache.
2. A related advantage is that data shared by multiple cores is not replicated at the shared
cache level.
3. With proper frame replacement algorithms, the amount of shared cache allocated to
each core is dynamic
4. Interprocessor communication is easy to implement, via shared memory locations. A
potential advantage to having only dedicated L2 caches on the chip is that each core
enjoys more rapid access to its private L2 cache. This is advantageous for threads that
exhibit strong locality.

Intel Core Duo superscalar cores:


It implements two x86 superscalar processors with a shared L2 cache (fig 18.8 c)
The general structure of the Intel Core Duo is shown in Figure 18.9. Let us consider the key
ET
elements starting from the top of the figure.
 As is common in multicore systems, each core has its own dedicated L1 cache. In this
case, each core has a 32-KB instruction cache and a 32-KB data cache.
C

 Each core has an independent thermal control unit. With the high transistor density of
SV

today’s chips, thermal management is a fundamental capability, especially for laptop


and mobile systems.
 The Core Duo thermal control unit is designed to manage chip heat dissipation to
maximize processor performance within thermal constraints. Thermal management
also improves ergonomics with a cooler system and lower fan acoustic noise.
 In essence, the thermal management unit monitors digital sensors for high-accuracy die
temperature measurements. Each core can be defined as in independent thermal zone.
The maximum temperature for each thermal zone is reported separately via dedicated
registers that can be polled by software.
 If the temperature in a core exceeds a threshold, the thermal control unit reduces the
clock rate for that core to reduce heat generation.

Advanced Programmable Interrupt Controller (APIC).


The APIC performs a number of functions, including the following:
1. The APIC can provide interprocessor interrupts, which allow any process to interrupt
any other processor or set of processors. In the case of the Core Duo, a thread in one
core can generate an interrupt, which is accepted by the local APIC, routed to the APIC
of the other core, and communicated as an interrupt to the other core.
CS 6303 - Computer Architecture Unit 4 – Q & A

2. The APIC accepts I/O interrupts and routes these to the appropriate core.
3. Each APIC includes a timer, which can be set by the OS to generate an interrupt to the
local core.
The power management logic is responsible for reducing power consumption when
possible, thus increasing battery life for mobile platforms, such as laptops. In essence, the
power management logic monitors thermal conditions and CPU activity and adjusts voltage
levels and power consumption appropriately. It includes an advanced power-gating capability
that allows for an ultra-fine-grained logic control that turns on individual processor logic
subsystems only if and when they are needed.
The Core Duo chip includes a shared 2-MB L2 cache. The cache logic allows for a
dynamic allocation of cache space based on current core needs, so that one core can be assigned
up to 100% of the L2 cache.

ET
C
SV

A cache line gets the M state when a processor writes to it;


 If the line is not in E or M-state prior to writing it,
 the cache sends a Read-For-Ownership (RFO) request that ensures that the line exists
in the L1 cache and is in the I state in the other L1 cache.
 When a core issues an RFO, if the line is shared only by the other cache within the local
die, we can resolve the RFO internally very fast, without going to the external bus at
all. Only if the line is shared with another agent on the external bus do we need to issue
the RFO externally.
The bus interface connects to the external bus, known as the Front Side Bus, which connects
to main memory, I/O controllers, and other processor chips.
CS 6303 - Computer Architecture Unit 4 – Q & A

6. Implement Hardware Multithreading (or) Explain Hardware


multithreading
Hardware multithreading Increasing utilization of a processor by switching to another thread
when one thread is stalled.
 Thread A thread includes the program counter, the register state, and the stack. It is a
lightweight process; whereas threads commonly share a single address space, processes
don’t.
 Process A process includes one or more threads, the address space, and the operating
system state.
There are three main approaches to hardware multithreading.
1. Fine-grained multithreading
2. Coarse-grained multithreading
3. Simultaneous multithreading
Fine-grained multithreading
 Fine-grained multithreading switches between threads on each instruction, resulting
ET
in interleaved execution of multiple threads.
 This interleaving is often done in a round-robin fashion, skipping any threads that are
C

stalled at that clock cycle.


SV

 To make fine-grained multithreading practical, the processor must be able to switch


threads on every clock cycle. Advantage
 It can hide the throughput losses that arise from both short and long stalls, since
instructions from other threads can be executed when one thread stalls.
Disadvantage
 It slows down the execution of the individual threads, since a thread that is ready to
execute without stalls will be delayed by instructions from other threads.
Coarse-grained multithreading
Coarse-grained multithreading switches threads only on costly stalls, such as last-level
cache misses.
Advantage:
 Eliminates the need to have fast thread switching.
 Does not slow down execution of individual threads, since instructions from other
threads will only be issued when a thread encounters a costly stall.
CS 6303 - Computer Architecture Unit 4 – Q & A

Disadvantages:
 Does not cover short stalls. This limitation arises from the pipeline start-up.
 Since instructions are issued from a single thread, when a stall occurs, the pipeline must
be emptied or frozen.
 The new thread that begins executing after the stall must fill the pipeline before
instructions will be able to complete.
 Due to this start-up overhead, coarse-grained multithreading is much more useful for
reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to
the stall time.
Simultaneous multithreading
Simultaneous multithreading (SMT) is a variation on hardware multithreading.
Instructions from multiple threads are issued on same cycle. Uses register renaming and
dynamic scheduling facility of multi-issue architecture.
Advantage
 Maximizes utilization of execution units
ET
Disadvantage
 Needs more hardware support
C

o Register files, PC’s for each thread


SV

o Temporary result registers before commit


o Support to sort out which threads get results from which instructions
Example
How four threads use the issue slots of a superscalar processor in different approaches.
CS 6303 - Computer Architecture Unit 4 – Q & A

Coarse Multithreading

Fine Multithreading

ET
C
SV

Simultaneous Multithreading
CS 6303 - Computer Architecture Unit 4 – Q & A

7. How parallel processing is implemented and explain the architecture


that are used? or explain Flynn classification?
Flynn's taxonomy is a classification of computer architectures, proposed by Michael J.
Flynn in 1966. The four classifications defined by Flynn are based upon the number of
concurrent instruction (or control) and data streams available in the architecture:
Two types of information flow into a processor:

 The instruction stream is defined as the sequence of instructions performed by the


processing unit.
 The data stream is defined as the data traffic exchanged between the memory and the
processing unit.

Computer architecture can be classified into the following four distinct categories:

a) Single-instruction single-data streams (SISD)


b) Single-instruction multiple-data streams (SIMD)
c) Multiple-instruction single-data streams (MISD)
d) Multiple-instruction multiple-data streams (MIMD).
ET
Single Instruction, Single Data stream (SISD)

 Conventional single-processor von Neumann computers are classified as SISD systems.


C
SV

where CU = control unit, PE= processing element M = memory


Single-instruction multiple-data streams (SIMD)

 The SIMD model of parallel computing consists of two parts: a front-end computer of
the usual von Neumann style, and a processor array.
CS 6303 - Computer Architecture Unit 4 – Q & A

 The processor array is a set of identical synchronized processing elements capable of


simultaneously performing the same operation on different data.
 Each processor in the array has a small amount of local memory where the distributed
data resides while it is being processed in parallel.
 The processor array is connected to the memory bus of the front end so that the front
end can randomly access the local processor memories as if it were another memory.
 The front end can issue special commands that cause parts of the memory to be operated
on simultaneously or cause data to move around in the memory.
 The application program is executed by the front end in the usual serial way, but issues
commands to the processor array to carry out SIMD operations in parallel.
Multiple-instruction multiple-data streams (MIMD)
 Multiple-instruction multiple-data streams (MIMD) parallel architectures are made of
multiple processors and multiple memory modules connected together via some
interconnection network. They fall into two broad categories: shared memory or
message passing.
 Processors exchange information through their central shared memory in shared
memory systems, and exchange information through their interconnection network in
message passing systems.
 A shared memory system typically accomplishes inter-processor coordination through
ET
a global memory shared by all processors.
 Because access to shared memory is balanced, these systems are also called SMP
C

(symmetric multiprocessor) systems.


SV

 A message passing system (also referred to as distributed memory) typically combines


the local memory and processor at each node of the interconnection network.
 There is no global memory, so it is necessary to move data from one local memory to
another by means of message passing.
 This is typically done by a Send/Receive pair of commands, which must be written into
the application software by a programmer.
CS 6303 - Computer Architecture Unit 4 – Q & A

Multiple-instruction single-data streams (MISD)


 In the MISD category, the same stream of data flows through a linear array of
processors executing different instruction streams.

 In practice, there is no viable MISD machine.

ET
C
SV

You might also like