Instruction Level Parallelism
Instruction Level Parallelism
Instruction Level Parallelism
ILP is to Achieve not only instruction overlap, but the actual execution of more than
one instruction at a time through dynamic scheduling and how to maximize the throughput of
a processor. For typical RISC processors, instructions usually depend on each other too and as
a result the amount of overlap is limited.
Instruction level dependencies and hazards during ILP
If two instructions are not dependent, then they can execute simultaneously— assuming
sufficient resources that is no structural hazards. if one instruction depends on another, they
must execute in order though they may still partially overlap.
1. Data dependencies
Data dependence means that one instruction is dependent on another if there exists a
chain of dependencies between them. . Compilers can be of great help in detecting and
scheduling around these sorts of hazards; hardware can only resolve these dependencies with
severe limitations.
ET
2. Name Dependencies
A name dependency occurs when two instructions use the same register or memory
C
4. Control Dependencies
A control dependency determines the order of an instruction i with respect to a branch,
if (p1)
S1
if (p2)
S2
S1 is control dependent on p1 and S2 is control dependent on p2 speculatively executed
instructions do not affect the program state until branch result is determined. This implies that
the instructions executed speculatively must not raise an exception or otherwise cause side
effects.
Implementation of ILP and overcoming hazards or dependencies.
To implement ILP, 3 methods are used to avoid any delay during ILP
1. score boarding.
2. Tomasulo's solution for dynamic scheduling.
3. Branch prediction.
1. Score boarding.
Instructions to be issued when they are ready, not necessarily in order, hence out of-
ET
order execution. To implement out-of-order issue we need to split the instruction decode phase
into two:
C
2. Read operands—wait until no data hazards obtain then read the operands and start
executing.
It dynamically schedules the pipeline. instructions must pass through the issue
phase in order;
This method can stall or bypass each other, in the read operands phase and enter, or
even, complete execution in out of order manner.
Example
CDC6600 used a scoreboard, the goal of a scoreboard is to maintain processor
throughput of one instruction per clock cycle (no structural hazard). If the next instruction
would stall, then store it on a queue and start with a later instruction and takes full responsibility
for instruction issue and execution. It uses as many as 16 separate functional units.
2. Tomasulo's solution for dynamic scheduling.
Executing instructions only when operands are available, waiting instruction is stored
in a reservation station. Reservation stations keep track of pending instructions (RAW). WAW
can be avoided using Register renaming.(80 reg).
Tomasulo architecture executes instructions in three phases; each phase may take more
than one clock cycle:
CS 6303 - Computer Architecture Unit 4 – Q & A
Three Steps:
1. Issue
o Get next instruction from FIFO queue
o If available RS, issue the instruction to the RS with operand values if available
o If operand values not available, stall the instruction
2. Execute
o When operand becomes available, store it in any reservation stations waiting for it
o When all operands are ready, issue the instruction
o Loads and store maintained in program order through effective address
o No instruction allowed to initiate execution until all branches that proceed it in
program order have completed
3. Write result
o Write result on CDB into reservation stations and store buffers
o (Stores must wait until address and value are received)
ET
3. Branch prediction.
It uses predictor is a simple saturating n-bit counter. Each time a particular branch is
C
If the most significant bit in the counter is set then predict that the branch is taken.
Limitations of ILP
1. An instruction stream needs to be run on an ideal processor with no significant
limitations.
2. The ideal processor always predicts branches correctly, has no structural hazards.
3. This eliminates all control and name dependencies. (only data dependencies)
4. Theoretically it is possible for the last dynamically executed instruction in the program
to be scheduled on the first cycle.
CS 6303 - Computer Architecture Unit 4 – Q & A
Bus-based routing
SV
Incremental growth
SV
The processor can communicate with each other through memory (messages and status
information left in common data areas).
It may also be possible for processors to exchange signal directly.
The memory is often organized so that multiple simultaneous accesses to separate
blocks of memory are possible.
In some configurations each processor may also have its own private main memory and
I/O channels in addition to the shared resources.
The bus organization has several advantages compared with other approaches:
The main drawback to the bus organization is performance. Thus, the speed of the system is
limited by the bus cycle time. To improve performance, each processor can be equipped with
local cache memory.
Multiport Memory
The multiport memory approach allows the direct, independent access of main memory
modules by each processor and IO module. The multiport memory system is shown in Figure
10.3: Multiport memory
ET
C
SV
The multiport memory approach is more complex than the bus approach, requiring a
fair amount of logic to be added to the memory system. Logic associated with memory is
required for resolving conflict. The method often used to resolve conflicts is to assign
permanently designated priorities to each memory port.
CS 6303 - Computer Architecture Unit 4 – Q & A
Top level of description, the main variables in a multicore organization are as follows:
ET
C
SV
Dedicated L1 cache
Figure 18.8a is an organization found in some of the earlier multicore computer chips
and is still seen in embedded chips. In this organization, the only on-chip cache is L1 cache,
with each core having its own dedicated L1 cache. Almost invariably, the L1 cache is divided
into instruction and data caches.An example of this organization is the ARM11 MPCore.
Dedicated L2 cache
Figure 18.8b is also one in which there is no on-chip cache sharing. In this, there is
enough area available on the chip to allow for L2 cache. An example of this organization is the
AMD Opteron.
Shared L2 cache
Figure 18.8c shows a similar allocation of chip space to memory, but with the use of a
shared L2 cache. The Intel Core Duo has this organization.
Shared L3 cache
Figure 18.8 d as the amount of cache memory available on the chip continues to grow,
performance considerations dictate splitting off a separate, shared L3 cache, with dedicated L1
and L2 caches for each core processor.
CS 6303 - Computer Architecture Unit 4 – Q & A
Each core has an independent thermal control unit. With the high transistor density of
SV
2. The APIC accepts I/O interrupts and routes these to the appropriate core.
3. Each APIC includes a timer, which can be set by the OS to generate an interrupt to the
local core.
The power management logic is responsible for reducing power consumption when
possible, thus increasing battery life for mobile platforms, such as laptops. In essence, the
power management logic monitors thermal conditions and CPU activity and adjusts voltage
levels and power consumption appropriately. It includes an advanced power-gating capability
that allows for an ultra-fine-grained logic control that turns on individual processor logic
subsystems only if and when they are needed.
The Core Duo chip includes a shared 2-MB L2 cache. The cache logic allows for a
dynamic allocation of cache space based on current core needs, so that one core can be assigned
up to 100% of the L2 cache.
ET
C
SV
Disadvantages:
Does not cover short stalls. This limitation arises from the pipeline start-up.
Since instructions are issued from a single thread, when a stall occurs, the pipeline must
be emptied or frozen.
The new thread that begins executing after the stall must fill the pipeline before
instructions will be able to complete.
Due to this start-up overhead, coarse-grained multithreading is much more useful for
reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to
the stall time.
Simultaneous multithreading
Simultaneous multithreading (SMT) is a variation on hardware multithreading.
Instructions from multiple threads are issued on same cycle. Uses register renaming and
dynamic scheduling facility of multi-issue architecture.
Advantage
Maximizes utilization of execution units
ET
Disadvantage
Needs more hardware support
C
Coarse Multithreading
Fine Multithreading
ET
C
SV
Simultaneous Multithreading
CS 6303 - Computer Architecture Unit 4 – Q & A
Computer architecture can be classified into the following four distinct categories:
The SIMD model of parallel computing consists of two parts: a front-end computer of
the usual von Neumann style, and a processor array.
CS 6303 - Computer Architecture Unit 4 – Q & A
ET
C
SV