Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
RI, O(R2)
R3, R4, R5
R6, RI, R3
fixplain the flow of execution in a pipeline using a simple diagram. 2. a.Consider the following
instructions are executed five stage pipclinc: lw R
Explain the now of execution in a pipeline using a simple diagram
b. For the instruction mix above, on what instruction results does the last add instruction
depend on? 3. Indicate the type of hazards and data dependences for the following instructions:
(5)
b. Iw R5, - 16(R5)
sw R5, - - 6(R5)
add R5, R5, R5
4. For the above sequence, add nop instructions to eliminate the hazards. Assume that there is
no forwarding in this pipelined processor.
5. Discuss, the schemes for dealing with the pipeline stalls caused by branch hazards. (3)
Discuss, the schemes for dealing with the pipeline stalls caused by branch hazards.
ChatGPT
Sure, let's dive into the strategies for mitigating pipeline stalls caused by branch hazards.
1. Branch Prediction:
• Static Prediction: This involves predicting branches at compile-time based on the
program's structure. However, it may not be very accurate for dynamic and runtime-
dependent branches.
• Dynamic Prediction: Here, the processor predicts the outcome of a branch during
runtime. Two-level predictors, tournament predictors, and neural branch predictors are
examples of dynamic prediction techniques.
2. Branch Target Buffer (BTB):
• A BTB is a cache that stores target addresses of recent branches. When a branch
instruction is encountered, the BTB is checked to fetch the predicted target address.
This can help in reducing stalls as the pipeline can continue with the predicted
instruction.
3. Delayed Branching:
• Instructions following a branch are executed regardless of the branch outcome. If the
branch is taken, the pipeline is flushed, and incorrect instructions are discarded. This
reduces the impact of branch stalls but may lead to wasted cycles if the prediction is
incorrect.
4. Out-of-Order Execution:
• Modern processors often use out-of-order execution, allowing instructions after a
branch to be executed even before the branch outcome is determined. If the branch
prediction was correct, the results are committed; otherwise, the pipeline is rolled back.
5. Speculative Execution:
• Similar to out-of-order execution, speculative execution involves executing instructions
based on predicted branch outcomes. If the prediction is correct, the results are
committed; otherwise, the speculatively executed instructions are discarded.
6. Loop Buffer:
• A loop buffer is a small, fast cache that stores recently executed instructions in loops. It
helps in predicting loop branches more accurately, reducing the number of
mispredicted branches.
7. Software Pipelining:
• This is a compiler optimization technique that reorganizes the code to minimize
pipeline stalls. It can be used to overlap the execution of multiple iterations of a loop,
reducing the impact of branch hazards.
Each of these techniques comes with its advantages and limitations. The effectiveness of a
particular scheme may depend on the characteristics of the code being executed and the
architecture of the processor.
From <https://chat.openai.com/c/cb22d345-9d92-4035-ad45-9cbc0fb32bb6>
4.13.1: As we know lw used to load word from memory to a register. and sw same but stores a
register into memory add to adds together its two operands.
RAW on $1 from I1 to I3
RAW on $6 from I2 to I3
WAR on $6 from I1 to I2 and I3
b
I1: lw $5,-16($5)
I2: sw $5,-16($5)
I3: add $5,$5,$5
4.13.2: In the basic five-stage pipeline WAR and WAW dependences do not cause any hazards.
Without forwarding, any RAW dependence between an instruction and the next two
instructions (if register read happens in the second half of the clock cycle and the register write
happens in the fi rst half). The code that eliminates these hazards by inserting nop instructions
is:
Instruction sequence
a
lw $1,40($6)
add $6,$2,$2
nop
sw $6,50($1)
b
lw $5,-16($5)
nop
nop
sw $5,-16($5)
add $5,$5,$5
4.13.3: With full forwarding, an ALU instruction can forward a value to EX stage of the next
instruction without a hazard. However, a load cannot forward to the EX stage of the next
instruction (by can to the instruction after that).The code that eliminates these hazards by
inserting nop instructions is:
Instruction sequence
a
lw $1,40($6)
add $6,$2,$2
sw $6,50($1)
b
lw $5,-16($5)
nop
sw $5,-16($5)
add $5,$5,$5
4.13.4 The total execution time is the clock cycle time times the number of cycles. Without any
stalls, a three-instruction sequence executes in 7 cycles (5 to complete the fi rst instruction, then
one per instruction). The execution without forwarding must add a stall for every nop we had in
4.13.2, and execution forwarding must add a stall cycle for every nop we had in 4.13.3. Overall,
we get:
7 × 400ps = 2800ps
(7 + 2) × 200ps = 1800ps
(7 + 1) × 250ps = 2000ps
4.13.5 With ALU-ALU-only forwarding, an ALU instruction can forward to the next instruction,
but not to the second-next instruction (because that would be forwarding from MEM to EX). A
load cannot forward at all, because it determines the data value in MEM stage, when it is too
late for ALU-ALU forwarding. We have:
Instruction sequence
a
lw $1,40($6)
add $6,$2,$2
nop
sw $6,50($1)
lw $5,-16($5)
nop
nop
sw $5,-16($5)
add $5,$5,$5
4.13.6: Total execulation time of this instruction sequence with different condition ALU-ALU, NO
forwarding are given bellow.
(7 + 1) × 360ps = 2880ps
(7 + 2) × 200ps = 1800ps
(7 + 2) × 220ps = 1980ps
2A
From <https://www.chegg.com/homework-help/questions-and-answers/410-exercise-
examine-resource-hazards-control-hazards-instruction-set-architecture-isa-des-
q24942268>
markdown
○ 1.ADD R1, R2, R3
2.SUB R4, R1, R5
3.MUL R6, R7, R1
○ In Tomasulo's algorithm, these instructions would enter reservation stations. The ADD
instruction would reserve a spot in the reservation station associated with the ADD
functional unit. The subsequent SUB and MUL instructions would also enter reservation
stations.
○ If the ADD instruction has not completed its execution, the SUB and MUL instructions can
still proceed if their operands are available, preventing data hazards.
1. Structural Hazards:
○ Structural hazards arise when multiple instructions compete for the same functional unit.
Tomasulo's algorithm addresses this by allowing multiple reservation stations for the same
type of functional unit.
○ For example, if there are multiple ADD instructions in the instruction stream, each ADD
instruction can use a separate reservation station associated with the ADD functional unit.
2. Control Hazards:
○ Tomasulo's algorithm primarily focuses on data and structural hazards, but control hazards,
such as branches, may still exist. Techniques like branch prediction and out-of-order
execution can help mitigate control hazards.
In summary, Tomasulo's algorithm prevents hazards by:
• Using reservation stations to handle data hazards.
• Allowing multiple reservation stations for the same type of functional unit to address structural
hazards.
Remember that Tomasulo's algorithm is just one approach to dynamic scheduling, and other techniques
like scoreboarding also address similar issues in out-of-order execution architectures.
From <https://chat.openai.com/c/6f195770-72b0-41e4-a502-26ffc055ebcc>
From <https://www.chegg.com/homework-help/questions-and-answers/following-loop-find-true-dependences-output-
dependences-anti-dependences-eliminate-output--q45515765>
n case of any doubt comment below. Please UPVOTE if you Liked the answer.
Output dependency exists two or more instructions writes to common register to memorylocation.
Output dependencies is seen between S1 and S3 causing thru A[i].
Anti-dependencies occurs whenever a succeeding instruction writes a register/memory locationwhich is
formerly read by a previous instruction reads,
here anti dependency is seen between S4and S3 cause anti dependency thru C[i].
Code if re-written as below:
for (i=0;i<100;i++)
{
T[i] = A[i] * B[i]; /* S1 */
B[i] = T[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
C1[i] = D[i] * A1[i]; /* S4 */
}
True dependencies exists, which would be S4 and S3 thru A[i]; and for S2 and S1 through T[i]
From <http://exedustack.nl/why-112-million-rated-moiss-caicedo-shouldnt-leave-brighton-this-year/>
Explanation:
=>S1: A[i] = A[i] + B[i];
=>S2:B[i+1] = C[i] + D[i];
=>We can see that there is no dependencies between S1 and S2 as no elements A[] or B[] or C[] of S1
depend on S2 or elements of S2 on S1.
=>Loops are not parallel as we are running two lines of code inside for loop on same processor core.
Making parallel:
=>We call loops parallel when we can run statements by seperate process in parallel like in different
CPU cores.
for(i=0;i<100;i++){
A[i] = A[i] + B[i];
}
for(i=0;i<100;i++){
B[i+1] = C[i] + D[i];
}
=>So above loops will run paralley on diffent CPU cores.
I have explained each and every part with the help of statements attached to it.
From <http://exedustack.nl/why-112-million-rated-moiss-caicedo-shouldnt-leave-brighton-this-year/>
https://www.lkouniv.ac.in/site/writereaddata/siteContent/202004291236420982rohit_vector_architect
ure.pdf