Lecture4نظم

Parallel and Distributed Systems
Performance Issues
Prepared by : Shah Murtaza Rashid Al Masud

Department of Computer Science
Najran University
1
Pipelining Example
 Laundry Example: Three Stages
‫ ﺛﻼث ﻣﺮاﺣﻞ‬:‫ﻣﺜﺎل اﻟﻐﺴﯿﻞ‬
1. Wash dirty load of clothes ‫ اﻏﺴﻞ اﻟﺤﻤﻮﻟﺔ اﻟﻤﺘﺴﺨﺔ ﻣﻦ اﻟﻤﻼﺑﺲ‬.1

‫ ﺗﺠﻔﯿﻒ اﻟﻤﻼﺑﺲ اﻟﻤﺒﻠﻠﺔ‬.2
2. Dry wet clothes ‫ ﻗﻢ ﺑﻄﻲ اﻟﻤﻼﺑﺲ ووﺿﻌﮭﺎ ﻓﻲ اﻷدراج‬.3

‫ دﻗﯿﻘﺔ ﻹﻛﻤﺎﻟﮭﺎ‬30 ‫ﺗﺴﺘﻐﺮق ﻛﻞ ﻣﺮﺣﻠﺔ‬
‫أرﺑﻊ ﺣﻤﻮﻻت ﻣﻦ اﻟﻤﻼﺑﺲ ﻟﻐﺴﻠﮭﺎ وﺗﺠﻔﯿﻔﮭﺎ‬
‫وطﯿﮭﺎ‬
3. Fold and put clothes into drawers
 Each stage takes 30 minutes to complete

A B
 Four loads of clothes to wash, dry, and fold C D
Sequential Laundry
6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30
 Sequential laundry takes 6 hours for 4 loads

 Intuitively, we can use pipelining to speed up laundry
Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30
A  Pipelined laundry takes

3 hours for 4 loads
B
 Speedup factor is 2 for
4 loads
C
 Time to wash, dry, and
D fold one load is still the
same (90 minutes)
Serial Execution versus Pipelining
‫ﯾﻌﺘﺒﺮﻣﮭﻤﺔ ﯾﻤﻜﻦ ﺗﻘﺴﯿﻤﮭﺎ إﻟﻰ ﻣﮭﺎم ﻓﺮﻋﯿﺔ‬
 Consider a task that can be divided into k subtasks
 The k subtasks are executed on k different‫ﻣﺨﺘﻠﻔﺔ‬
stages
‫ ﻋﻠﻰ ﻣﺮاﺣﻞ‬k ‫ﯾﺘﻢ ﺗﻨﻔﯿﺬ اﻟﻤﮭﺎم اﻟﻔﺮﻋﯿﺔ‬
 Each subtask requires one time unit ‫ ﺗﺘﻄﻠﺐ ﻛﻞ ﻣﮭﻤﺔ ﻓﺮﻋﯿﺔ وﺣﺪة زﻣﻨﯿﺔ‬k
‫ ﻣﻦ‬k ‫واﺣﺪة إﺟﻤﺎﻟﻲ وﻗﺖ اﻟﺘﻨﻔﯿﺬ ﻟﻠﻤﮭﻤﺔ ھﻮ‬
 The total execution time of the task is k time ‫اﻟﻮﻗﺖ‬units
‫وﺣﺪات‬
‫ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ھﻮ ﺑﺪء ﻣﮭﻤﺔ ﺟﺪﯾﺪة ﻗﺒﻞ اﻻﻧﺘﮭﺎء ﻣﻦ اﻟﺴﺎﺑﻖ‬
 Pipelining is to start a new task before finishing previous
 The k stages work in parallel on k different tasks
 Tasks enter/leave pipeline at the rate of one task per time unit
‫ ﺗﺘﺮك اﻟﻤﮭﺎم ﻓﻲ ﺧﻂ اﻷﻧﺎﺑﯿﺐ ﺑﻤﻌﺪل ﻣﮭﻤﺔ واﺣﺪة ﻟﻜﻞ‬/ ‫ﺗﺪﺧﻞ‬ ‫ ﻣﮭﺎم ﻣﺨﺘﻠﻔﺔ‬k ‫ ﺑﺎﻟﺘﻮازي ﻓﻲ‬k ‫ﺗﻌﻤﻞ ﻣﺮاﺣﻞ‬
‫زﻣﻨﯿﺔ‬ … k
1 2 ‫وﺣﺪة‬ 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k
Without Pipelining With Pipelining

One completion every k time units One completion every 1 time unit
Pipeline Performance
In this subsection, we develop some simple measures of pipeline performance and
relative speedup (based on a discussion in [HWAN93]). The cycle time of an
instruction pipeline is the time needed to advance a set of instructions one stage
through the pipeline; each column in the following Figure represents one cycle
time. The cycle time can be determined as:
، ‫ﻓﻲ ھﺬا اﻟﻘﺴﻢ اﻟﻔﺮﻋﻲ‬

‫ﻧﻘﻮم ﺑﺘﻄﻮﯾﺮ ﺑﻌﺾ‬
‫اﻟﻤﻘﺎﯾﯿﺲ اﻟﺒﺴﯿﻄﺔ ﻷداء‬
‫ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ واﻟﺘﺴﺮﯾﻊ‬
‫اﻟﻨﺴﺒﻲ )ﺑﻨﺎءً ﻋﻠﻰ ﻣﻨﺎﻗﺸﺔ‬
.([HWAN93] ‫ﻓﻲ‬
‫وﻗﺖ دورة ﺧﻂ أﻧﺎﺑﯿﺐ‬
‫اﻟﺘﻌﻠﯿﻤﺎت ھﻮ اﻟﻮﻗﺖ اﻟﻼزم‬
‫ﻟﺘﻘﺪﯾﻢ ﻣﺠﻤﻮﻋﺔ ﻣﻦ‬
‫اﻟﺘﻌﻠﯿﻤﺎت ﻓﻲ ﻣﺮﺣﻠﺔ‬
‫واﺣﺪة ﻋﺒﺮ ﺧﻂ اﻷﻧﺎﺑﯿﺐ ؛‬
‫ﯾﻤﺜﻞ ﻛﻞ ﻋﻤﻮد ﻓﻲ اﻟﺸﻜﻞ‬
.‫اﻟﺘﺎﻟﻲ وﻗﺖ دورة واﺣﺪ‬
‫ﯾﻤﻜﻦ ﺗﺤﺪﯾﺪ وﻗﺖ اﻟﺪورة‬
:‫ﻋﻠﻰ اﻟﻨﺤﻮ اﻟﺘﺎﻟﻲ‬
Figure: Timing Diagram for Instruction Pipeline Operation

cycle 1A total of k cycles are required to complete the execution of the first instruction, and
the remaining n-1 instructions require n-1 cycles. This equation is easily verified from the
Figure mentioned above. The ninth instruction completes at time 4:
، ‫ ﻹﻛﻤﺎل ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت اﻷوﻟﻰ‬k ‫ ﯾﻠﺰم إﺟﻤﺎﻟﻲ دورات‬1A ‫دورة‬
‫ ﯾﻤﻜﻦ اﻟﺘﺤﻘﻖ ﻣﻦ ھﺬه‬.n-1 ‫ اﻟﻤﺘﺒﻘﯿﺔ دورات‬n-1 ‫وﺗﺘﻄﻠﺐ ﺗﻌﻠﯿﻤﺎت‬
‫ اﻟﺘﻌﻠﯿﻤﺎت اﻟﺘﺎﺳﻌﺔ‬.‫اﻟﻤﻌﺎدﻟﺔ ﺑﺴﮭﻮﻟﺔ ﻣﻦ اﻟﺸﻜﻞ اﻟﻤﺬﻛﻮر أﻋﻼه‬
:4 ‫ﺗﻜﺘﻤﻞ ﻓﻲ اﻟﺴﺎﻋﺔ‬
Now consider a processor with equivalent functions but no pipeline, and assume that the
instruction cycle time is kτ The speedup factor for the instruction pipeline compared to
execution without the pipeline is defined as:
‫ واﻓﺘﺮض أن وﻗﺖ‬، ‫اﻵن ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك ﻣﻌﺎﻟﺠًﺎ ﻟﮫ وظﺎﺋﻒ ﻣﻜﺎﻓﺌﺔ وﻟﻜﻦ ﻻ ﯾﻮﺟﺪ ﺧﻂ أﻧﺎﺑﯿﺐ‬
‫ ﻋﺎﻣﻞ اﻟﺘﺴﺮﯾﻊ ﻟﺨﻂ أﻧﺎﺑﯿﺐ اﻟﺘﻌﻠﯿﻤﺎت ﻣﻘﺎرﻧﺔ ﺑﺎﻟﺘﻨﻔﯿﺬ ﺑﺪون ﺧﻂ اﻷﻧﺎﺑﯿﺐ‬kτ ‫دورة اﻟﺘﻌﻠﯿﻤﺎت ھﻮ‬
:‫ﻣﺤﺪد ﻋﻠﻰ اﻟﻨﺤﻮ اﻟﺘﺎﻟﻲ‬
Pipeline Performance Cont’d
 Let ti = time delay in stage Si
 Clock cycle t = max(ti) is the maximum stage delay
 Clock frequency f = 1/t = 1/max(ti)
 A pipeline can process n tasks in k + n – 1 cycles
 k cycles are needed to complete the first task
 n – 1 cycles are needed to complete the remaining n – 1 tasks
 Ideal speedup of a k-stage pipeline over serial execution
Serial execution in cycles nk

Sk = = Sk → k for large n
Pipelined execution in cycles k+n–1
Single-Cycle vs Pipelined Performance
 Consider a 5-stage instruction execution in which …
 Instruction fetch = ALU operation = Data memory access = 200 ps
 Register read = register write = 150 ps
 What is the single-cycle non-pipelined time?
 What is the pipelined cycle time?
 What is the speedup factor for pipelined execution?
 Solution
Non-pipelined cycle = 200+150+200+200+150 = 900 ps
IF Reg ALU MEM Reg
900 ps IF Reg ALU MEM Reg
900 ps
Single-Cycle versus Pipelined – cont’d
 Pipelined cycle time = max(200, 150) = 200 ps
IF Reg ALU MEM Reg
200 IF Reg ALU MEM Reg
200 IF Reg ALU MEM Reg
200 200 200 200 200
 CPI for pipelined execution = 1

 One instruction completes each cycle (ignoring pipeline fill)
 Speedup of pipelined execution = 900 ps / 200 ps = 4.5
 Instruction count and CPI are equal in both cases
 Speedup factor is less than 5 (number of pipeline stage)
 Because the pipeline stages are not balanced
INSTRUCTION PIPELINING
Instruction pipelining is similar to the use of an assembly line in a manufacturing plant. An

assembly line takes advantage of the fact that a product goes through various stages of
production. By laying the production process out in an assembly line, products at various
stages can be worked on simultaneously. This process is also referred to as pipelining,
because, as in a pipeline, new inputs are accepted at one end before previously accepted
inputs appear as outputs at the other end.
To apply this concept to instruction execution, we must recognize that, in fact, an instruction
has a number of stages. Figures 1, for example, breaks the instruction cycle up into 10
tasks, which occur in sequence. Clearly, there should be some opportunity for pipelining.
‫ﻟﺘﻄﺒﯿﻖ ھﺬا اﻟﻤﻔﮭﻮم ﻋﻠﻰ‬ ‫ﯾﺸﺒﮫ ﺧﻂ أﻧﺎﺑﯿﺐ اﻟﺘﻌﻠﯿﻤﺎت اﺳﺘﺨﺪام ﺧﻂ اﻟﺘﺠﻤﯿﻊ ﻓﻲ ﻣﺼﻨﻊ‬
‫ ﯾﺠﺐ أن‬، ‫ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت‬ ‫ ﯾﺴﺘﻔﯿﺪ ﺧﻂ اﻟﺘﺠﻤﯿﻊ ﻣﻦ ﺣﻘﯿﻘﺔ أن اﻟﻤﻨﺘﺞ ﯾﻤﺮ ﺑﻤﺮاﺣﻞ‬.‫اﻟﺘﺼﻨﯿﻊ‬
‫ أن‬، ‫ ﻓﻲ اﻟﻮاﻗﻊ‬، ‫ﻧﺪرك‬ ‫ ﻣﻦ ﺧﻼل وﺿﻊ ﻋﻤﻠﯿﺔ اﻹﻧﺘﺎج ﻓﻲ ﺧﻂ‬.‫ﻣﺨﺘﻠﻔﺔ ﻣﻦ اﻹﻧﺘﺎج‬
‫ ﯾﻤﻜﻦ اﻟﻌﻤﻞ ﻋﻠﻰ اﻟﻤﻨﺘﺠﺎت ﻓﻲ ﻣﺮاﺣﻞ ﻣﺨﺘﻠﻔﺔ ﻓﻲ‬، ‫اﻟﺘﺠﻤﯿﻊ‬
‫اﻟﺘﻌﻠﯿﻤﺎت ﻟﮭﺎ ﻋﺪد ﻣﻦ‬ ، ‫ ﯾﺸﺎر إﻟﻰ ھﺬه اﻟﻌﻤﻠﯿﺔ أﯾﻀًﺎ ﺑﺎﺳﻢ ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ‬.‫وﻗﺖ واﺣﺪ‬
‫ ﻋﻠﻰ‬، 1 ‫ اﻟﺸﻜﻞ‬.‫اﻟﻤﺮاﺣﻞ‬ ‫ ﯾﺘﻢ ﻗﺒﻮل اﻟﻤﺪﺧﻼت‬، ‫ ﻛﻤﺎ ھﻮ اﻟﺤﺎل ﻓﻲ ﺧﻂ اﻷﻧﺎﺑﯿﺐ‬، ‫ﻷﻧﮫ‬
‫ ﯾﻘﺴﻢ دورة‬، ‫ﺳﺒﯿﻞ اﻟﻤﺜﺎل‬ ‫اﻟﺠﺪﯾﺪة ﻓﻲ أﺣﺪ اﻟﻄﺮﻓﯿﻦ ﻗﺒﻞ أن ﺗﻈﮭﺮ اﻟﻤﺪﺧﻼت اﻟﻤﻘﺒﻮﻟﺔ‬
، ‫ ﻣﮭﺎم‬10 ‫اﻟﺘﻌﻠﯿﻤﺎت إﻟﻰ‬ .‫ﻣﺴﺒﻘًﺎ ﻛﻤﺨﺮﺟﺎت ﻓﻲ اﻟﻄﺮف اﻵﺧﺮ‬
‫ ﻣﻦ‬.‫واﻟﺘﻲ ﺗﺤﺪث ﺑﺎﻟﺘﺴﻠﺴﻞ‬
‫اﻟﻮاﺿﺢ أﻧﮫ ﯾﺠﺐ أن ﯾﻜﻮن‬
‫ھﻨﺎك ﺑﻌﺾ اﻟﻔﺮص‬
.‫ﻟﻠﺘﺨﻄﯿﻂ‬
Figure 1 Instruction Cycle State Diagram

Pipeline cycling
6 staged pipelining:
To gain more speedup, the pipeline must have more stages.
Let us consider the following decomposition of the instruction processing.
Figure: Six-Stage CPU Instruction Pipeline

‫ اﻗﺮأ اﻟﺘﻌﻠﯿﻤﺎت اﻟﻤﺘﻮﻗﻌﺔ اﻟﺘﺎﻟﯿﺔ ﻓﻲ اﻟﻤﺨﺰن اﻟﻤﺆﻗﺖ‬:(FI) ‫ﺗﻌﻠﯿﻤﺎت اﻟﺠﻠﺐ‬
.‫ ﺗﺤﺪﯾﺪ ﻛﻮد اﻟﺘﺸﻐﯿﻞ وﻣﺤﺪدات اﻟﻤﻌﺎﻣﻞ‬:(DI) ‫ﺗﻌﻠﯿﻤﺎت ﻓﻚ اﻟﺸﻔﺮة‬
‫ اﺣﺴﺐ اﻟﻌﻨﻮان اﻟﻔﻌﺎل ﻟﻜﻞ ﻣﺼﺪر‬:(CO) ‫ﺣﺴﺎب اﻟﻤﻌﺎﻣﻼت‬
.‫ ﻗﺪ ﯾﺸﻤﻞ ذﻟﻚ اﻹزاﺣﺔ أو ﺗﺴﺠﯿﻞ ﻏﯿﺮ ﻣﺒﺎﺷﺮ أو ﻏﯿﺮ ﻣﺒﺎﺷﺮ أو أﺷﻜﺎل أﺧﺮى ﻣﻦ ﺣﺴﺎب اﻟﻌﻨﻮان‬.‫اﻟﻤﻌﺎﻣﻞ‬
1) Fetch instruction (FI): Read the next expected instruction into a buffer.
2) Decode instruction (DI): Determine the opcode and the operand specifiers.
3) Calculate operands (CO): Calculate the effective address of each source
operand. This may involve displacement, register indirect, indirect, or other forms of address
calculation. .‫ ﻻ ﯾﻠﺰم ﺟﻠﺐ اﻟﻤﻌﺎﻣﻼت ﻓﻲ اﻟﺴﺠﻼت‬.‫ ﻗﻢ ﺑﺈﺣﻀﺎر ﻛﻞ ﻣﻌﺎﻣﻞ ﻣﻦ اﻟﺬاﻛﺮة‬:(FO) ‫ﻣﻌﺎﻣﻼت اﻟﺠﻠﺐ‬
4) Fetch operands (FO): Fetch each operand from memory. Operands in registers need not
be fetched. .‫ ﻓﻲ ﻣﻮﻗﻊ ﻣﻌﺎﻣﻞ اﻟﻮﺟﮭﺔ اﻟﻤﺤﺪد‬، ‫ إن وﺟﺪت‬، ‫ ﻗﻢ ﺑﺈﺟﺮاء اﻟﻌﻤﻠﯿﺔ اﻟﻤﺸﺎر إﻟﯿﮭﺎ وﻗﻢ ﺑﺘﺨﺰﯾﻦ اﻟﻨﺘﯿﺠﺔ‬:(EI) ‫ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت‬
5) Execute instruction (EI): Perform the indicated operation and store the result, if any, in
the specified destination operand location.
6) Write operand (WO): Store the result in memory.
Figure 2 shows that a six-stage pipeline can reduce the execution time for 9 instructions from 54
time units to 14 time units.
Figure2: Timing Diagram for Instruction Pipeline Operation

‫‪Example‬‬
‫‪Consider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50‬‬
‫‪ns, 60 ns, 60 ns, 50 ns, and 50 ns.‬‬
‫‪- Find the instruction latency on this machine.‬‬
‫?‪- How much time does it take to execute 100 instructions‬‬
‫ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك آﻟﺔ ﻏﯿﺮ ﻣﺒﻄﻨﺔ ﺑﺎﻷﻧﺎﺑﯿﺐ ذات ‪ 6‬ﻣﺮاﺣﻞ ﺗﻨﻔﯿﺬ ﺑﺄطﻮال ‪ 50‬ﻧﺎﻧﻮﺛﺎﻧﯿﺔ ‪ ،‬و‬
‫‪ 50‬ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ‪ ،‬و ‪ 60‬ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ‪ ،‬و ‪ 60‬ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ‪ ،‬و ‪ 50‬ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ‪ ،‬و ‪ 50‬ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ‪.‬‬
‫‪ -‬اﺑﺤﺚ ﻋﻦ زﻣﻦ اﻧﺘﻘﺎل اﻟﺘﻌﻠﯿﻤﺎت ﻋﻠﻰ ھﺬا اﻟﺠﮭﺎز‪.‬‬
‫‪ -‬ﻛﻢ ﻣﻦ اﻟﻮﻗﺖ ﯾﺴﺘﻐﺮق ﺗﻨﻔﯿﺬ ‪ 100‬أﻣﺮ؟‬

Example
Suppose we introduce pipelining on this machine. Assume that when
introducing pipelining, the clock skew adds 5ns of overhead to each execution
stage. ، ‫ اﻓﺘﺮض أﻧﮫ ﻋﻨﺪ إدﺧﺎل ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ‬.‫ﻟﻨﻔﺘﺮض أﻧﻨﺎ أدﺧﻠﻨﺎ ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ﻋﻠﻰ ھﺬا اﻟﺠﮭﺎز‬
.‫ ﻣﻦ اﻟﺤﻤﻞ ﻟﻜﻞ ﻣﺮﺣﻠﺔ ﺗﻨﻔﯿﺬ‬5ns ‫ﯾﻀﯿﻒ ﻣﯿﻞ اﻟﺴﺎﻋﺔ‬
- What is the instruction latency on the pipelined machine?
- How much time does it take to execute 100 instructions?
Solution
Remember that in the pipelined implementation, the length of the pipe stages must all
be the same, i.e., the speed of the slowest stage plus overhead. With 5ns overhead it
comes to:
‫ﺗﺬﻛﺮ أﻧﮫ ﻓﻲ ﺗﻨﻔﯿﺬ اﻷﻧﺎﺑﯿﺐ‬
‫ ﯾﺠﺐ أن ﯾﻜﻮن طﻮل‬،
، ً‫ﻣﺮاﺣﻞ اﻷﻧﺎﺑﯿﺐﻣﺘﻤﺎﺛﻼ‬
‫أي ﺳﺮﻋﺔ أﺑﻄﺄ ﻣﺮﺣﻠﺔ‬
‫ ﻣﻊ‬.‫ﺑﺎﻹﺿﺎﻓﺔ إﻟﻰ اﻟﺤﻤﻞ‬
، ‫ اﻟﻨﻔﻘﺎت اﻟﻌﺎﻣﺔ‬5ns
:‫ﯾﺘﻌﻠﻖ اﻷﻣﺮ ﺑﻤﺎ ﯾﻠﻲ‬
The length of pipelined stage = MAX(lengths of un-pipelined stages) + overhead = 60 + 5 = 65

ns
Instruction latency = 65 ns
Time to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 = 6825 ns
Cont’d
Problem:
What is the speedup obtained from pipelining?
‫اﻟﺴﺮﻋﺔ ھﻲ ﻧﺴﺒﺔ ﻣﺘﻮﺳﻂ وﻗﺖ اﻟﺘﻌﻠﯿﻤﺎت ﺑﺪون ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ إﻟﻰ ﻣﺘﻮﺳﻂ وﻗﺖ اﻟﺘﻌﻠﯿﻤﺎت ﺑﺎﺳﺘﺨﺪام‬
.‫ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ‬
Solution:
Speedup is the ratio of the average instruction time without pipelining to the
average instruction time with pipelining.
Average instruction time not pipelined = 320 ns

Average instruction time pipelined = 65 ns
Speedup for 100 instructions = 32000 / 6825 = 4.69
Pipeline: Enhancing Performance
-We observed that we can obtain better performance in

executing instructions, if a single cycle accomplishes
multiple operations: ‫أﻧﺠﺰت‬ ‫ إذا‬، ‫ﻻﺣﻈﻨﺎ أﻧﮫ ﯾﻤﻜﻨﻨﺎ اﻟﺤﺼﻮل ﻋﻠﻰ أداء أﻓﻀﻞ ﻓﻲ ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت‬
:‫دورة واﺣﺪة ﻋﻤﻠﯿﺎت ﻣﺘﻌﺪدة‬
-We added a shift register after the ALU to improve the
performance on the multiplication and division
algorithms. ‫ ﻟﺘﺤﺴﯿﻦ اﻷداء ﻓﻲ ﺧﻮارزﻣﯿﺎت‬ALU ‫أﺿﻔﻨﺎ ﺳﺠﻞ اﻟﺘﺤﻮل ﺑﻌﺪ‬
.‫اﻟﻀﺮب واﻟﻘﺴﻤﺔ‬
-Not all operations will need both the ALU and the
SHIFT register SHIFT ‫ وﺳﺠﻞ‬ALU ‫ﻟﻦ ﺗﺤﺘﺎج ﺟﻤﯿﻊ اﻟﻌﻤﻠﯿﺎت إﻟﻰ ﻛﻞ ﻣﻦ‬
-But all instructions do have some things in

common:
:‫ﻟﻜﻦ ﺟﻤﯿﻊ اﻟﺘﻌﻠﯿﻤﺎت ﺗﺸﺘﺮك ﻓﻲ ﺑﻌﺾ اﻷﺷﯿﺎء‬
Instruction fetch
(‫ﺟﻠﺐ اﻟﺘﻌﻠﯿﻤﺎت ﺟﻠﺐ اﻟﻤﻌﺎﻣﻞ وﺟﻠﺐ )إذا ﻛﺎﻧﺖ اﻟﻤﻌﺎﻣﻼت‬
Decode
Operand fetch (if operands)
‫ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ھﻲ ﺗﺤﺴﯿﻦ ﻣﻌﻤﺎري ﯾﻌﻤﻞ ﻋﻠﻰ ﺗﺤﺴﯿﻦ‬
PIPELINES Parallel Level 1 .‫أداء ﻧﻈﺎم اﻟﻜﻤﺒﯿﻮﺗﺮ دون زﯾﺎدة ﺳﺮﻋﺔ اﻟﺴﺎﻋﺔ‬
-Pipelines are an architectural enhancement that improves
the performance of a computer system without increasing
the clock speed. .‫ﺗﺴﺘﻔﯿﺪ ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ﻣﻦ اﻟﺘﻮازي اﻟﻤﺘﺄﺻﻞ ﻓﻲ ﻋﻤﻠﯿﺔ ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت واﻟﺘﻌﻠﯿﻤﺎت‬
-Pipelines take advantage of parallelisminherent in the
instruction execution process and instructions.
-Based on the observation that all instructions go through
the same set of phases (stages) in the process of begin
executed: (one possible example:)
INSTRUCTION FETCH
INSTRUCTION DECODE ‫ﺑﻨﺎءً ﻋﻠﻰ ﻣﻼﺣﻈﺔ أن ﺟﻤﯿﻊ اﻟﺘﻌﻠﯿﻤﺎت ﺗﻤﺮ ﺑﻨﻔﺲ‬
:‫ﻣﺠﻤﻮﻋﺔ اﻟﻤﺮاﺣﻞ )اﻟﻤﺮاﺣﻞ( ﻓﻲ ﻋﻤﻠﯿﺔ ﺑﺪء اﻟﺘﻨﻔﯿﺬ‬
OPERAND FETCH
EXECUTION
RESULT STORE (not all architectures)
PARALLEL STAGES
-Do each stage in parallel, with different instructions in each stage
-Instructions “flow” from one stage to the next Stages
PIPELINE PERFORMANCE
k: stages in pipeline
t: time to do each stage (actually slowest stage
constrains performance)
n: number of instructions to be processed
Time required to complete the processing on a

non-pipelined processor: (CPU cycle = kt time) =
n(kt)
Time required for pipelined processor:
1st instruction: = kt(pipeline load)
All after 1st (n-1 more after 1st): = (n-1)t
= kt + (n-1)t
PIPELINE SPEEDUP
PIPELINE EFFICIENCY
Taxonomy of Parallel Processor
Architectures
Parallel Speedup
 Parallel speedup is defined as the ratio of the time
required to compute some function using a single
processor (T1) divided by the time required to compute it
using P processors (TP). That is: speedup = T1/TP.
 For Example if it takes 10 seconds to run a program
sequentially and 2 seconds to run it in parallel on some
number of processors, P, then the speedup is 10/2=5
times.
‫ﯾﺘﻢ ﺗﻌﺮﯾﻒ اﻟﺘﺴﺮﯾﻊ اﻟﻤﻮازي ﻋﻠﻰ أﻧﮫ ﻧﺴﺒﺔ اﻟﻮﻗﺖ اﻟﻤﻄﻠﻮب ﻟﺤﺴﺎب ﺑﻌﺾ‬
‫( ﻣﻘﺴﻮﻣًﺎ ﻋﻠﻰ اﻟﻮﻗﺖ اﻟﻤﻄﻠﻮب‬T1) ‫اﻟﻮظﺎﺋﻒ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺞ واﺣﺪ‬
.T1 / TP = ‫ اﻟﺘﺴﺮﯾﻊ‬:‫ ھﺬا ھﻮ‬.P (TP) ‫ﻟﺤﺴﺎﺑﮭﺎ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺠﺎت‬
‫ ﺛﻮانٍ ﻟﺘﺸﻐﯿﻞ ﺑﺮﻧﺎﻣﺞ ﺑﺎﻟﺘﺘﺎﺑﻊ‬10 ‫ إذا اﺳﺘﻐﺮق اﻷﻣﺮ‬، ‫ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل‬
‫ ﻓﺈن اﻟﺘﺴﺮﯾﻊ ھﻮ‬، P ، ‫وﺛﺎﻧﯿﺘﯿﻦ ﻟﺘﺸﻐﯿﻠﮫ ﺑﺎﻟﺘﻮازي ﻋﻠﻰ ﻋﺪد ﻣﻦ اﻟﻤﻌﺎﻟﺠﺎت‬
.‫ ﻣﺮات‬5 = 10/2
Speedup Formula
Sequential execution time

Speedup 
Parallel execution time
Parallel Efficiency
Parallel efficiency measures how much use of the parallel
processors we are making. For P processors, it is defined as:
efficiency= 1/P x speedup= 1/P x T1/TP. For example,
continuing with the same example, if P is 10 processors and
the speedup is 5 times, then the parallel efficiency is 5/10=.5.
In other words, on average, only half of the processors were
used to gain the speedup and the other half were idle.
‫ﺗﻘﯿﺲ اﻟﻜﻔﺎءة اﻟﻤﻮازﯾﺔ ﻣﻘﺪار اﺳﺘﺨﺪام اﻟﻤﻌﺎﻟﺠﺎت اﻟﻤﺘﻮازﯾﺔ‬

‫ ﯾﺘﻢ ﺗﻌﺮﯾﻔﮭﺎ ﻋﻠﻰ اﻟﻨﺤﻮ‬، P ‫ ﺑﺎﻟﻨﺴﺒﺔ ﻟﻠﻤﻌﺎﻟﺠﺎت‬.‫اﻟﺘﻲ ﻧﻘﻮم ﺑﮭﺎ‬
.P x T1 / TP / 1 = ‫ اﻟﺘﺴﺮﯾﻊ‬P x / 1 = ‫ اﻟﻜﻔﺎءة‬:‫اﻟﺘﺎﻟﻲ‬
P ‫ إذا ﻛﺎﻧﺖ‬، ‫ اﻻﺳﺘﻤﺮار ﻓﻲ ﻧﻔﺲ اﻟﻤﺜﺎل‬، ‫ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل‬
‫ ﻓﺈن اﻟﻜﻔﺎءة‬، ‫ ﻣﺮات‬5 ‫ ﻣﻌﺎﻟﺠﺎت وﻛﺎن اﻟﺘﺴﺮﯾﻊ‬10 ‫ھﻲ‬
‫ ﺗﻢ‬، ‫ ﻓﻲ اﻟﻤﺘﻮﺳﻂ‬، ‫ ﺑﻤﻌﻨﻰ آﺧﺮ‬.0.5 = 5/10 ‫اﻟﻤﺘﻮازﯾﺔ ھﻲ‬
‫اﺳﺘﺨﺪام ﻧﺼﻒ اﻟﻤﻌﺎﻟﺠﺎت ﻓﻘﻂ ﻟﻠﺤﺼﻮل ﻋﻠﻰ اﻟﺘﺴﺮﯾﻊ‬
.‫واﻟﻨﺼﻒ اﻵﺧﺮ ﻓﻲ وﺿﻊ اﻟﺨﻤﻮل‬
Efficiency Formula
Sequential execution time

Efficiency 
Processors Parallel execution time
Speedup
Efficiency 
Processors
Parallel Processing and Amdahl’s Law
When considering system performance, computer system designers look for ways to improve
performance by improvement in technology or change in design. Examples include the use of parallel
processors, the use of a memory cache hierarchy, and speedup in memory access time and I/O
transfer rate due to technology improvements.
In all of these cases, it is important to note that a speedup in one aspect of the
technology or design does not result in a corresponding improvement in
performance. This limitation is succinctly expressed by Amdahl’s law.
.‫ﯾﻤﻜﻦ ﻟﻌﺪد ﺻﻐﯿﺮ ﻣﻦ اﻟﻌﻤﻠﯿﺎت اﻟﻤﺘﺴﻠﺴﻠﺔ أن ﺗﺤﺪ ﺑﺸﻜﻞ ﻓﻌﺎل ﻣﻦ ﺗﺴﺮﯾﻊ اﻟﺨﻮارزﻣﯿﺔ اﻟﻤﺘﻮازﯾﺔ‬
Amdahl’s Law: A small number of sequential operations can effectively limit
the speedup of a parallel algorithm.‫ ﺧﻼل‬،‫اﻟﻤﺘﻮازﯾﺔﻣﻦ‬
‫ ﯾﺒﺤﺚ ﻣﺼﻤﻤﻮ أﻧﻈﻤﺔ اﻟﻜﻤﺒﯿﻮﺗﺮ ﻋﻦ طﺮق ﻟﺘﺤﺴﯿﻦ اﻷداء‬، ‫ﻋﻨﺪ اﻟﺘﻔﻜﯿﺮ ﻓﻲ أداء اﻟﻨﻈﺎم‬
‫ ﺗﺘﻀﻤﻦ اﻷﻣﺜﻠﺔ اﺳﺘﺨﺪام اﻟﻤﻌﺎﻟﺠﺎت‬.‫ﺗﺤﺴﯿﻦ اﻟﺘﻜﻨﻮﻟﻮﺟﯿﺎ أو اﻟﺘﻐﯿﯿﺮ ﻓﻲ اﻟﺘﺼﻤﯿﻢ‬
‫ واﻟﺘﺴﺮﯾﻊ ﻓﻲ وﻗﺖ اﻟﻮﺻﻮل إﻟﻰ اﻟﺬاﻛﺮة‬، ‫واﺳﺘﺨﺪام اﻟﺘﺴﻠﺴﻞ اﻟﮭﺮﻣﻲ ﻟﺬاﻛﺮة اﻟﺘﺨﺰﯾﻦ اﻟﻤﺆﻗﺖ‬
‫ اﻹﺧﺮاج‬/ ‫واﻹدﺧﺎل‬
.‫ﻣﻌﺪل اﻟﻨﻘﻞ ﺑﺴﺒﺐ اﻟﺘﺤﺴﯿﻨﺎت اﻟﺘﻜﻨﻮﻟﻮﺟﯿﺔ‬
 Let f be the fraction of operations in a computation that must be performed sequentially, where 0 <
f < 1. Then the maximum speedup S achievable by a parallel computer with p processors
performing the computation is S < 1 / [f + (1 - f) / p]. For example, if 10% of the computation must
be performed sequentially, then the maximum speedup achievable is 10, no matter how many
processors a parallel computer has.
‫ ﺛﻢ أﻗﺼﻰ ﺳﺮﻋﺔ ﯾﻤﻜﻦ ﺗﺤﻘﯿﻘﮭﺎ ﺑﻮاﺳﻄﺔ‬.f <1> 0 ‫ ﺣﯿﺚ ﯾﻜﻮن‬، ‫ ﺟﺰء ﻣﻦ اﻟﻌﻤﻠﯿﺎت اﻟﺤﺴﺎﺑﯿﺔ اﻟﺘﻲ ﯾﺠﺐ إﺟﺮاؤھﺎ ﺑﺎﻟﺘﺘﺎﺑﻊ‬f ‫ﻟﻨﻔﺘﺮض أن‬
‫ ﻣﻦ‬٪10 ‫ إذا ﻛﺎن ﯾﺠﺐ إﺟﺮاء‬، ‫ ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل‬.[‫ ع‬/ (‫ و‬- 1) + f] / S <1 ‫ ﺗﻘﻮم ﺑﺎﻟﺤﺴﺎب ھﻮ‬p ‫ﻛﻤﺒﯿﻮﺗﺮ ﻣﻮازٍ ﻣﻊ ﻣﻌﺎﻟﺠﺎت‬
‫ ﺑﻐﺾ اﻟﻨﻈﺮ ﻋﻦ ﻋﺪد اﻟﻤﻌﺎﻟﺠﺎت اﻟﺘﻲ ﯾﻤﺘﻠﻜﮭﺎ اﻟﻜﻤﺒﯿﻮﺗﺮ‬، 10 ‫ ﻓﺈن أﻗﺼﻰ ﺳﺮﻋﺔ ﯾﻤﻜﻦ ﺗﺤﻘﯿﻘﮭﺎ ھﻲ‬، ‫اﻟﻌﻤﻠﯿﺎت اﻟﺤﺴﺎﺑﯿﺔ ﺑﺎﻟﺘﺘﺎﺑﻊ‬
.‫اﻟﻤﻮازي‬
Maximum Speedup in Amdahl’s Law
Amdahl’s law states that the maximum speedup possible in
parallelizing an algorithm is limited by the sequential
‫ﯾﻨﺺ ﻗﺎﻧﻮن أﻣﺪال ﻋﻠﻰ أن أﻗﺼﻰ ﺳﺮﻋﺔ ﻣﻤﻜﻨﺔ ﻓﻲ ﻣﻮازاة‬
portion of the code. .‫اﻟﺨﻮارزﻣﯿﺔ ﻣﺤﺪودة ﺑﺎﻟﺠﺰء اﻟﻤﺘﺴﻠﺴﻞ ﻣﻦ اﻟﻜﻮد‬
Given an algorithm which is P% parallel, Amdahl’s law

states that: Maximum Speedup=1/(1- (P/100)).
For example if 80% of a program is parallel, then the
maximum speedup is 1/(1-0.8)=1/.2=5 times.
= ‫ اﻟﺤﺪ اﻷﻗﺼﻰ ﻟﻠﺴﺮﻋﺔ‬:‫ ﯾﻨﺺ ﻗﺎﻧﻮن أﻣﺪال ﻋﻠﻰ ﻣﺎ ﯾﻠﻲ‬، ٪P ‫ﺑﺎﻟﻨﻈﺮ إﻟﻰ ﺧﻮارزﻣﯿﺔ ﻣﻮازﯾﺔ ﻟـ‬
.((P / 100) -1) / 1
‫ ﻓﺈن اﻟﺤﺪ‬، ‫ ﻣﻦ اﻟﺒﺮﻧﺎﻣﺞ ﻣﺘﻮازﯾًﺎ‬٪80 ‫ إذا ﻛﺎن‬، ‫ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل‬

.‫ ﻣﺮات‬5 = 2. / 1 = (0.8-1) / 1 ‫اﻷﻗﺼﻰ ﻟﻠﺘﺴﺮﯾﻊ ھﻮ‬
Amdahl’s law was first proposed by Gene Amdahl in [AMDA67] and deals with the potential
speedup of a program using multiple processors compared to a single processor. Consider a
program running on a single processor such that a fraction (1-f) of the execution time involves
code that is inherently serial and a fraction f that involves code that is infinitely parallelizable
‫[ وﯾﺘﻌﺎﻣﻞ ﻣﻊ اﻟﺘﺴﺮﯾﻊ اﻟﻤﺤﺘﻤﻞ ﻟﺒﺮﻧﺎﻣﺞ ﯾﺴﺘﺨﺪم‬AMDA67] ‫ﺗﻢ اﻗﺘﺮاح ﻗﺎﻧﻮن أﻣﺪال ﻷول ﻣﺮة ﻣﻦ ﻗﺒﻞ ﺟﯿﻦ أﻣﺪال ﻓﻲ‬
with no scheduling overhead. ‫( ﻣﻦ‬f-1) ‫ ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك ﺑﺮﻧﺎﻣﺠًﺎ ﯾﻌﻤﻞ ﻋﻠﻰ ﻣﻌﺎﻟﺞ واﺣﺪ ﺑﺤﯿﺚ ﯾﺘﻀﻤﻦ ﺟﺰء‬.‫ﻣﻌﺎﻟﺠﺎت ﻣﺘﻌﺪدة ﻣﻘﺎرﻧﺔ ﺑﻤﻌﺎﻟﺞ واﺣﺪ‬
.‫ ﯾﺘﻀﻤﻦ رﻣﺰًا ﻗﺎﺑﻞ ﻟﻠﺘﻮازي ﺑﻼ ﺣﺪود ﻣﻊ ﻋﺪم وﺟﻮد ﺗﻜﺎﻟﯿﻒ إﺿﺎﻓﯿﺔ ﻟﻠﺠﺪوﻟﺔ‬f ‫وﻗﺖ اﻟﺘﻨﻔﯿﺬ رﻣﺰًا ﺗﺴﻠﺴﻠﯿًﺎ ﺑﻄﺒﯿﻌﺘﮫ وﺟﺰء‬
Let T be the total execution time of the program using a single processor. Then the speedup
using a parallel processor with N processors that fully exploits the parallel portion of the
program is as follows:
‫ ﯾﻜﻮن إﺟﻤﺎﻟﻲ وﻗﺖ ﺗﻨﻔﯿﺬ‬T ‫دع‬
‫ ﺛﻢ‬.‫اﻟﺒﺮﻧﺎﻣﺞ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺞ واﺣﺪ‬
‫ﯾﻜﻮن اﻟﺘﺴﺮﯾﻊ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺞ‬
‫ ﯾﺴﺘﻐﻞ‬N ‫ﻣﺘﻮازي ﻣﻊ ﻣﻌﺎﻟﺠﺎت‬
‫اﻟﺠﺰء اﻟﻤﺘﻮازي ﺑﺎﻟﻜﺎﻣﻞ ﻣﻦ اﻟﺒﺮﻧﺎﻣﺞ‬
:‫ﻛﻤﺎ ﯾﻠﻲ‬
Two important conclusions can be drawn:
1. When f is small, the use of parallel processors has little effect.
2. As N approaches infinity, speedup is bound by 1/(1-f), so that there are diminishing returns
for using more processors.
Amdahl’s law illustrates the problems facing industry in the development of multicore
machines with an ever-growing number of cores: The software that runs on such machines
must be adapted to a highly parallel execution environment to exploit the power of parallel
processing. ‫ﯾﻮﺿﺢ ﻗﺎﻧﻮن أﻣﺪال اﻟﻤﺸﻜﻼت اﻟﺘﻲ ﺗﻮاﺟﮫ اﻟﺼﻨﺎﻋﺔ ﻓﻲ ﺗﻄﻮﯾﺮ آﻻت ﻣﺘﻌﺪدة اﻟﻨﻮاة ﻣﻊ ﻋﺪد ﻣﺘﺰاﯾﺪ‬
‫ ﯾﺠﺐ ﺗﻜﯿﯿﻒ اﻟﺒﺮﻧﺎﻣﺞ اﻟﺬي ﯾﻌﻤﻞ ﻋﻠﻰ ھﺬه اﻵﻻت ﻣﻊ ﺑﯿﺌﺔ ﺗﻨﻔﯿﺬ ﻣﺘﻮازﯾﺔ ﻟﻠﻐﺎﯾﺔ ﻻﺳﺘﻐﻼل‬:‫ﻣﻦ اﻟﻨﻮى‬
.‫ﻗﻮة اﻟﻤﻌﺎﻟﺠﺔ اﻟﻤﺘﻮازﯾﺔ‬
Amdahl’s law can be generalized to evaluate any design or technical improvement in a
computer system. Consider any enhancement to a feature of a system that results in a
speedup. The speedup can be expressed as
‫ ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك أي ﺗﺤﺴﯿﻦ ﻋﻠﻰ‬.‫ﯾﻤﻜﻦ ﺗﻌﻤﯿﻢ ﻗﺎﻧﻮن أﻣﺪال ﻟﺘﻘﯿﯿﻢ أي ﺗﺼﻤﯿﻢ أو ﺗﺤﺴﯿﻦ ﺗﻘﻨﻲ ﻓﻲ ﻧﻈﺎم اﻟﻜﻤﺒﯿﻮﺗﺮ‬
‫ ﯾﻤﻜﻦ اﻟﺘﻌﺒﯿﺮ ﻋﻦ اﻟﺘﺴﺮﯾﻊ ﻛـ‬.‫ﻣﯿﺰة ﻓﻲ ﻧﻈﺎم ﺗﺆدي إﻟﻰ زﯾﺎدة اﻟﺴﺮﻋﺔ‬
Suppose that a feature of the system is used during execution a fraction of the time f,
before enhancement, and that the speedup of that feature after enhancement is SUf . Then
the overall speedup of the system is:
، ‫ ﻗﺒﻞ اﻟﺘﺤﺴﯿﻦ‬، f ‫اﻓﺘﺮض أن إﺣﺪى ﻣﯿﺰات اﻟﻨﻈﺎم ﺗُﺴﺘﺨﺪم أﺛﻨﺎء اﻟﺘﻨﻔﯿﺬ ﻟﺠﺰء ﺑﺴﯿﻂ ﻣﻦ اﻟﻮﻗﺖ‬
:‫ ﺛﻢ اﻟﺴﺮﻋﺔ اﻹﺟﻤﺎﻟﯿﺔ ﻟﻠﻨﻈﺎم ھﻲ‬.SU f ‫وأن ﺗﺴﺮﯾﻊ ھﺬه اﻟﻤﯿﺰة ﺑﻌﺪ اﻟﺘﺤﺴﯿﻦ ھﻮ‬
(Summary)
Amdahl’s Law
Examples
Questions
Example, suppose that a task makes extensive use of floating-point operations,
with 40% of the time is consumed by floating-point operations. With a new
hardware design, the floating-point module is speeded up by a factor of K. Then
the overall speedup is:
Thus, independent of K, the maximum speedup is 1.67.

Problem:
Suppose, You have a system that contains a special processor for doing floating-point
operations. You have determined that 50% of your computations can use the floating-point
processor. The speedup of the floating pointing-point processor is 15.
Solution
Problems and solutions
EFFECTIVENESS OF PARALLEL PROCESSING
There are certain measures to compare the effectiveness of various parallel algorithms or
architectures for solving desired problems. The following definitions and notations are
applicable [Lee80]:
The significance of each measure is self-evident from its name and defining equation given above. It is not difficult to
establish the following relationships between these parameters.
EFFECTIVENESS OF PARALLEL PROCESSING Cont’d
EFFECTIVENESS OF PARALLEL PROCESSING Cont’d
Example
Example: Finding the sum of 16 numbers can be represented by the binary-tree
computation graph of Figure mentioned bellow with T(1) = W(1) = 15. Assume
unit-time additions and ignore all else.
With p = 8 processors, we have

W(8) = 15 T(8) = 4 E(8) = 15/(8 * 4) = 47%
S(8) = 15/4 = 3.75 R(8) = 15/15 = 1 Q(8) = 1.76
Essentially, the 8 processors perform all of the additions at the same tree level in
each time unit, beginning with the leaf nodes and ending at the root. The relatively
low efficiency is the result of limited parallelism near the root of the tree.
Thank You
46

Lecture4نظم

Uploaded by

Copyright:

Available Formats

Lecture4نظم

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture4نظم

Uploaded by

Copyright:

Available Formats

Parallel and Distributed Systems

Prepared by : Shah Murtaza Rashid Al Masud

1. Wash dirty load of clothes ‫ اﻏﺴﻞ اﻟﺤﻤﻮﻟﺔ اﻟﻤﺘﺴﺨﺔ ﻣﻦ اﻟﻤﻼﺑﺲ‬.1

2. Dry wet clothes ‫ ﻗﻢ ﺑﻄﻲ اﻟﻤﻼﺑﺲ ووﺿﻌﮭﺎ ﻓﻲ اﻷدراج‬.3

 Each stage takes 30 minutes to complete

 Sequential laundry takes 6 hours for 4 loads

A  Pipelined laundry takes

Without Pipelining With Pipelining

، ‫ﻓﻲ ھﺬا اﻟﻘﺴﻢ اﻟﻔﺮﻋﻲ‬

Figure: Timing Diagram for Instruction Pipeline Operation

 Ideal speedup of a k-stage pipeline over serial execution

Serial execution in cycles nk

 CPI for pipelined execution = 1

Instruction pipelining is similar to the use of an assembly line in a manufacturing plant. An

Figure 1 Instruction Cycle State Diagram

To gain more speedup, the pipeline must have more stages.

Let us consider the following decomposition of the instruction processing.

Figure: Six-Stage CPU Instruction Pipeline

Figure2: Timing Diagram for Instruction Pipeline Operation

‫‪ -‬اﺑﺤﺚ ﻋﻦ زﻣﻦ اﻧﺘﻘﺎل اﻟﺘﻌﻠﯿﻤﺎت ﻋﻠﻰ ھﺬا اﻟﺠﮭﺎز‪.‬‬

‫‪ -‬ﻛﻢ ﻣﻦ اﻟﻮﻗﺖ ﯾﺴﺘﻐﺮق ﺗﻨﻔﯿﺬ ‪ 100‬أﻣﺮ؟‬

The length of pipelined stage = MAX(lengths of un-pipelined stages) + overhead = 60 + 5 = 65

Average instruction time not pipelined = 320 ns

-We observed that we can obtain better performance in

-But all instructions do have some things in

Time required to complete the processing on a

Sequential execution time

‫ﺗﻘﯿﺲ اﻟﻜﻔﺎءة اﻟﻤﻮازﯾﺔ ﻣﻘﺪار اﺳﺘﺨﺪام اﻟﻤﻌﺎﻟﺠﺎت اﻟﻤﺘﻮازﯾﺔ‬

Sequential execution time

Given an algorithm which is P% parallel, Amdahl’s law

‫ ﻓﺈن اﻟﺤﺪ‬، ‫ ﻣﻦ اﻟﺒﺮﻧﺎﻣﺞ ﻣﺘﻮازﯾًﺎ‬٪80 ‫ إذا ﻛﺎن‬، ‫ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل‬

1. When f is small, the use of parallel processors has little effect.

Thus, independent of K, the maximum speedup is 1.67.

With p = 8 processors, we have

You might also like