Lect27 Parallal Processing
Lect27 Parallal Processing
Lect27 Parallal Processing
Processing
Parallel Processing
Simultaneous data processing tasks for the purpose of
= computational
increasing the speed
Perform concurrent data processing to achieve faster execution
time
Multiple Functional Unit :
Separate the execution unit into eight functional units operating in parallel
Adder-subtractor
Integer multiply
Logic unit
Shift unit
To Memory
Incrementer
Processor
registers
Floating-point
add-subtract
Floating-point
multiply
Floating-point
divide
Pipelining : it is the process of Decomposing a sequential process into
suboperations with Each subprocess is executed in a special dedicated
segment concurrently with all other segments.
It is a collection of processing segments through which binary information
flows. Where each segment performs partial processing dedicated by the
way the task is partioned.
Pipelining
Multiply의and예제 add: Fig. 9-2
operationAi:* Bi ( for i = 1, 2,
…, 7 ) Suboperation Segment
3 개의 Ci 로
분리» R1 Ai, R2 Bi : Input Ai and Bi
1) R3 R1* R2, R4 Ci : Multiply and
» R5 R3 R4 input Ci
2)
Content : Add Ciexample :
of registers in pipeline
»
Tab. 9-1
3)
Ai Bi Ci
R1 R2
Multiplier
R3 R4
Adder
R5
Segment1 Segment 2 Segment 3
Clock pulse R1 R2 R3 R4 R5
Number
1 A1 B1 - - -
2 A2 B2 A1*B1 C1 -
3 A3 B3 A2*B2 C2 A1*B1+C1
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3
6 A6 A6 A5*B5 C5 A4*B4+C4
7 A7 A7 A6*B6 C6 A5*B5+C5
8- - A7*B7 C7 A6*B6+C6
9- - - - A7*B7+C7
General considerations
4 segment pipeline : the operand pass through
all four segments in a fixed sequence. Each segment consists
of a combinational ckt Si that performs a sub operation over
the data stream. The segments are separated by the
registers to hold the intermediate results.
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Clock es 1 2 3 4 5 6 7 8 9
cycl
1 T1 T2 T3 T4 T5 T6
Segme
2 T1 T2 T3 T4 T5 T6
nt
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
Speedup S : Nonpipeline / Pipeline
S = n • tn / ( k + n - 1 ) • tp = 6 • 6 tn / ( 4 + 6 - 1 ) • tp = 36 tn /
9 t»n =n 4
: task number ( 6 )
» tn : time to complete each task in nonpipeline ( 6 cycle times =
k+n-1 6 tp)
n » tp : clock cycle time ( 1 clock cycle )
If n 이면 , S =number
» k: segment tn / tp ( 4 )
If we assume that the time it takes to process a task is the same in the
pipeline and nonpipeline circuits then we have
nonpipeline ( tn ) = pipeline ( k • tp )
S = tn / tp = k • tp / tp = k
Where k is the number of segments.
Arithmetic Pipeline
Floating-point Adder Pipeline Example :
Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
4 segments
suboperations
» 1) Compare exponents by
subtraction :
3-2=1
X = 0.9504 x 103
Y = 0.8200 x 102
» 2) Align mantissas
X = 0.9504 x 103
Y = 0.08200 x 103
» 3) Add mantissas
Z = 1.0324 x 103
» 4) Normalize result
Z = 0.1324 x 104
Exponen Mantiss
ts as
a b A B
R R
Compare D i ff e r e n c
Segment 1 exponent e
: s
by subtraction
A d d or
Segment 3
subtract
:
mantissas
R R
Adjust Normaliz
Segment 4
exponen e
:
t result
R R
Instruction
Pipeline
Instruction Cycle
Decode instruction
Segment 2 : and calculate the
effective address
Branch ?
Fetch operand
Segment 3 : from memory
Interrupt
handling Interrupt ?
Update PC
Empty pipe
Example : Four-segment Instruction
Pipeline
Four-segment CPU pipeline :
» 1) FI : Instruction Fetch
» 2) DA : Decode Instruction &
calculate EA
» 3) FO : Operand Fetch
» 4) EX : Execution
pTiming
Ste : of
1 Instruction
2 3 Pipeline
4 5: 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
:
2 FI DA FO EX
3 FI DA FO EX
(Branch)
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
No Branch
Branch
Pipeline Conflicts : 3 major difficulties
1) Resource conflicts
» memory access by two segments at the same time.
» Can be avoided by using separate instruction stream and data
memories.
2) Data dependency
» when an instruction depend on the result of a previous
instruction, but this result is not
yet available
3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the
value of PC
Data Dependency 해결 방법
Hardware 적인 방법
» Hardware Interlock
previous instruction 의 결과가 나올 때 까지 Hardware 적인 Delay 를 강제 삽입
» Operand Forwarding
previous instruction 의 결과를 곧바로 ALU 로 전달 ( 정상적인 경우 , register 를 경유함 )
Software 적인 방법
» Delayed Load
previous instruction 의 결과가 나올 때 까지 No-operation instruction 을 삽입
Assignment
What do you mean by pipeline and parallel
processing.
Explain vector processing.