CO Mod5 SB
CO Mod5 SB
CO Mod5 SB
5.1 Pipelining
• Pipelining is a technique of decomposing a sequential process into
suboperations, with each subprocess being executed in a special dedicated
segment that operates concurrently with all other segments.
• The overlapping of computation is made possible by associating a register
with each segment in the pipeline.
• The registers provide isolation between each segment so that each can operate
on distinct data simultaneously.
• Perhaps the simplest way of viewing the pipeline structure is to imagine that
each segment consists of an input register followed by a combinational
circuit.
o The register holds the data.
o The combinational circuit performs the suboperation in the particular
segment.
• A clock is applied to all registers after enough time has elapsed to perform all
segment activity.
• The pipeline organization will be demonstrated by means of a simple
example.
o To perform the combined multiply and add operations with a stream of
numbers
Ai * Bi + Ci for i = 1, 2, 3, …, 7
• Each suboperation is to be implemented in a segment within a pipeline.
R1 Ai, R2 Bi Input Ai and Bi
R3 R1 * R2, R4 Ci Multiply and input Ci
R5 R3 + R4 Add Ci to product
• Each segment has one or two registers and a combinational circuit as shown in
Fig. 9-2.
• The five registers are loaded with new data every clock pulse. The effect of
each clock is shown in Table 4-1.
| 1
Fig 4-1:
1: Example of pipeline processing
Table 4-1:
1: Content of Registers in Pipeline Example
General Considerations
• Any operation that can be decomposed into a sequence of suboperations of
about the same complexity can be implemented by a pipeline processor.
• The general structure of a four
four-segment
segment pipeline is illustrated in Fig. 4-2.
• We define
efine a task as the total operation performed going through all the
segments in the pipeline.
• The behavior of a pipeline can be illustrated with a space-time
space diagram.
o It shows the segment utilization as a function of time.
| 2
Fig 4-2: Four Segment Pipeline
• The space-time
time diagram of a four
four-segment
segment pipeline is demonstrated in Fig. 4-3.
• Where a k-segment
segment pipeline with a clock cycle time tp is used to execute n tasks.
o The first task T1 requires a time equal to ktp to complete its operation.
o The remaining n-11 tasks will be completed after a time equal to (n-1)tp
o Therefore, to complete n tasks using a k-segment
segment pipeline requires k+(n-1)
clock cycles.
• Consider a nonpipeline unit that performs the same operation and takes a time
equal to tn to complete each task.
o The total time required for n tasks is ntn.
Fig 4-3:
3: Space-time
Space diagram for pipeline
| 3
Fig 4-4:
4: Multiple functional units in parallel
• There are various reasons why the pipeline cannot operate at its maximum
theoretical rate.
o Different segments may take different times to complete their sub
operation.
o It is not always correct to assume that a non pipe circuit has the same time
delay as that of an equivalent pipeline circuit.
• There are two areas of computer design where the pipeline organization is
applicable.
o Arithmetic pipeline
o Instruction pipeline
| 4
Fig 4-5:
5: Processor with multiple functional units
SIMD
• Represents an organization that includes many processing units under the
supervision of a common control unit.
• All processors receive the same instruction from the control unit but operate on
different items of data.
• The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
| 5
MISD & MIMD
• MISD structure is only of theoretical interest since no practical system has been
constructed using this organization.
• MIMD organization refers to a computer system capable of processing several
programs at the same time. e.g. multiprocessor and multicomputer system
• Flynn’s classification depends on the distinction between the performance of the
control unit and the data-processing unit.
• It emphasizes the behavioral characteristics of the computer system rather than its
operational and structural interconnections.
• One type of parallel processing that does not fit Flynn’s classification is
pipelining.
• We consider parallel processing under the following main topics:
o Pipeline processsing
Is an implementation technique where arithmetic suboperations or
the phases of a computer instruction cycle overlap in execution.
o Vector processing
Deals with computations involving large vectors and matrices.
o Array processing
Perform computations on large arrays of data.
| 6
When an overflow occurs, the mantissa of the sum or
difference is shifted right and the exponent incremented by
one.
If an underflow occurs, the number of leading zeros in the
mantissa determines the number of left shifts in the mantissa
and the number that must be subtracted from the exponent.
• The following
following numerical example may clarify the suboperations performed in
each segment.
• The comparator, shift, adder, subtractor, incrementer, and decrementer in the
floating-point
point pipeline are implemented with combinational circuits.
• Suppose that the time delays of the four segments are t1=60ns, t2=70ns,
t3=100ns, t4=80ns, and the interface registers have a delay of tr=10ns
o Pipeline floating-point
floating arithmetic delay: tp=t3+tr=110ns
o Nonpipeline floating-point
floating arithmetic delay: tn=t1+t2+t3+t4+tr=320ns
o Speedup: 320/110=2.9
Fig 4-6:
6: Pipeline for floating point addition and subtraction
| 7
o Calculate the effective address.
o Fetch the operands from memory.
o Execute the instruction.
o Store the result in the proper place.
• There are certain difficulties that will prevent the instruction pipeline from
operating at its maximum rate.
o Different segments may take different times to operate on the
incoming information.
o Some segments are skipped for certain operations.
o Two or more segments may require memory access at the same time,
causing one segment to wait until another is finished with the memory.
| 8
Fig 4-7:
7: Four-segment
Four CPU pipeline
Fig 4-8:
8: Timing of Instruction Pipeline
Pipeline Conflicts
• In general, there are three major difficulties that cause the instruction pipeline
to deviate from its normal operation.
o Resource conflicts caused by access to memory by two segments at the
same time.
JNNCE_CSE | 9
Can be resolved by using separate instruction and data
memories
o Data dependency conflicts arise when an instruction depends on the
result of a previous instruction, but this result is not yet available.
o Branch difficulties arise from branch and other instructions that change
the value of PC.
• A difficulty that may cause a degradation of performance in an instruction
pipeline is due to possible collision of data or address.
o A data dependency occurs when an instruction needs data that are not
yet available.
o An address dependency may occur when an operand address cannot be
calculated because the information needed by the addressing mode is
not available.
• Pipelined computers deal with such conflicts between data dependencies in a
variety of ways.
| 10
• Loop buffer: This is a small very high speed register file maintained by the
instruction fetch segment of the pipeline.
• Branch prediction: A pipeline with branch prediction uses some additional logic
to guess the outcome of a conditional branch instruction before it is executed.
• Delayed branch: in this procedure, the compiler detects the branch instructions
and rearranges the machine language code sequence by inserting useful
instructions that keep the pipeline operating without interruptions.
o A procedure employed in most RISC processors.
o e.g. no-operation instruction
JNNCE-CSE | 11
• To achieve the required level of high performance it is necessary to utilize the
fastest and most reliable hardware and apply innovative procedures from vector
and parallel processing techniques.
Vector Operations
• Many scientific problems require arithmetic operations on large arrays of
numbers.
• A vector is an ordered set of a one
one-dimensional
dimensional array of data items.
• A vector V of length n is represented as a row vector by V=[v1,v2,…,Vn].
• To examine the difference between a conventional scalar processor and a vector
processor, consider the following Fortran DO loop:
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
• This is implemented in machine language by the following sequence of
operations.
Initialize I=0
20 Read A(I)
Read B(I)
Store C(I) = A(I)+B(I)
Increment I = I + 1
If I <= 100 go to 20
Continue
• A computer capable of vector processing eliminates the overhead associated with
the time it takes to fetch and execute the instructions in the program loop.
C(1:100) = A(1:100) + B(1:100)
B(1:
• A possible instruction format for a vector instruction is shown in Fig. 4-11.
o This assumes that the vector operands reside in memory.
memory
• It is also possible to design the processor with a large number of registers and
store all operands in registers prio
prior to the addition operation.
o The base address and length in the vector instruction specify a group of
CPU registers.
Fig 4-11:
11: Instruction format for vector processor
Matrix Multiplication
• The multiplication of two n x n matrices consists of n2 inner products or n3
multiply-add operations.
o Consider, for example, the multiplication of two 3 x 3 matrices A and B.
o c11= a11b11+ a12b21+ a13b31
o This requires three multiplication and (after initializing c11 to 0) three
additions.
• In general, the inner product
produ consists of the sum of k product terms of the form
C= A1B1+A2B2+A3B3+…+AkBk.
o In a typical application k may be equal to 100 or even 1000.
| 12
• The inner product calculation on a pipeline vector processor is shown in Fig. 4-
4
12.
C = A1 B1 + A5 B5 + A9 B9 + A13B13 + ⋅ ⋅ ⋅
+ A2 B2 + A6 B6 + A10 B10 + A14 B14 + ⋅ ⋅ ⋅
+ A3 B3 + A7 B7 + A11B11 + A15 B15 + ⋅ ⋅ ⋅
+ A4 B4 + A8 B8 + A12 B12 + A16 B16 + ⋅ ⋅ ⋅
Fig 4-12:
12: Pipeline for calculating an inner product
Memory Interleaving
• Pipeline and vector processors often require simultaneous access to memory from
two or more sources.
o An instruction pipeline may require the fetching of an instruction and an
operand at the same time from two different segments.
o An arithmetic pipeline usually requires two or more operands to enter the
pipeline at the same time.
• Instead of using two memory buses for simultaneous access, the memory can be
partitioned into a number of modules connected to a common memory address
and data buses.
o A memory module is a memory array together with its own address and
data registers.
• Fig. 4-1313 shows a memory unit with four modules.
Fig 4-13:
13: Multiple module memory organization
JNNCE | 13
• The advantage of a modular memory is that it allows the use of a technique called
interleaving.
• In an interleaved memory, different sets of addresses are assigned to different
memory modules.
• By staggering the memory access, the effective memory cycle time can be
reduced by a factor close to the number of modules.
Supercomputers
• A commercial computer with vector instructions and pipelined floating-point
arithmetic operations is referred to as a supercomputer.
o To speed up the operation, the components are packed tightly together to
minimize the distance that the electronic signals have to travel.
• This is augmented by instructions that process vectors and combinations of
scalars and vectors.
• A supercomputer is a computer system best known for its high computational
speed, fast and large memory systems, and the extensive use of parallel
processing.
o It is equipped with multiple functional units and each unit has its own
pipeline configuration.
• It is specifically optimized for the type of numerical calculations involving
vectors and matrices of floating-point numbers.
• They are limited in their use to a number of scientific applications, such as
numerical weather forecasting, seismic wave analysis, and space research.
• A measure used to evaluate computers in their ability to perform a given number
of floating-point operations per second is referred to as flops.
• A typical supercomputer has a basic cycle time of 4 to 20 ns.
• The examples of supercomputer:
• Cray-1: it uses vector processing with 12 distinct functional units in parallel; a
large number of registers (over 150); multiprocessor configuration (Cray X-MP
and Cray Y-MP)
o Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300
megaflops
5. 6 Array Processing
• An array processor is a processor that performs computations on large arrays of
data.
• The term is used to refer to two different types of processors.
o Attached array processor:
Is an auxiliary processor.
It is intended to improve the performance of the host computer in
specific numerical computation tasks.
o SIMD array processor:
Has a single-instruction multiple-data organization.
It manipulates vector instructions by means of multiple functional
units responding to a common instruction.
| 14
Attached Array Processor
• Its purpose is to enhance the performance of the computer by providing vector
processing for complex
c scientific applications.
o Parallel processing with multiple functional units
• Fig. 4-14
14 shows the interconnection of an attached array processor to a host
computer.
• For example, when attached to a VAX 11 computer, the FSP FSP-164/MAX from
Floating-Point
Point Systems increases the computing power of the VAX to
100megaflops.
• The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputer.
Fig 9-14:
14: Attached array processor with host computer
| 15
Fig 4-15:
15: SIMD array processor organization
JNN_CSE | 16