S Rawat
S Rawat
S Rawat
S.Rawat
ALU
ALU an Engine for any Computational Silicon. We have different units ALU/FPUs for Integers/Floats respectively. Mainly Decided based on the fact that FP must be pipelined to be in harmony with other blocks in the design. While ALU mainly Single Cycle Operation if we are in given Timing Bucket. So the Fast-Parallel Systolic Array implementation of Computational blocks is one of the main concern.
Further Classification Based on Application Now What all kind of operation an ALU must Support. Lets take an example of IIR/FIR Filters i.e. DSP processor. Where Frequently performed operation is for an example consider a second order FIR. Y[n] = a0*x[n] + a1*x[n-1] + a2*x[n-3]. Either Implement MAC Units or perform using separate multiplication and addition
Further Classification Based on Application So the idea is that depending upon the Application one might have an ALU where normal Arithmetic and Logical operations are frequent or DSP shifts and MAC more Frequent. Why this is important is because it has to be decided initially before starting any computational device according to given Timing and Power Bucket.
An ALU is supposed to fit-in either Finite Field Arithmetic or with the allowed range infinite field arithmetic (i.e. Integer ALU of some generic processor). Present Superscalar processors have separate FPU (Floating Point Unit) and Integer ALU. For Communication Dedicated processor where Finite Field Arithmetic is needed to encode and decode, Finite Field ALUs are designed, where the Number Line becomes Cyclic and interpretation of + * / is changed according to periodicity.
Other Descriptions: state diagrams, timing diagrams, reg xfer, . . . Optimization Criteria: Gate Count [Package Count] Pin Out Area Logic Levels Delay Fan-in/Fan-out Cost Design time Power
Representation Languages
Hardware Representation Languages: Block Diagrams: FUs, Registers, & Dataflows
Fifth Representation "Language": Hardware Description Languages hw modules described like programs E.G., ISP' with i/o ports, internal state, & parallel Verilog execution of assignment statements Descriptions in these languages can be used as input to simulation systems synthesis systems "To Design is to Represent" "software breadboard" generate hw from high level description
Levels of Description
Architectural Simulation models programmer's view at a high level; written in your favorite programming language more detailed model, like the block diagram view commitment to datapath FUs, registers, busses; register xfer operations are clock phase accurate model is in terms of logic gates; higher level MSI functions described in terms of these Less Abstract More Accurate Slower Simulation
Logic Circuit
Verilog
Goals:
Support design, documentation, and simulation of hardware Digital system level to gate level Technology Insertion
Concepts:
Design entity Time-based execution model.
Design Entity == Hardware Component Interface == External Characteristics
Interface
Externally Visible Characteristics Ports: channels of communication (inputs, outputs, clocks, control) Generic Parameters: define class of components (timing characteristics, size, fan-out) --- determined where instantiated or by default Internally Visible Characteristics Declarations: Assertions: constraints on all alternative bodies (i.e., implementations)
Interface Architecture
details of implementation
divu $2,$3
mfhi $1 mflo $1
Lo = $2 $3,
$1 = Hi $1 = Lo
MULTIPLY (unsigned)
Paper and pencil example (unsigned): Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000 m bits x n bits = m+n bit product Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place a copy ( 1 x multiplicand) 4 versions of multiply hardware & algorithm: successive refinement
A3
A2
A1
A0
B0
A3
A2
A1
A0
B1
A3
A2
A1
A0 B2
A3
A2
A1
A0
B3
P7
P6
P5
P4
P3
P2
P1
P0
Stage i accumulates A * 2 i if Bi == 1 Q: How much hardware for 32 bit multiplier? Critical path?
B0
B1
A3 A3 P7 P6 A2 P5
A2 A1 P4
A1 A0 P3
A0
B2 B3 P2 P1 P0
at each stage shift A left ( x 2) use next bit of B to determine whether to add in shifted multiplicand accumulate 2n bit partial product at each stage
64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg
Multiplicand 64 bits Shift Left
Shift Right
Product 64 bits
Control
Start
1. Test Multiplier0
Multiplier0 = 0
1a. Add multiplicand to product & place the result in Product register
Product 0000 0000 0000 0010 0000 0110 0000 0110 Multiplier 0011 0001 0000 Multiplicand 0000 0010 0000 0100 0000 1000
2. Shift the Multiplicand register left 1 bit. 3. Shift the Multiplier register right 1 bit. 32nd repetition?
Control
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
1a. Add multiplicand to the left half of product & place the result in the left half of Product register Product 0000 0000 Multiplier Multiplicand 0011 0010 2. Shift the Product register right 1 bit. 3. Shift the Multiplier register right 1 bit. 32nd repetition?
A3
A2
A1
A0
B0
A3
A2
A1
A0
B1
A3
A2
A1
A0 B2
A3
A2
A1
A0
B3
P7
P6
P5
P4
P3
P2
P1
P0
Multiplicand 32 bits 32-bit ALU Shift Right Product (Multiplier) 64 bits Write
Control
Start
1. Test Product0
Product0 = 0
1a. Add multiplicand to the left half of product & place the result in the left half of Product register
32nd repetition?
Booths Algorithm
end of run middle of run
0 1 1 1 1 0
beginning of run
Current Bit Bit to the Right Explanation Example Op 1 0 Begins run of 1s 0001111000 sub 1 1 Middle of run of 1s 0001111000 none 0 1 End of run of 1s 0001111000 add 0 0 Middle of run of 0s 0001111000 none Originally for Speed (when shift was faster than add) Replace a string of 1s in multiplier with an initial subtract when we first see a one and then later add for the bit after the last one
1 + 10000 01111
Booths Example (2 x 7)
Operation 0. initial value 1a. P = P - m 1b. 2. 3. 4a. Multiplicand 0010 1110 0010 0010 0010 0010 Product 0000 0111 0 + 1110 1110 0111 0 1111 0011 1 1111 1001 1 1111 1100 1 + 0010 0001 1100 1 next? 10 -> sub shift P (sign ext) 11 -> nop, shift 11 -> nop, shift 01 -> add shift
4b.
0010
0000 1110 0
done
2b.
3a. 3b. 4a 4b.
0010
0010 0010 0010
10 -> sub
shift 11 -> nop shift done
Shifters
Two kinds: logical-- value shifted in is always "0" "0" msb lsb "0"
Note: these are single bit shifts. A given instruction might request 0 to 32 bits to be shifted!
D
A5 A4 A3 A2 A1 A0 S2 S1 S0
0 R7
0 R6
1 R5
1 R4
1 R3
1 R2
1 R1
1 R0
What comes in the MSBs? How many levels for 32-bit shifter? What if we use 4-1 Muxes ?
S3 (0, 8) If added Right-to-left connections could support Rotate (not in MIPS but found in ISAs)
Barrel Shifter
Technology-dependent solutions: transistor per switch SR3
SR2
SR1
SR0
D3
D2 A6
D1 A5
D0 A4
A3
A2
A1
A0
Quotient Dividend
See how big a number can be subtracted, creating quotient bit on each step Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder => | Dividend | = | Quotient | + | Divisor | 3 versions of divide, successive refinement
Divisor 64 bits
Shift Left
Remainder 64 bits
Control
Takes n+1 steps for n-bit Quotient & Rem. 1. Subtract the Divisor register from the Remainder Quotient Divisor Remainder register, and place the result 0000 0111 0000 0010 0000
Remainder 0
Remainder < 0
2a. Shift the Quotient register to the left setting the new rightmost bit to 1.
2b. Restore the original value by adding the Divisor register to the Remainder register, & place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0.
3. Shift the Divisor register right1 bit. n+1 repetition? Done No: < n+1 repetitions
Observations on Divide Version 1 1/2 bits in divisor always 0 => 1/2 of 64-bit adder is wasted => 1/2 of divisor is wasted Instead of shifting divisor to right, shift remainder to left? 1st step cannot produce a 1 in quotient bit (otherwise too big) => switch order to shift first and then subtract, can save 1 iteration
DIVIDE HARDWARE Version 2 32-bit Divisor reg, 32-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor 32 bits Quotient 32-bit ALU Shift Left Remainder 64 bits Write 32 bits Shift Left
Control
Start: Place Dividend in Remainder 1. Shift the Remainder register left 1 bit.
2. Subtract the Divisor register from the left half of the Remainder register, & place the result in the left half of the Remainder register. Remainder 0 Test Remainder Remainder < 0
3a. Shift the Quotient register to the left setting the new rightmost bit to 1.
3b. Restore the original value by adding the Divisor register to the left half of the Remainderregister, &place the sum in the left half of the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0.
DIVIDE HARDWARE Version 3 32-bit Divisor reg, 32 -bit ALU, 64-bit Remainder reg, (0-bit Quotient reg)
Divisor 32 bits 32-bit ALU
HI LO
Shift Left
Control
1. Shift the Remainder register left 1 bit. 2. Subtract the Divisor register from the left half of the Remainder register, & place the result in the left half of the Remainder register. Remainder 0 Test Remainder Remainder < 0
3a. Shift the Remainder register to the left setting the new rightmost bit to 1.
3b. Restore the original value by adding the Divisor register to the left half of the Remainderregister, &place the sum in the left half of the Remainder register. Also shift the Remainder register to the left, setting the new least significant bit to 0.
nth repetition?
Yes: n repetitions (n = 4 here) Done. Shift left half of Remainder right 1 bit.
Note: Quotient negated if Divisor sign & Dividend sign disagree e.g., 7 2 = 3, remainder = 1 Possible for quotient to be too large: if divide 64-bit interger by 1, quotient is 64 bits (called saturation)
Summary
Intro to Verilog a language to describe hardware Modules, reg, wire, always, assign, for, etccccccccc behavior can be higher level x <= boolean_expression(A,B,C,D); Has time as concept Can activate when inputs change, not specifically invoked Inherently parallel Multiply: successive refinement to see final design 32-bit Adder, 64-bit shift register, 32-bit Multiplicand Register Booths algorithm to handle signed multiplies There are algorithms that calculate many bits of multiply per cycle Shifter: success refinement 1/bit at a time shift register to barrel shifter Whats Missing from MIPS is Divide & Floating Point Arithmetic:
More Info
David Patterson & John Hennessy, Computer Organization & Design, Morgan Kaufmann Publishers, 1994. David Winkel & Franklin Prosser, The Art of Digital Design: An Introduction to Top-Down Design, Prentice-Hall, Inc., 1980. Kai Hwang, Computer Arithmetic: Principles, archtiecture, and design, Wiley 1979
ThAnKs
tHaNKs