Computer Arithmetic
Computer Arithmetic
Computer Arithmetic
Ecole polytechnique federale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich
Lecture notes on
Reto Zimmermann
Integrated Systems Laboratory Swiss Federal Institute of Technology (ETH) CH-8092 Z rich, Switzerland u [email protected]
Contents
Contents
Contents
1 Introduction and Conventions 1.1 Outline
::::::::::::::::::::::: 4 :::::::::::::::::::::::::::::::::::::::::: 4 1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6 Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9 Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10 3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14 3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16 3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18 3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19 3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19 3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23
1
: : : : : : : : : : : : : : : : : : : 26 4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52 Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53 5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53 5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58 5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60 5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64 5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66 5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68 Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71 6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72 6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73 6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75 6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75 6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76 6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76 Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77 7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
4.3 Carry-Propagate Adders (CPA)
2
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 : : : : : : : : : : : : : : : : : : : : : : : : : : 78 7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79 7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80 7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81 7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83 7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84 Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86 8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90 9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91 9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
7.2 Restoring Division 7.3 Non-Restoring Division
1.2 Motivation
1.3 Conventions
A (1-D), Ai (2-D), ai:k (subbus, 1-D) Signals : a, ai (1-D), ai k (2-D), Ai:k (group signal) Circuit complexity measures : A (area), T (cycle time, delay), AT (area-time product), L (latency, # cycles) Arithmetic operators : +, ;, , =, log (= log2 )
Signal buses : Logic operators :
+ (or),
(and),
(xor),
(xnor), (not)
Circuit complexity measures Unit-gate model ( Inverter, buffer : gate-equivalents (GE) model) :
A=0 T =0
(i.e. ignored)
Simple monotonic 2-input gates (AND, NAND, OR, NOR) : A = 1 T = 1 Simple non-monotonic 2-input gates (XOR, XNOR) : A=2 T =2 Complex gates : composed from simple gates
) Simple m-input gates :
A = m ; 1 T = dlog me
Wiring not considered (acceptable for comparison purposes, local wiring, multilevel metallization) Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Introduction and Conventions 5
1.4 Recursive Function Evaluation Given : inputs ai , outputs zi , function f (graph sym. : ) Non-recursive functions (n.) Output zi is a function of input ai (or aj +m:j
2.
m const.)
zi = f (ai x) ; i = 0 : : : n ; 1
) parallel structure :
a3 a2 a1 a0 funn.epsi 119 17 mm z3 z2 z1 z0
A = O(n) T = O(1)
Recursive functions (r.) Output zi is a function of all inputs ak a) with single output z
k i
2.
= zn (r.s.) :
A = O(n2) T = O(log n)
a3 a2 a1 a0
1 funrsn.epsi 219 24 mm 3 z 6
) or shared-tree structure :
a3 a2 a1 a0 1funrma2.epsi 219 21 mm z3 z2 z1 z0
2 Arithmetic Operations
2.1 Overview
2 Arithmetic Operations
2 Arithmetic Operations
2.1 Overview
based on operation related operation << , >>
=,<
+1 , 1
+/
+,
+,
Sequential implementation using simpler units and several clock cycles () decomposition) : sometimes : 6 in most cases : 7, 8, 9
arithops.epsi 98 83 mm
sqrt (x)
Table look-up techniques using ROMs : universal : simple application to all operations efcient only for single-operand operations of high complexity (8 12) and small word length (note: ROM size = 2n n) Approximation techniques using simpler units : 712
exp (x)
log (x)
trig (x)
hyp (x)
1 2 3 4 5 6
shift/extension 7 division comparison 8 square root extraction increment/decrement 9 exponential function complement 10 logarithm function addition/subtraction 11 trigonometric functions multiplication 12 hyperbolic functions
8
taylor series expansion polynomial and rational approximations convergence of recursive equation systems CORDIC (COordinate Rotation DIgital Computer)
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations 9
3 Number Representations
3.1 Binary Number Systems (BNS) Radix-2, binary number system (BNS) : irredundant, weighted, positional, monotonic [1, 2]
n-bit number is ordered sequence of bits (binary digits) : A = (an;1 an;2 : : : a0)2 ai 2 f0 1g
Simple and efcient implementation in digital circuits MSB/LSB (most-/least-signicant bit) : an;1 / a0 Represents an integer or xed-point number, exact Fixed-point numbers :
Sign : an;1 Properties : asymmetric range, compatible with unsigned numbers in many arithmetic operations (i.e. same treatment of positive and negative numbers) Ones (1s) complement : similar to 2s complement n;2 X Value : A = ;an;1 (2n;1 + 1) + ai 2i i=0 Range : ;(2n;1 ; 1) 2n;1 ; 1] Complement : ;A = 2n ; A ; 1 = A Sign : an;1
n ; m)-bit fraction
n;1 X i=0
A = an;1 2n;1 +
+ a12 + a0 =
ai2i
Properties : double representation of zero, symmetric range, modulo (2n ; 1) number system Sign-magnitude : alternative representation of signed numbers n;2 X ai 2i Value : A = (;1)an;1 i=0 n;1 ; 1) 2n;1 ; 1] Range : ;(2 Complement : ;A = (an;1 an;2 Sign : an;1
10
Range : 0 2n ; 1]
Twos (2s) complement : standard representation of signed or integer numbers n;2 X ai2i Value : A = ;an;1 2n;1 + i=0 Range : ;2n;1 2n;1 ; 1]
Computer Arithmetic: Principles, Architectures, and VLSI Design
: : : a0 )
11
3 Number Representations
3 Number Representations
Properties : double representation of zero, symmetric range, different treatment of positive and negative numbers in arithmetic operations, no MSB toggles at sign changes around 0 () low power) Graphical representation
000...0 011...1 100...0 111...1
3.2 Gray Numbers Gray numbers (code) : binary, irredundant, non-weighted, non-monotonic + Property : unit-distance coding (i.e. exactly one bit toggles between adjacent numbers) Applications : counters with low output toggle rate (low-power signal buses), representation of continuous signals for low-error sampling (no false numbers due to switching of different bits at different times) Non-monotonic numbers : difcult arithmetic operations, e.g. addition, comparison :
n1
n1
numrep.epsi 95 73 mm
binary ! Gray :
gi = bi+1 bi bn = 0 ; i = 0 : : : n ; 1 (n.)
Gray ! binary :
Conventions 2s complement used for signed numbers in these notes Unsigned and signed numbers can be treated equally in most cases, exceptions are mentioned
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations 12
bi = bi+1 gi bn = 0 ; i = n ; 1 : : : 0 (r.m.a.)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
b3 b2 b1 b0 g3 g2 g1 g0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
binary
Gray
13
3.3 Redundant Number Systems Non-binary, redundant, weighted number systems [1, 2] Digit set larger than radix (typically radix 2) ) multiple representations of same number ) redundancy + No carry-propagation in adders ) more efcient impl. of adder-based units (e.g. multipliers and dividers) Redundancy ) no direct implementation of relational operators ) conversion to irredundant numbers Several bits used to represent one digit ) higher storage requirements Expensive conversion into irredundant numbers (not necessary if redundant input operands are allowed) Delayed-carry of half-adder number representation : ri 2 f0 1 2g , ci si ai bi 2 f0 1g , ri = (ci+1 si) = 2ci+1 + si = ai + bi , ci+1si = 0 ; R = Pn=01 ri2i = (C S ) = C + S = A + B i 1 digit holds sum of 2 bits (no carry-out digit) example : (00 10) = 00 + 10 = 01 + 01 = (10 00) irredundant representation of ;1 [8], since ci+1si = 0 & C + S = ;1 ! S = ;1 C = 0 Carry-save number representation : ri 2 f0 1 2 3g , ci si ai bi di 2 f0 1g , ri = (ci+1 si) = 2ci+1 + si = ai + bi + di = ai + ri0 ; R = Pn=01 ri2i = (C S ) = C + S = A + R0 i
Computer Arithmetic: Principles, Architectures, and VLSI Design 14
1 digit holds sum of 3 bits or 1 digit + 1 bit (no carry-out digit, i.e. carry is saved) standard redundant number system for fast addition Signed-digit (SD) or redundant digit (RD) number representation : ; ri si ti 2 f;1 0 1g f1 0 1g , R = Pn=01 ri2i i no carry-propagation in S = R + T :
ri + ti = (ci+1 ui) = 2ci+1 + ui , ci+1 ui 2 f1 (ci+1 ui) is redundant (e.g. 0 + 1 = 01 = 11) 8i 9(ci ui ) j ci + ui = si 2 f1 0 1g
0 1g
1 digit holds sum of 2 digits (no carry-out digit) minimal SD representation : minimal number of non-zero digits, 011f1g10 ! 100f0g10 applications : sequential multiplication (less cycles), lters with constant coefcients (less hardware) example : minimal 7 = (0111 j 1111 j 1011 j 1001 j 11111 j
z }| {
canonical SD repres.: minimal SD + not two non-zero digits in sequence, 01f1g10 ! 10f0g10 SD ! binary : carry-propagation necessary () adder) other applications : high-speed multipliers [9] similar to carry-save, simple use for signed numbers
Computer Arithmetic: Principles, Architectures, and VLSI Design 15
3 Number Representations
3 Number Representations
3.4 Residue Number Systems (RNS) Non-binary, irredundant, non-weighted number system [1] + Carry-free and fast additions and multiplications Complex and slow other arithmetic operations (e.g. comparison, sign and overow detection) because digits are not weighted, conversion to weighted mixed-radix or binary system required Codes for error detection and correction [1] Possible applications (but hardly used) : digital lters : fast additions and multiplications error detection and correction for arithmetic operations in conventional and residue number systems Base is n-tuple of integers (mn;1 mn;2 : : : m0 ), residues (or moduli) mi pairwise relatively prime
zi = jZ jmi = jf (A)jmi = f (jAjmi ) mi = jf (ai)jmi jA + B jmi = jAjmi + jB jmi = jai + bijmi mi jA B jmi = jAjmi jB jmi = jai bijmi mi j ; ai jmi = jmi ; ai jmi a;1 mi = aimi ;2 mi (Fermats theorem) i
high storage efciency with k bits simple modular addition : 2k : k -bit adder without cout , 2k ; 1 : k -bit adder with end-around carry (cin = cout )
Example :
(m1 m0) = (3 2) , M = 6
mn;2 ::: m0 ,
A a1 a0
;4 ;3 ;2 ;1 0 1 2 3 4 5 6 7 8 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0
| {z }
possible range
M=
n;1 X i=0
n; 1 Y i=0
mi, anywhere in Z Z
, Ci
ai = A mod mi = jAjmi , A = mi qi + ai
jAjM
j4 + 5j6 = (1 0) + (2 1) = = (j1 + 2j3 j0 + 1j2) = (0 1) = j3j6 j4 5j6 = (1 0) (2 1) = = (j1 2j3 j0 1j2) = (2 0) = j2j6
j5j6 = A = (a1
a0) = (j5j3
j5j2 ) = (2 1)
Ciai
= (: : :
0 1 0 |{z} i
: : :)
16
17
3.5 Floating-Point Numbers Larger range, smaller precision than xed-point representation, inexact, real numbers [1, 2] Double-number form ) discontinuous precision S biased exponent E unsigned norm. mantissa M F = (;1)S M E = (;1)S 1:M 2E;bias Basic arithmetic operations : A B = (;1)SA SB MA MB EA+EB A + B = (;1)SA MA + EA (;1)SB MB (EA ; EB ) base on xed-point add, multiply, and shift operations postnormalization required (1= M < 1) Applications : processors : real oating-point formats (e.g. IEEE standard), large range due to universal use ASICs : usually simplied oating-point formats with small exponents, smaller range, used for range extension of normal xed-point numbers IEEE oating-point format : precision single double
3.6 Logarithmic Number System Alternative representation to oating-point (i.e. mantissa + integer exponent ! only xed-point exponent) [1] Single-number form ) continuous precision ) higher accuracy, more reliable S biased xed-point exponent E L = (;1)S E = (;1)S 2E;bias (signed-logarithmic) Basic arithmetic operations :
+ Simpler multiplication/exponent., more complex addition Expensive conversion : (anti)logarithms (table look-up) Applications : real-time digital lters 3.7 Antitetrational Number System Tetration (t. x = |{z}) and antitetration (a.t. x) [10] 22 x Larger range, smaller precision than logarithmic repres., otherwise analogous (i.e. 2x ! t. x log x ! a.t. x)
2
n nM nE
32 64 23 52 8 11
bias
range
18
19
3 Number Representations
3 Number Representations
3.8 Composite Arithmetic Proposal for a new standard of number representations [10] Scheme for storage and display of exact (primary: integer, secondary: rational) and inexact (primary: logarithmic, secondary: antitetrational) numbers Secondary forms used for numbers not representable by primary ones () no over-/underow handling necessary) Choice of number representation hidden from user, i.e. software/compiler selects format for highest accuracy Number representations :
integer : rational : logarithmic : antitetrational : tag 00 01 10 11 value 2s complement integer slash log integer a.t. integer
3.9 Round-Off Schemes Intermediate results with d additional lower bits () higher accuracy) : A = (an;1 : : : a0 a;1 : : : Rounding : keeping error small during nal word length reduction : R = (rn;1 : : : r0 ) = A ; Trade-off : numerical accuracy vs. implementation cost Truncation :
1 bias = ; 1 + 2d+ 2
1
a;d)
RTRUNC = (an;1 : : : a0 )
(= average error )
Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ denominator) is variable and stored (oating slash) Storage form sizes : 32-bit (short), 64-bit (normal), 128-bit (long), 256-bit (extended) Implementation : mixed hardware/software solutions Hardware proposal : long accumulator (4096 bits) holds any oating-point number in xed-point format ) higher accurary ) large hardware/software overhead
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 20 4.1 Overview
3 guard bits for rounding after oating-point operations : guard bit G (postnormalization), round bit R (round-to-nearest), sticky bit S (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 21
4 Addition
4.1 Overview
1-bit adders HA FA (m,k) (m,2)
4.2 1-Bit Adders, (m, k)-Counters Add up m bits of same magnitude (i.e. 1-bit numbers) Output sum as k -bit number (k
= blog mc + 1)
(cout s) = 2cout + s = a + b
s=a b cout = ab
A = 3 T = 2 (1)
CSA
(sum) (carry-out)
a b a b
multi-operand
adder array
adder tree
a b
multi-operand adders array adder tree adder
hasym.epsi HA 18 23 mm c
out
c out haschema1.epsi 19 28 mm
haschema2.epsi 21 43 mm c out
s
Legend: HA: FA: (m,k): (m,2): half-adder full-adder (m,k)-counter (m,2)-compressor CPA: carry-propagate adder RCA: ripple-carry adder CSKA:carry-skip adder CSLA: carry-select adder CIA: carry-increment adder related component CLA: carry-lookahead adder PPA: parallel-prefix adder COSA:conditional-sum adder CSA: carry-save adder
(reference)
s
based on component
22
23
4 Addition
4 Addition
(m, k)-counters
A = 7 T = 4 (2 )
g = ab (generate) c0 = ab p = a b (propagate) c1 = a + b s = a b cin = p cin cout = ab + acin + bcin = ab + (a b)cin = g + pcin = pg + pcin = pa + pcin = cinc0 + cinc1
a b a b g HA faschematic3.epsi p 29 32 mm c c out in HA s a b a b a b
a0
a m-1
...
cntsymbol.epsi 23 18 (m,k)mm
...
s k-1 s 0
Usually built from full-adders Associativity of addition allows convertion from linear to tree structure ) faster at same number of FAs
fasymbol.epsi FA c18 21 mm c
out
in
c out
faschematic2.epsi c in 32 35 mm
A = 28 T = 14
a0a1 a2a3a4a5a6 FA
A = 28 T = 10
a0a1 a2 FA a3a4 a5a6 FA
s s a b
FA
p faschematic4.epsi c out c in 29 1 41 mm
0
count73ser.epsi 42 59 mm
c out
count73par.epsi FA 36 48 mm
c out
faschematic1.epsi g p 29 43 mm
c in
faschematic5.epsi 0 c0 35 47 mm 1 c1
c in
FA s2
FA s1 s0
FA s2 s1 s0
(reference)
s s
linear structure
tree structure
25
24
4.3 Carry-Propagate Adders (CPA) Add two n-bit operands A and B and an optional carry-in cin by performing carry-propagation [1, 2, 11] Sum (cout
Carry-propagation speed-up techniques a) Concatenation of partial CPAs with fast cin ! cout
a n-1:j b n-1:j
...
a i-1:k b i-1:k
speedup1.epsi CPA c i84 26 mm
...
a k-1:0 b k-1:0
c out
CPA
cj
ck
CPA
c in
s n-1:j
s i-1:k
s k-1:0
Ripple-carry adder (RCA) Serial arrangement of n full-adders Simplest, smallest, and slowest CPA structure
c out
... preprocessing
speedup2.epsi 104 50 mm
carry propagation
c in
A = 7n T = 2n AT = 14n2
a n-1 FA s n-1 b n-1
...
...
postprocessing
a1
b1
a0 FA s0
b0
s n-1
s1
s0
c out
c n-1
c1
c in
...
s1
26
27
4 Addition
4 Addition
Carry-select adder (CSLA) Type a) : partial CPA with fast ck ! ci and ck ! si;1:k
ci = P i;1:k c0i + Pi;1:k ck (bit group (ai;1 : : : ak )) Pi;1:k = pi;1pi;2 pk (group propagate)
) path ck ! c0i ! ci never sensitized ) fast ck ! ci ) false path ) inherent logic redundancy ) problems in circuit optimization, timing analysis, and testing
1) Pi;1:k 2) Pi;1:k
= 0 : ck 6! c0i and c0i selected (c0i ! ci) = 1 : ck ! c0i but c0i skipped (c0i 6! ci)
Two CPAs compute two possible results (cin = 0=1), group carry-in ck selects correct one afterwards Variable group sizes (faster) : larger groups at end (MSB) (balance delays a0 ! ck and ak ! c0 ) i Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA High speed-up at high hardware overhead (+ MUX/bit + (CPA + MUX)/group)
Variable group sizes (faster) : larger groups in the middle (minimize delays a0 ! ck ! si;1 and ak ! ci ! sn;1 ) Partial CPA typ. is RCA or CSKA () multilevel CSKA) Medium speed-up at small hardware overhead (+ AND/bit + MUX/group)
A
...
0
14n
2:8n1=2
AT
39n3=2
a k-1:0 b k-1:0
a i-1:k b i-1:k
A
a n-1:j b n-1:j
8n
4n1=2
AT
32n3=2
a k-1:0 b k-1:0 c out
a i-1:k b i-1:k
...
c i0
CPA
0
csla.epsi 102 50CPA mm
ci
...
ck
CPA
c in
c out
CPA
cj
...
ci
cska.epsi 99 36 mm 1
ci
CPA ck CPA c in ck
c i1
0 s i-1:k 0 1
1 s i-1:k
s i-1:k
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition
s k-1:0
29
Carry-increment adder (CIA) Type a) : partial CPA with fast ck ! ci and ck ! si;1:k
Example : gate-level schematic of carry-incr. adder (CIA) only 2 different logic cells (bit-slices) : IHA and IFA
max ngroup
= 1 [12, 11]
IFA
a i-1
a i-2
b i-2 IFA
a k+1
b k+1 IHA
ak
bk
Variable group sizes (faster) : larger groups at end (MSB) (balance delays a0 ! ck and ak ! c0i ) Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA High speed-up at medium hardware overhead (+ AND/bit + (incrementer + AND-OR)/group)
...
...
A
...
10n
T
ci
2:8n1=2
a i-1:k b i-1:k
AT
0 ck
28n3=2
a k-1:0 b k-1:0
ci
sk IHA
ck
CPA
86 cia.epsi si-1:k 43 mm
c out
...
ci P i-1:k
CPA
c in
...
bits i-1...k
...
bits 6...4
bits 3,2
bit 1
bit 0
+1
s i-1:k
s k-1:0
c out c in
30
31
4 Addition
4 Addition
Conditional-sum adder (COSA) Type a) : optimized multilevel CSLA with (log n) levels (i.e. double CPAs are merged at higher levels) Correct sum bits (si;1:k or si;1:k ) are (conditionally) selected through (log n) levels of multiplexers
0 1
Carry-lookahead adder (CLA), traditional Type b) : carries looked ahead before sum bits computed Typically 4-bit blocks used (e.g. standard IC SN74181)
Bit groups of size 2l at level l Higher parallelism, more balanced signal paths Highest speed-up at highest hardware overhead (2 RCA + more than (log n) MUX/bit)
c0 = c00 c1 = g0 + p0c00 c2 = g1 + p1g0 + p1p0c00 c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00 0 g3 = g3 + p3g2 + p3p2g1 + p3p2 p1g0 0 = p3 p2 p1 p0 p3
(g3,p3)
...
(g0,p0)
clbsymbol.epsi 26 27 CLB mm c 0
(g,p) c 3 . . . c 0 3 3
3n log n
a3 b3
T
a2
2 log n
b2
AT
a1
6n log
b1
n
a0 b0
Hierarchical arrangement using ( 1 log n) levels : 2 (g30 p03) passed up, c00 passed down between levels High speed-up at medium hardware overhead
A
c in
14n
4 log n
AT
56n log n
level 0
...
FA FA
0 1
FA FA
0 1
FA FA
0 1 FA
level 1
...
cosa.epsi 100 57 mm
0 1
CLB c 15 . . . c 12
CLB
CLB
CLB c3 . . . c0
level 2
...
c 11 . . . c 8 cla.epsi c 7 . . . c 4 97 48 mm
c out
Parallel-prex adders (PPA) Type b) : universal adder architecture comprising RCA, CIA, CLA, and more (i.e. entire range of area-delay trade-offs from slowest RCA to fastest CLA) Preprocessing, carry-lookahead, and postprocessing step Carries calculated using parallel-prex algorithms + High regularity : suitable for synthesis and layout + High exibility : special adders, other arithmetic operations, exchangeable prex algorithms (i.e. speeds) + High performance : smallest and fastest adders
a1 b1 a0 b0
s1
s0
... s3 s2 s1 s0
CLB
c in
+ preprocessing : gi = ai bi + postprocessing : si = pi
pi = ai bi ci
33
32
Prex problem Inputs (xn;1 : : : x0 ), outputs (yn;1 binary operator [11, 13]
: : : y0), associative
or
{z
} }
{z
2 y3 = Y3:0
{z
A
...
(gn-1 , p n-1 )
5n + 3A
T = 4 + 2T
preprocessing:
c in
...
(g0 , p0 )
add.epsi///gures 73 64 mm
gi = aibi pi = ai bi
carry-lookahead: prex algorithm
: : : xi) at level l Carry-propagation is prex problem : Yil:k = (Gl :k Pil:k ) i (G0:i Pi0:i) = (gi pi) i 1 1 (Gli:k Pil:k ) = (Gli;+1 Pil:;+1) (Glj;k1 Pjl:;1) ; k j i :j j : k 1 1 1 = (Gil;+1 + Pil:;+1Glj;k1 Pil:;+1Pjl:;1) :j j j : k ci+1 = Gm ; i = 0 : : : n ; 1 l = 1 : : : m i:0
Parallel-prex algorithms [14] : multi-tree structures (T = O(n) ! O(log n)) sharing subtrees (A = O(n2 ) ! O(n log n)) different algorithms trading area vs. delay (inuences also from wiring and maximum fan-out FOmax )
c n p n-1
c1
p0
c0
...
...
postprocessing:
si = pi ci
34
35
4 Addition
4 Addition
Prex algorithms Algorithms visualized by directed acyclic graphs (DAG) with array structure (n bits m levels) Graph vertex symbols : l 1 1 (Gil;+1 Pil:;+1) (Gj;k1 Pjl:;1 ) :j j : k
A
0 1 2 3 4
1 2
1 2
(Gil;1 Pil:;1 ) :k k
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sk.epsi///gures 67 30 mm
Brent-Kung parallel-prex algorithm () PPA-BK) Traditional CLA is PPA-BK with 4-bit groups Tree-like redistribution of carries (fan-out tree)
A = n ; 1 T = n ; 1 FOmax = 2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 ... 14 15
ser.epsi///gures 69 38 mm
bk.epsi///gures 67 38 mm
36
37
Mixed serial/parallel-prex algorithm () RCA + PPA) linear size-depth trade-off using parameter k : 0
k n ; 2dlog ne + 2
A = n ; 1 + k T = n ; 1 ; k FOmax = var.
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
2n ; 1:4n1=2
1:4n1=2
FOmax
1:4n1=2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5
var.epsi///gures 68 54 mm
cia.epsi///gures 67 34 mm
38
39
4 Addition
4 Addition
Example : 4-bit parallel-prex adder (PPA-SK) efcient AND-OR-prex circuit for the generate and AND-prex circuit for the propagate signals optimization: alternatingly AOI-/OAI- resp. NAND-/ NOR-gates (inverting gates are smaller and faster) can also be realized using two MUX-prex circuits
a3 b3 a2 b2 a1 b1 a0 b0
A =3 T =3
0 unfact.epsi 1 20 26 mm 2 3
depth-decr. transform
3 2 1 0 0 fact.epsi 1 20 26 mm 2 3
;! ;
size-decr. transform
A =4 T =2
c in
Repeated (local) prex transformations result in overall minimization of graph depth or size ) which sequence ? Goal: minimal size (area) at given depth (delay) Simple algorithm for sequence of applied transforms : Step 1 : prex graph compression (depth minimization) : depth-decr. transforms in right-to-left bottom-up order Step 2 : prex graph expansion (size minimization) : size-decreasing transforms in left-to-right top-down order, if allowed depth not exceeded Prex adder synthesis : 1) generate serial-prex graph, 2) graph compression, 3) depth-controlled graph expansion, 4) generate pre-/postprocessing and prex logic + Generates all previous prex graphs (except PPA-KS) + Universal adder synthesis algorithm : generates area-optimal adders for any given timing constraints [14] (including non-uniform signal arrival times)
c out P n-1:0 s3 s2 s1 s0
40
41
Multilevel adders Multilevel versions of adders of type a) possible (CSKA, CSLA, and CIA; notation: 2-level CIA = CIA-2L) + Delay is O(n1=(m+1) ) for m levels Area increase small for CSKA and CIA, high for CSLA () COSA) Difcult computation of optimal group sizes Hybrid adders Arbitrary combinations of speed-up techniques possible ) hybrid/mixed adder architectures Often used combinations : CLA and CSLA [15] Pure architectures usually perform best (at gate-level)
Self-timed adders + RCA is fast in average case (T = O(log n)), slow in worst case ) suitable for self-timed asynchronous designs [16] Completion detection is not trivial Adder performance comparisons Standard-cell implementations, 0:8
area [lambda^2] RCA 1e+07 128-bit CSKA-2L CIA-1L 5 64-bit CIA-2L PPA-SK PPA-BK 32-bit 2 16-bit 1e+06 8-bit 5
m process
Transistor-level adders Inuence of logic styles (e.g. dynamic logic, pass-transistor logic ) faster) + Efcient transistor-level implementation of ripple-carry chains (Manchester chain) [15] + Combinations of speed-up techniques make sense Much higher design effort Many efcient implementations exist and published
Computer Arithmetic: Principles, Architectures, and VLSI Design 42
2 5 10 20
addperf.ps 84 84 mm
delay [ns]
43
4 Addition
4 Addition
Complexity comparison under the unit-gate model adder RCA CSKA-1L CSKA-2L CSLA-1L CIA-1L CIA-2L CIA-3L PPA-SK PPA-BK PPA-KS CLA 5 COSA
1
4.4 Carry-Save Adder (CSA) a) Adds three n-bit operands A0 , A1 , A2 performing no carry-propagation (i.e. carries are saved) [1]
7n
2n 4n1=2 xn1=3 4 2:8n1=2 2:8n1=2 3:6n1=3 4:4n1=4 2 log n 4 log n 2 log n 4 log n 2 log n
AT
(C S ) = C + S = A0 + A1 + A2
2ci+1 + si
A0 A1 A2
csasymbol.epsi 26 21 CSAmm
p p p p p p ( )
b) Adds one n-bit operand to an n-digit carry-save operand Result is in redundant carry-save format (n digits), represented by two n-bit numbers S (sum bits) and C (carry bits) + Parallel arrangement of n full-adders, constant delay
optimality regarding area and delay aaa : smallest area, longest delay aat : small area, medium delay att : medium area, short delay ttt : large area, shortest delay : not optimal 2 obtained from prex adder synthesis 3 automatic logic optimization not possible (redundancy) 4 exact factors not calculated 5 corresponds to 4-bit PPA-BK
44
A = 7n T = 4
a 0,n-1 a 1,n-1 a 2,n-1 a 0,1 a 1,1 a 2,1 a 0,0 a 1,0 FA c1 s0
csa.epsi 27 mm FA
FA cn s n-1
. . . 67
c2
s1
Multi-operand carry-save adders (m > 3) ) adder array (linear arrangement), adder tree (tree arr.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 45
4.5 Multi-Operand Adders Add three or more (m > 2) n-bit operands, yield (n + dlog me)-bit result in irredundant number rep. [1, 2] Array adders Realization by array adders : (see gures on next page) a) linear arrangement of CPAs b) linear arr. of CSAs (adder array) and nal CPA a) and b) differ in bit arrival times at nal CPA : ) if CPA = RCA : a) and b) have same overall delay ) if fast nal CPA : uniform bit arrival times required ) CSA array (b) Fast implementation : CSA array + fast nal CPA (note: array of fast CPAs not efcient/necessary)
...
FA a 2,n-1 FA a 3,n-1 FA sn FA
... ... ...
FA a 2,2
FA a 2,1
cparray.epsi 93 FA 57 mm FA
a 3,2 FA s2
a 3,1 FA s1
s n-1
A0 A1 A2
A3
A m-1
CSA
...
FA a 3,n-1
...
FA a 3,2
FA a 3,1 FA
FA a 3,0 HA
A = O(mn + n) T = O(m + n)
mopadd.epsi CSA 30 58 mm
...
FA
...
csarray.epsi 99 57 mm FA
Fast CPA :
CPA
FA sn
FA
...
FA s2
HA s1 s0
s n-1
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47
4 Addition
4 Addition
m;4 X
clout) + s =
m;4 X l =0
a0
a m-1
...
0 c out m-4 c out
ai +
clin
cprsymbol.epsi 26 37 (m,2)mm
0 c in m-4 c in
1-bit adders (similar to (m, k)-counters) [17] Compresses m bits down to 2 by forwarding (m ; 3) intermediate carries to next higher bit position Is bit-slice of multi-operand CSA array (see prev. page) + No horizontal carry-propagation (i.e. cl ! ck in out k > l) Built from full-adders (= (3, 2)-compressor) or (4, 2)-compressors arranged in linear or tree structures Example : 4-operand adder using (4, 2)-compressors
a 0,n-1 a 1,n-1 a 2,n-1 a 3,n-1 a 0,2 a 1,2 a 2,2 a 3,2 a 0,1 a 1,1 a 2,1 a 3,1 a 0,0 a 1,0 a 2,0 a 3,0
...
...
A = 14 T = 8
a0 a1 a2 a3 FA
cpr42fa.epsi 32 38 mm
A = 14 T = 6
a0 a1 a2 a3
)
c in c out
c out
cpr42opt.epsi 1 41 53 mm
FA c s
c in
0 1
with full-adders
CSA
(4,2)
(4,2) cpradd.epsi 99 44 mm FA s2
(4,2)
(4,2)
FA s n+1 sn
FA s n-1
HA s1 s0
CPA
SD-FA (signed-digit full-adder) is similar to (4, 2)-compressor regarding structure and complexity
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 49
48
Advantages of (4, 2)-compressors over FAs for realizing (m, 2)-compressors : higher compression rate (4:2 instead of 3:2) less deep and more regular trees tree depth # operands FA (4,2) 012 3 4 5 6 7 8 9 10
Tree adders (Wallace tree) Adder tree : n-bit m-operand carry-save adder composed of n tree-structured (m, 2)-compressors [1, 18] Tree adders : fastest multi-operand adders using an adder tree and a fast nal CPA
2 3 4 6 9 13 19 28 42 63 94 2 4 8 16 32 64 128
A = 42 T = 16
a0a1 a2a3 FA a4a5 a6a7 FA
0 c out 0 c in 1 c in 1 c out 2 c out 3 c out
A = 42 T = 12
a0a1a2a3 a4a5a6a7
0 c in 1 c in
(4,2)
(4,2)
FA
FA
cpr82fa.epsi 47 65 mm
cpr82cpr42.epsi 47 50 mm
2 c in 3 c in
2 c in 3 c in
FA
4 c out 4 c in
(4,2)
4 c in
FA c s
full-adder tree
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
neg.epsi 21 32 mm1
+1
si
Z A B
2s complement subtractor
accucpa.epsi CPA 28 mm 27
A ; B = A + (;B ) =A+B+1
sub.epsi 29 32 mm
c out
CPA
S A B
2s complement adder/subtractor
CSA
accucsa.epsi 33 52 mm
addsub.epsi 36 35 mm
c out
CPA
sub
S CPA A B
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer carries saved), trade-off between speed and register size
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 52
A + B (mod 2n ; 1) = A + B + cout
(end-around carry)
c out
addmod.epsi 28 29 CPAmm
c in
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 53
5.2 Increment / Decrement Incrementer Adds a single bit cin to an n-bit operand A (cout Z ) = cout2n + Z = A + cin
Prex problem :
Ci:k = Ci:j+1Cj:k
) AND-prex struct.
1 2
A
A
1 2
n log n + 2n T = dlog ne + 2 AT
n log2 n
Decrementer
a n-1
(cout Z ) = A ; cin
a2 a1 a0
...
incsymbol.epsi +1 c out 29 26 mm c in
Z c out
...
= 0 () FA ! HA)
3n2
dec.epsi 93 41 mm
c in
A = 3n T = n + 1 AT
a n-1
...
z n-1
z2
z1
z0
a1
incfa.epsi HA 59 23 mm c c2 1
a0 HA z0
Incrementer-decrementer
c in
c out
HA z n-1
c n-1
...
z1
a2
a1
a0
incdec.epsi 94 46 mm inc.epsi 83 33 mm
c out
...
c in
c out
...
c in
HA z n-1 z2 z1 z0
54
z n-1
z2
z1
z0
55
inccg.epsi 62 39 mm
c in
c out z3 z2 z1 z0
c0 = an;1 an;2 a0 (parity) ci+1 = ai ci ; i = 0 : : : n ; 3 (r.m.a.) z0 = a0 c0 zi = ai ai;1 ci;1 ; i = 1 : : : n ; 2 zn;1 = an;1 cn;2
Prex problem ) AND-prex structure
incpp.epsi 98 63 mm
c out
z7
z6
z5
z4
z3
z2
z1
z0
56 5.3 Counting Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 57 5.3 Counting
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations
5.3 Counting Count clock cycles ) counter, divide clock frequency ) frequency divider (cout ) Binary counter Sequential in-/decrementer Incrementer speed-up techniques applicable Down- and up-down-counters using decrementers / incrementer-decrementers
Fast divider (T = O(1)) using delayed-carry numbers (irredundant carry-save represention of ;1 allows using fast carry-save incrementer) [8] Gray counter Counter using Gray incrementer
c out
+1 cntblock.epsi 32 33 mm
c in clk
Q q n-1
Example : Ripple-carry up-counter using counter slices (= HA + FF), cin is count enable
c out
...
q2
q1
q0
State is not encoded ) n FF for counting n states Must be initialized correctly (e.g. 00 Applications: fast dividers (no logic between FF) state counter for one-hot coded FSMs 01)
c in
cntripple.epsi 87 36 mm
q n-1
q2
q1
q0
cntasync.epsi 64 18 mm
clk
q n-1
q2
q1
q0
q n-1
q2
q1
q0
58
Comparators Subtractor (A ; B ) :
EQ = (A = B ) (equal) NE = (A 6= B ) = EQ (not equal) GE = (A B ) (greater or equal) LT = (A < B ) = GE (less than) GT = (A > B ) = GE EQ (greater than) LE = (A B ) = GT = GE + EQ (less or equal)
Equality comparison
GE = cout EQ = Pn;1:0
(for free in PPA)
cmpsub.epsi 37 31 mm
GE = c out
CPA
EQ = P n-1:0
2 log n
EQ = (A = B ) eqi+1 = (ai = bi) eqi = (ai bi) eqi ; i = 0 ::: n ; 1 eq0 = 1 EQ = eqn (r.s.a.)
Magnitude comparison
a n-1 b n-1
a2 b2
a1 b1
...
a0 b0
cmpeq.epsi 40 36 mm
A = 6n TLIN = 2n TTREE
a n-1 b n-1 a2 b2 a1 b1
2 log n
GE = (A B ) gei+1 = (ai > bi) + (ai = bi) gei = aibi + (ai bi) gei ; i = 0 : : : n ; 1 ge0 = 1 GE = gen (r.s.a.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 60 5.4 Comparison, Coding, Detection
GE
...
cmpripple.epsi 100 47 mm
EQ
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations
61
zi =
1 if A = i 0 else ;
a2
i = 0 ::: m ; 1
a1 a0
decoder.epsi 58 28 mm
= 2A
A
decodersym.epsi decoder 21 26 mm
A = n T = log n
Leading-zeroes detection (LZD) : for scaling, normalization, priority encoding
z2 z1 z0
z7
z6
z5
z4
z3
A = (n ; 1)2n T = dlog ne
Encoder Encodes vector Am;1:0 to binary number Zn;1:0 (m = 2n ) (condition: 9i 8k j if k = i then ak = 1 else ak = 0) Z = i if ai = 1 ; i = 0 : : : m ; 1 Z = log2 A
A
encodersym.epsi encoder 21 26 mm
a) non-encoded output :
f0g1f0j1g ! f0g1f0g
a n-1 a n-2
...
a1
a0
lzdnenc.epsi 50 28 mm
...
A = 2n T = n
z n-1
z n-2
z1
z0
a7a5a3a1 a6a4a2a0
encoder.epsi 30 34 mm
A = n(2n;1 ; 1) T =n;1
z2
5.5 Shift, Extension, Saturation Shift : a) shift n-bit vector by k bit positions b) select n out of more bits at position k also: logical (= unsigned), arithmetic (= signed)
Rotation by k bit positions, n constant (logic operation) Extension of word lengths by k bits (n ! n + k ) (i.e. sign-extension for signed numbers) Saturation to highest/lowest value after over-/underow shift a) unl. signed r. signed l. r. unsigned signed l. r. unl. signed r. signed l. r.
Applications : adaption of magnitude (shift a)) or word length (extension) of operands (e.g. for addition) multiplication/division by multiples of 2 (shift) logic bit/byte operations (shift, rotation) scaling of numbers for word-length reduction (i.e. ignore leading zeroes, shift b)) or normalization (e.g. of oating-point numbers, shift a)) using LZD reducing error after over-/underow (saturation) Implementation of shift/extension/rotation by constant values : hard-wired variable values : multiplexers n possible values : nbyn barrel-shifter/rotator Example : 4by4 barrel-rotator
an;2 0 an;1 an;1 an;3 an;1 an;1 an;2 an+k;1 a2n;1 an+k;2 an;2 a0 an;1 0 an;1 an;1 an;1 an;1 an;2 an;1 an;2 an;1 an;1 an;1
::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: :::
rol ror
A = O(n2) T = O(log n)
a3 a2 a1 a0
a3 s1 s0 s1 s0 s1 s0 s1 s0
a2
a1
a0
barshift.epsi 44 49 mm
s0 s1 z3
muxshift.epsi 41 28 mm
z2
z1
z0
z3
z2
z1
z0
multiplexers
64
tristate buffers
65 5.6 Addition Flags
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations
Basic and derived condition ags description carry ag signed overow ag zero ag negative ag, sign condition operation: S=0 S<0 S 0 ag formula signed
C V
formula
C , N : for free V : fast cn, cn;1 computed by e.g. PPA ) very cheap Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn;1:0 (of PPA) b) cin = 0=1 : Z = sn;1 + sn;2 + + s0 (r.s.a.) 1) A = ACPA + n TZ = TCPA + dlog ne
2) faster without nal sum (i.e. carry prop.) [19] example : 01001 1 00 0 + 10110 1 00 = 00000 0 00
S = A + B (+) or S = A ; B (;) zero Z Z negative N positive N S > max overow C (+) VC S < min underow C (;) VC operation: A ; B A=B EQ Z Z A 6= B NE Z Z A B GE C N V + NV A>B GT CZ (N V + NV )Z A<B LT C NV + NV A B LE C + Z NV + NV + Z
Unsigned and signed addition/subtraction only differ with respect to the condition ags
unsigned
z0 = ((a0 b0) cin) zi = ((ai bi) (ai;1 + bi;1)) Z = zn;1 zn;2 z0 ; i = 0 : : : n ; 1 (r.s.a.) A = ACPA + 3n TZ = 4 + dlog ne
66 Computer Arithmetic: Principles, Architectures, and VLSI Design 67
6 Multiplication
6 Multiplication
6.1 Multiplication Basics Multiplies two n-bit operands A and B [1, 2] Product P is (2n)-bit unsigned number or (2n ; 1)-bit signed number Example : unsigned multiplication n;1 n;1 n;1 n;1 X X XX P = A B = ai2i bj 2j = aibj 2i+j or i=0 j =0 i=0 j =0 n;1 X Pi = ai B P = Pi2i ; i = 0 : : : n ; 1 (r.s.a.) i=0 Algorithm 1) Generation of n partial products Pi 2) Adding up partial products : a) sequentially (sequential shift-and-add), b) serially (combinational shift-and-add), or c) in parallel Speed-up techniques Reduce number of partial products Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 69
ALU operations arithmetic add inc pass and or xor pass sll sla rol
logic
shift/ rotate
sub dec neg nand nor xnor not srl sra ror
A;B A;1 ;A ai bi ai + bi ai bi ai A 1 A a1 A r1
Sequential multipliers : partial products generated and added sequentially (using accumulator)
mulseq.epsi 34 28 mm
CPA
A = O(n) T = O(log n) L = n
Array multipliers : partial products generated and added simultaneously in linear array (using array adder)
CSA CSA mularr.epsi 34 47 mm CSA
P=
aibj 2i+j a0 b3 a1 b2 a2 b1 a3 b0 p3
A = 8n2 ; 11n T = 6n ; 9 a0 b2 a0 b1 a0 b0 a1 b1 a1 b0 a2 b0 p2 p1 p0
b1 b0
A = O(n ) T = O(n)
2
CSA CPA
a1
p0 HA HA HA p1
Parallel multipliers : partial products generated in parallel and added subsequently in multi-operand adder (using tree adder)
mulpar.epsi 34 43 mm
1
a2
FA
mulbraun.epsi FA 99 83 mm
FA p2
A = O(n ) T = O(log n)
2
2
CSA CPA
FA
FA
FA p3
Signed multipliers : a) complement operands before and result after multiplication ) unsigned multiplication b) direct implementation (dedicated multiplier structure)
Computer Arithmetic: Principles, Architectures, and VLSI Design 70
3
p7
FA p6
FA p5
HA p4
71
6 Multiplication
6 Multiplication
6.3 Signed Array Multipliers Modied Braun multiplier Subtract bits with negative weight ) special FAs [1] 1 neg. bit : ;a + b + cin = 2cout ; s 2 neg. bits : a ; b ; cin = ;2cout + s Replace FAs in regions 1 , 2 , and 3 by : (input a at mark )
6.4 Booth Recoding Speed-up technique : reduction of partial products Sequential multiplication + One cycle per non-zero partial product (i.e. 8ai j ai = 0) 6 Negative partial products Minimal (or canonical) signed-digit (SD) represent. of A
Data-dependent reduction of partial products and latency Combinational multiplication Only xed reduction of partial product possible Radix-4 modied Booth recoding : 2 bits recoded to one multiplier digit ) n=2 partial products
Otherwise exactly same structure and complexity as Braun multiplier ) efcient and exible Baugh-Wooley multiplier Arithmetic transformations yield the following partial products (two additional ones) :
A=
n=2;1 X i=0
a0 b3 a1b3 a1 b2 a2 b3 a2 b2 a2 b1 a3 b3 a3 b2 a3 b1 a3 b0 a3 a3 + 1 b3 b3 p7 p6 p5 p4 p3
Booth recoding
a0 b2 a0 b1 a0 b0 a1 b1 a1 b0 a2 b0 p2 p1 p0
f;2 ;1 0 +1 +2g
=0
mulbooth.epsi 41 43 mm
Applicable to sequential, array, and parallel multipliers additional recoding logic and more complex partial product generation (MUX for shift, XOR for negation) + adder array/tree cut in half ) considerably smaller (array and tree)
) much faster for adder arrays ) slightly or not faster for adder trees
6.5 Wallace Tree Addition Speed-up technique : fast partial product addition
A : +8n T : +7 A : =2 T : =2 T : ;0 p2 p1 p0 p2 p1 p0
1
A = O(n2) T = O(log n)
Applicable to parallel multipliers : parallel partial product generation (normal or Booth recoded) Irregular adder tree (Wallace tree) due to different number of bits per column ) irregular wiring and/or layout ) non-uniform bit arrival times at nal adder 6.6 Multiplier Implementations Sequential multipliers : low performance, small area, component re-use (adder) Braun or Baugh-Wooley multiplier (array multiplier) : medium performance, high area, high regularity layout generators ) data paths and macro-cells simple pipelining, faster CPA ) higher speed Booth-Wallace multiplier (parallel multiplier) [9] : high performance, high area, low regularity custom multipliers, netlist generators often pipelined (e.g. register between CSA-tree and CPA) Signed-unsigned multiplier : signed multiplier with operands extended by 1 bit (an = an;1 =0, bn = bn;1 =0)
74 Computer Arithmetic: Principles, Architectures, and VLSI Design 75
p3 p3 p3 p3 p2 p1 p0 = | {z }
ext. sign
0 0 0 ;p3 = 1 + 1 1 1 p3
p03 p02 p01 p00 p03 p03 p03 p03 p02 p01 p00 p13 p12 p11 p10 p13 p13 p13 p12 p11 p10 ! p23 p22 p21 p20 p23 p23 p22 p21 p20 + p33 p32 p31 p30 + p33 p32 p31 p30
p6 p5 p4 p3 p2 p1 p0
p6 p5 p4 p3 p2 p1 p0
Suited for signed multiplication (incl. Booth recod.) Extend A for unsigned multiplication : an
=0
6 Multiplication
6.8 Squaring
(2n 2n)-bit multiplier can be composed from 4 (n n)-bit multipliers (can be repeated recursively)
A B = (AH 2n + AL) (BH 2n + BL) = AH BH 22n + (AH BL + ALBH )2n + ALBL AH BL AH BH AL BL AL BH
4 (n n)-bit multipliers + (2n)-bit CSA + (3n)-bit CPA less efcient (area and speed) 6.8 Squaring
A=Q B+R; R <B R = A rem B (remainder) A 2 0 22n ; 1] B Q R 2 0 2n ; 1] B 6= 0 Q < 2n ! A < 2nB , otherwise overow ) normalize B before division (B 2 2n;1 2n ; 1]) A =Q+ R B B
Algorithms (radix-2) Subtract-and-shift : partial remainders Ri [1, 2] Sequential algorithm : recursive, f non-associative
P = A2 = AA
!
;
+ bn=2c + 1 partial products (if no Booth recoding used) ) optimized squarer more efcient than multiplier Table look-up (ROM) less efcient for every n
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction
qi =
1 if 0 if
q0 = 1 if
i
1 if
Ri+1 ; B 2i < 0 : qi = 0 Ri = Ri+1 (restored) i ; 1 Ri+1 ; B 2i;1 0 : qi;1 = 1 Ri;1 = Ri+1 ; B 2i;1 i
7.3 Non-Restoring Division
Example : signed non-restoring array divider (simplications: B > 0, nal correction of R omitted)
qi0 =
1 if ;1 = 1 if
A = 9n2 T = 2n2 + 4n
a6 b2 a5 b1 a4 b0
a6 b3
b3
a3
Ri+1 0 : qi0 = 1 Ri = Ri+1 ; B 2i i ; 1 Ri+1 ; B 2i < 0 : qi0;1 = 1 Ri;1 = Ri+1 ; B 2i +B 2i;1 = Ri+1 ; B 2i;1 i
One subtraction/addition (CPA) per step Final correction step for R (additional CPA) 0 Simple quotient digit conversion : (note: qi irredundant)
q3
FA
FA
FA
FA
a2 q2 FA FA FA divarray.epsi 81 101 mm FA
a1 q1 FA FA FA FA
a0 q0 FA r3 FA r2 FA r1 FA r0
79
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 78
qi0 = >0 if ;B 2i
> :1
if if
= 2m , qi0 2 f
;1
:::
1 0 1
:::
; 1g
if 2n;1 ) ;B 2i
)
B < 2n , i.e. B is normalized : ;2n+i;1 Ri+1 < 2n+i;1 B 2i 8 >1 if 2n+i;1 Ri+1 > < 0 = 0 if ;2n+i;1 R < 2n+i;1 qi > i+1 > :1 if Ri+1 < ;2n+i;1
0 + Only 3 MSB are compared ) qi are estimated ) CSA instead of CPA can be used (precise enough) [20] Correction in following steps (+ nal correction step) 0 Redundant representation of qi (SD representation) ) nal conversion necessary (CPA) + Highly regular and fast (O(n)) SRT array dividers ) only slightly slower/larger than array multipliers
A B
A A Q = B = B R0R1 R0 R1
Rm;1 ! A Rm;1 B
B 1 B
Q = Q resp. 2n 1
CPA
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction 80
Quadratic convergence :
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction
7.9 Divider Implementations Iterative dividers (through multiplication) : re-use of existing components (multiplier) medium performance, medium area high efciency if components are re-used Sequential dividers (restoring, non-restoring, SRT) : re-use of existing components (e.g. adder) low performance, low area Array dividers (restoring, non-restoring, SRT) : dedicated hardware component high performance, high area high regularity ) layout generators, pipelining square root extraction possible by minor changes combination with multiplication or/and square root No parallel dividers exist (sequential nature of division)
f (X ) = 0
by recursion
1 1 1 f (X ) = X ; B f 0 (X ) = ; X 2 f B = 0 Algorithm :
R = A rem B M = A mod B M
0
sign(R) = sign(A)
M = R+B R
if A else
82
83
8 Elementary Functions
8.1 Algorithms
8 Elementary Functions
A = Q2 + R
A2
0 22n ; 1]
Q2
0 2n ; 1]
Exponential function : ex (exp x) Logarithm function : ln x, log x Trigonometric functions : sin x, cos x, tan x
Algorithm Subtract-and-shift : partial remainders Ri and quotients Qi = Qi+1 + qi2i = (qn;1 : : : qi 0 : : : 0) 2 Q2 = Qi+1 + qi2i = Q2+1 + qi2i 2Qi+1 + qi2i i i
Inverse trig. functions : arcsin x, arccos x, arctan x Hyperbolic functions : sinh x, cosh x, tanh x 8.1 Algorithms Table look-up : inefcient for large word lengths [5] Taylor series expansion : complex implementation Polynomial and rational approximations [1, 5] Shift-and-add algorithms [5] Convergence algorithms [1, 2] : similar to division-by-convergence two (or more) recursive formulas : one formula converges to a constant, the other to the result Coordinate rotation (CORDIC) [2, 5, 21] :
A ADIV =2 T TDIV
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Elementary Functions 84
3 equations for x-, y-coordinate, and angle computes all elementary functions by proper input settings and choice of modes and outputs simple, universal hardware, small look-up table
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Elementary Functions 85
b)
xy = ey ln x = 2y log x
Ab )2 Ab
1
= (: : :
0 1 0 |{z} A
: : :)
AB
=A A A | {z }
B
L=0
2n ; 1 (!)
Applications : modular exponentiation AB (mod in cryptographic algorithms (e.g. IDEA, RSA) Algorithms : square-and-multiply a)
C)
Z = blog2 Ac
For detection/comparison of order of magnitude Corresponds to leading-zeroes detection (LZD) with encoded output
A4b A2b Ab
2 1
Ei = Pibi Ei;1 Pi+1 = Pi2 ; i = 0 : : : n ; 1 E;1 = 1 P0 = A E = En;1 (r.s.n.) A = 2AMUL T = TMUL L = n A = AMUL T = TMUL L = 2n
Computer Arithmetic: Principles, Architectures, and VLSI Design
or
86
87
Gate-level design Cell-based design techniques : standard-cells, gate-array/ sea-of-gates, eld-programmable gate-array (FPGA) Circuit implemented by hand or by synthesis (library) Layout implemented by automated place-and-route Medium to high design efciency Medium to low circuit performance Medium to low exibility : full choice of architecture Block-level design Layout blocks and netlists from parameterized automatic generators or compilers (library) High design efciency Medium to high circuit performance Low exibility : limited choice of architectures Implementations : data-path : bit-sliced, bus-oriented layout (array of cells: n bits m operations), implementation of entire data paths, medium performance, medium diversity macro-cells : tiled layout, xed/single-operation components, high performance, small diversity portable netlists : ) gate-level design
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects 89 9.3 VHDL
k i-1 p i-1
c in
a b
fulladder :
c in c in
b b
facmos.epsi 76 40 mm
c in s c in b c out
c in
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects
88 9.2 Synthesis
9.2 Synthesis High-level synthesis Synthesis from abstract, behavioral hardware description (e.g. data dependency graphs) using e.g. VHDL Involves architectural synthesis and arithmetic transformations High-level synthesis is still in the beginnings Low-level synthesis Layout and netlist generators Included in libraries and synthesis tools Low-level synthesis is state-of-the-art Basis for efcient ASIC design Limited diversity and exibility of library components Circuit optimization Efcient optimization of random logic (low factorization degree) is state-of-the-art Optimization of entire arithmetic circuits (high factorization degree) is not feasible ) only local optimizations possible Logic optimization cannot replace the synthesis of efcient arithmetic circuit structures using generators
Computer Arithmetic: Principles, Architectures, and VLSI Design 90
9.3 VHDL Arithmetic types : unsigned, signed (2s complement) Arithmetic packages numeric_bit, numeric_std (IEEE standard 1076.3), std_logic_arith (Synopsys) contain overloaded arithmetic operators and resizing / type conversion routines for unsigned, signed types Arithmetic operators (VHDL87/93) [22] relational shift, rotate (93 only) adding sign (unary) multiplying exponent, absolute Synthesis Typical limitations of synthesis tools :
/, mod, rem : both operands must be constant or divisor
: : : : : :
=, /=, <, <=, >, >= rol, ror, sla, sll, sra, srl +, +, *, /, mod, rem **, abs
must be a power of two ** : for power-of-two bases only Variety of arithmetic components provided in separate libraries (e.g. DesignWare by Synopsys)
Computer Arithmetic: Principles, Architectures, and VLSI Design 91
9.3 VHDL
9.4 Performance
Resource sharing Sharing one resource for multiple operations Done automatically by some synthesis tools Otherwise, appropriate coding is necessary : a) b)
S <= A + C when SELA = 1 else B + C;
9.4 Performance Pipelining Pipelining is basically possible with every combinational circuit ) higher throughput Arithmetic circuits are well suited for pipelining due to high regularity Pipelining of arithmetic circuits can be very costly : large amount of internal signals in arithmetic circuits array structures : many small pipeline registers tree structures : few large pipeline registers
) no advantage of tree structures anymore (except for smaller latency)
) 2 adders + 1 multiplexer
T <= A when SELA = 1 else B; S <= T + C; ) 1 multiplexer + 1 adder
Fine-grain pipelining ) systolic arrays (often applied to arithmetic circuits) High speed Fast circuit architectures, pipelining, replication (parallelization), and combinations of those Optimal solution depends on arithmetic operation, circuit architecture, user specications, and circuit environment
92 Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects 93 9.5 Testability
Synthesis : check synthesis result for allocated arithmetic units ) code sanity check, control of circuit size VHDL library of arithmetic units Structural, synthesizable VHDL code for most circuits described in this text is found in [23]
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects
9.4 Performance
Low power Power-related properties of arithmetic circuits : High glitching activity due to high bit dependencies and large logic depth Power reduction in arithmetic circuits [24] : Reduce the switched capacitance by choosing an area efcient circuit architecture Allow for lower supply voltage by speeding up the circuitry Reduce the transition activity : apply stable inputs while circuit is not in use () disabling subcircuits) reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difcult to realize) reduce glitching transitions by reducing logic depth (pipelining) take advantage of correlated data streams choose appropriate number representations (e.g. Gray codes for counters)
9.5 Testability Testability goal : high fault coverage with few test vectors that are easy to generate/apply Random test vectors : easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits Special test vectors : sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits Hard-detectable faults found in : circuits of arithmetic operations with inherent special cases (arithmetic exceptions) : detectors, comparators, incrementers and counters (MSBs), adder ags circuits using redundant number representations (= redundant hardware) : dividers (Pentium bug!) 6
94
95
Bibliography
Bibliography
Bibliography
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [2] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons, 1979. [3] O. Spaniol, Computer Arithmetic, John Wiley & Sons, 1981. [4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design and Implementation, McGraw-Hill, 1984. [5] J.-M. Muller, Elementary Functions: Algorithms and Implementation, Birkhauser Boston, 1997. [6] Proceedings of the Xth Symposium on Computer Arithmetic. [7] IEEE Transactions on Computers. [8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11, pp. 939941, Nov. 1996. [9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with high speed redundant binary architecture, IEEE J. Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996. [10] W. N. Holmes, Composite arithmetic: Proposal for a new standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar. 1997.
Computer Arithmetic: Principles, Architectures, and VLSI Design Bibliography 96
[11] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and their Synthesis, PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998. [12] A. Tyagi, A reduced-area scheme for carry-select adders, IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct. 1993. [13] T. Han and D. A. Carlson, Fast area-efcient VLSI adders, in Proc. 8th Computer Arithmetic Symp., Como, May 1987, pp. 4956. [14] R. Zimmermann, Non-heuristic optimization and synthesis of parallel-prex adders, in Proc. Int. Workshop on Logic and Architecture Synthesis, Grenoble, France, Dec. 1996, pp. 123132. [15] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue CMOS microprocessor, IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 15551564, Nov. 1992. [16] A. De Gloria and M. Olivieri, Statistical carry lookahead adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347, Mar. 1996. [17] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar. 1996.
Computer Arithmetic: Principles, Architectures, and VLSI Design 97
[18] Z. Wang, G. A. Jullien, and W. C. Miller, A new design technique for column compression multipliers, IEEE Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995. [19] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K conditions without carry propagation, IEEE Trans. Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992. [20] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms for division and square root, J. VLSI Signal Processing, vol. 8, pp. 151168, Oct. 1994. [21] Y. H. Hu, CORDIC-based VLSI architectures for digital signal processing, IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 1635, July 1992. [22] K. C. Chang, Digital Design and Modeling with VHDL and Synthesis, IEEE Computer Society Press, Los Alamitos, California, 1997. [23] R. Zimmermann, VHDL Library of Arithmetic Units,
http://www.iis.ee.ethz.ch/zimmi/arith_lib.html.
[24] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer, Norwell, MA, 1995.
98