Computer Arithmetic

Eidgenossische Technische Hochschule Zurich
Ecole polytechnique federale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich
Institut f r Integrierte Systeme u
Integrated Systems Laboratory
Lecture notes on
Computer Arithmetic: Principles, Architectures, and VLSI Design

June 25, 1998
Reto Zimmermann
Integrated Systems Laboratory Swiss Federal Institute of Technology (ETH) CH-8092 Z rich, Switzerland u [email protected]
Copyright c 1998 by Integrated Systems Laboratory, ETH Z rich u

http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents
Contents
Contents
1 Introduction and Conventions 1.1 Outline
::::::::::::::::::::::: 4 :::::::::::::::::::::::::::::::::::::::::: 4 1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6 Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9 Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10 3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14 3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16 3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18 3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19 3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19 3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23
1
: : : : : : : : : : : : : : : : : : : 26 4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52 Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53 5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53 5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58 5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60 5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64 5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66 5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68 Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71 6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72 6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73 6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75 6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75 6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76 6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76 Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77 7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
4.3 Carry-Propagate Adders (CPA)
2
Computer Arithmetic: Principles, Architectures, and VLSI Design Contents
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 : : : : : : : : : : : : : : : : : : : : : : : : : : 78 7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79 7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80 7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81 7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83 7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84 Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86 8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90 9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91 9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
7.2 Restoring Division 7.3 Non-Restoring Division
1 Introduction and Conventions
1.2 Motivation
1.3 Conventions

1.1 Outline Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Circuit architectures and implementations of main arithmetic operations Aspects regarding VLSI design of arithmetic units 1.2 Motivation Arithmetic units are, among others, core of every data path and addressing unit Data path is core of : microprocessors (CPU) signal processors (DSP) data-processing application specic ICs (ASIC) and programmable ICs (e.g. FPGA) Standard arithmetic units available from libraries Design of arithmetic units necessary for : non-standard operations high-performance components library development
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Introduction and Conventions 4
1.3 Conventions Naming conventions
A (1-D), Ai (2-D), ai:k (subbus, 1-D) Signals : a, ai (1-D), ai k (2-D), Ai:k (group signal) Circuit complexity measures : A (area), T (cycle time, delay), AT (area-time product), L (latency, # cycles) Arithmetic operators : +, ;, , =, log (= log2 )
Signal buses : Logic operators :
+ (or),
(and),
(xor),
(xnor), (not)
Circuit complexity measures Unit-gate model ( Inverter, buffer : gate-equivalents (GE) model) :
A=0 T =0
(i.e. ignored)
Simple monotonic 2-input gates (AND, NAND, OR, NOR) : A = 1 T = 1 Simple non-monotonic 2-input gates (XOR, XNOR) : A=2 T =2 Complex gates : composed from simple gates
) Simple m-input gates :
A = m ; 1 T = dlog me
Wiring not considered (acceptable for comparison purposes, local wiring, multilevel metallization) Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Introduction and Conventions 5
1.4 Recursive Function Evaluation
1.4 Recursive Function Evaluation a3 a2 a1 a0 1 funrsa.epsi 219 20 mm z
1.4 Recursive Function Evaluation Given : inputs ai , outputs zi , function f (graph sym. : ) Non-recursive functions (n.) Output zi is a function of input ai (or aj +m:j
2.
f is associative (r.s.a.) ) serial or single-tree structure : A = O(n) T = O(log n)
m const.)
b) with multiple outputs zi (r.m.) () prex problem) :
zi = f (ai x) ; i = 0 : : : n ; 1
) parallel structure :
a3 a2 a1 a0 funn.epsi 119 17 mm z3 z2 z1 z0
zi = f (ai zi;1) ; i = 0 : : : n ; 1 z;1 = 0=1

1.
) serial structure :
f is non-associative (r.m.n.) A = O(n) T = O(n) f is associative (r.m.a.)
a3 a2 a1 a0 1 funrmn.epsi 219 25 mm 3 z3 z2 z1 z0 a3 a2 a1 a0 1 2 z3 funrma1.epsi 19 43 mm z2 z1 z0
A = O(n) T = O(1)
Recursive functions (r.) Output zi is a function of all inputs ak a) with single output z
k i
2.
= zn (r.s.) :
) serial or multi-tree structure :
ti = f (ai ti;1) ; i = 0 : : : n ; 1 t;1 = 0=1 z = tn;1

1.
) serial structure :
A = O(n2) T = O(log n)
a3 a2 a1 a0
f is non-associative (r.s.n.) A = O(n) T = O(n)
1 funrsn.epsi 219 24 mm 3 z 6
) or shared-tree structure :
a3 a2 a1 a0 1funrma2.epsi 219 21 mm z3 z2 z1 z0
A = O(n log n) T = O(log n)

2 Arithmetic Operations
2.1 Overview
2.2 Implementation Techniques
2.1 Overview
based on operation related operation << , >>
2.2 Implementation Techniques Direct implementation of dedicated units :

fixed-point floating-point
always : 1 5 in most cases : 6 sometimes : 7, 8
=,<
+1 , 1
+/
+,
+,
Sequential implementation using simpler units and several clock cycles () decomposition) : sometimes : 6 in most cases : 7, 8, 9
arithops.epsi 98 83 mm
sqrt (x)
(same as on the left for floating-point numbers) complexity
Table look-up techniques using ROMs : universal : simple application to all operations efcient only for single-operand operations of high complexity (8 12) and small word length (note: ROM size = 2n n) Approximation techniques using simpler units : 712
exp (x)
log (x)
trig (x)
hyp (x)
1 2 3 4 5 6
shift/extension 7 division comparison 8 square root extraction increment/decrement 9 exponential function complement 10 logarithm function addition/subtraction 11 trigonometric functions multiplication 12 hyperbolic functions
8
taylor series expansion polynomial and rational approximations convergence of recursive equation systems CORDIC (COordinate Rotation DIgital Computer)
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations 9
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations
3.1 Binary Number Systems (BNS)
3 Number Representations
3.1 Binary Number Systems (BNS) Radix-2, binary number system (BNS) : irredundant, weighted, positional, monotonic [1, 2]
Complement : ;A = 2n ; A = A + 1 , where A = (an;1 an;2 : : : a0 )
n-bit number is ordered sequence of bits (binary digits) : A = (an;1 an;2 : : : a0)2 ai 2 f0 1g
Simple and efcient implementation in digital circuits MSB/LSB (most-/least-signicant bit) : an;1 / a0 Represents an integer or xed-point number, exact Fixed-point numbers :
Sign : an;1 Properties : asymmetric range, compatible with unsigned numbers in many arithmetic operations (i.e. same treatment of positive and negative numbers) Ones (1s) complement : similar to 2s complement n;2 X Value : A = ;an;1 (2n;1 + 1) + ai 2i i=0 Range : ;(2n;1 ; 1) 2n;1 ; 1] Complement : ;A = 2n ; A ; 1 = A Sign : an;1
(| m;1 {z: : a0 : | ;1 : :{z am;n ) a : } a : }

m-bit integer
(
n ; m)-bit fraction
n;1 X i=0
Unsigned : positive or natural numbers Value :
A = an;1 2n;1 +
+ a12 + a0 =
ai2i
Properties : double representation of zero, symmetric range, modulo (2n ; 1) number system Sign-magnitude : alternative representation of signed numbers n;2 X ai 2i Value : A = (;1)an;1 i=0 n;1 ; 1) 2n;1 ; 1] Range : ;(2 Complement : ;A = (an;1 an;2 Sign : an;1
10
Range : 0 2n ; 1]
Twos (2s) complement : standard representation of signed or integer numbers n;2 X ai2i Value : A = ;an;1 2n;1 + i=0 Range : ;2n;1 2n;1 ; 1]
: : : a0 )
11
3.2 Gray Numbers
Properties : double representation of zero, symmetric range, different treatment of positive and negative numbers in arithmetic operations, no MSB toggles at sign changes around 0 () low power) Graphical representation
000...0 011...1 100...0 111...1
3.2 Gray Numbers Gray numbers (code) : binary, irredundant, non-weighted, non-monotonic + Property : unit-distance coding (i.e. exactly one bit toggles between adjacent numbers) Applications : counters with low output toggle rate (low-power signal buses), representation of continuous signals for low-error sampling (no false numbers due to switching of different bits at different times) Non-monotonic numbers : difcult arithmetic operations, e.g. addition, comparison :
binary number representation
n1
n1
numrep.epsi 95 73 mm
unsigned 2s complement 1s complement sign-magnitude
0 0 0 g1 g0 g1g0 g0 g0 0 0 < 0 1 and 0 < 1 1 1 < 1 0 but 1 > 0
binary ! Gray :
gi = bi+1 bi bn = 0 ; i = 0 : : : n ; 1 (n.)
Gray ! binary :
Conventions 2s complement used for signed numbers in these notes Unsigned and signed numbers can be treated equally in most cases, exceptions are mentioned
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations 12
bi = bi+1 gi bn = 0 ; i = n ; 1 : : : 0 (r.m.a.)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
b3 b2 b1 b0 g3 g2 g1 g0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
binary
Gray
13
3.3 Redundant Number Systems
3.3 Redundant Number Systems
3.3 Redundant Number Systems Non-binary, redundant, weighted number systems [1, 2] Digit set larger than radix (typically radix 2) ) multiple representations of same number ) redundancy + No carry-propagation in adders ) more efcient impl. of adder-based units (e.g. multipliers and dividers) Redundancy ) no direct implementation of relational operators ) conversion to irredundant numbers Several bits used to represent one digit ) higher storage requirements Expensive conversion into irredundant numbers (not necessary if redundant input operands are allowed) Delayed-carry of half-adder number representation : ri 2 f0 1 2g , ci si ai bi 2 f0 1g , ri = (ci+1 si) = 2ci+1 + si = ai + bi , ci+1si = 0 ; R = Pn=01 ri2i = (C S ) = C + S = A + B i 1 digit holds sum of 2 bits (no carry-out digit) example : (00 10) = 00 + 10 = 01 + 01 = (10 00) irredundant representation of ;1 [8], since ci+1si = 0 & C + S = ;1 ! S = ;1 C = 0 Carry-save number representation : ri 2 f0 1 2 3g , ci si ai bi di 2 f0 1g , ri = (ci+1 si) = 2ci+1 + si = ai + bi + di = ai + ri0 ; R = Pn=01 ri2i = (C S ) = C + S = A + R0 i
Computer Arithmetic: Principles, Architectures, and VLSI Design 14
1 digit holds sum of 3 bits or 1 digit + 1 bit (no carry-out digit, i.e. carry is saved) standard redundant number system for fast addition Signed-digit (SD) or redundant digit (RD) number representation : ; ri si ti 2 f;1 0 1g f1 0 1g , R = Pn=01 ri2i i no carry-propagation in S = R + T :
ri + ti = (ci+1 ui) = 2ci+1 + ui , ci+1 ui 2 f1 (ci+1 ui) is redundant (e.g. 0 + 1 = 01 = 11) 8i 9(ci ui ) j ci + ui = si 2 f1 0 1g
0 1g
1 digit holds sum of 2 digits (no carry-out digit) minimal SD representation : minimal number of non-zero digits, 011f1g10 ! 100f0g10 applications : sequential multiplication (less cycles), lters with constant coefcients (less hardware) example : minimal 7 = (0111 j 1111 j 1011 j 1001 j 11111 j
z }| {
canonical SD repres.: minimal SD + not two non-zero digits in sequence, 01f1g10 ! 10f0g10 SD ! binary : carry-propagation necessary () adder) other applications : high-speed multipliers [9] similar to carry-save, simple use for signed numbers
3.4 Residue Number Systems (RNS)
3.4 Residue Number Systems (RNS)
3.4 Residue Number Systems (RNS) Non-binary, irredundant, non-weighted number system [1] + Carry-free and fast additions and multiplications Complex and slow other arithmetic operations (e.g. comparison, sign and overow detection) because digits are not weighted, conversion to weighted mixed-radix or binary system required Codes for error detection and correction [1] Possible applications (but hardly used) : digital lters : fast additions and multiplications error detection and correction for arithmetic operations in conventional and residue number systems Base is n-tuple of integers (mn;1 mn;2 : : : m0 ), residues (or moduli) mi pairwise relatively prime
Arithmetic operations : (each digit computed separately)
Best moduli mi are 2k and (2k ; 1) :
zi = jZ jmi = jf (A)jmi = f (jAjmi ) mi = jf (ai)jmi jA + B jmi = jAjmi + jB jmi = jai + bijmi mi jA B jmi = jAjmi jB jmi = jai bijmi mi j ; ai jmi = jmi ; ai jmi a;1 mi = aimi ;2 mi (Fermats theorem) i
high storage efciency with k bits simple modular addition : 2k : k -bit adder without cout , 2k ; 1 : k -bit adder with end-around carry (cin = cout )
Example :
(m1 m0) = (3 2) , M = 6
A = (an;1 an;2 : : : a0 )mn; ai 2 f0 1 : : : mi ; 1g

Range:
mn;2 ::: m0 ,
A a1 a0
;4 ;3 ;2 ;1 0 1 2 3 4 5 6 7 8 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0
| {z }
possible range
M=
n;1 X i=0
n; 1 Y i=0
mi, anywhere in Z Z
, Ci
ai = A mod mi = jAjmi , A = mi qi + ai
jAjM
j4 + 5j6 = (1 0) + (2 1) = = (j1 + 2j3 j0 + 1j2) = (0 1) = j3j6 j4 5j6 = (1 0) (2 1) = = (j1 2j3 j0 1j2) = (2 0) = j2j6
j5j6 = A = (a1
a0) = (j5j3
j5j2 ) = (2 1)
Ciai
= (: : :
0 1 0 |{z} i
: : :)
16
17
3.5 Floating-Point Numbers
3.7 Antitetrational Number System
3.5 Floating-Point Numbers Larger range, smaller precision than xed-point representation, inexact, real numbers [1, 2] Double-number form ) discontinuous precision S biased exponent E unsigned norm. mantissa M F = (;1)S M E = (;1)S 1:M 2E;bias Basic arithmetic operations : A B = (;1)SA SB MA MB EA+EB A + B = (;1)SA MA + EA (;1)SB MB (EA ; EB ) base on xed-point add, multiply, and shift operations postnormalization required (1= M < 1) Applications : processors : real oating-point formats (e.g. IEEE standard), large range due to universal use ASICs : usually simplied oating-point formats with small exponents, smaller range, used for range extension of normal xed-point numbers IEEE oating-point format : precision single double
3.6 Logarithmic Number System Alternative representation to oating-point (i.e. mantissa + integer exponent ! only xed-point exponent) [1] Single-number form ) continuous precision ) higher accuracy, more reliable S biased xed-point exponent E L = (;1)S E = (;1)S 2E;bias (signed-logarithmic) Basic arithmetic operations :
number system and double conversion A B = (;1)SA SB EA+EB p Ay = (;1)SA y EA y A = (;1)SA
(A < B ) = (EA < EB ) (additionally consider sign) A + B : by approximation or addition in conventional

EA =y
+ Simpler multiplication/exponent., more complex addition Expensive conversion : (anti)logarithms (table look-up) Applications : real-time digital lters 3.7 Antitetrational Number System Tetration (t. x = |{z}) and antitetration (a.t. x) [10] 22 x Larger range, smaller precision than logarithmic repres., otherwise analogous (i.e. 2x ! t. x log x ! a.t. x)
2
n nM nE
32 64 23 52 8 11
127 3:8 1038 1023 9 10307
bias
range
precision 10;7 10;15
18
19
3.8 Composite Arithmetic
3.9 Round-Off Schemes
3.8 Composite Arithmetic Proposal for a new standard of number representations [10] Scheme for storage and display of exact (primary: integer, secondary: rational) and inexact (primary: logarithmic, secondary: antitetrational) numbers Secondary forms used for numbers not representable by primary ones () no over-/underow handling necessary) Choice of number representation hidden from user, i.e. software/compiler selects format for highest accuracy Number representations :
integer : rational : logarithmic : antitetrational : tag 00 01 10 11 value 2s complement integer slash log integer a.t. integer
3.9 Round-Off Schemes Intermediate results with d additional lower bits () higher accuracy) : A = (an;1 : : : a0 a;1 : : : Rounding : keeping error small during nal word length reduction : R = (rn;1 : : : r0 ) = A ; Trade-off : numerical accuracy vs. implementation cost Truncation :
1 bias = ; 1 + 2d+ 2
1
a;d)
RTRUNC = (an;1 : : : a0 )
(= average error )
Round-to-nearest (i.e. normal rounding) :

denominator n numerator log fraction a.t. fraction
1 bias = 2d+ (nearly symmetric) + 0:1 can often be included in previous operation
1
0 RROUND = (a0n;1 : : : a0) A0 = A + 0:1
Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ denominator) is variable and stored (oating slash) Storage form sizes : 32-bit (short), 64-bit (normal), 128-bit (long), 256-bit (extended) Implementation : mixed hardware/software solutions Hardware proposal : long accumulator (4096 bits) holds any oating-point number in xed-point format ) higher accurary ) large hardware/software overhead
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 20 4.1 Overview
0 : : a0 ) ;d RROUND ;EVEN = RROUND: : if 0(a0;)1 :otherwise6= 0 0 : a (an;1 1 bias = 0 (symmetric)

mandatory in IEEE oating-point standard
3 guard bits for rounding after oating-point operations : guard bit G (postnormalization), round bit R (round-to-nearest), sticky bit S (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 21
4.2 1-Bit Adders, (m, k)-Counters
4 Addition
4.1 Overview
1-bit adders HA FA (m,k) (m,2)
4.2 1-Bit Adders, (m, k)-Counters Add up m bits of same magnitude (i.e. 1-bit numbers) Output sum as k -bit number (k
= blog mc + 1)
or : count 1s at inputs ) (m, k)-counter [3] (combinational counters)

RCA carry-propagate adders CLA CPA PPA COSA CSKA CSLA CIA
Half-adder (HA), (2, 2)-counter
(cout s) = 2cout + s = a + b
s=a b cout = ab
A = 3 T = 2 (1)
3-operand carry-save adders

adders.epsi 103 121 mm
CSA
(sum) (carry-out)
a b a b
multi-operand
adder array
adder tree
a b
multi-operand adders array adder tree adder
hasym.epsi HA 18 23 mm c
out
c out haschema1.epsi 19 28 mm
haschema2.epsi 21 43 mm c out
s
Legend: HA: FA: (m,k): (m,2): half-adder full-adder (m,k)-counter (m,2)-compressor CPA: carry-propagate adder RCA: ripple-carry adder CSKA:carry-skip adder CSLA: carry-select adder CIA: carry-increment adder related component CLA: carry-lookahead adder PPA: parallel-prefix adder COSA:conditional-sum adder CSA: carry-save adder
(reference)
s
based on component
22
23
4 Addition
4 Addition
Full-adder (FA), (3, 2)-counter
(m, k)-counters
(cout s) = 2cout + s = a + b + cin
A = 7 T = 4 (2 )
g = ab (generate) c0 = ab p = a b (propagate) c1 = a + b s = a b cin = p cin cout = ab + acin + bcin = ab + (a b)cin = g + pcin = pg + pcin = pa + pcin = cinc0 + cinc1
a b a b g HA faschematic3.epsi p 29 32 mm c c out in HA s a b a b a b
(sk;1 : : : s0) = k;1 m ;1 X X sj 2j = ai

j =0 i=0
a0
a m-1
...
cntsymbol.epsi 23 18 (m,k)mm
...
s k-1 s 0
Usually built from full-adders Associativity of addition allows convertion from linear to tree structure ) faster at same number of FAs
m A = 7 Plog 1 bm2;k c 7(m ; log m) k= TLIN = 4m + 2blog mc TTREE = 4dlog3 me + 2blog mc

Example : (7, 3)-counter
fasymbol.epsi FA c18 21 mm c
out
in
c out
faschematic2.epsi c in 32 35 mm
A = 28 T = 14
a0a1 a2a3a4a5a6 FA
A = 28 T = 10
a0a1 a2 FA a3a4 a5a6 FA
s s a b
FA
p faschematic4.epsi c out c in 29 1 41 mm
0
count73ser.epsi 42 59 mm
c out
count73par.epsi FA 36 48 mm
c out
faschematic1.epsi g p 29 43 mm
c in
faschematic5.epsi 0 c0 35 47 mm 1 c1
c in
FA s2
FA s1 s0
FA s2 s1 s0
(reference)
s s
linear structure
tree structure
25
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition
24
4.3 Carry-Propagate Adders (CPA) Add two n-bit operands A and B and an optional carry-in cin by performing carry-propagation [1, 2, 11] Sum (cout
Carry-propagation speed-up techniques a) Concatenation of partial CPAs with fast cin ! cout
a n-1:j b n-1:j
...
S ) is irredundant (n + 1)-bit number

A B
a i-1:k b i-1:k
speedup1.epsi CPA c i84 26 mm
...
a k-1:0 b k-1:0
(cout S ) = cout2n + S = A + B + cin

2ci+1 + si
c out
CPA
cj
ck
CPA
c in
= ai + bi + ci ; i = 0 1 ::: n; 1 c0 = cin cout = cn (r.m.a.)
s n-1:j
s i-1:k
s k-1:0
cpasymbol.epsi CPA c out 29 26 mm c in
a) Fast carry look-ahead logic for entire range of bits

a n-1 b n-1 a1 b1 a0 b0
Ripple-carry adder (RCA) Serial arrangement of n full-adders Simplest, smallest, and slowest CPA structure
c out
... preprocessing
speedup2.epsi 104 50 mm
carry propagation
c in
A = 7n T = 2n AT = 14n2
a n-1 FA s n-1 b n-1
...
...
postprocessing
a1
b1
a0 FA s0
b0
s n-1
s1
s0
c out
c n-1
rca.epsi 57c 23FA mm

2
c1
c in
...
s1
26
27
4 Addition
4 Addition
Carry-skip adder (CSKA) Type a) : partial CPA with fast ck ! ci
Carry-select adder (CSLA) Type a) : partial CPA with fast ck ! ci and ck ! si;1:k
ci = P i;1:k c0i + Pi;1:k ck (bit group (ai;1 : : : ak )) Pi;1:k = pi;1pi;2 pk (group propagate)
) path ck ! c0i ! ci never sensitized ) fast ck ! ci ) false path ) inherent logic redundancy ) problems in circuit optimization, timing analysis, and testing
si;1:k = ck s0;1:k + ck s1;1:k i i ci = ck c0 + ck c1 i i
1) Pi;1:k 2) Pi;1:k
= 0 : ck 6! c0i and c0i selected (c0i ! ci) = 1 : ck ! c0i but c0i skipped (c0i 6! ci)
Two CPAs compute two possible results (cin = 0=1), group carry-in ck selects correct one afterwards Variable group sizes (faster) : larger groups at end (MSB) (balance delays a0 ! ck and ak ! c0 ) i Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA High speed-up at high hardware overhead (+ MUX/bit + (CPA + MUX)/group)
Variable group sizes (faster) : larger groups in the middle (minimize delays a0 ! ck ! si;1 and ak ! ci ! sn;1 ) Partial CPA typ. is RCA or CSKA () multilevel CSKA) Medium speed-up at small hardware overhead (+ AND/bit + MUX/group)
A
...
0
14n
2:8n1=2
AT
39n3=2
a k-1:0 b k-1:0
a i-1:k b i-1:k
A
a n-1:j b n-1:j
8n
4n1=2
AT
32n3=2
a k-1:0 b k-1:0 c out
a i-1:k b i-1:k
...
c i0
CPA
0
csla.epsi 102 50CPA mm
ci
...
ck
CPA
c in
c out
CPA
cj
...
ci
cska.epsi 99 36 mm 1
ci
CPA ck CPA c in ck
c i1
0 s i-1:k 0 1
1 s i-1:k
P i-1:k s n-1:j s i-1:k s k-1:0

28
s i-1:k
s k-1:0
29
Carry-increment adder (CIA) Type a) : partial CPA with fast ck ! ci and ck ! si;1:k
Example : gate-level schematic of carry-incr. adder (CIA) only 2 different logic cells (bit-slices) : IHA and IFA
max ngroup
si;1:k = s0i;1:k + ck ci = c0i + Pi;1:k ck Pi;1:k = pi;1pi;2 pk (group propagate)

Result is incremented after addition, if ck
4 6 10 12 14 16 18 20 22 24 26 28 ... 38 2 3 4 5 6 7 8 9 10 11 ... 16 1 2 4 7 11 16 22 29 37 46 56 67 ... 137

b i-1 IFA
...
= 1 [12, 11]
IFA
a i-1
a i-2
b i-2 IFA
a k+1
b k+1 IHA
ak
bk
Variable group sizes (faster) : larger groups at end (MSB) (balance delays a0 ! ck and ak ! c0i ) Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA High speed-up at medium hardware overhead (+ AND/bit + (incrementer + AND-OR)/group)
...
...
Logic of CPA and incrementer can be merged [11]
A
...
10n
T
ci
2:8n1=2
a i-1:k b i-1:k
AT
0 ck
28n3=2
a k-1:0 b k-1:0
ci
s i-1 (i-k-1)IFA + IHA
ciagate.epsi 100 112 mm s i-2

2IFA + IHA
s k+1 IFA + IHA IHA
sk IHA
ck
CPA
86 cia.epsi si-1:k 43 mm
c out
...
ci P i-1:k
CPA
c in
...
bits i-1...k
...
bits 6...4
bits 3,2
bit 1
bit 0
+1
s i-1:k
s k-1:0
c out c in
30
31
4 Addition
4 Addition
Conditional-sum adder (COSA) Type a) : optimized multilevel CSLA with (log n) levels (i.e. double CPAs are merged at higher levels) Correct sum bits (si;1:k or si;1:k ) are (conditionally) selected through (log n) levels of multiplexers
0 1
Carry-lookahead adder (CLA), traditional Type b) : carries looked ahead before sum bits computed Typically 4-bit blocks used (e.g. standard IC SN74181)
Bit groups of size 2l at level l Higher parallelism, more balanced signal paths Highest speed-up at highest hardware overhead (2 RCA + more than (log n) MUX/bit)
c0 = c00 c1 = g0 + p0c00 c2 = g1 + p1g0 + p1p0c00 c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00 0 g3 = g3 + p3g2 + p3p2g1 + p3p2 p1g0 0 = p3 p2 p1 p0 p3
(g3,p3)
...
(g0,p0)
clbsymbol.epsi 26 27 CLB mm c 0
(g,p) c 3 . . . c 0 3 3
3n log n
a3 b3
T
a2
2 log n
b2
AT
a1
6n log
b1
n
a0 b0
Hierarchical arrangement using ( 1 log n) levels : 2 (g30 p03) passed up, c00 passed down between levels High speed-up at medium hardware overhead
A
c in
14n
4 log n
AT
56n log n
level 0
...
FA FA
0 1
FA FA
0 1
FA FA
0 1 FA
(g15,p15) . . . (g12,p12)(g11,p11) . . . (g8,p8) (g7,p7) . . . (g4,p4) (g3,p3) . . . (g0,p0)
level 1
...
cosa.epsi 100 57 mm
0 1
CLB c 15 . . . c 12
CLB
CLB
CLB c3 . . . c0
level 2
...
c 11 . . . c 8 cla.epsi c 7 . . . c 4 97 48 mm
c out
Parallel-prex adders (PPA) Type b) : universal adder architecture comprising RCA, CIA, CLA, and more (i.e. entire range of area-delay trade-offs from slowest RCA to fastest CLA) Preprocessing, carry-lookahead, and postprocessing step Carries calculated using parallel-prex algorithms + High regularity : suitable for synthesis and layout + High exibility : special adders, other arithmetic operations, exchangeable prex algorithms (i.e. speeds) + High performance : smallest and fastest adders
a n-1 b n-1 a n-2 b n-2
a1 b1 a0 b0
c out s n-1 s n-2
s1
s0
... s3 s2 s1 s0
CLB
c in
+ preprocessing : gi = ai bi + postprocessing : si = pi
pi = ai bi ci
33
32
Prex problem Inputs (xn;1 : : : x0 ), outputs (yn;1 binary operator [11, 13]
: : : y0), associative
or
(yn;1 : : : y0) = (xn;1 x0 : : : x1 x0 x0) y0 = x0 yi = xi yi;1 ; i = 1 : : : n ; 1 (r.m.a.)

Associativity of
) tree structures for evaluation :
1 Y3:2 1 y1 = Y1:0 1 y1 = Y1:0
x3 (x2 (x1 {z x0)) = (x3 {z x2 ) (| 1 {z x0 ) , but y2 ? x } | } | }

| |
2 y2 = Y2:0 3 y3 = Y3:0
{z
} }
{z
2 y3 = Y3:0
{z
A
...
(gn-1 , p n-1 )
5n + 3A
T = 4 + 2T
preprocessing:
c in
...
(g0 , p0 )
add.epsi///gures 73 64 mm
gi = aibi pi = ai bi
carry-lookahead: prex algorithm
: : : xi) at level l Carry-propagation is prex problem : Yil:k = (Gl :k Pil:k ) i (G0:i Pi0:i) = (gi pi) i 1 1 (Gli:k Pil:k ) = (Gli;+1 Pil:;+1) (Glj;k1 Pjl:;1) ; k j i :j j : k 1 1 1 = (Gil;+1 + Pil:;+1Glj;k1 Pil:;+1Pjl:;1) :j j j : k ci+1 = Gm ; i = 0 : : : n ; 1 l = 1 : : : m i:0
Parallel-prex algorithms [14] : multi-tree structures (T = O(n) ! O(log n)) sharing subtrees (A = O(n2 ) ! O(n log n)) different algorithms trading area vs. delay (inuences also from wiring and maximum fan-out FOmax )
Group variables Yil:k : covers bits (xk
c n p n-1
c1
p0
c0
...
...
postprocessing:
si = pi ci
34
35
4 Addition
4 Addition
Prex algorithms Algorithms visualized by directed acyclic graphs (DAG) with array structure (n bits m levels) Graph vertex symbols : l 1 1 (Gil;+1 Pil:;+1) (Gj;k1 Pjl:;1 ) :j j : k
Sklansky parallel-prex algorithm () PPA-SK) Tree-like collection, parallel redistribution of carries
A
0 1 2 3 4
1 2
n log n T = dlog ne FOmax
1 2
? ; y ;? l l ; (Gli:k Pil:k ) (Gi:k Pi:k )

(contains logic for )
? i ;?l l (Gli:k Pil:k ) (Gi:k Pi:k )

(contains no logic)
(Gil;1 Pil:;1 ) :k k
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sk.epsi///gures 67 30 mm
Performance measures : A : graph size (number of black nodes)
Brent-Kung parallel-prex algorithm () PPA-BK) Traditional CLA is PPA-BK with 4-bit groups Tree-like redistribution of carries (fan-out tree)
: graph depth (number of black nodes on critical path)
Serial-prex algorithm () RCA)
A = n ; 1 T = n ; 1 FOmax = 2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 ... 14 15
A = 2n ; dlog ne ; 2 T = 2dlog ne ; 2 FOmax log n

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6
ser.epsi///gures 69 38 mm
bk.epsi///gures 67 38 mm
36
37
Kogge-Stone parallel-prex algorithm () PPA-KS) very high wiring requirements
Mixed serial/parallel-prex algorithm () RCA + PPA) linear size-depth trade-off using parameter k : 0
n log n ; n + 1 T = dlog ne FOmax = 2

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 ks.epsi///gures 67 52 mm
k n ; 2dlog ne + 2
k = 0 : serial-prex graph k = n ; 2dlog ne + 1 : Brent-Kung parallel-prex

graph lls gap between RCA and PPA-BK (i.e. CLA) in steps of single -operations
A = n ; 1 + k T = n ; 1 ; k FOmax = var.
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Carry-increment parallel-prex algorithm () CIA)
2n ; 1:4n1=2
1:4n1=2
FOmax
1:4n1=2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5
var.epsi///gures 68 54 mm
cia.epsi///gures 67 34 mm
38
39
4 Addition
4 Addition
Example : 4-bit parallel-prex adder (PPA-SK) efcient AND-OR-prex circuit for the generate and AND-prex circuit for the propagate signals optimization: alternatingly AOI-/OAI- resp. NAND-/ NOR-gates (inverting gates are smaller and faster) can also be realized using two MUX-prex circuits
a3 b3 a2 b2 a1 b1 a0 b0
Prex adder synthesis Local prex graph transformation :

3 2 1 0
A =3 T =3
0 unfact.epsi 1 20 26 mm 2 3
depth-decr. transform
3 2 1 0 0 fact.epsi 1 20 26 mm 2 3
;! ;
size-decr. transform
A =4 T =2
c in
Repeated (local) prex transformations result in overall minimization of graph depth or size ) which sequence ? Goal: minimal size (area) at given depth (delay) Simple algorithm for sequence of applied transforms : Step 1 : prex graph compression (depth minimization) : depth-decr. transforms in right-to-left bottom-up order Step 2 : prex graph expansion (size minimization) : size-decreasing transforms in left-to-right top-down order, if allowed depth not exceeded Prex adder synthesis : 1) generate serial-prex graph, 2) graph compression, 3) depth-controlled graph expansion, 4) generate pre-/postprocessing and prex logic + Generates all previous prex graphs (except PPA-KS) + Universal adder synthesis algorithm : generates area-optimal adders for any given timing constraints [14] (including non-uniform signal arrival times)
askgate.epsi///gures 100 103 mm
c out P n-1:0 s3 s2 s1 s0
40
41
Multilevel adders Multilevel versions of adders of type a) possible (CSKA, CSLA, and CIA; notation: 2-level CIA = CIA-2L) + Delay is O(n1=(m+1) ) for m levels Area increase small for CSKA and CIA, high for CSLA () COSA) Difcult computation of optimal group sizes Hybrid adders Arbitrary combinations of speed-up techniques possible ) hybrid/mixed adder architectures Often used combinations : CLA and CSLA [15] Pure architectures usually perform best (at gate-level)
Self-timed adders + RCA is fast in average case (T = O(log n)), slow in worst case ) suitable for self-timed asynchronous designs [16] Completion detection is not trivial Adder performance comparisons Standard-cell implementations, 0:8
area [lambda^2] RCA 1e+07 128-bit CSKA-2L CIA-1L 5 64-bit CIA-2L PPA-SK PPA-BK 32-bit 2 16-bit 1e+06 8-bit 5
Average carry-propagation length : log n
m process
Transistor-level adders Inuence of logic styles (e.g. dynamic logic, pass-transistor logic ) faster) + Efcient transistor-level implementation of ripple-carry chains (Manchester chain) [15] + Combinations of speed-up techniques make sense Much higher design effort Many efcient implementations exist and published
2 5 10 20
addperf.ps 84 84 mm
CLA COSA const. AT
delay [ns]
43
4 Addition
4 Addition
4.4 Carry-Save Adder (CSA)
Complexity comparison under the unit-gate model adder RCA CSKA-1L CSKA-2L CSLA-1L CIA-1L CIA-2L CIA-3L PPA-SK PPA-BK PPA-KS CLA 5 COSA
1
4.4 Carry-Save Adder (CSA) a) Adds three n-bit operands A0 , A1 , A2 performing no carry-propagation (i.e. carries are saved) [1]
n log n 10n 3n log n 14n 3n log n

3 2
8n 8n 14n 10n 10n 10n
7n
2n 4n1=2 xn1=3 4 2:8n1=2 2:8n1=2 3:6n1=3 4:4n1=4 2 log n 4 log n 2 log n 4 log n 2 log n
3n log2 n 40n log n 6n log2 n 56n log n 6n log2 n
14n2 32n3=2 xn4=3 4 39n3=2 28n3=2 36n4=3 44n5=4
AT
opt.1 syn.2 p aaa aat 3 att att ttt att
(C S ) = C + S = A0 + A1 + A2
2ci+1 + si
A0 A1 A2
csasymbol.epsi 26 21 CSAmm
p p p p p p ( )
= a0 i + a1 i + a2 i ; i = 0 1 : : : n ; 1 (n.) (C S )out = A + (C S )in
b) Adds one n-bit operand to an n-digit carry-save operand Result is in redundant carry-save format (n digits), represented by two n-bit numbers S (sum bits) and C (carry bits) + Parallel arrangement of n full-adders, constant delay
optimality regarding area and delay aaa : smallest area, longest delay aat : small area, medium delay att : medium area, short delay ttt : large area, shortest delay : not optimal 2 obtained from prex adder synthesis 3 automatic logic optimization not possible (redundancy) 4 exact factors not calculated 5 corresponds to 4-bit PPA-BK
44
A = 7n T = 4
a 0,n-1 a 1,n-1 a 2,n-1 a 0,1 a 1,1 a 2,1 a 0,0 a 1,0 FA c1 s0
csa.epsi 27 mm FA
FA cn s n-1
. . . 67
c2
s1
Multi-operand carry-save adders (m > 3) ) adder array (linear arrangement), adder tree (tree arr.)
4.5 Multi-Operand Adders
4.5 Multi-Operand Adders Add three or more (m > 2) n-bit operands, yield (n + dlog me)-bit result in irredundant number rep. [1, 2] Array adders Realization by array adders : (see gures on next page) a) linear arrangement of CPAs b) linear arr. of CSAs (adder array) and nal CPA a) and b) differ in bit arrival times at nal CPA : ) if CPA = RCA : a) and b) have same overall delay ) if fast nal CPA : uniform bit arrival times required ) CSA array (b) Fast implementation : CSA array + fast nal CPA (note: array of fast CPAs not efcient/necessary)
a) 4-operand CPA (RCA) array :

a 0,n-1 a 1,n-1 a 0,2 a 1,2 a 0,1 a 1,1 a 0,0 a 1,0
...
a 2,0 HA a 2,0 HA a 3,0 HA s0

CPA CPA CPA
FA a 2,n-1 FA a 3,n-1 FA sn FA
... ... ...
FA a 2,2
FA a 2,1
cparray.epsi 93 FA 57 mm FA
a 3,2 FA s2
a 3,1 FA s1
s n-1
b) 4-operand CSA array with nal CPA (RCA) :

a 0,n-1 a 1,n-1 a 2,n-1 a 0,2 a 1,2 a 2,2 a 0,1 a 1,1 a 2,1 a 0,0 a 1,0 a 2,0
CSA CSA CPA
A = (m ; 2)ACSA + ACPA T = (m ; 2)TCSA + TCPA

CPA = RCA :
A0 A1 A2
A3
A m-1
CSA
...
FA a 3,n-1
...
FA a 3,2
FA a 3,1 FA
FA a 3,0 HA
A = O(mn + n) T = O(m + n)
mopadd.epsi CSA 30 58 mm
...
FA
...
csarray.epsi 99 57 mm FA
Fast CPA :
A = O(mn + n log n) T = O(m + log n)
CPA
FA sn
FA
...
FA s2
HA s1 s0
s n-1
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47
4 Addition
4 Addition
(m, 2)-compressors 2(c +
m;4 X
l=0 m;1 X i=0
clout) + s =
m;4 X l =0
a0
a m-1
...
0 c out m-4 c out
ai +
clin
cprsymbol.epsi 26 37 (m,2)mm
0 c in m-4 c in
A = 7(m ; 2) TLIN = 4(m ; 2) TTREE = 6(dlog me ; 1)

Optimized (4, 2)-compressor : 2 full-adders merged and optimized (i.e. XORs arranged in tree structure)
1-bit adders (similar to (m, k)-counters) [17] Compresses m bits down to 2 by forwarding (m ; 3) intermediate carries to next higher bit position Is bit-slice of multi-operand CSA array (see prev. page) + No horizontal carry-propagation (i.e. cl ! ck in out k > l) Built from full-adders (= (3, 2)-compressor) or (4, 2)-compressors arranged in linear or tree structures Example : 4-operand adder using (4, 2)-compressors
a 0,n-1 a 1,n-1 a 2,n-1 a 3,n-1 a 0,2 a 1,2 a 2,2 a 3,2 a 0,1 a 1,1 a 2,1 a 3,1 a 0,0 a 1,0 a 2,0 a 3,0
...
...
A = 14 T = 8
a0 a1 a2 a3 FA
cpr42fa.epsi 32 38 mm
A = 14 T = 6
a0 a1 a2 a3
)
c in c out
c out
cpr42opt.epsi 1 41 53 mm
FA c s
c in
0 1
with full-adders
CSA
(4,2)
(4,2) cpradd.epsi 99 44 mm FA s2
(4,2)
(4,2)
optimized + same area, 25% shorter delay
FA s n+1 sn
FA s n-1
HA s1 s0
CPA
SD-FA (signed-digit full-adder) is similar to (4, 2)-compressor regarding structure and complexity
48
Advantages of (4, 2)-compressors over FAs for realizing (m, 2)-compressors : higher compression rate (4:2 instead of 3:2) less deep and more regular trees tree depth # operands FA (4,2) 012 3 4 5 6 7 8 9 10
Tree adders (Wallace tree) Adder tree : n-bit m-operand carry-save adder composed of n tree-structured (m, 2)-compressors [1, 18] Tree adders : fastest multi-operand adders using an adder tree and a fast nal CPA
2 3 4 6 9 13 19 28 42 63 94 2 4 8 16 32 64 128
A = A(m 2) n + ACPA = O(mn + n log n) T = T(m 2) + TCPA = O(log m + log n)

Adder arrays and adder trees revisited Some FA can often be replaced by HA or eliminated (i.e. redundant due to constant inputs) Number of (irredundant) FA does not depend on adder structure, but number of HA does An m-operand adder accomodates (m ; 1) carry inputs Adder trees (T = O(log n)) are faster than adder arrays (T = O(n)) at same amount of gates (A = O(mn)) Adder trees are less regular and have more complex routing than adder arrays ) larger area, difcult layout (i.e. limited use in layout generators)
50 Computer Arithmetic: Principles, Architectures, and VLSI Design 51
Example : (8, 2)-compressor
A = 42 T = 16
a0a1 a2a3 FA a4a5 a6a7 FA
0 c out 0 c in 1 c in 1 c out 2 c out 3 c out
A = 42 T = 12
a0a1a2a3 a4a5a6a7
0 c in 1 c in
0 c out 1 c out 2 c out 3 c out 4 c out
(4,2)
(4,2)
FA
FA
cpr82fa.epsi 47 65 mm
cpr82cpr42.epsi 47 50 mm
2 c in 3 c in
2 c in 3 c in
FA
4 c out 4 c in
(4,2)
4 c in
FA c s
(4, 2)-compressor tree
full-adder tree
4 Addition
4.6 Sequential Adders
5 Simple / Addition-Based Operations
5.1 Complement and Subtraction
4.6 Sequential Adders Bit-serial adder : Sequential n-bit adder

ai bi
bitseradd.epsi FA 25 27 mm

5.1 Complement and Subtraction 2s complementer (negation)
;A = A + 1
A
A = AFA + AFF T = TFA + TFF L=n
neg.epsi 21 32 mm1
+1
Accumulators : Sequential m-operand adders A With CPA
si
Z A B
2s complement subtractor
A = ACPA + AREG T = TCPA + TREG L=m

With CSA and nal CPA Allows higher clock rates Final CPA too slow : ) pipelining or multiple cycles for evaluation
A
accucpa.epsi CPA 28 mm 27
A ; B = A + (;B ) =A+B+1
sub.epsi 29 32 mm
c out
CPA
S A B
2s complement adder/subtractor
CSA
accucsa.epsi 33 52 mm
A B = A + (;1)sub B = A + (B sub) + sub

1s complement adder
addsub.epsi 36 35 mm
c out
CPA
sub
A = ACSA + ACPA + 4AREG T = TCSA + TREG L=m
S CPA A B
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer carries saved), trade-off between speed and register size
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 52
A + B (mod 2n ; 1) = A + B + cout
(end-around carry)
c out
addmod.epsi 28 29 CPAmm
c in
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 53
5.2 Increment / Decrement
5.2 Increment / Decrement Incrementer Adds a single bit cin to an n-bit operand A (cout Z ) = cout2n + Z = A + cin
Prex problem :
Ci:k = Ci:j+1Cj:k
) AND-prex struct.
1 2
A
A
1 2
n log n + 2n T = dlog ne + 2 AT
n log2 n
Decrementer
a n-1
(cout Z ) = A ; cin
a2 a1 a0
...
zi = ai ci ci+1 = aici ; i = 0 : : : n ; 1 c0 = cin cout = cn (r.m.a.)

Corresponds to addition with B
incsymbol.epsi +1 c out 29 26 mm c in
Z c out
...
= 0 () FA ! HA)
3n2
dec.epsi 93 41 mm
c in
Example : Ripple-carry incrementer using half-adders
A = 3n T = n + 1 AT
a n-1
...
z n-1
z2
z1
z0
a1
incfa.epsi HA 59 23 mm c c2 1
a0 HA z0
Incrementer-decrementer
c in
c out
HA z n-1
c n-1
(cout Z ) = A cin = A + (;1)dec cin

a n-1 a2 a1 a0 dec
...
...
z1
or using incrementer slices (= half-adder)

a n-1
...
a2
a1
a0
incdec.epsi 94 46 mm inc.epsi 83 33 mm
c out
...
c in
c out
...
c in
HA z n-1 z2 z1 z0
54
z n-1
z2
z1
z0
55
Fast incrementers 4-bit incrementer using multi-input gates :

a3 a2 a1 a0
Gray incrementer Increments in Gray number system
inccg.epsi 62 39 mm
c in
c out z3 z2 z1 z0
c0 = an;1 an;2 a0 (parity) ci+1 = ai ci ; i = 0 : : : n ; 3 (r.m.a.) z0 = a0 c0 zi = ai ai;1 ci;1 ; i = 1 : : : n ; 2 zn;1 = an;1 cn;2
Prex problem ) AND-prex structure
8-bit parallel-prex incrementer (Sklansky AND-prex structure) :

a7 a6 a5 a4 a3 a2 a1 a0 c in
incpp.epsi 98 63 mm
c out
z7
z6
z5
z4
z3
z2
z1
z0
56 5.3 Counting Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 57 5.3 Counting
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations
5.3 Counting Count clock cycles ) counter, divide clock frequency ) frequency divider (cout ) Binary counter Sequential in-/decrementer Incrementer speed-up techniques applicable Down- and up-down-counters using decrementers / incrementer-decrementers
Fast divider (T = O(1)) using delayed-carry numbers (irredundant carry-save represention of ;1 allows using fast carry-save incrementer) [8] Gray counter Counter using Gray incrementer
c out
+1 cntblock.epsi 32 33 mm
c in clk
Ring counters Shift register connected to ring :

cntring.epsi 51 16 mm
Q q n-1
Example : Ripple-carry up-counter using counter slices (= HA + FF), cin is count enable
c out
...
q2
q1
q0
State is not encoded ) n FF for counting n states Must be initialized correctly (e.g. 00 Applications: fast dividers (no logic between FF) state counter for one-hot coded FSMs 01)
c in
cntripple.epsi 87 36 mm
q n-1
q2
q1
q0
Johnson / twisted-ring counter (inverted feed-back) :

cntjohnson.epsi 59 16 mm
Asynchronous counter using toggle-ip-ops (lower toggle rate ) lower power)

T ... T T T
cntasync.epsi 64 18 mm
clk
q n-1
q2
q1
q0
q n-1
q2
q1
q0
58
n FF for counting 2n states

5.4 Comparison, Coding, Detection
5.4 Comparison, Coding, Detection Comparison operations
Comparators Subtractor (A ; B ) :
EQ = (A = B ) (equal) NE = (A 6= B ) = EQ (not equal) GE = (A B ) (greater or equal) LT = (A < B ) = GE (less than) GT = (A > B ) = GE EQ (greater than) LE = (A B ) = GT = GE + EQ (less or equal)
Equality comparison
GE = cout EQ = Pn;1:0
(for free in PPA)
cmpsub.epsi 37 31 mm
GE = c out
CPA
EQ = P n-1:0
ARCA = 7n TRCA = 2n or APPA;KS 3 n log n TPPA;KS 2

Optimized comparator :
2 log n
EQ = (A = B ) eqi+1 = (ai = bi) eqi = (ai bi) eqi ; i = 0 ::: n ; 1 eq0 = 1 EQ = eqn (r.s.a.)
Magnitude comparison
removing redundancies in subtractor (unused si ) single-tree structure ) speed-up at no cost :
a n-1 b n-1
a2 b2
a1 b1
...
a0 b0
cmpeq.epsi 40 36 mm
A = 6n TLIN = 2n TTREE
a n-1 b n-1 a2 b2 a1 b1
2 log n
example : ripple comparator using comparator slices

EQ
a0 b0
equality & magnitude magnitude equality
GE = (A B ) gei+1 = (ai > bi) + (ai = bi) gei = aibi + (ai bi) gei ; i = 0 : : : n ; 1 ge0 = 1 GE = gen (r.s.a.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 60 5.4 Comparison, Coding, Detection
GE
...
cmpripple.epsi 100 47 mm
EQ
61
Decoder Decodes binary number An;1:0 to vector Zm;1:0 (m = 2n )
Detection operations All-zeroes detection : All-ones detection :
zi =
1 if A = i 0 else ;
a2
i = 0 ::: m ; 1
a1 a0
decoder.epsi 58 28 mm
= 2A
z = an;1 + an;2 + + a0 z = an;1 an;2 a0 (r.s.a.)
A
decodersym.epsi decoder 21 26 mm
A = n T = log n
Leading-zeroes detection (LZD) : for scaling, normalization, priority encoding
z2 z1 z0
z7
z6
z5
z4
z3
A = (n ; 1)2n T = dlog ne
Encoder Encodes vector Am;1:0 to binary number Zn;1:0 (m = 2n ) (condition: 9i 8k j if k = i then ak = 1 else ak = 0) Z = i if ai = 1 ; i = 0 : : : m ; 1 Z = log2 A
A
encodersym.epsi encoder 21 26 mm
a) non-encoded output :
f0g1f0j1g ! f0g1f0g
a n-1 a n-2
...
a1
a0
(e.g. 000101 ! 000100)
lzdnenc.epsi 50 28 mm
...
A = 2n T = n
z n-1
z n-2
z1
z0
a7a5a3a1 a6a4a2a0
encoder.epsi 30 34 mm
prex problem (r.m.a.) ) AND-prex structure

z0 z1
b) encoded output : + encoder signed numbers : + leading-ones detector (LOZ)
A = n(2n;1 ; 1) T =n;1
z2
(note: connections according to PPA-SK)

5.5 Shift, Extension, Saturation
5.5 Shift, Extension, Saturation
5.5 Shift, Extension, Saturation Shift : a) shift n-bit vector by k bit positions b) select n out of more bits at position k also: logical (= unsigned), arithmetic (= signed)
Rotation by k bit positions, n constant (logic operation) Extension of word lengths by k bits (n ! n + k ) (i.e. sign-extension for signed numbers) Saturation to highest/lowest value after over-/underow shift a) unl. signed r. signed l. r. unsigned signed l. r. unl. signed r. signed l. r.
Applications : adaption of magnitude (shift a)) or word length (extension) of operands (e.g. for addition) multiplication/division by multiples of 2 (shift) logic bit/byte operations (shift, rotation) scaling of numbers for word-length reduction (i.e. ignore leading zeroes, shift b)) or normalization (e.g. of oating-point numbers, shift a)) using LZD reducing error after over-/underow (saturation) Implementation of shift/extension/rotation by constant values : hard-wired variable values : multiplexers n possible values : nbyn barrel-shifter/rotator Example : 4by4 barrel-rotator
shift b) rotate extend
saturate unsigned signed
an;2 0 an;1 an;1 an;3 an;1 an;1 an;2 an+k;1 a2n;1 an+k;2 an;2 a0 an;1 0 an;1 an;1 an;1 an;1 an;2 an;1 an;2 an;1 an;1 an;1
::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: :::
a0 0 a1 a0 0 a1 ak ak a0 an;1 a1 a0 a0 0 a0 a0 0 an;1 an;1
sll srl sla sra
rol ror
a3 a2 a1 a0
a3 s1 s0 s1 s0 s1 s0 s1 s0
a2
a1
a0
barshift.epsi 44 49 mm
s0 s1 z3
muxshift.epsi 41 28 mm
z2
z1
z0
z3
z2
z1
z0
multiplexers
64
tristate buffers
65 5.6 Addition Flags
5.6 Addition Flags
5.6 Addition Flags ag
Basic and derived condition ags description carry ag signed overow ag zero ag negative ag, sign condition operation: S=0 S<0 S 0 ag formula signed
cn cn cn;1 an bnsn + anbn sn Z 8i : si = 0 N sn;1
C V
formula
Implementation of adder with ags
C , N : for free V : fast cn, cn;1 computed by e.g. PPA ) very cheap Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn;1:0 (of PPA) b) cin = 0=1 : Z = sn;1 + sn;2 + + s0 (r.s.a.) 1) A = ACPA + n TZ = TCPA + dlog ne
2) faster without nal sum (i.e. carry prop.) [19] example : 01001 1 00 0 + 10110 1 00 = 00000 0 00
S = A + B (+) or S = A ; B (;) zero Z Z negative N positive N S > max overow C (+) VC S < min underow C (;) VC operation: A ; B A=B EQ Z Z A 6= B NE Z Z A B GE C N V + NV A>B GT CZ (N V + NV )Z A<B LT C NV + NV A B LE C + Z NV + NV + Z
Unsigned and signed addition/subtraction only differ with respect to the condition ags
unsigned
z0 = ((a0 b0) cin) zi = ((ai bi) (ai;1 + bi;1)) Z = zn;1 zn;2 z0 ; i = 0 : : : n ; 1 (r.s.a.) A = ACPA + 3n TZ = 4 + dlog ne
5.7 Arithmetic Logic Unit (ALU)
6 Multiplication
6.1 Multiplication Basics
5.7 Arithmetic Logic Unit (ALU)

A B
6 Multiplication
6.1 Multiplication Basics Multiplies two n-bit operands A and B [1, 2] Product P is (2n)-bit unsigned number or (2n ; 1)-bit signed number Example : unsigned multiplication n;1 n;1 n;1 n;1 X X XX P = A B = ai2i bj 2j = aibj 2i+j or i=0 j =0 i=0 j =0 n;1 X Pi = ai B P = Pi2i ; i = 0 : : : n ; 1 (r.s.a.) i=0 Algorithm 1) Generation of n partial products Pi 2) Adding up partial products : a) sequentially (sequential shift-and-add), b) serially (combinational shift-and-add), or c) in parallel Speed-up techniques Reduce number of partial products Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 69
c out alusymbol.epsi c in 29 mm 30 ALU flags op Z
ALU operations arithmetic add inc pass and or xor pass sll sla rol
logic
shift/ rotate
A + B + cin A+1 A aibi ai + bi ai bi ai A 1 A a1 A r1
sub dec neg nand nor xnor not srl sra ror
A;B A;1 ;A ai bi ai + bi ai bi ai A 1 A a1 A r1
s/ro : shift/rotate ; l/r : left/right ; l/a : logic (unsigned) / arithmetic (signed)
Logic of adder/subtractor can partly be re-used for logic operations

Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 68
6.1 Multiplication Basics
6.2 Unsigned Array Multiplier
Sequential multipliers : partial products generated and added sequentially (using accumulator)
6.2 Unsigned Array Multiplier
mulseq.epsi 34 28 mm
Braun multiplier : array multiplier for unsigned numbers
CPA
A = O(n) T = O(log n) L = n
Array multipliers : partial products generated and added simultaneously in linear array (using array adder)

CSA CSA mularr.epsi 34 47 mm CSA
P=
n;1 n;1 XX i=0 j =0
aibj 2i+j a0 b3 a1 b2 a2 b1 a3 b0 p3
A = 8n2 ; 11n T = 6n ; 9 a0 b2 a0 b1 a0 b0 a1 b1 a1 b0 a2 b0 p2 p1 p0
b1 b0
a1 b3 a2 b3 a2 b2 + a3b3 a3b2 a3b1 p7 p6 p5 p4

b3 a0 b2
A = O(n ) T = O(n)
2
CSA CPA
a1
p0 HA HA HA p1
Parallel multipliers : partial products generated in parallel and added subsequently in multi-operand adder (using tree adder)
mulpar.epsi 34 43 mm
1
a2
CSA tree CPA

a3
FA
mulbraun.epsi FA 99 83 mm
FA p2
A = O(n ) T = O(log n)
2
2
CSA CPA
FA
FA
FA p3
Signed multipliers : a) complement operands before and result after multiplication ) unsigned multiplication b) direct implementation (dedicated multiplier structure)
3
p7
FA p6
FA p5
HA p4
71
6 Multiplication
6.3 Signed Array Multipliers
6 Multiplication
6.4 Booth Recoding
6.3 Signed Array Multipliers Modied Braun multiplier Subtract bits with negative weight ) special FAs [1] 1 neg. bit : ;a + b + cin = 2cout ; s 2 neg. bits : a ; b ; cin = ;2cout + s Replace FAs in regions 1 , 2 , and 3 by : (input a at mark )
6.4 Booth Recoding Speed-up technique : reduction of partial products Sequential multiplication + One cycle per non-zero partial product (i.e. 8ai j ai = 0) 6 Negative partial products Minimal (or canonical) signed-digit (SD) represent. of A
s = a b cin cout = ab + acin + bcin
Data-dependent reduction of partial products and latency Combinational multiplication Only xed reduction of partial product possible Radix-4 modied Booth recoding : 2 bits recoded to one multiplier digit ) n=2 partial products
Otherwise exactly same structure and complexity as Braun multiplier ) efcient and exible Baugh-Wooley multiplier Arithmetic transformations yield the following partial products (two additional ones) :
A=
n=2;1 X i=0
(a2i;1 + a2i ; 2a2i+1 ) 22i ; a;1 | {z }
a0 b3 a1b3 a1 b2 a2 b3 a2 b2 a2 b1 a3 b3 a3 b2 a3 b1 a3 b0 a3 a3 + 1 b3 b3 p7 p6 p5 p4 p3
Booth recoding
a0 b2 a0 b1 a0 b0 a1 b1 a1 b0 a2 b0 p2 p1 p0
f;2 ;1 0 +1 +2g
=0
Less efcient and regular than modied Braun multiplier

Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 72 6.4 Booth Recoding
a2i+1 a2i a2i;1 Pi 0 0 0 + 0 1 + B 0 0 0 1 0 + B 0 1 1 + 2B 1 0 0 ; 2B 1 0 1 ; B 1 1 0 ; B 1 1 1 ; 0

6 Multiplication
mulbooth.epsi 41 43 mm
CSA array/tree CPA

73
6.6 Multiplier Implementations
Applicable to sequential, array, and parallel multipliers additional recoding logic and more complex partial product generation (MUX for shift, XOR for negation) + adder array/tree cut in half ) considerably smaller (array and tree)
) much faster for adder arrays ) slightly or not faster for adder trees
6.5 Wallace Tree Addition Speed-up technique : fast partial product addition
A : +8n T : +7 A : =2 T : =2 T : ;0 p2 p1 p0 p2 p1 p0
1
Applicable to parallel multipliers : parallel partial product generation (normal or Booth recoded) Irregular adder tree (Wallace tree) due to different number of bits per column ) irregular wiring and/or layout ) non-uniform bit arrival times at nal adder 6.6 Multiplier Implementations Sequential multipliers : low performance, small area, component re-use (adder) Braun or Baugh-Wooley multiplier (array multiplier) : medium performance, high area, high regularity layout generators ) data paths and macro-cells simple pipelining, faster CPA ) higher speed Booth-Wallace multiplier (parallel multiplier) [9] : high performance, high area, low regularity custom multipliers, netlist generators often pipelined (e.g. register between CSA-tree and CPA) Signed-unsigned multiplier : signed multiplier with operands extended by 1 bit (an = an;1 =0, bn = bn;1 =0)
Negative partial products (avoid sign-extension) :
p3 p3 p3 p3 p2 p1 p0 = | {z }
ext. sign
0 0 0 ;p3 = 1 + 1 1 1 p3
p03 p02 p01 p00 p03 p03 p03 p03 p02 p01 p00 p13 p12 p11 p10 p13 p13 p13 p12 p11 p10 ! p23 p22 p21 p20 p23 p23 p22 p21 p20 + p33 p32 p31 p30 + p33 p32 p31 p30
p6 p5 p4 p3 p2 p1 p0
p6 p5 p4 p3 p2 p1 p0
Suited for signed multiplication (incl. Booth recod.) Extend A for unsigned multiplication : an
=0
Radix-8 (3-bit recoding) and higher radices : precomputing 3B , : : : ) inefcient

6 Multiplication
6.8 Squaring
7 Division / Square Root Extraction
7.1 Division Basics
6.7 Composition from Smaller Multipliers

7.1 Division Basics
(2n 2n)-bit multiplier can be composed from 4 (n n)-bit multipliers (can be repeated recursively)
A B = (AH 2n + AL) (BH 2n + BL) = AH BH 22n + (AH BL + ALBH )2n + ALBL AH BL AH BH AL BL AL BH
4 (n n)-bit multipliers + (2n)-bit CSA + (3n)-bit CPA less efcient (area and speed) 6.8 Squaring
A=Q B+R; R <B R = A rem B (remainder) A 2 0 22n ; 1] B Q R 2 0 2n ; 1] B 6= 0 Q < 2n ! A < 2nB , otherwise overow ) normalize B before division (B 2 2n;1 2n ; 1]) A =Q+ R B B
Algorithms (radix-2) Subtract-and-shift : partial remainders Ri [1, 2] Sequential algorithm : recursive, f non-associative
P = A2 = AA
: multiplier optimizations possible
!
;
a0 a3 a0 a2 a0 a1 a0 a0 a1 a3 a1 a2 a1 a1 a1 a0 a2 a3 a2 a2 a2 a1 a2 a0 + a3a3 a3a2 a3a1 a3a0 a2 a3 a1 a3 a0 a3 a0 a2 a0 a1 a0 a0 a3 a3 a1 a2 a1 a1 + a2 a2 p7 p6 p5 p4 p3 p2 p1 p0
qi = Ri+1 2iB Ri = Ri+1 ; qi2iB Rn = A R = R0 ; i = n ; 1 : : : 0 (r.m.n.)

Basic algorithm : compare and conditionally subtract ) expensive comparison and CPA Restoring division : subtract and conditionally restore (adder or multiplexer) ) expensive CPA and restoring Non-restoring division : detect sign, subtract/add, and correct by next steps ) expensive CPA SRT division : estimate range, subtract/add (CSA), and correct by next steps ) inexpensive CSA
76 Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction 77 7.4 Signed Division
+ bn=2c + 1 partial products (if no Booth recoding used) ) optimized squarer more efcient than multiplier Table look-up (ROM) less efcient for every n
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction
7.3 Non-Restoring Division
7.2 Restoring Division
qi =
1 if 0 if
Ri+1 ; B 2i 0 Ri+1 ; B 2i < 0
7.4 Signed Division
q0 = 1 if
i
1 if
Ri+1 B same sign Ri+1 B opposite sign
Ri+1 ; B 2i < 0 : qi = 0 Ri = Ri+1 (restored) i ; 1 Ri+1 ; B 2i;1 0 : qi;1 = 1 Ri;1 = Ri+1 ; B 2i;1 i
7.3 Non-Restoring Division
Example : signed non-restoring array divider (simplications: B > 0, nal correction of R omitted)
qi0 =
1 if ;1 = 1 if
Ri+1 0 Ri+1 < 0
A = 9n2 T = 2n2 + 4n
a6 b2 a5 b1 a4 b0
a6 b3
b3
a3
Ri+1 0 : qi0 = 1 Ri = Ri+1 ; B 2i i ; 1 Ri+1 ; B 2i < 0 : qi0;1 = 1 Ri;1 = Ri+1 ; B 2i +B 2i;1 = Ri+1 ; B 2i;1 i
One subtraction/addition (CPA) per step Final correction step for R (additional CPA) 0 Simple quotient digit conversion : (note: qi irredundant)
q3
FA
FA
FA
FA
a2 q2 FA FA FA divarray.epsi 81 101 mm FA
qi0 2 f1 1g ! qi 2 f0 1g : qi = 1 (qi0 + 1) 2 Q = (qn;1 qn;2 qn;3 : : : q0 1)

A B
a1 q1 FA FA FA FA
A = (n + 1)ACPA = O(n2) or O(n2 log n) T = (n + 1)TCPA = O(n2) or O(n log n)
+/ CPA +/ CPA divnr.epsi 38 mm 46 +/ CPA +/ CPA +/ CPA
a0 q0 FA r3 FA r2 FA r1 FA r0
79
R
7.5 SRT Division
7.7 Division by Multiplication
7.5 SRT Division
qi0 = >0 if ;B 2i
> :1
8 >1 > <
7.6 High-Radix Division Radix
if if
B 2i Ri+1 qi0 is SD number Ri+1 < B 2i i Ri+1 < ;B 2
= 2m , qi0 2 f
;1
:::
1 0 1
:::
; 1g
m quotient bits per step ) fewer, but more complex steps

+ Suitable for SRT algorithm ) faster Complex comparisons (more bits) and decisions ) table look-up () Pentium bug!) 7.7 Division by Multiplication Division by convergence
if 2n;1 ) ;B 2i
)
B < 2n , i.e. B is normalized : ;2n+i;1 Ri+1 < 2n+i;1 B 2i 8 >1 if 2n+i;1 Ri+1 > < 0 = 0 if ;2n+i;1 R < 2n+i;1 qi > i+1 > :1 if Ri+1 < ;2n+i;1
0 + Only 3 MSB are compared ) qi are estimated ) CSA instead of CPA can be used (precise enough) [20] Correction in following steps (+ nal correction step) 0 Redundant representation of qi (SD representation) ) nal conversion necessary (CPA) + Highly regular and fast (O(n)) SRT array dividers ) only slightly slower/larger than array multipliers
A B
A A Q = B = B R0R1 R0 R1
Rm;1 ! A Rm;1 B
B 1 B
Q = Q resp. 2n 1
Bi+1 = Bi Ri = 2n(1{z y) (1 + y) = 2n(1{z y2 ) ; ; | } | {z } | } Bi Ri > Bi ! 2n y = 1 ; Bi2;n Ri = 2 ; Bi2;n = B i + 1 (signed)

Algorithm :
A = nACSA + 2ACPA = O(n2) T = nTCSA + TCPA = O(n)
+/ CSA +/ divsrt.epsi CSA +/ 50 38 mm CSA +/ CSA +/ CPA
Bi+1 = Bi Ri Ai+1 = Ai Ri Ri = B i + 1 ; i = 0 : : : m ; 1 A0 = A B0 = B Q = Am (r.s.n.) L = dlog ne

81 7.9 Divider Implementations
CPA
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction 80
Quadratic convergence :
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction
7.8 Remainder / Modulus
Division by reciprocation 1 A Q= B =A B Newton-Raphson iteration method : nd
7.9 Divider Implementations Iterative dividers (through multiplication) : re-use of existing components (multiplier) medium performance, medium area high efciency if components are re-used Sequential dividers (restoring, non-restoring, SRT) : re-use of existing components (e.g. adder) low performance, low area Array dividers (restoring, non-restoring, SRT) : dedicated hardware component high performance, high area high regularity ) layout generators, pipelining square root extraction possible by minor changes combination with multiplication or/and square root No parallel dividers exist (sequential nature of division)
f (X ) = 0
by recursion
f Xi+1 = Xi ; f 0((Xo) Xi)
1 1 1 f (X ) = X ; B f 0 (X ) = ; X 2 f B = 0 Algorithm :
Xi+1 = Xi (2 ; B Xi) ; i = 0 : : : m ; 1 X0 = B Q = Xm (r.s.n.) L = O(log n) Speed-up : rst approximation X0 from table

Quadratic convergence : 7.8 Remainder / Modulus Remainder (rem) : signed remainder of a division
R = A rem B M = A mod B M
0
sign(R) = sign(A)
Modulus (mod) : positive remainder of a division
M = R+B R
if A else
82
83
7.10 Square Root Extraction
8 Elementary Functions
8.1 Algorithms
7.10 Square Root Extraction p A;R =Q
8 Elementary Functions
A = Q2 + R
A2
0 22n ; 1]
Q2
0 2n ; 1]
Exponential function : ex (exp x) Logarithm function : ln x, log x Trigonometric functions : sin x, cos x, tan x
Algorithm Subtract-and-shift : partial remainders Ri and quotients Qi = Qi+1 + qi2i = (qn;1 : : : qi 0 : : : 0) 2 Q2 = Qi+1 + qi2i = Q2+1 + qi2i 2Qi+1 + qi2i i i
Inverse trig. functions : arcsin x, arccos x, arctan x Hyperbolic functions : sinh x, cosh x, tanh x 8.1 Algorithms Table look-up : inefcient for large word lengths [5] Taylor series expansion : complex implementation Polynomial and rational approximations [1, 5] Shift-and-add algorithms [5] Convergence algorithms [1, 2] : similar to division-by-convergence two (or more) recursive formulas : one formula converges to a constant, the other to the result Coordinate rotation (CORDIC) [2, 5, 21] :
qi = Ri+1 2i 2Qi+1 + 2i Qi = Qi+1 + qi2i Ri = Ri+1 ; qi2i 2Qi+1 + qi2i ; i = n ; 1 : : : 0 Rn = A Qn = 0 R = R0 Q = Q0 (r.m.n.)

Implementation + Similar to division ) same algorithms applicable (restoring, non-restoring, SRT, high-radix) + Combination with division in same component possible Only triangular array required (step i : qk i = 0)
A
+/ CPA sqrtnr.epsi +/ CPA 42 36 mm +/ CPA +/ CPA +/ CPA
A ADIV =2 T TDIV
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Elementary Functions 84
3 equations for x-, y-coordinate, and angle computes all elementary functions by proper input settings and choice of modes and outputs simple, universal hardware, small look-up table
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Elementary Functions 85
8.2 Integer Exponentiation
8.3 Integer Logarithm

1 1 1 0
8.2 Integer Exponentiation Approximated exponentiation :
b)
xy = ey ln x = 2y log x
E = AB = Abn; 2n; + +b 2+b = ( ((Abn; )2 Abn; )2

1 2
Ab )2 Ab
1
Base-2 integer exponentiation : 2A Integer exponentiation (exact) :
= (: : :
0 1 0 |{z} A
: : :)
Ei = Ei2+1 Abi ; i = n ; 1 : : : En = 1 E = E0 (r.s.n.)
AB
=A A A | {z }
B
L=0
2n ; 1 (!)
A = AMUL T = TMUL L = 2(n ; 1)

8.3 Integer Logarithm
Applications : modular exponentiation AB (mod in cryptographic algorithms (e.g. IDEA, RSA) Algorithms : square-and-multiply a)
C)
Z = blog2 Ac
For detection/comparison of order of magnitude Corresponds to leading-zeroes detection (LZD) with encoded output
E = AB = Abn; 2n; + +b 2+b = A2n; bn; A2n; bn;

1 1 1 0 1 1 2
A4b A2b Ab
2 1
Ei = Pibi Ei;1 Pi+1 = Pi2 ; i = 0 : : : n ; 1 E;1 = 1 P0 = A E = En;1 (r.s.n.) A = 2AMUL T = TMUL L = n A = AMUL T = TMUL L = 2n
or
86
87
9 VLSI Design Aspects
9.1 Design Levels
9.1 Design Levels

9.1 Design Levels Transistor-level design Circuit and layout designed by hand (full custom) Low design efciency High circuit performance : high speed, low area High exibility : choice of architecture and logic style Transistor-level circuit optimizations : logic style : static vs. dynamic logic, complementary CMOS vs. pass-transistor logic special arithmetic circuits : better than with gates carry chain :
ci c out c i-1 carrychain.epsi 54 17 mm ki pi gi g i-1 c in
Gate-level design Cell-based design techniques : standard-cells, gate-array/ sea-of-gates, eld-programmable gate-array (FPGA) Circuit implemented by hand or by synthesis (library) Layout implemented by automated place-and-route Medium to high design efciency Medium to low circuit performance Medium to low exibility : full choice of architecture Block-level design Layout blocks and netlists from parameterized automatic generators or compilers (library) High design efciency Medium to high circuit performance Low exibility : limited choice of architectures Implementations : data-path : bit-sliced, bus-oriented layout (array of cells: n bits m operations), implementation of entire data paths, medium performance, medium diversity macro-cells : tiled layout, xed/single-operation components, high performance, small diversity portable netlists : ) gate-level design
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects 89 9.3 VHDL
k i-1 p i-1
c in
a b
fulladder :
c in c in
b b
facmos.epsi 76 40 mm
c in s c in b c out
c in
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects
88 9.2 Synthesis
9.2 Synthesis High-level synthesis Synthesis from abstract, behavioral hardware description (e.g. data dependency graphs) using e.g. VHDL Involves architectural synthesis and arithmetic transformations High-level synthesis is still in the beginnings Low-level synthesis Layout and netlist generators Included in libraries and synthesis tools Low-level synthesis is state-of-the-art Basis for efcient ASIC design Limited diversity and exibility of library components Circuit optimization Efcient optimization of random logic (low factorization degree) is state-of-the-art Optimization of entire arithmetic circuits (high factorization degree) is not feasible ) only local optimizations possible Logic optimization cannot replace the synthesis of efcient arithmetic circuit structures using generators
9.3 VHDL Arithmetic types : unsigned, signed (2s complement) Arithmetic packages numeric_bit, numeric_std (IEEE standard 1076.3), std_logic_arith (Synopsys) contain overloaded arithmetic operators and resizing / type conversion routines for unsigned, signed types Arithmetic operators (VHDL87/93) [22] relational shift, rotate (93 only) adding sign (unary) multiplying exponent, absolute Synthesis Typical limitations of synthesis tools :
/, mod, rem : both operands must be constant or divisor
: : : : : :
=, /=, <, <=, >, >= rol, ror, sla, sll, sra, srl +, +, *, /, mod, rem **, abs
must be a power of two ** : for power-of-two bases only Variety of arithmetic components provided in separate libraries (e.g. DesignWare by Synopsys)
9.3 VHDL
9.4 Performance
Resource sharing Sharing one resource for multiple operations Done automatically by some synthesis tools Otherwise, appropriate coding is necessary : a) b)
S <= A + C when SELA = 1 else B + C;
9.4 Performance Pipelining Pipelining is basically possible with every combinational circuit ) higher throughput Arithmetic circuits are well suited for pipelining due to high regularity Pipelining of arithmetic circuits can be very costly : large amount of internal signals in arithmetic circuits array structures : many small pipeline registers tree structures : few large pipeline registers
) no advantage of tree structures anymore (except for smaller latency)
) 2 adders + 1 multiplexer
T <= A when SELA = 1 else B; S <= T + C; ) 1 multiplexer + 1 adder
Coding & synthesis hints Addition : single adder with carry-in/carry-out :

Aext Bext Sext S Cout <= <= <= <= <= resize(A, width+1) & Cin; resize(B, width+1) & 1; Aext + Bext; Sext(width downto 1); Sext(width+1);
Fine-grain pipelining ) systolic arrays (often applied to arithmetic circuits) High speed Fast circuit architectures, pipelining, replication (parallelization), and combinations of those Optimal solution depends on arithmetic operation, circuit architecture, user specications, and circuit environment
92 Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects 93 9.5 Testability
Synthesis : check synthesis result for allocated arithmetic units ) code sanity check, control of circuit size VHDL library of arithmetic units Structural, synthesizable VHDL code for most circuits described in this text is found in [23]
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects
9.4 Performance
Low power Power-related properties of arithmetic circuits : High glitching activity due to high bit dependencies and large logic depth Power reduction in arithmetic circuits [24] : Reduce the switched capacitance by choosing an area efcient circuit architecture Allow for lower supply voltage by speeding up the circuitry Reduce the transition activity : apply stable inputs while circuit is not in use () disabling subcircuits) reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difcult to realize) reduce glitching transitions by reducing logic depth (pipelining) take advantage of correlated data streams choose appropriate number representations (e.g. Gray codes for counters)
9.5 Testability Testability goal : high fault coverage with few test vectors that are easy to generate/apply Random test vectors : easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits Special test vectors : sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits Hard-detectable faults found in : circuits of arithmetic operations with inherent special cases (arithmetic exceptions) : detectors, comparators, incrementers and counters (MSBs), adder ags circuits using redundant number representations (= redundant hardware) : dividers (Pentium bug!) 6
94
95
Bibliography
Bibliography
Bibliography
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [2] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons, 1979. [3] O. Spaniol, Computer Arithmetic, John Wiley & Sons, 1981. [4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design and Implementation, McGraw-Hill, 1984. [5] J.-M. Muller, Elementary Functions: Algorithms and Implementation, Birkhauser Boston, 1997. [6] Proceedings of the Xth Symposium on Computer Arithmetic. [7] IEEE Transactions on Computers. [8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11, pp. 939941, Nov. 1996. [9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with high speed redundant binary architecture, IEEE J. Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996. [10] W. N. Holmes, Composite arithmetic: Proposal for a new standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar. 1997.
Computer Arithmetic: Principles, Architectures, and VLSI Design Bibliography 96
[11] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and their Synthesis, PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998. [12] A. Tyagi, A reduced-area scheme for carry-select adders, IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct. 1993. [13] T. Han and D. A. Carlson, Fast area-efcient VLSI adders, in Proc. 8th Computer Arithmetic Symp., Como, May 1987, pp. 4956. [14] R. Zimmermann, Non-heuristic optimization and synthesis of parallel-prex adders, in Proc. Int. Workshop on Logic and Architecture Synthesis, Grenoble, France, Dec. 1996, pp. 123132. [15] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue CMOS microprocessor, IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 15551564, Nov. 1992. [16] A. De Gloria and M. Olivieri, Statistical carry lookahead adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347, Mar. 1996. [17] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar. 1996.
[18] Z. Wang, G. A. Jullien, and W. C. Miller, A new design technique for column compression multipliers, IEEE Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995. [19] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K conditions without carry propagation, IEEE Trans. Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992. [20] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms for division and square root, J. VLSI Signal Processing, vol. 8, pp. 151168, Oct. 1994. [21] Y. H. Hu, CORDIC-based VLSI architectures for digital signal processing, IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 1635, July 1992. [22] K. C. Chang, Digital Design and Modeling with VHDL and Synthesis, IEEE Computer Society Press, Los Alamitos, California, 1997. [23] R. Zimmermann, VHDL Library of Arithmetic Units,
http://www.iis.ee.ethz.ch/zimmi/arith_lib.html.
[24] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer, Norwell, MA, 1995.
98

Computer Arithmetic

Uploaded by

Copyright:

Available Formats

Computer Arithmetic

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Arithmetic

Uploaded by

Copyright:

Available Formats

Eidgenossische Technische Hochschule Zurich

Institut f r Integrierte Systeme u

Integrated Systems Laboratory

Computer Arithmetic: Principles, Architectures, and VLSI Design

Copyright c 1998 by Integrated Systems Laboratory, ETH Z rich u

Computer Arithmetic: Principles, Architectures, and VLSI Design Contents

Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

1 Introduction and Conventions

1 Introduction and Conventions

1 Introduction and Conventions

1.3 Conventions Naming conventions

1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation a3 a2 a1 a0 1 funrsa.epsi 219 20 mm z

f is associative (r.s.a.) ) serial or single-tree structure : A = O(n) T = O(log n)

b) with multiple outputs zi (r.m.) () prex problem) :

zi = f (ai zi;1) ; i = 0 : : : n ; 1 z;1 = 0=1

f is non-associative (r.m.n.) A = O(n) T = O(n) f is associative (r.m.a.)

a3 a2 a1 a0 1 funrmn.epsi 219 25 mm 3 z3 z2 z1 z0 a3 a2 a1 a0 1 2 z3 funrma1.epsi 19 43 mm z2 z1 z0

) serial or multi-tree structure :

ti = f (ai ti;1) ; i = 0 : : : n ; 1 t;1 = 0=1 z = tn;1

f is non-associative (r.s.n.) A = O(n) T = O(n)

A = O(n log n) T = O(log n)

Computer Arithmetic: Principles, Architectures, and VLSI Design

2.2 Implementation Techniques

2.2 Implementation Techniques Direct implementation of dedicated units :

always : 1 5 in most cases : 6 sometimes : 7, 8

(same as on the left for floating-point numbers) complexity

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

3.1 Binary Number Systems (BNS)

3.1 Binary Number Systems (BNS)

Complement : ;A = 2n ; A = A + 1 , where A = (an;1 an;2 : : : a0 )

(| m;1 {z: : a0 : | ;1 : :{z am;n ) a : } a : }

Unsigned : positive or natural numbers Value :

Computer Arithmetic: Principles, Architectures, and VLSI Design

3.1 Binary Number Systems (BNS)

3.2 Gray Numbers

binary number representation

unsigned 2s complement 1s complement sign-magnitude

0 0 0 g1 g0 g1g0 g0 g0 0 0 < 0 1 and 0 < 1 1 1 < 1 0 but 1 > 0

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

3.3 Redundant Number Systems

3.3 Redundant Number Systems

3.4 Residue Number Systems (RNS)

3.4 Residue Number Systems (RNS)

Arithmetic operations : (each digit computed separately)

Best moduli mi are 2k and (2k ; 1) :

A = (an;1 an;2 : : : a0 )mn; ai 2 f0 1 : : : mi ; 1g

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

3.5 Floating-Point Numbers

3.7 Antitetrational Number System

number system and double conversion A B = (;1)SA SB EA+EB p Ay = (;1)SA y EA y A = (;1)SA

(A < B ) = (EA < EB ) (additionally consider sign) A + B : by approximation or addition in conventional

127 3:8 1038 1023 9 10307

precision 10;7 10;15

Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

3.8 Composite Arithmetic

3.9 Round-Off Schemes