Number System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Computer Organization

Arithmetic and Logic Unit

Module 3.

STUDY MATERIALS ON COMPUTER ORGANIZATION


(As per the curriculum of Third semester B.Sc. Electronics of Mahatma Gandh Uniiversity)

Compiled by Sam Kollannore U.. Lecturer in Electronics M.E.S. College, Marampally

3. ARITHMETIC AND LOGIC UNIT


asic operation of all digital computers is the addition or subtraction of two numbers. These operations are implemented along with basic logical functions such as AND, OR, NOT and Exclusive OR in the Arithmetic and Logic Unit (ALU) subsystem of the processor. The time needed to perform these operations affect the overall system performance. Compared with arithmetic operations, logic operations are simple to implement. They require only independent Boolean operations on individual bit positions of the operands, whereas carry/borrow lateral signals are required in arithmetic operations.

B
3.1

Number Representations
B = bn-1.bn-2..b1.b0 where bi = 0 or 1 for i = 0,1,.. n-1

Binary number system is used in computers. Consider an n-bit vector (number) This vector can represent positive integer values V in the range 0 to 2n-1, where V(B) = bn-1 . 2n-1 +. + b1 . 21 + b0 . 20 Three systems are widely used for representing both positive and negative numbers 1) Sign-and-magnitude 2) 1s-compliment 3) 2s-compliment
In all the three systems, the leftmost bit is 0 for positive numbers and 1 for negative numbers. Table shows all the three types of representations using 4-bit numbers Positive numbers have identical representations in all systems, but negative values have different representations. In sign-and-magnitude system, negative values are represented by changing the most significant bit to 1. For example +5 is represented by 0101 and -5 is represented by 1101. In the 1s compliment representation, negative values are obtained by complimenting each bit of the corresponding positive number. E.g. Representation for -3 is obtained by complimenting each bit in the vector 0011 to yield 1100 The operation of obtaining the 1s compliment of a number is equivalent to subtracting that number from 2n 1 i.e. from 1111 (in the case of 4 bit numbers) In 2s compliment system, a negative number is obtained by subtracting the corresponding positive number from 2n. Hence the 2s compliment representation is obtained by adding 1 to the 1s compliment representation. Note:There are distinct +0 and 0 representations in both the sign-and-magnitude and 1s compliment systems; but the 2s compliment system has only a +0 representation. For 4 bit numbers the value 8 is representable in the 2s compliment system but not in the other systems. 2s compliment system yields the most efficient logic circuit implementation (most often used in computers for addition and subtraction)

Computer Organization
Arithmetic and Logic Unit B b3b2b1b0 0111 0110 0101 0100 0011 0010 0001 0000 1000 1001 1010 1011 1100 1101 1110 1111 Values represented Sign and magnitude +7 +6 +5 +4 +3 +2 +1 +0 0 1 2 3 4 5 6 7 1s compliment +7 +6 +5 +4 +3 +2 +1 +0 7 6 5 4 3 2 1 0

Module 3.

2s compliment +7 +6 +5 +4 +3 +2 +1 +0 8 7 6 5 4 3 2 1

Fig 3.1 Binary representations

3.2

Addition of Positive Numbers

Consider the case of unsigned numbers. Consider adding two 1-bit numbers. The results are shown in figure. The sum of 1 and 1 requires the 2-bit vector 10 to represent the value 2. We say the sum is 0 and the carry-out is 1. 0 +0 0 1 +0 1 0 +1 1 1 +1 10 Carry-out

3.3

RIPPLE - CARRY ADDERS

In order to add multiple-bit numbers, we use a method analogous to that used for manual computation with decimal numbers. We add bit pairs starting from the low-order (right) end of the bit vectors, propagating carries towards the high-order (left) end. A straight forward 2-level combinational logic circuit implementation of the truth table for addition is shown along with a convenient symbol for the circuit called an Adder or sometimes called a Full-adder.

Computer Organization
Arithmetic and Logic Unit xi 0 0 0 0 1 1 1 1 yi 0 0 1 1 0 0 1 1 Carry-in ci 0 1 0 1 0 1 0 1 yi ci Sum si 0 1 1 0 1 0 0 1 Carry-out ci+1 0 0 0 1 0 1 1 1

Module 3.

si = xiyici + xiyici + xiyici + xiyici = xi ci+1 = yici + xici + xiy

Fig 3.2 (a)

Fig 3.2 (b)

Computer Organization
Arithmetic and Logic Unit

Module 3.

A cascaded connection of n adder blocks can be used to add two n-bit numbers. Since the carries must propagate or ripple through this cascade, the configuration is called an n-bit-ripple-carry adder. When the carry-out cn, from the MSB position equals 1, an overflow from the operation occurs ( i.e. the result cannot be represented in n bits) The carry-in, co into the LSB position provides a convenient means of adding 1 to a number. ( for eg. Adding 1 to the 1s compliment of a number forms a 2s compliment

3.4

ADDITION AND SUBTRACTION OF SIGNED NUMBERS


1. To add two numbers, add their n-bit representations, ignoring the carry-out signal from the MSB position. The sum will be the algebraically correct value in the 2s compliment representation as long as the answer is in the range 2n1 through + 2n1 1. 2. To subtract two numbers X and Y, that is to perform XY, find out the 2s compliment of Y and then add it to X. Again the result will be algebraically correct value in the 2s compliment representation if the answer is in the range 2n1 through + 2n1 1.

Rules for addition and subtraction of n-bit signed numbers using 2s compliment representation

Examples:Addition 0010 + 0011 0101 1011 + 1110 1001 0111 + 1101 0100 (+ 2) (+ 3) (+ 5) ( 5) ( 2) ( 7) (+ 2) ( 3) (+ 4) 1001 0001 ( 7) (+ 1) 0010 0100 (+ 2) (+ 4) 1101 1001 Subtraction ( 3) ( 7) 1101 + 0111 0100 0010 + 1100 1110 1001 + 1111 1000 ( 8) ( 2) (+ 4)

Note: In all these 4-bit examples, the answer falls into the representable range of 8 and +7. When answers do not fall within the representable range, we say that arithmetic overflow has occurred.

Computer Organization
Arithmetic and Logic Unit

Module 3.

3.5

BINARY ADDITION SUBTRACTION LOGIC NETWORK

Fig 3.3

The subtraction operation requires the subtrahend i.e. the bottom value to be 2s complimented before the addition is performed. The Add/Sub control wire is set to 0 for addition and 1 for subtraction. So during addition, the Y vector is applied without change to one of the adder inputs along with a carry in signal c0 = 0. During subtraction, the Y vector is 1s complimented by the EXOR gate and c0 is set to 1 to complete the 2s complementation of Y. Note:- An EXOR gate can be added to detect the overflow condition cn cn-1

3.5.1 Overflow Logic


In all the 4 bit examples the answers fall into the representable range of 8 through +7. When answers do not fall within the representable range, we say that arithmetic overflow has occurred. When adding unsigned numbers the carry out cn from the MSB position serves as the overflow indicator. However this does not work for adding signed numbers. Eg. Adding +7 and +4 0111 0100 01011 which is 5 ; a wrong result. The carry out signal from MSB position is 0 Similarly adding 4 and 6 1100 1010 10110

which is +6; again a wrong result. The carry out signal is 1.

Conclusions:Overflow can occur only when adding two numbers that have the same sign and the sign of the sum is different from them. A simple way to detect overflow is to examine the signs of the two summands X and Y and the sign of the result. In an n-bit adder, we can define a signal overflow by the logical expression Overflow = xn-1 yn-1 sn-1 + xn-1 yn-1 sn-1 It can also be shown that overflow occurs when the carry bits cn and cn-1 are different 5

Computer Organization
Arithmetic and Logic Unit

Module 3.

Therefore a simple alternative circuit for detecting overflow can be obtained by implementing the expression cn cn-1 with an EXOR gate.

3.6

DESIGN OF FAST ADDERS

Drawback of n-bit ripple-carry adder is the too much delay in developing its output. This delay is considerable only by considering the speed of other processor components and the data transfer times of registers and cache memory. The length of the delay through a network of logic gates depends on i) IC fabrication technology and ii) number of gates in the path from input to output. After a particular technology is chosen, the delay caused with any combinational logic circuit can be calculated by adding the number of logic gate delays along the longest path through the network. In the above example longest signal propagations from inputs x0, y0 and c0 at the LSB position to outputs cn, sn-1 at the MSB position.

Fig 3.4

Using the logic gate implementation cn-1 is available in 2(n-1) gate delays, and sn-1 is correct one XOR gate delay later. The final carry-out, cn is available after 2n gate delays. Therefore, if a ripple-carry adder is used to implement the addition/subtraction unit (as shown in the previous topic), all sum bits are available in 2n gate delays, including the delay through the XOR gates on the Y input. Using the implementation cn cn-1 for overflow, this indicator is available after 2n + 2 gate delays. Two approaches can be taken to reduce delay in adders. i) To use the fastest possible electronic technology in implementing the ripple-carry logic design ii) To use an augmented logic gate network structure that is larger than that shown in figure 3.4(b) 6

Computer Organization
Arithmetic and Logic Unit

Module 3.

3.7

CARRY-LOOKAHEAD ADDITION

A fast adder circuit must speed up the generation of the carry signals. We have si = x i y i ci ci+1 = yici + xici + xiyi

Factoring the second equation ci+1 = xi yi + ( xi + yi ) ci We can write ci+1 = Gi + Pi ci ; where Gi = xi yi and Pi = xi + yi The expressions Gi and Pi are called Generate and Propagate functions for stage i If the generate function for stage i is equal to 1, then ci + 1 = 1, independent of the input carry ci. . This occurs when both xi and yi are 1. The propagate function means that an input carry will produce an output carry when either xi is 1 or yi is 1. All Gi and Pi functions can be formed independently and in parallel in one logic gate delay after the X and Y vectors are applied to the inputs of an n-bit adder. Each bit stage contains: i) an AND gate to form Gi ii) an OR gate to form Pi iii) a 3- input XOR gate to form si

Fig 3.5

Computer Organization
Arithmetic and Logic Unit

Module 3.

Note:- Here Pi is realized as Pi = xi yi , which differs from Pi = xi + yi only when xi = yi = 1. This does not effect the final ci+1, since when xi = yi = 1; Gi = 1. Thus using a cascade of 2-input XOR gates to realize the 3-input XOR function, the basic cell B in the figure can be used in each bit stage. Expanding ci in terms of i1 subscripted variables and substituting into the ci + 1 expression, we obtain c i+1 = Gi + Pi ci = Gi + Pi (Gi1 + Pi1 ci1) = Gi + Pi Gi1 + Pi Pi1 ci1 = Gi + Pi Gi1 + Pi Pi1 (Gi2 + Pi2 ci2) = Gi + Pi Gi1 + Pi Pi1 Gi2 + Pi Pi1 Pi2 ci2 = Gi + Pi Gi1 + Pi Pi1 Gi2 + . . . . + Pi Pi1 . . P1G0 + Pi Pi1 Pi2 . . P 0c0 Thus all carries can be obtained three gate delays after the input signals X, Y and c0 are applied because only one gate delay is needed to develop all Pi and Gi signals, followed by two gate delays in the AND-OR circuit for ci+1. After a further XOR gate delay, all sum bits are available. Therefore independent of n, the n-bit addition process requires only four gate delays. Example: Consider the design of a 4-bit adder. The carries can be implemented as c1 = Go + Poco c2 = G1 + P1G1 + P1P0c0 c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 c4 = G3 + P3G2 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0 Carries are implemented in the block labeled carry-lookahead logic. An adder implemented in this form is called a carry-look ahead adder. Delay through the adder is 3 gate delays for all carry bits and 4 gate delays for all sum bits. In comparison, note that a 4-bit ripple-carry adder requires 7 gate delays for s3 and 8 gate delays for c4 Sign extension:We often need to represent a given number in 2s compliment system by using a large number of bits. For a positive number, this is achieved by adding 0s to the left. Similarly for a negative number, longer number with the same value is obtained by replicating the sign bit to the left as many times as desired. This operation is called Sign extension.

3.8

MULTIPLICATION OF POSITIVE NUMBERS

The product of two n-digit numbers can be accommodated in 2n digits as shown in the manual multiplication example. 1101 (13) Multiplicand M 1011 (11) Multiplier Q 1101 1101 Partial products Fig 3.6 (a) 0000 1101 10001111 (143) Product P 8

Computer Organization
Arithmetic and Logic Unit

Module 3.

In binary system, if the multiplier bit is 1, the multiplicand is entered in the appropriate position to be added to the partial product. If the multiplier bit is 0, then 0s are entered.

3.8.1 Method I Using combinational logic circuit


Binary multiplication of positive operands can be implemented in a pure combinational two dimensional logic array as shown. The main component in any cell is an Adder circuit. The AND gate in any cell determines whether the multiplicand bit mj, is added to the partial product bit, based on the value of the multiplier bit qi. The multiplicand is shifted left one position per row by the diagonal signal path. PP4 is the desired product.

Fig 3.6 (b)

3.8.2 Method II Using combinational logic circuit & sequential techniques


Multiplication can also be performed using a mixture of combinational array techniques as shown above and sequential techniques. The simplest hardware arrangement for sequential multiplication is as shown. The circuit performs the multiplication by using a single n-bit adder n time. Register A and Q combined to hold the Partial Product i (PPi) while the multiplier bit qi generates the signal Add/Noadd. This signal controls the addition of multiplicand M to PPi to generate PP(i+1). The product is computed in n cycles. The partial product grows in length by one bit in each cycle. The carry-out from the adder is stored in flip flop C.

Computer Organization
Arithmetic and Logic Unit

Module 3.

Fig 3.7(a)

Fig 3.7(b)

At the start the multiplier is loaded into register Q, multiplicand into /register M and C & A are cleared to 0. At the end of each cycle, C, A and Q are shifted right one bit position to allow for growth of the partial product. Because of this shifting, multiplier bit qi appears at the LSB position of Q to generate the Add/Noadd signal at correct time in each cycle (with q0 during the 1st cycle, q1 during the 2nd cycle etc.). After they are used, the multiplier bits are discarded by the right shift operation. Carry-out from the adder is held in C and this is shifted right with the contents of A and Q. After n cycles, the highorder half of the product is held in register A and the low-order half in register Q. The multiplication example of Fig 3.6(a) is shown in Figure 3.7(b) as it would be performed by this hardware arrangement.

10

Computer Organization
Arithmetic and Logic Unit

Module 3.

3.9

SIGNED-OPERAND MULTIPLICATION

Multiplication of signed operands in 2s-compliment system generates a double length product. The general strategy is to accumulate partial products. Case I : Positive multiplier and negative multiplicand Here we must extend the sign-bit value of the multiplicand to the left as far as the product will extend. Example: 10011 01011 1111110011 111110011 00000000 1110011 000000 1101110001 (13) Multiplicand (+11) Multiplier

Partial products (Sign extension bits is shown in italics)

(143)

Here the 5-bit signed operand 13 is the multiplicand. It is multiplied by the multiplier +11 to get the 10-bit product 143. The hardware discussed earlier can be used for negative multiplicands if it provides for sign extension of the partial products. Case II : Negative multiplier For a negative multiplier, solution is to form 2s-compliment of both the multiplier and the multiplicand and proceed as earlier. This is because complementation of both operands does not change the value or sign of the product. A technique which is suitable for both negative and positive multipliers is called the Booth Algorithm.

3.10 BOOTH ALGORITHM


Powerful algorithm for signed number multiplication Generates 2n-bit product Treats both positive and negative 2s complement n-bit operands uniformly Consider a multiplication operation in which the multiplier is positive and has a single block of 1s eg. 0011110. To derive the product, we could add four properly shifted versions of the multiplicand. But we can reduce the number of required operations by considering this multiplier as the difference between two numbers. 0100000 0000010 0011110 (32) (2) (30)

This suggests that the product can be generated by adding 25 times the multiplicand to the 2scompliment of 21 times the multiplicand. We can describe the above sequence of required operations by recoding the multiplier as 0 +1 0 0 0 1 0 (Booth scheme) 11

Computer Organization
Arithmetic and Logic Unit

Module 3.

In this Booth scheme, 1 times the shifted multiplicand is selected when moving from 0 to 1 and +1 times the shifted multiplicand is selected when moving from 1 to 0, as the multiplier is scanned from right to left Example for Booth multiplication algorithm: 0 1 0 1 1 0 1 (Multiplicand) 0 0 1 1 1 1 0 (Multiplier) 0000000 0101101 0101101 0101101 (No sign extension required because the multiplicand is positive) 0101101 0000000 0000000 00010101000110

0 1 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 1 0 0 0 1 0 1

0 1 0 0 0 0 0 0

0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0

0 1 0 +1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0

0 0 0 0 0 0 0

1 0 0 0 0 0

1 0 1 0 1 0 0 0 0 1 1 0

(Multiplicand) (Multiplier) (2s compliment of the multiplicand) 0101101 1s - 1 0 1 0 0 1 0 + 1 2s - 1 0 1 0 0 1 1 Sign extension bits shown in italics

0 0 1 1 0

Note: 1) Booth algorithm can be extended to any number of blocks of1s in a multiplier including the situation in which a single 1 is considered as a block. 2) If the least significant bit is 0 or 1, then we must assume that an implied 0 lies to its right. 3) Booth algorithm can be used for negative multipliers also 16-bit number 0 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 0 +1 1 +1 0 1 0 +1 0 0 1 +1 1 +1 0 1 0 0

For negative multipliers: 0 1 1 0 1 1 1 0 1 0 (+13) (6) 12

Computer Organization
Arithmetic and Logic Unit

Module 3.

0 1 0 1 0 1

0 1 0 1 0 1

0 1 0 1 0 1

0 0 1 1 0 1 0 0 0 0 0 1

0 1 1 0 0 1 +1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 1

1 0 0

(+13) (6) (2s compliment of the multiplicand 01101) 01101 1s - 1 0 0 1 0 + 1 2s - 1 0 0 1 1 (78) Sign extension bits shown in italics

Booth multiplier recoding table Multiplier Version of multiplicand selected by Bit i Bit (i) Bit (i1) 0 0 0M 0 1 +1 M 1 0 1 M 1 1 0M Fig 3.8 The transformation from original multiplier to booth recoded multiplier is called skipping over 1s 16-bit worst case multiplier 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 16-bit ordinary multiplier 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 0 16-bit good multiplier 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1

(no blocks of 1s) (some blocks of 1s) (large blocks of 1s)

Attractive features of Booth Algorithm 1. It handles both positive and negative multipliers uniformly 2. It achieves some efficiency in the number of additions required when the multiplier has a few large blocks of 1s.

3.11 FAST MULTIPLICATION


Two techniques to speed-up the multiplication operation

3.11.1 BIT-PAIR RECODING OF MULTIPLIERS


13

Computer Organization
Arithmetic and Logic Unit

Module 3.

Maximum number of summands that must be added is reduced by half Derived directly from booth algorithm Group the booth-recoded multiplier bits in pairs- we can observe the following things:The pair (+1 1) is equivalent to the pair (0 +1) i.e. instead of adding 1 times the multiplicand M at shift position i to +1 times M at position i +1, the same result is obtained by adding +1 times M at position i. Similarly (+1 0) is equivalent to (0 +2) (1 0) is equivalent to (0 2) (1 +1) is equivalent to (0 1) and (+1 1) is equivalent to (0 +1) as stated above. Thus if the booth recoded multiplier is examined two-bits at a time, starting from the right, it can be rewritten in a form that requires at most one version of the multiplicand to be added to the partial product Sign extension Implied 0 to right of LSB 1 0 0 1 1 0 1 0 0 1 +1 1 0 1 2 0 (6)

0 1 1 0 1 1 1 0 1 0

0 1 1 0 1 0 1 +1 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0

0 1 1 0 1 0 1 2 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 1 0

01101 10010 + 1 10011

2 indicates that the 2s compliment of the multiplicand should be left shifted one place 1 indicates that 2s compliment of the multiplicand should be put in the correct position itself. Above example shows that bit-pair recoding of multiplier requires only half the number of summands compared to the normal algorithm.

3.11.2 CARRY-SAVE ADDITION OF SUMMANDS


Multiplication requires the addition of several summands. Carry-save addition technique will speedup the addition process. Method I: Consider the addition of three n-bit numbers W, X, Y to produce a sum Z. In general case, we can first add W to X to generate a number A and then we can add A to Y to produce Z. This can be done by using two ripple-carry adders

14

Computer Organization
Arithmetic and Logic Unit W +X A +Y Z 1 1 11 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1 0 0 0

Module 3.

Fig 3.9 Method II: In a second method, instead of adding W and X to produce A, we introduce the bits of Y into the carry inputs. This generates the vectors S and the saved carries C as the outputs of the upper row. In the second row, S and C are added in a ripple-carry adder to produce the desired sum Z. W X +Y S +Y Z 1 1 1 1 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0

Here the intermediate sum A is not generated in the second method. Carry-save addition transforms W, X and Y into S and C, because all bits of S and C are available in parallel.

Fig 3.10

Consider the addition of many summands. We can group the summands into threes and performs carry-save addition on them. We continue this process until there are only two vectors remaining. They can be added in a ripple-carry adder or a carry-lookahead adder to produce the desired sum. Each carry-save addition reducing three to two is done in only one FA unit delay independent of the length of the number. Consider the example of adding 9 numbers using this method:Step I : initial 3 groups of three numbers each is reduced to 6 numbers. Step II: These 6 nos. are reduced to 4. Step III: 4 reduced to 3 15

Computer Organization
Arithmetic and Logic Unit

Module 3.

Step IV: then 3 reduced to 2 nos. Step V: Final 2 numbers are added to produce the desired sum. The complete operation requires 4 full adder unit delays to do the carry-save additions, followed by a full addition operation on the final 2 vectors.

3.12 Applications of High- Speed Multiplication Techniques


The addition of 32 numbers using the carry-save addition method requires only 7 full adder unit delays before the final Add operation. In general about 1.7log2K full adder unit delays (i.e. carry-save addition steps) are needed to reduce K summands to 2 vectors which when added, produce the desired sum. Again a carry-lookahead adder can be used effectively to add the final 2 vectors because all bits of these vectors are produced in parallel as the result of the carry-save addition operation. This method saves time and enhances performance Conclusion: Bit-pair recoding of the multiplier derived from the Booth algorithm, reduces the number of summands by up to a factor of 2. These summands can then be reduced to only 2 by using a relatively small number of carrysave addition of steps. The final product can be generated by one addition operation which is speeded up by using a lookahead adder. In summary, all the 3 basic techniques bit-pair recoding of the multiplier, carry-save addition of the summands and lookahead addition have been used in high-performance processors to reduce the time needed to perform multiplication.

3.13 FLOATING POINT NUMBERS


Integers are considered of having an implied binary point at the right end of the number (Fixed point numbers). It is also possible to assume that the binary point is just to the right of the sign bit, thus representing a fraction. Consider an integer in 32 bit, signed, fixed-point format. The approximate range of value represented by this system is from 0 to 2.15109. If we consider them to be fractions, the range is approximately 4.551010 to 1. This range is not sufficient for scientific calculations involving parameters like Avogadros number (6.0247 1023 mole1) or Plancks constant (6.6254 1027 erg.s). Therefore we need to accommodate both very large numbers and very small fractions. This means that the computer must be able to represent numbers and operate on them in such a way that the position of binary point is variable and is automatically adjusted as computation proceeds. In such a case the binary point is said to float and the numbers are called floating point numbers. (Where as in the fixed-point numbers, the binary point is always in the same position)

3.13.1 FLOATING-POINT REPRESENTATION


In decimal scientific notation, numbers may be written as 6.0247 1023 , 6.6254 1027 and so on. These numbers are said to have five significant digits. Scale factors (1023, 1027 and so on) indicate 16

Computer Organization
Arithmetic and Logic Unit

Module 3.

the position of the decimal point with respect to the significant digits. By convention, when the decimal point is placed to the right of the first (nonzero) significant digit, the number is said to be normalized. A floating point representation is one in which a number is represented by its sign, a string of significant digits commonly called mantissa and an exponent to an implied base for the scale factor. Eg:- X1.X2X3X4X5X6X7 10Y1Y2 where Xi and Yi are decimal digits. Here both the number of significant digits (7) and the exponent range (99) are sufficient for a wide range of scientific calculations. It is possible to approximate this mantissa precision and scale factor range in a binary representation that occupies 32 bits which is a standard computer word length. Out of this 32 bits, 24 bits for mantissa (which can approximately represent a 7 digit number) and 8 bits for exponent to an implied base 2. Since the leading nonzero bit of a nonzero mantissa must be a binary 1, it is not included explicitly in the representation. Anyway one bit is needed for the sign of the number. This standard is developed by IEEE.

32-bits

8-bit signed exponent in excess-127 representation

23-bit mantissa fraction

Value represented = 1.M 2 E127 Fig (a) Single-precision

Sign of number 0 0 1 0 1 0 0 0 0 0 signifies + 1 signifies

001010 . . .

Value represented = 1.001010 . . . 0 2 87 Fig (b) Example of a single-precision number Fig 3.11 (a & b) Here instead of signed exponent E, the value actually stored in the exponent field is an unsigned integer E= E + 127. This is called the Excess-127 format. Thus E is in the range 0 E 255. Actual exponent is in the range 126 E +127 (0 and 255 are used to represent special values). Thus scale factor has a range of 2126 to 2+127 which is equal to 1038. Since binary normalization is used, the most significant bit of the mantissa is always equal to 1. This bit is not explicitly represented- it is assumed tp be at immediate left of the binary point. The 23 bits stored in the M field represent the fractional part of the mantissa i.e. the bits to the right of the binary point. This 32 bit standard representation is called a single precision representation because it occupies a single 32-bit word. 17

Computer Organization
Arithmetic and Logic Unit

Module 3.

To provide more precision and range for floating point numbers IEEE standard specifies a double precision format as shown below

64-bits

Sign

11-bit excess-1023 exponent

52-bit mantissa fraction

Value represented = 1.M 2 E1023 Fig 3.11(c) Double precision

Here 11 bit Excess-1023 3xponent E has the range 0 < E < 2047. Actual exponent is in the range 1022 E 1023 providing scale factors of 21022 to 21023 (which is equal to approximately 10308) 53 bit mantissa provides a precision equivalent to about 16 decimal digits Note:- Double precision representation is optional. Any commercial computer implementation must provide at least single-precision representation to confirm to the IEEE standard.

18

You might also like