Floating-Point Numbers and Operations Representation

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Floating-point numbers and operations

Representation
 
The IEEE single precision floating point standard representation requires a 32 bit
word, which may be represented as numbered from 0 to 31, left to right. The first bit
is the sign bit, S, the next eight bits are the exponent bits, ‘E’, and the final 23 bits are
the fraction ‘F’. Instead of the signed exponent E, the value stored is an unsigned
integer E’ = E + 127, called the excess-127 format. Therefore, E’ is in the range 0 £ E’
£ 255.
 
S E’E’E’E’E’E’E’E’ FFFFFFFFFFFFFFFFFFFFFFF
 
0 1                                     8  9                                                                    31
 
The value V represented by the word may be determined as follows:

 If E’ = 255 and F is nonzero, then V = NaN (“Not a number”)


 If E’ = 255 and F is zero and S is 1, then V = -Infinity
 If E’ = 255 and F is zero and S is 0, then V = Infinity
 If 0 < E< 255 then V =(-1)**S * 2 ** (E-127) * (1.F) where “1.F” is intended
to represent the binary number created by prefixing F with an implicit leading 1 and a
binary point.
 If E’ = 0 and F is nonzero, then V = (-1)**S * 2 ** (-126) * (0.F). These are
“unnormalized” values.
 If E’= 0 and F is zero and S is 1, then V = -0
 If E’ = 0 and F is zero and S is 0, then V = 0

For example,
 
0 00000000 00000000000000000000000 = 0
 
1 00000000 00000000000000000000000 = -0
 
0 11111111 00000000000000000000000 = Infinity
 
1 11111111 00000000000000000000000 = -Infinity
 
0 11111111 00000100000000000000000 = NaN
 
1 11111111 00100010001001010101010 = NaN
 
0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
 
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
 
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
 
0  00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
 
0  00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)
 
0  00000000 00000000000000000000001 = +1 * 2**(-126) *
 
0.00000000000000000000001 = 2**(-149) (Smallest positive value)
 
(unnormalized values)
 
Double Precision Numbers:
 
The IEEE double precision floating point standard representation requires a 64-bit
word, which may be represented as numbered from 0 to 63, left to right. The first bit
is the sign bit, S, the next eleven bits are the excess-1023 exponent bits, E’, and the
final 52 bits are the fraction ‘F’:
 
S  E’E’E’E’E’E’E’E’E’E’E’
 
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
 
0 1                                                     11 12
 
63
 
The value V represented by the word may be determined as follows:

 If E’ = 2047 and F is nonzero, then V = NaN (“Not a number”)


 If E’= 2047 and F is zero and S is 1, then V = -Infinity
 If E’= 2047 and F is zero and S is 0, then V = Infinity
 If 0 < E’< 2047 then V = (-1)**S * 2 ** (E-1023) * (1.F) where “1.F” is
intended to represent the binary number created by prefixing F with an implicit
leading 1 and a binary point.
 If E’= 0 and F is nonzero, then V = (-1)**S * 2 ** (-1022) * (0.F) These are
“unnormalized” values.
 If E’= 0 and F is zero and S is 1, then V = – 0
 If E’= 0 and F is zero and S is 0, then V = 0

 
Arithmetic unit
 
Arithmetic operations on floating point numbers consist of addition, subtraction,
multiplication and division. The operations are done with algorithms similar to those
used on sign magnitude integers (because of the similarity of representation) —
example, only add numbers of the same sign. If the numbers are of opposite sign,
must do subtraction.
 
ADDITION
 
Example on decimal value given in scientific notation:
 
3.25 x 10 ** 3
+ 2.63 x 10 ** -1
—————–
    first step: align decimal points
second step: add
 
3.25       x 10 ** 3
+  0.000263 x 10 ** 3
——————–
3.250263 x 10 ** 3
(presumes use of infinite precision, without regard for accuracy)
 
third step:  normalize the result (already normalized!)
 
Example on floating pt. value given in binary:
 
.25 =    0 01111101 00000000000000000000000

 100 =    0 10000101 10010000000000000000000


To add these fl. pt. representations,
 
step 1:  align radix points
 
shifting the mantissa left by 1 bit decreases the exponent by 1
 
shifting the mantissa right by 1 bit increases the exponent by 1
 
we want to shift the mantissa right, because the bits that fall off the end should
come from the least significant end of the mantissa
 
-> choose to shift the .25, since we want to increase it’s exponent.
-> shift by  10000101
-01111101
———
00001000    (8) places.
 
0 01111101 00000000000000000000000 (original value)
0 01111110 10000000000000000000000 (shifted 1 place)
(note that hidden bit is shifted into msb of mantissa)
0 01111111 01000000000000000000000 (shifted 2 places)
0 10000000 00100000000000000000000 (shifted 3 places)
0 10000001 00010000000000000000000 (shifted 4 places)
0 10000010 00001000000000000000000 (shifted 5 places)
 
0 10000011 00000100000000000000000 (shifted 6 places)
0 10000100 00000010000000000000000 (shifted 7 places)
0 10000101 00000001000000000000000 (shifted 8 places)
 
 
step 2: add (don’t forget the hidden bit for the 100)
 
0 10000101 1.10010000000000000000000  (100)
+    0 10000101 0.00000001000000000000000  (.25)
—————————————
0 10000101 1.10010001000000000000000
 
step 3:  normalize the result (get the “hidden bit” to be a 1)
It already is for this example.
result
0 10000101 10010001000000000000000
is

Representation of Floating-Point numbers

-1S × M × 2E

Bit No Size Field Name


31 1 bit  Sign (S)
23-30 8 bits Exponent (E)
0-22 23 bits Mantissa (M)

A Single-Precision floating-point number occupies 32-bits, so there is a


compromise between the size of the mantissa and the size of the exponent.

These chosen sizes provide a range of approx:


± 10-38 ... 1038

 Overflow

The exponent is too large to be represented in the Exponent field

 Underflow

The number is too small to be represented in the Exponent field

To reduce the chances of underflow/overflow, can use 64-bit Double-


Precision arithmetic

Bit No Size Field Name


63 1 bit  Sign (S)
52-62 11 bits Exponent (E)
0-51 52 bits Mantissa (M)

providing a range of approx


± 10-308 ... 10308

These formats are called ...

IEEE 754 Floating-Point Standard

Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to


represent the leading 1. So, effectively:

 Single Precision: mantissa ===> 1 bit + 23 bits


 Double Precision: mantissa ===> 1 bit + 52 bits

Since zero (0.0) has no leading 1, to distinguish it from others, it is given the
reserved bitpattern all 0s for the exponent so that hardware won't attach a
leading 1 to it. Thus:

 Zero (0.0) = 0000...0000


 Other numbers = -1S × (1 + Mantissa) × 2E

If we number the mantissa bits from left to right m1, m2, m3, ...

mantissa = m1 × 2-1 + m2 × 2-2 + m3 × 2-3 + ....

Negative exponents could pose a problem in comparisons.

For example (with two's complement):

  Sign Exponent Mantissa


1.0 × 2 -1
0 11111111 0000000 00000000 00000000
1.0 × 2+1 0 00000001 0000000 00000000 00000000

With this representation, the first exponent shows a "larger" binary number,
making direct comparison more difficult.

To avoid this, Biased Notation is used for exponents.

If the real exponent of a number is X then it is represented as (X + bias)

IEEE single-precision uses a bias of 127. Therefore, an exponent of

-1 is represented as -1 + 127 = 126 = 011111102


 0 is represented as  0 + 127 = 127 = 011111112
+1 is represented as +1 + 127 = 128 = 100000002
+5 is represented as +5 + 127 = 132 = 100001002

So the actual exponent is found by subtracting the bias from the stored
exponent. Therefore, given S, E, and M fields, an IEEE floating-point number
has the value:

-1S × (1.0 + 0.M) × 2E-bias

(Remember: it is (1.0 + 0.M) because, with normalised form, only


the fractional part of the mantissa needs to be stored)

Floating Point Addition

Add the following two decimal numbers in scientific notation:


8.70 × 10-1 with 9.95 × 101

1. Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
8.70 × 10-1 = 0.087 × 101

2. Add the mantissas

9.95 + 0.087 = 10.037 and write the sum 10.037 × 101

3. Put the result in Normalised Form

10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)

check for overflow/underflow of the exponent after normalisation

4. Round the result

If the mantissa does not fit in the space reserved for it, it has to be
rounded off.

For Example: If only 4 digits are allowed for mantissa


1.0037 × 102 ===> 1.004 × 102

(only have a hidden bit with binary floating point numbers)

Example addition in binary

Perform 0.5 + (-0.4375)

0.5 = 0.1 × 20 = 1.000 × 2-1 (normalised)

-0.4375 = -0.0111 × 20 = -1.110 × 2-2 (normalised)

1. Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
-1.110 × 2-2 = -0.1110 × 2-1

2. Add the mantissas:


1.000 × 2-1 + -0.1110 × 2-1 = 0.001 × 2-1

3. Normalise the sum, checking for overflow/underflow:


0.001 × 2-1 = 1.000 × 2-4

-126 <= -4 <= 127 ===> No overflow or underflow

4. Round the sum:

The sum fits in 4 bits so rounding is not required

Check: 1.000 × 2-4 = 0.0625 which is equal to 0.5 - 0.4375

You might also like