Floating-Point Numbers and Operations Representation
Floating-Point Numbers and Operations Representation
Floating-Point Numbers and Operations Representation
Representation
The IEEE single precision floating point standard representation requires a 32 bit
word, which may be represented as numbered from 0 to 31, left to right. The first bit
is the sign bit, S, the next eight bits are the exponent bits, ‘E’, and the final 23 bits are
the fraction ‘F’. Instead of the signed exponent E, the value stored is an unsigned
integer E’ = E + 127, called the excess-127 format. Therefore, E’ is in the range 0 £ E’
£ 255.
S E’E’E’E’E’E’E’E’ FFFFFFFFFFFFFFFFFFFFFFF
0 1 8 9 31
The value V represented by the word may be determined as follows:
For example,
0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity
0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN
0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)
0 00000000 00000000000000000000001 = +1 * 2**(-126) *
0.00000000000000000000001 = 2**(-149) (Smallest positive value)
(unnormalized values)
Double Precision Numbers:
The IEEE double precision floating point standard representation requires a 64-bit
word, which may be represented as numbered from 0 to 63, left to right. The first bit
is the sign bit, S, the next eleven bits are the excess-1023 exponent bits, E’, and the
final 52 bits are the fraction ‘F’:
S E’E’E’E’E’E’E’E’E’E’E’
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1 11 12
63
The value V represented by the word may be determined as follows:
Arithmetic unit
Arithmetic operations on floating point numbers consist of addition, subtraction,
multiplication and division. The operations are done with algorithms similar to those
used on sign magnitude integers (because of the similarity of representation) —
example, only add numbers of the same sign. If the numbers are of opposite sign,
must do subtraction.
ADDITION
Example on decimal value given in scientific notation:
3.25 x 10 ** 3
+ 2.63 x 10 ** -1
—————–
first step: align decimal points
second step: add
3.25 x 10 ** 3
+ 0.000263 x 10 ** 3
——————–
3.250263 x 10 ** 3
(presumes use of infinite precision, without regard for accuracy)
third step: normalize the result (already normalized!)
Example on floating pt. value given in binary:
.25 = 0 01111101 00000000000000000000000
-1S × M × 2E
Overflow
Underflow
Since zero (0.0) has no leading 1, to distinguish it from others, it is given the
reserved bitpattern all 0s for the exponent so that hardware won't attach a
leading 1 to it. Thus:
If we number the mantissa bits from left to right m1, m2, m3, ...
With this representation, the first exponent shows a "larger" binary number,
making direct comparison more difficult.
So the actual exponent is found by subtracting the bias from the stored
exponent. Therefore, given S, E, and M fields, an IEEE floating-point number
has the value:
1. Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
8.70 × 10-1 = 0.087 × 101
If the mantissa does not fit in the space reserved for it, it has to be
rounded off.
Perform 0.5 + (-0.4375)
1. Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
-1.110 × 2-2 = -0.1110 × 2-1