Floating-Point Numbers and Operations Representation

Floating-point numbers and operations
Representation

The IEEE single precision floating point standard representation requires a 32 bit
word, which may be represented as numbered from 0 to 31, left to right. The first bit
is the sign bit, S, the next eight bits are the exponent bits, ‘E’, and the final 23 bits are
the fraction ‘F’. Instead of the signed exponent E, the value stored is an unsigned
integer E’ = E + 127, called the excess-127 format. Therefore, E’ is in the range 0 £ E’
£ 255.

S E’E’E’E’E’E’E’E’ FFFFFFFFFFFFFFFFFFFFFFF

0 1 8 9 31

The value V represented by the word may be determined as follows:
 If E’ = 255 and F is nonzero, then V = NaN (“Not a number”)

 If E’ = 255 and F is zero and S is 1, then V = -Infinity
 If E’ = 255 and F is zero and S is 0, then V = Infinity
 If 0 < E< 255 then V =(-1)**S * 2 ** (E-127) * (1.F) where “1.F” is intended
to represent the binary number created by prefixing F with an implicit leading 1 and a
binary point.
 If E’ = 0 and F is nonzero, then V = (-1)**S * 2 ** (-126) * (0.F). These are
“unnormalized” values.
 If E’= 0 and F is zero and S is 1, then V = -0
 If E’ = 0 and F is zero and S is 0, then V = 0
For example,

0 00000000 00000000000000000000000 = 0

1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity

1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN

1 11111111 00100010001001010101010 = NaN

0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2

0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5

1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)

0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)

0 00000000 00000000000000000000001 = +1 * 2**(-126) *

0.00000000000000000000001 = 2**(-149) (Smallest positive value)

(unnormalized values)

Double Precision Numbers:

The IEEE double precision floating point standard representation requires a 64-bit
word, which may be represented as numbered from 0 to 63, left to right. The first bit
is the sign bit, S, the next eleven bits are the excess-1023 exponent bits, E’, and the
final 52 bits are the fraction ‘F’:

S E’E’E’E’E’E’E’E’E’E’E’

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

0 1 11 12

63

The value V represented by the word may be determined as follows:
 If E’ = 2047 and F is nonzero, then V = NaN (“Not a number”)

 If E’= 2047 and F is zero and S is 1, then V = -Infinity
 If E’= 2047 and F is zero and S is 0, then V = Infinity
 If 0 < E’< 2047 then V = (-1)**S * 2 ** (E-1023) * (1.F) where “1.F” is
intended to represent the binary number created by prefixing F with an implicit
leading 1 and a binary point.
 If E’= 0 and F is nonzero, then V = (-1)**S * 2 ** (-1022) * (0.F) These are
“unnormalized” values.
 If E’= 0 and F is zero and S is 1, then V = – 0
 If E’= 0 and F is zero and S is 0, then V = 0

Arithmetic unit

Arithmetic operations on floating point numbers consist of addition, subtraction,
multiplication and division. The operations are done with algorithms similar to those
used on sign magnitude integers (because of the similarity of representation) —
example, only add numbers of the same sign. If the numbers are of opposite sign,
must do subtraction.

ADDITION

Example on decimal value given in scientific notation:

3.25 x 10 ** 3
+ 2.63 x 10 ** -1
—————–
first step: align decimal points
second step: add

3.25 x 10 ** 3
+ 0.000263 x 10 ** 3
——————–
3.250263 x 10 ** 3
(presumes use of infinite precision, without regard for accuracy)

third step: normalize the result (already normalized!)

Example on floating pt. value given in binary:

.25 = 0 01111101 00000000000000000000000
100 = 0 10000101 10010000000000000000000

To add these fl. pt. representations,

step 1: align radix points

shifting the mantissa left by 1 bit decreases the exponent by 1

shifting the mantissa right by 1 bit increases the exponent by 1

we want to shift the mantissa right, because the bits that fall off the end should
come from the least significant end of the mantissa

-> choose to shift the .25, since we want to increase it’s exponent.
-> shift by 10000101
-01111101
———
00001000 (8) places.

0 01111101 00000000000000000000000 (original value)
0 01111110 10000000000000000000000 (shifted 1 place)
(note that hidden bit is shifted into msb of mantissa)
0 01111111 01000000000000000000000 (shifted 2 places)
0 10000000 00100000000000000000000 (shifted 3 places)
0 10000001 00010000000000000000000 (shifted 4 places)
0 10000010 00001000000000000000000 (shifted 5 places)

0 10000011 00000100000000000000000 (shifted 6 places)
0 10000100 00000010000000000000000 (shifted 7 places)
0 10000101 00000001000000000000000 (shifted 8 places)

step 2: add (don’t forget the hidden bit for the 100)

0 10000101 1.10010000000000000000000 (100)
+ 0 10000101 0.00000001000000000000000 (.25)
—————————————
0 10000101 1.10010001000000000000000

step 3: normalize the result (get the “hidden bit” to be a 1)
It already is for this example.
result
0 10000101 10010001000000000000000
is
Representation of Floating-Point numbers
-1S × M × 2E
Bit No Size Field Name

31 1 bit Sign (S)
23-30 8 bits Exponent (E)
0-22 23 bits Mantissa (M)
A Single-Precision floating-point number occupies 32-bits, so there is a

compromise between the size of the mantissa and the size of the exponent.
These chosen sizes provide a range of approx:

± 10-38 ... 1038
 Overflow
The exponent is too large to be represented in the Exponent field
 Underflow
The number is too small to be represented in the Exponent field
To reduce the chances of underflow/overflow, can use 64-bit Double-

Precision arithmetic
Bit No Size Field Name

63 1 bit Sign (S)
52-62 11 bits Exponent (E)
0-51 52 bits Mantissa (M)
providing a range of approx

± 10-308 ... 10308
These formats are called ...
IEEE 754 Floating-Point Standard
Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to

represent the leading 1. So, effectively:
 Single Precision: mantissa ===> 1 bit + 23 bits

 Double Precision: mantissa ===> 1 bit + 52 bits
Since zero (0.0) has no leading 1, to distinguish it from others, it is given the
reserved bitpattern all 0s for the exponent so that hardware won't attach a
leading 1 to it. Thus:
 Zero (0.0) = 0000...0000

 Other numbers = -1S × (1 + Mantissa) × 2E
If we number the mantissa bits from left to right m1, m2, m3, ...
mantissa = m1 × 2-1 + m2 × 2-2 + m3 × 2-3 + ....
Negative exponents could pose a problem in comparisons.
For example (with two's complement):
Sign Exponent Mantissa

1.0 × 2 -1
0 11111111 0000000 00000000 00000000
1.0 × 2+1 0 00000001 0000000 00000000 00000000
With this representation, the first exponent shows a "larger" binary number,
making direct comparison more difficult.
To avoid this, Biased Notation is used for exponents.
If the real exponent of a number is X then it is represented as (X + bias)
IEEE single-precision uses a bias of 127. Therefore, an exponent of
-1 is represented as -1 + 127 = 126 = 011111102

0 is represented as 0 + 127 = 127 = 011111112
+1 is represented as +1 + 127 = 128 = 100000002
+5 is represented as +5 + 127 = 132 = 100001002
So the actual exponent is found by subtracting the bias from the stored
exponent. Therefore, given S, E, and M fields, an IEEE floating-point number
has the value:
-1S × (1.0 + 0.M) × 2E-bias
(Remember: it is (1.0 + 0.M) because, with normalised form, only

the fractional part of the mantissa needs to be stored)
Floating Point Addition
Add the following two decimal numbers in scientific notation:

8.70 × 10-1 with 9.95 × 101
1. Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
8.70 × 10-1 = 0.087 × 101
2. Add the mantissas
9.95 + 0.087 = 10.037 and write the sum 10.037 × 101
3. Put the result in Normalised Form
10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)
check for overflow/underflow of the exponent after normalisation
4. Round the result
If the mantissa does not fit in the space reserved for it, it has to be
rounded off.
For Example: If only 4 digits are allowed for mantissa

1.0037 × 102 ===> 1.004 × 102
(only have a hidden bit with binary floating point numbers)
Example addition in binary
Perform 0.5 + (-0.4375)
0.5 = 0.1 × 20 = 1.000 × 2-1 (normalised)
-0.4375 = -0.0111 × 20 = -1.110 × 2-2 (normalised)
1. Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
-1.110 × 2-2 = -0.1110 × 2-1
2. Add the mantissas:

1.000 × 2-1 + -0.1110 × 2-1 = 0.001 × 2-1
3. Normalise the sum, checking for overflow/underflow:

0.001 × 2-1 = 1.000 × 2-4
-126 <= -4 <= 127 ===> No overflow or underflow
4. Round the sum:
The sum fits in 4 bits so rounding is not required
Check: 1.000 × 2-4 = 0.0625 which is equal to 0.5 - 0.4375

Floating-Point Numbers and Operations Representation

Uploaded by

Copyright:

Available Formats

Floating-Point Numbers and Operations Representation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Floating-Point Numbers and Operations Representation

Uploaded by

Copyright:

Available Formats

Floating-point numbers and operations

 If E’ = 255 and F is nonzero, then V = NaN (“Not a number”)

 If E’ = 2047 and F is nonzero, then V = NaN (“Not a number”)

100 = 0 10000101 10010000000000000000000

Representation of Floating-Point numbers

Bit No Size Field Name

A Single-Precision floating-point number occupies 32-bits, so there is a

These chosen sizes provide a range of approx:

The exponent is too large to be represented in the Exponent field

The number is too small to be represented in the Exponent field

To reduce the chances of underflow/overflow, can use 64-bit Double-

Bit No Size Field Name

providing a range of approx

These formats are called ...

IEEE 754 Floating-Point Standard

Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to

 Single Precision: mantissa ===> 1 bit + 23 bits

 Zero (0.0) = 0000...0000

mantissa = m1 × 2-1 + m2 × 2-2 + m3 × 2-3 + ....

Negative exponents could pose a problem in comparisons.

For example (with two's complement):

Sign Exponent Mantissa

To avoid this, Biased Notation is used for exponents.

If the real exponent of a number is X then it is represented as (X + bias)

IEEE single-precision uses a bias of 127. Therefore, an exponent of

-1 is represented as -1 + 127 = 126 = 011111102

-1S × (1.0 + 0.M) × 2E-bias

(Remember: it is (1.0 + 0.M) because, with normalised form, only

Floating Point Addition

Add the following two decimal numbers in scientific notation:

2. Add the mantissas

9.95 + 0.087 = 10.037 and write the sum 10.037 × 101

3. Put the result in Normalised Form

10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)

check for overflow/underflow of the exponent after normalisation

4. Round the result

For Example: If only 4 digits are allowed for mantissa

(only have a hidden bit with binary floating point numbers)

Example addition in binary

0.5 = 0.1 × 20 = 1.000 × 2-1 (normalised)

-0.4375 = -0.0111 × 20 = -1.110 × 2-2 (normalised)

2. Add the mantissas:

3. Normalise the sum, checking for overflow/underflow:

-126 <= -4 <= 127 ===> No overflow or underflow

4. Round the sum:

The sum fits in 4 bits so rounding is not required

Check: 1.000 × 2-4 = 0.0625 which is equal to 0.5 - 0.4375

You might also like