4 Floating Point
4 Floating Point
4 Floating Point
Suppose we have 8 bits to store a real number, where 5 bits store the
integer part and 3 bits store the fractional part:
1 0 1 1 1.0 1 1 !$
!
2% 2$ 2# 2" 2! 2
!#
2 2!"
Smallest number:
𝑎& = 0 ∀𝑖 and 𝑏", 𝑏#, … , 𝑏$" = 0 and 𝑏$# = 1 → 2'$#≈ 10'"!
Largest number:
𝑎& = 1 ∀𝑖 and 𝑏& = 1 ∀𝑖 → 2$" + ⋯ + 2! + 2'" + ⋯ + 2'$#≈ 10(
(Unsigned) Fixed-point representation
Suppose we have 64 bits to store a real number, where 32 bits store the
integer part and 32 bits store the fractional part:
$" $#
𝑎$" … 𝑎#𝑎"𝑎!. 𝑏"𝑏#𝑏$ … 𝑏$# # = 4 𝑎) 2) + 4 𝑏) 2')
)*! )*"
0 ∞
(Unsigned) Fixed-point representation
Range: difference between the largest and smallest numbers possible.
More bits for the integer part ⟶ increase range
𝑥 = ± 𝑞 × 2&
𝑞 is the significand, normally a fractional value in the range [1.0,2.0)
𝑚 is the exponent
Floating-point numbers
Numerical Form:
𝑏! ∈ 0,1
Exponent range: 𝑚 ∈ 𝐿, 𝑈
Precision: p = 𝑛 + 1
“Floating” the binary point
1
1011.1 ! = 1×8 + 0×4 + 1×2 + 1×1 + 1× = 11.5 "#
2
Move “binary point” to the left by one bit position: Divide the decimal
number by 2
Move “binary point” to the right by one bit position: Multiply the decimal
number by 2
Converting floating points
Convert (39.6875)"! = 100111.1011 # into floating point
representation
The first bit to the left of the binary point 𝑏" = 1 does not need to be
stored, since its value is fixed.
This representation ”adds” 1-bit of precision (we will show some exceptions
later, including the representation of number zero).
Iclicker question
Determine the normalized floating point representation
1. 𝒇 × 2𝒎 of the decimal number 𝑥 = 47.125 (𝒇 in binary
representation and 𝒎 in decimal)
A) 1.01110001 *× 2𝟓
B) 1.01110001 *× 2𝟒
C) 1.01111001 *× 2𝟓
D) 1.01111001 *× 2𝟒
Normalized floating-point numbers
𝑥 = ± 𝑞 × 2' = ± 1. 𝑏" 𝑏! 𝑏$ … 𝑏( × 2' = ± 1. 𝑓 × 2'
• Exponent range: 𝐿, 𝑈
• Precision: p = 𝑛 + 1
UFL = 2,
−∞ +∞
0
Floating-point numbers: Simple example
A ”toy” number system can be represented as 𝑥 = ±1. 𝑏" 𝑏# ×2-
for 𝑚 ∈ [−4,4] and 𝑏) ∈ {0,1}.
1.00 ! ×2" =1 1.00 ! ×2$ =2 1.00 ! ×2! = 4.0
1.01 " $ !
! ×2 = 1.25 1.01 ! ×2 = 2.5 1.01 ! ×2 = 5.0
" $ !
1.10 ! ×2 = 1.5 1.10 ! ×2 = 3.0 1.10 ! ×2 = 6.0
1.11 " $ !
! ×2 = 1.75 1.11 ! ×2 = 3.5 1.11 ! ×2 = 7.0
𝝐𝒎 = 0.01 # ×2' = 𝟎. 𝟐𝟓
Machine numbers: how floating point
numbers are stored?
Floating-point number representation
What do we need to store when representing floating point
numbers in a computer?
𝑥 = ± 1. 𝒇 × 2𝒎
𝑥= ± 𝑚 𝑓
𝑥 = ± 1. 𝒇 × 2𝒎
Representation in memory:
𝑥= 𝒔 𝑐 𝑓
𝑥 = 𝑠 𝑐 = 𝑚 + 127 𝑓
sign exponent significand
(1-bit) (8-bit) (23-bit)
𝑥= 𝑠 𝑐 = 𝑚 + 1023 𝑓
sign exponent significand
(1-bit) (11-bit) (52-bit)
Special Values:
𝑥 = (−1)𝒔 1. 𝒇 × 2𝒎 = 𝒔 𝒄 𝒇
1) Zero:
𝑥= 𝑠 000 … 000 0000 … … 0000
2) Infinity: +∞ (𝑠 = 0) and −∞ 𝑠 = 1
Note that the exponent 𝑐 = 000 … 000 and 𝑐 = 111 … 111 are reserved
for these special cases, which limits the exponent range for the other numbers.
IEEE-754 Single Precision (32-bit)
𝑥 = (−1)𝒔 1. 𝒇 × 2𝒎
𝑠 𝑐 = 𝑚 + 127 𝑓
sign exponent significand
(1-bit) (8-bit) (23-bit)
𝑠 𝑐 = 𝑚 + 1023 𝑓
sign exponent significand
(1-bit) (11-bit) (52-bit)
−∞ +∞
0
Subnormal (or denormalized) numbers
• Noticeable gap around zero, present in any floating system, due to
normalization
ü The smallest possible significand is 1.00
ü The smallest possible exponent is 𝐿
• Relax the requirement of normalization, and allow the leading digit to be zero,
only when the exponent is at its minimum (𝑚 = 𝐿)
• Computations with subnormal numbers are often slow.
𝑥= 𝑠 𝑐 = 000 … 000 𝑓
𝑥 = (−1)𝒔 0. 𝒇 × 2𝑳
Instead, the exponent is set to the
lower bound, 𝒎 = 𝐋
Subnormal (or denormalized) numbers
IEEE-754 Single precision (32 bits):
𝑐 = 00000000 # = 0
Exponent set to 𝑚 = −126
Smallest positive subnormal FP number: 2'#$ × 2'"#G ≈ 1.4 ×10'%+
0 10000100 00101101000000000000000
What is the equivalent decimal
number?
0 00000000 00000000000000000000000
1 11111111 00000000000000000000000
0 11111111 11111111110000111111111
0 00000000 11110000000000000000000
0 01111111 00000000000000000000000
Iclicker question
A number system can be represented as 𝑥 = ±1. 𝑏" 𝑏# 𝑏$ ×2-
for 𝑚 ∈ [−5,5] and 𝑏) ∈ {0,1}.
1) Let’s say you want to represent the decimal number 19.625 using the
binary number system above. Can you represent this number exactly?
2) What is the range of integer numbers that you can represent exactly using
this binary system?
Iclicker question
Determine the decimal number corresponding to the
following single-precision machine number:
1 10011001 00000000000000000000001
A) 67,108,872
B) −67,108,872
C) 67,108,864
D) −67,108,864
Iclicker question
Determine the double-precision machine representation
of the decimal number 𝑥 = −37.625
A) 1 10000100000 00101101000000 … 0
B) 1 10000000100 00101101000000 … 0
C) 0 10000100000 00101101000000 … 0
D) 0 10000000100 00101101000000 … 0
(52-bit)