IEEE Standard 754

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

IEEE Standard 754 Floating Point Numbers

Last Updated : 16 Mar, 2020


The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard
for floating-point computation which was established in 1985 by the Institute of
Electrical and Electronics Engineers (IEEE). The standard addressed many
problems found in the diverse floating point implementations that made them
difficult to use reliably and reduced their portability. IEEE Standard 754 floating
point is the most common representation today for real numbers on computers,
including Intel-based PC’s, Macs, and most Unix platforms.

There are several ways to represent floating point number but IEEE 754 is the
most efficient in most cases. IEEE 754 has 3 basic components:

The Sign of Mantissa –


This is as simple as the name. 0 represents a positive number while 1 represents
a negative number.
The Biased exponent –
The exponent field needs to represent both positive and negative exponents. A
bias is added to the actual exponent in order to get the stored exponent.
The Normalised Mantissa –
The mantissa is part of a number in scientific notation or a floating-point number,
consisting of its significant digits. Here we have only 2 digits, i.e. O and 1. So a
normalised mantissa is one with only one 1 to the left of the decimal.
IEEE 754 numbers are divided into two based on the above three components:
single precision and double precision.
TYPES SIGN BIASED EXPONENT NORMALISED MANTISA BIAS
Single precision 1(31st bit) 8(30-23) 23(22-0) 127
Double precision 1(63rd bit) 11(62-52) 52(51-0) 1023
Example –

85.125
85 = 1010101
0.125 = 001
85.125 = 1010101.001
=1.010101001 x 2^6
sign = 0

1. Single precision:
biased exponent 127+6=133
133 = 10000101
Normalised mantisa = 010101001
we will add 0's to complete the 23 bits

The IEEE 754 Single precision is:


= 0 10000101 01010100100000000000000
This can be written in hexadecimal form 42AA4000

2. Double precision:
biased exponent 1023+6=1029
1029 = 10000000101
Normalised mantisa = 010101001
we will add 0's to complete the 52 bits

The IEEE 754 Double precision is:


= 0 10000000101
0101010010000000000000000000000000000000000000000000
This can be written in hexadecimal form 4055480000000000
Special Values: IEEE has reserved some values that can ambiguity.

Zero –
Zero is a special value denoted with an exponent and mantissa of 0. -0 and +0
are distinct values, though they both are equal.
Denormalised –
If the exponent is all zeros, but the mantissa is not then the value is a
denormalized number. This means this number does not have an assumed
leading one before the binary point.
Infinity –
The values +infinity and -infinity are denoted with an exponent of all ones and a
mantissa of all zeros. The sign bit distinguishes between negative infinity and
positive infinity. Operations with infinite values are well defined in IEEE.
Not A Number (NAN) –
The value NAN is used to represent a value that is an error. This is represented
when exponent field is all ones with a zero sign bit or a mantissa that it not 1
followed by zeros. This is a special value that might be used to denote a variable
that doesn’t yet hold a value.
EXPONENT MANTISA VALUE
0 0 exact 0
255 0 Infinity
0 not 0 denormalised
255 not 0 Not a number (NAN)
Similar for Double precision (just replacing 255 by 2049), Ranges of Floating
point numbers:

Denormalized Normalized Approximate Decimal


Single Precision ± 2-149 to (1 – 2-23)×2-126 ± 2-126 to (2 – 2-23)×2127
± approximately 10-44.85 to approximately 1038.53
Double Precision ± 2-1074 to (1 – 2-52)×2-1022 ± 2-1022 to (2 – 2-52)×21023
± approximately 10-323.3 to approximately 10308.3
The range of positive floating point numbers can be split into normalized
numbers, and denormalized numbers which use only a portion of the fractions’s
precision. Since every floating-point number has a corresponding, negated value,
the ranges above are symmetric around zero.

There are five distinct numerical ranges that single-precision floating-point


numbers are not able to represent with the scheme presented so far:

Negative numbers less than – (2 – 2-23) × 2127 (negative overflow)


Negative numbers greater than – 2-149 (negative underflow)
Zero
Positive numbers less than 2-149 (positive underflow)
Positive numbers greater than (2 – 2-23) × 2127 (positive overflow)
Overflow generally means that values have grown too large to be represented.
Underflow is a less serious problem because is just denotes a loss of precision,
which is guaranteed to be closely approximated by zero.

Table of the total effective range of finite IEEE floating-point numbers is shown
below:

BinaryDecimal
Single ± (2 – 2-23) × 2127 approximately ± 1038.53
Double ± (2 – 2-52) × 21023 approximately ± 10308.25
Special Operations –

Operation Result
n ÷ ±Infinity0
±Infinity × ±Infinity ±Infinity
±nonZero ÷ ±0 ±Infinity
±finite × ±Infinity ±Infinity
Infinity + Infinity
Infinity – -Infinity +Infinity
-Infinity – Infinity
-Infinity + – Infinity – Infinity
±0 ÷ ±0 NaN
±Infinity ÷ ±Infinity NaN
±Infinity × 0NaN
NaN == NaN False

"GeeksforGeeks helped me ace the GATE exam! Whenever I had any doubt
regarding any topic, GFG always helped me and made my concepts quiet clear." -
Anshika Modi | AIR 21

Choose GeeksforGeeks as your perfect GATE 2025 Preparation partner with these
newly launched programs
GATE CS & IT- Online
GATE DS & AI- Online
GATE Offline (Delhi/NCR)

Over 150,000+ students already trust us to be their GATE Exam guide. Join them
& let us help you in opening the GATE to top-tech IITs & NITs!

Introduction of Floating Point Representation


Last Updated : 17 May, 2023



1. To convert the floating point into decimal, we have 3
elements in a 32-bit floating point representation:
i) Sign
ii) Exponent
iii) Mantissa

 Sign bit is the first bit of the binary representation. ‘1’


implies negative number and ‘0’ implies positive number.
Example: 11000001110100000000000000000000 This is
negative number.
 Exponent is decided by the next 8 bits of binary
representation. 127 is the unique number for 32 bit
floating point representation. It is known as bias. It is
determined by 2k-1 -1 where ‘k’ is the number of bits in
exponent field.
There are 3 exponent bits in 8-bit representation and 8
exponent bits in 32-bit representation.
Thus
bias = 3 for 8 bit conversion (23-1 -1 = 4-1 = 3)
bias = 127 for 32 bit conversion. (28-1 -1 = 128-1 = 127)
Example: 01000001110100000000000000000000
10000011 = (131)10
131-127 = 4
Hence the exponent of 2 will be 4 i.e. 24 = 16.
 Mantissa is calculated from the remaining 23 bits of the
binary representation. It consists of ‘1’ and a fractional
part which is determined by:
Example:
01000001110100000000000000000000
The fractional part of mantissa is given by:
1*(1/2) + 0*(1/4) + 1*(1/8) + 0*(1/16) +……… = 0.625
Thus the mantissa will be 1 + 0.625 = 1.625
The decimal number hence given as:
Sign*Exponent*Mantissa = (-1)0*(16)*(1.625) = 26
2. To convert the decimal into floating point, we have 3
elements in a 32-bit floating point representation:
i) Sign (MSB)
ii) Exponent (8 bits after MSB)
iii) Mantissa (Remaining 23 bits)

 Sign bit is the first bit of the binary representation. ‘1’


implies negative number and ‘0’ implies positive number.
Example: To convert -17 into 32-bit floating point
representation Sign bit = 1
 Exponent is decided by the nearest smaller or equal to
2n number. For 17, 16 is the nearest 2n. Hence the
exponent of 2 will be 4 since 24 = 16. 127 is the unique
number for 32 bit floating point representation. It is known
as bias. It is determined by 2k-1 -1 where ‘k’ is the number
of bits in exponent field.
Thus bias = 127 for 32 bit. (28-1 -1 = 128-1 = 127)
Now, 127 + 4 = 131 i.e. 10000011 in binary
representation.
 Mantissa: 17 in binary = 10001.
Move the binary point so that there is only one bit from
the left. Adjust the exponent of 2 so that the value does
not change. This is normalizing the number. 1.0001 x 2 4.
Now, consider the fractional part and represented as 23
bits by adding zeros.
00010000000000000000000

Advantages:

Wide range of values: Floating factor illustration lets in for a


extensive variety of values to be represented, along with very
massive and really small numbers.
Precision: Floating factor illustration offers excessive precision,
that is important for medical and engineering calculations.
Compatibility: Floating point illustration is extensively used in
computer structures, making it well matched with a extensive
variety of software and hardware.
Easy to use: Most programming languages offer integrated guide
for floating factor illustration, making it smooth to use and control
in laptop programs.

Disadvantages:

Complexity: Floating factor illustration is complex and can be


tough to understand, mainly for folks that aren’t acquainted with
the underlying mathematics.
Rounding errors: Floating factor illustration can result in
rounding mistakes, where the real price of a number of is barely
extraordinary from its illustration inside the computer.
Speed: Floating factor operations can be slower than integer
operations, particularly on older or much less powerful hardware.
Limited precision: Despite its excessive precision, floating factor
representation has a restrained number of sizeable digits, which
could restrict its usefulness in some programs.
Related Link:
https://www.youtube.com/watch?v=03fhijH6e2w
More questions on number representation:
https://www.geeksforgeeks.org/number-representation-gq/
This article is contributed by Kriti Kushwaha

"GeeksforGeeks helped me ace the GATE exam! Whenever I had


any doubt regarding any topic, GFG always helped me and made
my concepts quiet clear." - Anshika Modi | AIR 21
Choose GeeksforGeeks as your perfect GATE 2025
Preparation partner with these newly launched programs
GATE CS & IT- Online
GATE DS & AI- Online
GATE Offline (Delhi/NCR)
Over 150,000+ students already trust us to be their GATE
Exam guide. Join them & let us help you in opening the GATE to
top-tech IITs & NITs!

representation and 2’s Complement


representation Technique
Last Updated : 23 Apr, 2023



Prerequisite – Representation of Negative Binary Numbers
1’s complement of a binary number is another binary number
obtained by toggling all bits in it, i.e., transforming the 0 bit to 1
and the 1 bit to 0. Examples:
Let numbers be stored using 4 bits

1's complement of 7 (0111) is 8 (1000)


1's complement of 12 (1100) is 3 (0011)
2’s complement of a binary number is 1 added to the 1’s
complement of the binary number. Examples:
Let numbers be stored using 4 bits

2's complement of 7 (0111) is 9 (1001)


2's complement of 12 (1100) is 4 (0100)
These representations are used for signed numbers.
The main difference between 1′ s complement and 2′ s
complement is that 1′ s complement has two representations of 0
(zero) — 00000000, which is positive zero (+0), and 11111111,
which is negative zero (-0); whereas in 2′ s complement, there is
only one representation for zero — 00000000 (0) because if we
add 1 to 11111111 (-1), we get 100000000, which is nine bits
long. Since only eight bits are allowed, the left-most bit is
discarded(or overflowed), leaving 00000000 (-0) which is the
same as positive zero. This is the reason why 2′ s complement is
generally used.
Another difference is that while adding numbers using 1′ s
complement, we first do binary addition, then add in an end-
around carry value. But, 2′ s complement has only one value for
zero and doesn’t require carry values.
Range of 1’s complement for n bit number is from -2 n-1-1 to 2n-1-1
whereas the range of 2’s complement for n bit is from -2 n-1 to 2n-1-1.
There are 2n-1 valid numbers in 1’s complement and 2 n valid
numbers in 2’s complement.
Difference between 1’s Complement representation and
2’s Complement representation in tabular form:
1’s 2’s
Criteria
Complement Complement

The 2’s complement of a binary


The 1’s complement of a
number is obtained by adding 1 to
Definition binary number is obtained
the 1’s complement of the
by inverting all its bits.
number.

Range of values
that can be From -2^(n-1) + 1 to 2^(n-
From -2^(n-1) to 2^(n-1) – 1
represented with n 1) – 1
bits

Number of
Can be represented in two Can be represented in only one
representations for
ways (all 0s and all 1s). way (all 0s).
zero

Addition of positive
Same as unsigned binary Same as unsigned binary
and negative
addition. addition.
numbers

Subtraction of Subtract the smaller Add the negative number to the


numbers number from the larger positive one using binary
one, then add a sign bit to addition.
the result.

Are you a student in Computer Science or an employed


professional looking to take up the GATE 2025 Test? Of course,
you can get a good score in it but to get the best score our GATE
CS/IT 2025 - Self-Paced Course is available on GeeksforGeeks
to help you with its preparation. Get comprehensive coverage of
all topics of GATE, detailed explanations, and practice questions
for study. Study at your pace. Flexible and easy-to-follow modules.
Do well in GATE to enhance the prospects of your career. Enroll
now and let your journey to success begin!

You might also like