Representation by hardware floats

10.1.2 Representation by hardware floats

A real is represented by a floating number d, that is

d=2(1+m), 0<m<1, −2¹⁰ < α < 2¹⁰

If α>1−2¹⁰, then m ≥ 1/2, and d is a normalized floating point number, otherwise d is denormalized (α=1−2¹⁰). The special exponent 2¹⁰ is used to represent plus or minus infinity and NaN (Not a Number). A hardware float is made of 64 bits:

the first bit is for the sign of d (0 for ’+’ and 1 for ’-’)
the 11 following bits represents the exponent, more precisely if α denotes the integer given by the 11 bits, the exponent is α+2¹⁰−1,
the 52 last bits codes the mantissa m, more precisely if M denotes the integer given by the 52 bits, then m=1/2+M/2⁵³ for normalized floats and m=M/2⁵³ for denormalized floats.

Examples of representations of the exponent:

α=0 is coded by 011 1111 1111
α=1 is coded by 100 0000 0000
α=4 is coded by 100 0000 0011
α=5 is coded by 100 0000 0100
α=−1 is coded by 011 1111 1110
α=−4 is coded by 011 1111 1011
α=−5 is coded by 011 1111 1010
α=2¹⁰ is coded by 111 1111 1111
α=2⁻¹⁰−1 is coded by 000 0000 000

Remark: 2⁻⁵²=0.2220446049250313e−15