Representation by hardware floats

suivant: Examples of representations of monter: Floating point representation. précédent: Digits Table des matières Index

Representation by hardware floats

A real is represented by a floating number d, that is

d = 2*(1 + m), 0 < m < 1, -2¹⁰ < $\displaystyle \alpha$ < 2¹⁰

If $\alpha$ > 1 - 2¹⁰, then m $\geq$ 1/2, and d is a normalized floating point number, otherwise d is denormalized ( $\alpha$ = 1 - 2¹⁰). The special exponent 2¹⁰ is used to represent plus or minus infinity and NaN (Not a Number). A hardware float is made of 64 bits:

the first bit is for the sign of d (0 for '+' and 1 for '-')
the 11 following bits represents the exponent, more precisely if $\alpha$ denotes the integer from the 11 bits, the exponent is $\alpha$ +2¹⁰ - 1,
the 52 last bits codes the mantissa m, more precisely if M denotes the integer from the 52 bits, then m = 1/2 + M/2⁵³ for normalized floats and m = M/2⁵³ for denormalized floats.

Examples of representations of the exponent:

$\alpha$ = 0 is coded by 011 1111 1111
$\alpha$ = 1 is coded by 100 0000 0000
$\alpha$ = 4 is coded by 100 0000 0011
$\alpha$ = 5 is coded by 100 0000 0100
$\alpha$ = - 1 is coded by 011 1111 1110
$\alpha$ = - 4 is coded by 011 1111 1011
$\alpha$ = - 5 is coded by 011 1111 1010
$\alpha$ = 2¹⁰ is coded by 111 1111 1111
$\alpha$ = 2^-10 - 1 is coded by 000 0000 000

Remark: 2^-52 = 0.2220446049250313e - 15

suivant: Examples of representations of monter: Floating point representation. précédent: Digits Table des matières Index

giac documentation written by Renée De Graeve and Bernard Parisse