Representation by hardware floats

23.1.2 Representation by hardware floats

A real is represented by a floating number d, that is

d=2^α (1+m), 0<m<1, −2¹⁰ < α < 2¹⁰.

If α>1−2¹⁰, then m ≥ 1/2, and d is a normalized floating point number, otherwise d is denormalized (α=1−2¹⁰). The special exponent 2¹⁰ is used to represent plus or minus infinity and NaN (Not a Number). A hardware float is made of 64 bits:

the first bit is for the sign of d (0 for + and 1 for −),
the 11 following bits represents the exponent, more precisely if α denotes the integer given by the 11 bits, the exponent is α+2¹⁰−1,
the 52 last bits codes the mantissa m, more precisely if M denotes the integer given by the 52 bits, then m=1/2+M/2⁵³ for normalized floats and m=M/2⁵³ for denormalized floats.

Examples of representations of the exponent:

α=0 is coded by 011 1111 1111
α=1 is coded by 100 0000 0000
α=4 is coded by 100 0000 0011
α=5 is coded by 100 0000 0100
α=−1 is coded by 011 1111 1110
α=−4 is coded by 011 1111 1011
α=−5 is coded by 011 1111 1010
α=2¹⁰ is coded by 111 1111 1111
α=2⁻¹⁰−1 is coded by 000 0000 000

Remark.

2⁻⁵²=0.2220446049250313e−15.

Examples of representations of normalized floats

Representation of 3.1.

We have

3.1

=2·

⎛
⎜
⎜
⎝

2⁵

2⁶

2⁹

2¹⁰

+⋯

⎞
⎟
⎟
⎠

=2·

⎛
⎜
⎜
⎝

∞

∑

k=1

⎛
⎜
⎜
⎝

2^4k+1

2^4k+2

⎞
⎟
⎟
⎠

hence α=1 and m=1/2+∑_k=1^∞(1/2^4k+1+1/2^4k+2). Hence the hexadecimal and binary representation of 3.1 is:

40 (01000000), 8 (00001000), cc (11001100), cc (11001100),

cc (11001100), cc (11001100), cc (11001100), cd (11001101),

the last octet is 1101, the last bit is 1, because the following digit is 1 (upper rounding).

Representation of 3.0.

We have 3=2· (1+1/2). Hence the hexadecimal and binary representation of 3 is:

40 (01000000), 8 (00001000), 0 (00000000), 0 (00000000),

0 (00000000), 0 (00000000), 0 (00000000), 0 (00000000)

The difference between representations of 3.1−3.0 and 0.1.

For the representation of 0.1:

0.1

=2⁻⁴·

⎛
⎜
⎜
⎝

2⁴

2⁵

2⁸

2⁹

+⋯

⎞
⎟
⎟
⎠

=2⁻⁴·

∞

∑

k=0

⎛
⎜
⎜
⎝

2^4k

2^4k+1

⎞
⎟
⎟
⎠

hence α=1 and

∞

∑

k=1

⎛
⎜
⎜
⎝

2^4k

2^4k+1

⎞
⎟
⎟
⎠

therefore the representation of 0.1 is

3f (00111111), b9 (10111001), 99 (10011001), 99 (10011001),

99 (10011001), 99 (10011001), 99 (10011001), 9a (10011010),

the last octet is 1010, indeed the 2 last bits 01 became 10 because the following digit is 1 (upper rounding).

For the representation of a=3.1−3: computing a is done by adjusting exponents (here nothing to do), then subtracting the mantissa and adjusting the exponent of the result to have a normalized float. The exponent is α=−4 (that corresponds at 2·2⁻⁵) and the bits corresponding to the mantissa begin at 1/2=2·2⁻⁶: the bits of the mantissa are shifted to the left 5 positions and you get:

3f (00111111), b9 (10111001), 99 (10011001), 99 (10011001),

99 (10011001), 99 (10011001), 99 (10011001), a0 (10100000),

Therefore, a>0.1 and a−0.1=1/2⁵⁰+1/2⁵¹ (since 100000−11010=110).

This is the reason why:

floor(1/(3.1-3))

returns 9 and not 10 when Digits:=14.