Floating-point numbers - Fundamentals of Numerical Computation

The real number set $\real$ is infinite in two ways: it is unbounded and continuous. In most practical computing, the latter kind of infiniteness is much more consequential than the former, so we turn our attention there first.

Equation (1.1.2) represents the significand as a number in $[1,2)$ in base-2 form. Equivalently,

f = 2^{-d}\, \sum_{i=1}^{d} b_{i} \, 2^{d-i} = 2^{-d} z

(1.1.3)

for an integer $z$ in the set $\{0,1,\ldots,2^d-1\}$ . Consequently, starting at $2^n$ and ending just before $2^{n+1}$ there are exactly $2^d$ evenly spaced numbers belonging to $\float$ .

Observe that the smallest element of $\float$ that is greater than 1 is $1+2^{-d}$ , and we call the difference machine epsilon.^[1]

We define the rounding function $\fl(x)$ as the map from real number $x$ to the nearest member of $\float$ . The distance between the floating-point numbers in $[2^n,2^{n+1})$ is $2^n\macheps=2^{n-d}$ . As a result, every real $x \in [2^n,2^{n+1})$ is no farther than $2^{n-d-1}$ away from a member of $\float$ . Therefore we conclude that $|\fl(x)-x| \le \tfrac{1}{2}(2^{n-d})$ , which leads to the bound

\frac{|\fl(x)-x|}{|x|} \le \frac{2^{n-d-1}}{2^n} \le \tfrac{1}{2}\macheps.

(1.1.5)

In words, every real number is represented with a uniformly bounded relative precision. Inequality (1.1.5) holds true for negative $x$ as well. In Exercise 2 you are asked to show that an equivalent statement is that

\fl(x)=x(1+\epsilon) \quad \text{for some $|\epsilon|\le \tfrac{1}{2}\macheps$.}

(1.1.6)

The value of ε depends on $x$ , but this dependence is not usually shown explicitly.

1.1.1Precision and accuracy¶

It may help to recast (1.1.1) and (1.1.2) in terms of base 10:

\pm \left(b_0 + \sum_{i=1}^d b_i \, 10^{-i} \right) \times 10^n = \pm (b_0.b_1b_2\cdots b_d) \times 10^n,

(1.1.7)

where each $b_i$ is in $\{0,1,\ldots,9\}$ and $b_0\neq 0$ . This is simply scientific notation with $d+1$ significant digits. For example, Planck’s constant is $6.626068\times 10^{-34}$ m ${}^2\cdot$ kg/sec to seven digits. If we alter just the last digit from 8 to 9, the relative change is

\frac{0.000001\times 10^{-34}}{6.626068\times 10^{-34}} \approx 1.51\times 10^{-7}.

(1.1.8)

We therefore say that the constant is given with 7 decimal digits of precision. That’s in contrast to noting that the value is given to 40 decimal places. A major advantage of floating point is that the relative precision does not depend on the choice of physical units. For instance, when expressed in eV•sec, Planck’s constant is $4.135668\times 10^{-15}$ , which still has 7 digits but only 21 decimal places.

Floating-point precision functions the same way, except that computers prefer base 2 to base 10. The precision of a floating-point number is always $d$ binary digits, implying a resolution of the real numbers according to (1.1.5).

It can be easy to confuse precision with accuracy, especially when looking at the result of a calculation on the computer. Every result is computed and represented using $d$ binary digits, but not all of those digits may accurately represent an intended value. Suppose $x$ is a number of interest and $\tilde{x}$ is an approximation to it. The absolute accuracy of $\tilde{x}$ is

|\tilde{x} - x|,

(1.1.9)

while the relative accuracy is

\frac{|\tilde{x} - x|}{|x|}.

(1.1.10)

Absolute accuracy has the same units as $x$ , while relative accuracy is dimensionless. We can also express the relative accuracy as the number of accurate digits, computed in base 10 as

-\log_{10} \left| \frac{\tilde{x}-x}{x} \right|.

(1.1.11)

We often round this value down to an integer, but it does make sense to speak of “almost seven digits” or “ten and a half digits.”

Example 1.1.2 (Absolute and relative accuracy)

Julia

MATLAB

Python

Example 1.1.2

Recall the grade-school approximation to the number π.

@show p = 22/7;

p = 22 / 7 = 3.142857142857143

Not all the digits displayed for p are the same as those of π.

@show float(π);

float(π) = 3.141592653589793

The absolute and relative accuracies of the approximation are as follows.

acc = abs(p-π)
println("absolute accuracy = $acc")
println("relative accuracy = $(acc/π)")

absolute accuracy = 0.0012644892673496777

relative accuracy = 0.0004024994347707008

Here we calculate the number of accurate digits in p.

println("Number of accurate digits = $(-log10(acc/π))")

Number of accurate digits = 3.3952347251747166

This last value could be rounded down by using floor.

Example 1.1.2

Recall the grade-school approximation to the number π.

format long
p = 22/7

Not all the digits displayed for p are the same as those of π.

The absolute and relative accuracies of the approximation are as follows.

abs_accuracy = abs(p - pi)
rel_accuracy = abs(p - pi) / pi

Here we calculate the number of accurate digits in p.

format short
accurate_digits = -log10(rel_accuracy)

Example 1.1.2

Recall the grade-school approximation to the number π.

p = 22/7
print(p)

3.142857142857143

Not all the digits displayed for p are the same as those of π.

print(pi)

3.141592653589793

The absolute and relative accuracies of the approximation are as follows:

print(f"absolute accuracy: {abs(p - pi)}")

absolute accuracy: 0.0012644892673496777

rel_acc = abs(p - pi) / pi
print("relative accuracy: {rel_acc:.4e}")

relative accuracy: {rel_acc:.4e}

Here we calculate the number of accurate digits in p:

print(f"accurate digits: {-log10(rel_acc):.1f}")

accurate digits: 3.4

1.1.2Double precision¶

Most numerical computing today is done in the IEEE 754 standard. This defines single precision with $d=23$ binary digits for the fractional part $f$ , and the more commonly used double precision with $d=52$ . In double precision,

\macheps = 2^{-52} \approx 2.2\times 10^{-16}.

(1.1.12)

We often speak of double-precision floating-point numbers as having about 16 decimal digits. The 52-bit significand is paired with a sign bit and 11 binary bits to represent the exponent $n$ in (1.1.1), for a total of 64 binary bits per floating-point number.

Example 1.1.3 (Floating-point representation)

Julia

MATLAB

Python

Example 1.1.3

In Julia, 1 and 1.0 are different values, because they have different types:

@show typeof(1);
@show typeof(1.0);

typeof(1) = Int64


typeof(1.0) = Float64

The standard choice for floating-point values is Float64, which is double precision using 64 binary bits. We can see all the bits by using bitstring.

bitstring(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

The first bit determines the sign of the number:

[bitstring(1.0), bitstring(-1.0)]

2-element Vector{String}:
 "0011111111110000000000000000000000000000000000000000000000000000"
 "1011111111110000000000000000000000000000000000000000000000000000"

The next 11 bits determine the exponent (scaling) of the number, and so on.

[bitstring(1.0), bitstring(2.0)]

2-element Vector{String}:
 "0011111111110000000000000000000000000000000000000000000000000000"
 "0100000000000000000000000000000000000000000000000000000000000000"

The sign bit, exponent, and significand in (1.1.1) are all directly accessible.

x = 3.14
@show sign(x), exponent(x), significand(x);

(sign(x), exponent(x), significand(x)) = (1.0, 1, 1.57)

x = x / 8
@show sign(x), exponent(x), significand(x);

(sign(x), exponent(x), significand(x)) = (1.0, -2, 1.57)

The spacing between floating-point values in $[2^n,2^{n+1})$ is $2^n \epsilon_\text{mach}$ , where $\epsilon_\text{mach}$ is machine epsilon. You can get its value from the eps function in Julia. By default, it returns the value for double precision.

eps()

2.220446049250313e-16

Because double precision allocates 52 bits to the significand, the default value of machine epsilon is 2^-52.

log2(eps())

-52.0

The spacing between adjacent floating-point values is proportional to the magnitude of the value itself. This is how relative precision is kept roughly constant throughout the range of values. You can get the adjusted spacing by calling eps with a value.

eps(1.618)

2.220446049250313e-16

eps(161.8)

2.842170943040401e-14

nextfloat(161.8)

161.80000000000004

ans - 161.8

2.842170943040401e-14

A common mistake is to think that $\epsilon_\text{mach}$ is the smallest floating-point number. It’s only the smallest relative to 1. The correct perspective is that the scaling of values is limited by the exponent, not the mantissa. The actual range of positive values in double precision is

@show floatmin(), floatmax();

(floatmin(), floatmax()) = (2.2250738585072014e-308, 1.7976931348623157e308)

For the most part you can mix integers and floating-point values and get what you expect.

1/7

0.14285714285714285

37.3 + 1

38.3

2^(-4)

0.0625

There are some exceptions. A floating-point value can’t be used as an index into an array, for example, even if it is numerically equal to an integer. In such cases you use Int to convert it.

@show 5.0, Int(5.0);

(5.0, Int(5.0)) = (5.0, 5)

If you try to convert a noninteger floating-point value into an integer you get an InexactValue error. This occurs whenever you try to force a type conversion that doesn’t make clear sense.

Example 1.1.3

In MATLAB, values are double-precision floats unless declared otherwise.

fprintf('1 has type: %s', class(1))
fprintf('1.0 has type: %s', class(1.0))

1 has type:

double

1.0 has type:

double

The spacing between floating-point values in $[2^n,2^{n+1})$ is $2^n \epsilon_\text{mach}$ , where $\epsilon_\text{mach}$ is machine epsilon. Its value is predefined as eps.

eps

Because double precision allocates 52 bits to the significand, the default value of machine epsilon is 2^-52.

log2(eps)

eps(1.618)

eps(161.8)

x = 161.8 + 0.1*eps(161.8);
x - 161.8

format short e
[realmin, realmax]

Example 1.1.3

Python has native int and float types.

print(f"The type of {1} is {type(1)}")
print(f"The type of {float(1)} is {type(1.0)}")

The type of 1 is <class 'int'>
The type of 1.0 is <class 'float'>

The numpy package has its own float types:

one = float64(1)
print(f"The type of {one} is {type(one)}")

The type of 1.0 is <class 'numpy.float64'>

Both float and float64 are double precision, using 64 binary bits per value. Although it is not normally necessary to do so, we can deconstruct a float into its significand and exponent:

x = 3.14
mantissa, exponent = frexp(x)
print(f"significand: {mantissa * 2}, exponent: {exponent - 1}")

significand: 1.57, exponent: 1

mantissa, exponent = frexp(x / 8)
print(f"significand: {mantissa * 2}, exponent: {exponent - 1}")

significand: 1.57, exponent: -2

The spacing between floating-point values in $[2^n,2^{n+1})$ is $2^n \epsilon_\text{mach}$ , where $\epsilon_\text{mach}$ is machine epsilon, given here for double precision:

mach_eps = finfo(float).eps
print(f"machine epsilon is {mach_eps:.4e}")

machine epsilon is 2.2204e-16

Because double precision allocates 52 bits to the significand, the default value of machine epsilon is 2^-52.

print(f"machine epsilon is 2 to the power {log2(mach_eps)}")

machine epsilon is 2 to the power -52.0

A common mistake is to think that $\epsilon_\text{mach}$ is the smallest floating-point number. It’s only the smallest relative to 1. The correct perspective is that the scaling of values is limited by the exponent, not the significand. The actual range of positive values in double precision is

finf = finfo(float)
print(f"range of positive values: [{finf.tiny}, {finf.max}]")

range of positive values: [2.2250738585072014e-308, 1.7976931348623157e+308]

For the most part you can mix integers and floating-point values and get what you expect.

1/7

0.14285714285714285

37.3 + 1

38.3

2**(-4)

0.0625

You can convert a floating value to an integer by wrapping it in int.

int(3.14)

3

Our theoretical description of $\float$ did not place limits on the exponent, but in double precision its range is limited to $-1022\le n \le 1023$ . Thus, the largest number is just short of $2^{1024}\approx 2\times 10^{308}$ , which is enough in most applications. Results that should be larger are said to overflow and will actually result in the value Inf. Similarly, the smallest positive number is $2^{-1022}\approx 2\times 10^{-308}$ , and smaller values are said to underflow to zero.^[2]

Note the crucial difference between $\macheps=2^{-52}$ , which is the distance between 1 and the next larger double-precision number, and 2^-1022, which is the smallest positive double-precision number. The former has to do with relative precision, while the latter is about absolute precision. Getting close to zero always requires a shift in thinking to absolute precision because any finite error is infinite relative to zero.

One more double-precision value is worthy of note: NaN, which stands for Not a Number. It is the result of an undefined arithmetic operation such as 0/0.

1.1.3Floating-point arithmetic¶

Computer arithmetic is performed on floating-point numbers and returns floating-point results. We assume the existence of machine-analog operations for real functions such as $+$ , $-$ , ×, $/$ , $\sqrt{\quad}$ , and so on. Without getting into the details, we will suppose that each elementary machine operation creates a floating-point result whose relative error is bounded by $\macheps$ . For example, if $x$ and $y$ are in $\float$ , then for machine addition $\oplus$ we have the bound

\frac{ |(x \oplus y)-(x+y)| }{ |x+y| } \le \macheps.

(1.1.13)

Hence the relative error in arithmetic is essentially the same as for the floating-point representation itself. However, playing by these rules can lead to disturbing results.

Example 1.1.4 (Floating-point arithmetic oddity)

Julia

MATLAB

Python

Example 1.1.4

There is no double-precision number between 1 and $1+\epsilon_\text{mach}$ . Thus the following difference is zero despite its appearance.

e = eps()/2
(1.0 + e) - 1.0

0.0

However, the spacing between floats in $[1/2,1)$ is $\macheps/2$ , so both $1-\macheps/2$ and its negative are represented exactly:

1.0 + (e - 1.0)

1.1102230246251565e-16

This is now the expected result. But we have found a rather shocking breakdown of the associative law of addition!

Example 1.1.4

There is no double-precision number between 1 and $1+\epsilon_\text{mach}$ . Thus the following difference is zero despite its appearance.

( 1 + eps / 2 ) - 1

However, the spacing between floats in $[1/2,1)$ is $\macheps/2$ , so both $1-\macheps/2$ and its negative are represented exactly:

1 - (1 - eps / 2)

This is now the expected result. But we have found a rather shocking breakdown of the associative law of addition!

Example 1.1.4

There is no double precision number between 1 and $1+\varepsilon_\text{mach}$ . Thus, the following difference is zero despite its appearance.

eps = finfo(float).eps
e = eps/2
print((1.0 + e) - 1.0)

0.0

However, $1-\varepsilon_\text{mach}/2$ is a double precision number, so it and its negative are represented exactly:

print(1.0 + (e - 1.0))

1.1102230246251565e-16

This is now the “correct” result. But we have found a rather shocking breakdown of the associative law of addition!

There are two ways to look at Demo 1.1.4. On one hand, its two versions of the result differ by less than $1.2\times 10^{-16}$ , which is very small — not just in everyday terms, but with respect to the operands, which are all close to 1 in absolute value. On the other hand, the difference is as large as the exact result itself! We formalize and generalize this observation in the next section. In the meantime, keep in mind that exactness cannot be taken for granted in floating-point computation.

1.1.4Exercises¶

✍ Consider a floating-point set $\float$ defined by (1.1.1) and (1.1.2) with $d=4$ .
(a) How many elements of $\float$ are there in the real interval $[1/2,4]$ , including the endpoints?
(b) What is the element of $\float$ closest to the real number $1/10$ ? (Hint: Find the interval $[2^n,2^{n+1})$ that contains $1/10$ , then enumerate all the candidates in $\float$ .)
(c) What is the smallest positive integer not in $\float$ ? (Hint: For what value of the exponent does the spacing between floating-point numbers become larger than 1?)

✍ Prove that (1.1.5) is equivalent to (1.1.6). This means showing first that (1.1.5) implies (1.1.6), and then separately that (1.1.6) implies (1.1.5).
⌨ There are much better rational approximations to π than 22/7 as used in Demo 1.1.2. For each one below, find its absolute and relative accuracy, and (rounding down to an integer) the number of accurate digits.
(a) 355/113
(b) 103638/32989
✍ IEEE 754 single precision specifies that 23 binary bits are used for the value $f$ in the significand $1+f$ in (1.1.2). Because they need less storage space and can be operated on more quickly than double-precision values, single-precision values can be useful in low-precision applications. (They are supported as type Float32 in Julia.)
(a) In base-10 terms, what is the first single-precision number greater than 1 in this system?
(b) What is the smallest positive integer that is not a single-precision number? (See the hint to Exercise 1.)
⌨ Julia defines a function nextfloat that gives the next-larger floating-point value of a given number. What is the next float past floatmax()? What is the next float past -Inf?

Footnotes¶

The terms machine epsilon, machine precision, and unit roundoff aren’t used consistently across references, but the differences are not consequential for our purposes.
↩
Actually, there are some still-smaller denormalized numbers that have less precision, but we won’t use that level of detail.
↩

Preface

1. Introduction

Preface

Problems and conditioning