Skip to article frontmatterSkip to article content

Floating-point numbers

The real number set R\real is infinite in two ways: it is unbounded and continuous. In most practical computing, the latter kind of infiniteness is much more consequential than the former, so we turn our attention there first.

Equation (1.1.2) represents the significand as a number in [1,2)[1,2) in base-2 form. Equivalently,

f=2di=1dbi2di=2dzf = 2^{-d}\, \sum_{i=1}^{d} b_{i} \, 2^{d-i} = 2^{-d} z

for an integer zz in the set {0,1,,2d1}\{0,1,\ldots,2^d-1\}. Consequently, starting at 2n2^n and ending just before 2n+12^{n+1} there are exactly 2d2^d evenly spaced numbers belonging to F\float.

Observe that the smallest element of F\float that is greater than 1 is 1+2d1+2^{-d}, and we call the difference machine epsilon.[1]

We define the rounding function fl(x)\fl(x) as the map from real number xx to the nearest member of F\float. The distance between the floating-point numbers in [2n,2n+1)[2^n,2^{n+1}) is 2nϵmach=2nd2^n\macheps=2^{n-d}. As a result, every real x[2n,2n+1)x \in [2^n,2^{n+1}) is no farther than 2nd12^{n-d-1} away from a member of F\float. Therefore we conclude that fl(x)x12(2nd)|\fl(x)-x| \le \tfrac{1}{2}(2^{n-d}), which leads to the bound

fl(x)xx2nd12n12ϵmach.\frac{|\fl(x)-x|}{|x|} \le \frac{2^{n-d-1}}{2^n} \le \tfrac{1}{2}\macheps.

In words, every real number is represented with a uniformly bounded relative precision. Inequality (1.1.5) holds true for negative xx as well. In Exercise 2 you are asked to show that an equivalent statement is that

fl(x)=x(1+ϵ)for some ϵ12ϵmach. \fl(x)=x(1+\epsilon) \quad \text{for some $|\epsilon|\le \tfrac{1}{2}\macheps$.}

The value of ε depends on xx, but this dependence is not usually shown explicitly.

1.1.1Precision and accuracy

It may help to recast (1.1.1) and (1.1.2) in terms of base 10:

±(b0+i=1dbi10i)×10n=±(b0.b1b2bd)×10n,\pm \left(b_0 + \sum_{i=1}^d b_i \, 10^{-i} \right) \times 10^n = \pm (b_0.b_1b_2\cdots b_d) \times 10^n,

where each bib_i is in {0,1,,9}\{0,1,\ldots,9\} and b00b_0\neq 0. This is simply scientific notation with d+1d+1 significant digits. For example, Planck’s constant is 6.626068×10346.626068\times 10^{-34} m2{}^2\cdotkg/sec to seven digits. If we alter just the last digit from 8 to 9, the relative change is

0.000001×10346.626068×10341.51×107.\frac{0.000001\times 10^{-34}}{6.626068\times 10^{-34}} \approx 1.51\times 10^{-7}.

We therefore say that the constant is given with 7 decimal digits of precision. That’s in contrast to noting that the value is given to 40 decimal places. A major advantage of floating point is that the relative precision does not depend on the choice of physical units. For instance, when expressed in eVsec, Planck’s constant is 4.135668×10154.135668\times 10^{-15}, which still has 7 digits but only 21 decimal places.

Floating-point precision functions the same way, except that computers prefer base 2 to base 10. The precision of a floating-point number is always dd binary digits, implying a resolution of the real numbers according to (1.1.5).

It can be easy to confuse precision with accuracy, especially when looking at the result of a calculation on the computer. Every result is computed and represented using dd binary digits, but not all of those digits may accurately represent an intended value. Suppose xx is a number of interest and x~\tilde{x} is an approximation to it. The absolute accuracy of x~\tilde{x} is

x~x,|\tilde{x} - x|,

while the relative accuracy is

x~xx.\frac{|\tilde{x} - x|}{|x|}.

Absolute accuracy has the same units as xx, while relative accuracy is dimensionless. We can also express the relative accuracy as the number of accurate digits, computed in base 10 as

log10x~xx.-\log_{10} \left| \frac{\tilde{x}-x}{x} \right|.

We often round this value down to an integer, but it does make sense to speak of “almost seven digits” or “ten and a half digits.”

1.1.2Double precision

Most numerical computing today is done in the IEEE 754 standard. This defines single precision with d=23d=23 binary digits for the fractional part ff, and the more commonly used double precision with d=52d=52. In double precision,

ϵmach=2522.2×1016.\macheps = 2^{-52} \approx 2.2\times 10^{-16}.

We often speak of double-precision floating-point numbers as having about 16 decimal digits. The 52-bit significand is paired with a sign bit and 11 binary bits to represent the exponent nn in (1.1.1), for a total of 64 binary bits per floating-point number.

Our theoretical description of F\float did not place limits on the exponent, but in double precision its range is limited to 1022n1023-1022\le n \le 1023. Thus, the largest number is just short of 210242×103082^{1024}\approx 2\times 10^{308}, which is enough in most applications. Results that should be larger are said to overflow and will actually result in the value Inf. Similarly, the smallest positive number is 210222×103082^{-1022}\approx 2\times 10^{-308}, and smaller values are said to underflow to zero.[2]

Note the crucial difference between ϵmach=252\macheps=2^{-52}, which is the distance between 1 and the next larger double-precision number, and 2-1022, which is the smallest positive double-precision number. The former has to do with relative precision, while the latter is about absolute precision. Getting close to zero always requires a shift in thinking to absolute precision because any finite error is infinite relative to zero.

One more double-precision value is worthy of note: NaN, which stands for Not a Number. It is the result of an undefined arithmetic operation such as 0/0.

1.1.3Floating-point arithmetic

Computer arithmetic is performed on floating-point numbers and returns floating-point results. We assume the existence of machine-analog operations for real functions such as ++, -, ×, //, \sqrt{\quad}, and so on. Without getting into the details, we will suppose that each elementary machine operation creates a floating-point result whose relative error is bounded by ϵmach\macheps. For example, if xx and yy are in F\float, then for machine addition \oplus we have the bound

(xy)(x+y)x+yϵmach.\frac{ |(x \oplus y)-(x+y)| }{ |x+y| } \le \macheps.

Hence the relative error in arithmetic is essentially the same as for the floating-point representation itself. However, playing by these rules can lead to disturbing results.

There are two ways to look at Demo 1.1.4. On one hand, its two versions of the result differ by less than 1.2×10161.2\times 10^{-16}, which is very small — not just in everyday terms, but with respect to the operands, which are all close to 1 in absolute value. On the other hand, the difference is as large as the exact result itself! We formalize and generalize this observation in the next section. In the meantime, keep in mind that exactness cannot be taken for granted in floating-point computation.

1.1.4Exercises

  1. ✍ Consider a floating-point set F\float defined by (1.1.1) and (1.1.2) with d=4d=4.

    (a) How many elements of F\float are there in the real interval [1/2,4][1/2,4], including the endpoints?

    (b) What is the element of F\float closest to the real number 1/101/10? (Hint: Find the interval [2n,2n+1)[2^n,2^{n+1}) that contains 1/101/10, then enumerate all the candidates in F\float.)

    (c) What is the smallest positive integer not in F\float? (Hint: For what value of the exponent does the spacing between floating-point numbers become larger than 1?)

  2. ✍ Prove that (1.1.5) is equivalent to (1.1.6). This means showing first that (1.1.5) implies (1.1.6), and then separately that (1.1.6) implies (1.1.5).

  3. ⌨ There are much better rational approximations to π than 22/7 as used in Demo 1.1.2. For each one below, find its absolute and relative accuracy, and (rounding down to an integer) the number of accurate digits.

    (a) 355/113

    (b) 103638/32989

  4. ✍ IEEE 754 single precision specifies that 23 binary bits are used for the value ff in the significand 1+f1+f in (1.1.2). Because they need less storage space and can be operated on more quickly than double-precision values, single-precision values can be useful in low-precision applications. (They are supported as type Float32 in Julia.)

    (a) In base-10 terms, what is the first single-precision number greater than 1 in this system?

    (b) What is the smallest positive integer that is not a single-precision number? (See the hint to Exercise 1.)

  5. ⌨ Julia defines a function nextfloat that gives the next-larger floating-point value of a given number. What is the next float past floatmax()? What is the next float past -Inf?

Footnotes
  1. The terms machine epsilon, machine precision, and unit roundoff aren’t used consistently across references, but the differences are not consequential for our purposes.

  2. Actually, there are some still-smaller denormalized numbers that have less precision, but we won’t use that level of detail.