The real number set R is infinite in two ways: it is unbounded and continuous. In most practical computing, the latter kind of infiniteness is much more consequential than the former, so we turn our attention there first.
Equation (1.1.2) represents the significand as a number in [1,2) in base-2 form. Equivalently,
for an integer z in the set {0,1,…,2d−1}. Consequently, starting at 2n and ending just before 2n+1 there are exactly 2d evenly spaced numbers belonging to F.
Observe that the smallest element of F that is greater than 1 is 1+2−d, and we call the difference machine epsilon.[1]
We define the rounding function fl(x) as the map from real number x to the nearest member of F. The distance between the floating-point numbers in [2n,2n+1) is 2nϵmach=2n−d. As a result, every real x∈[2n,2n+1) is no farther than 2n−d−1 away from a member of F. Therefore we conclude that ∣fl(x)−x∣≤21(2n−d), which leads to the bound
In words, every real number is represented with a uniformly bounded relative precision. Inequality (1.1.5) holds true for negative x as well. In Exercise 2 you are asked to show that an equivalent statement is that
where each bi is in {0,1,…,9} and b0=0. This is simply scientific notation with d+1 significant digits. For example, Planck’s constant is 6.626068×10−34 m2⋅kg/sec to seven digits. If we alter just the last digit from 8 to 9, the relative change is
We therefore say that the constant is given with 7 decimal digits of precision. That’s in contrast to noting that the value is given to 40 decimal places. A major advantage of floating point is that the relative precision does not depend on the choice of physical units. For instance, when expressed in eV•sec, Planck’s constant is 4.135668×10−15, which still has 7 digits but only 21 decimal places.
Floating-point precision functions the same way, except that computers prefer base 2 to base 10. The precision of a floating-point number is always d binary digits, implying a resolution of the real numbers according to (1.1.5).
It can be easy to confuse precision with accuracy, especially when looking at the result of a calculation on the computer. Every result is computed and represented using d binary digits, but not all of those digits may accurately represent an intended value. Suppose x is a number of interest and x~ is an approximation to it. The absolute accuracy of x~ is
Absolute accuracy has the same units as x, while relative accuracy is dimensionless. We can also express the relative accuracy as the number of accurate digits, computed in base 10 as
Most numerical computing today is done in the IEEE 754 standard. This defines single precision with d=23 binary digits for the fractional part f, and the more commonly used double precision with d=52. In double precision,
We often speak of double-precision floating-point numbers as having about 16 decimal digits. The 52-bit significand is paired with a sign bit and 11 binary bits to represent the exponent n in (1.1.1), for a total of 64 binary bits per floating-point number.
Our theoretical description of F did not place limits on the exponent, but in double precision its range is limited to −1022≤n≤1023. Thus, the largest number is just short of 21024≈2×10308, which is enough in most applications. Results that should be larger are said to overflow and will actually result in the value Inf. Similarly, the smallest positive number is 2−1022≈2×10−308, and smaller values are said to underflow to zero.[2]
Note the crucial difference between ϵmach=2−52, which is the distance between 1 and the next larger double-precision number, and 2-1022, which is the smallest positive double-precision number. The former has to do with relative precision, while the latter is about absolute precision. Getting close to zero always requires a shift in thinking to absolute precision because any finite error is infinite relative to zero.
One more double-precision value is worthy of note: NaN, which stands for Not a Number. It is the result of an undefined arithmetic operation such as 0/0.
Computer arithmetic is performed on floating-point numbers and returns floating-point results. We assume the existence of machine-analog operations for real functions such as +, −, ×, /, , and so on. Without getting into the details, we will suppose that each elementary machine operation creates a floating-point result whose relative error is bounded by ϵmach. For example, if x and y are in F, then for machine addition ⊕ we have the bound
Hence the relative error in arithmetic is essentially the same as for the floating-point representation itself. However, playing by these rules can lead to disturbing results.
There are two ways to look at Demo 1.1.4. On one hand, its two versions of the result differ by less than 1.2×10−16, which is very small — not just in everyday terms, but with respect to the operands, which are all close to 1 in absolute value. On the other hand, the difference is as large as the exact result itself! We formalize and generalize this observation in the next section. In the meantime, keep in mind that exactness cannot be taken for granted in floating-point computation.
✍ Consider a floating-point set F defined by (1.1.1) and (1.1.2) with d=4.
(a) How many elements of F are there in the real interval [1/2,4], including the endpoints?
(b) What is the element of F closest to the real number 1/10? (Hint: Find the interval [2n,2n+1) that contains 1/10, then enumerate all the candidates in F.)
(c) What is the smallest positive integer not in F? (Hint: For what value of the exponent does the spacing between floating-point numbers become larger than 1?)
⌨ There are much better rational approximations to π than 22/7 as used in Demo 1.1.2. For each one below, find its absolute and relative accuracy, and (rounding down to an integer) the number of accurate digits.
(a) 355/113
(b) 103638/32989
✍ IEEE 754 single precision specifies that 23 binary bits are used for the value f in the significand 1+f in (1.1.2). Because they need less storage space and can be operated on more quickly than double-precision values, single-precision values can be useful in low-precision applications. (They are supported as type Float32 in Julia.)
(a) In base-10 terms, what is the first single-precision number greater than 1 in this system?
(b) What is the smallest positive integer that is not a single-precision number? (See the hint to Exercise 1.)
⌨ Julia defines a function nextfloat that gives the next-larger floating-point value of a given number. What is the next float past floatmax()? What is the next float past -Inf?
The terms machine epsilon, machine precision, and unit roundoff aren’t used consistently across references, but the differences are not consequential for our purposes.