close up photo of ledger s list

A Note on Error due to Floating-Point Arithmetic

Pop Quiz!

What’s 1 + (1e-16) -1?

That’s easy you might say, it’s just 1e-16

We can check that with our favorite calculator (in my case that’s a TI-84 that I’ve had for way too long, but its easier to take screenshots of python) and we get

Wait, that’s not right. What’s going on?  (If you don’t believe me give it a try yourself)

How Computers Represent Rational Numbers

Computers don’t store numbers the same way we do (Note if you want to get into the nitty-gritty of how computers store and work on data I recommend Computer Systems: A Programmers Perspective as a good starting guide). They store data in banks of binary numbers, where each bit of information is either a 1 or a 0. Now it’s easy to define a mapping from the positive integers to a set of 1’s and 0’s, we can just convert from base 10 to base 2

0 -> 0

1 -> 1

2 -> 10

3 -> 11

4 -> 100

5 -> 101

Note that representing 1, 2, and 4 all require different numbers of bits. From a computer architecture perspective, it’s easier to have them all be a fixed number of digits. For this example let’s make that length be 4 bits (also known as a byte). Then we would represent the positive integers as

0 -> 0000

1 -> 0001

2 -> 0010

3 -> 0011

4 -> 0100

5 -> 0101

Using these 4 bit long encodings, we’ll notice that the largest number we can represent is 15, which isn’t that high, but it gets even worse if we want to include negative integers. We can include negative numbers easily by making the first (leftmost) bit a 1 if the number is negative, and 0 if the number is positive. We now get

-2 -> 1010

-1 -> 1001

 0 -> 0000

 1 -> 0001

 2 -> 0010

This cuts the number of digits we can represent with 4 bits in half. Increasing the number of bits is one way of representing more numbers (most personal computers today are built around dealing with 64 bits), but we still haven’t reached representations of fractions. For these let’s introduce floating point, where a value V is represented as

V = (-1^S) * M * 2^E

where S is a single bit encoding the sign, M (called the significant) is some value, and E is some positive or negative integer. IEEE has published a floating-point standard (IEEE 754), but for explanatory purposes let’s just say M is some 4  bit integer, s is a single bit, and E is 3 bits (with the first being its sign). Our floating point has 8 bits of the following form

SEEEMMMM

So just to practice evaluating a few

1110010 -> -0.50

1110100 -> -1.00

0110100 -> 1.00

0110101 -> 1.25

This encoding is bad because it has a lot of duplicates  (for example evaluate 0110100 and 0101001), but it illustrates the point of how floating-point numbers work. It also illustrates one of the key limitations of floating points, that they are finite. In our current representation, there is no way to represent 0.9. To do that we would need to add more bits of information, but no matter how many bits we added there would always be numbers that are not representable exactly. These numbers however can be close to a number we can represent, and so the natural solution is to just round. in that case, 0.9 would be rounded to 1.0.  In other cases, we can add two representable number, but we cant represent their sum. For example

0111111 + 0000001 = 1.875+1 = 2.875

Now 2.875 is .125 from 3.0, which can be represented in our floating-point (01101100), and it’s .125 from 2.75 which can be represented (01101011), but we cant represent 2.875 exactly, so let’s round it up to 3.0. This leads to the fun representation of

(0111111 + 0000001) – 0000001 =(1.875+1) – 1 = 2

Do note that real floating-point numbers are a bit more complex. I tried to go for “intuitively understanding why floating-point representation can lead to unexpected results rather than explaining everything 100% accurately” If you want to dive in deeper I again recommend Computer Systems: A Programmer’s perspective as a good starting point. Also, note that these problems can be avoided (or at least mitigated) now that we know what they look like. In our example from the beginning if we just swap around the order of operations so that the 1e-16 is at the end, (aka 1-1+1e16) we now can obtain the correct answer, even though our equations look the same.

Want More Gereshes?

If you want to receive new Gereshes blog post directly to your email when they come out, you can sign up for that here!

Don’t want another email? That’s ok, Gereshes also has a twitter account and subreddit!