a = a + 1.0, revisited

Sun Dec 22 08:02:45 AEST 1985

The purpose of this note is to illustrate that some issues are much
more complicated than they appear on the surface.  Consider the
following C statement:

	a = a + 1.0;

and the superficially similar statement:

	a = a + b;

In these examples, a and b are presumed to be of type float, and
1.0 is (by definition) of type double.  The question:  is it
allowable for a compiler to quietly substitute a single-precision
addition for the double-precision addition that the language
definition requires?  In other words:  will the results always be
equivalent both ways?

To explore this, I need to define a few terms and assume a few things
about the architecture.  First of all, realize that any floating-point
value is the exact representation of some rational number, and that
the sum of any two rationals is a rational.  However, this sum may
not be capable of being represented exactly as a floating-point
value, even if both operands started out that way.

Thus, given a rational number, we need a way of deciding how to
represent it.  In general, we will choose one of the two nearest
representable neighbors.  Which one we choose is determined by the
rounding rule of the implementation, and for a good implementation
will be unique for any particular value being rounded.

If some process gives the correctly rounded value of the exact mathematical
result, I'll call the result of that process "correct."

Now, let's assume that both single- and double-precision addition on our
machine give correct results.  We then want to know:  will the two
examples above give correct results in both single and double precision?
If not, which one is correct?

It is straightforward that doing the addition in single gives correct
results in both cases.  The second case (a = a + b) follows immediately
from the definition of "correct" and the assumption that single-precision
addition gives correct results.  The first case is true given the
further assumption that 1.0 can be exactly represented in single
precision, an assumption that is true of every floating-point architecture
I have ever seen.

But what about the double-precision case?  Will (double)a + (double)b
always have the same value as (single)a + (single)b?  What about
(single)((double)a + (double)b)?

It's the last one that is the toughie, because what is really happening
with the double-precision case is that an exact mathematical result is
being formed which is then rounded to double and then rounded a second
time to single.  If we knew that this double rounding was always
harmless, then we could say with confidence that the result will be the
same either way.  But is it harmless?

The answer is: NO, double rounding is NOT harmless.  Consider a result
to be rounded twice with the default IEEE rounding rule:
round to nearest, and round up in case of ties.  Apply this rule to, say:

	.....1|01111111111|11...

Here, the ... represent uninteresting fraction bits and the | represent
the two rounding points.

If we round this value directly to single, the 0 after the leftmost
cut point clearly indicates rounding down.  However, if we round to
double first, we round up, which overflows into the 0 making it a 1.
This generates a tie in the second rounding, and the round-to-even rule
says we must round up.

So here is an example of a case where rounding directly to single
gives different results from rounding first to double and then to
single.

The next questions are these:  (a) is it possible to generate such
a result by adding two single-precision values?  (b) is it possible
to generate such a result if one of the operands is 1 and the other
is a single-precision value?

Ansure:  I'm not sure yet -- I have to think about it some more.