no noalias not negligible: more red herrings, conclusion unchanged
David Hough
dgh%dgh at Sun.COM
Fri May 27 10:02:10 AEST 1988
My previous posting on this subject was slightly in error. It turns out
that a properly crafted C source will fool the intermediate optimizer
into thinking it's Fortran and loop unrolling will occur, at least
for SunOS 4.0.
However, the lack of noalias information is propagated to the instruction
scheduler which steadfastly refuses to schedule loads of x[i+1] before
stores of y[i]. Consequently the overall performance is identical between
the following versions of ROLLED source; the first results in unrolled
object code; the second and most other equivalent ones do not:
daxpy(n, da, dx, dy)
double dx[], dy[], da;
int n;
{
int i;
i = 0;
do {
dy[i] = dy[i] + da * dx[i];
i++;
} while (i < n);
}
daxpy(n, da, dx, dy)
double dx[], dy[], da;
int n;
{
int i;
i = 0;
do {
dy[i] = dy[i] + da * dx[i];
} while (i++ < n);
}
Niels J|rgen Kruse suggested hand unrolling; the following is an
example of the idea (add more stuff if n is not even):
daxpy(n, da, dx, dy )
double dx[], dy[], da;
int n;
{
int i;
double a,b;
do {
a = dy[i] + da * dx[i];
b = dy[i+1] + da * dx[i+1];
dy[i] = a;
dy[i+1] = b;
i++;
i++;
} while (i < n);
}
But this improves performance only slightly (850->900) due
to another important effect: optimizers do a lot better work with
simple loops than with clever ones. The particular Sun release
in question generates some unnecessary fmove instructions for this loop.
Maybe that
will get fixed in the future, but it will always be a good idea to make
your programs as simple as possible if you are going to depend on
optimizers.
It's also worth noting that several commentators referred to this as a
problem with vector machines, so that for instance PC programmers may have
concluded that it was irrelevant to their careers. However the
measurements taken above were on a Sun-4/260 which is not a vector machine,
but does allow the integer unit, the floating-point adder, and the
floating-point multiplier to operate in parallel in certain circumstances
which arise frequently when the instruction scheduler does its job well.
It's to be expected that floating-point units of comparable complexity
will be common in PC-class machines soon; the Weitek 1167 is an example
of such a unit intended to be installed in 80386 systems. It has the
same floating-point ALU and multiplier as the Sun-4/260. Don't assume
that consequently 80386-based PC's will have floating-point throughput
comparable to the Sun-4, however; there is a small matter of memory
bandwidth to contend with, which becomes more and more noticeable as the
floating-point hardware becomes faster, until at the top end it is the
main issue and the floating-point hardware is a minor distraction.
So the issues of loop unrolling and instruction scheduling (and
identifying unaliased operands) will soon
become important to everybody interested in getting the most out of
even simple scalar-oriented systems.
David Hough
dhough at sun.com
na.hough at na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough
More information about the Comp.lang.c
mailing list