Vectorizing C compiler for the Cray
Eugene D. Brooks III
brooks at lll-crg.ARPA
Sun Apr 7 16:45:50 AEST 1985
> It's also not clear that a "vectorizing C compiler" makes
> much sense, given the form of typical C code.
For the traditional uses of C, operating systems programming, compilers,
text editors ... a vectorizing C compiler indeed does not make much sense.
C is not restricted to the above uses and it is a very good language for
numerical applications (modulo the float-->double problem that is being
fixed in the ANSI standard and has been fixed in any compiler that I have
used for numerical work). I have been using C for numerical programming
for 4 years now. Fortran used to be the only language I used and I have
not used it for 4 years. I have even forgotten some of the key words.
I am not an isolated case as there is a small but growing community of
scientists who are using C instead of fortran for their work. The data
structures that can be created in C make the layout of a typical program
much cleaner and more easily understood.
It is clear that for C to be used for numerical work on supercomputers
one must have a vectorizing C compiler just as is the case for Fortran.
Consider the code below.
float **a, **b, **c;
int dim;
int i,j,k;
for(i = 0; i < dim; i += 1) {
for(j = 0; j < dim; j += 1) {
a[i][j] = 0.0;
for(k = 0; k < dim; k += 1) {
a[i][j] += b[i][k] * c[k][j];
}
}
}
Considering the inner loop one can see that a vector dot product is being
formed. The fetch of b[i][k] is a stride 1 vector fetch. The fetch of
c[k][j] is a gather fetch using an offset of j from the array of pointers
c[k]. This loop will vectorize on the Cray XMP48, the Cray 2, the CDC 205,
and the Convex C-1 among others. So why not have a vectorzing C compiler
avaiable to the users of C on these machines? The only valid argument
against the use of C for numerical work on supercomputers is the lack of a
vectorizing compiler.
A "vectorizing" compiler, where one means that the compiler unrolls loops
to reduce the jump overhead and picks the best possible way to get the
work done on a given machine is even useful on a scalar machine such as a VAX.
As an example consider the following trick that I have in fact used frequently
to get vector operations to run is fast as is possible on a VAX.
void vdadd3(a,b,c,dim)
double *a, *b, *c;
int dim;
{
/* Leaving out the code to take care of the dim%N extra elts. */
dim /= N;
do {
*a++ = *b++ + *c++;
*a++ = *b++ + *c++;
.
.
.
*a++ = *b++ + *c++;
*a++ = *b++ + *c++;
} while(--n > 0);
}
Just how big you can make N on a vax is determined by how big the cache is.
At around N == 16 the instruction fetches start causing cache misses.
Wouldn't it have been nicer to have simply written
vdadd3(a,b,c,dim)
{
int i;
for(i = 0; i < dim; i += 1) {
a[i] = b[i] + c[i];
}
}
and have the compiler pick up the the vector operation and put in all the
pointer crazyness and the loop unrolling! In this context a vectorizing
compiler is even useful on a scalar machine.
More information about the Comp.lang.c
mailing list