Vectorizing C compiler for the Cray

Sun Apr 7 16:45:50 AEST 1985

> It's also not clear that a "vectorizing C compiler" makes
> much sense, given the form of typical C code.

For the traditional uses of C, operating systems programming, compilers,
text editors ... a vectorizing C compiler indeed does not make much sense.

C is not restricted to the above uses and it is a very good language for
numerical applications (modulo the float-->double problem that is being
fixed in the ANSI standard and has been fixed in any compiler that I have
used for numerical work).  I have been using C for numerical programming
for 4 years now.  Fortran used to be the only language I used and I have
not used it for 4 years.  I have even forgotten some of the key words.
I am not an isolated case as there is a small but growing community of
scientists who are using C instead of fortran for their work.  The data
structures that can be created in C make the layout of a typical program
much cleaner and more easily understood.

It is clear that for C to be used for numerical work on supercomputers
one must have a vectorizing C compiler just as is the case for Fortran.
Consider the code below.

float **a, **b, **c;
int dim;
int i,j,k;

	for(i = 0; i < dim; i += 1) {
		for(j = 0; j < dim; j += 1) {
			a[i][j] = 0.0;
			for(k = 0; k < dim; k += 1) {
				a[i][j] += b[i][k] * c[k][j];
			}
		}
	}

Considering the inner loop one can see that a vector dot product is being
formed.  The fetch of b[i][k] is a stride 1 vector fetch.  The fetch of
c[k][j] is a gather fetch using an offset of j from the array of pointers
c[k].  This loop will vectorize on the Cray XMP48, the Cray 2, the CDC 205,
and the Convex C-1 among others.  So why not have a vectorzing C compiler
avaiable to the users of C on these machines?  The only valid argument
against the use of C for numerical work on supercomputers is the lack of a
vectorizing compiler.

A "vectorizing" compiler, where one means that the compiler unrolls loops
to reduce the jump overhead and picks the best possible way to get the
work done on a given machine is even useful on a scalar machine such as a VAX.
As an example consider the following trick that I have in fact used frequently
to get vector operations to run is fast as is possible on a VAX.

void vdadd3(a,b,c,dim)
double *a, *b, *c;
int dim;
{
	/* Leaving out the code to take care of the dim%N extra elts. */
	dim /= N;
	do {
		*a++ = *b++ + *c++;
		*a++ = *b++ + *c++;
			.
			.
			.
		*a++ = *b++ + *c++;
		*a++ = *b++ + *c++;
	} while(--n > 0);
}

Just how big you can make N on a vax is determined by how big the cache is.
At around N == 16 the instruction fetches start causing cache misses.

Wouldn't it have been nicer to have simply written

vdadd3(a,b,c,dim)
{
	int i;
	for(i = 0; i < dim; i += 1) {
		a[i] = b[i] + c[i];
	}
}

and have the compiler pick up the the vector operation and put in all the
pointer crazyness and the loop unrolling!  In this context a vectorizing
compiler is even useful on a scalar machine.