SGI GL matrix performance

Tue Apr 30 02:16:03 AEST 1991

In article <15407 at helios.TAMU.EDU>, jamie at archone.tamu.edu (James Price) writes:
> Has anyone done any benchmarking of the SGI matrix functions?  I was curious
> and wrote the program included below.  It does a number of 4x4 matrix 
> multiplies, first using software, and then using the geometry pipeline 
> functions (loadmatrix(), multmatrix(), getmatrix()).  
> 
> Here are some typical results:
> 
> 10000 iterations on fritz, with GL version: GL4DGT-3.3
> 
> Software - no optimization:     3.349 sec.
> 
> Software - some optimization:   1.130 sec.
> 
> Software - more optimization:   0.910 sec.
> 
> Hardware - preserve CTM:        2.379 sec.
> 
> Hardware - destroy CTM:         2.289 sec.
> 
> Hardware - abandon results:     0.580 sec.
> 
> 
> The actual hardware multiplication is fast (0.580 sec/10000 multiplies) 
> but if we call getmatrix() to access the results, it slows things down 
> by around 400% (to 2.379 sec/10000 multiplies).  I was hoping to use the 
> speed of the hardware for my own matrix needs, but it looks like the 
> getmatrix() call is simply too costly.  Is there a better way?

Its possible to do a complete 4x4 matrix multiply in under 310 cycles on
a MIPS processor (in single precision).  At 33 Mhz this works out to over
100,000 matrix multiplies per second or .010 sec for your benchmark above,
more than 5 times faster than the hardware!

I think one of the reasons why your software benchmark ran so slow was
that you might have forgotten to compile with -float (and thus all floating
point math was done in double precision).

The theoretical limit for matrix multiply would be 64*4 cycles + a few.
Of course, this requires writing very careful assembler code in order
to overlap all the adds and load/stores with the 4 cycle multiplies.
So I suspect that you could improve upon the 310 number I actually
measured by about 10%.

--------------------
	Gary Tarolli