SGI GL matrix performance
Gary Tarolli
tarolli at westcoast.esd.sgi.com
Tue Apr 30 02:16:03 AEST 1991
In article <15407 at helios.TAMU.EDU>, jamie at archone.tamu.edu (James Price) writes:
> Has anyone done any benchmarking of the SGI matrix functions? I was curious
> and wrote the program included below. It does a number of 4x4 matrix
> multiplies, first using software, and then using the geometry pipeline
> functions (loadmatrix(), multmatrix(), getmatrix()).
>
> Here are some typical results:
>
> 10000 iterations on fritz, with GL version: GL4DGT-3.3
>
> Software - no optimization: 3.349 sec.
>
> Software - some optimization: 1.130 sec.
>
> Software - more optimization: 0.910 sec.
>
> Hardware - preserve CTM: 2.379 sec.
>
> Hardware - destroy CTM: 2.289 sec.
>
> Hardware - abandon results: 0.580 sec.
>
>
> The actual hardware multiplication is fast (0.580 sec/10000 multiplies)
> but if we call getmatrix() to access the results, it slows things down
> by around 400% (to 2.379 sec/10000 multiplies). I was hoping to use the
> speed of the hardware for my own matrix needs, but it looks like the
> getmatrix() call is simply too costly. Is there a better way?
Its possible to do a complete 4x4 matrix multiply in under 310 cycles on
a MIPS processor (in single precision). At 33 Mhz this works out to over
100,000 matrix multiplies per second or .010 sec for your benchmark above,
more than 5 times faster than the hardware!
I think one of the reasons why your software benchmark ran so slow was
that you might have forgotten to compile with -float (and thus all floating
point math was done in double precision).
The theoretical limit for matrix multiply would be 64*4 cycles + a few.
Of course, this requires writing very careful assembler code in order
to overlap all the adds and load/stores with the 4 cycle multiplies.
So I suspect that you could improve upon the 310 number I actually
measured by about 10%.
--------------------
Gary Tarolli
More information about the Comp.sys.sgi
mailing list