SGI GL matrix performance
Jim Barton
jmb at patton.wpd.sgi.com
Thu May 2 11:03:02 AEST 1991
In all cases you must run the benchmark and average the results to get a true
performance number. The reasons are many and varied, but some of the more
significant ones are:
1) When you first run a program, it takes awhile to fill up the processor
cache. Depending on context switching, etc., the cache can be more or
less effective at various times during the run.
2) When you first execute a program, IRIX must read it from disk. However,
IRIX is fanatical about caching disk blocks in memory, and it is quite
likely that the second execution just picks up the pages in memory, and
execution time could be significantly faster. This happens even when the
timing is built into the program, since executables are almost always
demand paged.
3) The way in which real memory pages are allocated to the process has a big
impact on performance because the processor caches are direct mapped.
For example, on a system with a 64Kb cache, real memory references
modulo 64Kb will map to the same cache location. IRIX tries its best to
allocate physical memory in a linear fashion, so that the probability of
cache thrashing is minimized, but in the final analysis the application
memory access pattern will determine the performance.
4) The 4D/20 and 4D/25 have a 1-deep write buffer. By default, C does all
floating point in double precision (two words). Thus, when the compiler
writes out a double precision float, the first word is buffered, but
the second stalls the processor until the first write has been retired.
Single precision floats (-float flag to the compiler) will eliminate this
problem (unless you really need double precision). The POWERSeries
machines have a 4-deep write buffer, while the 4D35 has an 8 deep write
buffer.
Benchmarking is Art, not Science. I suspect it always will be, despite the
best efforts of SPEC, etc.
-- Jim Barton
Silicon Graphics Computer Systems
jmb at sgi.com
More information about the Comp.sys.sgi
mailing list