Sun 3 vs uVAXII floating point speed....

Sat Jul 16 06:06:57 AEST 1988

For what it's worth, here are some benchmarks I did for one of my
programs.  I list the total time, the time in the "tweak" (number-
crunching) subroutine, and the time in the "io" (heavy on io) subroutine,
for two separarate runs, one of which is more io-intensive than the other.

The VAX was an 11/780 with fpa, running ULTRIX.  The code was written in
Fortran, and compiled & run with f77 on the VAX and Sun, with fc on the
Convex C1.  The different -O levels for the Convex refer to different
levels of optimization (see below).

Separate benchmarks of a different kind indicated that the uVAX-II is about
0.8 of an 11/780fpa on ordinary floating point arithmetic.  Lots depends on
the compiler, though.  A previous posting pointed out that DEC now makes
its own Fortran compiler, previously available only under VMS, available
under ULTRIX, and that Sun now has a DEC-compatible Fortran compiler, which
people say also produces better code than their version of f77 used to.

I advise you to skip the data for now and come back to it after reading
the conclusions at the bottom.

Comparison of times on the VAX, Sun3 and Convex for two typical random tweak
runs:
    l2-1000-0.0:    relatively high io/compute ratio
    l2-150-2.0all:  relatively low io/compute ratio

NUMBERS:

l2-1000-0.0             Sun-3     Sun-3     Convex    Convex    Convex
===========   VAX       -68881    -fpa      -O0       -O1       -O2
TIMES (cpu-s)
total:        2766      2107      1199       325       302       300
tweak:        1679      1824       950       208       184       174
io:           1029       226       224       108       110       119

TOTAL
SPEED:           1      1.31      2.31      8.51      9.16      9.22
(VAX = 1)

*************************************************************************
*************************************************************************

l2-150-2.0all           Sun-3     Sun-3     Convex    Convex    Convex
===========   VAX       -68881    -fpa      -O0       -O1       -O2
TIMES (cpu-s)
total:        2339      2734      1287       273       246       229
tweak:        1656      2266      1062       205       180       162
io:            161        34        34        16        17        17

TOTAL
SPEED:           1      0.86      1.82      8.57       9.51    10.21
(VAX = 1)

*************************************************************************
*************************************************************************

CONCLUSIONS (for THIS PROGRAM!!!):

    (1) Sun-3 vs. VAX:  With -68881, Sun is 4-5 times faster on IO, about
        0.8 times as fast on single-precision arithmetic.  (I know through
        other tests that it's several times faster on double-precision.)
        With -fpa (Weitek floating point board), same IO comparison holds,
        but Sun is about 1.7 times the speed of the vax in single-precision
        arithmetic.
	(2) Convex vs. VAX: with full optimization, about 9 times faster than
        the VAX on IO, about 10 times faster on single-precision arithmetic.
        Vectorization (-O2) gives a 20% speed-up over only local scalar
        optimization (-O0);  full scalar optimization gives a 10% speed-up
        over only local.

NOTES:

    (1) The program is (a) poorly written, and (b) not well-suited in its
        present form to automatic vectorization.  As such it is probably
        typical.  (On the other hand, it works....)
    (2) Estimates of IO and floating-point speeds were made from the
        io and tweak times, which are dominated by these kinds of operations,
        respectively.
    (3) VAX is the 11/780-fpa at Columbia Biology (cubsvax);  Sun3 -68881
        refers to the 68881 floating point processor.  This was also at
        Columbia Biology (ramon).  Sun3 -fpa was a machine at Sun in Fort
        Lee, NJ.  Convex was cuhhca at Howard Hughes Institute, Columbia
        Medical School.  See above for illumination of the -O options.
    (4) This particular program probably does not easily lend itself to great
        speed-up through vectorization, since the operations tend to be on
        fairly short vectors -- about 40 long in these examples, perhaps
        about 120 long in the "best" case, these being the numbers of atoms
        in the loop being repeatedly randomly generated.  With difficulty,
        it might be possible to rewrite the program so as to generate many
        loops together, and thereby deal with longer vectors.  Less drastic
        rewrites might conceivable speed things up by a factor of 1.5 to 2
        overall (just a guess, based on the speed-up of those portions of
        the code where everything vectorized).
-- 
*******************************************************************************
Peter S. Shenkin,    Department of Biological Sciences,    Columbia University,
New York, NY   10027         Tel: (212) 280-5517 (work);  (212) 829-5363 (home)
shenkin at cubsun.bio.columbia.edu    shenkin%cubsun.bio.columbia.edu at cuvmb.BITNET