Fortran vs. C for numerical work (SUMMARY)

Sat Dec 1 02:21:09 AEST 1990


>From: brnstnd at kramden.acf.nyu.edu (Dan Bernstein)
>
>In article <7318 at lanl.gov> jlg at lanl.gov (Jim Giles) writes:
>> The Crays have an integer multiply unit for addresses.  This mult
>> takes 4 clocks.
>
>But isn't that only for the 24-bit integer? If you want to multiply full
>words you have to (internally) convert to floating point, multiply, and
>convert back.
>
>I have dozens of machines that can handle a 16MB computation; I'm not
>gonig to bother with a Cray for those. The biggest advantage of the Cray
>line (particularly the Cray-2) is its huge address space.
>
>So what's the actual time for multiplying integers?
 
>---Dan
 
    The time for multiplying 32-bit integers on the YMP is 5 clock
    periods. Normally YMP addresses are interpreted as 64-bit words
    not as bytes. On the previous models of CRAYS, 24 bits are used to
    address 16Mwords not Mbytes. (This saves 3 wires per address data
    path As most work on CRAY's is done on words (numerical) or
    packed-character strings, multiplication of longer integers is not
    provided for in the hardware.
 
    Personally I would like to have long integer support. The CRAY
    architecture supports a somewhat strange multiplication method
    which will yield a 48-bit product of the input words have total
    length less than 48 bits. That is, one can multiply two 24-bit
    quantities, a 16-bit and a 32-bit quantity, a 13-bit and a 35-bit
    quantity, or shorter things. This operation takes two shifts and
    one multiply. The shifts may be overlapped so the time is 3 clocks
    for the two shifts and 7 clocks for the multiply if the shifts are
    known; or 4 clocks for the shifts and 7 clocks for the multiply if
    the shifts are variable. Its a bit of a pain to program but the
    compiler does for us. Another form of integer multiplication is
    used sometimes: the integers are converted to floating, then
    multiplied, and the result converted back to integer. This method
    fails if an intermediate value exceeds 46-bits of significance.
    The time is 2 clocks for producing a "magic" constant, 3 clocks
    each for two integer adds (reduces to 4 total because of
    pipelining), 6 clocks each for two floating adds (reduces to 6
    because of pipelining overlap with the integer add), 7 clocks for
    the floating multiply, 6 clocks for another floating add, and 6
    clocks for another integer multiply. Total is 29 clocks if no
    other operations may be pipelined with these operations. If the
    quantities being multiplied are addresses, some of the above is
    eliminated, bringing the result down to 20 clocks. Still this is
    not as good as the floating point performance. All of the above
    may be vectorized which would result in 3 clocks per result in
    vector mode.