ANSI memmove(), optimized for GnuC/68010
Karl Botts
kdb at chinet.chi.il.us
Tue Sep 26 11:05:28 AEST 1989
/*
The style of this file has been inspired by several references I have
encountered in recent months to what has been called, I suspect
unfortunately, "literate programming". One article was in JACM by D. E.
Knuth. The general idea is to produce files which are both an English
language document and compilable source files. Mr. Knuth is working on
some specialized tools with an eye toward research papers and
educational expositions. I like the general idea a great deal and
thought I would try it out on the less formal vehicle of a UseNet
article, without using special tools.
Karl D. Botts 5480 Hyde Park Blvd. Chicago IL, 60615
{gargoyle|mcdchg|att}!chinet!kdb
*/
/*
The result of compiling this file is an ANSI C memmove(), for systems
that don't have memmove() and have a memcpy() that doesn't handle
overlaps, i.e., SysV.2 ported by Convergent Technologies, as found on my
AT&T 3b1. It is specially optimized for GnuC and an mc68k, as described
in detail in the internal comments. It compiles with PCC but the result is
not as fast. The internal comments also describe some peculiarities of the
Gnu C optimizer discovered while tinkering with this code.
All cases are handled, including any of the "null copy operations" --
dest == src || n == 0. Note that in the first case the block will indeed
get copied over itself, but this can do nothing except burn cpu unless, I
suppose, something in the block is volatile.
Note also that n is expected to be unsigned; if size_t is signed (as it
is by default on my SysV) and you pass a negative number to this routine it
will interpret the number as unsigned and proceed. This will cause a core
dump but I believe it is correct.
I wound up doing a bunch of timing tests on this code; some results and
some remarks are appended to this file.
/*
/*
By the way, I have seen usenet code -- pcomm for one -- which does
#define memmove memcpy
when configured for the 3b1 and comments that memcpy() handles overlaps
and so can be used as memmove(). This is variant (as is stated by
memory(3C)) and is not true on my system (as determined the hard way),
although pcomm seems to work anyhow. I don't use it regularly.
*/
#include <stdio.h>
/* The goal here is to get a typedef name "size_t" defined. */
#if __STDC__
# include <stddef.h>
#else
#if defined(SYSV) && defined(PCC)
# include <unistd.h>/* defines size_t as int; see below */
#endif
#endif
/* The goal here is to find a declaration or prototype for
memmove(), in the hope of generating an error message if it doesn't
match the definition in this file. The main thing to worry about is
whether the third arg is signed or unsigned. */
#ifdef KDB
# include <kdb/stdlib.h>
#else
#if __STDC__
# include <stdlib.h>
# include <memory.h>
#endif
#endif
#if __STDC__ || defined(PROTOS)
void *memmove(void *d, void *s, register size_t n)
#else
void *memmove(d, s, n)
void *d;
void *s;
register size_t n;
#endif
{
register long *dl = (long *)d, *sl = (long *)s;
int nstash;
/* The rule for handling overlap is: Always start copying from the end
of the source toward which the destination is offset. If the offset is
zero it doesn't matter.*/
if ( dl > sl ) {
/* This ain't ANSI C (a cast does not form an lvalue) but both SysV.2 PCC
and GnuC accept it, and it's damn convenient. Also optimizes very
nicely.*/
((char *)dl) += n;
((char *)sl) += n;
}
if ( (int)dl & 1 ) {
if ( (int)sl & 1 )
goto byte_aligned;
/* else drop thru */
} else {
if ( ! ((int)sl & 1) )
goto word_aligned;
/* else drop thru */
}
mis_aligned:
/* I don't think I can do much better than the simple bytewise copy. As
written here, Gnu C optimizes the loop into the two instruction
sequence:
L%12:
mov.b -(%a2),-(%a3)
dbra %d0,L%12
which is the special and only copying loop which the 68010 can keep entirely
in the prefetch queue, thus freeing the bus for full-time data and nearly
doubling the speed of otherwise apparently equivalent loops. Therefore I
can't think of any fancy juggling to handle the odd-alignment that
seems likely to turn a profit.
The loop here (or actually, its longword analog, below), should still be
the fastest way to copy a block of memory, but it won't be as much
faster on a 68020 as on a 68010 because equivalent loops with a couple
of extra instructions will still get kept in the prefetch queue. The
same should be true on a 68000 because none of the loops get kept in the
prefetch queue. I don't have a 68020 or a 68000 machine handy so this
paragraph is sheer speculation. */
if ( dl > sl ) {
if ( n-- )
do
*--((char *)dl) = *--((char *)sl);
while ( n-- );
} else {
/* Here's the strange part. If the following loop is written with the
casts-to-lvalue like its complement just above, instead of the local
char * vars, Gnu C doesn't optimize it to the complement of the
seqeunce given above, which would be:
L%12:
mov.b (%a2)+,(%a3)+
dbra %d0,L%12
Instead it produces:
L%16:
mov.l %a3,%a0
addq.w &1,%a3
mov.l %a2,%a1
addq.w &1,%a2
mov.b (%a1),(%a0)
dbra %d0,L%16
I noticed this because it made enough difference to badly skew
some timing tests I was running on a parsing algorithm. Anyhow,
adding the char *temps as here coaxes Gnu C into performing the
desired optimization.
It is worth noting that SysV.2 PCC, which I also tried, did not
make this optimization despite my best attempts to devise
suggestively convoluted loops. The Gnu code is almost twice as fast
as the best PCC could do (on a 68010 -- see above).
The upshot of all this is that the inner loop of the resulting code
(together with the longword loop below) is identical to the main
loops of the standard library memcpy() routine on my 3b1, which I assume
is coded in assembler (but doesn't handle overlap, which is why I had to
write this routine in the first place.) This means that memory copying
functions, one of the last remaining rationalizations for coding library
routines in assembler, will no longer serve as an excuse. It also means
that inline C code to copy small blocks can be significantly faster than
calling memcpy() (it seldom was before, in practice.)
But see the caveats below.*/
register char *dc = (char *)d, *sc = (char *)s;/* casts needed by PCC */
if ( n-- )
do
*dc++ = *sc++;
while ( n-- );
}
return(d);
byte_aligned:
/* both byte aligned -- copy the odd byte */
if ( n-- )
if ( dl > sl )
*--((char *)dl) = *--((char *)sl);
else
*((char *)dl)++ = *((char *)sl)++;
/* drop thru... */
else
return(d);
word_aligned:
nstash = n;
n /= sizeof(long);
if ( dl > sl ) {
/* GnuC performs an optimization on this loop equivalent to the
one described above, but copying longwords. It really cooks along.
However, the loop must be written exactly as here; any deviation,
even to the extent of not performing the initial decrement of <n> and
using predecrement in the loop control expression, results in the
desired optimization not being done. I tried nearly a dozen
variations, including several array notation equivalents, as well as
using an end pointer instead of a loop counter; none resulted in the
desired optimization.
This is a shame, because what's below is a non-intuitive way to write
this loop -- I certainly wouldn't write it this way off the top of my
head, and I doubt if it occurs in this precise form in much existing
code. It seems to me that the compiler should be able to figure out
equivalent loops and perform the optimization. The speed increase is,
as noted above, close to doubling in some cases, and it is hard to think
of a more useful construct to optimize than block-copy loops.*/
if ( n-- )
do /* main loop -- copy longwords */
*--dl = *--sl;
while ( n-- );
n = nstash;
if ( n & 2 )
*--((short *)dl) = *--((short *)sl);
if ( n & 1 )
*--((char *)dl) = *--((char *)sl);
} else {
if ( n-- )
do
*dl++ = *sl++;
while ( n-- );
n = nstash;
if ( n & 2 )
*((short *)dl)++ = *((short *)sl)++;
if ( n & 1 )
*((char *)dl)++ = *((char *)sl)++;
}
return(d);
}
#ifdef TESTING
/*
These are some functions and code fragments used in the timeing tests,
the results of which are appended.
*/
/*
maxcpy() receives the GnuC/68010 optimization but doesn't
handle overlap, or any alignment except both destination and source
word-aligned. It is about the fastest copy function you can write in C for
the 68010, and I don't think you could do much better in assembler.
*/
char *maxcpy(dl, sl, n)
register long *dl, *sl;
register int n;
{
n /= sizeof(long);
if ( n-- )
do
*dl++ = *sl++;
while ( n-- );
}
/*
null() is the null function.
Incidentally, GnuC does significantly better than PCC at optimizing
function calls. It is smarter about picking up arguments off the stack
into registers and automatic vars, and it pushes the args for and calls
several functions before it cleans up the stack. That is, it does:
push args
call foo()
push args
call bar()
pop args for both foo() and bar()
GnuC sometimes calls four or more functions before cleaning up. All
this adds up to a null function about 3/2 as fast as PCC produces.
*/
char *null(d, s, n)
register char *d, *s;
register int n;
{
return(d);
}
/*
"inline longwords" is the same code as maxcpy() but in-line; as you would
expect, the the times for null() and inline longwords add up pretty closely
to the time for maxcpy() in all tests.
"inline bytes" is the same, but it copies bytes instead of longwords.
Since maxcpy() and inline longwords would cause a core dump in the
non-word-aligned tests, the results are not here.
*/
#endif
/*
Here are the results, in seconds of copying five sizes of blocks by
several methods with each of the basic alignment variations. d is the
destination offset from longword alignment (here, always zero), s is the
source offset, n is the length of the block in bytes, and i is the number
of iterations tested. The total bytes copied , n * i, is the same for
each test.
This test sequence is mainly to test alignment differences. Note that
for small blocks the alignment, which only effects the speed of the innermost
loop, is pretty much irrelevent; the function call and setup overhead
dominates completely. For big blocks, mis-alignment slows both memcpy()
and memmove() by a factor of three -- mis-alignment forces byte by byte
copying.
d: 0 s: 0 n: 4 i: 262144 null:4.3
d: 0 s: 0 n: 4 i: 262144 memcpy:9.5
d: 0 s: 0 n: 4 i: 262144 maxcpy:8.0
d: 0 s: 0 n: 4 i: 262144 memmove:14.9
d: 0 s: 0 n: 4 i: 262144 inline bytes:4.1
d: 0 s: 0 n: 4 i: 262144 inline longwords:2.0
d: 0 s: 1 n: 4 i: 262144 null:4.2
d: 0 s: 1 n: 4 i: 262144 memcpy:9.5
d: 0 s: 1 n: 4 i: 262144 memmove:14.9
d: 0 s: 1 n: 4 i: 262144 inline bytes:4.1
d: 0 s: 2 n: 4 i: 262144 null:4.2
d: 0 s: 2 n: 4 i: 262144 memcpy:9.2
d: 0 s: 2 n: 4 i: 262144 maxcpy:7.9
d: 0 s: 2 n: 4 i: 262144 memmove:14.8
d: 0 s: 2 n: 4 i: 262144 inline bytes:4.1
d: 0 s: 2 n: 4 i: 262144 inline longwords:2.0
d: 0 s: 3 n: 4 i: 262144 null:4.2
d: 0 s: 3 n: 4 i: 262144 memcpy:9.5
d: 0 s: 3 n: 4 i: 262144 memmove:14.7
d: 0 s: 3 n: 4 i: 262144 inline bytes:4.2
d: 0 s: 0 n: 16 i: 65536 null:1.1
d: 0 s: 0 n: 16 i: 65536 memcpy:2.8
d: 0 s: 0 n: 16 i: 65536 maxcpy:2.4
d: 0 s: 0 n: 16 i: 65536 memmove:4.3
d: 0 s: 0 n: 16 i: 65536 inline bytes:2.2
d: 0 s: 0 n: 16 i: 65536 inline longwords:1.4
d: 0 s: 1 n: 16 i: 65536 null:1.0
d: 0 s: 1 n: 16 i: 65536 memcpy:3.6
d: 0 s: 1 n: 16 i: 65536 memmove:5.2
d: 0 s: 1 n: 16 i: 65536 inline bytes:2.2
d: 0 s: 2 n: 16 i: 65536 null:1.1
d: 0 s: 2 n: 16 i: 65536 memcpy:2.7
d: 0 s: 2 n: 16 i: 65536 maxcpy:2.4
d: 0 s: 2 n: 16 i: 65536 memmove:4.2
d: 0 s: 2 n: 16 i: 65536 inline bytes:2.2
d: 0 s: 2 n: 16 i: 65536 inline longwords:1.3
d: 0 s: 3 n: 16 i: 65536 null:1.1
d: 0 s: 3 n: 16 i: 65536 memcpy:3.6
d: 0 s: 3 n: 16 i: 65536 memmove:5.2
d: 0 s: 3 n: 16 i: 65536 inline bytes:2.2
d: 0 s: 0 n: 64 i: 16384 null:0.3
d: 0 s: 0 n: 64 i: 16384 memcpy:1.1
d: 0 s: 0 n: 64 i: 16384 maxcpy:1.1
d: 0 s: 0 n: 64 i: 16384 memmove:1.6
d: 0 s: 0 n: 64 i: 16384 inline bytes:1.7
d: 0 s: 0 n: 64 i: 16384 inline longwords:0.8
d: 0 s: 1 n: 64 i: 16384 null:0.3
d: 0 s: 1 n: 64 i: 16384 memcpy:2.1
d: 0 s: 1 n: 64 i: 16384 memmove:2.8
d: 0 s: 1 n: 64 i: 16384 inline bytes:1.7
d: 0 s: 2 n: 64 i: 16384 null:0.3
d: 0 s: 2 n: 64 i: 16384 memcpy:1.1
d: 0 s: 2 n: 64 i: 16384 maxcpy:1.1
d: 0 s: 2 n: 64 i: 16384 memmove:1.6
d: 0 s: 2 n: 64 i: 16384 inline bytes:1.7
d: 0 s: 2 n: 64 i: 16384 inline longwords:0.8
d: 0 s: 3 n: 64 i: 16384 null:0.3
d: 0 s: 3 n: 64 i: 16384 memcpy:2.1
d: 0 s: 3 n: 64 i: 16384 memmove:2.8
d: 0 s: 3 n: 64 i: 16384 inline bytes:1.7
d: 0 s: 0 n: 512 i: 2048 null:0.0
d: 0 s: 0 n: 512 i: 2048 memcpy:0.7
d: 0 s: 0 n: 512 i: 2048 maxcpy:0.7
d: 0 s: 0 n: 512 i: 2048 memmove:0.8
d: 0 s: 0 n: 512 i: 2048 inline bytes:1.7
d: 0 s: 0 n: 512 i: 2048 inline longwords:0.6
d: 0 s: 1 n: 512 i: 2048 null:0.0
d: 0 s: 1 n: 512 i: 2048 memcpy:1.7
d: 0 s: 1 n: 512 i: 2048 memmove:2.2
d: 0 s: 1 n: 512 i: 2048 inline bytes:1.6
d: 0 s: 2 n: 512 i: 2048 null:0.0
d: 0 s: 2 n: 512 i: 2048 memcpy:0.7
d: 0 s: 2 n: 512 i: 2048 maxcpy:0.7
d: 0 s: 2 n: 512 i: 2048 memmove:0.9
d: 0 s: 2 n: 512 i: 2048 inline bytes:1.7
d: 0 s: 2 n: 512 i: 2048 inline longwords:0.6
d: 0 s: 3 n: 512 i: 2048 null:0.0
d: 0 s: 3 n: 512 i: 2048 memcpy:1.6
d: 0 s: 3 n: 512 i: 2048 memmove:2.1
d: 0 s: 3 n: 512 i: 2048 inline bytes:1.6
d: 0 s: 0 n: 4096 i: 256 null:0.0
d: 0 s: 0 n: 4096 i: 256 memcpy:0.6
d: 0 s: 0 n: 4096 i: 256 maxcpy:0.6
d: 0 s: 0 n: 4096 i: 256 memmove:0.7
d: 0 s: 0 n: 4096 i: 256 inline bytes:1.6
d: 0 s: 0 n: 4096 i: 256 inline longwords:0.6
d: 0 s: 1 n: 4096 i: 256 null:0.0
d: 0 s: 1 n: 4096 i: 256 memcpy:1.6
d: 0 s: 1 n: 4096 i: 256 memmove:2.1
d: 0 s: 1 n: 4096 i: 256 inline bytes:1.6
d: 0 s: 2 n: 4096 i: 256 null:0.0
d: 0 s: 2 n: 4096 i: 256 memcpy:0.6
d: 0 s: 2 n: 4096 i: 256 maxcpy:0.6
d: 0 s: 2 n: 4096 i: 256 memmove:0.7
d: 0 s: 2 n: 4096 i: 256 inline bytes:1.6
d: 0 s: 2 n: 4096 i: 256 inline longwords:0.7
d: 0 s: 3 n: 4096 i: 256 null:0.0
d: 0 s: 3 n: 4096 i: 256 memcpy:1.6
d: 0 s: 3 n: 4096 i: 256 memmove:2.2
d: 0 s: 3 n: 4096 i: 256 inline bytes:1.6
This sequence of tests does all longword copying, with more iterations
and more block sizes for better resolution, to test raw speed.
d: 0 s: 0 n: 4 i: 1048576 null:17.0
d: 0 s: 0 n: 4 i: 1048576 memcpy:36.8
d: 0 s: 0 n: 4 i: 1048576 maxcpy:31.6
d: 0 s: 0 n: 4 i: 1048576 memmove:60.0
d: 0 s: 0 n: 4 i: 1048576 inline bytes:16.5
d: 0 s: 0 n: 4 i: 1048576 inline longwords:7.9
d: 0 s: 0 n: 8 i: 524288 null:8.5
d: 0 s: 0 n: 8 i: 524288 memcpy:19.5
d: 0 s: 0 n: 8 i: 524288 maxcpy:16.9
d: 0 s: 0 n: 8 i: 524288 memmove:31.1
d: 0 s: 0 n: 8 i: 524288 inline bytes:11.5
d: 0 s: 0 n: 8 i: 524288 inline longwords:8.1
d: 0 s: 0 n: 16 i: 262144 null:4.2
d: 0 s: 0 n: 16 i: 262144 memcpy:10.9
d: 0 s: 0 n: 16 i: 262144 maxcpy:9.6
d: 0 s: 0 n: 16 i: 262144 memmove:17.0
d: 0 s: 0 n: 16 i: 262144 inline bytes:8.7
d: 0 s: 0 n: 16 i: 262144 inline longwords:5.3
d: 0 s: 0 n: 32 i: 131072 null:2.1
d: 0 s: 0 n: 32 i: 131072 memcpy:6.7
d: 0 s: 0 n: 32 i: 131072 maxcpy:5.9
d: 0 s: 0 n: 32 i: 131072 memmove:9.9
d: 0 s: 0 n: 32 i: 131072 inline bytes:7.6
d: 0 s: 0 n: 32 i: 131072 inline longwords:3.9
d: 0 s: 0 n: 64 i: 65536 null:1.0
d: 0 s: 0 n: 64 i: 65536 memcpy:4.5
d: 0 s: 0 n: 64 i: 65536 maxcpy:4.2
d: 0 s: 0 n: 64 i: 65536 memmove:6.4
d: 0 s: 0 n: 64 i: 65536 inline bytes:6.9
d: 0 s: 0 n: 64 i: 65536 inline longwords:3.2
d: 0 s: 0 n: 128 i: 32768 null:0.5
d: 0 s: 0 n: 128 i: 32768 memcpy:3.5
d: 0 s: 0 n: 128 i: 32768 maxcpy:3.3
d: 0 s: 0 n: 128 i: 32768 memmove:4.6
d: 0 s: 0 n: 128 i: 32768 inline bytes:6.6
d: 0 s: 0 n: 128 i: 32768 inline longwords:2.8
d: 0 s: 0 n: 512 i: 8192 null:0.1
d: 0 s: 0 n: 512 i: 8192 memcpy:2.7
d: 0 s: 0 n: 512 i: 8192 maxcpy:2.6
d: 0 s: 0 n: 512 i: 8192 memmove:3.3
d: 0 s: 0 n: 512 i: 8192 inline bytes:6.4
d: 0 s: 0 n: 512 i: 8192 inline longwords:2.5
*/
More information about the Unix-pc.sources
mailing list