Alignment IS important

Wed Sep 26 02:19:13 AEST 1990

There has been some debate in this newsgroup about the importance of
aligned memory access. I have finally neatly packaged my own technology
for doing core-to-core memory copies, aligned and unaligned, and here I
am posting the technology and some discussion of the results.

This article is posted to comp.arch because it discusses architecture,
and to alt.sources because it contains generally useful source code.

	Usual disclaimer: this work has no relationship whatever to that
	of the University College of Wales; it was performed exclusively
	by me, with the use of my own time, funds, machines, know-how,
	and has not been aided abetted or supported in any way by the
	Unviersity College of Wales. I thank them for providing the
	opportunity to access News and therefore to post this article,
	about which they do not actually know anything.

This article is about a library function essentially equivalent to
memcpy(3), that I have called CoreCopy().  It does not handle
overlapping moves, even if it would not be difficult to extend it so.

It is very portable, but also highly tuned and parametric on the machine
characteristics. I use a set of my own (longish) headers for the parametric
information; they have been summarized here in the file "CoreHdr.h". I hope
that the parameters are self explanatory. The way the parametrization is
used in the CoreCopy source is I hope quite clear, even if virtually all of
the source is preprocessor source, which I have tried to make as readable as
possible. This is one obvious case where very careful hand optimization and
parametrization gives a pay-off and is relevant, as core-to-core copy
bandwidth is often crucial, e.g. to the overall efficiency of the UNIX
kernel.

Here is a list of files contained in the attached shar and their
contents:

Core.h		The user interface of CoreCopy()
CoreCopy.c	The source for CoreCopy()
CoreHdr.h	Environment parameters
CoreSun3.h	Tuning parameters for Sun 3 machines
CoreSv386.h	Tuning parameters for SysV/386 machines
CoreTest.c	A program to "benchmark" CoreCopy()
CoreRun.sh	A shell script to run CoreTest
CoreSv386.pr	Results of running CoreRun.sh under SysV/386
CoreSun3.pr	Results of running CoreRun.sh on a Sun 3

I will provide here some comments on the "benchmark" results:

The benchmarks involve three cases, copying a total 16MB, in chunks of 8,
32, 128, 512 and 2048 bytes; each copy is done first with both source and
destination aligned on a "double" boundary, then with both misaligned by 1
byte, then with source aligned and destination misaligned by 3 bytes, and
then the reverse.

User time in seconds.centiseconds is reported, as returned by the OS.

The first case does not really copy anything; it is run just to have an idea
of the function calling overhead, which dominates when calling small chunks.
The second is copying using the system provided memcpy(3) function; the
third case is running CoreCopy() itself. (you can if you want run additional
cases, just for comparison, as they do not involve CoreCopy() itself; look
at the "CoreTest.c" file).

Some parts of the "benchmark" may be not perfectly portable (one example is
that I ensure that the source and destination buffers are aligned by putting
before their definition a definition for a 'double'), but should be to
nearly every common architecture I can imagine.

    Environment of benchmarks:

    Sv386 is an i386DX 20Mhz with (write-thru) 64KB cache, running System
    V/386 with the Register C Compiler. As a very rought measure of power,
    it does a bit more than 6000 2.x dhrystone.

    Sun3 is a Sun 3/280, 68020 25Mhz with (write-thru?) cache, running
    SunOS 4.0.3 with the PCC descended compiler. This does also a bit
    more than 6000 2.x dhrystones.

Here is a subset of the results; on the left is the Sun3, the right is
the Sv386. I have chosen as block sizes 512 because it is large enough
that procedure call overhead is not large, and 32 because it is small
enough that the overhead starts to matter.

  .------------------ Size of block copied in bytes
  |
  |	.------------ Destination address modulus 4
  |	|
  |	|    .------- Source address modulus 4
  |	|    |
  |	|    |   .--- Time in seconds.centiseconds to copy 16MB
  |	|    |   |
  |	|    |	 |
  V	V    V   V

Sun3 memcpy(3)		  Sv386 memcpy(3)

512B t% 0 f% 0   2.25u    512B t% 0 f% 0   1.55u    
512B t% 1 f% 1   3.02u    512B t% 1 f% 1   4.21u
512B t% 0 f% 3   8.38u    512B t% 0 f% 3   3.32u
512B t% 3 f% 0   8.44u    512B t% 3 f% 0   2.43u
 32B t% 0 f% 0   7.02u     32B t% 0 f% 0   4.71u
 32B t% 1 f% 1   8.02u     32B t% 1 f% 1   7.39u
 32B t% 0 f% 3  12.20u     32B t% 0 f% 3   6.44u
 32B t% 3 f% 0  12.01u     32B t% 3 f% 0   5.64u

Sun3 CoreCopy()		  Sv386 CoreCopy()

512B t% 0 f% 0   2.49u    512B t% 0 f% 0   1.68u
512B t% 1 f% 1   3.11u    512B t% 1 f% 1   1.77u
512B t% 0 f% 3   4.09u    512B t% 0 f% 3   2.65u
512B t% 3 f% 0   3.23u    512B t% 3 f% 0   2.57u
 32B t% 0 f% 0   6.10u     32B t% 0 f% 0   6.48u
 32B t% 1 f% 1   6.46u     32B t% 1 f% 1   9.21u
 32B t% 0 f% 3   6.09u     32B t% 0 f% 3   8.28u
 32B t% 3 f% 0   6.29u     32B t% 3 f% 0   7.45u

The results are often surprising, and must be analyzed with some detailed
knowledge of the logic used by CoreCopy(), memcpy(3), and the performance
profiles of the compiler and CPU architecture and implementation
involved (please also refer to the full set of results in the shar
archive below).

In general CoreCopy() is as fast or just a little bit slower than the
in-built memcpy(3) function for aligned copies; it is usually much faster
for unaligned copies. This holds true down to fairly small chunk sizes; for
very small chunk sizes the higher overheads of CoreCopy() become more
important.

I have not included any statistics on this, but indeed aligning the
destination instead of the source does provide a significant performance
benefit. Another interesting note is that (4-way) loop unrolling does not
buy much for the machines I have used; probably tight loops in this case are
just as good, because of pipelining or something else. It helps instead to
unroll the code that copies the misaligned head and tail of the core area to
copy, because on most machines 4-way unrolling means that the loop will
never be repeated, because head and tail are 1, 2 or 3 bytes long.

Probably substituting memcpy(3) with CoreCopy() on each of the tested
machines would provide overall benefits, because CoreCopy() is only a
little worse then memcpy(3) with aligned copies, but usually
dramatically better with unaligned ones. In particular if you use it in
the "insdel.c" module of GNU Emacs, for which an opportune patch will be
posted, you may experience huge speedups; currently "insdel.c" uses a C
coded char-by-char loop to shift the buffer. Even just using the system
provided bcopy(3) or memcpy(3) will help a lot.

It is essential to performance to have inline assembler code on the 386, but
not on the 68020; this is probably because the code to do a string copy on a
386 looks fairly large -- using the in-built string copy instructions
provides a 3x speedup, probably most because of saving on instruction word
fetches, and the limited pipelining of the 386. It would be interesting to
see how the 486 compares.  It is interesting to note that memcpy(3) on the
386 also uses the string copy instructions; it however ignores alignment,
and copies word by word until there is less than a wordful of bytes, and
then byte by byte.

I think that some inline machine language would also be vital for machines
like the MIPS or SPARC that do traps to support unaligned accesses in
general. I did actually run some cases on a Mips, but the unaligned cases
are simply too slow because of trapping (in the aligned ones CoreCopy() is
as quick as the assembler coded memcpy(3)).

You are welcome to provide machine dependent headers for other architectures
and compilers, and to experiment with the various parameters, thresholds,
etc... you will find in the source. I would be interested in knowing the
times for the VAX-11/780, on which the first incarnation of this function
was developed (as soon I had read how the CPU-SBI-Memory interface worked
for byte stores :->).

The source and the full result files are in the following shar archive.

---------------------------cut here---------------------------------------
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  Core.h CoreCopy.c CoreHdr.h CoreRun.sh CoreSun3.h
#   CoreSun3.pr CoreSv386.h CoreSv386.pr CoreTest.c
# Wrapped by pcg at thor on Tue Sep 25 16:32:40 1990
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
echo '
    Copyright 1982,1990 Piercarlo Grandi.  All rights reserved.

    This shar archive contains free software; you can redistribute
    it and/or modify it under the terms of the GNU General Public
    License as published by the Free Software Foundation; either
    version 1, or (at your option) any later version.

    This shar archive is distributed in the hope that it will be
    useful, but WITHOUT ANY WARRANTY; without even the implied
    warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
    PURPOSE.  See the GNU General Public License for more details.

    You may have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
'
sleep 4
if test -f 'Core.h' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'Core.h'\"
else
echo shar: Extracting \"'Core.h'\" \(487 characters\)
sed "s/^X//" >'Core.h' <<'END_OF_FILE'
X#ifndef Core_H
X#define Core_H
X#if __STDC__
X#   pragma once
X#endif
X
X#if 0
X#ifndef Extend_H
X#   include "Extend.h"
X#endif
X#endif
X
X/*
X	This is a set of library routines to allocate virtual memory and
X	manipulate  it.  It  strives  to  be  reliable  and  consistent,
X	efficient  and  portable.  Unfortunately  this latter quality is
X	more  difficult  to  obtain than the others for such a low level
X	library.
X*/
X
extern pointer		CoreCopy of((pointer,pointer,addressy));
X
X#endif /* Core_H */
END_OF_FILE
if test 487 -ne `wc -c <'Core.h'`; then
    echo shar: \"'Core.h'\" unpacked with wrong size!
fi
# end of 'Core.h'
fi
if test -f 'CoreCopy.c' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreCopy.c'\"
else
echo shar: Extracting \"'CoreCopy.c'\" \(8519 characters\)
sed "s/^X//" >'CoreCopy.c' <<'END_OF_FILE'
X#if 1
X#   include "CoreHdr.h"
X#else
X#ifndef Extend_h
X#   include "Extend.h"
X#endif
X
X#include <import>
X#ifndef Here_h
X#   include "Here.h"
X#endif
X#ifndef With_h
X#   include "With.h"
X#endif
X#ifndef Type_h
X#   include "Type.h"
X#endif
X#ifndef Convert_h
X#   include "Convert.h"
X#endif
X#ifndef Bits_h
X#   include "Bits.h"
X#endif
X#ifndef Assert_h
X#   include "Assert.h"
X#endif
X
X#include <export>
X#endif /* 1 */
X
X#ifndef Core_h
X#   include "Core.h"
X#endif
X
X#if ((CcFEATURE & CcKR78) != CcKR78)
X#   include "ERROR: language supported too old"
X#endif
X
X/*
X    This function is handed pointers to two memory areas, and copies as many
X    units as it is told from the second to the first. The two areas are
X    expected to begin at any byte boundary, and the size is given in bytes
X    too.
X
X    If the memory subsystem of the machine handles more efficiently
X    naturally aligned requests in clusters (multiples of a unit), we try to
X    take advantage of that. Since we cannot take advantage of moving clusters
X    for both source and destination, we optimize the writing of clusters, of
X    course...
X*/
X
X/*
X    We need three copying operations. Only the first is always needed,
X    the remaining two are needed only if copying by clusters pays.
X
X    BYTECOPY(to,from,bytes) copies bytes by byte, the number of bytes is
X    guaranteed to be >= 0.
X
X    ODDCOPY(to,from,bytes) also copies byte by byte, but the number of bytes
X    is guaranteed to be >= 0 && < ClusterBYTES.
X
X    CLUSTERCOPY(to,from,clusters) copies cluster by cluster, and the number
X    of clusters is guaranteed to be > 0.
X
X    For all these macros, the value of bytes is not touched, but to and from
X    are updated to point to the end of the copied area.
X*/
X
X#if (CpuIS == CpuIAPX && CpuMODEL == 0x0386)
X#   include "CoreSv386.h"
X#endif /* CpuIAPX && 0x0386 */
X
X#if (CpuIS == CpuMC68000 && CpuMODEL == 0x0020)
X#   include "CoreSun3.h"
X#endif /* CpuMC68000 && 0x0020 */
X
X#if (CpuIS == CpuMIPS /* && CpuMODEL == 0x3000 */)
X#   include "CoreMips.h"
X#endif /* CpuMIPS */
X
X#ifndef ClusterBITS
X
X#   ifdef CoreFASTALIGN
X#	define ClusterBITS		(CoreFASTALIGN*CpuUNIT)
X#   else
X#	if (CoreFEATURE & (CoreDCACHE|CoreWRITETHRU) == (CoreDCACHE))
X#	    define ClusterBITS		(CoreCACHELINE*CpuUNIT)
X#	else
X#	    ifdef CoreCORELINE
X#		define ClusterBITS	(CoreCORELINE*CpuUNIT)
X#	    else
X#		ifdef CoreINTERLEAVE
X#		    define ClusterBITS	(CoreINTERLEAVE*CpuUNIT)
X#		else
X#		    define ClusterBITS	CpuUNIT
X#		endif
X#	    endif
X#	endif
X#   endif
X
X#   if (ClusterBITS >= LongBITS && (ClusterBITS % LongBITS) == 0)
X#	undef ClusterBITS
X#	define ClusterBITS	LongBITS
X#   endif
X
X#   if ((ClusterBITS % ByteBITS) == 0)
X#	define ClusterBYTES	(ClusterBITS/ByteBITS)
X#   else
X#	include "ERROR: Cluster size is not an even # of bytes"
X#   endif
X
X#endif /* ndef ClusterBITS */
X
X#if (ClusterBYTES > 1)
X
X#   ifndef ClusterLNBYTES
X#	if (ClusterBYTES == 2)
X#	    define ClusterLNBYTES	1
X#	endif
X#	if (ClusterBYTES == 4)
X#	    define ClusterLNBYTES	2
X#	endif
X#	if (ClusterBYTES == 8)
X#	    define ClusterLNBYTES	3
X#	endif
X#   endif
X
X#   ifndef ClusterBEST
X#	if (CoreFEATURE & CoreWRITETHRU)
X#	    define ClusterBEST		(ClusterBYTES*4)
X#	else
X#	    define ClusterBEST		(ClusterBYTES*8)
X#	endif
X#   endif
X
X#   ifndef ClusterALIGNTO
X#	define ClusterALIGNTO		1
X#   endif
X
X#   ifndef ClusterDOALIGN
X#	define ClusterDOALIGN		(ClusterBEST*4)
X#   endif
X
X#   ifndef ClusterTYPE
X#	if (ClusterBITS == ShortBITS && !defined ClusterTYPE)
X#	    define ClusterTYPE		short
X#	endif
X#	if (ClusterBITS == IntBITS && !defined ClusterTYPE)
X#	    define ClusterTYPE		int
X#	endif
X#	if (ClusterBITS == LongBITS && !defined ClusterTYPE)
X#	    define ClusterTYPE		long
X#	endif
X#	if (!defined ClusterTYPE)
X#	    include "ERROR: cannot define a sensible ClusterTYPE"
X#	endif
X#   endif
X
X#   if (!defined ClusterREM && defined ClusterLNBYTES)
X#	define ClusterREM(n)	((n) & (ClusterBYTES-1))
X#	define ClusterDIV(n)	((n) >> ClusterLNBYTES)
X#   else
X#	define ClusterREM(n)	((n) % ClusterBYTES)
X#	define ClusterDIV(n)	((n) / ClusterBYTES)
X#   endif
X
X#   if (!defined Core4CLUSTERCOPY					\
X	    && (CodeREGISTERS >= 6 || CodePREGISTERS >= 5))
X#       define Core4CLUSTERCOPY(to,from,clusters)			\
X	begindef							\
X	    fast ClusterTYPE	*CoreTo = (ClusterTYPE *) (to);		\
X	    fast ClusterTYPE	*CoreFrom = (ClusterTYPE *) (from);	\
X	    fast addressy	CoreClusters = (clusters);		\
X	    while (CoreClusters) switch (CoreClusters)			\
X	    {								\
X	    default:	*CoreTo++ = *CoreFrom++; --CoreClusters;	\
X	    case 3:	*CoreTo++ = *CoreFrom++; --CoreClusters;	\
X	    case 2:	*CoreTo++ = *CoreFrom++; --CoreClusters;	\
X	    case 1:	*CoreTo++ = *CoreFrom++; --CoreClusters;	\
X	    case 0:	break; /* keep this "useless" break in ... */	\
X	    }								\
X	    /* do *CoreTo++ = *CoreFrom++; while (--CoreClusters); */	\
X	    (to) = (pointer) CoreTo, (from) = (pointer) CoreFrom;	\
X	enddef
X#   endif
X
X#   ifndef Core4CLUSTERCOPY
X#	define Core4CLUSTERCOPY(to,from,clusters)			\
X	begindef							\
X	    fast addressy	CoreClusters = (clusters);		\
X	    while (CoreClusters) switch (CoreClusters)			\
X	    {								\
X	    default:							\
X		*(ClusterTYPE *) (to) = *(ClusterTYPE *) (from);	\
X		(to) += ClusterBYTES, (from) += ClusterBYTES;		\
X		--CoreClusters;						\
X	    case 3:							\
X		*(ClusterTYPE *) (to) = *(ClusterTYPE *) (from);	\
X		(to) += ClusterBYTES, (from) += ClusterBYTES;		\
X		--CoreClusters;						\
X	    case 2:							\
X		*(ClusterTYPE *) (to) = *(ClusterTYPE *) (from);	\
X		(to) += ClusterBYTES, (from) += ClusterBYTES;		\
X		--CoreClusters;						\
X	    case 1:							\
X		*(ClusterTYPE *) (to) = *(ClusterTYPE *) (from);	\
X		(to) += ClusterBYTES, (from) += ClusterBYTES;		\
X		--CoreClusters;						\
X	    case 0: break; /* keep this "useless" break in ... */	\
X	    }								\
X	enddef
X#   endif
X
X#   ifndef CoreCLUSTERCOPY
X#	define CoreCLUSTERCOPY(to,from,clusters)			\
X	begindef							\
X	    fast addressy	CoreClusters = (clusters);		\
X	    do	{   *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from);	\
X		    (to) += ClusterBYTES, (from) += ClusterBYTES;	\
X	    } while (--CoreClusters);					\
X	enddef
X#   endif
X
X#endif /* ClusterBYTES > 1 */
X
X#ifndef Core4BYTECOPY
X#   define Core4BYTECOPY(to,from,bytes)					\
X    begindef								\
X	fast addressy CoreBytes = (bytes);				\
X	while (CoreBytes) switch (CoreBytes)				\
X	{								\
X	default:    *(to)++ = *(from)++; --CoreBytes;			\
X	case 3:	    *(to)++ = *(from)++; --CoreBytes;			\
X	case 2:	    *(to)++ = *(from)++; --CoreBytes;			\
X	case 1:	    *(to)++ = *(from)++; --CoreBytes;			\
X	case 0:	    break; /* keep this "useless" break in ... */	\
X	}								\
X    enddef
X#endif /* ndef Core4BYTECOPY */
X
X/*
X    You may want to define this, but at least on my machine (iAPX 386)
X    unrolling loops does not pay.
X*/
X
X#ifndef CoreBYTECOPY
X#   if ((CpuFEATURE&CpuPIPELINE) && !(CpuIS == CpuIAPX && CpuMODEL == 0x0386))
X#	define CoreBYTECOPY	    Core4BYTECOPY
X#   endif
X#endif
X
X#ifndef CoreBYTECOPY
X#   define CoreBYTECOPY(to,from,bytes)					\
X    begindef								\
X	fast addressy CoreBytes = (bytes);				\
X	while (CoreBytes) *(to)++ = *(from)++, --CoreBytes;		\
X    enddef
X#endif /* ndef CoreBYTECOPY */
X
X#ifndef CoreODDCOPY
X#   if (ClusterBYTES <= 4)
X#	define CoreODDCOPY	Core4BYTECOPY
X#   else
X#	define CoreODDCOPY	CoreBYTECOPY
X#   endif
X#endif /* ndef CoreODDCOPY */
X
global pointer		CoreCopy(to,from,bytes)
X    fast pointer	    to;
X    fast pointer	    from;
X    addressy		    bytes;
X{
X#   ifndef CoreCLUSTERCOPY
X	CoreBYTECOPY(to,from,bytes);
X#   else
X    {
X    copySmallBlock:
X
X	if (bytes < ClusterBEST)
X	{
X	    CoreBYTECOPY(to,from,bytes);
X	    return to;
X	}
X
X#	if (ClusterDOALIGN != 0)
X	{
X	    /*
X                Note that here we usually want align cluster transfers
X                on 'to', as we care more about aligning writes than
X                reads, that are often easier to pipeline.
X	    */
X
X	copyHead:
X
X	    if (bytes >= ClusterDOALIGN)
X	    {
X		addressy		    odd;
X
X#		if (ClusterALIGNTO)
X#		    define ClusterALIGN		to
X#		else
X#		    define ClusterALIGN		from
X#		endif
X
X		if ((odd = ClusterREM((addressy) ClusterALIGN)) != 0)
X		{
X		    CoreODDCOPY(to,from,odd = ClusterBYTES - odd);
X		    bytes -= odd;
X		}
X
X#		undef ClusterALIGN
X	    }
X	}
X#	endif /* ClusterDOALIGN != 0 */
X
X    copyClusters:
X
X	assert (ClusterREM((addressy) to) == 0,"CoreCopy");
X	CoreCLUSTERCOPY(to,from,ClusterDIV(bytes));
X	assert (ClusterREM((addressy) to) == 0,"CoreCopy");
X
X    copyTail:
X
X	CoreODDCOPY(to,from,ClusterREM(bytes));
X    }
X#endif /* ndef CoreCLUSTERCOPY */
X
X    return to;
X}
END_OF_FILE
if test 8519 -ne `wc -c <'CoreCopy.c'`; then
    echo shar: \"'CoreCopy.c'\" unpacked with wrong size!
fi
# end of 'CoreCopy.c'
fi
if test -f 'CoreHdr.h' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreHdr.h'\"
else
echo shar: Extracting \"'CoreHdr.h'\" \(2367 characters\)
sed "s/^X//" >'CoreHdr.h' <<'END_OF_FILE'
X#define CpuIAPX		0x0005
X#define CpuMC68000	0x0006
X#define CpuMIPS		0x0007
X
X#ifdef i386
X#define CpuIS		CpuIAPX		/* Type of instruction set	*/
X#define CpuMODEL	0x0386		/* In HEX !			*/
X#endif
X#ifdef sun3
X#define CpuIS		CpuMC68000	/* Type of instruction set	*/
X#define CpuMODEL	0x0020		/* In HEX !			*/
X#endif
X#ifdef mips
X#define CpuIS		CpuMIPS		/* Type of instruction set	*/
X#define CpuMODEL	0x3000		/* In HEX !			*/
X#endif
X
X#define CpuUNIT		8		/* Bits in addressable unit	*/
X
X#define CpuFEATURE	0x0008		/* Peculiarities of CPU		*/
X#define CpuPIPELINE	0x0002		/* Multi stage command obey	*/
X#define CpuDALIGN	0x000a		/* Must align data		*/
X
X#define CoreFEATURE	0x0006		/* Peculiarities of memory sys	*/
X#define CoreDCACHE	0x0002		/* Has a DATA cache		*/
X#define CoreWRITETHRU	0x0004		/* Updates directly to memory	*/
X
X#define CoreCACHELINE	16		/* D cache line size in units	*/
X#define CoreCORELINE	4		/* Units to/from mem at a time	*/
X#define CoreINTERLEAVE	1		/* Interleaving in units	*/
X#define CoreFASTALIGN	4		/* Align at this for fast move	*/
X
X#define CcPORTABLE	0x0005		/* Johnson's classic		*/
X#define CcREGISTER	0x0006		/* Successor to PORTABLE	*/
X
X#ifdef i386
X#define CcIS		CcREGISTER	/* Type (author) of compiler	*/
X#define CodeREGISTERS	5		/* Spare universal registers	*/
X#define CodeDREGISTERS	0		/* Spare data only registers	*/
X#define CodePREGISTERS	0		/* Spare pointer only registers	*/
X#endif
X#ifdef sun3
X#define CcIS		CcPORTABLE	/* Type (author) of compiler	*/
X#define CodeREGISTERS	0		/* Spare universal registers	*/
X#define CodeDREGISTERS	3		/* Spare data only registers	*/
X#define CodePREGISTERS	3		/* Spare pointer only registers	*/
X#endif
X#ifdef mips
X#define CcIS		CcPORTABLE	/* Type (author) of compiler	*/
X#define CodeREGISTERS	8		/* Spare universal registers	*/
X#define CodeDREGISTERS	0		/* Spare data only registers	*/
X#define CodePREGISTERS	0		/* Spare pointer only registers	*/
X#endif
X
X#define CcFEATURE	0x027f		/* Compiler dependent C		*/
X#define CcKR78		0x007f		/*=All that is in K&R 1st ed.	*/
X#define CcASM		0x0200		/*!asm(" ... ");		*/
X
X#define ByteBITS	8
X#define ShortBITS	16
X#define IntBITS		32
X#define LongBITS	32
X
X
X#define	of(ARGS)	(/* ARGS */)
X#define	begindef	do {
X#define enddef		} while (0)
X
X#define global		/* extern */
X#define fast		register
X#define assert(c,m)	/* no op */
X
typedef unsigned	addressy;
typedef char		*pointer;
END_OF_FILE
if test 2367 -ne `wc -c <'CoreHdr.h'`; then
    echo shar: \"'CoreHdr.h'\" unpacked with wrong size!
fi
# end of 'CoreHdr.h'
fi
if test -f 'CoreRun.sh' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreRun.sh'\"
else
echo shar: Extracting \"'CoreRun.sh'\" \(126 characters\)
sed "s/^X//" >'CoreRun.sh' <<'END_OF_FILE'
for C in 0 1 2
do
X    for B in 2048 512 128 32 8
X    do
X	$1	$C	$B	0	0
X	$1	$C	$B	1	1
X	$1	$C	$B	0	3
X	$1	$C	$B	3	0
X    done
done
END_OF_FILE
if test 126 -ne `wc -c <'CoreRun.sh'`; then
    echo shar: \"'CoreRun.sh'\" unpacked with wrong size!
fi
chmod +x 'CoreRun.sh'
# end of 'CoreRun.sh'
fi
if test -f 'CoreSun3.h' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreSun3.h'\"
else
echo shar: Extracting \"'CoreSun3.h'\" \(1135 characters\)
sed "s/^X//" >'CoreSun3.h' <<'END_OF_FILE'
X#define ClusterBITS		32	/* Bits in a cluster		*/
X#define ClusterBYTES		4	/* Bytes in a cluster		*/
X#define ClusterLNBYTES		2	/* Log2 of ClusterBYTES		*/
X#define ClusterTYPE		int	/* The type of a cluster	*/
X
X#define ClusterALIGNTO		1	/* Align destination		*/
X
X#if (CcIS == CcPORTABLE)
X
X#   define ClusterBEST		16	/* Copy clusters when longer	*/
X#   define ClusterDOALIGN	64	/* Align clusters when longer	*/
X#   define CoreODDCOPY		CoreBYTECOPY
X
X    /* Asm inlines do not improve speed */
X#   if (0 && (CcFEATURE&CcASM))
X	/*
X	    Having had a look at the generated code, we know that to is
X	    a5, from is a4, and the count is "always" ready in d0.
X	*/
X
X#	define CoreBYTECOPY(to,from,bytes)				\
X	begindef							\
X	    fast unsigned CoreBytes;					\
X	    if (CoreBytes = (bytes)) {					\
X		asm ("1:	movb    a4 at +,a5 at +");			\
X		asm ("		dbra	d0,1b"); }			\
X	enddef
X
X#	define CoreCLUSTERCOPY(to,from,clusters)			\
X	begindef							\
X	    fast unsigned CoreClusters;					\
X	    if (CoreClusters = (clusters)) {				\
X		asm ("1:	movl    a4 at +,a5 at +");			\
X		asm ("		dbra	d0,1b"); }			\
X	enddef
X
X#   endif /* 0 */
X
X#endif /* CsIS == CcPORTABLE */
END_OF_FILE
if test 1135 -ne `wc -c <'CoreSun3.h'`; then
    echo shar: \"'CoreSun3.h'\" unpacked with wrong size!
fi
# end of 'CoreSun3.h'
fi
if test -f 'CoreSun3.pr' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreSun3.pr'\"
else
echo shar: Extracting \"'CoreSun3.pr'\" \(2040 characters\)
sed "s/^X//" >'CoreSun3.pr' <<'END_OF_FILE'
X 16MB C=0 2048B t% 0 f% 0   0.01u
X 16MB C=0 2048B t% 1 f% 1   0.02u
X 16MB C=0 2048B t% 0 f% 3   0.02u
X 16MB C=0 2048B t% 3 f% 0   0.02u
X 16MB C=0  512B t% 0 f% 0   0.07u
X 16MB C=0  512B t% 1 f% 1   0.04u
X 16MB C=0  512B t% 0 f% 3   0.08u
X 16MB C=0  512B t% 3 f% 0   0.04u
X 16MB C=0  128B t% 0 f% 0   0.27u
X 16MB C=0  128B t% 1 f% 1   0.25u
X 16MB C=0  128B t% 0 f% 3   0.25u
X 16MB C=0  128B t% 3 f% 0   0.23u
X 16MB C=0   32B t% 0 f% 0   1.44u
X 16MB C=0   32B t% 1 f% 1   1.52u
X 16MB C=0   32B t% 0 f% 3   1.43u
X 16MB C=0   32B t% 3 f% 0   1.45u
X 16MB C=0    8B t% 0 f% 0   7.04u
X 16MB C=0    8B t% 1 f% 1   7.23u
X 16MB C=0    8B t% 0 f% 3   7.25u
X 16MB C=0    8B t% 3 f% 0   7.25u
X 16MB C=1 2048B t% 0 f% 0   2.10u
X 16MB C=1 2048B t% 1 f% 1   2.45u
X 16MB C=1 2048B t% 0 f% 3   8.38u
X 16MB C=1 2048B t% 3 f% 0   8.37u
X 16MB C=1  512B t% 0 f% 0   2.25u
X 16MB C=1  512B t% 1 f% 1   3.02u
X 16MB C=1  512B t% 0 f% 3   8.38u
X 16MB C=1  512B t% 3 f% 0   8.44u
X 16MB C=1  128B t% 0 f% 0   3.22u
X 16MB C=1  128B t% 1 f% 1   4.32u
X 16MB C=1  128B t% 0 f% 3  10.43u
X 16MB C=1  128B t% 3 f% 0  10.00u
X 16MB C=1   32B t% 0 f% 0   7.02u
X 16MB C=1   32B t% 1 f% 1   8.02u
X 16MB C=1   32B t% 0 f% 3  12.20u
X 16MB C=1   32B t% 3 f% 0  12.01u
X 16MB C=1    8B t% 0 f% 0  22.05u
X 16MB C=1    8B t% 1 f% 1  25.23u
X 16MB C=1    8B t% 0 f% 3  24.37u
X 16MB C=1    8B t% 3 f% 0  24.35u
X 16MB C=2 2048B t% 0 f% 0   3.01u
X 16MB C=2 2048B t% 1 f% 1   2.44u
X 16MB C=2 2048B t% 0 f% 3   3.06u
X 16MB C=2 2048B t% 3 f% 0   3.21u
X 16MB C=2  512B t% 0 f% 0   2.49u
X 16MB C=2  512B t% 1 f% 1   3.11u
X 16MB C=2  512B t% 0 f% 3   4.09u
X 16MB C=2  512B t% 3 f% 0   3.23u
X 16MB C=2  128B t% 0 f% 0   3.48u
X 16MB C=2  128B t% 1 f% 1   4.10u
X 16MB C=2  128B t% 0 f% 3   4.53u
X 16MB C=2  128B t% 3 f% 0   4.19u
X 16MB C=2   32B t% 0 f% 0   6.10u
X 16MB C=2   32B t% 1 f% 1   6.46u
X 16MB C=2   32B t% 0 f% 3   6.09u
X 16MB C=2   32B t% 3 f% 0   6.29u
X 16MB C=2    8B t% 0 f% 0  29.09u
X 16MB C=2    8B t% 1 f% 1  29.53u
X 16MB C=2    8B t% 0 f% 3  28.50u
X 16MB C=2    8B t% 3 f% 0  28.56u
END_OF_FILE
if test 2040 -ne `wc -c <'CoreSun3.pr'`; then
    echo shar: \"'CoreSun3.pr'\" unpacked with wrong size!
fi
# end of 'CoreSun3.pr'
fi
if test -f 'CoreSv386.h' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreSv386.h'\"
else
echo shar: Extracting \"'CoreSv386.h'\" \(1879 characters\)
sed "s/^X//" >'CoreSv386.h' <<'END_OF_FILE'
X#define ClusterBITS		32	/* Bits in a cluster		*/
X#define ClusterBYTES		4	/* Bytes in a cluster		*/
X#define ClusterLNBYTES		2	/* Log2 of ClusterBYTES		*/
X#define ClusterTYPE		int	/* The type of a cluster	*/
X
X#define ClusterALIGNTO		1	/* This should be 1; 25% diff.	*/
X
X#if (CcIS == CcREGISTER)
X
X#   define ClusterBEST		16	/* Copy clusters when longer	*/
X#   define ClusterDOALIGN	64	/* Align clusters when longer	*/
X
X    /* This is 0, but should be 1. Cannot get asm procs to work! */
X#   if (0 && (CcFEATURE&CcASM))
X
X	asm void CoreByteCopy(to,from,bytes)
X	{
X	%   ureg to,from; reg bytes;
X
X	    movl    to,%edi
X	    movl    from,%esi
X	    movl    bytes,%ecx
X	    rep
X	    movsb   /* (%esi),(%edi) */
X	}
X
X	asm void CoreClusterCopy(to,from,clusters)
X	{
X	%   ureg to,from; reg clusters;
X
X	    movl    to,%edi
X	    movl    from,%esi
X	    movl    clusters,%ecx
X	    rep
X	    movsl   /* (%esi),(%edi) */
X	}
X
X#	define CoreBYTECOPY		CoreByteCopy
X#	define CoreODDCOPY		CoreByteCopy
X#	define CoreCLUSTERCOPY		CoreClusterCopy
X
X#   endif /* 0 */
X
X#   /* This is 1, but should be 0, because we should use inline asm procs */
X#   if (1 && (CcFEATURE&CcASM))
X	/*
X	    Having had a look at the generated code, we know that to is
X	    %esi, from is in %edi, and bytes is in %ebx.
X	*/
X
X#	define CoreBYTECOPY(to,from,bytes)				\
X	begindef							\
X	    fast addressy CoreBytes = (bytes);				\
X	    asm ("	movl	%ebx,%ecx");				\
X	    asm ("	rep");						\
X	    asm ("	movsb	/ (%esi),(%edi)");			\
X	enddef
X
X#	define CoreCLUSTERCOPY(to,from,clusters)			\
X	begindef							\
X	    fast addressy CoreClusters = (clusters);			\
X	    asm ("	movl	%ebx,%ecx");				\
X	    asm ("	rep");						\
X	    asm ("	movsl	/ (%esi),(%edi)");			\
X	enddef
X
X#	define CoreODDCOPY	    CoreBYTECOPY
X
X#   endif /* 1 */
X
X#   ifndef CoreCLUSTERCOPY
X#	define CoreCLUSTERCOPY Core4CLUSTERCOPY
X#   endif
X
X#endif /* CsIS == CcREGISTER */
END_OF_FILE
if test 1879 -ne `wc -c <'CoreSv386.h'`; then
    echo shar: \"'CoreSv386.h'\" unpacked with wrong size!
fi
# end of 'CoreSv386.h'
fi
if test -f 'CoreSv386.pr' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreSv386.pr'\"
else
echo shar: Extracting \"'CoreSv386.pr'\" \(2040 characters\)
sed "s/^X//" >'CoreSv386.pr' <<'END_OF_FILE'
X 16MB C=0 2048B t% 0 f% 0   0.05u
X 16MB C=0 2048B t% 1 f% 1   0.05u
X 16MB C=0 2048B t% 0 f% 3   0.05u
X 16MB C=0 2048B t% 3 f% 0   0.05u
X 16MB C=0  512B t% 0 f% 0   0.20u
X 16MB C=0  512B t% 1 f% 1   0.20u
X 16MB C=0  512B t% 0 f% 3   0.20u
X 16MB C=0  512B t% 3 f% 0   0.20u
X 16MB C=0  128B t% 0 f% 0   0.81u
X 16MB C=0  128B t% 1 f% 1   0.81u
X 16MB C=0  128B t% 0 f% 3   0.81u
X 16MB C=0  128B t% 3 f% 0   0.82u
X 16MB C=0   32B t% 0 f% 0   3.23u
X 16MB C=0   32B t% 1 f% 1   3.23u
X 16MB C=0   32B t% 0 f% 3   3.22u
X 16MB C=0   32B t% 3 f% 0   3.22u
X 16MB C=0    8B t% 0 f% 0  12.88u
X 16MB C=0    8B t% 1 f% 1  12.89u
X 16MB C=0    8B t% 0 f% 3  12.88u
X 16MB C=0    8B t% 3 f% 0  12.88u
X 16MB C=1 2048B t% 0 f% 0   1.39u
X 16MB C=1 2048B t% 1 f% 1   4.08u
X 16MB C=1 2048B t% 0 f% 3   3.16u
X 16MB C=1 2048B t% 3 f% 0   2.28u
X 16MB C=1  512B t% 0 f% 0   1.55u
X 16MB C=1  512B t% 1 f% 1   4.21u
X 16MB C=1  512B t% 0 f% 3   3.32u
X 16MB C=1  512B t% 3 f% 0   2.43u
X 16MB C=1  128B t% 0 f% 0   2.18u
X 16MB C=1  128B t% 1 f% 1   4.84u
X 16MB C=1  128B t% 0 f% 3   3.94u
X 16MB C=1  128B t% 3 f% 0   3.07u
X 16MB C=1   32B t% 0 f% 0   4.71u
X 16MB C=1   32B t% 1 f% 1   7.39u
X 16MB C=1   32B t% 0 f% 3   6.44u
X 16MB C=1   32B t% 3 f% 0   5.64u
X 16MB C=1    8B t% 0 f% 0  14.79u
X 16MB C=1    8B t% 1 f% 1  17.57u
X 16MB C=1    8B t% 0 f% 3  16.10u
X 16MB C=1    8B t% 3 f% 0  16.10u
X 16MB C=2 2048B t% 0 f% 0   1.42u
X 16MB C=2 2048B t% 1 f% 1   1.44u
X 16MB C=2 2048B t% 0 f% 3   2.33u
X 16MB C=2 2048B t% 3 f% 0   2.31u
X 16MB C=2  512B t% 0 f% 0   1.68u
X 16MB C=2  512B t% 1 f% 1   1.77u
X 16MB C=2  512B t% 0 f% 3   2.65u
X 16MB C=2  512B t% 3 f% 0   2.57u
X 16MB C=2  128B t% 0 f% 0   2.72u
X 16MB C=2  128B t% 1 f% 1   3.03u
X 16MB C=2  128B t% 0 f% 3   3.89u
X 16MB C=2  128B t% 3 f% 0   3.61u
X 16MB C=2   32B t% 0 f% 0   6.48u
X 16MB C=2   32B t% 1 f% 1   9.21u
X 16MB C=2   32B t% 0 f% 3   8.28u
X 16MB C=2   32B t% 3 f% 0   7.45u
X 16MB C=2    8B t% 0 f% 0  22.22u
X 16MB C=2    8B t% 1 f% 1  22.24u
X 16MB C=2    8B t% 0 f% 3  22.23u
X 16MB C=2    8B t% 3 f% 0  22.23u
END_OF_FILE
if test 2040 -ne `wc -c <'CoreSv386.pr'`; then
    echo shar: \"'CoreSv386.pr'\" unpacked with wrong size!
fi
# end of 'CoreSv386.pr'
fi
if test -f 'CoreTest.c' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'CoreTest.c'\"
else
echo shar: Extracting \"'CoreTest.c'\" \(3039 characters\)
sed "s/^X//" >'CoreTest.c' <<'END_OF_FILE'
X#include <sys/types.h>
X#include <sys/times.h>
X#include <sys/param.h>
X
X#ifndef HZ
X#    define HZ 60
X#endif
X
X#include <stdio.h>
X
X#ifndef B
X#   define B		4096		/* Maximum & default # of bytes	*/
X#endif
X#ifndef M
X#   define M		(16<<20)	/* Default megabytes copied	*/
X#endif
X
typedef char		*(*method)();
X
static time_t		measure(p,t,f,b)
X    register method	    p;
X    register char	    *t,*f;
X    register unsigned	    b;
X{
X    register unsigned	    i;
X    struct tms		    tms;
X    time_t		    utime;
X
X    (void) times(&tms);
X    utime = tms.tms_utime;
X
X    for (i = 0; i < M; i += b)
X	(void) (*p)(t,f,b);
X
X    (void) times(&tms);
X    return tms.tms_utime - utime;
X}
X
extern char		*null();
extern char		*memcpy();
extern char		*CoreCopy();
extern char		*copy1();
extern char		*copy2();
extern char		*copy3();
X
static method		methods[] = {null,memcpy,CoreCopy,copy1,copy2,copy3};
static unsigned		nmethods = sizeof methods/sizeof (method);
X
X#define SLOP		sizeof (long unsigned)
X
long unsigned		alignit1;
char			bfrom[B+SLOP];
X
long unsigned		alignit2;
char			bto[B+SLOP];
X
extern int		main(argc,argv)
X    int			    argc;
X    char		    **argv;
X{
X    register unsigned	    i,b;
X    register char	    *f,*t;
X    unsigned		    of,ot;
X    unsigned		    m;
X    time_t		    utime;
X    
X
X    m	= (argc <= 1) ? 0 : atoi(argv[1]);
X    b	= (argc <= 2) ? B : atoi(argv[2]);
X    ot	= (argc <= 3) ? 1 : atoi(argv[3]);
X    of	= (argc <= 4) ? 1 : atoi(argv[4]);
X
X    if (m >= nmethods)	    m = 1;
X    if (b > B)		    b = B;
X    if (ot > SLOP)	    ot %= SLOP;
X    if (of > SLOP)	    of %= SLOP;
X
X    f = bfrom + of; t = bto + ot;
X
X    printf("%3uMB C=%u %4uB t%% %u f%% %u ",
X	    M>>20,m,b,(unsigned) t%SLOP,(unsigned) f%SLOP);
X    fflush(stdout);
X
X    utime = measure(methods[m],f,t,b);
X
X    printf("%3u.%02uu\n",utime/HZ,utime%HZ);
X    fflush(stdout);
X
X    return 0;
X}
X
extern char		*null(to,from,bytes)
X    register char	    *to,*from;
X    register unsigned	    bytes;
X{
X    return to+bytes;
X}
X
extern char		*copy1(to,from,bytes)
X    register char	    *to,*from;
X    register unsigned	    bytes;
X{
X    if (bytes)
X    {
X	do *to++ = *from++;
X	while (--bytes);
X    }
X
X    return to;
X}
X
extern char		*copy2(to,from,bytes)
X    register char	    *to,*from;
X    register unsigned	    bytes;
X{
X    while (bytes >= sizeof (long))
X    {
X	*(long *) to = *(long *) from;
X	to += sizeof (long), from += sizeof (long);
X	bytes -= sizeof (long);
X    }
X
X    if (bytes)
X    {
X	do *to++ = *from++;
X	while (--bytes);
X    }
X
X    return to;
X}
X
extern char		*copy3(to,from,bytes)
X    register char	    *to,*from;
X    register unsigned	    bytes;
X{
X    while (bytes >= 2*sizeof (long))
X    {
X	*(long *) to = *(long *) from;
X	*((long *) to + 1) = *((long *) from +1);
X	to +=2*sizeof (long), from += 2*sizeof (long);
X	bytes -= 2*sizeof (long);
X    }
X
X    while (bytes >= sizeof (long))
X    {
X	*(long *) to = *(long *) from;
X	to += sizeof (long), from += sizeof (long);
X	bytes -= sizeof (long);
X    }
X
X    if (bytes)
X    {
X	do *to++ = *from++;
X	while (--bytes);
X    }
X
X    return to;
X}
END_OF_FILE
if test 3039 -ne `wc -c <'CoreTest.c'`; then
    echo shar: \"'CoreTest.c'\" unpacked with wrong size!
fi
# end of 'CoreTest.c'
fi
echo shar: End of shell archive.
exit 0
--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs at nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg at cs.aber.ac.uk