Alignment IS important
Piercarlo Grandi
pcg at cs.aber.ac.uk
Wed Sep 26 02:19:13 AEST 1990
There has been some debate in this newsgroup about the importance of
aligned memory access. I have finally neatly packaged my own technology
for doing core-to-core memory copies, aligned and unaligned, and here I
am posting the technology and some discussion of the results.
This article is posted to comp.arch because it discusses architecture,
and to alt.sources because it contains generally useful source code.
Usual disclaimer: this work has no relationship whatever to that
of the University College of Wales; it was performed exclusively
by me, with the use of my own time, funds, machines, know-how,
and has not been aided abetted or supported in any way by the
Unviersity College of Wales. I thank them for providing the
opportunity to access News and therefore to post this article,
about which they do not actually know anything.
This article is about a library function essentially equivalent to
memcpy(3), that I have called CoreCopy(). It does not handle
overlapping moves, even if it would not be difficult to extend it so.
It is very portable, but also highly tuned and parametric on the machine
characteristics. I use a set of my own (longish) headers for the parametric
information; they have been summarized here in the file "CoreHdr.h". I hope
that the parameters are self explanatory. The way the parametrization is
used in the CoreCopy source is I hope quite clear, even if virtually all of
the source is preprocessor source, which I have tried to make as readable as
possible. This is one obvious case where very careful hand optimization and
parametrization gives a pay-off and is relevant, as core-to-core copy
bandwidth is often crucial, e.g. to the overall efficiency of the UNIX
kernel.
Here is a list of files contained in the attached shar and their
contents:
Core.h The user interface of CoreCopy()
CoreCopy.c The source for CoreCopy()
CoreHdr.h Environment parameters
CoreSun3.h Tuning parameters for Sun 3 machines
CoreSv386.h Tuning parameters for SysV/386 machines
CoreTest.c A program to "benchmark" CoreCopy()
CoreRun.sh A shell script to run CoreTest
CoreSv386.pr Results of running CoreRun.sh under SysV/386
CoreSun3.pr Results of running CoreRun.sh on a Sun 3
I will provide here some comments on the "benchmark" results:
The benchmarks involve three cases, copying a total 16MB, in chunks of 8,
32, 128, 512 and 2048 bytes; each copy is done first with both source and
destination aligned on a "double" boundary, then with both misaligned by 1
byte, then with source aligned and destination misaligned by 3 bytes, and
then the reverse.
User time in seconds.centiseconds is reported, as returned by the OS.
The first case does not really copy anything; it is run just to have an idea
of the function calling overhead, which dominates when calling small chunks.
The second is copying using the system provided memcpy(3) function; the
third case is running CoreCopy() itself. (you can if you want run additional
cases, just for comparison, as they do not involve CoreCopy() itself; look
at the "CoreTest.c" file).
Some parts of the "benchmark" may be not perfectly portable (one example is
that I ensure that the source and destination buffers are aligned by putting
before their definition a definition for a 'double'), but should be to
nearly every common architecture I can imagine.
Environment of benchmarks:
Sv386 is an i386DX 20Mhz with (write-thru) 64KB cache, running System
V/386 with the Register C Compiler. As a very rought measure of power,
it does a bit more than 6000 2.x dhrystone.
Sun3 is a Sun 3/280, 68020 25Mhz with (write-thru?) cache, running
SunOS 4.0.3 with the PCC descended compiler. This does also a bit
more than 6000 2.x dhrystones.
Here is a subset of the results; on the left is the Sun3, the right is
the Sv386. I have chosen as block sizes 512 because it is large enough
that procedure call overhead is not large, and 32 because it is small
enough that the overhead starts to matter.
.------------------ Size of block copied in bytes
|
| .------------ Destination address modulus 4
| |
| | .------- Source address modulus 4
| | |
| | | .--- Time in seconds.centiseconds to copy 16MB
| | | |
| | | |
V V V V
Sun3 memcpy(3) Sv386 memcpy(3)
512B t% 0 f% 0 2.25u 512B t% 0 f% 0 1.55u
512B t% 1 f% 1 3.02u 512B t% 1 f% 1 4.21u
512B t% 0 f% 3 8.38u 512B t% 0 f% 3 3.32u
512B t% 3 f% 0 8.44u 512B t% 3 f% 0 2.43u
32B t% 0 f% 0 7.02u 32B t% 0 f% 0 4.71u
32B t% 1 f% 1 8.02u 32B t% 1 f% 1 7.39u
32B t% 0 f% 3 12.20u 32B t% 0 f% 3 6.44u
32B t% 3 f% 0 12.01u 32B t% 3 f% 0 5.64u
Sun3 CoreCopy() Sv386 CoreCopy()
512B t% 0 f% 0 2.49u 512B t% 0 f% 0 1.68u
512B t% 1 f% 1 3.11u 512B t% 1 f% 1 1.77u
512B t% 0 f% 3 4.09u 512B t% 0 f% 3 2.65u
512B t% 3 f% 0 3.23u 512B t% 3 f% 0 2.57u
32B t% 0 f% 0 6.10u 32B t% 0 f% 0 6.48u
32B t% 1 f% 1 6.46u 32B t% 1 f% 1 9.21u
32B t% 0 f% 3 6.09u 32B t% 0 f% 3 8.28u
32B t% 3 f% 0 6.29u 32B t% 3 f% 0 7.45u
The results are often surprising, and must be analyzed with some detailed
knowledge of the logic used by CoreCopy(), memcpy(3), and the performance
profiles of the compiler and CPU architecture and implementation
involved (please also refer to the full set of results in the shar
archive below).
In general CoreCopy() is as fast or just a little bit slower than the
in-built memcpy(3) function for aligned copies; it is usually much faster
for unaligned copies. This holds true down to fairly small chunk sizes; for
very small chunk sizes the higher overheads of CoreCopy() become more
important.
I have not included any statistics on this, but indeed aligning the
destination instead of the source does provide a significant performance
benefit. Another interesting note is that (4-way) loop unrolling does not
buy much for the machines I have used; probably tight loops in this case are
just as good, because of pipelining or something else. It helps instead to
unroll the code that copies the misaligned head and tail of the core area to
copy, because on most machines 4-way unrolling means that the loop will
never be repeated, because head and tail are 1, 2 or 3 bytes long.
Probably substituting memcpy(3) with CoreCopy() on each of the tested
machines would provide overall benefits, because CoreCopy() is only a
little worse then memcpy(3) with aligned copies, but usually
dramatically better with unaligned ones. In particular if you use it in
the "insdel.c" module of GNU Emacs, for which an opportune patch will be
posted, you may experience huge speedups; currently "insdel.c" uses a C
coded char-by-char loop to shift the buffer. Even just using the system
provided bcopy(3) or memcpy(3) will help a lot.
It is essential to performance to have inline assembler code on the 386, but
not on the 68020; this is probably because the code to do a string copy on a
386 looks fairly large -- using the in-built string copy instructions
provides a 3x speedup, probably most because of saving on instruction word
fetches, and the limited pipelining of the 386. It would be interesting to
see how the 486 compares. It is interesting to note that memcpy(3) on the
386 also uses the string copy instructions; it however ignores alignment,
and copies word by word until there is less than a wordful of bytes, and
then byte by byte.
I think that some inline machine language would also be vital for machines
like the MIPS or SPARC that do traps to support unaligned accesses in
general. I did actually run some cases on a Mips, but the unaligned cases
are simply too slow because of trapping (in the aligned ones CoreCopy() is
as quick as the assembler coded memcpy(3)).
You are welcome to provide machine dependent headers for other architectures
and compilers, and to experiment with the various parameters, thresholds,
etc... you will find in the source. I would be interested in knowing the
times for the VAX-11/780, on which the first incarnation of this function
was developed (as soon I had read how the CPU-SBI-Memory interface worked
for byte stores :->).
The source and the full result files are in the following shar archive.
---------------------------cut here---------------------------------------
#! /bin/sh
# This is a shell archive. Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file". To overwrite existing
# files, type "sh file -c". You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g.. If this archive is complete, you
# will see the following message at the end:
# "End of shell archive."
# Contents: Core.h CoreCopy.c CoreHdr.h CoreRun.sh CoreSun3.h
# CoreSun3.pr CoreSv386.h CoreSv386.pr CoreTest.c
# Wrapped by pcg at thor on Tue Sep 25 16:32:40 1990
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
echo '
Copyright 1982,1990 Piercarlo Grandi. All rights reserved.
This shar archive contains free software; you can redistribute
it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either
version 1, or (at your option) any later version.
This shar archive is distributed in the hope that it will be
useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more details.
You may have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
'
sleep 4
if test -f 'Core.h' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'Core.h'\"
else
echo shar: Extracting \"'Core.h'\" \(487 characters\)
sed "s/^X//" >'Core.h' <<'END_OF_FILE'
X#ifndef Core_H
X#define Core_H
X#if __STDC__
X# pragma once
X#endif
X
X#if 0
X#ifndef Extend_H
X# include "Extend.h"
X#endif
X#endif
X
X/*
X This is a set of library routines to allocate virtual memory and
X manipulate it. It strives to be reliable and consistent,
X efficient and portable. Unfortunately this latter quality is
X more difficult to obtain than the others for such a low level
X library.
X*/
X
extern pointer CoreCopy of((pointer,pointer,addressy));
X
X#endif /* Core_H */
END_OF_FILE
if test 487 -ne `wc -c <'Core.h'`; then
echo shar: \"'Core.h'\" unpacked with wrong size!
fi
# end of 'Core.h'
fi
if test -f 'CoreCopy.c' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreCopy.c'\"
else
echo shar: Extracting \"'CoreCopy.c'\" \(8519 characters\)
sed "s/^X//" >'CoreCopy.c' <<'END_OF_FILE'
X#if 1
X# include "CoreHdr.h"
X#else
X#ifndef Extend_h
X# include "Extend.h"
X#endif
X
X#include <import>
X#ifndef Here_h
X# include "Here.h"
X#endif
X#ifndef With_h
X# include "With.h"
X#endif
X#ifndef Type_h
X# include "Type.h"
X#endif
X#ifndef Convert_h
X# include "Convert.h"
X#endif
X#ifndef Bits_h
X# include "Bits.h"
X#endif
X#ifndef Assert_h
X# include "Assert.h"
X#endif
X
X#include <export>
X#endif /* 1 */
X
X#ifndef Core_h
X# include "Core.h"
X#endif
X
X#if ((CcFEATURE & CcKR78) != CcKR78)
X# include "ERROR: language supported too old"
X#endif
X
X/*
X This function is handed pointers to two memory areas, and copies as many
X units as it is told from the second to the first. The two areas are
X expected to begin at any byte boundary, and the size is given in bytes
X too.
X
X If the memory subsystem of the machine handles more efficiently
X naturally aligned requests in clusters (multiples of a unit), we try to
X take advantage of that. Since we cannot take advantage of moving clusters
X for both source and destination, we optimize the writing of clusters, of
X course...
X*/
X
X/*
X We need three copying operations. Only the first is always needed,
X the remaining two are needed only if copying by clusters pays.
X
X BYTECOPY(to,from,bytes) copies bytes by byte, the number of bytes is
X guaranteed to be >= 0.
X
X ODDCOPY(to,from,bytes) also copies byte by byte, but the number of bytes
X is guaranteed to be >= 0 && < ClusterBYTES.
X
X CLUSTERCOPY(to,from,clusters) copies cluster by cluster, and the number
X of clusters is guaranteed to be > 0.
X
X For all these macros, the value of bytes is not touched, but to and from
X are updated to point to the end of the copied area.
X*/
X
X#if (CpuIS == CpuIAPX && CpuMODEL == 0x0386)
X# include "CoreSv386.h"
X#endif /* CpuIAPX && 0x0386 */
X
X#if (CpuIS == CpuMC68000 && CpuMODEL == 0x0020)
X# include "CoreSun3.h"
X#endif /* CpuMC68000 && 0x0020 */
X
X#if (CpuIS == CpuMIPS /* && CpuMODEL == 0x3000 */)
X# include "CoreMips.h"
X#endif /* CpuMIPS */
X
X#ifndef ClusterBITS
X
X# ifdef CoreFASTALIGN
X# define ClusterBITS (CoreFASTALIGN*CpuUNIT)
X# else
X# if (CoreFEATURE & (CoreDCACHE|CoreWRITETHRU) == (CoreDCACHE))
X# define ClusterBITS (CoreCACHELINE*CpuUNIT)
X# else
X# ifdef CoreCORELINE
X# define ClusterBITS (CoreCORELINE*CpuUNIT)
X# else
X# ifdef CoreINTERLEAVE
X# define ClusterBITS (CoreINTERLEAVE*CpuUNIT)
X# else
X# define ClusterBITS CpuUNIT
X# endif
X# endif
X# endif
X# endif
X
X# if (ClusterBITS >= LongBITS && (ClusterBITS % LongBITS) == 0)
X# undef ClusterBITS
X# define ClusterBITS LongBITS
X# endif
X
X# if ((ClusterBITS % ByteBITS) == 0)
X# define ClusterBYTES (ClusterBITS/ByteBITS)
X# else
X# include "ERROR: Cluster size is not an even # of bytes"
X# endif
X
X#endif /* ndef ClusterBITS */
X
X#if (ClusterBYTES > 1)
X
X# ifndef ClusterLNBYTES
X# if (ClusterBYTES == 2)
X# define ClusterLNBYTES 1
X# endif
X# if (ClusterBYTES == 4)
X# define ClusterLNBYTES 2
X# endif
X# if (ClusterBYTES == 8)
X# define ClusterLNBYTES 3
X# endif
X# endif
X
X# ifndef ClusterBEST
X# if (CoreFEATURE & CoreWRITETHRU)
X# define ClusterBEST (ClusterBYTES*4)
X# else
X# define ClusterBEST (ClusterBYTES*8)
X# endif
X# endif
X
X# ifndef ClusterALIGNTO
X# define ClusterALIGNTO 1
X# endif
X
X# ifndef ClusterDOALIGN
X# define ClusterDOALIGN (ClusterBEST*4)
X# endif
X
X# ifndef ClusterTYPE
X# if (ClusterBITS == ShortBITS && !defined ClusterTYPE)
X# define ClusterTYPE short
X# endif
X# if (ClusterBITS == IntBITS && !defined ClusterTYPE)
X# define ClusterTYPE int
X# endif
X# if (ClusterBITS == LongBITS && !defined ClusterTYPE)
X# define ClusterTYPE long
X# endif
X# if (!defined ClusterTYPE)
X# include "ERROR: cannot define a sensible ClusterTYPE"
X# endif
X# endif
X
X# if (!defined ClusterREM && defined ClusterLNBYTES)
X# define ClusterREM(n) ((n) & (ClusterBYTES-1))
X# define ClusterDIV(n) ((n) >> ClusterLNBYTES)
X# else
X# define ClusterREM(n) ((n) % ClusterBYTES)
X# define ClusterDIV(n) ((n) / ClusterBYTES)
X# endif
X
X# if (!defined Core4CLUSTERCOPY \
X && (CodeREGISTERS >= 6 || CodePREGISTERS >= 5))
X# define Core4CLUSTERCOPY(to,from,clusters) \
X begindef \
X fast ClusterTYPE *CoreTo = (ClusterTYPE *) (to); \
X fast ClusterTYPE *CoreFrom = (ClusterTYPE *) (from); \
X fast addressy CoreClusters = (clusters); \
X while (CoreClusters) switch (CoreClusters) \
X { \
X default: *CoreTo++ = *CoreFrom++; --CoreClusters; \
X case 3: *CoreTo++ = *CoreFrom++; --CoreClusters; \
X case 2: *CoreTo++ = *CoreFrom++; --CoreClusters; \
X case 1: *CoreTo++ = *CoreFrom++; --CoreClusters; \
X case 0: break; /* keep this "useless" break in ... */ \
X } \
X /* do *CoreTo++ = *CoreFrom++; while (--CoreClusters); */ \
X (to) = (pointer) CoreTo, (from) = (pointer) CoreFrom; \
X enddef
X# endif
X
X# ifndef Core4CLUSTERCOPY
X# define Core4CLUSTERCOPY(to,from,clusters) \
X begindef \
X fast addressy CoreClusters = (clusters); \
X while (CoreClusters) switch (CoreClusters) \
X { \
X default: \
X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
X (to) += ClusterBYTES, (from) += ClusterBYTES; \
X --CoreClusters; \
X case 3: \
X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
X (to) += ClusterBYTES, (from) += ClusterBYTES; \
X --CoreClusters; \
X case 2: \
X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
X (to) += ClusterBYTES, (from) += ClusterBYTES; \
X --CoreClusters; \
X case 1: \
X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
X (to) += ClusterBYTES, (from) += ClusterBYTES; \
X --CoreClusters; \
X case 0: break; /* keep this "useless" break in ... */ \
X } \
X enddef
X# endif
X
X# ifndef CoreCLUSTERCOPY
X# define CoreCLUSTERCOPY(to,from,clusters) \
X begindef \
X fast addressy CoreClusters = (clusters); \
X do { *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
X (to) += ClusterBYTES, (from) += ClusterBYTES; \
X } while (--CoreClusters); \
X enddef
X# endif
X
X#endif /* ClusterBYTES > 1 */
X
X#ifndef Core4BYTECOPY
X# define Core4BYTECOPY(to,from,bytes) \
X begindef \
X fast addressy CoreBytes = (bytes); \
X while (CoreBytes) switch (CoreBytes) \
X { \
X default: *(to)++ = *(from)++; --CoreBytes; \
X case 3: *(to)++ = *(from)++; --CoreBytes; \
X case 2: *(to)++ = *(from)++; --CoreBytes; \
X case 1: *(to)++ = *(from)++; --CoreBytes; \
X case 0: break; /* keep this "useless" break in ... */ \
X } \
X enddef
X#endif /* ndef Core4BYTECOPY */
X
X/*
X You may want to define this, but at least on my machine (iAPX 386)
X unrolling loops does not pay.
X*/
X
X#ifndef CoreBYTECOPY
X# if ((CpuFEATURE&CpuPIPELINE) && !(CpuIS == CpuIAPX && CpuMODEL == 0x0386))
X# define CoreBYTECOPY Core4BYTECOPY
X# endif
X#endif
X
X#ifndef CoreBYTECOPY
X# define CoreBYTECOPY(to,from,bytes) \
X begindef \
X fast addressy CoreBytes = (bytes); \
X while (CoreBytes) *(to)++ = *(from)++, --CoreBytes; \
X enddef
X#endif /* ndef CoreBYTECOPY */
X
X#ifndef CoreODDCOPY
X# if (ClusterBYTES <= 4)
X# define CoreODDCOPY Core4BYTECOPY
X# else
X# define CoreODDCOPY CoreBYTECOPY
X# endif
X#endif /* ndef CoreODDCOPY */
X
global pointer CoreCopy(to,from,bytes)
X fast pointer to;
X fast pointer from;
X addressy bytes;
X{
X# ifndef CoreCLUSTERCOPY
X CoreBYTECOPY(to,from,bytes);
X# else
X {
X copySmallBlock:
X
X if (bytes < ClusterBEST)
X {
X CoreBYTECOPY(to,from,bytes);
X return to;
X }
X
X# if (ClusterDOALIGN != 0)
X {
X /*
X Note that here we usually want align cluster transfers
X on 'to', as we care more about aligning writes than
X reads, that are often easier to pipeline.
X */
X
X copyHead:
X
X if (bytes >= ClusterDOALIGN)
X {
X addressy odd;
X
X# if (ClusterALIGNTO)
X# define ClusterALIGN to
X# else
X# define ClusterALIGN from
X# endif
X
X if ((odd = ClusterREM((addressy) ClusterALIGN)) != 0)
X {
X CoreODDCOPY(to,from,odd = ClusterBYTES - odd);
X bytes -= odd;
X }
X
X# undef ClusterALIGN
X }
X }
X# endif /* ClusterDOALIGN != 0 */
X
X copyClusters:
X
X assert (ClusterREM((addressy) to) == 0,"CoreCopy");
X CoreCLUSTERCOPY(to,from,ClusterDIV(bytes));
X assert (ClusterREM((addressy) to) == 0,"CoreCopy");
X
X copyTail:
X
X CoreODDCOPY(to,from,ClusterREM(bytes));
X }
X#endif /* ndef CoreCLUSTERCOPY */
X
X return to;
X}
END_OF_FILE
if test 8519 -ne `wc -c <'CoreCopy.c'`; then
echo shar: \"'CoreCopy.c'\" unpacked with wrong size!
fi
# end of 'CoreCopy.c'
fi
if test -f 'CoreHdr.h' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreHdr.h'\"
else
echo shar: Extracting \"'CoreHdr.h'\" \(2367 characters\)
sed "s/^X//" >'CoreHdr.h' <<'END_OF_FILE'
X#define CpuIAPX 0x0005
X#define CpuMC68000 0x0006
X#define CpuMIPS 0x0007
X
X#ifdef i386
X#define CpuIS CpuIAPX /* Type of instruction set */
X#define CpuMODEL 0x0386 /* In HEX ! */
X#endif
X#ifdef sun3
X#define CpuIS CpuMC68000 /* Type of instruction set */
X#define CpuMODEL 0x0020 /* In HEX ! */
X#endif
X#ifdef mips
X#define CpuIS CpuMIPS /* Type of instruction set */
X#define CpuMODEL 0x3000 /* In HEX ! */
X#endif
X
X#define CpuUNIT 8 /* Bits in addressable unit */
X
X#define CpuFEATURE 0x0008 /* Peculiarities of CPU */
X#define CpuPIPELINE 0x0002 /* Multi stage command obey */
X#define CpuDALIGN 0x000a /* Must align data */
X
X#define CoreFEATURE 0x0006 /* Peculiarities of memory sys */
X#define CoreDCACHE 0x0002 /* Has a DATA cache */
X#define CoreWRITETHRU 0x0004 /* Updates directly to memory */
X
X#define CoreCACHELINE 16 /* D cache line size in units */
X#define CoreCORELINE 4 /* Units to/from mem at a time */
X#define CoreINTERLEAVE 1 /* Interleaving in units */
X#define CoreFASTALIGN 4 /* Align at this for fast move */
X
X#define CcPORTABLE 0x0005 /* Johnson's classic */
X#define CcREGISTER 0x0006 /* Successor to PORTABLE */
X
X#ifdef i386
X#define CcIS CcREGISTER /* Type (author) of compiler */
X#define CodeREGISTERS 5 /* Spare universal registers */
X#define CodeDREGISTERS 0 /* Spare data only registers */
X#define CodePREGISTERS 0 /* Spare pointer only registers */
X#endif
X#ifdef sun3
X#define CcIS CcPORTABLE /* Type (author) of compiler */
X#define CodeREGISTERS 0 /* Spare universal registers */
X#define CodeDREGISTERS 3 /* Spare data only registers */
X#define CodePREGISTERS 3 /* Spare pointer only registers */
X#endif
X#ifdef mips
X#define CcIS CcPORTABLE /* Type (author) of compiler */
X#define CodeREGISTERS 8 /* Spare universal registers */
X#define CodeDREGISTERS 0 /* Spare data only registers */
X#define CodePREGISTERS 0 /* Spare pointer only registers */
X#endif
X
X#define CcFEATURE 0x027f /* Compiler dependent C */
X#define CcKR78 0x007f /*=All that is in K&R 1st ed. */
X#define CcASM 0x0200 /*!asm(" ... "); */
X
X#define ByteBITS 8
X#define ShortBITS 16
X#define IntBITS 32
X#define LongBITS 32
X
X
X#define of(ARGS) (/* ARGS */)
X#define begindef do {
X#define enddef } while (0)
X
X#define global /* extern */
X#define fast register
X#define assert(c,m) /* no op */
X
typedef unsigned addressy;
typedef char *pointer;
END_OF_FILE
if test 2367 -ne `wc -c <'CoreHdr.h'`; then
echo shar: \"'CoreHdr.h'\" unpacked with wrong size!
fi
# end of 'CoreHdr.h'
fi
if test -f 'CoreRun.sh' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreRun.sh'\"
else
echo shar: Extracting \"'CoreRun.sh'\" \(126 characters\)
sed "s/^X//" >'CoreRun.sh' <<'END_OF_FILE'
for C in 0 1 2
do
X for B in 2048 512 128 32 8
X do
X $1 $C $B 0 0
X $1 $C $B 1 1
X $1 $C $B 0 3
X $1 $C $B 3 0
X done
done
END_OF_FILE
if test 126 -ne `wc -c <'CoreRun.sh'`; then
echo shar: \"'CoreRun.sh'\" unpacked with wrong size!
fi
chmod +x 'CoreRun.sh'
# end of 'CoreRun.sh'
fi
if test -f 'CoreSun3.h' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreSun3.h'\"
else
echo shar: Extracting \"'CoreSun3.h'\" \(1135 characters\)
sed "s/^X//" >'CoreSun3.h' <<'END_OF_FILE'
X#define ClusterBITS 32 /* Bits in a cluster */
X#define ClusterBYTES 4 /* Bytes in a cluster */
X#define ClusterLNBYTES 2 /* Log2 of ClusterBYTES */
X#define ClusterTYPE int /* The type of a cluster */
X
X#define ClusterALIGNTO 1 /* Align destination */
X
X#if (CcIS == CcPORTABLE)
X
X# define ClusterBEST 16 /* Copy clusters when longer */
X# define ClusterDOALIGN 64 /* Align clusters when longer */
X# define CoreODDCOPY CoreBYTECOPY
X
X /* Asm inlines do not improve speed */
X# if (0 && (CcFEATURE&CcASM))
X /*
X Having had a look at the generated code, we know that to is
X a5, from is a4, and the count is "always" ready in d0.
X */
X
X# define CoreBYTECOPY(to,from,bytes) \
X begindef \
X fast unsigned CoreBytes; \
X if (CoreBytes = (bytes)) { \
X asm ("1: movb a4 at +,a5 at +"); \
X asm (" dbra d0,1b"); } \
X enddef
X
X# define CoreCLUSTERCOPY(to,from,clusters) \
X begindef \
X fast unsigned CoreClusters; \
X if (CoreClusters = (clusters)) { \
X asm ("1: movl a4 at +,a5 at +"); \
X asm (" dbra d0,1b"); } \
X enddef
X
X# endif /* 0 */
X
X#endif /* CsIS == CcPORTABLE */
END_OF_FILE
if test 1135 -ne `wc -c <'CoreSun3.h'`; then
echo shar: \"'CoreSun3.h'\" unpacked with wrong size!
fi
# end of 'CoreSun3.h'
fi
if test -f 'CoreSun3.pr' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreSun3.pr'\"
else
echo shar: Extracting \"'CoreSun3.pr'\" \(2040 characters\)
sed "s/^X//" >'CoreSun3.pr' <<'END_OF_FILE'
X 16MB C=0 2048B t% 0 f% 0 0.01u
X 16MB C=0 2048B t% 1 f% 1 0.02u
X 16MB C=0 2048B t% 0 f% 3 0.02u
X 16MB C=0 2048B t% 3 f% 0 0.02u
X 16MB C=0 512B t% 0 f% 0 0.07u
X 16MB C=0 512B t% 1 f% 1 0.04u
X 16MB C=0 512B t% 0 f% 3 0.08u
X 16MB C=0 512B t% 3 f% 0 0.04u
X 16MB C=0 128B t% 0 f% 0 0.27u
X 16MB C=0 128B t% 1 f% 1 0.25u
X 16MB C=0 128B t% 0 f% 3 0.25u
X 16MB C=0 128B t% 3 f% 0 0.23u
X 16MB C=0 32B t% 0 f% 0 1.44u
X 16MB C=0 32B t% 1 f% 1 1.52u
X 16MB C=0 32B t% 0 f% 3 1.43u
X 16MB C=0 32B t% 3 f% 0 1.45u
X 16MB C=0 8B t% 0 f% 0 7.04u
X 16MB C=0 8B t% 1 f% 1 7.23u
X 16MB C=0 8B t% 0 f% 3 7.25u
X 16MB C=0 8B t% 3 f% 0 7.25u
X 16MB C=1 2048B t% 0 f% 0 2.10u
X 16MB C=1 2048B t% 1 f% 1 2.45u
X 16MB C=1 2048B t% 0 f% 3 8.38u
X 16MB C=1 2048B t% 3 f% 0 8.37u
X 16MB C=1 512B t% 0 f% 0 2.25u
X 16MB C=1 512B t% 1 f% 1 3.02u
X 16MB C=1 512B t% 0 f% 3 8.38u
X 16MB C=1 512B t% 3 f% 0 8.44u
X 16MB C=1 128B t% 0 f% 0 3.22u
X 16MB C=1 128B t% 1 f% 1 4.32u
X 16MB C=1 128B t% 0 f% 3 10.43u
X 16MB C=1 128B t% 3 f% 0 10.00u
X 16MB C=1 32B t% 0 f% 0 7.02u
X 16MB C=1 32B t% 1 f% 1 8.02u
X 16MB C=1 32B t% 0 f% 3 12.20u
X 16MB C=1 32B t% 3 f% 0 12.01u
X 16MB C=1 8B t% 0 f% 0 22.05u
X 16MB C=1 8B t% 1 f% 1 25.23u
X 16MB C=1 8B t% 0 f% 3 24.37u
X 16MB C=1 8B t% 3 f% 0 24.35u
X 16MB C=2 2048B t% 0 f% 0 3.01u
X 16MB C=2 2048B t% 1 f% 1 2.44u
X 16MB C=2 2048B t% 0 f% 3 3.06u
X 16MB C=2 2048B t% 3 f% 0 3.21u
X 16MB C=2 512B t% 0 f% 0 2.49u
X 16MB C=2 512B t% 1 f% 1 3.11u
X 16MB C=2 512B t% 0 f% 3 4.09u
X 16MB C=2 512B t% 3 f% 0 3.23u
X 16MB C=2 128B t% 0 f% 0 3.48u
X 16MB C=2 128B t% 1 f% 1 4.10u
X 16MB C=2 128B t% 0 f% 3 4.53u
X 16MB C=2 128B t% 3 f% 0 4.19u
X 16MB C=2 32B t% 0 f% 0 6.10u
X 16MB C=2 32B t% 1 f% 1 6.46u
X 16MB C=2 32B t% 0 f% 3 6.09u
X 16MB C=2 32B t% 3 f% 0 6.29u
X 16MB C=2 8B t% 0 f% 0 29.09u
X 16MB C=2 8B t% 1 f% 1 29.53u
X 16MB C=2 8B t% 0 f% 3 28.50u
X 16MB C=2 8B t% 3 f% 0 28.56u
END_OF_FILE
if test 2040 -ne `wc -c <'CoreSun3.pr'`; then
echo shar: \"'CoreSun3.pr'\" unpacked with wrong size!
fi
# end of 'CoreSun3.pr'
fi
if test -f 'CoreSv386.h' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreSv386.h'\"
else
echo shar: Extracting \"'CoreSv386.h'\" \(1879 characters\)
sed "s/^X//" >'CoreSv386.h' <<'END_OF_FILE'
X#define ClusterBITS 32 /* Bits in a cluster */
X#define ClusterBYTES 4 /* Bytes in a cluster */
X#define ClusterLNBYTES 2 /* Log2 of ClusterBYTES */
X#define ClusterTYPE int /* The type of a cluster */
X
X#define ClusterALIGNTO 1 /* This should be 1; 25% diff. */
X
X#if (CcIS == CcREGISTER)
X
X# define ClusterBEST 16 /* Copy clusters when longer */
X# define ClusterDOALIGN 64 /* Align clusters when longer */
X
X /* This is 0, but should be 1. Cannot get asm procs to work! */
X# if (0 && (CcFEATURE&CcASM))
X
X asm void CoreByteCopy(to,from,bytes)
X {
X % ureg to,from; reg bytes;
X
X movl to,%edi
X movl from,%esi
X movl bytes,%ecx
X rep
X movsb /* (%esi),(%edi) */
X }
X
X asm void CoreClusterCopy(to,from,clusters)
X {
X % ureg to,from; reg clusters;
X
X movl to,%edi
X movl from,%esi
X movl clusters,%ecx
X rep
X movsl /* (%esi),(%edi) */
X }
X
X# define CoreBYTECOPY CoreByteCopy
X# define CoreODDCOPY CoreByteCopy
X# define CoreCLUSTERCOPY CoreClusterCopy
X
X# endif /* 0 */
X
X# /* This is 1, but should be 0, because we should use inline asm procs */
X# if (1 && (CcFEATURE&CcASM))
X /*
X Having had a look at the generated code, we know that to is
X %esi, from is in %edi, and bytes is in %ebx.
X */
X
X# define CoreBYTECOPY(to,from,bytes) \
X begindef \
X fast addressy CoreBytes = (bytes); \
X asm (" movl %ebx,%ecx"); \
X asm (" rep"); \
X asm (" movsb / (%esi),(%edi)"); \
X enddef
X
X# define CoreCLUSTERCOPY(to,from,clusters) \
X begindef \
X fast addressy CoreClusters = (clusters); \
X asm (" movl %ebx,%ecx"); \
X asm (" rep"); \
X asm (" movsl / (%esi),(%edi)"); \
X enddef
X
X# define CoreODDCOPY CoreBYTECOPY
X
X# endif /* 1 */
X
X# ifndef CoreCLUSTERCOPY
X# define CoreCLUSTERCOPY Core4CLUSTERCOPY
X# endif
X
X#endif /* CsIS == CcREGISTER */
END_OF_FILE
if test 1879 -ne `wc -c <'CoreSv386.h'`; then
echo shar: \"'CoreSv386.h'\" unpacked with wrong size!
fi
# end of 'CoreSv386.h'
fi
if test -f 'CoreSv386.pr' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreSv386.pr'\"
else
echo shar: Extracting \"'CoreSv386.pr'\" \(2040 characters\)
sed "s/^X//" >'CoreSv386.pr' <<'END_OF_FILE'
X 16MB C=0 2048B t% 0 f% 0 0.05u
X 16MB C=0 2048B t% 1 f% 1 0.05u
X 16MB C=0 2048B t% 0 f% 3 0.05u
X 16MB C=0 2048B t% 3 f% 0 0.05u
X 16MB C=0 512B t% 0 f% 0 0.20u
X 16MB C=0 512B t% 1 f% 1 0.20u
X 16MB C=0 512B t% 0 f% 3 0.20u
X 16MB C=0 512B t% 3 f% 0 0.20u
X 16MB C=0 128B t% 0 f% 0 0.81u
X 16MB C=0 128B t% 1 f% 1 0.81u
X 16MB C=0 128B t% 0 f% 3 0.81u
X 16MB C=0 128B t% 3 f% 0 0.82u
X 16MB C=0 32B t% 0 f% 0 3.23u
X 16MB C=0 32B t% 1 f% 1 3.23u
X 16MB C=0 32B t% 0 f% 3 3.22u
X 16MB C=0 32B t% 3 f% 0 3.22u
X 16MB C=0 8B t% 0 f% 0 12.88u
X 16MB C=0 8B t% 1 f% 1 12.89u
X 16MB C=0 8B t% 0 f% 3 12.88u
X 16MB C=0 8B t% 3 f% 0 12.88u
X 16MB C=1 2048B t% 0 f% 0 1.39u
X 16MB C=1 2048B t% 1 f% 1 4.08u
X 16MB C=1 2048B t% 0 f% 3 3.16u
X 16MB C=1 2048B t% 3 f% 0 2.28u
X 16MB C=1 512B t% 0 f% 0 1.55u
X 16MB C=1 512B t% 1 f% 1 4.21u
X 16MB C=1 512B t% 0 f% 3 3.32u
X 16MB C=1 512B t% 3 f% 0 2.43u
X 16MB C=1 128B t% 0 f% 0 2.18u
X 16MB C=1 128B t% 1 f% 1 4.84u
X 16MB C=1 128B t% 0 f% 3 3.94u
X 16MB C=1 128B t% 3 f% 0 3.07u
X 16MB C=1 32B t% 0 f% 0 4.71u
X 16MB C=1 32B t% 1 f% 1 7.39u
X 16MB C=1 32B t% 0 f% 3 6.44u
X 16MB C=1 32B t% 3 f% 0 5.64u
X 16MB C=1 8B t% 0 f% 0 14.79u
X 16MB C=1 8B t% 1 f% 1 17.57u
X 16MB C=1 8B t% 0 f% 3 16.10u
X 16MB C=1 8B t% 3 f% 0 16.10u
X 16MB C=2 2048B t% 0 f% 0 1.42u
X 16MB C=2 2048B t% 1 f% 1 1.44u
X 16MB C=2 2048B t% 0 f% 3 2.33u
X 16MB C=2 2048B t% 3 f% 0 2.31u
X 16MB C=2 512B t% 0 f% 0 1.68u
X 16MB C=2 512B t% 1 f% 1 1.77u
X 16MB C=2 512B t% 0 f% 3 2.65u
X 16MB C=2 512B t% 3 f% 0 2.57u
X 16MB C=2 128B t% 0 f% 0 2.72u
X 16MB C=2 128B t% 1 f% 1 3.03u
X 16MB C=2 128B t% 0 f% 3 3.89u
X 16MB C=2 128B t% 3 f% 0 3.61u
X 16MB C=2 32B t% 0 f% 0 6.48u
X 16MB C=2 32B t% 1 f% 1 9.21u
X 16MB C=2 32B t% 0 f% 3 8.28u
X 16MB C=2 32B t% 3 f% 0 7.45u
X 16MB C=2 8B t% 0 f% 0 22.22u
X 16MB C=2 8B t% 1 f% 1 22.24u
X 16MB C=2 8B t% 0 f% 3 22.23u
X 16MB C=2 8B t% 3 f% 0 22.23u
END_OF_FILE
if test 2040 -ne `wc -c <'CoreSv386.pr'`; then
echo shar: \"'CoreSv386.pr'\" unpacked with wrong size!
fi
# end of 'CoreSv386.pr'
fi
if test -f 'CoreTest.c' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'CoreTest.c'\"
else
echo shar: Extracting \"'CoreTest.c'\" \(3039 characters\)
sed "s/^X//" >'CoreTest.c' <<'END_OF_FILE'
X#include <sys/types.h>
X#include <sys/times.h>
X#include <sys/param.h>
X
X#ifndef HZ
X# define HZ 60
X#endif
X
X#include <stdio.h>
X
X#ifndef B
X# define B 4096 /* Maximum & default # of bytes */
X#endif
X#ifndef M
X# define M (16<<20) /* Default megabytes copied */
X#endif
X
typedef char *(*method)();
X
static time_t measure(p,t,f,b)
X register method p;
X register char *t,*f;
X register unsigned b;
X{
X register unsigned i;
X struct tms tms;
X time_t utime;
X
X (void) times(&tms);
X utime = tms.tms_utime;
X
X for (i = 0; i < M; i += b)
X (void) (*p)(t,f,b);
X
X (void) times(&tms);
X return tms.tms_utime - utime;
X}
X
extern char *null();
extern char *memcpy();
extern char *CoreCopy();
extern char *copy1();
extern char *copy2();
extern char *copy3();
X
static method methods[] = {null,memcpy,CoreCopy,copy1,copy2,copy3};
static unsigned nmethods = sizeof methods/sizeof (method);
X
X#define SLOP sizeof (long unsigned)
X
long unsigned alignit1;
char bfrom[B+SLOP];
X
long unsigned alignit2;
char bto[B+SLOP];
X
extern int main(argc,argv)
X int argc;
X char **argv;
X{
X register unsigned i,b;
X register char *f,*t;
X unsigned of,ot;
X unsigned m;
X time_t utime;
X
X
X m = (argc <= 1) ? 0 : atoi(argv[1]);
X b = (argc <= 2) ? B : atoi(argv[2]);
X ot = (argc <= 3) ? 1 : atoi(argv[3]);
X of = (argc <= 4) ? 1 : atoi(argv[4]);
X
X if (m >= nmethods) m = 1;
X if (b > B) b = B;
X if (ot > SLOP) ot %= SLOP;
X if (of > SLOP) of %= SLOP;
X
X f = bfrom + of; t = bto + ot;
X
X printf("%3uMB C=%u %4uB t%% %u f%% %u ",
X M>>20,m,b,(unsigned) t%SLOP,(unsigned) f%SLOP);
X fflush(stdout);
X
X utime = measure(methods[m],f,t,b);
X
X printf("%3u.%02uu\n",utime/HZ,utime%HZ);
X fflush(stdout);
X
X return 0;
X}
X
extern char *null(to,from,bytes)
X register char *to,*from;
X register unsigned bytes;
X{
X return to+bytes;
X}
X
extern char *copy1(to,from,bytes)
X register char *to,*from;
X register unsigned bytes;
X{
X if (bytes)
X {
X do *to++ = *from++;
X while (--bytes);
X }
X
X return to;
X}
X
extern char *copy2(to,from,bytes)
X register char *to,*from;
X register unsigned bytes;
X{
X while (bytes >= sizeof (long))
X {
X *(long *) to = *(long *) from;
X to += sizeof (long), from += sizeof (long);
X bytes -= sizeof (long);
X }
X
X if (bytes)
X {
X do *to++ = *from++;
X while (--bytes);
X }
X
X return to;
X}
X
extern char *copy3(to,from,bytes)
X register char *to,*from;
X register unsigned bytes;
X{
X while (bytes >= 2*sizeof (long))
X {
X *(long *) to = *(long *) from;
X *((long *) to + 1) = *((long *) from +1);
X to +=2*sizeof (long), from += 2*sizeof (long);
X bytes -= 2*sizeof (long);
X }
X
X while (bytes >= sizeof (long))
X {
X *(long *) to = *(long *) from;
X to += sizeof (long), from += sizeof (long);
X bytes -= sizeof (long);
X }
X
X if (bytes)
X {
X do *to++ = *from++;
X while (--bytes);
X }
X
X return to;
X}
END_OF_FILE
if test 3039 -ne `wc -c <'CoreTest.c'`; then
echo shar: \"'CoreTest.c'\" unpacked with wrong size!
fi
# end of 'CoreTest.c'
fi
echo shar: End of shell archive.
exit 0
--
Piercarlo "Peter" Grandi | ARPA: pcg%uk.ac.aber.cs at nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg at cs.aber.ac.uk
More information about the Alt.sources
mailing list