sort

Wed Aug 9 01:00:52 AEST 1989

In article <770 at ocsmd.ocs.com>, glenn at ocsmd.ocs.com (glenn ford) writes:
> Let me explain to the world what I am currently doing.  At our work we
> do alot of B-tree builds which require a sorted text file as input
> to sort.  Thus the need for something fast, since the text file
> to sort can be up to 250 megabytes with about 12-15 million records
> to sort.  I currently run on a 6310 (VAX) that is about 5 mips rating
> with 780 at 1 mip.  Now the data is totally random and is spread
> across the alphabet pretty well.  Only requirements is that it be
> CASE sensitive (yes, it makes things REAL slow then!) and that
> you have a prefix offset.  Now I can sort about 1.5 megabytes per
> minute.  I use ONLY quicksort in memory and if I don't have enough
> memory I either a)split the file into multiple smaller files, then
> sort and append to output file or b) I call sortmerge (VAX sort)
> which takes care of those problems.  I usually use case a unless I
> have allready split the file.  I only split the original file, and
> I split it into 28 seperate files.
> Now, is there anything that can beat quicksort?

Sure.  Once you can't fit everything into memory, you have a sort/merge
problem.  I had the same problem of sorting magnetic tapes full
of fixed length records, up to about 160-180 MBytes.  I only have
one tape drive, so I, of course, have to simulate other tape drives
on disk.  If you do the in-core sort with quicksort, you generate
fixed length runs to merge.  Instead, you should use a replacement
selection sorting algorithm for the in-core sort, which, on the
average, generates runs twice as long.  (Much longer if the data
is nearly sorted, which mine sometimes is.)  This saves a lot of
tape handling in the merge phase.  (Or a lot of seeking on disk.)
I would hope that this is what the VAX sortmerge is doing.  But this is
definitely _not_ what Unix sort does, using qsort(3) for all sorts.
Just remember, if even one merge pass can be eliminated by longer
runs, then that means the entire file (~200 MBytes) needs to be
input one less time.

I had to write my own routines tuned for my application, but the
algorithms are described very clearly in several places.  I used
Baase, "Computer Algorithms", and Knuth Vol. 3.  A general discussion
in Folk and Zoellick, "File Structures: A Conceptual Toolkit"
covers various sorting problems, in-core and on disk.

-- 
A. Lester Buck		...!texbell!moray!siswat!buck