sort
A. Lester Buck
buck at siswat.UUCP
Wed Aug 9 01:00:52 AEST 1989
In article <770 at ocsmd.ocs.com>, glenn at ocsmd.ocs.com (glenn ford) writes:
> Let me explain to the world what I am currently doing. At our work we
> do alot of B-tree builds which require a sorted text file as input
> to sort. Thus the need for something fast, since the text file
> to sort can be up to 250 megabytes with about 12-15 million records
> to sort. I currently run on a 6310 (VAX) that is about 5 mips rating
> with 780 at 1 mip. Now the data is totally random and is spread
> across the alphabet pretty well. Only requirements is that it be
> CASE sensitive (yes, it makes things REAL slow then!) and that
> you have a prefix offset. Now I can sort about 1.5 megabytes per
> minute. I use ONLY quicksort in memory and if I don't have enough
> memory I either a)split the file into multiple smaller files, then
> sort and append to output file or b) I call sortmerge (VAX sort)
> which takes care of those problems. I usually use case a unless I
> have allready split the file. I only split the original file, and
> I split it into 28 seperate files.
> Now, is there anything that can beat quicksort?
Sure. Once you can't fit everything into memory, you have a sort/merge
problem. I had the same problem of sorting magnetic tapes full
of fixed length records, up to about 160-180 MBytes. I only have
one tape drive, so I, of course, have to simulate other tape drives
on disk. If you do the in-core sort with quicksort, you generate
fixed length runs to merge. Instead, you should use a replacement
selection sorting algorithm for the in-core sort, which, on the
average, generates runs twice as long. (Much longer if the data
is nearly sorted, which mine sometimes is.) This saves a lot of
tape handling in the merge phase. (Or a lot of seeking on disk.)
I would hope that this is what the VAX sortmerge is doing. But this is
definitely _not_ what Unix sort does, using qsort(3) for all sorts.
Just remember, if even one merge pass can be eliminated by longer
runs, then that means the entire file (~200 MBytes) needs to be
input one less time.
I had to write my own routines tuned for my application, but the
algorithms are described very clearly in several places. I used
Baase, "Computer Algorithms", and Knuth Vol. 3. A general discussion
in Folk and Zoellick, "File Structures: A Conceptual Toolkit"
covers various sorting problems, in-core and on disk.
--
A. Lester Buck ...!texbell!moray!siswat!buck
More information about the Comp.lang.c
mailing list