v02i041: qndxr - Inverted index builder for text files

Wed Feb 3 12:08:40 AEST 1988

Comp.sources.misc: Volume 2, Issue 41

Submitted-By: "Mark Zimmerman" <science at nems.ARPA>

Archive-Name: qndxr

Comp.sources.misc: Volume 2, Issue 41
Submitted-By: "Mark Zimmerman" <science at nems.ARPA>
Archive-Name: qndxr

[Yes, indeed, another un-shared program!  Sigh.  ++bsa]

Following C code builds inverted index files for big text files ... seems
to work on Sun, VAX, and Macintosh in my tests.  Heavily commented in a
helter-skelter style....  ^z

/*    revised main indexer program 'qndxr.c' by ^z -- 870820-...
 *
 * revised 870930-1008 to give the user more options and control of how
 * delimiters are defined and handled.  Now can choose:
 *    - the default:  any non-alphanumeric characters are delimiters (my
 *        original approach, and perhaps the simplest to understand and use);
 *    - option "-e":  keep punctuation characters ONLY when they are embedded
 *        in between non-delimiter characters, otherwise turn them into
 *        delimiters (approach invented by Bill Hole of MicroDynamics Ltd.;
 *        very nice for displaying 'words' such as don't, non-ASCII, 3.14159,
 *        87/10/02, etc. without splitting them up and requiring proximity
 *        searching);
 *    - option "-a":  keep ALL punctuation characters, whether embedded in
 *        words or not (best for indexing FORTH programs and such, but does
 *        make spurious 'distinct' words at ends of sentences, etc.);
 *    - option "-s":  keep all special (non-ASCII) characters (with values
 *        in the range 128-255; in the Apple character set, these are the
 *        symbols that carry umlauts, tilde, accents, etc., making this
 *        option the best for foreign-language and symbol-heavy files);
 *        default is to remove the special characters.
 *
 *    Note that option "-s" can be combined with any of the other options;
 *    options "-e" and "-a" are mutually exclusive.  (Actually, "-a" in my
 *    implementation overrides "-e".)  The "-e" option does require
 *    peeking ahead one character before deciding on the fate of
 *    a punctuation symbol, but that isn't too hard to arrange....
 *
 *    ---------------------------------------------------------------
 *
 * Synopsis of the qndxr.c program's approach to sorting an index:
 *    - load a chunk of the input file into memory; during the loading,
 *        filter the characters appropriately, turning lower-case
 *        into capitals, changing delimiters into \0's, and noting down
 *        the locations of the beginnings of keys (words) in a ptr array;
 *    - do a quicksort on the ptr array, using the keys in the buffer
 *        to sort on;
 *    - write the resulting sorted list of pointers along with their keys
 *        to disk as files, named x0k0 and x0p0, in the *.k and *.p formats
 *        used in ndxr.15;
 *    - repeat the above steps with another chunk of the input file, and
 *        name the resulting files x0k1 and x0p1; repeat, etc., until the
 *        input file is all sorted into chunks;
 *    - begin to merge the sorted files in an N-way merge ... for the
 *        specific case of N=2, for example, merge x0k0/x0p0 and
 *        x0k1/x0p1 into x1k0/x1p0; merge x0k2/x0p2 and x0k3/x0p3 into
 *        x1k1/x1p1; etc., deleting the 0th-generation files as they are
 *        merged;
 *    - merge x1k0/x1p0 and x1k1/x1p1 into x2k0/x2p0, etc., and continue
 *        merging subsequent generations until everything has become a
 *        single pair of index files, xnk0/xnp0, which is then renamed
 *        to be the original document name with '.k' and '.p' endings.
 *
 *    ---------------------------------------------------------------
 *
 * Comparison with the older radix sort approach:
 *    The new quicksort/mergesort method gains a bit in speed (since so
 *    much of the work is done in memory rather than streaming from disk
 *    back and forth) and also saves on disk space requirements (since
 *    the k and p files are significantly compressed relative to the raw
 *    index files formerly used during index sorting).  The old radix sort
 *    tended to only achieve about 2 MB/hour indexing on my Mac Plus setup,
 *    and it required 5-6 times the size of the original file in free
 *    disk space during indexing.  This new approach achieves >4 MB/hour
 *    in tests on the same Mac Plus, and only requires about 1.9 times
 * the original file size in free space!
 *
 *    The new approach also allows for greater key length -- try recompiling
 *    with KEY_LENGTH of 28, for instance -- without such severe disk space
 *    penalties as the old radix sort would have incurred, and without severe
 *    time penalties.  (Duplicate words in the chunks are only stored once
 *    in the key file, though each must have an entry in the pointer file.)
 *
 *    The only obvious disadvantage of the quicksort/mergesort approach is
 *    that it is an O(NlogN) procedure rather than O(N), and thus gets slower
 *    when files get very large.  (Radix sorting is strictly linear in N,
 *    in theory anyway.)
 *
 *    ---------------------------------------------------------------
 *
 *    For further details, contact the author:
 *     Mark Zimmermann             science (at) NEMS.ARPA
 *     9511 Gwyndale Dr.           [75066,2044] on CompuServe
 *     Silver Spring, MD  20910
 *     301-565-2166
 *    ---------------------------------------------------------------
 */

#include <stdio.h>
#include <strings.h>
#include <ctype.h>

/* --------------header file--------------- */

/* header file for project qndxr -- by ^z -- 870820-0913-...
 */

/* tell what compiler we're using ... Lightspeed C by Think, if the
 * following line is #defined ... essentially, when LIGHTSPEED is here,
 * assume that we have only 16-bit int variables and that Macintosh
 * toolbox traps are available ... when LIGHTSPEED is not #defined,
 * assume that ints can hold more like 32 bits without error, so more
 * can be done using standard I/O routines from <stdio>
 */
/* #define LIGHTSPEED */

/* preprocessor 'function' to turn on/off debug printing of detailed info
 *    during program runs ... when debugging, use a statement:
 * #define DEBUG(fmt, arg)   printf (fmt, arg)
 * ... and when not debugging, just use:  #define DEBUG(fmt, arg)
 * to effectively remove those print commands....
 */
/* #define DEBUG(fmt, arg)   printf (fmt, arg) */
#define DEBUG(fmt, arg)

/* sort on KEY_LENGTH characters ... make it 28 rather arbitrarily as an
 * experiment ... want it long enough to avoid truncation of legitimate
 * words, but short enough to save some space in the *.k files ... an
 * alternative value is 12, for compatibility with the older ndxr.c program
 * and brwsr.c, if they haven't been revised to allow for longer keys....
 */
#define KEY_LENGTH  28

/* define a structure for the index_keys file
 */
typedef struct
  {
    char kkey[KEY_LENGTH];
    long ccount;
  }  KEY_REC;

/* define a structure for my simpleminded I/O buffers ...
 */
typedef struct
  {
    char *zbufbase;
    char *zbufp;
    long zbufcounter;
    long zbufsize;
    FILE *zbuffilep;
    int  zbufitemsize;
  }  ZBUF;

/* choose this size to be the default memory size used for total buffer
 * space ... for convenience on the Mac Plus, I have picked 512 kB, which
 * leaves me a bit of space to spare ....
 */
#define DEFAULT_MEMSIZ  524288

/* merge this many files at each stage of the merging operation for index
 * building ... 2 means a binary merge, etc. ... one needs to have at least
 * 5 + 2 * NMERGE I/O buffers around:  for each of NMERGE files, there is
 * a *.k keys file and a *.p pointers file; plus there must be a single
 * output *.k and a single output *.p file; plus there is the need for stdin,
 * stdout, and stderr to be open as well.  Thus, I have found that a 4-way
 * merge (NMERGE = 4) works pretty nicely....
 */
#define NMERGE  4

#ifndef TRUE
#define TRUE  1
#endif

#ifndef FALSE
#define FALSE  0
#endif

/* CMASK makes sure that a returned character isn't sign-extended
 */
#ifndef CMASK
#define CMASK  0xFF
#endif

/* put the prototypes for my functions here... assume that if LIGHTSPEED
 * is #define'd, then we want to use prototypes....
 */
#ifdef LIGHTSPEED

/* in qndxr unix main.c */
void punt(void);
void openfile(FILE *file, char *mode);
void main(void);

/* in qndxr.c */
void _main(int argc, char *argv[]);

/* in bufio.c */
void create_zbuffer (int n, long bufsize, FILE *buffile, int bufitemsize);
void free_zbuffer (int n);
char *next_input_item (int n);
void load_zbuffer (int n);
char *next_output_item (int n);
void flush_zbuffer (int n);

/* in build_indices.c */
int build_indices (void);

/* in doc_buf.c */
char *make_buf (long bufsiz);
long load_doc_buffer (char *doc, long doc_bufsiz, char **ptr);
int filtered_getc (void);

/* in merge_files.c */
void nway_merge_kpfiles (FILE *ink[], FILE *inp[], FILE *outk, FILE *outp,
        int n);
void copy_ptr_recs (int inzbuf, long count, int outzbuf);
void copy_key_rec (char *kkey, long ccount, int outzbuf);
int merge_not_finished (KEY_REC *kr[], int n);
int smallest_key (KEY_REC *kr[], int n);

/* in merge_indices.c */
int merge_indices (char *doc_filename);

/* in file_utils.c */
void write_sorted_files (char *doc, char **ptr, long nwords,
        int pass_number, long offset);
int is_new_word (char *w0, char *w1);
void write_new_key (char *p, char *kp);

/* in merge_utils.c */
FILE *open_inkfile (int generation_number, int file_number);
FILE *open_inpfile (int generation_number, int file_number);
void fix_oddball_file_name (int generation_number, int file_number);
void fix_final_file_names (int generation_number, char *doc_filename);
FILE *open_outkfile (int generation_number, int file_number);
FILE *open_outpfile (int generation_number, int file_number);
void remove_used_infiles (int generation_number, int file_number, int n);
void close_infiles (FILE *ink[], FILE *inp[], int n);

/* in misc_utils.c */
long set_params (int argc, char *argv[]);
long set_zbufsiz (long zb);
void set_parser_options (char *str);
void check_interrupt (void);
long file_size (FILE *fp);

/* in open_files.c */
FILE *open_docfile (int argc, char *argv[]);
FILE *open_kfile (int pass_number);
FILE *open_pfile (int pass_number);

/* in zqsort.c */
void zqsort (char **first, char **last);
int compare_ptrs (char **p1, char **p2);
int zstrcmp (char *s1, char *s2);

#endif

/* ----------------------main program file---------------- */

/* define a global variable to hold the chosen size of a fundamental
 * quantum of buffer space ... experiments with dynamically choosing
 * it at runtime seem to cause occasional problems on the Macintosh,
 * and have other risks with virtual memory systems, so default to
 * DEFAULT_MEMSIZ total buffer space but let the user override with
 * a command-line choice to suit ... see function set_zbufsiz() for
 * further discussion....
 */
long zbufsiz;

/* define a global variable to tell whether or not all punctuation chars
 * are kept...
 */
int keep_all_punct = FALSE;

/* define a global variable to tell whether or not only embedded punctuation
 * characters are preserved...
 */
int keep_embedded_punct = FALSE;

/* define a global variable to tell whether or not to keep special,
 * non-ASCII characters...
 */
int keep_special_chars = FALSE;

/* define a global variable to hold the (FILE *) for the document file
 */
FILE *doc_file;

/* main program to do the work ... 
 */

void main(argc, argv)
  int argc;
  char *argv[];
  {
    unsigned long start_time, end_time, time();
    long set_params(), file_size();
    char *ctime();
    FILE *open_docfile();

    time (&start_time);
    printf ("Starting work:  %s\n", ctime (&start_time));

    printf ("\nOpening document file \"%s\"...\n", argv[1]);
    doc_file = open_docfile (argc, argv);

    zbufsiz = set_params (argc, argv);
    printf ("Using buffers of size %ld each...\n", zbufsiz);

    printf ("Beginning to build index...\n");
    while (build_indices ());

    printf ("Beginning to merge index subfiles...\n");    
    while (merge_indices (argv[1]));

    time (&end_time);
    printf ("\nEnding work:  %s\n", ctime (&end_time));
    printf ("Elapsed time  = %ld seconds.\n", end_time - start_time);
    printf ("Indexing rate = %.2f megabytes/hour.\n", file_size (doc_file) *
                  0.003600 / ( end_time - start_time ));
    printf ("Input database file \"%s\" was %ld bytes long.\n", argv[1],
                  file_size (doc_file));

    fclose (doc_file);
  }

/* -------------------bufio.c file------------------- */

/* This is the file 'bufio.c' ... by ^z, 870830-0913- ... it includes 
 * definitions of some buffer words to accumulate information on its
 * way to/from the disks ... use these to speed up operations and reduce
 * disk head movements, in place of calls to fputc(), fread(), fwrite(),
 * etc. ... try to make them very general so that they will simplify
 * other tasks....
 *
 */

/* set up some buffers here to save on disk head movement and speed up
 * operations ... use my simple ZBUF structure for both input and
 * output operations ... keep everything static, local here to this file
 * for safety's sake ... allocate enough items here for all the buffers
 * that we may need for an NMERGE-way merge operation, taking into account
 * the need for input buffers for all the NMERGE *.k and *.p files plus
 * the output files *.k and *.p as well....
 */

static ZBUF zbuffer[2 + 2 * NMERGE];

/* function to create a zbuffer and set its parameters appropriately
 */

void create_zbuffer (n, bufsize, buffile, bufitemsize)
  int n, bufitemsize;
  long bufsize;
  FILE *buffile;
  {
    char *make_buf();

    zbuffer[n].zbufbase = make_buf (bufsize);
    zbuffer[n].zbufp = zbuffer[n].zbufbase;
    zbuffer[n].zbufcounter = 0;
    zbuffer[n].zbufsize = bufsize;
    zbuffer[n].zbuffilep = buffile;
    zbuffer[n].zbufitemsize = bufitemsize;
  }

/* function to free a zbuffer ...
 */

void free_zbuffer (n)
  int n;
  {
    free (zbuffer[n].zbufbase);
  }

/* function to return a pointer to the next item in a chosen input
 * buffer 'n'; it reloads the buffer if necessary ... returns NULL
 * pointer when there's nothing left in the file ...
 */

char *next_input_item (n)
  int n;
  {
    char *result;
    void load_zbuffer();

    if (zbuffer[n].zbufcounter == 0)
        load_zbuffer (n);

    zbuffer[n].zbufcounter -= zbuffer[n].zbufitemsize;
    if (zbuffer[n].zbufcounter >= 0)
      {
        result = zbuffer[n].zbufp;
        zbuffer[n].zbufp += zbuffer[n].zbufitemsize;
        return (result);
      }
    else
        return (NULL);
  }

/* function to load the nth zbuffer appropriately ... it resets the buffer
 * pointers, etc. ... might as well give the user a chance to interrupt (in
 * the Macintosh version) here, as long as we have to go to the disk anyway....
 */

void load_zbuffer (n)
  int n;
  {
    long nread;
    void exit(), check_interrupt();

#ifdef LIGHTSPEED
    nread = zbuffer[n].zbufsize;
    FSRead (zbuffer[n].zbuffilep->refnum, &nread, zbuffer[n].zbufbase);
#else
    nread = fread (zbuffer[n].zbufbase, sizeof(char),
            (int)zbuffer[n].zbufsize, zbuffer[n].zbuffilep);
#endif

    zbuffer[n].zbufp = zbuffer[n].zbufbase;
    zbuffer[n].zbufcounter = nread;

    if (ferror (zbuffer[n].zbuffilep))
      {
        printf ("\nFatal error in load_zbuffer while reading in data!\n");
        printf ("(n=%d, nread=%ld)\n", n, nread);
        exit (1);
      }

#ifdef LIGHTSPEED
    check_interrupt ();
#endif
  }

/* function to return a pointer to the right place to store the next
 * output item for the nth zbuffer ... when the buffer becomes full,
 * it flushes it to disk, resets pointers, etc.
 */

char *next_output_item (n)
  int n;
  {
    char *result;
    void flush_zbuffer();

    if (zbuffer[n].zbufcounter == zbuffer[n].zbufsize)
        flush_zbuffer (n);

    result = zbuffer[n].zbufp;
    zbuffer[n].zbufp += zbuffer[n].zbufitemsize;
    zbuffer[n].zbufcounter += zbuffer[n].zbufitemsize;
    return (result);
  }

/* function to flush out a zbuffer to the disk ... might as well give user
 * a chance to interrupt here (in the Macintosh version), as long as we
 * have to go to the disk anyway....
 */

void flush_zbuffer (n)
  int n;
  {
    long chars_written;
    void exit();

    if (zbuffer[n].zbufcounter == 0)
        return;

#ifdef LIGHTSPEED
    chars_written = zbuffer[n].zbufcounter;
    FSWrite (zbuffer[n].zbuffilep->refnum, &chars_written,
                zbuffer[n].zbufbase);
#else
    chars_written = fwrite (zbuffer[n].zbufbase, sizeof(char),
                (int)zbuffer[n].zbufcounter, zbuffer[n].zbuffilep);
#endif
    if (ferror(zbuffer[n].zbuffilep) || chars_written == 0)
      {
        printf ("\nFatal error in flush_zbuffer while writing out data!\n");
        printf ("(n=%d, zbufcounter=%ld, chars_written=%ld)\n", n,
                    zbuffer[n].zbufcounter, chars_written);
        exit(1);
      }

    zbuffer[n].zbufp = zbuffer[n].zbufbase;
    zbuffer[n].zbufcounter = 0;

#ifdef LIGHTSPEED
    check_interrupt ();
#endif
  }

/* ---------------build_indices.c file---------------- */

/* file build_indices.c ... by ^z, 870820-0913-...
 *
 * revised 870930-871007 to allow more user options on keeping/discarding
 *    punctuation, etc. -- ideas based on Bill Hole's suggestions
 *
 * contains subroutine to build indices for each chunk of input document
 * (database) text file for program qndxr ... 
 *
 * Strategy is as outlined in qndxr:  take in a big chunk of the doc_file,
 *    generate the pointers to each word in the buffer as the buffer contents
 *    are converted into appropriate (all caps, delimiters filtered into
 *    spaces) form for sorting; sort the pointers in memory; and then write
 *    out to disk the pointers and keys in the ndxr.c format for *.k and *.p
 *    files.
 *
 * Allocate space for the doc and ptr buffers here so as to maximize use
 * of available memory ... note that we need to have room for the doc
 * buffer, for a ptr buffer that might (in the worst case of a file full
 * of 1-letter words) be twice as long as the doc buffer, and also space
 * for two standard zbuffers to accumulate the output *.k and *.p file
 * info in before sending it to disk.
 *
 * Note that for speed, while they are being sorted the pointers just point
 *    directly to the key strings in the input buffer; they must be converted
 *    into true offset pointers relative to the 0th byte of the document file
 *    as they are written to disk in the *.p file!  Make sure that all of
 *    the delimiters in the document/database buffer are converted into
 *    '\0' characters so that string comparison functions will work right.
 *
 * Also note that to avoid edge effects at the end of the buffer, an extra
 *    amount of space is required at the end of the buffer, of length
 *    KEY_LENGTH, to accomodate the end of the last word in the buffer.
 *
 * Use static local variables in the function here to keep track of where
 *    we are in the document file from one chunk to the next, what chunk
 *    number we are on now, etc.
 *
 * Give the user a chance to interrupt operations (in the Macintosh
 * version of this program) at intervals here, as long as
 * there are time-consuming I/O or sorting or scanning operations
 * to be done ...
 */

int build_indices ()
  {
    static int pass_number = 0;
    long doc_bufsiz, offset, load_doc_buffer(), nwords, 
            ftell();
    extern long zbufsiz;
    extern FILE *doc_file;
    char *doc, **ptr, *malloc(), *mlalloc(), *calloc(), *clalloc();
    void zqsort(), write_sorted_files();

    doc_bufsiz = 2 * NMERGE * zbufsiz / 3;
    DEBUG ("--allocating doc buffer of size %ld\n", doc_bufsiz + KEY_LENGTH);
    doc = make_buf (doc_bufsiz + KEY_LENGTH);

    DEBUG ("--allocating ptr buffer of size %ld\n", doc_bufsiz * 2);
    ptr = (char **)make_buf (doc_bufsiz * 2);

#ifdef LIGHTSPEED
    check_interrupt ();
#endif

    offset = ftell (doc_file);    
    DEBUG ("--loading doc buffer beginning at offset %ld\n", offset);    
    nwords = load_doc_buffer (doc, doc_bufsiz, ptr);

    if (nwords == 0)
      {
        DEBUG ("--Building done ... now freeing doc & ptr buffers\n", NULL);
        free (doc);
        free ((char *)ptr);
        return (FALSE);
      }

    printf ("Index subfile #%d contains %ld words...\n", pass_number,
                nwords);

#ifdef LIGHTSPEED
    check_interrupt ();
#endif

    DEBUG ("--sorting ptr array\n", NULL);
    zqsort (ptr, ptr + nwords);

#ifdef LIGHTSPEED
    check_interrupt ();
#endif

    DEBUG ("--writing sorted keys and ptrs to disk\n", NULL);
    write_sorted_files (doc, ptr, nwords, pass_number, offset);

#ifdef LIGHTSPEED
    check_interrupt ();
#endif

    DEBUG ("--freeing doc & ptr buffers\n", NULL);
    free (doc);
    free ((char *)ptr);

    ++pass_number;
    return (TRUE);
  }

/* ---------------doc_buf.c file------------------ */

/* file doc_buf.c ...  by ^z 870830-0919-...
 * functions to load in text from the document file and then
 * save out keys and pointers to the *.k and *.p files ...
 * modified 871007-... to unify the doc-buffer loading with the
 * character-filtering and the pointer array building....
 */

/* function to create a buffer for me to use...
 */

char *make_buf (bufsiz)
  long bufsiz;
  {
    char *buf, *malloc(), *mlalloc();
    void exit();

    DEBUG ("--allocating a buffer, size = %ld\n", bufsiz);

#ifdef LIGHTSPEED
    buf = mlalloc (bufsiz);
#else
    buf = malloc ((unsigned int)bufsiz);
#endif

    if (buf == NULL)
      {
        printf ("\nFatal error in attempt to allocate a buffer!\n");
        printf ("(bufsiz=%ld)\n", bufsiz);
        exit(1);
      }

    return (buf);
  }

/* function to load the document buffer ... bring in doc_bufsiz
 * characters, and then enough more to finish out the final word,
 * followed by a terminal delimiter .... as the characters are read
 * in, filter them appropriately (depending on user choices) and
 * build the pointer array in memory to the first character of each
 * word ... return the total number of words that were
 * read in to the buffer (zero if we're finished with the file)
 *
 * ... note that one must be sure to pull in and throw away
 * any excess characters beyond KEY_LENGTH in the final word, so that
 * the remaining fragment doesn't show up as the first "word" in the
 * next chunk of the file....
 *
 * Routine modified 871007-... in order to unify the buffer-loading and
 * character-filtering and pointer-array-building operations, and to go
 * back to using getc() from <stdio> rather than Macintosh-specific
 * operations for loading the buffer....
 */

long load_doc_buffer (doc, doc_bufsiz, ptr)
  char *doc, **ptr;
  long doc_bufsiz;
  {
    int c, i, in_a_word = FALSE;
    char **ptr0, *end_doc_buf;
    extern FILE *doc_file;

    DEBUG ("--Loading document buffer...\n", NULL);

     ptr0 = ptr;
     end_doc_buf = doc + doc_bufsiz;

     while (doc < end_doc_buf)
       {
         c = filtered_getc ();
         DEBUG ("--filtered character = \"%c\"\n", c);
         if (c == EOF)
           {
             *doc++ = '\0';
             in_a_word = FALSE;
             break;
           }
         if (! c)
             in_a_word = FALSE;
         else if (! in_a_word)
           {
             *ptr++ = doc;
             in_a_word = TRUE;
             DEBUG ("--adding new ptr = %ld\n", doc);
           }
         *doc++ = c;
       }

     if (doc == end_doc_buf && in_a_word)
      {
        DEBUG ("--finishing off a final buffer word...\n", NULL);
        for (i = 0; i < KEY_LENGTH; ++i)
         {
           c = filtered_getc ();
           if (c == EOF)
             {
               *doc++ = '\0';
               break;
             }
           if (! c)
             {
               *doc++ = '\0';
               break;
             }
           *doc++ = c;
         }
       if (i == KEY_LENGTH)
           while (filtered_getc ())
                ;
       }

     return (ptr - ptr0);
  }

/* function to get the next character from the document file and filter it
 * as the user desires ... return:
 *  EOF  if end of file encountered;
 *  '/0' if the character is a delimiter;
 *  otherwise, the character itself (filtered into upper-case,
 *        if it was lower-case)
 */

int filtered_getc ()
  {
    static int prevc, c = '\0';
    int nextc;
    extern int keep_all_punct, keep_embedded_punct, keep_special_chars;
    extern FILE *doc_file;

    prevc = c;
    c = getc (doc_file);

    if (c == EOF)
        return (EOF);

    if (islower (c))
        return (c = toupper (c));

    if (isupper (c) || isdigit (c))
        return (c);

    if (isspace (c))
        return (c = '\0');

    if (keep_special_chars && ! isascii (c))
        return (c);

    if (keep_all_punct && ispunct (c))
        return (c);

    if (keep_embedded_punct && ispunct (c))
      {
        if (prevc == '\0')
            return (c = '\0');
        nextc = getc (doc_file);
        ungetc (nextc, doc_file);
        if (nextc == EOF)
            return (c = '\0');
        if (isalnum (nextc) || (keep_special_chars && ! isascii (nextc)))
            return (c);
        else
            return (c = '\0');
      }

    return (c = '\0');
  }

/* ------------------------file_utils.c------------- */

/* file file_utils.c ... by ^z -- 870820-0913-...
 * some utility routines for qndxr project, associated with files...
 */

/* function to write out sorted k & p files based on the doc and ptr
 * arrays in memory....
 *
 * The kfile format is as described in detail elsewhere:
 *    the key word, turned into all capital letters and with spaces
 *        afterward, of fixed length KEY_LENGTH; and
 *    the cumulative count of how many words have passed before, including
 *        the current word, a long integer.
 *
 * Function revised 870907-... by ^z to use zbuffer method....
 */

void write_sorted_files (doc, ptr, nwords, pass_number, offset)
  char *doc, **ptr;
  long nwords, offset;
  int pass_number;
  {
    extern long zbufsiz;
    FILE *kfile, *pfile, *open_kfile(), *open_pfile();
    char *prev_word, *next_output_item();
    KEY_REC *outk;
    long *outp, i, file_size ();
    void create_zbuffer(), write_new_key();

    DEBUG ("--Entering write_sorted_files with nwords %ld\n", nwords);
    if (nwords == 0)
        return;

    DEBUG ("--Opening kfile & pfile for pass_number = %d\n", pass_number);
    kfile = open_kfile (pass_number);
    pfile = open_pfile (pass_number);

    DEBUG ("--Creating buffers for keys & ptrs, size = %ld\n", zbufsiz);
    create_zbuffer (0, zbufsiz, kfile, sizeof(KEY_REC));
    create_zbuffer (1, zbufsiz, pfile, sizeof(long));

    DEBUG ("--Beginning to write keys and ptrs; first key=%.28s\n", ptr[0]);
    prev_word = ptr[0];
    outk = (KEY_REC *)next_output_item (0);
    write_new_key (ptr[0], outk->kkey);

    for (i = 0; i < nwords; ++i)
      {
        if (is_new_word (prev_word, ptr[i]))
          {
            outk->ccount = i;
            outk = (KEY_REC *)next_output_item (0);
            write_new_key (ptr[i], outk->kkey);
            prev_word = ptr[i];
          }
        outp = (long *)next_output_item (1);
        *outp = (ptr[i] - doc) + offset;
      }
    outk->ccount = i;

    flush_zbuffer (0);
    flush_zbuffer (1);

    DEBUG ("--Getting rid of key and ptr buffers...\n", NULL);
    free_zbuffer (0);
    free_zbuffer (1);

    printf (" ...%ld distinct words\n",
            file_size (kfile) / sizeof(KEY_REC));
    fclose (kfile);
    fclose (pfile);
  }

/* function to determine if the current word is the same as or different
 * from the previous word -- if it is different, we'll need to write an
 * entry out to the key file kfile -- compare the words up to the first
 * '\0', or for a maximum distance of KEY_LENGTH, and return TRUE
 * if they differ, FALSE if they are identical that far.  Thus, a simple
 * call to zstrcmp() does the job.... but keep ours as a function instead
 * of a macro call for the moment, for safety and readability....
 */

int is_new_word (w0, w1)
  char *w0, *w1;
  {
    return (zstrcmp (w0, w1));
  }

/* function to write out a new key entry in the key_file:
 * KEY_LENGTH letters consisting of the key word (which will be found
 * delimited by a '\0'), followed by enough blanks to fill out the
 * record to total length KEY_LENGTH ...
 */

void write_new_key (p, kp)
  register char *p, *kp;
  {
    register int i, c;

    for (i = 0; i < KEY_LENGTH; ++i)
      {
        c = *p++;
        if (c == '\0')
            break;
        *kp++ = c;
      }

    for ( ; i < KEY_LENGTH; ++i)
        *kp++ = ' ';
  }

/* ---------------------merge_files.c--------------- */

/* file merge_files.c ...  870902-... ^z
 * more functions to do the work of merging subindices together ...
 * with buffering of inputs and outputs ...
 */

/* function to do the real work of merging sorted k & p
 * files into a single sorted file...
 *
 * Procedure is as follows:
 *
 *    Compare the current key records from each of the N files to be merged.
 *    Take the smallest of the keys (or, if there is a tie, take the one
 *    from the earlier file number) and write its pointer records out to
 *    the *.p file for the next generation.  If the key is a new one, that
 *    is, if it differs from the previous key, write out the previous key
 *    word and the value for cumulative_counts to the *.k file, and reset
 *    the previous key's value to that of the current key.  Then repeat
 *    this whole procedure, until all the input files are exhausted.
 *
 *    The above works provided that we are careful in choosing the smallest
 *    key and never let a file that has been exhausted (reached EOF) win a
 *    comparison.  The function smallest_key does that properly below, since
 *    a file that is at EOF has returned a NULL pointer for its key_rec...
 *
 *  For each file, the variable prev_cc[i] holds the previous value of
 *    cumulative_counts for that file, so that we can determine how
 *    many ptr_recs to write out by doing a simple subtraction.
 *
 * Buffer numbering scheme:  the Kth input file has zbuffer #K
 *    for its keys and zbuffer #(K+n) for its pointers; the output
 *    buffers are zbuffers #(2*n) for keys and #(2*n+1) for pointers.
 */

void nway_merge_kpfiles (ink, inp, outk, outp, n)
  FILE *ink[], *inp[], *outk, *outp;
  register int n;
  {
    register int i;
    KEY_REC *kr[NMERGE], prev_key;
    long prev_cc[NMERGE], out_cc;
    extern long zbufsiz;
    char *strncpy();
    void copy_ptr_recs(), copy_key_rec();

    DEBUG ("--entering nway_merge_kpfiles with n=%d\n", n);
    for (i = 0; i < n; ++i)
        prev_cc[i] = 0;
    out_cc = 0;

    for (i = 0; i < n; ++i)
      {
        create_zbuffer (i, zbufsiz, ink[i], sizeof(KEY_REC));
        create_zbuffer (i + n, zbufsiz, inp[i], sizeof(long));
      }
    create_zbuffer (2 * n, zbufsiz, outk, sizeof(KEY_REC));
    create_zbuffer (2 * n + 1, zbufsiz, outp, sizeof(long));

    for (i = 0; i < n; ++i)
        kr[i] = (KEY_REC *)next_input_item (i);

    i = smallest_key (kr, n);
    strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH);
    DEBUG ("--first key is %.28s\n", prev_key.kkey);

    while (merge_not_finished (kr, n))
      {
        i = smallest_key (kr, n);
        if (zstrcmp (prev_key.kkey, kr[i]->kkey))
          {
            copy_key_rec (prev_key.kkey, out_cc, 2 * n);
            strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH);
          }
        copy_ptr_recs (i + n, kr[i]->ccount - prev_cc[i], 2 * n + 1);
        out_cc += kr[i]->ccount - prev_cc[i];
        prev_cc[i] = kr[i]->ccount;
        kr[i] = (KEY_REC *)next_input_item (i);
      }

    copy_key_rec (prev_key.kkey, out_cc, 2 * n);
    DEBUG ("--finished nway merge ... final key=%.28s\n", prev_key.kkey);
    flush_zbuffer (2 * n);
    flush_zbuffer (2 * n + 1);
    for (i = 0; i < 2 * n + 2; ++i)
        free_zbuffer (i);
  }

/* function to copy a chosen number of ptr_recs (longs) from one file
 * to another ... used in merging two files by merge_kpfiles() above....
 * revised and simplified to use zbuffers....870902 ... ^z
 */

void copy_ptr_recs (inzbuf, count, outzbuf)
  register int inzbuf, outzbuf;
  register long count;
  {
    register long *inp, *outp;
    char *next_input_item(), *next_output_item();

    for ( ; count > 0; --count)
      {
        inp = (long *)next_input_item (inzbuf);
        outp = (long *)next_output_item (outzbuf);
        *outp = *inp;
      }
  }

/* function to write a key record including kkey (key word) and ccount
 * (cumulative_count) to an output file...
 */

void copy_key_rec (kkey, cc, outzbuf)
  char *kkey;
  long cc;
  int outzbuf;
  {
    KEY_REC *outp;
    char *strncpy(), *next_output_item();

    outp = (KEY_REC *)next_output_item (outzbuf);
    strncpy (outp->kkey, kkey, KEY_LENGTH);
    outp->ccount = cc;
  }

/* simple function to see if we are not yet finished with all of the input
 * files coming in to the merge operation ... signified by at least one of
 * the key pointers being non-NULL....
 */

int merge_not_finished (kr, n)
  KEY_REC *kr[];
  register int n;
  {
    register int i;

    for (i = 0; i < n; ++i)
        if (kr[i] != NULL)
            return (TRUE);

    return (FALSE);
  }

/* function to determine the smallest of the n keys pointed to by the
 * kr[] vector of pointers to KEY_RECs ... note that a NULL ptr ranks
 * after any other...also note that in case of a tie, the smaller
 * value of i is the one to return (for a stable merging sort)
 */

smallest_key (kr, n)
  KEY_REC *kr[];
  register int n;
  {
    register int i, smallest;

    for (i = 0; i < n; ++i)
        if (kr[i] != NULL)
          {
            smallest = i;
            break;
          }

    for (i = smallest + 1; i < n; ++i)
        if (kr[i] != NULL)
            if (zstrcmp (kr[smallest]->kkey, kr[i]->kkey) > 0)
                smallest = i;

    return (smallest);
  }

/* -------------------------merge_indices.c-------------- */

/* file "merge_indices.c" ... by ^z -- 870823-...
 * function to merge sorted indices together repeatedly until finished
 * with them all in a single set of *.k/*.p files ...
 *
 * The merging strategy is straightforward enough:
 *    Let "g" denote the generation_number and "f" denote the file_number.
 *    Temporary file names begin with the letter x, which is then followed
 *    by a generation number (decimal), the letter k or p (standing for
 *    key or ptr, respectively), and then the file number (decimal).  Thus,
 *    the file "x0k0" is the keys file #0 for generation #0 (the starting,
 *    pre-merging, generation), file "x2p3" is the ptr file #3 for generation
 *    #2, etc.
 *
 * (The following discussion is specifically for a 2-way merge ... but
 * the generalization for N-way merging is straightforward.)
 *
 *    On a given call to merge_indices, the following may happen:
 *        - files xgkf/xgpf and xgk(f+1)/xgp(f+1) are merged into files
 *            x(g+1)k(f/2)/x(g+1)p(f/2), and then the parent files are
 *            deleted;
 *        - file xgkf isn't found, which means we are done with this
 *            generation and must go on to the next;
 *        - file xgk(f+1) isn't found, which means that either we are
 *            completely done with the merging work (if f=0) and just
 *            have to rename the files xgkf/xgpf into the correct final
 *            names (that is, doc_filename.k/doc_filename.p), or else
 *            (if f>0) we have an odd number of files for this level
 *            of merging, and therefore just have to rename xgkf/xgpf
 *            to x(g+1)k(f/2)/x(g+1)p(f/2).
 */

int merge_indices (doc_filename)
  char *doc_filename;
  {
    static int generation_number = 0, file_number = 0;
    FILE *ink[NMERGE], *inp[NMERGE], *outk, *outp, *open_inkfile(),
            *open_inpfile(), *open_outkfile(), *open_outpfile();
    void fix_oddball_file_name(), fix_final_file_names(), merge_kpfiles(),
            remove_used_infiles(), close_infiles();
    long inwords, indistinctwords, outdistinctwords, file_size();
    int i, n;

    for (n = 0; n < NMERGE; ++n)
      {
        ink[n] = open_inkfile (generation_number, file_number + n);
        if (ink[n] == NULL)
            break;
        inp[n] = open_inpfile (generation_number, file_number + n);
      }

    if (file_number + n == 1)
      {
        DEBUG ("--All finished with merging files!\n", NULL);
        printf ("\nFinished with index sorting!  Final file sizes:\n");
        printf ("%9ld bytes of distinct key words\n", file_size (ink[0]));
        printf ("%9ld bytes of pointers to words\n", file_size (inp[0]));
        close_infiles (ink, inp, n);
        fix_final_file_names (generation_number, doc_filename);
        return (FALSE);
      }

    if (n < 2)
      {
        DEBUG ("--finished generation_number=%d ", generation_number);
        DEBUG ("-- file_number=%d\n", file_number);
        if (n == 1)
          {
            close_infiles (ink, inp, n);
            fix_oddball_file_name (generation_number, file_number);
          }
        ++generation_number;
        file_number = 0;
        return (TRUE);
      }

    outk = open_outkfile (generation_number, file_number);
    outp = open_outpfile (generation_number, file_number);

    inwords = 0;
    indistinctwords = 0;
    for (i = 0; i < n; ++i)
      {
        inwords += file_size (inp[i]) / sizeof(long);
        indistinctwords += file_size (ink[i]) / sizeof(KEY_REC);
      }
    printf ("Merge pass #%d.%d\n", generation_number, file_number / NMERGE);
    printf ("Input files contain %ld words -- %ld distinct words...\n",
                inwords, indistinctwords);

    nway_merge_kpfiles (ink, inp, outk, outp, n);

    outdistinctwords = file_size (outk) / sizeof(KEY_REC);
    printf (" ... merged result has %ld distinct words.\n",
                outdistinctwords);

    close_infiles (ink, inp, n);
    fclose (outk);
    fclose (outp);
    remove_used_infiles (generation_number, file_number, n);

    file_number += NMERGE;

    return (TRUE);
  }

/* --------------------merge_utils.c---------------- */

/* file "merge_utils.c" ... 870902-... ^z
 * misc. utilities for the merge_indices functions...
 */

/* function to open an input key file for this generation and file number
 */

FILE *open_inkfile (generation_number, file_number)
  int generation_number, file_number;
  {
    FILE *fopen();
    char fname[32];

    sprintf (fname, "x%dk%d", generation_number, file_number);
    return (fopen (fname, "rb"));
  }

/* function to open an input ptr file for this generation and file number
 */

FILE *open_inpfile (generation_number, file_number)
  int generation_number, file_number;
  {
    FILE *fopen();
    char fname[32];

    sprintf (fname, "x%dp%d", generation_number, file_number);
    return (fopen (fname, "rb"));
  }

/* function to open an output key file for this generation and file number
 */

FILE *open_outkfile (generation_number, file_number)
  int generation_number, file_number;
  {
    FILE *fopen();
    char fname[32];

    sprintf (fname, "x%dk%d", generation_number + 1, file_number / NMERGE);
    return (fopen (fname, "wb"));
  }

/* function to open an output ptr file for this generation and file number
 */

FILE *open_outpfile (generation_number, file_number)
  int generation_number, file_number;
  {
    FILE *fopen();
    char fname[32];

    sprintf (fname, "x%dp%d", generation_number + 1, file_number / NMERGE);
    return (fopen (fname, "wb"));
  }

/* function to rename the remaining last unpaired key file & ptr file
 * from one generation to the next...
 */

void fix_oddball_file_name (generation_number, file_number)
  int generation_number, file_number;
  {
    char oldname[32], newname[32];

    sprintf (oldname, "x%dk%d", generation_number, file_number);
    sprintf (newname, "x%dk%d", generation_number + 1, file_number / NMERGE);
    if (rename (oldname, newname))
        printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname,
                newname);

    sprintf (oldname, "x%dp%d", generation_number, file_number);
    sprintf (newname, "x%dp%d", generation_number + 1, file_number / NMERGE);
    if (rename (oldname, newname))
        printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname,
                newname);
  }

/* function to give the final key and ptr files their proper ultimate
 * names ...
 */

void fix_final_file_names (generation_number, doc_filename)
  int generation_number;
  char *doc_filename;
  {
    char oldname[32], finalname[65];

    sprintf (oldname, "x%dk0", generation_number);
    sprintf (finalname, "%s.k", doc_filename);
    if (rename (oldname, finalname))
        printf ("Sorry -- error in renaming file %s to %s!\n", oldname,
                finalname);

    sprintf (oldname, "x%dp0", generation_number);
    sprintf (finalname, "%s.p", doc_filename);
    if (rename (oldname, finalname))
        printf ("Sorry -- error in renaming file %s to %s!\n", oldname,
                finalname);
  }

/* function to get rid of the superfluous k & p files now that they
 * have been merged into the next generation....
 */

void remove_used_infiles (generation_number, file_number, n)
  int generation_number, file_number, n;
  {
    char fname[32];
    register int i;

    for (i = 0; i < n; ++i)
      {
        sprintf (fname, "x%dk%d", generation_number, file_number + i);
        DEBUG ("--removing %s\n", fname);
        if (unlink (fname))
            printf ("Sorry -- unable to delete file %s!\n", fname);
        sprintf (fname, "x%dp%d", generation_number, file_number + i);
        DEBUG ("--removing %s\n", fname);
        if (unlink (fname))
            printf ("Sorry -- unable to delete file %s!\n", fname);
      }
  }

/* function to close out the ink and inp files that have been opened...
 */

void close_infiles (ink, inp, n)
  FILE *ink[], *inp[];
  register int n;
  {
    register int i;

    for (i = 0; i < n; ++i)
      {
        fclose (ink[i]);
        fclose (inp[i]);
      }
  }

/* ---------------------misc_utils.c----------------- */

/* file "misc_utils.c" -- miscellaneous utilities for the qndxr project...
 * by ^z -- 870821-...
 */

/* this function returns a value for the size of the internal buffers,
 * zbufsiz, and it also takes care of setting the other global parameters,
 * keep_all_punct, keep_embedded_punct, and keep_special_chars,
 * which govern how punctuation and special characters are handled.
 * They are set based on flags such as -e, -a, and -s in the input line.
 * The input parameters are taken one after another, and any that do
 * not convert to a nonzero number are scanned for the letters "e", "a", and
 * "s".  If a parameter is not reset, the default is to leave keep_all_punct,
 * keep_embedded_punct, and keep_special_chars as FALSE.
 * The default memory size is DEFAULT_MEMSIZ, set in the header file.
 */

long set_params (argc, argv)
  int argc;
  char *argv[];
  {
    int i;
    void set_parser_options();
    long zb = 0, n, atol(), set_zbufsiz();

    for (i = 2; i < argc; ++i)
      {
        n = atol (argv[i]);
        if (n != 0)
            zb = set_zbufsiz (n);
        else
            set_parser_options (argv[i]);
      }

    if (zb == 0)
        zb = set_zbufsiz (DEFAULT_MEMSIZ);

    return (zb);
  }

/* determine how big we should make the big buffers -- want to be sure
 * to have room for at least 2*NMERGE+2 of them at one time in memory,
 * so that the N-way merge function can hold all the input files
 * (2N) plus the output files (2).
 *
 * NOTE that the chosen buffer size must be a multiple of both sizeof(long)
 * and sizeof(KEY_REC); I ensure that very simply in what follows by
 * a little modular arithmetic.  I also take care of the sign and of the
 * requirement that the buffer size be non-zero -- thus, UNIX aficionados
 * can precede the parameter with a "-" with no ill effects.  I have put in
 * various safeguards against picking illegal buffer sizes, along with an
 * ultimate safety net small minimum value for zbufsiz.
 */

long set_zbufsiz (zb)
  long zb;
  {
    char *testb;

    if (zb < 0)
        zb = -zb;

    zb /=  (2 * NMERGE + 2);
    zb = zb - zb % (sizeof(KEY_REC) * sizeof(long));

    if (zb < sizeof(KEY_REC) * sizeof(long))
        zb = sizeof(KEY_REC) * sizeof(long);

    DEBUG ("--Checking for memory to support buffer size zb=%ld\n", zb);
    testb = make_buf ((2 * NMERGE + 2) * zb);
    free (testb);

    return (zb);
  }

/* function to set the parser options based on a character string ...
 * 'a' turns on keep_all_punct, 'e' turns on keep_embedded_punct,
 * and 's' turns on keep_special_chars
 */

void set_parser_options (str)
  char *str;
  {
    extern int keep_all_punct, keep_embedded_punct, keep_special_chars;

    while (*str)
      {
        if (*str == 'a')
          {
            keep_all_punct = TRUE;
            printf ("All punctuation characters will be kept.\n");
          }
        else if (*str == 'e')
          {
            keep_embedded_punct = TRUE;
            printf ("Embedded punctuation characters will be kept.\n");
          }
        else if (*str == 's')
          {
            keep_special_chars = TRUE;
            printf ("Special characters will be kept.\n");
          }

        ++str;
      }
  }

/* function to check for user interruption of operations (for use in the
 * Macintosh version of this program only) ... call SystemTask() to give
 * desk accessories a bit of time, and then check for non-null events
 * with GetNextEvent() ... if something comes along (a mouse click, or key
 * press, or whatever) let the user choose to exit the program or to
 * carry on ....
 */

#ifdef LIGHTSPEED

void check_interrupt ()
  {
    EventRecord myEvent;
    char cmd[256], *gets();
    void exit();

    SystemTask ();
    if (GetNextEvent (everyEvent, &myEvent))
      {
        fprintf (stderr, "\Quit indexing now?\n");
        gets (cmd);
        if (cmd[0] == 'y' || cmd[0] == 'Y')
            exit (0);
      }
  }

#endif

/* a very simple function to return the size of a file ... leave the file
 * pointer positioned at wherever it was when the routine was entered....
 * should work fine with stdio methods, but not guaranteed compatible when
 * mixed in with Mac-specific FSRead() function!
 */

long file_size (fp)
  FILE *fp;
  {
    long fpos, result, ftell();

    DEBUG ("--finding the size of file fp=%ld\n", fp);
    fpos = ftell (fp);
    fseek (fp, 0L, 2);
    result = ftell (fp);
    fseek (fp, fpos, 0);
    return (result);
  }

/* ----------------------open_files.c----------------- */

/* functions to open files as needed in qndxr work...
 */

FILE *open_docfile (argc, argv) 
  int argc;
  char *argv[];
  {
    FILE *fp, *fopen();
    void exit();

    if (argc < 2)
      {
        printf ("\nToo few command line arguments!\n");
        printf ("\nEnter a text file name (no embedded spaces allowed)\n");
        printf ("and the program will build and sort a complete inverted\n");
        printf ("index to that file.  Use \">\" in front of a file name to\n");
        printf ("redirect console output (UNIX-fashion) if desired.\n");
        printf ("Give the program a value for available memory (in bytes)\n");
        printf ("if the default value of 512kB is unsatisfactory ... larger\n");
        printf ("memory allows for faster index building and sorting.\n");
        printf ("Other command-line arguments:\n");
        printf ("  \"-e\" preserves embedded punctuation\n");
        printf ("  \"-a\" preserves all punctuation characters\n");
        printf ("  \"-s\" preserves special (non-ASCII) characters\n");
        printf ("\nContact Mark Zimmermann, 9511 Gwyndale Drive, Silver Spring\n");
        printf ("Maryland  20910  USA; 301-565-2166; science (at) nems.arpa;\n");
        printf ("[75066,2044] on CompuServe for further information. - ^z\n");
        exit (1);
      }

    if ((fp = fopen (argv[1], "r")) == NULL)
      {
        printf ("\nFatal error opening document file\"%s\"!\n", argv[1]);
        exit (1);
      }

    return (fp);
  }

/* open the key file with proper name for this pass ... file will be
 * named "x0kN" where N represents the pass number ....
 */

FILE *open_kfile (pass_number)
  int pass_number;
  {
    FILE *fopen();
    char fname[32];

    sprintf (fname, "x0k%d", pass_number);
    return (fopen (fname, "wb"));
  }

/* open the ptr file with proper name for this pass ... file will be
 * named "x0pN" where N represents the pass number ....
 */

FILE *open_pfile (pass_number)
  int pass_number;
  {
    FILE *fopen();
    char fname[32];

    sprintf (fname, "x0p%d", pass_number);
    return (fopen (fname, "wb"));
  }

/* ----------------------zqsort.c-------------------- */

/* file zqsort.c -- by ^z, 870823-...
 * my quicksort to sort out the ptr array ... based, at least initially,
 * on the Lightspeed C library qsort routine, specialized to the task
 * at hand here ...
 */

/*  sort elements "first" through "last" */

void zqsort (first, last)
  register char **first, **last;
  {
    register char **i, **j, *tmp;

    while (last - first > 1)
      {
        i = first;
        j = last;
        for (;;)
          {
            while (++i < last && compare_ptrs (i, first) < 0)
                ;
            while (--j > first && compare_ptrs (j, first) > 0)
                ;
            if (i >= j)
                 break;
             tmp = *i;
             *i = *j;
             *j = tmp;
          }
        tmp = *first;
        *first = *j;
        *j = tmp;
        if (j - first < last - (j + 1))
          {
            zqsort (first, j);
            first = j + 1;
          }
        else
          {
            zqsort (j + 1, last);
            last = j;
          }
      }
  }

/* function to compare two index ptrs and give a result appropriate
 * for quicksort to use in alphabetizing them....
 *
 * Since the words pointed to have already been turned into all capital
 * letters and delimiters have been filtered out, simply doing zstrcmp()
 * for KEY_LENGTH letters works fine!
 *
 * Slight modification to make the quicksort stable:  if two words tie,
 * then we want to compare their pointers to make the lesser one come
 * out first in the sort ...
 */

int compare_ptrs (p1, p2)
  register char **p1, **p2;
  {
    register int diff;

    diff = zstrcmp (*p1, *p2);
    if (diff == 0)
        diff = ((*p1 - *p2) > 0) ? 1 : -1;
    return (diff);
  }

/* my function to compare two strings and give a result as to who is
 * alphabetically earlier.  Note that this is almost the same as strncmp()
 * with the fixed value of KEY_LENGTH as the maximum comparison distance,
 * except that I must be sure to mask the characters to make them positive
 * (since we want to be able to handle the non-ASCII funny letters in
 * the Apple character set properly/consistently).  If the masking isn't
 * done, then inconsistent results can occur with those non-ASCII chars!
 */

int zstrcmp (s1, s2)
  register char *s1, *s2;
  {
    register int n = KEY_LENGTH;

    for (; --n && ((*s1 & CMASK) == (*s2 & CMASK)); s1++, s2++)
        if (!*s1) break;

    return ((*s1 & CMASK) - (*s2 & CMASK));
  }

-------