v02i041: qndxr - Inverted index builder for text files
Mark Zimmermann
science at nems.ARPA
Wed Feb 3 12:08:40 AEST 1988
Comp.sources.misc: Volume 2, Issue 41
Submitted-By: "Mark Zimmerman" <science at nems.ARPA>
Archive-Name: qndxr
Comp.sources.misc: Volume 2, Issue 41
Submitted-By: "Mark Zimmerman" <science at nems.ARPA>
Archive-Name: qndxr
[Yes, indeed, another un-shared program! Sigh. ++bsa]
Following C code builds inverted index files for big text files ... seems
to work on Sun, VAX, and Macintosh in my tests. Heavily commented in a
helter-skelter style.... ^z
/* revised main indexer program 'qndxr.c' by ^z -- 870820-...
*
* revised 870930-1008 to give the user more options and control of how
* delimiters are defined and handled. Now can choose:
* - the default: any non-alphanumeric characters are delimiters (my
* original approach, and perhaps the simplest to understand and use);
* - option "-e": keep punctuation characters ONLY when they are embedded
* in between non-delimiter characters, otherwise turn them into
* delimiters (approach invented by Bill Hole of MicroDynamics Ltd.;
* very nice for displaying 'words' such as don't, non-ASCII, 3.14159,
* 87/10/02, etc. without splitting them up and requiring proximity
* searching);
* - option "-a": keep ALL punctuation characters, whether embedded in
* words or not (best for indexing FORTH programs and such, but does
* make spurious 'distinct' words at ends of sentences, etc.);
* - option "-s": keep all special (non-ASCII) characters (with values
* in the range 128-255; in the Apple character set, these are the
* symbols that carry umlauts, tilde, accents, etc., making this
* option the best for foreign-language and symbol-heavy files);
* default is to remove the special characters.
*
* Note that option "-s" can be combined with any of the other options;
* options "-e" and "-a" are mutually exclusive. (Actually, "-a" in my
* implementation overrides "-e".) The "-e" option does require
* peeking ahead one character before deciding on the fate of
* a punctuation symbol, but that isn't too hard to arrange....
*
* ---------------------------------------------------------------
*
* Synopsis of the qndxr.c program's approach to sorting an index:
* - load a chunk of the input file into memory; during the loading,
* filter the characters appropriately, turning lower-case
* into capitals, changing delimiters into \0's, and noting down
* the locations of the beginnings of keys (words) in a ptr array;
* - do a quicksort on the ptr array, using the keys in the buffer
* to sort on;
* - write the resulting sorted list of pointers along with their keys
* to disk as files, named x0k0 and x0p0, in the *.k and *.p formats
* used in ndxr.15;
* - repeat the above steps with another chunk of the input file, and
* name the resulting files x0k1 and x0p1; repeat, etc., until the
* input file is all sorted into chunks;
* - begin to merge the sorted files in an N-way merge ... for the
* specific case of N=2, for example, merge x0k0/x0p0 and
* x0k1/x0p1 into x1k0/x1p0; merge x0k2/x0p2 and x0k3/x0p3 into
* x1k1/x1p1; etc., deleting the 0th-generation files as they are
* merged;
* - merge x1k0/x1p0 and x1k1/x1p1 into x2k0/x2p0, etc., and continue
* merging subsequent generations until everything has become a
* single pair of index files, xnk0/xnp0, which is then renamed
* to be the original document name with '.k' and '.p' endings.
*
* ---------------------------------------------------------------
*
* Comparison with the older radix sort approach:
* The new quicksort/mergesort method gains a bit in speed (since so
* much of the work is done in memory rather than streaming from disk
* back and forth) and also saves on disk space requirements (since
* the k and p files are significantly compressed relative to the raw
* index files formerly used during index sorting). The old radix sort
* tended to only achieve about 2 MB/hour indexing on my Mac Plus setup,
* and it required 5-6 times the size of the original file in free
* disk space during indexing. This new approach achieves >4 MB/hour
* in tests on the same Mac Plus, and only requires about 1.9 times
* the original file size in free space!
*
* The new approach also allows for greater key length -- try recompiling
* with KEY_LENGTH of 28, for instance -- without such severe disk space
* penalties as the old radix sort would have incurred, and without severe
* time penalties. (Duplicate words in the chunks are only stored once
* in the key file, though each must have an entry in the pointer file.)
*
* The only obvious disadvantage of the quicksort/mergesort approach is
* that it is an O(NlogN) procedure rather than O(N), and thus gets slower
* when files get very large. (Radix sorting is strictly linear in N,
* in theory anyway.)
*
* ---------------------------------------------------------------
*
* For further details, contact the author:
* Mark Zimmermann science (at) NEMS.ARPA
* 9511 Gwyndale Dr. [75066,2044] on CompuServe
* Silver Spring, MD 20910
* 301-565-2166
* ---------------------------------------------------------------
*/
#include <stdio.h>
#include <strings.h>
#include <ctype.h>
/* --------------header file--------------- */
/* header file for project qndxr -- by ^z -- 870820-0913-...
*/
/* tell what compiler we're using ... Lightspeed C by Think, if the
* following line is #defined ... essentially, when LIGHTSPEED is here,
* assume that we have only 16-bit int variables and that Macintosh
* toolbox traps are available ... when LIGHTSPEED is not #defined,
* assume that ints can hold more like 32 bits without error, so more
* can be done using standard I/O routines from <stdio>
*/
/* #define LIGHTSPEED */
/* preprocessor 'function' to turn on/off debug printing of detailed info
* during program runs ... when debugging, use a statement:
* #define DEBUG(fmt, arg) printf (fmt, arg)
* ... and when not debugging, just use: #define DEBUG(fmt, arg)
* to effectively remove those print commands....
*/
/* #define DEBUG(fmt, arg) printf (fmt, arg) */
#define DEBUG(fmt, arg)
/* sort on KEY_LENGTH characters ... make it 28 rather arbitrarily as an
* experiment ... want it long enough to avoid truncation of legitimate
* words, but short enough to save some space in the *.k files ... an
* alternative value is 12, for compatibility with the older ndxr.c program
* and brwsr.c, if they haven't been revised to allow for longer keys....
*/
#define KEY_LENGTH 28
/* define a structure for the index_keys file
*/
typedef struct
{
char kkey[KEY_LENGTH];
long ccount;
} KEY_REC;
/* define a structure for my simpleminded I/O buffers ...
*/
typedef struct
{
char *zbufbase;
char *zbufp;
long zbufcounter;
long zbufsize;
FILE *zbuffilep;
int zbufitemsize;
} ZBUF;
/* choose this size to be the default memory size used for total buffer
* space ... for convenience on the Mac Plus, I have picked 512 kB, which
* leaves me a bit of space to spare ....
*/
#define DEFAULT_MEMSIZ 524288
/* merge this many files at each stage of the merging operation for index
* building ... 2 means a binary merge, etc. ... one needs to have at least
* 5 + 2 * NMERGE I/O buffers around: for each of NMERGE files, there is
* a *.k keys file and a *.p pointers file; plus there must be a single
* output *.k and a single output *.p file; plus there is the need for stdin,
* stdout, and stderr to be open as well. Thus, I have found that a 4-way
* merge (NMERGE = 4) works pretty nicely....
*/
#define NMERGE 4
#ifndef TRUE
#define TRUE 1
#endif
#ifndef FALSE
#define FALSE 0
#endif
/* CMASK makes sure that a returned character isn't sign-extended
*/
#ifndef CMASK
#define CMASK 0xFF
#endif
/* put the prototypes for my functions here... assume that if LIGHTSPEED
* is #define'd, then we want to use prototypes....
*/
#ifdef LIGHTSPEED
/* in qndxr unix main.c */
void punt(void);
void openfile(FILE *file, char *mode);
void main(void);
/* in qndxr.c */
void _main(int argc, char *argv[]);
/* in bufio.c */
void create_zbuffer (int n, long bufsize, FILE *buffile, int bufitemsize);
void free_zbuffer (int n);
char *next_input_item (int n);
void load_zbuffer (int n);
char *next_output_item (int n);
void flush_zbuffer (int n);
/* in build_indices.c */
int build_indices (void);
/* in doc_buf.c */
char *make_buf (long bufsiz);
long load_doc_buffer (char *doc, long doc_bufsiz, char **ptr);
int filtered_getc (void);
/* in merge_files.c */
void nway_merge_kpfiles (FILE *ink[], FILE *inp[], FILE *outk, FILE *outp,
int n);
void copy_ptr_recs (int inzbuf, long count, int outzbuf);
void copy_key_rec (char *kkey, long ccount, int outzbuf);
int merge_not_finished (KEY_REC *kr[], int n);
int smallest_key (KEY_REC *kr[], int n);
/* in merge_indices.c */
int merge_indices (char *doc_filename);
/* in file_utils.c */
void write_sorted_files (char *doc, char **ptr, long nwords,
int pass_number, long offset);
int is_new_word (char *w0, char *w1);
void write_new_key (char *p, char *kp);
/* in merge_utils.c */
FILE *open_inkfile (int generation_number, int file_number);
FILE *open_inpfile (int generation_number, int file_number);
void fix_oddball_file_name (int generation_number, int file_number);
void fix_final_file_names (int generation_number, char *doc_filename);
FILE *open_outkfile (int generation_number, int file_number);
FILE *open_outpfile (int generation_number, int file_number);
void remove_used_infiles (int generation_number, int file_number, int n);
void close_infiles (FILE *ink[], FILE *inp[], int n);
/* in misc_utils.c */
long set_params (int argc, char *argv[]);
long set_zbufsiz (long zb);
void set_parser_options (char *str);
void check_interrupt (void);
long file_size (FILE *fp);
/* in open_files.c */
FILE *open_docfile (int argc, char *argv[]);
FILE *open_kfile (int pass_number);
FILE *open_pfile (int pass_number);
/* in zqsort.c */
void zqsort (char **first, char **last);
int compare_ptrs (char **p1, char **p2);
int zstrcmp (char *s1, char *s2);
#endif
/* ----------------------main program file---------------- */
/* define a global variable to hold the chosen size of a fundamental
* quantum of buffer space ... experiments with dynamically choosing
* it at runtime seem to cause occasional problems on the Macintosh,
* and have other risks with virtual memory systems, so default to
* DEFAULT_MEMSIZ total buffer space but let the user override with
* a command-line choice to suit ... see function set_zbufsiz() for
* further discussion....
*/
long zbufsiz;
/* define a global variable to tell whether or not all punctuation chars
* are kept...
*/
int keep_all_punct = FALSE;
/* define a global variable to tell whether or not only embedded punctuation
* characters are preserved...
*/
int keep_embedded_punct = FALSE;
/* define a global variable to tell whether or not to keep special,
* non-ASCII characters...
*/
int keep_special_chars = FALSE;
/* define a global variable to hold the (FILE *) for the document file
*/
FILE *doc_file;
/* main program to do the work ...
*/
void main(argc, argv)
int argc;
char *argv[];
{
unsigned long start_time, end_time, time();
long set_params(), file_size();
char *ctime();
FILE *open_docfile();
time (&start_time);
printf ("Starting work: %s\n", ctime (&start_time));
printf ("\nOpening document file \"%s\"...\n", argv[1]);
doc_file = open_docfile (argc, argv);
zbufsiz = set_params (argc, argv);
printf ("Using buffers of size %ld each...\n", zbufsiz);
printf ("Beginning to build index...\n");
while (build_indices ());
printf ("Beginning to merge index subfiles...\n");
while (merge_indices (argv[1]));
time (&end_time);
printf ("\nEnding work: %s\n", ctime (&end_time));
printf ("Elapsed time = %ld seconds.\n", end_time - start_time);
printf ("Indexing rate = %.2f megabytes/hour.\n", file_size (doc_file) *
0.003600 / ( end_time - start_time ));
printf ("Input database file \"%s\" was %ld bytes long.\n", argv[1],
file_size (doc_file));
fclose (doc_file);
}
/* -------------------bufio.c file------------------- */
/* This is the file 'bufio.c' ... by ^z, 870830-0913- ... it includes
* definitions of some buffer words to accumulate information on its
* way to/from the disks ... use these to speed up operations and reduce
* disk head movements, in place of calls to fputc(), fread(), fwrite(),
* etc. ... try to make them very general so that they will simplify
* other tasks....
*
*/
/* set up some buffers here to save on disk head movement and speed up
* operations ... use my simple ZBUF structure for both input and
* output operations ... keep everything static, local here to this file
* for safety's sake ... allocate enough items here for all the buffers
* that we may need for an NMERGE-way merge operation, taking into account
* the need for input buffers for all the NMERGE *.k and *.p files plus
* the output files *.k and *.p as well....
*/
static ZBUF zbuffer[2 + 2 * NMERGE];
/* function to create a zbuffer and set its parameters appropriately
*/
void create_zbuffer (n, bufsize, buffile, bufitemsize)
int n, bufitemsize;
long bufsize;
FILE *buffile;
{
char *make_buf();
zbuffer[n].zbufbase = make_buf (bufsize);
zbuffer[n].zbufp = zbuffer[n].zbufbase;
zbuffer[n].zbufcounter = 0;
zbuffer[n].zbufsize = bufsize;
zbuffer[n].zbuffilep = buffile;
zbuffer[n].zbufitemsize = bufitemsize;
}
/* function to free a zbuffer ...
*/
void free_zbuffer (n)
int n;
{
free (zbuffer[n].zbufbase);
}
/* function to return a pointer to the next item in a chosen input
* buffer 'n'; it reloads the buffer if necessary ... returns NULL
* pointer when there's nothing left in the file ...
*/
char *next_input_item (n)
int n;
{
char *result;
void load_zbuffer();
if (zbuffer[n].zbufcounter == 0)
load_zbuffer (n);
zbuffer[n].zbufcounter -= zbuffer[n].zbufitemsize;
if (zbuffer[n].zbufcounter >= 0)
{
result = zbuffer[n].zbufp;
zbuffer[n].zbufp += zbuffer[n].zbufitemsize;
return (result);
}
else
return (NULL);
}
/* function to load the nth zbuffer appropriately ... it resets the buffer
* pointers, etc. ... might as well give the user a chance to interrupt (in
* the Macintosh version) here, as long as we have to go to the disk anyway....
*/
void load_zbuffer (n)
int n;
{
long nread;
void exit(), check_interrupt();
#ifdef LIGHTSPEED
nread = zbuffer[n].zbufsize;
FSRead (zbuffer[n].zbuffilep->refnum, &nread, zbuffer[n].zbufbase);
#else
nread = fread (zbuffer[n].zbufbase, sizeof(char),
(int)zbuffer[n].zbufsize, zbuffer[n].zbuffilep);
#endif
zbuffer[n].zbufp = zbuffer[n].zbufbase;
zbuffer[n].zbufcounter = nread;
if (ferror (zbuffer[n].zbuffilep))
{
printf ("\nFatal error in load_zbuffer while reading in data!\n");
printf ("(n=%d, nread=%ld)\n", n, nread);
exit (1);
}
#ifdef LIGHTSPEED
check_interrupt ();
#endif
}
/* function to return a pointer to the right place to store the next
* output item for the nth zbuffer ... when the buffer becomes full,
* it flushes it to disk, resets pointers, etc.
*/
char *next_output_item (n)
int n;
{
char *result;
void flush_zbuffer();
if (zbuffer[n].zbufcounter == zbuffer[n].zbufsize)
flush_zbuffer (n);
result = zbuffer[n].zbufp;
zbuffer[n].zbufp += zbuffer[n].zbufitemsize;
zbuffer[n].zbufcounter += zbuffer[n].zbufitemsize;
return (result);
}
/* function to flush out a zbuffer to the disk ... might as well give user
* a chance to interrupt here (in the Macintosh version), as long as we
* have to go to the disk anyway....
*/
void flush_zbuffer (n)
int n;
{
long chars_written;
void exit();
if (zbuffer[n].zbufcounter == 0)
return;
#ifdef LIGHTSPEED
chars_written = zbuffer[n].zbufcounter;
FSWrite (zbuffer[n].zbuffilep->refnum, &chars_written,
zbuffer[n].zbufbase);
#else
chars_written = fwrite (zbuffer[n].zbufbase, sizeof(char),
(int)zbuffer[n].zbufcounter, zbuffer[n].zbuffilep);
#endif
if (ferror(zbuffer[n].zbuffilep) || chars_written == 0)
{
printf ("\nFatal error in flush_zbuffer while writing out data!\n");
printf ("(n=%d, zbufcounter=%ld, chars_written=%ld)\n", n,
zbuffer[n].zbufcounter, chars_written);
exit(1);
}
zbuffer[n].zbufp = zbuffer[n].zbufbase;
zbuffer[n].zbufcounter = 0;
#ifdef LIGHTSPEED
check_interrupt ();
#endif
}
/* ---------------build_indices.c file---------------- */
/* file build_indices.c ... by ^z, 870820-0913-...
*
* revised 870930-871007 to allow more user options on keeping/discarding
* punctuation, etc. -- ideas based on Bill Hole's suggestions
*
* contains subroutine to build indices for each chunk of input document
* (database) text file for program qndxr ...
*
* Strategy is as outlined in qndxr: take in a big chunk of the doc_file,
* generate the pointers to each word in the buffer as the buffer contents
* are converted into appropriate (all caps, delimiters filtered into
* spaces) form for sorting; sort the pointers in memory; and then write
* out to disk the pointers and keys in the ndxr.c format for *.k and *.p
* files.
*
* Allocate space for the doc and ptr buffers here so as to maximize use
* of available memory ... note that we need to have room for the doc
* buffer, for a ptr buffer that might (in the worst case of a file full
* of 1-letter words) be twice as long as the doc buffer, and also space
* for two standard zbuffers to accumulate the output *.k and *.p file
* info in before sending it to disk.
*
* Note that for speed, while they are being sorted the pointers just point
* directly to the key strings in the input buffer; they must be converted
* into true offset pointers relative to the 0th byte of the document file
* as they are written to disk in the *.p file! Make sure that all of
* the delimiters in the document/database buffer are converted into
* '\0' characters so that string comparison functions will work right.
*
* Also note that to avoid edge effects at the end of the buffer, an extra
* amount of space is required at the end of the buffer, of length
* KEY_LENGTH, to accomodate the end of the last word in the buffer.
*
* Use static local variables in the function here to keep track of where
* we are in the document file from one chunk to the next, what chunk
* number we are on now, etc.
*
* Give the user a chance to interrupt operations (in the Macintosh
* version of this program) at intervals here, as long as
* there are time-consuming I/O or sorting or scanning operations
* to be done ...
*/
int build_indices ()
{
static int pass_number = 0;
long doc_bufsiz, offset, load_doc_buffer(), nwords,
ftell();
extern long zbufsiz;
extern FILE *doc_file;
char *doc, **ptr, *malloc(), *mlalloc(), *calloc(), *clalloc();
void zqsort(), write_sorted_files();
doc_bufsiz = 2 * NMERGE * zbufsiz / 3;
DEBUG ("--allocating doc buffer of size %ld\n", doc_bufsiz + KEY_LENGTH);
doc = make_buf (doc_bufsiz + KEY_LENGTH);
DEBUG ("--allocating ptr buffer of size %ld\n", doc_bufsiz * 2);
ptr = (char **)make_buf (doc_bufsiz * 2);
#ifdef LIGHTSPEED
check_interrupt ();
#endif
offset = ftell (doc_file);
DEBUG ("--loading doc buffer beginning at offset %ld\n", offset);
nwords = load_doc_buffer (doc, doc_bufsiz, ptr);
if (nwords == 0)
{
DEBUG ("--Building done ... now freeing doc & ptr buffers\n", NULL);
free (doc);
free ((char *)ptr);
return (FALSE);
}
printf ("Index subfile #%d contains %ld words...\n", pass_number,
nwords);
#ifdef LIGHTSPEED
check_interrupt ();
#endif
DEBUG ("--sorting ptr array\n", NULL);
zqsort (ptr, ptr + nwords);
#ifdef LIGHTSPEED
check_interrupt ();
#endif
DEBUG ("--writing sorted keys and ptrs to disk\n", NULL);
write_sorted_files (doc, ptr, nwords, pass_number, offset);
#ifdef LIGHTSPEED
check_interrupt ();
#endif
DEBUG ("--freeing doc & ptr buffers\n", NULL);
free (doc);
free ((char *)ptr);
++pass_number;
return (TRUE);
}
/* ---------------doc_buf.c file------------------ */
/* file doc_buf.c ... by ^z 870830-0919-...
* functions to load in text from the document file and then
* save out keys and pointers to the *.k and *.p files ...
* modified 871007-... to unify the doc-buffer loading with the
* character-filtering and the pointer array building....
*/
/* function to create a buffer for me to use...
*/
char *make_buf (bufsiz)
long bufsiz;
{
char *buf, *malloc(), *mlalloc();
void exit();
DEBUG ("--allocating a buffer, size = %ld\n", bufsiz);
#ifdef LIGHTSPEED
buf = mlalloc (bufsiz);
#else
buf = malloc ((unsigned int)bufsiz);
#endif
if (buf == NULL)
{
printf ("\nFatal error in attempt to allocate a buffer!\n");
printf ("(bufsiz=%ld)\n", bufsiz);
exit(1);
}
return (buf);
}
/* function to load the document buffer ... bring in doc_bufsiz
* characters, and then enough more to finish out the final word,
* followed by a terminal delimiter .... as the characters are read
* in, filter them appropriately (depending on user choices) and
* build the pointer array in memory to the first character of each
* word ... return the total number of words that were
* read in to the buffer (zero if we're finished with the file)
*
* ... note that one must be sure to pull in and throw away
* any excess characters beyond KEY_LENGTH in the final word, so that
* the remaining fragment doesn't show up as the first "word" in the
* next chunk of the file....
*
* Routine modified 871007-... in order to unify the buffer-loading and
* character-filtering and pointer-array-building operations, and to go
* back to using getc() from <stdio> rather than Macintosh-specific
* operations for loading the buffer....
*/
long load_doc_buffer (doc, doc_bufsiz, ptr)
char *doc, **ptr;
long doc_bufsiz;
{
int c, i, in_a_word = FALSE;
char **ptr0, *end_doc_buf;
extern FILE *doc_file;
DEBUG ("--Loading document buffer...\n", NULL);
ptr0 = ptr;
end_doc_buf = doc + doc_bufsiz;
while (doc < end_doc_buf)
{
c = filtered_getc ();
DEBUG ("--filtered character = \"%c\"\n", c);
if (c == EOF)
{
*doc++ = '\0';
in_a_word = FALSE;
break;
}
if (! c)
in_a_word = FALSE;
else if (! in_a_word)
{
*ptr++ = doc;
in_a_word = TRUE;
DEBUG ("--adding new ptr = %ld\n", doc);
}
*doc++ = c;
}
if (doc == end_doc_buf && in_a_word)
{
DEBUG ("--finishing off a final buffer word...\n", NULL);
for (i = 0; i < KEY_LENGTH; ++i)
{
c = filtered_getc ();
if (c == EOF)
{
*doc++ = '\0';
break;
}
if (! c)
{
*doc++ = '\0';
break;
}
*doc++ = c;
}
if (i == KEY_LENGTH)
while (filtered_getc ())
;
}
return (ptr - ptr0);
}
/* function to get the next character from the document file and filter it
* as the user desires ... return:
* EOF if end of file encountered;
* '/0' if the character is a delimiter;
* otherwise, the character itself (filtered into upper-case,
* if it was lower-case)
*/
int filtered_getc ()
{
static int prevc, c = '\0';
int nextc;
extern int keep_all_punct, keep_embedded_punct, keep_special_chars;
extern FILE *doc_file;
prevc = c;
c = getc (doc_file);
if (c == EOF)
return (EOF);
if (islower (c))
return (c = toupper (c));
if (isupper (c) || isdigit (c))
return (c);
if (isspace (c))
return (c = '\0');
if (keep_special_chars && ! isascii (c))
return (c);
if (keep_all_punct && ispunct (c))
return (c);
if (keep_embedded_punct && ispunct (c))
{
if (prevc == '\0')
return (c = '\0');
nextc = getc (doc_file);
ungetc (nextc, doc_file);
if (nextc == EOF)
return (c = '\0');
if (isalnum (nextc) || (keep_special_chars && ! isascii (nextc)))
return (c);
else
return (c = '\0');
}
return (c = '\0');
}
/* ------------------------file_utils.c------------- */
/* file file_utils.c ... by ^z -- 870820-0913-...
* some utility routines for qndxr project, associated with files...
*/
/* function to write out sorted k & p files based on the doc and ptr
* arrays in memory....
*
* The kfile format is as described in detail elsewhere:
* the key word, turned into all capital letters and with spaces
* afterward, of fixed length KEY_LENGTH; and
* the cumulative count of how many words have passed before, including
* the current word, a long integer.
*
* Function revised 870907-... by ^z to use zbuffer method....
*/
void write_sorted_files (doc, ptr, nwords, pass_number, offset)
char *doc, **ptr;
long nwords, offset;
int pass_number;
{
extern long zbufsiz;
FILE *kfile, *pfile, *open_kfile(), *open_pfile();
char *prev_word, *next_output_item();
KEY_REC *outk;
long *outp, i, file_size ();
void create_zbuffer(), write_new_key();
DEBUG ("--Entering write_sorted_files with nwords %ld\n", nwords);
if (nwords == 0)
return;
DEBUG ("--Opening kfile & pfile for pass_number = %d\n", pass_number);
kfile = open_kfile (pass_number);
pfile = open_pfile (pass_number);
DEBUG ("--Creating buffers for keys & ptrs, size = %ld\n", zbufsiz);
create_zbuffer (0, zbufsiz, kfile, sizeof(KEY_REC));
create_zbuffer (1, zbufsiz, pfile, sizeof(long));
DEBUG ("--Beginning to write keys and ptrs; first key=%.28s\n", ptr[0]);
prev_word = ptr[0];
outk = (KEY_REC *)next_output_item (0);
write_new_key (ptr[0], outk->kkey);
for (i = 0; i < nwords; ++i)
{
if (is_new_word (prev_word, ptr[i]))
{
outk->ccount = i;
outk = (KEY_REC *)next_output_item (0);
write_new_key (ptr[i], outk->kkey);
prev_word = ptr[i];
}
outp = (long *)next_output_item (1);
*outp = (ptr[i] - doc) + offset;
}
outk->ccount = i;
flush_zbuffer (0);
flush_zbuffer (1);
DEBUG ("--Getting rid of key and ptr buffers...\n", NULL);
free_zbuffer (0);
free_zbuffer (1);
printf (" ...%ld distinct words\n",
file_size (kfile) / sizeof(KEY_REC));
fclose (kfile);
fclose (pfile);
}
/* function to determine if the current word is the same as or different
* from the previous word -- if it is different, we'll need to write an
* entry out to the key file kfile -- compare the words up to the first
* '\0', or for a maximum distance of KEY_LENGTH, and return TRUE
* if they differ, FALSE if they are identical that far. Thus, a simple
* call to zstrcmp() does the job.... but keep ours as a function instead
* of a macro call for the moment, for safety and readability....
*/
int is_new_word (w0, w1)
char *w0, *w1;
{
return (zstrcmp (w0, w1));
}
/* function to write out a new key entry in the key_file:
* KEY_LENGTH letters consisting of the key word (which will be found
* delimited by a '\0'), followed by enough blanks to fill out the
* record to total length KEY_LENGTH ...
*/
void write_new_key (p, kp)
register char *p, *kp;
{
register int i, c;
for (i = 0; i < KEY_LENGTH; ++i)
{
c = *p++;
if (c == '\0')
break;
*kp++ = c;
}
for ( ; i < KEY_LENGTH; ++i)
*kp++ = ' ';
}
/* ---------------------merge_files.c--------------- */
/* file merge_files.c ... 870902-... ^z
* more functions to do the work of merging subindices together ...
* with buffering of inputs and outputs ...
*/
/* function to do the real work of merging sorted k & p
* files into a single sorted file...
*
* Procedure is as follows:
*
* Compare the current key records from each of the N files to be merged.
* Take the smallest of the keys (or, if there is a tie, take the one
* from the earlier file number) and write its pointer records out to
* the *.p file for the next generation. If the key is a new one, that
* is, if it differs from the previous key, write out the previous key
* word and the value for cumulative_counts to the *.k file, and reset
* the previous key's value to that of the current key. Then repeat
* this whole procedure, until all the input files are exhausted.
*
* The above works provided that we are careful in choosing the smallest
* key and never let a file that has been exhausted (reached EOF) win a
* comparison. The function smallest_key does that properly below, since
* a file that is at EOF has returned a NULL pointer for its key_rec...
*
* For each file, the variable prev_cc[i] holds the previous value of
* cumulative_counts for that file, so that we can determine how
* many ptr_recs to write out by doing a simple subtraction.
*
* Buffer numbering scheme: the Kth input file has zbuffer #K
* for its keys and zbuffer #(K+n) for its pointers; the output
* buffers are zbuffers #(2*n) for keys and #(2*n+1) for pointers.
*/
void nway_merge_kpfiles (ink, inp, outk, outp, n)
FILE *ink[], *inp[], *outk, *outp;
register int n;
{
register int i;
KEY_REC *kr[NMERGE], prev_key;
long prev_cc[NMERGE], out_cc;
extern long zbufsiz;
char *strncpy();
void copy_ptr_recs(), copy_key_rec();
DEBUG ("--entering nway_merge_kpfiles with n=%d\n", n);
for (i = 0; i < n; ++i)
prev_cc[i] = 0;
out_cc = 0;
for (i = 0; i < n; ++i)
{
create_zbuffer (i, zbufsiz, ink[i], sizeof(KEY_REC));
create_zbuffer (i + n, zbufsiz, inp[i], sizeof(long));
}
create_zbuffer (2 * n, zbufsiz, outk, sizeof(KEY_REC));
create_zbuffer (2 * n + 1, zbufsiz, outp, sizeof(long));
for (i = 0; i < n; ++i)
kr[i] = (KEY_REC *)next_input_item (i);
i = smallest_key (kr, n);
strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH);
DEBUG ("--first key is %.28s\n", prev_key.kkey);
while (merge_not_finished (kr, n))
{
i = smallest_key (kr, n);
if (zstrcmp (prev_key.kkey, kr[i]->kkey))
{
copy_key_rec (prev_key.kkey, out_cc, 2 * n);
strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH);
}
copy_ptr_recs (i + n, kr[i]->ccount - prev_cc[i], 2 * n + 1);
out_cc += kr[i]->ccount - prev_cc[i];
prev_cc[i] = kr[i]->ccount;
kr[i] = (KEY_REC *)next_input_item (i);
}
copy_key_rec (prev_key.kkey, out_cc, 2 * n);
DEBUG ("--finished nway merge ... final key=%.28s\n", prev_key.kkey);
flush_zbuffer (2 * n);
flush_zbuffer (2 * n + 1);
for (i = 0; i < 2 * n + 2; ++i)
free_zbuffer (i);
}
/* function to copy a chosen number of ptr_recs (longs) from one file
* to another ... used in merging two files by merge_kpfiles() above....
* revised and simplified to use zbuffers....870902 ... ^z
*/
void copy_ptr_recs (inzbuf, count, outzbuf)
register int inzbuf, outzbuf;
register long count;
{
register long *inp, *outp;
char *next_input_item(), *next_output_item();
for ( ; count > 0; --count)
{
inp = (long *)next_input_item (inzbuf);
outp = (long *)next_output_item (outzbuf);
*outp = *inp;
}
}
/* function to write a key record including kkey (key word) and ccount
* (cumulative_count) to an output file...
*/
void copy_key_rec (kkey, cc, outzbuf)
char *kkey;
long cc;
int outzbuf;
{
KEY_REC *outp;
char *strncpy(), *next_output_item();
outp = (KEY_REC *)next_output_item (outzbuf);
strncpy (outp->kkey, kkey, KEY_LENGTH);
outp->ccount = cc;
}
/* simple function to see if we are not yet finished with all of the input
* files coming in to the merge operation ... signified by at least one of
* the key pointers being non-NULL....
*/
int merge_not_finished (kr, n)
KEY_REC *kr[];
register int n;
{
register int i;
for (i = 0; i < n; ++i)
if (kr[i] != NULL)
return (TRUE);
return (FALSE);
}
/* function to determine the smallest of the n keys pointed to by the
* kr[] vector of pointers to KEY_RECs ... note that a NULL ptr ranks
* after any other...also note that in case of a tie, the smaller
* value of i is the one to return (for a stable merging sort)
*/
smallest_key (kr, n)
KEY_REC *kr[];
register int n;
{
register int i, smallest;
for (i = 0; i < n; ++i)
if (kr[i] != NULL)
{
smallest = i;
break;
}
for (i = smallest + 1; i < n; ++i)
if (kr[i] != NULL)
if (zstrcmp (kr[smallest]->kkey, kr[i]->kkey) > 0)
smallest = i;
return (smallest);
}
/* -------------------------merge_indices.c-------------- */
/* file "merge_indices.c" ... by ^z -- 870823-...
* function to merge sorted indices together repeatedly until finished
* with them all in a single set of *.k/*.p files ...
*
* The merging strategy is straightforward enough:
* Let "g" denote the generation_number and "f" denote the file_number.
* Temporary file names begin with the letter x, which is then followed
* by a generation number (decimal), the letter k or p (standing for
* key or ptr, respectively), and then the file number (decimal). Thus,
* the file "x0k0" is the keys file #0 for generation #0 (the starting,
* pre-merging, generation), file "x2p3" is the ptr file #3 for generation
* #2, etc.
*
* (The following discussion is specifically for a 2-way merge ... but
* the generalization for N-way merging is straightforward.)
*
* On a given call to merge_indices, the following may happen:
* - files xgkf/xgpf and xgk(f+1)/xgp(f+1) are merged into files
* x(g+1)k(f/2)/x(g+1)p(f/2), and then the parent files are
* deleted;
* - file xgkf isn't found, which means we are done with this
* generation and must go on to the next;
* - file xgk(f+1) isn't found, which means that either we are
* completely done with the merging work (if f=0) and just
* have to rename the files xgkf/xgpf into the correct final
* names (that is, doc_filename.k/doc_filename.p), or else
* (if f>0) we have an odd number of files for this level
* of merging, and therefore just have to rename xgkf/xgpf
* to x(g+1)k(f/2)/x(g+1)p(f/2).
*/
int merge_indices (doc_filename)
char *doc_filename;
{
static int generation_number = 0, file_number = 0;
FILE *ink[NMERGE], *inp[NMERGE], *outk, *outp, *open_inkfile(),
*open_inpfile(), *open_outkfile(), *open_outpfile();
void fix_oddball_file_name(), fix_final_file_names(), merge_kpfiles(),
remove_used_infiles(), close_infiles();
long inwords, indistinctwords, outdistinctwords, file_size();
int i, n;
for (n = 0; n < NMERGE; ++n)
{
ink[n] = open_inkfile (generation_number, file_number + n);
if (ink[n] == NULL)
break;
inp[n] = open_inpfile (generation_number, file_number + n);
}
if (file_number + n == 1)
{
DEBUG ("--All finished with merging files!\n", NULL);
printf ("\nFinished with index sorting! Final file sizes:\n");
printf ("%9ld bytes of distinct key words\n", file_size (ink[0]));
printf ("%9ld bytes of pointers to words\n", file_size (inp[0]));
close_infiles (ink, inp, n);
fix_final_file_names (generation_number, doc_filename);
return (FALSE);
}
if (n < 2)
{
DEBUG ("--finished generation_number=%d ", generation_number);
DEBUG ("-- file_number=%d\n", file_number);
if (n == 1)
{
close_infiles (ink, inp, n);
fix_oddball_file_name (generation_number, file_number);
}
++generation_number;
file_number = 0;
return (TRUE);
}
outk = open_outkfile (generation_number, file_number);
outp = open_outpfile (generation_number, file_number);
inwords = 0;
indistinctwords = 0;
for (i = 0; i < n; ++i)
{
inwords += file_size (inp[i]) / sizeof(long);
indistinctwords += file_size (ink[i]) / sizeof(KEY_REC);
}
printf ("Merge pass #%d.%d\n", generation_number, file_number / NMERGE);
printf ("Input files contain %ld words -- %ld distinct words...\n",
inwords, indistinctwords);
nway_merge_kpfiles (ink, inp, outk, outp, n);
outdistinctwords = file_size (outk) / sizeof(KEY_REC);
printf (" ... merged result has %ld distinct words.\n",
outdistinctwords);
close_infiles (ink, inp, n);
fclose (outk);
fclose (outp);
remove_used_infiles (generation_number, file_number, n);
file_number += NMERGE;
return (TRUE);
}
/* --------------------merge_utils.c---------------- */
/* file "merge_utils.c" ... 870902-... ^z
* misc. utilities for the merge_indices functions...
*/
/* function to open an input key file for this generation and file number
*/
FILE *open_inkfile (generation_number, file_number)
int generation_number, file_number;
{
FILE *fopen();
char fname[32];
sprintf (fname, "x%dk%d", generation_number, file_number);
return (fopen (fname, "rb"));
}
/* function to open an input ptr file for this generation and file number
*/
FILE *open_inpfile (generation_number, file_number)
int generation_number, file_number;
{
FILE *fopen();
char fname[32];
sprintf (fname, "x%dp%d", generation_number, file_number);
return (fopen (fname, "rb"));
}
/* function to open an output key file for this generation and file number
*/
FILE *open_outkfile (generation_number, file_number)
int generation_number, file_number;
{
FILE *fopen();
char fname[32];
sprintf (fname, "x%dk%d", generation_number + 1, file_number / NMERGE);
return (fopen (fname, "wb"));
}
/* function to open an output ptr file for this generation and file number
*/
FILE *open_outpfile (generation_number, file_number)
int generation_number, file_number;
{
FILE *fopen();
char fname[32];
sprintf (fname, "x%dp%d", generation_number + 1, file_number / NMERGE);
return (fopen (fname, "wb"));
}
/* function to rename the remaining last unpaired key file & ptr file
* from one generation to the next...
*/
void fix_oddball_file_name (generation_number, file_number)
int generation_number, file_number;
{
char oldname[32], newname[32];
sprintf (oldname, "x%dk%d", generation_number, file_number);
sprintf (newname, "x%dk%d", generation_number + 1, file_number / NMERGE);
if (rename (oldname, newname))
printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname,
newname);
sprintf (oldname, "x%dp%d", generation_number, file_number);
sprintf (newname, "x%dp%d", generation_number + 1, file_number / NMERGE);
if (rename (oldname, newname))
printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname,
newname);
}
/* function to give the final key and ptr files their proper ultimate
* names ...
*/
void fix_final_file_names (generation_number, doc_filename)
int generation_number;
char *doc_filename;
{
char oldname[32], finalname[65];
sprintf (oldname, "x%dk0", generation_number);
sprintf (finalname, "%s.k", doc_filename);
if (rename (oldname, finalname))
printf ("Sorry -- error in renaming file %s to %s!\n", oldname,
finalname);
sprintf (oldname, "x%dp0", generation_number);
sprintf (finalname, "%s.p", doc_filename);
if (rename (oldname, finalname))
printf ("Sorry -- error in renaming file %s to %s!\n", oldname,
finalname);
}
/* function to get rid of the superfluous k & p files now that they
* have been merged into the next generation....
*/
void remove_used_infiles (generation_number, file_number, n)
int generation_number, file_number, n;
{
char fname[32];
register int i;
for (i = 0; i < n; ++i)
{
sprintf (fname, "x%dk%d", generation_number, file_number + i);
DEBUG ("--removing %s\n", fname);
if (unlink (fname))
printf ("Sorry -- unable to delete file %s!\n", fname);
sprintf (fname, "x%dp%d", generation_number, file_number + i);
DEBUG ("--removing %s\n", fname);
if (unlink (fname))
printf ("Sorry -- unable to delete file %s!\n", fname);
}
}
/* function to close out the ink and inp files that have been opened...
*/
void close_infiles (ink, inp, n)
FILE *ink[], *inp[];
register int n;
{
register int i;
for (i = 0; i < n; ++i)
{
fclose (ink[i]);
fclose (inp[i]);
}
}
/* ---------------------misc_utils.c----------------- */
/* file "misc_utils.c" -- miscellaneous utilities for the qndxr project...
* by ^z -- 870821-...
*/
/* this function returns a value for the size of the internal buffers,
* zbufsiz, and it also takes care of setting the other global parameters,
* keep_all_punct, keep_embedded_punct, and keep_special_chars,
* which govern how punctuation and special characters are handled.
* They are set based on flags such as -e, -a, and -s in the input line.
* The input parameters are taken one after another, and any that do
* not convert to a nonzero number are scanned for the letters "e", "a", and
* "s". If a parameter is not reset, the default is to leave keep_all_punct,
* keep_embedded_punct, and keep_special_chars as FALSE.
* The default memory size is DEFAULT_MEMSIZ, set in the header file.
*/
long set_params (argc, argv)
int argc;
char *argv[];
{
int i;
void set_parser_options();
long zb = 0, n, atol(), set_zbufsiz();
for (i = 2; i < argc; ++i)
{
n = atol (argv[i]);
if (n != 0)
zb = set_zbufsiz (n);
else
set_parser_options (argv[i]);
}
if (zb == 0)
zb = set_zbufsiz (DEFAULT_MEMSIZ);
return (zb);
}
/* determine how big we should make the big buffers -- want to be sure
* to have room for at least 2*NMERGE+2 of them at one time in memory,
* so that the N-way merge function can hold all the input files
* (2N) plus the output files (2).
*
* NOTE that the chosen buffer size must be a multiple of both sizeof(long)
* and sizeof(KEY_REC); I ensure that very simply in what follows by
* a little modular arithmetic. I also take care of the sign and of the
* requirement that the buffer size be non-zero -- thus, UNIX aficionados
* can precede the parameter with a "-" with no ill effects. I have put in
* various safeguards against picking illegal buffer sizes, along with an
* ultimate safety net small minimum value for zbufsiz.
*/
long set_zbufsiz (zb)
long zb;
{
char *testb;
if (zb < 0)
zb = -zb;
zb /= (2 * NMERGE + 2);
zb = zb - zb % (sizeof(KEY_REC) * sizeof(long));
if (zb < sizeof(KEY_REC) * sizeof(long))
zb = sizeof(KEY_REC) * sizeof(long);
DEBUG ("--Checking for memory to support buffer size zb=%ld\n", zb);
testb = make_buf ((2 * NMERGE + 2) * zb);
free (testb);
return (zb);
}
/* function to set the parser options based on a character string ...
* 'a' turns on keep_all_punct, 'e' turns on keep_embedded_punct,
* and 's' turns on keep_special_chars
*/
void set_parser_options (str)
char *str;
{
extern int keep_all_punct, keep_embedded_punct, keep_special_chars;
while (*str)
{
if (*str == 'a')
{
keep_all_punct = TRUE;
printf ("All punctuation characters will be kept.\n");
}
else if (*str == 'e')
{
keep_embedded_punct = TRUE;
printf ("Embedded punctuation characters will be kept.\n");
}
else if (*str == 's')
{
keep_special_chars = TRUE;
printf ("Special characters will be kept.\n");
}
++str;
}
}
/* function to check for user interruption of operations (for use in the
* Macintosh version of this program only) ... call SystemTask() to give
* desk accessories a bit of time, and then check for non-null events
* with GetNextEvent() ... if something comes along (a mouse click, or key
* press, or whatever) let the user choose to exit the program or to
* carry on ....
*/
#ifdef LIGHTSPEED
void check_interrupt ()
{
EventRecord myEvent;
char cmd[256], *gets();
void exit();
SystemTask ();
if (GetNextEvent (everyEvent, &myEvent))
{
fprintf (stderr, "\Quit indexing now?\n");
gets (cmd);
if (cmd[0] == 'y' || cmd[0] == 'Y')
exit (0);
}
}
#endif
/* a very simple function to return the size of a file ... leave the file
* pointer positioned at wherever it was when the routine was entered....
* should work fine with stdio methods, but not guaranteed compatible when
* mixed in with Mac-specific FSRead() function!
*/
long file_size (fp)
FILE *fp;
{
long fpos, result, ftell();
DEBUG ("--finding the size of file fp=%ld\n", fp);
fpos = ftell (fp);
fseek (fp, 0L, 2);
result = ftell (fp);
fseek (fp, fpos, 0);
return (result);
}
/* ----------------------open_files.c----------------- */
/* functions to open files as needed in qndxr work...
*/
FILE *open_docfile (argc, argv)
int argc;
char *argv[];
{
FILE *fp, *fopen();
void exit();
if (argc < 2)
{
printf ("\nToo few command line arguments!\n");
printf ("\nEnter a text file name (no embedded spaces allowed)\n");
printf ("and the program will build and sort a complete inverted\n");
printf ("index to that file. Use \">\" in front of a file name to\n");
printf ("redirect console output (UNIX-fashion) if desired.\n");
printf ("Give the program a value for available memory (in bytes)\n");
printf ("if the default value of 512kB is unsatisfactory ... larger\n");
printf ("memory allows for faster index building and sorting.\n");
printf ("Other command-line arguments:\n");
printf (" \"-e\" preserves embedded punctuation\n");
printf (" \"-a\" preserves all punctuation characters\n");
printf (" \"-s\" preserves special (non-ASCII) characters\n");
printf ("\nContact Mark Zimmermann, 9511 Gwyndale Drive, Silver Spring\n");
printf ("Maryland 20910 USA; 301-565-2166; science (at) nems.arpa;\n");
printf ("[75066,2044] on CompuServe for further information. - ^z\n");
exit (1);
}
if ((fp = fopen (argv[1], "r")) == NULL)
{
printf ("\nFatal error opening document file\"%s\"!\n", argv[1]);
exit (1);
}
return (fp);
}
/* open the key file with proper name for this pass ... file will be
* named "x0kN" where N represents the pass number ....
*/
FILE *open_kfile (pass_number)
int pass_number;
{
FILE *fopen();
char fname[32];
sprintf (fname, "x0k%d", pass_number);
return (fopen (fname, "wb"));
}
/* open the ptr file with proper name for this pass ... file will be
* named "x0pN" where N represents the pass number ....
*/
FILE *open_pfile (pass_number)
int pass_number;
{
FILE *fopen();
char fname[32];
sprintf (fname, "x0p%d", pass_number);
return (fopen (fname, "wb"));
}
/* ----------------------zqsort.c-------------------- */
/* file zqsort.c -- by ^z, 870823-...
* my quicksort to sort out the ptr array ... based, at least initially,
* on the Lightspeed C library qsort routine, specialized to the task
* at hand here ...
*/
/* sort elements "first" through "last" */
void zqsort (first, last)
register char **first, **last;
{
register char **i, **j, *tmp;
while (last - first > 1)
{
i = first;
j = last;
for (;;)
{
while (++i < last && compare_ptrs (i, first) < 0)
;
while (--j > first && compare_ptrs (j, first) > 0)
;
if (i >= j)
break;
tmp = *i;
*i = *j;
*j = tmp;
}
tmp = *first;
*first = *j;
*j = tmp;
if (j - first < last - (j + 1))
{
zqsort (first, j);
first = j + 1;
}
else
{
zqsort (j + 1, last);
last = j;
}
}
}
/* function to compare two index ptrs and give a result appropriate
* for quicksort to use in alphabetizing them....
*
* Since the words pointed to have already been turned into all capital
* letters and delimiters have been filtered out, simply doing zstrcmp()
* for KEY_LENGTH letters works fine!
*
* Slight modification to make the quicksort stable: if two words tie,
* then we want to compare their pointers to make the lesser one come
* out first in the sort ...
*/
int compare_ptrs (p1, p2)
register char **p1, **p2;
{
register int diff;
diff = zstrcmp (*p1, *p2);
if (diff == 0)
diff = ((*p1 - *p2) > 0) ? 1 : -1;
return (diff);
}
/* my function to compare two strings and give a result as to who is
* alphabetically earlier. Note that this is almost the same as strncmp()
* with the fixed value of KEY_LENGTH as the maximum comparison distance,
* except that I must be sure to mask the characters to make them positive
* (since we want to be able to handle the non-ASCII funny letters in
* the Apple character set properly/consistently). If the masking isn't
* done, then inconsistent results can occur with those non-ASCII chars!
*/
int zstrcmp (s1, s2)
register char *s1, *s2;
{
register int n = KEY_LENGTH;
for (; --n && ((*s1 & CMASK) == (*s2 & CMASK)); s1++, s2++)
if (!*s1) break;
return ((*s1 & CMASK) - (*s2 & CMASK));
}
-------
More information about the Comp.sources.misc
mailing list