Silly Question?
Brian Rice
rice at dg-rtp.dg.com
Wed Oct 18 12:04:16 AEST 1989
In article <4726 at internal.Apple.COM> athos at apple.com (Rick Eames) writes:
>Okay, here it is: I am writing a program which takes a text file and
>ouputs a concordance of the words in the file. I have it working fine,
>however, I have problems with contractions: (i.e. can't) My question is
>this: does anyone have any good ideas for filtering every punctuation mark
>except contraction apostraphes?
Below is a function fgetbaseword() which may help; it filters both
punctuation and contractions. For instance, it will work its way through
Hey, Joe, isn't that Sally's "pinochle deck"?!
by giving back (through a char[] buffer one provides it)
Hey Joe is that Sally pinochle deck
It's only 116 rather sparse lines of C code, so I felt it would be O.K.
to post it here.
Brian Rice rice at dg-rtp.dg.com (919) 248-6328
DG/UX Product Assurance Engineering
Data General Corp., Research Triangle Park, N.C.
"My other car is an AViiON."
--------------------clip here-------------------
#include <stdio.h>
#include <string.h>
#ifndef TRUE
#define TRUE 1
#define FALSE 0
#endif
#ifndef BOOLEAN
#define BOOLEAN char
#endif
#define WORD_SPLITTERS " \n\t.,;:^?/!@#$%%^&*()_-=+<>{}[]\\~|`\""
/* I included two %'s so that one can printf WORD_SPLITTERS without
getting tricked. */
fgetbaseword(fp, s, lim)
FILE *fp;
char *s;
int lim;
/* fgetbaseword() reads an input stream fp and puts each word, minus any
contractions it may have appended to it, into s. (Note that
it will not behave properly for words like "O'Shaughnessy",
and "ain't" will trick it into reporting "ai".) Punctuation
is filtered out. fgetbaseword() is case-insensitive.
This function is an example of a finite-state machine, although
not necessarily an efficient one. If you don't understand the
code and you don't know what a F.S.M. is, it might help to find
out.
fgetbaseword() has as its spiritual ancestor getline(), from page 67
of K&R-1. All hail. */
{
int c,c2,i;
char *end_of_word;
BOOLEAN in_word;
BOOLEAN ignore_text;
BOOLEAN maybe_nt; /* n't is a hard one to deal with, so
we give our finite-state machine a
special state for it */
i = 0;
end_of_word = NULL;
in_word = TRUE;
ignore_text = FALSE;
maybe_nt = FALSE;
while (--lim > 0 && (c = getc(fp)) != EOF) {
if (strchr(WORD_SPLITTERS,c)) {
if (in_word) {
if (!ignore_text) {
end_of_word = s+i;
}
in_word = FALSE;
}
ignore_text = FALSE;
continue;
}
if (c == '\'') {
if (in_word) {
if (maybe_nt) {
in_word = FALSE;
if ((c2=getc(fp)) != 't' &&
c2 != 'T') {
end_of_word = s+i;
}
if (c2 == EOF) {
break;
}
} else {
end_of_word = s+i;
ignore_text = TRUE;
}
}
continue;
}
if (c == 'n' || c == 'N') {
if (in_word) {
end_of_word = s+i;
s[i++] = c;
maybe_nt = TRUE;
continue;
} else {
ungetc(c,fp);
break;
}
}
if (in_word) {
if (!ignore_text) {
s[i++] = c;
maybe_nt = FALSE;
}
continue;
} else {
if (i == 0) {
s[i++] = c;
in_word = TRUE;
} else {
ungetc(c,fp);
break;
}
}
}
if (end_of_word == NULL) {
s[i] = '\0';
return i;
} else {
*end_of_word = '\0';
return (end_of_word - s);
}
}
/* Brian Rice, 1989
This code is in the public domain. Everyone may
copy, use, and modify it at will. */
More information about the Comp.lang.c
mailing list