Text Processing Question
Tom Christiansen
tchrist at convex.COM
Mon Mar 18 15:19:09 AEST 1991
>From the keyboard of goer at ellis.uchicago.edu (Richard L. Goerwitz):
:In article <31134 at usc> rkumar at buddha.usc.edu (C.P. Ravikumar) writes:
:
:>I was wondering if there is a utility to check
:>for repitition of words in a document....
:>
:>I have the feeling this can be done using "awk".
:
:The hard part, as always, is settling on a field separator -
Perhaps. I always thought the hard part was catching pairs of words that
extend over line boundaries. Here's a perl version that catches these,
although I admit it's probably overkill to suck up the whole file into
memory before munging it. Works fine on my machine. :-)
Here's the output when run on my C compiler man page:
/usr/man/man1/cc.1:
39 compiler. Certain extensions, notably the [* long long *] type,
57 Forces language and library interpretation based on [* the the *] original
770 Each library has a profiled version whose name is formed [* by
771 by *] inserting \(lq_p\(rq before the \(lq.a\(rq.
The precise definition of what constitutes a repeated words (and what
legit separators are) will vary according to tastes. I chose identifier-
like tokens separated by white space. Speed (and definitely memory)
optimizations are certainly possible, but this does the job well enough
for me. The program (not line noise :-) follows:
--tom
#!/usr/bin/perl
undef $/; $* = 1; # process whole file
while ( $ARGV = shift ) {
if (!open ARGV) { warn "$ARGV: $!\n"; next; }
$_ = <>;
s/\b(\s?)(([a-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next;
split(/\n/);
$n = 0; @hits = ();
for (@_) { $n++; push(@hits, sprintf("%5d %s", $n, $_)) if /\200/; }
$_ = join("\n", at hits);
s/\200([^\200]+)\200/[* $1 *]/g;
print "$ARGV:\n$_\n";
}
More information about the Comp.unix.questions
mailing list