Fuzzy grep?
Larry Wall
lwall at jpl-devvax.JPL.NASA.GOV
Tue Nov 6 06:03:26 AEST 1990
In article <242 at locke.water.ca.gov> rfinch at caldwr.water.ca.gov (Ralph Finch) writes:
: Is there something like grep, except it will (easlly) search an entire
: file (not just line-by-line) for regexp's near each other? Ideally it
: would rank hits by how much or how close they match, e.g.
:
: fzgrep 'abc.*123' filename
:
: would return hits not by line number but by how close abc & 123 are
: found together. Also it wouldn't matter what order the regexp's are.
I sincerely doubt you're going to find a specialized tool to do that.
But if you just slurp a file into a string in Perl, you can then
start playing with it. For example, if your search strings are fixed,
you can use index:
#!/usr/bin/perl
undef $/;
while (<>) { # for each file
$posabc = index($_, "abc");
next if $posabc < 0;
$pos123 = index($_, "123");
next if $pos123 < 0;
$diff = $posabc - $pos123;
$diff = -$diff if $diff < 0;
print "$ARGV: $diff\n";
}
Of course, you'd probably want to make a subroutine of that middle junk.
Or you can say:
#!/usr/bin/perl
undef $/;
while (<>) { # for each file
tr/\n/ /; # so . matches anything
(/(abc.*)123/ || /(123.*)abc/)
&& print "$ARGV: " . (length($1)-3) . "\n"
}
Those .*'s are going to be expensive, though. Maybe
#!/usr/bin/perl
undef $/;
while (<>) { # for each file
next unless /abc/;
$posabc = length($`);
next unless /123/;
$pos123 = length($`);
$diff = $posabc - $pos123;
$diff = -$diff if $diff < 0;
print "$ARGV: $diff\n";
}
Of course, none of these solutions is going to find the closest pair,
necessarily. To do that, use a nested split, which also works with arbitrary
regular expressions:
#!/usr/bin/perl
undef $/;
while (<>) { # for each file
$min = length($_);
@abc = split(/abc/, $_, 999999);
next if @abc == 1; # no match
&try(shift(@abc), 0, 1);
&try(pop(@abc), 1, 0);
foreach $chunk (@abc) {
&try($chunk, 1, 1);
}
next if $min == length($_);
print "$ARGV: $min\n";
}
sub try {
($hunk, $first, $last) = @_;
@pieces = split(/123/, $hunk, 999999);
if ($first && $min > length($pieces[0]) {
$min = length($pieces[0]);
}
if ($last && $min > length($pieces[$#pieces]) {
$min = length($pieces[$#pieces]);
}
}
Or something like that...
Larry Wall
lwall at jpl-devvax.jpl.nasa.gov
More information about the Comp.unix.questions
mailing list