Pattern matching with awk
Tom Christiansen
tchrist at convex.COM
Mon Mar 4 15:10:48 AEST 1991
>From the keyboard of lin at CS.WMICH.EDU (Lite Lin):
: This is a simple question, but I don't see it in "Freqently Asked
:Questions", so...
: I'm trying to identify all the email addresses in email messages, i.e.,
:patterns with the format user at node. Now I can use grep/sed/awk to find
:those lines containing user at node, but I can't figure out from the manual
:how or whether I can have access to the matching pattern (it can be
:anywhere in the line, and it doesn't have to be surrounded by spaces,
:i.e., it's not necessarily a separate "field" in awk). If there is no
:way to do that in awk, I guess I'll do it with lex (yytext holds the
:matching pattern).
Well, I wouldn't try to do it in awk, but that doesn't mean we have to
jump all the way to a C program!
perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;'
that does a fair good job, but there are a lot of duplicates,
so let's not print any we've already seen:
perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n" unless $seen{$1}++/ge;'
A more sordid approach might be:
#!/usr/bin/perl
while (<>) { s/([-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; }
print join("\n", sort keys %seen), "\n";
But you've got a basic problem in that you can't distinguish
message-ids from real addresses. A message_id at host looks
a lot (in some cases indistinguishably so) from a user_id at host.
Here's a half-hearted attempt to weed out a few strays:
#!/usr/bin/perl
while (<>) { s/([a-zA-Z][-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; }
print join("\n", grep(!/^(AA)?\d/, sort keys %seen)), "\n";
--tom
ps: dunno what all this ``node'' talk is. My manual talks
about nodes in the filesystem section, hosts in the
networking section. Or do you mail directly to i-nodes? :-)
--
"UNIX was not designed to stop you from doing stupid things, because
that would also stop you from doing clever things." -- Doug Gwyn
Tom Christiansen tchrist at convex.com convex!tchrist
More information about the Comp.unix.questions
mailing list