Pattern matching with awk

Mon Mar 4 15:10:48 AEST 1991

>From the keyboard of lin at CS.WMICH.EDU (Lite Lin):
:  This is a simple question, but I don't see it in "Freqently Asked
:Questions", so...
:  I'm trying to identify all the email addresses in email messages, i.e.,
:patterns with the format user at node.  Now I can use grep/sed/awk to find
:those lines containing user at node, but I can't figure out from the manual
:how or whether I can have access to the matching pattern (it can be
:anywhere in the line, and it doesn't have to be surrounded by spaces,
:i.e., it's not necessarily a separate "field" in awk).  If there is no
:way to do that in awk, I guess I'll do it with lex (yytext holds the
:matching pattern).

Well, I wouldn't try to do it in awk, but that doesn't mean we have to 
jump all the way to a C program!  

    perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;'

that does a fair good job, but there are a lot of duplicates, 
so let's not print any we've already seen:

    perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n" unless $seen{$1}++/ge;'

A more sordid approach might be:

    #!/usr/bin/perl
    while (<>) { s/([-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; } 
    print join("\n", sort keys %seen), "\n";

But you've got a basic problem in that you can't distinguish 
message-ids from real addresses.  A message_id at host looks
a lot (in some cases indistinguishably so) from a user_id at host.

Here's a half-hearted attempt to weed out a few strays:

    #!/usr/bin/perl
    while (<>) { s/([a-zA-Z][-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; } 
    print join("\n", grep(!/^(AA)?\d/, sort keys %seen)), "\n";

--tom

ps: dunno what all this ``node'' talk is.  My manual talks 
    about nodes in the filesystem section, hosts in the
    networking section.  Or do you mail directly to i-nodes? :-)
--
"UNIX was not designed to stop you from doing stupid things, because
 that would also stop you from doing clever things." -- Doug Gwyn

 Tom Christiansen                tchrist at convex.com      convex!tchrist