Regular Expression tool

Larry Wall lwall at jpl-devvax.JPL.NASA.GOV
Tue Jun 12 09:32:30 AEST 1990


In article <1990Jun8.174056.15313 at icc.com> wdm at icc.com (Bill Mulert) writes:
: Consider the following statements containing regular expressions:
: 
: echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`"
: 
: df_usr=`df | sed -n '/^\/usr[   ]/s/[^)]*):[    ]*\([^  ]*\).*/\1/p'`
: 
: sed	-e 's/\([!:]\)\([0-9]\)/\1 \2/' \
: 	-e '/!/s/^\([^ 	][^ 	]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \
: 	< .newsrc.old > .newsrc
: 
: sed 's/^\([^:! 	]*\).*$/\1/' $ACTIVE | sort > $TMPFILE.1
: 
: Do you have a headache, now? I do. I find any but the simplist regular
: expressions to be "write only". They are rather like C's declarations
: that so often cause even veteran programmers to look askance.
: Fortunately, we have cdecl to help create and decode the C declarations.
: 
: I wish there were something similar for regular expressions. I would
: like to have a tool, call it regex, that would allow me to say:
: 
: regex ' "^[^=]*=\(.*\)\" '
: and have regex say, in plain language, what the expression means.
: 
: Is there anything like that in existance? Any ideas on how large
: a project like that might be?

It's not likely to be too practical, for a couple of reasons.

First, there a number of different standards out there.  For instance,
sed and expr use \( ... \) to indicate grouping, while egrep and perl
use ( ... ) for grouping, and \( and \) to indicate real parens.  (I'm
of course prejudiced in favor of the latter, but I think it's more readable
on the whole, since you do grouping a lot more often than you match real
parens.)  On top of that, when are ?, +, |, { and } metacharacters?  They
are in some programs, and aren't in others.  Are you going to have a
switch?

	regex -sed   ' "^[^=]*=\(.*\)\" '
	regex -expr  ' "^[^=]*=\(.*\)\" '
	regex -egrep ' "^[^=]*=\(.*\)\" '
	regex -perl  ' "^[^=]*=\(.*\)\" '
	regex -ed    ' "^[^=]*=\(.*\)\" '
	regex -emacs ' "^[^=]*=\(.*\)\" '
	regex -vi    ' "^[^=]*=\(.*\)\" '

Second, your big problem is not so much the regular expressions themselves
as it is all the quoting you have to put around them because of the paucity of
quoting mechanisms.  Take your first example:

    echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`"

If we blame the gobbldygookiness on the backslashes, we see that half
the problem is that we are quoting three deep, so we have to use \", and
the other half of the problem is that \( ... \) are the grouping
metacharacters.  I think the following is more readable simply because
of the absence of \, which is simply too heavily overloaded in Unix:

    perl -e 'print shift =~ /^[^=]*=(.*)/' "$1"

Using /PATTERN/ to search filenames forces you to backslash all the slashes
in the pattern:

    df_usr=`df | sed -n '/^\/usr[   ]/s/[^)]*):[    ]*\([^  ]*\).*/\1/p'`
			   ^^

It helps to have an alternate pattern delimiting method.  sed lets you have
an alternate delimiter on substitutions, but not on pattern matches.  (Perl
gives you both.)  Even in sed, you could write the above as:

    df_usr=`df | sed -n 's#^/usr[   ][^)]*):[    ]*\([^  ]*\).*#\1#p'`

That gets rid of one backslash, anyway.  Other filename patterns will
benefit more.  Filename patterns are the primary reason I added m#PATTERN#
to perl, where # can be any delimeter.

Similarly, we see a lot of cruft is there simply because of the overly
minimalistic implementations of some regexps.  Such as having to repeat
character classes because there's no +, or having to use uninterpretable
whitespace because there's no alternate way to specify spaces and tabs.

Compare

: sed	-e 's/\([!:]\)\([0-9]\)/\1 \2/' \
: 	-e '/!/s/^\([^ 	][^ 	]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \
: 	< .newsrc.old > .newsrc

to

perl -p	-e 's/([!:])([0-9])/$1 $2/' \
	-e '/!/ && s/^(\S+).*[,-]+([0-9]+)$/$1 1-$2/' \
	< .newsrc.old > .newsrc

Actually, I'd probably write that as

perl -pe 's/:\s*/: /;  s/!.*\D(\d+)$/! 1-$1/;' .newsrc.old >.newsrc

Whatever.  For the most part, I don't think the problem with understanding
regular expressions is the regular expressions themselves, but all the
claptrap surrounding them.  And that will be very difficult to write
a decoder for.

Unix is not a simple language.

Larry Wall
lwall at jpl-devvax.jpl.nasa.gov



More information about the Comp.unix.questions mailing list