POSIX Regular Expression Funnyness

Doug Gwyn gwyn at smoke.BRL.MIL
Wed Feb 1 02:59:58 AEST 1989


In article <5980041 at hpfcdc.HP.COM> donn at hpfcdc.HP.COM (Donn Terry) writes:
>In Doug Gwyn's comments about [:ch:]  As far as character classes:
>these are specified by the natural language involved.  My Spanish is
>weak, but the *two characters* ch are treated as a single symbol with
>its own place in the collating sequence.  c and h can also appear
>independently, but when adjacent they are collated as another symbol.
>This is arguably a kluge, but it antedates the computer business by a
>few hundred years, and a few million users, so I doubt we can change it
>just for the sake of aesthetics.

My Spanish is not too weak and I'm well aware of ch, ll, nn (written
n-tilde), etc.  German also has some interesting features (e.g. ss when
capitalized).  However, we took all this stuff into account when coming
up with the multibyte character specifications in the proposed ANSI C
standard.  The "internationalization" community helped formulate that
approach, and it bothers me more than somewhat to see it being ignored
by 1003.2.  A reasonable implementation of Spanish-language locale
requires that ch etc. be multibyte sequences, not handled as multiple
separate single-byte characters by "grep".



More information about the Comp.unix.wizards mailing list