POSIX Regular Expression Funnyness
Doug Gwyn
gwyn at smoke.BRL.MIL
Wed Feb 1 02:59:58 AEST 1989
In article <5980041 at hpfcdc.HP.COM> donn at hpfcdc.HP.COM (Donn Terry) writes:
>In Doug Gwyn's comments about [:ch:] As far as character classes:
>these are specified by the natural language involved. My Spanish is
>weak, but the *two characters* ch are treated as a single symbol with
>its own place in the collating sequence. c and h can also appear
>independently, but when adjacent they are collated as another symbol.
>This is arguably a kluge, but it antedates the computer business by a
>few hundred years, and a few million users, so I doubt we can change it
>just for the sake of aesthetics.
My Spanish is not too weak and I'm well aware of ch, ll, nn (written
n-tilde), etc. German also has some interesting features (e.g. ss when
capitalized). However, we took all this stuff into account when coming
up with the multibyte character specifications in the proposed ANSI C
standard. The "internationalization" community helped formulate that
approach, and it bothers me more than somewhat to see it being ignored
by 1003.2. A reasonable implementation of Spanish-language locale
requires that ch etc. be multibyte sequences, not handled as multiple
separate single-byte characters by "grep".
More information about the Comp.unix.wizards
mailing list