is it really necessary for character values to be positive?
dave at murphy.UUCP
dave at murphy.UUCP
Wed Dec 17 04:29:33 AEST 1986
Summary: invent an 8-bit character set and then let some of them be negative
Line eater: fully conforming
I've been thinking about this business with long chars and short chars
and trigraphs and international character sets and such, and I've got a
proposal. The proposal is this: if someone can come up with an 8-bit
character set that contains all of the necessary characters for the
Western languages, (and includes the existing USASCII set as a subset),
then let's drop the requirements that a member of a machine's "natural"
character set be represented as a positive number in a plain char. This
will have the following benefits:
1. Everyone can adopt a character set that will have all of the characters
that they need, and not have to overload any of the USASCII set with
other characters. Portability of programs and other text files will benefit
greatly, and trigraphs will be unnecessary. (For many languages, there
aren't enough punctuation characters to overmap; for example, I think that
it takes 17 characters to represent all of the possible letter-and-accent
combinations in French, and that's just for lower case.)
2. The character set will fit into almost everyone's byte size, meaning no
dramatic increase in the size of text files. (Nearly everyone uses at least
an 8-bit byte with UN*X; the only ones that I can think of are the PDP10/20's,
which can use 7-bit bytes.).
3. It won't be necessary to raise sizeof(char) from 1. This means that
programs that use chars for things other than text (yes, there are a *lot*
of them) won't be disturbed.
4. Each implementation can continue using the signedness for char that best
fits the architechure. It won't be necessary to force all plain chars to
unsigned.
The disadvantages that I can see are these:
1. Since some of the char values may be negative, it will not be possible
to collate chars by simply comparing their values; you have to call a
collating routine defined for the particular implementation. (But, some
languages don't collate in strict alphabetic order, so you'll wind up
doing this with any international character set.)
2. You will have to use functions to do things like converting a letter
to upper or lower case; just masking off bits won't get it anymore.
3. Some terminals already use the codes > 127 for other purposes. There is
no easy answer to this problem.
4. The value 255 can't be used because it may look like EOF on some systems.
In short, it doesn't look to me like there is any good reason to require
characters to be represented as positive values. Or have I overlooked
something really basic?
---
"I used to be able to sing the blues, but now I have too much money."
-- Bruce Dickinson
Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL
UUCP: ...!{sun,pur-ee,brl-bmd,bcopen}!gould!dcornutt
or ...!{ucf-cs,allegra,codas}!novavax!houligan!dcornutt
ARPA: dcornutt at gswd-vms.arpa (I'm not sure how well this works)
"The opinions expressed herein are not necessarily those of my employer,
not necessarily mine, and probably not necessary."
More information about the Comp.lang.c
mailing list