Multibyte characters
M.Marking
marking at drivax.UUCP
Thu Jul 5 13:26:37 AEST 1990
mikeb at inset.UUCP (Mike Banahan) writes:
) Let's say that I do have a multibyte execution character set which supports
) for the sake of argument, English and Greek, with Greek using a shift-in
) shift-out mechanism.
) A string of the form "abc at d" is valid C (using @ to represent the Greek
) character `alpha'.
) It will contain 8 bytes, counting the shift-in, shift-out and the null
) at the end.
) Presumably the integral constant '@' is a three-byte constant, no matter
) what it may look like?
I don't know about Greek, but I have seen situations where the mbchar itself
is three bytes, so with the shift in/out you have five bytes. Not all schemes
use shift/in shift out: some don't know about shifts at all and some have
an implicit shift after each character, so it's *always* in the initial
state. For others, the shift is implied by the initial character of the
multibyte sequence being in certain ranges. Furthermore, some schemes use
characters of mixed lengths, so that a string might consist of a mixture
of 1, 2, and 3-byte characters.
(My apologies if you want to know about Greek specifically, but my
presumption is that we want to write code that will work in a variety of
locales.)
) An alternative interpretation is that it violates
) the constraint in 2.2.1.2 `a .. character constant .. shall begin
) and end in the initial shift state', but presumably I can expect my
) implementation to do the necessary good deeds and put a shift-out
) in there too.
Good question. In Japanese, there are no separate shift characters, so
I don't know what compilers do when there are. Anyone?
) Since it is a three-byte constant (assuming I'm right), then can I be
) sure that I do not get overflow when I assign it to a char variable?
A char is not a multibyte char, so truncation or overflow or whatever
is the likely result. The type char is still a single scalar value, so
an array of them is needed for multibyte data.
) 3.1.3.4 says that the value of a multi-character character constant
) will be implementation-defined, and 3.2.1.2 says that that (paraphrase)
) demoting an int to a char gives an implementation-defined result.
) So to call it `overflow' is perhaps overstating the case, but I clearly
) end up in implementation-defined territory twice over.
You can test MB_LEN_MAX (for the compiler's worst case) or MB_CUR_MAX
(for the current locale's worst case) to check how many bytes you
might need to hold the value.
My question: do MB_LEN_MAX and MB_CUR_MAX include shift characters in
locales that use them? If not, my recollection is that the old ansii
spec on extended characters allows multibyte shift sequences, so how
do we know the maximum length of a shift sequence (in or out)?
My experiences with shift characters antedate the introduction of
multibyte and wide characters into C. Any information on current use
of shift characters here would be appreciated.
More information about the Comp.std.c
mailing list