International Unix
Tim Radzykewycz
radzy at calma.UUCP
Fri Nov 1 06:22:35 AEST 1985
In article <2344 at ukma.UUCP> sambo at ukma.UUCP (Father of micro-ln) writes:
>In article <2400 at brl-tgr.ARPA> bilbo.jbrown at ucla-locus.ARPA (Jordan Brown) writes:
>>Unfortunately, you CAN'T build a good international character set.
>>Some of those silly European countries have the same character in
>>several languages, but sort the character in different places in each
>>language. They also have interesting constructs like characters that
>>sort as two characters, and pairs of characters that sort as single
>>characters. That is, there might be a character @ which sorts as "xy",
>>so that @m sorts right after xylophone and before xyn. Similarly, they
>>sometimes say that the pair ll sorts as a single character; I don't
>>remember where.
>I guess I would like to see some examples of the above. Are you saying
>that in some language, the order of the letters might be "a b c ...",
>whereas in some other language, the order might be "a c b ..."? What
>pair of languages is like this? Also, in which language is some single
>character considered as two characters?
Basically, yes. That's the general idea. If you go through your archives
for net.nlang for about the last 3 or 4 weeks, you can get about 6 examples
of alphabets, at least two of which have "letters out of sequence".
One other way of looking at this (let's see how far ahead of myself
I can get) is to think of the reasons for the internaltional character
set:
1. consistent sorting
2. consistent pred/succ operations
3. no special characters in one language that are printable chars in another
Well, reason 2 says we can't have gaps in the letters for *any* language.
Reason 3 says languages with smaller alphabets can't use the extra chars.
Reason 1 says everything has to be in order.
So lets take a look at 3 character sets (english, spanish and german)
a b c d e f g h i j k l m n o p q r s t u v w x y z <- english
a b c d e f g h i j k l ll m n o p q r s t u v w x y z <- spanish
a b c d e f g h i j k l m n o p q r s B t u v w x y z <- german
(pardon me if any of this is wrong, but at least it makes the point,
even if it *is* wrong.)
So the letters (E:m-z,S:ll-z,G:m-z) are all different, and we're still
on the latin alphabet (How about cyrillic?).
Aside: I strongly recommend that anyone seriously interested in
international [issues|unix] read net.nlang. It is not too difficult
to cull the garbage from it and read only the relevant articles, such
as the ones I mentioned above. Please send flames to /dev/null and
discussions to me or the net.
>I speak Spanish and some French. Without thinking very much, something
>like the double "l" (which at least in Honduras is pronounced the same
>as a "y") would need to be treated as a single character, but written
>out as two characters. The problem is in capitalizing it. There need
>to be two forms for the uppercase double "l": "LL" and "Ll". This would
>mean that there would be two different codes for the uppercase double
>"l". Again, without thinking very much, this is the same situation as
>with vowels, since they may have an accent.
I assume this is all an argument to support the original article, however
I don't think that was clear the way it was written.
--
Tim (radzy) Radzykewycz, The Incredible Radical Cabbage
calma!radzy at ucbvax.ARPA
{ucbvax,sun,csd-gould}!calma!radzy
More information about the Comp.unix
mailing list