International Unix

Fri Nov 1 06:22:35 AEST 1985

In article <2344 at ukma.UUCP> sambo at ukma.UUCP (Father of micro-ln) writes:
>In article <2400 at brl-tgr.ARPA> bilbo.jbrown at ucla-locus.ARPA (Jordan Brown) writes:
>>Unfortunately, you CAN'T build a good international character set.
>>Some of those silly European countries have the same character in
>>several languages, but sort the character in different places in each
>>language.  They also have interesting constructs like characters that
>>sort as two characters, and pairs of characters that sort as single
>>characters.  That is, there might be a character @ which sorts as "xy",
>>so that @m sorts right after xylophone and before xyn.  Similarly, they
>>sometimes say that the pair ll sorts as a single character; I don't
>>remember where.

>I guess I would like to see some examples of the above.  Are you saying
>that in some language, the order of the letters might be "a b c ...",
>whereas in some other language, the order might be "a c b ..."?  What
>pair of languages is like this?  Also, in which language is some single
>character considered as two characters?

Basically, yes.  That's the general idea.  If you go through your archives
for net.nlang for about the last 3 or 4 weeks, you can get about 6 examples
of alphabets, at least two of which have "letters out of sequence".

One other way of looking at this (let's see how far ahead of myself
I can get) is to think of the reasons for the internaltional character
set:
    1. consistent sorting
    2. consistent pred/succ operations
    3. no special characters in one language that are printable chars in another

Well, reason 2 says we can't have gaps in the letters for *any* language.
Reason 3 says languages with smaller alphabets can't use the extra chars.
Reason 1 says everything has to be in order.

So lets take a look at 3 character sets (english, spanish and german)
a b c d e f g h i j k l  m n o p q r s t u v w x y z	<- english
a b c d e f g h i j k l ll m n o p q r s t u v w x y z	<- spanish
a b c d e f g h i j k l  m n o p q r s B t u v w x y z	<- german

(pardon me if any of this is wrong, but at least it makes the point,
even if it *is* wrong.)

So the letters (E:m-z,S:ll-z,G:m-z) are all different, and we're still
on the latin alphabet (How about cyrillic?).  

Aside:  I strongly recommend that anyone seriously interested in
international [issues|unix] read net.nlang.  It is not too difficult
to cull the garbage from it and read only the relevant articles, such
as the ones I mentioned above.  Please send flames to /dev/null and
discussions to me or the net.

>I speak Spanish and some French.  Without thinking very much, something
>like the double "l" (which at least in Honduras is pronounced the same
>as a "y") would need to be treated as a single character, but written
>out as two characters.  The problem is in capitalizing it.  There need
>to be two forms for the uppercase double "l": "LL" and "Ll".  This would
>mean that there would be two different codes for the uppercase double
>"l".  Again, without thinking very much, this is the same situation as
>with vowels, since they may have an accent.

I assume this is all an argument to support the original article, however
I don't think that was clear the way it was written.
-- 
Tim (radzy) Radzykewycz, The Incredible Radical Cabbage
	calma!radzy at ucbvax.ARPA
	{ucbvax,sun,csd-gould}!calma!radzy