[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
long-char, kanji
- To: common-lisp@SU-AI.ARPA, ida@UTOKYO-RELAY.CSNET, rwk@SCRC-YUKON.ARPA
- Subject: long-char, kanji
- From: Masayuki Ida <a37078%ccut.u-tokyo.junet%utokyo-relay.csnet@CSNET-RELAY.ARPA>
- Date: Fri, 30 May 86 14:55:14+0900
>Date: Sun, 11 May 86 16:51 EDT
>From: "Robert W. Kerns" <RWK@SCRC-YUKON.ARPA>
>Subject: The first note on kanji, sent to junet site in Jan 1986 and some reactions in japan
>In-Reply-To: <8605100330.AA08572@tansei.utyo.junet>
>Message-Id: <860511165153.6.RWK@WHITE-BIRD.SCRC.Symbolics.COM>
>
Prior to reply the mail, I want to summarize my way of understanding.
The reason why I send this mail is the first mail on KANJI to common-lisp at su-ai
was not helpfull for the discussion.
This issue is also related to the relation between character data type and string data type.
CLtL says,
T > Character > string-char > standard-char
string = (array string-char (*)) = (vector string-char)
string-char type object has zero value for font and bits attributes.
standard-char type object is a character
among (ASCII) 95 characters and #\newline.
The basic idea of my draft:
add long-char, or extended-string-char.
which is needed to represent multi-byte characters.
(in the last mail, I use the word "japanese-char" instead.
I realized it was a poor choice of naming.
Here, I use "long-char" for multi-byte characters.
But the naming is temporary.)
The opinions behind the attempt to add long-char:
There are many Lisps which can not handle multi-byte characters correctly.
Many implementors and users wanted to have a common way to handle japanese characters.
Related facts (but only for information):
Each character of standard-char type have another representation in
JIS 6226 two-byte representation, which I call here regular-long-char.
Further, almost all the machines in Japan has another representation in
two-byte representation, which I call it hankaku-long-char.
Namely, "A" say, can be represented as a standard-char, regular-long-char or
hankaku-long-char. Furthermore, " " (blank character), ",", ".", "(",")"
can have three different representations !
The basic issues:
Is the long-char a subtype of character ?
Is the long-char a subtype of string-char ?
What is the relation between standard-char and long-char ?
Can a vector of long-char be a component of a string ?
If the long-char is separated from string-char, it should have font-attribute or no?
--- Selection 1 ---
make long-char be a subtype of string-char, i.e. string-char > long-char.
long-char and standard-char are disjoint.
--- Selection 2 ---
make long-char be a subtype of character type, i.e. character > long-char.
and, string-char and long-char are disjoint.
--- Selection 3 ---
make standard-char be a subtype of long-char, i.e. string-char > long-char > standard-char.
Possible consequences due to the above selections
Selection 1:
long-char (2 byte or more) and standard-char(1 byte) can be mixed in a string.
--> It seems to be very heavy for general purpose machines,
to support ELT, LENGTH etc. correctly.
And user may confuse on writing software.
--> add string-normalize function.
(string-normalize x) means,
if x is purely composed of standard-char then return x.
if x is purely composed of long-char then return x.
if x is mix-composed of subtypes of string-char,
then if all the characters of x can be representable in standard-char
then each character is converted to the representation in
standard-char,
elseif there is at least one character which can only be
representable by long-char,
then all characters are converted to long-char representation.
else error.
Selection 2:
New problem will come. That is,
Can long-char type have non-zero value for char-font and char-bits ?
--> I feel the asnwer should be "NO".
--> (vector long-char) can not be "string", because string is
(vector string-char) and string-char and long-char are assumed to be disjoint.
--> Need another type of string, say long-char-based-string,
which is parallel to string, but is disjoint to string.
I did not have a firm idea about selection yet.
At least in japan, we have to settle the matter with several computer languages
such as Cobol, Fortran, C, Ada,...
and Common Lisp.
Masayuki Ida
ida%utokyo-relay.csnet@csnet-relay.arpa
----