[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
- To: common-lisp@SU-AI.ARPA, ida@UTOKYO-RELAY.CSNET, rwk@SCRC-YUKON.ARPA
- Subject: long-char, kanji
- From: Masayuki Ida <a37078%ccut.u-tokyo.junet%utokyo-relay.csnet@CSNET-RELAY.ARPA>
- Date: Fri, 30 May 86 14:55:14+0900
>Date: Sun, 11 May 86 16:51 EDT
>From: "Robert W. Kerns" <RWK@SCRC-YUKON.ARPA>
>Subject: The first note on kanji, sent to junet site in Jan 1986 and some reactions in japan
Prior to reply the mail, I want to summarize my way of understanding.
The reason why I send this mail is the first mail on KANJI to common-lisp at su-ai
was not helpfull for the discussion.
This issue is also related to the relation between character data type and string data type.
T > Character > string-char > standard-char
string = (array string-char (*)) = (vector string-char)
string-char type object has zero value for font and bits attributes.
standard-char type object is a character
among (ASCII) 95 characters and #\newline.
The basic idea of my draft:
add long-char, or extended-string-char.
which is needed to represent multi-byte characters.
(in the last mail, I use the word "japanese-char" instead.
I realized it was a poor choice of naming.
Here, I use "long-char" for multi-byte characters.
But the naming is temporary.)
The opinions behind the attempt to add long-char:
There are many Lisps which can not handle multi-byte characters correctly.
Many implementors and users wanted to have a common way to handle japanese characters.
Related facts (but only for information):
Each character of standard-char type have another representation in
JIS 6226 two-byte representation, which I call here regular-long-char.
Further, almost all the machines in Japan has another representation in
two-byte representation, which I call it hankaku-long-char.
Namely, "A" say, can be represented as a standard-char, regular-long-char or
hankaku-long-char. Furthermore, " " (blank character), ",", ".", "(",")"
can have three different representations !
The basic issues:
Is the long-char a subtype of character ?
Is the long-char a subtype of string-char ?
What is the relation between standard-char and long-char ?
Can a vector of long-char be a component of a string ?
If the long-char is separated from string-char, it should have font-attribute or no?
--- Selection 1 ---
make long-char be a subtype of string-char, i.e. string-char > long-char.
long-char and standard-char are disjoint.
--- Selection 2 ---
make long-char be a subtype of character type, i.e. character > long-char.
and, string-char and long-char are disjoint.
--- Selection 3 ---
make standard-char be a subtype of long-char, i.e. string-char > long-char > standard-char.
Possible consequences due to the above selections
long-char (2 byte or more) and standard-char(1 byte) can be mixed in a string.
--> It seems to be very heavy for general purpose machines,
to support ELT, LENGTH etc. correctly.
And user may confuse on writing software.
--> add string-normalize function.
(string-normalize x) means,
if x is purely composed of standard-char then return x.
if x is purely composed of long-char then return x.
if x is mix-composed of subtypes of string-char,
then if all the characters of x can be representable in standard-char
then each character is converted to the representation in
elseif there is at least one character which can only be
representable by long-char,
then all characters are converted to long-char representation.
New problem will come. That is,
Can long-char type have non-zero value for char-font and char-bits ?
--> I feel the asnwer should be "NO".
--> (vector long-char) can not be "string", because string is
(vector string-char) and string-char and long-char are assumed to be disjoint.
--> Need another type of string, say long-char-based-string,
which is parallel to string, but is disjoint to string.
I did not have a firm idea about selection yet.
At least in japan, we have to settle the matter with several computer languages
such as Cobol, Fortran, C, Ada,...
and Common Lisp.