long-char, kanji

>Date: Sun, 11 May 86 16:51 EDT
>From: "Robert W. Kerns" <RWK@SCRC-YUKON.ARPA>
>Subject: The first note on kanji, sent to junet site in Jan 1986 and some reactions in japan
>In-Reply-To: <8605100330.AA08572@tansei.utyo.junet>
>Message-Id: <860511165153.6.RWK@WHITE-BIRD.SCRC.Symbolics.COM>

Prior to reply the mail, I want to summarize my way of understanding.
The reason why I send this mail is the first mail on KANJI to common-lisp at su-ai
was not helpfull for the discussion.
This issue is also related to the relation between character data type and string data type.

CLtL says,
 T > Character > string-char > standard-char
   string = (array string-char (*)) = (vector string-char)
   string-char type object has zero value for font and bits attributes.
   standard-char type object is a character 
       among (ASCII) 95 characters and #\newline.

The basic idea of my draft:
  add long-char, or extended-string-char.
  which is needed to represent multi-byte characters.
  (in the last mail, I use the word "japanese-char" instead.
   I realized it was a poor choice of naming.
   Here, I use "long-char" for multi-byte characters.
   But the naming is temporary.)
The opinions behind the attempt to add long-char:
  There are many Lisps which can not handle multi-byte characters correctly.
  Many implementors and users wanted to have a common way to handle japanese characters.

Related facts (but only for information):
  Each character of standard-char type have another representation in
  JIS 6226 two-byte representation, which I call here regular-long-char.
  Further, almost all the machines in Japan has another representation in
  two-byte representation, which I call it hankaku-long-char.
   Namely, "A" say, can be represented as a standard-char, regular-long-char or
  hankaku-long-char. Furthermore, " " (blank character), ",", ".", "(",")" 
   can have three different representations !

The basic issues:
  Is the long-char a subtype of character ?
  Is the long-char a subtype of string-char ?
  What is the relation between standard-char and long-char ?
  Can a vector of long-char be a component of a string ?
  If the long-char is separated from string-char, it should have font-attribute or no?

--- Selection 1 ---
  make long-char be a subtype of string-char, i.e. string-char > long-char.
  long-char and standard-char are disjoint.
--- Selection 2 ---
  make long-char be a subtype of character type, i.e. character > long-char.
  and, string-char and long-char are disjoint.
--- Selection 3 ---
  make standard-char be a subtype of long-char, i.e. string-char > long-char > standard-char.

Possible consequences due to the above selections

Selection 1:
 long-char (2 byte or more) and standard-char(1 byte) can be mixed in a string.
--> It seems to be very heavy for general purpose machines,
    to support ELT, LENGTH etc. correctly.
    And user may confuse on writing software.
--> add string-normalize function.
 (string-normalize x) means,
   if x is purely composed of standard-char then return x.
   if x is purely composed of long-char then return x.
   if x is mix-composed of subtypes of string-char,
     then if all the characters of x can be representable in standard-char
            then each character is converted to the representation in 
            elseif there is at least one character which can only be 
                    representable by long-char,
                  then all characters are converted to long-char representation.
	    else error.
Selection 2:
 New problem will come. That is,
  Can long-char type have non-zero value for char-font and char-bits ?
--> I feel the asnwer should be "NO".
--> (vector long-char) can not be "string", because string is 
 (vector string-char) and string-char and long-char are assumed to be disjoint.
--> Need another type of string, say long-char-based-string,
  which is parallel to string, but is disjoint to string.

I did not have a firm idea about selection yet.
At least in japan, we have to settle the matter with several computer languages
such as Cobol, Fortran, C, Ada,...
and Common Lisp.

Masayuki Ida