[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A multi-byte character extension proposal



I got a mail from nippon UNIVAC person as for the matter.
He is one of the contributors of Kanji WG.

Here is his comment.
(I am just acting as a junction between japanese net and US.)

If someone is interested in his reply, I will forward
mails to him (since direct mailing is not premitted.)


Masayuki Ida


============================ his comment =================================

Date: Fri, 26 Jun 87 16:46:31 JST
From: tshimizu@xxx.yyy.junet (Toshihiko Shimizu)
To: ida@u-tokyo.junet
Subject: Our comments to Thom Linden's q

  We  have  reviewed  the  JEIDA  proposal  and  Mr.   Thom   Linden's
questions.  We have been making  efforts for an implementation  of the
extention of Common Lisp for Japanese character processing on Explorer
lisp machine.  Our  efforts have  been made  in pararell  to the JEIDA
Kanji WG activity, and our specifications had been nearly fixed before
the final proposal was issued.  But our implementation almost conforms
to the proposal except the point left being implementation  dependent.
We think it is important to answer Linden's question according to  our
implementation for your information.

 First we have to give an overview of our implementation.  Our primary
goal is completely the same as the JEIDA proposal that we want to  use
Japanese  characters  "as  almost  the  same  as"  characters  already
available.  In Explorer, an extended  ASCII character set called  Lisp
Machine character set has  been used.  We  have added a  new character
set for Japanese characters called JIS character set which is  defined
to be the standard in Japan.  This set has double-byte character code.
Explorer has the  capability to  handle strings  whose elements  are 2
byte long.  This string type can be considered to be a subtype of type
string.  Then  we  use  this  type  of  strings  to  hold  double-byte
characters.  Apparently  these  strings  are  able to hold single-byte
characters as mixture.  This  implementation is considered  almost the
same as the scheme using "internal-thin-string" type described in  the
proposal.  We are  now preparing  a paper  on this  implementation for
WGSYM IPSJ, September 1987.  Please refer it for further detailes.


The followings are our answers to Linden's questions;


1)
 All Common Lisp standard characters are included in the standared JIS
character set, but they have different character code from the ones in
ASCII character set.  This situation is almost likely in case of usual
file systems  which  allow  JIS  character  set.   Then we think these
difference has to  be preserved  when files  are read  into Lisp  as a
sequance of characters.  After that we can think of parsing, discussed
later.


2)
 Above interpretation seems  to lead  to a  contradiction against  the
description in CLtL  (p.233).  We  think that  two distinct  character
objects may have the  same print glyphs,  but in this  case they shold
have  the  same  syntactic  properties.   Indeed  they  are  different
characters but somtimes we doubt.   Because they may be  printed using
various fonts and sometimes these printed figures are very similar.


3), 4)
 Actually we have both single-byte and double-byte representations for
some characters.  But  we never  try to  map them  into the one except
when the Lisp reader  parses them.  This  is because these  difference
have to be preserved as described above.  And we think that once these
two representation is mapped into the one, there are no reasonable way
to make inverse mapping.  This  is the crucial point  for applications
on Lisp to interact with other conventional applications.  Suppose  we
have a text processing application on Lisp and we want use it  against
a text  file  in  which  single-byte  and  double-byte  characters are
containted in  mixture.   It  is  not  desirable  if  all  single-byte
characters in the source text file are mapped into double-byte ones.


5)
 Now our stand point is that a double-byte character can be a standard
character within  the  parsing  context  only  if its printed glyph is
regarded as a  standard character.   As a  result, there  must be some
test for  this  correspondence.   Acturally  we have this "equivalence
test".   Both  the  single-byte  character  set  and  the  double-byte
character set include  standard characters.   If a  character from the
single-byte character set which  is a standard  character, there is  a
corresponding character in the  double-byte character set.   And these
two characters  pass  the  "equivalence  test",  but they never be EQ.
However this point may lead to  a contradiction to the description  in
CLtL (p.20).

5a)
 Then, our implementation  recognizes some  double-byte characters  as
standard characters.  For example,  STANDARD-CHAR-P returns T  against
#\a in the double-byte character set.

5b)
 Our implementation takes option 3 in the proposal.  That is, we don't
distinguish single-byte and  double-byte versions  of symbols,  but we
preserve these difference within strings.  For example, two version of
a symbol 'LAMBDA are considered to be EQ, but two versions of a string
"LAMBDA" are  distinguished,  or  not  EQUAL,  but  they pass the test
described above.  Further,  there may  be mixed  versions of  a string
"LAMBDA".

5c)
 We might  agree  Linden's  point  if  we  didn't think about strings.
Actually our  primary  understanding  was  that  there  was no need to
distinguish such a  difference for  the sole  purpose of  Common Lisp.
But there is a certain  requirement for interaction with  conventional
applications in which distinction between single-byte and  double-byte
version is significant.  Then we  decided that the distinction  is not
neccessary for  symbols  which  plays  an  important role in programs,
whereas it  is  neccessary  for  strings  which are primarily used for
interaction with outer world, such as files, displays, and networks.

5d)
 As we  defined  that  a  double-byte  character  may  be  a  standard
character, it  is  consistent  to  define  such a character to satisfy
ALPHA-CHAR-P.   Then  both   version  of   a  character   'a'  satisfy
ALPHA-CHAR-P, ALPHANUMERICP and LOWER-CASE-P.

5e)
 We think that these description  sholud be eraborated, but  the JEIDA
committee  has  decided  that  these  should  be  left  implementation
dependent.


6)
 In our implementation, such syntactic attributes relevant to  parsing
and format controlling are only defined for standard characters.  That
is, if a  character is  a double-byte  character and  also a standared
character at  the  same  time,  it  may  have  non-constituent syntax.
Indeed it has the same syntax attribute as the single-byte version  of
it.  For example, a string "123" in double-byte version is also parsed
into a  number  123.   Otherwise  its  syntax  cannot  be  other  than
constituent.


7)
 We think it is  not neccessary to  have such a  large readtable which
covers all characters of type  string-char.  We only have  a readtable
for single-byte characters and uses  the "equivalent" mapping for  the
double-byte version of these characters.  And the rest of  double-byte
characters are defined to have constituent syntax.


8)
 In  our  implementation,   MAKE-DISPATCH-MACRO-CHARACTER  against   a
non-standard, double-byte character is an error.





------------- end of the message -----------