[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: long-char, kanji
- To: common-lisp@SU-AI.ARPA, ida@UTOKYO-RELAY.CSNET, nuyens.pa@XEROX.COM
- Subject: Re: long-char, kanji
- From: Masayuki Ida <a37078%ccut.u-tokyo.junet%utokyo-relay.csnet@CSNET-RELAY.ARPA>
- Date: Fri, 6 Jun 86 16:34:09+0900
>From ccut!Shasta!@SU-AI.ARPA:nuyens.pa%Xerox.COM@u-tokyo.junet Wed Jun 4 11:49:48 1986
>Date: 3 Jun 86 15:22 PDT
>Subject: re: long-char, kanji
>To: common-lisp@su-ai.ARPA
>
> ...
>representation:
>Strings are represented as homogeneous simple vectors of thin (8 bit) or
>fat (16 bit) characters. Ignoring storage taken to represent them, the
>difference between fat characters and thin characters is transparent to
>the user. In particular, since we allow fat characters in symbol print
>names, we use an equivalent of Ida's string-normalize function to
>guarantee unique representation for hashing.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is the most important decision point, I think.
I agree to do.
With Moon's idea, the relation of thin- and fat- is like a relation of fixnum
and bignum.
This means the characters in fat-char and in thin-char
are completely independent.
But, any character set may contain the characters which have same appearance
with standard-char.
Such as the space code, alphabetic characters, and terminating-macro characters.
Actually JIS 6226 has an another code for the standard-char.
I think, other foreign character set may have characters which have same visual
figure as standard-char.
With Moon's idea of ASSURE-FAT-STRING, once there is a fat-char in a string,
it can not be reduced
to thin-string, even if the modification made it to the string with the characters
which can be representable in thin-code only.
>
>kanji:
>NS includes all "JIS C 6226" graphic characters including the 6300 most
>common Japanese kanji. There are also Hiragana and Katakana character
>codes specified. (While there is substantial overlap with the Japanese
>kanji, Chinese characters are semantically separate and their character
>code assignments have not yet been published.)
>
The reason why I stick to kanji issue is not only I am a japanese, but I feel
it is the test case to cope with multi-byte characters and as a Common Lisper,
I feel a need to polish up the character data type.
>type hierarchy:
>Since we have char-bits-limit = char-font-limit = 1, STANDARD-CHAR is
>the same as STRING-CHAR. I agree with Moon that STRING should be
>(VECTOR CHARACTER) and provide specialisations (even though this is a
>change from the status quo). In our applications, we do as Fahlman
>suggests and use external data-structures to represent the sort of
>information encoded in "styles". (It is hard to standardize which
>attributes should be made part of style (some people claim "case" should
>be a style bit!)).
I like "style" idea also.
I don't want to use font.
>
>number of character codes required:
>At first glance it seems hard to imagine exceeding 16 bits. Note
>however that the 7200 characters in NS don't include Chinese, Korean,
>Farsi, Hindi, etc. How many times have you been *sure* that the FOO
>field wouldn't be required to be larger than 16 bits?
>
As far as the japanese character set concerned, 16 bits for char-code is enough.
But, as an international standard, I feel the room for more bits is needed.
>
>Greg Nuyens
>Text, Graphics and Printing,
>Xerox AI Systems
>
>
Masayuki Ida
ida%utokyo-relay.csnet@csnet-relay.arpa