[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

long-char, kanji



Here's how the character and string data types are organized in the
Symbolics system (Release 7 happened to be the version I looked at),
which supports Kanji as well as other extensions beyond the Common
Lisp standard-chars.  Perhaps this can serve as a guide, as one
possible technique for extending Common Lisp characters that is
demonstrated to work.

  STANDARD-CHAR is a subtype of STRING-CHAR
  STRING-CHAR is a subtype of CHARACTER
  FAT-CHAR is a subtype of CHARACTER
  FAT-CHAR and STRING-CHAR are an exhaustive partition of CHARACTER

  THIN-STRING and FAT-STRING are an exhaustive partition of STRING
  THIN-STRING is (VECTOR STRING-CHAR)
  FAT-STRING is (VECTOR (OR FAT-CHAR STRING-CHAR))

  STANDARD-CHAR includes only the 96 characters that Common Lisp
  says it includes.

  STRING-CHAR includes a few dozen additional characters, and
  has a representation that is 8 bits wide.

  FAT-CHAR has a representation that is 28 bits wide and can
  express all other characters that we support.

Note that there are slight deviations from the Common Lisp manual here.
It's not true in our system that any character whose bits and font are
zero is a STRING-CHAR, and it's not true that any STRING is a VECTOR of
STRING-CHAR.  These deviations are necessary and I don't think they
depart from the spirit of the language.

CHAR-CODE-LIMIT is 65536; the other bits in a FAT-CHAR are used
for CHAR-BITS and CHAR-STYLE.  CHAR-FONT-LIMIT is 1; that is,
Symbolics does not use CHAR-FONT.  CHAR-STYLE is a Symbolics
extension that is used to express how the character is
portrayed (size, italicization, boldface, typeface, etc.).

Character codes are assigned dynamically.  In files and for interchange,
FAT-CHARs are represented not by the binary representation used in
memory, but by a more symbolic representation involving the names of the
character set and character style.  This is the default; other
representations can be used for interchange with other systems.  Thus
interchange with JIS 6226 and Hankaku would be equally possible; there
is no assumption that the codes used internally and the codes used
externally are the same.

A SIMPLE-STRING is any string, thin or fat, that is a SIMPLE-ARRAY.
Since I don't know what the SIMPLE-STRING type is for, I don't know
whether allowing SIMPLE-STRINGs to be fat is good or bad.

Note: FAT-CHAR, THIN-STRING, and FAT-STRING are not actually accepted by
TYPEP (probably they ought to be), but there are predicates to test for
these types.  Usually the types are invisible; string-valued functions
produce thin or fat strings as necessary, depending on the contents of
the string.

Note that the difference between thin and fat characters is completely
transparent to the user, except that you cannot store a fat character
into a thin string.  There is no such thing as dual representations
of a character.  In this way thin and fat characters are very analogous
to fixnums and bignums.  I feel that this property is very important
for the usability of the system.

We haven't found a need for your STRING-NORMALIZE function, perhaps
just because we don't worry that much about saving storage by making
strings thin whenever possible.  We do have a function ASSURE-FAT-STRING
that goes the other way; if its argument is a THIN-STRING, it makes
a new FAT-STRING that contains the same characters and returns it.