[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re: long-char, kanji




This is just a note to expand on Fischer's info about the Xerox
corporate character code standard, one of the protocols that comprise
the Printing and Network Systems.  The character code standard (usually
just called NS characters) specifies character codes to represent text
fragments.  This requires a mapping of character codes to graphic,
rendering and control characters together with an interchange standard
describing the legal encoding of strings of these codes.  

The NS encoding is a 16-bit encoding with escapes for multibyte
characters.  Single byte characters are essentially in the familiar
ISO/ANSI encoding.  There are currently approximately 7100 character
codes assigned. NS is only specified as an interchange standard.  In
Xerox Lisp, we use NS characters without escapes as the internal
representation.  (I would discourage including escapes in internal
representations, since constant time random access is reduced to linear
scans.  This is a separate decision for external representation.)

Rendering codes reside in a separate region of the character code space.
Unlike graphic character codes (e.g. STANDARD-CHAR) which determine the
information included in a text fragment, rendering codes are only used
to specify appearance.  e.g. To avoid the question of whether (and how)
a text search stops at an ffl ligature when searching for "ff",
rendering characters are only included in rendered images of a document
(to send to a printer, for instance).

To give specific info about how our implementation of NS characters
addresses some of the problems mentioned recently on the list:

representation:
Strings are represented as homogeneous simple vectors of thin (8 bit) or
fat (16 bit) characters.  Ignoring storage taken to represent them, the
difference between fat characters and thin characters is transparent to
the user.  In particular, since we allow fat characters in symbol print
names, we use an equivalent of Ida's string-normalize function to
guarantee unique representation for hashing.  

kanji:
NS includes all "JIS C 6226" graphic characters including the 6300 most
common Japanese kanji.  There are also Hiragana and Katakana character
codes specified.  (While there is substantial overlap with the Japanese
kanji, Chinese characters are semantically separate and their character
code assignments have not yet been published.)

type hierarchy:
Since we have char-bits-limit = char-font-limit = 1, STANDARD-CHAR is
the same as STRING-CHAR.  I agree with Moon that STRING should be
(VECTOR CHARACTER) and provide specialisations (even though this is a
change from the status quo).  In our applications, we do as Fahlman
suggests and use external data-structures to represent the sort of
information encoded in "styles".  (It is hard to standardize which
attributes should be made part of style (some people claim "case" should
be a style bit!)).  

number of character codes required:
At first glance it seems hard to imagine exceeding 16 bits.  Note
however that the 7200 characters in NS don't include Chinese, Korean,
Farsi,  Hindi, etc.  How many times have you been *sure* that the FOO
field wouldn't be required to be larger than 16 bits?


A more detailed description of the NS character encoding is available in
XSIS 058404 available from:

Xerox Systems Institute
2100 Geng Rd.
Palo Alto, CA 94303
attn: Pam Cance (tell her you were referred by XAIS and your odds of
having any fee waived are good.)
phone: 415-496-6511


Greg Nuyens
Text, Graphics and Printing,
Xerox AI Systems