[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
long-char, kanji
- To: common-lisp@SU-AI.ARPA
- Subject: long-char, kanji
- From: Robert W. Kerns <RWK@SCRC-YUKON.ARPA>
- Date: Sat, 31 May 86 13:21 EDT
- Cc: David A. Moon <Moon@SCRC-STONY-BROOK.ARPA>
- In-reply-to: <FAHLMAN.12210977147.BABYL@C.CS.CMU.EDU>
- Supersedes: <860531131604.5.RWK@WHITE-BIRD.SCRC.Symbolics.COM>, <860531131735.6.RWK@WHITE-BIRD.SCRC.Symbolics.COM>, <860531131827.7.RWK@WHITE-BIRD.SCRC.Symbolics.COM>
Date: Sat, 31 May 1986 00:20 EDT
From: "Scott E. Fahlman" <Fahlman@C.CS.CMU.EDU>
True, but my guess is that few implementations will choose to add such a
thing. I think our current view at CMU (Rob will correct me if I'm
wrong) is that highlighting and the other things you do with "styles" is
better accomplished with some sort of external data structure that
indicates where the highlighting starts and stops.
I strongly disagree that this is "better". It might be "necessary"
in some environment or other, but I could argue that you're trying
to represent too much text in one string if it's really a problem
for you.
The point here is that one of the main purposes of the string data-type
is to hold text, not arbitrary sequences of characters. Character
styles are primarily part of text, not just some add-on highlighting
that's part of the presentation of that text. As part of text, they can
appear in text in editor buffers, in text files, in source files, etc.
True, sometimes they are part of the presentation, say for highlighting
a selected element in a menu. And indeed, in cases like this, we supply
the character-style information separately as either data-structure or
program-state.
It seems wasteful to
do this on a per-character basis, and even more wasteful to tax every
character (or even just every Japanese character) with a field to
indicate possible style modification. We wouldn't make it illegal to do
this, but many implementations will go for the 2x compactness instead.
It's not for all characters, unless you put all your text in
one string, which is not a very good implementation technique,
especially in an editor.
Didn't I already suggest how to avoid paying a penalty for Japanese
the last time this topic came up? Instead of a two-way partitioning
of the type space, have a three-way partitioning. ``PLUMP-STRING''
uses 16 bits to represent the characters, with default style. Most
implementations would probably NOT more than one character datatype
to implement any of these schemes, since even the hairy characters
would be an immediate datatype, but there would be a ''PLUMP-CHARACTER''
type, consisting of those characters which fit in PLUMP-STRINGs.
I believe I already explained how to use 16 bits to represent multiple
languages.
Of course, this name isn't adaquate; it should be named something
which reflects the fact that this is single-language.
More useful in the probably more common case of single-language
systems would be the equivalent technique applied to things which
are all of the same character-set (i.e. language) but with various
styles. Again, this would typically be a 16-bit-per-character
representation, although some might choose to do it with fewer.
As I read the manual, Common Lisp strings are not now allowed to contain
any characters with non-zero bit and font attributes. Arbitrary
characters can be stored in vectors of type Character, which are not
Strings and do not print with the double-quote notation.
This means they are useless, or nearly so. Please don't quote CLtL at
us, we know very well what the book says. We consider this portion of
the book to be very poorly thought out, and not suitable as a guide.
Preserving the status quo here would be a mistake. Let's not use the
book as a substitute for doing language design.
(I do consider compatibility to be a language design issue. Let
me assert that I believe compatibility is not a real problem here.
If you disagree with this, please give arguments other than just
"status quo").
I am just
suggesting that we preserve this staus quo: the name String might be
extended to include Fat-String (in the narrow sense of Fat-String
defined above) but not to include vectors of arbitrary characters.
The only way we're not in compliance is that we allow storing of
characters with non-zero bits in strings (and files). I don't see
how this can be a problem for any legal CL program.
The issue here is one of type hierarchy. Remember, MAKE-ARRAY with
:ELEMENT-TYPE (ARRAY CHARACTER) is allowed to give you back any kind
of array that can hold characters. (i.e. any supertype of character,
including T). Similarly, :ELEMENT-TYPE (ARRAY FAT-STRING-CHAR) is
allowed to give you back an array that can hold characters with bits
as well. Nowhere else in CLtL do we forbid implementations from allowing
additional types to be stored in arrays. Especially note how the
:ELEMENT-TYPE argument to OPEN is defined, with explicit mention of
:ELEMENT-TYPE CHARACTER as storing "any character".
I think the intent was to not require implementations to support
putting arbitrary characters in strings, not to forbid them from
doing so.
Note that I'm not advocating that putting characters with
bits in strings is really a good idea. If you want my honest
opinion, I don't think they should be of type CHARACTER at all.
I think they're more logically of type INPUT-GESTURE, and may
include various other bits of information, like whether you
hit the key with a feather or a hammer, or where you drew the
letter 'L' on the pad, or perhaps it wasn't a letter, but a
key on the synthesizer you keep next to your console.
But the rest of CL does not get into the business of forbidding
extensions. Let's not do it here, either. I see no reason why
CL has to forbid the inclusion of, say, diagrams in strings.