[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
long-char, kanji

To: common-lisp@SU-AI.ARPA
Subject: long-char, kanji
From: David A. Moon <Moon@SCRC-STONY-BROOK.ARPA>
Date: Fri, 30 May 86 23:44 EDT
In-reply-to: <FAHLMAN.12210959064.BABYL@C.CS.CMU.EDU>
    Date: Fri, 30 May 1986  22:41 EDT
    From: "Scott E. Fahlman" <Fahlman@C.CS.CMU.EDU>
    ....
    The Symbolics spec, as described by Moon, meets these goals.  However, he
    says that Fat-Char and String-Char form an EXHAUSTIVE partition of
    Character.

I'd say that's just an implementation detail.  I didn't mean to imply that
it should not be legal to introduce a third subtype of CHARACTER.  I agree
with the goals, by the way.

    This means that if an implementation supports any Char-Bit
    or Char-Font bits, the fat strings must be able to accommodate these, in
    addition to the longer Char-Code field.  Since the Char-Code will
    typically be 16 bits, it would be nice to be able to store just the
    char-code in a fat string, and not make the big jump to 32 bits per
    character, which is the next stop for most stock-hardare machines.

We considered this (and in fact partially implemented it at one time)
but felt that for our implementation the savings of storage did not
justify the extra complexity of having three subtypes of STRING instead
of two.  (Actually, as it turns out there would be four subtypes: 8-bit,
32-bit, and two 16-bit subtypes, depending on whether the character has
no style or has a small code -- both special cases occur with
approximately equal frequency).  I can easily understand that another
implementation that had less memory available and was more willing to
accept extra complexity might make this design decision the other way.
Having extra subtypes of CHARACTER is no problem, because just as with
fixnums and bignums the user never sees them, but any user who modifies
the contents of strings has to think about extra subtypes of STRING.

I really don't know how to deal with four subtypes of STRING in a
language standard.  I'm sure you stock-hardware people will jump all
over me if I suggest that SETF of CHAR should automatically change the
representation of the string if it isn't wide enough for the character
to fit, to make the subtypes essentially invisible to users.  Perhaps we
could take a hint from Common Lisp floating-point numbers, but I doubt
that that analogy is very helpful.

    Two solutions are possible:

    First, we could alter the type hierarchy as Moon suggests, and begin to
    encourage implementations to exercise their right to have zero-length
    font and bit fields in characters.  A lot of us have come to feel that
    these were a major mistake and should begin to disappear.  (We wouldn't
    legislate these fields away, just not implement them and encourage code
    developers not to use them.)  An implementation that does this can have
    Fat-Strings with 16-bits per char, all of it Char-Code.

This would be fine.  The only problem is that if the implementation later
wants to add character styles, it has to double the width of fat-strings
or add a third type of string.

    Alternatively, we could say that Fat-Char is a subtype of Character, with
    Char-Bit and Char-Font of zero.  String-Char is a subtype of Fat-Char,
    with a Char-Code that fits (or can be mapped) into eight bits.  A
    Thin-String holds only characters that are of type String-Char.  A
    Fat-String holds Fat-Chars (some of which may also be String-Chars).  If
    you want a vector of characters that have non-zero bits and fonts, then
    you use (Vector Character).  I'm not sure what we do with the String
    type-specifier; the two reasonable possibilities are to equate it to
    Thin-String or tow the union of Thin and Fat Strings.

I take it the way this differs from your first alternative is that there
are three subtypes of character and three subtypes of string, and you
propose to name the additional types CHARACTER and (VECTOR CHARACTER).
I don't think that's viable.  The informal definition of STRING is
anything that prints with double-quotes around it.  Surely any one
dimensional array of characters should qualify as a string.  I don't
think it makes sense to use the name STRING for a specialized subtype of
(VECTOR CHARACTER) and have a different name for the general thing; I
think it's always cleaner to use the short name for the general thing
and qualified names for the specializations of it.  Surely using STRING
to mean the union of thin and fat strings, excluding extra-fat strings,
would be confusing.

Another solution that should be permitted by the language is to have
only one representation for strings, which is fat enough to accomodate
all characters.  In some environments the frequency of thin strings
might be low enough that the storage savings would not justify the extra
complexity of optimizing strings that contain only STRING-CHARs.

Stepping back a bit, what we have is an implementation-dependent
spectrum of subtypes of STRING.  We need names for the most general,
which can hold any CHARACTER, and the least general, which is only
required to be able to hold STANDARD-CHARs.  In addition, we need a
generic way to select from among implementation-dependent in-between
types, if there are any.  If you think Common Lisp should go this far,
some deep thought is in order.

    For those of us without microcoded type-dispatch, Simple-String is a
    very important concept.  It says that you can access the Nth character
    in the string simply by indexing off the address of the string by N
    bytes (maybe adding some fixed offset).  On a lot of machines that is
    one or two instructions, and no conditionals.  If Simple-Strings can be
    either fat or thin, then you have to make a runtime decision about
    whether to index by N bytes or 2N bytes.  So it is best to reserve
    Simple-String for simple thin strings and maybe add another type for
    Simple-Fat-String.

I see.  Let's put this in the manual next time around.
References:
- long-char, kanji
  - From: "Scott E. Fahlman" <Fahlman@C.CS.CMU.EDU>
Prev by Date: Defstruct default values
Next by Date: long-char, kanji
Previous by thread: long-char, kanji
Next by thread: long-char, kanji
Index(es):
- Date
- Thread