[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cs proposal part 1

\documentstyle{report}     % Specifies the document style.


Extensions to Common LISP to Support International
Character Sets}
Michael Beckerle\thanks{Gold Hill Computers} \and
Paul Beiser\thanks{Hewlett-Packard} \and
Robert Kerns\thanks{Independent consultant} \and
Kevin Layer\thanks{Franz, Inc.} \and
Thom Linden\thanks{IBM Research, Subcommittee Chair} \and
Larry Masinter\thanks{Xerox Research} \and
David Unietis\thanks{Lucid, Inc.}
\date{January 1, 1989}   % Deleting this command produces today's date.


\maketitle                 % Produces the title.




\newcommand{\apostrophe}{\clkwd '}


This is a proposal to the X3 J13 committee
for both extending and modifying the Common LISP
language definition to provide a standard basis for Common LISP
support of the variety of characters used to represent the
native languages of the international community.

This proposal was created by the Character Subcommittee of X3 J13.
We would like to acknowledge discussions with T. Yuasa and other
members of the JIS Technical Working Group,
comments from members of X3 J13,
and the proposals \cite{ida87},
\cite{linden87}, \cite{kerns87}, and \cite{kurokawa88} for
providing the motivation and direction for these extensions.
As all these documents and discussions were created
expressly for LISP standardization usage,
we have borrowed freely from their ideas as well as the texts

This document is separated into three parts. The first part explains the
major language changes and their motivations. While intended as
commentary to a general audience, and not explicitly as
part of the standard document, the X3 J13 editor may
include sections at her/his discretion.  The second part,
Appendix A, provides
the page by page set of editorial changes to \cite{steele84}.
The final part, Appendix B, contains language elements deleted
from \cite{steele84} which we view as important from a compatibility
viewpoint but consider deprecated Common LISP features.

The major objectives of this proposal are:
\item To provide a consistent, well-defined scheme allowing support
of both very large character sets and multiple character sets.
\footnote{The distinction between the terms {\em character repertoire}
and {\em coded character set} is made later.  The usage
of the term {\em character set},
avoided after this introduction, encompasses both terms.}

Many software applications are intended for international use, or
have requirements for incorporation of language elements of multiple
native languages within a single application.
Also, many applications require specialized languages including,
for example, scientific and typesetting symbols.
In order
to ensure some portability of these applications, data expressed in
a mixture of these
languages must be treated uniformly by the
software language.

All character and string manipulations should operate uniformly,
regardless of the character set(s) of the character objects.
This applies to array indexing, readtable definitions, read
symbol construction and I/O operations.

\item To ensure efficient performance of string and character

Many native
languages, such as Japanese and Chinese, use character
sets which contain more characters than the Latin alphabet.
Supporting larger sized character sets frequently means employing
larger data fields to uniquely encode each character.
Common LISP implementations using
larger sized character sets can
incur performance penalties in terms
of space, time, or both.

The use of large and/or multiple character sets by an
implies the need for a more complex character type representation.
Given a more complex character representation, the efficiency
of language operations on characters (e.g. string operations)
could be affected.

\item To assure forward compatibility of the proposed model
and definition with existing Common LISP implementations.

Developers should not be required to re-write large amounts of either
LISP code or data representations in order to apply the proposed
changes to existing implementations.
The proposed changes should provide an easy
portability path for existing code to many possible implementations.

There are a number of issues, some under the general rubric of
internationalization, which this proposal does {\em not} cover.
Among these issues are:
\item Time and date formats
\item Monetary formats
\item Numeric punctuation
\item Fonts
\item Lexicographic orderings
\item Right-to-left and bidirectional languages


We use several terms within this document which
are new in the context of Common LISP.
Definitions for the following prominent
terms are provided for the reader's convenience.

A {\em character repertoire} defines a collection of characters
independent of their specific rendered image or font.  This
corresponds to the mathematical notion of a {\em set}
\footnote{We avoid the term {\em character set} as it has been
(over)used in the context of character repertoire as well
as in the context of coded character set.}.
repertoires are specified independent of coding and their characters
are only identified with a unique label, a graphic symbol, and
a character description.
A {\em coded character set} is a character repertoire plus
an {\em encoding} providing a unique mapping between each character
and a number which serves as the character representation.
There are numerous internationally standardized coded character
sets; for example, \cite{iso8859/1} and \cite{iso646}.

A character may be included in one or more character repertoires.
Similarly, a character may be included in one or more
coded character sets.  For example, the Latin letter "A" is contained
in the coded character set standards: ISO 8859/1, ISO 8859/2,
ISO 6937/2, and others.

Common LISP
characters are partitioned into a unique collection of
repertoires called {\em
Character Registries}.  That is, each character is included
in one and only one Character Registry.  The label identifying
each character within a Character Registry is a unique numerical value
referred to as the {\em character index}.

In Common LISP a {\em character} data object is identified by its
{\em character code}, a unique numerical code.
Each character code is composed from
a Character Registry
shared by all characters of a particular Registry,
and a character index, a numerical value which
is unique within the Character Registry.

Character data objects which are classified as {\em graphic},
or displayable, are each associated with a {\em glyph}.  The
glyph is the visual representation of the character.

The primary purpose of introducing these terms is to provide a
consistent naming to Common LISP concepts which are related
to those found in ISO standardization of coded
character sets.
\footnote{The bibliography includes several relevant ISO
coded character set standards.}
They also serve as a demarkation between these
standardization activities.  For example, while Common LISP is free to
define unique repertoires and facilities to manipulate them, it should
not define coded character sets.

A secondary purpose is to detach the language specification from
underlying hardware representation.  From a language
specification viewpoint it is inconsequential whether
characters occupy one or more (8-bit) bytes or whether
a Common LISP implementation's
internal representation for characters is distinct from or identical
to any given external representation (for example, a text interchange
representation \cite{iso6937/2}).
We specifically do not propose any standard coded character sets.

\section{Character Identity}

Characters are uniquely distinguished by their codes,
which are drawn from the set of
non-negative integers.  That is, within Common LISP
a unique numerical code
is assigned to each semantically different character.
Character codes are composed from a Character Registry and a
character index.  The convention by which a character index and
Character Registry compose a character code is implementation

It is important to separate the notion of glyph from the notion of
character data object when defining a scheme under which issues of
identity can be rigorously decided by a computer language.  Glyphs are
the visual aspects of characters, writable on surfaces, and sometimes
called 'graphics'.  A language specification valid for more than a
narrow range of systems can only make assumptions about the existence
of {\em abstract} glyphs (for example, the Latin letter A) and not about
glyph variants (for example, the italicized Latin letter {\em A})
or characteristics of display devices.  Thus, a key element of this
proposal is the removal of the {\em font} and {\em bits}
attributes from the language specification.

One ramification is that the distinction between {\clkwd string-char}
and {\clkwd character} is eliminated.  {\bf All} characters can be
inserted into (type compatible) strings.
In addition, all functions
dealing with the {\em bits} and {\em font} attributes are either
removed or modified by this proposal.

A second ramification is the introduction of new functions to
compose and decompose character objects.
The {\clkwd characterp} predicate is extended to
support testing
membership of a character in a given Character Registry.
For example,
testing membership in the Japanese Katakana Character Registry.
Also, a global variable {\clkwd *all-registry-names*} is added to
support application determination of supported Character Registries.

A third ramification is that I/O functions must be modified to manage
the interaction between the Common LISP treatment of characters and
the external environment.

The definition in \cite{steele84} of semi-standard characters has
been eliminated.  This is replaced by a more uniform approach
with introduction of the Control Character
Registry (see below).

\section{Character Repertoires and Registries}

A Common LISP program must be able to compose and decompose
characters in a portable uniform manner, independent of any
underlying representation.  One possible composition is by
the pair $<$ coded character set standard, decimal representation $>$
\footnote{This syntax is for illustration only and is not being
Thus, for example, one might compose the Latin 'A' with the pair
$<$ "ISO8859/2-1987ccs", 65 $>$,
$<$ "ISO8859/6-1987ccs", 65 $>$, or
$<$ "ISO646-1983ccs", 65 $>$, etc..  The difficulty here is two-fold.
First, there are several ways to compose the same character and
second, there may be multiple answers to
the question: {\em To what coded character set
does character object x belong?}.\footnote{Even
worse, the answer might change yearly.}
The identical problems occur if the pair
$<$ character repertoire standard, decimal representation $>$ is used.
\footnote{Existing repertoires seem to be defined exclusively
in the context of coded character sets and not as standards
in their own right.}

The concept of Character Registry is introduced by this proposal
to resolve the problem of character composition and decomposition.
Each character is universally defined by the
pair $<$ Character Registry name, character index $>$. For this
to be a portable definition, it must have a standard meaning.
Thus this
proposal relies on a {\em Character Registry Standard}.
There is no existing Character Registry Standard.
Until such an ANSI or ISO standard exists, Common LISP
defines the {\em Common LISP Character Registry Standard}.
\footnote{It is the intention of X3 J13 to promote and adopt
an eventual ANSI or ISO Character Registry Standard.  In particular, we
acknowledge that X3 J13 is {\em not} the appropriate forum to
define the standard.  We believe
it is a required component of all programming languages
providing support for international characters.}

Common LISP defines the following Character Registries:
\footnote{In the interest of brevity, this document will
define only a partial list of
the Character Registry names.  A subsequent
document will define the complete Common LISP Character Registry
Standard including the effect of the character predicates
{\em alpha-char-p},
{\em lower-case-p}, etc..}
Character Registry names are strings formed from the Common LISP
{\clkwd standard-p} characters. Within registry names, alphabetic
case is ignored.}
\item Arabic
\item Armenian
\item Bo-po-mo-fo
\item Control
\item Cyrillic
\item Georgian
\item Greek
\item Hangul
\item Hebrew
\item Hiragana
\item Japanese-Punctuation
\item Kanji-JIS-Level-1
\item Kanji-JIS-Level-2
\item Kanji-Gaiji
\item Katakana
\item Latin
\item Latin-Punctuation
\item Mathematical
\item Pattern
\item Phonetic
\item Technical
The Common LISP Character Registry Standard is fixed;
an implementation
may not extend the set of characters within any Common LISP
Character Registry.

An implementation may provide support for all or part of any Common LISP
Character Registry
and may provide new character registries which include characters
having unique semantics (i.e. not defined in any other
implementation-defined character registry or Common LISP Character
Registry).  Implementation registries must be uniquely
named using only {\clkwd standard-p} characters.  In addition,
the repertoire names {\em base} and {\em standard} have
reserved Common LISP usage.

An implementation must document the registries it supports.
For each registry supported,
an implementation must define individual characters supported
including at least the following:
\item Character Labels,
Glyphs, and Descriptions.
\item $<$ Common LISP
Character Registry name, character index $>$ pair if one exists
otherwise $<$ implementation-defined
character registry name, character index $>$ pair.
\item Reader Canonicalization.
\item Position in total ordering.
The partial ordering of the Standard alphanumeric
characters must be preserved.
\item Effect of character predicates.
In particular,
\item {\clkwd alpha-char-p}
\item {\clkwd lower-case-p}
\item {\clkwd upper-case-p}
\item {\clkwd both-case-p}
\item {\clkwd graphic-char-p}
\item {\clkwd standard-char-p}
\item {\clkwd alphanumericp}
\item Interaction with File I/O.  In particular, the
coded character set standards
\footnote{For example, "ISO8859/1-1987ccs".} and
external encoding schemes
\footnote{For example, {\em "Xerox System Integration Character
Code Standard"}\cite{xerox87}.}
which are supported must be specified.

intent of the provision for multiple character registries
is that native language glyphs (with associated digits and
\footnote{For example, the glyphs on the keycaps of a particular
terminal, or any other glyph sets with a common use in graphics or
symbolic communication.
should each be mapped by the I/O interface
into registries inside
Common LISP, all the members of which
share a common registry name.
Which glyph sets are supported by the overall computing system, the
details of the mapping of
glyphs to character codes, and any implementation unique character
registry names used, are left unspecified by Common LISP.

The diversity of glyph sets and coded character
set conventions in use worldwide and the desirability
of allowing Common LISP to manipulate symbolic elements from many
languages, perhaps simultaneously, mandate such a flexible approach.

\section{Hierarchy of Types}

Providing support for extensive character repertoires may
impact Common LISP implementation performance in terms
of space, time, or both.
\footnote{This does not apply to all implementations.
Unique hardware support and user community requirements must
be taken into consideration.}
In particular, many existing
implementations support variants of the ISO 8859/1 standard.
Supporting large
repertoires argues for a multi-byte internal representation
for each character, even if an application primarily (or exclusively)
uses the ISO 8859/1 characters.

This proposal extends the definition of the character and string
type hierarchy to include specialized subtypes
of character and string.  An implementation is free to associate
compact internal representation tailored to each subtype.
The {\clkwd string} type specifier, when used as a
declaration (for example, in {\clkwd make-sequence})
is defined to mean the most general string subtype supported
by the implementation.  This definition emphasizes portability
of existing Common LISP applications to international
character environments over performance.  Applications emphasizing
efficiency of text processing in non-international environments
will require some modification to utilize subtypes with
compact internal representations.

It has been suggested that either a single type is
sufficient to support international characters,
or that a hierarchy of types could be used, in a manner
transparent to the user.  A desire to provide flexibility which
encourages implementations to support international
characters without compromising application efficiency
led us to accept the need for more than one type.
We believe that these choices reflect a minimal
modification of this aspect of the type system, and that
exposing the types for string and character construction while
requiring uniform treatment for characters otherwise
is the most reasonable approach.

\subsection{Character Type}

The following type specifier is added as a subtype
of {\clkwd character}.
\item {\clkwd base-character}

An implementation may support additional subtypes of {\clkwd character}
which may or may not be supertypes of {\clkwd base-character}.
In addition, an implementation may define {\clkwd base-character}
as equivalent to {\clkwd character}.

Characters of type {\clkwd base-character} are referred to as
{\em base characters}.  Characters of type {\clkwd
(and character (not base-character))}
are referred to as {\em extended characters}.
The base characters are
distinguished in the following respects:
The standard characters are a subrepertoire of the base characters.
Only members of the base character repertoire
can be elements of a base string.
The base characters are, in general, the default characters for I/O
No upper bound is specified for the number of glyphs in the base
character repertoire--that
is implementation dependent.  The lower bound is 96, the
number of standard characters defined for Common LISP.
\footnote{Or, in contrast, the base repertoire may include all
the Common LISP Character Registries.}

The distinction of base characters is largely a pragmatic
choice.  It permits efficient handling of common situations, is
in some sense privileged for host system I/O, and can serve as an
intermediate basis for portability, less general than the standard
characters, but possibly more useful across a narrower range of

Many computers have some "base" character representation which
is a function of hardware instructions for dealing with characters,
as well as the organization of the file system.  The base character
representation is likely to be the smallest transaction unit permitted
for text file and terminal I/O operations.  On a system with a record
based I/O paradigm, the base character representation is likely to
be the smallest record quantum.  On many computer systems,
this representation is a byte.

However, there are often multiple
coded character sets supportable on a
computer, through the use of special display and entry hardware, which
are varying interpretations of the basic system character
representation.  For example, ISO 8859/1 and ISO 6937/2 are two
different interpretations of the same 1-byte code representations.
Many countries have their own glyph-to-code mappings for 1-byte
character codes addressing the special requirements of national
languages.  Differentiating between these, without reference to
display hardware, is a matter of convention, since they all use the
same set of code representations.  When a single byte is not enough,
two or more bytes are sometimes used for character encoding.  This
makes character handling even more difficult on machines where the
natural representation size is a byte, since not only is the semantic
value of a character code a matter of convention, which may vary
within the same computing system, but so is the identification of a
set of bits as a complete character code.

It is the intention of this proposal that the base characters of
Common LISP
be the natural characters of the host system: its composition
should be
determined by the code capacity of the natural file system and I/O
transaction representations, and its assumed display glyphs should be
those of the terminals most commonly employed.
There are several advantages to this scheme.  Internal representation
of strings of just base characters can be more compact than
strings including extended characters.
Source programs are likely to consist predominantly of base characters
since the standard characters are a subset of the base character
repertoire. Parsing of pure base character text
can be more efficient than parsing of text including
extended characters.
I/O can be performed more simply
with base characters.

The standard characters are the 96 characters used in the Common LISP
definition {\bf or their equivalents}.

This was the Common LISP \cite{steele84} definition, but
{\em equivalents} is a vague term.

The standard characters are not defined by their glyphs, but by their
roles within the language.  There are two aspects to the roles of the
standard characters: one is their role in reader and format control
string syntax; the second is their role as components of the names of
all Common LISP
functions, macros, constants, and global variables.  As
long as an implementation chooses 96 glyphs
and treats those 96 in a manner consistent with
the language's specification for the standard characters (e.g.
the naming of functions), it doesn't matter what glyphs the I/O
hardware uses to represent those characters: they are the standard
characters.  Any program or
data text written wholly in those characters
is portable through simple code conversion.
\footnote{For example, the currency glyph, \$ , might be replaced
uniformly by the currency glyph available on a particular display.}

mechanisms, such as in \cite{linden87}, which support establishment of
equivalency between otherwise distinct characters are not excluded by
this proposal.
\footnote{We believe this is an important issue but it requires
additional implementation experience.  We also encourage
new proposals from JIS and ISO LISP Working Groups on this issue.}

\subsection{String Type}

The {\clkwd string} type
is defined as
a vector of characters.  More precisely, a string
is a specialized vector whose elements are of type
{\clkwd character} or a subtype of character.  The following string
subtypes are
distinguished with standardized names: {\clkwd base-string},
{\clkwd general-string}, {\clkwd simple-base-string}, and
{\clkwd simple-general-string}.
All strings which are not base strings
are referred to as {\em extended strings}.

A base string can only contain base characters.  A
{\clkwd general-string}
can contain any implementation supported base or extended characters,
in any mixture.
\footnote{This type might be more appropriately named
{\clkwd most-general-string}.  {\clkwd general-string} was
subjectively judged to be less offensive.}

All Common LISP functions defined to operate on strings treat
base and extended strings uniformly with the following
caveat: for any function which inserts a character into a string, it
is an error to insert an extended character
into a base string.
\footnote{An implementation may, optionally, provide automatic
coersion to an extended string.}

An implementation may support string subtypes more general
than {\clkwd base-string} but more specialized than
{\clkwd general-string}.
For example, a hypothetical
implementation supporting Arabic and Cyrillic Character Registries
might provide:
\item {\clkwd general-string} -- may contain Arabic, Cyrillic or
base characters in any mixture.
\item {\clkwd region-specialized-string} -- may contain installation
selected repertoire (Arabic/Cyrillic) or base characters in any
\item {\clkwd base-string} -- may contain base characters
Though, clearly, portability of applications using
{\clkwd region-specialized-string} is limited, a performance
advantage might argue for its use.
\footnote{{\clkwd region-specialized-string} is used here for
illustration only; it is not being proposed as a standardized
string subtype.}

an implementation
supporting a large base character repertoire
including, say, Japanese Character Registries may define
{\clkwd base-character}
as equivalent to {\clkwd character}.

We expect that applications sensitive to the performance
of character handling in some host environments will
utilize the string subtypes to provide performance
improvement.  Applications with emphasis on international
portability will likely utilize only {\clkwd general-string}s.

The {\clkwd coerce} function is extended to
allow for explicit coercion between base strings and extended strings.

During reader
construction of symbols, if all the characters
in the symbol's name are of type {\clkwd base-character},
then the name of the symbol may be stored as a base string.
Otherwise it will be stored as an extended string.

The base string type allows for more compact representation of strings
of base characters, which are likely to predominate in any system.
Note that in any particular implementation the base characters
need not be the
most compactly representable, since others might have
a smaller repertoire.
However, in most implementations base strings are
likely to be more space efficient than extended strings.

\section{Streams and System I/O}

A lot of the work of ensuring that a
Common LISP implementation operates correctly in a
multiple coded character set environment must be performed by
the I/O interface.
The system I/O interface, abstracted in
Common LISP as streams, is responsible
for ensuring that text input from outside LISP is properly mapped
into character objects internally, and that the inverse mapping
is performed on output.  It is beyond the scope of a language
definition to specify the details of this operation, but options
are specified which allow runtime indication from the user as to
what coded character sets a stream uses, and how the mappings
should be done.  It is expected that implementations will provide
reasonable defaults and invocation options to accommodate desired use
at an installation.

One keyword argument is proposed as an addition to {\clkwd open}:
\item {\clkwd :external-code-format}
whose value would be:
A name or list indicating an implementation recognized scheme for
representing 1 or more coded character sets.
For example, the so/si convention used by IBM on 370
machines could be selected by a list including
the name {\em "ibm-shift-delimited"}.
The run-encoding convention defined by XEROX could be
selected by {\em "xerox-run-encoded"}.
The convention based on
ASCII which uses leading bit patterns to distinguish two-byte codes
from one-byte codes could be selected by
{\em "ascii-high-byte-delimited"}.
As many coded character set names must be provided as the
implementation requires for that external coding convention.
For example, if {\em "ibm-shift-delimited"} were the
{\clkwd :external-code-format} argument, two
coded character set specifiers would have to be provided.

These arguments are provided for input, output, and
bidirectional streams.
It is an error to try to write a character other than a
member of the specified coded character sets
to a stream.  (This excludes the
\#$\backslash${\clkwd Newline} character.
Implementations must provide appropriate line division behavior
for all character streams.)

An implementation supporting multiple coded character sets
must allow for the external
representation of characters to be separately (and perhaps
multiply) specified to {\clkwd open},
since there can be circumstances under
which more than one external representation for characters
is in use, or more than one coded character set
is mixed together in an
external representation convention.

In addition to supporting conversion at the system interface, the
language must allow user programs to determine how much space data
objects will require when output in whichever external representations
are available.

The new function {\clkwd external-width}
takes a character
or string object as its required argument.  It also takes an optional
{\em output-stream}.
It returns the number of host system character
representation quantum units
Same as the storage width of a base character, usually a byte.
required to externally store that object, using the
representation convention associated with the stream.
If the object cannot be represented in
that convention, the function returns {\clkwd nil}.
This function is necessary
to determine if strings can be written to fixed length
fields in databases or terminal screen templates.  Note that this
function does not
address the problem of calculating
screen width of strings printed in proportional fonts.
modified to include these semantics is an
acceptable alternative to the {\clkwd external-width} function
proposed here.}