[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
cs proposal part1
- To: Common Lisp mailing <common-lisp@sail.stanford.edu>
- Subject: cs proposal part1
- From: Thom Linden <baggins@IBM.com>
- Date: Wed, 22 Feb 89 00:13:28 PST
\documentstyle{report} % Specifies the document style.
\pagestyle{headings}
\title{\bf
Extensions to Common LISP to Support International
Character Sets}
\author{
Michael Beckerle\thanks{Gold Hill Computers} \and
Paul Beiser\thanks{Hewlett-Packard} \and
Jerry Duggan\thanks{Hewlett-Packard} \and
Robert Kerns\thanks{Independent consultant} \and
Kevin Layer\thanks{Franz, Inc.} \and
Thom Linden\thanks{IBM Research, Subcommittee Chair} \and
Larry Masinter\thanks{Xerox Research} \and
David Unietis\thanks{Lucid, Inc.}
}
\date{February 21, 1989} % Deleting this command produces today's date.
\begin{document}
\maketitle % Produces the title.
\setcounter{secnumdepth}{4}
\setcounter{tocdepth}{4}
\tableofcontents
%----------------------------------------------------------------------
%----------------------------------------------------------------------
\newfont{\cltxt}{cmr10}
\newfont{\clkwd}{cmtt10}
\newcommand{\apostrophe}{\clkwd '}
\newcommand{\bq}{\clkwd\symbol{'22}}
%----------------------------------------------------------------------
%----------------------------------------------------------------------
\chapter{Introduction}
This is a proposal to the X3 J13 committee
for both extending and modifying the Common LISP
language definition to provide a standard basis for Common LISP
support of the variety of characters used to represent the
native languages of the international community.
This proposal was created by the Character Subcommittee of X3 J13.
We would like to acknowledge discussions with T. Yuasa and other
members of the JIS Technical Working Group,
comments from members of X3 J13,
and the proposals \cite{ida87},
\cite{linden87}, \cite{kerns87}, and \cite{kurokawa88} for
providing the motivation and direction for these extensions.
As all these documents and discussions were created
expressly for LISP standardization usage,
we have borrowed freely from their ideas as well as the texts
themselves.
This document is separated into two parts. The first part explains the
major language changes and their motivations. While intended as
commentary to a general audience, and not explicitly as
part of the standard document, the X3 J13 editor may
include sections at her/his discretion. The second part,
Appendix A, provides
the page by page set of editorial changes to \cite{steele84}.
\section{Objectives}
The major objectives of this proposal are:
\begin{itemize}
\item To provide a consistent, well-defined scheme allowing support
of both very large character sets and multiple character sets.
\footnote{The distinction between the terms {\em character repertoire}
and {\em coded character set} is made later. The usage
of the term {\em character set},
avoided after this introduction, encompasses both terms.}
Many software applications are intended for international use, or
have requirements for incorporation of language elements of multiple
native languages within a single application.
Also, many applications require specialized languages including,
for example, scientific and typesetting symbols.
In order
to ensure some portability of these applications, data expressed in
a mixture of these
languages must be treated uniformly by the
software language.
All character and string manipulations should operate uniformly,
regardless of the character set(s) of the character objects.
This applies to array indexing, readtable definitions, read
symbol construction and I/O operations.
\item To ensure efficient performance of string and character
operations.
Many native
languages, such as Japanese and Chinese, use character
sets which contain more characters than the Latin alphabet.
Supporting larger sized character sets frequently means employing
larger data fields to uniquely encode each character.
Common LISP implementations using
larger sized character sets can
incur performance penalties in terms
of space, time, or both.
The use of large and/or multiple character sets by an
implementation
implies the need for a more complex character type representation.
Given a more complex character representation, the efficiency
of language operations on characters (e.g. string operations)
could be affected.
\item To assure forward compatibility of the proposed model
and definition with existing Common LISP implementations.
Developers should not be required to re-write large amounts of either
LISP code or data representations in order to apply the proposed
changes to existing implementations.
The proposed changes should provide an easy
portability path for existing code to many possible implementations.
\end{itemize}
There are a number of issues, some under the general rubric of
internationalization, which this proposal does {\em not} cover.
Among these issues are:
\begin{itemize}
\item Time and date formats
\item Monetary formats
\item Numeric punctuation
\item Fonts
\item Lexicographic orderings
\item Right-to-left and bidirectional languages
\end{itemize}
%----------------------------------------------------------------------
%----------------------------------------------------------------------
%----------------------------------------------------------------------
%----------------------------------------------------------------------
\chapter{Overview}
We use several terms within this document which
are new in the context of Common LISP.
Definitions for the following prominent
terms are provided for the reader's convenience.
A {\em character repertoire} defines a collection of characters
independent of their specific rendered image or font. This
corresponds to the mathematical notion of a {\em set}
\footnote{We avoid the term {\em character set} as it has been
(over)used in the context of character repertoire as well
as in the context of coded character set.}.
Character
repertoires are specified independent of coding and their characters
are only identified with a unique {\em character label},
a graphic symbol, and
a character description.
A {\em coded character set} is a character repertoire plus
an {\em encoding} providing a unique mapping between each character
and a number which serves as the character representation.
There are numerous internationally standardized coded character
sets; for example, \cite{iso8859/1} and \cite{iso646}.
A character may be included in one or more character repertoires.
Similarly, a character may be included in one or more
coded character sets. For example, the Latin letter "A" is contained
in the coded character set standards: ISO 8859/1, ISO 8859/2,
ISO 6937/2, and others.
To universally identify each character, we define a unique
collection of repertoires called {\em character
registries} as a partitioning of all characters.
That is, each character is included
in one and only one character registry.
In Common LISP a {\em character} data object is identified by its
{\em character code}, a unique numerical code.
Each character code is composed from
a character registry and a character label.
Character data objects which are classified as {\em graphic},
or displayable, are each associated with a {\em glyph}. The
glyph is the visual representation of the character.
The primary purpose of introducing these terms is to provide a
consistent naming to Common LISP concepts which are related
to those found in ISO standardization of coded
character sets.
\footnote{The bibliography includes several relevant ISO
coded character set standards.}
They also serve as a demarcation between these
standardization activities. For example, while Common LISP is free to
define unique manipulation facilities for characters, registries
and coded character sets, it should
not define standard coded character sets nor standard character
registries.
A secondary purpose is to detach the language specification from
underlying hardware representation. From a language
specification viewpoint it is inconsequential whether
characters occupy one or more (8-bit) bytes or whether
a Common LISP implementation's
internal representation for characters is distinct from or identical
to any of the numerous
external representations (for example, the text interchange
representation \cite{iso6937/2}).
We specifically do not propose any standard coded character sets.
%----------------------------------------------------------------------
\section{Character Identity}
Characters are uniquely distinguished by their codes,
which are drawn from the set of
non-negative integers. That is, within Common LISP
a unique numerical code
is assigned to each semantically different character.
It is important to separate the notion of glyph from the notion of
character data object when defining a scheme under which issues of
identity can be rigorously decided by a computer language. Glyphs are
the visual aspects of characters, writable on surfaces, and sometimes
called 'graphics'. A language specification valid for more than a
narrow range of systems can only make assumptions about the existence
of {\em abstract} glyphs (for example, the Latin letter A) and not about
glyph variants (for example, the italicized Latin letter {\em A})
or characteristics of display devices. Thus, an important element of
this proposal is the removal of the {\em font} and {\em bits}
attributes from the language specification.
\footnote{These and other attributes may still be supported as
implementation-defined extensions.}
All functions
dealing with the {\em bits} and {\em font} attributes are either
removed or modified by this proposal.
The deleted functions and constants include:
{\em char-font-limit,
char-bits-limit,
int-char,
char-int,
char-bits,
char-font,
make-char,
char-control-bit,
char-meta-bit,
char-super-bit,
char-hyper-bit,
char-bit,
set-char-bit}.
The definition in \cite{steele84} of semi-standard characters has
been eliminated. This is replaced by a more uniform approach
to character naming with the introduction of character registries
(see below).
%----------------------------------------------------------------------
\section{Character Naming}
A Common LISP program must be able to name, compose and decompose
characters in a uniform, portable manner, independent of any
underlying representation. One possible composition is by
the pair $<$ coded character set standard, decimal representation $>$
\footnote{This syntax is for illustration only and is not being
proposed.}.
Thus, for example, one might compose the Latin 'A' with the pair
$<$ ISO8859/2-1987, 65 $>$,
$<$ ISO8859/6-1987, 65 $>$, or
$<$ ISO646-1983, 65 $>$, etc.. The difficulty here is two-fold.
First, there are several ways to compose the same character and
second, there may be multiple answers to
the question: {\em To what coded character set
does character object x belong?}.\footnote{Even
worse, the answer might change yearly.}
The identical problems occur if the pair
$<$ character repertoire standard, decimal representation $>$ is used.
\footnote{Existing ISO repertoires seem to be defined exclusively
in the context of coded character sets and not as standards
in their own right.}
The concept of character registry is introduced by this proposal
to resolve the problem of character naming, composition and
decomposition.
Each character is universally defined by the
pair $<$ character registry name, character label $>$. For this
to be a portable definition, it must have a standard meaning.
Thus we propose the formation of an ISO Working Group to
define an international
{\em Character Registry Standard}.
At this writing there is no existing Character Registry Standard nor
ISO Working Group organized to define such a standard.
\footnote{It is the intention of X3 J13 to promote and adopt
an eventual ANSI or ISO Character Registry Standard. In particular, we
acknowledge that X3 J13 is {\em not} the appropriate forum to
define the standard. We believe
it is a required component of all programming languages
providing support for international characters.}
Common LISP character codes are composed from a character registry and
a character label. The convention by which a character label and
character registry compose a character code is implementation
dependent.
We introduce new functions {\clkwd find-char, char-registry-name,} and
{\clkwd char-label} to
compose and decompose character objects. We also extend the
{\clkwd characterp} predicate to
support testing
membership of a character in a given character registry.
\footnote{
For example,
testing membership in the Japanese Katakana character registry.
}
A global variable {\clkwd *all-character-registry-names*}
is added to
support application determination of supported character registries.
The naming and content of the standard character registries
is left unspecified by this proposal.
\footnote{The only constraint is that character registries be
named using only {\clkwd standard-p} characters.}
Below are some candidate character registry names:
\begin{itemize}
\item Arabic
\item Armenian
\item Bo-po-mo-fo
\item Control (meaning the collection of standard text communication
control codes)
\item Cyrillic
\item Georgian
\item Greek
\item Hangul
\item Hebrew
\item Hiragana
\item Japanese-Punctuation
\item Kanji
\item Katakana
\item Latin
\item Latin-Punctuation
\item Mathematical
\item Pattern
\item Phonetic
\item Technical
\end{itemize}
The list above is provided as a starting point for discussion
and is not intended to be representative
nor exhaustive. The Common LISP language definition does not
depend on these names nor any specific content (for example:
Where should the plus sign appear?). It is application
programs which require a reliable definition of the
registry names and their constituents. The Common LISP language
definition imposes the framework for constructing and manipulating
character objects.
The proposed ISO Character Registry Standard is fixed;
an implementation may not extend a standard registry's
constituent set of characters beyond the
standard definition.
An implementation may provide support for all or part of any
character registry
and may provide new character registries which include characters
having unique semantics (i.e. not defined in any standard
character registry).
Implementation registries must be uniquely
named using only {\clkwd standard-p} characters.
An implementation must document the registries it supports.
For each registry supported the documentation must include
at least the following:
\begin{itemize}
\item Character Labels,
Glyphs, and Descriptions.
\item Reader Canonicalization.
\item Effect of character predicates.
In particular,
\begin{itemize}
\item {\clkwd alpha-char-p}
\item {\clkwd lower-case-p}
\item {\clkwd upper-case-p}
\item {\clkwd both-case-p}
\item {\clkwd graphic-char-p}
\item {\clkwd alphanumericp}
\end{itemize}
\item Interaction with File I/O. In particular, the
coded character sets
\footnote{For example, ISO8859/1-1987.} and
external encoding schemes
\footnote{For example, {\em Xerox System Integration Character
Code Standard}\cite{xerox87}.}
supported are documented.
\end{itemize}
Which coded character sets and encoding schemes
are supported by the overall computing system, the
details of the mapping of glyphs to characters
to character codes are
left unspecified by Common LISP.
The diversity of glyph sets and coded character
set conventions in use worldwide and the desirability
of allowing Common LISP applications
to portabily manipulate symbolic elements from many
languages, perhaps simultaneously, mandate such a flexible approach.
%----------------------------------------------------------------------
\section{Hierarchy of Types}
Providing support for extensive character repertoires may
impact Common LISP implementation performance in terms
of space, time, or both.
\footnote{This does not apply to all implementations.
Unique hardware support and user community requirements must
be taken into consideration.}
In particular, many existing
implementations support variants of the ISO 8859/1 standard.
Supporting large
repertoires argues for a multi-byte internal representation
for each character, even if an application primarily (or exclusively)
uses the ISO 8859/1 characters.
This proposal extends the definition of the character and string
type hierarchy to include specialized subtypes
of character and string. An implementation is free to associate
compact internal representation tailored to each subtype.
The {\clkwd string} type specifier, when used for object
creation, for example in {\clkwd make-sequence},
is defined to mean the most general string subtype supported
by the implementation (similarily for the {\clkwd simple-string}
type specifier). This definition emphasizes portability
of existing Common LISP applications to international
character environments over performance. Applications emphasizing
efficiency of text processing in non-international environments
will require some modification to utilize subtypes with
compact internal representations.
It has been suggested that either a single type is
sufficient to support international characters,
or that a hierarchy of types could be used, in a manner
transparent to the user. A desire to provide flexibility which
encourages implementations to support international
characters without compromising application efficiency
led us to accept the need for more than one type.
We believe that these choices reflect a minimal
modification of this aspect of the type system, and that
exposing the types for string and character construction while
requiring uniform treatment for characters otherwise
is the most reasonable approach.
\subsection{Character Type}
The following type specifier is added as a subtype
of {\clkwd character}:
\begin{itemize}
\item {\clkwd base-character}
\end{itemize}
An implementation may support additional subtypes of {\clkwd character}
which may or may not be supertypes of {\clkwd base-character}.
In addition, an implementation may define {\clkwd base-character}
as equivalent to {\clkwd character}.
Characters of type {\clkwd base-character} are referred to as
{\em base characters}. Characters of type {\clkwd
(and character (not base-character))}
are referred to as {\em extended characters}.
The base characters are
distinguished in the following respects:
\begin{itemize}
\item
The standard characters are a subrepertoire of the base characters.
The selection of base characters which are not standard characters
is implementation defined.
\item
Only members of the base character repertoire
can be elements of a base string.
\item
The base characters are, in general, the default characters for I/O
operations.
\end{itemize}
No upper bound is specified for the number of glyphs in the base
character repertoire--that
is implementation dependent. The lower bound is 96, the
number of standard characters defined for Common LISP.
\footnote{Or, in contrast, the base repertoire may include all
implementation supported characters.}
The distinction of base characters is largely a pragmatic
choice. It permits efficient handling of common situations, is
in some sense privileged for host system I/O, and can serve as an
intermediate basis for portability, less general than the standard
characters, but possibly more useful across a narrower range of
implementations.
Many computers have some "base" character representation which
is a function of hardware instructions for dealing with characters,
as well as the organization of the file system. The base character
representation is likely to be the smallest transaction unit permitted
for text file and terminal I/O operations. On a system with a record
based I/O paradigm, the base character representation is likely to
be the smallest record quantum. On many computer systems,
this representation is a byte.
However, there are often multiple
coded character sets supportable on a
computer, through the use of special display and entry hardware, which
are varying interpretations of the basic system character
representation. For example, ISO 8859/1 and ISO 6937/2 are two
different interpretations of the same 1-byte code representations.
Many countries have their own glyph-to-code mappings for 1-byte
character codes addressing the special requirements of national
languages. Differentiating between these, without reference to
display hardware, is a matter of convention, since they all use the
same set of code representations. When a single byte is not enough,
two or more bytes are sometimes used for character encoding. This
makes character handling even more difficult on machines where the
natural representation size is a byte, since not only is the semantic
value of a character code a matter of convention, which may vary
within the same computing system, but so is the identification of a
set of bits as a complete character code.
It is the intention of this proposal that the composition of
base characters is typically
determined by the code capacity of the natural file system and I/O
transaction representations, and the assumed display glyphs should be
those of the terminals most commonly employed.
There are several advantages to this scheme. Internal representation
of strings of just base characters can be more compact than
strings including extended characters.
Source programs are likely to consist predominantly of base characters
since the standard characters are a subset of the base character
repertoire. Parsing of pure base character text
can be more efficient than parsing of text including
extended characters.
I/O can be performed more simply
with base characters.
The standard characters are the 96 characters used in the Common LISP
definition {\bf or their equivalents}.
This was the Common LISP \cite{steele84} definition, but
{\em equivalents} is a vague term.
The standard characters are not defined by their glyphs, but by their
roles within the language. There are two aspects to the roles of the
standard characters: one is their role in reader and format control
string syntax; the second is their role as components of the names of
all Common LISP
functions, macros, constants, and global variables. As
long as an implementation chooses 96 glyphs
and treats those 96 in a manner consistent with
the language's specification for the standard characters (e.g.
the naming of functions), it doesn't matter what glyphs the I/O
hardware uses to represent those characters: they are the standard
characters. Any program or
data text written wholly in those characters
is portable through simple code conversion.
\footnote{For example, the currency glyph, \$ , might be replaced
uniformly by the currency glyph available on a particular display.}
Additional
mechanisms, such as in \cite{linden87}, which support establishment of
equivalency between otherwise distinct characters are not excluded by
this proposal.
\footnote{We believe this is an important issue but it requires
additional implementation experience. We also encourage
new proposals from JIS and ISO LISP Working Groups on this issue.}
\subsection{String Type}
The {\clkwd string} type
is defined as
a vector of characters. More precisely, a string
is a specialized vector whose elements are of type
{\clkwd character} or a subtype of character. Similarly, a simple
string is a specialized simple vector whose elements are of type
{\clkwd character} or a subtype of character. The following string
subtypes are
distinguished with standardized names: {\clkwd base-string},
{\clkwd general-string}, {\clkwd simple-base-string}, and
{\clkwd simple-general-string}.
All strings which are not base strings
are referred to as {\em extended strings}.
A base string can only contain base characters.
{\clkwd general-string} is equivalent to {\clkwd (vector character)}
and can contain any implementation supported base or extended characters,
in any mixture.
All Common LISP functions defined to operate on strings treat
base and extended strings uniformly with the following
caveat: for any function which inserts a character into a string, it
is an error to insert an extended character
into a base string.
\footnote{An implementation may, optionally, provide automatic
coersion to an extended string.}
An implementation may support string subtypes in addition
to {\clkwd base-string} and
{\clkwd general-string}.
For example, a hypothetical
implementation supporting Arabic and Cyrillic character registries
might provide as extended characters:
\begin{itemize}
\item {\clkwd general-string} -- may contain Arabic, Cyrillic or
base characters in any mixture.
\item {\clkwd region-specialized-string} -- may contain installation
selected repertoire (Arabic/Cyrillic) or base characters in any
mixture.
\item {\clkwd base-string} -- may contain base characters
\end{itemize}
Though, clearly, portability of applications using
{\clkwd region-specialized-string} is limited, a performance
advantage might argue for its use.
\footnote{{\clkwd region-specialized-string} is used here for
illustration only; it is not being proposed as a standardized
string subtype.}
Alternatively,
an implementation
supporting a large base character repertoire
including, say, Japanese Kanji may define
{\clkwd base-character}
as equivalent to {\clkwd character}.
We expect that applications sensitive to the performance
of character handling in some host environments will
utilize the string subtypes to provide performance
improvement. Applications with emphasis on international
portability will likely utilize only {\clkwd general-string}s.
The {\clkwd coerce} function is extended to
allow for explicit coercion between base strings and extended strings.
It is an error to coerce an extended character to a base character.
During reader
construction of symbols, if all the characters
in the symbol's name are of type {\clkwd base-character},
then the name of the symbol may be stored as a base string.
Otherwise it will be stored as an extended string.
The base string type allows for more compact representation of strings
of base characters, which are likely to predominate in any system.
Note that in any particular implementation the base characters
need not be the
most compactly representable, since others might have
a smaller repertoire.
However, in most implementations base strings are
likely to be more space efficient than extended strings.
%----------------------------------------------------------------------
\section{Streams and System I/O}
A lot of the work of ensuring that a
Common LISP implementation operates correctly in a
multiple coded character set environment must be performed by
the I/O interface.
The system I/O interface, abstracted in
Common LISP as streams, is responsible
for ensuring that text input from outside LISP is properly mapped
into character objects internally, and that the inverse mapping
is performed on output. It is beyond the scope of a language
definition to specify the details of this operation, but options
are specified which allow runtime indication from the user as to
what coded character sets a stream uses, and how the mappings
should be done. It is expected that implementations will provide
reasonable defaults and invocation options to accommodate desired use
at an installation.
One keyword argument is proposed as an addition to {\clkwd open}:
\begin{itemize}
\item {\clkwd :external-coded-character-format}
whose value would be:
\begin{itemize}
\item
A name or list of names indicating an implementation recognized
scheme for representing 1 or more coded character sets.
\footnote{
For example, the so/si convention used by IBM on 370
machines could be selected by a list including
the name {\clkwd :ibm-shift-delimited}.
The run-encoding convention defined by XEROX could be
selected by {\clkwd :xerox-run-encoded}.
The convention based on
ASCII which uses leading bit patterns to distinguish two-byte codes
from one-byte codes could be selected by
{\clkwd :ascii-high-byte-delimited}.
}
As many coded character set names must be provided as the
implementation requires for that external coding convention.
\footnote{
For example, if {\clkwd :ibm-shift-delimited} were the
argument, two
coded character set specifiers would have to be provided.
}
\end{itemize}
\end{itemize}
These arguments are provided for input, output, and
bidirectional streams.
It is an error to try to write a character other than a
member of the specified coded character sets
to a stream. (This excludes the
\#$\backslash${\clkwd Newline} character.
Implementations must provide appropriate line division behavior
for all character streams.)
An implementation supporting multiple coded character sets
must allow for the external
representation of characters to be separately (and perhaps
multiply) specified to {\clkwd open},
since there can be circumstances under
which more than one external representation for characters
is in use, or more than one coded character set
is mixed together in an
external representation convention.
In addition to supporting conversion at the system interface, the
language must allow user programs to determine how much space data
objects will require when output in whichever external representations
are available.
The new function {\clkwd external-coded-string-length}
takes a character
or string object as its required argument. It also takes an optional
{\em output-stream}.
It returns the number of implementation-defined
representation units
\footnote{
Often the same as the storage width of a base character, usually a byte.
}
required to externally store that object, using the
representation convention associated with the stream.
If the object cannot be represented in
that convention, the function returns {\clkwd nil}.
This function is necessary
to determine if strings can be written to fixed length
fields in databases or terminal screen templates. Note that this
function does not
address the problem of calculating
screen width of strings printed in proportional fonts.
Related to the I/O interface,
we also introduce the function {\clkwd char-ccs-value}
which takes a character object and a coded character set name
(eg. {\clkwd :ISO8859/1-1987}) and returns the encoding of
the character within the coded character set.
%----------------------------------------------------------------------
%----------------------------------------------------------------------