[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
A multi-byte character extension proposal
- To: common-lisp@SAIL.STANFORD.EDU, ida%tansei.u-tokyo.junet@RELAY.CS.NET
- Subject: A multi-byte character extension proposal
- From: Masayuki Ida <a37078%tansei.u-tokyo.junet%utokyo-relay.csnet@RELAY.CS.NET>
- Date: Mon, 18 May 87 14:06:46+0900
Date: Mon, 18 May 87 09:44:19 JST
From: moto@XXX.XXX.junet (MOTOYOSHI)
To: ida@tansei.cc.u-tokyo.junet
Subject: Kanji Proposal
-------------------- Beginning of text --------------------
@Comment[-*- Mode: Text -*-]
@heading[Digest]
@heading[1. Hierarcy of characters and strings]
Let the value of char-code-limit be large enough to include all
characters.
char > string-char >= internal-string-char > standard-char
string >= internal-thin-string > simple-internal-thin-string
simple-string >= simple-internal-thin-string
string = (or internal-thin-string (vector string-char))
Type internal-thin-string and (vector string-char) are
disjoint or identical.
simple-string = (or simple-internal-thin-string
(simple-array string-char (*)))
Type simple-internal-thin-string and (simple-array
string-char (*)) are disjoint or identical.
notes: A > B means B is a subtype of A,
A >= B means B is a subtype of A or B is equal to A.
@Heading[2. Print width]
Only standard characters are required to have fix-pitched
print width. Use the newly introdued function 'write-width'
to know the print width of an expression.
@Heading[3. Functions]
Functions dealing with strings should work as before,
expect ones which change the contents of internal-thin-string
to non internal-thin-char's.
Functions producing strings should create (vector
string-char), not internal-thin-string, unless they were
explicitly specified.
Funtions comaring strings should copare them elementwise.
Therefore it is possible that a (vector string-char) is equal
to an internal-thin-string.
@newpage
@Heading[1. A proposal for embedding multi-byte characters]
The Kanji Working Group (KWG) examined implementation of facilities for
multi-byte character to Common Lisp. This report is the result of many
discussions of many proposals. Of course, this report doesn't satisfy
all proposals, but it is very close.
In order to decide on a final proposal, we chose essential and
desirable characteristics of a working multi-byte character system.
Chapter 2 describes these characteristics in some detail.
Chapter 3 describes additional features to Common Lisp which will be useful
not just for multi-byte character, but also for many other kinds of character sets.
This chapter describes internal data structures. If this proposal is
accepted in Common Lisp, it will be easy for countries to add original mechanisms.
Chapters 4 describes proposed changes to @I[Common Lisp -- The Language]
(CLtL).
@Heading[2. Additional features for embedding multi-byte characters.]
This chapter describes design principles which can be used to design
multi-byte character language extensions to Common Lisp.
There are many programming languages which can multi-byte characters.
Most of them can use multi-byte character as string character data
but not as variables or function names.
It is necessary for programming languages like Lisp that use symbolic data
to be able to process not only single-byte characters but also multi-byte characters.
That is, it should be possible to use multi-byte characters in character string and
symbols, and it must be possible to store both kinds of characters in them.
Treating multi-byte characters just like other alpha-numeric characters
means that multi-byte character must be treated as a single character object.
Many of the present implementations of Lisp treat multi-byte character as
pairs of bytes. Alternatively, they use a different data type which
doesn't permit multi-byte character to be mixed with standard characters.
Such systems are not useful for user.
Thus, the basic design principles for embedding multi-byte character to Common Lisp are:
@Begin[Itemize]
Multi-byte character should be treated like single-byte character, that is,
a multi-byte character is one character object.
@End[Itemize]
@Begin[Itemize]
A program which was coded without explicit attention for multi-byte character should
handle multi-byte character data as is.
@End[Itemize]
These principles provide sufficient functionality, but we can't ignore
efficiency. So we considered the next principle:
@Begin[Itemize]
The performance of the system in terms of CPU and memory
utilization should not be consideraly affected in programs which do not use multi-byte
characters.
@End[Itemize]
This principle is contradictory to other principles, but this can't be
ignored when we consider the users of actual systems, so we have to
compromise. We think that following methods will satisfy both of these
requirements.
@Heading[3. Common parts which we implement.]
This chapter describes the implementation of multiple character sets in Common Lisp.
To treat multi-byte characters like single-byte characters, the multi-byte character must be
included in the set of possible character codes.
We consider the following implementation methods.
@Begin[Itemize]
Add multi-byte characters by setting the variable char-code-limit to a large number.
@End[Itemize]
In this case, the single-byte character set and the multi-byte character
set must be ordered into a single sequence of character codes. This means multi-byte
character set must not overlap with the single-byte character set. This method could
be satisfied within most implementations with ease.
If we use this method, it is possible to use multi-byte characters with
fonts in Common Lisp, and operations that work for single-byte
character will also work for multi-byte character without any change.
This implementation method has problems with efficiency.
In the case that the value of character code is greater than size of 1 byte
(multi-byte characters are in this category), memory utilization is
affected. A string containing only one single-byte character is 2 bytes long.
The same problem would also occur with symbol p-names. If we can solve the problem
for strings, we can solve other problems, so we will start by considering only strings.
To avoid this memory utilization problem, it is possible to optimize and
make single-byte character strings by packing internally. In other words,
to have two kinds of data types and not show it to user. There is only one type of
data from the viewpoint of users, which means that every function which uses strings
will continue to work as defined.
This can be implemented in almost everywhere without so many costs. The only
problem occurs when a function attempts to put a multi-byte character into an optimized
and packed sigle-byte-only string. To work according to the definition, the implementation
must unpack the original packed string. This presents an implementation inefficiency which
the user may find undesirable.
One solution would be to
@Begin[Itemize]
Generate errors for operations that try to use multi-byte characters into
single-byte string and presenting two string datatypes to users.
@End[Itemize]
We propose this latter implementation. Common lisp should have 2 string
types to treat multi-byte characters efficiently. The first of these is
@b[1string0], which stores any character of type @B[1string-char0], i.e.,
whose @I[2bits0] and @I[2font0] are both zero. The type of string is
@B[1internal-thin-string0] which is the optimized character string.
@B[1internal-thin-char0] is a subtype of @B[1character0] and can be inserted into string
@B[1internal-thin-string0]. The predicate which tests for this type of character is
@B[1internal-thin-char-p0].
The type @B[1internal-thin-char0] is a subtype of @B[1string-char0], and is a
supertype of @B[1standard-char0].
The data type hierarchy for @B[1character0] and @B[1string0] is shown in figure 1.
@b[1Internal-thin-char0] and @b[1string-char0] may be equal as it is possible that situations
may arise where both sets describe the same character-set.
This is equivalent to the type of system that has only one type of character from the
viewpoint of the user as discussed in the previous chapter. This proposal permits both
kinds of implementations.
@newpage
@Begin[Verbatim]
character
|
string-char
|
internal-thin-char
|
standard-char
@Center[Fig-1.a Structure of character type]
string
|
-----------------------------------
| | |
| simple-string |
| | |
internal-thin-string | (vector string-char)
| | |
-----------------------------------
| |
| |
simple-internal-thin-string (simple-array string-char (*))
@Center[Fig-1.b Structure of string type]
@End[Verbatim]
To compare @B[1string0] characters with @B[1internal-thin-string0] characters, it is
necessary to convert both to the @B[1string-char0] format. This means that
the same character is the same object regardless of whether it is found
in an @B[1internal-thin-string0] or a normal @B[1string0].
Next we must discuss character input. The proposal does not discuss what is stored
in files, nor what happens between the Lispimplementation and a terminal.
Each system will implement this in itsown way. Instead, let us discuss the data
as passed to lisp programs. We think that treating all input data as @B[1string0]
is the safest possible course. Since a symbol's p-name string should not be modified,
it can be optimized.
This may cause performance problems for programs which use only
single-byte characters. The variable @B[1*read-default-string-type*0] is
provided for these programs. When its value is @B[1internal-thin-string0], the system
expects single-byte characters only. so the system will return input data
in the form of @B[1internal-thin-string0]. Though it is possible that the system may
choose to ignore this variable.
@newpage
@Heading[4 Proposed changes to CLtL to support multiple character sets.]
In this section, we list proposed modifications to CLtL. Chapters 13,
14 and 18 of CLtL are concerned very greatly with multi-byte character, so we specify
modifications to these chapters by making a list of all constants,
functions and variables. For other chapters we specify only additional
and modifying parts. Those portions which are not mentioned are
unchanged.
@b(2 Data Types)
@b(2.5.2 Strings)
@begin(equation)
"a string is a specialized vector .... type string-char"
"a string is a specialized vector .... type string-char or @B[internal-thin-char]"
@end(equation)
@b(2.15 Overlap,Inclusion and Disjointness of Types)
a description of type string-char is changed to :
Type standard-char is a subtype of @B[internal-thin-char].
@B[internal-thin-char] is a subtype of string-char. string-char is a
subtype of character.
and add the following :
Type @B[internal-thin-string] is a subtype of vector because @B[internal-thin-string] means
(vector internal-thin-char).
a description of type string is changed to :
Type string is a subtype of vector because string means (or
(vector string-char) internal-thin-string). Type (vector
string-char) and @B[internal-thin-string] are disjoint or equality.
a description of type simple-vector,simple-string ... is changed to :
Type simple-vector,simple-string and simple-bit-vector are disjoint subtype of
simple-array because each one means (simple-array t (*)),
(or (simple-array string-char (*)),(or (simple-array internal-thin-char (*)) and
(simple-array bit (*)).
and add following :
Type simple-internal-thin-string means (simple-array
internal-thin-char (*)) and is a subtype of @B[internal-thin-string].
Type (simple-array string-char (*)) and simple-internal-thin-string are disjoint or
equality.
@b(4 Type Specifiers)
@b(4.1 Type Specifier Symbols)
add followings to system defined type specifiers :
simple-internal-thin-string
internal-thin-string
internal-thin-char
@b(4.5 Type Specifiers That Specialize)
"The specialized types (vector string-char) ... data types."
"The specialized types (vector internal-thin-char), (vector
string-char) and (vector bit) are so useful that they have the
special names string and bit-vector. Every implementation of Common
Lisp must provide distinct representation for string and bit-vector
as distinct specialized data types."
@begin(equation)
@b(13 Characters)
@b(13.1 Character Attributes)
char-code-limit@>[constant]
char-font-limit@>[constant]
char-bits-limit@>[constant]
@b(13.2 Predicates on Characters)
standard-char-p char@>[constant]
graphic-char-p char@>[constant]
@begin(quotation)
a description "graphic characters of font 0 are all of the same width when printed" in
the CLtL changed to "standard-char without #\Newline of font 0 are all of the same
width when printed".
@end(quotation)
string-char-p char @>[function]
internal-thin-char-p char@>[function]
@begin(quotation)
this function must be added.
the argument char must be a character object. internal-thin-char-p
is true if char can be stored into a internal-thin-string, and
otherwise is false.
@end(quotation)
alpha-char-p char@>[function]
upper-case-p char@>[function]
lower-case-p char@>[function]
both-case-p char@>[function]
"If a character is either ... alphabetic."
"If a character is either uppercase or lowercase, it is necessarily character
that alpha-char-p returns true."
digit-char-p char &optional (radix 10)@>[function]
alphanumericp char@>[function]
char= character &rest more-characters@>[function]
char/= character &rest more-characters@>[function]
char< character &rest more-characters@>[function]
char> character &rest more-characters@>[function]
char<= character &rest more-characters@>[function]
char>= character &rest more-characters@>[function]
char-equal character &rest more-characters@>[function]
char-not-equal character &rest more-characters@>[function]
char-lessp character &rest more-characters@>[function]
char-greaterp character &rest more-characters@>[function]
char-not-greaterp character &rest more-characters@>[function]
char-not-lessp character &rest more-characters@>[function]
@b(13.3 Character Construction and Selection)
char-code char@>[function]
char-bits char@>[function]
char-font char@>[function]
code-char char &optional (bits 0) (font 0)@>[function]
make-char char &optional (bits 0) (font 0)@>[function]
@b(13.4 Character Conversion)
character char@>[function]
char-upcase char@>[function]
char-downcase char@>[function]
digit-char weight &optional (radix 10) (font 0)@>[function]
char-int char@>[function]
int-char char@>[function]
char-name char@>[function]
name-char char@>[function]
@b(13.5 Character control-bit functions)
char-control-bit@>[constant]
char-meta-bit@>[constant]
char-super-bit@>[constant]
char-hyper-bit@>[constant]
char-bit char name@>[function]
set-char-bit char name newvalue@>[function]
@b(14 Sequence)
@b(14.1 Simple sequence functions)
elt sequence index@>[Function]
subseq sequence start &optional end@>[Function]
copy-seq sequence@>[Function]
length sequence@>[Function]
reverse sequence@>[Function]
nreverse sequence@>[Function]
make-sequence type size &key :initial-element@>[Function]
@b(14.2 Sequence connection)
concatenate result-type &rest sequences@>[Function]
map result-type function sequence &rest more-sequences@>[Function]
some predicate sequence &rest more-sequences@>[Function]
every predicate sequence &rest more-sequences@>[Function]
notany predicate sequence &rest more-sequences@>[Function]
notevery predicate sequence &rest more-sequences@>[Function]
reduce function sequence@>[Function]
&key :from-end :start :end :initial-value
@b(14.3 Sequence correction)
fill sequence item &key :start :end@>[Function]
replace sequence1 sequence2 &key :start1 :end1 :start2 :end2@>[Function]
remove item sequence@>[Function]
&key :from-end :test :test-not
:start :end :count :key
remove-if test sequence@>[Function]
&key :from-end :start
:end :count :key
remove-if-not test sequence@>[Function]
&key :from-end :start
:end :count :key
delete item sequence@>[Function]
&key :from-end :test :test-not
:start :end :count :key
remove-if test sequence@>[Function]
&key :from-end :start
:end :count :key
remove-if-not test sequence@>[Function]
&key :from-end :start
:end :count :key
remove-duplicates sequence@>[Function]
&key :from-end :test :test-not
:start :end :key
delete-duplicates sequence@>[Function]
&key :from-end :test :test-not
:start :end :key
subsutitute newitem test sequence@>[Function]
&key :from-end :test :test-not
:start :end :count :key
subsutitute-if newitem test sequence@>[Function]
&key :from-end :start :end :count :key
subsutitute-if-not newitem test sequence@>[Function]
&key :from-end :start :end :count :key
nsubsutitute newitem test sequence@>[Function]
&key :from-end :test :test-not
:start :end :count :key
nsubsutitute-if newitem test sequence@>[Function]
&key :from-end :start :end :count :key
nsubsutitute-if-not newitem test sequence@>[Function]
&key :from-end :start :end :count :key
@b(14.4 Search)
find item sequence @>[Function]
&key :from-end :test :test-not
:start :end :key
find-if test sequence @>[Function]
&key :from-end :start :end :key
find-if-not test sequence>[Function]
&key :from-end :start :end :key
position item sequence@>[Function]
&key :from-end :test :test-not
:start :end :key
position-if test sequence@>[Function]
&key :from-end :start :end :key
position-if-not test sequence@>[Function]
&key :from-end :start :end :key
count item sequence@>[Function]
&key :from-end :test :test-not
:start :end :key
count-if item sequence@>[Function]
&key :from-end :start :end :key
count-if-not item sequence@>[Function]
&key :from-end :start :end :key
mismatch sequence1 sequence2@>[Function]
&key :from-end :test :test-not
:key :start1 :start2
:end1 :end2
search sequence1 sequence2@>[Function]
&key :from-end :test :test-not
:key :start1 :start2
:end1 :end2
@b(14.5 Sort,merge)
sort sequence predicate &key :key@>[Function]
stable-sort sequence predicate &key :key@>[Function]
merge result-type sequence1 sequence2 predicate &key :key@>[Function]
@b(18 Strings)
"the type string is identical ... (array string-char (*))."
"the type string is identical to the type
(or (vector internal-thin-char) (vector string-char)),
which in turn is the same as (or (array internal-thin-char (*))
(array string-char (*)))."
@b(18.3 String Construction and Manipulation)
make-string size &key :initial-element@>[function]
@begin(quotation)
add following :
To make an internal-thin-string, you should use make-array or make-sequence.
@end(quotation)
@b(22 Input/Output)
@b(22.2 Input Functions)
@b(22.2.1 Output to Character Stream)
add following :
*read-default-string-type*@>[variables]
@begin(quotation)
The value is string or internal-thin-string. This determines string that the function
read takes whether type string or internal-thin-string.
@end(quotation)
@b(22.3 Output Functions)
@b(22.3.1 Output from Character Stream)
@begin(quotation)
add following :
@end(quotation)
write-width object@>[function]
&key :unit-type :stream :escape :radix :base
:circle :pretty :label :length :case :gensym :array
@begin(quotation)
This function returns the printed width as the value of the unit
specified by :unit-type when then printed representation of
object is written to the output stream specified by :stream. It
returns nil when object includes control characters
(#\Newline,#\Tab etc). The default of :unit-type is byte. The
value which we can specify :unit-type depends on implementation.
@end(quotation)
@end(equation)
@newpage
@Heading[Appendix Proposed Japanese character processing facilities for Common Lisp.]
In addition to the modification of CLtL, here are some suggestions for systems
including Japanese characters.
1). How should system behave for Japanese characters both
under unmodified part of CLtL and the part changed for multi-byte
processing.
2). About function that are specific to Japanese and no at all related
to multi-byte processing.
Notes: All Japanese characters are constituent. JIS is a abreviation of Japanese Industry
Standard.
@begin(equation)
@b(13. Characters)
@b(13.1. Character Attributes)
char-code-limit char @>[Function]
@begin(quotation)
The value of char-code-limit should be large enough to include Japanese characters,
e.g. 65536.
@end(quotation)
@b(13.2. Predicates on Characters)
standard-char-p char @>[Function]
@begin(quotation)
Return nil for all Japanese characters.
@end(quotation)
graphic-char-p char @>[Function]
@begin(quotation)
Return t for Japanese characters.
@end(quotation)
internal-thin-char-p char @>[Function]
@begin(quotation)
The result depends on each implementation that whether the Japanese character is in
internal-thin-string or not.
@end(quotation)
alpha-char-p char @>[Function]
@begin(quotation)
Return nil for all character except alphabets in Japanese character. It depends on
each implementation whether to return t or nil for alphabets in Japanese characters.
@end(quotation)
@newpage
jis-char-p char@>[Function]
@begin(quotation)
The argument char has to be a character type object. jis-char-p is true if the
argument is included in JIS C-6226, and otherwise false.
@end(quotation)
japanese-char-p char@>[Function]
@begin(quotation)
The argument char has to be a character type object. japanese-char-p is true if the
argument is a Japanese character and is otherwise false. All characters that satisfy
jis-char-p must satisfy japanese-char-p; other characters might.
@end(quotation)
kanji-char-p char@>[Function]
@begin(quotation)
The argument char has to be character type object. kanji-char-p is true if the
argument is one of the 6353 Kanji characters in JIS C6226(3.1.8), the repeat symbol,
the kanji numeric zero or the same as above symbol for a total of 6356 characters
that also satisfy jis-char-p.
@end(quotation)
hiragana-char-p char@>[Function]
@begin(quotation)
The argument char has to be character type object.
hiragana-char-p is true if the argument is one of the 83
hiragana characters in JIS C6226(3.1.4), the hiragana repeat
symbol, or dakuten for a total of 85 characters that also
satisfy jis-char-p.
@end(quotation)
katakana-char-p char@>[Function]
@begin(quotation)
The argument char has to be a character type object.
katakana-char-p is true if the argument is one of the 86
hiragana characters in JIS C6226(3.1.5), long-sound-symbol,
katakana-repeat symbol, or katakana-dakuten for a total of 89
characters that also satisfy jis-char-p.
@end(quotation)
kana-char-p char@>[Function]
@begin(quotation)
equivalence (or (hiragana-char-p char) (katakana-char-p char))
@end(quotation)
upper-case-p char@>[Function]
lower-case-p char@>[Function]
both-case-p char@>[Function]
@begin(quotation)
These are nil if the argument does not satisfy alpha-char-p.
Japanese characters which satisfy alpha-char-p should be treated
as normal alphabetic characters.
@end(quotation)
@newpage
digit-char-p char &optional (radix 10)@>[Function]
@begin(quotation)
digit-char-p is nil if the argument is a Japanese character.
@end(quotation)
alphanumericp char@>[Function]
@begin(quotation)
equivalence (or (alpha-char-p char) (not (null (digit-char-p char))))
@end(quotation)
char= character &rest more-characters@>[Function]
char/= character &rest more-characters@>[Function]
char< character &rest more-characters@>[Function]
char> character &rest more-characters@>[Function]
char<= character &rest more-characters@>[Function]
char>= character &rest more-characters@>[Function]
@begin(quotation)
The ordering of hiragana, katakana, kanji follows the JIS ordering.
@end(quotation)
@b(13.4 character Conversions)
char-upcase char@>[Function]
char-downcast char@>[Function]
@begin(quotation)
These return the argument if the argument does not satisfy
alpha-char-p. It depends on the implementation whether these
work on the alphabets included in JIS or not. But it should be
consistent with upper-case-p, lower-case-p, both-case-p.
@end(quotation)
@end(equation)