mirror of https://github.com/svaarala/duktape.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
295 lines
12 KiB
295 lines
12 KiB
===============
|
|
Unicode support
|
|
===============
|
|
|
|
Overview
|
|
========
|
|
|
|
Ecmascript E5 requires quite extensive Unicode support, which is difficult to
|
|
implement in a very compact fashion. The subsections below discuss Unicode
|
|
handling in various parts of the E5 standard. Below, the terms "character"
|
|
and "codepoint" are used interchangeably.
|
|
|
|
The general principles for implementation are:
|
|
|
|
* Operations on ASCII characters and ASCII strings should have a fast path
|
|
which avoids expensive scanning of conversion tables etc.
|
|
|
|
* Simple run-time operations on non-ASCII characters like string
|
|
concatenation, character lookups etc, should be reasonably fast (e.g.,
|
|
avoid a scan of Unicode character information ranges).
|
|
|
|
* Complex run-time operations on non-ASCII characters like case conversion
|
|
can have a performance penalty in exchange for small size.
|
|
|
|
* Compile-time operations on non-ASCII characters can have a performance
|
|
penalty in exchange for small size.
|
|
|
|
Handling unicode case conversion, character classes etc in a compact code
|
|
size is bit challenging. The current solution is to fast path ASCII
|
|
characters and to use a bit-packed format for encoding case conversion
|
|
rules (e.g. range mappings). The rules are created by build-time Python
|
|
scripts (see ``src/`` directory) and decoded by run-time code such as the
|
|
parser with the help of ``duk_bitdecoder_ctx`` and ``duk_bd_decode()``.
|
|
|
|
.. note:: There are many Unicode specifications, and I'm not sure
|
|
which ones apply to E5. For instance, which specification governs the
|
|
'end of word' behavior for the Final_Sigma context? Is it #29 or
|
|
something else?
|
|
|
|
Useful background information:
|
|
|
|
* Unicode Standard Annex #44: UNICODE CHARACTER DATABASE:
|
|
http://unicode.org/reports/tr44/#Casemapping
|
|
|
|
* CLDR - Unicode Common Locale Data Repository:
|
|
http://cldr.unicode.org/
|
|
|
|
* Unicode Technical Standard #35: UNICODE LOCALE DATA MARKUP LANGUAGE (LDML):
|
|
http://www.unicode.org/reports/tr35/tr35-21.html
|
|
|
|
* Unicode Standard Annex #29: UNICODE TEXT SEGMENTATION:
|
|
http://www.unicode.org/reports/tr29/
|
|
|
|
Unicode data:
|
|
|
|
* http://unicode.org/Public/UNIDATA/
|
|
|
|
* UnicodeData.txt: http://unicode.org/Public/UNIDATA/UnicodeData.txt
|
|
|
|
* SpecialCasing.txt: http://unicode.org/Public/UNIDATA/SpecialCasing.txt
|
|
|
|
Source text
|
|
===========
|
|
|
|
The ``IdentifierStart`` and ``IdentifierPart`` codepoint sets are rather
|
|
complex. They are currently encoded into about 1.5 kilobytes of bit-packed
|
|
match data by ``extract_chars.py``.
|
|
|
|
Regular expression
|
|
==================
|
|
|
|
The ``Canonicalize()`` abstract operation described in E5 Section 15.10.2.8
|
|
shares the case conversion of ``String.prototype.toUpperCase()`` with a few
|
|
exceptions. The conversion tables can be shared so no additional tables are
|
|
needed.
|
|
|
|
String case conversion
|
|
======================
|
|
|
|
Ecmascript E5 requires case conversion for 16-bit Unicode characters with the
|
|
``String.prototype`` functions ``toLowerCase()``, ``toLocaleLowerCase()``,
|
|
``toUpperCase()``, and ``toLocaleUpperCase()``, see E5 Sections 15.5.4.16 to
|
|
15.5.4.19. Titlecase conversion is not required by Ecmascript E5. Regular
|
|
expression abstract ``Canonicalize()`` operation also borrows the case
|
|
conversion rules (though only for 1:1 conversions), see E5 Section 15.10.2.8.
|
|
|
|
Unicode data files describe case conversion rules in two parts:
|
|
|
|
1. ``UnicodeData.txt`` describes simple 1:1 mappings for lowercase, uppercase,
|
|
and titlecase. The titlecase mapping, if missing, defaults to uppercase
|
|
mapping.
|
|
|
|
2. ``SpecialCasing.txt`` describes complex 1:many mappings for case conversion,
|
|
which are also required by E5. These mappings may be locale sensitive (e.g.
|
|
apply only to a certain language) and/or context sensitive (e.g. apply only
|
|
if a character is preceded or followed by certain codepoints).
|
|
|
|
UnicodeData.txt lists all Unicode codepoints and optionally gives case
|
|
conversion rules for each. Titlecase conversion defaults to uppercase
|
|
conversion, and if no conversion is given, the character is assumed to remain
|
|
the same unless SpecialCasing.txt has an overriding rule. The actual case
|
|
conversion rules are not random, but in many cases continuous ranges are
|
|
shifted to another position in the codepoint space; the ranges may be fully
|
|
continuous or have a "skip", e.g. apply to every other character.
|
|
|
|
SpecialCasing.txt provides additional rules particularly for handling cases
|
|
where the case conversion is not 1:1. For instance, "ß" converted to
|
|
uppercase is "SS". There are slightly over 100 such rules, almost entirely
|
|
for uppercase and titlecase conversion. The special casing rules can convert
|
|
an input codepoint into 1-3 result codepoints (the ligature U+FB03 uppercases
|
|
to "FFI", for instance). Some special casing rules are context and/or locale
|
|
sensitive. *Context sensitivity* means that a rule only applies when a
|
|
codepoint is (or is not) surrounded by certain other codepoints, which means
|
|
that characters cannot be case converted individually. *Locale sensitivity*
|
|
means that a rule might only apply for a certain language.
|
|
|
|
The Python script ``extract_caseconv.py`` reads in UnicodeData.txt and
|
|
SpecialCasing.txt, extracts the appropriate case conversion rules, scans the
|
|
conversion rules to generate a compact rules database (really just a list of
|
|
rules), and encodes the rules into a bit packed format. The bit packed rule
|
|
format has been developed experimentally to minimize data and code space, by
|
|
looking at the case conversion data and first detecting simple rules (ranges
|
|
which are either continuous or have a certain "skip"), and then looking at
|
|
what remains.
|
|
|
|
Currently the encoded format consists of three parts:
|
|
|
|
1. range mappings with a "skip" of 1...6;
|
|
|
|
2. simple 1:1 character mappings which are not covered by the range rules;
|
|
|
|
3. complex 1:n character mappings.
|
|
|
|
There's probably some room for improvement in optimizing the encoding further;
|
|
currently it takes almost 2 KiB for uppercase and lowercase rules combined.
|
|
|
|
.. note:: Context or locale specific rules are not processed now. This
|
|
violates E5 requirements for both context and locale support.
|
|
|
|
See also:
|
|
|
|
* http://www.unicode.org/faq/casemap_charprop.html.
|
|
|
|
* ``src/CaseConversion.java`` which allows easy testing of what Java does
|
|
|
|
Context and locale sensitive rules
|
|
==================================
|
|
|
|
The following context and locale sensitive rules exist in SpecialCasing.txt
|
|
with md5sum of 5cea3d079e2b6c6c3babb0726e47e1db.
|
|
|
|
Useful background:
|
|
|
|
* Unicode Standard Annex #44: UNICODE CHARACTER DATABASE, Section 5.6:
|
|
http://unicode.org/reports/tr44/#Casemapping
|
|
|
|
- Clarifies that contexts are not formal character properties
|
|
|
|
* CLDR - Unicode Common Locale Data Repository: http://cldr.unicode.org/
|
|
|
|
* http://unicode.org/reports/tr44/#General_Category_Values
|
|
|
|
Final sigma (all languages)
|
|
---------------------------
|
|
|
|
::
|
|
|
|
# Special case for final form of sigma
|
|
|
|
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
|
|
|
|
The lowercase conversion of U+03A3: GREEK CAPITAL LETTER SIGMA depends
|
|
on context as follows:
|
|
|
|
* Final_Sigma: lowercase is U+03C2: GREEK SMALL LETTER FINAL SIGMA
|
|
|
|
* Otherwise: lowercase is U+03C3: GREEK SMALL LETTER SIGMA
|
|
|
|
Other conversions (uppercase or titlecase conversions, or lowercase
|
|
conversions of other sigma characters) are not context sensitive.
|
|
In particular, codepoints U+03C2 and U+03C3 lowercase to themselves.
|
|
|
|
.. note:: What is the formal definition of a "final sigma" context?
|
|
Based on the "Unicode demystified" link below, let p = previous
|
|
codepoint (if exists) and n = next codepoint; final_sigma = (p exists)
|
|
and (p is a letter) and (n exists) and (n is not a letter), for
|
|
some meaning of a "letter"?
|
|
|
|
See also:
|
|
|
|
* http://unicode.org/faq/greek.html#5
|
|
* http://en.wikipedia.org/wiki/Sigma
|
|
* http://www.unicode.org/reports/tr29/#Word_Boundaries
|
|
* http://books.google.fi/books?id=wn5sXG8bEAcC&pg=PA169&lpg=PA169&dq=%22Final_Sigma%22&source=bl&ots=J07ysYPbVD&sig=tGhPz1VFpi-KE1InQPsjX2diVlg&hl=fi&ei=XHswTqmrA4aSOrSf3X4&sa=X&oi=book_result&ct=result&resnum=5&ved=0CDYQ6AEwBA#v=onepage&q=%22Final_Sigma%22&f=false
|
|
|
|
Lithuanian (lt)
|
|
---------------
|
|
|
|
::
|
|
|
|
# Lithuanian retains the dot in a lowercase i when followed by accents.
|
|
|
|
# Remove DOT ABOVE after "i" with upper or titlecase
|
|
|
|
0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
|
|
|
|
::
|
|
|
|
# Introduce an explicit dot above when lowercasing capital I's and J's
|
|
# whenever there are more accents above.
|
|
# (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)
|
|
|
|
0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I
|
|
004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J
|
|
012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK
|
|
00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE
|
|
00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE
|
|
0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE
|
|
|
|
Turkish and Azeri (tr and az)
|
|
-----------------------------
|
|
|
|
::
|
|
|
|
# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
|
|
# The following rules handle those cases.
|
|
|
|
0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
|
|
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
|
|
|
|
::
|
|
|
|
# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
|
|
# This matches the behavior of the canonically equivalent I-dot_above
|
|
|
|
0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
|
|
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
|
|
|
|
::
|
|
|
|
# When lowercasing, unless an I is before a dot_above, it turns into a dotless i.
|
|
|
|
0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
|
|
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
|
|
|
|
::
|
|
|
|
# When uppercasing, i turns into a dotted capital I
|
|
|
|
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
|
|
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
|
|
|
|
Various 'i' characters
|
|
----------------------
|
|
|
|
Case conversion rules for various 'i' characters are particularly fun.
|
|
There are four separate 'i'-characters:
|
|
|
|
* U+0049: LATIN CAPITAL LETTER I
|
|
* U+0069: LATIN SMALL LETTER I
|
|
* U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE
|
|
* U+0131: LATIN SMALL LETTER DOTLESS I
|
|
|
|
Case conversion rules for these characters are locale and context dependent and differ
|
|
from standard conversions at least for Lithuanian (lt), Turkish (tr), and Azeri (az)
|
|
as follows (ignoring context dependent rules):
|
|
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| Input | uc/lt | uc/tr | uc/az | uc/other | lc/lt | lc/tr | lc/az | lc/other |
|
|
+========+==========+==========+==========+==========+==========+==========+==========+==========+
|
|
| U+0049 | | | | | | | | |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| U+0069 | | | | | | | | |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| U+0130 | | | | | | | | |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| U+0131 | | | | | | | | |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
|
|
**FIXME: FILL**
|
|
|
|
Java behavior:
|
|
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| Input | uc/lt | uc/tr | uc/az | uc/other | lc/lt | lc/tr | lc/az | lc/other |
|
|
+========+==========+==========+==========+==========+==========+==========+==========+==========+
|
|
| U+0049 | U+0049 | U+0049 | U+0049 | U+0049 | U+0069 |**U+0131**|**U+0131**| U+0069 |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| U+0069 | U+0049 |**U+0130**|**U+0130**| U+0049 | U+0069 | U+0069 | U+0069 | U+0069 |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| U+0130 | U+0130 | U+0130 | U+0130 | U+0130 | U+0069 | U+0069 | U+0069 | U+0069 |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
| U+0131 | U+0049 | U+0049 | U+0049 | U+0049 | U+0131 | U+0131 | U+0131 | U+0131 |
|
|
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|
|
|
|
|