mirror of https://github.com/svaarala/duktape.git
Sami Vaarala
11 years ago
1 changed files with 295 additions and 0 deletions
@ -0,0 +1,295 @@ |
|||
=============== |
|||
Unicode support |
|||
=============== |
|||
|
|||
Overview |
|||
======== |
|||
|
|||
Ecmascript E5 requires quite extensive Unicode support, which is difficult to |
|||
implement in a very compact fashion. The subsections below discuss Unicode |
|||
handling in various parts of the E5 standard. Below, the terms "character" |
|||
and "codepoint" are used interchangeably. |
|||
|
|||
The general principles for implementation are: |
|||
|
|||
* Operations on ASCII characters and ASCII strings should have a fast path |
|||
which avoids expensive scanning of conversion tables etc. |
|||
|
|||
* Simple run-time operations on non-ASCII characters like string |
|||
concatenation, character lookups etc, should be reasonably fast (e.g., |
|||
avoid a scan of Unicode character information ranges). |
|||
|
|||
* Complex run-time operations on non-ASCII characters like case conversion |
|||
can have a performance penalty in exchange for small size. |
|||
|
|||
* Compile-time operations on non-ASCII characters can have a performance |
|||
penalty in exchange for small size. |
|||
|
|||
Handling unicode case conversion, character classes etc in a compact code |
|||
size is bit challenging. The current solution is to fast path ASCII |
|||
characters and to use a bit-packed format for encoding case conversion |
|||
rules (e.g. range mappings). The rules are created by build-time Python |
|||
scripts (see ``src/`` directory) and decoded by run-time code such as the |
|||
parser with the help of ``duk_bitdecoder_ctx`` and ``duk_bd_decode()``. |
|||
|
|||
.. note:: There are many Unicode specifications, and I'm not sure |
|||
which ones apply to E5. For instance, which specification governs the |
|||
'end of word' behavior for the Final_Sigma context? Is it #29 or |
|||
something else? |
|||
|
|||
Useful background information: |
|||
|
|||
* Unicode Standard Annex #44: UNICODE CHARACTER DATABASE: |
|||
http://unicode.org/reports/tr44/#Casemapping |
|||
|
|||
* CLDR - Unicode Common Locale Data Repository: |
|||
http://cldr.unicode.org/ |
|||
|
|||
* Unicode Technical Standard #35: UNICODE LOCALE DATA MARKUP LANGUAGE (LDML): |
|||
http://www.unicode.org/reports/tr35/tr35-21.html |
|||
|
|||
* Unicode Standard Annex #29: UNICODE TEXT SEGMENTATION: |
|||
http://www.unicode.org/reports/tr29/ |
|||
|
|||
Unicode data: |
|||
|
|||
* http://unicode.org/Public/UNIDATA/ |
|||
|
|||
* UnicodeData.txt: http://unicode.org/Public/UNIDATA/UnicodeData.txt |
|||
|
|||
* SpecialCasing.txt: http://unicode.org/Public/UNIDATA/SpecialCasing.txt |
|||
|
|||
Source text |
|||
=========== |
|||
|
|||
The ``IdentifierStart`` and ``IdentifierPart`` codepoint sets are rather |
|||
complex. They are currently encoded into about 1.5 kilobytes of bit-packed |
|||
match data by ``extract_chars.py``. |
|||
|
|||
Regular expression |
|||
================== |
|||
|
|||
The ``Canonicalize()`` abstract operation described in E5 Section 15.10.2.8 |
|||
shares the case conversion of ``String.prototype.toUpperCase()`` with a few |
|||
exceptions. The conversion tables can be shared so no additional tables are |
|||
needed. |
|||
|
|||
String case conversion |
|||
====================== |
|||
|
|||
Ecmascript E5 requires case conversion for 16-bit Unicode characters with the |
|||
``String.prototype`` functions ``toLowerCase()``, ``toLocaleLowerCase()``, |
|||
``toUpperCase()``, and ``toLocaleUpperCase()``, see E5 Sections 15.5.4.16 to |
|||
15.5.4.19. Titlecase conversion is not required by Ecmascript E5. Regular |
|||
expression abstract ``Canonicalize()`` operation also borrows the case |
|||
conversion rules (though only for 1:1 conversions), see E5 Section 15.10.2.8. |
|||
|
|||
Unicode data files describe case conversion rules in two parts: |
|||
|
|||
1. ``UnicodeData.txt`` describes simple 1:1 mappings for lowercase, uppercase, |
|||
and titlecase. The titlecase mapping, if missing, defaults to uppercase |
|||
mapping. |
|||
|
|||
2. ``SpecialCasing.txt`` describes complex 1:many mappings for case conversion, |
|||
which are also required by E5. These mappings may be locale sensitive (e.g. |
|||
apply only to a certain language) and/or context sensitive (e.g. apply only |
|||
if a character is preceded or followed by certain codepoints). |
|||
|
|||
UnicodeData.txt lists all Unicode codepoints and optionally gives case |
|||
conversion rules for each. Titlecase conversion defaults to uppercase |
|||
conversion, and if no conversion is given, the character is assumed to remain |
|||
the same unless SpecialCasing.txt has an overriding rule. The actual case |
|||
conversion rules are not random, but in many cases continuous ranges are |
|||
shifted to another position in the codepoint space; the ranges may be fully |
|||
continuous or have a "skip", e.g. apply to every other character. |
|||
|
|||
SpecialCasing.txt provides additional rules particularly for handling cases |
|||
where the case conversion is not 1:1. For instance, "ß" converted to |
|||
uppercase is "SS". There are slightly over 100 such rules, almost entirely |
|||
for uppercase and titlecase conversion. The special casing rules can convert |
|||
an input codepoint into 1-3 result codepoints (the ligature U+FB03 uppercases |
|||
to "FFI", for instance). Some special casing rules are context and/or locale |
|||
sensitive. *Context sensitivity* means that a rule only applies when a |
|||
codepoint is (or is not) surrounded by certain other codepoints, which means |
|||
that characters cannot be case converted individually. *Locale sensitivity* |
|||
means that a rule might only apply for a certain language. |
|||
|
|||
The Python script ``extract_caseconv.py`` reads in UnicodeData.txt and |
|||
SpecialCasing.txt, extracts the appropriate case conversion rules, scans the |
|||
conversion rules to generate a compact rules database (really just a list of |
|||
rules), and encodes the rules into a bit packed format. The bit packed rule |
|||
format has been developed experimentally to minimize data and code space, by |
|||
looking at the case conversion data and first detecting simple rules (ranges |
|||
which are either continuous or have a certain "skip"), and then looking at |
|||
what remains. |
|||
|
|||
Currently the encoded format consists of three parts: |
|||
|
|||
1. range mappings with a "skip" of 1...6; |
|||
|
|||
2. simple 1:1 character mappings which are not covered by the range rules; |
|||
|
|||
3. complex 1:n character mappings. |
|||
|
|||
There's probably some room for improvement in optimizing the encoding further; |
|||
currently it takes almost 2 KiB for uppercase and lowercase rules combined. |
|||
|
|||
.. note:: Context or locale specific rules are not processed now. This |
|||
violates E5 requirements for both context and locale support. |
|||
|
|||
See also: |
|||
|
|||
* http://www.unicode.org/faq/casemap_charprop.html. |
|||
|
|||
* ``src/CaseConversion.java`` which allows easy testing of what Java does |
|||
|
|||
Context and locale sensitive rules |
|||
================================== |
|||
|
|||
The following context and locale sensitive rules exist in SpecialCasing.txt |
|||
with md5sum of 5cea3d079e2b6c6c3babb0726e47e1db. |
|||
|
|||
Useful background: |
|||
|
|||
* Unicode Standard Annex #44: UNICODE CHARACTER DATABASE, Section 5.6: |
|||
http://unicode.org/reports/tr44/#Casemapping |
|||
|
|||
- Clarifies that contexts are not formal character properties |
|||
|
|||
* CLDR - Unicode Common Locale Data Repository: http://cldr.unicode.org/ |
|||
|
|||
* http://unicode.org/reports/tr44/#General_Category_Values |
|||
|
|||
Final sigma (all languages) |
|||
--------------------------- |
|||
|
|||
:: |
|||
|
|||
# Special case for final form of sigma |
|||
|
|||
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA |
|||
|
|||
The lowercase conversion of U+03A3: GREEK CAPITAL LETTER SIGMA depends |
|||
on context as follows: |
|||
|
|||
* Final_Sigma: lowercase is U+03C2: GREEK SMALL LETTER FINAL SIGMA |
|||
|
|||
* Otherwise: lowercase is U+03C3: GREEK SMALL LETTER SIGMA |
|||
|
|||
Other conversions (uppercase or titlecase conversions, or lowercase |
|||
conversions of other sigma characters) are not context sensitive. |
|||
In particular, codepoints U+03C2 and U+03C3 lowercase to themselves. |
|||
|
|||
.. note:: What is the formal definition of a "final sigma" context? |
|||
Based on the "Unicode demystified" link below, let p = previous |
|||
codepoint (if exists) and n = next codepoint; final_sigma = (p exists) |
|||
and (p is a letter) and (n exists) and (n is not a letter), for |
|||
some meaning of a "letter"? |
|||
|
|||
See also: |
|||
|
|||
* http://unicode.org/faq/greek.html#5 |
|||
* http://en.wikipedia.org/wiki/Sigma |
|||
* http://www.unicode.org/reports/tr29/#Word_Boundaries |
|||
* http://books.google.fi/books?id=wn5sXG8bEAcC&pg=PA169&lpg=PA169&dq=%22Final_Sigma%22&source=bl&ots=J07ysYPbVD&sig=tGhPz1VFpi-KE1InQPsjX2diVlg&hl=fi&ei=XHswTqmrA4aSOrSf3X4&sa=X&oi=book_result&ct=result&resnum=5&ved=0CDYQ6AEwBA#v=onepage&q=%22Final_Sigma%22&f=false |
|||
|
|||
Lithuanian (lt) |
|||
--------------- |
|||
|
|||
:: |
|||
|
|||
# Lithuanian retains the dot in a lowercase i when followed by accents. |
|||
|
|||
# Remove DOT ABOVE after "i" with upper or titlecase |
|||
|
|||
0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE |
|||
|
|||
:: |
|||
|
|||
# Introduce an explicit dot above when lowercasing capital I's and J's |
|||
# whenever there are more accents above. |
|||
# (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek) |
|||
|
|||
0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I |
|||
004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J |
|||
012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK |
|||
00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE |
|||
00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE |
|||
0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE |
|||
|
|||
Turkish and Azeri (tr and az) |
|||
----------------------------- |
|||
|
|||
:: |
|||
|
|||
# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri |
|||
# The following rules handle those cases. |
|||
|
|||
0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE |
|||
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE |
|||
|
|||
:: |
|||
|
|||
# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. |
|||
# This matches the behavior of the canonically equivalent I-dot_above |
|||
|
|||
0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE |
|||
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE |
|||
|
|||
:: |
|||
|
|||
# When lowercasing, unless an I is before a dot_above, it turns into a dotless i. |
|||
|
|||
0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I |
|||
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I |
|||
|
|||
:: |
|||
|
|||
# When uppercasing, i turns into a dotted capital I |
|||
|
|||
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I |
|||
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I |
|||
|
|||
Various 'i' characters |
|||
---------------------- |
|||
|
|||
Case conversion rules for various 'i' characters are particularly fun. |
|||
There are four separate 'i'-characters: |
|||
|
|||
* U+0049: LATIN CAPITAL LETTER I |
|||
* U+0069: LATIN SMALL LETTER I |
|||
* U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE |
|||
* U+0131: LATIN SMALL LETTER DOTLESS I |
|||
|
|||
Case conversion rules for these characters are locale and context dependent and differ |
|||
from standard conversions at least for Lithuanian (lt), Turkish (tr), and Azeri (az) |
|||
as follows (ignoring context dependent rules): |
|||
|
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| Input | uc/lt | uc/tr | uc/az | uc/other | lc/lt | lc/tr | lc/az | lc/other | |
|||
+========+==========+==========+==========+==========+==========+==========+==========+==========+ |
|||
| U+0049 | | | | | | | | | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| U+0069 | | | | | | | | | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| U+0130 | | | | | | | | | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| U+0131 | | | | | | | | | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
|
|||
**FIXME: FILL** |
|||
|
|||
Java behavior: |
|||
|
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| Input | uc/lt | uc/tr | uc/az | uc/other | lc/lt | lc/tr | lc/az | lc/other | |
|||
+========+==========+==========+==========+==========+==========+==========+==========+==========+ |
|||
| U+0049 | U+0049 | U+0049 | U+0049 | U+0049 | U+0069 |**U+0131**|**U+0131**| U+0069 | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| U+0069 | U+0049 |**U+0130**|**U+0130**| U+0049 | U+0069 | U+0069 | U+0069 | U+0069 | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| U+0130 | U+0130 | U+0130 | U+0130 | U+0130 | U+0069 | U+0069 | U+0069 | U+0069 | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
| U+0131 | U+0049 | U+0049 | U+0049 | U+0049 | U+0131 | U+0131 | U+0131 | U+0131 | |
|||
+--------+----------+----------+----------+----------+----------+----------+----------+----------+ |
|||
|
Loading…
Reference in new issue