You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

296 lines
12 KiB

===============
Unicode support
===============
Overview
========
Ecmascript E5 requires quite extensive Unicode support, which is difficult to
implement in a very compact fashion. The subsections below discuss Unicode
handling in various parts of the E5 standard. Below, the terms "character"
and "codepoint" are used interchangeably.
The general principles for implementation are:
* Operations on ASCII characters and ASCII strings should have a fast path
which avoids expensive scanning of conversion tables etc.
* Simple run-time operations on non-ASCII characters like string
concatenation, character lookups etc, should be reasonably fast (e.g.,
avoid a scan of Unicode character information ranges).
* Complex run-time operations on non-ASCII characters like case conversion
can have a performance penalty in exchange for small size.
* Compile-time operations on non-ASCII characters can have a performance
penalty in exchange for small size.
Handling unicode case conversion, character classes etc in a compact code
size is bit challenging. The current solution is to fast path ASCII
characters and to use a bit-packed format for encoding case conversion
rules (e.g. range mappings). The rules are created by build-time Python
scripts (see ``src/`` directory) and decoded by run-time code such as the
parser with the help of ``duk_bitdecoder_ctx`` and ``duk_bd_decode()``.
.. note:: There are many Unicode specifications, and I'm not sure
which ones apply to E5. For instance, which specification governs the
'end of word' behavior for the Final_Sigma context? Is it #29 or
something else?
Useful background information:
* Unicode Standard Annex #44: UNICODE CHARACTER DATABASE:
http://unicode.org/reports/tr44/#Casemapping
* CLDR - Unicode Common Locale Data Repository:
http://cldr.unicode.org/
* Unicode Technical Standard #35: UNICODE LOCALE DATA MARKUP LANGUAGE (LDML):
http://www.unicode.org/reports/tr35/tr35-21.html
* Unicode Standard Annex #29: UNICODE TEXT SEGMENTATION:
http://www.unicode.org/reports/tr29/
Unicode data:
* http://unicode.org/Public/UNIDATA/
* UnicodeData.txt: http://unicode.org/Public/UNIDATA/UnicodeData.txt
* SpecialCasing.txt: http://unicode.org/Public/UNIDATA/SpecialCasing.txt
Source text
===========
The ``IdentifierStart`` and ``IdentifierPart`` codepoint sets are rather
complex. They are currently encoded into about 1.5 kilobytes of bit-packed
match data by ``extract_chars.py``.
Regular expression
==================
The ``Canonicalize()`` abstract operation described in E5 Section 15.10.2.8
shares the case conversion of ``String.prototype.toUpperCase()`` with a few
exceptions. The conversion tables can be shared so no additional tables are
needed.
String case conversion
======================
Ecmascript E5 requires case conversion for 16-bit Unicode characters with the
``String.prototype`` functions ``toLowerCase()``, ``toLocaleLowerCase()``,
``toUpperCase()``, and ``toLocaleUpperCase()``, see E5 Sections 15.5.4.16 to
15.5.4.19. Titlecase conversion is not required by Ecmascript E5. Regular
expression abstract ``Canonicalize()`` operation also borrows the case
conversion rules (though only for 1:1 conversions), see E5 Section 15.10.2.8.
Unicode data files describe case conversion rules in two parts:
1. ``UnicodeData.txt`` describes simple 1:1 mappings for lowercase, uppercase,
and titlecase. The titlecase mapping, if missing, defaults to uppercase
mapping.
2. ``SpecialCasing.txt`` describes complex 1:many mappings for case conversion,
which are also required by E5. These mappings may be locale sensitive (e.g.
apply only to a certain language) and/or context sensitive (e.g. apply only
if a character is preceded or followed by certain codepoints).
UnicodeData.txt lists all Unicode codepoints and optionally gives case
conversion rules for each. Titlecase conversion defaults to uppercase
conversion, and if no conversion is given, the character is assumed to remain
the same unless SpecialCasing.txt has an overriding rule. The actual case
conversion rules are not random, but in many cases continuous ranges are
shifted to another position in the codepoint space; the ranges may be fully
continuous or have a "skip", e.g. apply to every other character.
SpecialCasing.txt provides additional rules particularly for handling cases
where the case conversion is not 1:1. For instance, "ß" converted to
uppercase is "SS". There are slightly over 100 such rules, almost entirely
for uppercase and titlecase conversion. The special casing rules can convert
an input codepoint into 1-3 result codepoints (the ligature U+FB03 uppercases
to "FFI", for instance). Some special casing rules are context and/or locale
sensitive. *Context sensitivity* means that a rule only applies when a
codepoint is (or is not) surrounded by certain other codepoints, which means
that characters cannot be case converted individually. *Locale sensitivity*
means that a rule might only apply for a certain language.
The Python script ``extract_caseconv.py`` reads in UnicodeData.txt and
SpecialCasing.txt, extracts the appropriate case conversion rules, scans the
conversion rules to generate a compact rules database (really just a list of
rules), and encodes the rules into a bit packed format. The bit packed rule
format has been developed experimentally to minimize data and code space, by
looking at the case conversion data and first detecting simple rules (ranges
which are either continuous or have a certain "skip"), and then looking at
what remains.
Currently the encoded format consists of three parts:
1. range mappings with a "skip" of 1...6;
2. simple 1:1 character mappings which are not covered by the range rules;
3. complex 1:n character mappings.
There's probably some room for improvement in optimizing the encoding further;
currently it takes almost 2 KiB for uppercase and lowercase rules combined.
.. note:: Context or locale specific rules are not processed now. This
violates E5 requirements for both context and locale support.
See also:
* http://www.unicode.org/faq/casemap_charprop.html.
* ``src/CaseConversion.java`` which allows easy testing of what Java does
Context and locale sensitive rules
==================================
The following context and locale sensitive rules exist in SpecialCasing.txt
with md5sum of 5cea3d079e2b6c6c3babb0726e47e1db.
Useful background:
* Unicode Standard Annex #44: UNICODE CHARACTER DATABASE, Section 5.6:
http://unicode.org/reports/tr44/#Casemapping
- Clarifies that contexts are not formal character properties
* CLDR - Unicode Common Locale Data Repository: http://cldr.unicode.org/
* http://unicode.org/reports/tr44/#General_Category_Values
Final sigma (all languages)
---------------------------
::
# Special case for final form of sigma
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
The lowercase conversion of U+03A3: GREEK CAPITAL LETTER SIGMA depends
on context as follows:
* Final_Sigma: lowercase is U+03C2: GREEK SMALL LETTER FINAL SIGMA
* Otherwise: lowercase is U+03C3: GREEK SMALL LETTER SIGMA
Other conversions (uppercase or titlecase conversions, or lowercase
conversions of other sigma characters) are not context sensitive.
In particular, codepoints U+03C2 and U+03C3 lowercase to themselves.
.. note:: What is the formal definition of a "final sigma" context?
Based on the "Unicode demystified" link below, let p = previous
codepoint (if exists) and n = next codepoint; final_sigma = (p exists)
and (p is a letter) and (n exists) and (n is not a letter), for
some meaning of a "letter"?
See also:
* http://unicode.org/faq/greek.html#5
* http://en.wikipedia.org/wiki/Sigma
* http://www.unicode.org/reports/tr29/#Word_Boundaries
* http://books.google.fi/books?id=wn5sXG8bEAcC&pg=PA169&lpg=PA169&dq=%22Final_Sigma%22&source=bl&ots=J07ysYPbVD&sig=tGhPz1VFpi-KE1InQPsjX2diVlg&hl=fi&ei=XHswTqmrA4aSOrSf3X4&sa=X&oi=book_result&ct=result&resnum=5&ved=0CDYQ6AEwBA#v=onepage&q=%22Final_Sigma%22&f=false
Lithuanian (lt)
---------------
::
# Lithuanian retains the dot in a lowercase i when followed by accents.
# Remove DOT ABOVE after "i" with upper or titlecase
0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
::
# Introduce an explicit dot above when lowercasing capital I's and J's
# whenever there are more accents above.
# (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)
0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK
00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE
Turkish and Azeri (tr and az)
-----------------------------
::
# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.
0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
::
# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above
0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
::
# When lowercasing, unless an I is before a dot_above, it turns into a dotless i.
0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
::
# When uppercasing, i turns into a dotted capital I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
Various 'i' characters
----------------------
Case conversion rules for various 'i' characters are particularly fun.
There are four separate 'i'-characters:
* U+0049: LATIN CAPITAL LETTER I
* U+0069: LATIN SMALL LETTER I
* U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE
* U+0131: LATIN SMALL LETTER DOTLESS I
Case conversion rules for these characters are locale and context dependent and differ
from standard conversions at least for Lithuanian (lt), Turkish (tr), and Azeri (az)
as follows (ignoring context dependent rules):
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| Input | uc/lt | uc/tr | uc/az | uc/other | lc/lt | lc/tr | lc/az | lc/other |
+========+==========+==========+==========+==========+==========+==========+==========+==========+
| U+0049 | | | | | | | | |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| U+0069 | | | | | | | | |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| U+0130 | | | | | | | | |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| U+0131 | | | | | | | | |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
**FIXME: FILL**
Java behavior:
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| Input | uc/lt | uc/tr | uc/az | uc/other | lc/lt | lc/tr | lc/az | lc/other |
+========+==========+==========+==========+==========+==========+==========+==========+==========+
| U+0049 | U+0049 | U+0049 | U+0049 | U+0049 | U+0069 |**U+0131**|**U+0131**| U+0069 |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| U+0069 | U+0049 |**U+0130**|**U+0130**| U+0049 | U+0069 | U+0069 | U+0069 | U+0069 |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| U+0130 | U+0130 | U+0130 | U+0130 | U+0130 | U+0069 | U+0069 | U+0069 | U+0069 |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
| U+0131 | U+0049 | U+0049 | U+0049 | U+0049 | U+0131 | U+0131 | U+0131 | U+0131 |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+