internal documentation about unicode support

11 years ago · 269db6708f
1 changed files with 295 additions and 0 deletions
--- a/doc/unicode-support.txt
+++ b/doc/unicode-support.txt
@ -0,0 +1,295 @@
 ===============
 Unicode support
 ===============
 Overview
 ========
 Ecmascript E5 requires quite extensive Unicode support, which is difficult to
 implement in a very compact fashion.  The subsections below discuss Unicode
 handling in various parts of the E5 standard.  Below, the terms "character"
 and "codepoint" are used interchangeably.
 The general principles for implementation are:
 * Operations on ASCII characters and ASCII strings should have a fast path
  which avoids expensive scanning of conversion tables etc.
 * Simple run-time operations on non-ASCII characters like string
  concatenation, character lookups etc, should be reasonably fast (e.g.,
  avoid a scan of Unicode character information ranges).
 * Complex run-time operations on non-ASCII characters like case conversion
  can have a performance penalty in exchange for small size.
 * Compile-time operations on non-ASCII characters can have a performance
  penalty in exchange for small size.
 Handling unicode case conversion, character classes etc in a compact code
 size is bit challenging.  The current solution is to fast path ASCII
 characters and to use a bit-packed format for encoding case conversion
 rules (e.g. range mappings).  The rules are created by build-time Python
 scripts (see ``src/`` directory) and decoded by run-time code such as the
 parser with the help of ``duk_bitdecoder_ctx`` and ``duk_bd_decode()``.
 .. note:: There are many Unicode specifications, and I'm not sure
   which ones apply to E5.  For instance, which specification governs the
   'end of word' behavior for the Final_Sigma context?  Is it #29 or
   something else?
 Useful background information:
 * Unicode Standard Annex #44: UNICODE CHARACTER DATABASE:
  http://unicode.org/reports/tr44/#Casemapping
 * CLDR - Unicode Common Locale Data Repository:
  http://cldr.unicode.org/
 * Unicode Technical Standard #35: UNICODE LOCALE DATA MARKUP LANGUAGE (LDML):
  http://www.unicode.org/reports/tr35/tr35-21.html
 * Unicode Standard Annex #29: UNICODE TEXT SEGMENTATION:
  http://www.unicode.org/reports/tr29/
 Unicode data:
 * http://unicode.org/Public/UNIDATA/
 * UnicodeData.txt: http://unicode.org/Public/UNIDATA/UnicodeData.txt
 * SpecialCasing.txt: http://unicode.org/Public/UNIDATA/SpecialCasing.txt
 Source text
 ===========
 The ``IdentifierStart`` and ``IdentifierPart`` codepoint sets are rather
 complex.  They are currently encoded into about 1.5 kilobytes of bit-packed
 match data by ``extract_chars.py``.
 Regular expression
 ==================
 The ``Canonicalize()`` abstract operation described in E5 Section 15.10.2.8
 shares the case conversion of ``String.prototype.toUpperCase()`` with a few
 exceptions.  The conversion tables can be shared so no additional tables are
 needed.
 String case conversion
 ======================
 Ecmascript E5 requires case conversion for 16-bit Unicode characters with the
 ``String.prototype`` functions ``toLowerCase()``, ``toLocaleLowerCase()``,
 ``toUpperCase()``, and ``toLocaleUpperCase()``, see E5 Sections 15.5.4.16 to
 15.5.4.19.  Titlecase conversion is not required by Ecmascript E5.  Regular
 expression abstract ``Canonicalize()`` operation also borrows the case
 conversion rules (though only for 1:1 conversions), see E5 Section 15.10.2.8.
 Unicode data files describe case conversion rules in two parts:
 1. ``UnicodeData.txt`` describes simple 1:1 mappings for lowercase, uppercase,
   and titlecase.  The titlecase mapping, if missing, defaults to uppercase
   mapping.
 2. ``SpecialCasing.txt`` describes complex 1:many mappings for case conversion,
   which are also required by E5.  These mappings may be locale sensitive (e.g.
   apply only to a certain language) and/or context sensitive (e.g. apply only
   if a character is preceded or followed by certain codepoints).
 UnicodeData.txt lists all Unicode codepoints and optionally gives case
 conversion rules for each.  Titlecase conversion defaults to uppercase
 conversion, and if no conversion is given, the character is assumed to remain
 the same unless SpecialCasing.txt has an overriding rule.  The actual case
 conversion rules are not random, but in many cases continuous ranges are
 shifted to another position in the codepoint space; the ranges may be fully
 continuous or have a "skip", e.g. apply to every other character.
 SpecialCasing.txt provides additional rules particularly for handling cases
 where the case conversion is not 1:1.  For instance, "ß" converted to
 uppercase is "SS".  There are slightly over 100 such rules, almost entirely
 for uppercase and titlecase conversion.  The special casing rules can convert
 an input codepoint into 1-3 result codepoints (the ligature U+FB03 uppercases
 to "FFI", for instance).  Some special casing rules are context and/or locale
 sensitive.  *Context sensitivity* means that a rule only applies when a
 codepoint is (or is not) surrounded by certain other codepoints, which means
 that characters cannot be case converted individually.  *Locale sensitivity*
 means that a rule might only apply for a certain language.
 The Python script ``extract_caseconv.py`` reads in UnicodeData.txt and
 SpecialCasing.txt, extracts the appropriate case conversion rules, scans the
 conversion rules to generate a compact rules database (really just a list of
 rules), and encodes the rules into a bit packed format.  The bit packed rule
 format has been developed experimentally to minimize data and code space, by
 looking at the case conversion data and first detecting simple rules (ranges
 which are either continuous or have a certain "skip"), and then looking at
 what remains.
 Currently the encoded format consists of three parts:
 1. range mappings with a "skip" of 1...6;
 2. simple 1:1 character mappings which are not covered by the range rules;
 3. complex 1:n character mappings.
 There's probably some room for improvement in optimizing the encoding further;
 currently it takes almost 2 KiB for uppercase and lowercase rules combined.
 .. note:: Context or locale specific rules are not processed now.  This
   violates E5 requirements for both context and locale support.
 See also:
 * http://www.unicode.org/faq/casemap_charprop.html.
 * ``src/CaseConversion.java`` which allows easy testing of what Java does
 Context and locale sensitive rules
 ==================================
 The following context and locale sensitive rules exist in SpecialCasing.txt
 with md5sum of 5cea3d079e2b6c6c3babb0726e47e1db.
 Useful background:
 * Unicode Standard Annex #44: UNICODE CHARACTER DATABASE, Section 5.6:
  http://unicode.org/reports/tr44/#Casemapping
  - Clarifies that contexts are not formal character properties
 * CLDR - Unicode Common Locale Data Repository: http://cldr.unicode.org/
 * http://unicode.org/reports/tr44/#General_Category_Values
 Final sigma (all languages)
 ---------------------------
 ::
  # Special case for final form of sigma
  03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
 The lowercase conversion of U+03A3: GREEK CAPITAL LETTER SIGMA depends
 on context as follows:
 * Final_Sigma: lowercase is U+03C2: GREEK SMALL LETTER FINAL SIGMA
 * Otherwise: lowercase is U+03C3: GREEK SMALL LETTER SIGMA
 Other conversions (uppercase or titlecase conversions, or lowercase
 conversions of other sigma characters) are not context sensitive.
 In particular, codepoints U+03C2 and U+03C3 lowercase to themselves.
 .. note:: What is the formal definition of a "final sigma" context?
   Based on the "Unicode demystified" link below, let p = previous
   codepoint (if exists) and n = next codepoint; final_sigma = (p exists)
   and (p is a letter) and (n exists) and (n is not a letter), for
   some meaning of a "letter"?
 See also:
 * http://unicode.org/faq/greek.html#5
 * http://en.wikipedia.org/wiki/Sigma
 * http://www.unicode.org/reports/tr29/#Word_Boundaries
 * http://books.google.fi/books?id=wn5sXG8bEAcC&pg=PA169&lpg=PA169&dq=%22Final_Sigma%22&source=bl&ots=J07ysYPbVD&sig=tGhPz1VFpi-KE1InQPsjX2diVlg&hl=fi&ei=XHswTqmrA4aSOrSf3X4&sa=X&oi=book_result&ct=result&resnum=5&ved=0CDYQ6AEwBA#v=onepage&q=%22Final_Sigma%22&f=false
 Lithuanian (lt)
 ---------------
 ::
  # Lithuanian retains the dot in a lowercase i when followed by accents.
  # Remove DOT ABOVE after "i" with upper or titlecase
  0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
 ::
  # Introduce an explicit dot above when lowercasing capital I's and J's
  # whenever there are more accents above.
  # (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)
  0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I
  004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J
  012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK
  00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE
  00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE
  0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE
 Turkish and Azeri (tr and az)
 -----------------------------
 ::
  # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
  # The following rules handle those cases.
  0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
  0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
 ::
  # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
  # This matches the behavior of the canonically equivalent I-dot_above
  0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
  0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
 ::
  # When lowercasing, unless an I is before a dot_above, it turns into a dotless i.
  0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
  0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
 ::
  # When uppercasing, i turns into a dotted capital I
  0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
  0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
 Various 'i' characters
 ----------------------
 Case conversion rules for various 'i' characters are particularly fun.
 There are four separate 'i'-characters:
 * U+0049: LATIN CAPITAL LETTER I
 * U+0069: LATIN SMALL LETTER I
 * U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE
 * U+0131: LATIN SMALL LETTER DOTLESS I
 Case conversion rules for these characters are locale and context dependent and differ
 from standard conversions at least for Lithuanian (lt), Turkish (tr), and Azeri (az)
 as follows (ignoring context dependent rules):
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | Input  | uc/lt    | uc/tr    | uc/az    | uc/other | lc/lt    | lc/tr    | lc/az    | lc/other |
 +========+==========+==========+==========+==========+==========+==========+==========+==========+
 | U+0049 |          |          |          |          |          |          |          |          |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | U+0069 |          |          |          |          |          |          |          |          |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | U+0130 |          |          |          |          |          |          |          |          |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | U+0131 |          |          |          |          |          |          |          |          |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 **FIXME: FILL**
 Java behavior:
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | Input  | uc/lt    | uc/tr    | uc/az    | uc/other | lc/lt    | lc/tr    | lc/az    | lc/other |
 +========+==========+==========+==========+==========+==========+==========+==========+==========+
 | U+0049 |  U+0049  |  U+0049  |  U+0049  |  U+0049  |  U+0069  |**U+0131**|**U+0131**|  U+0069  |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | U+0069 |  U+0049  |**U+0130**|**U+0130**|  U+0049  |  U+0069  |  U+0069  |  U+0069  |  U+0069  |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | U+0130 |  U+0130  |  U+0130  |  U+0130  |  U+0130  |  U+0069  |  U+0069  |  U+0069  |  U+0069  |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+
 | U+0131 |  U+0049  |  U+0049  |  U+0049  |  U+0049  |  U+0131  |  U+0131  |  U+0131  |  U+0131  |
 +--------+----------+----------+----------+----------+----------+----------+----------+----------+