Notes on an internal representation alternative

7 years ago · d24f590f48
1 changed files with 141 additions and 0 deletions
--- a/doc/utf8-internal-representation.rst
+++ b/doc/utf8-internal-representation.rst
@ -0,0 +1,141 @@
+======================================
+Using UTF-8 as internal representation
+======================================
+
+Some notes on using UTF-8 as internal representation for Ecmascript strings
+when surrogate pairs can be combined.
+
+Current representation
+======================
+
+Current internal representation is a union of:
+
+* CESU-8: to support full 16-bit codepoint sequences without limitations.
+  In particular, individual and unpaired surrogates must work without
+  interpretation or conversion.
+
+* UTF-8: to support non-BMP characters, if they are created from C code
+  or e.g. using String.fromCharCode(0x12345).
+
+* Extended UTF-8: to support codepoints up to U+FFFFFFFF.  This is now
+  only needed by the regexp bytecode, which uses extended UTF-8 as its
+  internal representation and needs to represent long offsets as
+  codepoints.
+
+C API problem with current representation
+=========================================
+
+One concrete problem with this arrangement is that non-BMP strings are
+internally represented as CESU-8:
+
+* If source code contains a non-BMP character, the Ecmascript specification
+  requires that such a character is decoded into surrogates, from
+  https://www.ecma-international.org/ecma-262/5.1/#sec-6:
+
+  - If an actual source text is encoded in a form other than 16-bit code
+    units it must be processed as if it was first converted to UTF-16.
+
+* This means that ``x = '\u{12345}'`` and ``x = '\ud808\udf45'`` MUST be
+  treated identically.  For example, for both inputs:
+
+  - The string's ``.length`` must be 2.
+
+  - ``x[0]`` must be 0xd808, and ``x[1]`` must be 0xdf45.
+
+  - RegExps must be able to match the individual surrogates, and one must
+    be able to e.g. backtrack each surrogate separately.
+
+  - It must be possible to take a substring whose one end is between
+    the surrogate codepoints.
+
+* In the current C API such a string will appear CESU-8 encoded because
+  that's the internal representation used for surrogate codepoints.
+
+* Applications dealing natively with UTF-8 would often prefer to see UTF-8
+  rather than CESU-8, thus avoiding the need to transcode CESU-8 to UTF-8.
+
+The Ecmascript specification doesn't (and cannot) mandate any specific
+internal representation, nor does it provide any requirements on how a
+C API must represent strings.  The current convention of using CESU-8
+for standard Ecmascript strings is thus not really mandatory.  However,
+if an alternative representation is used, it MUST behave identically as
+far as script code is concerned.
+
+Automatically combining surrogates in internal representation
+=============================================================
+
+One alternative to the current internal representation is to:
+
+* Keep the current CESU-8 + UTF-8 + extended UTF-8 as the base representation.
+
+* When conceptual Ecmascript strings contain correctly paired surrogates,
+  combine the surrogates into the actual non-BMP codepoint.  The resulting
+  codepoint is then valid UTF-8 and not CESU-8.
+
+* When a non-paired surrogate is found, encode it as CESU-8 as before.
+
+* This process must be applied to all inputs, both script code and C code,
+  so that a certain conceptual Ecmascript string has a unique duk_hstring
+  representation.  (If this is not the case, string comparison using an
+  interned string pointer would no longer be valid which leads to a lot of
+  complications.)
+
+This would have the upside that:
+
+* Valid Unicode strings in UTF-8 codepoint range (U+0000 to U+10FFFF without
+  surrogate range U+D800 to U+DFFF) would appear as valid UTF-8 (not CESU-8)
+  in the C API.
+
+* Pushing UTF-8 strings would produce strings that behaved like standard
+  Ecmascript strings, i.e. they would conceptually have surrogate pairs in
+  place of non-BMP.
+
+And a few downsides:
+
+* All the internal code would need to maintain an "as if" illusion: such
+  strings must appear as uninterpreted 16-bit codepoint sequences, and all
+  16-bit codepoint sequences must still work without difference as far as
+  script code is concerned.  This is not trivial, more on this below.
+
+* One would no longer be able to push an arbitrary byte sequence as a string
+  (duk_push_string()) and then read it back as is.  The automatic surrogate
+  combination would mean the output might be different, with surrogates
+  represented in CESU-8 combined into UTF-8.  This is a loss of current
+  functionality which has been useful for some applications; one can e.g.
+  push ISO-8859-1 strings as is, and read them back.  Script code will see
+  such strings as being somewhat broken, but they have previously passed
+  through without modification.
+
+Some internals where the "as if" illusion must be maintained:
+
+* String ``.length`` must count non-BMP codepoints as 2 codepoints to get
+  the standard length.
+
+* String.charCodeAt() and all other String functions must use an index scheme
+  that references the conceptual 16-bit codepoint sequence index (where each
+  non-BMP counts as two indices), and allow reading, substringing, etc, both
+  of the surrogate pairs individually.
+
+* There's no longer an easy "char offset to byte offset" internal primitive.
+  Currently such a conversion maps an integer to an integer (or error).  For
+  non-BMP characters the result would now be a tuple: an integer pointing to
+  the start of the codepoint, and a flag indicating whether we want the high
+  or the low surrogate.  All places maintaining "current offset" must track
+  that additional flag somehow (it could maybe be encoded as the high bit of
+  a 32-bit unsigned value?).
+
+* When doing string replacements, code must always check whether the
+  replacements created valid surrogate pairs from previously unpaired
+  surrogates.  They must be merged, to maintain a unique strnig representation.
+  Such surrogates may appear at the edges of replacement strings.
+
+* When combining strings, must check for previously unpaired surrogates at
+  string join point.
+
+* RegExp matching must match non-BMP codepoints as two surrogates individually
+  as far as patterns are concerned.  It must be possible to capture only one
+  of the surrogates, backtrack each surrogate individually, match start offset
+  must try both surrogates as starting points, etc.
+
+* RegExp /u mode would work trivially with this internal representation, as
+  the codepoints are already combined.