Browse Source

Notes on an internal representation alternative

pull/1695/head
Sami Vaarala 7 years ago
parent
commit
d24f590f48
  1. 141
      doc/utf8-internal-representation.rst

141
doc/utf8-internal-representation.rst

@ -0,0 +1,141 @@
======================================
Using UTF-8 as internal representation
======================================
Some notes on using UTF-8 as internal representation for Ecmascript strings
when surrogate pairs can be combined.
Current representation
======================
Current internal representation is a union of:
* CESU-8: to support full 16-bit codepoint sequences without limitations.
In particular, individual and unpaired surrogates must work without
interpretation or conversion.
* UTF-8: to support non-BMP characters, if they are created from C code
or e.g. using String.fromCharCode(0x12345).
* Extended UTF-8: to support codepoints up to U+FFFFFFFF. This is now
only needed by the regexp bytecode, which uses extended UTF-8 as its
internal representation and needs to represent long offsets as
codepoints.
C API problem with current representation
=========================================
One concrete problem with this arrangement is that non-BMP strings are
internally represented as CESU-8:
* If source code contains a non-BMP character, the Ecmascript specification
requires that such a character is decoded into surrogates, from
https://www.ecma-international.org/ecma-262/5.1/#sec-6:
- If an actual source text is encoded in a form other than 16-bit code
units it must be processed as if it was first converted to UTF-16.
* This means that ``x = '\u{12345}'`` and ``x = '\ud808\udf45'`` MUST be
treated identically. For example, for both inputs:
- The string's ``.length`` must be 2.
- ``x[0]`` must be 0xd808, and ``x[1]`` must be 0xdf45.
- RegExps must be able to match the individual surrogates, and one must
be able to e.g. backtrack each surrogate separately.
- It must be possible to take a substring whose one end is between
the surrogate codepoints.
* In the current C API such a string will appear CESU-8 encoded because
that's the internal representation used for surrogate codepoints.
* Applications dealing natively with UTF-8 would often prefer to see UTF-8
rather than CESU-8, thus avoiding the need to transcode CESU-8 to UTF-8.
The Ecmascript specification doesn't (and cannot) mandate any specific
internal representation, nor does it provide any requirements on how a
C API must represent strings. The current convention of using CESU-8
for standard Ecmascript strings is thus not really mandatory. However,
if an alternative representation is used, it MUST behave identically as
far as script code is concerned.
Automatically combining surrogates in internal representation
=============================================================
One alternative to the current internal representation is to:
* Keep the current CESU-8 + UTF-8 + extended UTF-8 as the base representation.
* When conceptual Ecmascript strings contain correctly paired surrogates,
combine the surrogates into the actual non-BMP codepoint. The resulting
codepoint is then valid UTF-8 and not CESU-8.
* When a non-paired surrogate is found, encode it as CESU-8 as before.
* This process must be applied to all inputs, both script code and C code,
so that a certain conceptual Ecmascript string has a unique duk_hstring
representation. (If this is not the case, string comparison using an
interned string pointer would no longer be valid which leads to a lot of
complications.)
This would have the upside that:
* Valid Unicode strings in UTF-8 codepoint range (U+0000 to U+10FFFF without
surrogate range U+D800 to U+DFFF) would appear as valid UTF-8 (not CESU-8)
in the C API.
* Pushing UTF-8 strings would produce strings that behaved like standard
Ecmascript strings, i.e. they would conceptually have surrogate pairs in
place of non-BMP.
And a few downsides:
* All the internal code would need to maintain an "as if" illusion: such
strings must appear as uninterpreted 16-bit codepoint sequences, and all
16-bit codepoint sequences must still work without difference as far as
script code is concerned. This is not trivial, more on this below.
* One would no longer be able to push an arbitrary byte sequence as a string
(duk_push_string()) and then read it back as is. The automatic surrogate
combination would mean the output might be different, with surrogates
represented in CESU-8 combined into UTF-8. This is a loss of current
functionality which has been useful for some applications; one can e.g.
push ISO-8859-1 strings as is, and read them back. Script code will see
such strings as being somewhat broken, but they have previously passed
through without modification.
Some internals where the "as if" illusion must be maintained:
* String ``.length`` must count non-BMP codepoints as 2 codepoints to get
the standard length.
* String.charCodeAt() and all other String functions must use an index scheme
that references the conceptual 16-bit codepoint sequence index (where each
non-BMP counts as two indices), and allow reading, substringing, etc, both
of the surrogate pairs individually.
* There's no longer an easy "char offset to byte offset" internal primitive.
Currently such a conversion maps an integer to an integer (or error). For
non-BMP characters the result would now be a tuple: an integer pointing to
the start of the codepoint, and a flag indicating whether we want the high
or the low surrogate. All places maintaining "current offset" must track
that additional flag somehow (it could maybe be encoded as the high bit of
a 32-bit unsigned value?).
* When doing string replacements, code must always check whether the
replacements created valid surrogate pairs from previously unpaired
surrogates. They must be merged, to maintain a unique strnig representation.
Such surrogates may appear at the edges of replacement strings.
* When combining strings, must check for previously unpaired surrogates at
string join point.
* RegExp matching must match non-BMP codepoints as two surrogates individually
as far as patterns are concerned. It must be possible to capture only one
of the surrogates, backtrack each surrogate individually, match start offset
must try both surrogates as starting points, etc.
* RegExp /u mode would work trivially with this internal representation, as
the codepoints are already combined.
Loading…
Cancel
Save