mirror of https://github.com/svaarala/duktape.git
Sami Vaarala
7 years ago
1 changed files with 141 additions and 0 deletions
@ -0,0 +1,141 @@ |
|||
====================================== |
|||
Using UTF-8 as internal representation |
|||
====================================== |
|||
|
|||
Some notes on using UTF-8 as internal representation for Ecmascript strings |
|||
when surrogate pairs can be combined. |
|||
|
|||
Current representation |
|||
====================== |
|||
|
|||
Current internal representation is a union of: |
|||
|
|||
* CESU-8: to support full 16-bit codepoint sequences without limitations. |
|||
In particular, individual and unpaired surrogates must work without |
|||
interpretation or conversion. |
|||
|
|||
* UTF-8: to support non-BMP characters, if they are created from C code |
|||
or e.g. using String.fromCharCode(0x12345). |
|||
|
|||
* Extended UTF-8: to support codepoints up to U+FFFFFFFF. This is now |
|||
only needed by the regexp bytecode, which uses extended UTF-8 as its |
|||
internal representation and needs to represent long offsets as |
|||
codepoints. |
|||
|
|||
C API problem with current representation |
|||
========================================= |
|||
|
|||
One concrete problem with this arrangement is that non-BMP strings are |
|||
internally represented as CESU-8: |
|||
|
|||
* If source code contains a non-BMP character, the Ecmascript specification |
|||
requires that such a character is decoded into surrogates, from |
|||
https://www.ecma-international.org/ecma-262/5.1/#sec-6: |
|||
|
|||
- If an actual source text is encoded in a form other than 16-bit code |
|||
units it must be processed as if it was first converted to UTF-16. |
|||
|
|||
* This means that ``x = '\u{12345}'`` and ``x = '\ud808\udf45'`` MUST be |
|||
treated identically. For example, for both inputs: |
|||
|
|||
- The string's ``.length`` must be 2. |
|||
|
|||
- ``x[0]`` must be 0xd808, and ``x[1]`` must be 0xdf45. |
|||
|
|||
- RegExps must be able to match the individual surrogates, and one must |
|||
be able to e.g. backtrack each surrogate separately. |
|||
|
|||
- It must be possible to take a substring whose one end is between |
|||
the surrogate codepoints. |
|||
|
|||
* In the current C API such a string will appear CESU-8 encoded because |
|||
that's the internal representation used for surrogate codepoints. |
|||
|
|||
* Applications dealing natively with UTF-8 would often prefer to see UTF-8 |
|||
rather than CESU-8, thus avoiding the need to transcode CESU-8 to UTF-8. |
|||
|
|||
The Ecmascript specification doesn't (and cannot) mandate any specific |
|||
internal representation, nor does it provide any requirements on how a |
|||
C API must represent strings. The current convention of using CESU-8 |
|||
for standard Ecmascript strings is thus not really mandatory. However, |
|||
if an alternative representation is used, it MUST behave identically as |
|||
far as script code is concerned. |
|||
|
|||
Automatically combining surrogates in internal representation |
|||
============================================================= |
|||
|
|||
One alternative to the current internal representation is to: |
|||
|
|||
* Keep the current CESU-8 + UTF-8 + extended UTF-8 as the base representation. |
|||
|
|||
* When conceptual Ecmascript strings contain correctly paired surrogates, |
|||
combine the surrogates into the actual non-BMP codepoint. The resulting |
|||
codepoint is then valid UTF-8 and not CESU-8. |
|||
|
|||
* When a non-paired surrogate is found, encode it as CESU-8 as before. |
|||
|
|||
* This process must be applied to all inputs, both script code and C code, |
|||
so that a certain conceptual Ecmascript string has a unique duk_hstring |
|||
representation. (If this is not the case, string comparison using an |
|||
interned string pointer would no longer be valid which leads to a lot of |
|||
complications.) |
|||
|
|||
This would have the upside that: |
|||
|
|||
* Valid Unicode strings in UTF-8 codepoint range (U+0000 to U+10FFFF without |
|||
surrogate range U+D800 to U+DFFF) would appear as valid UTF-8 (not CESU-8) |
|||
in the C API. |
|||
|
|||
* Pushing UTF-8 strings would produce strings that behaved like standard |
|||
Ecmascript strings, i.e. they would conceptually have surrogate pairs in |
|||
place of non-BMP. |
|||
|
|||
And a few downsides: |
|||
|
|||
* All the internal code would need to maintain an "as if" illusion: such |
|||
strings must appear as uninterpreted 16-bit codepoint sequences, and all |
|||
16-bit codepoint sequences must still work without difference as far as |
|||
script code is concerned. This is not trivial, more on this below. |
|||
|
|||
* One would no longer be able to push an arbitrary byte sequence as a string |
|||
(duk_push_string()) and then read it back as is. The automatic surrogate |
|||
combination would mean the output might be different, with surrogates |
|||
represented in CESU-8 combined into UTF-8. This is a loss of current |
|||
functionality which has been useful for some applications; one can e.g. |
|||
push ISO-8859-1 strings as is, and read them back. Script code will see |
|||
such strings as being somewhat broken, but they have previously passed |
|||
through without modification. |
|||
|
|||
Some internals where the "as if" illusion must be maintained: |
|||
|
|||
* String ``.length`` must count non-BMP codepoints as 2 codepoints to get |
|||
the standard length. |
|||
|
|||
* String.charCodeAt() and all other String functions must use an index scheme |
|||
that references the conceptual 16-bit codepoint sequence index (where each |
|||
non-BMP counts as two indices), and allow reading, substringing, etc, both |
|||
of the surrogate pairs individually. |
|||
|
|||
* There's no longer an easy "char offset to byte offset" internal primitive. |
|||
Currently such a conversion maps an integer to an integer (or error). For |
|||
non-BMP characters the result would now be a tuple: an integer pointing to |
|||
the start of the codepoint, and a flag indicating whether we want the high |
|||
or the low surrogate. All places maintaining "current offset" must track |
|||
that additional flag somehow (it could maybe be encoded as the high bit of |
|||
a 32-bit unsigned value?). |
|||
|
|||
* When doing string replacements, code must always check whether the |
|||
replacements created valid surrogate pairs from previously unpaired |
|||
surrogates. They must be merged, to maintain a unique strnig representation. |
|||
Such surrogates may appear at the edges of replacement strings. |
|||
|
|||
* When combining strings, must check for previously unpaired surrogates at |
|||
string join point. |
|||
|
|||
* RegExp matching must match non-BMP codepoints as two surrogates individually |
|||
as far as patterns are concerned. It must be possible to capture only one |
|||
of the surrogates, backtrack each surrogate individually, match start offset |
|||
must try both surrogates as starting points, etc. |
|||
|
|||
* RegExp /u mode would work trivially with this internal representation, as |
|||
the codepoints are already combined. |
Loading…
Reference in new issue