Switch to using WTF-8 for duk_hstring string representation. The main
differences to previous extended CESU-8/UTF-8 are: (1) valid surrogate
pairs are automatically combined to UTF-8 on string intern while invalid
surrogate characters are encoded in CESU-8, and (2) ECMAScript code always
sees surrogate pairs for non-BMP characters.
Together, these make it more natural to work with non-BMP strings for both
ECMAScript (which no longer sees extended codepoints as before) and native
code (which now sees valid UTF-8 for non-BMP whenever possible).
Internally the main change is in string interning which now always sanitizes
input strings (but no Symbols) to WTF-8. Also all call sites where the byte
representation of strings are dealt with need fixing. WTF-8 leads to some
challenges because it's no longer possible to e.g. find a substring with a
naive byte compare: surrogate characters may either appear directly (CESU-8)
or baked into a non-BMP UTF-8 byte sequence.
The main places where this needs complex handling include:
* charCodeAt / codePointAt
* Extracting a substring
* String .replace()
* String .startsWith() and .endsWith()
* String .split() and search functions (like .indexOf())
* RegExp matching
* String cache behavior
This commit fixes all the necessary sites with minimal baseline implementations
which are in some cases much slower than the previous CESU-8 ones. Further work
is needed to optimize the WTF-8 variants to perform close to CESU-8.
Isolate all char-offset-to-byte-offset and character access calls
behind helpers to help prepare for a switch to WTF-8 representation.
This change should have no visible effect yet.
Join surrogate pairs (encoded in CESU-8) in string intern check,
with unoptimized code. This allows working on WTF-8 representation
when the joining is manually enabled. The test code is disabled by
default so should not affect current behavior.