* Add DUK_F_DJGPP.h.in to detect and provide DJGPP convenience defines.
* Add a separate platform for MSDOS, now supporting DJGPP only.
* Add DJGPP identification to compiler_gcc.h.in.
Fix a realloc() memory leak which would happen when:
- a previous allocation exists ('ptr' is non-NULL); and
- the new 'size' is zero; and
- a voluntary GC (or GC torture) causes the initial realloc() attempt
to be bypassed
In this case the slow path would incorrectly assume that it was entered
after realloc() had returned a NULL, which for a zero new 'size' would
mean that 'ptr' was successfully freed and no further action was necessary.
But because the realloc() had actually been bypassed, this would cause the
old 'ptr' to leak.
Restructure string intern check:
- Compute string hash and perform strtable lookup; if found, it
must already be a valid Symbol or valid WTF-8 data so no WTF-8
sanitization steps are needed. Return found string.
- Otherwise perform a "keepcheck" to see if the candidate string
can be used as is (i.e. it is valid Symbol or valid WTF-8).
If so, we know it's not in the strtable so intern the string.
- Otherwise the string needs WTF-8 sanitization. After sanitizing,
rehash the sanitized data, perform another strtable lookup and
return existing string or intern the sanitized string.
This speeds up string intern processing for (1) strings already in
the string table and (2) valid WTF-8 strings which should be the
vast majority of strings interned. Only strings that are invalid
WTF-8, i.e. contain uncombined surrogate pairs or outright data,
will need sanitization.
Other minor changes:
- Add some WTF-8 documentation to tentative 3.0 release notes.
- Add a 3.0 release entry.
* Remove lazy charlen support. Since we need to WTF-8 sanitize the entire
input string, charlen can be computed while validating (avoiding extra
book-keeping for ASCII eventually).
* Improve WTF-8 search forwards/backwards performance (no substring operations)
when the search string is valid UTF-8. Use reference implementation for
non-UTF-8 still, to be optimized later.
* Minor testcase improvements.
Switch to using WTF-8 for duk_hstring string representation. The main
differences to previous extended CESU-8/UTF-8 are: (1) valid surrogate
pairs are automatically combined to UTF-8 on string intern while invalid
surrogate characters are encoded in CESU-8, and (2) ECMAScript code always
sees surrogate pairs for non-BMP characters.
Together, these make it more natural to work with non-BMP strings for both
ECMAScript (which no longer sees extended codepoints as before) and native
code (which now sees valid UTF-8 for non-BMP whenever possible).
Internally the main change is in string interning which now always sanitizes
input strings (but no Symbols) to WTF-8. Also all call sites where the byte
representation of strings are dealt with need fixing. WTF-8 leads to some
challenges because it's no longer possible to e.g. find a substring with a
naive byte compare: surrogate characters may either appear directly (CESU-8)
or baked into a non-BMP UTF-8 byte sequence.
The main places where this needs complex handling include:
* charCodeAt / codePointAt
* Extracting a substring
* String .replace()
* String .startsWith() and .endsWith()
* String .split() and search functions (like .indexOf())
* RegExp matching
* String cache behavior
This commit fixes all the necessary sites with minimal baseline implementations
which are in some cases much slower than the previous CESU-8 ones. Further work
is needed to optimize the WTF-8 variants to perform close to CESU-8.