You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

313 lines
12 KiB

============
String table
============
Overview
========
Duktape 1.x to 2.0
------------------
The ``duk_hstring`` struct doesn't contain heap link (next/prev) pointers and
strings are assumed to be tracked within the string table. Tracking is
necessary at all times so that strings can be properly freed by mark-and-sweep
heap destruction, etc.
Default string table, uses probe sequence
-----------------------------------------
::
heap->strtable
|
`--> duk_hstring * Entries are duk_hstring pointers, with a special
duk_hstring * pointer value for DELETED entries.
...
duk_hstring *
String table size is a prime, probe sequence uses modulus. The string table
is resized based on a load factor check.
Alternative string table for low memory, uses chaining
------------------------------------------------------
::
heap->strtable
|
`--> duk_strtab_entry --> listlen=0/n, union { str, strlist }
duk_strtab_entry
...
duk_strtab_entry
Each string table entry is either:
* Unused (listlen == 0, str pointer == NULL)
* Single string (listlen == 0, str pointer != NULL)
* Array of strings in a separate allocation (listlen > 0, strlist pointer != NULL)
The string table size is a constant, and the string table is not resized.
Duktape 2.1
-----------
The two string table alternatives were replaced by a single algorithm in
Duktape 2.1. The main approach is:
* Add one-way link pointers (h_next field) to the ``duk_hstring`` entries so
that the strings can be efficiently chained without an external container.
* Use a string table of size ``2 ** N`` so that a bit mask can be used to
find string position.
* In-place resize from ``2 ** N`` to ``2 ** (N + 1)`` and ``2 ** (N - 1)``.
For example, to reduce size, rehash in place to half of current size, and
then reallocate the memory chunk. If the memory allocator can satisfy the
realloc() in place, there's no need for a copy.
* The hash table size is changed based on load factor (number of strings
compared to table size). The size can be also be made constant for low
memory environments where predictability is key; this imposes no string
count limit because of chaining.
Duktape 2.1 details
===================
Resizing the top level strtable
-------------------------------
Load factor (average chain length) is easily computed by keeping track of
number of strings inserted into the table::
load_factor = avg_len = num_strings / strtable_size
Average length is a cheap way of making resizing decisions. Also maximum chain
length would be interesting for pathological cases -- however, resizing may not
always solve pathological issues if strings with matching hashes are created on
purpose. To solve that case, the hash algorithm would need to be changed on
the fly to adapt to the input (difficult because existing strings need
rehashing too).
When using chaining there's no mandatory need to do a shrink/grow at any
specific time because the string table can hold an arbitrary number of strings
at some performance cost. There are also no issues like accumulation of
DELETED entries in the hash probe approach. Because of this, it's OK to do a
resize check only from time to time, for example::
if ((num_strings & 0xff) == 0) {
/* Resize check when string count is a multiple of 256. */
}
The grow/shrink check can also be limited to string insertion; while removals
could technically lead to the need to shrink the allocation, it can almost
always be delayed to later insertion processing.
The growth step is limited to doubling the current size of the string table.
Except for very small string table sizes this is sufficient because when the
need to resize is detected, only a limited number of string table inserts have
been processed. So in practice there is no need to do multiple growth steps
at once.
The shrink step is also limited to halving the current size. Because a lot of
strings may be removed without any inserts in between (consider a large array
of strings becoming unreachable), the string table load factor may remain very
low (way below the shrink limit) until some inserts are done to detect the
situation; and even after that it may take multiple resize checks to shrink the
string table enough to get within the desired load factor range. However, this
kind of very fast shrinks are not common in practice, and the current
implementation will just do one halving per resize check. Emergency GC also
does a resize check so each emergency GC round can halve the string table size
if that's useful to satisfy the failing allocation.
The memory cost per string of the top level strtable is::
sizeof(duk_hstring *) * strtable_size / num_strings
For example:
* For a load factor 1.0 the cost per string is exactly the pointer size.
* For a load factor 2.0 the cost per string is half the pointer size.
Config options provide minimum and maximum sizes, and grow/shrink load
factor limits expressed as fixed point integers. When minimum and maximum
size are equal, no resize checks are done; this is useful for very low memory
targets where a fixed size string table is often preferred.
Resizing in place
-----------------
When the top level strtable is resized, it's useful to resize it in place so
that the structure can be ``realloc()``'d. In some cases that will avoid
making an unnecessary copy.
For growing in place:
* Reallocate the structure to twice previous size first.
* Because the hash mask size grows by one bit, the new highest mask bit
maps one existing bucket into two separate hash buckets. Strings need
to be rehashed into their proper bucket.
For shrinking in place:
* Two hash buckets, again distinguished by their highest hash mask bit,
will now be merged into a single bucket. There's no need to do any
has checks, the two buckets are just combined into one.
* When complete, reallocate the structure. This is a shrink ``realloc()``
which we assume will **always** succeed. (Handling a realloc() failure
would also be straightforward: just run the "grow" algorithm to restore
strings into their original buckets.)
Because ``realloc()`` may have side effects, the following precautions are
needed:
* Recursive strtable resizes must be prevented. Because the hash chains don't
have fixed capacity limits, this never leads to a dead end.
* When ``realloc()`` is called, the strtable size, mask, etc must be valid so
that any strings interned/freed by side effects can be handled normally
without being aware of the resize. For example, when shrinking, the buckets
must first be combined, and the strtable size, mask, etc updated *before*
the ``realloc()`` call is made.
Zero-way linking, single linking, and double linking
====================================================
Zero-way linking was used up to Duktape 2.0: ``duk_hstring`` itself doesn't
contain any link pointers. For a hash+probe approach this is fine because
strings don't need any linking: they are in the hash table as is. For a
hash+chain approach Duktape 2.0 used separate allocations for lists of string
pointers.
Single linking means a ``duk_hstring`` has only a "next" pointer; double
linking means a string also has a "prev" pointer which means more pointer
manipulation but makes some operations (e.g. random string unlinking) easier.
Upsides of double linking:
* Unlinking doesn't require scanning to find the previous element, so that
both insert and remove are O(1).
Downsides of double linking:
* One more pointer per ``duk_hstring``, more memory usage.
* More pointer manipulation because both directions of the list need managing.
Upsides of single linking:
* Only a single link pointer, less memory usage.
* Less pointer manipulation.
Downsides of single linking:
* Unlinking by refcount requires scanning from the hash chain root, so that the
previous string can be located and its 'next' pointer updated which makes
removal slower. However, it's only a significant issue if the average chain
length is high, so resizing the top level strtable array should minimize the
issue. For low memory targets with a fixed top level strtable array this may
be a more concrete issue; however, the number of strings there is also limited
by memory.
In Duktape 2.1 a single linked ``duk_hstring`` is used. On average, the cost
of scanning in removal is offset by less pointer manipulation in inserts. This
holds at least when the load factor is small (say <= 2.0).
Delayed freeing
---------------
While refcounts can be used to free strings immediately, it would also be easy
to delay string freeing to the mark-and-sweep pass. This would even allow the
refcount field to be dropped from strings (which would need adjustment to
refcount macros).
The upside of delaying string freeing is that if a string is repeatedly created
and then forgotten, the interned string which is otherwise unreachable can be
reused.
While not very common, this happens in some algorithms repeatedly; for example
when using a string as a "lookup table"::
var nybbles = "0123456789abcdef";
var res = ''; // inefficient concatenation for simplicity
for (var i = 0; i < n; i++) {
res += nybbles[inp[i] >> 4];
res += nybbles[inp[i] & 0x0f];
}
Here each lookup creates a one-character substring which is interned, appended
to ``res``, and then freed (unless a reference exists elsewhere).
For low memory targets delayed freeing would be nice because it reduces the
string header size by the refcount field. The downside is that the refcount
macros need adjustment: not all heaphdr refcount operations are the same.
Memory usage would also be less snappy.
Other changes in Duktape 2.1
============================
External string handling
------------------------
External strings are always supported by string table code in Duktape 2.0 but
only created when some external string macros are enabled. The ``duk_hstring``
accessor macros only support external strings when ``DUK_USE_HSTRING_EXTDATA``
is enabled. Revise this behavior:
* Disable external string checks also in string intern code if external
string support macros are not enabled.
For future consideration:
* External string support might be enabled in the public C API because it's
quite useful for things like memory mapped source files.
* When not pressed for RAM, add an explicit string pointer to the standard
header so that string data access doesn't need a flag check at every turn.
Future work
===========
Header definition
-----------------
In Duktape 2.0 ``duk_hstring`` header starts with a ``duk_heaphdr`` and is
then followed by further fields. If ``duk_heaphdr`` is not naturally aligned
this introduces unnecessary padding inside the struct.
Change so that shared ``duk_heaphdr`` fields are provided by a macro which
can be called in ``duk_hstring`` definition. This also simplifies field
access, e.g. ``h_str->hdr.h_next`` can be written as ``h_str->h_next``.
Raising a looked up string to top of hash chain
-----------------------------------------------
When an intern check is done and the string is already present in the string
table, the string could be "bubbled" to the top of the hash table chain when
found. If application code recreates the same string (or set of strings)
many times over, this would make further lookups faster. This could happen
e.g. when a string is used as a "lookup table" and the same substrings are
looked up over and over.
In practical testing this technique did improve some individual tests, it
didn't have a net positive effect. The added shuffling for every intern
check seems to nullify the potential (rare) benefits.
ROM string link pointer reuse
-----------------------------
ROM strings don't need a ``h_next`` field. It could be used fo string
data, provided that both arridx and clen have been dropped so that the
``duk_hstring`` struct itself is actually empty. This does need a change
to ``DUK_HSTRING_GET_DATA()`` macro though.
Remove heap->st_size field
--------------------------
Because heap->st_size is always equal to heap->st_mask + 1 (except during
heap init) the explicit heap->st_size field can be removed which reduces
book-keeping.