Browse Source

Link to wiki performance page

pull/320/merge
Sami Vaarala 9 years ago
parent
commit
7d48c332c8
  1. 471
      website/guide/performance.html

471
website/guide/performance.html

@ -1,10 +1,8 @@
<a name="performance-overview"></a> <!-- for old links -->
<a name="performance-characteristics"></a> <!-- for old links -->
<a name="performance-json-stringify-fastpath"></a> <!-- for old links -->
<h1 id="performance">Performance</h1>
<p>This section discussed Duktape specific performance characteristics and
provides some hints to avoid Duktape specific performance pitfalls.</p>
<h2 id="performance-overview">Performance overview</h2>
<p>Duktape is an interpreted engine with currently no JIT support. It uses
reference counting which makes memory usage tight at the cost of some execution
performance. Overall Duktape performance should be similar to other interpreted
@ -12,463 +10,6 @@ languages. However, up to Duktape 1.1 there has been very little performance
work and Duktape is significantly slower than e.g. Lua right now. Performance
work has been scheduled for spring 2015.</p>
<h2 id="performance-characteristics">Duktape performance characteristics</h2>
<h3>String interning</h3>
<p>Strings are
<a href="http://en.wikipedia.org/wiki/String_interning">interned</a>: only
a single copy of a certain string exists at any point in time. Interning
a string involves hashing the string and looking up a global string table
to see whether the string is already present. If so, a pointer to the
existing string is returned; if not, the string is inserted into the string
table, potentially involving a string table resize. While a string remains
reachable, it has a unique and a stable pointer which allows byte-by-byte
string comparisons to be converted to simple pointer comparisons. Also,
string hashes are computed during interning which makes the use of string
keys in internal hash tables efficient.</p>
<p>There are many downsides also. Strings cannot be modified in-place but
a copy needs to be made for every modification. For instance, repeated
string concatenation creates a temporary value for each intermediate string
which is especially bad if a result string is built one character at a time.
Duktape internal primitives, such as string case conversion and array
<code>join()</code>, try to avoid these downsides by minimizing the number
of temporary strings created.</p>
<h3>String memory representation and the string cache</h3>
<p>The internal memory representation for strings is extended UTF-8, which
represents each ASCII character with a single byte but uses two or more
bytes to represent non-ASCII characters. This reduces memory footprint for
most strings and makes strings easy to interact with in C code. However,
it makes random access expensive for non-ASCII strings. Random access is
needed for operations such as extracting a substring or looking up a
character at a certain character index.</p>
<p>Duktape automatically detects pure ASCII strings (based on the fact that
their character and byte length are identical) and provides efficient random
access to such strings.</p>
<p>However, when a string contains non-ASCII characters a <b>string cache</b>
is used to resolve a character index to an internal byte index. Duktape
maintains a few (internal define <code>DUK_HEAP_STRCACHE_SIZE</code>,
currently 4) string cache entries which remember the last byte offset and
character offset for recently accessed strings. Character index lookups
near a cached character/byte offset can be efficiently handled by scanning
backwards or forwards from the cached location. When a string access cannot
be resolved using the cache, the string is scanned either from the beginning
or the end, which is obviously very expensive for large strings. The cache
is maintained with a very simple LRU mechanism and is transparent to both
Ecmascript and C code.</p>
<p>The string cache makes simple loops like the following efficient:</p>
<pre class="ecmascript-code">
var i;
var n = inp.length;
for (i = 0; i &lt; n; i++) {
print(inp.charCodeAt(i));
}
</pre>
<p>When random accesses are made from here and there to multiple strings,
the strings may very easily fall out of the cache and become expensive
at least for longer strings.</p>
<p>Note that the cache never maintains more than one entry for each
string, so the following would be very inefficient:</p>
<pre class="ecmascript-code">
var i;
var n = inp.length;
for (i = 0; i &lt; n; i++) {
// Accessing the string alternatively from beginning and end will
// have a major performance impact.
print(inp.charCodeAt(i));
print(inp.charCodeAt(n - 1 - i));
}
</pre>
<p>As mentioned above, these performance issues are avoided entirely for
ASCII strings which behave as one would expect. More generally, Duktape
provides fast paths for ASCII characters and pure ASCII strings in internal
algorithms whenever applicable. This applies to algorithms such as case
conversion, regexp matching, etc.</p>
<h3>Buffer accesses</h3>
<p>There is a fast path for reading and writing numeric indices of plain buffer
values, e.g. <code>x = buf[123]</code> or <code>buf[123] = x</code>. The
fast path avoids coercing the index to a string (here <code>"123"</code>)
before attempting a lookup.</p>
<p>There's a similar fast path for buffer objects (Duktape.Buffer, Node.js
Buffer, ArrayBuffer, and TypedArray views), but it is about 20% slower.</p>
<h3>Object/array storage</h3>
<p>Object properties are stored in a linear key/value list which provides
stable ordering (insertion order). When an object has enough properties
(internal define <code>DUK_HOBJECT_E_USE_HASH_LIMIT</code>, currently 32),
a hash lookup table is also allocated to speed up property lookups. Even
in this case the key ordering is retained which is a practical requirement
for an Ecmascript implementation. The hash part is avoided for most objects
because it increases memory footprint and doesn't significantly speed up
property lookups for very small objects.</p>
<p>For most objects property lookup thus involves a linear comparison
against the object's property table. Because properties are kept in the
property table in their insertion order, properties added earlier are
slightly faster to access than those added later. When the object grows
large enough to gain a hash table this effect disappears.</p>
<p>Array elements are stored in a special "array part" to reduce memory
footprint and to speed up access. Accessing an array with a numeric index
officially first coerces the number to a string (e.g. <code>x[123]</code>
to <code>x["123"]</code>) and then does a string key lookup; when an object
has an array part no temporary string is actually created in most cases.</p>
<p>The array part can be "sparse", i.e. contain unmapped entries. Duktape
occasionally rechecks the density of the array part, and it it becomes too
sparse the array part is abandoned (current limit is roughly: if fewer than
25% of array part elements are mapped, the array part is abandoned). The
array entries are then converted to ordinary object properties, with every
mapped array index converted to an explicit string key (such as
<code>"123"</code>), which is relatively expensive. If an array part has
once been abandoned, it is never recreated even if the object would be dense
enough to warrant an array part.</p>
<p>Elements in the array part are required to be plain properties (not
accessors) and have default property attributes (writable, enumerable,
and configurable). If any element deviates from this, the array part is
again abandoned and array elements converted to ordinary properties.</p>
<h3>Identifier access</h3>
<p>Duktape has two modes for storing and accessing identifiers (function
arguments, local variables, function declarations): a fast path and a slow
path. The fast path is used when an identifier can be bound to a virtual
machine register, i.e., a fixed index in a virtual stack frame allocated
for a function. Identifier access is then simply an array lookup. The
slow path is used when the fast path cannot be safely used; identifier
accesses are then converted to explicit property lookups on either
external or internal objects, which is more than an order of magnitude
slower.</p>
<p>To keep identifier accesses in the fast path:</p>
<ul>
<li>Execute (almost all) inside Ecmascript functions, not in the top-level
global or eval code: global/eval code never uses fast path identifier
accesses (however, function code inside global/eval does)</li>
<li>Store frequently accessed values in local variables instead of looking
them up from the global object or other objects</li>
</ul>
<h3>Enumeration</h3>
<p>When an object is enumerated, with either the <code>for-in</code> statement
or <code>Object.keys()</code>, Duktape first traverses the target object and its
prototype chain and forms an internal enumeration object, which contains all
the enumeration keys as strings.
In particular, all array indices (or character indices in case of
strings) are converted and interned into string values before enumeration and
they remain interned until the enumeration completes. This can be memory
intensive especially if large arrays or strings are enumerated.</p>
<p>Note, however, that iterating a string or an array with <code>for-in</code>
and expecting the array elements or string indices to be enumerated in an
ascending order is non-portable. Such behavior, while guaranteed by many
implementations including Duktape, is not guaranteed by the Ecmascript
standard.</p>
<h3>Function features</h3>
<p>Ecmascript has a lot of features which make function entry and execution
quite expensive. The general goal of the Duktape Ecmascript compiler is to
avoid all the troublesome features for most functions while providing full
compatibility for the rest.</p>
<p>An ideal compiled function has all its variables and functions bound to
virtual machine registers to allow fast path identifier access, avoids
creation of the <code>arguments</code> object on entry, avoids creation of
explicit lexical environment records upon entry and during execution, and
avoids storing any lexical environment related control information such as
internal identifier-to-register binding tables.</p>
<p>The following features have a significant impact on execution performance:</p>
<ul>
<li>access to the <code>arguments</code> object: requires creation of an
expensive object upon function entry in case it is accessed</li>
<li>a direct call to <code>eval()</code>: requires initialization of
the <code>arguments</code> and full identifier binding information
needs to be retained in case evaluated code needs it</li>
<li>global and eval code in general: identifiers are never bound to
virtual machine registers but use explicit property lookups instead</li>
</ul>
<p>The following features have a more moderate impact:</p>
<ul>
<li><code>try-catch-finally</code> statement: the dynamic binding required
by the catch variable is relatively expensive</li>
<li><code>with</code> statement: the object binding required is relatively
expensive</li>
<li>use of bound functions, i.e. functions created with
<code>Function.prototype.bind()</code>: function invocation is slowed
down by handling of bound function objects and argument shuffling</li>
<li>more than about 250 formal arguments, literals, and active temporaries:
causes bytecode to use register shuffling</li>
</ul>
<p>To avoid these, isolate performance critical parts into separate minimal
functions which avoid using the features mentioned above.</p>
<h2>Minimize use of temporary strings</h2>
<p>All temporary strings are interned. It is particularly bad to accumulate
strings in a loop:</p>
<pre class="ecmascript-code">
var t = '';
for (var i = 0; i &lt; 1024; i++) {
t += 'x';
}
</pre>
<p>This will intern 1025 strings. Execution time is <code>O(n^2)</code>
where <code>n</code> is the loop limit. It is better to use a temporary
array instead:</p>
<pre class="ecmascript-code">
var t = [];
for (var i = 0; i &lt; 1024; i++) {
t[i] = 'x';
}
t = t.join('');
</pre>
<p>Here, <code>x</code> will be interned once into a function constant, and
each array entry simply refers to the same string, typically costing only 8
bytes per array entry. The final <code>Array.prototype.join()</code> avoids
unnecessary interning and creates the final string in one go.</p>
<h2>Avoid large non-ASCII strings if possible</h2>
<p>Avoid operations which require access to a random character offset inside
a large string containing one or more non-ASCII characters. Such accesses
require use of the internal "string cache" and may, in the worst case, require
a brute force scanning of the string to find the correct byte offset
corresponding to the character offset.</p>
<p>Case conversion and other Unicode related operations have fast paths for
ASCII codepoints but fall back to a slow path for non-ASCII codepoints. The
slow path is size optimized, not speed optimized, and often involve linear
range matching.</p>
<h2>Iterate over plain buffer values, not Buffer objects</h2>
<p>Both buffer objects and plain buffer values have a fast path when buffer
contents are accessed with numeric indices. The fast path for plain buffer
values is a bit faster, so your code will run faster if you get the plain
buffer before iteration:</p>
<pre class="ecmascript-code">
var b, i, n;
// Buffer object, typeof is 'object'
var bufferValue = new Duktape.Buffer('foo');
print(typeof bufferValue); // 'object'
// Get plain buffer, if already plain, no harm
b = bufferValue.valueOf();
print(typeof b); // always 'buffer'
n = b.length;
for (i = 0; i &lt; n; i++) {
print(i, b[i]); // fast path for plain buffer access
}
</pre>
<p>When creating buffers, note that <code>new Duktape.Buffer(x)</code> always
creates a Buffer object, while <code>Duktape.Buffer(x)</code> returns a plain
buffer value. This mimics how Ecmascript <code>new String()</code> and
<code>String()</code> work. Plain buffers should be preferred whenever
possible.</p>
<h2>Avoid sparse arrays when possible</h2>
<p>If an array becomes too sparse at any point, Duktape will abandon the
array part permanently and convert array properties to explicit string keyed
properties. This may happen for instance if an array is initialized with a
descending index:</p>
<pre class="ecmascript-code">
var arr = [];
for (var i = 1000; i &gt;= 0; i--) {
// bad: first write will abandon array part permanently
arr[i] = i * i;
}
</pre>
<p>Right after the first array write the array part would contain 1001
entries with only one mapped array element. The density of the array would
thus be less than 0.1%. This is way below the density limit for abandoning
the array part, so the array part is abandoned immediately. At the end the
array part would be 100% dense but will never be restored. Using an ascending
index fixes the issue:</p>
<pre class="ecmascript-code">
var arr = [];
for (var i = 0; i &lt;= 1000; i++) {
arr[i] = i * i;
}
</pre>
<p>Setting the <code>length</code> property of an array manually does not,
by itself, cause an array part to be abandoned. To simplify a bit, the array
density check compares the number of mapped elements relative to the highest
used element (actually allocated size). The <code>length</code> property does
not affect the check. Although setting an array length beforehand may
effectively pre-allocate an array in some implementations, it has no such
effect in Duktape, at least at the moment. For example:</p>
<pre class="ecmascript-code">
var arr = [];
arr.length = 1001; // array part not abandoned, but no speedup in Duktape
for (var i = 0; i &lt;= 1000; i++) {
arr[i] = i * i;
}
</pre>
<h2>Iterate arrays with explicit indices, not a "for-in"</h2>
<p>Because the internal enumeration object contains all (used) array
indices converted to string values, avoid <code>for-in</code> enumeration
of at least large arrays. As a concrete example, consider:</p>
<pre class="ecmascript-code">
var a = [];
for (var i = 0; i &lt; 1000000; i++) {
a[i] = i;
}
for (var i in a) {
// Before this loop is first entered, a million strings ("0", "1",
// ..., "999999") will be interned.
print(i, a[i]);
}
// The million strings become garbage collectable only here.
</pre>
<p>The internal enumeration object created in this example would contain a
million interned string keys for "0", "1", ..., "999999". All of these keys
would remain reachable for the entire duration of the enumeration. The
following code would perform much better (and would be more portable, as it
makes no assumptions on enumeration order):</p>
<pre class="ecmascript-code">
var a = [];
for (var i = 0; i &lt; 1000000; i++) {
a[i] = i;
}
var n = a.length;
for (var i = 0; i &lt; n; i++) {
print(i, a[i]);
}
</pre>
<h2>Minimize top-level global/eval code</h2>
<p>Identifier accesses in global and eval code always use slow path instructions
to ensure correctness. This is at least a few orders of magnitude slower than
the fast path where identifiers are mapped to registers of a function
activation.</p>
<p>So, this is slow:</p>
<pre class="ecmascript-code">
for (var i = 0; i &lt; 100; i++) {
print(i);
}
</pre>
<p>Each read and write of <code>i</code> will be an explicit environment
record lookup, essentially a property lookup from an internal environment
record object, with the string key <code>i</code>.</p>
<p>Optimize by putting most code into a function:</p>
<pre class="ecmascript-code">
function main() {
for (var i = 0; i &lt; 100; i++) {
print(i);
}
}
main();
</pre>
<p>Here, <code>i</code> will be mapped to a function register, and each
access will be a simple register reference (basically a pointer to a
tagged value), which is much faster than the slow path.</p>
<p>If you don't want to name an explicit function, use:</p>
<pre class="ecmascript-code">
(function() {
var i;
for (i = 0; i &lt; 100; i++) {
print(i);
}
})();
</pre>
<p>Eval code provides an implicit return value which also has a performance
impact. Consider, for instance, the following:</p>
<pre class="ecmascript-code">
var res = eval("if (4 &gt; 3) { 'foo'; } else { 'bar'; }");
print(res); // prints 'foo'
</pre>
<p>To support such code the compiler emits bytecode to store a statement's
implicit return value to a temporary register in case it is needed. These
instructions slow down execution and increase bytecode size unnecessarily.</p>
<h2>Prefer local variables over external ones</h2>
<p>When variables are bound to virtual machine registers, identifier lookups
are much faster than using explicit property lookups on the global object or
on other objects.</p>
<p>When an external value or function is required multiple times, copy it to
a local variable instead:</p>
<pre class="ecmascript-code">
function slow(x) {
var i;
// 'x.length' is an explicit property lookup and happens on every loop
for (i = 0; i &lt; x.length; i++) {
// 'print' causes a property lookup to the global object
print(x[i]);
}
}
function fast(x) {
var i;
var n = x.length;
var p = print;
// every access in the loop now happens through register-bound identifiers
for (i = 0; i &lt; n; i++) {
p(x[i]);
}
}
</pre>
<p>Use such optimizations only where it matters, because they often reduce
code readability.</p>
<h2 id="performance-json-stringify-fastpath">JSON.stringify() fast path</h2>
<p>There's a fast path for <code>JSON.stringify()</code> serialization which
is used when there is no "replacer" argument and no indent. The fast path
assumes that it can serialize the argument value without risking any side
effects (which might mutate the value being serialized), and will fall back
to the slower default algorithm if necessary. This happens at least when:
(1) any object has a <code>.toJSON()</code> property, and (2) any object
property is a getter. See:
<a href="https://github.com/svaarala/duktape/blob/master/tests/ecmascript/test-bi-json-enc-fastpath.js">test-bi-json-enc-fastpath.js</a>
for detailed notes on current limitations; the fast path preconditions and
limitations are very likely to change between Duktape releases.</p>
<p>The fast path is currently not enabled by default. To enable, ensure
<code>DUK_USE_JSON_STRINGIFY_FASTPATH</code> is enabled in your
<code>duk_config.h</code>.</p>
<p>See <a href="http://wiki.duktape.org/HowtoPerformance.html">How to optimize performance</a>
for discussion of Duktape performance characteristics and hints to optimize code
for performance.</p>

Loading…
Cancel
Save