mirror of https://github.com/svaarala/duktape.git
Sami Vaarala
9 years ago
1 changed files with 222 additions and 0 deletions
@ -0,0 +1,222 @@ |
|||
================== |
|||
Benchmarking notes |
|||
================== |
|||
|
|||
This document provides some notes on how to benchmark performance or memory |
|||
consumption so that you'll get the most relevant results for an actual |
|||
target device. |
|||
|
|||
Memory benchmarking |
|||
=================== |
|||
|
|||
Enable Duktape low memory options |
|||
--------------------------------- |
|||
|
|||
Enable Duktape low memory config options for benchmarking if the target |
|||
would actually be running with these options enabled. For example, if |
|||
the target has 128-256 kB of system RAM, low memory options would be very |
|||
recommended: |
|||
|
|||
* doc/low-memory.rst |
|||
|
|||
* config/examples/low_memory.yaml |
|||
|
|||
Enable DUK_USE_GC_TORTURE to see actual hard memory usage |
|||
--------------------------------------------------------- |
|||
|
|||
Enable ``DUK_USE_GC_TORTURE`` for testing so that you'll be measuring actual |
|||
memory, i.e. actually **reachable objects** which won't be collected when |
|||
an emergency garbage collection would take place. This config option causes |
|||
a full mark-and-sweep garbage collection pass for every allocation so that |
|||
only actually reachable memory will remain use at any time. |
|||
|
|||
Apparent memory usage seen without this option enabled may differ quite a lot |
|||
from what's actually needed. That difference is usually irrelevant: if memory |
|||
were to run out, emergency garbage collection would be able free the |
|||
non-reachable objects. |
|||
|
|||
The difference may matter in practice if some *other* component in the system |
|||
is out of memory, as it usually cannot trigger an emergency garbage |
|||
collection which would free up memory. However, when using a pool allocator |
|||
for Duktape this is not an issue: all Duktape allocations will be contained |
|||
in the pre-allocated pool. |
|||
|
|||
Measure usage using a pool allocator if target uses one |
|||
------------------------------------------------------- |
|||
|
|||
Measurements using ``valgrind --tool=massif`` are relatively accurate (when |
|||
GC torture is enabled) but will include allocation overhead not present when |
|||
a pool allocator is used. Pool allocators are recommended for low memory |
|||
targets to reduce overhead and heap fragmentation. |
|||
|
|||
If the actual target uses a pool allocator, benchmarking should be done |
|||
against that allocator, with the pool entry sizes optimized for the actual |
|||
application code to be executed. The difference between valgrind massif |
|||
reported usage and actual pool allocator usage can be quite large. However, |
|||
when the pool configuration is poorly optimized, memory allocation overhead |
|||
caused by wasted pool entry bytes can also be significant. |
|||
|
|||
Measurements using e.g. process RSS are very inaccurate and should be avoided |
|||
if possible as they don't accurately reflect the actual memory usage |
|||
achievable. When measuring without a pool allocator, valgrind massif, |
|||
combined with enabling GC torture, is a much better option. |
|||
|
|||
Example of DUK_USE_GC_TORTURE measurement impact |
|||
------------------------------------------------ |
|||
|
|||
Let's take an example program which involves creating a lot of anonymous |
|||
function instances, quite typical in callback oriented code:: |
|||
|
|||
function test() { |
|||
for (var i = 0; i < 10000; i++) { |
|||
var ignored = function () {}; |
|||
} |
|||
} |
|||
test(); |
|||
|
|||
Because each such anonymous function is in a reference loop with its default |
|||
``.prototype`` object (which points back to the function using ``.constructor`` |
|||
reference), the functions won't be collected by reference counting and will |
|||
be freed by mark-and-sweep. Mark-and-sweep runs periodically but an emergency |
|||
mark-and-sweep is also triggered when an allocation attempt fails. |
|||
|
|||
Compiling Duktape for defaults on x64 (without any low memory options, ROM |
|||
builtins, etc) shows the following memory usage in valgrind massif:: |
|||
|
|||
... |
|||
KB |
|||
539.2^ : |
|||
| # : :: : : :@ : : |
|||
| # : : :: : :: :@ @: :: |
|||
| # :: : :: : :: :@ :@: @:: |
|||
| @# :: :: :: : :: ::@ :@: @:: |
|||
| @# ::: :: ::: ::: ::: ::@ ::@: @:: |
|||
| @# ::: ::: ::: : : ::: :::@ ::@: :@:: |
|||
| @@# @::: ::: :::: : : ::: :::@ ::@: :@:: |
|||
| @@# @::: @::: :::: :: : :::: :::@ :::@: ::@:: |
|||
| @@@# :@::: @::: ::::: :: : :::: ::::@ ::::@: ::@:: |
|||
| @@@# :@::: :@::: ::::: :: : :::: :::::@ ::::@: ::@:: |
|||
| @@@@# :@::: :@::: :::::: ::: : ::::: :::::@ ::::@: :::@:: |
|||
| @ @@@@# ::@::: :@::: :::::: ::: : ::::: :::::@ :::::@: :::@:: |
|||
| @ @@@@# ::@::: :@::: :::::: :::: : :::::: ::::::@ :::::@: :::@:: |
|||
| @::@@@@# ::@::: ::@::: :::::: :::: : :::::: ::::::@::::::@:::::@:: |
|||
| @@: @@@@# ::@::::::@::: ::::::: :::: : :::::: @:::::@::::::@:::::@::: |
|||
| @@: @@@@#::::@::::::@::: :::::::::::: ::::::::::@:::::@::::::@:::::@::: |
|||
| @@: @@@@#: ::@::::::@::: :::::::::::: :: :::::::@:::::@::::::@:::::@::: |
|||
| @@: @@@@#: ::@::::::@::: :::::::::::: :: :::::::@:::::@::::::@:::::@::: |
|||
| @@: @@@@#: ::@::::::@::: :::::::::::: :: :::::::@:::::@::::::@:::::@::: |
|||
0 +----------------------------------------------------------------------->Mi |
|||
0 138.9 |
|||
|
|||
From this it would appear the program is using ~540 kB of memory. This is very |
|||
misleading because almost all of that usage is actually collectable garbage which |
|||
is periodically collected by mark-and-sweep (as seen above as "spiking"). In |
|||
particular, if memory were to run out (in concrete terms, an attempt to allocate |
|||
memory would fail), an emergency mark-and-sweep pass would free that memory which |
|||
would then be available for other use. |
|||
|
|||
Enabling ``DUK_OPT_GC_TORTURE`` (or ``DUK_USE_GC_TORTURE`` if editing ``duk_config.h`` |
|||
directly) we get a very different result:: |
|||
|
|||
... |
|||
KB |
|||
118.2^# |
|||
|# |
|||
|#:@@:@:::::::: : ::: @:: :: : : : : :: |
|||
|#:@ :@:: : : ::::::::::::::::::::@@:::::@: ::: :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
|#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: |
|||
0 +----------------------------------------------------------------------->Gi |
|||
0 20.29 |
|||
|
|||
The actual "hard" memory usage is ~120kB, only about 22% of the apparent memory |
|||
usage as seen by valgrind. This hard memory usage is what really matters, i.e. |
|||
determines whether an application will be able to allocate more memory or not. |
|||
|
|||
Performance benchmarking |
|||
======================== |
|||
|
|||
Enable Duktape performance options |
|||
---------------------------------- |
|||
|
|||
Unless you're running on a memory constrained device and prefer performance |
|||
over e.g. code footprint, you should enable Duktape performance options. |
|||
For more information, see: |
|||
|
|||
* doc/performance-sensitive.rst |
|||
|
|||
* config/examples/performance_sensitive.yaml |
|||
|
|||
As with memory, it's important to measure with options relevant to the actual |
|||
target. It's possible to enable most low memory options and performance options |
|||
at the same time (which makes sense if there's relatively little RAM but code |
|||
ROM footprint is not an issue). Duktape low memory options may have an effect |
|||
on performance; in particular, heap pointer compression has a relatively large |
|||
performance impact which is important to account for, depending on whether the |
|||
eventual target will use heap pointer compression or not. |
|||
|
|||
Test using function code by default |
|||
----------------------------------- |
|||
|
|||
Global code (program code) and eval code have important semantic differences |
|||
to function code, i.e. statements residing inside a ``function () { ... }`` |
|||
expression. For Duktape the performance difference between these two kinds |
|||
of compiled code is very large. The concrete difference is that for global |
|||
and eval code there are no local variables but instead all variable accesses |
|||
go through an internal slow path and are actually property reads and writes |
|||
on the global object. |
|||
|
|||
As a concrete example, empty loop inside a function:: |
|||
|
|||
$ cat test.js |
|||
function test() { |
|||
for (var i = 0; i < 1e7; i++) { |
|||
} |
|||
} |
|||
test(); |
|||
|
|||
$ time ./duk.O2.140 test.js |
|||
real 0m0.256s |
|||
user 0m0.256s |
|||
sys 0m0.000s |
|||
|
|||
Empty loop outside a function:: |
|||
|
|||
$ cat test.js |
|||
// Note that 'i' is actually a property of the global object. |
|||
for (var i = 0; i < 1e7; i++) { |
|||
} |
|||
|
|||
$ time ./duk.O2.140 _test.js |
|||
real 0m4.325s |
|||
user 0m4.319s |
|||
sys 0m0.004s |
|||
|
|||
The loop in global code runs ~20x slower than inside a function. The |
|||
performance difference for practical code depends on how many variable |
|||
accesses are done. |
|||
|
|||
In most programs the majority of actually performance relevant code is inside |
|||
functions. In particular, all CommonJS modules are inside anonymous wrapper |
|||
functions automatically, so all module code will run using the fast path. |
|||
For benchmarking the best default, usually matching actually executing code |
|||
on the target, is to measure performance critical code by placing it inside |
|||
a function. |
|||
|
|||
However, if the target will actually be running performance relevant code |
|||
in the global or eval context (which is quite possible for specific applications) |
|||
then it is of course prudent to measure that code outside a function. |
Loading…
Reference in new issue