From 9ae81b9a475445209784f6b69b3717eab674c172 Mon Sep 17 00:00:00 2001 From: Sami Vaarala Date: Wed, 17 Feb 2016 19:38:07 +0200 Subject: [PATCH] Add some benchmarking notes --- doc/benchmarking-notes.rst | 222 +++++++++++++++++++++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 doc/benchmarking-notes.rst diff --git a/doc/benchmarking-notes.rst b/doc/benchmarking-notes.rst new file mode 100644 index 00000000..4aca990e --- /dev/null +++ b/doc/benchmarking-notes.rst @@ -0,0 +1,222 @@ +================== +Benchmarking notes +================== + +This document provides some notes on how to benchmark performance or memory +consumption so that you'll get the most relevant results for an actual +target device. + +Memory benchmarking +=================== + +Enable Duktape low memory options +--------------------------------- + +Enable Duktape low memory config options for benchmarking if the target +would actually be running with these options enabled. For example, if +the target has 128-256 kB of system RAM, low memory options would be very +recommended: + +* doc/low-memory.rst + +* config/examples/low_memory.yaml + +Enable DUK_USE_GC_TORTURE to see actual hard memory usage +--------------------------------------------------------- + +Enable ``DUK_USE_GC_TORTURE`` for testing so that you'll be measuring actual +memory, i.e. actually **reachable objects** which won't be collected when +an emergency garbage collection would take place. This config option causes +a full mark-and-sweep garbage collection pass for every allocation so that +only actually reachable memory will remain use at any time. + +Apparent memory usage seen without this option enabled may differ quite a lot +from what's actually needed. That difference is usually irrelevant: if memory +were to run out, emergency garbage collection would be able free the +non-reachable objects. + +The difference may matter in practice if some *other* component in the system +is out of memory, as it usually cannot trigger an emergency garbage +collection which would free up memory. However, when using a pool allocator +for Duktape this is not an issue: all Duktape allocations will be contained +in the pre-allocated pool. + +Measure usage using a pool allocator if target uses one +------------------------------------------------------- + +Measurements using ``valgrind --tool=massif`` are relatively accurate (when +GC torture is enabled) but will include allocation overhead not present when +a pool allocator is used. Pool allocators are recommended for low memory +targets to reduce overhead and heap fragmentation. + +If the actual target uses a pool allocator, benchmarking should be done +against that allocator, with the pool entry sizes optimized for the actual +application code to be executed. The difference between valgrind massif +reported usage and actual pool allocator usage can be quite large. However, +when the pool configuration is poorly optimized, memory allocation overhead +caused by wasted pool entry bytes can also be significant. + +Measurements using e.g. process RSS are very inaccurate and should be avoided +if possible as they don't accurately reflect the actual memory usage +achievable. When measuring without a pool allocator, valgrind massif, +combined with enabling GC torture, is a much better option. + +Example of DUK_USE_GC_TORTURE measurement impact +------------------------------------------------ + +Let's take an example program which involves creating a lot of anonymous +function instances, quite typical in callback oriented code:: + + function test() { + for (var i = 0; i < 10000; i++) { + var ignored = function () {}; + } + } + test(); + +Because each such anonymous function is in a reference loop with its default +``.prototype`` object (which points back to the function using ``.constructor`` +reference), the functions won't be collected by reference counting and will +be freed by mark-and-sweep. Mark-and-sweep runs periodically but an emergency +mark-and-sweep is also triggered when an allocation attempt fails. + +Compiling Duktape for defaults on x64 (without any low memory options, ROM +builtins, etc) shows the following memory usage in valgrind massif:: + + ... + KB + 539.2^ : + | # : :: : : :@ : : + | # : : :: : :: :@ @: :: + | # :: : :: : :: :@ :@: @:: + | @# :: :: :: : :: ::@ :@: @:: + | @# ::: :: ::: ::: ::: ::@ ::@: @:: + | @# ::: ::: ::: : : ::: :::@ ::@: :@:: + | @@# @::: ::: :::: : : ::: :::@ ::@: :@:: + | @@# @::: @::: :::: :: : :::: :::@ :::@: ::@:: + | @@@# :@::: @::: ::::: :: : :::: ::::@ ::::@: ::@:: + | @@@# :@::: :@::: ::::: :: : :::: :::::@ ::::@: ::@:: + | @@@@# :@::: :@::: :::::: ::: : ::::: :::::@ ::::@: :::@:: + | @ @@@@# ::@::: :@::: :::::: ::: : ::::: :::::@ :::::@: :::@:: + | @ @@@@# ::@::: :@::: :::::: :::: : :::::: ::::::@ :::::@: :::@:: + | @::@@@@# ::@::: ::@::: :::::: :::: : :::::: ::::::@::::::@:::::@:: + | @@: @@@@# ::@::::::@::: ::::::: :::: : :::::: @:::::@::::::@:::::@::: + | @@: @@@@#::::@::::::@::: :::::::::::: ::::::::::@:::::@::::::@:::::@::: + | @@: @@@@#: ::@::::::@::: :::::::::::: :: :::::::@:::::@::::::@:::::@::: + | @@: @@@@#: ::@::::::@::: :::::::::::: :: :::::::@:::::@::::::@:::::@::: + | @@: @@@@#: ::@::::::@::: :::::::::::: :: :::::::@:::::@::::::@:::::@::: + 0 +----------------------------------------------------------------------->Mi + 0 138.9 + +From this it would appear the program is using ~540 kB of memory. This is very +misleading because almost all of that usage is actually collectable garbage which +is periodically collected by mark-and-sweep (as seen above as "spiking"). In +particular, if memory were to run out (in concrete terms, an attempt to allocate +memory would fail), an emergency mark-and-sweep pass would free that memory which +would then be available for other use. + +Enabling ``DUK_OPT_GC_TORTURE`` (or ``DUK_USE_GC_TORTURE`` if editing ``duk_config.h`` +directly) we get a very different result:: + + ... + KB + 118.2^# + |# + |#:@@:@:::::::: : ::: @:: :: : : : : :: + |#:@ :@:: : : ::::::::::::::::::::@@:::::@: ::: :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + |#:@ :@:: : : :: : ::::: :::::::::@ : :: @: : : :::::::::: :::@:::::::@:: + 0 +----------------------------------------------------------------------->Gi + 0 20.29 + +The actual "hard" memory usage is ~120kB, only about 22% of the apparent memory +usage as seen by valgrind. This hard memory usage is what really matters, i.e. +determines whether an application will be able to allocate more memory or not. + +Performance benchmarking +======================== + +Enable Duktape performance options +---------------------------------- + +Unless you're running on a memory constrained device and prefer performance +over e.g. code footprint, you should enable Duktape performance options. +For more information, see: + +* doc/performance-sensitive.rst + +* config/examples/performance_sensitive.yaml + +As with memory, it's important to measure with options relevant to the actual +target. It's possible to enable most low memory options and performance options +at the same time (which makes sense if there's relatively little RAM but code +ROM footprint is not an issue). Duktape low memory options may have an effect +on performance; in particular, heap pointer compression has a relatively large +performance impact which is important to account for, depending on whether the +eventual target will use heap pointer compression or not. + +Test using function code by default +----------------------------------- + +Global code (program code) and eval code have important semantic differences +to function code, i.e. statements residing inside a ``function () { ... }`` +expression. For Duktape the performance difference between these two kinds +of compiled code is very large. The concrete difference is that for global +and eval code there are no local variables but instead all variable accesses +go through an internal slow path and are actually property reads and writes +on the global object. + +As a concrete example, empty loop inside a function:: + + $ cat test.js + function test() { + for (var i = 0; i < 1e7; i++) { + } + } + test(); + + $ time ./duk.O2.140 test.js + real 0m0.256s + user 0m0.256s + sys 0m0.000s + +Empty loop outside a function:: + + $ cat test.js + // Note that 'i' is actually a property of the global object. + for (var i = 0; i < 1e7; i++) { + } + + $ time ./duk.O2.140 _test.js + real 0m4.325s + user 0m4.319s + sys 0m0.004s + +The loop in global code runs ~20x slower than inside a function. The +performance difference for practical code depends on how many variable +accesses are done. + +In most programs the majority of actually performance relevant code is inside +functions. In particular, all CommonJS modules are inside anonymous wrapper +functions automatically, so all module code will run using the fast path. +For benchmarking the best default, usually matching actually executing code +on the target, is to measure performance critical code by placing it inside +a function. + +However, if the target will actually be running performance relevant code +in the global or eval context (which is quite possible for specific applications) +then it is of course prudent to measure that code outside a function.