|
|
|
=========
|
|
|
|
Execution
|
|
|
|
=========
|
|
|
|
|
|
|
|
Overview
|
|
|
|
========
|
|
|
|
|
|
|
|
This document describes how Duktape manages its execution state. Some details
|
|
|
|
are omitted but the goal is to give an overall picture how execution proceeds,
|
|
|
|
what state is involved, and what are the most important internal functions
|
|
|
|
involved.
|
|
|
|
|
|
|
|
The discussion is limited to a single Duktape heap as each Duktape heap is
|
|
|
|
independent of other Duktape heaps. At any time, only one native thread may
|
|
|
|
be actively calling into a specific Duktape heap.
|
|
|
|
|
|
|
|
Execution states
|
|
|
|
================
|
|
|
|
|
|
|
|
There are three conceptual execution states for a Duktape heap:
|
|
|
|
|
|
|
|
* Idle
|
|
|
|
|
|
|
|
* Executing a Duktape/C function
|
|
|
|
|
|
|
|
* Executing an Ecmascript function
|
|
|
|
|
|
|
|
This conceptual model ignores details like heap initialization and
|
|
|
|
transitions from one state to another by "call handling".
|
|
|
|
|
|
|
|
Typical control flow
|
|
|
|
====================
|
|
|
|
|
|
|
|
Execution always begins from an idle state where no calls into Duktape are
|
|
|
|
active and user application has control. User code may manipulate the value
|
|
|
|
stacks of Duktape contexts in this state without making any calls. User code
|
|
|
|
may also call ``duk_debugger_cooperate()`` for integrating debugger into the
|
|
|
|
application event loop (or equivalent).
|
|
|
|
|
|
|
|
Eventually user code makes a call into either a Duktape/C function or an
|
|
|
|
Ecmascript function. Such a call may be caused by an obvious API call like
|
|
|
|
``duk_pcall()``. It may also be caused by a less obvious API call such as
|
|
|
|
``duk_get_prop()``, which may invoke a getter, or ``duk_to_string()`` which
|
|
|
|
may invoke a ``toString()`` coercion method.
|
|
|
|
|
|
|
|
The initial call into Duktape is always handled using ``duk_handle_call()``
|
|
|
|
which can handle a call from any state into any kind of target function.
|
|
|
|
Setting up a call involves a lot of state changes:
|
|
|
|
|
|
|
|
* A setjmp catchpoint is needed for protected calls.
|
|
|
|
|
|
|
|
* The call stack is resized if necessary, and an activation record
|
|
|
|
(``duk_activation``) is set up for the new call.
|
|
|
|
|
|
|
|
* The value stack is resized if necessary, and a fresh value stack frame
|
|
|
|
is established for the call. The calling value stack frame and the target
|
|
|
|
frame overlap for the call arguments, so that arguments on top of the
|
|
|
|
calling stack are directly visible on the bottom of the target stack.
|
|
|
|
|
|
|
|
* An arguments object and an explicit environment record is created if
|
|
|
|
necessary.
|
|
|
|
|
|
|
|
* Other small book-keeping (such as recursion depth tracking) is done.
|
|
|
|
|
|
|
|
When a call returns, the state changes are reversed before returning to
|
|
|
|
the caller. If an error occurs during the call, a ``longjmp()`` will take
|
|
|
|
place and will be caught by the current (innermost) setjmp catchpoint
|
|
|
|
without tearing down the call state; the catchpoint will have to do that.
|
|
|
|
|
|
|
|
If the target function is a Duktape/C function, the corresponding C function
|
|
|
|
is looked up and called. The C function now has access to a fresh value stack
|
|
|
|
frame it can operate on using the Duktape API. It can make further calls which
|
|
|
|
get handled by ``duk_handle_call()``.
|
|
|
|
|
|
|
|
If the target function is an Ecmascript function, the value stack is resized
|
|
|
|
for the function register count (nregs) established by the compiler during
|
|
|
|
function compilation; unlike Duktape/C functions the value stack is mostly
|
|
|
|
static for the duration of bytecode execution. Opcode handling may push
|
|
|
|
temporaries on the value stack but they must always be popped off before
|
|
|
|
proceeding to dispatch the next opcode.
|
|
|
|
|
|
|
|
The bytecode executor has its own setjmp catchpoint. If bytecode makes a
|
|
|
|
call into a Duktape/C function it is handled normally using ``duk_handle_call()``;
|
|
|
|
such calls may happen also when the bytecode executor uses the value stack API
|
|
|
|
for various coercions etc.
|
|
|
|
|
|
|
|
If bytecode makes a function call into an Ecmascript function it is handled
|
|
|
|
specially by ``duk_handle_ecma_call_setup()``. This call handler sets up a
|
|
|
|
new activation similarly to ``duk_handle_call()``, but instead of doing a
|
|
|
|
recursive call into the bytecode executor it returns to the bytecode executor
|
|
|
|
which restarts execution and starts executing the call target without
|
|
|
|
increasing C stack depth. The call handler also supports tail calls where an
|
|
|
|
activation record is reused.
|
|
|
|
|
|
|
|
Both Duktape and user code may use ``duk_safe_call()`` to make protected
|
|
|
|
calls inside the current activation (or outside of any activations in the
|
|
|
|
idle state). A safe call creates a new setjmp catchpoint but not a new
|
|
|
|
activation, so safe calls are not actual function calls.
|
|
|
|
|
|
|
|
Threading limitations
|
|
|
|
=====================
|
|
|
|
|
|
|
|
Only one native thread may call into a Duktape heap at any given time.
|
|
|
|
See ``threading.rst`` for more details.
|
|
|
|
|
|
|
|
Bytecode executor
|
|
|
|
=================
|
|
|
|
|
|
|
|
Basic functionality
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
* Setjmp catchpoint which supports yield, resume, slow returns, try-catch, etc
|
|
|
|
|
|
|
|
* Opcode dispatch loop, central for performance
|
|
|
|
|
|
|
|
* Executor interrupt which facilitates script timeout and debugger integration
|
|
|
|
|
|
|
|
* Debugger support; breakpoint handling, checked and normal execution modes
|
|
|
|
|
|
|
|
Setjmp catchpoint
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
The ``duk_handle_call()`` and ``duk_safe_call()`` catchpoints are only used to
|
|
|
|
handle ordinary error throws which propagate out of the calling function. The
|
|
|
|
bytecode executor setjmp catchpoint handles a wider variety of longjmp call
|
|
|
|
types, and in many cases the longjmp may be handled without exiting the current
|
|
|
|
function:
|
|
|
|
|
|
|
|
* A slow break/continue uses a longjmp() so that if the break/continue crosses
|
|
|
|
any finally clauses, they get executed as expected. Similarly 'with' statement
|
|
|
|
lexical environments are torn down, etc.
|
|
|
|
|
|
|
|
* A slow return uses a longjmp() so that any finally clauses, 'with' statement
|
|
|
|
lexical environments, etc are handled appropriately.
|
|
|
|
|
|
|
|
* A coroutine resume is handled using longjmp(): the Duktape.Thread.resume()
|
|
|
|
call adjusts the thread states (including their activations) and then uses
|
|
|
|
this longjmp() type to restart execution in the target coroutine.
|
|
|
|
|
|
|
|
* A coroutine yield is handled using longjmp(): the Duktape.Thread.yield()
|
|
|
|
call adjusts the states and uses this longjmp() type to restart execution
|
|
|
|
in the target coroutine.
|
|
|
|
|
|
|
|
* An ordinary throw is handled as in ``duk_handle_call()`` with the difference
|
|
|
|
that there are both 'try' and 'finally' sites.
|
|
|
|
|
|
|
|
Returns, coroutine yields, and throws may propagate out of the initial bytecode
|
|
|
|
executor entry and outwards to whatever code called into the executor.
|
|
|
|
|
|
|
|
Opcode dispatch loop and executor interrupt
|
|
|
|
-------------------------------------------
|
|
|
|
|
|
|
|
The opcode dispatch loop is a central performance critical part of the
|
|
|
|
executor. The dispatch loop:
|
|
|
|
|
|
|
|
* Checks for an executor interrupt. An interrupt can be taken for every
|
|
|
|
opcode or for every N instructions; the interrupt handler provides e.g.
|
|
|
|
script timeout and debugger integration. This is performance critical
|
|
|
|
because the check occurs for every opcode dispatch. See separate section
|
|
|
|
below on interrupt counter handling.
|
|
|
|
|
|
|
|
* Fetches an instruction from the topmost activation's "current PC",
|
|
|
|
and increments the PC. Managing the "current PC" is performance critical.
|
|
|
|
See separate section below on current PC handling.
|
|
|
|
|
|
|
|
* Decodes and executes the opcode using a large switch-case. The most
|
|
|
|
important opcodes are in the main opcode space (64 opcodes); more rarely
|
|
|
|
used opcodes are "extra" opcodes and need a double dispatch.
|
|
|
|
|
|
|
|
* Usually loops back to execute further opcodes. May also (1) call another
|
|
|
|
Duktape/C or Ecmascript function, (2) cause a longjmp, or (3) use
|
|
|
|
``goto restart_execution`` to restart the executor e.g. after call stack
|
|
|
|
has been changed.
|
|
|
|
|
|
|
|
Debugger support
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Debugger support relies on:
|
|
|
|
|
|
|
|
* Executor interrupt mechanism is needed to support debugging.
|
|
|
|
|
|
|
|
* A precheck in ``restart_execution`` where debugging status and breakpoints
|
|
|
|
are checked. Executor then either proceeds in "normal" or "checked"
|
|
|
|
execution. Checked execution means running one opcode at a time, and
|
|
|
|
calling into the interrupt handler before each to see e.g. if a breakpoint
|
|
|
|
has been triggered.
|
|
|
|
|
|
|
|
* There's some additional support outside the executor, e.g. call stack
|
|
|
|
unwinding code handles the "step out" logic.
|
|
|
|
|
|
|
|
See ``debugger.rst`` for details.
|
|
|
|
|
|
|
|
Managing executor interrupt
|
|
|
|
===========================
|
|
|
|
|
|
|
|
The executor interrupt counter is currently tracked in
|
|
|
|
``thr->interrupt_counter``. This seems to work well because ``thr`` is a
|
|
|
|
"hot" variable.
|
|
|
|
|
|
|
|
Another alternative would be to track the counter in an executor local
|
|
|
|
variable. Error handling and other code paths jumping out of the executor
|
|
|
|
need to work similarly to how stack local ``curr_pc`` is handled.
|
|
|
|
|
|
|
|
Managing current PC
|
|
|
|
===================
|
|
|
|
|
|
|
|
Current approach
|
|
|
|
----------------
|
|
|
|
|
|
|
|
The current solution in Duktape 1.3 is to maintain a direct bytecode pointer
|
|
|
|
in each activation, and to keep a "cached copy" of the topmost activation's
|
|
|
|
bytecode pointer in a bytecode executor local variable ``curr_pc``. A pointer
|
|
|
|
to the ``curr_pc`` in the stack frame (whose type is ``duk_instr_t **``) is
|
|
|
|
stored in ``thr->ptr_curr_pc`` so that when control exits the opcode dispatch
|
|
|
|
loop (e.g. when an error is thrown) the value in the stack frame can be read
|
|
|
|
and synced back into the topmost activation's ``act->curr_pc``.
|
|
|
|
|
|
|
|
Consistency depends on the compiler doing correct aliasing analysis, and
|
|
|
|
writing back the ``curr_pc`` value to the stack frame before any operation
|
|
|
|
that may potentially read it through ``thr->ptr_curr_pc``. Using ``volatile``
|
|
|
|
would be safer but in practical testing it eliminates the performance benefit
|
|
|
|
entirely.
|
|
|
|
|
|
|
|
For the most part the bytecode executor can keep on dispatching opcodes
|
|
|
|
using ``curr_pc`` without copying the pointer back to the topmost activation.
|
|
|
|
However, the pointer needs to be synced (copied back) when:
|
|
|
|
|
|
|
|
* The current activation changes, i.e. a new function call is made.
|
|
|
|
|
|
|
|
* When an error is about to be thrown, to ensure any longjmp handlers
|
|
|
|
will see correct PC values in activations.
|
|
|
|
|
|
|
|
* When the executor interrupt is entered; in particular, the debugger
|
|
|
|
must see an up-to-date state in activations.
|
|
|
|
|
|
|
|
* When a ``goto restart_execution;`` occurs in bytecode dispatch, which
|
|
|
|
happens for multiple opcodes.
|
|
|
|
|
|
|
|
Care must be taken *not* to sync when ``thr->ptr_curr_pc`` is no longer
|
|
|
|
pointing to the topmost activation and/or when the C stack frame pointed
|
|
|
|
to may no longer exist. The current policy is to:
|
|
|
|
|
|
|
|
* Sync PC on function calls, also backup/restore ``thr->ptr_curr_pc`` on
|
|
|
|
calls.
|
|
|
|
|
|
|
|
* Sync PC before a longjmp, often a bit earlier to ensure stacktraces
|
|
|
|
come out right.
|
|
|
|
|
|
|
|
* Never sync or otherwise access ``thr->ptr_curr_pc`` in the setjmp
|
|
|
|
catcher and unwind code paths. This is to ensure we never dereference
|
|
|
|
a ``thr->ptr_curr_pc`` no longer related to the topmost activation or
|
|
|
|
pointing to an unwound C stack frame. (The ``thr->ptr_curr_pc`` is
|
|
|
|
not currently NULLed so it's intentionally dangling and must not be
|
|
|
|
dereferenced incorrectly.)
|
|
|
|
|
|
|
|
Syncing the pointer back unnecessarily or multiple times is safe, however.
|
|
|
|
|
|
|
|
Function bytecode is behind a stable pointer, so there are no realloc or
|
|
|
|
other side effect concerns with using direct bytecode pointers. Because
|
|
|
|
the function being executed is always reachable, a borrowed pointer can
|
|
|
|
be used.
|
|
|
|
|
|
|
|
This is a bit error prone, but it is worth the performance difference
|
|
|
|
of the alternatives. This method of dispatch improves dispatch performance
|
|
|
|
by about 20-25% over Duktape 1.2.
|
|
|
|
|
|
|
|
Some alternatives
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
* Duktape 1.3: maintain a direct bytecode pointer in each activation, and a
|
|
|
|
"cached" copy of the topmost activation's bytecode pointer in a local
|
|
|
|
variable of the executor. Whenever something that might throw an error
|
|
|
|
is executed, write the pointer back to the current activation using
|
|
|
|
``thr->ptr_curr_pc`` which points to the stack frame location containing
|
|
|
|
``curr_pc``.
|
|
|
|
|
|
|
|
* Duktape 1.2: maintain all PC values as numeric indices (not pointers and
|
|
|
|
not pre-multiplied by bytecode opcode size). The current PC is always
|
|
|
|
looked up from the current activation.
|
|
|
|
|
|
|
|
* Same as Duktape 1.3 behavior but maintain a cached copy of the topmost
|
|
|
|
activation's bytecode pointer in ``thr->curr_pc``. The copy back operation
|
|
|
|
is needed but doesn't need to peek into the bytecode executor stack frame.
|
|
|
|
This works quite well because ``thr`` is a "hot" variable. However, the
|
|
|
|
stack local ``curr_pc`` used in Duktape 1.3 is faster.
|
|
|
|
|
|
|
|
* Use direct bytecode pointers in activations, keep a pointer to the current
|
|
|
|
activation in the executor, and use ``act->curr_pc`` for dispatch. There's
|
|
|
|
no need for a copy back operation because activation states are always in
|
|
|
|
sync. This is faster than the Duktape 1.2 approach, but significantly
|
|
|
|
slower than the ``thr->curr_pc`` or the Duktape 1.3 approach (part of that
|
|
|
|
is probably because there's more register pressure).
|
|
|
|
|
|
|
|
Comparison between curr_pc alternatives
|
|
|
|
---------------------------------------
|
|
|
|
|
|
|
|
The current Duktape 1.3 approach is a bit error prone because of the need to
|
|
|
|
sync the executor local ``curr_pc`` back to ``act->curr_pc`` in multiple code
|
|
|
|
paths. Another alternative would be to dispatch using ``act->curr_pc``
|
|
|
|
directly. While that is faster than Duktape 1.2, it is significantly slower
|
|
|
|
than dispatching using executor local ``curr_pc`` (or ``thr->curr_pc``).
|
|
|
|
|
|
|
|
The measurements below are using ``gcc -O2`` on x64::
|
|
|
|
|
|
|
|
# Duktape 1.3, dispatch using executor local variable curr_pc
|
|
|
|
$ sudo nice -20 python util/time_multi.py --count 10 --mode all --verbose ./duk.O2.local_pc tests/perf/test-empty-loop.js
|
|
|
|
Running: 2.180000 2.170000 2.180000 2.290000 2.180000 2.200000 2.190000 2.190000 2.220000 2.200000
|
|
|
|
min=2.17, max=2.29, avg=2.20, count=10: [2.18, 2.17, 2.18, 2.29, 2.18, 2.2, 2.19, 2.19, 2.22, 2.2]
|
|
|
|
|
|
|
|
# Duktape 1.2, dispatch using a numeric PC index
|
|
|
|
$ sudo nice -20 python util/time_multi.py --count 10 --mode all --verbose ./duk.O2.123 tests/perf/test-empty-loop.js
|
|
|
|
Running: 3.100000 3.100000 3.120000 3.120000 3.160000 3.300000 3.370000 3.410000 3.370000 3.390000
|
|
|
|
min=3.10, max=3.41, avg=3.24, count=10: [3.1, 3.1, 3.12, 3.12, 3.16, 3.3, 3.37, 3.41, 3.37, 3.39]
|
|
|
|
|
|
|
|
# Alternative; dispatch using thr->curr_pc
|
|
|
|
$ sudo nice -20 python util/time_multi.py --count 10 --mode all --verbose ./duk.O2.thr_pc tests/perf/test-empty-loop.js
|
|
|
|
Running: 2.310000 2.330000 2.310000 2.300000 2.400000 2.290000 2.310000 2.290000 2.300000 2.300000
|
|
|
|
min=2.29, max=2.40, avg=2.31, count=10: [2.31, 2.33, 2.31, 2.3, 2.4, 2.29, 2.31, 2.29, 2.3, 2.3]
|
|
|
|
|
|
|
|
# Alternative; dispatch using act->curr_pc
|
|
|
|
$ sudo nice -20 python util/time_multi.py --count 10 --mode all --verbose ./duk.O2.act_pc tests/perf/test-empty-loop.js
|
|
|
|
Running: 2.590000 2.580000 2.600000 2.600000 2.600000 2.660000 2.600000 2.640000 2.860000 2.860000
|
|
|
|
min=2.58, max=2.86, avg=2.66, count=10: [2.59, 2.58, 2.6, 2.6, 2.6, 2.66, 2.6, 2.64, 2.86, 2.86]
|
|
|
|
|
|
|
|
Accessing constants
|
|
|
|
===================
|
|
|
|
|
|
|
|
The executor stores a copy of the ``duk_hcompiledfunction`` constant table
|
|
|
|
base address into a local variable ``consts``. This reduces code footprint
|
|
|
|
and performs better compared to reading the consts base address on-the-fly
|
|
|
|
through the function reference. Because the constants table has a stable
|
|
|
|
base address, this is easy and safe.
|
|
|
|
|
|
|
|
Accessing registers
|
|
|
|
===================
|
|
|
|
|
|
|
|
The executor currently accesses the stack frame base address (needed to read
|
|
|
|
registers) through ``thr`` as ``thr->valstack_bottom``. This is reasonably
|
|
|
|
OK because ``thr`` is a "hot" variable.
|
|
|
|
|
|
|
|
The register base address could also be copied to a local variable as is done
|
|
|
|
for constants. However, ``thr->valstack_bottom`` is not a stable address and
|
|
|
|
may be changed by any side effect (because any side effect can cause a value
|
|
|
|
stack resize, e.g. if a finalizer is invoked).
|
|
|
|
|
|
|
|
If a local variable were to be used, it would need to be updated when the
|
|
|
|
value stack is resized. It's not certain if overall performance would be
|
|
|
|
improved. This was postponed to Duktape 1.4:
|
|
|
|
|
|
|
|
* https://github.com/svaarala/duktape/issues/298
|