You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

522 lines
20 KiB

=======================
Duktape bytecode format
=======================
Overview
========
Duktape has API functions to dump a compiled function into bytecode and load
(reinstantiate) a function from a bytecode dump. Bytecode dump/load allows
code to be compiled offline, compiled code to be cached and reused, compiled
code to be moved from one Duktape heap to another, etc. However, Duktape
bytecode format is version specific so it is *not* a version neutral code
distribution format like Java bytecode. (The term "bytecode" is used here
and in other Duktape documentation even though it's a bit inaccurate: the
serialization format includes many other fields besides bytecode
instructions.)
Duktape bytecode is **version specific** and (potentially) **config option
specific**, and may change arbitrarily even in minor releases (but is
guaranteed not to change in a patch release, as long as config options are
kept the same). In other words, the bytecode format is not part of the
ordinary versioning guarantees. If you compile code into bytecode offline,
you must ensure such code is recompiled whenever Duktape source is updated.
In this sense Duktape bytecode differs fundamentally from e.g. Java bytecode
which is used as a version neutral distribution format.
Duktape bytecode is **unvalidated** which means that loading untrusted or
broken bytecode may cause a crash or other memory unsafe behavior, leading
to potentially exploitable vulnerabilities. Calling code is responsible for
ensuring that bytecode for a different Duktape version is not loaded, and that
the bytecode input is not truncated or corrupted. (Validating bytecode is
quite difficult, because one would also need to validate the actual bytecode
which might otherwise refer to non-existent registers or constants, jump out
of bounds, etc.)
The bytecode format is **platform neutral**, so that it's possible to compile
the bytecode on one platform and load it on another, even if the platforms
have different byte order. This is useful to support offline compilation in
cross compilation.
There are a few limitations on what kind of functions can be dumped into
bytecode, and what information is lost in the process. See separate section
on limitations below. The following API test case provides concrete examples
on usage and current limitations:
* ``api-testcases/test-dump-load-basic.c``
Working with bytecode
=====================
The ``duk_dump_function()`` API call is used to convert a function into a
buffer containing bytecode::
duk_eval_string(ctx, "(function myfunc() { print('hello world'); })");
duk_dump_function(ctx);
/* -> stack top contains bytecode for 'myfunc' */
The ``duk_load_function()`` API call does the reverse, converting a buffer
containing bytecode into a function object::
/* ... push bytecode to value stack top */
duk_load_function(ctx);
/* -> stack top contains function */
The Duktape command line tool "duk" can also be used to compile a file
into bytecode::
./duk -c /tmp/program.bin program.js
The input source is compiled as an Ecmascript program and the bytecode
will be for the "program function". The command line tool doesn't support
compiling individual functions, and is mostly useful for playing with
bytecode.
The command line tool can also execute bytecode functions; it will just load
a function and call it without arguments, as if a program function was being
executed::
./duk /tmp/program.bin
When to use bytecode dump/load
==============================
There are two main motivations for using bytecode dump/load:
* Performance
* Obfuscation
Performance
-----------
Whenever compilation performance is *not* an issue, it is nearly always
preferable to compile functions from source rather than using bytecode
dump/load. Compiling from source is memory safe, version compatible,
and has no semantic limitations like bytecode.
There are some applications where compilation is a performance issue.
For example, a certain function may be compiled and executed over and
over again in short lived Duktape global contexts or even separate
Duktape heaps (which prevents reusing a single function object). Caching
the compiled function bytecode and instantiating the function by loading
the bytecode is much faster than recompiling it for every execution.
Obfuscation
-----------
Obfuscation is another common reason to use bytecode: it's more difficult
to reverse engineer source code from bytecode than e.g. minified code.
However, when doing so, you should note the following:
* Some minifiers support obfuscation which may be good enough and avoids
the bytecode limitations and downsides.
* For some targets source code encryption may be a better option than
relying on bytecode for obfuscation.
* Although Duktape bytecode doesn't currently store source code, it does
store all variable names (``_Varmap``) and formal argument names
(``_Formals``) which are needed in some functions. It may also be
possible source code is included in bytecode at some point to support
debugging. In other words, **obfuscation is not a design goal for the
bytecode format**.
That said, concrete issues to consider when using bytecode for obfuscation:
* Variable names in the ``_Varmap`` property: this cannot be easily avoided
in general but a minifier may be able to rename variables.
* Function name in the ``name`` property: this can be deleted or changed
before dumping a function, but note that some functions (such as
self-recursive functions) may depend on the property being present and
correct.
* Function filename in the ``fileName`` property: this can also be deleted
or changed before dumping a function. You can avoid introducing a filename
at all by using ``duk_compile()`` (rather than e.g. ``duk_eval_string()``)
to compile the function.
* Line number information in the ``_Pc2line`` property: this can be deleted or
changed, or you can configure Duktape not to store this information in the
first place (using option ``DUK_USE_PC2LINE``). Without line information
tracebacks will of course be less useful.
When not to use bytecode dump/load
==================================
Duktape bytecode is **not** a good match for:
* Distributing code
* Minimizing code size
Distributing code
-----------------
It's awkward to use a version specific bytecode format for code distribution.
This is especially true for Ecmascript, because the language itself is
otherwise well suited for writing backwards compatible code, detecting
features at run-time, etc.
It's also awkward for code distribution that the bytecode load operation
relies on calling code to ensure the loaded bytecode is trustworthy and
uncorrupted. In practice this means e.g. cryptographic signatures are
needed to avoid tampering.
Minimizing code size
--------------------
The bytecode format is designed to be fast to dump and load, while still
being platform neutral. It is *not* designed to be compact (and indeed
is not).
For example, for a simple Mandelbrot function (``mandel()`` in
``dist-files/mandel.js``):
+---------------------------+----------------+----------------------+
| Format | Size (bytes) | Gzipped size (bytes) |
+===========================+================+======================+
| Original source | 884 | 371 |
+---------------------------+----------------+----------------------+
| Bytecode dump | 809 | 504 |
+---------------------------+----------------+----------------------+
| UglifyJS2-minified source | 364 | 267 |
+---------------------------+----------------+----------------------+
For minimizing code size, using a minifier and ordinary compression is
a much better idea than relying on compressed or uncompressed bytecode.
Bytecode limitations
====================
Function lexical environment is lost
------------------------------------
A function loaded from bytecode always works as if it was defined in the
global environment so that any variable lookups not bound in the function
itself will be resolved through the global object. If you serialize ``bar``
created as::
function foo() {
var myValue = 123;
function bar() {
// myValue will be 123, looked up from 'foo' scope
print(myValue);
}
return bar;
}
and then load it back, it will behave as if it was originally created as::
function bar() {
// myValue will be read from global object
print(myValue);
}
If the original function was established using a function declaration,
the declaration itself is not restored when a function is loaded. This
may be confusing. For example, if you serialize ``foo`` declared as::
function foo() {
// Prints 'function' before dump/load; 'foo' is looked up from
// the global object.
print(typeof foo);
}
and then load it back, it will behave as::
var loadedFunc = (function() {
// Prints 'undefined' after dump/load; 'foo' is looked up from
// the global object. Workaround is to assign loadedFunc to
// globalObject.foo manually before calling to simulate declaration.
print(typeof foo);
});
No function name binding for function declarations
--------------------------------------------------
Function name binding for function expressions is supported, e.g. the
following function would work::
// Can dump and load this function, the reference to 'count' will
// be resolved using the automatic function name lexical binding
// provided for function expressions.
var func = function count(n) { print(n); if (n > 0) { count(n - 1); } };
However, for technical reasons functions that are established as global
declarations work a bit differently::
// Can dump and load this function, but the reference to 'count'
// will lookup globalObject.count instead of automatically
// referencing the function itself. Workaround is to assign
// the function to globalObject.count after loading.
function count(n) { print(n); if (n > 0) { count(n - 1); } };
(The NAMEBINDING flag controls creation of a lexical environment which
contains the function expression name binding. In Duktape 1.2 the flag
is only set for function templates, not function instances; this was
changed for Duktape 1.3 so that the NAMEBINDING flag could be detected
when loading bytecode, and a lexical environment can then be created
based on the flag.)
Custom internal prototype is lost
---------------------------------
A custom internal prototype is lost, and ``Function.prototype`` is used
on bytecode load.
Custom external prototype is lost
---------------------------------
A custom external prototype (``.prototype`` property) is lost, and a
default empty prototype is created on bytecode load.
Finalizer on the function is lost
---------------------------------
A finalizer on the function being serialized is lost, no finalizer will
exist on bytecode load.
Only specific function object properties are kept
-------------------------------------------------
Only specific function object properties, i.e. those needed to correctly
revive a function, are kept. These properties have type and value
limitations:
* .length: uint32, non-number values replaced by 0
* .name: string required, non-string values replaced by empty string
* .fileName: string required, non-string values replaced by empty string
* ._Formals: internal property, value is an array of strings
* ._Varmap: internal property, value is an object mapping identifier
names to register numbers
Bound functions are not supported
---------------------------------
Currently a ``TypeError`` is thrown when trying to serialize a bound function
object.
CommonJS modules don't work well with bytecode dump/load
--------------------------------------------------------
CommonJS modules cannot be trivially serialized because they're normally
evaluated by embedding the module source code inside a temporary function
wrapper (see ``modules.rst`` for details). User code does not have access
to the temporary wrapped function. This means that:
* If you compile and serialize the module source, the module will
have incorrect scope semantics.
* You could add the function wrapper and compile the wrapped function
instead.
* Module support for bytecode dump/load will probably need future work.
Bytecode format
===============
A function is serialized into a platform neutral byte stream. Multibyte
values are in network order (big endian), and don't have any alignment
guarantees.
Because the exact format is version specific, it's not documented in full
detail here. Doing so would mean tedious documentation updates whenever
bytecode was changed, and documentation would then easily fall out of date.
The exact format is ultimately defined by the source code, see:
* ``src-input/duk_api_bytecode.c``
* ``tools/dump_bytecode.py``
As a simplified summary of the bytecode format:
* There's a two-byte header: the first byte is a 0xff marker byte (which never
occurs in valid extended UTF-8 strings); the second byte is a bytecode version
which is used as a crude validity check.
* The header is followed by a serialized function. The function may contain
inner functions which are serialized recursively (without duplicating the
two-byte header).
The function serialization format is tedious and best looked up directly from
source code.
NOTE: The top level function is a function instance, but all inner functions
are function templates. There are some difference between the two which must
be taken into account in bytecode serialization code.
Security and memory safety
==========================
Duktape bytecode must only be loaded from a trusted source: loading broken
or maliciously crafted bytecode may lead to memory unsafe behavior, even
exploitable behavior.
Because bytecode is version specific, it is generally unsafe to load bytecode
provided by a network peer -- unless you can somehow be certain the bytecode
is specifically compiled for your Duktape version.
Design notes
============
Eval and program code
---------------------
Ecmascript specification recognizes three different types of code: program
code, eval code, and function code, with slightly different scope and variable
binding semantics. The serialization mechanism supports all three types.
Version specific vs. version neutral
------------------------------------
Duktape bytecode instruction format is already version specific and can change
between even minor releases, so it's quite natural for the serialization
format to also be version specific.
Providing a version neutral format would be possible when Duktape bytecode no
longer changes in minor versions (not easy to see when this would be the case)
or by doing some kind of recompilation for bytecode.
Config option specific
----------------------
Some Duktape options may affect what function metadata is available. E.g. you
can disable line number information (pc2line) which might then be left out of
the bytecode dump altogether. Attempting to load such a dump in a Duktape
environment compiled with line number information enabled might then fail due
to a format error.
(In the initial master merge there are no config option specific format
differences, but there may be such differences in later Duktape versions
if it's convenient to do so.)
Endianness
----------
Network endian was chosen because it's also used elsewhere in Duktape (e.g.
the debugger protocol) as the default, portable endianness.
Faster bytecode dump/load could be achieved by using native endianness and
(if necessary) padding to achieve proper alignment. This additional speed
improvement was considered less important than portability.
Platform neutrality
-------------------
Supporting cross compilation is a useful feature so that bytecode generated on
one platform can be loaded on another, as long as they run the same Duktape
version.
The cost of being platform neutral is rather small. The essential features
are normalizing endianness and avoiding alignment assumptions. Both can be
quite easily accommodated with relatively little run-time cost.
Bytecode header
---------------
The initial 0xFF byte is used because it can never appear in valid UTF-8
(even extended UTF-8) so that using a random string accidentally as bytecode
input will fail.
Memory safety and bytecode validation
-------------------------------------
The bytecode load primitive is memory unsafe, to the extent that trying to
load corrupted (truncated and/or modified) bytecode may lead to memory unsafe
behavior (even exploitable behavior). To keep bytecode loading fast and simple,
there are even no bounds checks when parsing the input bytecode.
This might seem strange but is intentional: while it would be easy to do basic
syntax validation for the serialized data when it is loaded, it still wouldn't
guarantee memory safety. To do so one would also need to validate the bytecode
opcodes, otherwise memory unsafe behavior may happen at run time.
Consider the following example: a function being loaded has ``nregs`` 100, so
that 100 slots are allocated from the value stack for the function. If the
function bytecode then executed::
LDREG 1, 999 ; read reg 999, out of bounds
STREG 1, 999 ; write reg 999, out of bounds
Similar issues exist for constants; if the function has 100 constants::
LDCONST 1, 999 ; read constant 999, out of bounds
In addition to direct out-of-bounds references there are also "indirect"
opcodes which e.g. load a register index from another register. Validating
these would be a lot more difficult and would need some basic control flow
algorithm, etc.
Overall it would be quite difficult to implement bytecode validation that
would correctly catch broken and perhaps maliciously crafted bytecode, and
it's not very useful to have a partial solution in place.
Even so there is a very simple header signature for bytecode which ensures
that obviously incorrect values are rejected early. The signature ensures
that (1) no ordinary string data can accidentally be loaded as byte code
(the initial byte 0xFF is invalid extended UTF-8); and (2) there is a basic
bytecode version check. Any bytes beyond this signature is unvalidated.
Future work
===========
Full value serialization
------------------------
Bytecode dump/load is restricted to a subset of function values. It would be
more elegant to support generic value dump/load. However, there are several
practical issues:
* Arbitrary object graphs would need to be supported, which is quite
challenging.
* There'd have to be some mechanism to "revive" any native values on
load. For example, for a native object representing an open file,
the revive operation would reopen the file and perhaps seek the file
to the correct offset.
Support bound functions
-----------------------
Currently a TypeError is thrown for bound functions. As a first step, it's
probably better to follow the bound chain and serialize the final target
function instead, i.e. bound status would be lost during serialization.
This is more in line with serializing with loss of some metadata rather than
throwing an error.
As the second step, it would be nice to serialize the bound ``this`` and
argument values. However, proper generic value serialization would be needed
to do that.
Caching CommonJS modules
------------------------
Caching CommonJS modules would be very useful. Figure out how to do that
when reworking the module mechanism.
Figure out debugger overlap
---------------------------
The debugger protocol has its own value serialization format (with somewhat
different goals):
- Would it be sensible to share value serialization format between dump/load
and debugger protocol?
- Should function values be serialized in the debugger protocol in the
bytecode dump/load format? Would that be useful for the debugger (not
immediately apparent why)?