mirror of https://github.com/svaarala/duktape.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2794 lines
108 KiB
2794 lines
108 KiB
======================
|
|
Duktape ES5.1 compiler
|
|
======================
|
|
|
|
Introduction
|
|
============
|
|
|
|
This document provides an overview of the Duktape ES5.1 compiler structure,
|
|
the basic compilation process, and notes on difficult-to-compile constructs.
|
|
Assorted lexer and compiler design notes are covered in separate sections in
|
|
no specific order (these are raw notes collected on the first implementation
|
|
pass). Future work lists known areas for improvement (not exhaustively).
|
|
|
|
The document does not cover the changes needed to support new ES2015 constructs
|
|
like destructuring assignment; the main change will be the addition of some
|
|
form of an intermediate representation (IR), perhaps a statement level
|
|
expression tree or a full expression tree. Having an IR will also enable
|
|
several optimizations.
|
|
|
|
This document is a snapshot of design issues and will not be kept exactly
|
|
up-to-date with the compiler. The document was originally written for Duktape
|
|
1.0 so some parts may be out-of-date.
|
|
|
|
Compilation overview
|
|
====================
|
|
|
|
Basics
|
|
------
|
|
|
|
The compiler converts source code containing global code, eval code, or
|
|
function code into an executable form. The most important parts of that
|
|
form are bytecode, constants, function metadata, and inner function
|
|
templates. Compilation can also end in a ``SyntaxError``, which are
|
|
mandated to be "early errors" in Ecmascript, or some internal error, such
|
|
as out of memory error.
|
|
|
|
The end result of compilation is more specifically a *function template*.
|
|
A function template does not yet have a lexical environment, it merely
|
|
refers to symbols outside of its own scope by name. A function template
|
|
is instantiated into a closure which supplies the missing lexical environment.
|
|
Inner functions are also function templates, and multiple closures may
|
|
be created from a certain template::
|
|
|
|
// For each invocation of f(), a separate closure of g() is created
|
|
function f(x) {
|
|
function g() {
|
|
print(x);
|
|
}
|
|
return g;
|
|
}
|
|
|
|
Compilation depends on two major components:
|
|
|
|
* A *lexer*, which generates tokens from an underlying input stream on request.
|
|
The lexer supports backtracking and reparsing to allow multi-pass parsing,
|
|
at the cost of not currently supporting streamed parsing (adding support
|
|
for chunked streaming would be possible, see future work).
|
|
|
|
* A *compiler*, which generates bytecode from a token stream. The compiler
|
|
uses two (or more) passes over a function, and generates bytecode directly,
|
|
avoiding an explicit intermediate representation. An *intermediate value*
|
|
(ivalue) concept is used instead for managing expression values.
|
|
|
|
Code and memory footprint concern originally lead to the decision to avoid
|
|
a comprehensive intermediate representation in the compiler; a limited
|
|
intermediate value concept is used instead. This complicates compilation
|
|
somewhat but reduces memory footprint. A more comprehensive intermediate
|
|
representation will be needed in future versions to fully support ES2015.
|
|
|
|
Lexer
|
|
-----
|
|
|
|
The lexer has a straightforward design:
|
|
|
|
* Source code is decoded from UTF-8 into a window of Unicode codepoints,
|
|
with an infinite number of virtual EOF codepoints following the end-of-input.
|
|
The decoder supports rewinding: the current position in the source stream
|
|
can be saved and restored at a later time.
|
|
|
|
* The main lexing function decodes individual tokens like keywords, numbers,
|
|
identifiers, etc, using the codepoint window for a safe lookup. The lexing
|
|
function keeps track of token line number and source byte offset information,
|
|
and the eligibility of tokens for automatic semicolon insertion.
|
|
|
|
The upside of the codepoint window is that character encoding is effectively
|
|
hidden and normalized in the process, and that the main lexing function can
|
|
safely do forward lookups up to a certain limit without explicit end-of-input
|
|
checks. It would also be possible to support encodings other than UTF-8 quite
|
|
easily. The downside is a small performance impact over hardcoding the lexer
|
|
for UTF-8 input, which would also be a viable design.
|
|
|
|
Compiler
|
|
--------
|
|
|
|
The compiler component uses a hand crafted recursive descent statement parser,
|
|
with somewhat tricky handling of a few Ecmascript constructs (like the
|
|
"for/for-in" statement). In contrast, top-down operator parsing is used for
|
|
parsing expressions (see http://javascript.crockford.com/tdop/tdop.html)
|
|
which is nice in that it allows single pass parsing while generating mostly
|
|
nice bytecode.
|
|
|
|
Compilation is triggered by Duktape C API compilation/eval calls, ``eval()``
|
|
calls, or ``Function`` constructor calls. The calling context provides the
|
|
necessary flags to indicate whether global code, eval code, or function code
|
|
is being compiled. The global compiler state is set up, and a "current
|
|
function" is set up with matching the requested compilation context. The
|
|
function body is then parsed with a multi-pass parser, resulting in final
|
|
bytecode, constants, inner functions, and a lot of other state (for example,
|
|
flags indicating whether the function accesses ``argument`` or ``eval``;
|
|
this state is needed to optimize functions). Finally, the "current function"
|
|
is converted from the compiler representation to the run-time function
|
|
(template) format used by the executor.
|
|
|
|
Statement parsing is straightforward: a function body is parsed as a sequence
|
|
of "source elements" which are otherwise the same as statements but function
|
|
declarations are also allowed (Duktape allows function declarations anywhere
|
|
in a function for compatibility with existing code and other engines, however).
|
|
For global code and eval code there's a need to support an "implicit return
|
|
value" which requires tracking of statement values; implicit return value is
|
|
not needed for function code.
|
|
|
|
Expressions are parsed into *intermediate values* (ivalues) which are one
|
|
step from being concrete, fully referenced values. This missing step allows
|
|
ivalues to be used both as left-hand-side values (assignment targets) and
|
|
right-hand-side values (evaluated values), and allows limited constant folding
|
|
at compile time. An ivalue may, for example, represent:
|
|
|
|
* A constant value which is not yet registered to the function constant table
|
|
(e.g. the string ``"foo"``).
|
|
|
|
* A constant registered to the function constant table (denoted e.g. ``C18``
|
|
for constant 18).
|
|
|
|
* A register in the function register frame (denoted e.g. ``R23`` for register
|
|
23).
|
|
|
|
* A variable access using a variable name.
|
|
|
|
* A property access with a base value and a key, with the base value and key
|
|
being registers or constants.
|
|
|
|
* A unary or binary arithmetic operation.
|
|
|
|
One way to characterize ivalues is that instead of using a full expression
|
|
tree as an intermediate representation, the compiler uses small, local
|
|
fragments of that tree without ever constructing the full tree. Each ivalue
|
|
is either a leaf node or an internal node with two leaf node children.
|
|
|
|
When parsing an expression the compiler typically creates an ivalue to
|
|
represent the value of the expression; some bytecode may be emitted in the
|
|
process to prepare registers/constants for the ivalue but this is not always
|
|
the case. The ivalue may then be used as an input to another expression,
|
|
converted from one form to another if necessary. This conversion process
|
|
may allocate new constants or registers, and emit bytecode as necessary.
|
|
For example, the result of a previously parsed ivalue representing an addition
|
|
operation may be needed in a single register/constant. The compiler converts
|
|
the ivalue by allocating a temporary register and emitting the ADD opcode to
|
|
compute the result. The temporary register can then be used as an input in
|
|
another ivalue as needed.
|
|
|
|
Creating ivalues for expressions and converting ivalues from one form to
|
|
another drives much of the code generation process. The ivalue conversion
|
|
helpers can also perform limited optimization, such as constant folding for
|
|
numbers and strings.
|
|
|
|
Bytecode emission is conceptually quite simple: expression and ivalue handling
|
|
code simply request opcodes to be emitted as needed. However, the bytecode
|
|
emission functions transparently handle *register shuffling* to extend the
|
|
range of addressable registers. For example, the binary ``ADD X, Y, Z``
|
|
opcodes can directly only address an 8-bit target register (X) and two 8-bit
|
|
source registers or constants (Y and Z). If any arguments exceed their
|
|
allowed range, the bytecode emission functions emit the necessary opcodes
|
|
to shuffle source and/or target values through temporary registers. While
|
|
such code is not optimal, it is necessary to support very large functions
|
|
(for example those produced by Emscripten).
|
|
|
|
Two (or more) passes are made over every parsed function. On the first pass
|
|
we don't know which variables and inner functions the function will declare,
|
|
as such declarations are conceptually "hoisted" to the top of the function.
|
|
One purpose of the first pass is to gather this information for the second
|
|
pass. Even so, to keep the code simple, the first pass also generates "broken"
|
|
throw-away bytecode so that the same parsing code can be used for all passes.
|
|
|
|
On the second pass all the necessary information has been gathered and actual
|
|
bytecode can be emitted. A simple in-place peephole optimizer is applied to
|
|
the bytecode before generating final bytecode. The peephole optimizer
|
|
currently only straightens out JUMPs (JUMP->JUMP->X is converted to JUMP->X).
|
|
|
|
The temporary registers needed for shuffling are only allocated when they're
|
|
actually needed. Typically this is noticed on the first pass, but in some
|
|
cases it is only detected on the second pass; in such cases a third pass is
|
|
needed to generate the final bytecode.
|
|
|
|
This multi-pass approach has several downsides: (1) it requires a lexer which
|
|
can backtrack to the beginning of the function; and (2) time is wasted in
|
|
lexing and compiling the function twice (in an initial design inner functions
|
|
would also get parsed *four times* in total, their inner functions
|
|
*eight times* in total, etc, but there's a specific solution to this problem
|
|
in the current compiler). The upside of multi-pass parsing is that there is
|
|
no need for an intermediate representation which saves on memory footprint.
|
|
|
|
The statement parser keeps track of a "statement number" within the
|
|
current function. This is not needed for any critical purpose, but it
|
|
allows the first compilation pass to stash information related to a
|
|
certain statement for the second pass, perhaps allowing more optimal
|
|
code generation. For instance, the first pass could note that a loop
|
|
statement has no relevant break/continue statements, so a label site is
|
|
not actually needed. Similar expression counts, token counts, or source
|
|
offsets could be used to address constructs to help in multi-pass parsing.
|
|
However, no such optimizations are currently used by the compiler.
|
|
|
|
Recursive expression parsing, statement parsing, and function parsing may
|
|
happen during parsing; for example, a function expression may appear almost
|
|
anywhere and trigger recursive function compilation. To fully support
|
|
recursion in function parsing, all compilation state is kept in the "current
|
|
function" state rather than the global compiler state.
|
|
|
|
Both the lexer and compiler need to deal with the fact that garbage collection
|
|
may occur almost anywhere (which may even lead to nested compilation if a
|
|
finalizer is invoked), errors may be thrown almost anywhere, and so on. All
|
|
resources must thus be visible to the garbage collector and correctly reference
|
|
counted at nearly all times. The current approach to deal with this is to use
|
|
the current thread's value stack to stash token values, intermediate values,
|
|
identifier names, etc. Slots are allocated from the value stack as necessary.
|
|
This is a bit complicated; some alternatives:
|
|
|
|
* Finalization (and possibly mark-and-sweep) could be prevented during
|
|
compilation.
|
|
|
|
* Make the compiler state a traversable object type visible to garbage
|
|
collection.
|
|
|
|
Ivalue example
|
|
--------------
|
|
|
|
Expression parsing and ivalue manipulation drives most of the code generation
|
|
process. Let's look at a concrete example how these work together to generate
|
|
bytecode.
|
|
|
|
Consider the statement::
|
|
|
|
x.y.z = 1 + 2;
|
|
|
|
The steps taken to compile the statement are roughly:
|
|
|
|
* The "x" expression generates ivalue I1 of type "variable access" with the
|
|
variable name "x", which is not yet allocated a constant identifier. No
|
|
code is emitted.
|
|
|
|
* The ".y" part generates ivalue I2 of type "property access":
|
|
|
|
- The base value (I1) needs to be a register or a constant, so a constant C0
|
|
is allocated for the variable name (``"x"``) and a temporary register R0
|
|
for the value, and bytecode to read the variable is emitted (``GETVAR
|
|
R0, C0``).
|
|
|
|
- The key needs to be a register or constant, so a constant C1 is allocated
|
|
for the key (``"y"``). No bytecode needs to be emitted.
|
|
|
|
- I2 base value is R0, key is C1.
|
|
|
|
* The ".z" part generates ivalue I3 of type "property access":
|
|
|
|
- The base value (I2) is coerced into a new temporary register R1 by
|
|
emitting bytecode for the property load (``GETPROP R1, R0, C1``).
|
|
|
|
- A constant C2 is allocated for the key (``"z"``).
|
|
|
|
- I3 base value is R1, key is C2.
|
|
|
|
* The compiler notices an assignment operator and parses the right side.
|
|
The constants 1 and 2 are compiled into ivalues I4 and I5 initially, and
|
|
the combined into an ivalue I6 representing the addition of two constants.
|
|
No code is emitted for the addition yet.
|
|
|
|
* To perform the assignment the right-hand side (I6) needs to be coerced into
|
|
a single register/constant. For this specific ivalue the compiler notices
|
|
that two integer constants are being added so constant folding is used.
|
|
The compiler allocates a temporary register R2 and emits bytecode to load
|
|
the integer (``LDINT R2, 3``). The ivalue I7 represents the result in R2.
|
|
(The compiler could also register a new constant instead of using an integer
|
|
load, but (some) integers are more efficiently handled using direct integer
|
|
loads.)
|
|
|
|
* Finally, the assignment operation uses I3 as its target and I7 as its
|
|
source, emitting a property write (``PUTPROP R1, C2, R2``). Here I3 is
|
|
used as a left-hand side value (write target) rather than as a right-hand
|
|
side value.
|
|
|
|
While there are multiple steps and ivalues, the bytecode emitted from this
|
|
process is relatively short (the opcodes here are for illustration only and
|
|
don't match 1:1 with the actual opcodes used by Duktape)::
|
|
|
|
; Constant C0: "x"
|
|
; Constant C1: "y"
|
|
; Constant C2: "z"
|
|
|
|
GETVAR R0, C0 ; read variable "x" to R0
|
|
GETPROP R1, R0, C1 ; read property R0["y"] (= x.y) to R1
|
|
LDINT R2, 3 ; load integer 3 to R2
|
|
PUTPROP R1, C2, R2 ; write property R1["z"] (= x.y.z), value R2 (integer 3)
|
|
|
|
As can be seen from the example, ivalues are convenient in that the final
|
|
result of a property expression has a single format (an ivalue) which is
|
|
one step removed from the final value. This allows them to be used both as
|
|
left-hand-side and right-hand-side values; the decision is made by the caller
|
|
in the final conversion. Optimizations are also possible when converting
|
|
ivalues from one form to the next.
|
|
|
|
Ivalue conversion also provides a lot of flexibility: if the result of a
|
|
previous expression isn't directly compatible with the needs of the expression
|
|
being parsed, ivalues can be converted to the required form. Because ivalues
|
|
are one step away from being completed, inefficient conversions are mostly
|
|
(but certainly not always) avoided. For example, an ivalue representing an
|
|
integer can be converted either to a register or a constant, with the
|
|
necessary bytecode only emitted when it's known which one is preferred.
|
|
|
|
Several details are omitted from this description; for example:
|
|
|
|
* The compiler tries to reuse temporary registers where possible to reduce the
|
|
number of temporaries needed.
|
|
|
|
* Local variables (including arguments) are assigned to registers and are
|
|
accessed directly without an explicit variable read/write operation (GETVAR
|
|
or PUTVAR).
|
|
|
|
* Register shuffling might be needed; it is currently handled transparently
|
|
by the bytecode emission functions.
|
|
|
|
Bytecode
|
|
--------
|
|
|
|
The bytecode opcodes used by Duktape are chosen simply to work well for both
|
|
compilation and execution. The bytecode is not version compatible, and may
|
|
change arbitrarily in even minor versions. The role of Duktape bytecode is
|
|
not to be a code distribution format like e.g. Java bytecode.
|
|
|
|
The bytecode executor is the best source for documentation on exact bytecode
|
|
semantics at any given time. Opcode information must be sync in:
|
|
|
|
* ``src-input/duk_js_bytecode.h`` defines opcode names and various constants
|
|
|
|
* ``src-input/duk_js_compiler.c`` emits bytecode
|
|
|
|
* ``src-input/duk_js_executor.c`` interprets bytecode
|
|
|
|
* ``debugger/duk_opcodes.yaml`` provides opcode metadata in a programmatic
|
|
format, used by the debugger Web UI for bytecode dumping
|
|
|
|
Code organization
|
|
-----------------
|
|
|
|
The main entry point to compilation is ``duk_js_compile()`` in
|
|
``duk_js_compiler.c``.
|
|
|
|
``duk_lexer.c`` and ``duk_lexer.h`` contain the entire lexer implementation.
|
|
Tokens are represented by ``duk_token``. Two slots are reserved from the
|
|
value stack for token values (regexp literals need two slots: pattern and
|
|
flags) to keep the values reference counted.
|
|
|
|
``duk_js_compiler.c`` and ``duk_js_compiler.h`` contain the entire compiler
|
|
implementation: function, statement and expression parsers, bytecode emission,
|
|
ivalue manipulation, and assorted support functionality like label and constant
|
|
management. The compiler was originally written as a single file for efficient
|
|
inlining, before source files were combined into a single file in the dist
|
|
process.
|
|
|
|
Compilation state is encapsulated into ``duk_compiler_ctx``, which includes:
|
|
|
|
* Tokenization state
|
|
|
|
* Control structure for the current function being compiled; the function
|
|
structure includes:
|
|
|
|
- Code generation state: bytecode, identifier bindings, constants,
|
|
temporary register state, label state, etc
|
|
|
|
- Control variables for the current expression being parsed
|
|
|
|
* Various control flags which operate at the entry point level
|
|
|
|
Intermediate values are represented by ``duk_ivalue`` and ``duk_ispec``.
|
|
These need value value stack slots for storing values such as strings.
|
|
|
|
A function being compiled is represented by the inner representation
|
|
``duk_compiler_func`` which is converted into an actual function object
|
|
(a template) once compilation is finished. The intermediate function
|
|
refers to a number of allocated value stack locations for storing
|
|
compilation data such as label information, known identifiers, bytecode
|
|
emitted, etc. There are also support state and structures like
|
|
``duk_labelinfo``.
|
|
|
|
Bytecode is generated as a sequence of ``duk_compiler_instr`` structs.
|
|
These contain an actual instruction (``duk_instr_r``) and line information.
|
|
Line information is compressed into a compact bit-packed run-time format
|
|
(pc2line) at the end of function compilation.
|
|
|
|
General design notes
|
|
====================
|
|
|
|
This section lists miscellaneous issues affecting lexer and compiler design.
|
|
|
|
C recursion depth
|
|
-----------------
|
|
|
|
C recursion depth or C stack size needs to be reasonably limited for
|
|
compatibility with some embedded environments with small stacks.
|
|
|
|
Avoiding memory churn
|
|
---------------------
|
|
|
|
Minimizing the number of alloc/realloc/free operations is desirable for
|
|
all environments. Memory churn has a performance impact and also increases
|
|
the chance that memory gets fragmented which is an issue for some (but not
|
|
all) allocators.
|
|
|
|
A few examples on how to avoid memory churn:
|
|
|
|
* Use fixed size buffers when possible, e.g. for codepoint decode window.
|
|
|
|
* Use a shared temporary buffer for parsing string valued tokens, reusing
|
|
the buffer. Most keywords and literal strings will easily fit into a
|
|
few hundred without ever needing to resize the temporary buffer.
|
|
|
|
* Minimize resizes of the bytecode emission buffer. For example, when
|
|
starting second compilation pass, keep the current bytecode buffer
|
|
without resizing it to a smaller size.
|
|
|
|
Memory usage patterns for pooled allocators
|
|
-------------------------------------------
|
|
|
|
For low memory environments using pool allocation, any large allocations that
|
|
grow without bounds are awkward to handle because selecting the pool sizes
|
|
becomes difficult. It is preferable to do a lot of smaller allocations with
|
|
a bounded size instead; typical pool configurations provide a lot of small
|
|
buffers from 4 to 64 bytes, and a reasonable number of buffers up to 256
|
|
bytes. Above that buffer counts are smaller and tightly reserved.
|
|
|
|
There are a few unbounded allocations in the current solution, such as current
|
|
bytecode being emitted.
|
|
|
|
Lexer design notes
|
|
==================
|
|
|
|
This section has small lexer design notes in no specific order. Larger
|
|
issues are covered in dedicated sections below.
|
|
|
|
Tokenization is stateful
|
|
------------------------
|
|
|
|
Tokenization is affected by:
|
|
|
|
* Strictness of the current context, which affects the set of recognized
|
|
keywords (reserved words, to be more precise).
|
|
|
|
* Regexp mode, i.e. whether a literal regexp is allowed in the current
|
|
context. This is the case because regexp literals use the forward slash
|
|
which is easily confused with a division expression. Currently handled
|
|
by having a table indicating which tokens may not be followed by a
|
|
RegExp literal.
|
|
|
|
* In some contexts reserved words are recognized but in others they must
|
|
be interpreted as identifiers: an ``Identifier`` production accepts
|
|
any ``IdentifierName`` except for ``ReservedWord``. Both ``Identifier``
|
|
and ``IdentifierName`` appear in constructs. The current approach is
|
|
to supply both the raw identifier name and a possible reserved word in
|
|
``duk_token``. The caller can then decide which is appropriate in the
|
|
caller's context.
|
|
|
|
Source code encoding is not specified
|
|
-------------------------------------
|
|
|
|
The E5.1 specification does not mandate any specific source code encoding.
|
|
Instead, source code is assumed to be a 16-bit codepoint sequence for
|
|
specification purposes (E5.1 Section 6). Current choice is for the source
|
|
code to be decoded in UTF-8. Changing the supported encoding(s) would be
|
|
easy because of the codepoint decoding window approach, but it's preferred
|
|
that calling code transcode non-UTF-8 inputs into UTF-8.
|
|
|
|
Source code may contain non-BMP characters but Ecmascript does not
|
|
support such characters directly. For instance, if codepoint U+12345 would
|
|
appear (without escaping) inside a string constant, it would need to be
|
|
interpreted as two 16-bit codepoint surrogate codepoints (surrogate pair),
|
|
if such characters are supported at all.
|
|
|
|
Duktape strings support non-BMP characters though, but they cannot be created
|
|
using source literals.
|
|
|
|
Use strict directive
|
|
--------------------
|
|
|
|
The "use strict" and other directives have somewhat odd semantics (see E5.1
|
|
Section 14.1):
|
|
|
|
* ``"use strict"`` is a valid "use strict directive" and triggers strict mode.
|
|
|
|
* ``"use\u0020strict"`` is a valid directive but **not** a "use strict
|
|
directive".
|
|
|
|
* ``("use strict")`` is not a valid directive.
|
|
|
|
The lexer and the expression parser coordinate to provide enough information
|
|
(character escaping, expression "depth") to allow these cases to be
|
|
distinguished properly.
|
|
|
|
Compiler design notes
|
|
=====================
|
|
|
|
This section has small compiler design notes in no specific order. Larger
|
|
issues are covered in dedicated sections below.
|
|
|
|
Expression parsing algorithm
|
|
----------------------------
|
|
|
|
The expression parsing algorithm is based on:
|
|
|
|
* http://javascript.crockford.com/tdop/tdop.html
|
|
|
|
* http://effbot.org/zone/simple-top-down-parsing.htm
|
|
|
|
* http://effbot.org/zone/tdop-index.htm
|
|
|
|
The ``nud()`` function considers a token as a "value" token. It also parses
|
|
unary expressions (such as ``!x``).
|
|
|
|
The ``led()`` function considers a token as an "operator" token, which
|
|
operates on a preceding value.
|
|
|
|
Some tokens operate in both roles but with different semantics. For instance,
|
|
opening bracket (``[``) may either begin an array literal in ``nud()``, or a
|
|
property access in ``led()``.
|
|
|
|
The simplified algorithm is as follows. The 'rbp' argument defines "right
|
|
binding power", which governs when the expression is considered to be
|
|
finished. The 'lbp()' value provides token binding power, "left binding
|
|
power". The higher 'rbp' is, the more tightly bound expression we're parsing::
|
|
|
|
nud() ; parse current token as "value"
|
|
while rbp < lbp(): ; while token binds more tightly than rbp...
|
|
led() ; combine previous value with operator
|
|
|
|
The ``led()`` function may parse an expression recursively,
|
|
with a higher 'rbp', i.e. a more tightly bound expression.
|
|
|
|
In addition to this basic algorithm, some special features are needed:
|
|
|
|
* Keep track of led() and nud() counts. This allows directives in a
|
|
function "directive prologue" (E5.1 Section 14.1) to be detected correctly.
|
|
For instance::
|
|
|
|
function f() {
|
|
'use strict'; // valid directive for strict mode
|
|
'use\u0020strict'; // valid directive, but not for strict mode (!)
|
|
('use strict'); // not a valid directive, terminates directive prologue
|
|
|
|
* Keep track of parenthesis nesting count during expression parsing. This
|
|
allows "top level" to be distinguished from nested levels.
|
|
|
|
* Keep track of whether the expression is a valid LeftHandSideExpression, i.e.
|
|
the top level contains only LeftHandSideExpression level operators.
|
|
|
|
* Allow a caller to specify that expression parsing should be terminated at
|
|
a top-level ``in`` token. This is needed for the "NoIn" variants, which are
|
|
used in for/for-in statements.
|
|
|
|
* Allow a caller to specify whether or not an empty expression is allowed.
|
|
|
|
The expression parses uses both the "previous token" and "current token"
|
|
in making parsing decisions. Which token is considered at each point is
|
|
not always trivial, and the responsibilities between compiler internal
|
|
helper functions are not always obvious; token state assumptions are thus
|
|
documented in most functions.
|
|
|
|
Parsing statements
|
|
------------------
|
|
|
|
Statement parsing is a traditional top-down recursive process which is
|
|
relatively straightforward. Some complicated issues are:
|
|
|
|
* Specific statement types which are difficult to parse without lookahead
|
|
|
|
* Label site handling
|
|
|
|
* Tail calls
|
|
|
|
* Implicit return values
|
|
|
|
Parsing functions
|
|
-----------------
|
|
|
|
At the end of function parsing, the compiler needs to determine what
|
|
flags to set for the function. Some flags have an important performance
|
|
impact. In particular, the creation of an ``arguments`` object can be
|
|
skipped if the compiler can guarantee that it will never be accessed.
|
|
|
|
This is not trivial because e.g. the presence of a direct ``eval()``
|
|
call may allow indirect access to ``arguments``. The compiler must always
|
|
make a conservative choice to ensure compliance and safety.
|
|
|
|
Distinguishing for/for-in
|
|
-------------------------
|
|
|
|
There are a total of four ``for`` / ``for-in`` statement variants. Each
|
|
variant requires slightly different bytecode output. Detecting the correct
|
|
variant is difficult, but possible, without multiple passes or arbitrary
|
|
length token lookup. See separate discussion below.
|
|
|
|
Expressions involving "new"
|
|
---------------------------
|
|
|
|
Expression involving ``new`` are not trivial to parse without lookahead.
|
|
The grammar rules for ``LeftHandSideExpression``, ``CallExpression``,
|
|
``NewExpression``, and ``MemberExpression`` are a bit awkward. See separate
|
|
discussion below.
|
|
|
|
Directive detection
|
|
-------------------
|
|
|
|
The "use strict" and other directives are part of a directive prologue which
|
|
is the sequence of initial ExpressionStatements producing only a string
|
|
literal (E5.1 Section 14.1).
|
|
|
|
The expression parser provides a nud/led call count which allows the
|
|
statement parser to determine that an expression is a valid directive.
|
|
The first non-valid directive terminates the directive prologue, and
|
|
no more directives are processed. The lexer provides character escape
|
|
metadata in token information to allow "use strict" to be detected correctly.
|
|
|
|
The transition to strict mode occurs in the directive prologue of the
|
|
first compilation pass. Function strictness is already known at the
|
|
beginning of the second pass. This is important because strict mode
|
|
affects function argument parsing, for instance, so it must be known
|
|
before parsing the function body.
|
|
|
|
Declaration "hoisting"
|
|
----------------------
|
|
|
|
Variable and function declarations affect code generation even before the
|
|
declarations themselves appear in the source code: in effect, declarations
|
|
are "hoisted" to the top of the function. To be able to generate reasonable
|
|
code, compile-time identifier resolution requires multi-pass parsing or some
|
|
intermediate representation. Current solution is multi-pass function parsing.
|
|
|
|
Some token lookahead is needed
|
|
------------------------------
|
|
|
|
Because we need some lookahead, the compiler currently keeps track of two
|
|
tokens at a time, a "current token" and a "previous token".
|
|
|
|
Implicit return values
|
|
----------------------
|
|
|
|
Global and eval code have an implicit return value, see separate section
|
|
below.
|
|
|
|
Guaranteed side effects
|
|
-----------------------
|
|
|
|
Sometimes code must be generated even when it might seem intuitive it is not
|
|
necessary. For example, the argument to a ``void`` operator must be coerced to
|
|
a "plain" register/constant so that any side effects are generated. Side effects
|
|
might be caused by e.g. getter calls::
|
|
|
|
// If foo.x is an accessor, it must be called
|
|
void foo.x
|
|
|
|
Evaluation order requirements
|
|
-----------------------------
|
|
|
|
Evaluation order requirements complicate one-pass code generation somewhat
|
|
because there's little leeway in reordering emitted bytecode without a
|
|
larger IR.
|
|
|
|
Dynamic lexical contexts
|
|
------------------------
|
|
|
|
Ecmascript lexical contexts can be dynamically altered even after a function
|
|
call exits. For example, if a function makes a direct ``eval()`` call with
|
|
a variable argument, it is possible to declare new variables when the function
|
|
is called::
|
|
|
|
var foo = 123;
|
|
var myfunc;
|
|
|
|
function f(x) {
|
|
eval(x);
|
|
|
|
return function () { print(foo); }
|
|
}
|
|
|
|
// declare 'foo' in f(), returned closure sees this 'foo' instead
|
|
// of the global one
|
|
|
|
myfunc = f('var foo = 321');
|
|
myfunc(); // prints 321, not 123
|
|
|
|
// don't declare 'foo' in f(), returned closure sees the global 'foo'
|
|
// instead of the global one
|
|
|
|
myfunc = f('var quux = 432');
|
|
myfunc(); // prints 123
|
|
|
|
For execution efficiency we should, for example, avoid creation of environment
|
|
records and the ``arguments`` object. The compiler thus needs to
|
|
conservatively estimate what optimizations are possible.
|
|
|
|
Compilation may trigger a GC or recursive compilation
|
|
-----------------------------------------------------
|
|
|
|
At first glance it might seem that the compiler cannot be invoked recursively.
|
|
This is not the case however: the compiler may trigger a garbage collection
|
|
or a refzero, which triggers a finalizer execution, which in turn can use e.g.
|
|
``eval()`` to cause a recursive Ecmascript compilation. Compiler recursion is
|
|
not a problem as such, as it is a normal recursive C call which respects value
|
|
stack policy.
|
|
|
|
There are a few practical issues to note with regards to GC and recursion:
|
|
|
|
* All heap values must be correctly reference counted and reachable. The
|
|
compiler needs heap values to represent token values, compiler intermediate
|
|
values, etc. All such values must be reachable through the valstack, a
|
|
temporary object, or GC must explicitly support compiler state.
|
|
|
|
* There should be no global (heap- or thread-wide) compiler state that would
|
|
get clobbered by a recursive compilation call. If there is such state, it
|
|
must be saved and restored by the compiler.
|
|
|
|
* At the moment there is a "current compiler context" variable in ``duk_hthread``
|
|
which is used to augment SyntaxErrors with a line number. This state is saved
|
|
and restored in recursive compilation to avoid clobbering.
|
|
|
|
Unary minus and plus
|
|
--------------------
|
|
|
|
Quite interestingly, the minus sign in ``-123`` is **not** a part of the
|
|
number token in Ecmascript syntax. Instead, ``-123`` is parsed as a unary
|
|
minus followed by a number literal.
|
|
|
|
The current compiler follows this required syntax, but constant folding
|
|
ensures no extra code or constants are generated for unary minus or unary
|
|
plus.
|
|
|
|
Compile-time vs. run-time errors
|
|
--------------------------------
|
|
|
|
Compilation may fail with an error only if the cause is an "early
|
|
error", specified in E5.1 Section 16, or an internal error such as
|
|
out of memory occurs. Other errors must only occur when the result
|
|
of the compilation is executed. Sometimes this includes constructs
|
|
that we know can never be executed without an error (such as a
|
|
function call being in a left-hand-side position of an assignment),
|
|
but perhaps that code is never reached or the error is intentional.
|
|
|
|
Label statement handling
|
|
------------------------
|
|
|
|
Label statements essentially prefix actual statements::
|
|
|
|
mylabel:
|
|
while (true) { ... }
|
|
|
|
Labels are currently handled directly by the internal function which
|
|
parses a single statement. This is useful because all labels preceding
|
|
an actual statement are coalesced into a single "label site". All labels,
|
|
including an implicit empty label for iteration statements, point to the
|
|
same label site::
|
|
|
|
// only a single label site is established for labels:
|
|
// "label1", "label2", ""
|
|
label1:
|
|
label2:
|
|
for (;;) { ... }
|
|
|
|
Technically, a label looks like an expression statement initially, as a
|
|
label begins with an identifier. The current parsing approach avoids
|
|
backtracking by parsing an expression statement normally, and then
|
|
noticing that (1) it consisted of a single identifier token, and (2)
|
|
is followed by a colon.
|
|
|
|
No code is emitted by the expression parser for such a terminal single
|
|
token expression (an intermediate value is generated, but it is not
|
|
coerced to any code yet), so this works without emitting any invalid
|
|
code.
|
|
|
|
Note that some labels cannot accept break or continue (e.g. label for
|
|
an expression statement), some can accept a break only (switch) while
|
|
others can accept both (iteration statements: do-while, for, while).
|
|
All the label names are registered while processing explicit labels,
|
|
and an empty label is registered for an iteration/switch statement.
|
|
When the final statement type is known, all labels in the set of labels
|
|
are updated to indicate whether they accept break and/or continue.
|
|
|
|
Backtracking
|
|
------------
|
|
|
|
There is currently only a need to backtrack at the function level, to
|
|
restart function compilation when moving from one parsing pass to the next.
|
|
The "current function" state needs to be carefully reinitialized during
|
|
this transition.
|
|
|
|
More fine-grained backtracking is not needed right now, but would involve
|
|
resetting:
|
|
|
|
* Emitted bytecode
|
|
|
|
* Highest used (temp) register
|
|
|
|
* Emitted constants and inner functions
|
|
|
|
* Active label set
|
|
|
|
Temporary register allocation
|
|
-----------------------------
|
|
|
|
Temporary registers are allocated as a strictly increasing sequence from a
|
|
specified starting register. The "next temp" is reset back to a smaller
|
|
value whenever we know that none of the higher temp values are no longer
|
|
needed. This can be done safely because temporaries are always allocated
|
|
with a strict stack discipline, and any fixed identifier-to-register
|
|
bindings are below the initial temp reg.
|
|
|
|
The current expression parsing code does not always produce optimal
|
|
register allocations. It would be preferable for expression result values
|
|
to be in as low register numbers as possible, which maximizes the amount
|
|
of temporaries available for later expression code. This is currently
|
|
done on a case-by-case basis as need arises.
|
|
|
|
The backstop is at the statement level: after every statement is complete,
|
|
the "next temp" can be reset to the same value it was before parsing
|
|
the statement. However, it's beneficial to reset "next temp" to a smaller
|
|
value whenever possible (inside expression parsing), to minimize function
|
|
register count and avoid running out of temp registers.
|
|
|
|
Unused temporary registers are not set to undefined, and are reachable for
|
|
garbage collection. Unless they're overwritten by temporary values needed
|
|
by another expression, they may cause a "silent leak". This is usually not
|
|
a concrete concern because a function exit will always decref all such
|
|
temporaries. This may be an issue for forever-running functions though.
|
|
|
|
Register shuffling
|
|
------------------
|
|
|
|
The compiler needs to handle the case where it runs out of "easy access"
|
|
registers or constants (usually 256 or 512 registers/constants). Either
|
|
this needs to be handled correctly in one pass, or the compiler must
|
|
fall back to a different strategy. Current solution is to use register
|
|
shuffling through temporary registers. Shuffling is handled by the bytecode
|
|
emitters.
|
|
|
|
Pc2line debug data creation
|
|
---------------------------
|
|
|
|
The "pc2line" debug data is a bit-packed format for converting a bytecode
|
|
PC into an approximate source line number at run time.
|
|
|
|
Although almost all of the bytecode is emitted in a linear fashion (appending
|
|
to earlier code), some tricky structures insert bytecode instructions in the
|
|
middle of already emitted bytecode. These insertions prohibit the emission
|
|
of debug data in a streaming fashion during code emission. Instead, it needs
|
|
to be implemented as a post-step. This unfortunately doubles the memory
|
|
footprint of bytecode during compilation.
|
|
|
|
The current solution is to keep track of (instruction, line number) pairs
|
|
for each bytecode instruction during compile time. When the intermediate
|
|
representation of the compiled function is converted to an actual run-time
|
|
representation, this representation is converted into a plain opcode list
|
|
and bit-packed pc2line data.
|
|
|
|
There is currently some inaccuracy in the line numbers assigned to opcodes:
|
|
the bytecode emitter associates the line number of the previous token because
|
|
this matches how expression parsing consumes tokens. However, in some other
|
|
call sites the relevant line number would be in the current token. Fixing
|
|
this needs a bit more internal book-keeping.
|
|
|
|
Peephole optimization
|
|
---------------------
|
|
|
|
Currently a simple in-place peephole optimizer is applied at the end of
|
|
function compilation to straighten out jumps. Consider for instance::
|
|
|
|
a:
|
|
JUMP c -.
|
|
b: | <--.
|
|
JUMP d | -. |
|
|
c: <--' | |
|
|
JUMP b | -'
|
|
d: <--'
|
|
|
|
The peephole optimizer runs over the bytecode looking for JUMP-to-JUMP
|
|
cases until the bytecode no longer changes. On the first peephole pass
|
|
these jumps are straightened to::
|
|
|
|
a:
|
|
JUMP b -.
|
|
b: <--'
|
|
JUMP d -.
|
|
c: |
|
|
JUMP d | -.
|
|
d: <--' <-'
|
|
|
|
(The JUMPs are modified in place, so some changes may be visible to later
|
|
jumps on the same pass.)
|
|
|
|
On the next pass this is further optimized to::
|
|
|
|
a:
|
|
JUMP d -.
|
|
b: |
|
|
JUMP d | -.
|
|
c: | |
|
|
JUMP d | | -.
|
|
d: <--' <--' <-'
|
|
|
|
The peephole pass doesn't eliminate any instructions, but it makes some
|
|
JUMP chains a bit faster. JUMP chains are generated by the current compiler
|
|
in many cases, so this simple pass cheaply improves generated code slightly.
|
|
|
|
Avoiding C recursion
|
|
--------------------
|
|
|
|
C recursion happens in multiple ways. These should suffice to control it:
|
|
|
|
* Recursive expression parsing
|
|
|
|
* Recursive statement parsing (e.g. ``if`` statement parses another statement)
|
|
|
|
* Recursive function parsing (e.g. function expression or function declaration
|
|
inside another function)
|
|
|
|
Recursion controls placed in these key functions should suffice to guarantee
|
|
an upper limit on C recursion, although it is difficult to estimate how much
|
|
stack is consumed before the limit is reached.
|
|
|
|
ES2015 constructs need an intermediate representation
|
|
-----------------------------------------------------
|
|
|
|
ES2015 constructs such as destructuring assignment will need an intermediate
|
|
representation (or at least a much larger fragment of the expression tree)
|
|
to compile in a reasonable manner.
|
|
|
|
Operator precedences (binding powers)
|
|
=====================================
|
|
|
|
Operator precedences (binding powers) are required by the expression parser
|
|
for tokens acting as "operators" for ``led()`` calls. This includes tokens
|
|
for binary operators (such as ``+`` and ``instanceof``).
|
|
|
|
A higher binding power binds more strongly, e.g. ``*`` has a higher binding
|
|
power than ``+``. The binding power of operators can be determined from the
|
|
syntax. Operators of different precedence are apparent from production
|
|
nesting level; outer productions have lower binding power.
|
|
|
|
Operators at the same level have the same binding power if left-associative.
|
|
A production can be determined to be left-associative by its production.
|
|
For instance::
|
|
|
|
AdditiveExpression:
|
|
MultiplicativeExpression
|
|
AdditiveExpression '+' MultiplicativeExpression
|
|
AdditiveExpression '-' MultiplicativeExpression
|
|
|
|
Abbreviated::
|
|
|
|
AE:
|
|
ME
|
|
AE '+' ME
|
|
AE '-' ME
|
|
|
|
The expression ``1 + 2 + 3 + 4 * 5`` would be derived as (with parentheses for
|
|
emphasizing order)::
|
|
|
|
AE -> AE '+' ME
|
|
-> (AE '+' ME) '+' ME
|
|
-> ((AE '+' ME) '+' ME) '+' ME
|
|
-> ((ME '+' ME) '+' ME) '+' ME
|
|
-> ((1 '+' 2) '+' 3) '+' (4 '*' 5)
|
|
|
|
Operators at the same level which are right-associative can be determined
|
|
from its production. For instance::
|
|
|
|
AssignmentExpression:
|
|
ConditionalExpression
|
|
LeftHandSideExpression '=' AssignmentExpression
|
|
LeftHandSideExpression AssignmentOperator AssignmentExpression
|
|
|
|
AssignmentOperator:
|
|
'*='
|
|
(... others omitted)
|
|
|
|
Abbreviated::
|
|
|
|
AE:
|
|
CE
|
|
LE '=' AE
|
|
LE AO AE
|
|
|
|
AO:
|
|
'*='
|
|
|
|
The expression ``a = b = c *= 4`` would be produced as (using parentheses for
|
|
emphasis)::
|
|
|
|
AE -> LE '=' AE
|
|
-> LE '=' (LE '=' AE)
|
|
-> LE '=' (LE '=' (LE '*=' AE)
|
|
-> LE '=' (LE '=' (LE '*=' CE)
|
|
-> a '=' (b '=' (c '*=' 4))
|
|
|
|
Right associative productions are parsed by using a tweaked 'rbp' argument
|
|
to the recursive expression parsing. For the example above:
|
|
|
|
* ``a`` is parsed with ``nud()`` and evaluates into a variable reference.
|
|
|
|
* The first ``=`` operator is parsed with ``led()``, which calls the
|
|
expression parser recursively, with a 'rbp' argument which causes
|
|
the recursive call to consume all further assignment operations.
|
|
|
|
What is a proper 'rbp' for the recursive ``led()`` call? It must be
|
|
lower than the binding power for the ``=`` operator, but higher or equal
|
|
than any operator whose binding power is less than that of ``=``. For
|
|
example, if the binding power of ``=`` was 10, the 'rbp' used could be 9.
|
|
The current compiler uses multiples of 2 for binding powers so that
|
|
subtracting 1 from the binding power of an operator results in a binding
|
|
power below the current operator but never equal to any other operator.
|
|
Technically this is not necessary, because it's OK for the 'rbp' to be
|
|
equal to a lower binding operator.
|
|
|
|
In addition to binary operators, binding powers need to be assigned to:
|
|
|
|
* Unary operators
|
|
|
|
* Some tokens which are not strictly operators. For example, ``(``, ``[``,
|
|
and ``{`` which begin certain expressions (function calls, property
|
|
access, and object
|
|
|
|
Token precedences for ``lbp()``, from highest (most tightly bound) to lowest
|
|
are summarized in the list below. Operators of equal binding power are on
|
|
the same line. The list is determined based on looking at the ``Expression``
|
|
production. Operators are left associative unless indicated otherwise:
|
|
|
|
* (IdentifierName, literals, ``this``, etc. Parsed by ``nud()``
|
|
and don't need binding powers.)
|
|
|
|
* ``.`` ``[``
|
|
(Note: MemberExpression parsed by ``led``.)
|
|
|
|
* ``new``
|
|
(Note: unary expression parsed by ``nud()``. Right-associative.)
|
|
|
|
* ``(``
|
|
(Note: CallExpression parsed by ``led()``.)
|
|
|
|
* ``++`` ``--``
|
|
(Note: postfix expressions which are parsed by ``led()`` but which are
|
|
"unary like". The expression always terminates in such a case.)
|
|
|
|
* ``delete`` ``void`` ``typeof`` ``++`` ``--`` ``+`` ``-`` ``~`` ``!``
|
|
(Note: unary expressions which are parsed by ``nud()`` and don't thus
|
|
actually need a binding power. All of these are also right-associative.
|
|
``++`` and ``--`` are preincrement here; ``+`` and ``-`` are unary plus
|
|
and minus.)
|
|
|
|
* ``*`` ``/`` ``%``
|
|
|
|
* ``+`` ``-``
|
|
|
|
* ``<<`` ``>>`` ``>>>``
|
|
|
|
* ``<`` ``>`` ``<=`` ``>=`` ``instanceof`` ``in``
|
|
|
|
* ``==`` ``!=`` ``===`` ``!==``
|
|
|
|
* ``&``
|
|
|
|
* ``^``
|
|
|
|
* ``|``
|
|
|
|
* ``&&``
|
|
|
|
* ``||``
|
|
|
|
* ``?``
|
|
(Note: starting a "a ? b : c" expression)
|
|
|
|
* ``=`` ``*=`` ``/=`` ``%=`` ``+=`` ``-=`` ``<<=`` ``>>=`` ``>>>=`` ``&=`` ``^=`` ``|=``
|
|
(Note: right associative.)
|
|
|
|
* ``,``
|
|
|
|
* ``)`` ``]``
|
|
(Note: when parsed with ``led()``; see below.)
|
|
|
|
* EOF
|
|
(Note: when parsed with ``led()``; see below.)
|
|
|
|
The precedence list is clear starting from the lowest binding up to binary
|
|
``+`` and ``-``. Binding powers higher than that get a bit tricky because
|
|
some of them are unary (parsed by ``nud()``) and some or parsed by ``led()``
|
|
but are not binary operators.
|
|
|
|
When parsing an expression beginning with ``(`` using ``nud()``, the
|
|
remainder of the expression is parsed with a recursive call to the expression
|
|
parser and a 'rbp' which guarantees that parsing stops at the closing ``)``.
|
|
The ``rbp`` here must NOT stop at the comma operator (``,``) so technically
|
|
``)`` is considered to have a binding power lower than comma. The same applies
|
|
to ``]``. Similarly, EOF is considered to have a binding power lowest of all.
|
|
These have been appended to the list above.
|
|
|
|
Parsing RegExp literals
|
|
=======================
|
|
|
|
The Ecmacsript lexer has two goal symbols for its lexical grammar:
|
|
``InputElementDiv`` and ``InputElementRegExp``. The former is used in
|
|
all lexical contexts where a division (``/``) or a division-assignment
|
|
(``/=``) is allowed; the latter is used elsewhere. The E5.1 specification
|
|
does not really say anything else on the matter (see E5.1 Section 7, 2nd
|
|
paragraph).
|
|
|
|
In the implementation of the compiler, the ``advance()`` set of helpers
|
|
knows the current token, and consults a token table which indicates
|
|
whether a regexp literal is prohibited after the current token. Thus,
|
|
this detail is hidden from ordinary parsing code.
|
|
|
|
The ``advance()`` helper knows the current token type and consults a
|
|
token table which has a flag indicating whether or not a RegExp can ever
|
|
follow that particular token. Unfortunately parsing Identifier (which
|
|
prohibits keywords) vs. IdentifierName (which allows them) is context
|
|
sensitive. The current lexer handles this by providing a token type for
|
|
both interpretations: ``t`` indicates token type with reserved words
|
|
being recognized (e.g. "return" yields a token type DUK_TOK_RETURN)
|
|
while ``t_nores`` indicates token type ignoring reserved words (e.g.
|
|
"return" yields a token type DUK_TOK_IDENTIFIER).
|
|
|
|
``IdentifierName`` occurs only in::
|
|
|
|
PropertyName -> IdentifierName (object literal)
|
|
MemberExpression -> MemberExpression '.' IdentifierName
|
|
CallExpression -> CallExpression '.' IdentifierName
|
|
|
|
Using ``t_nores`` for determing whether or not a RegExp is allowed does
|
|
not work. For instance, ``return`` statement allows a return value so
|
|
a RegExp must be allowed to follow::
|
|
|
|
return /foo/;
|
|
|
|
On the other hand, a RegExp cannot follow ``return`` here::
|
|
|
|
t = foo.return/2;
|
|
|
|
Using ``t`` has the inverse problem; if DUK_TOK_RETURN allows a RegExp
|
|
to follow, this parses correctly::
|
|
|
|
return /foo/;
|
|
|
|
but this will fail::
|
|
|
|
t = foo.return/2;
|
|
|
|
The IdentifierName cases require special handling:
|
|
|
|
* The ``PropertyName`` in object literal is not really an issue. It cannot
|
|
be followed by either a division or a RegExp literal.
|
|
|
|
* The ``MemberExpression`` case: a RegExp can never follow. A special one-time
|
|
flag can be used to reject RegExp literals on the next ``advance()`` call.
|
|
|
|
* The ``CallExpression`` case: can be handled similarly.
|
|
|
|
Currently this special handling is implemented using the ``reject_regexp_in_adv``
|
|
flag in the current compiler function state. It is only set when handling
|
|
``DUK_TOK_PERIOD`` in ``expr_led()``, and is automatically cleared by the next
|
|
``advance()`` call.
|
|
|
|
See test case: ``test-dev-regexp-parse.js``.
|
|
|
|
Automatic semicolon insertion
|
|
=============================
|
|
|
|
Semicolons need to be automatically inserted at certain points of the
|
|
token stream. Only the parser/compiler can handle automatic semicolon
|
|
insertion, because automatic semicolons are only allowed in certain
|
|
contexts. Only some statement types have a terminating semicolon and
|
|
thus participate in automatic semicolon insertion.
|
|
|
|
Automatic semicolon insertion is implemented almost completely at the
|
|
statement parsing level, the only exception being handling of
|
|
post-increment/decrement.
|
|
|
|
After the longest valid statement (usually containing an expression) has
|
|
been parsed, the statement is either terminated by an explicit semicolon
|
|
or is followed by an offending token which permits automatic semicolon
|
|
insertion. In other words, the offending token is preceded by a newline,
|
|
or is either the EOF or the ``}`` token, whichever is appropriate for the
|
|
statement list in question. The actual specification for "longest valid
|
|
statement" is that an automatic semicolon can only be inserted if a parse
|
|
error would otherwise occur.
|
|
|
|
Some statements also have grammar which prohibits automatic semicolon insertion
|
|
in certain places, such as: ``return [no LineTerminator here] Expression;``.
|
|
These need to be handled specially.
|
|
|
|
Some statements have a semicolon terminator while others do not. Automatic
|
|
semicolons are naturally only processed for statements with a semicolon
|
|
terminator.
|
|
|
|
The current implementation:
|
|
|
|
* The statement list parser parses statements.
|
|
|
|
* Individual statement type parsers need to have a capability of parsing
|
|
until an offending token is encountered (either a semicolon, or some
|
|
other unexpected token), and to indicate whether that specific statement
|
|
type requires a semicolon terminator.
|
|
|
|
* The general statement parsing wrapper then checks whether a semicolon
|
|
termination is needed, and if so, whether an explicit semicolon or an
|
|
automatically inserted semicolon terminates the statement.
|
|
|
|
* Statements which prohibit line terminators in some cases have a special
|
|
check in the parsing code for that statement type. If the token following
|
|
the restriction has a "lineterm" flag set, the token is considered
|
|
offending and the statement is terminated. For instance, "return\\n1;"
|
|
is parsed as an empty return because the token ``"1"`` has a lineterm
|
|
preceding it. The ``duk_token`` struct has a flag indicating whether
|
|
the token was preceded by whitespace which included one or more line
|
|
terminators.
|
|
|
|
* Checking whether an automatic semicolon is allowed depends on a token
|
|
which is potentially part of the next statement (the first token of
|
|
the next statement). In the current implementation the statement
|
|
parsing function is expected to "pull in" the token *following* the
|
|
statement into the "current token" slot anyway, so the token can be
|
|
examined for automatic semicolon insertion without backtracking.
|
|
|
|
* Post-increment/decrement has a restriction on LineTerminator occurring
|
|
between the preceding expression and the ``++``/``--`` token (note that
|
|
pre-increment/decrement has no such restriction). This is currently
|
|
handled by ``expr_lbp()`` which will return an artificially low binding
|
|
power if a ``++``/``--`` occurs in a post-increment/decrement position
|
|
(which is always the case if they're encountered on the ``expr_led()``
|
|
context) and the token was preceded by a line terminator. This
|
|
effectively terminates the preceding expression, treating e.g.
|
|
"a+b\\n++" as "a+b;++;" which causes a SyntaxError.
|
|
|
|
There is a custom hack for an errata related to a statement like::
|
|
|
|
do{print('loop')}while(false)false
|
|
|
|
Strictly speaking this is a syntax error, but is allowed by most
|
|
implementations in the field. A specific hack is needed to handle
|
|
this case. See ``test-stmt-dowhile-bug.js``.
|
|
|
|
Implicit return value of global code and eval code
|
|
==================================================
|
|
|
|
Global code and eval code have an "implicit return value" which comes from
|
|
the last non-empty statement executed. Function code has no implicit return
|
|
value. Statements returning a completion with type "empty" do not change
|
|
the implicit return value. For instance: ``eval("1;;var x=2;")`` returns
|
|
``1`` because the empty statement and the ``var`` statement have an empty
|
|
completion. This affects code generation, which is a bit different at the
|
|
statement level for global/eval code and function code.
|
|
|
|
When in a context requiring an implicit return value (eval code or global
|
|
code), a register is allocated for the last non-empty statement value.
|
|
When such a statement is parsed, its value is coerced to the allocated
|
|
register. Other statements are coerced into a plain value (which is then
|
|
ignored) which ensures all side effects have been generated (e.g. property
|
|
access is generated for the expression statement ``x.y;``) without affecting
|
|
the implicit return value.
|
|
|
|
Statement types generating an empty value directly:
|
|
|
|
* Empty statement (12.3)
|
|
|
|
* Debugger statement (12.15)
|
|
|
|
Statement types generating an empty value indirectly:
|
|
|
|
* Block statement (12.1): may generate an empty statement indirectly if all
|
|
statements inside the block are empty.
|
|
|
|
* ``if`` statement (12.5): may generate empty statement either if a clause has
|
|
an empty value (e.g. ``eval("if (true) {} else {1}")`` returns ``undefined``)
|
|
or a clause is missing (e.g. ``eval("if (false) {1}")`` returns ``undefined``).
|
|
|
|
* ``do-while``, ``while``, ``for``, ``for in`` statements (12.6): statement value
|
|
is the value of last non-empty statement executed within loop body; may be empty
|
|
if only empty statements or no statements are executed.
|
|
|
|
* ``continue`` and ``break`` statements (12.7, 12.8): have an empty value but
|
|
``continue`` and ``break`` are handled by their catching iteration statement,
|
|
so they are a bit special.
|
|
|
|
* ``with`` statement (12.10): like block statements
|
|
|
|
* ``switch`` statement (12.11): return value is the value of the last non-empty
|
|
statement executed (in whichever clause).
|
|
|
|
* Labelled statement (12.12): returns whatever the statement following them
|
|
returns.
|
|
|
|
* ``try`` statement (12.14): return value is the value of the last non-empty
|
|
statement executed in try and/or catch blocks.
|
|
|
|
Some examples:
|
|
|
|
+--------------------+-------------+-------------------------------------------------------------------------+
|
|
| Eval argument | Eval result | Notes |
|
|
+====================+=============+=========================================================================+
|
|
| "1+2;" | 3 | Normal case, expression statement generates implicit return value. |
|
|
+--------------------+-------------+-------------------------------------------------------------------------+
|
|
| "1+2;;" | 3 | An empty statement generates an empty value. |
|
|
+--------------------+-------------+-------------------------------------------------------------------------+
|
|
| "1+2; var a;" | 3 | A variable declaration generates an empty value. |
|
|
+--------------------+-------------+-------------------------------------------------------------------------+
|
|
| "1+2; var a=5;" | 3 | A variable declaration, even with assignment, generates an empty value. |
|
|
+--------------------+-------------+-------------------------------------------------------------------------+
|
|
| "1+2; a=5; | 5 | A normal assignment generates a value. |
|
|
+--------------------+-------------+-------------------------------------------------------------------------+
|
|
|
|
Tail call detection and handling
|
|
================================
|
|
|
|
A tail call can be used when:
|
|
|
|
1. the value of a CALL would become the argument for an explicit
|
|
``return`` statement or an implicit return value (for global or
|
|
eval code); and
|
|
|
|
2. there are no active TCF catchers between the return and the
|
|
function entrypoint.
|
|
|
|
A trivial example is::
|
|
|
|
function f(x) {
|
|
return f(x+1);
|
|
}
|
|
|
|
The generated code would look something like::
|
|
|
|
CSREG r0, c0 ; c0 = 'f'
|
|
GETVAR r1, c1 ; c1 = 'x'
|
|
ADD r0, r1, c2 ; c2 = 1
|
|
CALL r0, 2 ; TAILCALL flag not set
|
|
RETURN r0 ;
|
|
|
|
This could be emitted as a tail call instead::
|
|
|
|
CSREG r0, c0 ; c0 = 'f'
|
|
GETVAR r1, c1 ; c1 = 'x'
|
|
ADD r0, r1, c2 ; c2 = 1
|
|
CALL r0, 2 ; TAILCALL flag set
|
|
RETURN r0 ; kept in case tail call isn't allowed at run time
|
|
|
|
There are more complex cases, like::
|
|
|
|
function f(x) {
|
|
return (g(x) ? f(x+1) : f(x-1));
|
|
}
|
|
|
|
Here, just before executing a RETURN, both paths of execution end up with
|
|
a function call. Both calls can be converted to tail calls.
|
|
|
|
The following is not a candidate for a tail call because of a catcher::
|
|
|
|
function f(x) {
|
|
try {
|
|
return f(x+1);
|
|
} finally {
|
|
print('cleaning up...');
|
|
}
|
|
}
|
|
|
|
Detecting anything other than the very basic case is probably not worth the
|
|
complexity, especially because E5.1 does not require efficient tail calls at
|
|
all (in fact, as of this writing, neither V8 nor Rhino support tail calls).
|
|
ES2015 *does* require tail calls and provides specific guarantees for them.
|
|
Adding support for ES2015 tail calls will require compiler changes.
|
|
|
|
The current approach is very simplistic and only detects the most common
|
|
cases. First, it is only applied to compiling function code, not global or
|
|
eval code, which restricts consideration to explicit ``return`` statements
|
|
only. When parsing a ``return`` statement:
|
|
|
|
* First request the expression parser to parse the expression for the return
|
|
value normally.
|
|
|
|
* If the last bytecode instruction generated by the expression parser is a
|
|
CALL whose value would then become the RETURN argument and there is nothing
|
|
preventing a tail call (such as TCF catchers), convert the last CALL to a
|
|
tail call. (There are a few more details to this; see ``duk_js_compiler.c``
|
|
for comments.)
|
|
|
|
* The RETURN opcode is kept in case the tail call is not allowed at run time.
|
|
This is possible e.g. if the call target is a native function (which are
|
|
never tail called) or has a ``"use duk notail"`` directive.
|
|
|
|
* Note that active label sites are not a barrier to tail calls; they are
|
|
unwound by the tail call logic.
|
|
|
|
See ``test-dev-tail-recursion.js``.
|
|
|
|
Parsing CallExpression / NewExpression / MemberExpression
|
|
=========================================================
|
|
|
|
The grammar for ``CallExpression``, ``NewExpression``, and ``MemberExpression``
|
|
is interesting; they're not in a strict binding power sequence. Instead,
|
|
there is a branch, starting from LeftHandSideExpression::
|
|
|
|
LeftHandSideExpression
|
|
| .--.
|
|
| v |
|
|
| .---> NewExpression ----.
|
|
| | |
|
|
`--+ +---> MemberExpression
|
|
| |
|
|
`---> CallExpression ---'
|
|
^ |
|
|
`--'
|
|
|
|
Both NewExpression and CallExpression contain productions containing themselves
|
|
and MemberExpressions. However, a NewExpression never produces a CallExpression
|
|
and vice versa.
|
|
|
|
This is unfortunately difficult to parse. For instance, both productions
|
|
(CallExpression and NewExpression) may begin with a 'new' token, so without
|
|
lookahead we don't know which we're parsing.
|
|
|
|
Consider the two productions::
|
|
|
|
Production 1:
|
|
|
|
LeftHandSideExpression -> NewExpression
|
|
-> 'new' MemberExpression
|
|
-> 'new' 'Foo'
|
|
|
|
Production 2:
|
|
|
|
LeftHandSideExpression -> CallExpression
|
|
-> MemberExpression Arguments
|
|
-> 'new' 'Foo' '(' ')'
|
|
|
|
These two are syntactically different but semantically identical: they both
|
|
cause a constructor call with no arguments. However, they derive through
|
|
different productions.
|
|
|
|
Miscellaneous notes:
|
|
|
|
* A NewExpression is the only production capable of generating "unbalanced"
|
|
'new' tokens, i.e. 'new' tokens without an argument list. A NewExpression
|
|
essentially generates 0...N 'new' tokens before generating a MemberExpression.
|
|
|
|
* A MemberExpression can generate a "'new' MemberExpression Arguments"
|
|
production. These can nest, generating e.g. "new new Foo () ()" which
|
|
parses as "(new (new Foo ()) ())".
|
|
|
|
* If a LeftHandSideExpression generates a NewExpression, it is no longer
|
|
possible to generate more argument lists (open and close parenthesis)
|
|
than there are 'new' tokens. However, it is possible to generate more
|
|
'new' tokens than argument lists.
|
|
|
|
* If a LeftHandSideExpression generates a CallExpression, it is no longer
|
|
possible to generate 'new' tokens without argument list (MemberExpression
|
|
only allows 'new' with argument list). However, it is possible to
|
|
generate more argument lists than 'new' tokens; any argument lists not
|
|
matching a 'new' token are for function calls generated by CallExpression.
|
|
For instance (with angle brackets for illustration)::
|
|
|
|
new new Foo () () () == <(new <new Foo ()> ()> ()
|
|
|
|
where the last parenthesis are for a function call.
|
|
|
|
* Parentheses match innermost 'new' expressions generated by MemberExpression,
|
|
innermost first. There can then be either additional 'new' tokens on the
|
|
left or additional argument lists on the right, but not both.
|
|
|
|
Any additional 'new' tokens on the left are generated by NewExpression.
|
|
Any additional argument lists on the right are generated by CallExpression.
|
|
|
|
For instance::
|
|
|
|
new new new new Foo () ()
|
|
|
|
parses as (with angle brackets used for illustration)::
|
|
|
|
new new new <new Foo ()> ()
|
|
new new <new <new Foo ()> ()>
|
|
new <new <new <new Foo ()> ()>>
|
|
|
|
whereas::
|
|
|
|
new new Foo () () () ()
|
|
|
|
parses as (with angle brackets used for illustration)::
|
|
|
|
<<<new <new Foo ()> ()> ()> ()>
|
|
::: |==========| : : :
|
|
::: constr. call : : :
|
|
::: : : :
|
|
::|==================== : :
|
|
:: constr. call : :
|
|
:: : :
|
|
:|========================| :
|
|
: function call :
|
|
: :
|
|
|=============================|
|
|
function call
|
|
|
|
Current parsing approach:
|
|
|
|
* When a 'new' token is encountered by ``nud()``, eat the 'new' token.
|
|
|
|
* Parse a MemberExpression to get the call target. This expression parsing
|
|
must terminate if a left parenthesis '(' is encountered. The expression
|
|
parsing must not terminate if a property access is encountered (i.e. the
|
|
``.`` or ``[`` token in ``led()``). This is achieved by a suitable binding
|
|
power given to expression parser.
|
|
|
|
* Finally, look ahead to see whether the next token is a left parenthesis ('(').
|
|
If so, the 'new' token has an argument list; parse the argument list.
|
|
If the next token is not a left parenthesis, the 'new' expression is complete,
|
|
and ``nud()`` can return.
|
|
|
|
* There are many tests in ``test-dev-new.js`` which attempt to cover the
|
|
different cases.
|
|
|
|
Compiling "try-catch-finally" statements
|
|
========================================
|
|
|
|
Compiling the try-catch-statement statement is not very complicated.
|
|
However, what happens during execution is relatively complex:
|
|
|
|
* The catch stack is involved with a "TCF catcher".
|
|
|
|
* A new declarative environment record, containing the "catch variable",
|
|
may need to be used during the catch part.
|
|
|
|
The execution control flow is described in ``execution.rst``.
|
|
|
|
The catch variable has a local scope ("let" scope) which differs from the
|
|
way variables are normally declared -- they are usually "hoisted" to the top
|
|
level of the function.
|
|
|
|
Implementing the local scope in the general case requires the creation of
|
|
a declarative lexical environment which only maps the catch variable and
|
|
uses the previous lexical environment as its parent. This has the effect
|
|
of temporarily "masking" a variable of the same name, e.g.::
|
|
|
|
var e = "foo"; print(e);
|
|
|
|
try { throw new Error("error"); }
|
|
catch (e) { print(e); }
|
|
|
|
print(e);
|
|
|
|
prints::
|
|
|
|
foo
|
|
Error: error
|
|
foo
|
|
|
|
We would like to avoid emitting code for creating and tearing down such
|
|
an environment, as it is very often not needed at all. Instead, the
|
|
error caught can be bound to a register (only) at compile time.
|
|
|
|
To do so, the compiler would need to record some information about the
|
|
contents of the catch clause in pass 1, so that the compiler would know
|
|
in pass 2 if the environment record will be needed and emit the necessary
|
|
opcodes only when necessary. (The "statement number" would be enough to
|
|
identify the statement on the second pass.)
|
|
|
|
The current compiler does not have the necessary intelligence to avoid
|
|
creating a lexical environment, so the environment is currently always
|
|
established when the catch-clause activates.
|
|
|
|
There is a small footprint impact in having the declarative environment
|
|
established for the duration of the catch clause. The TRYCATCH flags
|
|
indicate that the environment is needed, and supplies the variable
|
|
name through a constant. There is a run-time penalty for this to
|
|
(1) establish the lexical environment and associated book-keeping, and
|
|
(2) access to the variable within the catch clause will happen through
|
|
the slow path primitives (GETVAR, PUTVAR, etc). The latter is a limitation
|
|
in the current lexical environment model, where an identifier is either
|
|
bound as a normal property of the lexical environment object, or is bound
|
|
to a *function-wide* register. (This will need to change anyway for ES2015
|
|
where "let" statements are supported.)
|
|
|
|
Compiling "with" statements
|
|
===========================
|
|
|
|
A ``with`` statement requires that an object environment record is
|
|
established on entry, and cleaned up on exit.
|
|
|
|
There is no separate catch stack entry for handling ``with`` statements.
|
|
Instead, the "TCF" catcher (which implements try-catch-finally) has
|
|
enough functionality to implement the semantics of ``with`` statement,
|
|
including the automatic handling of the object environment record.
|
|
|
|
For example::
|
|
|
|
with (A)
|
|
B
|
|
|
|
Generates code::
|
|
|
|
(code for A, loading result to rX)
|
|
|
|
TRYCATCH reg_catch=rN
|
|
var_name=none
|
|
with_object=rX
|
|
have_catch=false
|
|
have_finally=false
|
|
catch_binding=false
|
|
with_binding=true
|
|
INVALID
|
|
JUMP done
|
|
|
|
(code for B)
|
|
ENDTRY
|
|
|
|
done:
|
|
|
|
Note that neither a "catch" nor a "finally" part is needed: all the
|
|
cleanup handles either when the catcher is unwound by an error, or
|
|
by ENDTRY (which of course performs an unwind).
|
|
|
|
Compiling "for"/"for-in" statements
|
|
===================================
|
|
|
|
Four variants
|
|
-------------
|
|
|
|
Parsing a for/for-in statement is a bit complicated because there are
|
|
four variants which need different code generation:
|
|
|
|
1. for (ExpressionNoIn_opt; Expression_opt; Expression_opt) Statement
|
|
|
|
2. for (var VariableDeclarationListNoIn; Expression_opt; Expression_opt) Statement
|
|
|
|
3. for (LeftHandSideExpression in Expression) Statement
|
|
|
|
4. for (var VariableDeclarationNoIn in Expression) Statement
|
|
|
|
Distinguishing the variants from each other is not easy without
|
|
back-tracking. If back-tracking is avoided, any code generated before
|
|
the variant is determined needs to be valid for all potential variants
|
|
being considered. Also, no SyntaxErrors can be thrown in cases where
|
|
one variant would parse correctly.
|
|
|
|
There are also tricky control flow issues related to each variant.
|
|
Because code is generated while parsing, control flow often needs to
|
|
be implemented rather awkwardly.
|
|
|
|
Note that the ``in`` token serves two independent roles in Ecmascript:
|
|
(1) as a membership test in ``"foo" in y`` and (2) as part of the for-in
|
|
iterator syntax. These two uses have entirely different semantics and
|
|
compile entirely different code.
|
|
|
|
Semantics notes on variant 1
|
|
----------------------------
|
|
|
|
Nothing special.
|
|
|
|
Semantics notes on variant 2
|
|
----------------------------
|
|
|
|
Like all Ecmascript variable declarations, the declaration is "hoisted" to
|
|
the top of the function while a possible initializer assignment only happens
|
|
when related code is executed.
|
|
|
|
There can be multiple variable declarations variant 2, but only one in
|
|
variant 4.
|
|
|
|
Semantics notes on variant 3
|
|
----------------------------
|
|
|
|
Variants 1 and 3 cannot be trivially distinguished by looking ahead a
|
|
fixed number of tokens, which seems counterintuitive at first. This is
|
|
the case because a LeftHandSideExpression production in E5.1 allows for
|
|
e.g. function calls, 'new' expressions, and parenthesized arbitrary
|
|
expressions.
|
|
|
|
Although pure E5.1 functions cannot return left-hand-side values, native
|
|
functions are allowed to do so if the implementation wishes to support
|
|
it. Hence the syntax supports such cases, e.g.::
|
|
|
|
for (new Foo().bar() in quux) { ... }
|
|
|
|
This MUST NOT cause a SyntaxError during parsing, but rather a
|
|
ReferenceError at runtime.
|
|
|
|
A valid left-hand-side expression (such as an identifier) may also be
|
|
wrapped in one or more parentheses (i.e., an arbitrary number of tokens)::
|
|
|
|
for ( (((i))) in [ 'foo', 'bar' ] ) { }
|
|
print(i);
|
|
|
|
// -> prints 1
|
|
|
|
The comma expression semantics requires that every comma expression
|
|
part is coerced with ``GetValue()``, hence a comma expression is *not*
|
|
normally a valid left-hand-side expression::
|
|
|
|
for ( ("foo", i) in [ 'foo', 'bar' ] ) { }
|
|
|
|
// -> ReferenceError (not a SyntaxError, though)
|
|
|
|
Again, if a native function is allowed to return a Reference, a comma
|
|
expression could be a valid left-hand-side expression, but we don't
|
|
support that.
|
|
|
|
A valid left-hand-side expression may also involve multiple property
|
|
reference steps with side effects. The E5.1 specification allows some
|
|
leeway in implementing such expressions. Consider, e.g.::
|
|
|
|
y = { "z": null };
|
|
x = {
|
|
get y() { print("getter"); return y; }
|
|
}
|
|
for (x.y.z in [0,1]) {}
|
|
|
|
Such an expression may (apparently) print "getter" either once or multiple
|
|
times: see E5.1 Section 12.6.4, step 6.b which states that the left-hand-side
|
|
expression "may be evaluated repeatedly". This probably also implies that
|
|
"getter" can also be printed zero times, if the loop body is executed zero
|
|
times. At least V8 and Rhino both print "getter" two times for the example
|
|
above, indicating that the full code for the left-hand-side expression (if
|
|
it requires any code emission beyond a property/variable assignment) is
|
|
evaluated on every loop.
|
|
|
|
Another example of the evaluation order for a "for-in" statement::
|
|
|
|
function f() { throw new Error("me first"); }
|
|
for ("foo" in f()) {}
|
|
|
|
The code must throw the "me first" Error before the ReferenceError related
|
|
to an invalid left-hand-side.
|
|
|
|
A valid left-hand-side expression must ultimately be either a variable or a
|
|
property reference. Because we don't allow functions to return references,
|
|
any left-hand-side expression involving a function call or a 'new' expression
|
|
should cause a ReferenceError (but not a compile time SyntaxError). In fact,
|
|
the only acceptable productions for LeftHandSideExpression are::
|
|
|
|
LeftHandSideExpression -> NewExpression
|
|
NewExpression -> MemberExpression
|
|
MemberExpression -> MemberExpression [ Expression ]
|
|
| MemberExpression . Expression
|
|
| PrimaryExpression
|
|
PrimaryExpression -> this
|
|
| Identifier
|
|
| ( Expression )
|
|
|
|
Actual implementations seem to vary with respect to checking the syntax
|
|
validity of the LeftHandSideExpression. For instance, V8 accepts an
|
|
Expression which is not necessarily a valid LeftHandSideExpression
|
|
without throwing a SyntaxError, but then throws a ReferenceError at
|
|
run time::
|
|
|
|
> function f() { for (a+b in [0,1]) {} }
|
|
undefined
|
|
> f()
|
|
ReferenceError: Invalid left-hand side in for-in
|
|
|
|
This is technically incorrect.
|
|
|
|
Rhino gives a SyntaxError::
|
|
|
|
js> function f() { for (a+b in [0,1]) {} }
|
|
js: line 1: Invalid left-hand side of for..in loop.
|
|
|
|
So, a passable loose implementation is to parse the LeftHandSideExpression
|
|
as just a normal expression, and then check the final intermediate value.
|
|
If it is a property or variable reference, generate the respective iteration
|
|
code. Otherwise generate a fixed ReferenceError throw.
|
|
|
|
Semantics notes on variant 4
|
|
----------------------------
|
|
|
|
There can be only one declared variable. However, the variable may have
|
|
an initializer::
|
|
|
|
for (var i = 8 in [ 0, 1 ]) { ... }
|
|
|
|
The initializer cannot be safely omitted. There may be side effects and
|
|
the initialized value *can* be accessed in some cases, e.g.::
|
|
|
|
function f() {
|
|
function g() { print(i); return [0,1] };
|
|
for (var i = 8 in g()) { print(i); }
|
|
}
|
|
|
|
f(); // -> prints 8, 0, 1
|
|
|
|
Control flow for variant 1
|
|
--------------------------
|
|
|
|
Control flow for ``for (A; B; C) D``::
|
|
|
|
LABEL N
|
|
JUMP L4 ; break
|
|
JUMP L2 ; continue
|
|
|
|
(code for A)
|
|
|
|
L1:
|
|
(code for B)
|
|
(if ToBoolean(B) is false, jump to L4)
|
|
JUMP L3
|
|
|
|
L2:
|
|
(code for C)
|
|
JUMP L1
|
|
|
|
L3:
|
|
(code for D)
|
|
JUMP L2
|
|
|
|
L4:
|
|
; finished
|
|
|
|
If A is an empty expression, no code is omitted. If B is an empty expression,
|
|
it is considered "true" for loop termination (i.e. don't terminate loop) and
|
|
can be omitted ("JUMP L3" will occur at L1). If C is empty it can be omitted
|
|
("JUMP L1" will occur at L2); more optimally, the "JUMP L2" after L3 can be
|
|
changed to a direct "JUMP L1".
|
|
|
|
Control flow for variant 2
|
|
--------------------------
|
|
|
|
Control flow for variant 2 is the same as for variant 1: "code for A" is
|
|
replaced by the variable list assignment code for 1 or more variables.
|
|
|
|
Control flow for variant 3
|
|
--------------------------
|
|
|
|
Control flow for ``for (A in C) D``::
|
|
|
|
; Allocate Rx as temporary register for loop value
|
|
; Allocate Re as enumerator register
|
|
|
|
JUMP L2
|
|
|
|
L1:
|
|
(code for A)
|
|
(assign Rx to the variable/property of the left-hand-side expression A)
|
|
JUMP L3
|
|
|
|
L2:
|
|
(code for C)
|
|
(initializer enumerator for value of C into Re)
|
|
JUMP L4
|
|
|
|
L3:
|
|
(code for D)
|
|
|
|
L4:
|
|
(if enumerator Re is finished, JUMP to L5)
|
|
(else load next enumerated value to Rx)
|
|
JUMP L1
|
|
|
|
L5:
|
|
; finished
|
|
|
|
Control flow for variant 4
|
|
--------------------------
|
|
|
|
Control flow for ``for (var A = B in C) D`` is similar to that of variant 3.
|
|
If the variable declaration has an initializer (B), it needs to be evaluated
|
|
before the enumerator target expression (C) is evaluated::
|
|
|
|
; Allocate Rx as temporary register for loop value
|
|
; Allocate Re as enumerator register
|
|
|
|
(code for B)
|
|
(code for assigning result of B to variable A)
|
|
JUMP L2
|
|
|
|
L1:
|
|
(assign Rx to the variable A)
|
|
JUMP L3
|
|
|
|
L2:
|
|
(code for C)
|
|
(initializer enumerator for value of C into Re)
|
|
JUMP L4
|
|
|
|
L3:
|
|
(code for D)
|
|
|
|
L4:
|
|
(if enumerator Re is finished, JUMP to L5)
|
|
(else load next enumerated value to Rx)
|
|
JUMP L1
|
|
|
|
L5:
|
|
; finished
|
|
|
|
Compiling without backtracking
|
|
------------------------------
|
|
|
|
The first token after the left parenthesis determines whether we're parsing
|
|
variant 1/3 or variant 2/4: a ``var`` token can never begin an expression.
|
|
|
|
Parsing variant 2/4 without backtracking:
|
|
|
|
* Parse ``var``
|
|
|
|
* Parse identifier name
|
|
|
|
* Check whether next token is the equal sign; if so:
|
|
|
|
- Parse equal sign
|
|
|
|
- Parse assignment value expression as AssignmentExpressionNoIn:
|
|
terminate parsing if ``in`` encountered, and use the "rbp" argument
|
|
to start parsing at the "AssignmentExpression" binding power level
|
|
|
|
* If the next token is ``in``, we're dealing with variant 4:
|
|
|
|
- The code emitted for the variable assignment is proper for variant 4
|
|
|
|
- The variable identifier should be used for the loop iteration
|
|
|
|
* Else we're dealing with variant 2.
|
|
|
|
- The code emitted for the variable assignment is proper for variant 2
|
|
|
|
- There may be further variable declarations in the declaration list.
|
|
|
|
Parsing variant 1/3 without backtracking is a bit more complicated. An
|
|
important observation is that:
|
|
|
|
* The first expression (ExpressionNoIn_opt) before semicolon in variant 1
|
|
cannot contain a top-level ``in`` token
|
|
|
|
* The expression (LeftHandSideExpression) before ``in`` also cannot contain
|
|
a top-level ``in`` token
|
|
|
|
This observation allows the following compilation strategy:
|
|
|
|
* Parse an Expression, prohibiting a top-level ``in`` token and keeping
|
|
track whether the expression conforms to LeftHandSideExpression.
|
|
Any code generated during this parsing is correct for both variant 1
|
|
and variant 3.
|
|
|
|
* After Expression parsing, check the next token; if the next token is
|
|
an ``in``, parse the remainder of the statement as variant 3.
|
|
|
|
* Else, if the next token is a semicolon, parse the remainder of the
|
|
statement as variant 1.
|
|
|
|
* Else, SyntaxError.
|
|
|
|
Note that if the E5.1 syntax allowed a top-level ``in`` for variant 1,
|
|
this approach would not work.
|
|
|
|
Compiling "do-while" statements
|
|
===============================
|
|
|
|
There is a bug filed at:
|
|
|
|
* https://bugs.ecmascript.org/show_bug.cgi?id=8
|
|
|
|
The bug is about the expression::
|
|
|
|
do{;}while(false)false
|
|
|
|
which is prohibited in the specification but allowed in actual implementations.
|
|
The syntax error is that a ``do`` statement is supposed to be followed by a
|
|
semicolon and since there is no newline following the right parenthesis, an
|
|
automatic semicolon should not be allowed.
|
|
|
|
The workaround in the current implementation is a special flag for automatic
|
|
semicolon insertion (ALLOW_AUTO_SEMI_ALWAYS). If the flag is set, automatic
|
|
semicolon insertion is allowed even when no lineterm is not present before the
|
|
next token.
|
|
|
|
Compiling "switch" statements
|
|
=============================
|
|
|
|
Compiling switch statements is not complicated as such, but switch statement
|
|
has a bit tricky control flow. Essentially there are two control paths:
|
|
the "search" code path which looks for the first matching case (or the
|
|
default case), and the "case" code path which executes the case statements
|
|
starting from the first match, falling through where appropriate.
|
|
|
|
The code generated for this matching model is quite heavy in JUMPs.
|
|
It would be preferable to structure the code differently, e.g. first emit
|
|
all checks, and then emit all statement code. Intermediate jumps would
|
|
not be required at least in the statement code in this case. However, this
|
|
would require multi-pass parsing or construction of an intermediate
|
|
representation, which the current multi-pass model explicitly avoids.
|
|
|
|
The algorithm in E5.1 Section 12.11 seems to contain some ambiguity,
|
|
e.g. for a switch statement with a default clause, what B statements
|
|
are iterated in step 9 in each case? The intent seems clear though,
|
|
although the text is not. See:
|
|
|
|
* https://bugs.ecmascript.org/show_bug.cgi?id=345
|
|
|
|
See ``test-dev-switch*.js``.
|
|
|
|
Sometimes switch-case statements are used with a large number of integer
|
|
case values. For example, a processor simulator would commonly have such
|
|
a switch for decoding opcodes::
|
|
|
|
switch (opcode) {
|
|
case 0: /* ... */
|
|
case 1: /* ... */
|
|
case 2: /* ... */
|
|
/* ... */
|
|
case 255: /* ... */
|
|
}
|
|
|
|
It would be nice to detect such structures and handle it using some sort
|
|
of switch value indexed jump table. Doing so would need more state than is
|
|
currently available for the compiler, so switch-case statements like this
|
|
generate quite suboptimal bytecode at present. This is definite future work.
|
|
|
|
Compiling "break"/"continue" (fast and slow)
|
|
============================================
|
|
|
|
A "fast" break/continue jumps directly to the appropriate jump slot
|
|
of the matching LABEL instruction. The jump slot then jumps to the
|
|
correct place; in case of BREAK, the jump slot jumps directly to
|
|
ENDLABEL. The peephole optimizer then optimizes the extra jump,
|
|
creating a direct jump to the desired location.
|
|
|
|
A "fast" break/continue cannot cross a TCF catcher (i.e. a 'try'
|
|
statement or a 'with' statement), and the matching label must be
|
|
the innermost label (otherwise a LABEL catcher would be bypassed).
|
|
|
|
A "slow" break/continue uses a ``longjmp()`` and falls back to the
|
|
generic, always correct longjmp handler.
|
|
|
|
Compiling "return"
|
|
==================
|
|
|
|
Compiling a ``return`` statement is mostly trivial, but tail calls pose
|
|
some interesting problems.
|
|
|
|
If the return value is generated by a preceding ``CALL`` opcode, the call
|
|
can be flagged a tail call. The ``RETURN`` opcode is still emitted just
|
|
in case, if there's some feature preventing the tail call from happening
|
|
at run time -- for example, the call target may be a native function (which
|
|
are never tail called) or have a ``use duk notail`` directive which
|
|
prevents tail calling the function.
|
|
|
|
Compiling "throw" statements
|
|
============================
|
|
|
|
A ``throw`` is never "fast"; we always use the longjmp handler to
|
|
process them.
|
|
|
|
Compiling logical expressions
|
|
=============================
|
|
|
|
Ecmascript has three logical operators: binary operators ``&&`` and ``||``,
|
|
and a unary operator ``!``. The unary logical NOT operator coerces its
|
|
argument to a boolean value and negates the result (E5.1 Section 11.4.9).
|
|
The binary AND and OR operator employ ordered, short circuit evaluation
|
|
semantics, and the result of a binary operation is one of its arguments,
|
|
which is **not** coerced to a boolean value (E5.1 Section 11.11).
|
|
|
|
The Ecmascript ``ToBoolean()`` specification function is used to coerce
|
|
values into booleans (E5.1 Section 9.2) for comparison purposes. The
|
|
following values are coerced to ``false``: ``undefined``, ``null``,
|
|
``false``, ``+0``, ``-0``, ``NaN``, ``""``. All other values are coerced
|
|
to ``true``. Note that the ``ToBoolean`` operation is side-effect free,
|
|
and cannot throw an error.
|
|
|
|
Evaluation ordering and short circuiting example using Rhino::
|
|
|
|
js> function f(x,y) { print("f called for:", y); return x; }
|
|
js> function g(x,y) { print("g called for:", y); throw new Error("" + x); }
|
|
js>
|
|
js> // Illustration of short circuit evaluation and evaluation order
|
|
js> // (0/0 results in NaN)
|
|
js> var a = f(1,"first (t)") && f(0,"second (f)") || f(0/0,"third (f)") && g(0,"fourth (err)");
|
|
f called for: first (t)
|
|
f called for: second (f)
|
|
f called for: third (f)
|
|
js> print(a);
|
|
NaN
|
|
|
|
The first expression is evaluated, coerced to boolean, and since it coerces
|
|
to ``true``, move on to evaluate the second expression. That coerces to
|
|
``false``, so the first AND expression returns the number value ``0``, i.e.
|
|
the value of the second expression (which coerced to ``false`` for comparison).
|
|
Because the first part of the OR coerces to ``false``, the second part is
|
|
evaluated starting from the third expression (``NaN``). Since ``NaN`` coerces
|
|
to ``false``, the fourth expression is never evaluated. The result of the
|
|
latter AND expression is ``NaN``, which also becomes the final value of the
|
|
outer OR expression.
|
|
|
|
Code generation must respect the ordering and short circuiting semantics of
|
|
Ecmascript boolean expressions. In particular, short circuiting means that
|
|
binary logical operations are not simply operations on values, but must rather
|
|
be control flow instructions. Code generation must emit "skip jumps" when
|
|
generating expression code, and these jumps must be back-patched later. It
|
|
would be nice to generate a minimum amount of jumps (e.g. when an AND
|
|
expression is contained by a logical NOT).
|
|
|
|
Logical expressions can be used in deciding the control flow path in a
|
|
control flow statement such as ``if`` or ``do-while``, but the expression
|
|
result can also be used and e.g. assigned to a variable. For optimal code
|
|
generation the context where a logical expression occurs matters; for example,
|
|
often we don't need the final evaluation result but only its "truthiness".
|
|
The current compiler doesn't take advantage of this potential because there's
|
|
not enough state information to do so.
|
|
|
|
Let's look at the code generation issues for the following::
|
|
|
|
if (!((A && B) || (C && D && E) || F)) {
|
|
print("true");
|
|
} else {
|
|
print("false");
|
|
}
|
|
|
|
One code sequence for this would be::
|
|
|
|
start:
|
|
(t0 <- evaluate A)
|
|
IF t0, 1 ; skip if (coerces to) true
|
|
JUMP skip_and1 ; AND is done, result in t0 (= A)
|
|
(t0 <- evaluate B)
|
|
IF t0, 1 ; skip if (coerces to) true
|
|
JUMP skip_and1 ; AND is done, result in t0 (= B)
|
|
; first AND evaluates to true, result in t0 (= B)
|
|
JUMP do_lnot
|
|
|
|
skip_and1:
|
|
(t0 <- evaluate C)
|
|
IF t0, 1
|
|
JUMP skip_and2
|
|
(t0 <- evaluate D)
|
|
IF t0, 1
|
|
JUMP skip_and2
|
|
(t0 <- evaluate E)
|
|
IF t0, 1
|
|
JUMP skip_and3
|
|
; second AND evaluates to true, result in t0 (= E)
|
|
JUMP do_lnot
|
|
|
|
skip_and2:
|
|
(t0 <- evaluate F)
|
|
IF t0, 1
|
|
JUMP skip_and3
|
|
; third AND evaluates to true, result in t0 (= F)
|
|
JUMP do_lnot
|
|
|
|
skip_and3:
|
|
; the OR sequence resulted in a value (in t0) which
|
|
; coerces to false.
|
|
|
|
; fall through to do_lnot
|
|
|
|
do_lnot:
|
|
; the AND/OR part is done, with result in t0. Note that
|
|
; all code paths must provide the result value in the same
|
|
; temporary register.
|
|
|
|
LNOT t0, t0 ; coerce and negate
|
|
IF t0, 1 ; skip if true
|
|
JUMP false_path
|
|
|
|
true_path:
|
|
(code for print("true"))
|
|
JUMP done
|
|
|
|
false_path:
|
|
(code for print("false"))
|
|
; fall through
|
|
|
|
done:
|
|
; "if" is done
|
|
|
|
Because the result of the logical NOT is not actually needed, other than to
|
|
decide which branch of the if statement to execute, some extra jumps can be
|
|
eliminated::
|
|
|
|
start:
|
|
(t0 <- evaluate A)
|
|
IF t0, 1 ; skip if (coerces to) true
|
|
JUMP skip_and1 ; AND is done, result in t0 (= A)
|
|
(t0 <- evaluate B)
|
|
IF t0, 1 ; skip if (coerces to) true
|
|
JUMP skip_and1 ; AND is done, result in t0 (= B)
|
|
JUMP false_path
|
|
|
|
skip_and1:
|
|
(t0 <- evaluate C)
|
|
IF t0, 1
|
|
JUMP skip_and2
|
|
(t0 <- evaluate D)
|
|
IF t0, 1
|
|
JUMP skip_and2
|
|
(t0 <- evaluate E)
|
|
IF t0, 1
|
|
JUMP skip_and3
|
|
JUMP false_path
|
|
|
|
skip_and2:
|
|
(t0 <- evaluate F)
|
|
IF t0, 1
|
|
JUMP skip_and3
|
|
JUMP false_path
|
|
|
|
skip_and3:
|
|
; the expression inside LNOT evaluated to false, so LNOT would
|
|
; yield true, and we fall through to the true path
|
|
|
|
true_path:
|
|
(code for print("true"))
|
|
JUMP done
|
|
|
|
false_path:
|
|
(code for print("false"))
|
|
; fall through
|
|
|
|
done:
|
|
; "if" is done
|
|
|
|
Which can be further refined to::
|
|
|
|
start:
|
|
(t0 <- evaluate A)
|
|
IF t0, 1 ; skip if (coerces to) true
|
|
JUMP skip_and1 ; AND is done, result in t0 (= A)
|
|
(t0 <- evaluate B)
|
|
IF t0, 0 ; skip if (coerces to) false (-> skip_and1)
|
|
JUMP false_path
|
|
|
|
skip_and1:
|
|
(t0 <- evaluate C)
|
|
IF t0, 1
|
|
JUMP skip_and2
|
|
(t0 <- evaluate D)
|
|
IF t0, 1
|
|
JUMP skip_and2
|
|
(t0 <- evaluate E)
|
|
IF t0, 0 ; -> skip_and2
|
|
JUMP false_path
|
|
|
|
skip_and2:
|
|
(t0 <- evaluate F)
|
|
IF t0, 0 ; -> skip_and3
|
|
JUMP false_path
|
|
|
|
skip_and3:
|
|
; the expression inside LNOT evaluated to false, so LNOT would
|
|
; yield true, and we fall through to the true path
|
|
|
|
true_path:
|
|
(code for print("true"))
|
|
JUMP done
|
|
|
|
false_path:
|
|
(code for print("false"))
|
|
; fall through
|
|
|
|
done:
|
|
; "if" is done
|
|
|
|
The current compilation model for logical AND and OR is quite simple.
|
|
It avoids the need for explicit back-patching (all back-patching state
|
|
is kept in C stack), and allows generation of code on-the-fly. Although
|
|
logical AND and OR expressions are syntactically *left-associative*, they
|
|
are parsed and evaluated in a *right-associate* manner.
|
|
|
|
For instance, ``A && B && CC`` is evaluated as ``A && (B && C)``, which
|
|
allows the which processes the first logical AND to generate the code
|
|
for the latter part ``B && C`` recursively, and then back-patch a skip
|
|
jump over the entire latter part (= short circuiting the evaluation).
|
|
|
|
Unnecessary jumps are still generate between boundaries of AND and OR
|
|
expressions (e.g. in ``A && B || C && D``). These jumps are usually
|
|
"straightened out" by the final peephole pass, possibly leaving unneeded
|
|
instructions in bytecode, but generating more or less optimal run-time
|
|
jumps.
|
|
|
|
Note that there are no opcodes for logical AND and logical OR. They would
|
|
not be useful because short-circuit evaluation requires them to be control
|
|
flow instructions rather than logical ones.
|
|
|
|
Compiling function calls; direct eval
|
|
=====================================
|
|
|
|
Ecmascript E5.1 handles **direct** ``eval`` calls differently from other
|
|
``eval`` calls. For instance, direct ``eval`` calls may declare new
|
|
variables in the calling lexical scope, while variable declarations in
|
|
non-direct ``eval`` calls will go into the global object. See:
|
|
|
|
* E5.1 Section 10.4.2: Entering Eval Code
|
|
|
|
* E5.1 Section 15.1.2.1.1: Direct Call to Eval
|
|
|
|
E5.1 Section 15.1.2.1.1 states that:
|
|
|
|
A direct call to the eval function is one that is expressed as a
|
|
CallExpression that meets the following two conditions:
|
|
|
|
The Reference that is the result of evaluating the MemberExpression
|
|
in the CallExpression has an environment record as its base value
|
|
and its reference name is "eval".
|
|
|
|
The result of calling the abstract operation GetValue with that
|
|
Reference as the argument is the standard built-in function defined
|
|
in 15.1.2.1.
|
|
|
|
Note that it is *not* required that the binding be actually found in
|
|
the global object, a local variable with the name ``eval`` and with
|
|
the standard built-in ``eval()`` function as its value is also a
|
|
direct eval call.
|
|
|
|
Direct ``eval`` calls cannot be fully detected at compile time, as
|
|
we cannot always know the contents of the environment records outside
|
|
the current function. The situation can even change at run time.
|
|
See ``test-dev-direct-eval.js`` for an illustration using an
|
|
intercepting ``with`` environment.
|
|
|
|
On the other hand, partial information can be deduced; in particular:
|
|
|
|
* If a function never performs a function call with the identifier name
|
|
``eval``, we *can* be sure that there are no direct eval calls, as the
|
|
condition for the identifier name is never fulfilled.
|
|
|
|
The current approach is quite conservative, favoring correctness and
|
|
simple compilation over performing complicated analysis. The current
|
|
approach to handle a function call made using the identifier ``eval``
|
|
as follows:
|
|
|
|
* Flag the function as "tainted" by eval, which turns off most function
|
|
optimizations to ensure semantic correctness. For example, the varmap
|
|
is needed and the ``arguments`` object must be created on function entry
|
|
in case eval code accesses it.
|
|
|
|
* Call setup is made normally, it doesn't matter whether ``eval`` is bound
|
|
to a register or accessed using ``GETVAR``. It is perfectly fine for a
|
|
direct eval to happen through a local variable.
|
|
|
|
* Set the ``DUK_BC_CALL_FLAG_EVALCALL`` flag for the CALL bytecode
|
|
instruction to indicate that the call was made using the identifier
|
|
``"eval"``.
|
|
|
|
Then at run time:
|
|
|
|
* ``CALL`` handler notices that ``DUK_BC_CALL_FLAG_EVALCALL`` is set. It
|
|
then checks if the target function is the built-in eval function, and if
|
|
so, triggers direct eval behavior.
|
|
|
|
Identifier-to-register bindings
|
|
===============================
|
|
|
|
Varmap, fast path and slow path
|
|
-------------------------------
|
|
|
|
Identifiers local to a function are (1) arguments, (2) variables, (3) function
|
|
declarations, and (4) dynamic bindings like "catch" or "let" bindings. Local
|
|
identifiers are handled in one of two ways:
|
|
|
|
* An identifier can be bound to a fixed register in the value stack frame
|
|
allocated to the function. For example, an identifier named ``"foo"``
|
|
might be bound to register 7 (R7). This is possible when the identifier
|
|
is known at compile time, a suitable register is available, and when the
|
|
identifier binding is not deletable (which is usually, but not always,
|
|
the case).
|
|
|
|
* An identifier can be always accessed explicitly by name, and its value
|
|
will be stored in an explicit environment record object. This is possible
|
|
in all cases, including dynamically established and non-deletable bindings.
|
|
|
|
Only function code identifiers can be register mapped. For global code
|
|
declarations are mapped to the global object (an "object binding"). For
|
|
non-strict eval code the situation is a bit different: a variable declaration
|
|
inside a direct eval call will declare new variable to the *containing scope*.
|
|
Such bindings are also deletable whereas local declarations in a function are
|
|
not.
|
|
|
|
An example of a function and identifier binding::
|
|
|
|
function f(x, y) {
|
|
// Arguments 'x' and 'y' can be mapped to registers R0 and R1.
|
|
|
|
// Local variable can be mapped to register R2.
|
|
var a = 123;
|
|
|
|
// Dynamically declared variable is created in an explicit environment
|
|
// record and is not register mapped.
|
|
eval('var b = 321');
|
|
}
|
|
|
|
When the compiler encounters an identifier access in the local function it
|
|
looks through the variable map ("varmap") which records identifier names and
|
|
their associated registers. If the identifier is found in the varmap, it is
|
|
safe to access the identifier with a direct register reference which is called
|
|
a "fast path" access. This is safe because only non-deletable bindings are
|
|
register mapped, so there's no way that the binding would later be removed e.g.
|
|
by uncontrolled eval() calls. There's also nothing that could come in the way
|
|
to capture the reference. For example, the Ecmascript statement::
|
|
|
|
a += 1;
|
|
|
|
could be compiled to the following when "a" is in the varmap and mapped to R2::
|
|
|
|
INC R2
|
|
|
|
When the identifier is not in the varmap, the compiler uses the "slow path"
|
|
which means addressing identifers by name. For example, the Ecmascript
|
|
statement::
|
|
|
|
b += 1;
|
|
|
|
could be compiled to the following when "b" is *not* in the varmap::
|
|
|
|
; c3 = 'b'
|
|
; r4 = temp reg
|
|
|
|
GETVAR r4, c3 ; read 'b' to r4
|
|
INC r4
|
|
PUTVAR r4, c3 ; write r4 to 'b'
|
|
|
|
The GETVAR and PUTVAR opcodes (and other slow path opcodes) are handled by
|
|
the executor by looking up the variable name through explicit environment
|
|
record objects, which is more or less equivalent to a property lookup
|
|
through an object's prototype chain. The slow path is available at any
|
|
time for looking up any identifier, including a register mapped one.
|
|
|
|
When a function call exits, the executor copies any register mapped values
|
|
from the value stack frame into an environment record object so that any
|
|
inner functions which are still active can continue to access values held
|
|
by the outer function. An example of inner functions accessing a "closed"
|
|
outer function::
|
|
|
|
function outer(val) {
|
|
var foo = 'bar';
|
|
return function inner() {
|
|
print(val);
|
|
print(foo);
|
|
}
|
|
}
|
|
|
|
// Once outer() returns, 'fn' refers to a function which can still see
|
|
// into the variables held in outer().
|
|
var fn = outer(123);
|
|
|
|
fn(); // prints 123, "bar"
|
|
|
|
Basic optimizations
|
|
-------------------
|
|
|
|
A few optimizations are applied to the conceptual model described above:
|
|
|
|
* Creation of a lexical environment object is delayed for a function call
|
|
when possible, so that an actual object is only created when necessary.
|
|
Most functions don't establish new local bindings so there's no need to
|
|
create an explicit lexical environment object for every function call.
|
|
|
|
* When a function exits, identifier values are copied from registers to a
|
|
lexical environment object only when necessary -- e.g. when the function
|
|
has inner functions or eval calls. The compiler makes a conservative
|
|
estimate when this step can be omitted for better performance.
|
|
|
|
Here's an example when an eval() is enough to access function bindings
|
|
after function exit::
|
|
|
|
duk> function f(x) { var foo=123; return eval(x); }
|
|
= undefined
|
|
duk> g = f('(function myfunc() { print(foo); })');
|
|
= function myfunc() {/* ecmascript */}
|
|
duk> g()
|
|
123
|
|
= undefined
|
|
|
|
* When there is no possibility of slow path accesses to identifiers nor any
|
|
constructs which might otherwise access the varmap (direct eval calls,
|
|
inner functions, etc), the compiler can omit the "varmap" from the final
|
|
function template. However, when debugger support is enabled, varmap
|
|
is always kept so that the debugger can inspect variable names for all
|
|
functions.
|
|
|
|
Arguments object
|
|
----------------
|
|
|
|
The ``arguments`` object is special and quite expensive to create when
|
|
calling a function. The need to create an arguments objects is recorded
|
|
into the final function template with the ``DUK_HOBJECT_FLAG_CREATEARGS``
|
|
flag which is checked in call handling.
|
|
|
|
The compiler can omit argument object creation only when it's absolutely
|
|
certain it won't be needed. For example the following will now cause the
|
|
arguments object to be created on function entry (sometimes unnecessarily):
|
|
|
|
* If there's an ``eval`` anywhere in the function there's a risk it will
|
|
access the arguments object.
|
|
|
|
* If there's an identifier reference using the name ``arguments`` which
|
|
is not shadowed the arguments object may be referenced.
|
|
|
|
Delaying arguments object creation to the point of an actual access is not
|
|
trivial because argument values may have already been mutated and they affect
|
|
arguments object creation.
|
|
|
|
Current approach
|
|
----------------
|
|
|
|
* The ``varmap`` keeps track of identifier-to-register bindings. In the
|
|
first pass the ``varmap`` is empty; the ``varmap`` is populated before
|
|
the second pass. First pass gathers argument names, variable declarations,
|
|
and inner function declarations.
|
|
|
|
* After first pass but before second pass the effects of declaration
|
|
binding instantiation (E5.1 Section 10.5) are considered and a ``varmap``
|
|
is built. The varmap contains all known identifiers, and their names
|
|
are mapped either to an integer (= register number) or ``null``
|
|
(identifier is declared but not register mapped). The rather complex
|
|
shadowing rules for arguments, variable declarations, and inner function
|
|
declarations are handled in this step.
|
|
|
|
* ``catch`` clause bindings: handled at runtime by the try-catch-finally
|
|
opcodes by creating an explicit lexical scope with the catch variable
|
|
binding. All code accessing the catch variable name inside the catch
|
|
clause uses slow path lookups; this leaves room for future work to
|
|
handle catch bindings better.
|
|
|
|
* ``with`` statements: handled at runtime by try-catch-finally opcodes by
|
|
creating an explicit lexical scope indicating an "object binding". The
|
|
``with_depth``, the number of nested ``with`` statements, is tracked during
|
|
compilation. A non-zero with_depth prevents fast path variable accesses
|
|
entirely because potentially any identifier access is captured by the object
|
|
binding.
|
|
|
|
* After second pass, when creating the final function template, the ``varmap``
|
|
is cleaned up: ``null`` entries are removed and the map is compacted.
|
|
|
|
Future work
|
|
===========
|
|
|
|
Some future work (not a comprehensive list by any means), in no particular
|
|
order.
|
|
|
|
Better handling of "catch" variables, "let" bindings
|
|
----------------------------------------------------
|
|
|
|
Current handling for "catch" variables creates an explicit lexical environment
|
|
object and uses slow path for accessing the variable. This is far from optimal
|
|
but requires more compiler state to be solved better.
|
|
|
|
Similarly the ES2015 "let" binding needs efficient support to be useful.
|
|
|
|
Improve line number assignment
|
|
------------------------------
|
|
|
|
Current compiler associates opcode line numbers with the "previous token"
|
|
which is always not correct. Add the necessary plumbing to associate opcode
|
|
line numbers more accurately.
|
|
|
|
Partial copy of variables when closing a function scope
|
|
-------------------------------------------------------
|
|
|
|
As of Duktape 1.3 when an outer function containing inner functions exits,
|
|
its lexical scope is closed with variable values copied from VM registers
|
|
(value stack frame) into an explicit scope object. This works correctly
|
|
but causes a reference to be held for all variables in the outer scope, even
|
|
those that are *never* accessed by any inner function, see:
|
|
https://github.com/svaarala/duktape/issues/229.
|
|
|
|
This could be fixed by improving the compiler a bit:
|
|
|
|
* For every variable in the varmap, track the variable's current register
|
|
mapping and a flag indicating if it has been referenced by an inner
|
|
function ("keep on close").
|
|
|
|
* Whenever a function dereferences a variable not defined in the function
|
|
itself, scan outer lexical scopes for matching variables. If so, mark
|
|
that variable in the outer function as being referenced by an inner
|
|
function. (Note that if any involved function has an eval(), all bets
|
|
are off and conservative code must be generated, as eval() may introduce
|
|
new bindings at run time.)
|
|
|
|
* Encode that "keep on close" flags to the final compilation result (the
|
|
function template). If eval()s are involved, mark all variables as
|
|
"keep on close".
|
|
|
|
* At run time, when a function exits, copy only "keep on close" variables
|
|
into the explicit scope object. Other variables are then decref'd and
|
|
finalized if appropriate.
|
|
|
|
Make ivalue manipulation shuffling aware
|
|
----------------------------------------
|
|
|
|
Current ivalue manipulation is not aware of register shuffling. Instead,
|
|
ivalue manipulation relies on bytecode emission helpers to handle shuffling
|
|
as necessary. Sometimes this results in sub-optimal opcode sequences (e.g.
|
|
the result of an operation shuffled to a high register and then immediately
|
|
needed in a subsequent operation). Code quality could be improved by making
|
|
ivalue manipulation shuffling aware.
|
|
|
|
Improve support for large functions
|
|
-----------------------------------
|
|
|
|
Large functions don't produce very good code with the current compiler:
|
|
|
|
* The method of binding identifiers to registers consumes a lot of useful
|
|
low registers which can be directly addressed by all opcodes. It might
|
|
be better to reserve identifiers in a non-continuous fashion so that a
|
|
reasonable number of temporary registers could also be guaranteed to be
|
|
in the low register range.
|
|
|
|
* The method of allocating temporaries may reserve low registers as
|
|
temporaries which are then not available for inner expressions which
|
|
are often more important for performance (think outer loop vs. inner
|
|
loop).
|
|
|
|
These are not fundamental limitations of the compiler, but there's been
|
|
little effort to improve support for large functions so far, other than to
|
|
ensure they work correctly.
|
|
|
|
Chunked stream parsing with rewind
|
|
----------------------------------
|
|
|
|
For low memory environments it would be useful to be able to stream source
|
|
code off e.g. flash memory. Because Duktape decodes the source code into
|
|
a codepoint window anyway, hiding the streaming process would be relatively
|
|
straightforward.
|
|
|
|
Adding support for streaming would involve using a callback (perhaps a pure
|
|
C callback or even an actual Duktape/C or Ecmascript callback) for providing
|
|
a chunk of source code for Duktape to decode from. Another callback would be
|
|
needed to rewind to a specified position. Another approach is to provide a
|
|
callback to provide at most N bytes starting from a specified offset, and let
|
|
the callback optimize for continuous reads if that's helpful.
|
|
|
|
Allowing source code compression is also preferable. It's possible to use
|
|
an ordinary stateful compression algorithm (like deflate) for the source code,
|
|
but in a naive implementation any rewind operation means that the decompression
|
|
must restart from the beginning of the entire source text. A more practical
|
|
approach is to use chunked compression so that semi-random access is possible
|
|
and reasonably efficient.
|
|
|
|
One more design alternative is to model the source input as a sequence of
|
|
Unicode codepoints instead of bytes, so that Duktape would just request a
|
|
sequence of codepoints starting from a certain *codepoint* offset and then
|
|
put them into the codepoint window. The user callback would handle character
|
|
encoding as needed, which would simultaneously add support for custom source
|
|
encodings. The downside of this approach is that the user callback needs the
|
|
ability to map an arbitrary codepoint offset to a byte offset which is an
|
|
awkward requirement for multibyte character encodings.
|
|
|
|
Context aware compilation of logical expressions
|
|
------------------------------------------------
|
|
|
|
When a logical expression occurs in an "if" statement, the final result of the
|
|
expression is not actually needed (only its truthiness matters). Further,
|
|
the "if" statement only needs to decide between two alternative jumps, so that
|
|
the short circuit handling used by the logical expression could just jump to
|
|
those targets directly.
|
|
|
|
Improve pool allocator compatibility
|
|
------------------------------------
|
|
|
|
A small improvement would be to track opcodes and line numbers in separate
|
|
buffers rather than a single buffer with ``duk_compiler_instr`` entries.
|
|
|
|
Split compiler into multiple files
|
|
----------------------------------
|
|
|
|
Example:
|
|
|
|
* Bytecode emission
|
|
|
|
* Ivalue handling
|
|
|
|
* Expression parser
|
|
|
|
* Statement parser and entry point
|
|
|
|
Using some "memory" between pass 1 and pass 2
|
|
---------------------------------------------
|
|
|
|
The multi-pass compilation approach allows us to build some "memory"
|
|
to help in code generation. In fact, pass 1 is now used to discover
|
|
variable declarations, which is already a sort of memory which affects
|
|
code generation later.
|
|
|
|
These would help, for example:
|
|
|
|
* Avoiding LABEL sites for iteration structures not requiring them.
|
|
For instance, an iteration statement without an explicit label
|
|
and with no "break" or "continue" statement inside the iteration
|
|
construct does not need a LABEL site.
|
|
|
|
* More simply, one could simply record all label sites created in
|
|
pass 1, and note whether any break/continue targeted the label
|
|
site in question. On pass 2, this state could be consulted to
|
|
skip emitting label sites.
|
|
|
|
Because the source is identical when reparsed, it is possible to address
|
|
such "memory" using e.g. statement numbering, expression numbering, or
|
|
token numbering, where the numbers are assigned from start of the function
|
|
(the rewind point).
|
|
|
|
Compile time lookups for non-mutable constants
|
|
----------------------------------------------
|
|
|
|
Variable lookups are represented by ivalues which identify a variable by name.
|
|
Eventually they get converted to concrete code which reads a variable either
|
|
directly from a register, or using a slow path GETVAR lookup.
|
|
|
|
This could be improved in several ways. For example, if support for ``const``
|
|
was added, the ivalue conversion could detect that the variable maps to a
|
|
constant in the current function or an outer function (with the necessary
|
|
checks to ensure no "capturing" bindings can be established by e.g. an eval).
|
|
The ivalue could then be coerced into a registered constant, copying the value
|
|
of the constant declaration.
|
|
|
|
Slow path record skip count
|
|
---------------------------
|
|
|
|
When a slow path access is made, some environment record lookups can be
|
|
skipped if the records belong to functions which have no potential for
|
|
dynamically introduced bindings. For example::
|
|
|
|
var foo = 123; // global
|
|
|
|
function func1() {
|
|
var foo = 321;
|
|
|
|
function func2() {
|
|
var bar = 432;
|
|
|
|
function func3() {
|
|
var quux = 543;
|
|
|
|
// The slow path lookup for 'hello' can skip func3, func2, and
|
|
// func1 entirely because it will never match there. In other
|
|
// words, we could look up 'hello' directly from the global object.
|
|
print('hello');
|
|
|
|
// The slow path lookup for 'foo' could bypass func3 and func2,
|
|
// and begin from func1.
|
|
print(foo);
|
|
}
|
|
}
|
|
}
|
|
|
|
Any function with an ``eval()`` will potentially contain any binding, "with"
|
|
statements must be handled correctly, etc.
|
|
|
|
This optimization would be nice for looking up global bindings like ``print``,
|
|
``Math``, ``Array``, etc.
|
|
|
|
The technical change would be for e.g. GETVAR to get an integer argument
|
|
indicating how many prototype levels to skip when looking up the binding.
|
|
|
|
Slow path non-configurable, non-writable bindings
|
|
-------------------------------------------------
|
|
|
|
When a slow path access is certain to map to a non-configurable, non-writable
|
|
value, the value could be copied into the function's constant table and used
|
|
directly without an actual slow path lookup at run time. There are a few
|
|
problems with this:
|
|
|
|
* At the moment constants can only be numbers and strings, and this affects
|
|
bytecode dump/load. If a constant were e.g. a function reference, bytecode
|
|
dump/load wouldn't be able to handle it without some backing information to
|
|
reconstruct the reference on bytecode load.
|
|
|
|
* Even though a binding is non-writable and non-configurable, it can still be
|
|
changed by C code with ``duk_def_prop()``. This is intentional so that C
|
|
code has more freedom for sandboxing and such. For such environments this
|
|
optimization might not always be appropriate.
|
|
|
|
Better handling of shared constant detection
|
|
--------------------------------------------
|
|
|
|
When a new constant is introduced, the current implementation linearly
|
|
walks through existing constants to see if one can be reused. This
|
|
walk is capped to ensure reasonable compilation times even for functions
|
|
with a large number of constants.
|
|
|
|
A better solution would be to use a faster search structure for detecting
|
|
shared constants, e.g. a hash map with more flexible keys than in Ecmascript
|
|
objects (perhaps one of the ES2015 maps).
|
|
|
|
Better switch-case handling
|
|
---------------------------
|
|
|
|
It would be nice to support at least dense integer ranges and use a jump table
|
|
to handle them. This is important, for example, if a switch-case implements
|
|
some kind of integer-dependent dispatch such as an opcode decoder.
|
|
|