====================== Duktape ES5.1 compiler ====================== Introduction ============ This document provides an overview of the Duktape ES5.1 compiler structure, the basic compilation process, and notes on difficult-to-compile constructs. Assorted lexer and compiler design notes are covered in separate sections in no specific order (these are raw notes collected on the first implementation pass). Future work lists known areas for improvement (not exhaustively). The document does not cover the changes needed to support new ES2015 constructs like destructuring assignment; the main change will be the addition of some form of an intermediate representation (IR), perhaps a statement level expression tree or a full expression tree. Having an IR will also enable several optimizations. This document is a snapshot of design issues and will not be kept exactly up-to-date with the compiler. The document was originally written for Duktape 1.0 so some parts may be out-of-date. Compilation overview ==================== Basics ------ The compiler converts source code containing global code, eval code, or function code into an executable form. The most important parts of that form are bytecode, constants, function metadata, and inner function templates. Compilation can also end in a ``SyntaxError``, which are mandated to be "early errors" in Ecmascript, or some internal error, such as out of memory error. The end result of compilation is more specifically a *function template*. A function template does not yet have a lexical environment, it merely refers to symbols outside of its own scope by name. A function template is instantiated into a closure which supplies the missing lexical environment. Inner functions are also function templates, and multiple closures may be created from a certain template:: // For each invocation of f(), a separate closure of g() is created function f(x) { function g() { print(x); } return g; } Compilation depends on two major components: * A *lexer*, which generates tokens from an underlying input stream on request. The lexer supports backtracking and reparsing to allow multi-pass parsing, at the cost of not currently supporting streamed parsing (adding support for chunked streaming would be possible, see future work). * A *compiler*, which generates bytecode from a token stream. The compiler uses two (or more) passes over a function, and generates bytecode directly, avoiding an explicit intermediate representation. An *intermediate value* (ivalue) concept is used instead for managing expression values. Code and memory footprint concern originally lead to the decision to avoid a comprehensive intermediate representation in the compiler; a limited intermediate value concept is used instead. This complicates compilation somewhat but reduces memory footprint. A more comprehensive intermediate representation will be needed in future versions to fully support ES2015. Lexer ----- The lexer has a straightforward design: * Source code is decoded from UTF-8 into a window of Unicode codepoints, with an infinite number of virtual EOF codepoints following the end-of-input. The decoder supports rewinding: the current position in the source stream can be saved and restored at a later time. * The main lexing function decodes individual tokens like keywords, numbers, identifiers, etc, using the codepoint window for a safe lookup. The lexing function keeps track of token line number and source byte offset information, and the eligibility of tokens for automatic semicolon insertion. The upside of the codepoint window is that character encoding is effectively hidden and normalized in the process, and that the main lexing function can safely do forward lookups up to a certain limit without explicit end-of-input checks. It would also be possible to support encodings other than UTF-8 quite easily. The downside is a small performance impact over hardcoding the lexer for UTF-8 input, which would also be a viable design. Compiler -------- The compiler component uses a hand crafted recursive descent statement parser, with somewhat tricky handling of a few Ecmascript constructs (like the "for/for-in" statement). In contrast, top-down operator parsing is used for parsing expressions (see http://javascript.crockford.com/tdop/tdop.html) which is nice in that it allows single pass parsing while generating mostly nice bytecode. Compilation is triggered by Duktape C API compilation/eval calls, ``eval()`` calls, or ``Function`` constructor calls. The calling context provides the necessary flags to indicate whether global code, eval code, or function code is being compiled. The global compiler state is set up, and a "current function" is set up with matching the requested compilation context. The function body is then parsed with a multi-pass parser, resulting in final bytecode, constants, inner functions, and a lot of other state (for example, flags indicating whether the function accesses ``argument`` or ``eval``; this state is needed to optimize functions). Finally, the "current function" is converted from the compiler representation to the run-time function (template) format used by the executor. Statement parsing is straightforward: a function body is parsed as a sequence of "source elements" which are otherwise the same as statements but function declarations are also allowed (Duktape allows function declarations anywhere in a function for compatibility with existing code and other engines, however). For global code and eval code there's a need to support an "implicit return value" which requires tracking of statement values; implicit return value is not needed for function code. Expressions are parsed into *intermediate values* (ivalues) which are one step from being concrete, fully referenced values. This missing step allows ivalues to be used both as left-hand-side values (assignment targets) and right-hand-side values (evaluated values), and allows limited constant folding at compile time. An ivalue may, for example, represent: * A constant value which is not yet registered to the function constant table (e.g. the string ``"foo"``). * A constant registered to the function constant table (denoted e.g. ``C18`` for constant 18). * A register in the function register frame (denoted e.g. ``R23`` for register 23). * A variable access using a variable name. * A property access with a base value and a key, with the base value and key being registers or constants. * A unary or binary arithmetic operation. One way to characterize ivalues is that instead of using a full expression tree as an intermediate representation, the compiler uses small, local fragments of that tree without ever constructing the full tree. Each ivalue is either a leaf node or an internal node with two leaf node children. When parsing an expression the compiler typically creates an ivalue to represent the value of the expression; some bytecode may be emitted in the process to prepare registers/constants for the ivalue but this is not always the case. The ivalue may then be used as an input to another expression, converted from one form to another if necessary. This conversion process may allocate new constants or registers, and emit bytecode as necessary. For example, the result of a previously parsed ivalue representing an addition operation may be needed in a single register/constant. The compiler converts the ivalue by allocating a temporary register and emitting the ADD opcode to compute the result. The temporary register can then be used as an input in another ivalue as needed. Creating ivalues for expressions and converting ivalues from one form to another drives much of the code generation process. The ivalue conversion helpers can also perform limited optimization, such as constant folding for numbers and strings. Bytecode emission is conceptually quite simple: expression and ivalue handling code simply request opcodes to be emitted as needed. However, the bytecode emission functions transparently handle *register shuffling* to extend the range of addressable registers. For example, the binary ``ADD X, Y, Z`` opcodes can directly only address an 8-bit target register (X) and two 8-bit source registers or constants (Y and Z). If any arguments exceed their allowed range, the bytecode emission functions emit the necessary opcodes to shuffle source and/or target values through temporary registers. While such code is not optimal, it is necessary to support very large functions (for example those produced by Emscripten). Two (or more) passes are made over every parsed function. On the first pass we don't know which variables and inner functions the function will declare, as such declarations are conceptually "hoisted" to the top of the function. One purpose of the first pass is to gather this information for the second pass. Even so, to keep the code simple, the first pass also generates "broken" throw-away bytecode so that the same parsing code can be used for all passes. On the second pass all the necessary information has been gathered and actual bytecode can be emitted. A simple in-place peephole optimizer is applied to the bytecode before generating final bytecode. The peephole optimizer currently only straightens out JUMPs (JUMP->JUMP->X is converted to JUMP->X). The temporary registers needed for shuffling are only allocated when they're actually needed. Typically this is noticed on the first pass, but in some cases it is only detected on the second pass; in such cases a third pass is needed to generate the final bytecode. This multi-pass approach has several downsides: (1) it requires a lexer which can backtrack to the beginning of the function; and (2) time is wasted in lexing and compiling the function twice (in an initial design inner functions would also get parsed *four times* in total, their inner functions *eight times* in total, etc, but there's a specific solution to this problem in the current compiler). The upside of multi-pass parsing is that there is no need for an intermediate representation which saves on memory footprint. The statement parser keeps track of a "statement number" within the current function. This is not needed for any critical purpose, but it allows the first compilation pass to stash information related to a certain statement for the second pass, perhaps allowing more optimal code generation. For instance, the first pass could note that a loop statement has no relevant break/continue statements, so a label site is not actually needed. Similar expression counts, token counts, or source offsets could be used to address constructs to help in multi-pass parsing. However, no such optimizations are currently used by the compiler. Recursive expression parsing, statement parsing, and function parsing may happen during parsing; for example, a function expression may appear almost anywhere and trigger recursive function compilation. To fully support recursion in function parsing, all compilation state is kept in the "current function" state rather than the global compiler state. Both the lexer and compiler need to deal with the fact that garbage collection may occur almost anywhere (which may even lead to nested compilation if a finalizer is invoked), errors may be thrown almost anywhere, and so on. All resources must thus be visible to the garbage collector and correctly reference counted at nearly all times. The current approach to deal with this is to use the current thread's value stack to stash token values, intermediate values, identifier names, etc. Slots are allocated from the value stack as necessary. This is a bit complicated; some alternatives: * Finalization (and possibly mark-and-sweep) could be prevented during compilation. * Make the compiler state a traversable object type visible to garbage collection. Ivalue example -------------- Expression parsing and ivalue manipulation drives most of the code generation process. Let's look at a concrete example how these work together to generate bytecode. Consider the statement:: x.y.z = 1 + 2; The steps taken to compile the statement are roughly: * The "x" expression generates ivalue I1 of type "variable access" with the variable name "x", which is not yet allocated a constant identifier. No code is emitted. * The ".y" part generates ivalue I2 of type "property access": - The base value (I1) needs to be a register or a constant, so a constant C0 is allocated for the variable name (``"x"``) and a temporary register R0 for the value, and bytecode to read the variable is emitted (``GETVAR R0, C0``). - The key needs to be a register or constant, so a constant C1 is allocated for the key (``"y"``). No bytecode needs to be emitted. - I2 base value is R0, key is C1. * The ".z" part generates ivalue I3 of type "property access": - The base value (I2) is coerced into a new temporary register R1 by emitting bytecode for the property load (``GETPROP R1, R0, C1``). - A constant C2 is allocated for the key (``"z"``). - I3 base value is R1, key is C2. * The compiler notices an assignment operator and parses the right side. The constants 1 and 2 are compiled into ivalues I4 and I5 initially, and the combined into an ivalue I6 representing the addition of two constants. No code is emitted for the addition yet. * To perform the assignment the right-hand side (I6) needs to be coerced into a single register/constant. For this specific ivalue the compiler notices that two integer constants are being added so constant folding is used. The compiler allocates a temporary register R2 and emits bytecode to load the integer (``LDINT R2, 3``). The ivalue I7 represents the result in R2. (The compiler could also register a new constant instead of using an integer load, but (some) integers are more efficiently handled using direct integer loads.) * Finally, the assignment operation uses I3 as its target and I7 as its source, emitting a property write (``PUTPROP R1, C2, R2``). Here I3 is used as a left-hand side value (write target) rather than as a right-hand side value. While there are multiple steps and ivalues, the bytecode emitted from this process is relatively short (the opcodes here are for illustration only and don't match 1:1 with the actual opcodes used by Duktape):: ; Constant C0: "x" ; Constant C1: "y" ; Constant C2: "z" GETVAR R0, C0 ; read variable "x" to R0 GETPROP R1, R0, C1 ; read property R0["y"] (= x.y) to R1 LDINT R2, 3 ; load integer 3 to R2 PUTPROP R1, C2, R2 ; write property R1["z"] (= x.y.z), value R2 (integer 3) As can be seen from the example, ivalues are convenient in that the final result of a property expression has a single format (an ivalue) which is one step removed from the final value. This allows them to be used both as left-hand-side and right-hand-side values; the decision is made by the caller in the final conversion. Optimizations are also possible when converting ivalues from one form to the next. Ivalue conversion also provides a lot of flexibility: if the result of a previous expression isn't directly compatible with the needs of the expression being parsed, ivalues can be converted to the required form. Because ivalues are one step away from being completed, inefficient conversions are mostly (but certainly not always) avoided. For example, an ivalue representing an integer can be converted either to a register or a constant, with the necessary bytecode only emitted when it's known which one is preferred. Several details are omitted from this description; for example: * The compiler tries to reuse temporary registers where possible to reduce the number of temporaries needed. * Local variables (including arguments) are assigned to registers and are accessed directly without an explicit variable read/write operation (GETVAR or PUTVAR). * Register shuffling might be needed; it is currently handled transparently by the bytecode emission functions. Bytecode -------- The bytecode opcodes used by Duktape are chosen simply to work well for both compilation and execution. The bytecode is not version compatible, and may change arbitrarily in even minor versions. The role of Duktape bytecode is not to be a code distribution format like e.g. Java bytecode. The bytecode executor is the best source for documentation on exact bytecode semantics at any given time. Opcode information must be sync in: * ``src-input/duk_js_bytecode.h`` defines opcode names and various constants * ``src-input/duk_js_compiler.c`` emits bytecode * ``src-input/duk_js_executor.c`` interprets bytecode * ``debugger/duk_opcodes.yaml`` provides opcode metadata in a programmatic format, used by the debugger Web UI for bytecode dumping Code organization ----------------- The main entry point to compilation is ``duk_js_compile()`` in ``duk_js_compiler.c``. ``duk_lexer.c`` and ``duk_lexer.h`` contain the entire lexer implementation. Tokens are represented by ``duk_token``. Two slots are reserved from the value stack for token values (regexp literals need two slots: pattern and flags) to keep the values reference counted. ``duk_js_compiler.c`` and ``duk_js_compiler.h`` contain the entire compiler implementation: function, statement and expression parsers, bytecode emission, ivalue manipulation, and assorted support functionality like label and constant management. The compiler was originally written as a single file for efficient inlining, before source files were combined into a single file in the dist process. Compilation state is encapsulated into ``duk_compiler_ctx``, which includes: * Tokenization state * Control structure for the current function being compiled; the function structure includes: - Code generation state: bytecode, identifier bindings, constants, temporary register state, label state, etc - Control variables for the current expression being parsed * Various control flags which operate at the entry point level Intermediate values are represented by ``duk_ivalue`` and ``duk_ispec``. These need value value stack slots for storing values such as strings. A function being compiled is represented by the inner representation ``duk_compiler_func`` which is converted into an actual function object (a template) once compilation is finished. The intermediate function refers to a number of allocated value stack locations for storing compilation data such as label information, known identifiers, bytecode emitted, etc. There are also support state and structures like ``duk_labelinfo``. Bytecode is generated as a sequence of ``duk_compiler_instr`` structs. These contain an actual instruction (``duk_instr_r``) and line information. Line information is compressed into a compact bit-packed run-time format (pc2line) at the end of function compilation. General design notes ==================== This section lists miscellaneous issues affecting lexer and compiler design. C recursion depth ----------------- C recursion depth or C stack size needs to be reasonably limited for compatibility with some embedded environments with small stacks. Avoiding memory churn --------------------- Minimizing the number of alloc/realloc/free operations is desirable for all environments. Memory churn has a performance impact and also increases the chance that memory gets fragmented which is an issue for some (but not all) allocators. A few examples on how to avoid memory churn: * Use fixed size buffers when possible, e.g. for codepoint decode window. * Use a shared temporary buffer for parsing string valued tokens, reusing the buffer. Most keywords and literal strings will easily fit into a few hundred without ever needing to resize the temporary buffer. * Minimize resizes of the bytecode emission buffer. For example, when starting second compilation pass, keep the current bytecode buffer without resizing it to a smaller size. Memory usage patterns for pooled allocators ------------------------------------------- For low memory environments using pool allocation, any large allocations that grow without bounds are awkward to handle because selecting the pool sizes becomes difficult. It is preferable to do a lot of smaller allocations with a bounded size instead; typical pool configurations provide a lot of small buffers from 4 to 64 bytes, and a reasonable number of buffers up to 256 bytes. Above that buffer counts are smaller and tightly reserved. There are a few unbounded allocations in the current solution, such as current bytecode being emitted. Lexer design notes ================== This section has small lexer design notes in no specific order. Larger issues are covered in dedicated sections below. Tokenization is stateful ------------------------ Tokenization is affected by: * Strictness of the current context, which affects the set of recognized keywords (reserved words, to be more precise). * Regexp mode, i.e. whether a literal regexp is allowed in the current context. This is the case because regexp literals use the forward slash which is easily confused with a division expression. Currently handled by having a table indicating which tokens may not be followed by a RegExp literal. * In some contexts reserved words are recognized but in others they must be interpreted as identifiers: an ``Identifier`` production accepts any ``IdentifierName`` except for ``ReservedWord``. Both ``Identifier`` and ``IdentifierName`` appear in constructs. The current approach is to supply both the raw identifier name and a possible reserved word in ``duk_token``. The caller can then decide which is appropriate in the caller's context. Source code encoding is not specified ------------------------------------- The E5.1 specification does not mandate any specific source code encoding. Instead, source code is assumed to be a 16-bit codepoint sequence for specification purposes (E5.1 Section 6). Current choice is for the source code to be decoded in UTF-8. Changing the supported encoding(s) would be easy because of the codepoint decoding window approach, but it's preferred that calling code transcode non-UTF-8 inputs into UTF-8. Source code may contain non-BMP characters but Ecmascript does not support such characters directly. For instance, if codepoint U+12345 would appear (without escaping) inside a string constant, it would need to be interpreted as two 16-bit codepoint surrogate codepoints (surrogate pair), if such characters are supported at all. Duktape strings support non-BMP characters though, but they cannot be created using source literals. Use strict directive -------------------- The "use strict" and other directives have somewhat odd semantics (see E5.1 Section 14.1): * ``"use strict"`` is a valid "use strict directive" and triggers strict mode. * ``"use\u0020strict"`` is a valid directive but **not** a "use strict directive". * ``("use strict")`` is not a valid directive. The lexer and the expression parser coordinate to provide enough information (character escaping, expression "depth") to allow these cases to be distinguished properly. Compiler design notes ===================== This section has small compiler design notes in no specific order. Larger issues are covered in dedicated sections below. Expression parsing algorithm ---------------------------- The expression parsing algorithm is based on: * http://javascript.crockford.com/tdop/tdop.html * http://effbot.org/zone/simple-top-down-parsing.htm * http://effbot.org/zone/tdop-index.htm The ``nud()`` function considers a token as a "value" token. It also parses unary expressions (such as ``!x``). The ``led()`` function considers a token as an "operator" token, which operates on a preceding value. Some tokens operate in both roles but with different semantics. For instance, opening bracket (``[``) may either begin an array literal in ``nud()``, or a property access in ``led()``. The simplified algorithm is as follows. The 'rbp' argument defines "right binding power", which governs when the expression is considered to be finished. The 'lbp()' value provides token binding power, "left binding power". The higher 'rbp' is, the more tightly bound expression we're parsing:: nud() ; parse current token as "value" while rbp < lbp(): ; while token binds more tightly than rbp... led() ; combine previous value with operator The ``led()`` function may parse an expression recursively, with a higher 'rbp', i.e. a more tightly bound expression. In addition to this basic algorithm, some special features are needed: * Keep track of led() and nud() counts. This allows directives in a function "directive prologue" (E5.1 Section 14.1) to be detected correctly. For instance:: function f() { 'use strict'; // valid directive for strict mode 'use\u0020strict'; // valid directive, but not for strict mode (!) ('use strict'); // not a valid directive, terminates directive prologue * Keep track of parenthesis nesting count during expression parsing. This allows "top level" to be distinguished from nested levels. * Keep track of whether the expression is a valid LeftHandSideExpression, i.e. the top level contains only LeftHandSideExpression level operators. * Allow a caller to specify that expression parsing should be terminated at a top-level ``in`` token. This is needed for the "NoIn" variants, which are used in for/for-in statements. * Allow a caller to specify whether or not an empty expression is allowed. The expression parses uses both the "previous token" and "current token" in making parsing decisions. Which token is considered at each point is not always trivial, and the responsibilities between compiler internal helper functions are not always obvious; token state assumptions are thus documented in most functions. Parsing statements ------------------ Statement parsing is a traditional top-down recursive process which is relatively straightforward. Some complicated issues are: * Specific statement types which are difficult to parse without lookahead * Label site handling * Tail calls * Implicit return values Parsing functions ----------------- At the end of function parsing, the compiler needs to determine what flags to set for the function. Some flags have an important performance impact. In particular, the creation of an ``arguments`` object can be skipped if the compiler can guarantee that it will never be accessed. This is not trivial because e.g. the presence of a direct ``eval()`` call may allow indirect access to ``arguments``. The compiler must always make a conservative choice to ensure compliance and safety. Distinguishing for/for-in ------------------------- There are a total of four ``for`` / ``for-in`` statement variants. Each variant requires slightly different bytecode output. Detecting the correct variant is difficult, but possible, without multiple passes or arbitrary length token lookup. See separate discussion below. Expressions involving "new" --------------------------- Expression involving ``new`` are not trivial to parse without lookahead. The grammar rules for ``LeftHandSideExpression``, ``CallExpression``, ``NewExpression``, and ``MemberExpression`` are a bit awkward. See separate discussion below. Directive detection ------------------- The "use strict" and other directives are part of a directive prologue which is the sequence of initial ExpressionStatements producing only a string literal (E5.1 Section 14.1). The expression parser provides a nud/led call count which allows the statement parser to determine that an expression is a valid directive. The first non-valid directive terminates the directive prologue, and no more directives are processed. The lexer provides character escape metadata in token information to allow "use strict" to be detected correctly. The transition to strict mode occurs in the directive prologue of the first compilation pass. Function strictness is already known at the beginning of the second pass. This is important because strict mode affects function argument parsing, for instance, so it must be known before parsing the function body. Declaration "hoisting" ---------------------- Variable and function declarations affect code generation even before the declarations themselves appear in the source code: in effect, declarations are "hoisted" to the top of the function. To be able to generate reasonable code, compile-time identifier resolution requires multi-pass parsing or some intermediate representation. Current solution is multi-pass function parsing. Some token lookahead is needed ------------------------------ Because we need some lookahead, the compiler currently keeps track of two tokens at a time, a "current token" and a "previous token". Implicit return values ---------------------- Global and eval code have an implicit return value, see separate section below. Guaranteed side effects ----------------------- Sometimes code must be generated even when it might seem intuitive it is not necessary. For example, the argument to a ``void`` operator must be coerced to a "plain" register/constant so that any side effects are generated. Side effects might be caused by e.g. getter calls:: // If foo.x is an accessor, it must be called void foo.x Evaluation order requirements ----------------------------- Evaluation order requirements complicate one-pass code generation somewhat because there's little leeway in reordering emitted bytecode without a larger IR. Dynamic lexical contexts ------------------------ Ecmascript lexical contexts can be dynamically altered even after a function call exits. For example, if a function makes a direct ``eval()`` call with a variable argument, it is possible to declare new variables when the function is called:: var foo = 123; var myfunc; function f(x) { eval(x); return function () { print(foo); } } // declare 'foo' in f(), returned closure sees this 'foo' instead // of the global one myfunc = f('var foo = 321'); myfunc(); // prints 321, not 123 // don't declare 'foo' in f(), returned closure sees the global 'foo' // instead of the global one myfunc = f('var quux = 432'); myfunc(); // prints 123 For execution efficiency we should, for example, avoid creation of environment records and the ``arguments`` object. The compiler thus needs to conservatively estimate what optimizations are possible. Compilation may trigger a GC or recursive compilation ----------------------------------------------------- At first glance it might seem that the compiler cannot be invoked recursively. This is not the case however: the compiler may trigger a garbage collection or a refzero, which triggers a finalizer execution, which in turn can use e.g. ``eval()`` to cause a recursive Ecmascript compilation. Compiler recursion is not a problem as such, as it is a normal recursive C call which respects value stack policy. There are a few practical issues to note with regards to GC and recursion: * All heap values must be correctly reference counted and reachable. The compiler needs heap values to represent token values, compiler intermediate values, etc. All such values must be reachable through the valstack, a temporary object, or GC must explicitly support compiler state. * There should be no global (heap- or thread-wide) compiler state that would get clobbered by a recursive compilation call. If there is such state, it must be saved and restored by the compiler. * At the moment there is a "current compiler context" variable in ``duk_hthread`` which is used to augment SyntaxErrors with a line number. This state is saved and restored in recursive compilation to avoid clobbering. Unary minus and plus -------------------- Quite interestingly, the minus sign in ``-123`` is **not** a part of the number token in Ecmascript syntax. Instead, ``-123`` is parsed as a unary minus followed by a number literal. The current compiler follows this required syntax, but constant folding ensures no extra code or constants are generated for unary minus or unary plus. Compile-time vs. run-time errors -------------------------------- Compilation may fail with an error only if the cause is an "early error", specified in E5.1 Section 16, or an internal error such as out of memory occurs. Other errors must only occur when the result of the compilation is executed. Sometimes this includes constructs that we know can never be executed without an error (such as a function call being in a left-hand-side position of an assignment), but perhaps that code is never reached or the error is intentional. Label statement handling ------------------------ Label statements essentially prefix actual statements:: mylabel: while (true) { ... } Labels are currently handled directly by the internal function which parses a single statement. This is useful because all labels preceding an actual statement are coalesced into a single "label site". All labels, including an implicit empty label for iteration statements, point to the same label site:: // only a single label site is established for labels: // "label1", "label2", "" label1: label2: for (;;) { ... } Technically, a label looks like an expression statement initially, as a label begins with an identifier. The current parsing approach avoids backtracking by parsing an expression statement normally, and then noticing that (1) it consisted of a single identifier token, and (2) is followed by a colon. No code is emitted by the expression parser for such a terminal single token expression (an intermediate value is generated, but it is not coerced to any code yet), so this works without emitting any invalid code. Note that some labels cannot accept break or continue (e.g. label for an expression statement), some can accept a break only (switch) while others can accept both (iteration statements: do-while, for, while). All the label names are registered while processing explicit labels, and an empty label is registered for an iteration/switch statement. When the final statement type is known, all labels in the set of labels are updated to indicate whether they accept break and/or continue. Backtracking ------------ There is currently only a need to backtrack at the function level, to restart function compilation when moving from one parsing pass to the next. The "current function" state needs to be carefully reinitialized during this transition. More fine-grained backtracking is not needed right now, but would involve resetting: * Emitted bytecode * Highest used (temp) register * Emitted constants and inner functions * Active label set Temporary register allocation ----------------------------- Temporary registers are allocated as a strictly increasing sequence from a specified starting register. The "next temp" is reset back to a smaller value whenever we know that none of the higher temp values are no longer needed. This can be done safely because temporaries are always allocated with a strict stack discipline, and any fixed identifier-to-register bindings are below the initial temp reg. The current expression parsing code does not always produce optimal register allocations. It would be preferable for expression result values to be in as low register numbers as possible, which maximizes the amount of temporaries available for later expression code. This is currently done on a case-by-case basis as need arises. The backstop is at the statement level: after every statement is complete, the "next temp" can be reset to the same value it was before parsing the statement. However, it's beneficial to reset "next temp" to a smaller value whenever possible (inside expression parsing), to minimize function register count and avoid running out of temp registers. Unused temporary registers are not set to undefined, and are reachable for garbage collection. Unless they're overwritten by temporary values needed by another expression, they may cause a "silent leak". This is usually not a concrete concern because a function exit will always decref all such temporaries. This may be an issue for forever-running functions though. Register shuffling ------------------ The compiler needs to handle the case where it runs out of "easy access" registers or constants (usually 256 or 512 registers/constants). Either this needs to be handled correctly in one pass, or the compiler must fall back to a different strategy. Current solution is to use register shuffling through temporary registers. Shuffling is handled by the bytecode emitters. Pc2line debug data creation --------------------------- The "pc2line" debug data is a bit-packed format for converting a bytecode PC into an approximate source line number at run time. Although almost all of the bytecode is emitted in a linear fashion (appending to earlier code), some tricky structures insert bytecode instructions in the middle of already emitted bytecode. These insertions prohibit the emission of debug data in a streaming fashion during code emission. Instead, it needs to be implemented as a post-step. This unfortunately doubles the memory footprint of bytecode during compilation. The current solution is to keep track of (instruction, line number) pairs for each bytecode instruction during compile time. When the intermediate representation of the compiled function is converted to an actual run-time representation, this representation is converted into a plain opcode list and bit-packed pc2line data. There is currently some inaccuracy in the line numbers assigned to opcodes: the bytecode emitter associates the line number of the previous token because this matches how expression parsing consumes tokens. However, in some other call sites the relevant line number would be in the current token. Fixing this needs a bit more internal book-keeping. Peephole optimization --------------------- Currently a simple in-place peephole optimizer is applied at the end of function compilation to straighten out jumps. Consider for instance:: a: JUMP c -. b: | <--. JUMP d | -. | c: <--' | | JUMP b | -' d: <--' The peephole optimizer runs over the bytecode looking for JUMP-to-JUMP cases until the bytecode no longer changes. On the first peephole pass these jumps are straightened to:: a: JUMP b -. b: <--' JUMP d -. c: | JUMP d | -. d: <--' <-' (The JUMPs are modified in place, so some changes may be visible to later jumps on the same pass.) On the next pass this is further optimized to:: a: JUMP d -. b: | JUMP d | -. c: | | JUMP d | | -. d: <--' <--' <-' The peephole pass doesn't eliminate any instructions, but it makes some JUMP chains a bit faster. JUMP chains are generated by the current compiler in many cases, so this simple pass cheaply improves generated code slightly. Avoiding C recursion -------------------- C recursion happens in multiple ways. These should suffice to control it: * Recursive expression parsing * Recursive statement parsing (e.g. ``if`` statement parses another statement) * Recursive function parsing (e.g. function expression or function declaration inside another function) Recursion controls placed in these key functions should suffice to guarantee an upper limit on C recursion, although it is difficult to estimate how much stack is consumed before the limit is reached. ES2015 constructs need an intermediate representation ----------------------------------------------------- ES2015 constructs such as destructuring assignment will need an intermediate representation (or at least a much larger fragment of the expression tree) to compile in a reasonable manner. Operator precedences (binding powers) ===================================== Operator precedences (binding powers) are required by the expression parser for tokens acting as "operators" for ``led()`` calls. This includes tokens for binary operators (such as ``+`` and ``instanceof``). A higher binding power binds more strongly, e.g. ``*`` has a higher binding power than ``+``. The binding power of operators can be determined from the syntax. Operators of different precedence are apparent from production nesting level; outer productions have lower binding power. Operators at the same level have the same binding power if left-associative. A production can be determined to be left-associative by its production. For instance:: AdditiveExpression: MultiplicativeExpression AdditiveExpression '+' MultiplicativeExpression AdditiveExpression '-' MultiplicativeExpression Abbreviated:: AE: ME AE '+' ME AE '-' ME The expression ``1 + 2 + 3 + 4 * 5`` would be derived as (with parentheses for emphasizing order):: AE -> AE '+' ME -> (AE '+' ME) '+' ME -> ((AE '+' ME) '+' ME) '+' ME -> ((ME '+' ME) '+' ME) '+' ME -> ((1 '+' 2) '+' 3) '+' (4 '*' 5) Operators at the same level which are right-associative can be determined from its production. For instance:: AssignmentExpression: ConditionalExpression LeftHandSideExpression '=' AssignmentExpression LeftHandSideExpression AssignmentOperator AssignmentExpression AssignmentOperator: '*=' (... others omitted) Abbreviated:: AE: CE LE '=' AE LE AO AE AO: '*=' The expression ``a = b = c *= 4`` would be produced as (using parentheses for emphasis):: AE -> LE '=' AE -> LE '=' (LE '=' AE) -> LE '=' (LE '=' (LE '*=' AE) -> LE '=' (LE '=' (LE '*=' CE) -> a '=' (b '=' (c '*=' 4)) Right associative productions are parsed by using a tweaked 'rbp' argument to the recursive expression parsing. For the example above: * ``a`` is parsed with ``nud()`` and evaluates into a variable reference. * The first ``=`` operator is parsed with ``led()``, which calls the expression parser recursively, with a 'rbp' argument which causes the recursive call to consume all further assignment operations. What is a proper 'rbp' for the recursive ``led()`` call? It must be lower than the binding power for the ``=`` operator, but higher or equal than any operator whose binding power is less than that of ``=``. For example, if the binding power of ``=`` was 10, the 'rbp' used could be 9. The current compiler uses multiples of 2 for binding powers so that subtracting 1 from the binding power of an operator results in a binding power below the current operator but never equal to any other operator. Technically this is not necessary, because it's OK for the 'rbp' to be equal to a lower binding operator. In addition to binary operators, binding powers need to be assigned to: * Unary operators * Some tokens which are not strictly operators. For example, ``(``, ``[``, and ``{`` which begin certain expressions (function calls, property access, and object Token precedences for ``lbp()``, from highest (most tightly bound) to lowest are summarized in the list below. Operators of equal binding power are on the same line. The list is determined based on looking at the ``Expression`` production. Operators are left associative unless indicated otherwise: * (IdentifierName, literals, ``this``, etc. Parsed by ``nud()`` and don't need binding powers.) * ``.`` ``[`` (Note: MemberExpression parsed by ``led``.) * ``new`` (Note: unary expression parsed by ``nud()``. Right-associative.) * ``(`` (Note: CallExpression parsed by ``led()``.) * ``++`` ``--`` (Note: postfix expressions which are parsed by ``led()`` but which are "unary like". The expression always terminates in such a case.) * ``delete`` ``void`` ``typeof`` ``++`` ``--`` ``+`` ``-`` ``~`` ``!`` (Note: unary expressions which are parsed by ``nud()`` and don't thus actually need a binding power. All of these are also right-associative. ``++`` and ``--`` are preincrement here; ``+`` and ``-`` are unary plus and minus.) * ``*`` ``/`` ``%`` * ``+`` ``-`` * ``<<`` ``>>`` ``>>>`` * ``<`` ``>`` ``<=`` ``>=`` ``instanceof`` ``in`` * ``==`` ``!=`` ``===`` ``!==`` * ``&`` * ``^`` * ``|`` * ``&&`` * ``||`` * ``?`` (Note: starting a "a ? b : c" expression) * ``=`` ``*=`` ``/=`` ``%=`` ``+=`` ``-=`` ``<<=`` ``>>=`` ``>>>=`` ``&=`` ``^=`` ``|=`` (Note: right associative.) * ``,`` * ``)`` ``]`` (Note: when parsed with ``led()``; see below.) * EOF (Note: when parsed with ``led()``; see below.) The precedence list is clear starting from the lowest binding up to binary ``+`` and ``-``. Binding powers higher than that get a bit tricky because some of them are unary (parsed by ``nud()``) and some or parsed by ``led()`` but are not binary operators. When parsing an expression beginning with ``(`` using ``nud()``, the remainder of the expression is parsed with a recursive call to the expression parser and a 'rbp' which guarantees that parsing stops at the closing ``)``. The ``rbp`` here must NOT stop at the comma operator (``,``) so technically ``)`` is considered to have a binding power lower than comma. The same applies to ``]``. Similarly, EOF is considered to have a binding power lowest of all. These have been appended to the list above. Parsing RegExp literals ======================= The Ecmacsript lexer has two goal symbols for its lexical grammar: ``InputElementDiv`` and ``InputElementRegExp``. The former is used in all lexical contexts where a division (``/``) or a division-assignment (``/=``) is allowed; the latter is used elsewhere. The E5.1 specification does not really say anything else on the matter (see E5.1 Section 7, 2nd paragraph). In the implementation of the compiler, the ``advance()`` set of helpers knows the current token, and consults a token table which indicates whether a regexp literal is prohibited after the current token. Thus, this detail is hidden from ordinary parsing code. The ``advance()`` helper knows the current token type and consults a token table which has a flag indicating whether or not a RegExp can ever follow that particular token. Unfortunately parsing Identifier (which prohibits keywords) vs. IdentifierName (which allows them) is context sensitive. The current lexer handles this by providing a token type for both interpretations: ``t`` indicates token type with reserved words being recognized (e.g. "return" yields a token type DUK_TOK_RETURN) while ``t_nores`` indicates token type ignoring reserved words (e.g. "return" yields a token type DUK_TOK_IDENTIFIER). ``IdentifierName`` occurs only in:: PropertyName -> IdentifierName (object literal) MemberExpression -> MemberExpression '.' IdentifierName CallExpression -> CallExpression '.' IdentifierName Using ``t_nores`` for determing whether or not a RegExp is allowed does not work. For instance, ``return`` statement allows a return value so a RegExp must be allowed to follow:: return /foo/; On the other hand, a RegExp cannot follow ``return`` here:: t = foo.return/2; Using ``t`` has the inverse problem; if DUK_TOK_RETURN allows a RegExp to follow, this parses correctly:: return /foo/; but this will fail:: t = foo.return/2; The IdentifierName cases require special handling: * The ``PropertyName`` in object literal is not really an issue. It cannot be followed by either a division or a RegExp literal. * The ``MemberExpression`` case: a RegExp can never follow. A special one-time flag can be used to reject RegExp literals on the next ``advance()`` call. * The ``CallExpression`` case: can be handled similarly. Currently this special handling is implemented using the ``reject_regexp_in_adv`` flag in the current compiler function state. It is only set when handling ``DUK_TOK_PERIOD`` in ``expr_led()``, and is automatically cleared by the next ``advance()`` call. See test case: ``test-dev-regexp-parse.js``. Automatic semicolon insertion ============================= Semicolons need to be automatically inserted at certain points of the token stream. Only the parser/compiler can handle automatic semicolon insertion, because automatic semicolons are only allowed in certain contexts. Only some statement types have a terminating semicolon and thus participate in automatic semicolon insertion. Automatic semicolon insertion is implemented almost completely at the statement parsing level, the only exception being handling of post-increment/decrement. After the longest valid statement (usually containing an expression) has been parsed, the statement is either terminated by an explicit semicolon or is followed by an offending token which permits automatic semicolon insertion. In other words, the offending token is preceded by a newline, or is either the EOF or the ``}`` token, whichever is appropriate for the statement list in question. The actual specification for "longest valid statement" is that an automatic semicolon can only be inserted if a parse error would otherwise occur. Some statements also have grammar which prohibits automatic semicolon insertion in certain places, such as: ``return [no LineTerminator here] Expression;``. These need to be handled specially. Some statements have a semicolon terminator while others do not. Automatic semicolons are naturally only processed for statements with a semicolon terminator. The current implementation: * The statement list parser parses statements. * Individual statement type parsers need to have a capability of parsing until an offending token is encountered (either a semicolon, or some other unexpected token), and to indicate whether that specific statement type requires a semicolon terminator. * The general statement parsing wrapper then checks whether a semicolon termination is needed, and if so, whether an explicit semicolon or an automatically inserted semicolon terminates the statement. * Statements which prohibit line terminators in some cases have a special check in the parsing code for that statement type. If the token following the restriction has a "lineterm" flag set, the token is considered offending and the statement is terminated. For instance, "return\\n1;" is parsed as an empty return because the token ``"1"`` has a lineterm preceding it. The ``duk_token`` struct has a flag indicating whether the token was preceded by whitespace which included one or more line terminators. * Checking whether an automatic semicolon is allowed depends on a token which is potentially part of the next statement (the first token of the next statement). In the current implementation the statement parsing function is expected to "pull in" the token *following* the statement into the "current token" slot anyway, so the token can be examined for automatic semicolon insertion without backtracking. * Post-increment/decrement has a restriction on LineTerminator occurring between the preceding expression and the ``++``/``--`` token (note that pre-increment/decrement has no such restriction). This is currently handled by ``expr_lbp()`` which will return an artificially low binding power if a ``++``/``--`` occurs in a post-increment/decrement position (which is always the case if they're encountered on the ``expr_led()`` context) and the token was preceded by a line terminator. This effectively terminates the preceding expression, treating e.g. "a+b\\n++" as "a+b;++;" which causes a SyntaxError. There is a custom hack for an errata related to a statement like:: do{print('loop')}while(false)false Strictly speaking this is a syntax error, but is allowed by most implementations in the field. A specific hack is needed to handle this case. See ``test-stmt-dowhile-bug.js``. Implicit return value of global code and eval code ================================================== Global code and eval code have an "implicit return value" which comes from the last non-empty statement executed. Function code has no implicit return value. Statements returning a completion with type "empty" do not change the implicit return value. For instance: ``eval("1;;var x=2;")`` returns ``1`` because the empty statement and the ``var`` statement have an empty completion. This affects code generation, which is a bit different at the statement level for global/eval code and function code. When in a context requiring an implicit return value (eval code or global code), a register is allocated for the last non-empty statement value. When such a statement is parsed, its value is coerced to the allocated register. Other statements are coerced into a plain value (which is then ignored) which ensures all side effects have been generated (e.g. property access is generated for the expression statement ``x.y;``) without affecting the implicit return value. Statement types generating an empty value directly: * Empty statement (12.3) * Debugger statement (12.15) Statement types generating an empty value indirectly: * Block statement (12.1): may generate an empty statement indirectly if all statements inside the block are empty. * ``if`` statement (12.5): may generate empty statement either if a clause has an empty value (e.g. ``eval("if (true) {} else {1}")`` returns ``undefined``) or a clause is missing (e.g. ``eval("if (false) {1}")`` returns ``undefined``). * ``do-while``, ``while``, ``for``, ``for in`` statements (12.6): statement value is the value of last non-empty statement executed within loop body; may be empty if only empty statements or no statements are executed. * ``continue`` and ``break`` statements (12.7, 12.8): have an empty value but ``continue`` and ``break`` are handled by their catching iteration statement, so they are a bit special. * ``with`` statement (12.10): like block statements * ``switch`` statement (12.11): return value is the value of the last non-empty statement executed (in whichever clause). * Labelled statement (12.12): returns whatever the statement following them returns. * ``try`` statement (12.14): return value is the value of the last non-empty statement executed in try and/or catch blocks. Some examples: +--------------------+-------------+-------------------------------------------------------------------------+ | Eval argument | Eval result | Notes | +====================+=============+=========================================================================+ | "1+2;" | 3 | Normal case, expression statement generates implicit return value. | +--------------------+-------------+-------------------------------------------------------------------------+ | "1+2;;" | 3 | An empty statement generates an empty value. | +--------------------+-------------+-------------------------------------------------------------------------+ | "1+2; var a;" | 3 | A variable declaration generates an empty value. | +--------------------+-------------+-------------------------------------------------------------------------+ | "1+2; var a=5;" | 3 | A variable declaration, even with assignment, generates an empty value. | +--------------------+-------------+-------------------------------------------------------------------------+ | "1+2; a=5; | 5 | A normal assignment generates a value. | +--------------------+-------------+-------------------------------------------------------------------------+ Tail call detection and handling ================================ A tail call can be used when: 1. the value of a CALL would become the argument for an explicit ``return`` statement or an implicit return value (for global or eval code); and 2. there are no active TCF catchers between the return and the function entrypoint. A trivial example is:: function f(x) { return f(x+1); } The generated code would look something like:: CSREG r0, c0 ; c0 = 'f' GETVAR r1, c1 ; c1 = 'x' ADD r0, r1, c2 ; c2 = 1 CALL r0, 2 ; TAILCALL flag not set RETURN r0 ; This could be emitted as a tail call instead:: CSREG r0, c0 ; c0 = 'f' GETVAR r1, c1 ; c1 = 'x' ADD r0, r1, c2 ; c2 = 1 CALL r0, 2 ; TAILCALL flag set RETURN r0 ; kept in case tail call isn't allowed at run time There are more complex cases, like:: function f(x) { return (g(x) ? f(x+1) : f(x-1)); } Here, just before executing a RETURN, both paths of execution end up with a function call. Both calls can be converted to tail calls. The following is not a candidate for a tail call because of a catcher:: function f(x) { try { return f(x+1); } finally { print('cleaning up...'); } } Detecting anything other than the very basic case is probably not worth the complexity, especially because E5.1 does not require efficient tail calls at all (in fact, as of this writing, neither V8 nor Rhino support tail calls). ES2015 *does* require tail calls and provides specific guarantees for them. Adding support for ES2015 tail calls will require compiler changes. The current approach is very simplistic and only detects the most common cases. First, it is only applied to compiling function code, not global or eval code, which restricts consideration to explicit ``return`` statements only. When parsing a ``return`` statement: * First request the expression parser to parse the expression for the return value normally. * If the last bytecode instruction generated by the expression parser is a CALL whose value would then become the RETURN argument and there is nothing preventing a tail call (such as TCF catchers), convert the last CALL to a tail call. (There are a few more details to this; see ``duk_js_compiler.c`` for comments.) * The RETURN opcode is kept in case the tail call is not allowed at run time. This is possible e.g. if the call target is a native function (which are never tail called) or has a ``"use duk notail"`` directive. * Note that active label sites are not a barrier to tail calls; they are unwound by the tail call logic. See ``test-dev-tail-recursion.js``. Parsing CallExpression / NewExpression / MemberExpression ========================================================= The grammar for ``CallExpression``, ``NewExpression``, and ``MemberExpression`` is interesting; they're not in a strict binding power sequence. Instead, there is a branch, starting from LeftHandSideExpression:: LeftHandSideExpression | .--. | v | | .---> NewExpression ----. | | | `--+ +---> MemberExpression | | `---> CallExpression ---' ^ | `--' Both NewExpression and CallExpression contain productions containing themselves and MemberExpressions. However, a NewExpression never produces a CallExpression and vice versa. This is unfortunately difficult to parse. For instance, both productions (CallExpression and NewExpression) may begin with a 'new' token, so without lookahead we don't know which we're parsing. Consider the two productions:: Production 1: LeftHandSideExpression -> NewExpression -> 'new' MemberExpression -> 'new' 'Foo' Production 2: LeftHandSideExpression -> CallExpression -> MemberExpression Arguments -> 'new' 'Foo' '(' ')' These two are syntactically different but semantically identical: they both cause a constructor call with no arguments. However, they derive through different productions. Miscellaneous notes: * A NewExpression is the only production capable of generating "unbalanced" 'new' tokens, i.e. 'new' tokens without an argument list. A NewExpression essentially generates 0...N 'new' tokens before generating a MemberExpression. * A MemberExpression can generate a "'new' MemberExpression Arguments" production. These can nest, generating e.g. "new new Foo () ()" which parses as "(new (new Foo ()) ())". * If a LeftHandSideExpression generates a NewExpression, it is no longer possible to generate more argument lists (open and close parenthesis) than there are 'new' tokens. However, it is possible to generate more 'new' tokens than argument lists. * If a LeftHandSideExpression generates a CallExpression, it is no longer possible to generate 'new' tokens without argument list (MemberExpression only allows 'new' with argument list). However, it is possible to generate more argument lists than 'new' tokens; any argument lists not matching a 'new' token are for function calls generated by CallExpression. For instance (with angle brackets for illustration):: new new Foo () () () == <(new ()> () where the last parenthesis are for a function call. * Parentheses match innermost 'new' expressions generated by MemberExpression, innermost first. There can then be either additional 'new' tokens on the left or additional argument lists on the right, but not both. Any additional 'new' tokens on the left are generated by NewExpression. Any additional argument lists on the right are generated by CallExpression. For instance:: new new new new Foo () () parses as (with angle brackets used for illustration):: new new new () new new ()> new ()>> whereas:: new new Foo () () () () parses as (with angle brackets used for illustration):: << ()> ()> ()> ::: |==========| : : : ::: constr. call : : : ::: : : : ::|==================== : : :: constr. call : : :: : : :|========================| : : function call : : : |=============================| function call Current parsing approach: * When a 'new' token is encountered by ``nud()``, eat the 'new' token. * Parse a MemberExpression to get the call target. This expression parsing must terminate if a left parenthesis '(' is encountered. The expression parsing must not terminate if a property access is encountered (i.e. the ``.`` or ``[`` token in ``led()``). This is achieved by a suitable binding power given to expression parser. * Finally, look ahead to see whether the next token is a left parenthesis ('('). If so, the 'new' token has an argument list; parse the argument list. If the next token is not a left parenthesis, the 'new' expression is complete, and ``nud()`` can return. * There are many tests in ``test-dev-new.js`` which attempt to cover the different cases. Compiling "try-catch-finally" statements ======================================== Compiling the try-catch-statement statement is not very complicated. However, what happens during execution is relatively complex: * The catch stack is involved with a "TCF catcher". * A new declarative environment record, containing the "catch variable", may need to be used during the catch part. The execution control flow is described in ``execution.rst``. The catch variable has a local scope ("let" scope) which differs from the way variables are normally declared -- they are usually "hoisted" to the top level of the function. Implementing the local scope in the general case requires the creation of a declarative lexical environment which only maps the catch variable and uses the previous lexical environment as its parent. This has the effect of temporarily "masking" a variable of the same name, e.g.:: var e = "foo"; print(e); try { throw new Error("error"); } catch (e) { print(e); } print(e); prints:: foo Error: error foo We would like to avoid emitting code for creating and tearing down such an environment, as it is very often not needed at all. Instead, the error caught can be bound to a register (only) at compile time. To do so, the compiler would need to record some information about the contents of the catch clause in pass 1, so that the compiler would know in pass 2 if the environment record will be needed and emit the necessary opcodes only when necessary. (The "statement number" would be enough to identify the statement on the second pass.) The current compiler does not have the necessary intelligence to avoid creating a lexical environment, so the environment is currently always established when the catch-clause activates. There is a small footprint impact in having the declarative environment established for the duration of the catch clause. The TRYCATCH flags indicate that the environment is needed, and supplies the variable name through a constant. There is a run-time penalty for this to (1) establish the lexical environment and associated book-keeping, and (2) access to the variable within the catch clause will happen through the slow path primitives (GETVAR, PUTVAR, etc). The latter is a limitation in the current lexical environment model, where an identifier is either bound as a normal property of the lexical environment object, or is bound to a *function-wide* register. (This will need to change anyway for ES2015 where "let" statements are supported.) Compiling "with" statements =========================== A ``with`` statement requires that an object environment record is established on entry, and cleaned up on exit. There is no separate catch stack entry for handling ``with`` statements. Instead, the "TCF" catcher (which implements try-catch-finally) has enough functionality to implement the semantics of ``with`` statement, including the automatic handling of the object environment record. For example:: with (A) B Generates code:: (code for A, loading result to rX) TRYCATCH reg_catch=rN var_name=none with_object=rX have_catch=false have_finally=false catch_binding=false with_binding=true INVALID JUMP done (code for B) ENDTRY done: Note that neither a "catch" nor a "finally" part is needed: all the cleanup handles either when the catcher is unwound by an error, or by ENDTRY (which of course performs an unwind). Compiling "for"/"for-in" statements =================================== Four variants ------------- Parsing a for/for-in statement is a bit complicated because there are four variants which need different code generation: 1. for (ExpressionNoIn_opt; Expression_opt; Expression_opt) Statement 2. for (var VariableDeclarationListNoIn; Expression_opt; Expression_opt) Statement 3. for (LeftHandSideExpression in Expression) Statement 4. for (var VariableDeclarationNoIn in Expression) Statement Distinguishing the variants from each other is not easy without back-tracking. If back-tracking is avoided, any code generated before the variant is determined needs to be valid for all potential variants being considered. Also, no SyntaxErrors can be thrown in cases where one variant would parse correctly. There are also tricky control flow issues related to each variant. Because code is generated while parsing, control flow often needs to be implemented rather awkwardly. Note that the ``in`` token serves two independent roles in Ecmascript: (1) as a membership test in ``"foo" in y`` and (2) as part of the for-in iterator syntax. These two uses have entirely different semantics and compile entirely different code. Semantics notes on variant 1 ---------------------------- Nothing special. Semantics notes on variant 2 ---------------------------- Like all Ecmascript variable declarations, the declaration is "hoisted" to the top of the function while a possible initializer assignment only happens when related code is executed. There can be multiple variable declarations variant 2, but only one in variant 4. Semantics notes on variant 3 ---------------------------- Variants 1 and 3 cannot be trivially distinguished by looking ahead a fixed number of tokens, which seems counterintuitive at first. This is the case because a LeftHandSideExpression production in E5.1 allows for e.g. function calls, 'new' expressions, and parenthesized arbitrary expressions. Although pure E5.1 functions cannot return left-hand-side values, native functions are allowed to do so if the implementation wishes to support it. Hence the syntax supports such cases, e.g.:: for (new Foo().bar() in quux) { ... } This MUST NOT cause a SyntaxError during parsing, but rather a ReferenceError at runtime. A valid left-hand-side expression (such as an identifier) may also be wrapped in one or more parentheses (i.e., an arbitrary number of tokens):: for ( (((i))) in [ 'foo', 'bar' ] ) { } print(i); // -> prints 1 The comma expression semantics requires that every comma expression part is coerced with ``GetValue()``, hence a comma expression is *not* normally a valid left-hand-side expression:: for ( ("foo", i) in [ 'foo', 'bar' ] ) { } // -> ReferenceError (not a SyntaxError, though) Again, if a native function is allowed to return a Reference, a comma expression could be a valid left-hand-side expression, but we don't support that. A valid left-hand-side expression may also involve multiple property reference steps with side effects. The E5.1 specification allows some leeway in implementing such expressions. Consider, e.g.:: y = { "z": null }; x = { get y() { print("getter"); return y; } } for (x.y.z in [0,1]) {} Such an expression may (apparently) print "getter" either once or multiple times: see E5.1 Section 12.6.4, step 6.b which states that the left-hand-side expression "may be evaluated repeatedly". This probably also implies that "getter" can also be printed zero times, if the loop body is executed zero times. At least V8 and Rhino both print "getter" two times for the example above, indicating that the full code for the left-hand-side expression (if it requires any code emission beyond a property/variable assignment) is evaluated on every loop. Another example of the evaluation order for a "for-in" statement:: function f() { throw new Error("me first"); } for ("foo" in f()) {} The code must throw the "me first" Error before the ReferenceError related to an invalid left-hand-side. A valid left-hand-side expression must ultimately be either a variable or a property reference. Because we don't allow functions to return references, any left-hand-side expression involving a function call or a 'new' expression should cause a ReferenceError (but not a compile time SyntaxError). In fact, the only acceptable productions for LeftHandSideExpression are:: LeftHandSideExpression -> NewExpression NewExpression -> MemberExpression MemberExpression -> MemberExpression [ Expression ] | MemberExpression . Expression | PrimaryExpression PrimaryExpression -> this | Identifier | ( Expression ) Actual implementations seem to vary with respect to checking the syntax validity of the LeftHandSideExpression. For instance, V8 accepts an Expression which is not necessarily a valid LeftHandSideExpression without throwing a SyntaxError, but then throws a ReferenceError at run time:: > function f() { for (a+b in [0,1]) {} } undefined > f() ReferenceError: Invalid left-hand side in for-in This is technically incorrect. Rhino gives a SyntaxError:: js> function f() { for (a+b in [0,1]) {} } js: line 1: Invalid left-hand side of for..in loop. So, a passable loose implementation is to parse the LeftHandSideExpression as just a normal expression, and then check the final intermediate value. If it is a property or variable reference, generate the respective iteration code. Otherwise generate a fixed ReferenceError throw. Semantics notes on variant 4 ---------------------------- There can be only one declared variable. However, the variable may have an initializer:: for (var i = 8 in [ 0, 1 ]) { ... } The initializer cannot be safely omitted. There may be side effects and the initialized value *can* be accessed in some cases, e.g.:: function f() { function g() { print(i); return [0,1] }; for (var i = 8 in g()) { print(i); } } f(); // -> prints 8, 0, 1 Control flow for variant 1 -------------------------- Control flow for ``for (A; B; C) D``:: LABEL N JUMP L4 ; break JUMP L2 ; continue (code for A) L1: (code for B) (if ToBoolean(B) is false, jump to L4) JUMP L3 L2: (code for C) JUMP L1 L3: (code for D) JUMP L2 L4: ; finished If A is an empty expression, no code is omitted. If B is an empty expression, it is considered "true" for loop termination (i.e. don't terminate loop) and can be omitted ("JUMP L3" will occur at L1). If C is empty it can be omitted ("JUMP L1" will occur at L2); more optimally, the "JUMP L2" after L3 can be changed to a direct "JUMP L1". Control flow for variant 2 -------------------------- Control flow for variant 2 is the same as for variant 1: "code for A" is replaced by the variable list assignment code for 1 or more variables. Control flow for variant 3 -------------------------- Control flow for ``for (A in C) D``:: ; Allocate Rx as temporary register for loop value ; Allocate Re as enumerator register JUMP L2 L1: (code for A) (assign Rx to the variable/property of the left-hand-side expression A) JUMP L3 L2: (code for C) (initializer enumerator for value of C into Re) JUMP L4 L3: (code for D) L4: (if enumerator Re is finished, JUMP to L5) (else load next enumerated value to Rx) JUMP L1 L5: ; finished Control flow for variant 4 -------------------------- Control flow for ``for (var A = B in C) D`` is similar to that of variant 3. If the variable declaration has an initializer (B), it needs to be evaluated before the enumerator target expression (C) is evaluated:: ; Allocate Rx as temporary register for loop value ; Allocate Re as enumerator register (code for B) (code for assigning result of B to variable A) JUMP L2 L1: (assign Rx to the variable A) JUMP L3 L2: (code for C) (initializer enumerator for value of C into Re) JUMP L4 L3: (code for D) L4: (if enumerator Re is finished, JUMP to L5) (else load next enumerated value to Rx) JUMP L1 L5: ; finished Compiling without backtracking ------------------------------ The first token after the left parenthesis determines whether we're parsing variant 1/3 or variant 2/4: a ``var`` token can never begin an expression. Parsing variant 2/4 without backtracking: * Parse ``var`` * Parse identifier name * Check whether next token is the equal sign; if so: - Parse equal sign - Parse assignment value expression as AssignmentExpressionNoIn: terminate parsing if ``in`` encountered, and use the "rbp" argument to start parsing at the "AssignmentExpression" binding power level * If the next token is ``in``, we're dealing with variant 4: - The code emitted for the variable assignment is proper for variant 4 - The variable identifier should be used for the loop iteration * Else we're dealing with variant 2. - The code emitted for the variable assignment is proper for variant 2 - There may be further variable declarations in the declaration list. Parsing variant 1/3 without backtracking is a bit more complicated. An important observation is that: * The first expression (ExpressionNoIn_opt) before semicolon in variant 1 cannot contain a top-level ``in`` token * The expression (LeftHandSideExpression) before ``in`` also cannot contain a top-level ``in`` token This observation allows the following compilation strategy: * Parse an Expression, prohibiting a top-level ``in`` token and keeping track whether the expression conforms to LeftHandSideExpression. Any code generated during this parsing is correct for both variant 1 and variant 3. * After Expression parsing, check the next token; if the next token is an ``in``, parse the remainder of the statement as variant 3. * Else, if the next token is a semicolon, parse the remainder of the statement as variant 1. * Else, SyntaxError. Note that if the E5.1 syntax allowed a top-level ``in`` for variant 1, this approach would not work. Compiling "do-while" statements =============================== There is a bug filed at: * https://bugs.ecmascript.org/show_bug.cgi?id=8 The bug is about the expression:: do{;}while(false)false which is prohibited in the specification but allowed in actual implementations. The syntax error is that a ``do`` statement is supposed to be followed by a semicolon and since there is no newline following the right parenthesis, an automatic semicolon should not be allowed. The workaround in the current implementation is a special flag for automatic semicolon insertion (ALLOW_AUTO_SEMI_ALWAYS). If the flag is set, automatic semicolon insertion is allowed even when no lineterm is not present before the next token. Compiling "switch" statements ============================= Compiling switch statements is not complicated as such, but switch statement has a bit tricky control flow. Essentially there are two control paths: the "search" code path which looks for the first matching case (or the default case), and the "case" code path which executes the case statements starting from the first match, falling through where appropriate. The code generated for this matching model is quite heavy in JUMPs. It would be preferable to structure the code differently, e.g. first emit all checks, and then emit all statement code. Intermediate jumps would not be required at least in the statement code in this case. However, this would require multi-pass parsing or construction of an intermediate representation, which the current multi-pass model explicitly avoids. The algorithm in E5.1 Section 12.11 seems to contain some ambiguity, e.g. for a switch statement with a default clause, what B statements are iterated in step 9 in each case? The intent seems clear though, although the text is not. See: * https://bugs.ecmascript.org/show_bug.cgi?id=345 See ``test-dev-switch*.js``. Sometimes switch-case statements are used with a large number of integer case values. For example, a processor simulator would commonly have such a switch for decoding opcodes:: switch (opcode) { case 0: /* ... */ case 1: /* ... */ case 2: /* ... */ /* ... */ case 255: /* ... */ } It would be nice to detect such structures and handle it using some sort of switch value indexed jump table. Doing so would need more state than is currently available for the compiler, so switch-case statements like this generate quite suboptimal bytecode at present. This is definite future work. Compiling "break"/"continue" (fast and slow) ============================================ A "fast" break/continue jumps directly to the appropriate jump slot of the matching LABEL instruction. The jump slot then jumps to the correct place; in case of BREAK, the jump slot jumps directly to ENDLABEL. The peephole optimizer then optimizes the extra jump, creating a direct jump to the desired location. A "fast" break/continue cannot cross a TCF catcher (i.e. a 'try' statement or a 'with' statement), and the matching label must be the innermost label (otherwise a LABEL catcher would be bypassed). A "slow" break/continue uses a ``longjmp()`` and falls back to the generic, always correct longjmp handler. Compiling "return" ================== Compiling a ``return`` statement is mostly trivial, but tail calls pose some interesting problems. If the return value is generated by a preceding ``CALL`` opcode, the call can be flagged a tail call. The ``RETURN`` opcode is still emitted just in case, if there's some feature preventing the tail call from happening at run time -- for example, the call target may be a native function (which are never tail called) or have a ``use duk notail`` directive which prevents tail calling the function. Compiling "throw" statements ============================ A ``throw`` is never "fast"; we always use the longjmp handler to process them. Compiling logical expressions ============================= Ecmascript has three logical operators: binary operators ``&&`` and ``||``, and a unary operator ``!``. The unary logical NOT operator coerces its argument to a boolean value and negates the result (E5.1 Section 11.4.9). The binary AND and OR operator employ ordered, short circuit evaluation semantics, and the result of a binary operation is one of its arguments, which is **not** coerced to a boolean value (E5.1 Section 11.11). The Ecmascript ``ToBoolean()`` specification function is used to coerce values into booleans (E5.1 Section 9.2) for comparison purposes. The following values are coerced to ``false``: ``undefined``, ``null``, ``false``, ``+0``, ``-0``, ``NaN``, ``""``. All other values are coerced to ``true``. Note that the ``ToBoolean`` operation is side-effect free, and cannot throw an error. Evaluation ordering and short circuiting example using Rhino:: js> function f(x,y) { print("f called for:", y); return x; } js> function g(x,y) { print("g called for:", y); throw new Error("" + x); } js> js> // Illustration of short circuit evaluation and evaluation order js> // (0/0 results in NaN) js> var a = f(1,"first (t)") && f(0,"second (f)") || f(0/0,"third (f)") && g(0,"fourth (err)"); f called for: first (t) f called for: second (f) f called for: third (f) js> print(a); NaN The first expression is evaluated, coerced to boolean, and since it coerces to ``true``, move on to evaluate the second expression. That coerces to ``false``, so the first AND expression returns the number value ``0``, i.e. the value of the second expression (which coerced to ``false`` for comparison). Because the first part of the OR coerces to ``false``, the second part is evaluated starting from the third expression (``NaN``). Since ``NaN`` coerces to ``false``, the fourth expression is never evaluated. The result of the latter AND expression is ``NaN``, which also becomes the final value of the outer OR expression. Code generation must respect the ordering and short circuiting semantics of Ecmascript boolean expressions. In particular, short circuiting means that binary logical operations are not simply operations on values, but must rather be control flow instructions. Code generation must emit "skip jumps" when generating expression code, and these jumps must be back-patched later. It would be nice to generate a minimum amount of jumps (e.g. when an AND expression is contained by a logical NOT). Logical expressions can be used in deciding the control flow path in a control flow statement such as ``if`` or ``do-while``, but the expression result can also be used and e.g. assigned to a variable. For optimal code generation the context where a logical expression occurs matters; for example, often we don't need the final evaluation result but only its "truthiness". The current compiler doesn't take advantage of this potential because there's not enough state information to do so. Let's look at the code generation issues for the following:: if (!((A && B) || (C && D && E) || F)) { print("true"); } else { print("false"); } One code sequence for this would be:: start: (t0 <- evaluate A) IF t0, 1 ; skip if (coerces to) true JUMP skip_and1 ; AND is done, result in t0 (= A) (t0 <- evaluate B) IF t0, 1 ; skip if (coerces to) true JUMP skip_and1 ; AND is done, result in t0 (= B) ; first AND evaluates to true, result in t0 (= B) JUMP do_lnot skip_and1: (t0 <- evaluate C) IF t0, 1 JUMP skip_and2 (t0 <- evaluate D) IF t0, 1 JUMP skip_and2 (t0 <- evaluate E) IF t0, 1 JUMP skip_and3 ; second AND evaluates to true, result in t0 (= E) JUMP do_lnot skip_and2: (t0 <- evaluate F) IF t0, 1 JUMP skip_and3 ; third AND evaluates to true, result in t0 (= F) JUMP do_lnot skip_and3: ; the OR sequence resulted in a value (in t0) which ; coerces to false. ; fall through to do_lnot do_lnot: ; the AND/OR part is done, with result in t0. Note that ; all code paths must provide the result value in the same ; temporary register. LNOT t0, t0 ; coerce and negate IF t0, 1 ; skip if true JUMP false_path true_path: (code for print("true")) JUMP done false_path: (code for print("false")) ; fall through done: ; "if" is done Because the result of the logical NOT is not actually needed, other than to decide which branch of the if statement to execute, some extra jumps can be eliminated:: start: (t0 <- evaluate A) IF t0, 1 ; skip if (coerces to) true JUMP skip_and1 ; AND is done, result in t0 (= A) (t0 <- evaluate B) IF t0, 1 ; skip if (coerces to) true JUMP skip_and1 ; AND is done, result in t0 (= B) JUMP false_path skip_and1: (t0 <- evaluate C) IF t0, 1 JUMP skip_and2 (t0 <- evaluate D) IF t0, 1 JUMP skip_and2 (t0 <- evaluate E) IF t0, 1 JUMP skip_and3 JUMP false_path skip_and2: (t0 <- evaluate F) IF t0, 1 JUMP skip_and3 JUMP false_path skip_and3: ; the expression inside LNOT evaluated to false, so LNOT would ; yield true, and we fall through to the true path true_path: (code for print("true")) JUMP done false_path: (code for print("false")) ; fall through done: ; "if" is done Which can be further refined to:: start: (t0 <- evaluate A) IF t0, 1 ; skip if (coerces to) true JUMP skip_and1 ; AND is done, result in t0 (= A) (t0 <- evaluate B) IF t0, 0 ; skip if (coerces to) false (-> skip_and1) JUMP false_path skip_and1: (t0 <- evaluate C) IF t0, 1 JUMP skip_and2 (t0 <- evaluate D) IF t0, 1 JUMP skip_and2 (t0 <- evaluate E) IF t0, 0 ; -> skip_and2 JUMP false_path skip_and2: (t0 <- evaluate F) IF t0, 0 ; -> skip_and3 JUMP false_path skip_and3: ; the expression inside LNOT evaluated to false, so LNOT would ; yield true, and we fall through to the true path true_path: (code for print("true")) JUMP done false_path: (code for print("false")) ; fall through done: ; "if" is done The current compilation model for logical AND and OR is quite simple. It avoids the need for explicit back-patching (all back-patching state is kept in C stack), and allows generation of code on-the-fly. Although logical AND and OR expressions are syntactically *left-associative*, they are parsed and evaluated in a *right-associate* manner. For instance, ``A && B && CC`` is evaluated as ``A && (B && C)``, which allows the which processes the first logical AND to generate the code for the latter part ``B && C`` recursively, and then back-patch a skip jump over the entire latter part (= short circuiting the evaluation). Unnecessary jumps are still generate between boundaries of AND and OR expressions (e.g. in ``A && B || C && D``). These jumps are usually "straightened out" by the final peephole pass, possibly leaving unneeded instructions in bytecode, but generating more or less optimal run-time jumps. Note that there are no opcodes for logical AND and logical OR. They would not be useful because short-circuit evaluation requires them to be control flow instructions rather than logical ones. Compiling function calls; direct eval ===================================== Ecmascript E5.1 handles **direct** ``eval`` calls differently from other ``eval`` calls. For instance, direct ``eval`` calls may declare new variables in the calling lexical scope, while variable declarations in non-direct ``eval`` calls will go into the global object. See: * E5.1 Section 10.4.2: Entering Eval Code * E5.1 Section 15.1.2.1.1: Direct Call to Eval E5.1 Section 15.1.2.1.1 states that: A direct call to the eval function is one that is expressed as a CallExpression that meets the following two conditions: The Reference that is the result of evaluating the MemberExpression in the CallExpression has an environment record as its base value and its reference name is "eval". The result of calling the abstract operation GetValue with that Reference as the argument is the standard built-in function defined in 15.1.2.1. Note that it is *not* required that the binding be actually found in the global object, a local variable with the name ``eval`` and with the standard built-in ``eval()`` function as its value is also a direct eval call. Direct ``eval`` calls cannot be fully detected at compile time, as we cannot always know the contents of the environment records outside the current function. The situation can even change at run time. See ``test-dev-direct-eval.js`` for an illustration using an intercepting ``with`` environment. On the other hand, partial information can be deduced; in particular: * If a function never performs a function call with the identifier name ``eval``, we *can* be sure that there are no direct eval calls, as the condition for the identifier name is never fulfilled. The current approach is quite conservative, favoring correctness and simple compilation over performing complicated analysis. The current approach to handle a function call made using the identifier ``eval`` as follows: * Flag the function as "tainted" by eval, which turns off most function optimizations to ensure semantic correctness. For example, the varmap is needed and the ``arguments`` object must be created on function entry in case eval code accesses it. * Call setup is made normally, it doesn't matter whether ``eval`` is bound to a register or accessed using ``GETVAR``. It is perfectly fine for a direct eval to happen through a local variable. * Set the ``DUK_BC_CALL_FLAG_EVALCALL`` flag for the CALL bytecode instruction to indicate that the call was made using the identifier ``"eval"``. Then at run time: * ``CALL`` handler notices that ``DUK_BC_CALL_FLAG_EVALCALL`` is set. It then checks if the target function is the built-in eval function, and if so, triggers direct eval behavior. Identifier-to-register bindings =============================== Varmap, fast path and slow path ------------------------------- Identifiers local to a function are (1) arguments, (2) variables, (3) function declarations, and (4) dynamic bindings like "catch" or "let" bindings. Local identifiers are handled in one of two ways: * An identifier can be bound to a fixed register in the value stack frame allocated to the function. For example, an identifier named ``"foo"`` might be bound to register 7 (R7). This is possible when the identifier is known at compile time, a suitable register is available, and when the identifier binding is not deletable (which is usually, but not always, the case). * An identifier can be always accessed explicitly by name, and its value will be stored in an explicit environment record object. This is possible in all cases, including dynamically established and non-deletable bindings. Only function code identifiers can be register mapped. For global code declarations are mapped to the global object (an "object binding"). For non-strict eval code the situation is a bit different: a variable declaration inside a direct eval call will declare new variable to the *containing scope*. Such bindings are also deletable whereas local declarations in a function are not. An example of a function and identifier binding:: function f(x, y) { // Arguments 'x' and 'y' can be mapped to registers R0 and R1. // Local variable can be mapped to register R2. var a = 123; // Dynamically declared variable is created in an explicit environment // record and is not register mapped. eval('var b = 321'); } When the compiler encounters an identifier access in the local function it looks through the variable map ("varmap") which records identifier names and their associated registers. If the identifier is found in the varmap, it is safe to access the identifier with a direct register reference which is called a "fast path" access. This is safe because only non-deletable bindings are register mapped, so there's no way that the binding would later be removed e.g. by uncontrolled eval() calls. There's also nothing that could come in the way to capture the reference. For example, the Ecmascript statement:: a += 1; could be compiled to the following when "a" is in the varmap and mapped to R2:: INC R2 When the identifier is not in the varmap, the compiler uses the "slow path" which means addressing identifers by name. For example, the Ecmascript statement:: b += 1; could be compiled to the following when "b" is *not* in the varmap:: ; c3 = 'b' ; r4 = temp reg GETVAR r4, c3 ; read 'b' to r4 INC r4 PUTVAR r4, c3 ; write r4 to 'b' The GETVAR and PUTVAR opcodes (and other slow path opcodes) are handled by the executor by looking up the variable name through explicit environment record objects, which is more or less equivalent to a property lookup through an object's prototype chain. The slow path is available at any time for looking up any identifier, including a register mapped one. When a function call exits, the executor copies any register mapped values from the value stack frame into an environment record object so that any inner functions which are still active can continue to access values held by the outer function. An example of inner functions accessing a "closed" outer function:: function outer(val) { var foo = 'bar'; return function inner() { print(val); print(foo); } } // Once outer() returns, 'fn' refers to a function which can still see // into the variables held in outer(). var fn = outer(123); fn(); // prints 123, "bar" Basic optimizations ------------------- A few optimizations are applied to the conceptual model described above: * Creation of a lexical environment object is delayed for a function call when possible, so that an actual object is only created when necessary. Most functions don't establish new local bindings so there's no need to create an explicit lexical environment object for every function call. * When a function exits, identifier values are copied from registers to a lexical environment object only when necessary -- e.g. when the function has inner functions or eval calls. The compiler makes a conservative estimate when this step can be omitted for better performance. Here's an example when an eval() is enough to access function bindings after function exit:: duk> function f(x) { var foo=123; return eval(x); } = undefined duk> g = f('(function myfunc() { print(foo); })'); = function myfunc() {/* ecmascript */} duk> g() 123 = undefined * When there is no possibility of slow path accesses to identifiers nor any constructs which might otherwise access the varmap (direct eval calls, inner functions, etc), the compiler can omit the "varmap" from the final function template. However, when debugger support is enabled, varmap is always kept so that the debugger can inspect variable names for all functions. Arguments object ---------------- The ``arguments`` object is special and quite expensive to create when calling a function. The need to create an arguments objects is recorded into the final function template with the ``DUK_HOBJECT_FLAG_CREATEARGS`` flag which is checked in call handling. The compiler can omit argument object creation only when it's absolutely certain it won't be needed. For example the following will now cause the arguments object to be created on function entry (sometimes unnecessarily): * If there's an ``eval`` anywhere in the function there's a risk it will access the arguments object. * If there's an identifier reference using the name ``arguments`` which is not shadowed the arguments object may be referenced. Delaying arguments object creation to the point of an actual access is not trivial because argument values may have already been mutated and they affect arguments object creation. Current approach ---------------- * The ``varmap`` keeps track of identifier-to-register bindings. In the first pass the ``varmap`` is empty; the ``varmap`` is populated before the second pass. First pass gathers argument names, variable declarations, and inner function declarations. * After first pass but before second pass the effects of declaration binding instantiation (E5.1 Section 10.5) are considered and a ``varmap`` is built. The varmap contains all known identifiers, and their names are mapped either to an integer (= register number) or ``null`` (identifier is declared but not register mapped). The rather complex shadowing rules for arguments, variable declarations, and inner function declarations are handled in this step. * ``catch`` clause bindings: handled at runtime by the try-catch-finally opcodes by creating an explicit lexical scope with the catch variable binding. All code accessing the catch variable name inside the catch clause uses slow path lookups; this leaves room for future work to handle catch bindings better. * ``with`` statements: handled at runtime by try-catch-finally opcodes by creating an explicit lexical scope indicating an "object binding". The ``with_depth``, the number of nested ``with`` statements, is tracked during compilation. A non-zero with_depth prevents fast path variable accesses entirely because potentially any identifier access is captured by the object binding. * After second pass, when creating the final function template, the ``varmap`` is cleaned up: ``null`` entries are removed and the map is compacted. Future work =========== Some future work (not a comprehensive list by any means), in no particular order. Better handling of "catch" variables, "let" bindings ---------------------------------------------------- Current handling for "catch" variables creates an explicit lexical environment object and uses slow path for accessing the variable. This is far from optimal but requires more compiler state to be solved better. Similarly the ES2015 "let" binding needs efficient support to be useful. Improve line number assignment ------------------------------ Current compiler associates opcode line numbers with the "previous token" which is always not correct. Add the necessary plumbing to associate opcode line numbers more accurately. Partial copy of variables when closing a function scope ------------------------------------------------------- As of Duktape 1.3 when an outer function containing inner functions exits, its lexical scope is closed with variable values copied from VM registers (value stack frame) into an explicit scope object. This works correctly but causes a reference to be held for all variables in the outer scope, even those that are *never* accessed by any inner function, see: https://github.com/svaarala/duktape/issues/229. This could be fixed by improving the compiler a bit: * For every variable in the varmap, track the variable's current register mapping and a flag indicating if it has been referenced by an inner function ("keep on close"). * Whenever a function dereferences a variable not defined in the function itself, scan outer lexical scopes for matching variables. If so, mark that variable in the outer function as being referenced by an inner function. (Note that if any involved function has an eval(), all bets are off and conservative code must be generated, as eval() may introduce new bindings at run time.) * Encode that "keep on close" flags to the final compilation result (the function template). If eval()s are involved, mark all variables as "keep on close". * At run time, when a function exits, copy only "keep on close" variables into the explicit scope object. Other variables are then decref'd and finalized if appropriate. Make ivalue manipulation shuffling aware ---------------------------------------- Current ivalue manipulation is not aware of register shuffling. Instead, ivalue manipulation relies on bytecode emission helpers to handle shuffling as necessary. Sometimes this results in sub-optimal opcode sequences (e.g. the result of an operation shuffled to a high register and then immediately needed in a subsequent operation). Code quality could be improved by making ivalue manipulation shuffling aware. Improve support for large functions ----------------------------------- Large functions don't produce very good code with the current compiler: * The method of binding identifiers to registers consumes a lot of useful low registers which can be directly addressed by all opcodes. It might be better to reserve identifiers in a non-continuous fashion so that a reasonable number of temporary registers could also be guaranteed to be in the low register range. * The method of allocating temporaries may reserve low registers as temporaries which are then not available for inner expressions which are often more important for performance (think outer loop vs. inner loop). These are not fundamental limitations of the compiler, but there's been little effort to improve support for large functions so far, other than to ensure they work correctly. Chunked stream parsing with rewind ---------------------------------- For low memory environments it would be useful to be able to stream source code off e.g. flash memory. Because Duktape decodes the source code into a codepoint window anyway, hiding the streaming process would be relatively straightforward. Adding support for streaming would involve using a callback (perhaps a pure C callback or even an actual Duktape/C or Ecmascript callback) for providing a chunk of source code for Duktape to decode from. Another callback would be needed to rewind to a specified position. Another approach is to provide a callback to provide at most N bytes starting from a specified offset, and let the callback optimize for continuous reads if that's helpful. Allowing source code compression is also preferable. It's possible to use an ordinary stateful compression algorithm (like deflate) for the source code, but in a naive implementation any rewind operation means that the decompression must restart from the beginning of the entire source text. A more practical approach is to use chunked compression so that semi-random access is possible and reasonably efficient. One more design alternative is to model the source input as a sequence of Unicode codepoints instead of bytes, so that Duktape would just request a sequence of codepoints starting from a certain *codepoint* offset and then put them into the codepoint window. The user callback would handle character encoding as needed, which would simultaneously add support for custom source encodings. The downside of this approach is that the user callback needs the ability to map an arbitrary codepoint offset to a byte offset which is an awkward requirement for multibyte character encodings. Context aware compilation of logical expressions ------------------------------------------------ When a logical expression occurs in an "if" statement, the final result of the expression is not actually needed (only its truthiness matters). Further, the "if" statement only needs to decide between two alternative jumps, so that the short circuit handling used by the logical expression could just jump to those targets directly. Improve pool allocator compatibility ------------------------------------ A small improvement would be to track opcodes and line numbers in separate buffers rather than a single buffer with ``duk_compiler_instr`` entries. Split compiler into multiple files ---------------------------------- Example: * Bytecode emission * Ivalue handling * Expression parser * Statement parser and entry point Using some "memory" between pass 1 and pass 2 --------------------------------------------- The multi-pass compilation approach allows us to build some "memory" to help in code generation. In fact, pass 1 is now used to discover variable declarations, which is already a sort of memory which affects code generation later. These would help, for example: * Avoiding LABEL sites for iteration structures not requiring them. For instance, an iteration statement without an explicit label and with no "break" or "continue" statement inside the iteration construct does not need a LABEL site. * More simply, one could simply record all label sites created in pass 1, and note whether any break/continue targeted the label site in question. On pass 2, this state could be consulted to skip emitting label sites. Because the source is identical when reparsed, it is possible to address such "memory" using e.g. statement numbering, expression numbering, or token numbering, where the numbers are assigned from start of the function (the rewind point). Compile time lookups for non-mutable constants ---------------------------------------------- Variable lookups are represented by ivalues which identify a variable by name. Eventually they get converted to concrete code which reads a variable either directly from a register, or using a slow path GETVAR lookup. This could be improved in several ways. For example, if support for ``const`` was added, the ivalue conversion could detect that the variable maps to a constant in the current function or an outer function (with the necessary checks to ensure no "capturing" bindings can be established by e.g. an eval). The ivalue could then be coerced into a registered constant, copying the value of the constant declaration. Slow path record skip count --------------------------- When a slow path access is made, some environment record lookups can be skipped if the records belong to functions which have no potential for dynamically introduced bindings. For example:: var foo = 123; // global function func1() { var foo = 321; function func2() { var bar = 432; function func3() { var quux = 543; // The slow path lookup for 'hello' can skip func3, func2, and // func1 entirely because it will never match there. In other // words, we could look up 'hello' directly from the global object. print('hello'); // The slow path lookup for 'foo' could bypass func3 and func2, // and begin from func1. print(foo); } } } Any function with an ``eval()`` will potentially contain any binding, "with" statements must be handled correctly, etc. This optimization would be nice for looking up global bindings like ``print``, ``Math``, ``Array``, etc. The technical change would be for e.g. GETVAR to get an integer argument indicating how many prototype levels to skip when looking up the binding. Slow path non-configurable, non-writable bindings ------------------------------------------------- When a slow path access is certain to map to a non-configurable, non-writable value, the value could be copied into the function's constant table and used directly without an actual slow path lookup at run time. There are a few problems with this: * At the moment constants can only be numbers and strings, and this affects bytecode dump/load. If a constant were e.g. a function reference, bytecode dump/load wouldn't be able to handle it without some backing information to reconstruct the reference on bytecode load. * Even though a binding is non-writable and non-configurable, it can still be changed by C code with ``duk_def_prop()``. This is intentional so that C code has more freedom for sandboxing and such. For such environments this optimization might not always be appropriate. Better handling of shared constant detection -------------------------------------------- When a new constant is introduced, the current implementation linearly walks through existing constants to see if one can be reused. This walk is capped to ensure reasonable compilation times even for functions with a large number of constants. A better solution would be to use a faster search structure for detecting shared constants, e.g. a hash map with more flexible keys than in Ecmascript objects (perhaps one of the ES2015 maps). Better switch-case handling --------------------------- It would be nice to support at least dense integer ranges and use a jump table to handle them. This is important, for example, if a switch-case implements some kind of integer-dependent dispatch such as an opcode decoder.