You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1197 lines
53 KiB

===================
Regular expressions
===================
This document describes the Duktape ``RegExp`` built-in implementation.
Overview
========
Implementing a regular expression engine into very small space is
challenging. See the following three excellent articles by Russ Cox
for background:
* http://swtch.com/~rsc/regexp/regexp1.html
* http://swtch.com/~rsc/regexp/regexp2.html
* http://swtch.com/~rsc/regexp/regexp3.html
Ecmascript regular expression set is described in E5 Section 15.10,
and includes:
* Disjunction
* Quantifiers, counted repetition and both greedy and minimal variants
* Assertions, negative and positive lookaheads
* Character classes, normal and inverted
* Captures and backreferences
* Unicode character support
* Unanchored matching (only) (e.g. ``/x/.exec('fooxfoo')`` matches ``'x'``)
Counted repetition quantifiers, assertions, captures, and backreferences
all complicate a non-backtracking implementation considerably. For this
reason, the built-in regular expression implementation, described below,
uses a backtracking approach.
The two basic goals of the built-in implementation are Ecmascript compliance
and compactness. More generally, the following prioritized requirements
should be fulfilled:
#. Ecmascript compatibility
#. Compactness
#. Avoiding deep or unbounded C recursion, and providing recursion and
execution time sanity limits
#. Regexp execution performance
#. Regexp compilation performance
Further, it should be possible to leave out regexp support during
compilation, or to plug in a more powerful existing regexp engine should
it be needed by the user.
Architecture
============
The basic implementation approach consists of three parts:
#. A regexp tokenizer, which reuses the lexer model of the Ecmascript
tokenizer and generates a token stream
#. A regexp compiler, which takes the token stream and produces compiled
regexps (represented as interned strings) and *normalized* regexp
patterns (see E5 Section 15.10.6)
#. A regexp executor, which takes a compiled regexp and an input stream
and produces match results
Case insensitive matching provides some surprising challenges in handling
character ranges, see discussion of canonicalization below.
Tokenizer
---------
The tokenizer is implemented in ``duk_lexer.c`` and is quite simple and
straightforward. It shares the character decoding and character window
model of the Ecmascript tokenizer.
The two main functions are:
* ``duk_lexer.c:duk_lexer_parse_re_token()`` which parses a regexp token,
such as character, quantifier, etc.
* ``duk_lexer.c:duk_lexer_parse_re_ranges()`` which parses character class
ranges (a bit tricky due to canonicalization)
Quantifiers are fully parsed during tokenization, resulting in only two
types of quantifier tokens, greedy and minimal, with each having a minimum
count and a maximum count. An unspecified maximum count (infinite) is
encoded as the maximum unsigned 32-bit value 0xffffffff, which is quite
reasonable considering that Ecmascript strings cannot be longer than that.
The quantifier maximum value 0xffffffff is treated specially by the compiler
too.
Character classes could be parsed and encoded into token values. However,
this would mean that the token value would need to contain an arbitrary
number of character ranges. Also, character range normalization for case
insensitive matching requires some special treatment. For these reasons,
the lexer simply produces a start token for normal or inverted character
class (``[`` or ``[^``) and the lexer and the compiler co-operate to
process the character ranges. See, for instance:
* ``duk_regexp_compiler.c:parse_disjunction()``, and
* ``duk_lexer.c:duk_lexer_parse_re_ranges()``.
See the detailed discussion on canonicalization and case-insensitive
matching below.
Compiler
--------
The compiler is implemented in ``duk_regexp_compiler.c``. The main
functions are:
* ``duk_regexp_compiler.c:duk_regexp_compile()`` which provides the
compilation wrapper (e.g. initializes the compilation context,
including lexer state, etc) and also produces the *normalized*
regexp source required by E5 Section 15.10.6.
* ``duk_regexp_compiler.c:parse_disjunction()`` which parses a disjunction
(including atoms, quantifiers, and assertions) and calls itself
recursively to implement lookaheads and capture/non-capture groups.
The code generation model is shaped by the fact that linear bytecode
generation is not possible if a regexp is parsed linearly without lookahead.
In other words, one needs to choose between non-linear bytecode generation
and non-linear parsing. For instance, to compile ``a|b`` one would first
generate the bytecode for ``a``, and only then notice that the bytecode for
``a`` must be preceded by bytecode to handle the disjunction. Further, the
disjunction would need to be updated for each new alternative. Similar
problems apply to other constructs; consider, for instance, the quantifier
in ``(a|b)+``.
One common approach to deal with this problem is to first produce an
intermediate representation (e.g. a parse tree), and then perform compilation
using the more convenient intermediate representation. However, an
intermediate representation increases code size considerably, so we try
to make do without one.
Instead, the code generation model attempts to work around these
limitations as follows:
* The regexp bytecode generation is based on a byte buffer which holds
currently generated code. New bytecode instructions are either appended
to the buffer, or inserted into some earlier position e.g. to complete
jump offsets.
* Bytecode is "PC-relative". In particular, bytecode jump/branch offsets
are PC-relative (relative to the first byte of the subsequent instruction,
to be exact) which allows code blocks to be moved and copied freely
without breaking them. This works as long as there are no PC-relative
jumps over the "spliced" sections. There are a few restrictions, though,
discussed below.
* Code generation is bottom-up: a bytecode snippet is emitted for each
token, and these snippets are combined (concatenated, copied, etc) to
form more complex matchers. More complex expressions can backpatch jump
offsets, insert new bytecode into a previous position (bumping any
following code forwards), and clone existing bytecode snippets (e.g.
for counted quantifiers).
The current model has a few drawbacks:
* Insertion into the middle of the regexp buffer requires trailing code to
be moved (with ``memmove()``). This can lead to quite a lot of copying
in pathological cases. However, regular expressions are typically so
short that this does not really matter in practice, and keeps the
implementation simple.
* Because the compiler works without an intermediate representation for the
regexp, some of the back-patching required for code generation is a bit
tricky. This is the case especially for creating disjunction code (see
the example below).
* Because bytecode is variable size (especially, encoded PC-relative jump
offsets are variable size too!), back-patching jump offsets must be done
carefully. See comments in code, and discussion on jump offsets below.
Regular expressions are compiled into interned strings, containing both the
regexp flags and the actual regexp body bytecode. This allows compiled
regexps to be conveniently stored and handled as an internal property of a
``RegExp`` instance. The property is internal because the key for the
property uses a non-BMP character, which cannot be generated by standard
Ecmascript code, and cannot therefore be accessed by Ecmascript code. See
the bytecode format details below.
Another output of regexp compilation is the *normalized* regular expression
pattern, described in E5 Section 15.10.6, which goes into the ``source``
property of a ``RegExp`` instance. The normalized pattern is currently
formed simply as follows:
* If the input pattern is empty, output ``(?:)``.
* Else, look for any forward slash which is *not* preceded by a backslash.
Replace all such occurrences with ``\/``.
A run-time instance of a ``RegExp`` is created with only the compiled
bytecode (string) and the normalized pattern as inputs.
Executor
--------
The executor is implemented in ``duk_regexp_executor.c``, see:
* ``duk_regexp_executor.c:duk_regexp_match()`` which initializes the regexp
matcher context and contains most of the logic of E5 Section 15.10.6.2,
except for the innermost match attempt (step 9.b).
* ``duk_regexp_executor.c:match_regexp()`` which does regexp bytecode
execution starting from a certain input offset, calling itself recursively
when necessary (see "current limitations" below).
The basic implementation approach is a recursive back-tracking matcher
which uses the C stack whenever recursion is needed, but explicitly avoids
doing so for *simple quantifiers*: see separate discussion on quantifiers
and backtracking. Without the support for simple quantifiers, *every
character* matching the pattern ``/.*/`` would require one C recursion level
for back-tracking.
A regexp matcher context is maintained for matching to minimize C call
parameter count. The current state includes ``PC``, the program counter
for bytecode, and ``SP``, the string pointer referring to the (immutable)
input string. Among other book-keeping members, the context also contains
the current *saved pointers*, which are byte pointers to the (extended UTF-8
encoded) input string.
Saved pointers are used to implement capture groups. The start and end
points of the capture are identified with saved pointers (two pointers
are needed per capture group). A capture group is valid if *both* saved
pointers are valid; when in the middle of the capture group, the start
pointer is set but the end pointer is not. Since the input string
is not modified during matching, even for case-insensitive matching, saved
pointers allow capturing without making explicit copies of the captured
values during matching.
Saving a pointer currently involves C recursion: when a pointer is saved,
the previous value is stored and the matcher is called recursively. If
backtracking needs to happen, the previous value can be restored. (One
could also try to erase saved pointers during backtracking based on the
saved pointer value: if we backtrack ``SP`` beyond the saved pointer,
the pointer is erased.)
The mapping between saved pointers are capture groups is described in
the following table:
+-------------+------------------------------------------+
| Saved index | Description |
+=============+==========================================+
| 0 | Start of entire matching substring |
+-------------+------------------------------------------+
| 1 | End of entire matching substring |
+-------------+------------------------------------------+
| 2 | Start of capture group 1 |
+-------------+------------------------------------------+
| 3 | End of capture group 1 |
+-------------+------------------------------------------+
| ... | |
+-------------+------------------------------------------+
| 2n+1 | Start of capture group n |
+-------------+------------------------------------------+
| 2n+2 | End of capture group n |
+-------------+------------------------------------------+
Memory allocation is generally avoided during regexp execution.
When it is necessary to allocate temporary buffers, all temporaries
are placed in the value stack for correct memory management in case
of errors. Currently, memory allocation is needed during regexp
execution only to handle lookahead assertions, which need to make
a copy of saved pointers.
About safety: the Ecmascript executor should prevent user from reading
and replacing regexp bytecode. Even so, the executor must validate all
memory accesses etc. When an invalid access is detected (e.g. a 'save'
opcode to invalid, unallocated index) it must fail with an internal error
but not cause a segmentation fault.
Current limitations
-------------------
Regexp compiler
:::::::::::::::
C recursion depth limit
The compiler imposes an artificial limit on C recursion depth
(``DUK_RE_COMPILE_RECURSION_LIMIT`` by default). If the recursion limit
is reached, regexp compilation fails with an (internal) error.
The following constructs increase C recursion depth:
* Negative or positive lookahead
* Capture or non-capture group
Regexp atom copy limit
Complex quantifiers with a non-zero minimum or a non-infinite maximum
cause the quantified atom to be duplicated in regexp bytecode. There
is an artificial limit (``DUK_RE_MAX_ATOM_COPIES`` by default) on the
number of copies the compiler is willing to create. Some examples:
* For ``/(?:a|b){10,20}/``, the atom code (``/(?:a|b)/``) is first
copied 10 times to cover the quantifier minimum, and another 10
times to cover the maximum.
* For ``/(?:a|b){10,}/``, the atom code is first copied 10 times to
cover the quantifier minimum, and the remaining (greedy) infinite
match reuses the last emitted atom.
Note that there is no such restriction for *simple quantifiers*, which
can keep track of quantifier counts explicitly.
Regexp executor
:::::::::::::::
C recursion depth limit
The executor imposes an artificial limit on C recursion depth
(``DUK_RE_EXECUTE_RECURSION_LIMIT`` by default). If the recursion limit
is reached, regexp matching fails with an (internal) error.
The following constructs increase C recursion depth:
* Simple quantifier increases recursion depth by one when matching the
sequel (but not for each atom).
* Complex quantifier increases recursion depth for each atom matched and
the sequel (e.g. ``/(?:x|x)+/`` causes C recursion for each ``x``
character matched).
* ``DUK_REOP_SAVE`` increases recursion depth by one (to provide capture
backtracking), so each capture group increases C recursion depth by two.
* Positive and negative lookahead increase recursion depth by one for
matching the lookahead, and for matching the sequel (to provide capture
backtracking).
* Each alternative of a disjunction increases recursion depth by one,
because disjunctions currently generate a sequence of n-1
``DUK_REOP_SPLIT1`` opcodes for an n-alternative disjunction, and the
preferred execution path runs through each of these ``DUK_REOP_SPLIT1``
opcodes on the first attempt.
Regexp opcode steps limit
The execution imposes an artificial limit on the total number of regexp
opcodes executed (``DUK_RE_EXECUTE_STEPS_LIMIT`` by default) to provide
a safeguard against insane execution times. The steps limit applies to
total steps executed during e.g. ``exec()``. The steps count is *not*
zeroed for each attempt of an unanchored match.
The steps limit provides a safety net for avoiding excessive or
even infinite execution time. Infinite execution time may currently
happen for some empty quantifiers, so only the steps limit prevents
them from executing indefinitely.
Empty quantifier bodies in complex quantifiers
Empty quantifier bodies in complex quantifiers may cause unbounded
matcher execution time (eventually terminated by the steps limit).
There is no "progress" instruction or one-character lookahead to
prevent multiple matches of the same empty atom.
* Complex quantifier example: ``/(?:|)*x/.exec('x')`` is terminated by
the steps limit. The problem is that the empty group will match an
infinite number of times, so the greedy quantifier never terminates.
* Simple quantifiers have a workaround if the atom character length is
zero: ``qmin`` and ``qmax`` are capped to 1. This allows the atom
to match once and possibly cause whatever side effects it may have
(for instance, if we allowed captures in simple atoms, the capture
could happen, once). For instance, ``/(?:)*x/`` is, in effect,
converted to ``/(?:){0,1}x/`` and ``/(?:){3,4}x/`` to
``/(?:){1,1}x/``.
This problem could also be fixed for complex quantifiers, but the
fix is not as trivial as for simple quantifiers.
Miscellaneous
:::::::::::::
Incomplete support for characters outside the BMP
Ecmascript only mandates support for 16-bit code points, so this is
not a compliance issue.
The current implementation quite naturally processes code points above
the BMP as such. However, there is no way to express such characters
in patterns (there is for instance no Unicode escape for code points
higher than U+FFFF). Also, the built-in ranges ``\d``, ``\s``, and
``\w`` and their inversions only cover 16-bit code points, so they
will not currently work properly.
This limitation has very little practical impact, because a standard
Ecmascript program cannot construct an input string containing any
non-BMP characters.
Compiled regexp and bytecode format
===================================
A regular expression is compiled into an "extended" UTF-8 string which is
interned into an ``duk_hstring``. The extended UTF-8 string contains
flags, parameters, and code for the regexp body. This simplifies handling
of compiled regexps and minimizes memory overhead. The "extended" UTF-8
encoding also keeps the bytecode quite compact while allowing existing
helpers to deal with encoding and decoding.
Logically, a compiled regexp is a sequence of signed and unsigned integers.
Unsigned integers are encoded directly with "extended" UTF-8 which allows
codepoints of up to 36 bits, although integer values beyond 32 bits are not
used for compiled regexps. Signed integers need special treatment because
UTF-8 does not allow encoding of negative values. Thus, signed integers
are first converted to unsigned by doubling their absolute value and
setting the lowest bit if the number is negative; for example, ``6`` is
converted to ``2*6=12`` and ``-4`` to ``2*4+1=9``. The unsigned result
(again at most 32 bits) is then encoded with "extended" UTF-8. This
special treatment allows signed integers to be encoded with UTF-8 in the
first place, and further provides short encodings for small signed integers
which is useful for encoding bytecode jump distances.
The compiled regexp begins with a header, containing:
* unsigned integer: flags, any combination of ``DUK_RE_FLAG_*``
* unsigned integer: ``nsaved`` (number of save slots), which should be
``2n+2`` where ``n`` equals ``NCapturingParens`` (number of capture
groups)
Regexp body bytecode then follows. Each instruction consists of an opcode
value (``DUK_REOP_*``) (encoded as an unsigned integer) followed by a
variable number of instruction parameters. Each opcode and parameter is
encoded (as described above) as a "code point". When executing the
bytecode, program counter is maintained as a byte offset, not as an
instruction index, so all jump offsets are byte offsets (not instruction
offsets).
Jump targets are encoded as "skip offsets" relative to the first byte of
the instruction following the jump/branch. Because the skip offset itself
has variable length, this needs to be handled carefully during compilation;
see discussion below.
Regexp opcodes
--------------
The following table summarizes the regexp opcodes and their parameters.
The opcode name prefix ``DUK_REOP_`` is omitted for brevity; for instance,
``DUK_REOP_MATCH`` is listed as ``MATCH``.
+--------------------------+-------------------------------------------------+
| Opcode | Description / parameters |
+==========================+=================================================+
| MATCH | Successful match. |
+--------------------------+-------------------------------------------------+
| CHAR | Match one character. |
| | |
| | * ``uint``: character codepoint |
+--------------------------+-------------------------------------------------+
| PERIOD | ``.`` (period) atom, match next character |
| | against anything except a LineTerminator. |
+--------------------------+-------------------------------------------------+
| RANGES | Match the next character against a set of |
| | ranges; accept if in some range. |
| | |
| | * ``uint``: ``n``, number of ranges |
| | |
| | * ``2n * uint``: ranges, ``[r1,r2]`` encoded as |
| | two unsigned integers ``r1``, ``r2`` |
+--------------------------+-------------------------------------------------+
| INVRANGES | Match the next character against a set of |
| | ranges; accept if not in any range. |
| | |
| | * ``uint``: ``n``, number of ranges |
| | |
| | * ``2n * uint``: ranges, ``[r1,r2]`` encoded as |
| | two unsigned integers ``r1``, ``r2`` |
+--------------------------+-------------------------------------------------+
| JUMP | Jump to target unconditionally. |
| | |
| | * ``int``: ``skip``, signed byte offset for jump|
| | target, relative to the start of the next |
| | instruction |
+--------------------------+-------------------------------------------------+
| SPLIT1 | Split execution. Try direct execution first. |
| | If fails, backtrack to jump target. |
| | |
| | * ``int``: ``skip``, signed byte offset for jump|
| | alternative |
+--------------------------+-------------------------------------------------+
| SPLIT2 | Split execution. Try jump target first. |
| | If fails, backtrack to direct execution. |
| | |
| | * ``int``: ``skip``, signed byte offset for jump|
| | alternative |
+--------------------------+-------------------------------------------------+
| SQMINIMAL | Simple, minimal quantifier. |
| | |
| | * ``uint``: ``qmin``, minimum atom match count |
| | |
| | * ``uint``: ``qmax``, maximum atom match count |
| | |
| | * ``skip``: signed byte offset for sequel |
| | (atom begins directly after instruction and |
| | ends in a DUK_REOP_MATCH instruction). |
+--------------------------+-------------------------------------------------+
| SQGREEDY | Simple, greedy (maximal) quantifier. |
| | |
| | * ``uint``: ``qmin``, minimum atom match count |
| | |
| | * ``uint``: ``qmax``, maximum atom match count |
| | |
| | * ``uint``: ``atomlen``, atom length in |
| | characters (must be known and fixed for all |
| | atom matches; needed for stateless atom |
| | backtracking) |
| | |
| | * ``skip``: signed byte offset for sequel |
| | (atom begins directly after instruction and |
| | ends in a DUK_REOP_MATCH instruction). |
+--------------------------+-------------------------------------------------+
| SAVE | Save ``SP`` (string pointer) to ``saved[i]``. |
| | |
| | * ``uint``: ``i``, saved array index |
+--------------------------+-------------------------------------------------+
| LOOKPOS | Positive lookahead. |
| | |
| | * ``int``: ``skip``, signed byte offset for |
| | sequel (lookahead begins directly after |
| | instruction and ends in a DUK_REOP_MATCH) |
+--------------------------+-------------------------------------------------+
| LOOKNEG | Negative lookahead. |
| | |
| | * ``int``: ``skip``, signed byte offset for |
| | sequel (lookahead begins directly after |
| | instruction and ends in a DUK_REOP_MATCH) |
+--------------------------+-------------------------------------------------+
| BACKREFERENCE | Match next character(s) against a capture. |
| | If the capture is undefined, *always matches*. |
| | |
| | * ``uint``: ``i``, backreference number in |
| | [1,``NCapturingParens``], refers to input |
| | string between saved indices ``i*2`` and |
| | ``i*2+1``. |
+--------------------------+-------------------------------------------------+
| ASSERT_START | ``^`` assertion. |
+--------------------------+-------------------------------------------------+
| ASSERT_END | ``$`` assertion. |
+--------------------------+-------------------------------------------------+
| ASSERT_WORD_BOUNDARY | ``\b`` assertion. |
+--------------------------+-------------------------------------------------+
| ASSERT_NOT_WORD_BOUNDARY | ``\B`` assertion. |
+--------------------------+-------------------------------------------------+
.. FIXME poor layout for esp. ASSERT_NOT_WORD_BOUNDARY
Jumps offsets (skips) for jumps/branches
----------------------------------------
The jump offset of a jump/branch instruction is always encoded as the last
parameter of the instruction. The offset is relative to the first byte of
the next instruction. This presents some challenges with variable length
encoding for negative skip offsets.
Assume that the compiler is emitting a JUMP over a 10-byte code block::
JUMP L2
L1:
(10 byte code block)
L2:
The compiler emits a ``DUK_REOP_JUMP`` opcode. It then needs to emit
a skip offset of 10. The offset, 10, does not need to be adjusted because
the length of the encoded skip offset does not affect the offset
(``L2 - L1``).
However, assume that the compiler is emitting a JUMP backwards over a
10-byte code block::
L1:
(10 byte code block)
JUMP L1
L2:
The compiler emits a ``DUK_REOP_JUMP`` opcode. It then needs to emit the
negative offset ``L1 - L2``. To do this, it needs to know the encoded
byte length for representing that *offset value in bytecode*. The offset
thus depends on itself, and we need to find the shortest UTF-8 encoding
that can encode the skip offset successfully. In this case the correct
final skip offset is -12 which contains 1 extra byte for ``DUK_REOP_JUMP``
and another extra byte for encoding the -12 skip offset with a one-byte
encoding.
In practice it suffices to first compute the negative offset
``L1 - L2 - 1`` (where the -1 is to account for the ``DUK_REOP_JUMP``,
which always encodes to one byte) without taking the skip parameter into
account, and figure out the length of the UTF-8 encoding of that offset,
``len1``. Then do the same computation for the negative offset
``L1 - L2 - 1 - len1`` to get the encoded length ``len2``.
The final skip offset is ``L1 - L2 - 1 - len2``. In some cases ``len1``
will be one byte shorter than ``len2``, but ``len2`` will be correct.
For instance, if the code block in the second example had been 1022 bytes
long:
* The first offset ``L1 - L2 - 1`` would be -1023 which is converted to
the unsigned value ``2*1023+1 = 2047 = 0x7ff``. This encodes to two
UTF-8 bytes, i.e. ``len1 = 2``.
* The second offset ``L1 - L2 - 1 - 2`` would be -1025 which is converted
to the unsigned value ``2*1025+1 = 2051 = 0x803``. This encodes to
*three* UTF-8 bytes, i.e. ``len2 = 3``.
* The final skip offset ``L1 - L2 - 1 - 3`` is -1026, which converts to
the unsigned value ``2*1026+1 = 2053 = 0x805``. This again encodes to
three UTF-8 bytes, and is thus "self consistent".
This could also be solved into closed form directly.
Character class escape handling
-------------------------------
There are no opcodes or special constructions for character class escapes
(``\d``, ``\D``, ``\s``, ``\S``, ``\w``, ``\W``) described in E5 Section
15.10.2.12, regardless of whether they appear inside or outside a
character class.
The semantics are essentially ASCII-based except for the white space
character class which contains all characters in the E5 ``WhiteSpace`` and
``LineTerminator`` productions, resulting in a total of 11 ranges (or
individual characters).
Regardless of where they appear, character class escapes are turned into
explicit character range matches during compilation, which also allows
them to be embedded in character classes without complications (such as,
for instance, splitting the character class into a disjunction). The
downside of this is that regular expressions making heavy use of ``\s``
or ``\S`` will result in relatively large regexp bytecode. Another
approach would be to reuse some Unicode code points to act as special
'marker characters' for the execution engine. Such markers would need
to be above U+FFFF because all 16-bit code points must be matchable.
.. FIXME note briefly where these ranges come from, e.g. the script
which can be used to re-generate them
The (inclusive) ranges for positive character class escapes are:
+--------+--------+--------+
| Escape | Start | End |
+========+========+========+
| ``\d`` | U+0030 | U+0039 |
+--------+--------+--------+
| ``\s`` | U+0009 | U+000D |
+--------+--------+--------+
| | U+0020 | U+0020 |
+--------+--------+--------+
| | U+00A0 | U+00A0 |
+--------+--------+--------+
|  | U+1680 | U+1680 |
+--------+--------+--------+
|  | U+180E | U+180E |
+--------+--------+--------+
| | U+2000 | U+200A |
+--------+--------+--------+
|  | U+2028 | U+2029 |
+--------+--------+--------+
| | U+202F | U+202F |
+--------+--------+--------+
|  | U+205F | U+205F |
+--------+--------+--------+
|  | U+3000 | U+3000 |
+--------+--------+--------+
|  | U+FEFF | U+FEFF |
+--------+--------+--------+
| ``\w`` | U+0030 | U+0039 |
+--------+--------+--------+
|  | U+0041 | U+005A |
+--------+--------+--------+
|  | U+005F | U+005F |
+--------+--------+--------+
|  | U+0061 | U+007A |
+--------+--------+--------+
The ranges for negative character class escapes are:
+--------+--------+--------+
| Escape | Start | End |
+========+========+========+
| ``\D`` | U+0000 | U+002F |
+--------+--------+--------+
|  | U+003A | U+FFFF |
+--------+--------+--------+
| ``\S`` | U+0000 | U+0008 |
+--------+--------+--------+
|  | U+000E | U+001F |
+--------+--------+--------+
|  | U+0021 | U+009F |
+--------+--------+--------+
|  | U+00A1 | U+167F |
+--------+--------+--------+
|  | U+1681 | U+180D |
+--------+--------+--------+
|  | U+180F | U+1FFF |
+--------+--------+--------+
|  | U+200B | U+2027 |
+--------+--------+--------+
|  | U+202A | U+202E |
+--------+--------+--------+
|  | U+2030 | U+205E |
+--------+--------+--------+
|  | U+2060 | U+2FFF |
+--------+--------+--------+
|  | U+3001 | U+FEFE |
+--------+--------+--------+
|  | U+FF00 | U+FFFF |
+--------+--------+--------+
| ``\W`` | U+0000 | U+002F |
+--------+--------+--------+
|  | U+003A | U+0040 |
+--------+--------+--------+
|  | U+005B | U+005E |
+--------+--------+--------+
|  | U+0060 | U+0060 |
+--------+--------+--------+
|  | U+007B | U+FFFF |
+--------+--------+--------+
The ``.`` atom (period) matches everything except a ``LineTerminator`` and
behaves like a character class. It is interpreted literally inside a
character class. There is a separate opcode to match the ``.`` atom,
``DUK_REOP_PERIOD`` so there is currently no need to emit ranges for the
period atom. If it were compiled into a character range, its ranges would
be (the negative of ``.`` would not be needed):
+--------+--------+--------+
| Escape | Start | End |
+========+========+========+
| ``.`` | U+0000 | U+0009 |
+--------+--------+--------+
| | U+000B | U+000C |
+--------+--------+--------+
|  | U+000E | U+2027 |
+--------+--------+--------+
| | U+202A | U+FFFF |
+--------+--------+--------+
Each of the above range sets (including for ``.``) are affected by the
ignoreCase (``/i``) option. However, the ranges can be emitted verbatim
without canonicalization also when case-insensitive matching is used.
This is not a trivial issue, see discussion on canonicalization below.
Misc notes
----------
There is no opcode for a non-capturing group because there is no need for
it during execution.
During regexp execution, regexp flags are kept in the regexp matching
context, and affect opcode execution as follows:
* global (``/g``): does not affect regexp execution, only the behavior of
``RegExp.prototype.exec()`` and ``RegExp.prototype.toString()``.
* ignoreCase (``/i``): affects all opcodes which match characters or
character ranges, through the ``Canonicalize`` operation defined in
E5 Section 15.10.2.8. It also affects ``RegExp.prototype.toString()``.
* multiline (``/m``): affects the start and end assertion opcodes
(``^`` and ``$``). It also affects ``RegExp.prototype.toString()``.
A bytecode opcode for matching a string instead of an individual character
seems useful at first glance. The compiler could join successive
characters into a string match (by back-patching the preceding string
match instruction, for instance). However, this turns out to be difficult
to implement without lookahead. Consider matching ``/xyz+/`` for instance.
The ``z`` is quantified, so the compiler would need to emit a string match
for ``xy``, followed by a quantifier with ``z`` as its quantified atom.
However, when working on the ``z`` token, the compiler does not know
whether a quantifier will follow but still needs to decide whether or not
to merge it into the previous ``xy`` matcher. Perhaps the quantifier could
pull out the ``z`` later on, but this complicates the compiler. Thus there
is only a character matching opcode, ``DUK_REOP_CHAR``.
Canonicalization (case conversion for ignoreCase flag)
======================================================
The ``Canonicalize`` abstract operator is described in E5 Section 15.10.2.8.
It has a pretty straightforward definition matching the behavior of
``String.prototype.toUpperCase()``, except that:
* If case conversion would turn a single codepoint character into a
multiple codepoint character, case conversion is skipped
* If case conversion would turn a non-ASCII character (>= U+0080) into
an ASCII character (<= U+007F), case conversion is skipped
``Canocalize`` is used for the semantics of:
* The abstract ``CharacterSetMatcher`` construct,
E5 Section 15.10.2.8
* Atom ``PatternCharacter`` handling,
E5 Section 15.10.2.8 (through ``CharacterSetMatcher``)
* Atom ``.`` (period) handling,
E5 Section 15.10.2.8 (through ``CharacterSetMatcher``)
* Atom ``CharacterClass`` handling,
E5 Section 15.10.2.8 (through ``CharacterSetMatcher``)
* Atom escape ``DecimalEscape`` handling,
E5 Section 15.10.2.9 (through ``CharacterSetMatcher``)
* Atom escape ``CharacterEscape`` handling,
E5 Section 15.10.2.9 (through ``CharacterSetMatcher``)
* Atom escape ``CharacterClassEscape`` handling,
E5 Section 15.10.2.9 (through ``CharacterSetMatcher``)
* Atom escape (backreference) handling,
E5 Section 15.10.2.9
The ``CharacterSetMatcher`` basically compares a character against all
characters in the set, and produces a match if the input character and
the target character match after canonicalization. Matching character
ranges naively by canonicalizing the character range start and end point
and then comparing the canonicalized input character against the range
**is incorrect**, because a continuous range may turn into multiple
ranges after canonicalization.
Example: the class ``[x-{]`` is a continuous range U+0078-U+007B
(``x``, ``y``, ``z``, ``{``), but converts into two ranges after
canonicalization: U+0058-005A, U+007B (``X``, ``Y``, ``Z``, ``{``).
See test case ``test-regexp-canonicalization-js``.
The current solution has a small footprint but is expensive during
compilation: if ignoreCase (``/i``) option is given, the compiler
preprocesses all character ranges by running through all characters
in the character range, normalizing the character, and emitting output
ranges based on the normalization results. Continuous ranges are kept
continuous, and multiple ranges are emitted if necessary.
This process is relatively simple but has a high compile time impact
(but only if ignoreCase option is specified). Also note that the process
may result in overlapping character ranges (for instance, ``[a-zA-Z]``
results in ``[A-ZA-Z]``). However, overlapping ranges are not eliminated
during compilation of case sensitive regular expressions either, which
wastes some bytecode space and execution time, but cause no other
complications.
Note that the resulting ranges (after canonicalization) may include or omit
all such characters whose canonicalized (uppercased) counterparts are
included in some character range of the class. For instance, the
normalization of ``[a-z]`` is ``[A-Z]`` but ``[A-Zj]`` would also work,
although it would be sub-optimal. The reason is that a ``j`` will never be
compared during execution, because the input character is normalized before
range comparison (into ``J``) and will thus match the canonicalized
counterpart (here contained in the range ``[A-Z]``). The canonicalization
process could thus, for instance, simply add additional ranges but keep the
original ones too, although this particular approach would serve little
purpose.
However, this fact becomes relevant when built-in character ranges provided
by ``.``, ``\s``, ``\S``, ``\d``, ``\D``, ``\w``, and ``\W`` are considered.
In principle, the ranges they represent should be canonicalized when
ignoreCase has been specified. However, these ranges have the following
property: if a lowercase character ``x`` is contained in the range, its
uppercase (canonicalized) counterpart is also contained in the range (see
test case ``test-misc-regexp-character-range-property.js`` for a
verification). This property is apparent for all the ranges except for
``\w`` and ``\W``: for these ranges to have the property, the refusal of
``Canonicalize`` to canonicalize a non-ASCII character to an ASCII character
is crucial (for instance, U+0131 would map to U+0049 which would cause
problems for ``\W``). Because of this property, the regexp compiler can use
the built-in character ranges without any normalization processing, even
when ignoreCase option has been specified: the normalized characters are
already present.
Alternative solutions to the canonicalization problem include:
* Perform a more intelligent range conversion at compile time or at regexp
execution time. Difficult to implement compactly.
* Preprocess all 65536 possible *input characters* during compile time, and
match them against the character class ranges, generating optimal result
ranges (with overlaps eliminated). The downside include that this cannot
be done before all the ranges are known, and that the comparison of one
character against an (input) range is still complicated, and possibly
requires another character loop which would result in up to 2^32
comparisons (too high).
Compilation strategies
======================
The examples below use opcode names without the ``DUK_REOP_`` prefix, and use
symbolic labels for clarity.
PC-relative code blocks, jump patching
--------------------------------------
Because addressing of jumps and branches is PC-relative, already compiled
code blocks can be copied and removed without an effect on their validity.
Inserting code before and after code blocks is not a problem as such.
However, there are two things to watch out for:
#. Inserting or removing bytecode into an offset which is between a jump /
branch and its target. This breaks the jump offset. The compiler has
no support for 'fixing' already generated jumps (except pending jumps
and branches which are treated specially), so this must be avoided in
general.
#. Inserting or removing bytecode at an offset which affects a previously
stored book-keeping offset (e.g. for a pending jump). This is not
necessarily a problem as long as the offset is fixed, or the order of
patching is chosen so that offsets do not break. See the current
compilation strategy for an example of this.
Disjunction compilation alternatives
------------------------------------
Basic two alternative disjunction::
/a|b/
split L1
(a)
jump L2
L1: (b)
L2:
Assume this code is directly embedded in a three alternative disjunction
(original two alternative code marked with # characters)::
/a|b|c/ == /(?:a|b)|c/
split L3
# split L1
# (a)
# jump L2
# L1: (b)
# L2:
jump L4
L3: (c)
L4:
The "jump L2" instruction will jump directly to the "jump L4" instruction.
So, "jump L2" could be updated to "jump L4" which would not reduce bytecode
size, but would eliminate one extra jump during regexp execution::
/a|b|c/
split L3
# split L1
# (a)
# jump L4 <-- jump updated from L2 to L4
# L1: (b)
jump L4 <-- L2 label eliminated above this instruction
L3: (c)
L4:
Because the compile-time overhead of manipulating code generated for
sub-expressions is quite high, currently the compiler will generate
unoptimal jumps to disjunctions.
Current disjunction compilation model
-------------------------------------
The current disjunction compilation model avoids modifying already
generated code (which is tricky with variable length bytecode) when
possible. However, this is not entirely possible for disjunctions
compiled into a sequence of SPLIT1 opcodes as illustrated above. The
compiler needs to track and back-patch one pending JUMP (for a previous
match) and a SPLIT1 (for a previous alternative). This is illustrated
with an example below, for ``/a|b|c/``.
The bytecode form we create, at the end, for ``a|b|c`` is::
split1 L2
split1 L1
(a)
jump M1
L1: (b)
M1: jump M2
L2: (c)
M2:
This is built as follows. After parsing ``a``, a ``|`` is encountered and
the situation is, simply::
(a)
There is no pending jump/split1 to patch in this case. What we do in that
case is::
split1 (empty) <-- leave unpatched_disjunction_split
(a)
jump (empty) <-- leave unpatched_disjunction_jump
(new atom begins here)
When ``a|b`` has been parsed, a ``|`` is encountered and the situation is::
split1 (empty) <-- unpatched_disjunction_split for 'a'
(a)
jump (empty) <-- unpatched_disjunction_jump for 'a'
(b)
We first patch the pending jump to get::
split1 (empty) <-- unpatched_disjunction_split for 'a'
(a)
jump M1
(b)
M1:
The pending split1 can also now be patched because the jump has its final
length now::
split1 L1
(a)
jump M1
L1: (b)
M1:
We then insert a new pending jump::
split1 L1
(a)
jump M1
L1: (b)
M1: jump (empty) <-- unpatched_disjunction_jump for 'b'
... and a new pending split1::
split1 (empty) <-- unpatched_disjunction_split for 'b'
split1 L1
(a)
jump M1
L1: (b)
M1: jump (empty) <-- unpatched_disjunction_jump for 'b'
After finishing the parsing of ``c``, the disjunction is over and the end
of the ``parse_disjunction()`` function patches the final pending
jump/split1 similarly to what is done after ``b``. We get::
split1 L2
split1 L1
(a)
jump M1
L1: (b)
M1: jump M2
L2: (c)
M2:
... which is the target bytecode.
Regexp feature implications
===========================
Quantifiers with a range
------------------------
Quantifiers with a minimum-maximum range (other than the simple ``*`` and
``+`` quantifiers) cannot be implemented conveniently with a basic NFA-based
design because the NFA does not have state for keeping a count of how many
times each instance of a certain quantifier has been repeated. This is not
trivial to fix, because a certain quantifier may be simultaneously active
multiple times with each quantifier instance having a separate, backtracked
counter.
Ranged quantifiers are not easy for backtracking matchers either.
Consider, for instance, the regexp ``/(?:x{3,4}){5}/``. The matcher needs
to track five separate ``/x{3,4}/`` quantifiers, each of which backtracks.
Even a recursive backtracking implementation cannot easily handle such
quantifiers without resorting to some form of long jumps or continuation
passing style. This is not apparent for simple non-hierarchical quantifier
expressions.
There are multiple ways to implement ranged quantifiers. One can implement
the recursive backtracking engine to incorporate them into the backtracking
logic. This seems to require a control structure that cannot consist of
simple recursion; rather, some form of long jumps or continuation passing
style seems to be required. Another approach is to expand such quantifiers
during compile time into an explicit sequence. For instance, ``/x{3,5}/``
would become, in effect, ``/xxx(?:x?x)?/``. Capture groups in the
expansions need to map to the same capture group number (this cannot be
expressed in a normal regular expression, but is easy with regexp bytecode
which has a ``save i`` instruction). This approach becomes a bit wieldy
for large numbers, e.g. for ``/x{500,10000}/``, though.
The current implementation uses the "bytecode expansion" approach to keep
the regular expression matching engine as simple as possible. Because
bytecode uses relative offsets, and ``DUK_REOP_SAVE`` has a fixed index,
the bytecode for an "atom" may be copied without complications.
Quantifiers and backtracking, simple quantifiers
------------------------------------------------
.. FIXME there is some duplication of discussion with the above section
on ranged quantifiers
Quantifiers (especially greedy) are problematic for a backtracking
implementation. A simple implementation of a backtracking greedy
quantifier (or a minimal one, for that matter) will require one level
of C recursion for each atom match. This is especially problematic
for expressions like::
.+
The recursion is essentially unavoidable for the general case in a
backtracking implementation. Consider, for instance::
(?:x{4,5}){7,8}
Here, each 'instance' of the inner quantifier will individually attempt
to match either 4 or 5 ``x`` characters. This cannot be easily
implemented without unbounded recursion in a backtracking matcher.
However, for many simple cases unbounded recursion *can* be avoided.
In this document, the term **simple quantifier** is used to refer to any
quantifier (greedy or minimal), whose atom fulfills the following property:
#. The quantifier atom has no alternatives in need of backtracking: it
either matches once or not at all
#. The input portion matching the atom always has the same character length
(though not necessarily the same *byte* length)
#. The quantifier atom has no captures or lookaheads
The first property eliminates the need to backtrack any matched atoms.
For instance, a minimal ``+`` quantifier can match the atom once, attempt
to match the sequel. If the sequel match fails, it does not need to
consider an alternative match for the first atom match (there can be none).
Instead, it can simply proceed to match the atom once more, try the sequel
again, and so on. Note that although there are no alternatives for each
atom matched, the input portion matching the atom may be different for each
atom match. For instance, in ``.+`` the ``.`` can match a different
character each time. The important thing is that there are no alternative
matches for a ''particular'' match, like there are in ``(?:a|b)+``.
The second property is needed for greedy matching, where the quantifier
can first match the atom as many times as possible, and then try the
sequel. To 'undo' one atom match, we can simply rewind the input string
by the number of characters matched by the atom (which we know to be a
constant), and then try the sequel again. For instance, the atom length
for ``.+`` is 1, and for ``(?:.x[a-f])+`` it is 3. Because the particular
characters matching a certain atom instance may vary, we don't know the
byte length of the match in advance. To avoid remembering backtrack
positions (input offsets after each atom match) we rewind the input by
"atom length" UTF-8-encoded code points. This keeps a simple, greedy
quantifier stateless and avoids recursion.
The third property is needed because backtracking the ``saved`` array needs
C recursion right now. The condition might be avoidable quite easily for a
minimal quantifier, and with some effort also for a greedy quantifier (by
rematching the atom to refresh any captures). However, these haven't been
considered now. The requirement to have no lookaheads has a similar
motivation: lookaheads currently require recursion for ``saved`` array
management.
Simple quantifiers are expressed with ``DUK_REOP_SQMINIMAL`` and
``DUK_REOP_SQGREEDY``. The atom being matched *must* fulfill the conditions
described above; the compiler needs to track the simple-ness of an atom for
various nested atom expressions such as ``(?:a(?:.))[a-fA-F]``. In theory,
the following can also be expressed as a simple quantifier: ``(?:x{3})+``,
which expands to ``(?:xxx)+``, a simple quantifier with an atom length of 3.
The compiler is not this clever, though, at least not at the time of
writing.
Any quantifiers not matching the simple quantifier properties are complex
quantifiers, and are encoded as explicit bytecode sequences using e.g.
``DUK_REOP_SPLIT1``, ``DUK_REOP_SPLIT2``, and ``DUK_REOP_JUMP``.
Counted quantifiers are expanded by the compiler into straight bytecode.
For instance, ``(?:a|b){3,5}`` is expanded into (something like)
``(?:a|b)(?:a|b)(?:a|b)(?:(?:a|b)(?:a|b)?)?``. Capture groups inside the
atom being matched are encoded into two ``DUK_REOP_SAVE`` instructions.
The *same* save indices are used in the atom being expanded, so later atom
matches overwrite saved indices of earlier matches (which is correct
behavior). Such expressions cannot be expressed as ordinary regexps because
the same capture group index cannot be used twice.
Future work
===========
Compiler and lexer
------------------
* E5 Section 15.10.2.5, step 4 of RepeatMatcher: is it possible that ``cap[k]``
is defined for some ``k``, where ``k > parenCount + parenIndex``? If so, add
an example. This means that we can't just clear all captures for
``k > parenIndex``.
* Handling empty infinite quantifiers, as in: ``/(x*)*/``.
* The regexp lexer is quite simple and could perhaps be integrated into the
regexp compiler - at some loss of clarity but at some gain in code
compactness.
* Add an opcode for disjunction specifically? Could this avoid the amount of
recursion (linear to the number of alternatives) currently required by
disjunctions?
Executor
--------
* Optimized primitive for testing a regexp (match without captures) would be
easy by just skipping 'save' instructions but would waste space.