diff --git a/doc/regexp.rst b/doc/regexp.rst index 62ff3a0f..b4fec9e5 100644 --- a/doc/regexp.rst +++ b/doc/regexp.rst @@ -12,18 +12,26 @@ challenging. See the following three excellent articles by Russ Cox for background: * http://swtch.com/~rsc/regexp/regexp1.html + * http://swtch.com/~rsc/regexp/regexp2.html + * http://swtch.com/~rsc/regexp/regexp3.html Ecmascript regular expression set is described in E5 Section 15.10, and includes: * Disjunction + * Quantifiers, counted repetition and both greedy and minimal variants + * Assertions, negative and positive lookaheads + * Character classes, normal and inverted + * Captures and backreferences + * Unicode character support + * Unanchored matching (only) (e.g. ``/x/.exec('fooxfoo')`` matches ``'x'``) Counted repetition quantifiers, assertions, captures, and backreferences @@ -36,10 +44,14 @@ and compactness. More generally, the following prioritized requirements should be fulfilled: #. Ecmascript compatibility + #. Compactness + #. Avoiding deep or unbounded C recursion, and providing recursion and execution time sanity limits + #. Regexp execution performance + #. Regexp compilation performance Further, it should be possible to leave out regexp support during @@ -411,11 +423,11 @@ which is useful for encoding bytecode jump distances. The compiled regexp begins with a header, containing: - * unsigned integer: flags, any combination of ``DUK_RE_FLAG_*`` +* unsigned integer: flags, any combination of ``DUK_RE_FLAG_*`` - * unsigned integer: ``nsaved`` (number of save slots), which should be - ``2n+2`` where ``n`` equals ``NCapturingParens`` (number of capture - groups) +* unsigned integer: ``nsaved`` (number of save slots), which should be + ``2n+2`` where ``n`` equals ``NCapturingParens`` (number of capture + groups) Regexp body bytecode then follows. Each instruction consists of an opcode value (``DUK_REOP_*``) (encoded as an unsigned integer) followed by a @@ -598,17 +610,17 @@ will be one byte shorter than ``len2``, but ``len2`` will be correct. For instance, if the code block in the second example had been 1022 bytes long: - * The first offset ``L1 - L2 - 1`` would be -1023 which is converted to - the unsigned value ``2*1023+1 = 2047 = 0x7ff``. This encodes to two - UTF-8 bytes, i.e. ``len1 = 2``. +* The first offset ``L1 - L2 - 1`` would be -1023 which is converted to + the unsigned value ``2*1023+1 = 2047 = 0x7ff``. This encodes to two + UTF-8 bytes, i.e. ``len1 = 2``. - * The second offset ``L1 - L2 - 1 - 2`` would be -1025 which is converted - to the unsigned value ``2*1025+1 = 2051 = 0x803``. This encodes to - *three* UTF-8 bytes, i.e. ``len2 = 3``. +* The second offset ``L1 - L2 - 1 - 2`` would be -1025 which is converted + to the unsigned value ``2*1025+1 = 2051 = 0x803``. This encodes to + *three* UTF-8 bytes, i.e. ``len2 = 3``. - * The final skip offset ``L1 - L2 - 1 - 3`` is -1026, which converts to - the unsigned value ``2*1026+1 = 2053 = 0x805``. This again encodes to - three UTF-8 bytes, and is thus "self consistent". +* The final skip offset ``L1 - L2 - 1 - 3`` is -1026, which converts to + the unsigned value ``2*1026+1 = 2053 = 0x805``. This again encodes to + three UTF-8 bytes, and is thus "self consistent". This could also be solved into closed form directly. @@ -753,15 +765,15 @@ it during execution. During regexp execution, regexp flags are kept in the regexp matching context, and affect opcode execution as follows: - * global (``/g``): does not affect regexp execution, only the behavior of - ``RegExp.prototype.exec()`` and ``RegExp.prototype.toString()``. +* global (``/g``): does not affect regexp execution, only the behavior of + ``RegExp.prototype.exec()`` and ``RegExp.prototype.toString()``. - * ignoreCase (``/i``): affects all opcodes which match characters or - character ranges, through the ``Canonicalize`` operation defined in - E5 Section 15.10.2.8. It also affects ``RegExp.prototype.toString()``. +* ignoreCase (``/i``): affects all opcodes which match characters or + character ranges, through the ``Canonicalize`` operation defined in + E5 Section 15.10.2.8. It also affects ``RegExp.prototype.toString()``. - * multiline (``/m``): affects the start and end assertion opcodes - (``^`` and ``$``). It also affects ``RegExp.prototype.toString()``. +* multiline (``/m``): affects the start and end assertion opcodes + (``^`` and ``$``). It also affects ``RegExp.prototype.toString()``. A bytecode opcode for matching a string instead of an individual character seems useful at first glance. The compiler could join successive @@ -1201,4 +1213,3 @@ Executor * Optimized primitive for testing a regexp (match without captures) would be easy by just skipping 'save' instructions but would waste space. -