Browse Source

Regexp doc RST fixes

v1.0-maintenance
Sami Vaarala 10 years ago
parent
commit
3c533bd30e
  1. 53
      doc/regexp.rst

53
doc/regexp.rst

@ -12,18 +12,26 @@ challenging. See the following three excellent articles by Russ Cox
for background: for background:
* http://swtch.com/~rsc/regexp/regexp1.html * http://swtch.com/~rsc/regexp/regexp1.html
* http://swtch.com/~rsc/regexp/regexp2.html * http://swtch.com/~rsc/regexp/regexp2.html
* http://swtch.com/~rsc/regexp/regexp3.html * http://swtch.com/~rsc/regexp/regexp3.html
Ecmascript regular expression set is described in E5 Section 15.10, Ecmascript regular expression set is described in E5 Section 15.10,
and includes: and includes:
* Disjunction * Disjunction
* Quantifiers, counted repetition and both greedy and minimal variants * Quantifiers, counted repetition and both greedy and minimal variants
* Assertions, negative and positive lookaheads * Assertions, negative and positive lookaheads
* Character classes, normal and inverted * Character classes, normal and inverted
* Captures and backreferences * Captures and backreferences
* Unicode character support * Unicode character support
* Unanchored matching (only) (e.g. ``/x/.exec('fooxfoo')`` matches ``'x'``) * Unanchored matching (only) (e.g. ``/x/.exec('fooxfoo')`` matches ``'x'``)
Counted repetition quantifiers, assertions, captures, and backreferences Counted repetition quantifiers, assertions, captures, and backreferences
@ -36,10 +44,14 @@ and compactness. More generally, the following prioritized requirements
should be fulfilled: should be fulfilled:
#. Ecmascript compatibility #. Ecmascript compatibility
#. Compactness #. Compactness
#. Avoiding deep or unbounded C recursion, and providing recursion and #. Avoiding deep or unbounded C recursion, and providing recursion and
execution time sanity limits execution time sanity limits
#. Regexp execution performance #. Regexp execution performance
#. Regexp compilation performance #. Regexp compilation performance
Further, it should be possible to leave out regexp support during Further, it should be possible to leave out regexp support during
@ -411,11 +423,11 @@ which is useful for encoding bytecode jump distances.
The compiled regexp begins with a header, containing: The compiled regexp begins with a header, containing:
* unsigned integer: flags, any combination of ``DUK_RE_FLAG_*`` * unsigned integer: flags, any combination of ``DUK_RE_FLAG_*``
* unsigned integer: ``nsaved`` (number of save slots), which should be * unsigned integer: ``nsaved`` (number of save slots), which should be
``2n+2`` where ``n`` equals ``NCapturingParens`` (number of capture ``2n+2`` where ``n`` equals ``NCapturingParens`` (number of capture
groups) groups)
Regexp body bytecode then follows. Each instruction consists of an opcode Regexp body bytecode then follows. Each instruction consists of an opcode
value (``DUK_REOP_*``) (encoded as an unsigned integer) followed by a value (``DUK_REOP_*``) (encoded as an unsigned integer) followed by a
@ -598,17 +610,17 @@ will be one byte shorter than ``len2``, but ``len2`` will be correct.
For instance, if the code block in the second example had been 1022 bytes For instance, if the code block in the second example had been 1022 bytes
long: long:
* The first offset ``L1 - L2 - 1`` would be -1023 which is converted to * The first offset ``L1 - L2 - 1`` would be -1023 which is converted to
the unsigned value ``2*1023+1 = 2047 = 0x7ff``. This encodes to two the unsigned value ``2*1023+1 = 2047 = 0x7ff``. This encodes to two
UTF-8 bytes, i.e. ``len1 = 2``. UTF-8 bytes, i.e. ``len1 = 2``.
* The second offset ``L1 - L2 - 1 - 2`` would be -1025 which is converted * The second offset ``L1 - L2 - 1 - 2`` would be -1025 which is converted
to the unsigned value ``2*1025+1 = 2051 = 0x803``. This encodes to to the unsigned value ``2*1025+1 = 2051 = 0x803``. This encodes to
*three* UTF-8 bytes, i.e. ``len2 = 3``. *three* UTF-8 bytes, i.e. ``len2 = 3``.
* The final skip offset ``L1 - L2 - 1 - 3`` is -1026, which converts to * The final skip offset ``L1 - L2 - 1 - 3`` is -1026, which converts to
the unsigned value ``2*1026+1 = 2053 = 0x805``. This again encodes to the unsigned value ``2*1026+1 = 2053 = 0x805``. This again encodes to
three UTF-8 bytes, and is thus "self consistent". three UTF-8 bytes, and is thus "self consistent".
This could also be solved into closed form directly. This could also be solved into closed form directly.
@ -753,15 +765,15 @@ it during execution.
During regexp execution, regexp flags are kept in the regexp matching During regexp execution, regexp flags are kept in the regexp matching
context, and affect opcode execution as follows: context, and affect opcode execution as follows:
* global (``/g``): does not affect regexp execution, only the behavior of * global (``/g``): does not affect regexp execution, only the behavior of
``RegExp.prototype.exec()`` and ``RegExp.prototype.toString()``. ``RegExp.prototype.exec()`` and ``RegExp.prototype.toString()``.
* ignoreCase (``/i``): affects all opcodes which match characters or * ignoreCase (``/i``): affects all opcodes which match characters or
character ranges, through the ``Canonicalize`` operation defined in character ranges, through the ``Canonicalize`` operation defined in
E5 Section 15.10.2.8. It also affects ``RegExp.prototype.toString()``. E5 Section 15.10.2.8. It also affects ``RegExp.prototype.toString()``.
* multiline (``/m``): affects the start and end assertion opcodes * multiline (``/m``): affects the start and end assertion opcodes
(``^`` and ``$``). It also affects ``RegExp.prototype.toString()``. (``^`` and ``$``). It also affects ``RegExp.prototype.toString()``.
A bytecode opcode for matching a string instead of an individual character A bytecode opcode for matching a string instead of an individual character
seems useful at first glance. The compiler could join successive seems useful at first glance. The compiler could join successive
@ -1201,4 +1213,3 @@ Executor
* Optimized primitive for testing a regexp (match without captures) would be * Optimized primitive for testing a regexp (match without captures) would be
easy by just skipping 'save' instructions but would waste space. easy by just skipping 'save' instructions but would waste space.

Loading…
Cancel
Save