duktape/doc/uri.rst

=========================
URI encoding and decoding
=========================

Specification notes
===================

URI syntax
----------

E5.1 Annex F::

  15.1.3: Added notes clarifying that ECMAScript's URI syntax is based upon
  RFC 2396 and not the newer RFC 3986. In the algorithm for Decode, a step
  was removed that immediately preceded the current step 4.d.vii.10.a
  because it tested for a condition that cannot occur.

Changes from RFC 2396 to RFC 3986 are summarized in RFC 3986:

* http://tools.ietf.org/html/rfc3986#appendix-D

Changes relevant to Ecmascript include:

* Additional characters in "reserved" set.

  - RFC 2396::

     reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                     "$" | ","

     ; / ? : @ & = + $ ,

  - RFC 3986::

     reserved      = gen-delims / sub-delims
     gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
     sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                   / "*" / "+" / "," / ";" / "="

     : / ? # [ ] @ ! $ & ' ( ) * + , ; =

  - New characters in RFC 3986 are::

     # [ ] ! ' ( ) *

Effect on decoding: don't decode hex escapes into reserved characters.
However, RFC 3986 additional characters should be decoded normally
because they're not supported in Ecmascript.  Thus::

  decodeURI("%23%5B%5D%21%27%28%29%2A") -> "%23[]!'()*"

The '#' character is explicitly added to the reserved set by the
decodeURI() algorithm in E5.1 Section 15.1.3.1.

Effect on encoding: don't encode into hex escapes.  However, RFC 3986
additional characters should be escaped normally because they're not
supported::

  encodeURI("#[]!'()*") -> "#%5B%5D!'()*"

The '#' character is explicitly added to the reserved set by the
decodeURI() algorithm in E5.1 Section 15.1.3.1.  The characters
``!'()*`` are already part of the uriMark production which goes into
uriUnescaped.  Brackets are not included so they get escaped in
Ecmascript.

Reserved set / unescaped set
----------------------------

The "unescaped set" for encoding and the "reserved set" for decoding always
consist of only ASCII codepoints.  Thus comparing codepoints against the sets
should only be necessary when processing ASCII range characters.

When encoding, step 4.c will catch characters in the "unescaped set" and
encode them as-is into the output.  Note that these can only be single-byte
ASCII characters.  If we go to step 4.d, the codepoint may either be ASCII
or non-ASCII, and will be escaped regardless.

When decoding percent escaped codepoints, one-byte encoded codepoints (i.e.
ASCII) are checked in step 4.d.vi; multi-byte encoded codepoints in the BMP
range are checked in step 4.d.vii but codepoints above BMP are not checked.

Apparently the idea here is to ensure no characters in the reserved set are
decoded from percent escapes even if invalid UTF-8 (non-shortest) encodings
are allowed.  Because characters above BMP are encoded with surrogate pairs,
the formula for surrogate pairs ensures that the codepoint cannot be below
U+00010000 (0x10000 is added to the surrogate pair bits), and thus no check
against the "reserved set" is needed.

However, at the end of Section 15.1.3:

  RFC 3629 prohibits the decoding of invalid UTF-8 octet sequences. For
  example, the invalid sequence C0 80 must not decode into the character
  U+0000. Implementations of the Decode algorithm are required to throw a
  URIError when encountering such invalid sequences.

Because "reserved set" / "unescaped set" always consists of only ASCII
codepoints, the check in step 4.d.vii should not be necessary.  The UTF-8
validity check happens in step 4.d.vii.8.

Decoding characters outside BMP
-------------------------------

The URI decoding algorithm requires that UTF-8 encoded codepoints consisting
of more than 4 encoded bytes are rejected.  4 byte encoding contains 21 bits,
so the maximum codepoint which can be expressed is U+1FFFFF.  However, since
the bytes must also be valid UTF-8 (step 4.d.vii.8) the highest allowed
codepoint is actually U+10FFFF.

It would be nice to be able to:

* decode higher codepoints because Duktape can represent them

* decode codepoints up to U+10FFFF without surrogate pairs

Because the API requirements are strict, these cannot be added to the standard
API without breaking compliance.  Custom URI encoding/decoding functions could
provide these extended semantics.
add some internal docs, improve READMEs, fix dist script 11 years ago			`=========================`
			`URI encoding and decoding`
			`=========================`

			`Specification notes`
			`===================`

uri notes 11 years ago			`URI syntax`
			`----------`

			`E5.1 Annex F::`

Internal document trivia cleanups 8 years ago			`15.1.3: Added notes clarifying that ECMAScript's URI syntax is based upon`
uri notes 11 years ago			`RFC 2396 and not the newer RFC 3986. In the algorithm for Decode, a step`
			`was removed that immediately preceded the current step 4.d.vii.10.a`
			`because it tested for a condition that cannot occur.`

			`Changes from RFC 2396 to RFC 3986 are summarized in RFC 3986:`

			`* http://tools.ietf.org/html/rfc3986#appendix-D`

			`Changes relevant to Ecmascript include:`

			`* Additional characters in "reserved" set.`

			`- RFC 2396::`

			`reserved = ";" \| "/" \| "?" \| ":" \| "@" \| "&" \| "=" \| "+" \|`
			`"$" \| ","`

			`; / ? : @ & = + $ ,`

			`- RFC 3986::`

			`reserved = gen-delims / sub-delims`
			`gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"`
			`sub-delims = "!" / "$" / "&" / "'" / "(" / ")"`
			`/ "*" / "+" / "," / ";" / "="`

			`: / ? # [ ] @ ! $ & ' ( ) * + , ; =`

			`- New characters in RFC 3986 are::`

			`# [ ] ! ' ( ) *`

			`Effect on decoding: don't decode hex escapes into reserved characters.`
			`However, RFC 3986 additional characters should be decoded normally`
			`because they're not supported in Ecmascript. Thus::`

			`decodeURI("%23%5B%5D%21%27%28%29%2A") -> "%23[]!'()*"`

			`The '#' character is explicitly added to the reserved set by the`
			`decodeURI() algorithm in E5.1 Section 15.1.3.1.`

			`Effect on encoding: don't encode into hex escapes. However, RFC 3986`
			`additional characters should be escaped normally because they're not`
			`supported::`

			`encodeURI("#[]!'()") -> "#%5B%5D!'()"`

			`The '#' character is explicitly added to the reserved set by the`
			`decodeURI() algorithm in E5.1 Section 15.1.3.1. The characters`
			``!'()*`` are already part of the uriMark production which goes into
			`uriUnescaped. Brackets are not included so they get escaped in`
			`Ecmascript.`

add some internal docs, improve READMEs, fix dist script 11 years ago			`Reserved set / unescaped set`
			`----------------------------`

			`The "unescaped set" for encoding and the "reserved set" for decoding always`
			`consist of only ASCII codepoints. Thus comparing codepoints against the sets`
			`should only be necessary when processing ASCII range characters.`

			`When encoding, step 4.c will catch characters in the "unescaped set" and`
			`encode them as-is into the output. Note that these can only be single-byte`
			`ASCII characters. If we go to step 4.d, the codepoint may either be ASCII`
			`or non-ASCII, and will be escaped regardless.`

			`When decoding percent escaped codepoints, one-byte encoded codepoints (i.e.`
			`ASCII) are checked in step 4.d.vi; multi-byte encoded codepoints in the BMP`
			`range are checked in step 4.d.vii but codepoints above BMP are not checked.`

			`Apparently the idea here is to ensure no characters in the reserved set are`
			`decoded from percent escapes even if invalid UTF-8 (non-shortest) encodings`
			`are allowed. Because characters above BMP are encoded with surrogate pairs,`
			`the formula for surrogate pairs ensures that the codepoint cannot be below`
			`U+00010000 (0x10000 is added to the surrogate pair bits), and thus no check`
			`against the "reserved set" is needed.`

			`However, at the end of Section 15.1.3:`

			`RFC 3629 prohibits the decoding of invalid UTF-8 octet sequences. For`
			`example, the invalid sequence C0 80 must not decode into the character`
			`U+0000. Implementations of the Decode algorithm are required to throw a`
			`URIError when encountering such invalid sequences.`

			`Because "reserved set" / "unescaped set" always consists of only ASCII`
			`codepoints, the check in step 4.d.vii should not be necessary. The UTF-8`
			`validity check happens in step 4.d.vii.8.`

			`Decoding characters outside BMP`
			`-------------------------------`

			`The URI decoding algorithm requires that UTF-8 encoded codepoints consisting`
			`of more than 4 encoded bytes are rejected. 4 byte encoding contains 21 bits,`
			`so the maximum codepoint which can be expressed is U+1FFFFF. However, since`
			`the bytes must also be valid UTF-8 (step 4.d.vii.8) the highest allowed`
			`codepoint is actually U+10FFFF.`

			`It would be nice to be able to:`

			`* decode higher codepoints because Duktape can represent them`

			`* decode codepoints up to U+10FFFF without surrogate pairs`

			`Because the API requirements are strict, these cannot be added to the standard`
			`API without breaking compliance. Custom URI encoding/decoding functions could`
			`provide these extended semantics.`