duktape/doc/uri.rst


								=========================

								URI encoding and decoding

								=========================


								Specification notes

								===================


								URI syntax

								----------


								E5.1 Annex F::


								  15.1.3: Added notes clarifying that ECMAScript's URI syntax is based upon

								  RFC 2396 and not the newer RFC 3986. In the algorithm for Decode, a step

								  was removed that immediately preceded the current step 4.d.vii.10.a

								  because it tested for a condition that cannot occur.


								Changes from RFC 2396 to RFC 3986 are summarized in RFC 3986:


								* http://tools.ietf.org/html/rfc3986#appendix-D


								Changes relevant to Ecmascript include:


								* Additional characters in "reserved" set.


								  - RFC 2396::


								     reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |

								                     "$" | ","


								     ; / ? : @ & = + $ ,


								  - RFC 3986::


								     reserved      = gen-delims / sub-delims

								     gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"

								     sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"

								                   / "*" / "+" / "," / ";" / "="


								     : / ? # [ ] @ ! $ & ' ( ) * + , ; =


								  - New characters in RFC 3986 are::


								     # [ ] ! ' ( ) *


								Effect on decoding: don't decode hex escapes into reserved characters.

								However, RFC 3986 additional characters should be decoded normally

								because they're not supported in Ecmascript.  Thus::


								  decodeURI("%23%5B%5D%21%27%28%29%2A") -> "%23[]!'()*"


								The '#' character is explicitly added to the reserved set by the

								decodeURI() algorithm in E5.1 Section 15.1.3.1.


								Effect on encoding: don't encode into hex escapes.  However, RFC 3986

								additional characters should be escaped normally because they're not

								supported::


								  encodeURI("#[]!'()*") -> "#%5B%5D!'()*"


								The '#' character is explicitly added to the reserved set by the

								decodeURI() algorithm in E5.1 Section 15.1.3.1.  The characters

								``!'()*`` are already part of the uriMark production which goes into

								uriUnescaped.  Brackets are not included so they get escaped in

								Ecmascript.


								Reserved set / unescaped set

								----------------------------


								The "unescaped set" for encoding and the "reserved set" for decoding always

								consist of only ASCII codepoints.  Thus comparing codepoints against the sets

								should only be necessary when processing ASCII range characters.


								When encoding, step 4.c will catch characters in the "unescaped set" and

								encode them as-is into the output.  Note that these can only be single-byte

								ASCII characters.  If we go to step 4.d, the codepoint may either be ASCII

								or non-ASCII, and will be escaped regardless.


								When decoding percent escaped codepoints, one-byte encoded codepoints (i.e.

								ASCII) are checked in step 4.d.vi; multi-byte encoded codepoints in the BMP

								range are checked in step 4.d.vii but codepoints above BMP are not checked.


								Apparently the idea here is to ensure no characters in the reserved set are

								decoded from percent escapes even if invalid UTF-8 (non-shortest) encodings

								are allowed.  Because characters above BMP are encoded with surrogate pairs,

								the formula for surrogate pairs ensures that the codepoint cannot be below

								U+00010000 (0x10000 is added to the surrogate pair bits), and thus no check

								against the "reserved set" is needed.


								However, at the end of Section 15.1.3:


								  RFC 3629 prohibits the decoding of invalid UTF-8 octet sequences. For

								  example, the invalid sequence C0 80 must not decode into the character

								  U+0000. Implementations of the Decode algorithm are required to throw a

								  URIError when encountering such invalid sequences.


								Because "reserved set" / "unescaped set" always consists of only ASCII

								codepoints, the check in step 4.d.vii should not be necessary.  The UTF-8

								validity check happens in step 4.d.vii.8.


								Decoding characters outside BMP

								-------------------------------


								The URI decoding algorithm requires that UTF-8 encoded codepoints consisting

								of more than 4 encoded bytes are rejected.  4 byte encoding contains 21 bits,

								so the maximum codepoint which can be expressed is U+1FFFFF.  However, since

								the bytes must also be valid UTF-8 (step 4.d.vii.8) the highest allowed

								codepoint is actually U+10FFFF.


								It would be nice to be able to:


								* decode higher codepoints because Duktape can represent them


								* decode codepoints up to U+10FFFF without surrogate pairs


								Because the API requirements are strict, these cannot be added to the standard

								API without breaking compliance.  Custom URI encoding/decoding functions could

								provide these extended semantics.