Skip to content

Commit daf61a2

Browse files
committed
Add XML declaration encoding sniffing
Closes whatwg#1438, where we found out that this is required for web compatibility. The algorithm given here is an exact copy of that used by WebKit and Blink, with the exception that it does not detect UTF-32 byte sequences since in web-standards-world, UTF-32 must not be supported.
1 parent 85227d2 commit daf61a2

File tree

1 file changed

+108
-7
lines changed

1 file changed

+108
-7
lines changed

source

Lines changed: 108 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -99996,19 +99996,71 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
9999699996
encoding</dfn>, given some defined <var>end condition</var>, then it must run the
9999799997
following steps. These steps either abort unsuccessfully or return a character encoding. If at any
9999899998
point during these steps (including during instances of the <span
99999-
data-x="concept-get-attributes-when-sniffing">get an attribute</span> algorithm invoked by this
100000-
one) the user agent either runs out of bytes (meaning the <var>position</var> pointer
100001-
created in the first step below goes beyond the end of the byte stream obtained so far) or reaches
100002-
its <var>end condition</var>, then abort the <span>prescan a byte stream to determine its
100003-
encoding</span> algorithm unsuccessfully.</p>
99999+
data-x="concept-get-attributes-when-sniffing">get an attribute</span> algorithm invoked by
100000+
this one) the user agent either runs out of bytes (meaning the <var>position</var> pointer
100001+
created in the second step below goes beyond the end of the byte stream obtained so far) or
100002+
reaches its <var>end condition</var>, then if the below <var>fallback encoding</var> variable is
100003+
set to a non-null value, abort the <span>prescan a byte stream to determine its encoding</span>
100004+
algorithm with <var>fallback encoding</var> as the encoding; otherwise, abort the algorithm
100005+
unsuccessfully.</p>
100004100006

100005100007
<ol>
100006100008

100009+
<li><p>Let <var>fallback encoding</var> be null.</p></li>
100010+
100011+
<li><p>Let <var>position</var> be a pointer to a byte in the input byte stream, initially
100012+
pointing at the first byte.</p></li>
100013+
100007100014
<li>
100015+
<p>Prescan for XML declarations: If <var>position</var> points to:</p>
100016+
100017+
<dl class="switch">
100018+
<dt>A sequence of bytes starting with: 0x3C, 0x3F, 0x78, 0x6C (case-sensitive ASCII
100019+
'&lt;?xml')</dt>
100020+
<dd>
100021+
<p><span data-x="concept-get-xml-encoding-when-sniffing">Get an XML encoding</span>. If this
100022+
does not return failure, set <var>fallback encoding</var> to the returned encoding, and then
100023+
continue with this algorithm.</p>
100024+
</dd>
100025+
100026+
<dt>A sequence of bytes starting with: 0x3C, 0x0, 0x3F, 0x0, 0x78, 0x0 (case-sensitive UTF-16
100027+
little-endian '&lt;?xm')</dt>
100028+
<dd>
100008100029

100009-
<p>Let <var>position</var> be a pointer to a byte in the input byte stream, initially
100010-
pointing at the first byte.</p>
100030+
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
100031+
returning <span>UTF-16LE</span>.</p>
100032+
100033+
</dd>
100011100034

100035+
<dt>A sequence of bytes starting with: 0x0, 0x3C, 0x0, 0x3F, 0x0, 0x78 (case-sensitive UTF-16
100036+
big-endian '&lt;?xm')</dt>
100037+
<dd>
100038+
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
100039+
returning <span>UTF-16BE</span>.</p>
100040+
</dd>
100041+
100042+
<!-- the Encoding Standard doesn't support UTF-32:
100043+
https://github.com/whatwg/html/issues/1438#issuecomment-245142577
100044+
100045+
<dt>A sequence of bytes starting with: 0x3C, 0x0, 0x0, 0x0, 0x3F, 0x0, 0x0, 0x0 (case-sensitive
100046+
UTF-32 little-endian '&lt;?')</dt>
100047+
<dd>
100048+
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
100049+
returning <span>UTF-32LE</span>.</p>
100050+
</dd>
100051+
100052+
<dt>A sequence of bytes starting with: 0x0, 0x0, 0x0, 0x3C, 0x0, 0x0, 0x0, 0x3F (case-sensitive
100053+
UTF-32 big-endian '&lt;?')</dt>
100054+
<dd>
100055+
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
100056+
returning <span>UTF-32BE</span>.</p>
100057+
</dd>
100058+
-->
100059+
</dl>
100060+
100061+
<p class="note">Prescanning for XML declarations, even in HTML documents, must be done for
100062+
compatibility with legacy content. See <a
100063+
href="https://github.com/whatwg/html/issues/1438">issue #1438</a>.</p>
100012100064
</li>
100013100065

100014100066
<li>
@@ -100299,6 +100351,55 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
100299100351

100300100352
</ol>
100301100353

100354+
<p>When the <span>prescan a byte stream to determine its encoding</span> algorithm says to <dfn
100355+
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</dfn>, it means doing
100356+
this. If at any point during these steps the <var>encodingPosition</var> pointer created in the
100357+
first step below goes beyond the end of the byte stream obtained so far, abort the <span
100358+
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</span> algorithm and return
100359+
failure.</p>
100360+
100361+
<ol>
100362+
<li><p>Let <var>encodingPosition</var> be a distinct pointer to the same place in the input byte
100363+
stream as <var>position</var>.</p></li>
100364+
100365+
<li><p>Let <var>xmlDeclarationEnd</var> be a pointer to the next byte in the input byte
100366+
stream which is 0x3E (ASCII '>'). If there is no such byte, abort the <span
100367+
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
100368+
and return failure</p></li>
100369+
100370+
<li><p>Set <var>encodingPosition</var> to the position of the first occurrence of the subsequence
100371+
of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (ASCII 'encoding') at or after the
100372+
current <var>encodingPosition</var>. If there is no such sequence, abort the <span
100373+
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
100374+
and return failure.</p></li>
100375+
100376+
<li><p>Advance <var>encodingPosition</var> past the 0x67 (ASCII 'g') byte.</p></li>
100377+
100378+
<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is
100379+
either an ASCII space or control character), advance <var>encodingPosition</var> to the next
100380+
byte.</p></li>
100381+
100382+
<li><p>If the byte at <var>encodingPosition</var> is not 0x3D (ASCII =), abort the <span
100383+
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
100384+
and return failure.</p></li>
100385+
100386+
<li><p>Let <var>quoteMark</var> be the byte at <var>encodingPosition</var>.</p></li>
100387+
100388+
<li><p>Advance <var>encodingPosition</var> to the next byte.</p></li>
100389+
100390+
<li><p>Let <var>encodingEndPosition</var> be the position of the next occurence of
100391+
<var>quoteMark</var> at or after <var>encodingPosition</var>. If <var>quoteMark</var> does not
100392+
occur again, abort the <span data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding
100393+
algorithm</span> algorithm and return failure.</p></li>
100394+
100395+
<li><p>Let <var>potentialEncoding</var> be the Unicode string whose code points are the same as
100396+
the values of the bytes between <var>encodingPosition</var> (inclusive) and
100397+
<var>encodingEndPosition</var> (exlusive).</p></li>
100398+
100399+
<li><p>Return the result of <span>getting an encoding</span> given
100400+
<var>potentialEncoding</var>.</p></li>
100401+
</ol>
100402+
100302100403
<p>For the sake of interoperability, user agents should not use a pre-scan algorithm that returns
100303100404
different results than the one described above. (But, if you do, please at least let us know, so
100304100405
that we can improve this algorithm and benefit everyone...)</p>

0 commit comments

Comments
 (0)