Sunday, September 11, 2022

Why don't we ever talk about volunteer PMs in open source?

 Recently, on Wikipedia, there was an open letter to the Wikimedia Foundation, asking them to improve the New Page Patrol feature.

This started the usual debate between, WMF should do something vs It is open source, {{sofixit}} (i.e. Send a patch). There's valid points on both sides of that debate, which I don't really want to get into.

However, it occurred to me - the people on the {{sofixit}} side always suggest that people should learn how to program (an unreasonable ask), figure out how to fix something, and do it themselves. On the other hand, in a corporate environment, stuff is never done solely by developers. You usually have either a product manager or a program manager organizing the work.

Instead of saying to users - learn PHP and submit a patch, why don't we say: Be the PM for the things you want done, so a programmer can easily just do them without getting bogged down with organizational questions?

At first glance this may sound crazy - after all, ordinary users have no authority. Being a PM is hard enough when people are paid to listen to you, how could it possibly work if nobody has to listen to you. And I agree - not everything a PM does is applicable here, but i think some things are.

Some things a volunteer person could potentially do:

  • Make sure that bugs are clearly described with requirements, so a developer could just do them instead of trying to figure out what the users need
  • Make sure tasks are broken down into appropriate sized tickets
  • Make a plan of what they wish would happen. A volunteer can't force people to follow their plan, but if you have a plan people may just follow it. Too often all that is present is a big list of bugs of varying priority which is hard for a developer to figure out what is important and what isn't
    • For example, what i mean is breaking things into a few milestones, and having each milestone contain a small number (3-5) tickets around a similar theme. This could then be used in order to promote the project to volunteer developers, using language like "Help us achieve milestone 2" and track progress. Perhaps even gamifying things.
    • No plan survives contact with the enemy of course, and the point isn't to stick to any plan religiously. The point is to have a short list of what the most pressing things to work on right now are. Half the battle is figuring out what to work on and what to work on first.
  • Coordinate with other groups as needed. Sometimes work might depend on other work other people have planned to do. Or perhaps the current work is dependent on someone else's requirements (e.g. new extensions require security review). Potentially a volunteer PM could help coordinate this or help ensure that everyone is on the same page about expectations and requirements.
  • [not sure about this one] Help find constructive code reviewers. In MediaWiki development, code much be reviewed by another developer to be merged in. Finding knowledgeable people can often be difficult and a lot of effort. Sometimes this comes down to personal relationships and politely nagging people until someone bites. For many developers this is a frustrating part of the software development process. Its not clear how productive a non-developer would be here, as you may need to understand the code to know who to talk to. Nonetheless, potentially this is something a non-programmer volunteer can help with.

To use the new page patrol feature as an example - Users have a list of 56 feature requests. There's not really any indication of which ones are more important then others. A useful starting point would be to select the 3 most important. There are plenty of volunteer developers in the MediaWiki ecosystem that might work on them. The less time they have to spend figuring out what is wanted, the more likely they might fix one of the things. There are no guarantees of course, but it is a thing that someone who is not a programmer could do to move things forward.

 To be clear, being a good PM is a skill - all of this is hard and takes practice to be good at. People who have not done it before won't be good at it to begin with. But I think it is something we should talk about more, instead of the usual refrain of fix it yourself or be happy with what you got.


p.s. None of this should be taken as saying that WMF shouldn't fix anything and it should only be up to the communities, simply that there are things non-programmers could do to {{sofixit}} if they were so inclined.

 

Sunday, August 21, 2022

Weekly roundup - aug 21

 Some things i read or saw this week that i thought were interesting.

Natural perspective


I found this blog post fascinating. Basically it talks about how human preception is different than what a photograph would be. For example if there is a big object of interest in the distance it usually looks larger. I never thought much about this before,  but now that it has been pointed out to me, it rings very true. 

Kill the hero save the (narrative) world


This was a talk fron the GDC conference, by Hannah Nicklin, who is the narrative lead of a video game called Mutazione, talking about the narrative structure of video games. Essentially the speaker was arguing that many video games follow a hero's journey type of plot where the story follows a protagonist's journey to becoming a hero. They feel that this is a structure that works really well in movies: the director can direct your focus to character traits and growth. The 2 hour length is also very suitable to developing a single character's journey. They argue that video games would be better suited to follow the structure of an emsamble cast tv show. They think that this allows a better balance between players being able to do whatever they want but still getting the plot across as the focus is less on the affect of actions on the main character's phsyche and more driven by characters interacting with a community of other characters.

I know very little of video games, so i don't know how true the premise rings. However i found the reasoning quite interesting, and it gave me a lot to think about.

Galatea, Versu, Character Engine

This was a talk from 2018 by Emily Short about making non playable characters in video games feel like real people with inner lives. She also talks a bit about the pros and cons of interactive fiction, which i have always find interesting.





Saturday, August 13, 2022

Book review: Reamde

 

 

It has been a while since I have read a Stephenson novel, so when I saw this 1044 pages of light reading on the library shelf, I thought I would give it a go.

While I still enjoyed this novel, overall it was one of my least favourite Stephenson novels that I have read. I think perhaps the type of novel this was - essentially a thriller set in basically our current world - showcased Stephenson's weaknesses and hid his strengths, as a writer.

Stephenson is pretty famous for writing giant novels, and this is no exception. However usually it feels like a lot more is going on to justify the length. In this novel, the plot seems much more straight forward, and instead of using the pages to explain complex concepts or explore the world, we instead get 150 page long car/boat chases.

I think fundamentally the reason why Neal Stephenson novels are usually really fun, is their setting. He essentially writes novels set in various nerd utopias. The difference between him and most other science fiction writers writing in utopia settings, is most of those types of novels spend a lot of time arguing for the utopia or telling you how good it is. In Neal Stephenson novels, there is less of that, and the novel is more just set there. We might see how much a character likes the setting through their eyes, and how excited they are by it, but it still seems less directly aimed at the reader. I think this eases suspension of disbelief, because it feels less preachy and allows one to get absorbed in the "coolness" of the fantasy without thinking about all the ways it falls apart. For example, both Ananthem and The Big U are romanticized versions of university life (albeit romanticized in opposite directions), Cryptonomicon is a romanticized version of the cypherpunk movement, Snow crash is kind of romanticized cyberpunk. [I read these a long time ago, I might not be remembering them that well]

In this novel, the only romanticized aspect, is a world where everyone thinks MMORPGs are really cool. Unfortunately, a world where slightly more people play MMORPGs, is a lame utopia.

Additionally it has very little to do with the plot. It does provide the inciting incident, where a ransomware computer virus infecting the MMORPG encrypts some mobsters' private files. However this felt arbitrary - the MMORPG aspect really didn't matter.

Even worse, we spend hundreds of pages on this MMORPG world and the ongoing re-alignment of the in-world factions (The so-called "War of realignment"). Just as it seems to be building to something, we cut away to the main plot, never bothering to resolve or revisit what was building. This is especially sad, because in a lot of ways, it was more interesting than the main plot. Instead it is relegated to being a rather trite metaphor for a theme in the main plot about people coming together to form a chosen tribe over whatever situations they were born into.

Perhaps that's the biggest disappointment in this book: The threads don't really come together to form something bigger. There are a bunch of interesting premises at various points that don't really seem to contribute much to the overall whole. There is the design of the MMORPG/War-of-realignment, which is talked about a lot, but doesn't really go anywhere. There are the motorcycle gangsters with katanas, which are introduced as a back-story for a side character, who just ends up being shot with no katana fights. There are the main character's relatives who are some sort of fundamentalist christian gun-nut cult, but when the action comes down they mostly just hide in their house like normal people.

Why bother with all these tantalizing details when everyone in the end just acts like a normal person. Their uniqueness seems unnecessary to the novel, which feels like a major let down.

Last of all and pretty unrelatedly, the female characters felt like they were written weirdly, in a hard to define way. The main character, Zula, spent most of the novel kidnapped, and there was a bizarre amount of discussion on the differing levels of chivalry her various kidnappers showed her. As if holding the door open for her somehow makes up for the whole kidnapping at gunpoint. Several characters debate whether Yuxia is a "sister" or girlfriend material, where "friend" or "person they just met last week under very traumatic circumstances" apparently didn't cross anyone's mind. All the romantic pairings seemed kind of random and forced (Except maybe Olivia & Sokolov). Yuxia had barely even exchanged any words with Seamus before suddenly being together. Csongor and Zula also had pretty limited interaction before falling in love.

In the end, i did enjoy it, but definitely not his strongest novel.

Wednesday, July 20, 2022

Interviewed on Between the brackets

 This week I was interviewed by Yaron Karon for the second time for his MediaWiki podcast Between the Brackets.

Yaron has been doing this podcast for several years now, and I love how he highlights the different voices of all the different groups that use, interact and develop MediaWiki. He's had some fascinating people on his podcast over the years, and I highly reccomend giving it a listen.

Anyhow, it's an honour to be on the program again for episode 117. I was previously on the program 4 years ago for episode 5


Monday, July 18, 2022

Write up DiceCTF 2022: Flare and stumbling upon a Traefik 0-day (CVE-2022-23632)

 A while back, I participated in DiceCTF. During the contest I was the first person to solve the problem "Flare":

This was pretty exciting. Normally I'm happy just to be able to solve a problem — I'm never first. Even better though, my solution wasn't the intended solution, and I had instead found a 0-day in the load balancing software the contest was using!

The contest was a while ago so this post is severely belated. Nonetheless, given how exciting this all was, I wanted to write it up.

The Problem

The contest problem was short and sweet. You had a flask app behind Cloudflare. If you accessed it with an internal IP address (e.g. 192.168.0.2 or 10.0.0.2), you get the flag. The entire code is just 18 lines, mostly boilerplate:

import os
import ipaddress

from flask import Flask, request
from gevent.pywsgi import WSGIServer

app = Flask(__name__)
flag = os.getenv('FLAG', default='dice{flag}')

@app.route('/')
def index():
    ip = ipaddress.ip_address(request.headers.get('CF-Connecting-IP'))

    if isinstance(ip, ipaddress.IPv4Address) and ip.is_private:
        return flag

    return f'No flag for {format(ip)}'

WSGIServer(('', 8080), app).serve_forever()
  

Avenues of Attack

Simple to understand, but how to attack? The direct approach seems unlikely; The service is behind Cloudflare — if we can trick Cloudflare into thinking we are connecting to a site from an arbitrary IP address, then that would be a very significant vulnerability. If we could actually make a TCP connection to Cloudflare from a IP address that should not be globally routable, then something is seriously wrong with the internet.

Thinking about it, it seems like the following approaches are possible:

  • Hack Cloudflare to send the wrong CF-Connection-IP header (Seems impossible)
  • Use some sort of HTTP smuggling or header normalization issue to send something that flask thinks is a valid CF-Connection-IP header, but Cloudflare doesn't recognize, and hope that when faced with 2 conflicting headers, flask chooses ours instead of Cloudflare's. (Also seems impossible)
  • Find the backend server and connect to it directly, bypassing Cloudflare allowing us to send whatever header we want


 With that in mind, I figured it had to be the third one. After all, this is a challenge, so it must have a solution, hence I figured it's probably neither impossible nor involving a high value vulnerability in a major CDN.

I was wrong of course. The intended solution was sort of a combination of the first possibility which i dismissed as impossible and python having an "interesting" definition of an "internal" IP, which there is no way I ever would have gotten. More on that later.

Trying to Find the Backend

So now that I determined my course of action, I started hunting for the backend server. I tried googling around for snippets from the page to see if any other sites came up in google with the same text. I tried looking at various "dns history" sites that were largely useless. I tried certificate transparency logs, but no. I even tried blindly typing in variations on the challenge's domain name.

The setup for the challenges were as follows: This challenge was on https://flare.mc.ax. The other challenges were all on *.mc.ax. This was the only challenge served by Cloudflare, the rest were served directly by the CTF infrastructure using a single IP address and a wildcard certificate.

With that in mind, I thought, maybe I could connect to the IP serving the other challenges, give the flare.mc.ax SNI and host header, and perhaps I will be directly connected to the backend. So I tried that, as well as the domain fronting version where you give the wrong SNI but the right host header. This did not work. However, to my surprise instead of getting a 404 response, I got a 421 Misdirected Redirect.

421 essentially means you asked for something that this server is not configured to give you, so you should re-check your DNS resolution to make sure you have the right IP address and try again. In HTTP/2, you are allowed to reuse a connection for other domains as long as it served a TLS certificate that would work for the other domain (This is called "Connection coalescing"). However, sometimes that back-fires especially with wildcard certs. Just because a server serves a TLS certificate for *.example.com, doesn't mean it knows how to literally handle everything under example.com since some subdomains might be served by a different server on a different IP. The new error code was created for such cases, to tell the browser it should stop with the connection coalescing, look up the DNS records for the domain name again, and open a separate connection. We needed a new code, because if the server just responded with a 404, the browser wouldn't know if its because the page just doesn't exist, or if its because the connection was inappropriately coalesced.

Looking back, I'm not sure I should have seen this as a sign. After all I was asking for a domain name that this server did not serve but had a correct certificate for, so this was the appropriate HTTP status code to give. However, the uniqueness of the error code and sudden change in behaviour around the domain name I was interested in, made me feel like I was on to something.

So I tried messing around with variations in headers and SNI. I tried silly things like having Host: flare.ac.mx/foo in the hopes that it would maybe confuse one layer, but another layer would strip it off, and get me the site i wanted or something like that.

Why settle for partial qualification? 

Eventually I tried Host: flare.ac.mx. (note the extra dot at the end) with no SNI.

curl 'https://104.196.60.107'  -vk --header 'host: flare.mc.ax.' --http1.1 --header 'CF-Connecting-IP: 127.0.0.1' --header 'X-Forwarded-For: 127.0.0.1' --header 'CF-ray: REDACTED' --header 'CF-IPCountry: CA' --header 'CF-Visitor: {"scheme":"https"}' --header 'CDN-Loop: cloudflare'

It worked.

Wait what?

What does a dot at the end of a domain name even mean?

In DNS, there is the concept of a partially qualified domain name (PQDN), and its opposite, the fully qualified domain name (FQDN). A partially qualified domain name is similar to a relative path - you can setup a default DNS search path in your DNS config (usually set by DHCP) that acts as a default suffix for partially qualified domain names. For example, if your default search path is example.com and you look up the host foo DNS will check foo.example.com.

I imagine this made more sense during the early internet, when it was more a "network of networks", and it was more common that you wanted to look up local resources on your local network.

In DNS, there is the root zone, which is represented by the empty string. This is the top of the DNS tree, which has TLDs like com or net, as its children.

If you add a period to the end of your DNS name, for example foo., this tells the DNS software that it isn't a partially qualified DNS name, but what you actually want, is to look up the foo domain in the root zone. So it does not lookup foo.example.com., but instead just foo..

For the most part, this is an obscure bit of DNS trivia. However, as far as web browsers are concerned, the PQDN example.net and FQDN example.net. are entirely separate domains. The same origin policy treats them differently, cookies set to one do not affect the other (TLS certificates seem to work for both though). In the past, people have used this trick to avoid advertisers on some websites.

So why did this work

So at this point, I solved the puzzle, obtained the flag and submitted it. Yay me!
 
But I still wasn't really sure what happened. I assumed there was some sort of misconfiguration involved, but I wasn't sure what. I did not have the configuration of the load-balancer. For that matter, I wasn't even sure yet which load balancing software was in use.

After I solved the problem, the competition organizers reached out and asked me what method I use. I imagine they wanted to see if I found the intended solution or stumbled upon something else. When I told them, they were very surprised. Apparently mutual TLS had been set up with cloudflare, and it should have been impossible to contact the backend if you did not have the correct client certificate, which I did not.

Wait what!?

The load balancing software in question was Traefik. In it, you can configure various requirements for different hosts. For example, you can say that a certain host needs a specific version of TLS, specific ciphers, specific server certificate or even a specific client certificate (mTLS). There is also a section for default options. In this case, they had one set of TLS settings for most of the domains, and a rule for the flare domain that you needed the correct client certificate to get access.

In normal operation, the SNI is checked and the appropriate requirements are applied. In the event that the SNI doesn't match the host header, and the host header matches a domain with different TLS requirements then the default requirements, a 421 status code is sent. This is all good.

However, if the host header has a period at the end to make it a FQDN, the code checking the TLS requirements doesn't think it matches the non-period version, so only the default requirements apply. However the rest of the code will still forward the request to the correct domain and process it as normal.

Thus, you can bypass any domain specific security requirements by not sending the SNI and adding an extra period to the host header.

This would be one thing for settings like min TLS version. However, it is an entirely different thing for access control settings such as mutual TLS as it allows an attacker to fully bypass the required client certificate, getting access to things they shouldn't be able to.

I reported this to the Traefik maintainers, and they fixed the issue in version 2.6.1. It was assigned CVE-2022-23632 and you can read more about it in their advisory. This was pretty exciting as well, as Traefik is used by quite a few people, and based on their github security advisory page, this is the first high-severity vulnerability that has been found in it.

What was the intended solution?

I found out later from the organizers of the CTF, that the intended solution was something very different. I definitely would never have came up with this myself.

The intended solution was to exploit two weird behaviours:

  • The python ipaddress library considers class E IP addresses (i.e. 240.0.0.0/4) to be "private". Class E addresses are reserved for future use. They are not globally routable, but they aren't private either, so it is odd that python considers them private. Python's own docs say that is_private is "True if the address is allocated for private networks" linking to IANA's official list, even though 240.0.0.0/4 is not listed as private use on that list.
  • Cloudflare has a feature where if your site does not support IPv6, you can enable "Pseudo IPv4", where the ipv6 connections will be forwarded as if they come from a class E IPv4 address. Cloudflare talks more about it in their blog post.

Which is a pretty fascinating combination.

Initially I discarded the possibility you could make Cloudflare give the wrong IP address, because I thought that would be such a major hack, that it wouldn't show up in this type of contest; people would either be reporting it or exploiting it, depending on the colour of their hat. However, my assumption was based on the idea that any sort of exploit would let you pick your IP. Being able to present as a random class E (which class E IP is based on an md5 hash of the top 64 bits of your IPv6 address, so you cannot choose it), is no where near as useful as being able to chose your IP (Everyone is just so trusting of 127.0.0.1). While this is a fascinating challenge, its hard to imagine a non-contrived situation where this would be a useful primitive. Making network access control in the real world that just blacklists all globally routable IPs instead of your own network seems silly. Even sillier would be to whitelist class E for some reason. Sure I guess an attacker could masquerade as one of these class E addresses to confuse anti-abuse systems, but if the site properly processes those connections, then it seems anti-abuse systems are likely to handle them just as easily as a normal IP. Since its still tied to your real IP, you can't hop between them unless you can hop between real IPs. If it ever really became an issue, Cloudflare lets you disable them, and there is also an additional header with the original IPv6 address you can use. At worst, maybe it makes reading logs after an incident more complicated, but this would be a really bad way to hide yourself in a world where VPNs are cheap and tor is free. In conclusion - its a fascinating behaviour, but practically speaking doesn't seem exploitable in the way that "make Cloudflare report your ip is something other than your ip" sounds like it would be exploitable at first glance.

 

 

Tuesday, July 12, 2022

Making Instant Commons Quick

 The Wikimedia family of websites includes one known as Wikimedia Commons. Its mission is to collect and organize freely licensed media so that other people can re-use them. More pragmatically, it collects all the files needed by different language Wikipedias (and other Wikimedia projects) into one place.

 

The 2020 Wikimedia Commons Picture of the Year: Common Kingfisher by Luca Casale / CC BY SA 4.0

 As you can imagine, it's extremely useful to have a library of freely licensed photos that you can just use to illustrate your articles.

However, it is not just useful for people writing encyclopedias. It is also useful for any sort of project.

To take advantage of this, MediaWiki, the software that powers Wikipedia and friends, comes with a feature to use this collection on your own Wiki. It's an option you can select when installing the software and is quite popular. Alternatively, it can be manually configured via $wgUseInstantCommons or the more advanced $wgForeignFileRepos.

The Issue

Unfortunately, instant commons has a reputation for being rather slow. As a weekend project I thought I'd measure how slow, and see if I could make it faster.

How Slow?

First things first, I'll need a test page. Preferably something with a large (but not extreme) number of images but not much else. A Wikipedia list article sounded ideal. I ended up using the English Wikipedia article: List of Governors General of Canada (Long live the Queen!). This has 85 images and not much else, which seemed perfect for my purposes.

I took the expanded Wikitext from https://en.wikipedia.org/w/index.php?title=List_of_governors_general_of_Canada&oldid=1054426240&action=raw&templates=expand, pasted it into my test wiki with instant commons turned on in the default config.

And then I waited...

Then I waited some more...

1038.18761 seconds later (17 minutes, 18 seconds) I was able to view a beautiful list of all my viceroys.

Clearly that's pretty bad. 85 images is not a small number, but it is definitely not a huge number either. Imagine how long [[Comparison_of_European_road_signs]] would take with its 3643 images or [[List_of_paintings_by_Claude_Monet]] with 1676.

Why Slow?

This raises the obvious question of why is it so slow. What is it doing for all that time?

When MediaWiki turns wikitext into html, it reads through the text. When it hits an image, it stops reading through the wikitext and looks for that image. Potentially the image is cached, in which case it can go back to rendering the page right away. Otherwise, it has to actually find it. First it will check the local DB to see if the image is there. If not it will look at Foreign image repositories, such as Commons (if configured).

To see if commons has the file we need to start making some HTTPS requests¹:

  1. We make a metadata request to see if the file is there and get some information about it: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime%7Cmediatype%7Cextmetadata&prop=imageinfo&iimetadataversion=2&iiextmetadatamultilang=1&format=json&action=query&redirects=true&uselang=en
  2.  We make an API request to find the url for the thumbnail of the size we need for the article. For commons, this is just to find the url, but on wikis with 404 thumbnail handling disabled, this is also needed to tell the wiki to generate the file we will need: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=300&iiurlheight=-1&iiurlparam=300px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  3.  Some devices now have very high resolution screens. Screen displays are made up of dots. High resolution screens have more dots per inch, and thus can display more fine detailed. Traditionally 1 pixel equalled one dot on the screen. However if you keep that while increasing the dots-per-inch, suddenly everything on the screen that was measured in pixels is very small and hard to see. Thus these devices now sometimes have 1.5 dots per pixel, so they can display fine detail without shrinking everything. To take advantage of this, we use an image 1.5 times bigger than we normally would, so that when it is displayed in its normal size, we can take advantage of the extra dots and display a much more clear picture. Hence we need the same image but 1.5x bigger: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=450&iiurlheight=-1&iiurlparam=450px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  4. Similarly, some devices are even higher resolution and use 2 dots per pixel, so we also fetch an image double the normal size:  https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=600&iiurlheight=-1&iiurlparam=600px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en

 

This is the first problem - for every image we include we have to make 4 api requests. If we have 85 images that's 340 requests.

Latency and RTT

It gets worse. All of these requests are done in serial. Before doing request 2, we wait until we have the answer to request 1. Before doing request 3 we wait until we get the answer to request 2, and so on.

Internet speed can be measured in two ways - latency and bandwidth. Bandwidth is the usual measurement we're familiar with: how much data can be transferred in bulk - e.g. 10 Mbps.

Latency, ping time or round-trip-time (RTT) is another important measure - it's how long it takes your message to get somewhere and come back.

When we start to send many small messages in serial, latency starts to matter. How big your latency is depends on how close you are to the server you're talking to. For Wikimedia Commons, the data-centers (DCs) are located in San Francisco (ulsfo), Virginia (eqiad), Texas (codfw), Singapore (eqsin) and Amsterdam (esams). For example, I'm relatively close to SF, so my ping time to the SF servers is about 50ms. For someone with a 50ms ping time, all this back and forth will take at a minimum 17 seconds just from latency.

However, it gets worse; Your computer doesn't just ask for the page and get a response back, it has to setup the connection first (TCP & TLS handshake). This takes additional round-trips.

Additionally, not all data centers are equal. The Virginia data-center (eqiad)² is the main data center which can handle everything, the other DCs only have varnish servers and can only handle cached requests. This makes browsing Wikipedia when logged out very speedy, but the type of API requests we are making here cannot be handled by these caching DCs³. For requests they can't handle, they have to ask the main DC what the answer is, which adds further latency. When I tried to measure mine, i got 255ms, but I didn't measure very rigorously, so I'm not fully confident in that number. In our particular case, the TLS & TCP handshake are handled by the closer DC, but the actual api response has to be fetched all the way from the DC in Virginia.

But wait, you might say: Surely you only need to do the TLS & TCP setup once if communicating to the same host. And the answer would normally be yes, which brings us to major problem #2: Each connection is setup and tore down independently, requiring us to re-establish the TCP/TLS session each time. This adds 2 additional RTT. In our 85 image example, we're now up to 1020 round-trips. If you assume 50ms to caching DC and 255ms to Virginia (These numbers are probably quite idealized, there are probably other things I'm not counting), we're up to 2 minutes.

To put it altogether, here is a diagram representing all the back and forth communication needed just to use a single image:

12 RTT per image used! This is assuming TLS 1.3. Earlier versions of TLS would be even worse.

Introducing HTTP/2

In 2015, HTTP/2 came on the scene. This was the first major revision to the HTTP protocol in almost 20 years.

The primary purpose of this revision of HTTP, was to minimize the effect of latency when you are requesting many separate small resources around the same time. It works by allowing a single connection to be reused for many requests at the same time and allowing the responses to come in out of order or jumbled together. In HTTP/1.1 you can sometimes be stuck waiting for some request to finish before being allowed to start on the next one (Head of line blocking) resulting in inefficient use of network resources

This is exactly the problem that instant commons was having.

Now I should be clear, instant commons wasn't using HTTP/1.1 in a very efficient way, and it would be possible to do much better even with HTTP/1.1. However, HTTP/2 will still be that much better than what an improved usage of HTTP/1.1 would be.

Changing instant commons to use HTTP/2 changed two things:

  1. Instead of creating a new connection each time, with multiple round trips to set up TCP and TLS, we just use a single HTTP/2 connection that only has to do the setup once.
  2. If we have multiple requests ready to go, send them all off at once instead of having to wait for each one to finish before sending the next one.

We still can't do all requests at once, since the MediaWiki parser is serial, and it stops parsing once we hit an image, so we need to get information about the current image before we will know what the next one we need is. However, this still helps as for each image, we send 4 requests (metadata, thumbnail, 1.5dpp thumbnail and 2dpp thumbnail), which we can now send in parallel.


The results are impressive for such a simple change. Where previously my test page took 17 minutes, now it only takes 2 (139 seconds).


Transform via 404

In vanilla MediaWiki, you have to request a specific thumbnail size before fetching it; otherwise it might not exist. This is not true on Wikimedia Commons. If you fetch a thumbnail that doesn't exist, Wikimedia Commons will automatically create it on the spot. MediaWiki calls this feature "TransformVia404".

In instant commons, we make requests to create thumbnails at the appropriate sizes. This is all pointless on a wiki where they will automatically be created on the first attempt to fetch them. We can just output <img> tags, and the first user to look at the page will trigger the thumbnail creation. Thus skipping 3 of the requests.

Adding this optimization took the time down from 139 seconds with just HTTP/2 to 18.5 seconds with both this and HTTP/2. This is 56 times faster than what we started with!



Prefetching

18.5 seconds is pretty good. But can we do better?

We might not be able to if we actually have to fetch all the images, but there is a pattern we can exploit.

Generally when people edit an article, they might change a sentence or two, but often don't alter the images. Other times, MediaWiki might re-parse a page, even if there are no changes to it (e.g. Due to a cache expiry). As a result, often the set of images we need is the same or close to the set that we needed for the previous version of the page. This set is already recorded in the database in order to display what pages use an image on the image description page

We can use this. First we retrieve this list of images used on the (previous version) of the page. We can then fetch all of these at once, instead of having to wait for the parser to tell us one at a time which image we need.

It is possible of course, that this list could be totally wrong. Someone could have replaced all the images on the page. If it's right, we speed up by pre-fetching everything we need, all in parallel. If it's wrong, we fetched some things we didn't need, possibly making things slower than if we did nothing.

I believe in the average case, this will be a significant improvement. Even in the case that the list is wrong, we can send off the fetch in the background while MediaWiki does other page processing - the hope being, that MediaWiki does other stuff while this fetch is running, so if it is fetching the wrong things, time is not wasted.

On my test page, using this brings the time to render (Where the previous version had all the same images) down to 1.06 seconds. A 980 times speed improvement! It should be noted, that this is time to render in total, not just time to fetch images, so most of that time is probably related to rendering other stuff and not instant commons.

Caching

All the above is assuming a local cache miss. It is wasteful to request information remotely, if we just recently fetched it. It makes more sense to reuse information recently fetched.

In many cases, the parser cache, which in MediaWiki caches the entire rendered page, will mean that instant commons isn't called that often. However, some extensions that create dynamic content make the parser cache very short lived, which makes caching in instant commons more important. It is also common for people to use the same images on many pages (e.g. A warning icon in a template). In such a case, caching at the image fetching layer is very important.

There is a downside though, we have no way to tell if upstream has modified the image. This is not that big a deal for most things. Exif data being slightly out of date does not matter that much. However, if the aspect ratio of the image changes, then the image will appear squished until InstantCommons' cache is cleared.

To balance these competing concerns, Quick InstantCommons uses an adaptive cache. If the image has existed for a long time, we cache for a day (configurable). After all, if the image has been stable for years, it seems unlikely it is going to be edited in very soon. However, if the image has been edited recently, we use a dynamically determined shorter time to live. The idea being, if the image was edited 2 minutes ago, there is a much higher possibility that it might be edited a second time. Maybe the previous edit was vandalism, or maybe it just got improved further.

As the cache entry for an image begins to get close to expiring, we refetch it in the background. The hope is that we can use the soon to be expired version now, but as MediaWiki is processing other things, we refetch in background so that next time we have a new version, but at the same time we don't have to stall downloading it when MediaWiki is blocked on getting the image's information. That way things are kept fresh without a negative performance impact.

MediaWiki's built-in instant commons did support caching, however it wasn't configurable and the default time to live was very low. Additionally, the adaptive caching code had a bug in it that prevented it from working correctly. The end result was that often the cache could not be effectively used.

Missing MediaHandler Extensions

In MediaWiki's built-in InstantCommons feature, you need to have the same set of media extensions installed to view all files. For example, PDFs won't render via instant commons without Extension:PDFHandler.

This is really unnecessary where the file type just renders to a normal image. After all, the complicated bit is all on the other server. My extension fixes that, and does its best to show thumbnails for file types it doesn't understand. It can't support advanced features without the appropriate extension e.g. navigating in 3D models, but it will show a static thumbnail.

Conclusion

In the end, by making a few, relatively small changes, we were able to improve the performance of instant commons significantly. 980 times as fast!

Do you run a MediaWiki wiki? Try out the extension and let me know what you think.

Footnotes:

¹ This is assuming default settings and an [object] cache miss. This may be different if $wgResponsiveImages is false in which case high-DPI images won't be fetched, or if apiThumbCacheExpiry is set to non-zero in which case thumbnails will be downloaded locally to the wiki server during the page parse instead of being hotlinked.


² This role actually rotates between the Virginia & Texas data center. Additionally, the Texas DC (when not primary) does do some things that the caching DCs don't that isn't particularly relevant to this topic. There are eventual plans to have multiple active DCs which all would be able to respond to the type of API queries being made here, but they are not complete as of this writing - https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Active-active_MediaWiki


³ The MediaWiki API actually supports an smaxage=<number of seconds> (shared maximum age) url parameter. This tells the API server you don't care if your request is that many seconds out of date, and to serve it from varnish caches in the local caching data center if possible. Unlike with normal Wikipedia page views, there is no cache invalidation here, so it is rarely used and it is not used by instant commons.





Friday, July 8, 2022

Book review: Binti

 I've never written a book review before, but what is a blog if not a platform to give your uninformed opinion to people who don't care about it :)


Recently I read the novella Binti by Nnedi Okorafor. I had recently saw it on the shelf at the library and I remembered it being pretty popular back when it was released, so thought i would give it a go. Overall, i liked it up to the ending, which I strongly disliked.

In general, it seemed like a fun coming of age/space opera-esque story. You have a main character who is an outsider and unsure of herself. Due to circumstances beyond her control she is thrust into the middle of interplanetary conflict which only she can resolve due to unique traits which originally had marked her as an outsider.

None of this is particularly ground breaking - I have read many stories with that type of plot structure - but it still makes for a fun story. The weird part here, is I don't think this is the type of story that the author intended to write. From what I remember of this book's release, the selling point was supposed to be how it explored the main character's culture (Which from what I understand is based on the real human culture of the Himba people). However, to me, that seemed like the most weak and superficial part of the book. All we really get, is that they don't like outsiders, and use an ointment made of mud that is very culturally important. We get very little in terms of why this is meaningful to the character, or how it affects her worldview, etc.

Its a reasonably common theme to explore "culture" in science fiction works. Admittedly, it is somewhat rarer but definitely not unheard of to explore it as an end in itself. More common is to have an alien culture to juxtapose against contemporary mainstream culture to make some point about the mainstream culture of the context the writer is writing in. Novels such as The Dispossessed and Stranger In A Strange Land come to mind as novels that posit some alien culture to contrast it with contemporary culture. Other times alien characters have a unique culture to provide an aesthetic of otherness for them. Consider how boring most aliens in star trek are compared to the Klingons that have a much better developed cultural back story. Or how alien the aliens feel in the Lilith's Brood series. Even if the cultural exploration is arguably not the primary point to these novels, the cultures they explore feel rounded and fleshed out. The cultural depiction in Binti seems flat comparatively, because even if its tied to a real culture, it was barely explored in the novella.

Ultimately I felt like this novella worked as a fun space adventure; it didn't work as an interesting cultural exploration.

Last of all the: The ending [Major spoilers ahead]

The ending ruined the book for me. It did not make sense in terms of character logic nor character emotion.

Basically - the aliens, who are very isolationist, are mad because the university stole the stinger of one of their leaders to display as an artifact. So they decide to murder everyone on a ship full of students in order to sneak in and retrieve it (or die trying). The situation is resolved by the university saying sorry, claiming that doing such a thing was against policy, and the appropriate people will be punished. Additionally one of the aliens was invited to be an exchange student.

To me, this felt just one step above, "it was all a dream". It makes no sense how the university could accidentally have obtained the artifact without realizing it was unethically sourced. Some researchers just show up with an alien body part, from aliens who generally refuse to talk to outsiders, and everyone just assumed the alien gave consent for this? It is difficult to reconcile these plot points together logically, unless the university was intentionally turning a blind eye, in which case, why would their apology be accepted? Second, nobody was at all upset that everyone on the ship was slaughtered. I could understand maybe letting bygones be bygones for the greater good - but they should at least be a little sad about it. Nobody seemed to care at all.

In conclusion: Generally fun plot. The "cultural" aspects were superficial and overrated. Ending sucked.