Bawolff's rants: 2025

Sunday, October 5, 2025

Pure CSS 360° Panorama Viewer

Recently I've been interested in 360° Panoramas (i.e. "Photospheres"). This is when you take a picture of the entire world surrounding your camera. Displaying them allows an immersive experience where you can change the viewpoint to show the entire scene. Using VR it can become truly immersive.

Wikipedia has a couple of these photos, however the display has something left to be desired. Typically they are shown in an equirectangular projection with a link to an external viewer tool.

Here is an example that used to be on the Wikipedia article about the F-4 fighter plane showing the cockpit of the plane:

Image credit: Lauri Veerde (CC-BY-SA 4.0)

As you can see, the image shown has significant distortions. This is because the image is a sphere being projected on to a rectangle. Its impossible to flatten a sphere into a rectangle without distorting the image for the same reason you need a globe to show the true shape and size of the continents of the earth. This distorted view might be artistically cool but is not very useful for actually illustrating the topic at hand. You can of course click the link to go to a more proper rendering, but that requires an extra click which many readers won't do. It would be much cooler if a proper interactive view was embedded directly on the page.

Can we do better?

The external tool that is linked to for these images is called Pannoviewer. In turn, it is a wrapper around an open source library called Pannellum.

In theory that could be embedded directly into MediaWiki. TheDj did some work towards this back in 2019-2023. I'm not exactly sure what happened, but it seems like it didn't make it all the way to the end goal and work on it eventually petered out.

Getting new extensions deployed to Wikimedia as a volunteer is a Kafkaesque hell that I don't really want any part in. But is there an alternative?

Pannellum normally uses WebGL. However for browsers that do not support WebGL, it has a CSS based fallback mode using 3D CSS transforms. We can use CSS in Wikipedia templates. Can we make a perspective viewer as a pure template? No server code needed, no review needed, just a normal edit that anyone could make to Wikipedia?

Pure CSS viewer

Lets look at the fallback code in Pannellum, since that is the inspiration for this.

From https://github.com/mpetroff/pannellum/blob/master/src/js/libpannellum.js#L835-L856

            s = fallbackImgSize / 2;
            var transforms = {
                f: 'translate3d(-' + (s + 2) + 'px, -' + (s + 2) + 'px, -' + s + 'px)',
                b: 'translate3d(' + (s + 2) + 'px, -' + (s + 2) + 'px, ' + s + 'px) rotateX(180deg) rotateZ(180deg)',
                u: 'translate3d(-' + (s + 2) + 'px, -' + s + 'px, ' + (s + 2) + 'px) rotateX(270deg)',
                d: 'translate3d(-' + (s + 2) + 'px, ' + s + 'px, -' + (s + 2) + 'px) rotateX(90deg)',
                l: 'translate3d(-' + s + 'px, -' + (s + 2) + 'px, ' + (s + 2) + 'px) rotateX(180deg) rotateY(90deg) rotateZ(180deg)',
                r: 'translate3d(' + s + 'px, -' + (s + 2) + 'px, -' + (s + 2) + 'px) rotateY(270deg)'
            };
            focal = 1 / Math.tan(hfov / 2);
            var zoom = focal * canvas.clientWidth / 2 + 'px';
            var transform = 'perspective(' + zoom + ') translateZ(' + zoom + ') rotateX(' + pitch + 'rad) rotateY(' + yaw + 'rad) ';
            
            // Apply face transforms
            var faces = Object.keys(transforms);
            for (i = 0; i < 6; i++) {
                var face = world.querySelector('.pnlm-' + faces[i] + 'face');
                if (!face)
                    continue; // ignore missing face to support partial cubemap/fallback image
                face.style.webkitTransform = transform + transforms[faces[i]];
                face.style.transform = transform + transforms[faces[i]];
            }

The first thing to notice, is instead of working with an equirectangular projection, it instead uses a cubemap representation. This is where the 360° panorama is represented as a cube that has 6 faces. We can't convert the spherical representation into a cube with pure CSS since CSS only supports affine transformations. Thus we'll have to upload all the parts of the cube as separate files:

The same image of an F4 as before, but as the faces of a cube instead of an equirectangular projection Image credit: Lauri Veerde (CC-BY-SA 4.0)

Imagine this as sort of an arts and craft project. If you print it out, take some scissors, and glue all the edges together, you get a cube. If you were inside the cube, you would get a full scene view. This is what we are going to do with CSS - separate the parts and glue them together with perspective shifted to give the 3D view effect.

The different faces of the cube labelled. Image credit: Arieee (CC-BY-SA 3.0)

https://jaxry.github.io/panorama-to-cubemap/ is an easy to use website to do the equirectangular to cubemap conversion, but it tends to fail for large files. I've mostly been using Tim Starling's pano-projector, which is CLI tool that was part of an effort to bring panorama display to Wikimedia that hasn't come to fruition yet.

Once we have the 360° view converted to cube form, the idea is pretty simple. We apply 3D translations and rotations to make it in to a cube, and then we transform the viewing angle as appropriate to see the part of the image we want.

All this can be translated to a MediaWiki template fairly easily. It is a bit persnickety, firefox seems to have some weird tearing artifacts sometimes. Chrome is much smoother, but it has the very confusing issue that you need to put some sort of transform on any element overlapping a 3D transformed element, or it will disappear or reappear at random (Thank you stack overflow for explaining this to me).

We can even use CSS animations to give a rotating view.

The result is: https://commons.wikimedia.org/wiki/Template:Cubemap_player/doc#Example (My blog software messes with it so i only embedded a video here. Click the link for the full effect). The template code is a little gnarly; in retrospect I probably should have used lua instead of pure wikitext. But it works!

Enter {{calculator}}

This works but has some notable limitations:

We use TemplateStyles for the animation. The animation needs to know where the image starts but TemplateStyles doesn't support being parameterized. Thus it is difficult to specify a starting yaw and pitch. I wrote a script that created separate stylesheets for some common values, but its an ugly solution that would be a pain if anything has to be changed. (I made a proposal for template styles to allow variables in calc() which should be secure and fix my problem, but I'm not holding my breath for it to actually happen)
We use :target pseudo-class to implement buttons. That's basically all we got with core mediawiki, as we can't make checkboxes, and CSS is not really designed with the idea of clickable interaction in mind. Unfortunately that causes the page to scroll when clicked, and it really only lets us stop and stop the animation. It does not allow more complex controls.

Previously I was involved with a project to make a calculator gadget. The idea was to make widgets on a page plug into formulas (like a spreadsheet) to update other text on the page or CSS variables.

This is exactly what was needed for this project. Even better it was already enabled on Wikipedia, so I could just go ahead and use it. It lets us make a button that on click adjusts a CSS variable, which we can then use to change the displayed view.

With this we can make the previous template much more parameterized. You can specify any yaw, pitch or zoom you want as a starting point and I didn't have to make a separate stylesheet for every option (It uses CSS transitions instead of animation for the full rotation button). We can also have real buttons to move it left, right, up or down — no more sudden jumps on clicking play.

I even added drag support to {{calculator}} so you can control the viewpoint with your mouse (or finger). Unfortunately more complex gestures like pinch-zoom aren't viable, but dragging to change viewpoint really makes it come alive.

You can see the results at the [[McDonnell Douglas F-4 Phantom II]] article.

Conclusion

So far I've put the new template on a few pages. Thus far nobody has objected, so success!

There are still a few limitations. Most notably it requires users to upload the cube faces as separate images which is a bit annoying. On the bright side though, the different faces are usually usable images, albeit perhaps not framed in the most pleasing composition, so there is potential that they might be used independently. The biggest reason why this is an issue is it means only people who know how to extract the faces can use the template.

We can't do full screen or virtual reality mode (Where the viewer users the device orientation API to show the scene based on how the user holds their phone). Nonetheless we can still link to the external panoviewer tool for those usecases. Similarly pinch to zoom gestures do not work.

Another major limitation is we can't do dynamic level of detail loading. Ideally if the user zooms in, we should load a higher resolution tile of the part they are zoomed in at. That is not viable with the current approach. Instead we have to chose a resolution from the get-go and load only that. Too high and it affects page load speed. Too low and the image is blurry when zoomed.

In the future it might be interesting to investigate adding support for annotations or linked hotspots.

On the whole though, i think its a big improvement over the status quo, and a testament to what is possible with modern CSS and a smidge of Javascript.

Saturday, October 4, 2025

How did XSSProtector do?

A few months ago, frustrated by the lack of defensive anti-XSS measures in MediaWiki, I decided to make my own. Hence was born Extension:XSSProtector.

This extension is a compromise - its the best I can do from an extension without changing anything in MediaWiki. However I think it does provide real safety against the most likely vulnerabilities in MediaWiki.

Yesterday, MediaWiki released a security update for core and bundled extensions, so lets see how XSSProtector fared:

[Note: Severity ratings for vulns are my own opinion as there is no official rating]

The Vulns

High

✅ (T401099, CVE-2025-61638) SECURITY: Sanitize data- attributes.

Stored XSS in MediaWiki's parser.
As an aside, this is an amazing find by gui-ying233. Its not often that people find stored XSS in MediaWiki's core parser.

Moderate

✅ (T397232, CVE-2025-61656) SECURITY: Sanitize attributes unwrapped from data-ve-attributes.

Basically if you can trick a user to copy and past something evil into visual editor, you can take over their account. Moderate because it requires complex user interaction, but is ultimately fairly serious.

Low

❌ (T387478, CVE-2025-61634) SECURITY: REST: Set cache-control value of max-age=60 for redirects.
✅ (T394396, CVE-2025-61636) SECURITY: Escape rawElement $content.
✅ (T394856, CVE-2025-61637) SECURITY: Escape three system messages used by live preview.
❌ (T280413, CVE-2025-61639) SECURITY: Use ManualLogEntry::getDeleted in ::getRecentChange.

Not an XSS type bug

❌ (T403757, CVE-2025-61643) SECURITY: Don't send suppressed recent changes to RCFeeds.
❌ (T398706, CVE-2025-61646) SECURITY: Prevent leaking hidden usernames in Watchlist/RecentChanges.
✅ (T402075, CVE-2025-61640) SECURITY: Parse messages instead of inserting them as HTML.
❌ (T298690, CVE-2025-61641) SECURITY: api: Disable maxsize in QueryAllPages in miser mode.
✅ (T402313, CVE-2025-61642) SECURITY: Escape submit button label for Codex-based HTMLForms.
✅ (T403761, CVE-2025-61645) SECURITY: Fix i18n XSS in CodexTablePager.
✅ [CheckUser] (T403408, CVE-2025-61651) SECURITY: fix XSS in tempuser-expired-link-tooltip message.
❌ [CheckUser] (T404805, CVE-2025-61658) SECURITY: Add config variable to exclude from GlobalContributions.
✅ [CheckUser] (T402077, CVE-2025-61648) SECURITY: Escape system messages before inserting them as HTML.
❌ [ConfirmEdit] (T355073, CVE-2025-61635) SECURITY: ApiFancyCaptchaReload: Reuse badcaptcha rate limit.
❌ [DiscussionTools] (T397580, CVE-2025-61652) SECURITY: In API check user read permissions before showing PageInfo.

I consider this low as it requires an unsupported configuration. People who have private wikis using officially supported configs are not affected

❌ [DiscussionTools] (T364910, T396248, CVE-2025-11175) SECURITY: DiscussionTools should use better regex.
❌ [OATHAuth] (T401862, T402094, CVE-2025-11173) SECURITY: Reauth for enabling 2FA can be bypassed by submitting a form.
❌[OATHAuth] (T396951) FreeOTP refuses to add MediaWiki's 2FA details, because "token is unsafe".
❌ [TextExtracts] (T397577, CVE-2025-61653) SECURITY: Add authorizeRead check for extracts endpoint.

I'm considering this low because it requires configuring MediaWiki in an officially unsupported configuration. Normal private wikis are not affected as far as i can tell.

❌ [Thanks] (T397497, CVE-2025-61654) SECURITY: Exclude deleted entries when counting thanks.

I think most users don't really consider this sensitive information.

✅ [VisualEditor] (T395858, CVE-2025-61655) SECURITY: Properly escape and parse system messages.
✅ [Vector] (T398636, CVE-2025-61657) SECURITY: Insert sticky header labels as text instead of HTML.

In conclusion

It stopped all the XSS vulns, including the two that actually matter for your average MediaWiki setup. Overall it got 11 out of 24 or 46%. However I think its important to emphasize that most of the low vulnerabilities either can only be triggered by an admin, can only happen in rare configurations, or are DoS vulnerabilities that only matter if you've already spent significant effort doing performance hardening. XSSProtector prevented all the vulnerabilities that your average MediaWiki install should be worried about.

Wednesday, July 30, 2025

Preventing XSS in MediaWiki - A new extension to protect your Wiki

Its no secret that the vast majority of serious security vulnerabilities in MediaWiki are Cross-Site-Scripting (XSS).

XSS is where an attacker can put evil javascript where they aren't supposed to in order to take over other users. For example, the typical attack would look like the attacker putting some javascript in a wiki page. The javascript would contain some instructions for the web browser, like make a specific edit. Then anyone who views the page would make the edit. Essentially it lets evil people take over other users' accounts. This is obviously quite problematic in a wiki environment.

This is the year 2025. We shouldn't have to deal with this anymore. Back in Y2K the advice was that "Web Users Should Not Engage in Promiscuous Browsing". A quarter of a century later, we have a better solution: Content-Security-Policy.

Everyone's favourite security technology: CSP

Content Security Policy (CSP) is a major web browser security technology designed to tackle this problem. Its actually a grab-bag of a lot of things, which sometimes makes it difficult to talk about, as its not one solution but a bunch of potential solutions to a bunch of different problems. Thus when people bring it up in conversation they can often talk past each other if they are talking about different parts of CSP.

First and foremost though CSP is designed to tackle XSS.

The traditional wisdom with CSP is that its easy if you start with it, but difficult to apply it afterwards in an effective way. Effective being the operative word. Since CSP has so many options and knobs, it is very easy to apply a CSP policy that does nothing but looks like it's doing something.

This isn't the first time I've tried MediaWiki and CSP. Back when I used to work for the Wikimedia Foundation in 2020, I was tasked with trying to make something work with CSP. Unfortunately it never really got finished. After I left, nobody developed it further and it was never deployed. *sads*

Part of the reason is I think the effort tried to do much all at once. From Wikimedia's perspective there are two big issues that they might want to solve: XSS and "privacy". XSS is very traditional, but privacy is somewhat unique to Wikimedia. Wikimedia sites allows users and admins to customize javascript. This is about as terrible an idea as it sounds, but here we are. There are various soft-norms around what people can do. Generally its expected that you are not allowed to send any data (even implicitly such as someone's IP address by loading an off-site resource) without their permission. CSP has the potential to enforce this, but its a more complex project then just the XSS piece. In theory the previous attempt was going to try and address both, which in retrospect was probably too much scope all at once relative to the resources dedicated to the project. In any case, after i left my job the project died.

Can we go simpler?

Recently I've been kind of curious about the idea of CSP but simple. What is the absolute minimal viable product for CSP in MediaWiki?

For starters this is just going to focus on XSS. Outside of Wikimedia, the privacy piece is not cared about very much. I don't know, maybe Miraheze care (not sure), but I doubt anyone else does. Most MediaWiki installs there is a much closer connection between the "interface-admin" group and the people running the servers, thus there is less need to restrict what interface-admin group can do. In any case, I don't work for WMF anymore, I'm not interested in dealing with all the political wrangling that would be needed to make something happen in the Wikimedia world. However, Wikimedia is not the only user of MediaWiki and perhaps there is still something useful we could easily do here.

The main insight is that the body of article and i18n messages generally should not contain javascript at all, but that is where most XSS attacks will occur. So if we can use CSP to disable all forms of javascript except <script> tags, and then use a post processing filter to filter all script tags out of the body of the article, we should be golden. At the same time, this should involve almost no changes to MediaWiki.

This is definitely not the recommended way of using CSP. Its entirely possible I'm missing something here and there is a way to bypass it. That said, I think this will work.

What exactly are we doing

So I made an Extension - XSSProtector. Here is what it does:

Set CSP script-src-attr 'none'.

This disables html attributes like onclick or onfocus. Code following MediaWiki conventions should never use these, but they are very common in attacks where you can bypass attribute sanitization. It is also very common in javascript based attacks, since the .innerHTML JS API ignores <script> tags but processes the on attributes.

Look for <script tag in content added for output (i.e. in OutputPage) and replace it with <script tag. MediaWiki code following coding conventions should always use ResourceLoader or at least OutputPage::addHeadItem to add scripts, so only evil stuff should match. If it is in an attribute, there should be no harm with replacing with entity
Ditto for <meta and <base tags. Kind of a side point, but you can use <meta http-equiv="refresh" ... to redirect the page. <base can be used to adjust where resources are loaded from, and sometimes to pass data via the target attribute. We also use base-uri CSP directive to restrict this.
Add an additional CSP tag after page load - script-src-elem *, this disables unsafe-inline after page load. MediaWiki uses dynamic inline script tags during initial load for "legacy" scripts. I don't think it needs that after page load (Though i'm honestly not sure). The primary reason to do this is to disable javascript: URIs, which would be a major method to otherwise bypass this system.
We also try to regex out links with javascript URIs, but the regex is sketchy and i don't have great confidence in it the same way i do with the regex for <script.
Restrict form-action targets to 'self' to reduce risk of scriptless XSS that tricks users with forms

The main thing this misses is <style> tags. Attackers could potentially add them to extract data from a page, either by unclosed markup loading a resource that contains the rest of the page in the url or via attacks that use attribute selectors in CSS (so-called "scriptless xss"). It also could allow the attacker to make the page display weird in an attempt to trick the user. This would be pretty hard to block, especially if TemplateStyles extension is enabled, and the risk is relatively quite low as there is not much you can do with it. In any case, I decided not to care

The way the extension hooks into the Message class is very hacky. If this turns out to be a good idea, probably the extension would need to become part of core or new hooks would have to be added to Message.

Does it work?

Seems to. Of course, the mere fact i can't hack the thing I myself came up with isn't exactly the greatest brag. Nonetheless I think it works and I haven't been able to think of any bypasses. It also seems to not break anything in my initial testing.

Extension support is a little less clear. I think it will work for most extensions that do normal things. Some extensions probably do things that won't work. In most cases they could be fixed by following MediaWiki coding conventions. In some cases, they are intrinsically problematic, such as Extension:Widgets.

To be very clear, this hasn't been exhaustively tested, so YMMV.

How many vulns will it stop?

Lets take a look at recent vulnerabilities in MediaWiki core. Taking a look in the vulns in the MediaWiki 1.39 release series, between 1.39.0 and 1.39.13 there were 29 security vulnerabilities.

17 of these vulnerabilities were not XSS. Note that many of these are very low severity, to the point its debatable if they even are security vulnerabilities. If I was triaging the non-XSS vulnerabilities, I would say there are 6 informational (informational is code for: I don't think this is a security vulnerability but other people disagree), 9 low severity, 2 medium-low severity. None of them come close to the severity of an (unauthenticated) XSS, although some may be on par with an XSS vuln that requires admin rights to exploit.

While I haven't explicitly tested all of them, I believe the remaining 12 would be blocked by this extension. Additionally, if we are just counting by number, this is a bit of an under count, as in many cases multiple issues are being counted as a single phab ticket, if reported at the same time.

In conclusion, this extension would have stopped 41% of the security vulnerabilities found so far in the 1.39.x release series of MediaWiki, including all of the high severity ones. That's pretty good in my opinion.

Try it yourself

You can download the extension from here. I'd be very curious if you find that the extension breaks anything or otherwise causes unusual behaviour. I'd also love for people to test it to see if they can bypass any of its protections.

It should support MediaWiki 1.39 and above, but please use the REL1_XX for the version of MediaWiki you have (i.e. On 1.39 use REL1_39 branch) as the master branch is not compatible with older MediaWiki.

Friday, February 28, 2025

Exploring structured data on commons

Recently I've been helping Yaron do some SPARQL query optimization for his site Commons Walkabout.

Its a cool site. It lets you explore the media on Wikimedia Commons by filtering through various metadata fields.

For example - bodies of water located in a place that is in Ireland.

Its addicting too. The media with the best quality metadata tends to be those donated by museums, which often means they are quite interesting.

An image the commons walkabout app showed me that I thought was pretty: Arabah desert in 1952 photographed by Benno Rothenberg

Structured data on Commons

As a result of helping with the optimization, I've been exploring structured (Wikidata-style) data on commons.

The structured data project has been a bit controversial. I think there is a feeling in the community that WMF abandoned the project at the 90% done point. It does mostly work, but there is still a lot of rough edges that make it less successful than it would otherwise be. Tools for authoring, interacting and maintaining the metadata are lacking from what I understand. Most files are not as described by metadata as they ought to be. Importantly there is now a dual system of organization - the traditional free text image description pages and category system, along with the newer structured metadata. Doing both means we're never fully committed to either.

Source: XKCD

The biggest criticism is the situation with the commons query service (See T297995 and T376979). Right now the service requires you to log in first. In the beginning this sounded like it would be a temporary restriction, but it now appears permanent.

Logging in is a big issue because it makes it impossible to do client side apps that act as a front-end to the query service (Its not theoretically impossible, but the WMF's implementation of logging in doesn't support that). The auth implementation is not very user friendly, which is a significant hindrance, especially when many people who want to do queries aren't necessarily professional programmers (For example, the official instructions suggest using the browser dev console to look up the value of certain cookies as one of the steps to use the system). Some users have described the authentication system as such a hindrance that it makes more sense to shut the whole thing down than to keep it behind auth.

The SPARQL ecosystem is designed to be a linked ecosystem where you can query remote databases. The auth system means commons query service can't be used in federation. It can talk to other servers but other servers cannot talk to it.

Its a bit hard to understand why WMF is doing this. Wikidata is fully open, and that is a much larger dataset of interest to a much broader group of people. If blazegraph is hard (Which don't get me wrong, i am sure it is), the commons instance should be trivial compared to the Wikidata one. You can just look at the usage graphs that clearly show almost nobody using the commons query service relative to the wikidata query service. The commons query service seems to be averaging about 0 requests per minute with occasional spikes up to 15-40 reqs/min. In comparison, the wikidata query service seems to average about 7500 reqs/minute

I've heard various theories: That this is the beginning of an enshitification process so that Wikimedia can eventually sell access under the Wikimedia Enterprise banner or that they don't want AI companies to scrape all the data (Why an AI company would want to scrape this but not wikidata and why we would want to prevent them, I have no idea). These aren't really super convincing to me.

I suspect the real reason is that WMF has largely cut funding to the structured data project. That just leaves a sysadmin team responsible for the blazegraph query end point. However normally such a team would work in concert with another team more broadly responsible for sdc. With no such other team, the blazegraph sysadmin team is very scared of being sucked into a position where suddenly they are solely responsible for things that should be outside their team's remit. They really don't want that to happen (hard to blame them), so they are putting the breaks on moving things forward with the commons query service.

This is just a guess. I don't know for sure what is happening or why, but that is the theory that makes the most sense to me.

The data model

Regardless of the above, the structured data project and SPARQL is really cool. I actually really like it.

While playing with it though, some parts do seem kind of weird to me.

Blank nodes for creator

The creator property says who created the image. Ideally the value is supposed to be a Q-number, but many creators don't have one.

The solution is to use a blank node. This makes sense, blank nodes in RDF are placeholders. They aren't equal to any other node, but allow you to specify properties.

If the relationship chain was:

<sdc:Some image> <wdt:P170 (creator)> < blank node > <wdt:P4174 Wikimedia Username> "Some username"

That would be fine. However its not. Instead creator property is rectified so that <wdt:P4174 Wikimedia Username> "Some username" is modifying the creator predicate instead of being a property of the blank node.

This feels so ontologically weird to me. It kind of weird we have to resort to such a hack for what is undoubtedly the main use case of this system.

Functionally dependent predicates

Some predicates functionally depend on the file in question. I feel like these should be added by the system. They should not be controlled or edited by the user.

For example P3575 data size. That's technical metadata. Users should not be responsible for inserting it. Users should not be able to change it. The fact that they are is a poor design in my opinion. Similarly for P2048 height, P2049 width, P1163 media type, P4092 checksum.

I find P1163 media type (aka the mime type of the file) especially weird. Why is this a string? Surely it should be the Q number of the file format in question if we're going to be manually filling it out?

The especially weird part is some of this data is automatically added to the system in the schema namespace. The system automatically adds schema:contentSize, schema:encodingFormat, schema:height, schema:width, schema:numberOfPages which are equivalent to some of these properties. So why are we duplicating them by hand (or by bot)?

At the same time, there seems to be a lot missing from schema. I don't see sha1 hash (sha1 isn't the best hash since it is now broken, but its the one MediaWiki uses internally for image). I'd love to see XMP file metadata included here as well, since it is already in RDF format.

The thing really missing (unless i missed it) is it seems impossible to get the url of the file description page or even the mediawiki page title, without string manipulation. This seems like a bizarre omission. I should be able to go from the canonical url at commons to the SDC item.

Querying the system

Querying the system can be a bit tricky sometimes because the data is usually spread between commons and wikidata, so you have to make use of SERVICE clauses to query the remote data, which can be slow.

The main trick seems to be to try and minimize the cross database communication. If you have a big dataset, try and minimize it before communicating with the remote database (or the label service).

Copious use of named subqueries (due to them being isolated by the optimizer) can really help here.

If you are fetching distinct terms (or counts of distinct terms) ensuring that blazegraph can use the distinct term optimization is very helpful. It seems like the blazegraph query optimizer isn't very good and often cannot use this optimization even when it should. Making the group by very simple and putting it into a named subquery can help with this.

The distinct term optimization is critically important for running fast aggregation queries. Often it makes sense to first get the list of distinct terms you are interested in and their count (if applicable) in a named subquery, then go and fetch information for each item or filter them instead of doing it all in one group by.

If you have a slow query that involves any sort of group by, the first thing i would suggest is to try and extract a simple group by into a subquery (by simple I mean: only 1 basic graph pattern, no services, grouping by only 1 term, and either no aggregate functions or the only aggregrate function being count(*)) and then use the results of that query as the basis of the rest of your query.

If its still too much, using the bd:sample service can be really helpful. This runs the query over a random subset of results instead of the whole thing. If you just want to get the broad trends this can often be good enough.

The most complicated thing to query seems to be the P170 creator predicate. There are 72 million items with that predicate, the vast majority are blank nodes, so the number of distinct terms is in the millions (Even for files with the same creator, they are considered distinct of they are blank nodes). Queries involving it seem to almost always time out.

Initially I thought the best that could be done was sampling and interpolating. For example, this query that gives you the top creators (who have a Q numbers). The numbers aren't exactly right, but they seem to be within the right order of magnitude.

Unfortunately filtering via wikibase:isSomeValue() is very slow so we can't just filter out the blank nodes. I did find a hack though. In general blazegraph arranges blank nodes at the end of the result set (or at least, it seems so in this case). If you do a subquery of distinct terms with a limit of about 10,000 you can get all the non-blank nodes (Since there are only about 7200 of them and they are at the beginning). This is hacky since you can't use range queries with URI values and you can't even put an order by on the query or it will slow down, so you just have to trust blaze graph is consistently returning things in the order you want even though it is by no means required to. It seems to work.. For example, here is an example table using this method counting the number of images created by creators (with Q numbers) grouped by their cause of death. A bit morbid, but is is fascinating you can make such an arbitrary query.

Conclusion

Anyways, I do find RDF and SPARQL really cool, so its been fun poking around in commons' implementation of it. Check out Yaron's site https://commonswalkabout.org/ it is really cool.

Thursday, January 30, 2025

Happy (belated) new year

I was going to write a new years post. Suddenly it's almost February.

Last year I had the goal of writing at least 1 blog post a month.

I was inspired by Tyler Cipriani who wrote a post talking about his goal of writing at least once a month. It seemed like a reasonable goal to help practice writing. I've often wanted to maintain a blog in the past, but would always find I petered out after a few posts. So last year I set myself the goal of at least once a month.

How'd I do?

Ok I guess. I wrote 9 blog posts last year. Short of my goal of 12, but still not terrible. Looking back, I realize that I wrote 9 in 2023 and 11 in 2022, so I guess the goal didn't actually increase my writing. Nonetheless I'm still pretty happy that I was writing posts throughout the year.

Based on blogger's internal analytics, it seems like people like when I do CTF writeups. For some unclear reason my post about MUDCon got a tremendous amount of views relative to my other posts. However, the real goal is more to practice writing than to get views, so I suppose it doesn't matter much. That said, I do write the CTF writeups in the hope that others can learn from them, so it is nice to see that people read them.

On to next. Maybe this year I'll actually make it to once a month - I'd like to keep going with this goal. I think I'll try and write more shorter off the cuff things (Like this) and less big technical posts.

This year

While its already been a month and I've already mostly forgotten about my goals. But I did write some down.

I want this to be a year of trying new things. I want to try and branch out to new a different things.

I want to explore my creative side a bit more. To be clear, I do think there is a lot of creativity in computer programming, but I also want to try more traditionally creative things. Paint some pictures! I've been taking a beginner acrylic painting class at the local community center, which has been great so far. Maybe I should try and join the two interests and make a silly computer game.

I'd also like to explore different computer things. I've been doing MediaWiki stuff for a while now, but I don't want that to be the only thing I ever do. I'd like to try and direct my open source contributions towards things I haven't done before. Maybe things that are more low level than php web apps. Perhaps I should learn rust! I'd like to work on stuff where I feel like I'm learning and I think it would be cool to spend some time learning more about traditional CS concepts (algorithms and what not). Its been a long time since I was in university and learning about that sort of thing. Maybe it would be fun to brush up. In my current programming work its very rare for that sort of thing to be relevant.

Where I do keep with open source contributions in the MediaWiki/Wikimedia sphere of influence, I want to work on things that are unique or new. Things that, whether they are good or bad ideas, at least open up new possibilities for discussion. I can't help but feel that MediaWiki land has been a little stagnant. So many of the core ideas are from late 2000s. They've been incrementally improved upon, but there really isn't anything new. Both in Wikimedia land and in third party MediaWiki land. Perhaps that is just a sign of a maturing ecosystem, but I think its time to try some crazy new ideas. Some will work and some will fail, but I want to feel like I'm working on something new that makes people think of the possibilities, not just improving what is already there (Not that there is anything wrong with that, maintenance is critical and under appreciated, its just not what i want to work on right now, at least not as a volunteer)

I think I've gotten a start in that direction with the calculator project that WikiProject med funded me to do. It spurned a lot of interesting discussion and ideas, which is the sort of thing I want to be involved with.

Maybe I'll explore Wikidata a bit more. I always found RDF databases kind of interesting.

On the third party MW side, I've felt for a long time that we could use some different ideas for querying and reporting in MediaWiki (Cargo and SMW is cool, but I don't think they quite make the right trade-offs). I'd love to explore ideas in that space a little bit.

So in conclusion, what is the yearly goals?

I think I want to be a little more intentional in planning out my life
I want my open source MediaWiki contribs to be more along prototyping new and unique ideas. I want to avoid being sucked into just fixing things at Wikimedia (After all, that's why WMF allegedly has paid staff).
I want to try and pursue creative hobbies outside of computer programming.
I want to try and do more programming outside of my MediaWiki comfort zone.

See you back here this time next year to see how I did.

Thursday, January 16, 2025

Signpost article on {{Calculator}}

The most recent issue of the Wikipedia Signpost published an article I wrote about the {{calculator}} series of templates, which I worked on.

The signpost is Wikipedia's internal newspaper. I read it all the time. Its really cool to have contributed something to it.

So what's this all about?

Essentially I was hired by Wiki Project Med, to make a script to add medical calculators to their wiki, MDWiki. For example, you might have a calculator to calculate BMI, where the reader enters their weight and height. After being used on MDWiki, the script made its way to Wikipedia.

The goal of this was to make something user programmable, so that users could solve their own problems without having to wait on developers. The model we settled on was a spreadsheet-esque one. You can define cells that readers can write in, and you can define other cells that update based on some formula.

Later we also allowed manipulating CSS, to provide more interactivity. This opened up a surprising number of opportunities, such as interactive diagrams and demonstrations.

See the Signpost article for more details: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2025-01-15/Technology_report and let me know what you think.

Bawolff's rants

Followers

Blog Archive

About Me

Sunday, October 5, 2025

Pure CSS 360° Panorama Viewer

Can we do better?

Pure CSS viewer

Enter {{calculator}}

Conclusion

Saturday, October 4, 2025

How did XSSProtector do?

The Vulns

High

Moderate

Low

In conclusion

Wednesday, July 30, 2025

Preventing XSS in MediaWiki - A new extension to protect your Wiki

Everyone's favourite security technology: CSP

Can we go simpler?

What exactly are we doing

Does it work?

How many vulns will it stop?

Try it yourself

Friday, February 28, 2025

Exploring structured data on commons

Structured data on Commons

The data model

Blank nodes for creator

Functionally dependent predicates

Querying the system

Conclusion

Thursday, January 30, 2025

Happy (belated) new year

How'd I do?

This year

Thursday, January 16, 2025

Signpost article on {{Calculator}}

So what's this all about?