Tuesday, March 31, 2026

Giving WikiApiary a kick

A few days ago I was listening to some of the talks at MUDCon (The MediaWiki conference aimed at non-Wikimedia uses of MediaWiki).

During James Hare's talk about a project to keep track of various Miraheze wikis using Wikibase (The software behind Wikidata), he briefly mentioned that WikiApiary has been down forever, and that maybe a Wikibase approach would be better instead of the previous Semanitc MediaWiki approach (Semantic MediaWiki is an extension to MediaWiki that allowing annotating links with the "relationship" the link represents and querying those relationshops).

This reminded me that WikiApiary existed, and i thought i would try to give it a kick. For those who don't know, WikiApiary is/was a site that tracked what public mediawiki isntances were out there and what extensions they had installed.

I had already gotten server access years ago (People wanted me to help but i never really did). The status of the site was sort of up but extremely flakey and timing out all the time.

Web server concurrency

If you've been paying attention to web hosting at all in the last while, you know that AI scrapper bots are the bane of everyone's existence.

Initially I assumed that was what was happening here. I may have been partially wrong on that, but i think it was a contributing factor.

One of the most common performance problems with MediaWiki is people setting up apache to use the default max 150 threads. If a large spike in traffic comes in, all the threads fight over resources and everything becomes really slow, causing even more threads to pile up, slowing everything down to a halt. Even worse if you don't have enough memory and have swap enabled, you end up with swap death (This server did not have swap enabled, but i mention it because its such a common failure case). Its often better to try and process a few requests at the time and make the other requests wait their turn than to try and do too much all at once and accomplish nothing.

 To address this, I installed varnish. Varnish is a great piece of software that lets you cache recently used pages, reducing server load significantly in general. It can also help by deduplicating requests to a certain extent, if two people request the same page at the same time, it will just send one request to the backend so the backend doesn't get overloaded processing the same page twice for two separate people

It can also set a maximum number of requests in flight to the backend. This can make sure that the backend doesn't get too overloaded.

One of the less used features of varnish is the ability to set up multiple backends. I like to use this to setup different backends for different classes of requests. For WikiApiary I setup four - likely bots, non-normal article views (e.g. diffs, history, special), normal article views, logged in users & images. For each of these I had different back-ends with a different number of max requests at one time. For bots, i set it to at most one. Mostly in case of false positives. My bot metric is just using an old version of chrome. Comparatively, normal page was given 15 and non-normal pages 3. Thus bots should not be able to take down the site, at worse they could just take down requests in the same class as them. I gave effectively no limit (actually 60) for logged in users and requests to static resources like images that are super cheap to serve.

As a special case, I outright blocked facebook's AI scrapper as it was being super aggressive. I also blocked logged out access to Special:Browse. I hate outright blocking things for logged out users; how can lurkers become contributors if you block them from everything? However, Special:Browse was using OFFSET based paging and was super expensive to render while at the same time being linked everywhere and a very common spider target.

This is what I ended up with for the varnish config:

  backend default {
    .host = "127.0.0.1";
    .port = "8088";
    .max_connections = 60;
    # Unfortunately the version of varnish i have is too old for .wait_limit
    # and .wait_timeout. However if you're doing something like this with small max connections, you definitely want a wait queue to even out spikes.
#    .wait_limit = 10;
#    .wait_timeout = 10s;
}

backend page {
    .host = "127.0.0.1";
    .port = "8088";
    .max_connections = 15;
#    .wait_timeout = 30s;
#    .wait_limit = 60;
    .first_byte_timeout = 200s;
    .between_bytes_timeout = 300s;
}

backend bot {
    .host = "127.0.0.1";
    .port = "8088";
    .max_connections = 1;
    .first_byte_timeout = 200s;
    .between_bytes_timeout = 300s;
}

backend special {
    .host = "127.0.0.1";
    .port = "8088";
    .max_connections = 3;
 #   .wait_timeout = 60s;
 #   .wait_limit = 15;
    .first_byte_timeout = 120s;
    .between_bytes_timeout = 200s;
}
# access control list for "purge": open to only localhost and other local nodes
acl purge {
    "127.0.0.1";
}

# vcl_recv is called whenever a request is received 
sub vcl_recv {
        # Serve objects up to 2 minutes past their expiry if the backend
        # is slow to respond.
        set req.grace = 500s;

        set req.http.X-Forwarded-For = req.http.X-Forwarded-For + ", " + client.ip;

        set req.backend_hint= default;

        # This uses the ACL action called "purge". Basically if a request to
        # PURGE the cache comes from anywhere other than localhost, ignore it.
        if (req.method == "PURGE") {
            if (!client.ip ~ purge) {
                return (synth(405, "Not allowed."));
            } else {
                return (purge);
            }
        }

	# Was crawling very fast.
	if ( req.http.User-Agent ~ "^meta-externalagent" ) {
		return (synth( 403, "No crawling please" ) );
	}
        
        # Pass requests from logged-in users directly.
        # Only detect cookies with "session" and "Token" in file name, otherwise nothing get cached.
        if (req.http.Authorization || req.http.Cookie ~ "([sS]ession|Token)=") {
            return (pass);
        } /* Not cacheable by default */

	if ( req.url ~ "^/w/index.php\?(.*&t|t)itle=Property:.*&limit=\d*&offset=\d\d\d\d" ) {
		return (synth( 403, "Log in to view more" ));
	}

	# rate limit < chrome 136. MSIE. Opera. (Note samsung browser is chrome 136)
	# Also applebot, only rate limiting instead of blocking because it is well behaved.
	if (req.http.User-Agent ~ "(Chrome/[0-9][0-9]\.|Chrome/1[012][0-9]|Chrome/13[0-5]|Opera|MSIE|Applebot)" && req.url ~ "^/(wiki/|w/index.php|w/api.php)" ) {
		set req.backend_hint = bot;
	} elsif (req.method == "GET" && 
		( req.url ~ "^/w/index.php" || req.url ~ "^/wiki/Special:" || req.url ~ "/wiki/.*\?" || req.url ~ "^/w/api.php" ) &&
		!( req.url ~ "(Special:CreateAccount|Special:UserLogin|Special:RecentChanges|Special:Random)" )
	) {
		set req.backend_hint= special;
	} elsif ( req.method == "GET" && req.url ~ "^/wiki/" ) {
		set req.backend_hint= page;
	} elsif( req.method == "POST" && req.url ~ "(^/w/api.php|Special(:|%3A)Browse)" ) {
		set req.backend_hint= special;
	}

	if ( req.method != "GET" ) {
		return (pass);
	}

        # normalize Accept-Encoding to reduce vary
        if (req.http.Accept-Encoding) {
          if (req.http.User-Agent ~ "MSIE 6") {
            unset req.http.Accept-Encoding;
          } elsif (req.http.Accept-Encoding ~ "gzip") {
            set req.http.Accept-Encoding = "gzip";
          } elsif (req.http.Accept-Encoding ~ "deflate") {
            set req.http.Accept-Encoding = "deflate";
          } else {
            unset req.http.Accept-Encoding;
          }
        }
 
        return (hash);
}

sub vcl_pipe {
        # Note that only the first request to the backend will have
        # X-Forwarded-For set.  If you use X-Forwarded-For and want to
        # have it set for all requests, make sure to have:
        # set req.http.connection = "close";
 
        # This is otherwise not necessary if you do not do any request rewriting.
 
        set req.http.connection = "close";
}

# Called if the cache has a copy of the page.
sub vcl_hit {
        if (!obj.ttl > 0s) {
            return (pass);
        }
}

# Called after a document has been successfully retrieved from the backend.
sub vcl_backend_response {
        # Don't cache 50x responses
        if (beresp.status == 500 || beresp.status == 502 || beresp.status == 503 || beresp.status == 504) {
            set beresp.uncacheable = true;
            return (deliver);
        }   
        if (beresp.http.Set-Cookie) {
          set beresp.uncacheable = true;
          return (deliver);
        }

        if (!beresp.ttl > 0s) {
          set beresp.uncacheable = true;
          return (deliver);
        }
 
 
        if (beresp.http.Authorization && !beresp.http.Cache-Control ~ "public") {
          set beresp.uncacheable = true;
          return (deliver);
        }

	set beresp.grace = 2h;
        return (deliver);
}

  

Request Limits

However, I still saw lots of slow requests piling up. Sometimes DB queries seemed to take longer than the request stayed open, which was very pointless as the user had gone away by the time it was done.

Ideally the user would not be able to trigger super slow requests, however in a system like SemanticMediaWiki where the user is allowed to make arbitrary queries (and quite frankly a questionably optimized DB schema) its going to happen.

So the important thing is to make sure if a query that can't be answered in a reasonable amount of time happens, that we stop processing it instead of just wasting resources on it. This is extra important as slow requests can have a cascade effect; the first request is slow taking up a lot of resources making other requests at the same time slow down. Throughput falls and suddenly more requests come in also slowed by the general business of the system. Basically a traffic jam happens.

To deal with this I did two things:

  • Install php-excimer package and set $wgRequestTimeLimit to 300. This allows MediaWiki to gracefully set a time limit for itself. Unlike php's execution_time, the php-excimer extension allows handling the timeout gracefully and also applies the timeout to wall clock time instead of cpu time (Important because in an overload, the CPU might be split amongst many php processes so only a little cpu time might have passed)
  • Set mariaDB's max_statement_time config to 200 seconds. This is the max amount of time a query can take before it is killed. This ensures that a run away query is time limited. You have to be careful though since this config is global, you'll want to disable it before running any sort of DB maintenance.

This helped a bit to prevent things from piling up. However queries were still slow and when i looked at iotop & top it seemed like mariadb was using excessive CPU & Disk I/O but almost no memory.

Almost no memory? That doesn't seem right for a database. I'd missed the obvious thing: innodb_buffer_pool_size was set to only 128 MB! This is one of the most important settings for MariaDB/MySQL performance. It effectively determines how much RAM the DB users (while one time of ram anyways, kind of the important one). The server had 23GB of ram and the DB was limited to 128MB. No wonder queries were slow. Traditional advice is for a dedicated server this value should be 80% of the server's RAM. We have to subtract from that for the RAM MediaWiki needs as both were on the same server, but nonetheless this needed to be way higher. I upped it to 11GB and query speed increased by orders of magnitude almost instantly.

I also noticed that the temporary table in memory size was really small. This doesn't matter much for vanilla mediawiki, but for extensions like SemanticMediaWiki that do a lot of user defined queries that sort through many results this is really important, so I bumped max_heap_table_size, tmp_table_size and tmp_memory_table_size to 800MB.

The Database

As I looked through the slow query log, I saw a lot of Semantic MediaWiki queries that just seem somewhat questionable, at least at first glance. Semantic MediaWiki seems to love SELECT DISTINCT, which in some cases is much harder for MySQL to optimize. I think there is a lot of room for optimization in SemanticMediaWiki. I filed a bug about things that looked off to me, but I'm not a SemanticMediaWiki dev so maybe there are reasons things are the way they are.
 
I did do some local schema changes, removing some indexes that seemed duplicating other indexes, making the index on smw_hash only index the first 8 bytes of the sha1 hash instead of the whole thing. I don't really know if that helped or not, but it seems like for a large database like this, making indexes take less room means more likely data already in the buffer pool and less disk i/o to fetch things. After compressing page revisions and running optimize on all the tables, disk space usage dropped by about 160GB which was also good as disk was about 75% full.
 
Honestly, it seems like the SMW query model would be better served by something like InnoDB full text search. I think there is already an optional module for elasticsearch backend, but its not default and requires the external dependency. SMW searches are largely intersection queries between a bunch of properties. There are also some operators like <, > LIKE and !=, but those seem to be used much less. Its a graph model, but doesn't seem to implement any sort of path queries. I imagine the elasticsearch backend would do very well with this

I also tried to optimize some of the templates. Some of this involved moving array functions based templates to lua. Array functions is a MediaWiki extension providing an array_map like functionality in wikitext. However it tends to be very slow.

Some of the SMW queries in templates were essentially doing their own group by. They would request 5000 results and try and find all the unique answers.  This seemed slow for large properties such as the page for Extension:ParserFunctions. It seemed only part of this was on the database side (The sortkey isn't indexed, so mysql has to look at all the results no matter how many you return) but it seemed like SMW was doing an additional query per result which was also adding latency.

Instead I tried to make it get a single result and then use greater than query to get the next result. This seemed much faster, however it did not work properly for pages with properties with multiple values. Apparently in SemanticMediaWiki, multi valued properties are returned together in a single row, not like SQL where an inner join makes multiple. At the same time, queries have to match all the properties, so its essentially impossible to do any sort of condition beside = on a multivalued property. if you say [[My Property::!Foo]] to match it not having property foo, if the property's value is Foo and Bar, then the query still matches the bar part. Even if you do [[My Property::!Foo]] [[My Property::!Bar]], it still wouldn't work because each value in the multivalued property is tested separately, and Foo matches not Bar and Bar matches not Foo, so either way the page matches. This seems like a major oversight in the query model of SMW, but i guess it is what it is.

Conclusion

It works now. Not every page is super speedy. Some outliers, like the page on Extension:ParserFunctions still takes about a minute to load, but all pages at least load now. Previously even the simple pages were taking about 600 seconds to load, with almost none of them actually loading before hitting a timeout.

Of course, the update bots are not enabled yet, and I'm not sure what impact that will have. I'm not exactly sure how those worked originally or what it will take to turn those back on.

I also heard there is now a competing WikiApiary type site - https://catalogue.ai.wu.ac.at/ check them out!