
A few days ago I was listening to some of the talks at MUDCon (The MediaWiki conference aimed at non-Wikimedia uses of MediaWiki).
During James Hare's talk about a project to keep track of various Miraheze wikis using Wikibase (The software behind Wikidata), he briefly mentioned that WikiApiary has been down forever, and that maybe a Wikibase approach would be better instead of the previous Semanitc MediaWiki approach (Semantic MediaWiki is an extension to MediaWiki that allowing annotating links with the "relationship" the link represents and querying those relationshops).
This reminded me that WikiApiary existed, and i thought i would try to give it a kick. For those who don't know, WikiApiary is/was a site that tracked what public mediawiki isntances were out there and what extensions they had installed.
I had already gotten server access years ago (People wanted me to help but i never really did). The status of the site was sort of up but extremely flakey and timing out all the time.
Web server concurrency
If you've been paying attention to web hosting at all in the last while, you know that AI scrapper bots are the bane of everyone's existence.
Initially I assumed that was what was happening here. I may have been partially wrong on that, but i think it was a contributing factor.
One of the most common performance problems with MediaWiki is people setting up apache to use the default max 150 threads. If a large spike in traffic comes in, all the threads fight over resources and everything becomes really slow, causing even more threads to pile up, slowing everything down to a halt. Even worse if you don't have enough memory and have swap enabled, you end up with swap death (This server did not have swap enabled, but i mention it because its such a common failure case). Its often better to try and process a few requests at the time and make the other requests wait their turn than to try and do too much all at once and accomplish nothing.
To address this, I installed varnish. Varnish is a great piece of software that lets you cache recently used pages, reducing server load significantly in general. It can also help by deduplicating requests to a certain extent, if two people request the same page at the same time, it will just send one request to the backend so the backend doesn't get overloaded processing the same page twice for two separate people
It can also set a maximum number of requests in flight to the backend. This can make sure that the backend doesn't get too overloaded.
One of the less used features of varnish is the ability to set up multiple backends. I like to use this to setup different backends for different classes of requests. For WikiApiary I setup four - likely bots, non-normal article views (e.g. diffs, history, special), normal article views, logged in users & images. For each of these I had different back-ends with a different number of max requests at one time. For bots, i set it to at most one. Mostly in case of false positives. My bot metric is just using an old version of chrome. Comparatively, normal page was given 15 and non-normal pages 3. Thus bots should not be able to take down the site, at worse they could just take down requests in the same class as them. I gave effectively no limit (actually 60) for logged in users and requests to static resources like images that are super cheap to serve.
As a special case, I outright blocked facebook's AI scrapper as it was being super aggressive. I also blocked logged out access to Special:Browse. I hate outright blocking things for logged out users; how can lurkers become contributors if you block them from everything? However, Special:Browse was using OFFSET based paging and was super expensive to render while at the same time being linked everywhere and a very common spider target.
This is what I ended up with for the varnish config:
backend default {
.host = "127.0.0.1";
.port = "8088";
.max_connections = 60;
# Unfortunately the version of varnish i have is too old for .wait_limit
# and .wait_timeout. However if you're doing something like this with small max connections, you definitely want a wait queue to even out spikes.
# .wait_limit = 10;
# .wait_timeout = 10s;
}
backend page {
.host = "127.0.0.1";
.port = "8088";
.max_connections = 15;
# .wait_timeout = 30s;
# .wait_limit = 60;
.first_byte_timeout = 200s;
.between_bytes_timeout = 300s;
}
backend bot {
.host = "127.0.0.1";
.port = "8088";
.max_connections = 1;
.first_byte_timeout = 200s;
.between_bytes_timeout = 300s;
}
backend special {
.host = "127.0.0.1";
.port = "8088";
.max_connections = 3;
# .wait_timeout = 60s;
# .wait_limit = 15;
.first_byte_timeout = 120s;
.between_bytes_timeout = 200s;
}
# access control list for "purge": open to only localhost and other local nodes
acl purge {
"127.0.0.1";
}
# vcl_recv is called whenever a request is received
sub vcl_recv {
# Serve objects up to 2 minutes past their expiry if the backend
# is slow to respond.
set req.grace = 500s;
set req.http.X-Forwarded-For = req.http.X-Forwarded-For + ", " + client.ip;
set req.backend_hint= default;
# This uses the ACL action called "purge". Basically if a request to
# PURGE the cache comes from anywhere other than localhost, ignore it.
if (req.method == "PURGE") {
if (!client.ip ~ purge) {
return (synth(405, "Not allowed."));
} else {
return (purge);
}
}
# Was crawling very fast.
if ( req.http.User-Agent ~ "^meta-externalagent" ) {
return (synth( 403, "No crawling please" ) );
}
# Pass requests from logged-in users directly.
# Only detect cookies with "session" and "Token" in file name, otherwise nothing get cached.
if (req.http.Authorization || req.http.Cookie ~ "([sS]ession|Token)=") {
return (pass);
} /* Not cacheable by default */
if ( req.url ~ "^/w/index.php\?(.*&t|t)itle=Property:.*&limit=\d*&offset=\d\d\d\d" ) {
return (synth( 403, "Log in to view more" ));
}
# rate limit < chrome 136. MSIE. Opera. (Note samsung browser is chrome 136)
# Also applebot, only rate limiting instead of blocking because it is well behaved.
if (req.http.User-Agent ~ "(Chrome/[0-9][0-9]\.|Chrome/1[012][0-9]|Chrome/13[0-5]|Opera|MSIE|Applebot)" && req.url ~ "^/(wiki/|w/index.php|w/api.php)" ) {
set req.backend_hint = bot;
} elsif (req.method == "GET" &&
( req.url ~ "^/w/index.php" || req.url ~ "^/wiki/Special:" || req.url ~ "/wiki/.*\?" || req.url ~ "^/w/api.php" ) &&
!( req.url ~ "(Special:CreateAccount|Special:UserLogin|Special:RecentChanges|Special:Random)" )
) {
set req.backend_hint= special;
} elsif ( req.method == "GET" && req.url ~ "^/wiki/" ) {
set req.backend_hint= page;
} elsif( req.method == "POST" && req.url ~ "(^/w/api.php|Special(:|%3A)Browse)" ) {
set req.backend_hint= special;
}
if ( req.method != "GET" ) {
return (pass);
}
# normalize Accept-Encoding to reduce vary
if (req.http.Accept-Encoding) {
if (req.http.User-Agent ~ "MSIE 6") {
unset req.http.Accept-Encoding;
} elsif (req.http.Accept-Encoding ~ "gzip") {
set req.http.Accept-Encoding = "gzip";
} elsif (req.http.Accept-Encoding ~ "deflate") {
set req.http.Accept-Encoding = "deflate";
} else {
unset req.http.Accept-Encoding;
}
}
return (hash);
}
sub vcl_pipe {
# Note that only the first request to the backend will have
# X-Forwarded-For set. If you use X-Forwarded-For and want to
# have it set for all requests, make sure to have:
# set req.http.connection = "close";
# This is otherwise not necessary if you do not do any request rewriting.
set req.http.connection = "close";
}
# Called if the cache has a copy of the page.
sub vcl_hit {
if (!obj.ttl > 0s) {
return (pass);
}
}
# Called after a document has been successfully retrieved from the backend.
sub vcl_backend_response {
# Don't cache 50x responses
if (beresp.status == 500 || beresp.status == 502 || beresp.status == 503 || beresp.status == 504) {
set beresp.uncacheable = true;
return (deliver);
}
if (beresp.http.Set-Cookie) {
set beresp.uncacheable = true;
return (deliver);
}
if (!beresp.ttl > 0s) {
set beresp.uncacheable = true;
return (deliver);
}
if (beresp.http.Authorization && !beresp.http.Cache-Control ~ "public") {
set beresp.uncacheable = true;
return (deliver);
}
set beresp.grace = 2h;
return (deliver);
}
Request Limits
However, I still saw lots of slow requests piling up. Sometimes DB queries seemed to take longer than the request stayed open, which was very pointless as the user had gone away by the time it was done.
Ideally the user would not be able to trigger super slow requests, however in a system like SemanticMediaWiki where the user is allowed to make arbitrary queries (and quite frankly a questionably optimized DB schema) its going to happen.
So the important thing is to make sure if a query that can't be answered in a reasonable amount of time happens, that we stop processing it instead of just wasting resources on it. This is extra important as slow requests can have a cascade effect; the first request is slow taking up a lot of resources making other requests at the same time slow down. Throughput falls and suddenly more requests come in also slowed by the general business of the system. Basically a traffic jam happens.
To deal with this I did two things:
- Install php-excimer package and set $wgRequestTimeLimit to 300. This allows MediaWiki to gracefully set a time limit for itself. Unlike php's execution_time, the php-excimer extension allows handling the timeout gracefully and also applies the timeout to wall clock time instead of cpu time (Important because in an overload, the CPU might be split amongst many php processes so only a little cpu time might have passed)
- Set mariaDB's max_statement_time config to 200 seconds. This is the max amount of time a query can take before it is killed. This ensures that a run away query is time limited. You have to be careful though since this config is global, you'll want to disable it before running any sort of DB maintenance.
This helped a bit to prevent things from piling up. However queries were still slow and when i looked at iotop & top it seemed like mariadb was using excessive CPU & Disk I/O but almost no memory.
Almost no memory? That doesn't seem right for a database. I'd missed the obvious thing: innodb_buffer_pool_size was set to only 128 MB! This is one of the most important settings for MariaDB/MySQL performance. It effectively determines how much RAM the DB users (while one time of ram anyways, kind of the important one). The server had 23GB of ram and the DB was limited to 128MB. No wonder queries were slow. Traditional advice is for a dedicated server this value should be 80% of the server's RAM. We have to subtract from that for the RAM MediaWiki needs as both were on the same server, but nonetheless this needed to be way higher. I upped it to 11GB and query speed increased by orders of magnitude almost instantly.
I also noticed that the temporary table in memory size was really small. This doesn't matter much for vanilla mediawiki, but for extensions like SemanticMediaWiki that do a lot of user defined queries that sort through many results this is really important, so I bumped max_heap_table_size, tmp_table_size and tmp_memory_table_size to 800MB.


