Recently, I've decided to experiment with making a reporting extension for MediaWiki (Working name "Sofa") as a side project. There are already several options already in this space, notably Cargo, Semantic MediaWiki (SMW) and DynamicPageLst (DPL). However, I think the design space is still a little under explored. My goal is to experiment with some rather different design choices, and see what I end up with. In particular - I want to make one that respects the core wiki philosophy of "quick" - e.g. changes are reflected immediately, and "soft-security", where nothing users can do, whether accidentally or maliciously, cause any (performance) harm but can simply be reverted. Current solutions often require manual cache purging to see changes reflected and can have unpredictable user-dependent performance characteristics where subtle user choices in extreme cases could even cause a DB overload.
I don't want to talk too much about the new extension generally, as that's not what this blog post is about and I am still in the early stages. However one design requirement that left me in a bit of a conundrum is the automatic cache purging one. In the model for this extension, we have ordered lists of items and pages that display a range of items from the list. In order to support cache purging when someone adds a new entry that would appear in a used range, we need some way to store what ranges of items are used by what pages so that given a specific item, we can query which pages use a range containing that item. This turned out to be surprisingly difficult. I thought I'd write a post about the different methods I considered.
The naive approach
For example we might have the following list of alphabetically ordered items
id | item |
---|---|
17 | Alpha |
12 | Atom |
21 | Beta |
34 | Bobcat |
8 | Bonsai Tree |
And a page might want to include all items between Amoeba and Bobcat (Note that Amoeba is not actually in the list).
In this example, we need someway to record that the page is using items between Amoeba and Bobcat, so if someone inserts Badger into the list, the page using the list gets refreshed.
The natural way of doing this would be a MariaDB table like the following:
id | page_id | list_start | list_end |
---|---|---|---|
1 | 1234 | Amoeba | Bobcat |
Along with 2 indicies: one on list_start and the other on list_end.[1]
In the event Badger gets added to the list, we would need to query what pages to refresh. In this structure we would use a query like: SELECT page_id FROM list_cache WHERE "Badger" BETWEEN list_start AND list_end;
This all works. However there is one problem, to answer that query the DB has to look at a number of rows beside the rows we're interested in. For best performance, we ideally want to make the database look at as few rows as possible to answer our query. Ideally it would only look at rows that are relevant to the answer.
An index in the database is an ordered list of rows from a table - like a phonebook. In this case we have two - one for the start of the range and one for the end. In essence we want information from both, however that's not how it works.
"'badger' BETWEEN list_start AND list_end" is really just a fancy way of saying list_start < 'badger' AND list_end > 'badger'. The RDBMS then has a choice to make: It can use the list_start index, go to the spot in that index where badger would be, and then work its way down to the bottom of the list, checking each row individually for if the second condition is true. Alternatively it can pull up the list_end index and do the same thing but reversed.
However, once it picks an index and goes to it starting place, it has to look at every entry until it hits the beginning (Respectively end) of the index. If the starting place is in the middle of the index, it would have to look at roughly half the rows in order to answer the query. If this system was used at scale and the table had 2 million rows, the database might have to look at a million rows, just to find the 3 that are matching. This is definitely not ideal
[1] Potentially you could do a compound index on (list_start,list_end) and (list_end,list_start) instead which allows the DB to look directly at the index instead of constantly looking up each row in the underlying table. I'm unclear on if the benefit of index-condition-pushdown outweighs the increased index size, I suspect it would mildly, but either way things are still rather inefficient.
Interval trees
At this point, like any good programmer, I started googling and trawling stack overflow. Eventually I stumbled upon the article "A Static Relational Interval Tree" by Laurent Martin on the SolidQ blog.
At first glance this seemed the exact solution I was looking for. On second glance - not so much. This is a really cool technique; unfortunately it scales with the number of bits needed to represent the data type you are using. If I was looking at ranges of integers, dates or IP addresses, this would be perfect. Unfortunately I am using strings which take a large number of bits. Nonetheless, I'm going to describe the technique, as I think its really cool, and should be more widely known.
The core idea, is to separate the different ranges into a number of buckets. Each bucket can be queried separately but efficiently - Any row that the DB has to look at, is a row that would match the query. In order to find all the ranges that contain a specific point, you just need to query all the applicable buckets. The number of buckets that are applicable in the worst case is the same as the number of bits needed to represent the data type (hence the applicability to ranges of ints but not long strings).
The end result is: if you are using ranges of 32-bit integer, you would have to do 32 really efficient separate queries (or 1 query after UNIONing them together). For a large DB, this is much better than 1 inefficient query that might read millions of rows or more.
For simplicity, in the explanation I will use 4 bit integers (0-16).
Buckets
So, what are these buckets? They are simply the shared binary prefix of the two ends of the range with a 1 followed by 0's appended to fill out the rest of the binary number. Here are some examples:
Range start | Range end | Shared prefix | Bucket |
---|---|---|---|
10 (1010b) | 13 (1101b) | 1 | 12 (1100b) |
9 (1001b) | 10 (1010b) | 10 | 10 (1010b) |
9 (1001b) | 9 (1001b) | 1001 | 9 (1001b) |
In the database, we can use virtual columns to have mariaDB manage this for us, using the formula range_end - range_end % POWER(2, FLOOR(LOG( (range_start - 1) ^ range_end)/LOG(2)))
CREATE TABLE ranges ( id int unsigned PRIMARY KEY AUTO_INCREMENT, range_start int NOT NULL, range_end int NOT NULL, bucket int as (range_end - range_end % POWER(2, FLOOR(LOG( (range_start - 1) ^ range_end)/LOG(2)))) ); CREATE INDEX start on ranges (bucket, range_start); CREATE INDEX end on ranges (bucket, range_end);
Now we can just insert things like normal, and MariaDB will take care of calculating the bucket number.
Querying
This has the following useful property: Any range assigned to bucket number n must contain the value n. For example, bucket 12 might have the range [10,13] in it, or [12,12], all of which contain the point 12. It would not be able to have the range [10,11] as that does not go through the point 12.
This is useful, since if we have a specific number of interest, we can easily query the bucket to see if it is in there. For example, if we wanted to know all the ranges in bucket 12 containing 9, we can query WHERE bucket = 12 AND range_start <= 9; since if a range is in bucket 12, it must contain the value 12, so any range that starts before or equal 9, must go through 9 in order to get to 12. If we have an index on (bucket, range_start), this is a really efficient query that will only look at the rows that are of interest. Similarly, if the number of interest is greater than the bucket number, we do the same thing, just with range_end >= instead.
Now that we know how to get all the relevant ranges from one bucket efficiently, we just need to figure out what all the relevant buckets are, and we have a solution (remembering that at most log of the data type number of buckets need to be consulted).
Finding the buckets
To find the buckets, we simply take the first x bits of the binary representation and append 1 followed by 0s to fill out the number, for all x.
For example, if we wanted to find all the buckets that are relevant to ranges containing 9 (1001b):
bits masked | prefix | bucket | top or bottom |
---|---|---|---|
0 | 8 (1000) | top | |
1 | 1b | 12 (1100) | bottom |
2 | 10b | 10 (1010) | bottom |
3 | 100b | 9 (1001) | top (either) |
We would have to look at buckets 8,12,10 and 9. If the bit immediately after the prefix is a 1 (or equivalently if our point >= bucket) we have to look at the top of the bucket, otherwise, we look at the bottom. By this, we mean whether the query is range_end <= point or range_start >= point respectively.
So with that in mind, in our example we would make a query like:
SELECT id, range_start, range_end FROM ranges WHERE bucket = 8 and range_end <= 9 UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 12 and range_start >= 9 UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 10 and range_start >= 9 UNION ALL SELECT id, range_start, range_end FROM ranges WHERE bucket = 9 and range_end <= 9;
Which is a union of queries that each individually are very efficient.
To generate this, you might use code like the following (in php):
define( "MAX_BITS", 4 ); function makeQuery ( $target ) { $mask = 0; $query = ''; for ( $i = MAX_BITS-1; $i >= 0; $i-- ) { $mask = $mask | (1>>$i); $tmpMask = $mask ^ (1>>$i); $pad = 1 >> $i; $bucket = ($target & $tmpMask) | $pad; $query .= "SELECT id, range_start, range_end FROM ranges WHERE bucket = $bucket "; if ( $target < $bucket ) { $query .= "AND range_start >= $target "; } else { $query .= "AND range_end <= $target "; } if ( $i != 0 ) { $query .= "\nUNION ALL "; } } return $query . ";"; }
Unfortunately this technique didn't work for me, as I am using ranges of strings. Nonetheless I was very happy to learn about it as it is quite a cool technique. It can also be easily extended to find intersecting ranges instead of just points in a range. For more information I encourage the interested reader to read the articles "A Static Relational Interval Tree" by Laurent Martin, as well as "Managing Intervals Efficiently in Object-Relational Databases" by Kriegel et al.
Geospatial (R-Tree) index
Newer versions of MariaDB support spatial indexes. This is meant to support 2-D coordinates on a globe. If our ranges only involved integers, we would be able to (ab)use this for the 1-dimensional case. We simply use LineStrings that have only two points in them, and always have a y value of 0.
CREATE TABLE geo_ranges (
id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
ranges LINESTRING NOT NULL, SPATIAL INDEX (ranges)
);
Then we just insert some ranges:
INSERT INTO geo_ranges (ranges) VALUES
(LineString(Point(17,0),Point(34,0))),
(LineString(Point(28,0),Point(40,0))),
(LineString(Point(50,0),Point(52,0)));
Now we can query it using the ST_INTERSECTS function:
SELECT id, x(startpoint(ranges)) AS 'range_start', x(endpoint(ranges)) AS 'range_end'
FROM geo_ranges
WHERE st_intersects(point(30,0),ranges);
+----+-------------+-----------+
| id | range_start | range_end |
+----+-------------+-----------+
| 2 | 28 | 40 |
| 1 | 17 | 34 |
+----+-------------+-----------+
2 rows in set (0.001 sec)
I don't know enough about how MariaDB spatial indexes work to analyze the performance of this approach, but I would assume it is reasonably performant. Unfortunately it does not work for my usecase, as I want string ranges not floats or ints.
Recording every point
So the last method I'm aware of, is give up on trying to record ranges, and instead record every point within the range.
So if we have the following items:
id | item |
---|---|
17 | Alpha |
12 | Atom |
21 | Beta |
34 | Bobcat |
8 | Bonsai Tree |
And if page number 100 uses the range Amulet to Bog, it would be recorded as follows:
page_id | item_id |
---|---|
100 | 12 |
100 | 21 |
100 | 34 |
If you insert a new item, like "Anthill", you would look for the items immediately before and after it (Alpha - 17 and Atom 12), and then purge all pages that use item 17 or 12.
The queries are efficient, at the cost of much data duplication. At first, I didn't like this idea: it had to store so much extra data. However after thinking about it, this is mitigated by a couple of factors: First of all, the usage table only has to store id numbers, not the original string, which reduces space quite a bit. Second, if we assume that each entry is due to the item actually being used on the page, that puts some limits on things, as we can generally assume that people do not want to make excessively long pages.
If we supported aggregations over large numbers of items, that could be problematic. However, i have another idea in mind for that case.
In the end, the mild data duplication seems well worth the cost, especially when you consider that saving (writing to the table) happens rarely, relative to how often you might have to check to see if a cache purge is needed.
Conclusion
Overall, I'm surprised I didn't find any option that I completely liked. I would have assumed this is a somewhat common need, so I would have expected more information about doing this type of thing. In the end, I think the recording every point meets my needs the best.
Is there a method I missed?
No comments:
Post a Comment