20080219

Will Wikipedia collapse under its own weight?

Wikipedia stores the complete revision history of each page in snapshot form, and not in differential form. While this allows quick diffs between arbitrary revisions, it makes the storage requirements quadratic in the number of revisions. And, in practive, Wikipedia XML dump files are absolutely gigantic. 14 gigabytes uncompressed for the 100,000 article Turkish wiki; 290 frigging gigabytes for the 600,000 article French wiki; the English wiki with its 2,000,000 articles is 133 gigabytes bzip2-compressed and I don't know how big it really is.

Now you might think that if you distribute the articles of each Wikipedia over a number of different servers it should be all right, after all the individual articles aren't that big. Each snapshot may not be too big, but some articles have many, many snapshots.

I will give you some precise figures from the latest dump of the French wikipedia, taken on Februrary the 15th, 2008.

The NapolĂ©on Ier article is ~120,000 bytes long and has 2,393 revisions. The XML portion of the 2,393 revisions in snapshot form, gzipped with compression level 2, is 80,801,744 bytes; uncompressed it is about ~200,000,000 bytes. In differential form all that information measures a meager 5,310,701 bytes as a marshalled Ocaml object. Gzipped, that file takes 1,370,456 bytes. Basically, Wikipedia has a ×40 overhead and is wasting 97% of its uncompressed storage.

7 comments:

Unknown said...

If Wikipedia stores one snapshot per revision, then its storage requirements would be linear in the number of revisions, not quadratic.

(One might consider the overall requirement to be "quadratic", if one assumes that aggregate page size will grow without bound, but that's a different thing.)

semmelweis said...

> If Wikipedia stores one snapshot per revision, then its storage requirements would be linear in the number of revisions, not quadratic.

Nope, because the stored size of a snapshot is not constant, since the whole page is stored, and not the diff. Hence if you change one letter on a 10 megabyte page, you add 10 megabytes.

Unknown said...

Hmm--it appears that one of us doesn't understand your example. :-)

1 revision, 10 megabytes
2 revisions, 20 megabytes
3 revisions, 30 megabytes

This is a linear relation.

semmelweis said...

OK so I have to get formal.
Let p_1, p_2, ..., p_m be m revisions of a page. The size of the i-th revision is |p_i|. Assume p_{i+1} is obtained by adding one letter to p_{i}, and that |p_1| is 1. Hence the size of the i-th revision is i. However, the size required to store the first i revisions under the current Wikipedia scheme is 1 + 2 + ... + i = i(i+1)/2 which is in O(i^2). Note that use input is, save for copy/paste, proportional to the number of keypresses.

Unknown said...

Okay, I see what you're doing. Under your assumptions, storage is indeed quadratic in the number of revisions.

You're assuming, though, that the size of a page grows linearly in the number of revisions. I don't see that as a realistic assumption over the long term (e.g., 100 years). I would assume that an article on, say, Millard Fillmore would reach a fleshed out size fairly early and then grow slowly from that point, if at all. I'd assume a constant size, or perhaps at most logarithmic growth.

semmelweis said...

So we agree that the worst case space complexity is quadratic.

Now in practice, that quadratic behaviour occurs during the initial life period of an article where it grows.

Keeping an article whose size is stabilized at n bytes is still proportional to n times time because of random mutations (trolls) that get reverted but that still cost O(n) bytes each.

The current Wikipedia storage scheme is not sustainable. They have to store the diffs and maybe a few snapshots now and then to accelerate differences between arbitrary revisions. Or base themselves on a proper RCS.

Unknown said...

I wonder how much space a git repos of it will take.