Will Wikipedia collapse under its own weight?

Wikipedia stores the complete revision history of each page in snapshot form, and not in differential form. While this allows quick diffs between arbitrary revisions, it makes the storage requirements quadratic in the number of revisions. And, in practive, Wikipedia XML dump files are absolutely gigantic. 14 gigabytes uncompressed for the 100,000 article Turkish wiki; 290 frigging gigabytes for the 600,000 article French wiki; the English wiki with its 2,000,000 articles is 133 gigabytes bzip2-compressed and I don't know how big it really is.

Now you might think that if you distribute the articles of each Wikipedia over a number of different servers it should be all right, after all the individual articles aren't that big. Each snapshot may not be too big, but some articles have many, many snapshots.

I will give you some precise figures from the latest dump of the French wikipedia, taken on Februrary the 15th, 2008.

The NapolĂ©on Ier article is ~120,000 bytes long and has 2,393 revisions. The XML portion of the 2,393 revisions in snapshot form, gzipped with compression level 2, is 80,801,744 bytes; uncompressed it is about ~200,000,000 bytes. In differential form all that information measures a meager 5,310,701 bytes as a marshalled Ocaml object. Gzipped, that file takes 1,370,456 bytes. Basically, Wikipedia has a ×40 overhead and is wasting 97% of its uncompressed storage.


Monads are a class of hard drugs

I've been doing functional programming (FP) for about twelve years now, first with Scheme, and then with Ocaml, professionally for nearly three years now. I'm a big fan of FP and a big fan of static typing. So I have no prejudice against FP languages and I have tried many FP languages over the years, including, of course, Haskell.

When I try a language, I first want to try a non-trivial program in that language. Ocaml has
Unison, MLDonkey and the defunct Efuns & MMM.

Haskell has Xmonad, Yi and Darcs. I did not have enough patience to compile either Xmonad or Yi. They had many dependencies, required the latest version of GHC which itself took many hours to compile, and I couldn't figure out how to use Cabal. As for Darcs, which is usually available pre-compiled on most systems, it is extremely slow (checking out the source of Darcs itself takes ages), extremely memory hungry, and, word has it, is fundamentally flawed. But these are all practical issues.

Haskell programs tend to be slow, save for shootout-optimized ones. You can talk all you want about how functional purity allows memoization (BTW my PhD thesis was memoizing automata) but, in practice, GHC, as a compiler, is very slow. Haskell programs, unless written very carefully, do get quite slow.

The usual answer is that languages are not slow, programs are, and that there are ways to make your program really fast, and that, actually, as it is purely functional, the compiler can do lots of very smart tricks to produce code undistinguishable from hand-coded imperative assembly.

The problem is that those very smart tricks do not always apply. Further, as they are complex, it is difficult for the programmer to know if they will apply for his particular program. So he has to know a lot about the compiler internals to write programs that play well with the compiler. But, unlike mandatory optimization of tail-recursion in most FP languages, those smart tricks are not part of the language specification. So he can't really rely on them.

In other words, with Haskell/GHC, the programmer doesn't have a simple mental model of the compiler with which he can write efficient programs. So either he takes the academic approach and says "I don't care about efficiency", or delves into the depths of the compiler and actively follows the latest papers on FP optimization.

But maybe those smart tricks apply sufficiently often in practice that most programmers do not have to care about them?

I would like to say that if this was the case, GHC and Darcs wouldn't be dog slow, but I have another argument.

That's the part where we use the M-word. Monads. While they have been turned into a tool of intellectual terrorism, the original paper on monads is actually quite easy to understand and fun to read. Monads are a good and natural solution for expressing imperative constructs in purely functional languages.

The essence of monads is to use abstract types to enclose a mutable state, providing only a set of carefully-crafted combinators for using it in such a way that you can't duplicate the state but have to use it in a linear, non-branching fashion; that way, the implementation of the monad can use side-effects. Monads can be used to access mutable arrays, in which case the side-effects would be writes to the array, or system services such as I/O, in which case the side-effects would be some system calls. Monads are a typing trick to ensure proper linear usage of some updateable resource in a functional language. A very neat idea, indeed.

So monads hide the tricky machinery of managing mutable state under the carpet of the type system. This is good because you don't get to see greasy mechanical parts while casually programming. But mutable state covers everything that is not pure computation and that includes debug statements, input and output, system calls, network access, getting the time of the day or even throwing an exception. So the monad you are using has to provide all these services. Of course not every monad has everything you need so you have the IO monad, the Maybe monad, the STM monad, and so on. Whenever you have a C binding, you get a monad. And this is where things get ugly.

For one thing, when you use a monad, you get hooked to it: the type constructors of the monad start to appear in the signatures of your function. Obviously, you just have to add a layer of indirection by redefining those constructors and combinators in another module or type class. But then you've just substituted your own brand of heroin for your dealer's brand. All your code still has to use your brand. You can change your dealer, but still, if someone imports your code, they will have to "supply" your code with the brand you've selected. Where did the type inference go? Why are we declaring types everywhere as if we were in freaking Java or C++?

And, for another thing, what happens if I need more than one monad? That's where you get the fun concept of "Monad combinators" which allow you to define the "product" of two or more monads and "bundle" your brand of heroin with your brand of crack into a nice "crackoin" product. You're still doing drugs, but at least you can do any combination of them!

So, the situation is this. You write a software component doing imperative tasks in Haskell. You need monads M1, M2 and M3. You thus define your monad bundle M1M2M3 which you use in all your code.

Someone else writes a software comoponent C2 and uses a monad bundle M4M5.

Now what if I want to write a component using C1 and C2? I just define my functions in the monad bundle M1M2M3M4M5, silly! In the end, I might very well have more monad definitions than code.

I do not like mandatory type annotations. Type inference is a great technology and it is very well worth the price of restricted or inexistant overloading. Modules and functors are also very great, and those, understandably, require explicit annotations. In Ocaml, you can often start writing whatever you want with little to no type declarations (and the declarations you use are usually for defining records or sum types); when your program grows, you start isolating parts inside modules; later, those modules get their own file and might transform into a functor or two. In Haskell, you have to start in the correct monad if you plan on doing any I/O (and you do). Ocaml's "implicit" monad is the same monad the overwhelming majority of programs live in: the imperative monad with exceptions. This includes C programs and your operating system. Types are implicitly lifted to this monad. This means that you do not have to worry about monads and that you just write your code, getting all the benefits of a FP language with static typing and type inference. If you want to add a printf there, or a network connection here, you can. So your functions are unpure, but programmers are not priests of type theology. They want to do dirty things, like connecting to sockets or mutating array entries while waiting for external world events.

Programmers might one day accept to do all these actions under the hood of a combination of monads, but only if monadic types do not creep too much in their code and, of course, if the resulting programs are sufficiently fast. This is not the case today with Haskell and this is why I will stick to ML.

Kernel crashes...

I've got two kernel Oopses (attempt to dereference null pointer). This is very annoying. It could be bad memory or overheating, but I think it is just a buggy driver, possibly the graphics or the WiFi driver. I really don't want to go into the whole mess of recompiling your own kernel but then... BTW I bought a 8Gb SD HD card and I can't read it with my old USB driver. I think I'll just back up the whole image and do a network install of eeeubuntu thru TFTP.


Developing on the Eee

We have been visiting some family and thus have been away from home for one day. I of course took my Eee and, fortunately, a welcoming RJ45 socket allowed me to promptly connect to the matrix through my little black terminal.

I thus debugged some Ocaml code using a remote Vim via SSH. The experience was not very pleasing, the main reason being the Eee's keyboard, while great for typing a blog like this, is annoying for development work, because programming, involving the use of many special characters, requires frequent depression of the shift and control keys and of the cursor keys. Unfortunately, the right shift key is single-width and is located right next to the cursor up key. Hence, I am constantly pressing cursor up instead of shift, which gets very annoying especially in Vim. I don't know if I'll get used to it. One solution might be to swap cursor up and shift right. However I didn't manage to get my old Xmodmap scripts working; apparently Xmodmap is deprecated and you have to use that thing called Xkb now. The problem is that Xkb is extremely complex and could be called an "international keyboard mapping meta-description framework environment".

Anyway, I ended adding some Xandros repositories and was able to get gcc, gvim, mercurial and other goodies. I also compiled Ocaml from the CVS. The Eee is indeed slower than my home desktop, a 2.4 GHz PIV, and slower than my work desktop, a dual-core 64-bit machine, but it remains perfectly acceptable. I'd say that today, most computers run at the same speed. Having an ultra-fast solid-state disk certainly helps too.

As you may recall, I had disconnected the fan, resulting in a perfectly silent machine. It works quite well and doesn't get any appreciably hotter than what it got with the fan connected. I hope it won't shorten its life. It's a real pleasure and maybe I'll use it at home, connecting it to a bigger screen, a bigger keyboard and a mouse, turning it into a silent workstation.


Of Wikipedia and pigs

From perusing the forums at Wikipediareview, it just came to my attention that Wikipedians actually debated if the list of pigs over 1000 pounds should be merged with the list of pigs. Note that these are not about zoologic classification of animals. No, these are about, well, "notable" pigs. In the wikiworld, "notable" means, well, "notable". It is an irreducible concept. If you don't understand what notable is, you don't get Wikipedia.


Ocaml and Extlib

At one point or another, you need to select an extension library for your Ocaml programs, or roll your won. For the past ~9 years I have been mostly satisfied with the standard libraries and for the rest, I usually rolled my own. I'm not talking of Netclient bindings or of PCRE, but pretty basic stuff like string splitting, wrapping file opening, and so on. But I needed some Unicode capabilities and I really wasn't in the mood to wrestle with UTF8 encodings, been there, done that, so I grabbed a copy of Extlib and started using it. I was a little bit disappointed since it is much less complete then what I expected, so I started patching it and sending it "upstream". Fortunately their developers are friendly and quickly integrated my change, rightfull asking for a unit test (which actually uncovered a bug!).

Second day in heaven

Okay audience, I now have been using the Eee for half a day now and I'm still delighted. I was slightly annoyed by the noise emitted by the fan, so I proceeded to disconnect it, following instructions available on the "Internet" (a place where you can find lots of useful things). Noiseless computing is an absolute joy! It brings me back memories of the time when I was hacking on my ZX Spectrum 48K+ - no noise either. The screen was bigger, but the resolution - 256 x 292 if I recall correctly, was ridiculous. It had 48 kB of memory. 48 kB. Do you understand? In other words, its whole RAM could barely contain a 128 x 128 pixel color image. Or the first 6.25% of a 1024 x 768 display. Or the front page of a typical blog (HTML only). That's why the ZX didn't last, I guess. Anyway, the Eee's keyboard is way more pleasant than the ZX plus's keyboard as you can depress the keys just with your fingers. I am getting used to its layout and the small size of its keys.

Oh, I'm also trying to use Ethernet mose of the time as I am worried about the health effects of WiFi radiation. It's also faster and uses less power, creating less heat.

The Eee APT repository doesn't contain much. However the installed software is fine for most purposes. I don't know if I will be using this little gem of technology as a stand-alone development platform, or if I will be content SSHing into a server and working there. The problem is that I am living in a very small 35 square meter flat with my soulmate. If I turn on my main PC, the noise is overwhelming. I could SSH to a dedicated served I am renting, but that one is not very powerful, and I'd be struggling with network latency. Also, Gvim is more pleasant than Vim on a terminal - more colors, more mouse functionality.

However I don't know what the performance of this system would be for development. Should be decent, anyway. Maybe I'll upgrade the main memory, maybe not. However I'd like to at least have Mercurial or the SSH daemon, and those are not in the built-in repository.


Frist psot!!eleven

Today I'm very excited! I just got my Eee PC. And, of course, I'm using it right now. So I decided to start another blog. My previous blogs were one-shot ones on particular, controversial topics. This blog will be different. It will be non-controversial. I will talk about technology, science, and what not. I will also blog in the proper sense of the term, that is, write useless things of interest to no one but myself and maybe one or two close friends.

So, how's the Eee you ask? It's small. You already knew it. Mine is a nice black. I had to order it from the US of A to get the keys as god intended them to be
: in a QWERTY layout. Ooops this blog was supposed to be non-offensive. Sorry you DVORAK fanboys. QWERTY rules and that's final. You know why? Because I was raised on a QWERTY. Been using one for twenty years now. But also, all the greatest programs in the world written by the greatest geniuses were typed on QWERTY programs. [Citation needed]

Anyway, so QWERTY it is. What about the rest? Well, the keyboard is quite small. And it's someone using a HHK talking. However its keys are great. I think I may get used to it after a while. I'll tell you if it is comfortable enough after some weeks of daily usage.

Speed. The Eee is perfectly fast. It certainly doesn't lag. And with no spinning, mechanical twentieth century hard drive to swap pages to, you may get out-of-memory errors, but you'll get them fast, and your machine will never be trashing like Windows fucking NT.

Weight. It is very, very lightweight. I am really pleased with this. My previous laptop was so heavy I could barely lift it with my feeble muscles.

I'll save you half an hour by telling you that you can open a terminal using Ctrl+Alt+T, and that to enable the SSH agent, you go to /etc/X11/Xsession.options. I haven't figured out how to log in using another user yet. (I'd like to do this without disrupting the original configuration too much. But I think I'll end up rewriting a simple X startup script by myself. I see too many unnecessarily spawned shells here...)

Silence. It indeed has a fan, which makes a slightly annoying buzz. But before ordering my Eee, I checked various blogs and sites, and (1) you can easily disable the fan and (2) the fan doesn't affect the system temperature much, be it on or off. I think I'll disable it pretty soon.

So all this is already too long for a frivolous blog post. I'll finish here.