Serializing universal types as S-expressions using Sexplib

The standard ML way to "unify" multiple types is to use sum types a.k.a. discriminated unions, e.g. type foo = Int of int | String of string | Pair of foo * foo.

Universal types are types without type variables that allow you to store and retrieve values of any arbitrary type. They can be handy in some situations. For instance, you might decorate an abstract syntax tree with cells of a universal type and insert various data inferred at different phases of your compiler without having to polymorphise your AST types.

Sweek blogged about this recently here and the MLTon wiki has nice entries about property lists and universal types.

Sexplib on the other hand is a very nice piece of Markus Mottl engineering (again at Jane St) and is an absolute must for developing in Ocaml. The Sexplib camlp4 syntax extension automatically defines converters for any type you define, and handles polymorphism, modules and functors perfectly well.

Obviously, Sexplib can't serialize functions. (You can have a look at AliceML if you absolutely want that; function serialization requires some kind of byte code).

And that's where universal types and Sexplib don't mix well. Why? Because, at least for implementation techniques of universal types that I know, universal types work by cheating on the type system using closures and side-effects. Greetings to the relaxed value restriction!

Basically, to store a value of type bar, the computer creates a closure of fixed type unit -> unit that stores within its environment the value of type bar; when invoked, the closure spits it out at a pre-agreed location.

For every type, the closure will have the same closed type unit -> unit and can be stored anywhere you please. To retrieve the value, you invoke the closure, and it writes its content somewhere known. This will usually be a foo option ref. If multi-threading is used, you must protect that reference, for instance with a mutex.

However, functions are generally not serializable. Marshall does an ugly thing by spitting out a PC, the environment and the MD5 of your code, which works until you recompile your program. Not particularly persistent. As for Sexplib, as it works at the preprocessor level, it is not even possible.

Fortunately you can always define your own foo_of_sexp and sexp_of_foo. With some trickery, I managed to write a Sexp-aware Proplist.


(* Proplist *)

type property_id = string
type 'a propname
type proplist

val propname : ('a -> Sexplib.Sexp.t) -> (Sexplib.Sexp.t -> 'a) -> property_id -> 'a propname
val proplist : unit -> proplist
val set : proplist -> 'a propname -> 'a -> unit
val get : proplist -> 'a propname -> 'a
val get' : proplist -> 'a propname -> 'a option
val sexp_of_proplist : proplist -> Sexplib.Sexp.t
val proplist_of_sexp : Sexplib.Sexp.t -> proplist


(* Proplist *)

TYPE_CONV_PATH "Util.Proplist"

open Sexplib
open Sexp

type property_id = string

module I = struct type t = property_id let compare (x : property_id) y = compare x y end
module SS = Set.Make(I)
module SM = Map.Make(I)

type 'a propname =
id : property_id;
to_sexp : 'a -> Sexp.t;
mutable value : 'a option

type property =
extract : unit -> unit;
sexpify : unit -> Sexp.t

type proplist =
mutable stuff : property SM.t

let ids = ref SM.empty

let proplist () = { stuff = SM.empty }

let sexp_of_property p = p.sexpify ()

let sexp_of_proplist p =
(fun k p result ->
List[Atom k; sexp_of_property p] :: result

let set p name x =
let f () = name.value <- Some x in let g () = name.to_sexp x in let r = { extract = f; sexpify = g } in p.stuff <- SM.add name.id r p.stuff let propname to_sexp from_sexp id = if SM.mem id !ids then failwith "Property already defined" else begin let name = { id = id; value = None; to_sexp = to_sexp; } in let f pl x = set pl name (from_sexp x) in ids := SM.add id f !ids; name end let get' p name = let r = SM.find name.id p.stuff in r.extract (); let x = name.value in name.value <- None; x let get p name = match get' p name with | Some x -> x
| None -> raise Not_found

let proplist_of_sexp x =
let pl = proplist () in
match x with
| List l ->
| List[Atom k; p] ->
let f = SM.find k !ids in
f pl p
| x -> raise (Conv.Of_sexp_error("Pair expected", x)))
| _ -> raise (Conv.Of_sexp_error("List expected", x))

Note that if you are using a property list where property names are integers generated from a global counter, there is no guarantee that property names will always get the same ID. So, when constructing property names, you'll have to give, well, a unique property name. And strings are more convenient than integers, here. The unicity of such strings cannot be checked at runtime but that's about the best you can do. Also, when constructing a property name, you have to give the foo_of_sexp and sexp_of_foo converters. Retrieving them at unsexprification time is a little bit tricky since the return type must be hidden - these will go into a private global table indexed by property names.


Transparent performance

It is today widely accepted that good programming languages should have clean semantics. Semantics usually define the values programs compute, or fail to compute. However this is insufficient: what good is a correct value if the program that computes it takes forever to terminate, or consumes too much memory?

A good programming language should also guarantee the complexity of its basic constructs. It should be possible for the programmer to usefully predict the performance of his program without depending on compiler internals. Compiler optimizations that may alter the asymptotic performance of programs should be documented as well as the value semantics of the language. For instance, most functional language guarantee tail-call optimization, and this is good. Tail-calls are easy to spot for the programmer; it is easy to conceptualize what happens when a call is tail and when it is not (e.g., call chain proportional to the number of calls).

I'm tempted to consider that compilers that perform optimizations potentially changing the asymptotic performance or substantially changing the actual performance of programs are incorrect, when those optimizations are not mandated by the language definition. Why? Because programmers will rely on them. However, compiler optimizations are tricky beasts that do not always apply or work the same way. A new way of compiling may yield much better performance than the old one, save in a few cases. If "better" here means "asymptotically better" and if the performance of your program was depending, by chance, on those corner cases (how would you know without delving in the internals of complex optimizations?), your program may go from O(n) to O(n^2) and fail.

That will then create compiler-specific subdialects of the language. The maintainers of the compiler will have to ensure that these optimizations always apply in all future versions of the compiler. This is no better than backwards bug-compatibility or browser-specific hacks. This means that they won't be able to change their optimization strategy if it is not formally a superset of the previous one, even if it performs better in almost all known programs.

You will then end up with programs P1 and P2 written in the same language, but such that P1 has usable perfomance only when compiled with flags F1, with which P2 hasn't usable performance. Usually you can't mix compilation flags, so you won't be able to compose programs P1 and P2 in a bigger program and have acceptable performance.

Whole-program compilers that optimize the hell out of your programs are a good thing, but I'd avoid programming languages that require such complex, non-universal optimizations to have usable performance in practice. Programs written in such languages may very well fail to have the same performance under a different, or future compiler.

In short:

  • Some program optimizations can change the asymptotic complexity of programs, hopefully by lowering them.

  • It is generally impossible to give a compiler-independent and human-understandable definition of the cases in which a particular optimization may apply.

  • The domain of optimizations done by successive versions of a compiler tracking advances in compiling techniques is not necessarily monotone, and it is often impossible to guarantee that the domain of a new optimization is a "superset" of a previous one.

  • Optimizations do not necessarily compose, for instance if you use a different internal representation for programs, for example by going to SSA form.

  • Therefore, nothing guarantees that the optimizations that apply to a program P at compiler version n will also apply at compiler version n + 1.

  • Therefore, regular use of compilers that do complicated, asymptotic-complexity-changing optimizations that are not specified in the language may lead people to write programs that will depend on the availability of those particular optimizations and that may break with future versions of the compiler.

  • Lazy by default languages such as Haskell seem to require such sophisticated optimizations in order to yield acceptable performance.

  • Hence one should be careful to not depend on unmandated optimizations when writing programs in such languages.


Will Wikipedia collapse under its own weight?

Wikipedia stores the complete revision history of each page in snapshot form, and not in differential form. While this allows quick diffs between arbitrary revisions, it makes the storage requirements quadratic in the number of revisions. And, in practive, Wikipedia XML dump files are absolutely gigantic. 14 gigabytes uncompressed for the 100,000 article Turkish wiki; 290 frigging gigabytes for the 600,000 article French wiki; the English wiki with its 2,000,000 articles is 133 gigabytes bzip2-compressed and I don't know how big it really is.

Now you might think that if you distribute the articles of each Wikipedia over a number of different servers it should be all right, after all the individual articles aren't that big. Each snapshot may not be too big, but some articles have many, many snapshots.

I will give you some precise figures from the latest dump of the French wikipedia, taken on Februrary the 15th, 2008.

The NapolĂ©on Ier article is ~120,000 bytes long and has 2,393 revisions. The XML portion of the 2,393 revisions in snapshot form, gzipped with compression level 2, is 80,801,744 bytes; uncompressed it is about ~200,000,000 bytes. In differential form all that information measures a meager 5,310,701 bytes as a marshalled Ocaml object. Gzipped, that file takes 1,370,456 bytes. Basically, Wikipedia has a ×40 overhead and is wasting 97% of its uncompressed storage.


Monads are a class of hard drugs

I've been doing functional programming (FP) for about twelve years now, first with Scheme, and then with Ocaml, professionally for nearly three years now. I'm a big fan of FP and a big fan of static typing. So I have no prejudice against FP languages and I have tried many FP languages over the years, including, of course, Haskell.

When I try a language, I first want to try a non-trivial program in that language. Ocaml has
Unison, MLDonkey and the defunct Efuns & MMM.

Haskell has Xmonad, Yi and Darcs. I did not have enough patience to compile either Xmonad or Yi. They had many dependencies, required the latest version of GHC which itself took many hours to compile, and I couldn't figure out how to use Cabal. As for Darcs, which is usually available pre-compiled on most systems, it is extremely slow (checking out the source of Darcs itself takes ages), extremely memory hungry, and, word has it, is fundamentally flawed. But these are all practical issues.

Haskell programs tend to be slow, save for shootout-optimized ones. You can talk all you want about how functional purity allows memoization (BTW my PhD thesis was memoizing automata) but, in practice, GHC, as a compiler, is very slow. Haskell programs, unless written very carefully, do get quite slow.

The usual answer is that languages are not slow, programs are, and that there are ways to make your program really fast, and that, actually, as it is purely functional, the compiler can do lots of very smart tricks to produce code undistinguishable from hand-coded imperative assembly.

The problem is that those very smart tricks do not always apply. Further, as they are complex, it is difficult for the programmer to know if they will apply for his particular program. So he has to know a lot about the compiler internals to write programs that play well with the compiler. But, unlike mandatory optimization of tail-recursion in most FP languages, those smart tricks are not part of the language specification. So he can't really rely on them.

In other words, with Haskell/GHC, the programmer doesn't have a simple mental model of the compiler with which he can write efficient programs. So either he takes the academic approach and says "I don't care about efficiency", or delves into the depths of the compiler and actively follows the latest papers on FP optimization.

But maybe those smart tricks apply sufficiently often in practice that most programmers do not have to care about them?

I would like to say that if this was the case, GHC and Darcs wouldn't be dog slow, but I have another argument.

That's the part where we use the M-word. Monads. While they have been turned into a tool of intellectual terrorism, the original paper on monads is actually quite easy to understand and fun to read. Monads are a good and natural solution for expressing imperative constructs in purely functional languages.

The essence of monads is to use abstract types to enclose a mutable state, providing only a set of carefully-crafted combinators for using it in such a way that you can't duplicate the state but have to use it in a linear, non-branching fashion; that way, the implementation of the monad can use side-effects. Monads can be used to access mutable arrays, in which case the side-effects would be writes to the array, or system services such as I/O, in which case the side-effects would be some system calls. Monads are a typing trick to ensure proper linear usage of some updateable resource in a functional language. A very neat idea, indeed.

So monads hide the tricky machinery of managing mutable state under the carpet of the type system. This is good because you don't get to see greasy mechanical parts while casually programming. But mutable state covers everything that is not pure computation and that includes debug statements, input and output, system calls, network access, getting the time of the day or even throwing an exception. So the monad you are using has to provide all these services. Of course not every monad has everything you need so you have the IO monad, the Maybe monad, the STM monad, and so on. Whenever you have a C binding, you get a monad. And this is where things get ugly.

For one thing, when you use a monad, you get hooked to it: the type constructors of the monad start to appear in the signatures of your function. Obviously, you just have to add a layer of indirection by redefining those constructors and combinators in another module or type class. But then you've just substituted your own brand of heroin for your dealer's brand. All your code still has to use your brand. You can change your dealer, but still, if someone imports your code, they will have to "supply" your code with the brand you've selected. Where did the type inference go? Why are we declaring types everywhere as if we were in freaking Java or C++?

And, for another thing, what happens if I need more than one monad? That's where you get the fun concept of "Monad combinators" which allow you to define the "product" of two or more monads and "bundle" your brand of heroin with your brand of crack into a nice "crackoin" product. You're still doing drugs, but at least you can do any combination of them!

So, the situation is this. You write a software component doing imperative tasks in Haskell. You need monads M1, M2 and M3. You thus define your monad bundle M1M2M3 which you use in all your code.

Someone else writes a software comoponent C2 and uses a monad bundle M4M5.

Now what if I want to write a component using C1 and C2? I just define my functions in the monad bundle M1M2M3M4M5, silly! In the end, I might very well have more monad definitions than code.

I do not like mandatory type annotations. Type inference is a great technology and it is very well worth the price of restricted or inexistant overloading. Modules and functors are also very great, and those, understandably, require explicit annotations. In Ocaml, you can often start writing whatever you want with little to no type declarations (and the declarations you use are usually for defining records or sum types); when your program grows, you start isolating parts inside modules; later, those modules get their own file and might transform into a functor or two. In Haskell, you have to start in the correct monad if you plan on doing any I/O (and you do). Ocaml's "implicit" monad is the same monad the overwhelming majority of programs live in: the imperative monad with exceptions. This includes C programs and your operating system. Types are implicitly lifted to this monad. This means that you do not have to worry about monads and that you just write your code, getting all the benefits of a FP language with static typing and type inference. If you want to add a printf there, or a network connection here, you can. So your functions are unpure, but programmers are not priests of type theology. They want to do dirty things, like connecting to sockets or mutating array entries while waiting for external world events.

Programmers might one day accept to do all these actions under the hood of a combination of monads, but only if monadic types do not creep too much in their code and, of course, if the resulting programs are sufficiently fast. This is not the case today with Haskell and this is why I will stick to ML.

Kernel crashes...

I've got two kernel Oopses (attempt to dereference null pointer). This is very annoying. It could be bad memory or overheating, but I think it is just a buggy driver, possibly the graphics or the WiFi driver. I really don't want to go into the whole mess of recompiling your own kernel but then... BTW I bought a 8Gb SD HD card and I can't read it with my old USB driver. I think I'll just back up the whole image and do a network install of eeeubuntu thru TFTP.


Developing on the Eee

We have been visiting some family and thus have been away from home for one day. I of course took my Eee and, fortunately, a welcoming RJ45 socket allowed me to promptly connect to the matrix through my little black terminal.

I thus debugged some Ocaml code using a remote Vim via SSH. The experience was not very pleasing, the main reason being the Eee's keyboard, while great for typing a blog like this, is annoying for development work, because programming, involving the use of many special characters, requires frequent depression of the shift and control keys and of the cursor keys. Unfortunately, the right shift key is single-width and is located right next to the cursor up key. Hence, I am constantly pressing cursor up instead of shift, which gets very annoying especially in Vim. I don't know if I'll get used to it. One solution might be to swap cursor up and shift right. However I didn't manage to get my old Xmodmap scripts working; apparently Xmodmap is deprecated and you have to use that thing called Xkb now. The problem is that Xkb is extremely complex and could be called an "international keyboard mapping meta-description framework environment".

Anyway, I ended adding some Xandros repositories and was able to get gcc, gvim, mercurial and other goodies. I also compiled Ocaml from the CVS. The Eee is indeed slower than my home desktop, a 2.4 GHz PIV, and slower than my work desktop, a dual-core 64-bit machine, but it remains perfectly acceptable. I'd say that today, most computers run at the same speed. Having an ultra-fast solid-state disk certainly helps too.

As you may recall, I had disconnected the fan, resulting in a perfectly silent machine. It works quite well and doesn't get any appreciably hotter than what it got with the fan connected. I hope it won't shorten its life. It's a real pleasure and maybe I'll use it at home, connecting it to a bigger screen, a bigger keyboard and a mouse, turning it into a silent workstation.


Of Wikipedia and pigs

From perusing the forums at Wikipediareview, it just came to my attention that Wikipedians actually debated if the list of pigs over 1000 pounds should be merged with the list of pigs. Note that these are not about zoologic classification of animals. No, these are about, well, "notable" pigs. In the wikiworld, "notable" means, well, "notable". It is an irreducible concept. If you don't understand what notable is, you don't get Wikipedia.