The idea behind this project started with this entry in my blog. Taking inspiration from Doug Bagley's Computer Language Shootout project, I decided to write a sample program in haskell, erlang, ruby, ocaml, and scheme, in order to evaluate each of these languages for their suitability in rewriting the main program I work on. The sample program would be one that is relatively central to what that program does: an inverted index.
Anybody who's ever worked on a search engine will know that the inverted index is the central data structure. A google search for inverted index turns up 227,000 documents (the most interesting of which is the paid advertisement from google itself, looking for search engine developers).
The idea behind an inverted index is simple: you start with a bunch of documents containing words, and you want to "invert" that, to create a bunch of words each of which lists all the documents that contain that word. Or, in code-speak, you start with a MapDocToWords object, and want to create a MapWordToDocs object (except that I won't be limiting myself to objects).
In this shootout, the first criterion will be how easy it is to use the same code for inverted indices of different types (e.g. indices of words, indices of numeric ranges, etc).
Another criterion will be how easy it is to write the inverted index to disk, and read it back in. This is important because building the inverted index usually takes a while, so it's good to do it offline and store the results for faster loading later. Also, it's not uncommon to use different data structures for building the inverted index and for querying it. This requirement has the advantage of making the test a bit more realistic.
And the last criterion will be performing intersection queries. For example, your typical multi-word query will involve looking up the document list for each word, and then intersecting the resulting lists to produce the documents that contain all the words in the query.
The results will be judged on code size, memory usage, speed, and (highly subjective) cool-ness.
56 lines of code (not counting comments and tests).
About 15 hours to write.
Coolness points because QuickCheck makes testing easy.
No performance numbers available yet.
48 lines of code.
Contributed by Darius Bacon.
Based on the haskell version.
10 lines of code (incomplete).
About 8 hours to write.
Anti-coolness points because the syntax is so clunky, and because I was disappointed to find out that Erlang is not a logic language. I was so disappointed that I didn't actually finish the project. But you can look at the code to see how far I got.
34 lines of code (incomplete)
About 8 hours to write.
Joy almost got coolness points for having a built-in set type. But its sets are implemented as bit-fields and so they can only contain numbers up to 31. So no coolness points.
Negative coolness points for the fact that the "in" function only works for lists of numbers -- lists of lists do not use a recursive comparison.
Negative coolness points for all the stack shuffling I have to do.
Positive coolness points for having such a simple, elegant execution model.
Positive coolness points for conciseness, even though this comes at the cost of clarity.
No performance numbers available yet.
Last updated October 1st, 2003
Back to Kimberley's Code.
Back to Kimberley's Home Page.