Taking inspiration from Doug Bagley's Computer Language Shootout project, I've decided to write a sample program in each of the languages I'm considering for my rewriting-the-dgraph project. The sample program will be one that is relatively central to what the dgraph does: an inverted index.
Anybody who's ever worked on a search engine will know that the inverted index is the central data structure. A google search for inverted index turns up 227,000 documents (the most interesting of which is the paid advertisement from google itself, looking search engine developers).
The idea behind an inverted index is simple: you start with a bunch of documents containing words, and you want to "invert" that, to create a bunch of words each of which lists all the documents that contain that word. Or, in code-speak, you start with a MapDocToWords object, and want to create a MapWordToDocs object (except that I won't be limiting myself to objects).
In my test programs, the first criterion will be how easy it is to use the same code for indices of different types (words, integers, etc).
Another criterion will be how easy it is to write the inverted index to disk, and read it back in. This is important because building the inverted index usually takes a while, so it's good to do it offline and store the results for faster loading later. Also, it's not uncommon to use different data structures for building the inverted index and for querying it. This requirement has the advantage of making the test a bit more realistic.
And the last criterion will be performing intersection queries. For example, your typical multi-word query will involve looking up the document list for each word, and then intersecting the resulting lists to produce the documents that contain all the words in the query.
This is a highly simplified version of the actual inverted index we use at Endeca (we have a lot more features and performance tweaks than I've described here), but it should give me a general idea as to the strengths of each language.
The results will be judged on code size, memory usage, speed, and (highly subjective) cool-ness.
Posted on May 2, 2003 08:15 PM
More languages articles
Could InvertedIndex be reasonable new benchmark for the revived Shootout?
http://shootout.alioth.debian.org/
Posted by: Isaac Gouy at November 16, 2004 08:06 PMI'd like to point that the shootout is flawed. They don't measure Java properly. I found that on many occasions the Java versions actually ran faster than the C versions. I'm randomly visiting websites because I'm very upset over their propaganda
Posted by: X at June 9, 2005 06:27 PMHi all,
i also do some research due to customer's interest. My concerns are around java vs. c/c++ form performance standpoint.
What i can comment, that from my and enterprice point of view
additionally to small points (like thread -safe String Buffer is used, but not StringBuilder)
more important things which are not fair to java are
1. for server side including of jvm and java load time is not fair
2. including of jit-compilation time is not reasonable
What is even more interesting:
i have implemented and plan to implement few more micro-benchmark:
1. intensive memory allocation deallocation in multithreaded env.
2. intensive synchronization at intensive multithreading
in both cases java is not slightly bu much faster
it is something important for enterprice development
i would be glad if some one point me where is appropriate place to publish the codes and results
Posted by: Yegor at May 14, 2007 09:11 AM