I've become very interested in abstract interpretation and model checking recently. The limitations of traditional testing (specifically: you can only try so many combinations of inputs) are starting to bother me, and so I'm looking for new techniques. Abstract interpretation seems to offer a whole new set of possibilities for testing, and I've been studying it for the past couple months.
Here's a very readable introduction to the kinds of things you can do with model checking, written by the people at Coverity. Coverity is a startup based on the technique.
Here's a paper describing the java bytecode verifier as an abstract interpreter.
The BLAST project at Berkeley seems to have made some significant progress in this area. Their work seems to have led to projects like the Static Driver Verifier at Microsoft, which "finds bugs in device drivers at compile time". Their use of boolean programs (programs whose variables keep track of the current value of various boolean predicates) is very interesting. I wish I was more familiar with software theorem provers so that I could understand their work better.
I'm currently playing around with writing an abstract interpreter for java bytecode. If that works out, I'd like to try working with the assembler output from c++ programs (since the c++ language model is too complicated to work with directly). My current goal is to try to use abstract interpretation to automatically recover equivalence classes from code. For example, a statement like "if (x < 10) ..." would indicate that there are two equivalence classes for x: (x<10) and (x>=10).
It would probably be a better use of my time to just try to learn out to use the BLAST tool, but I want to have a more hands-on understanding of the techniques involved. I'm currently at the point where it has become obvious why the monotonicity requirement for fixpoint computation is necessary -- otherwise you might never terminate when faced with a loop. I've also become comfortable with the idea of least-upper-bounds as a technique for handling branches and joins. Previously I hadn't been able to figure out how to organize my data structures so that they could handle loops.
Followups to Model Checking:Posted on February 21, 2005 09:20 PM
More testing articles
http://pag.csail.mit.edu/daikon/
Posted by: Jay McCarthy at February 22, 2005 06:35 AMYes, I thought of mentioning Daikon. But daikon is a runtime invariant detector, while abstract interpretation and model checking are static tools.
In particular, daikon knows nothing about code that never gets executed by your unit tests. In contrast, model checkers include even untested paths when constructing their proofs.
Posted by: Kim at February 22, 2005 01:57 PMYou would loose a lot of type information using the assembly output (besides the question of, which architecture? SPARC? MIPS? x86?). You can recover simple types like characters, integers, floats and pointers but objects? It's a pointer to memory. Also, at the assembly level (and I'll pick the MC68K as an example), your statement of:
if (x < 10)
will most likely be translated to:
move.l [a6 + 12],d0
cmp.l #10,d0
bls.w ...
At that point, you have two equivilence classes, the four bytes at [a6+12] < 10, and the four bytes at [a6+12] >= 10---probably an integer, but careful tracking would be needed for the following:
move.l [a6+8],d0
and.l d0,d0
beq.w ...
are we doing something like:
if (x == 0)
or
if (pointer == NULL)
I'm not saying it's impossible, just that dropping down to assembly may be too low of a level for this.
Sean, the debugging information should have enough information to recover any type info that you might need. I admit that it's entirely possible that assembler is too low level. But even still, it seems like a more feasible approach than attempting to parse C++ in the first place. If I had to parse C++, I'd probably try to find some way of reusing the gcc front-end.
Posted by: Kim at February 23, 2005 01:11 PM