Here's an interesting point I came across while asking Google about "static software testing". The title of the slide is Moore's Low. If that's a typo, it's ironically appropriate.
I've become very interested in abstract interpretation and model checking recently. The limitations of traditional testing (specifically: you can only try so many combinations of inputs) are starting to bother me, and so I'm looking for new techniques. Abstract interpretation seems to offer a whole new set of possibilities for testing, and I've been studying it for the past couple months.
Here's a very readable introduction to the kinds of things you can do with model checking, written by the people at Coverity. Coverity is a startup based on the technique.
Here's a paper describing the java bytecode verifier as an abstract interpreter.
The BLAST project at Berkeley seems to have made some significant progress in this area. Their work seems to have led to projects like the Static Driver Verifier at Microsoft, which "finds bugs in device drivers at compile time". Their use of boolean programs (programs whose variables keep track of the current value of various boolean predicates) is very interesting. I wish I was more familiar with software theorem provers so that I could understand their work better.
I'm currently playing around with writing an abstract interpreter for java bytecode. If that works out, I'd like to try working with the assembler output from c++ programs (since the c++ language model is too complicated to work with directly). My current goal is to try to use abstract interpretation to automatically recover equivalence classes from code. For example, a statement like "if (x < 10) ..." would indicate that there are two equivalence classes for x: (x<10) and (x>=10).
It would probably be a better use of my time to just try to learn out to use the BLAST tool, but I want to have a more hands-on understanding of the techniques involved. I'm currently at the point where it has become obvious why the monotonicity requirement for fixpoint computation is necessary -- otherwise you might never terminate when faced with a loop. I've also become comfortable with the idea of least-upper-bounds as a technique for handling branches and joins. Previously I hadn't been able to figure out how to organize my data structures so that they could handle loops.
I have had the opportunity to talk to dozens of engineering managers that have implemented, or tried to implement, developer testing programs. The reports are very consistent: without some practices and tools to set up objectives, prioritize the efforts, and measure the results, it's very difficult to maintain momentum, achieve consistency, and maximize efficiency and effectiveness. It's clear that developer testing, like any other activity that consumes a non-trivial amount of valuable development time and resources, has to be managed.
I'm currently in a similar situation -- our engineering team has nominally agreed that test-driven development is a good thing, but I'm having a lot of trouble keeping it going. Occasionally I push to have a particular feature get more tests, and then it happens (to some degree), but I can't continually push on all features all the time.
The article above suggests several things: first, give developers a sense of accomplishment by measuring, and reporting, how many "test points" they've achieved. Second, set targets for test points, and manage those targets just as you manage features. Third, create a dashboard that shows the current status, compared with the target.
"Designing for testability" is a way of writing code so that testing is easier. "Designing for debuggibility", in contrast, is a way of writing tests so that development is easier.
Designing for testability means that when you write code, you design it in such a way that it can be tested. For example, if you're developing a client/server system, make it possible to create a stub API so that the client can be tested isolation, without having to setup and configure a server. The advantage of this is that it's easier to write tests against single components of the system, instead of trying to test the whole system all at once. It makes the tests easier to maintain (since the client tests aren't affected by changes in how the server is set up). It also makes it easier to debug the tests when they fail, since you know the problem is in the client, not the server.
Designing for debuggability is the dual of designing for testabilty. The idea is that when you write a test, you should be writing it with the expectation that some day the test will fail. That's what a good test does -- it fails; it highlights regressions or misunderstandings. And since the goal of the test is for it to fail someday, it's in your interest to make sure that when that day comes it's easy to figure out what went wrong. That means you should keep the test very simple and direct -- you don't want to have to debug the test before it's even possible to start debugging the code. It also means making sure that the error message you get when the test fails is very obvious. Don't settle for "AssertionError". Instead, try for "AssertionError: too slow, expected at least 3000 widgets per second, but got only 2200". Try to avoid even things as simple as if-statements and for-loops, because they make the test more complicated.
Some tests are inherently complicated. This is especially true of load tests, which run a long time and use a lot of input. For those tests, designing for debuggability might mean keeping a detailed log of everything that happened in the test. It might mean making sure that the inputs are deterministic instead of randomly generated (or at least it means being able to reuse the same random seed that caused the failure initially). It might mean crafting the test so that a small set of inputs get magnified into a whole ton of load.
For example, at work I'm testing a client-server system. I have some end-to-end tests that start up the server and then run various commands against the client. Now suppose one of those commands ends up killing the server. It would be helpful if the test told me "command X killed the server, and here's the stderr log from the server process". It's less helpful if the only error message I get is that the command right after command X failed with "connection refused". And it's even less helpful if the test didn't capture stderr, so that I have to run all the steps in the test by hand in order to see the error message from the server.
Here's an amusing tool: Guantanamo is a tool that can delete all code lines that are not covered by tests. It can also delete code that is not covered by the application itself.
(Via Andrew Birkett)
I want to write down this bit of knowledge so that it can get sucked up into the all-encompassing brain that is Google. If you use pexpect on Solaris, and you notice that it hangs in spawn.__del__ as soon as the spawn object goes out of scope, then this may be of interest to you.
More...Yesterday at work I ran some code coverage metrics for the test suite we have. I learned that there's a substantial percentage of the code that isn't being covered. What I'd like to do is correlate that information with my knowledge of what feature aren't yet supposed to be covered. We have one large project in particular that isn't ready for testing, and I expect it accounts for a lot of the uncovered code. So I'd like to know what percentage of the uncovered code is due to that feature.
One idea that occurred to me would be to try to correlate call graph information with code coverage information. For example, maybe a base class grew a new method in order to support a new feature, and that method is overridden by a dozen subclasses. If that feature simply isn't getting tested, it would be nice to be able to "roll up" all those method implementations, and think of them as a single method that isn't getting executed. Then you could also roll up all the callers of that method, and all the methods it calls. Obviously if you go far enough up the call stack, then you're no longer talking about a particular feature, but it would still be useful to be able to see the entire uncovered code tree as a single entity.
A coworker mentioned that the idea that was "screaming out at him" was aspect oriented programming. If the coverage metrics could be categorized by aspect, that would clearly make it easier to understand the degree to which each aspect is covered. Assuming that aspects correlate with features, that would give you feature coverage. The only problem with this is that I've never yet worked anywhere that used AOP.
I just came across jMock, which is a Java library that uses reflection to make writing mock objects easier. Most of the testing I'm doing at work at the moment is in C++, but once we get to testing the Java parts I expect that jMock may come in very handy.
I still can't say much about the place where I work, but I can share a pretty picture with you. This is the output of a stress/performance test that I wrote a couple weeks ago. It takes a little over a minute to generate a 700x700 pixel version of picture below. I find this amusing, because the software is not really designed to do this kind of thing -- it's meant to be more of a database than a computation engine. The conversion into GIF format was done by postprocessing the output with ppmtogif. Click on the picture to see the 700x700 version.

At work, I have to test a program that has 18 separate features, nearly all of which can be used in combination with each other, and each of which has multiple ways it can be used. Some of features are binary -- the feature is either used or it isn't -- but a few features have five or six variations.
I want to test each possible combination of feature, because most bugs are attributable to either a single feature or a combination of two features. A brute force way of doing this pairwise testing would require hundreds of individual tests. A cleverer version created by hand might be able to get away with using only fifty tests or so, except that there are several combinations that are invalid, and keeping track of all of them is a real headache.
It turns out there are programs designed to solve this problem. For example, jenny. I wrote a little python wrapper around jenny, so that I can use descriptive feature names instead of numbers and letters, and then I set it loose on my problem. It came up with a solution requiring only 35 tests. Since the minimum possible number of tests is at least 30, I figure 35 is pretty good. And since the problem is NP hard in the general case, "pretty good" is good enough.
A google search for combinatorial testing will yield a lot more information about the technique, if you're interested.
If you need to test a codebase, one of the first things you should consider is getting code coverage metrics, so that you know what is currently being tested and what isn't. This is one of the first tasks I set myself when I started my new job, three weeks ago.
More...I've been looking for info about how to make JUnit and CppUnit work together in the same framework. CppUnit has an inherent closed-world assumption because it's based on linking everything together into one big executable. JUnit similarly assumes that all its tests are written in Java. It's not obvious how to create a TestRunner that will span the two frameworks, and the alternative of just having two different frameworks seems gross. While I'm at it, I'd also like to have integration tests use the same framework as well.
If I end up creating such a thing, maybe I'll call it The One Ring.
Since I frequently get useful pointers from readers when I mention a problem here, I thought I'd put the topic up for discussion.