Reading Building Robust Systems, by Gerald Sussman. I'm not done yet, but so far (page 7) the paper is pretty vague and metaphorical. So I thought I'd scribble down some of the slightly-more-concrete thoughts that it triggers in my head:
Sussman talks about "degeneracy" in biological systems, and how it can emerge by duplicating a section of DNA and allowing the copies to diverge. In programming languages, this might be done by taking a program, copying a section of code, and then changing each caller so it either continues to call the old version or calls a new one. In order to allow broken pieces of code to continue to evolve without destroying the program, you could make callers "prefer" one version over the other, but silently fall back to their non-preferred implementation if the first version didn't work. For example, maybe their preferred version threw an exception, or maybe it started failing some kind of unit test that the caller cares about.
Here's another idea: generate random segments of code by "connecting the dots", where by "dot" I mean "type", or perhaps "function call". Suppose you have a URL and you want to have a file on disk. If you're lucky, you can search the call graphs of a whole bunch of programs and find some code path that starts with a url and ends with a file. If you're really lucky, that code path will do something appropriate, like downloading the content behind the url. If you took this idea and applied it to all the open source projects in the world, you'd probably have a fair chance of implementing something reasonable, purely by accident. Well, not really by accident -- it would actually be by virtue of the fact that you're drawing a random sample from a set of programs that is distributed extremely non-uniformly over the space of all possible programs. Djinn does something like this, but without the benefit of a meaningful dataset of samples to draw from. Haskell probably has an advantage at this kind of thing because it doesn't depend on side effects to determine the meaning of a segment of code.
Combine these two ideas. Generate random code, evolve it by making (fail-safe) copies, and mutate it by replacing randomly-selected code paths with randomly-generated code paths that connect the same dots.
This starts to resemble the way bacteria share plasmids. If you manage to come up with a useful generated program, add it back to the pool of programs from which we draw our samples. Now you start getting positive feedback for code snippets that continue to be useful when copied/mutated. You can start imagining computer scientists doing research into things like what characteristics are necessary for a piece of code to be randomly reusable, or what extraction techniques end up producing more reusable snippets.
To take better advantage of the meaning inherent in existing a program, you also want to pay attention to variable names, not just types. E.g. a variable named "i" is probably an index, while one named "n" is probably a length of some kind. Build up a probabilistic model describing how variables with certain names are likely to be handled (e.g. variables named "i" are unlikely to be divided, but very likely to be incremented). You might even be able to only pay attention to names (or at least parts of names), and just hope that the types work out properly -- that would be simpler than trying to handle both names and types. Try to notice naming conventions within a single class / subsystem / project, and pull them out as patterns. Do all of this probabilistically so that it scales to huge code repositories like SourceForge. I'm thinking you'd treat names as n-grams, and just accumulate counts for how many times an identifier with a particular n-gram appeared in a particular relationship with other variables. Normalize these histograms to produce probability distributions. Now you can use the distributions to produce a "code babler", or to identify (and extract) common patterns that appear to be far from random.
I'll probably have more thoughts later, but I wanted to scribble this down, since it's been so long since I've written anything here.
Posted on February 19, 2007 04:33 PM
More languages articles
That essay is here (your link was a 404). Thanks for pointing it out.
Posted by: Darius Bacon at February 20, 2007 09:01 PMHow did you find this essay? (Are you in his class?)
Posted by: Tobin Fricke at February 20, 2007 10:14 PMI found the essay via programming.reddit.com.
Posted by: Kim at February 23, 2007 10:43 AMno, i don't think the robustness comes from evolutionary programming. but instead higher-level systems made of many programs. finish reading the paper.
Posted by: aaron at March 30, 2007 01:26 PM