An idea occurred to me the other day for an interesting project. I think it would be fun to try to use wikipedia as a source database for performing machine translation of natural languages. Obviously this idea leaves unspecified the entire question of what algorithm to use to achieve this, and I don't expect the results would be very successful, but I think it would be a fun project nonetheless.
The reason this would be fun is because although wikipedia does have articles in multiple languages, those articles are not generally kept in sync. The same article might have completely different structures in different languages. Getting anything even remotely sensible out of such a database would obviously be pretty difficult. But I think it would be fascinating to see what could be produced, even if it's not even as good as a "gisting translation". The translator is likely to end up with lots of subject knowledge that goes beyond what one would normally think of as translation. For example it might decide that anything mentioning the word "nazi" is likely to be an incorrect translation, if it notices lots of edits that added that word and which were later reverted (defacement).
The first thing you could start with would be to create synonyms based on the titles of pages that are supposed to be about the same thing. After that, it's not clear what approach to try :)
It's not even clear how to know when you've achieved a reasonable translation, without requiring a whole lot of human feedback. You could probably get a whole lot of feedback if you somehow harnessed the efforts of the people who actually contribute to wikipedia itself. But I doubt you could convince the wikipedia folks to let you add your automatically translated articles to wikipedia, since they'd very likely be utter garbage, at least at first.
If I were to approach this project for real, I would probably read up on machine learning techniques, and try to have the program figure out how to approach translation pretty much on its own. I'd prime it with concepts like synonyms, parts of speech, section headers, hyperlinks, talk pages, revisions, maybe even word etymologies. Then I'd let it figure out on its own which concepts seemed useful, and let it invent new concepts, heuristics, and meta-heuristics. I'm currently thinking of something like EURISKO, although I'm sure the state of the art has moved on significantly since that time.
The remaining big question is how to know whether it's doing well. This is necessary in order to guide the results towards "good" translations. Maybe you could restrict it to half of the wikipedia, and then judge it by how well it approximates the other half. Or maybe you could test it with newly-added translations of existing articles. This of course begs the question of how you know whether an automatically-translated article is "similar" to a human-produced one.
An interesting intermediate project might be to use wikipedia to automatically improve articles written in a single language. For example, it might figure out how to automatically remove defacement. It might even figure out how to use linked articles to provide extra context and background. The advantage of this project over the translation project is that you could use the edit history of an article as your criteria for whether or not the changes it makes are better or worse. You would just assume that articles generally improve over time.
You could also think about combining these projects: taking improvements from one language and applying them to the same article in a different language. The problem, of course, is that the edit history is less likely to show whether this was a good idea or not, since people aren't usually very prompt about copying edits from one language to another.
Even if the automatic generation of text doesn't end up working out, the same general idea might turn out to be very useful for things like automatically detecting defacement and/or erroneous changes. As wikipedia gets larger, tools to help with that task might be quite useful. Such a tool would also be likely to receive lots of human feedback, since it would be in people's interest to use it and to teach it when it guessed incorrectly.
Posted on February 6, 2006 10:06 PM
More projects articles
Automated translation using wikipedia as the source doesn't really seem like it would work well. For one thing, articles aren't necessarily translations of each other, and there's no way to guarantee that there's any match at all between the text in two different languages. On the other hand, I think you might be onto something with the automated defacement detection. There's a lot of ways you can determine what defacements look like, by looking at the text, edit history, talk page activity, etc. Then, using the built up heauristic, you can try to figure out if a given edit is a defacement or not. I wonder what the wikipedia people would think of bots playing with their wiki.
Posted by: crzwdjk at February 7, 2006 12:19 PM