Roll Your Own Search Engine

This month's edition of ACM Queue is all about search. I found it pretty interesting to read, after my experience at Endeca. For example, the first article was a conversation with Matt Wells, who is apparently writing his own search engine, and making enough money at it to live on. My first thought was, why didn't I think of writing my own search engine? :) It was interesting to read his catalogue of search techniques -- for example, how you should distribute documents across multiple machines for scalability. All the concepts were very familiar.

The "Why writing your own search engine is hard" article was fun. While I read it I kept thinking "yep, yep, yep". However, since I've already had exposure to all of these issues, it didn't seem to me like it all added up to "search is hard" -- but I suppose if you've never worked on a search engine before, there's lots of things that you're likely to overlook.

In my opinion, search engines actually have it easy in many ways. Sure, they're distributed, and that makes things complicated, but at least they're working with a read-only set of data. If one of your nodes goes down, you can just failover to a backup node. Databases, on the other hand, are really hard. There you're working with distributed, mutable data. And that complicates everything in a completely orthogonal way, which makes everything you do just that much harder. For example, it means that you can't use the same design to provide redundancy as you do to provide scalability.

There was an article on why intranet search is hard: because intranets aren't as hyperlinked as the internet is, and because users in an intranet are looking for authoritative answers, not just popular pages. I haven't seen any really good techniques for dealing with this problem, except for ones that add structured information to the underlying data. However it's very hard to add structured information in a wholesale way -- you either do it automatically in which case it's poor quality, or you do it manually in which case it's expensive and incomplete. I think incorporating user feedback might be the only way to really do this effectively, but it's difficult to come up with a feedback model that's both useful and robust.

The "Searching vs Finding" article started out very familiar but quickly moved into things I had never even thought of, for example requiring that the words in the document be in the same relationship as the words in the document. The example was a query of "black and white dog" which, if you don't pay attention to word relationships, returns articles about dogs being used to stifle racial protests. Whereas if you pay attention to word relationships, you get articles about dogs seeing in black and white, which is at least closer. Presumably an ideal search engine would return matches for dalmation instead...

Another interesting thing throughout the issue was the number of advertisements that I saw from companies that I was very familiar with (ex-competitors). For example, the enterprise search article used a screenshot from a Verity product. And the back cover is an ad for Convera. I didn't notice any Endeca ads though...

Posted on April 29, 2004 07:15 PM
More programming articles

Comments

I've a bit perplexed by that Matt Wells character.

First, he puts himself in the company of Google and Yahoo: "Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast."

Then, later, he says that Gigablast runs on "eight desktop machines" (indexing 320 million web pages, which he hopes to increase to 5 billion by the end of the year).

I'm prepared to believe that you could compete with Google with, say, an order of magnitude less machines--but *three* orders of magnitude?? (And a whole lot less PhDs?) It doesn't seem possible.

Posted by: Michael S. at April 30, 2004 03:48 AM

I think the trick is that he doesn't get anywhere near the query volume that Google does. If he gets one thousandth as many queries as Google, he can get away with using one thousandth as many machines. Alternatively he can let per-query latency increase from 0.2 seconds to 0.8 seconds, and get away with only using a quarter (or even a tenth) as many machines.

Posted by: Kim at May 4, 2004 11:01 AM

When I was at lycos I knew a young kid (22) who had his company bought by Ask Jeeves. He came from Ky and was dirt poor. He started a search company called ezresults.com with 4-6 desktop PCs. He coded it so that results hashed right to the spot on the disk (not the file name, the physical spot) for maximum speed. Hell, even Direct hit (who supplied Lycos) only ran on 6 BSD boxes.
Search is one of those problems that ramps up by steps. You can run a great system if you keep the data base requirements low. Which, during the boom was enough. Scaling is easy, since the system is so parallel that simply throwing more hardware at it works great.
Indexing is super easy in an evironment with interlocking links and text describing them.
So no, not very surprised.

Side note, after selling his company Ask Jeeves went bust onthe market. Poor guy. Rags to riches to rags in a few months.

Posted by: drlloyd11 at May 15, 2004 05:55 AM

Oh, I dunno about the problem with autoclassifying intranet documents--I put in a few years at Northern Light when it was still a public search engine, and intranet docs (as well as other 'high quality' document sources like all the periodicals we had in our collection) were relatively easy to classify automatically. Web docs were normally much harder to deal with, in part because the quality varied so widely. The active attempts to spoof search engine crawlers never helped, though they were sometimes funny to look at.

Wouldn't surprise me if you could play Bayesian filtering games with intranet docs as part of the classification system, though training the filters would be a pain. On the other hand, if you've got a corpus of well-classified documents available it becomes a lot easier to do, especially since the documents for something like this (intranet crawling and classification) are normally significantly larger than what you'd find on average on the web or in e-mail.

Posted by: Dan at July 19, 2004 10:39 AM

Dan, you're right that intranet classification is easier than internet classification. The body of documents covers a smaller breadth of human endeavor, and you don't have to worry about spoofing.

But I was coming at it from a different angle -- I was saying that adding metadata to content that doesn't already have it is hard. That is, I was comparing structured data versus unstructured data, not internal documents versus public documents.

There's really only two ways to add metadata: automated (and unreliable) or manual (and expensive). A combination of the two (e.g. bayesian filtering) is probably better than either alone, but it's still not cheap and easy.

On the other hand, the original article that I was commenting on was really making yet a third point: they were talking about the difficulty of ranking, not of classification. And when it comes to ranking, I think the internet has an advantage, due to its hyperlinked structure. Most intranet documents either don't refer to other intranet documents, or they do so in a vague way (e.g. they'll refer to "the document from April" without giving a url).

Posted by: Kim at July 19, 2004 11:54 AM

Firstly, I agree with the above comments on autocategorisation. We used to perform autoclassification followed by manual checking then captured the stats for classifier improvement. However, the manual check was costly and now we perform autoclassification only and use manual intervention to tweak the classifiers.

Getting to the intranet docs and ranking: I think one has to use other evidence, specific to an intranet environment, in the ranking algorithm. Since an intranet is in a company and therefore the docs are in a more 'controlled' environment one can assume that metadata added is more reliable - not gamed. Further, since an organisation is highly social (i.e. some authors are well regarded) then make use of this in the relevance ranking: interactions between authors (e.g. by email ), how many times a doc is checked out of a doc mgt system...maybe

Posted by: Lisa Catullus at November 12, 2004 09:19 AM
Post a comment









Remember info?




Prove you're human. Type "human":