Roll Your Own Search Engine

This month's edition of ACM Queue is all about search. I found it pretty interesting to read, after my experience at Endeca. For example, the first article was a conversation with Matt Wells, who is apparently writing his own search engine, and making enough money at it to live on. My first thought was, why didn't I think of writing my own search engine? :) It was interesting to read his catalogue of search techniques -- for example, how you should distribute documents across multiple machines for scalability. All the concepts were very familiar.

The "Why writing your own search engine is hard" article was fun. While I read it I kept thinking "yep, yep, yep". However, since I've already had exposure to all of these issues, it didn't seem to me like it all added up to "search is hard" -- but I suppose if you've never worked on a search engine before, there's lots of things that you're likely to overlook.

In my opinion, search engines actually have it easy in many ways. Sure, they're distributed, and that makes things complicated, but at least they're working with a read-only set of data. If one of your nodes goes down, you can just failover to a backup node. Databases, on the other hand, are really hard. There you're working with distributed, mutable data. And that complicates everything in a completely orthogonal way, which makes everything you do just that much harder. For example, it means that you can't use the same design to provide redundancy as you do to provide scalability.

There was an article on why intranet search is hard: because intranets aren't as hyperlinked as the internet is, and because users in an intranet are looking for authoritative answers, not just popular pages. I haven't seen any really good techniques for dealing with this problem, except for ones that add structured information to the underlying data. However it's very hard to add structured information in a wholesale way -- you either do it automatically in which case it's poor quality, or you do it manually in which case it's expensive and incomplete. I think incorporating user feedback might be the only way to really do this effectively, but it's difficult to come up with a feedback model that's both useful and robust.

The "Searching vs Finding" article started out very familiar but quickly moved into things I had never even thought of, for example requiring that the words in the document be in the same relationship as the words in the document. The example was a query of "black and white dog" which, if you don't pay attention to word relationships, returns articles about dogs being used to stifle racial protests. Whereas if you pay attention to word relationships, you get articles about dogs seeing in black and white, which is at least closer. Presumably an ideal search engine would return matches for dalmation instead...

Another interesting thing throughout the issue was the number of advertisements that I saw from companies that I was very familiar with (ex-competitors). For example, the enterprise search article used a screenshot from a Verity product. And the back cover is an ad for Convera. I didn't notice any Endeca ads though...

Posted on April 29, 2004 07:15 PM
More programming articles

Comments

I've a bit perplexed by that Matt Wells character.

First, he puts himself in the company of Google and Yahoo: "Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast."

Then, later, he says that Gigablast runs on "eight desktop machines" (indexing 320 million web pages, which he hopes to increase to 5 billion by the end of the year).

I'm prepared to believe that you could compete with Google with, say, an order of magnitude less machines--but *three* orders of magnitude?? (And a whole lot less PhDs?) It doesn't seem possible.

Posted by: Michael S. at April 30, 2004 03:48 AM

I think the trick is that he doesn't get anywhere near the query volume that Google does. If he gets one thousandth as many queries as Google, he can get away with using one thousandth as many machines. Alternatively he can let per-query latency increase from 0.2 seconds to 0.8 seconds, and get away with only using a quarter (or even a tenth) as many machines.

Posted by: Kim at May 4, 2004 11:01 AM

When I was at lycos I knew a young kid (22) who had his company bought by Ask Jeeves. He came from Ky and was dirt poor. He started a search company called ezresults.com with 4-6 desktop PCs. He coded it so that results hashed right to the spot on the disk (not the file name, the physical spot) for maximum speed. Hell, even Direct hit (who supplied Lycos) only ran on 6 BSD boxes.
Search is one of those problems that ramps up by steps. You can run a great system if you keep the data base requirements low. Which, during the boom was enough. Scaling is easy, since the system is so parallel that simply throwing more hardware at it works great.
Indexing is super easy in an evironment with interlocking links and text describing them.
So no, not very surprised.

Side note, after selling his company Ask Jeeves went bust onthe market. Poor guy. Rags to riches to rags in a few months.

Posted by: drlloyd11 at May 15, 2004 05:55 AM

Oh, I dunno about the problem with autoclassifying intranet documents--I put in a few years at Northern Light when it was still a public search engine, and intranet docs (as well as other 'high quality' document sources like all the periodicals we had in our collection) were relatively easy to classify automatically. Web docs were normally much harder to deal with, in part because the quality varied so widely. The active attempts to spoof search engine crawlers never helped, though they were sometimes funny to look at.

Wouldn't surprise me if you could play Bayesian filtering games with intranet docs as part of the classification system, though training the filters would be a pain. On the other hand, if you've got a corpus of well-classified documents available it becomes a lot easier to do, especially since the documents for something like this (intranet crawling and classification) are normally significantly larger than what you'd find on average on the web or in e-mail.

Posted by: Dan at July 19, 2004 10:39 AM

Dan, you're right that intranet classification is easier than internet classification. The body of documents covers a smaller breadth of human endeavor, and you don't have to worry about spoofing.

But I was coming at it from a different angle -- I was saying that adding metadata to content that doesn't already have it is hard. That is, I was comparing structured data versus unstructured data, not internal documents versus public documents.

There's really only two ways to add metadata: automated (and unreliable) or manual (and expensive). A combination of the two (e.g. bayesian filtering) is probably better than either alone, but it's still not cheap and easy.

On the other hand, the original article that I was commenting on was really making yet a third point: they were talking about the difficulty of ranking, not of classification. And when it comes to ranking, I think the internet has an advantage, due to its hyperlinked structure. Most intranet documents either don't refer to other intranet documents, or they do so in a vague way (e.g. they'll refer to "the document from April" without giving a url).

Posted by: Kim at July 19, 2004 11:54 AM

Firstly, I agree with the above comments on autocategorisation. We used to perform autoclassification followed by manual checking then captured the stats for classifier improvement. However, the manual check was costly and now we perform autoclassification only and use manual intervention to tweak the classifiers.

Getting to the intranet docs and ranking: I think one has to use other evidence, specific to an intranet environment, in the ranking algorithm. Since an intranet is in a company and therefore the docs are in a more 'controlled' environment one can assume that metadata added is more reliable - not gamed. Further, since an organisation is highly social (i.e. some authors are well regarded) then make use of this in the relevance ranking: interactions between authors (e.g. by email ), how many times a doc is checked out of a doc mgt system...maybe

Posted by: Lisa Catullus at November 12, 2004 09:19 AM

People all over the world know the abercrombie and fitch,but not everyone really knows how fashion the abercrombie is,hollister is the Legend maker. Everybody wears the hollister clothing would be the abercrombie mensand the abercrombie womens, if you want know you can search the Ruehl No.925 or abercrombie outlet in the www.google.com .

Posted by: fitch at November 14, 2009 02:03 AM

Louis Vuitton, commonly referred to as Louis Vuitton Bags and Louis Vuitton Shoes, or sometimes shortened to Louis Vuitton Speedy 30 has become one of the most Louis Vuitton Tivoli PM Agendas luxurybrands Louis Vuitton Tivoli PM.

Posted by: 1 at November 17, 2009 02:17 AM

UGG Bailey Button bootsis a new style in 2009.The classic cardy uggs boots is another hot boots that worth of buying.And the classic tall ugg boots will make your winter amusing.And now uggs on sale,if you are looking for such a boot,the ugg boots is good choice this year.

Posted by: uggs on sale at November 17, 2009 08:30 PM

Uggs on sale now.Classic cardy boots ugg is a special boot that makes you different from the other girls.And the Bailey Button UGGs is hot in 2009.If you want to get a warm shoe in cold winter,I think ugg classic tall boots is a good choice for you.

Posted by: ugg classic cardy boots at November 17, 2009 09:00 PM

We are the best online sales for the china wholesale . Here you can have a large of choices of kinds Ugg Boots,Converse Shoes,Timberland Boots,puma shoes,Nike Shox Shoes ,Nike Dunk SB Shoes,Nike Air Max,Links Of London,Tiffany Jewelry,Dior Handbags?,jimmy choo handbags ,Cartier Watches, 8GB Mp4 Players,Bluetooth Car DVDs. All our cheap online cheap goods are high quality and original packages, and best service. We offer our customers the best service, 7 days arrive at your door.Enjoy your easy and happy shopping with us.

Posted by: cheap goods sale. at November 20, 2009 01:52 AM

Laptop Battery Laptop Battery Laptop Batteries
Laptop Batteries discount laptop battery
discount laptop battery
notebook battery notebook battery
computer battery computer battery
replacement laptop battery replacement laptop battery
notebook batteries notebook batteries

Posted by: Laptop Battery at November 23, 2009 10:15 PM

Just wanted to say great job with the blog, today is my first visit here and I’ve enjoyed reading your posts so far
ugg bailey button
Wow, my ugg classic mini will not be coming off now! I’ve had them on for 12hrs strait and I do not want to take them off. Thanks for everything, well worth the wait.

Posted by: ugg bailey button boots at November 30, 2009 06:25 AM

Spring is near,every girl wants to be the bride in the special season.They are eager to put on beautiful Wedding Dresses or the Bridal gowns.During the day the Wedding gowns is the good choose,and the night,if you want radiant,you need the Evening gowns.
About the bridesmaid,they have to wear Bridesmaid Dresses in order to avoid grab limelight with the bride.And the Flower Girl has the Flower Girl Dresses,too.In the wedding,the Cocktail Dresses and the Evening Dresses is necessary,too!And remember,the Wedding Dress of the Bridal Dress must be the most glaring!

Posted by: weddingdressclub at December 1, 2009 01:39 AM
Post a comment









Remember info?




Prove you're human. Type "human":