I was thinking about the relational model today, and I realized that a lot of people probably don't realize that faceted navigation is an example of an application that simply doesn't fit comfortably in the relational model. It also occurs to me that most of you don't care. But I'm going to tell you why anyway.
Faceted navigation is built on faceted classification, which is conceptually very simple. You've got two things: records and categories, and those two things have a many-to-many relationship.
I like to work with concrete examples, so let's take the US states as our dataset. We've got one record for each state. For our categories, let's use adjacent states. That means Massachusetts has the following categories: adjoins-NH, adjoins-VT, adjoins-CT, adjoins-RI, and adjoins-NY. Obviously there are a lot of states that belong to the adjoins-NY category: Pennsylvania, New Jersey, Connecticut, Massachusetts, and Vermont. So clearly this is many-to-many. That's faceted classification.
You can think of each category as a predicate that defines a subset of records -- all the records that have that category. Faceted navigation is based on taking as input a given set of categories, and producing as output all the categories that, if you added them to the input set, would define a smaller, yet non-empty, record set. For example, if I start with the set of states that adjoin Massachusetts, then I could narrow down to states that also adjoin Rhode Island, or states that also adjoin Vermont. But I couldn't narrow down to states that also adjoin California, since that would create an empty record set.
In math terms, if S is the input set of selected categories, then what you want is the set of categories c such that c is not in s, and such that there exists at least one record r such that r has c and r also has every category in S. Or, more tersely:
{c : (∃r : c(r) ∧ (∀s in S : s(r)))} - S
If you map this to a straightforward relational structure, and then try to use SQL to compute the set described above, you will find that you basically end up thrashing the database to the point that you'd do better to suck out the entire database into memory and forget SQL entirely.
The problem is that SQL isn't really capable of expressing set intersections. Suppose we merely want to get the set of records that have our input set of selected categories. We'd have to do something like this:
SELECT DISTINCT record_id FROM Facets WHERE
record_id in (SELECT record_id FROM Facets WHERE category_id=1)
AND record_id in (SELECT record_id FROM Facets WHERE category_id=2)
AND ...
No database that I know of will implement the above query efficiently, even though they could -- all they would need to do is intersect the index for category ids 1 and 2. But even if the above query were done efficiently, we're still not done. What we actually want is not a list of records, but a list of categories that would provide further refinements:
SELECT DISTINCT category_id FROM Facets WHERE
record_id in (
SELECT record_id FROM Facets WHERE
record_id in (SELECT record_id FROM Facets WHERE category_id=1)
AND record_id in (SELECT record_id FROM Facets WHERE category_id=2)
AND ...)
And we're still not done, because we want to remove categories that, if we added them to the input category set, wouldn't reduce the size of our record set. That is, we want to remove "redundant" categories. I'm not even going to bother working that into the above SQL query, because it's more likely you'd implement it as a second pass, in order to keep from increasing the exponent on the query processor's big-O processing time.
It's interesting to note that every text search engine is optimized for handling the first SQL query -- text search is all about intersecting sets of documents, where those sets are defined as all the documents that contain a given word. There's a reason that relational databases aren't usually used to implement text search. For example, SQL Server requires you to explicitly state which fields you want indexed for full-text search.
Hmmm... It might be interesting to see if I could abuse SQL Server's full-text search support to get it to support faceted navigation...
Followups to Why Faceted Navigation is Hard:Posted on September 29, 2003 08:31 PM
More projects articles
Can you explain these examples in more detail please? The first query would have been better written as (i.e. is equivalent to):
SELECT DISTINCT record_id FROM Facets WHERE category_id in (1, 2, ...)... while the second query would have been better written as:
SELECT DISTINCT category_id FROM Facets WHERE category_id in (1, 2, ...)
These are so trivial that I assume I must have misunderstood what you were trying to say. It might be easier if you provide table definitions, example data, and example queries with results.
Posted by: Alistair Bayley at September 30, 2003 07:56 AMYour first example is doing a set union instead of a set intersection. That is, it's asking for record ids that have category id 1 or 2 or 3, when what you want is record ids that have category id 1 and 2 and 3.
As for the second query, the purpose is to take the records returned by the first query, and use it to get a larger set of category ids (using set union). Your version kind of short-circuited it, so that you'd end up with the same set of category ids as you started with.
To use the states example, suppose our input category set is {adjoins-MA}. Then we'd expect the first SQL query to return {NH, VT, CT, RI, NY}. Now for the second query, we want to get the set of all categories that belong to NH or VT or CT or RI or NY:
SELECT DISTINCT category_id FROM Facets WHERE
record_id in ("NH", "VT", "CT", "RI", "NY")
This is the same as the second SQL query, except that in the original article I combined the first and second SQL queries so that I could do everything in a single pass. I suppose that made it harder to read, but then the point was to explain how ill-suited this problem is to the relational model, and breaking the problem into two passes would kind of have been cheating.
Ah, much clearer - thanks for the additional information. That really helps us SQL neophytes follow along a little more closely.
Also, interesting use of logical connectors there! I didn't know that the character entities were so simple; if I had, they would find much more use in my own writing. Thanks for the unintentional tip!
Posted by: Gnomon at September 30, 2003 10:34 AM