Spectral clustering rocks!

In machine learning the other day, we learned about spectral clustering. This has got to be one of the most beautiful algorithms I've seen in months. It doesn't require you to come up with any kind of model for how your data was generated; all you need is a similarity measure between any two data points (e.g. exponential fall off). The measure doesn't have to satisfy the triangle inequality (i.e. it doesn't have to be transitive), but it should be symmetric. Here's how the algorithm works:

If you have N data points, then define an NxN matrix M where the value of Mij is the distance from point i to point j. Now take the second eigenvector of this matrix. Partition the data points according to whether the corresponding entry in the second eigenvector is positive or negative. Ta-da! Now you have a partition of the entries into two sets. You can use the eigenvalues (as opposed to the eigenvectors) to determine how strong the partition is. If you want to make sure that your clusters are roughly the same size, then instead of partitioning based on positive/negative, you can sort the values first and then partition around the median value.

If you need more than two clusters, you can either recursively subdivide binary partitions, or you can scatter-plot points from more than one eigenvector and then use a different clustering algorithm on the result. For example, if you need three partitions, then you can plot the second eigenvector against the third eigenvector, and use k-means clustering on the resulting 2D data set. For more clusters, use more eigenvectors.

The reason this algorithm is beautiful is because of the robustness of the result, the fact that it doesn't need an underlying model to generate the points (such as a sum of gaussians), and the intuitive interpretation of what the eigenvectors mean.

Here's the intuition. Assume that you start out with a pile of ants at each data point, and at each time step each ant traverses an edge, to get to some other data point. The probability of an ant choosing an edge is inversely proportional to the edge's length (because ants are lazy). Now if you let this situation go for an infinite amount of time, the first eigenvector will tell you the expected number of ants at each data point. The second eigenvector, meanwhile, will tell you what happens if you run for just short of an infinite amount of time.

The first eigenvector is really useful for some purposes. For example, the Google PageRank algorithm consists of nothing more than computing the first eigenvector -- replace data points with web pages, edges with hyperlinks, and ants with people, and the first eigenvector will tell you how popular each web page is.

For clustering what you want is the second eigenvector. The idea is that if you start off an ant in one cluster, then it should be more likely to wander around inside that cluster for a while before finally making its way over to some other cluster. That is, points in the same cluster should be more connected to each other than to points in other clusters. The second eigenvalue directly captures this notion, in a simple mathematical way.

Here's a picture of the result for a problem that isn't linearly separable (shamelessly ripped off from my teacher's lecture slides):

There are a couple mild caveats to this technique. First, all the theory seems to assume that your clusters are only sparsely connected. So the recommended technique is to to only connect each point with it's k nearest neighbors. I'm not sure what happens if you fill out every entry in the distance matrix. Second, the web of points is generally assumed to be connected. These caveats are in direct opposition to each other -- you have to choose k carefully in order to ensure a sparse, yet connected, matrix. I don't yet have enough of an understanding to know why these two things are needed. It may just be that computation of eigenvectors for non-sparse matrices is more computationally intensive than for sparse matrices.

I wonder what interpretation can be assigned to complex eigenvectors...

Posted on November 15, 2006 07:30 PM
More school articles

Comments

By 2nd eigenvector, do you mean the eigenvector with the second largest associated eigenvale?

I have been looking into a lot of clustering algorithms recently. My favorites are ones that decide the "natural" number of clusters for the data. For example, Growing K-Means and Growing Neural Gas are classics.

Posted by: George Dahl at November 18, 2006 12:39 AM

Yes, I mean the orthonormal eigenvector with the second largest eigenvalue. Or more precisely, I mean the first eigenvector with an eigenvalue less than one.

I should also have mentioned that you need to normalize the matrix so that the sum of each column is one (that is, each column forms a probability distribution over all the edges -- the probability of an ant traversing *some* edge should be 1).

I think the reason people use k nearest neighbors to produce a sparse matrix is because otherwise if one cluster has a lot of points, and your similarity function falls off slowly, then the probability of staying within the smaller cluster is less than the probability of getting sucked into the larger cluster (simply because there are so many edges that will take you to the larger cluster).

Posted by: Kim at November 18, 2006 10:56 AM

For PageRank, the second eigenvalue corresponds to the graph expansion factor (how connected your graph is), and thus tells you how fast your random walk will converge. Neato...

Posted by: leo at November 20, 2006 01:47 AM

Wow! This is elegant! Previously, I'd been looking at using density based clustering algorithms. However, I think this accomplishes a similar feat and is much simpler.

Thanks!

Posted by: Tanton Gibbs at November 20, 2006 01:26 PM

It's the second *smallest* eigenval and associated eigenvector you wanna be looking at, no?

Posted by: Rich at November 30, 2006 02:23 PM

Rich, umm... no. The smaller the eigenvalue, the less of an influence the eigenvector will have on a long random walk. So you really do want the second largest one. Besides, there are an infinite number of zero eigenvalues, so the second smallest isn't exactly well defined.

Posted by: Kim at November 30, 2006 05:31 PM

Kim, Hmm, there's one eigenvalue with value zero. The second smallest is the one that gives the best partition. Higher eigens are useful tho. Check the literature, or for a very simple exposition... wikipedia
"...Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S1,S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix"

Maybe we were talking at cross purposes.

Posted by: Rich at December 4, 2006 12:52 PM

Rich, it seems you're right, but that doesn't make sense to me. I'm not 100% solid on this material, but by analogy with the steady-state distribution of a markov process, I'd expect the second-largest eigenvalue to be the one to use for clustering. Unfortunately I don't have time to dig into this further at the moment (end of term craziness).

Posted by: Kim at December 10, 2006 09:28 PM

If you liked spectral clustering, you might like "A New Approach to Data Driven Clustering" presented at ICML 2006. Data points are viewed as nodes on a graph, normalized pairwise similarity is used to create a transition probability matrix for a markov random walk. Walk a few steps and learn the structure of your data. No parametric statistical model & the number of clusters can be learned from the data. Similar to spectral techniques of Melia & Shi in that data is viewed as a graph & a random walk defines the clusters.

Posted by: joe at February 16, 2007 08:11 PM

The eigenvalues to take into account are different from case to case because people define the Laplacian in various ways (standard, normalized, creative variations...).

So when you change technique the eigenvalues refer to a different matrix.

Posted by: st r at March 7, 2007 05:53 AM

hey the confusion over second largest and second smallest may arise if you follow two slightly different algos...one by shi and malik and other by kannan,vempala,(http://cs-www.cs.yale.edu/homes/kannan/Papers/specclustering.pdf...see the spectral algorithm II).there's one negative sign involved thats why one uses the opposite of the other...
not sure if i helped much..

Posted by: Rahul at May 24, 2007 01:14 PM

But what do the 3rd, 4th and 5th etc. eigen vectors mean exactly?

Are they then the states the "ants" will be in
shorter and shorter amounts of time?

If this is correct,can you point to some Maths to back this up.

Posted by: rachel at July 21, 2007 02:51 PM

Now, it seems that we all here are investigating the usefulness of the eigvenvectors of the graph laplacian or (Makrov) transition matrix.

How's about the left eigenvector of the transition matrix? Does it tell us anything?
I am gonna test it now and hopefully i can get something useful out of it.

By the way, regarding to the 3rd, 4th,... eigenvectors, i know that they are responsible to slow mixing in the context of erogodic theory.
But i don't know what they do in the language of clustering.

Posted by: Nara at October 31, 2007 11:34 PM

Second eigenvectors (and above) are all useful for clustering. They give progressively smaller (and orthogonal) global structural dissimilarities between nodes in the network. You can think of them as the "second" and "third" dimension of a physical object.

The negative eigenvectors often expose intransitivity between nodes... i.e nodes that share the same neighbors, yet don't share a link between themselves. This is the common feature in bipartite networks. They're useful for certain types of clustering where you're interested in separating the graph on these features.

Posted by: Justin Donaldson at November 8, 2007 04:34 PM

Kim,
Thanks for your post - helped me on the right track for what I wanted to find.

BTW I later came across this article http://www.kyb.mpg.de/publications/attachments/luxburg06_TR_v2_4139[1].pdf

This has great detail about Spectral Clustering including the math aspect of why it does what it does. Recommended reading.

Posted by: Thomas Mathew at April 15, 2008 03:34 PM


Look up Ng and Jordan's article on the subject. That makes the meaning of the higher order eigenvectors very clear.

http://citeseer.ist.psu.edu/ng01spectral.html

Posted by: Ted Dunning at May 9, 2008 08:24 PM
Post a comment









Remember info?




Prove you're human. Type "human":