Why least-squares makes sense

An offhand comment in my machine learning textbook finally gave me a good intuitive understanding for why you want to square the error terms when doing a best fit of a line to a set of data. Previously, the best explanation I'd ever heard was something like "well. by squaring you guarantee that the total error is non-negative", which wasn't sufficient to explain why x2 is better than, say, x4 or abs(x).

The reason that makes sense to me is this: by making your total error be the sum of squares, the derivative of your error becomes linear. Since the derivative is linear it has a single root, which corresponds to the global minimum of the error function. So you use a sum of squares because it gives you a well-defined global minimum.

It's pretty obvious once you see this, but nobody had ever really explained it to me before.

Posted on September 18, 2006 11:14 PM
More school articles

Comments

Great explanation, thanks. Oddly enough, I had never heard a good enough explanation for this either.

Posted by: Hans at October 1, 2006 06:35 PM

Every convex functional has a global optimum, so I wouldn't say that's its raison d'etre although it's certainly part of its appeal!

Here's another bit of intuition you might like. The general setup is that we have a set of n points p_1, p_2, ..., p_n in R^k, and we are looking for the best-fit hyperplane. Take the simplest case of k=1. Then a hyperplane is simply a point q. The least-squares error is

E(q) = sum(i) (p_i - q)^2

We can find an optimum by differentiating wrt q and then solving for zero:

E'(q) = sum(i) -2*(p_i - q)

E'(q) = 0
sum(i) -2*(p_i - q) = 0
sum(i) (p_i - q) = 0
sum(i) p_i = n*q
q = (sum(i) p_i)/n

So the least-squares optimizer is just the mean of the points. This shows that the least-squares optimizer generalizes the mean to higher-dimensional hyperplanes.

Posted by: Per Vognsen at October 15, 2006 10:19 PM

Also in engineeing problems least squares method minimize the error in sense of the POWER!! E.g. in the case of an audio project, the power is what you hear, and therefore the true error.

Posted by: ulrik at October 23, 2006 07:52 AM

Here's my 2 cents...

-- From a theoretical standpoint, least squares (LS) is justified because it gives you the maximum likelihood estimate under the assumption that your measurements are corrupted by Gaussian i.i.d. noise. This is nice because in many cases the Gaussian noise assumption is acceptable

-- From a practical standpoint, LS is used because it is easy: if your model is linear in the unknown parameters, say f(x) = Ax (where A is a known matrix and x is the parameter vector you're interested in), and you pose the problem as minimization of the sum of squares of the error vector (y - Ax) , where y are your noise-corrupted measurements, the (minimum norm) LS solution is given in closed form by x = pinv(A)y where pinv denotes pseudoinverse, in other words, the solution is a known linear function of the measurements!

Posted by: diego at January 11, 2007 12:08 PM

Statisticians seem almost embarrassed to admit that, among all the reasons for using least-squares, the most important one is the simplification of the calculus behind the fitting process (which you note).

For my money, though, least absolute errors regression makes more sense, and, today, cheap and plentiful computing power makes this option nearly as easy as least-squares.

Posted by: Will Dwinnell at October 26, 2007 01:41 AM
Post a comment









Remember info?




Prove you're human. Type "human":