TIL the normal equation method permalink

\[\ {\theta} = (X^{\intercal} X)^{-1} (X^{\intercal} y) \]

The normal equation is another function you can use to derive the optimal parameters \(\ {\theta} \) for your hypothesis as defined by your cost function. This is an alternative method to gradient descent that I learned about a few days ago.

The normal equation method has a number of advantages over gradient descent:

It’s faster than gradient descent because you can solve for theta in one calculation rather than iterating
Because it solves for \(\ {\theta} \) in one iteration, it doesn’t require feature scaling or mean normalization to make iterations efficient
Ditto for needing to use a learning rate \(\ {\alpha} \) (and potentially getting it wrong)

But it has two big downsides:

First, it’s slower than gradient descrent when the number of features is high - n >= 10,000. That’s because inverting an 10,000 x 10,000 matrix is computationally expensive!

Second, sometimes \(\ X^{\intercal} X \) is non-invertible! This usually happens for one of two reasons:

You have redundant features that are linearly dependent (e.g. \(\ x_1 \) = size in feet^2, \(\ x_2 \) = size in m^2 )
You have too many features and, in particular, the size of your training set is less than or equal to the number of features you’re trying to use - \(\ m {\leq} n \)

In both cases, delete some features to make the matrix invertible!