TIL the normal equation method permalink
\[\ {\theta} = (X^{\intercal} X)^{-1} (X^{\intercal} y) \]
The normal equation is another function you can use to derive the optimal parameters \(\ {\theta} \) for your hypothesis as defined by your cost function. This is an alternative method to gradient descent that I learned about a few days ago.
The normal equation method has a number of advantages over gradient descent:
- It’s faster than gradient descent because you can solve for theta in one calculation rather than iterating
- Because it solves for \(\ {\theta} \) in one iteration, it doesn’t require feature scaling or mean normalization to make iterations efficient
- Ditto for needing to use a learning rate \(\ {\alpha} \) (and potentially getting it wrong)
But it has two big downsides:
First, it’s slower than gradient descrent when the number of features is high - n >= 10,000
. That’s because inverting an 10,000 x 10,000
matrix is computationally expensive!
Second, sometimes \(\ X^{\intercal} X \) is non-invertible! This usually happens for one of two reasons:
- You have redundant features that are linearly dependent (e.g. \(\ x_1 \) = size in feet^2, \(\ x_2 \) = size in m^2 )
- You have too many features and, in particular, the size of your training set is less than or equal to the number of features you’re trying to use - \(\ m {\leq} n \)
In both cases, delete some features to make the matrix invertible!