# TIL the normal equation methodpermalink

$\ {\theta} = (X^{\intercal} X)^{-1} (X^{\intercal} y)$

The normal equation is another function you can use to derive the optimal parameters $$\ {\theta}$$ for your hypothesis as defined by your cost function. This is an alternative method to gradient descent that I learned about a few days ago.

The normal equation method has a number of advantages over gradient descent:

• It’s faster than gradient descent because you can solve for theta in one calculation rather than iterating
• Because it solves for $$\ {\theta}$$ in one iteration, it doesn’t require feature scaling or mean normalization to make iterations efficient
• Ditto for needing to use a learning rate $$\ {\alpha}$$ (and potentially getting it wrong)

But it has two big downsides:

First, it’s slower than gradient descrent when the number of features is high - n >= 10,000. That’s because inverting an 10,000 x 10,000 matrix is computationally expensive!

Second, sometimes $$\ X^{\intercal} X$$ is non-invertible! This usually happens for one of two reasons:

1. You have redundant features that are linearly dependent (e.g. $$\ x_1$$ = size in feet^2, $$\ x_2$$ = size in m^2 )
2. You have too many features and, in particular, the size of your training set is less than or equal to the number of features you’re trying to use - $$\ m {\leq} n$$

In both cases, delete some features to make the matrix invertible!