This notebook derives the ordinary least squares (OLS) solution $\hat{x} = (A^\top A)^{-1} A^\top b$ from first principles using matrix algebra.
1. Problem Setup
We want to solve the linear system:
$$A x = b$$
where:
$A \in \mathbb{R}^{m \times n}$ is a known matrix (typically $m > n$, more equations than unknowns)
$x \in \mathbb{R}^n$ is the unknown vector we want to find
$b \in \mathbb{R}^m$ is a known vector
When $m > n$, this system is overdetermined — there are more constraints than degrees of freedom, so an exact solution generally does not exist. Instead, we seek the $\hat{x}$ that comes closest to satisfying all equations simultaneously.
2. Defining the Residual
For any candidate solution $x$, define the residual:
$$r = b – Ax$$
This vector measures how far $Ax$ is from $b$. We want to minimize the sum of squared residuals, i.e., the squared Euclidean norm of $r$:
$$\min_{x} \; |r|^2 = \min_{x} \; |b – Ax|^2$$
Expanding using the inner product definition $|v|^2 = v^\top v$:
$$|b – Ax|^2 = (b – Ax)^\top (b – Ax)$$
3. Expanding the Objective
Expand $(b – Ax)^\top(b – Ax)$:
$$
|b – Ax|^2 = b^\top b – b^\top A x – x^\top A^\top b + x^\top A^\top A x
$$
Since $b^\top A x$ is a scalar, it equals its own transpose:
$$b^\top A x = (b^\top A x)^\top = x^\top A^\top b$$
So the two middle terms are equal, and we can write:
$$\boxed{|b – Ax|^2 = b^\top b – 2x^\top A^\top b + x^\top A^\top A x}$$
Call this objective function $f(x)$. It is a quadratic function of $x$, and since $A^\top A$ is positive semi-definite, $f(x)$ is convex — any local minimum is a global minimum.
4. Taking the Gradient
To minimize $f(x)$, we take the gradient with respect to $x$ and set it to zero.
Using standard matrix calculus identities:
- $\nabla_x (c^\top x) = c$
- $\nabla_x (x^\top C x) = 2Cx$ when $C$ is symmetric
Note that $A^\top A$ is symmetric by construction: $(A^\top A)^\top = A^\top A$.
Applying these identities to $f(x)$:
$$\nabla_x f(x) = 0 – 2A^\top b + 2A^\top A x$$
Setting $\nabla_x f(x) = 0$:
$$2A^\top A x – 2A^\top b = 0$$
$$A^\top A x = A^\top b$$
These are called the Normal Equations.
5. The Normal Equations
$$A^\top A \hat{x} = A^\top b$$
Geometric interpretation: The normal equations enforce that the residual $r = b – A\hat{x}$ is orthogonal to every column of $A$:
$$A^\top (b – A\hat{x}) = 0 \iff A^\top r = 0$$
In other words, $\hat{x}$ is the solution for which $A\hat{x}$ is the orthogonal projection of $b$ onto the column space of $A$. The residual points in a direction that $A$ cannot represent — it is orthogonal to $\text{col}(A)$.
6. Solving for $\hat{x}$
If $A$ has full column rank (rank $= n$), then $A^\top A$ is symmetric positive definite and therefore invertible. We can solve the normal equations directly:
$$\hat{x} = (A^\top A)^{-1} A^\top b$$
This is the Ordinary Least Squares (OLS) estimator.
The matrix $A^\dagger = (A^\top A)^{-1} A^\top$ is called the Moore-Penrose pseudoinverse of $A$ (in the full column rank case).
7. The Projection Matrix
The fitted values $\hat{b} = A\hat{x}$ are the projection of $b$ onto $\text{col}(A)$:
$$\hat{b} = A(A^\top A)^{-1} A^\top b = P b$$
where $P = A(A^\top A)^{-1} A^\top$ is the orthogonal projection matrix (also called the hat matrix).
$P$ has two key properties that confirm it is a valid orthogonal projector:
| Property | Statement | Meaning |
|---|---|---|
| Idempotent | $P^2 = P$ | Projecting twice is the same as projecting once |
| Symmetric | $P^\top = P$ | Projection is orthogonal (not oblique) |
Verification of idempotency:
$$P^2 = A(A^\top A)^{-1} A^\top \cdot A(A^\top A)^{-1} A^\top = A(A^\top A)^{-1} (A^\top A)(A^\top A)^{-1} A^\top = A(A^\top A)^{-1} A^\top = P \checkmark$$
8. Confirming the Minimum (Second-Order Condition)
The Hessian of $f(x)$ with respect to $x$ is:
$$H = \nabla^2_x f(x) = 2A^\top A$$
For any nonzero vector $v$:
$$v^\top (A^\top A) v = (Av)^\top (Av) = |Av|^2 \geq 0$$
So $A^\top A$ is positive semi-definite, and strictly positive definite when $A$ has full column rank (since $Av = 0 \Rightarrow v = 0$). Therefore $H \succ 0$, confirming that the stationary point $\hat{x}$ is a strict global minimum.
Summary
| Step | Key result |
|---|---|
| Objective | Minimize $|b – Ax|^2$ |
| Expand | $f(x) = b^\top b – 2x^\top A^\top b + x^\top A^\top A x$ |
| Gradient | $\nabla_x f = 2A^\top A x – 2A^\top b$ |
| Normal equations | $A^\top A \hat{x} = A^\top b$ |
| OLS solution | $\hat{x} = (A^\top A)^{-1} A^\top b$ (when $A$ is full column rank) |
| Geometric meaning | $A\hat{x}$ is the orthogonal projection of $b$ onto $\text{col}(A)$ |
The full column rank condition on $A$ is the precise requirement that makes the least squares solution unique — consistent with the earlier discussion of full rank implications.