Derivation of the Least Squares Solution

This notebook derives the ordinary least squares (OLS) solution $\hat{x} = (A^\top A)^{-1} A^\top b$ from first principles using matrix algebra.

1. Problem Setup

We want to solve the linear system:

$$A x = b$$

where:

$A \in \mathbb{R}^{m \times n}$ is a known matrix (typically $m > n$, more equations than unknowns)

$x \in \mathbb{R}^n$ is the unknown vector we want to find

$b \in \mathbb{R}^m$ is a known vector

When $m > n$, this system is overdetermined — there are more constraints than degrees of freedom, so an exact solution generally does not exist. Instead, we seek the $\hat{x}$ that comes closest to satisfying all equations simultaneously.

2. Defining the Residual

For any candidate solution $x$, define the residual:
$$r = b – Ax$$

This vector measures how far $Ax$ is from $b$. We want to minimize the sum of squared residuals, i.e., the squared Euclidean norm of $r$:

$$\min_{x} \; |r|^2 = \min_{x} \; |b – Ax|^2$$

Expanding using the inner product definition $|v|^2 = v^\top v$:

$$|b – Ax|^2 = (b – Ax)^\top (b – Ax)$$

3. Expanding the Objective

Expand $(b – Ax)^\top(b – Ax)$:

$$
|b – Ax|^2 = b^\top b – b^\top A x – x^\top A^\top b + x^\top A^\top A x
$$

Since $b^\top A x$ is a scalar, it equals its own transpose:
$$b^\top A x = (b^\top A x)^\top = x^\top A^\top b$$

So the two middle terms are equal, and we can write:

$$\boxed{|b – Ax|^2 = b^\top b – 2x^\top A^\top b + x^\top A^\top A x}$$

Call this objective function $f(x)$. It is a quadratic function of $x$, and since $A^\top A$ is positive semi-definite, $f(x)$ is convex — any local minimum is a global minimum.

4. Taking the Gradient

To minimize $f(x)$, we take the gradient with respect to $x$ and set it to zero.

Using standard matrix calculus identities:

$\nabla_x (c^\top x) = c$
$\nabla_x (x^\top C x) = 2Cx$ when $C$ is symmetric

Note that $A^\top A$ is symmetric by construction: $(A^\top A)^\top = A^\top A$.

Applying these identities to $f(x)$:

$$\nabla_x f(x) = 0 – 2A^\top b + 2A^\top A x$$

Setting $\nabla_x f(x) = 0$:

$$2A^\top A x – 2A^\top b = 0$$

$$A^\top A x = A^\top b$$

These are called the Normal Equations.

5. The Normal Equations

$$A^\top A \hat{x} = A^\top b$$

Geometric interpretation: The normal equations enforce that the residual $r = b – A\hat{x}$ is orthogonal to every column of $A$:

$$A^\top (b – A\hat{x}) = 0 \iff A^\top r = 0$$

In other words, $\hat{x}$ is the solution for which $A\hat{x}$ is the orthogonal projection of $b$ onto the column space of $A$. The residual points in a direction that $A$ cannot represent — it is orthogonal to $\text{col}(A)$.

6. Solving for $\hat{x}$

If $A$ has full column rank (rank $= n$), then $A^\top A$ is symmetric positive definite and therefore invertible. We can solve the normal equations directly:

$$\hat{x} = (A^\top A)^{-1} A^\top b$$

This is the Ordinary Least Squares (OLS) estimator.

The matrix $A^\dagger = (A^\top A)^{-1} A^\top$ is called the Moore-Penrose pseudoinverse of $A$ (in the full column rank case).

7. The Projection Matrix

The fitted values $\hat{b} = A\hat{x}$ are the projection of $b$ onto $\text{col}(A)$:

$$\hat{b} = A(A^\top A)^{-1} A^\top b = P b$$

where $P = A(A^\top A)^{-1} A^\top$ is the orthogonal projection matrix (also called the hat matrix).

$P$ has two key properties that confirm it is a valid orthogonal projector:

Property	Statement	Meaning
Idempotent	$P^2 = P$	Projecting twice is the same as projecting once
Symmetric	$P^\top = P$	Projection is orthogonal (not oblique)

Verification of idempotency:
$$P^2 = A(A^\top A)^{-1} A^\top \cdot A(A^\top A)^{-1} A^\top = A(A^\top A)^{-1} (A^\top A)(A^\top A)^{-1} A^\top = A(A^\top A)^{-1} A^\top = P \checkmark$$

8. Confirming the Minimum (Second-Order Condition)

The Hessian of $f(x)$ with respect to $x$ is:

$$H = \nabla^2_x f(x) = 2A^\top A$$

For any nonzero vector $v$:
$$v^\top (A^\top A) v = (Av)^\top (Av) = |Av|^2 \geq 0$$

So $A^\top A$ is positive semi-definite, and strictly positive definite when $A$ has full column rank (since $Av = 0 \Rightarrow v = 0$). Therefore $H \succ 0$, confirming that the stationary point $\hat{x}$ is a strict global minimum.

Summary

Step	Key result
Objective	Minimize $\|b – Ax\|^2$
Expand	$f(x) = b^\top b – 2x^\top A^\top b + x^\top A^\top A x$
Gradient	$\nabla_x f = 2A^\top A x – 2A^\top b$
Normal equations	$A^\top A \hat{x} = A^\top b$
OLS solution	$\hat{x} = (A^\top A)^{-1} A^\top b$ (when $A$ is full column rank)
Geometric meaning	$A\hat{x}$ is the orthogonal projection of $b$ onto $\text{col}(A)$

The full column rank condition on $A$ is the precise requirement that makes the least squares solution unique — consistent with the earlier discussion of full rank implications.