Normal Equation System Solution

Fundamental Properties

Normal equation systems $A^T A x = A^T b$ have special characteristics that distinguish them from ordinary linear systems. Imagine finding the best point on a line to represent scattered data points, this system provides a mathematical way to find that optimal solution.

For a matrix $A \in \mathbb{R}^{m \times n}$ with $m \geq n$ , the normal equation system $A^T A x = A^T b$ always has a solution. More specifically, this system has a unique solution exactly when matrix $A$ has full rank, that is when $\text{Rank}(A) = n$ . Under this condition, the solution can be expressed as $\hat{x} = (A^T A)^{-1} A^T b$ .

When matrix $A$ does not have full rank, the solution set of the normal equation system takes the form $\hat{x} + \text{Kernel}(A)$ , where $\hat{x}$ is any particular solution of the system.

Why the System is Always Solvable

The fundamental reason why normal equation systems always have a solution lies in the concept of orthogonal projection. The orthogonal projection of vector $b$ onto the column space $\{Ax : x \in \mathbb{R}^n\}$ always exists and is the solution to the linear least squares problem, which automatically is also a solution to the normal equation system.

To understand why other solutions take the form $\hat{x} + \text{Kernel}(A)$ , suppose $\tilde{x}$ is another solution of the system $A^T A \tilde{x} = A^T b$ . Then $\tilde{x}$ is a solution to the normal equation system if and only if $A^T A(\tilde{x} - \hat{x}) = 0$ , which is equivalent to $(\tilde{x} - \hat{x})^T A^T A(\tilde{x} - \hat{x}) = 0$ , which is further equivalent to $A(\tilde{x} - \hat{x}) = 0$ , or in other words $\tilde{x} - \hat{x} \in \text{Kernel}(A)$ .

Moore-Penrose Pseudoinverse

For a matrix $A \in \mathbb{R}^{m \times n}$ with $m \geq n$ and $\text{Rank}(A) = n$ , we can define the Moore-Penrose pseudoinverse as

A^{\dagger} = (A^T A)^{-1} A^T

The Moore-Penrose pseudoinverse functions like the "best inverse" of a non-square matrix. It provides an optimal way to "cancel" linear transformations in the least squares context.

The Moore-Penrose pseudoinverse satisfies four Penrose axioms that uniquely determine its characteristics

AA^{\dagger}A = A

A^{\dagger}AA^{\dagger} = A^{\dagger}

(AA^{\dagger})^T = AA^{\dagger}

(A^{\dagger}A)^T = A^{\dagger}A

These four properties are unique, meaning if a matrix $B$ satisfies all four axioms, then automatically $B = A^{\dagger}$ . The Moore-Penrose pseudoinverse thus functions as the unique solution operator for linear least squares problems.

Solution Using QR Decomposition

A more numerically stable approach to solving normal equation systems uses QR decomposition. For a matrix $A \in \mathbb{R}^{m \times n}$ with full rank and $m \geq n$ , we can use the (thin) QR decomposition $A = Q_1 R_1$ .

With this decomposition, the normal equation system $A^T A x = A^T b$ can be solved through

A^T A x = R_1^T Q_1^T Q_1 R_1 x = R_1^T R_1 x = R_1^T Q_1^T b = A^T b

Since $R_1$ is an invertible upper triangular matrix, this equation is equivalent to

R_1 x = Q_1^T b

This upper triangular system can be solved using back substitution, providing the solution $x$ directly and efficiently.

Numerical Example

Let's apply this method to a concrete example. Suppose we have experimental data that we want to fit with a quadratic polynomial. We will use the following data

A = \begin{pmatrix} 9 & -3 & 1 \\ 4 & -2 & 1 \\ 1 & -1 & 1 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ 4 & 2 & 1 \\ 9 & 3 & 1 \end{pmatrix}

b = \begin{pmatrix} -2.2 \\ -4.2 \\ -4.2 \\ -1.8 \\ 1.8 \\ 8.2 \\ 15.8 \end{pmatrix}

Each row in matrix $A$ has the format $[t_i^2, t_i, 1]$ to find the coefficients of polynomial $y = at^2 + bt + c$ , while vector $b$ contains the corresponding observation values.

QR Decomposition Process

The QR decomposition of matrix $A$ yields

Q_1 = \begin{pmatrix} -0.64286 & -0.56695 & 0.16496 \\ -0.28571 & -0.37796 & -0.24744 \\ -0.07143 & -0.18898 & -0.49487 \\ -0.00000 & 0.00000 & -0.57735 \\ -0.07143 & 0.18898 & -0.49487 \\ -0.28571 & 0.37796 & -0.24744 \\ -0.64286 & 0.56695 & 0.16496 \end{pmatrix}

R_1 = \begin{pmatrix} -14.00000 & -0.00000 & -2.00000 \\ 0.00000 & 5.29150 & 0.00000 \\ 0.00000 & 0.00000 & -1.73205 \end{pmatrix}

Solution Steps

First, we compute $Q_1^T b$ to obtain

Q_1^T b = \begin{pmatrix} -9.7143 \\ 16.0257 \\ 3.4806 \end{pmatrix}

Next, we solve the upper triangular system $R_1 x = Q_1^T b$ using back substitution. Since $R_1$ is upper triangular, we start from the bottom equation.

From the third equation, $-1.73205 x_3 = 3.4806$ , so $x_3 = -2.00952$ .

From the second equation, $5.29150 x_2 = 16.0257$ , so $x_2 = 3.02857$ .

From the first equation, $-14.00000 x_1 - 2.00000(-2.00952) = -9.7143$ , so $x_1 = 0.98095$ .

Thus, the complete solution is

\hat{x} = \begin{pmatrix} 0.98095 \\ 3.02857 \\ -2.00952 \end{pmatrix}

Fitting Results

Based on the obtained solution, the quadratic polynomial that best fits the data is

y = 0.98095 \cdot t^2 + 3.02857 \cdot t - 2.00952

The following visualization shows how well this polynomial represents the original data

Quadratic Polynomial Fitting

Polynomial curve generated from the normal equation system solution.

Method Comparison

To solve normal equation systems, there are two main approaches that can be compared in terms of computation and numerical stability.

The Cholesky approach involves explicitly forming the matrix $A^T A$ first, then applying Cholesky decomposition since this matrix is positive definite. This method requires approximately $n^2 \cdot m + \frac{1}{6}n^3 + O(n^2) + O(m \cdot n)$ arithmetic operations. However, multiplication and decomposition can become sources of large error propagation, especially when $m = n$ where $\text{cond}(A^T A) \approx \text{cond}(A)^2$ .

The QR approach, conversely, can solve this problem with better numerical stability and comparable computational complexity. The main complexity is determined by $n^2 \cdot m$ operations for QR decomposition, making it comparable to the Cholesky approach. However, the significant advantage of QR lies in the fact that orthogonal transformations do not worsen the problem condition, unlike forming $A^T A$ in the Cholesky method.

The choice of the appropriate method depends on the data characteristics and the level of accuracy required in the specific application.

Command Palette