Command Palette

Search for a command to run...

Linear Methods of AI

Normal Equation System Solution

Fundamental Properties

Normal equation systems ATAx=ATbA^T A x = A^T b have special characteristics that distinguish them from ordinary linear systems. Imagine finding the best point on a line to represent scattered data points, this system provides a mathematical way to find that optimal solution.

For a matrix ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n, the normal equation system ATAx=ATbA^T A x = A^T b always has a solution. More specifically, this system has a unique solution exactly when matrix AA has full rank, that is when Rank(A)=n\text{Rank}(A) = n. Under this condition, the solution can be expressed as x^=(ATA)1ATb\hat{x} = (A^T A)^{-1} A^T b.

When matrix AA does not have full rank, the solution set of the normal equation system takes the form x^+Kernel(A)\hat{x} + \text{Kernel}(A), where x^\hat{x} is any particular solution of the system.

Why the System is Always Solvable

The fundamental reason why normal equation systems always have a solution lies in the concept of orthogonal projection. The orthogonal projection of vector bb onto the column space {Ax:xRn}\{Ax : x \in \mathbb{R}^n\} always exists and is the solution to the linear least squares problem, which automatically is also a solution to the normal equation system.

To understand why other solutions take the form x^+Kernel(A)\hat{x} + \text{Kernel}(A), suppose x~\tilde{x} is another solution of the system ATAx~=ATbA^T A \tilde{x} = A^T b. Then x~\tilde{x} is a solution to the normal equation system if and only if ATA(x~x^)=0A^T A(\tilde{x} - \hat{x}) = 0, which is equivalent to (x~x^)TATA(x~x^)=0(\tilde{x} - \hat{x})^T A^T A(\tilde{x} - \hat{x}) = 0, which is further equivalent to A(x~x^)=0A(\tilde{x} - \hat{x}) = 0, or in other words x~x^Kernel(A)\tilde{x} - \hat{x} \in \text{Kernel}(A).

Moore-Penrose Pseudoinverse

For a matrix ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n and Rank(A)=n\text{Rank}(A) = n, we can define the Moore-Penrose pseudoinverse as

A=(ATA)1ATA^{\dagger} = (A^T A)^{-1} A^T

The Moore-Penrose pseudoinverse functions like the "best inverse" of a non-square matrix. It provides an optimal way to "cancel" linear transformations in the least squares context.

The Moore-Penrose pseudoinverse satisfies four Penrose axioms that uniquely determine its characteristics

AAA=AAA^{\dagger}A = A
AAA=AA^{\dagger}AA^{\dagger} = A^{\dagger}
(AA)T=AA(AA^{\dagger})^T = AA^{\dagger}
(AA)T=AA(A^{\dagger}A)^T = A^{\dagger}A

These four properties are unique, meaning if a matrix BB satisfies all four axioms, then automatically B=AB = A^{\dagger}. The Moore-Penrose pseudoinverse thus functions as the unique solution operator for linear least squares problems.

Solution Using QR Decomposition

A more numerically stable approach to solving normal equation systems uses QR decomposition. For a matrix ARm×nA \in \mathbb{R}^{m \times n} with full rank and mnm \geq n, we can use the (thin) QR decomposition A=Q1R1A = Q_1 R_1.

With this decomposition, the normal equation system ATAx=ATbA^T A x = A^T b can be solved through

ATAx=R1TQ1TQ1R1x=R1TR1x=R1TQ1Tb=ATbA^T A x = R_1^T Q_1^T Q_1 R_1 x = R_1^T R_1 x = R_1^T Q_1^T b = A^T b

Since R1R_1 is an invertible upper triangular matrix, this equation is equivalent to

R1x=Q1TbR_1 x = Q_1^T b

This upper triangular system can be solved using back substitution, providing the solution xx directly and efficiently.

Numerical Example

Let's apply this method to a concrete example. Suppose we have experimental data that we want to fit with a quadratic polynomial. We will use the following data

A=(931421111001111421931)A = \begin{pmatrix} 9 & -3 & 1 \\ 4 & -2 & 1 \\ 1 & -1 & 1 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ 4 & 2 & 1 \\ 9 & 3 & 1 \end{pmatrix}
b=(2.24.24.21.81.88.215.8)b = \begin{pmatrix} -2.2 \\ -4.2 \\ -4.2 \\ -1.8 \\ 1.8 \\ 8.2 \\ 15.8 \end{pmatrix}

Each row in matrix AA has the format [ti2,ti,1][t_i^2, t_i, 1] to find the coefficients of polynomial y=at2+bt+cy = at^2 + bt + c, while vector bb contains the corresponding observation values.

QR Decomposition Process

The QR decomposition of matrix AA yields

Q1=(0.642860.566950.164960.285710.377960.247440.071430.188980.494870.000000.000000.577350.071430.188980.494870.285710.377960.247440.642860.566950.16496)Q_1 = \begin{pmatrix} -0.64286 & -0.56695 & 0.16496 \\ -0.28571 & -0.37796 & -0.24744 \\ -0.07143 & -0.18898 & -0.49487 \\ -0.00000 & 0.00000 & -0.57735 \\ -0.07143 & 0.18898 & -0.49487 \\ -0.28571 & 0.37796 & -0.24744 \\ -0.64286 & 0.56695 & 0.16496 \end{pmatrix}
R1=(14.000000.000002.000000.000005.291500.000000.000000.000001.73205)R_1 = \begin{pmatrix} -14.00000 & -0.00000 & -2.00000 \\ 0.00000 & 5.29150 & 0.00000 \\ 0.00000 & 0.00000 & -1.73205 \end{pmatrix}

Solution Steps

First, we compute Q1TbQ_1^T b to obtain

Q1Tb=(9.714316.02573.4806)Q_1^T b = \begin{pmatrix} -9.7143 \\ 16.0257 \\ 3.4806 \end{pmatrix}

Next, we solve the upper triangular system R1x=Q1TbR_1 x = Q_1^T b using back substitution. Since R1R_1 is upper triangular, we start from the bottom equation.

From the third equation, 1.73205x3=3.4806-1.73205 x_3 = 3.4806, so x3=2.00952x_3 = -2.00952.

From the second equation, 5.29150x2=16.02575.29150 x_2 = 16.0257, so x2=3.02857x_2 = 3.02857.

From the first equation, 14.00000x12.00000(2.00952)=9.7143-14.00000 x_1 - 2.00000(-2.00952) = -9.7143, so x1=0.98095x_1 = 0.98095.

Thus, the complete solution is

x^=(0.980953.028572.00952)\hat{x} = \begin{pmatrix} 0.98095 \\ 3.02857 \\ -2.00952 \end{pmatrix}

Fitting Results

Based on the obtained solution, the quadratic polynomial that best fits the data is

y=0.98095t2+3.02857t2.00952y = 0.98095 \cdot t^2 + 3.02857 \cdot t - 2.00952

The following visualization shows how well this polynomial represents the original data

Quadratic Polynomial Fitting
Polynomial curve generated from the normal equation system solution.

Method Comparison

To solve normal equation systems, there are two main approaches that can be compared in terms of computation and numerical stability.

The Cholesky approach involves explicitly forming the matrix ATAA^T A first, then applying Cholesky decomposition since this matrix is positive definite. This method requires approximately n2m+16n3+O(n2)+O(mn)n^2 \cdot m + \frac{1}{6}n^3 + O(n^2) + O(m \cdot n) arithmetic operations. However, multiplication and decomposition can become sources of large error propagation, especially when m=nm = n where cond(ATA)cond(A)2\text{cond}(A^T A) \approx \text{cond}(A)^2.

The QR approach, conversely, can solve this problem with better numerical stability and comparable computational complexity. The main complexity is determined by n2mn^2 \cdot m operations for QR decomposition, making it comparable to the Cholesky approach. However, the significant advantage of QR lies in the fact that orthogonal transformations do not worsen the problem condition, unlike forming ATAA^T A in the Cholesky method.

The choice of the appropriate method depends on the data characteristics and the level of accuracy required in the specific application.