Gauss Newton applied to p-norms

Alec Jacobson

July 30, 2022

weblog/

The Gauss-Newton method is classically applied to non-linear least squares minimization problems of the form

minimize  ‖ f(x) ‖²
   x      ^---.---^
             E(x)

where f(x) : ℝⁿ → ℝᵐ. The method defines an update rule of

x ← x₀ - (JᵀJ)⁻¹ Jᵀ f(x₀)

where J = ∂f/∂x @ x₀ ∈ ℝᵐˣⁿ. We can recognize Jᵀ f(x) = ∂E/∂x, the gradient of the minimization objective and JᵀJ ≈ ∂²E/∂x² @ x₀ its Hessian where we ignore terms involving d²f/dx². For descent this is convenient as JᵀJ is symmetric positive definite by construction.

When f(x) is linear (affine) in x, this method degenerates into the usual normal equations if f(x) is linear (affine) in x:

minimize  ‖ A x - b ‖²
↓
x ← x₀ - (AᵀA)⁻¹ Aᵀ (A x₀ - b)
  = (AᵀA)⁻¹ Aᵀ b

That is, regardless of the initial guess we immediately jump to the solution.

Contrast this to gradient descent (where the Hessian is approximated with the identity):

minimize  ‖ A x - b ‖²
↓
x ← x₀ - Aᵀ (A x₀ - b)

We can see that this depends on the initial guess x₀ and in general we might bounce around many iterations before converging.

So, what if instead of ‖ A x - b ‖² we had a different norm? Like ‖ A x - b ‖¹?

In this case, gradient descent would step along:

minimize  ‖ A x - b ‖¹
↓
x ← x₀ - Aᵀ sign(A x₀ - b)

Meanwhile, to apply Gauss-Newton we need to massage the objective function a bit:

‖ A x - b ‖¹
‖ | A x - b |¹ᐟ² ‖²
  ^-----.------^
       f(x)

where | A x - b |¹ᐟ² applies the element-wise absolute value and square-root.

With a definition of f(x) we can start differentiating:

J = ½ diag( |A x - b|⁻¹ᐟ² ⊙ sign(A x - b) ) A

Plugging this into the Gauss-Newton update rule some cancellations start to appear:

x ← x₀ - (JᵀJ)⁻¹ Jᵀ f(x₀) = x₀ - (¼ Aᵀ diag( |A x₀ - b|⁻¹ ) A )⁻¹ Aᵀ (½ |A x₀ - b|⁻¹ᐟ² ⊙ sign(A x₀ - b) ⊙ |A x₀ - b|¹ᐟ²) 
  = x₀ - 2 (Aᵀ diag( |A x₀ - b|⁻¹ ) A )⁻¹ Aᵀ sign(A x₀ - b)

Curiously, this suggests that Gauss-Newton approximations the Hessian as

∂²E/∂x² = ? ≈ Aᵀ diag( |A x - b|⁻¹ ) A

even though it is exactly the zero matrix everywhere. I'm guess that the diag( |A x₀ - b|⁻¹ ) term will behave badly numerically (at least if implemented naively) because it blows up when Ax - b → 0. Maybe there's a way to construct the action of (Aᵀ diag( |A x₀ - b|⁻¹ ) A )⁻¹ in a numerically safe way?

Meanwhile, I'm curious, does this Hessian approximation actually contributions anything as a preconditioner?

For the sake of completeness, suppose the power is an arbitrary value p, then this Gauss-Newton update generalizes to:

minimize  ‖ A x - b ‖ᵖ
minimize  ‖ |A x - b|ᵖᐟ² ‖²
↓
x ← x₀ - 2/p (Aᵀ diag( |A x₀ - b|ᵖ⁻² ) A )⁻¹ Aᵀ ( sign(A x₀ - b) ⊙ |A x₀ - b|ᵖ⁻¹ )

We can double check for p=2:

x ← x₀ - 2/2 (Aᵀ diag( |A x₀ - b|ᵖ⁻².^(2-2) ) A )⁻¹ Aᵀ ( sign(A x₀ - b) |A x₀ - b|^(2-1) ) 
x ← x₀ - (Aᵀ A )⁻¹ Aᵀ ( sign(A x₀ - b) |A x₀ - b|^(1) )
x ← x₀ - (Aᵀ A )⁻¹ Aᵀ (A x₀ - b)
x ← (Aᵀ A )⁻¹ Aᵀ b
✓

As pointed out by Rahul in the comments, the Gauss-Newton update rule for Lᵖ norm linear regression looks really similar to iteratively reweighted least squares. In IRLS, a p-norm problem is solved by successively converting it to a 2-norm problem based on the current guess. Each term in the 2-norm is weighted accordingly:

minimize  ‖ A x - b ‖ᵖ
   x
   ↓
x ← argmin ‖ √w(x₀) ⊙ (A x - b) ‖²
       x

with

w(x₀) = |A x₀ - b|ᵖ⁻²

So then the update rule can be expanded as

x ← (Aᵀ diag( |A x₀ - b|ᵖ⁻² ) A )⁻¹ Aᵀ diag( |A x₀ - b|ᵖ⁻² ) b
    ^---------------------------.----------------------------^
                              x_IRLS

We can play with the terms in the p-norm Gauss-Newton update rule above to arrive at nearly the same thing:

x ← x₀ - 2/p (Aᵀ diag( |A x₀ - b|ᵖ⁻² ) A )⁻¹ Aᵀ ( sign(A x₀ - b) ⊙ |A x₀ - b|ᵖ⁻¹ )
x ← x₀ - 2/p (Aᵀ diag( |A x₀ - b|ᵖ⁻² ) A )⁻¹ Aᵀ diag(|A x₀ - b|ᵖ⁻²) (A x₀ - b) )
x ← (1 - 2/p) x₀ + 2/p (Aᵀ diag( |A x₀ - b|ᵖ⁻² ) A )⁻¹ Aᵀ diag(|A x₀ - b|ᵖ⁻²) b
x ← (1 - 2/p) x₀ + 2/p x_IRLS

So, the only difference with iteratively weighted least squares is that raw Gauss Newton method is suggesting a step-size of 2/p. If you're using a line search, then there's no difference at all since the direction is the same.

In some quick and dirty experiments it seemed like Gauss Newton was sometimes faster to converge, but more unstable (often diverging too). Both benefit from regularizing the diag(|A x₀ - b|ᵖ⁻²) term to avoid numerical degeneracies. I'm guessing that line search would really help, but didn't try it yet.