There's a general setup common to many design optimization problems. We'd like to minimize (or equivalently maximize) some measure of goodness \(f\) of a given object \(x\). Meanwhile, the design of that object is determined by a (usually smaller) set of parameters \(p\). If the object is to exist in the physical world then it will likely have to balance a force equilibrium equation \(g(x,p) = 0\).

Written as a constrained optimization problem, we're looking for the design parameters \(p\) that minimize \(f\) while satisfying the equilibrium according to \(g\):

$$ \mathop{\text{minimize}}_p f(x(p)) \\ \text{ subject to } g(x(p),p) = 0.$$

Even if the objective \(f\) is a simple function (e.g., linear, quadratic in \(x\)) and even if the design \(x\) is a simple function (e.g., linear in \(p\)), if the constraint function \(g\) is a non-linear function of the design parameters \(p\) then this is a difficult problem to solve.

One approach to this problem is called *sensitivity analysis*. It's been used a lot in computer graphics by folks like David Levin and Bernhard Thomaszewski. The idea is to start with *feasible* design parameters \(p_0\). That is, parameters that satisfy our constraint function \(g\):

$$ g(x(p_0),p_0) = 0. $$

The goal now is to improve these parameters with respect to minimizing the objective \(f\) while maintaining feasibility (maintaining \(g=0\)). In other words, we'd like our change in parameters \( ∆p \) to keep us at \(g = 0\). Or in other words, we'd like that total derivative of the constraint \(g\) with respect to our parameters \(p\) is zero (i.e., zero plus no change is still zero):

$$ \frac{d g}{d p} = 0.$$

This total derivative is composed of the partial derivatives of the constraint with respect to the parameters \(p\) and the partial derivatives of the constraint with respect to the design \(x\) and applying the chain rule the partial derivative of the design \(x\) with respect to the parameters \(p\):

$$ \frac{d g}{d p} = \frac{∂ g}{∂ p} + \frac{∂ g}{∂ x} \frac{∂ x}{∂ p}. $$

Let's pause for a second and do a dimensionality check. Suppose we have \(m\) parameters (i.e., \(p\in\mathbb{R}^m\)), our design is represented as an \(n\) vector (i.e., \(x \in \mathbb{R}^n\)), and finally we have \(k\) constraints (i.e., \(g(x(p),p) : \mathbb{R}^{n \times m} \rightarrow \mathbb{R}^k\)). Then our total/partial derivative of the constraints w.r.t. parameters is a \(k\times m\) matrix (i.e., \( \frac{d g}{d p},\frac{∂ g}{∂ p} \in \mathbb{R}^{k \times m}\), similarly the partial derivative of the constraints w.r.t. the design is a \(k \times n\) matrix (i.e., \(\frac{∂ g}{∂ x} \in \mathbb{R}^{k \times n}\) and the partial derivative of the design w.r.t. the parameters is a \(n \times m\) matrix (i.e., \(\frac{∂ x}{∂ p} \in \mathbb{R}^{n \times m}\)). We can verify that our matrix arithmetic above checks out.

We can solve for the change in the design w.r.t. the parameters that keeps this quantity zero:

$$ \frac{∂ g}{∂ p} + \frac{∂ g}{∂ x} \frac{∂ x}{∂ p} = 0 \\ \frac{∂ g}{∂ x} \frac{∂ x}{∂ p} = -\frac{∂ g}{∂ p} \\ \frac{∂ x}{∂ p} = -\left( \frac{∂ g}{∂ x} \right)^{-1} \frac{∂ g}{∂ p}. $$

Remember \(\frac{∂ g}{∂ x}\) is a \(k \times n\) matrix. Numerically we'll construct this matrix but we won't actually compute it's inverse. Instead anytime we read \(A^{-1} B\) we think of this as invoking a *solver* that solves the equation: \(A X = B\).

So far we have found a change in the parameters \(p\). that will maintain (up to first order) satisfaction of our constraints \(g\). The final step is to use this to minimize our objective \(f\). We can use any first-order optimization strategy (e.g., gradient descent, NADAM, quasi-Newton, LBFGS). These rely on computing the gradient of the objective \(f\) with respect to the optimization variables (in our case, the design parameters \(p\)). We can derive these by again invoking the chain rule:

$$ \frac{∂ f}{∂ p} = \frac{∂ f}{∂ x} \frac{∂ x}{∂ p}. $$

Let's pause for a second and do another dimensionality check. The objective function outputs a scalar (i.e., \(f : \mathbb{R}^n \rightarrow \mathbb{R}\), so the partial derivative w.r.t. the design is a \(n\) row vector (i.e., \(\frac{∂ f}{∂ x} \in \mathbb{R}^{1 \times n}\)), which implies that the partial derivative w.r.t. the parameters is a \(m\) row vector (i.e., \(\frac{∂ f}{∂ m} \in \mathbb{R}^{1 \times m}\)).

Plugging in our formula for \(\frac{∂ x}{∂ p}\) above we get and expression for the change in the objective with respect to the change in parameters:

$$ \frac{∂ f}{∂ p} = \frac{∂ f}{∂ x} \left( -\left( \frac{∂ g}{∂ x} \right)^{-1} \frac{∂ g}{∂ p} \right). $$

There's one final step that won't change the solution but will speed up computation. This is called the *adjoint method*. By a simple change in the parenthesis, we switch from solving a linear system against a matrix \( \frac{∂ g}{∂ p} \) on the right to solving against a vector \(\frac{∂ f}{∂ x}\) on the left:

$$ \frac{∂ f}{∂ p} = -\left(\frac{∂ f}{∂ x} \left( \frac{∂ g}{∂ x} \right)^{-1}\right) \frac{∂ g}{∂ p} . $$

Remember we don't actually construct the matrix inverse. If we want to compute \(b A^{-1}\) then we can solve the system \(A^\top x = b^\top\).

Dave pointed out that often $g = 0$ is the result of minimizing another objective. For example, $A'A x + A'b = 0$ results from $\mathop{\text{minimize}}_x |A x - b|^2$, so sensitivity analysis + adjoint method is actually a method to do multi-objective optimization (in particular a priori lexicographic multiobjective optimization, a.k.a., ??).

Very nice write up Alec! Nice to see the variational point of view.

But what do you *do* with these gradients?

In non-linear optimization, it's common to combine descent methods with a line search to ensure progress:

```
while not converged
x ← run forward sim given p
df/dp ← - ( df/dx ( dg/dx)⁻¹ ) dg/dp @ x,p
dp ← - df/dp
// Conduct a line search which requires full function evaluation
// of f in turn requiring a forward simulation to get x
f ← f(x)
t ← 1
for a few iterations
p₁ ← p + α dp
// This is very expensive!
x₁ ← run forward sim given p₁
if f(x₁,p₁) < f + 0.01 (df/dxᵀ dp)
break
t ← ½ t
p ← p + t dp
```

This is very expensive because every line search requires a full forward simulation.

Instead, it may be better to use a fixed time step and even *coast* a
bit before finding a valid $x$ via simulation. Rather we'll update $x$ using a
matching step along $\frac{\partial f}{\partial x}$:

```
while not converged
x ← run forward sim given p
// Now we know that g(x,p) = 0
for a few iterations or maybe until ‖ g(x,p) ‖ > ε
// all these terms need to be recomputed but we don't need
// to rerun the forward sim yet.
df/dp ← - ( df/dx ( dg/dx)⁻¹ ) dg/dp @ x,p
x ← p - γ df/dx
p ← p - γ df/dp
// Now we hope that g(x,p) ≈ 0
```