**Newton's method** requires the Jacobian in order to search for zeros,
and also the Hessian for finding extrema.

For root finding:

$$x_{n+1} = x_n - [J_f(x_n)]^{-1} f(x_n)$$

For optimization:

$$x_{n+1} = x_n - [H_f(x_n)]^{-1} J_f(x_n)$$

Newton's method converges towards a local extreme quadratically, if started within a neighborhood of the local extreme where the Hessian is invertible and Lipschitz continuous. This is much faster than gradient descent.

Relaxed Newton's method scales the step size by a factor less than 1 to improve convergence, but converges linearly.

To reduce computational complexity, matrix inversion is converted to solving linear equations:

$$[\mathbf{H} f(\mathbf{x}_n)] \mathbf{\Delta x} = -\nabla f(\mathbf{x}_n)$$

**Quasi-Newton methods** approximate the gradient and Hessian in Newton's method
by differences of cached function values.

Algorithm: (BFGS) With initial guess $\mathbf{x}_0$ and Hessian matrix $B_0 = I$, iterate the following until the norm of the gradient falls under a threshold

- Obtain a direction $\mathbf{p}_k$: $B_k \mathbf{p}_k = -\nabla f(\mathbf{x}_k)$;
- Line search for an acceptable step size $\alpha_k$;
- $\mathbf{s}_k = \alpha_k \mathbf{p}_k$;
- $\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{s}_k$;
- $\mathbf{y}_k = {\nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k)}$;
- Update formula for $B_k$;

The secant method can be thought of as a finite difference approximation of Newton's method, although developed long before the latter.

$$x_n = x_{n-1} - f(x_{n-1}) \frac{x_{n-1} - x_{n-2}}{f(x_{n-1}) - f(x_{n-2})}$$

The secant method converges if the initial values are sufficiently close to the root. If the function is twice continuously differentiable and the root in question has multiplicity 1, the order of convergence is the golden ratio $(1+\sqrt{5}) / 2$, between linear and quadratic.

DFP is the first quasi-Newton method to generalize the secant method to a multidimensional problem. [@Davidon1959; @Fletcher and Powell, 1963]

Update formula:

$$B_{k+1} = \left (I-\frac {y_k \, \Delta x_k^T} {y_k^T \, \Delta x_k} \right ) B_k \left (I-\frac {\Delta x_k y_k^T} {y_k^T \, \Delta x_k} \right )+\frac{y_k y_k^T} {y_k^T \, \Delta x_k}$$

Or with $H_{k+1} = B_{k+1}^{-1}$:

$$H_{k+1} = H_k + \frac {\Delta x_k \Delta x_k^T}{\Delta x_k^{T} \, y_k} - \frac {H_k y_k y_k^T H_k} {y_k^T H_k y_k}$$

[@Broyden1970; @Fletcher1970; @Goldfarb1970; @Shanno1970]

Update formula:

$$B_{k+1} = B_k + \frac {y_k y_k^T}{y_k^{T} \Delta x_k} - \frac {B_k \Delta x_k (B_k \Delta x_k)^T} {\Delta x_k^{T} B_k \, \Delta x_k}$$

Or with $H_{k+1} = B_{k+1}^{-1}$:

$$H_{k+1} = \left (I-\frac {\Delta x_k y_k^T } {y_k^T \Delta x_k} \right ) H_k \left (I-\frac { y_k \Delta x_k^T} {y_k^T \Delta x_k} \right )+\frac{\Delta x_k \Delta x_k^T} {y_k^T \, \Delta x_k}$$

Nonlinear Conjugate Gradient Methods