ECSE 507 Lecture Notes (presentation)

These are notes used for lecturing ECSE 507 during the Winter 2024 semester.
Please note that these notes constantly evolve and beware of typos/errors.
These notes are heavily influenced by Boyd and Vandenberghe’s excellent text Convex Optimization.
I appreciate being informed about typos/errors in these notes.

N.B.: This page is formatted to be projected–see here for the unformatted version (i.e., without excessive whitespace).

Click each subject to unfold.

Introduction

General Problem

Interested in solving and studying minimization problems subject to constraints:

$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in D \end{cases},$

where

$x$ are problem parameters;
$f_0:\mathbb{R}^n \to \mathbb{R}$ is the objective function and $f_0(x)$ usually interpreted as cost for choosing $x$ ;
$D \subset\mathbb{R}^n$ is a constraint set often described “geometrically”.

Example 1. Design Problem Interpretation

Let

$x,y$ =scalar-valued design variables
(e.g., dimensions of manufactured object, yaw and pitch of jet).
$f_0(x,y)$ = penalty for choosing design $(x,y)$
(e.g., cost in material, energy, time, deviation from desired path).
$D$ = design specifications
(i.e., allowable/possible values for $(x,y)$ ).
E.g., $D = \{ (x,y) : a<x<b, c<y<d\}$ specifies minimum and maximum design values.

Then the problem is to find optimal design values $(x,y)$ which minimize cost $f_0$ and satisfy the design specifications $(x,y) \in D$ .

Example 2. Minimize function over ellipse

Let

: $f_0(x,y) = 1 - \frac{\left(x^{2}+y^{2}\right)}{2}$; $D = \{ (x,y) \in \mathbb{R}^2 : x^2 + 2 y^2 \leq 1 \}$ .

Then the problem

$\begin{cases} \text{minimize} & 1-\frac{\left(x^{2}+y^{2}\right)}{2}\\ \text{subject to} & x^2 + 2 y^2 \leq 1 \end{cases}$

is a familiar kind of purely geometric optimization problem.
It has two solutions: $(x,y,f_0(x,y)) = (\pm 1, 0 , \frac{1}{2})$ .
Many applied problems can look like this, but with an applied interpretation.

Problems With Structure

General optimization problems can be numerically inefficient to solve or analytically difficult, unless $f_0$ and $D$ have additional structure/properties.
Identifying nice structure/properties of problem $\implies$ problem may become analytically solvable or numerically efficient.

Examples of nice structure:

linearity
$f_0(x) = c^T x$
$D$ defined in terms of linear equalities/inequalities;
e.g., $a_1^T x \geq 0, \ldots, a_m^T x \geq 0$ , or succinctly written $A x \succeq 0$ .

convexity or quasiconvexity
E.g., convex if $f_0(tx+(1-t)y) \leq tf_0(x) + (1-t)f_0(y)$
$0<t<1$
$D$ is itself a convex set.

sparsity or other matrix structure
E.g., $f_0(x) = x^T C x$ and $D$ given by $Ax \succeq 0$
If $C,A$ sparse
$\implies$ lots of cancellation
$\implies$ improve computational efficiency

$\begin{bmatrix} 1 & 0 & 1 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 1 & 0 & 1\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 0 & 0 & 1 & 0 \end{bmatrix} ,\qquad \begin{bmatrix} 1&2&0&0\\ 2&1&0&0\\ 0&0&2&1\\ 0&0&1&2 \end{bmatrix}$

Linear Programming

Simplest case: $f_0$ is linear and $D$ defined in terms of linear constraints.
Notation: if $x, y \in \mathbb{R}^n$ , then

$x \succeq y$ means $x_i \geq y_i$ for $i =1,\ldots, n$ .

$c^T$ = transpose of the column vector $c \in \mathbb{R}^n$ .
Linear program: given $c \in \mathbb{R}^n$ , $b \in \mathbb{R}^m$ , $A \in \mathbb{R}^{m \times n}$ , solve

$\begin{cases} \text{minimize}&c^T x\\ \text{subject to}& Ax \succeq b\\ &x \succeq 0 \end{cases}.$

Thus $f_0(x) = c^T x$ and $D = \{x \in \mathbb{R}^n: Ax \succeq b, x \succeq 0\}$ .

Positives of linear programming:

Conceptually simple: relies heavily on linear algebra
There are classical numerical methods which are often very efficient.
If $x^\star \in D$ is local minimizer of $f_0$ on $D$ , then it is automatically a global minimizer on $D$ .
Can sometimes approximate smooth problems linearly; however, usually can only give “local” results.
(E.g., $\sin(x) \sim x$ for $0 \leq x \ll 1$ .)

Shortcomings of linear programming:

Many applied problems are not linear.
Many problems may not even be (suitably) approximated by linear programs.
E.g., the “barrier”
$I_{-}(x) = \begin{cases} 0 & x <0\\ \infty & x\geq0 \end{cases}$
is better approximated by the a “logarithmic barrier” of the form $- c \log(-x)$ than any linear function.

Convex Optimization

Convex optimization problem: $f_0$ and $D$ are convex.
This is the main focus of the course.

Positives of Convex Optimization:

Relatively conceptually simple.
Still often have efficient, albeit more sophisticated, numerical methods.
Many applied problems may be recast as or approximated by convex optimization problems.
If $x^\star \in D$ is local minimizer of $f_0$ on $D$ , then it is automatically a global minimizer on $D$ .

Shortcomings of Convex Optimization:

Problems may seriously fail to be analytically tractable or numerically efficient.

Exists nonlinear problems which cannot be approximated by convex problems.

Example (Least Squares)

A standard and ubiquitous kind of convex optimization problem is the least squares problem.
This problem takes the form:

$\begin{cases} \text{minimize} & \Vert Ax-b \Vert^2\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ &h_i(x) =0, i=1,\ldots,p \end{cases},$

where

$\Vert \cdot \Vert$ is some norm
$A \in \mathbb{R}^{m \times n}$ a matrix
$b \in \mathbb{R}^m$ a fixed vector
$f_0(x) = \Vert A x-b \Vert^2$ is the (convex) objective function
$f_i,h_i$ are convex.

N.B.: will come back to this problem.

Example: Distance from points to ellipse

If $\Vert x \Vert^2 = x_1^2 + \cdots + x_n^2$ , $A$ is the identity matrix, and $D$ is an ellipsoid, then the solution is the point in the ellipsoid closest to the point $b$ .
The image below depicts this situation with

$b = \begin{bmatrix}0\\2\end{bmatrix}$ and $D = \{(x,y): (x-2)^2 + 2(y-2)^2 \leq 1 \}$ .

Here, the optimal solution is $(x,y) = (2-\sqrt{2},2)$ .

Optimal Control

Let

: $x(t),u(t),a(t),f(t)$ be functions $\mathbb{R} \to \mathbb{R}$ .

Assume $x(t)$ evolves by

$\begin{aligned} \dot x(t) &= a(t) x(t) + f(t) u(t)\\ \end{aligned}$

Here,

$\dot x(t)$ is the time derivative of $x(t)$ .
$a(t),f(t)$ are assumed to be given;
we think of $x(t)$ as being the state of some system at time $t$ ;
we think of $u(t)$ as an input we are allowed to choose to dictate the evolution $x(t)$ of $x(0)$ ; i.e., $u(t)$ “controls” the system;
When $u(t)=u(x(t),t)$ , the system experiences “feedback.”
the goal: choose control law $u(t) = u(x(t),t)$ so that $x(t)$ is as “desirable” as possible.

Example: Optimal Control Problems

Optimal control problem: choose the “best” control $u(t)$ which gives the “most” desirable $x(t)$ .
Typically “best” and “desirable” are determined by size/cost of $u(t)$ and $x(t)$ ; e.g., one may wish to minimize

$\int u(t)^2 + x(t)^2 dt.$

Therefore: problem is to solve (roughly speaking)

$\begin{cases} \text{minimize} & \int u(t)^2 + x(t)^2 dt\\ \text{subject to} & x'(t) = a(t) x(t) + f(t) u(t)\\ \end{cases}$

This will be another focus of the course and we will see some optimal control problems can be recast as convex optimization problems.

Rough Outline of Course

: Part 1: Basics of Convexity and Convex Optimization Problems.; Part 2: Applications of Convex Optimization Problems.; Part 3: Algorithms for Solving Convex Optimization Problems.; Part 4: Topics in Optimal Control.

Convex Geometry

Convex Sets

Convex set: a subset $X \subset \mathbb{R}^n$ satisfying:

for all $x,y \in X$ and $t \in [0,1]$ , there holds $tx + (1-t) y \in X$ .

I.e., $X$ contains all line segments whose endpoints belong to $X$ .

Examples: some standard convex sets.

Closed or open polytopes in $\mathbb{R}^n$ .
E.g., the interior of a tetrahedron.
Euclidean balls, ellipsoids.
Linear subspaces and affine spaces (e.g., lines, planes).
Given a norm on , the -ball
$\{ x \in \mathbb{R}^n : \Vert x - x_{0} \Vert \leq r \}$
with center and radius is a convex set.
Recall: a norm satisfies
- $\Vert x+ y \Vert \leq \Vert x \Vert + \Vert y \Vert$ for all vectors $x,y$ ;
- $\Vert c x \Vert = |c| \Vert x \Vert$ for all vectors $x$ and scalars $c$ ;
- $\Vert x \Vert=0$ iff $x =0$ .

Affine Subsets

Affine subset: A subset $X \subset \mathbb{R}^n$ satisfying:

For all $x,y \in X$ and $t \in \mathbb{R}$ , there holds $tx + (1-t) y \in X$ .

I.e., $X$ contains all lines which pass through two distinct points in $X$ .
N.B.: An affine subset is just a translated linear subspace:
“a linear space that’s forgotten its origin”.

Example 1.

Let $X = \{ (x,y,0): x,y \in \mathbb{R} \}$ be the $xy$ -plane in $\mathbb{R}^3$ .
Then any translation or rotation of $X$ is an affine subset.

Example 2.

If $A \in \mathbb{R}^{m \times n}$ , $b \in \mathbb{R}^m$ , then $X=\{x \in \mathbb{R}^n: Ax = b\}$ is affine.
( $X$ is just a translate of $\text{ker}\,A$ .)

Example 3.

Let

$A = \begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\qquad \text{ and } \qquad b = \begin{bmatrix} 0\\0\\3 \end{bmatrix}$ .

Then the solution set to $Ax=b$ is

$\{(x,y,3): x,y \in\mathbb{R}\}$ ,

which is just $\text{ker}\, A$ translated by 3 in the $z$ -direction:

Cones

Cone: A subset $X \subset \mathbb{R}^n$ satisfying:

For all $x \in X$ and $t \geq 0$ , there holds $tx \in X$ .

I.e., $X$ contains all “positive” rays emanating from the origin and passing through any of its points.
Proposition. $X \subset \mathbb{R}^n$ is a convex cone iff for all $x_1,x_2 \in X$ and $\theta_1,\theta_2\geq0$ , there holds $\theta_1 x_1 + \theta_2 x_2 \in X$ .

Proof.

Step 1. ( $\implies$ )

Suppose $X$ is a convex cone and let $x_1,x_2 \in X$ and $\theta_1,\theta_2\geq0$ be arbitrary.
Want to show: $\theta_1 x_1 + \theta_2 x_2 \in X$ .

Step 2.

Being conic implies $x = \frac{\theta_1}{t} x_1$ and $y=\frac{\theta_2}{1-t}x_2$ belong to $X$ for all $0<t<1$ .

Step 3.

Being convex implies $tx+ (1-t)y = \theta_1 x_1 + \theta_2 x_2 \in X$ , as desired.

Step 4. ( $\impliedby$ )

Suppose $X$ is such that $\theta_1 x_1 + \theta_2 x_2 \in X$ for all $x_1,x_2 \in X$ and $\theta_1,\theta_2\geq0$ .
Want to show: $X$ is a convex cone.

Step 5.

$X$ being conic follows from taking $\theta_1\geq0$ arbitrary and $\theta_2 = 0$ .

Step 6.

Convexity follows from taking $\theta_1 + \theta_2 = 1$ with $\theta_1,\theta_2 \geq0$ and $0 \leq \theta_1 \leq 1$ .
Indeed: $\theta_1 + \theta_2 = 1$ and $t := \theta_1$ $\implies$ $t x_1 + (1-t) x_2 \in X$ for $0 \leq t \leq 1$ since $1-t = \theta_2$ .

Examples.

Hyperplanes $\{ x \in \mathbb{R}^n : a^{T} x = 0\}$ with normal $a \in \mathbb{R}^n$ ,
halfspaces $\{ x \in \mathbb{R}^n : a^{T}x \leq 0 \}$ ,
nonnegative orthants $\{x \in \mathbb{R}^n: x \succeq 0\}$ are all convex cones.
(Here, $u \succeq v$ if $u_i \geq v_i$ for $i = 1,\ldots,n$ .)
Given a norm $\Vert\cdot\Vert$ on $\mathbb{R}^n$ , the $\Vert\cdot\Vert$ -norm cone is
$\{ (x,t) \in \mathbb{R}^{n+1} : \Vert x \Vert \leq t \}$ ,
which is a convex cone in $\mathbb{R}^{n+1}$ .
See “positive semidefinite cone” below.

Polyhedra

Polyhedron: Any subset $X \subset \mathbb{R}^n$ of the form

$X = \{ x \in \mathbb{R}^n : a_j^T x \leq b_j,\, c_i^T x = d_i \}$
$j=1,\ldots,m$
$i=1,\ldots,p$

given the vectors $a_j,c_i \in \mathbb{R}^n$ and scalars $b_j,d_i \in \mathbb{R}$ .
Thus, $X$ is a finite intersection of halfspaces and hyperplanes.
N.B.: Introducing equality constraints can be used to reduce dimension.

Example.

The polyhedron below is given by the indicated system of inequalities:

$\begin{aligned} y-x& \leq0\\ -x-y&\leq -1\\ x &\leq 3\\ -x &\leq -1 \end{aligned}$

It is an easy exercise to rewrite the inequalities in the notation $a_j^T x \leq b_j$ for suitable $a_j,b_j$ .

Positive Semidefiniteness

Symmetric matrix:

a matrix $X \in \mathbb{R}^{n \times n}$ satisfying $X = X^T$ ; i.e.,

$X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{21} & \cdots & x_{n1} \\ x_{12} & x_{22} & \cdots & x_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & \cdots & x_{nn} \end{bmatrix} = X^T.$

Set of symmetric matrices:

$\boldsymbol{S}^n = \{ X \in \mathbb{R}^{n \times n}: X = X^T \}.$

Positive semidefinite matrix:

$X \in \boldsymbol{S}^n$ satisfying $z^T X z \geq 0$ for all $z \in \mathbb{R}^n$ .
Equivalently, $X$ only has nonnegative eigenvalues.
If $X \in \boldsymbol{S}^n$ is positive semidefinite, then write $X \succeq 0$ .
If $X,Y \in \boldsymbol{S}^n$ and $X-Y\succeq0$ , then write $X \succeq Y$ .
Set of symmetric positive semidefinite matrices:

$\boldsymbol{S}_+^n = \{ X \in \boldsymbol{S}^n: X \succeq 0 \}.$

(N.B.: $\succeq$ is not the same as component-wise inequality, as was the case for vectors.)

Positive definite matrix:

$X \in \boldsymbol{S}_+^n$ satisfying $z^T X z = 0$ if and only if $z=0$ .
Equivalently, $X$ only has positive eigenvalues.
If $X \in \boldsymbol{S}^n_+$ is positive definite, then write $X \succ 0$ .
If $X,Y \in \boldsymbol{S}^n_+$ and $X-Y\succ0$ , then write $X \succ Y$ .
Set of symmetric positive definite matrices:

$\boldsymbol{S}_{++}^n = \{ X \in \boldsymbol{S}^n: X \succ 0 \}.$

Example 1

Let $A = \begin{bmatrix} 2 & 0 \\ 0 & 4 \end{bmatrix}$ .
Since $A$ has positive eigenvalues $\{2,4\}$ , it follows that $A \succ 0$ .
To see $A \succ 0$ explicitly, observe

$\begin{aligned} z^T A z &= \begin{bmatrix} z_1 & z_2 \end{bmatrix} \begin{bmatrix} 2 & 0\\ 0 & 4 \end{bmatrix} \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}\\ &= 2 z_1^2 + 4 z_2^2\\ &\geq0 \end{aligned}$

with $z^T A z = 0$ iff $z = \begin{bmatrix}0\\0\end{bmatrix}$ .

Example 2.

Let $B = \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix}$ .
Since $B$ has nonnegative eigenvalues $\{ 4,0 \}$ , it follows that $B \succeq 0$ .
To see $B \succeq 0$ explicitly, observe

$\begin{aligned} z^TBz &= \begin{bmatrix}z_1 & z_2 \end{bmatrix} \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}\\ &=4 z_1^2\\ &\geq 0. \end{aligned}$

Evidently, $z^T B z=0$ for $z = \begin{bmatrix} 0\\z_2 \end{bmatrix}$ and so $B \not\succ 0$ .

Example 3.

Let $C =\begin{bmatrix} 3 & 2 \\ 2 & 3 \end{bmatrix} \succ 0$ .
One can conclude $C \succ0$ by showing that $C$ has positive eigenvalues $\{ 5,1 \}$ .
To see it directly, compute

$\begin{aligned} z^T C z &= \begin{bmatrix}z_1 & z_2 \end{bmatrix} \begin{bmatrix} 3&2\\2&3 \end{bmatrix} \begin{bmatrix} z_1 \\z_2 \end{bmatrix}\\ &= 3z_1^2 + 4z_1z_2 + 3z_2^2. \end{aligned}$

But the discriminant (with respect to $z_1$ ) of this quadratic satisfies $-20 z_2^2 \leq 0$ , from which we conclude the polynomial is positive unless $z_1=z_2=0$ and hence $C \succ 0$ .

Example 4.

Let $D = \begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix}$ .
We can conclude $D \not\succeq 0$ , either compute its eigenvalues $\{3,-1\}$ or observe that $\det D = 1 - 4 = -3$ and whence $D$ cannot have only nonnegative eigenvalues.

Positive Semidefinite Cone

Proposition 1. $\boldsymbol{S}^n$ is a $\frac{n(n+1)}{2}$ -dimensional real vector space and $\boldsymbol{S}_+^n$ is a convex cone in $\boldsymbol{S}^n$ .

Proof.

Step 1. $\boldsymbol{S}^n$ is a vector space:

if $X,Y \in \boldsymbol{S}^n$ and $c \in \mathbb{R}$ , then it is easy to see:

$(X+cY)^T = X^T + cY^T = X + cY$ .

and so $X+cY \in \boldsymbol{S}^n$ .

Step 2. $\text{dim}\,\boldsymbol{S}^n = \frac{n(n+1)}{2}$ :

since $X \in \boldsymbol{S}^n$ implies $X=X^T$ , we have the identification

$\begin{aligned} X&= \begin{bmatrix} \boldsymbol{x_{11}} & \boldsymbol{x_{12}} & \boldsymbol{x_{13}} & \cdots & \boldsymbol{x_{1n}}\\ x_{12} & \boldsymbol{x_{22}} & \boldsymbol{x_{23}} &\cdots & \boldsymbol{x_{2n}}\\ x_{13} & x_{23} & \boldsymbol{x_{33}} &\cdots & \boldsymbol{x_{3n}}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & x_{3n} & \cdots & \boldsymbol{x_{nn}}\\ \end{bmatrix}\\ &\qquad\qquad\qquad\qquad\iff\\ \xi&:= \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} & x_{22} & x_{23} & \cdots & x_{2n} & \cdots & x_{nn} \end{bmatrix}^T, \end{aligned}$

where the bolded entries in $X$ indicate the unique contributions to making $\xi$ .
Counting the number of bold entries shows $\xi$ has $\frac{n(n+1)}{2}$ entries and hence $\xi \in \mathbb{R}^{\frac{n(n+1)}{2}}$ .

Step 3. $\boldsymbol{S}_+^n$ is a convex cone:

For $\theta_1,\theta_2\geq0$ , $X,Y \in \boldsymbol{S}_+^n$ and $z \in \mathbb{R}^n$ there holds

$z^T(\theta_1 X +\theta_2 Y)z = \theta_1 z^T X z + \theta_2 z^T Y z \geq 0$ ,

and so $\theta_1 X + \theta_2 Y \in \boldsymbol{S}_+^n$ .
By the proposition in Convex Geometry.Cones, we conclude the desired result.

Proposition 2. $\begin{bmatrix} a&b\\b&c \end{bmatrix} \in \boldsymbol{S}_+^2$ iff

$a,c\geq0$ and $\det \begin{bmatrix} a&b\\b&c \end{bmatrix} = ac - b^2 \geq 0$ .

Proof.

Step 1.

Let

$x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \text{and} \quad X = \begin{bmatrix} a&b\\b&c \end{bmatrix},$

recalling that $X \in \boldsymbol{S}_+^2$ iff $x^T X x \geq 0$ for all $x \in \mathbb{R}^2$ .

Step 2. (Case $a=0$ )

First compute

$x^T X x = \begin{bmatrix}x_1 & x_2 \end{bmatrix} \begin{bmatrix} 0&b\\b&c \end{bmatrix}\begin{bmatrix} x_1\\x_2\end{bmatrix} = 2bx_1x_2+cx_2^2.$

Observe that

$2bx_1x_2 + cx_2^2 \geq 0$ for all $x_1,x_2 \in \mathbb{R}$ iff $b=0,c\geq0$ .

Note (in case $a=0$ ): $ac - b^2 = 0$ iff $b=0$ .
Can thus conclude (in case $a=0$ ):

$x^TXx \geq 0$ for all $x \in \mathbb{R}^n$ iff $c\geq0, \det X =0$ .

Step 3. (Case $a \neq 0$ )

Completing the square gives

$\begin{aligned} x^T X x &= \begin{bmatrix}x_1&x_2\end{bmatrix}\begin{bmatrix}a&b\\b&c\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}\\ &=ax_1^2 + 2bx_1 x_2 + c x_2^2 \\ &= a(x_1 + a^{-1}bx_2)^2 + a^{-1}\det\, X\, x_2^2. \end{aligned}$

But

$a(x_1 + a^{-1}bx_2)^2 + a^{-1}\det\, X\, x_2^2 \geq0$ for all $x_1,x_2 \in \mathbb{R}$

iff

$a>0$ and $\det\,X \geq0$ .

(To conclude $a>0$ , take $x_2=0$ .)
N.B.: strictly speaking, $c\geq0$ was not used anywhere; however, $X \succeq0$ immediately implies $c \geq0$ , and $a>0,ac-b^2 \geq0$ also implies $c\geq0$ .

Step 4.

Putting Steps 2. and 3. together, we conclude:
$\begin{bmatrix} a&b\\b&c \end{bmatrix} \in \boldsymbol{S}_+^2$ iff

$a,c\geq0$ and $\det \begin{bmatrix} a&b\\b&c \end{bmatrix} = ac - b^2 \geq 0$ .

Image of $\boldsymbol{S}_+^2$

In light of Proposition 1 and Proposition 2, we can plot

$\boldsymbol{S}_+^2 \subset \boldsymbol{S}^2 \cong \mathbb{R}^3$ .

The image below depicts the boundary of

$\boldsymbol{S}_+^2 \cong \{ (x,y,z): xz \geq y^2 ,x\geq0,z\geq0\}$ .

Separating Hyperplanes

Let $A,B \subset \mathbb{R}^n$ be two sets.
Separating hyperplane: a hyperplane given by

$P_{a,b} := \{ x : a^T x = b \}$

for some $a \in \mathbb{R}^n$ and $b \in \mathbb{R}$ such that

$\begin{aligned} a^Tx -b \geq 0 & \text{ on } A\\ a^Tx - b \leq 0 & \text{ on } B \end{aligned}.$

$P_{a,b}$ is said to separate $A$ and $B$ .
Thus, $P_{a,b}$ cuts $\mathbb{R}^n$ into two halfspaces with one containing all of $A$ and the other containing all of $B$ .
Separating Hyperplane Theorem. If $A,B \subset \mathbb{R}^n$ are two disjoint convex sets, then there exists $a \in \mathbb{R}^n$ and $b \in \mathbb{R}$ such that $P_{a,b}$ is a separating hyperplane which separates $A$ and $B$ .

Example 1.

Consider the convex sets

$\begin{aligned} A &= \{(x,y) : \left(x-1\right)^{2}+2\left(y-1\right)^{2\ }\leq1 \}\\ B &= \{(x,y) : \left(x+1\right)^{2\ }+\ \left(y+1\right)^{2\ }\leq1\}\\ C &= \{(x,y) : \left(x-2\right)^{2}+2\left(y-1\right)^{2\ }\leq1 \}\\ P:&=P_{(1,1),0} = \{(x,y) : y+x=0 \} \end{aligned}.$

These three sets are indicated in the image below.
Note that $P$ separates the pairs $[B,A]$ and $[B,C]$ . Moreover, the pair $[A,C]$ cannot be separated since $A$ and $C$ have significant overlap.

Example 2.

Consider the convex sets

$\begin{aligned} A &= \{(x,y) : x^{2\ }+\ y^{2}<1 \}\\ B &= \{(x,y) : \left(x+2\right)^{2\ }+y^{2}<1 \}\\ P:&=P_{(1,0),-1} = \{(x,y) : x=-1 \} \end{aligned}.$

These sets are indicated in the image below.
First note $A \cap B = \emptyset$ since neither contain their boundaries.
As such, they have a separating hyperplane which is given by $P$ .

N.B.: Replacing $A,B$ with their respective closures $\overline{A},\overline{B}$ , the plane $P$ still separates $\overline{A},\overline{B}$ .
Indeed, $x+1 \geq 0$ for $(x,y) \in A$ and $x+1 \leq 0$ for $(x,y) \in B$ .

Supporting Hyperplanes

Let $A \subset \mathbb{R}^n$ be a fixed set and fix a boundary point

$x_0 \in \text{bd}A := \overline{A} \setminus \text{int}A$ .

If the plane

$P_{a,a^Tx_0} = \{ x : a^Tx = a^Tx_0 \}$

separates $A$ and the singleton $\{x_0\}$ , then $P_{a,a^Tx_0}$ is called the supporting hyperplane of $A$ at $x_0$ .
Equivalently, $A$ lies entirely in a halfspace with boundary given by $P_{a,a^Tx_0}$ .

(Here: $\overline{A}$ indicates the closure of $A$ and $\text{int} A$ indicates its interior.)

Example.

Consider the convex sets

$\begin{aligned} A &= \{ (x,y) : (x-2)^2 + (y-2)^2 \leq 1 \}\\ P :&= P_{(-1,0),-1} = \{ (x,y) : x = 1 \} \end{aligned}$

with boundary point $x_0 = (1,2) \in \partial A$ .
Letting $a = \begin{bmatrix} -1\\0\end{bmatrix}$ , we note $a^T x_0 + 1 = 0 \geq 0$ .
Next, observing that, if $(x,y) \in A$ , then $x \geq 1$ and so

$a^Tx + 1 = \begin{bmatrix} -1&0 \end{bmatrix} \begin{bmatrix} x\\y \end{bmatrix} +1 = -x +1 \leq 0$ .

Thus $P$ separates $A$ and $\{ x_0 \}$ , showing that $P$ is a supporting hyperplane of $A$ at the boundary point $x_0$ ; see image below.

Hulls

Let $X \subset \mathbb{R}^n$ be a fixed subset.

Convex hull:

the set

$\{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, \theta_i \geq 0,\, i = 1,\ldots,k,\, \theta_1 + \cdots + \theta_k= 1\}$ .

This is just the collection of all convex combinations of points in $X$ and is itself convex.
Example: The images below depict a set of three points and its convex hull.

Affine hull:

the set

$\{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, i = 1,\ldots,k,\, \theta_1 + \cdots + \theta_k= 1\}$ .

This is just the collection of all affine combinations of points in $X$ and is itself affine.
Example: The images below depict two points and their affine hull.

Conic hull:

The set

$\{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, \theta_i \geq 0,\, i = 1,\ldots,k\}$ .

This is just the collection of all conic combinations of points in $X$ and is itself a cone.
Example: The images below depict two points $x_1 = (0.5,1)$ , $x_2 = (1.5,0.5)$ and their conic hull.

Details

To see that the conic hull really is the shaded region, note that, by taking $\theta_1 = t s_1$ and $\theta_2 = (1-t)s_2$ , where $t \in [0,1]$ and $s\geq0$ , the conic hull contains all points of the form $ts_1x_1 + (1-t)s_2x_2$ .
Thus, it contains all line segments connecting any two points on the nonnegative rays $\{ s x_1 : s \geq 0 \}$ and $\{ s x_2 : s \geq 0 \}$ .

N.B.:

Conic hulls are convex cones.
Taking the “___ hull” of $X$ does indeed result in a “___” set.
The “___ hull” is a construction of the smallest “___” subset containing $X$ .

Generalized Inequalities

Proper cone: a convex cone $K \subset \mathbb{R}^n$ satisfying

$K$ is closed (i.e., $K$ contains its boundary)
$K$ has nonempty interior
$x,-x \in K \implies x=0$ .

Generalized inequality: given a proper cone $K$ , a partial ordering $\prec_K$ on $\mathbb{R}^n$ defined by

$x \preceq_K y \iff y-x \in K$ .

N.B.: $\preceq_K$ is a partial ordering and so $x \preceq_K y$ is not well-defined for all $x,y$ . Generalized strict inequality: given a proper cone $K$ , a partial ordering $\prec_K$ on $\mathbb{R}^n$ defined by

$x \prec_K y \iff y-x \in \text{int}\,K$ .

Examples

(CO Example 2.14)
If $K = \mathbb{R}^n_+$ , then $\preceq_K$ is the standard componentwise vector inequality:
$v \preceq w \iff v_i \leq w_i, \, i = 1,\ldots,n$ .
N.B.: $\preceq_{\mathbb{R}_+}$ is the standard inequality on $\mathbb{R}$ .
(CO Example 2.15)
If $K = \boldsymbol{S}_+^n$ , then
$\begin{aligned} A &\preceq_K B \implies B-A \text{ is positive semidefinite}\\ A &\prec_K B \implies B-A \text{ is positive definite} \end{aligned}$ .
Let
$K = \{(x_1,x_2) : x_1 \leq 2x_2, x_2 \leq 2x_1 \}$ .
Then is a proper cone.
In the image below:
- $K$ is the cone with vertex $(0,0)$ .
- The cone with vertex $x = (-1,1)$ depicts those $y \in \mathbb{R}^2$ with $x \preceq_K y$ .
- The cone with vertex $x = (1,-1)$ depicts those $y \in \mathbb{R}^2$ with $x \preceq_K y$ .
N.B.: , and so , as indicated in the image.
Moreover, and are not comparable.

Convex Function Theory

Conventions and Notations

Writing $f:\mathbb{R}^n \to \mathbb{R}$ always means a partial function with domain $\text{dom}\,f$ possibly smaller than $\mathbb{R}^n$ .
“Function” will mean “partial function.”
If $\text{dom}\, f \neq \mathbb{R}^n$ , we may work with the extension $\tilde{f}:\mathbb{R}^n \to \mathbb{R} \cup \{ +\infty \}$ given by
$\tilde{f}(x) = \begin{cases} f(x) & x \in \text{dom}\, f\\ +\infty & x \notin \text{dom}\, f \end{cases}.$
It is common to implicitly assume $f$ has been extended and to write $f$ for the partial function $f$ and its extension $\tilde{f}$ .
Given a set $C \subset \mathbb{R}^n$ , its indicator function is
$\tilde{I}_C = \begin{cases} 0 & x \in C\\ +\infty & x \notin C \end{cases}.$
We write
$\begin{aligned} \mathbb{R}_+ &:= \{ x \in \mathbb{R}: x\geq0 \}\\ \mathbb{R}_{++} &:= \{ x \in \mathbb{R}: x > 0 \} \end{aligned}.$

Convex Functions

Let $f:\mathbb{R}^n \to \mathbb{R}$ be a function with convex domain $\text{dom}\, f$ .

Convexity:

for all $x,y \in \text{dom}\, f, t \in [0,1]$ there holds

$f(t x + (1-t)y) \leq t f(x) + (1-t)f(y)$

(This inequality is often called Jensen’s inequality.)

Strict convexity:

for all $x,y \in \text{dom}f,\, x\neq y, t \in (0,1)$ there holds

$f(t x + (1-t)y) < t f(x) + (1-t)f(y)$ .

Example: failure of strict convexity

In the figure, the solid line indicates part of the graph of $x^4$ and the dashed line indicates part of the graph of a linear function.
This function fails to be everywhere strictly convex due to linear functions satisfying

$f(tx + (1-t)y) = tf(x) + (1-t)f(y)$ .

Concavity and strict concavity:

when $-f$ is, respectively, convex and strictly convex.

Remarks.

It is instructive to compare convexity/concavity with linearity and view the former as weak versions of linearity.

It is common to extend the definition of convexity to extended functions, i.e., those of the form $f: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\}$ .

For example, the indicator function $\tilde{I}_{(-\infty,2)}$ is convex in this sense.
To give insight, consider the image below, where the thick line is the “graph” of $\tilde{I}_{(-\infty,2)}$ and the dashed line is the “secant line” connecting the points $(1,0)$ to $(x,\infty)$ for any $x\geq 2$ .

Examples

All linear functions are convex and concave on their domains.
$e^x$ is convex on $\mathbb{R}$ .
$|x|^p$ is convex on $\mathbb{R}$ for $p \geq 1$ .
$x^p$ is convex on $\mathbb{R}_{++}$ for $p \geq 1$ or $p \leq 0$ and concave for $0 \leq p \leq 1$ .
$- \log\det X$ is convex on $\boldsymbol{S}_{++}^n$
If $C \subset \mathbb{R}^n$ is convex, then its indicator function $\tilde{I}_C$ is convex (in the extended value sense).

One Dimensional Characterization

Proposition. Let $f:\mathbb{R}^n \to \mathbb{R}$ have convex domain and, given
$x \in \text{dom}\, f$ and $v \in \mathbb{R}^n$ ,
define the function $g:\mathbb{R} \to \mathbb{R}$ by
$g(t) = f(x+tv)$
with
$\text{dom}\,g := \{ t \in \mathbb{R} : x + tv \in \text{dom}\, f \}$ .
Then $f$ is convex iff $g$ is convex for all $x \in \text{dom}\,f$ and $v \in \mathbb{R}^n$ such that $g$ is well-defined.

Proof.

Step 1.

First note that $\text{dom}\,g$ is convex: it is the intersection of $\text{dom}\,f$ with the line passing through $x$ with direction $v$ .

Step 2. ( $\implies$ )

Suppose $f$ is convex and let $x \in \text{dom}\,f$ and $v \in \mathbb{R}^n$ be arbitrary.
Then, for $\theta \in [0,1]$ and $t_1,t_2 \in \text{dom}\, g$ , there holds

$\begin{aligned} g(\theta t_1 + (1-\theta) t_2) &= f(x + (\theta t_1 + (1-\theta)t_2)v)\\ &= f(x + \theta t_1 v + (1-\theta) t_2 v)\\ &= f(\theta(x + t_1 v) + (1-\theta)(x+t_2 v) ) \\ &\leq \theta f(x+t_1 v) + (1-\theta)f(x + t_2 v)\\ &= \theta g(t_1) + (1-\theta)g(t_2), \end{aligned}$

proving that $g$ is convex.

Step 3. ( $\impliedby$ )

Suppose now that each $g$ is convex.
Fix $x,y \in \text{dom}\,f$ and let $\theta \in [0,1]$ .
We want to show

$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$ .

Let $v = y-x$ and

$g(t) = f(x + t(y-x))$ ,

noting that $g(0)=f(x), g(1)=f(y)$ .
Since $g$ is convex, we conclude

$\begin{aligned} f(\theta x + (1-\theta)y) &= f(x + (1-\theta)(y-x))\\ &= g(1-\theta)\\ &=g(\theta \cdot 0 + (1-\theta)\cdot 1)\\ &\leq \theta g(0) + (1-\theta) g(1)\\ &=\theta f(x) + (1-\theta) f(y). \end{aligned}$

This is enough to conclude $f$ is convex.

First Order Characterization

Proposition. If $f:\mathbb{R}^n \to \mathbb{R}$ is differentiable with convex domain $\text{dom}\, f$ , then $f$ is convex iff
$f(x) + \nabla f(x)^{T}(y-x) \leq f(y), \quad \forall x,y \in \text{dom}\, f$ .

Proof (sketch).

We prove it in case $f:\mathbb{R} \to \mathbb{R}$ ; the higher dimensional case follows by using that $f:\mathbb{R}^n \to \mathbb{R}$ with convex domain is convex iff it is convex as a single variable function when restricted to lines intersecting $\text{dom}\,f$ .
Throughout, let $x,y \in \text{dom}\,f$ and $t \in [0,1]$ .

Step 1. ( $\implies$ )

If $f$ is convex, then we obtain the following inequalities

$\begin{aligned} f(ty + (1-t)x) \leq tf(y) + (1-t)f(x)\quad&\text{(convexity)}\\ f(x) + \frac{f(x+t(y-x))-f(x)}{t} \leq f(y)\quad&\text{(rearranging)}\\ f(x) + \frac{f(x+t(y-x))-f(x)}{t(y-x)}(y-x) \leq f(y)\quad &\text{(rearranging)}\\ f(x) + f'(x)(y-x) \leq f(y)\quad&\text{(taking }t \to 0 \text{)}. \end{aligned}$

Step 2. ( $\impliedby$ )

Supposing

$f(x) + f'(x)(y-x) \leq f(y)$

we set $z = tx + (1-t)y$ and add the two inequalities

$\begin{aligned} t f(z) + tf'(z)(x - z) &\leq tf(x)\\ (1-t) f(z) + (1-t)f'(z)(y - z) &\leq (1-t)f(y) \end{aligned}$

to obtain

$f(tx+(1-t)y)=f(z) \leq tf(x) + (1-t)f(y).$

Remarks.

For fixed $x \in \text{dom}\,f$ , the mapping
$A_x: \, y \mapsto f(x) + \nabla f(x)^{T}(y-x)$
is affine whose graph is a hyperplane passing through the point $(x,f(x))$ .
Therefore, the inequality $A_x(y) \leq f(y)$ means this hyperplane is a tangent plane at $(x,f(x))$ of the graph of $f$ lying under the graph of $f$ .
In fact, this plane is a supporting hyperplane of the epigraph
$\text{epi}\,f := \{ (x,t) \in \mathbb{R}^{n} \times \mathbb{R} : x \in \text{dom}\,f, t \geq f(x)\}$
at the point $(x,f(x))$ .

The affine mapping $A_x$ is just the first order Taylor approximation of $f$ at $x$ .
Thus, differential convex functions are such that their first order Taylor approximations serve as a global underestimators of $f$ .

Example.

In the image below:

solid line is the graph of $f(x) = e^x$ ;
shaded region is the convex set given by $\text{epi}\,f = \{ (x,t): t\geq e^x \}$ ;
dashed line is the supporting hyperplane at $(-1,e^{-1})$ given by the graph of $e^{-1} + e^{-1}(x+1)$ .

$f(x) + f'(x)(y-x)$ gives a supporting hyperplane

$f(x) + f'(x)(y-x)$ gives a supporting hyperplane

Second Order Characterization

Proposition. If $f$ is twice-differentiable with $\text{dom}\,f$ convex, then $f$ is convex iff
$\nabla^2 f(x) \succeq 0, \quad \forall x \in \text{dom}\, f$ .

Recall: if $f: \mathbb{R}^n \to \mathbb{R}$ is twice-differentiable, then its Hessian is

$\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1 }(x) & \frac{\partial^2 f}{\partial x_2^2 }(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \\ \end{bmatrix} \in \mathbb{R}^{n \times n}.$

Proof (sketch).

Step 0.

The proof is a little more involved, so let us just give two intuitive justifications.

Justification 1.

The second order Taylor approximation gives

$f(y) = f(x) + \nabla f(x)^T (y-x) + \frac{1}{2} (y-x)^T \nabla^2 f(x)(y-x)$

up to some small error. But

$\nabla^2 f(x) \succeq 0$

implies

$(y-x)^T \nabla^2 f(x)(y-x) \geq0$

and so

$f(y) \geq f(x) + \nabla f(x)^T (y-x)$

(again, up to some small error).
The first order approximation from Convex Function Theory.First Order Characterization then implies convexity.

Justification 2.

Another intuitive justification is that $\nabla^2 f(x) \succeq 0$ means the graph of $f$ curves everywhere upward like a paraboloid, which evidently suggests convexity.

Remarks.

Recall: for $x \in \text{dom}\,f$ , there holds
$\begin{aligned} \nabla^2 f(x) \succeq 0 & \iff \nabla^2 f(x) \text{ is positive semidefinite}\\ & \iff \nabla^2 f(x) \in \boldsymbol{S}_+^n. \end{aligned}$
$\nabla^2 f (x) \succ 0$ for all $x \in \text{dom}\, f$ implies $f$ is strictly convex.
Converse is false since $x^4$ is strictly convex.

Level Sets

Fix a function $f: \mathbb{R}^n \to \mathbb{R}$ and let $c \in \mathbb{R}$

$c$ -Level set:

the set

$S(c):=\{ x \in \text{dom}\, f : f(x) = c\}$ .

The figure below depicts level sets of $x^2 + e^{y^2}$ with $c=5,10,17,26.$

$c$ -Sublevel set:

the set

$S_c = \{ x \in \text{dom}\,f : f(x) \leq c\}.$

The figure below depicts the sublevel sets of $x^2+e^{y^2}$ with $c=5,10,17,26$ .
Each shade of gray indicates a new sublevel set and of course $S_c \subset S_{c'}$ for $c<c'$ .

$c$ -Superlevel set:

the set

$S^c = \{ x \in \text{dom}\,f : f(x) \geq c \}$ .

The figure below depicts the superlevel sets of $x^2+e^{y^2}$ with $c=5,10,17,26$ .
Each shade of gray indicates a new superlevel set and of course $S^c \supset S^{c'}$ for $c<c'$ .

Proposition. If $f$ is convex, then the sublevel set $S_c$ is convex for all $c \in \mathbb{R}$ .
Equivalently, if $f$ is concave, then the superlevel set $S^c$ is convex for all $c \in \mathbb{R}$ .

Proof.

Want to show: $x,y \in S_c$ implies $tx + (1-t)y \in S_c$ for all $t \in [0,1]$ .
If $x,y \in S_c$ , then $f(x),f(y) \leq c$ and so convexity of $f$ gives

$\begin{aligned} f(tx+(1-t)y) &\leq t f(x) + (1-t)f(y) \\ &\leq t c + (1-t)c \\ &= c \end{aligned}$

and hence $tx + (1-t)y \in S_c$ as desired.

Graphs

Fix a function $f: \mathbb{R}^n \to \mathbb{R}$ .

Graph:

the set

$\{ (x,f(x)):x \in \text{dom}\,f \} \subset \mathbb{R}^n \times \mathbb{R}$ .

Example: graph of $e^x$ is given below.

Epigraph:

the set

$\text{epi}\, f = \{ (x,t): x \in \text{dom}\, f, f(x) \leq t \} \subset \mathbb{R}^n \times \mathbb{R}.$

Example: epigraph of $e^x$ is given below.

Hypograph:

the set

$\text{hypo}\, f = \{ (x,t) : x \in \text{dom}\, f, f(x) \geq t \} \subset \mathbb{R}^n \times \mathbb{R}.$

Example: hypograph of $e^x$ is given below.

Proposition. $f$ is convex iff $\text{epi}\, f$ is convex.
Equivalently, $f$ is concave iff $\text{hypo}\, f$ is convex.

Proof. (sketch)

We consider the case $f:\mathbb{R} \to \mathbb{R}$ for simplicity.

Step 1. ( $\implies$ )

Suppose $f$ is convex and let $x,y \in \text{epi}\,f$ be distinct points.
If $x,y$ both lie on a vertical line, then clearly $tx + (1-t)y \in \text{epi}\,f$ for $t \in [0,1]$ ; thus, suppose otherwise.
Let $\ell$ be the line passing through $x,y$ and let $x',y'$ be the two intersection points of $\ell$ with the graph of $f$ .
(If at most one intersection point exists, then it is easy to see that the line connecting $x$ and $y$ is in $\text{epi}\,f$ .)
By convexity of $f$ , the line formed by $tx' + (1-t)y'$ for $t \in [0,1]$ lies in $\text{epi}\,f$ , which is enough to conclude the line given by $tx + (1-t)y$ for $t \in [0,1]$ lies in $\text{epi}\,f$ .
This shows $\text{epi}\, f$ is convex.

Step 2. ( $\impliedby$ )

Suppose now $\text{epi}\,f$ is convex.
Let $x,y$ be two distinct points on the graph of $f$ .
Then $x,y \in \text{epi}\, f$ .
But convexity of $\text{epi}\,f$ implies the line formed by $tx+(1-t)y$ for $t \in [0,1]$ lies entirely in $\text{epi}\,f$ .
This is enough to conclude $f$ is convex.

Convex Calculus

The following list details some operations and actions that preserves convexity.
The main point: to conclude a function $f$ is convex, often one verifies $f$ may be built by other convex functions using, for example, the operations below.
N.B.: Conclusions only holds on common domains of the functions.

Conical combinations:

$f_1,\ldots,f_m$ convex and $c_1,\ldots,c_m \geq0$

$\implies$

$c_1 f_1 + \cdots + c_m f_m$ convex.

Weighted averages:

$f(x,y)$ convex in $x$ , $w(y) \geq0$ $\implies$ $\int f(x,y) w(y) dy$ convex.

Affine change of variables:

$f:\mathbb{R}^n \to \mathbb{R}$ convex, $A \in \mathbb{R}^{n\times m}$ , $b \in \mathbb{R}^n$

$\implies$

$f(A x + b)$ convex.

Maximum:

$f_1,\ldots,f_m$ convex

$\implies$

$f(x) := \max\{f_1(x),\ldots,f_m(x)\}$ convex.

Supremum:

$f(x,a)$ convex in $x$ for each $a \in \mathcal{A}$

$\implies$

$h(x):=\sup\{f(x,a):a \in \mathcal{A}\}$ convex.

Justification.

For $t \in [0,1]$ , there holds

$\begin{aligned} h(tx + (1-t)y) &= \sup \{ f(tx + (1-t)y,a) : a \in \mathcal{A} \}\\ &\leq \sup \{ tf(x,a) + (1-t) f(y,a) : a \in \mathcal{A}\}\\ &\leq t \sup\{ f(x,a) :a \in \mathcal{A}\} + (1-t)\sup\{f(y,a):a \in \mathcal{A} \}\\ &=th(x) + (1-t)h(y). \end{aligned}$

Example.

Let

$\begin{aligned} g(x)&:\mathbb{R}^n \to \mathbb{R} \text{ be given}\\ f(x,y)&:= y^T x - g(x). \end{aligned}$

N.B.: for each $x$ , the mapping $y \mapsto f(x,y)$ is affine and hence convex.
Thus

$h(y):=\sup\{y^Tx - g(x) : x \in \text{dom}\,g \}$

defines a convex function.

Infimum:

$f:\mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$ convex in $(x,y)$

$C \subset \mathbb{R}^m$ convex

$\inf_{y \in C}f(x,y)$ finite for some $x$

$\implies$

$g(x):=\inf_{y \in C}f(x,y)$ convex on

$\text{dom}\,g := \{ x : (x,y) \in \text{dom}\, f \text{ for some } y \in C\}$ .

Fenchel conjugation

Let $f:\mathbb{R}^n \to \mathbb{R}$ be given (not necessarily convex).
Fenchel conjugate: $f^*(y) = \sup \{ y^T x - f(x) : x \in \text{dom}\,f\}$ .

N.B.:

$\text{dom}\, f^* = \{ y \in \mathbb{R}^n : f^*(y) < \infty \}$ ;

i.e., those $y \in \mathbb{R}^n$ for which $y^Tx - f(x)$ is bounded above on $\text{dom}\,f$ as a function of $x$ .

Intuition.

Suppose $f: \mathbb{R}_+ \to \mathbb{R}_+$ is a differentiable convex function denoting the cost to produce $x$ items.
For a given unit price $y \in \mathbb{R}_+$ , the profit of selling $x$ units is

$P(x,y) = yx - f(x)$ .

Thus $f^*(y)$ is just the optimal profit for selling at price $y$ .
N.B.: $f$ convex implies $P(\cdot,y)$ is concave for each $y$ .
Thus, $P(x,y_0)$ is maximal at $x_0$ satisfying $P'(x_0,y_0) = 0$ , i.e., when $y_0 = f'(x_0)$ .
Viz.: the $x_0$ where $f$ has slope $y_0$ .
The tangent line through $(x_0,f(x_0))$ is then given by $y=y_0 (x - x_0) + f(x_0)$ .
Lastly, note that the $y$ -intercept of this line is $-y_0x_0 + f(x_0)=-f^*(y_0)$ ,

Remarks

Often $f^*$ is just called the conjugate function of $f$ .
Since $f^*$ is the supremum of a family of affine functions, $f^*$ is always convex, even if $f$ is not.
(Follows from Convex Function Theory.Convex Calculus.
If
- $f$ is convex
- $\text{epi}\,f$ is a closed subset of $\mathbb{R}^n \times \mathbb{R}$ ,
then .

Example.

We will compute the conjugate function of

$\begin{aligned} f(x) &= e^x\\ \text{dom}\,f &= \mathbb{R}. \end{aligned}$

Thus, let

$\begin{aligned} h_y(x) &= yx - f(x) \\ &= yx - e^x . \end{aligned}$

Case $y<0$ :

$h_y(x)$ is unbounded on $\text{dom}\,f = \mathbb{R}$ since

$h_y(x) \to +\infty$ as $x \to -\infty$ .

Thus

$\sup\{yx-e^x: x \in \text{dom}\,f\} = +\infty$ .

Case $y>0$ :

Compute

$\begin{aligned} h_y'(x) &= y - e^x =0 \text{ when } x=\log y \\ h_y''(x)&=-e^x \leq 0. \end{aligned}$

Thus $x=\log y$ maximizes $h_y$ and so

$\begin{aligned} \sup\{yx-e^x:x \in \text{dom}\,f\} &= \max\{yx-e^x:x \in \text{dom}\,f \}\\ &= h_y(\log y) \\ &= y\log y - y \end{aligned}$

Case $y=0$ :

Compute

$h_0(x) = -e^x$ ,

which evidently has least upper bound $0$ and so

$\sup\{ -e^x : x \in \text{dom}\,f \} = 0.$

Conclusion:

Since

$yx-e^x$

is bounded on $\text{dom}\,f$ only for $y \geq 0$ , it follows that

$\text{dom}\, f^* = \mathbb{R}_{+} .$

Putting everything together:

$f^*(y) = \sup\{yx-e^x\} = y\log y - y$ for $y \in \mathbb{R}_+$ ,

where we take $0 \log 0 = 0$ .

Legendre Transform

Let $f:\mathbb{R}^n \to \mathbb{R}$ be convex, differentiable and with $\text{dom}\,f = \mathbb{R}^n$ .
Then, the Fenchel conjugate $f^*$ of $f$ is often called the Legendre transform of $f$ .

Proposition. If $f$ as above, $z \in \mathbb{R}^n$ and $y = \nabla f(z)$ , then

$f^*(y) = z^T \nabla f(z) - f(z)$ .

Proof.

Step 1.

Let

$h_y(x) = y^T x - f(x)$ .

and note

$x^\star \in \mathbb{R}^n$ maximizes $h_y$ iff $\nabla h_y(x^\star) = 0$

since $h_y$ is a sum of concave functions and hence concave.

Step 2.

Using Step 1. and

$\begin{aligned} \nabla(y^T x) &= y\\ \nabla h_y(x) &= \nabla(y^Tx-f(x)) = y - \nabla f(x). \end{aligned}$

conclude

$y = \nabla f(x^\star)$ iff $x^\star$ maximizes $h_y$ .

(In particular, $z \in \mathbb{R}^n$ maximizes $h_{\nabla f(z)}$ .)

Step 3.

Letting $z, y \in \mathbb{R}^n$ satisfy

$\begin{aligned} y &= \nabla f(z) \end{aligned}$

and using Steps 1. and 2., we conclude

$\begin{aligned} f^*(y) &= \sup \{ y^T x - f(x) : x \in \text{dom}\, f\}\\ &= \max \{ h_y(x) : x \in \text{dom}\,f \}\\ & = z^T \nabla f(z) - f(z) \end{aligned}$ ,

as desired.

Example 1.

Let

$f(x) = e^x$ ,

and compute

$f'(x) = e^x$ .

Given $z \in \mathbb{R}$ , let

$y = f'(z) = e^z$ ; i.e., $z = \log y$ .

Thus

$f^*(y) = z f'(z) - f(z) = y \log y - y$ ,

which agrees with our calculation for $f^*$ in a previous example.

Example 2.

Fix $Q \in \boldsymbol{S}_{++}^n$ and let

$\begin{aligned} f(x) &= \frac{1}{2} x^T Q x\\ \text{dom}\,f&= \mathbb{R}^n. \end{aligned}$

We will compute

$f^*(y) = \frac{1}{2}y^T Q^{-1}y$ .

Step 0.

Observe

$f$ is convex:
$\begin{aligned} \nabla^2 f(x) = Q \succ 0. \end{aligned}$

(Justification)
Consider case $n = 2$ .
Let $Q = \begin{bmatrix}a&b\\b&d\end{bmatrix}$ .
Thus $f(x_1,x_2) = \frac{1}{2}(ax_1^2 + 2 b x_1 x_2 + c^2 x_2^2)$ .
Easy now to see $\nabla^2 f = Q$ .
$Q \succ 0$ implies $Q$ is invertible since then $\det Q > 0$ .

Step 1.

Using

$\begin{aligned} \nabla f(x) &=\nabla(\frac{1}{2}x^TQx) \\ &= Qx, \end{aligned}$

we conclude

$y = \nabla f(z) \iff y = Qz \iff z = Q^{-1}y.$

Step 2.

Let $y = \nabla f(z)$ .
By preceding proposition and Step 1., there holds

$\begin{aligned} f^*(y) &= z^T \nabla f(z) - f(z)\\ &= (Q^{-1}y)^T y - \frac{1}{2}(Q^{-1}y)^TQ(Q^{-1}y)\\ &= y^T Q^{-1}y - \frac{1}{2} y^T Q^{-1} y\\ &= \frac{1}{2}y^T Q^{-1} y. \end{aligned}$

Other Notions of Convexity

There are two other important notions of convexity that we will return to if needed.
Let $f:\mathbb{R}^n \to \mathbb{R}$ be given.

Quasiconvexity: $\text{dom}\,f$ and the sublevel sets

$\{x \in \text{dom}\,f: f(x) \leq \alpha \}$

are convex for all $\alpha \in \mathbb{R}$ .
Features:

Quasiconvex problems may sometimes be suitably approximated by convex problems.
Local minima need not be global minima

Log-convexity: $f>0$ on $\text{dom}\, f$ and $\log f$ is convex; equivalently

$f(tx+(1-t)y) \leq f(x)^tf(y)^{1-t}$ for all $t \in [0,1]$ .

Generalized Convexity

Let $K \subset \mathbb{R}^m$ be a proper cone and let $f: \mathbb{R}^n \to \mathbb{R}^m$ .

$K$ -convexity: for all $x,y \in \mathbb{R}^n$ and $t \in [0,1]$ , there holds

$f(tx + (1-t)y) \preceq_K t f(x) + (1-t)f(y)$ .

Strict $K$ -convexity: for all $x \neq y \in \mathbb{R}^n$ and $t \in (0,1)$ , there holds

$f(tx + (1-t)y) \prec_K t f(x) + (1-t)f(y)$ .

Examples

(CO Example 3.47)
Let $K = \mathbb{R}_+^n$ .
Then $f: \mathbb{R}^n \to \mathbb{R}^m$ is $K$ -convex iff: $\text{dom}\, f$ is convex and for all $x,y \in \text{dom}f$ and $t \in[0,1]$ , there holds
$f(tx+(1-t)y) \preceq tf(x) + (1-t)f(y)$
which holds iff
$f_i(tx+(1-t)y) \leq tf_i(x) + (1-t)f_i(y)$
for each $i = 1,\ldots, m$ , i.e., iff $f$ is component-wisely convex.
(CO Example 3.48)
A function is -convex iff : is convex and for all and , there holds
$f(tx + (1-t)y) \preceq t f(x) + (1-t)f(y)$ .
N.B.:
- this is a matrix inequality and $\boldsymbol{S}_+^n$ -convexity is often called matrix convexity.
- $f$ is matrix convex iff $z^Tf(x)z$ is convex for all $z \in \mathbb{R}^m$ .
- The two functions
  $\begin{aligned} \mathbb{R}^{n \times m} \ni X &\mapsto XX^T\\ \boldsymbol{S}_{++}^n \ni X &\mapsto X^p, \quad 1 \leq p \leq 2, -1 \leq p \leq 0. \end{aligned}$
  are matrix convex.

Basics of Optimization Problems

General Optimization Problems

By an optimization problem (OP) we mean the following:

$\text{(OP)} \begin{cases} \text{minimize } & f_0(x) \quad \text{(objective)}\\ \text{subject to }& f_i(x) \leq 0, \quad i=1,\ldots,m \quad \text{(inequality constraints)}\\ & h_i(x) = 0, \quad i=1,\ldots,p \quad \text{(equality constraints)} \end{cases}.$

We call

$f_0:\mathbb{R}^n \to \mathbb{R}$ the objective function;
$x \in \mathbb{R}^n$ the optimization variable or parameters;
$f_i:\mathbb{R}^n \to \mathbb{R}$ , $i=1,\ldots,m$ , the inequality constraint functions; and
$h_i:\mathbb{R}^n \to \mathbb{R}$ , $i=1,\ldots,p$ , the equality constraint functions.

The domain of (OP) is the intersection

$D = \bigcap_{i=0}^{m} \text{dom} \, f_i \cap \bigcap_{i=1}^{p} \text{dom} \, h_i$ .

Feasibility

Consider an (OP) as above.
Feasible point: those $x \in D$ satisfying

$\begin{aligned} f_i(x)&\leq 0\quad\text{for } i =1,\ldots,m\\ h_i(x) &= 0 \text{ for }i=1,\ldots,p. \end{aligned}$

Feasible set: the subset $F \subset D$ consisting of the feasible points.
Feasible problem: A problem with nonempty feasible set, i.e., $F \neq \emptyset$ .
Infeasible problem: A problem with empty feasible set; i.e., there are no $x \in D$ which satisfy the inequality and equality constraints.

Remark.

A feasible problem need not have a solution; e.g., $f(x) = e^x$ has no minimizer nor minimum on $\mathbb{R}$ .
An infeasible problem never has a solution–there are no parameters $x$ to even test.

Basic Example

Consider the problem

$\begin{cases} \text{minimize } & \log(1-x^2-y^2)\\ \text{subject to }& (x-1)^2+(y-1)^2-1\leq0\\ & (x-y-1)^2+(y-1)^2-1\leq0 \end{cases}.$

The objective function is

$\begin{aligned} f_0(x,y) &= \log(1-x^2-y^2)\\ \text{dom}\,f_0 &= \{ (x,y) \in \mathbb{R}^2 : x^2+y^2<1\} \end{aligned}$ ,

The inequality constraint functions are

$\begin{aligned} f_1(x,y) &=(x-1)^2 + (y-1)^2-1\\ f_2(x,y) &=(x-y-1)^2 + (y-1)^2 - 1\\ \text{dom}\,f_1 &= \text{dom}\,f_2 = \mathbb{R}^2. \end{aligned}$ .

The domain of the problem is

$D = \text{dom}\, f_0 \cap \text{dom}\, f_1 \cap \text{dom} f_2 = \{ x^2+y^2<1\}.$ .

The feasible set:

Let

$\begin{aligned} A &= \text{dom}\,f_0\\ B &= \{(x-1)^2+(y-1)^2-1\leq0\}\\ C &= \{(x-y-1)^2 +(y-1)^2-1\leq0\} \end{aligned}$ .

These three sets are depicted in the image below.
Note that the darkest region given by $A \cap B \cap C$ is the feasible set.

Can we solve the problem?

Noting

$\log(1-x^2-y^2) \to -\infty$ as $(x,y)$ approaches a point on the circle $\{ x^2+y^2 = 1 \}$ , and
such sequences exist in the feasible set,

we conclude the problem does not have a solution.

The Feasibility Problem

Feasibility problem: Given an (OP) with

$\begin{aligned} &\text{inequality constraint functions } f_i, i = 1,\ldots,m\\ &\text{equality constraint functions } h_i, i = 1, \ldots,p \end{aligned}$

solve

$\begin{cases} \text{find} & x\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.$

Viz.: the feasibility problem determines whether the constraints are consistent.

Example 1.

The problem

$\begin{cases} \text{find} & (x,y)\\ \text{subject to} &f_1(x,y) = x^2 + y^2 - 1 \leq 0\\ &f_2(x,y) = (x-1)^2 + y^2 - 1 \leq 0 \end{cases}$

has a solution since the two inequality constraints describe two intersecting disks.
This is depicted below.

Example 2.

The problem

$\begin{cases} \text{find} & (x,y)\\ \text{subject to} &f_1(x,y) = x^2 + y^2 - 1 \leq 0\\ &f_2(x,y) = (x-1)^2 + y^2 - 1 \leq 0\\ &h_1(x,y) = (x-\frac{1}{2})^2 + y^2 - 1 =0 \end{cases}$

has no solution since the circle given by $h_1=0$ lies outside of the intersection of the two disks.
This is depicted below, where the red circle is given by $h_1=0$ .

Optimal Value and Solvability

Recall:

$\begin{aligned} F &= \text{ feasible set of problem}\\ D &= \text{ domain of problem }. \end{aligned}$

Optimal value:

The value

$p^\star = \inf \{ f_0(x) : x \in F \}$ ,

i.e., $p^\star$ is the largest $p \in \mathbb{R}$ such that $p\leq f_0(x)$ for all $x \in F$ .
N.B.: $p^\star \in \mathbb{R} \cup \{ -\infty, + \infty \}$ .

Example:

Below depicts the graph of $f(x) = \frac{1}{x}$ on $\mathbb{R}_{++}$ .
Evidently, $\inf \{ \frac{1}{x} : x \in \mathbb{R}_{++} \} = 0$ .

Solvable:

When the problem satisfies

there exists $x^\star \in F$ with $f_0(x^\star) = p^\star$ ,

i.e., the minimum value $p^\star$ is attainable.

Example:

Below depicts the graph of a quartic $q(x)$ .
The problem of minimizing $q(x)$ on $\mathbb{R}$ is solvable with solution given by the minimal point $A$ .
N.B.: Point $B$ is a local minimum and hence does not give a solution.

Remarks.

$p^\star = \min\{f_0(x):x \in F \}$ iff the (OP) is solvable.
Indeed, $\min\{f_0(x):x \in F \}$ is not well-defined unless the (OP) is solvable.
$p^\star$ need not be finite:

$p^\star = -\infty$ if $f_0$ is unbounded below on the feasible set; and

$p^\star = + \infty$ if the OP is infeasible.

Standard Form

Optimization problems need not be placed in the form we defined them.
We therefore introduce the following definition.

(OP) in Standard form:

$\text{(OP)} \begin{cases} \text{minimize } & f_0(x) \\ \text{subject to }& f_i(x) \leq 0, \quad i=1,\ldots,m \\ & h_i(x) = 0, \quad i=1,\ldots,p \end{cases}.$

(This is how we defined (OP) before.)

Example: Rewriting in standard form.

We can recast more general optimization problems in standard form; e.g., consider

$\text{(OP2)} \begin{cases} \text{maximize } & F_0(x) \\ \text{subject to }& F_i(x) \leq G_i(x), \quad i=1,\ldots,m \\ & H_i(x) = K_i(x), \quad i=1,\ldots,p \end{cases}.$

Indeed, taking

: $f_0 = -F_0$ (noting $\min f_0 = -\max F_0$ ); $f_i = F_i - G_i$ for $i=1,\ldots,m$; $h_i = H_i - K_i$ for $i=1,\ldots,p$

we readily recast (OP2) into the standard form (OP).

Equivalent Problems

Suppose we are given two OP’s: (OP1) and (OP2).
We say (OP1) and (OP2) are equivalent if: solving (OP1) allows one to solve (OP2), and vice versa.
N.B.: Two problems being equivalent does not mean the problems are the same nor that they have the same solutions.

Example.

Consider the two problems:

$\begin{cases} \text{minimize} & f(x) = x^2\\ \text{subject to}& x \in [1,2] \end{cases} \quad \text{and} \quad \begin{cases} \text{minimize}&g(x) = (x+1)^2+1 \\ \text{subject to}&x \in [0,1] \end{cases}.$

Observing

$x^\star \in [1,2]$ minimizes $f(x)$ on $[1,2]$

iff

$x^\star - 1 \in [0,1]$ minimizes $g(x)$ on $[0,1]$ ,

we readily see the two problems are equivalent.
Indeed, if we find the solution $x^\star = 1$ to the first problem, we readily obtain the solution $x^\star - 1 = 0$ to the second problem, and vice versa.

Change of Variables

Suppose $\phi:\mathbb{R}^n \to \mathbb{R}^n$ is an injective function with $D \subset \phi(\text{dom} \,\phi)$ .
Then, under the change of variable $x \mapsto \phi(x)$ , we have

$\text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}$

is equivalent to

$\text{(OP2)} \begin{cases} \text{minimize} & f_0(\phi(x))\\ \text{subject to}& f_i(\phi(x)) \leq 0, i=1,\ldots,m\\ & h_i(\phi(x)) = 0, i=1,\ldots,p \end{cases}.$

N.B.: such a change of variables does not change the optimal value $p^\star$ .
Moreover, injectivity may be dropped.

Justification.

Indeed,

if $x$ solves (OP1), then $\phi^{-1}(x)$ solves (OP2).
(More generally, $z$ such that $\phi(z) = x$ solves (OP2).)
if $z$ solves (OP2), then $\phi(z)$ solves (OP1).

Example

Consider the problem

$\begin{cases} \text{minimize} & e^x\\ \text{subject to} & \sqrt{x} - y \leq 0\\ &y-5 \leq 0\\ &x-5\leq 0 \end{cases}$ .

In the image below, the shaded region is the feasible set and the curve is the graph of $f_0(x) = e^x$ .

Consider the change of variables

$\phi(x) = x^2$ .

The objective and constraints change as follows:

$\begin{aligned} f_0(x) = e^x & \to f_0(\phi(x)) = e^{x^2}\\ \sqrt{x}-y \leq 0 & \to |x| - y \leq 0\\ y-5 \leq 0 & \to y - 5 \leq 0\\ x - 5 \leq 0 & \to x^2 - 5 \leq 0 \end{aligned}$

The new feasible region and objective function are plotted below.

Evidently, this change of variable changed a nonconvex (OP) into a convex one.

Eliminating Linear Constraints

Let $A \in \mathbb{R}^{m \times n}$ , $b \in \mathbb{R}^m$ and $x_0 \in \mathbb{R}^n$ a solution to $Ax=b$ .
Let $B \in \mathbb{R}^{n\times k}$ be such that $\text{range}\, B = \text{kernel}\, A$ .
Then $Ax=b$ iff $x=By + x_0$ for some $y \in \mathbb{R}^{k}$ .

Consequently

$\text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & Ax = b \end{cases}$

is equivalent to

$\text{(OP2)} \begin{cases} \text{minimize} & f_0(By+x_0)\\ \text{subject to}& f_i(By+x_0) \leq 0, i=1,\ldots,m\\ \end{cases}.$

N.B.: this can reduce dimension of problem by $\text{rank}\, A$ many variables. (Recall: $n = \text{rank}\,A + \text{null}\,A$ .)

Justification.

Indeed,

if $x$ solves (OP1), then any $y \in \mathbb{R}^m$ with $x = By+x_0$ solves (OP2), and
if $y$ solves (OP2), then $x = By + x_0$ solves (OP1)

Example.

Consider the minimization problem

$\begin{cases} \text{minimize} & x^2 + y^2 \\ \text{subject to} &x\geq0\\ & y-x=1 \end{cases}.$

We may eliminate the variable $y$ by simply using $y=x+1$ .

But, to match with above: let

$\begin{aligned} f_0(x,y)=x^2 +y^2, &\quad A = \begin{bmatrix}-1&1\end{bmatrix}, \quad b = 1 \\ x_0 = \begin{bmatrix}0\\1\end{bmatrix},&\quad B = \begin{bmatrix} 1\\1\end{bmatrix} \end{aligned}$ .

Thus

$A \begin{bmatrix}x\\y\end{bmatrix}=b \iff \begin{bmatrix}x\\y \end{bmatrix} = Bt + x_0 = \begin{bmatrix}t\\t+1\end{bmatrix}$

for some $t \in \mathbb{R}$ , and so

$f_0(Bt + x_0) = f_0(t,t+1) = t^2 + (1+t)^2$ .

Therefore, the minimization problem becomes

$\begin{cases} \text{minimize} & t^2 + (t+1)^2 \\ \text{subject to} &t\geq0\\ \end{cases},$

which has the obvious solution $t^\star = 0$ with optimal value $p^\star = 1$ .
Thus, the original problem has solution

$\begin{bmatrix}x\\y\end{bmatrix} = \begin{bmatrix}t\\t+1\end{bmatrix} = \begin{bmatrix}0\\1\end{bmatrix}$ .

Slack Variables

Given $f:\mathbb{R}^n \to \mathbb{R}$ with $f(x) \leq 0$ , then there is a variable $s \geq 0$ such that $f(x) + s = 0$ ; such a variable $s$ is called a slack variable.

Using slack variables $s_i,i=1\ldots,m$ , the problem

$\text{(OP1)}\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i = 1,\ldots,p \end{cases}.$

is equivalent to the problem

$\text{(OP2)}\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & s_i \geq 0, i = 1,\ldots,m\\ & f_i(x)+s_i = 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}.$

Remarks.

All of the $x$ which may satisfy the constraints of (OP2) are the same as those which satisfy the constraints of (OP1); this justifies the equivalence.
Let $F_1$ be the feasible set of (OP1) and $F_2$ that of (OP2).
Then $F_1 \subset \mathbb{R}^n$ and $F_2 \subset \mathbb{R}^{n+m}$ ; i.e., the feasible sets are not the same object.
Example: in the images below, the disk depicts a feasible set $F_1 = \{ x^2+y^2-1 \leq 0 \} \subset \mathbb{R}^2$ and the paraboloid-type set depicts the feasible set $F_2 = \{x^2+y^2-1+s =0, s \geq 0 \}$ with slack variable $s$ .
N.B.: the permissible $(x,y)$ coordinates are the same for both sets.

Main point:

Solving the system of equations

$\begin{aligned} f_i(x)+s_i &= 0, i=1,\ldots,m\\ h_i(x) &= 0, i=1,\ldots,p \end{aligned}$

and considering only those solutions with $s_i \geq 0$ may be easier than solving the system of inequalities

$\begin{aligned} f_i(x)&\leq0, i=1,\ldots,m\\ h_i(x) &= 0, i=1,\ldots,p \end{aligned}.$

Example.

Consider

$\text{(OP1)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & a_1 x + b_1 y - c_1 \leq 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.$

Introduce slack variable $s\geq0$ satisfying

$a_1 x + b_1 y - c_1 +s = 0$ .

Then (OP1) is equivalent to the problem

$\text{(OP2)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & s\geq0\\ & a_1 x + b_1 y - c_1 + s = 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.$

Thus, finding feasible $(x,y,s)$ is just a matter of solving a system of equations and choosing those $(x,y,s)$ with $s\geq0$ .

Moreover, one can solve the problem

$\text{(OP3)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & a_1 x + b_1 y - c_1 + s = 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.$

and just choose solutions with $s \geq 0$ to obtain solutions to (OP2), and hence (OP1).

Epigraph Form

Recall: $\text{epi}\, f = \{(x,t): x \in \text{dom}\,f, t \geq f(x) \}$ .

The optimization problem

$\text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}$

is equivalent to its epigraph form

$\text{(OP2)} \begin{cases} \text{minimize} & t\\ \text{subject to} &f_0(x) - t \leq0\\ & f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}$

Viz., minimizing $f_0$ subject to constraints is equivalent to finding the smallest $t$ such that $(x,t) \in \text{epi}\, f$ for some feasible $x$ .

Proof by picture.

The dark curve and shaded region below indicate the epigraph of a function $f$ .
The red dot indicates the minimum point $(x^\star,p^\star)$ .
The black dots indicate points $(x^\star,t) \in \text{epi}\,f$ for different values of $t$ .
Evidently, the smallest $t^\star$ for which $(x^\star,t^\star) \in \text{epi}\, f$ is given by $t^\star = p^\star$ .

Fragmenting a Problem

Proposition. Given $f: \mathbb{R}^n \to \mathbb{R}$ and sets $F,F_1,\ldots,F_q$ with $F = F_1 \cup \cdots \cup F_q,$ let
$\begin{aligned} p^\star &= \inf \{f(x):x \in F\}\\ p_i^\star &= \inf\{ f(x): x \in F_i\}, \quad i = 1,\ldots, q. \end{aligned}$
Then
$p^\star = \min\{p_i^\star: i = 1,\ldots,q\}$ .
(Assuming $\min\{-\infty,a\} = -\infty$ for any real number $a$ .)

Viz., to minimize a function on a set $F$ , one may instead minimize $f$ over pieces of $F$ and then take the minimum optimal value of this procedure.

Example.

Consider the (OP)

$\text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in F \end{cases}$

where $F \subset \text{dom}\, f$ and where the feasible set $F$ is depicted below.

Consider breaking up $F$ into three regions $F1,F2,F3$ as indicated below.

Now formulate the (OP)’s

$\text{(OPi)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in Fi \end{cases}$

for $i = 1,2,3$ , and let $p_i^\star$ be the optimal value for (OPi).
Using the preceding proposition, the optimal value of (OP) is given by

$p^\star = \min \{ p_1^\star, p_2^\star, p_3^\star \}$ .

Conclusion: solving (OP), whose feasible set is not convex, may be achieved by solving three subproblems (OP1),(OP2),(OP3) whose feasible sets are convex.

Basics of Convex Optimization

Convex Optimization Problems

Abstract convex optimization problem: A problem involving minimizing a convex objective function on a convex set.
Convex optimization problem: a problem of the form

$\text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, i = 1,\ldots,m\\ & a_i^T x = b_i, i =1,\ldots, p \end{cases},$

where

: $f_0:\mathbb{R}^n \to \mathbb{R}$ and $f_i:\mathbb{R}^n \to \mathbb{R}$ are convex; and; $a_i \in \mathbb{R}^n$ and $b_i \in \mathbb{R}$ are fixed.

Some Remarks

Remark 1.

As defined, a (COP) is an (OP) in standard form; naturally, there are nonstandard form (OP)’s equivalent to (COP)’s.
E.g., the abstract (COP)

$\begin{cases} \text{minimize} & f_0(x,y) \\ \text{subject to} & (x+y+1)^2 = 0 \end{cases}$

is readily seen to be equivalent to the standard form (COP)

$\begin{cases} \text{minimize} & f_0(x,y) \\ \text{subject to}& \begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix} + 1 = 0 \end{cases}$ .

Remark 2.

We emphasize: the equality constraints are assumed to be affine constraints.
Moreover, the equality constraints

$a_i^T x = b_i , \quad i=1,\ldots,p$

can be rewritten as

$A x = b$ ,

where

$\begin{aligned} A &= \begin{bmatrix} a_1^T\\a_2^T\\\vdots\\a_p^T\end{bmatrix} \in \mathbb{R}^{p \times n}, \quad b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_p\end{bmatrix} \in \mathbb{R}^{p} \end{aligned}$ .

Remark 3.

The affine assumption on the equality constraints can be lifted at the possible expense of an intractable theory/numerical analysis.
E.g., if $h: \mathbb{R}^n \to \mathbb{R}$ is quasilinear, then $h(x) = 0$ defines a convex set.

Remark 4.

Generally, $h(x)$ convex does not imply the level set $h(x) = 0$ is convex; e.g., $h(x,y)=x^2+y^2-1$ gives a sphere.

Remark 5.

The common domain

$D = \bigcap_{i=0}^m \text{dom}\, f_i$

is convex since it is an intersection of convex sets.

Optimality for Convex Optimization Problems

Assume throughout that $f_0:\mathbb{R}^n \to \mathbb{R}$ is the objective function for some given (COP) and that $F$ is the feasible set.

Proposition 1. If $x^\star$ is a feasible local minimizer for a (COP), then it is the global minimizer for the (COP).

Proof.

We will follow a proof by contradiction; i.e., we will show that assuming $x^\star$ is not a global minimizer leads to a contradiction.

Step 1.

$x^\star$ being a feasible local minimizer means $x^\star \in F$ and that there is a $R>0$ such that

$f_0(x^\star) = \inf\{ f_0(z) : z \in F, \quad \Vert x-x^\star \Vert_2 \leq R \} ;$

i.e., $f_0(x^\star) \leq f_0(z)$ for all $z \in F$ with a distance at most $R$ of $x^\star$ .

Step 2.

Supposing $x^\star$ is not a global minimizer, then there exists $y \in F$ such that $f_0(y) < f_0(x^\star)$ .
By choice of $R$ , there must also hold $\Vert{y-x^\star}\Vert_2 > R$ .

Step 3.

Set

$\begin{aligned} z &= (1-t)x^\star + ty\\ t &= \frac{R}{2\Vert y-x^\star \Vert_2}, \end{aligned}$

noting that $t \in [0,1]$ by Step 2. and so $z \in F$ since $F$ is convex.
It follows that

$\begin{aligned} \Vert z - x^\star \Vert_2 &= \Vert (1-t)x^\star + ty - x^\star \Vert_2\\ &= \Vert t(y-x^\star ) \Vert_2\\ &= \frac{R}{2\Vert y-x^\star\Vert_2} \Vert y-x^\star\Vert_2\\ &= \frac{R}{2}\\ &< R \end{aligned}$

Step 4.

Since $z$ is a convex combination of feasible points, since $f_0$ is convex and since $f_0(y)<f_0(x^\star)$ , there holds

$\begin{aligned} f_0(z) &= f_0((1-t)x^\star + ty) \\ &\leq (1-t)f_0(x^\star) + t f_0(y)\\ & < (1-t) f_0(x^\star) + t f_0(x^\star)\\ &= f_0(x^\star). \end{aligned}$

But, since $x^\star$ minimizes $f_0$ on

$\{x \in F : \Vert x - x^\star \Vert_2 \leq R \}$

and since

$\Vert z - x^\star \Vert_2 \leq R$

we also have

$f_0(x^\star) \leq f_0(z)$ .

This is a contradiction and so $x^\star$ must be a global minimizer.

Proposition 2. If $f_0$ is differentiable on $F$ , then $x^\star \in F$ is a minimizer iff for all $y \in F$ there holds
$\nabla f_0(x^\star)^T (y - x^\star) \geq 0$ .

Proof.

Step 0.

N.B.: since $f_0$ is differentiable and convex on $F$ , then for each $x,y \in F$ there holds

$f_0(y) \geq f_0(x) + \nabla f_0(x)^T (y-x)$ .

(C.f., Convex Function Theory.First Order Characterization.)

Step 1.( $\implies$ )

Suppose $x^\star$ is a minimizer and suppose for contradiction that

$\nabla f_0(x^\star)^T(y-x^\star)<0$

for some $y \in F$ .
Set $z_t = ty+(1-t)x^\star$ , noting that $z_t \in F$ since $F$ is convex.
Using

$\frac{d}{dt} f_0(z_t)|_{t=0} = \nabla f_0(x^\star)^T (y-x^\star) < 0$ ,

we conclude $f_0$ is decreasing near $z_0 = x^\star$ in the direction $y-x^\star$ and so $f_0(z_t)< f_0(x^\star)$ for small $t$ .
Since $z_t \in F$ , this contradicts $x^\star$ being a minimizer.

Additional justification

Since $z_t$ defines a line passing through $x^\star$ with direction $y-x^\star$ , it follows that $\frac{d}{dt} f_0(z_t)|_{t=0}$ is the directional derivative in direction $(y-x^\star)$ , i.e., $\nabla f_0(x^\star)^T(y-x^\star)$ .

Step 2.( $\impliedby$ )

Supposing

$\nabla f_0(x^\star)(y-x^\star) \geq 0$

for all $y \in F$ and using the first order characterization at $x^\star$ , namely,

$f_0(y) \geq f_0(x^\star) + \nabla f_0(x^\star)^T(y-x^\star)$

we readily conclude

$f_0(y) \geq f_0(x^\star)$

for all $y \in F$ ; i.e., that $x^\star$ is a minimizer for the problem.

Corollary In case $f_0$ is differentiable and $F = \text{dom}\,f_0$ (equivalently, there are no nontrivial constraints), $x^\star \in \text{dom}\,f_0$ is a minimizer iff
$\nabla f_0(x^\star) = 0$ .

Proof.

By Proposition 2., we have that $x^\star \in \text{dom}\,f_0$ is a minimizer iff

$\nabla f_0(x^\star)^T(y-x^\star) \geq 0$

for all $y \in \text{dom}\, f_0$ .
Differentiability of $f_0$ requires $\text{dom}\,f_0$ is open and so, for small $t \in \mathbb{R}$ , there holds

$y: = x^\star - t \nabla f_0(x^\star) \in \text{dom}\,f_0$ .

But then

$\begin{aligned} \nabla f_0(x^\star)^T(y-x^\star) &= \nabla f_0(x^\star)^T(x^\star - t \nabla f_0(x^\star) - x^\star)\\ &= -t\nabla f_0(x^\star)^T\nabla f_0(x^\star)\\ &= -t \Vert \nabla f_0(x^\star) \Vert_2^2\\ &\geq 0, \end{aligned}$

which is only possible for $t>0$ iff $\nabla f_0(x^\star) = 0$ .

Some Examples

Example 1.

Let

$\begin{aligned} Q \in &\boldsymbol{S}_+^n, \quad a \in \mathbb{R}^n\quad b \in \mathbb{R}\\ f_0(x)&= \frac{1}{2} x^T Q x + a^T x + b. \end{aligned}$

Consider the unconstrained problem:

$\text{(OP)} \begin{cases} \text{minimize} & f_0(x) = \frac{1}{2} x^T Q x + a^T x + b\\ \text{subject to }& x \in \mathbb{R}^n \end{cases}$ .

Note that $\nabla^2 f_0 = Q \succeq 0$ implies $f_0$ is convex.
C.f.,Convex Function Theory.Second Order Characterization.

By the preceding corollary, we have $x^\star$ is a solution to (OP) iff

$\nabla f_0(x^\star) = Qx^\star + a = 0$ .

Thus solvability of (OP) rests on whether $-a \in \text{range}\, Q$ .

Three cases:

$-a \notin \text{range}\,Q$ $\implies$ $f_0$ is unbounded below and hence (OP) is unsolvable;
$Q \succ 0$ $\implies$ $Q$ is invertible and so $x^\star = -Q^{-1}a$ is the unique solution to (OP);
$Q \not\succ 0$ and $-a \in \text{range}\,Q$ $\implies$ $Qx^\star = -a$ has an affine set of solutions.

Example 2.

Let

$\begin{aligned} f_0:\mathbb{R}^n \to \mathbb{R}& \text{ be convex and differentiable}\\ A &\in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^{m} \end{aligned}$

and consider the problem

$\text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & Ax = b \end{cases}$ .

Using a preceding proposition, $x^\star$ satisfying $Ax^\star = b$ is a minimizer iff

$\nabla f_0(x^\star)^T(y-x^\star) \geq0$

for all $y$ satisfying $Ay= b$ .

Two cases

$Ax=b$ is an inconsistent system $\implies$ the problem is infeasible.
$Ax=b$ is a consistent system; then
$Ax^\star = b$ , $Ay=b$

$\iff$

$y = x^\star + x$ for some $x \in \text{null}A$ .

In case 2., we have

$\nabla f_0(x^\star)^T(y-x^\star) = \nabla f_0(x^\star)^Tx \geq 0$

for all

$x \in \text{null}A$ and $y = x^\star + x$ .

Since $\text{null}A$ is a linear space, this is only possible iff

$\nabla f_0(x^\star)^Tx = 0$ for all $x \in \text{null}\,A$ ,

i.e., iff

$\nabla f_0(x^\star) \perp \text{null}A$ .

But $\text{null}A = \text{range}A^T$ and so this condition means there exists $\nu \in \mathbb{R}^n$ such that

$\nabla f_0(x^\star) + A^T \nu = 0$ .

This is just a Lagrange multiplier condition, as we will see later.

Linear Programming

Linear program: a (COP) of the form

$\text{(LP)} \begin{cases} \text{miminize} & c^T x \\ \text{subject to} & Gx \preceq h\\ & Ax= b \end{cases}$ ,

where

$\begin{aligned} &G \in \mathbb{R}^{m \times n}, \quad A \in \mathbb{R}^{p\times n}\\ &h \in \mathbb{R}^m, \quad b \in \mathbb{R}^p, \quad x \in \mathbb{R}^n. \end{aligned}$

The feasible set $F$ is a polyhedron (see below).

Recall ( $\preceq$ ):

For $a,b \in \mathbb{R}^n$ the vector inequality

$a \preceq b$

means

$a_1 \leq b_1, \, a_2 \leq b_2, \, \ldots, \, a_n \leq b_n$ .

Different than: $A,B \in \mathbb{R}^{n \times n}$ satisfying the matrix inequality $A \preceq B$ , which means $B - A$ is positive semidefinite.

Determining the Feasible set:

Step 1. ( $A x = b$ )

Given $A \in \mathbb{R}^{p \times n}$ , $b \in \mathbb{R}^p$ , then

$\{x: Ax=b \}$

is an affine subspace of $\mathbb{R}^n$ or empty.

Step 2.

Given $\gamma \in \mathbb{R}^n, \eta \in \mathbb{R}$ , then

$\{x:\gamma^Tx \leq \eta \}$

is a half space in $\mathbb{R}^n$ .

Step 3. ( $G x \preceq h$ )

Given

$g_i \in \mathbb{R}^n, \quad G = \begin{bmatrix} g_1^T \\ g_2^T \\ \vdots \\ g_m^T \end{bmatrix} \in \mathbb{R}^{m\times n}, \quad Gx = \begin{bmatrix} g_1^Tx\\g_2^Tx\\ \vdots \\ g_m^Tx \end{bmatrix}$

Step 2. implies

$\{x : G x \preceq h \} = \{x : g_i^Tx\leq h_i, i = 1,\ldots,m\}$

is a finite intersection of half spaces.

Step 4.

Steps 1. and 3. imply the feasible $F$ to (LP) is the finite intersection of half spaces and an affine space, i.e., $F$ is a polyhedron.
(c.f. Convex Geometry.Polyhedra.)

Example.

Let

$\begin{aligned} c&=\begin{bmatrix}0.1\\1\end{bmatrix},\, G= \begin{bmatrix} 0.8&0.8\\ -1&-1\\ 0&-1\\ 0&1 \end{bmatrix},\, h= \begin{bmatrix} 4\\-3\\-1\\3 \end{bmatrix} \end{aligned}.$

Consider the resulting (LP):

$\text{(LP)} \begin{cases} \text{miminize} & c^T x \\ \text{subject to} & Gx\preceq h \end{cases}.$

Explicitly, this (LP) is given by

$\text{(LP)} \begin{cases} \text{miminize} & 0.1x_1 + x_2\\ \text{subject to} & 0.8x_1 + 0.8x_2 \leq 4\\ & -x_1-x_2 \leq -3\\ &-x_2\leq -1\\ &x_2\leq3 \end{cases}.$

The feasible set $F$ and the graph of the objective function over $F$ are indicated in the image below.

Remarks.

Given $d \in \mathbb{R}$ , have equivalent problem with objective $c^Tx + d$ .
Indeed,
$\begin{aligned} \min\{c^Tx + d :x \in F\} &= \min\{ c^Tx: x \in F \} + d\\ \text{argmin}\{c^Tx + d :x \in F\} &=\text{argmin}\{ c^Tx: x \in F \}. \end{aligned}$
WLOG: can assume $d=0$ to solve problem.

Since
$\begin{aligned} \max\{c^Tx :x \in F\} &= -\min\{ -c^Tx: x \in F \}\\ \text{argmax}\{c^Tx :x \in F\} &=\text{argmin}\{ -c^Tx: x \in F \}. \end{aligned}$
one also calls the problem of maximizing $c^Tx$ over a polyhedron a (LP).

Example: Integer Linear Programming Relaxation.

Integer Linear Program: an (OP) of the form

$\text{(ILP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & x \in \mathbb{Z}^n \end{cases}$

where

$\begin{aligned} &G \in \mathbb{R}^{m \times n},\qquad h \in \mathbb{R}^m\\ \mathbb{Z}^n : &= \{x \in \mathbb{R}^n : x_i \text{ an integer for each } i=1,\ldots,n\} \end{aligned}$

The constraint $x \in \mathbb{Z}^n$ is suitable for parameters which take on discrete quantities.
N.B.: An (ILP) is not a convex problem, but may be approximated by one (see below).

The feasible set

Let

$F = \{x \in \mathbb{Z}^n: Gx \preceq h \}$

denote the feasible set of an (ILP).
Then $F$ is just the collection of integer vectors in the polyhedron

$P = \{ x \in \mathbb{R}^n : Gx \preceq h \}$ .

Example. Consider an (ILP) with constraints given by $(x,y) \in \mathbb{Z}^2$ satisfying

$\begin{aligned} &-x\leq0\\ &-y\leq0\\ &y-2x\leq1\\ &y\leq2.5\\ &y+0.9x\leq4 \end{aligned}.$

The image below depicts the feasible set $F$ (the collection of dots) and polyhedron $P$ (shaded region).

Remarks.

Can of course also impose equality constraints $Ax=b$ :
$\begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & Ax = b\\ & x \in \mathbb{Z}^n\\ \end{cases}$
If we impose $x_i \in \{0,1\}$ , then the problem is called a boolean linear program:
$\text{(BLP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & x_i \in \{0,1\} \end{cases}$
Suitable for when coordinates of $x$ indicate when something is “off” or “on” or decision is “no” or “yes”.
Can also use $x_i \in \{-1,1\}$ instead.

Relaxation of (ILP)

The LP

$\text{(LP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h \end{cases}$

is called a relaxation of the (ILP) and is a convex approximation.
Important points:

The “tightest” convex relaxation is given by
$F=\{ x \in \mathbb{Z}^n : Gx \preceq h \} \to \text{convex hull of } F.$
Generally speaking, finding the convex hull is not an efficient way of approaching (ILPs).
The relaxation (LP) of (ILP) is generally easier to solve, though exact algorithms exist for (ILP).
If
$\begin{aligned} p_{LP}^\star &= \text{ optimal value for (LP)}\\ p_{ILP}^\star &= \text{ optimal value for (ILP)} \end{aligned}$
then $p_{LP}^\star \leq p_{ILP}^\star$ .
(Indeed the relaxation has larger feasible set.)
If $x^\star \in \mathbb{Z}^n$ solves the (LP), then it solves the (ILP).

Relaxation of (BLP)

Explicitly, a (BLP) is of the form

$\text{(BLP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ &x_i \in \{0,1\}, i=1,\ldots,n \end{cases}.$

The (LP)

$\text{(LP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ &0 \leq x_i \leq 1, i=1,\ldots,n \end{cases}$

is called a relaxation of the (BLP).
N.B.: the relaxation

$F \to \{x \in [0,1]^n: Gx \preceq h \}$

generally provides a better approximation than the relaxation

$F \to \{x \in \mathbb{R}^n: Gx \preceq h \}$ .

Indeed, the former biases approximate solutions to be close to being binary.

Example.

Problem Given $m$ workers and $n$ locations with $m \leq n$ ,

assign each worker to work at some location
assign at most one worker to a location
minimize cost of operation and transportation

Notational set up

$\begin{aligned} c_j &= \text{cost to operate at location }j\\ c_{ij} &= \text{cost to transport worker } i \text{ to location }j\\ x_j &= \begin{cases} 0 & \text{ if location }j\text{ is not operating}\\ 1 & \text{ if location }j\text{ is operating} \end{cases}\\ x_{ij} &= \begin{cases} 0 & \text{ if worker }i \text{ is not working at location }j\\ 1 & \text{ if worker }i \text{ is working at location }j \end{cases}. \end{aligned}$

Let

$\begin{aligned} \text{Cost vectors} & \begin{cases} c &= (c_j) \in \mathbb{R}^n\\ C &= (c_{ij}) \in \mathbb{R}^{m \times n} \end{cases}\\ \text{Optimization variables} &\begin{cases} x &= (x_j) \in \{0,1\}^n\\ X &= (x_{ij}) \in \{0,1\}^{m \times n} \end{cases} \end{aligned}.$

Construct objective function
We find

$\begin{aligned} \text{total operational cost }&= c^Tx = c_1x_1 + \cdots +c_nx_n\\ \text{total transportation cost }&= \text{tr}\,(C^TX) = \sum_{i=1}^m \sum_{j=1}^n c_{ij}x_{ij}\\ \text{total cost }&= c^Tx + \text{tr}\,(C^TX). \end{aligned}$

Thus the objective function is $f_0(x,X) = c^Tx + \text{tr}\,(C^TX)$ .

Construct constraints
The constraint that $x_i,x_{ij}$ are binary is of course natural for the problem.
Since each worker is selected only once, we have

$\sum_{j=1}^n x_{ij} = 1, \quad \text{ for each } i=1,\ldots,m$ .

Lastly, observe $x_j=0 \implies x_{ij}=0$ since the $j$ th location not operating means it cannot host a worker; thus we have $x_{ij} \leq x_j$ .

Formulate Problem
Putting everything together, the (BLP) formulation of the problem is

$\begin{cases} \text{minimize} & c^Tx + \text{tr}\,(C^TX)\\ \text{subject to} & x \in \{0,1\}^n\\ &X \in \{0,1\}^{m \times n}\\ &\sum_{j=1}^n x_{ij} =1, i = 1,\ldots,m\\ & x_{ij} \leq x_j, i=1,\ldots,m,j=1,\ldots,n \end{cases}.$

An (LP) relaxation (and hence convex approximation) of this (BLP) is

$\begin{cases} \text{minimize} & c^Tx + \text{tr}\,(C^TX)\\ \text{subject to} & x \in [0,1]^n\\ &X \in [0,1]^{m \times n}\\ &\sum_{j=1}^n x_{ij} =1, i = 1,\ldots,m\\ & x_{ij} \leq x_j, i=1,\ldots,m,j=1,\ldots,n \end{cases}.$

Quadratic Programming

Quadratic program: an (OP) of the form

$\text{(QP)} \begin{cases} \text{minimize}&\frac{1}{2}x^TQx + q^Tx\\ \text{subject to} &Gx \preceq h\\ &Ax =b \end{cases}$

where

$\begin{aligned} &Q \in \boldsymbol{S}^n, \quad G \in \mathbb{R}^{m \times n}, \quad A \in \mathbb{R}^{p \times n}\\ &q \in \mathbb{R}^n, \quad h \in \mathbb{R}^m,\quad b \in \mathbb{R}^p. \end{aligned}$

If $Q \in \boldsymbol{S}_+^n$ , then the problem is convex (see remarks).

Remarks.

As for (LP), the constraints
$\begin{cases} Gx \preceq h\\ Ax =b \end{cases}$
describe a polyhedron.
Given $d \in \mathbb{R}$ , then
$\begin{aligned} \min\{\frac{1}{2}x^TQx + q^Tx + d :x \in F\} &= \min\{ \frac{1}{2}x^TQx + q^Tx: x \in F \} + d\\ \text{argmin}\{\frac{1}{2}x^TQx + q^Tx + d :x \in F\} &= \text{argmin}\{ \frac{1}{2}x^TQx + q^Tx: x \in F \} \\ \end{aligned}$
WLOG: can assume $d=0$ to solve problem.
If
$f_0(x) = \frac{1}{2}x^TQx + q^Tx$ ,
then
$\begin{aligned} \nabla f_0(x) &= Qx + q\\ \nabla^2 f_0(x) &= Q. \end{aligned}$
Thus $Q \succeq 0$ implies $f_0$ is convex.
N.B.: The factor $\frac{1}{2}$ is just a convenient normalization.
The generalization
$\begin{aligned} &Gx \preceq h \, \to \, \frac{1}{2} x^T Q_ix+g_i^Tx \leq h_i \\ &Q_i \in \mathbb{R}^{n\times n}, g_i \in \mathbb{R}^n, h_i \in \mathbb{R}, \quad i =1,\ldots,m \\ \end{aligned}$
results in quadratically constrained quadratic programming (QCQP).
Imposing $Q,Q_i \in \boldsymbol{S}_+^n$ ensures the (QCQP) is convex.
(C.f., Convex Function Theory.Level Sets.)
The generalization
$\begin{aligned} &Ax=b \, \to \, \frac{1}{2} x^T P_ix+a_i^Tx =b_i \\ &P_i \in \mathbb{R}^{n\times n}, a_i \in \mathbb{R}^n, b_i \in \mathbb{R}, \quad i =1,\ldots,m \\ \end{aligned}$
also results in a (QCQP), but this can break convexity.
E.g., The quadratic constraints
$x_i(x_i-1)=0 \text{ or } x_i^2 = 1$
result in a (BLP) since these constraints enforce $x_i \in \{0,1\}$ or $x_i \in \{-1,1\}$ , respectively.
There holds
$\begin{aligned} \text{(QCQP)} + Q_i = P_i= 0 &\iff (QP)\\ (QP) + Q=0 &\iff (LP). \end{aligned}$
The assumption $Q \in \boldsymbol{S}^n$ (i.e., $Q = Q^T$ ) is not a serious restriction.
Indeed, first note that, since $x^TQx$ is a scalar, we have
$x^TQx = (x^TQx)^T = x^TQ^Tx$ .

Let
$f(x) = \frac{1}{2}x^TQx + q^T x$ .

Thus,
$\begin{aligned} f(x) &= \frac{f(x) + f(x)}{2}\\ &= \frac{1}{2}\left( \frac{1}{2}x^T Q x + q^Tx + \frac{1}{2}x^T Q^T x + q^Tx \right)\\ &=\frac{1}{2}\left( \frac{1}{2}x^T(Q+Q^T)x + 2 q^Tx \right)\\ &=:\frac{1}{2}x^T \tilde{Q} x + q^T x, \end{aligned}$
where
$\tilde{Q} = \frac{1}{2}(Q+Q^T)$ .

Lastly, observe the symmetry of $\tilde{Q}$ :
$\tilde{Q}^T = \left( \frac{1}{2}(Q+Q^T) \right)^T = \frac{1}{2}(Q^T+Q) = \tilde{Q}$ .

Thus every quadratic
$f(x) = \frac{1}{2}x^TQx + q^Tx$
has a “symmetric representation” of the form
$f(x) = \frac{1}{2}x^T\tilde{Q}x + q^Tx$ with $\tilde{Q} \in \boldsymbol{S}^n$ .

Example: Least Squares

Least squares: an unconstrained (QP) of the form

$\text{(LS)} \begin{cases} \text{minimize}&\Vert Ax-b\Vert_2^2= x^TA^TAx-2b^TAx + b^Tb \end{cases}$

where $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$ .
Features:

WLOG: may assume columns of $A$ are linearly independent, and so
$m \geq n$ .
N.B.: a least norm solution to (LS) is generally given by
$x^\star = A^{\dagger}b$ ,
where $A^\dagger$ is the pseudo-inverse (aka Moore-Penrose inverse) of $A$ .
Under the WLOG assumption, there holds
$x^\star = (A^TA)^{-1}A^Tb$ .

Recall (definition of $\Vert \cdot \Vert_2$ )

For $x \in \mathbb{R}^n$ , the notation $\Vert x \Vert_2$ means the vector norm

$\Vert x \Vert_2 := \sqrt{x_1^2 + \cdots + x_n^2 }$ .

Thus

$\Vert x- y \Vert_2$

is the distance between the two vectors $x,y \in \mathbb{R}^n$ .

Rough Justification of WLOG

Let

$\begin{aligned} A &= \begin{bmatrix} a_1 & a_2 & a_3 \end{bmatrix} \in \mathbb{R}^{2 \times 3}, \quad a_1,a_2,a_3 \in \mathbb{R}^2 \\ x &= \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad b = \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}. \end{aligned}$

N.B.: a set of three 2-dimensional vectors is always linearly dependent and so

$a_3 = c_1 a_1 + c_2 a_2$

for some $c_1,c_2 \in \mathbb{R}$
Therefore, we may write

$A=: \begin{bmatrix} a & \alpha & u \\ b & \beta & v \end{bmatrix} = \begin{bmatrix} a & \alpha & c_1 a + c_2 \alpha\\ b & \beta & c_1 b + c_2 \beta \end{bmatrix}$ .

Computing

$\begin{aligned} Ax &= \begin{bmatrix} a & \alpha & c_1 a + c_2 \alpha\\ b & \beta & c_1 b + c_2 \beta \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}\\ &= \begin{bmatrix} ax_1 + \alpha x_2 + c_1 a x_3 + c_2 \alpha x_3\\ bx_1 + \beta x_2 + c_1 b x_3 + c_2 \beta x_3 \end{bmatrix}\\ &= \begin{bmatrix} a(x_1 + c_1 x_3) + \alpha (x_2 + c_2 x_3)\\ b(x_1 + c_1 x_3) + \beta(x_2 + c_2 x_3) \end{bmatrix} \end{aligned}$

we see that minimizing

$\Vert Ax-b \Vert_2^2 = \bigg\Vert \begin{bmatrix} a(x_1 + c_1 x_3) + \alpha (x_2 + c_2 x_3)\\ b(x_1 + c_1 x_3) + \beta(x_2 + c_2 x_3) \end{bmatrix} - \begin{bmatrix} b_1\\b_2 \end{bmatrix} \bigg\Vert_2^2$

is equivalent to minimizing

$\begin{aligned} \Vert \hat{A}y - b \Vert_2^2 &:= \bigg\Vert \begin{bmatrix} a&\alpha\\b&\beta \end{bmatrix} \begin{bmatrix}y_1\\y_2\end{bmatrix}-\begin{bmatrix}b_1\\b_2\end{bmatrix} \bigg\Vert_2^2\\ &= \bigg\Vert \begin{bmatrix} ay_1 + \alpha y_2\\ by_1 + \beta y_2 \end{bmatrix} -\begin{bmatrix}b_1\\b_2\end{bmatrix} \bigg\Vert_2^2 \end{aligned}.$

Equivalence follows from the change of variables

$\begin{aligned} y_1 &= x_1 + c_1 x_3\\ y_2 &= x_2 + c_2 x_3 \end{aligned}$ .

Therefore, a (LS) problem with matrix of size $2 \times 3$ is equivalent to a (LS) problem with matrix of size $2 \times 2$ .

This argument holds in general: if $A$ has linearly dependent columns, can use change of variables to ensure linear independence.

Remarks.

If $Ax=b$ has solution $x^\star$ , then $x^\star$ solves the (LS) problem.
If $Ax=b$ has no solution, then any $x^\star$ solving the (LS) problem gives a “best” estimate solution to $Ax = b$ , where “best” is chosen to mean in terms of the vector norm $\Vert \cdot \Vert_2$ .
While minimizing $\Vert \cdot \Vert_2^2$ is equivalent to minimizing $\Vert \cdot \Vert_2$ , the exponent $2$ ensures $\Vert \cdot \Vert_2^2$ is differentiable (at $0$ ).
(C.f.: $|x|$ is not differentiable at $x=0$ , but $x^2$ is.)

Solving the (LS)

Step 1. (The problem is convex)

The objective function

$f_0(x) = \Vert Ax-b\Vert_2^2 = x^TA^TAx - 2b^TAx+b^Tb$

is convex since

$\nabla^2 f_0 = 2 A^TA \succeq 0$ .

To see $A^TA \succeq 0$ , observe:

$(A^T A)^T = A^T A$ and so $A^TA \in \boldsymbol{S}^n$ , i.e., $A^TA$ is symmetric;
for all $z \in \mathbb{R}^n$ there holds
$z^T A^T A z = (Az)^T Az = \Vert Az \Vert_2^2 \geq 0$ ,
i.e., $A^TA \in \boldsymbol{S}_+^n$ .

Step 2. (The critical points)

We compute

$\begin{aligned} \nabla f_0 &= \nabla (x^T A^TA x) - \nabla(2b^TAx) + \nabla(b^Tb)\\ &= 2 A^TAx - 2A^Tb + 0\\ &= 2(A^TAx - A^Tb). \end{aligned}$

Therefore, $\nabla f_0 = 0$ iff

$A^TAx = A^T b$ .

This system of equations consists of what are called the normal equations.

Step 3. (The solution)

By corollary proved above: $x^\star$ is a solution iff $\nabla f_0(x^\star) = 0$ .
Moreover, $A$ having linearly independent columns ensures that $A^TA$ is invertible:

linear independent columns $\implies$ $Az=0$ iff $z=0$ .
$A^TAz = 0$ $\iff$ $Az=0$ $\iff$ $z = 0$ .
Since $A^TA$ is square, can conclude $A^TA$ is invertible.

Lastly: since

$\nabla f_0(x^\star)= 0$ iff $A^TA x^\star = A^T b$

and since $A^TA$ is invertible, we conclude that the solution to the (LS) is

$x^\star = (A^TA)^{-1}A^Tb$ .

Example: Distance Between Convex Sets

Let $K_1,K_2 \subset \mathbb{R}^n$ be convex subsets.
Nearest Point Problem (NPP): Among $x \in K_1, y \in K_2$ , which pair $(x,y)$ minimizes the distance $\Vert x-y \Vert_2^2$ ?

The (NPP) may be expressed as a standard form (COP).
Indeed, supposing

$\begin{aligned} K_1 &= \{ f_1(x) \leq 0 \}\\ K_2 &= \{ f_2(x) \leq 0 \}, \end{aligned}$

with $f_1,f_2$ convex, then the (NPP) for $K_1,K_2$ is the (COP)

$\begin{cases} \text{minimize} & \Vert x - y \Vert_2^2\\ \text{subject to} & f_1(x) \leq 0 \\ &f_2(y) \leq 0 \end{cases}.$

N.B.: the feasible set is

$\begin{aligned} F&=\{(x,y) \in \mathbb{R}^n \times \mathbb{R}^n : f_1(x) \leq 0, f_2(y) \leq 0 \}\\ &=K_1 \times K_2. \end{aligned}$

Viz.: $(x,y) \in F$ iff $x \in K_1, y \in K_2$ .

When the $K_1,K_2$ are polyhedra, then the (NPP) is a (QP).
Indeed, supposing

$\begin{aligned} K_1 &= \{ G_1 x \preceq h_1 \}\\ K_2 &= \{ G_2 x \preceq h_2 \}, \end{aligned}$

then the (NPP) for $K_1,K_2$ is the (QP) given by

$\begin{cases} \text{minimize} & \Vert x - y \Vert_2^2\\ \text{subject to} & G_1 x \preceq h_1\\ & G_2 y \preceq h_2 \end{cases}.$

Can formulate as constrained least squares problem: defining

$\begin{aligned} B,C \in \mathbb{R}^{n \times n} &\mapsto B \oplus C = \begin{bmatrix} B & \vline & 0\\ \hline 0 & \vline & C \end{bmatrix} \in \mathbb{R}^{2n \times 2n}, \\ v,w \in \mathbb{R}^n &\mapsto v \oplus w = \begin{bmatrix} v_1 & \cdots & v_n & w_1 & \cdots & w_n \end{bmatrix}^T \in \mathbb{R}^{n + n}\\ A &= \begin{bmatrix} Id_{n\times n} &\vline & - Id_{n\times n}\\ \hline 0 &\vline& 0 \end{bmatrix} \in \mathbb{R}^{2n\times 2n} \end{aligned}$

we readily see that the problem is equivalent to

$\begin{cases} \text{minimize} & \Vert A(x\oplus y) \Vert_2^2\\ \text{subject to} & (G_1 \oplus G_2) (x \oplus y) \preceq h_1 \oplus h_2 \end{cases}.$

Polyhedral Approximation.

Let $K_1,K_2$ be nonempty convex domains, not necessarily polyhedral. Let $P_1,P_2,Q_1,Q_2$ be polyhedral domains such that

$\begin{aligned} &P_1 \subset K_1 \subset Q_1\\ &P_2 \subset K_2 \subset Q_2. \end{aligned}$

Then the (NPP) for the pairs $[P_1,P_2]$ and $[Q_1,Q_2]$ provide (QP) relaxations of the (NPP) for $[K_1,K_2]$ and, respectively, provide upper and lower bounds for the optimal values for the original (NPP).

Geometric Programming

Monomial function: given

$c \geq 0,\, a_i \in \mathbb{R},\, i =1,\ldots,n$ ,

a function of the form

$\begin{aligned} f&: \mathbb{R}^n \to \mathbb{R}\\ f(x)&= c x_1^{a_1} x_2^{a_2} \cdots x_n^{a_n}\\ \text{dom}\,f&=\mathbb{R}_{++}^n. \end{aligned}$

Example.

$3x_1^2x_2^{3.4}x_3^{-\pi}$ .

Posynomial: given

$c_k \geq 0,\, a_{ik}\in \mathbb{R},\, i=1,\ldots,n,\, k=1,\ldots,K$ ,

a function of the form

$\begin{aligned} f&: \mathbb{R}^n \to \mathbb{R}\\ f(x)&= \sum_{k=1}^K c_k x_1^{a_{1k}} x_2^{a_{2k}} \cdots x_n^{a_{nk}}\\ \text{dom}\,f&=\mathbb{R}_{++}^n. \end{aligned}$

Example.

$f(x) = \sqrt{2}x_1x_2\sqrt{x_3} + 3 x_1^4x_2^{-4}x_3^{-1.5}+x_5$

Geometric Programming: An (OP) of the form

$\text{(GP)} \begin{cases} \text{minimize}&f_0(x)\\ \text{subject to}&f_i(x) \leq 1, \, i=1,\ldots,m\\ & h_i(x) = 1, \, i=1,\ldots,p \end{cases},$

where

$\begin{aligned} f_0,f_1,\ldots,f_m & \text{ are posynomials}\\ h_1,\ldots,h_p & \text{ are monomials}. \end{aligned}$

N.B.: this is an undercover (COP).

Remarks.

Let
$\begin{aligned} Posy_n &= \text{ set of posynomials on } \mathbb{R}_{++}^n\\ Mon_n &= \text{ set of monomials on } \mathbb{R}_{++}^n. \end{aligned}$
Since any monomial is a posynomial, we have $Mon_n \subset Posy_n$ .
If
$\begin{aligned} \lambda \geq 0\\ p(x),q(x) &\in Posy_n\\ f(x),g(x) &\in Mon_n , \end{aligned}$
then
$\begin{aligned} \lambda p(x),\, p(x) + q(x),\, p(x) q(x),\, \frac{p(x)}{f(x)} \in Posy_n\\ \lambda f(x),\, f(x)g(x),\, \frac{f(x)}{g(x)} \in Mon_n. \end{aligned}$
Thus $Mon_n$ is a group and $Posy_n$ a “conic representation” of $Mon_n$ .
As usual, we use the language “writing in standard form” to refer to writing an equivalent (OP) written in the form (GP) above.

General (OPs) clearly equivalent to a (GP) may be called a geometric program in nonstandard form.

For example, the geometric program
$\begin{cases} \text{maximize}&f_0(x)\\ \text{subject to} & f_i(x) \leq g_i(x)\\ & h_i(x) = k_i(x) \end{cases}$
with
$\begin{aligned} f_i &\text{ are posynomials}\\ f_0,h_i,g_i\not\equiv 0 ,k_i\not\equiv0 & \text{ are monomials} \end{aligned}$
is readily rewritten as a standard form (GP):
$\begin{cases} \text{minimize}&\frac{1}{f_0(x)}\\ \text{subject to} & \frac{f_i(x)}{g_i(x)} \leq 1\\ & \frac{h_i(x)}{k_i(x)} = 1 \end{cases}.$

Rewriting (GP) as a (COP)

General (GPs) are not convex (e.g., $f_0(x) = \sqrt{x}$ ).
However, any (GP) is easily recast as a (COP) via change of variable.

Step 1. (The change of variable)

We will write $x \mapsto y$ to mean the change of variable given by

$x_i = e^{y_i}.$

Step 2. (Monomials $\to$ convex function)

Let

$\begin{matrix} c>0, \quad b = \log c,& f(x) \in Mon_n,\\ a = \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}^T \in \mathbb{R}^n,\quad & f(x) = c x_1^{a_1}x_2^{a_2} \cdots x_n^{a_n}. \end{matrix}$

Under the change of variable $x \mapsto y$ :

$\begin{aligned} f(x) &= f(x_1,\ldots,x_n)\\ &= f(e^{y_1},\ldots, e^{y_n})\\ &= c(e^{y_1})^{a_1} \cdots (e^{y_n})^{a_n}\\ &= e^{\log c}e^{a_1y_1}\cdots e^{a_ny_n}\\ &= e^{a^Ty+b} \end{aligned}$

But

$F\,$ convex $\implies$ $e^F\,$ convex

and so $e^{a^Ty+b}$ is convex since affine functions are convex.

Step 3. (Posynomial $\to$ convex function)

For $k=1,\ldots,K$ , let

$\begin{matrix} c_k>0, \quad b_k = \log c_k,& f(x) \in Posy_n,\\ a_k = \begin{bmatrix} a_{1k} & \cdots a_{nk} \end{bmatrix} \in \mathbb{R}^{n},\quad & f(x) = \sum_{k=1}^K c_k x_1^{a_{1k}}\cdots x_{n}^{a_{nk}}. \end{matrix}$

By Step 2., there holds

$\begin{aligned} f(x) = f(y) = \sum_{k=1}^K e^{a_k^Ty + b_k}, \end{aligned}$

which is again a convex function.

Step 4. ((GP) $\to$ (COP))

We explicitly write the (GP) as:

$\begin{cases} \text{minimize}&f_0(x) = \sum_{k=1}^{K_0} c_{0k} x_1^{a_{01k}}\cdots x_n^{a_{0nk}}\\ \text{subject to}& f_i(x) = \sum_{k=1}^{K_i} c_{ik} x_1^{a_{i1k}} \cdots x_n^{a_{ink}} \leq 1, \, i=1,\ldots,m\\ & h_i(x) =d_{i}x_1^{b_{i1}}\cdots x_n^{b_{in}} = 1, \, i=1,\ldots,p \end{cases}.$

Let

$\begin{aligned} a_{ik} &= \begin{bmatrix} a_{i1k} & \cdots a_{ink} \end{bmatrix}^T \in \mathbb{R}^n\\ b_i &= \begin{bmatrix} b_{i1} & \cdots & b_{in} \end{bmatrix}^T \in \mathbb{R}^n\\ \alpha_{ik} &= \log c_{ik}\\ \delta_i &= \log d_{i} \end{aligned}.$

Under the change of variable $x \mapsto y$ , this (GP) becomes the (COP)

$\begin{cases} \text{minimize}&f_0(y) = \sum_{k=1}^{K_0} e^{a_{0k}^T y + \alpha_{0k}}\\ \text{subject to}& f_i(y) = \sum_{k=1}^{K_i} e^{a_{ik}^T y + \alpha_{ik}} \leq 1, \, i=1,\ldots,m\\ & h_i(y) = e^{b_{i}^Ty + \delta_i} = 1, \, i=1,\ldots,p \end{cases}.$

Step 5. ((GP) in convex form)

At last, since exponentiation may result in unreasonably large numbers, it is customary to take logarithms, resulting in the geometric problem in convex form:

$\begin{cases} \text{minimize}& \log \left( \sum_{k=1}^{K_0} e^{a_{0k}^T y + \alpha_{0k}} \right)\\ \text{subject to}& \log \left(\sum_{k=1}^{K_i} e^{a_{ik}^T y + \alpha_{ik}}\right) \leq 0, \, i=1,\ldots,m\\ & b_{i}^Ty + \delta_i = 0, \, i=1,\ldots,p \end{cases}.$

N.B.:

Concavity of $\log$ is too weak to break the convexity of $e^{a^Ty+b}$ , and so the problem is still convex.
The constraints are equivalent since $\log$ is monotonic and injective on $\mathbb{R}_{++}$ .

Example.

(Taken from Boyd-Kim-Vandenberghe-Hassibi: A tutorial on geometric programming)
Problem Maximize the volume of a box with

a limit on total wall area;
a limit on total floor area; and
upper and lower bounds on the aspect ratios $height/width$ and $depth/width$ .

Notational set up

$\begin{aligned} \text{Optimization Variables}& \begin{cases} w &= \text{width}\\ h &= \text{height}\\ d &= \text{depth} \end{cases}\\ \text{Problem Parameters}& \begin{cases} A_{\text{wall}} &= \text{max wall area}\\ A_{\text{floor}} &= \text{max floor area}\\ \alpha_{-},\alpha_+ &= \text{lower and upper aspect ratio bounds for }h/w\\ \beta_{-},\beta_+ &= \text{lower and upper aspect ratio bounds for }d/w \end{cases} \end{aligned}$

Construct objective function
The volume of the box is

$hwd$

and so the objective function is

$f_0(h,w,d) = hwd$ .

N.B.: $f_0 \in Mon_3$

Construct contraints

$\begin{aligned} \text{wall area bound }:&\quad 2hw+2hd \leq A_{\text{wall}}\\ \text{floor area bound }:&\quad wd \leq A_{\text{floor}}\\ \text{aspect ratio bounds }:&\quad \alpha_1 \leq \frac{h}{w} \leq \alpha_2\\ &\quad \beta_1 \leq \frac{d}{w} \leq \beta_2 \end{aligned}$

N.B.:

$\begin{aligned} 2hw+2hd &\in Posy_3\\ wd,\, hw^{-1},\, dw^{-1} &\in Mon_3 \end{aligned}$

Formulate Problem
Putting everything together, we realize the problem may be formulated as the following (GP):

$\begin{cases} \text{maximize} & hwd\\ \text{subject to} & 2hw+2hd\leq A_{\text{wall}}\\ & wd \leq A_{\text{floor}}\\ & \alpha_1 \leq hw^{-1} \leq \alpha_2\\ & \beta_1 \leq dw^{-1} \leq \beta_2 \end{cases}.$

To write the problem in standard form: note the following equivalence of constraints

$\begin{aligned} 2hw+2hd\leq A_{\text{wall}} & \iff A_{\text{wall}}^{-1}2hw + A_{\text{wall}}^{-1}2hd \leq 1\\ wd \leq A_{\text{floor}} & \iff A_{\text{floor}}^{-1}wd \leq 1\\ &\\ \alpha_1 \leq hw^{-1} \leq \alpha_2 &\iff \begin{array}{l} \alpha_2^{-1}hw^{-1} \leq 1 \\ \alpha_1 h^{-1}w \leq 1 \end{array}\\ \beta_1 \leq dw^{-1} \leq \beta_2 & \iff \begin{array}{l} \beta_2^{-1} dw^{-1} \leq 1\\ \beta_1 d^{-1}w \leq 1 \end{array} \end{aligned}$

Moreover, maximizing $hwd$ is equivalent to minimizing $h^{-1}w^{-1}d^{-1}$ .

Therefore, the problem in standard form is given by

$\begin{cases} \text{minimize} & h^{-1}w^{-1}d^{-1}\\ \text{subject to} & A_{\text{wall}}^{-1}2hw +A_{\text{wall}}^{-1}2hd \leq 1\\ & A_{\text{floor}}^{-1} wd \leq 1\\ & \alpha_2^{-1} hw^{-1}\leq 1 \\ & \alpha_1 wh^{-1} \leq 1\\ & \beta_2^{-1}dw^{-1} \leq 1\\ & \beta_1 wd^{-1} \leq 1 \end{cases}.$

Semidefinite Programming

(Heavily influenced by Vandenberghe-Boyd Semidefinite Programming.)
Linear matrix inequality (LMI): given

$\begin{aligned} &F_0,F_1,\ldots,F_n \in \boldsymbol{S}^m\\ &x = \begin{bmatrix}x_1 & \cdots & x_n \end{bmatrix}^T \in \mathbb{R}^n\\ F(x) &:= F_0 + x_1 F_1 + \cdots + x_n F_n \end{aligned}$

an inequality of the form

$F(x) \succeq 0$ .

Recall: for $A \in \boldsymbol{S}^m$ , we write $A \succeq 0$ to mean $A$ is positive semidefinite, i.e., $z^TAz \geq 0$ for all $z \in \mathbb{R}^m$ .

Semidefinite program (SDP): an (OP) of the form

$\text{(SDP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & F(x) \succeq 0, \end{cases}$

where

$\begin{aligned} &F_0,F_1,\ldots,F_n \in \boldsymbol{S}^m\\ F(x) &:= F_0 + x_1 F_1 + \cdots + x_n F_n\\ c & \in \mathbb{R}^n \end{aligned}$

N.B.: The LMI $F(x) \succeq 0$ defines a feasible set which is convex and hence (SDPs) are convex problems.

Convexity of Feasible Set.

To see that (SDP) is a convex problem, first note: if

$t>0$ and $A \succeq 0,$

then

$z^T(tA)z = t ( z^TAz) \geq 0$

and so

$tA \succeq 0$ .

Next, observe: for $x,y$ feasible and $t \in [0,1]$ , the function

$F(x) = F_0 + x_1 F_1 + \cdots + x_n F_n$

evaluated at the convex combination

$tx + (1-t)y$

$\begin{aligned} F(tx +(1-t)y) &= F_0 + (tx_1 + (1-t)y_1) F_1 + \cdots + (tx_n + (1-t)y_n)F_n. \end{aligned}$

Expanding, rearranging and using

$F_0 = tF_0 + (1-t)F_0$

gives:

$\begin{aligned} F(tx +(1-t)y) &= tF_0 + tx_1F_1 + \cdots tx_nF_n \\ &+ (1-t)F_0 + (1-t)y_1F_1 + \cdots + (1-t)y_nF_n\\ &= tF(x) + (1-t)F(y). \end{aligned}$

Using $F(x),F(y)\succeq 0$ , we conclude

$F(tx+(1-t)y) = tF(x) + (1-t)F(y) \succeq 0$ .

Thus, $x,y$ feasible $\implies$ $tx + (1-t)y$ feasible for $t \in [0,1]$ , i.e., the feasible set is convex.

Example 1. LPs are SDPs

Consider the (LP)

$\begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax + b \succeq 0 \end{cases},$

where

$A \in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^{m}, \quad c \in \mathbb{R}^n$

and

$Ax+b \succeq 0$

means componentwise inequality.

Given $v = \begin{bmatrix}v_1 & \cdots & v_m \end{bmatrix}^T \in \mathbb{R}^m$ , define

$\text{diag}(v) = \begin{bmatrix} v_1 & 0 & \cdots & 0\\ 0 & v_2 &\cdots &0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & v_m \end{bmatrix}.$

Since $A \in \boldsymbol{S}^m$ satisfies $A \succeq 0$ iff $A$ has nonnegative eigenvalues, we have

$v \succeq 0 \iff \text{diag}(v) \succeq 0$ .

(Indeed, the eigenvalues of $\text{diag}(v)$ are the components of $v$ .) Therefore,

$\begin{aligned} Ax+b \succeq 0 &\iff \text{diag}(Ax+b) \succeq 0 \\ \text{(vector inequality)} & \iff \text{(matrix inequality)}. \end{aligned}$

Letting

$A=\begin{bmatrix}a_1 & \cdots & a_n \end{bmatrix}, \quad a_i \in \mathbb{R}^m$ ,

we have

$Ax + b = b + x_1 a_1 + \cdots + x_n a_n$ .

Therefore, using

$\begin{aligned} \text{diag}(v+\lambda w) &= \begin{bmatrix} v_1 + \lambda w_1 &0 & \cdots & 0\\ 0 & v_2 + \lambda w_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v_n + \lambda w_n \end{bmatrix}\\ &= \begin{bmatrix} v_1 &0 & \cdots & 0\\ 0 & v_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v_n \end{bmatrix} + \lambda \begin{bmatrix} w_1 &0 & \cdots & 0\\ 0 & w_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & w_n \end{bmatrix}\\ &= \text{diag}(v) + \lambda \text{diag}(w) \end{aligned}$

we have

$\begin{aligned} \text{diag}(Ax + b) &= \text{diag}(b + x_1 a_1 + \cdots + x_n a_n)\\ &= \text{diag}(b) + x_1 \text{diag}(a_1) + \cdots + x_n \text{diag}(a_n). \end{aligned}$

Therefore, defining

$\begin{aligned} F_0 &= \text{diag}(b), \quad F_i = \text{diag}(a_i)\\ F(x) &= F_0 + x_1 F_1 + \cdots + x_n F_n = \text{diag}(Ax+b), \end{aligned}$

we conclude

$\begin{aligned} Ax+b \succeq 0 \iff \text{diag}(Ax+b) \succeq 0 \iff F(x) \succeq 0 \end{aligned}$ .

In conclusion, we have that the (LP)

$\begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax + b \succeq 0 \end{cases},$

is equivalent to the (SDP)

$\begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & F(x)= \text{diag}(Ax+b) \succeq 0 \end{cases}.$

Example 2. Nonlinear OP as a SDP

Consider the nonlinear (COP)

$\text{(OP1)} \begin{cases} \text{minimize}& \frac{(c^Tx)^2}{d^Tx}\\ \text{subject to} & Ax + b \succeq 0\\ & d^Tx > 0 \end{cases}$

where

$\begin{aligned} &c,d \in \mathbb{R}^n\\ &A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m. \end{aligned}$

We will recast (OP1) as a (SDP).
N.B.:

$d^Tx > 0 \implies$ choice of domain (a halfspace) of objective and ensures convexity.
$d^Tx < 0 \implies$ concave problem.

To begin, recall we may recast problem in epigraph form:

$\text{(OP1)} \begin{cases} \text{minimize}& t\\ \text{subject to} & Ax + b \succeq 0\\ & d^Tx > 0\\ & \frac{(c^Tx)^2}{d^Tx} \leq t \end{cases}.$

N.B.: this introduces the new optimization variable $t$ .

Goal: find a symmetric matrix-valued function

$F(x,t) = F_0 + x_1F_1 + \cdots + x_nF_n + t F_{n+1}$

such that the constraints

$\begin{cases} &Ax + b \succeq 0\\ & d^Tx > 0\\ & \frac{(c^Tx)^2}{d^Tx} \leq t \end{cases}$

may be recast as the LMI

$\begin{cases} F(x) \succeq 0 \end{cases}.$

Idea: we know

$Ax + b \succeq 0 \iff \text{diag}(Ax+b) \succeq 0$ .

On the other hand:

$\begin{aligned} \frac{(c^Tx)^2}{d^Tx} \leq t &\iff (c^Tx)^2 \leq td^Tx \\ &\iff td^Tx - (c^Tx)^2 \geq0 \end{aligned}$ .

Recall: if $\gamma>0$ , then

$\begin{bmatrix}\alpha&\beta\\\beta&\gamma\end{bmatrix}\succeq0 \iff \alpha\geq0, \alpha\gamma - \beta^2 \geq 0$

Therefore, given $d^Tx>0$ , we have

$\begin{aligned} \frac{(c^Tx)^2}{d^Tx} \leq t & \iff \begin{bmatrix}t & c^Tx\\ c^Tx & d^Tx \end{bmatrix} \succeq 0 \end{aligned}$ .

Using that

$\begin{bmatrix} A &\vline &0\\ \hline 0 & \vline& B \end{bmatrix} \succeq 0 \iff A \succeq 0 , B \succeq 0,$

we therefore introduce

$E = \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix}$

to capture the problems constraints.
Indeed, evidently, $E \succeq 0$ iff

$\text{diag}(Ax+b) \succeq 0$ and $\begin{bmatrix} t & c^tx\\c^tx & d^tx \end{bmatrix} \succeq 0$ .

Therefore,

$\begin{array}{l} Ax + b \succeq 0\\ d^Tx > 0\\ \frac{(c^Tx)^2}{d^Tx} \leq t \end{array} \iff \begin{array}{l} \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix} \succeq 0 \end{array}$ .

This is enough to conclude (OP1) may be recast as an (SDP).

To make it clearer, introduce the notation

$A = \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}, \quad a_i \in \mathbb{R}^m$

and $(m+2)\times(m+2)$ matrices

$\begin{aligned} F_0 &= \text{diag}(b,0,0) \\ F_i & = \begin{bmatrix} \text{diag}(a_i) &0 &0\\ 0&0&c_i\\ 0&c_i&d_i \end{bmatrix}\\ F_{n+1} & = \begin{bmatrix}0_{m \times m} &0 &0\\0 & 1 & 0\\0 & 0 & 0\end{bmatrix}\\ \end{aligned}.$

Then

$\begin{aligned} \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix} = F_0 + x_1F_1 + \cdots + x_nF_n + t F_{n+1}:=F(x,t) \end{aligned}$

and so (OP1) is equivalent to the (SDP)

$\begin{cases} \text{minimize}& t\\ \text{subject to} & F(x,t) \succeq 0. \end{cases}$

Lagrangian Duality

Throughout, let (OP) denote a given optimization problem of the form

$\text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0 , \quad i=1,\ldots,m\\ & h_i(x) = 0, \quad i = 1,\ldots,p \end{cases}.$

Recall:

$\begin{aligned} \text{Problem domain: }& D := \bigcap_{i=0}^m \text{dom}\, f_i \cap \bigcap_{i=1}^p \text{dom}\,h_i\\ \text{Optimal value: } & p^\star := \inf\{f_0(x): x \in D , x \text{ feasible}\}. \end{aligned}$

The Lagrange Dual

Lagrangian: the function

$\begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} \\ \text{dom}\,L &= D \times \mathbb{R}^m \times \mathbb{R}^p \end{aligned}$

given by

$\begin{aligned} L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \end{aligned}$ .

N.B.: for each fixed $x \in D$ , the function

$(\lambda,\nu) \mapsto L(x,\lambda,\mu)$

is affine.

Lagrange multipliers: the variables $\lambda_i$ and $\nu_i$ .
The vectors

$\begin{aligned} \lambda := \begin{bmatrix}\lambda_1\\\lambda_2\\\vdots\\\lambda_m\end{bmatrix}, \quad \nu := \begin{bmatrix}\nu_1\\\nu_2\\\vdots\\\nu_p\end{bmatrix} \end{aligned}$

are called dual variables.

Lagrange dual function: the function

$\begin{aligned} g&:\mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ \end{aligned}$

given by

$g(\lambda,\nu) = \inf\{L(x,\lambda,\nu): x \in D\}$ .

Those $(\lambda,\nu) \in \mathbb{R}^m_+ \times \mathbb{R}^p$ satisfying

$g(\lambda,\nu) > -\infty$

are called dual feasible.
N.B.: as an infimum of affine functions, $g$ is automatically concave.

Proposition. For
$\lambda \in \mathbb{R}^m_+, \quad \nu \in \mathbb{R}^p$ ,
there holds
$g(\lambda,\nu) \leq p^\star$ ,
where $p^\star$ is the optimal value for the given (OP).

Proof.

Let $x \in D$ be feasible.
Then
$\begin{aligned} f_i(x) &\leq 0, \quad i=1,\ldots,m\\ h_i(x) &=0, \quad i=1,\ldots,p. \end{aligned}$
Let $\lambda \succeq 0$ and $\nu$ be arbitrary.
Then feasibility of $x$ implies
$\begin{aligned} \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) = \sum_{i=1}^m \lambda_i f_i(x) \leq 0. \end{aligned}$
Consequently,
$\begin{aligned} L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \leq f_0(x). \end{aligned}$
Therefore, for all feasible $x$ , for $\lambda \succeq 0$ and for arbitrary $\nu$ , there holds
$\begin{aligned} g(\lambda,\nu) = \inf\{L(z,\lambda,\nu):z \in D \} \leq L(x,\lambda,\nu) \leq f_0(x) \end{aligned}$
and so
$g(\lambda,\nu) \leq p^\star$ .
(Indeed, $g(\lambda,\nu)$ is a lower bound of $f_0(x)$ and $p^\star$ is the greatest lower bound of $f_0(x)$ .)

Lagrangian as underestimator.
(See CO 5.1.4)
Define the indicator functions

$I_-(t)= \begin{cases} 0 &t \leq 0\\ +\infty & t> 0 \end{cases},\qquad I_0(t)= \begin{cases} 0 & t=0\\ +\infty & t\neq0 \end{cases}.$

Then

$\begin{aligned} I_-(f_i(x)) &\text{ indicates when the constraint }f_i \text{ is active}\\ I_0(h_i(x)) &\text{ indicates when the constraint }h_i \text{ is active} \end{aligned}$

and the (OP) is equivalent to

$\begin{aligned} \begin{cases} \text{ minimize } f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x)). \end{cases} \end{aligned}$

N.B.: the terms in

$\begin{aligned}\sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x)) \end{aligned}$

act as penalties for breaking the desired constraints.

N.B.: if $x$ is feasible and $(\lambda,\nu) \in \mathbb{R}_+^m\times\mathbb{R}^p$ , then

$\begin{aligned} \lambda_i f_i(x) &\leq I_-(f_i(x))\\ \nu_i h_i(x) &\leq I_0(h_i(x)) \end{aligned}$

and hence

$\begin{aligned}L(x,\lambda,\nu) & = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \\ &\leq f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x))\end{aligned}$ .

Viz., $L$ is an underestimater of the objective function

$f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x))$

obtained by “softening” or “weakening” the penalty functions:

$\begin{aligned} I_-(t) & \to ct, \quad c>0\\ I_0(t) & \to bt, \quad b \in \mathbb{R}. \end{aligned}$

In particular, for each $(\lambda,\nu) \in \mathbb{R}_+^m \times \mathbb{R}^p$ , the problem

$\begin{cases} \text{minimize } L(x,\lambda,\nu) \end{cases}$

has optimal value $g(\lambda,\nu)$ and provides an underestimation of the original (OP).

Example 1

Consider the least squares problem

$(LS) \begin{cases} \text{minimize} & x^Tx\\ \text{subject to}& Ax=b \end{cases}$

for given

$A = \begin{bmatrix} a_1^T \\ \vdots \\ a_p^T \end{bmatrix} \in \mathbb{R}^{p\times n}, \quad a_i \in \mathbb{R}^n, \quad b \in \mathbb{R}^p$ .

N.B.:

$\begin{aligned} Ax = b \iff h_i(x) = a_i^Tx - b_i = 0 \text{ for } i=1,\ldots,p \end{aligned}$ .

Therefore, the Lagrangian $L$ for (LS) is

$\begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\,L &= \mathbb{R}^n \times \mathbb{R}^p\\ L(x,\nu) &= x^Tx + \sum_{i=1}^p \nu_i (a_i^Tx - b_i) \\ &= x^Tx + \nu^T(Ax-b) \end{aligned}$

and the Lagrange dual is

$\begin{aligned} g&:\mathbb{R}^p \to \mathbb{R}\\ g(\nu) &= \inf\{ x^Tx + \nu^T(Ax-b) : x \in \mathbb{R}^n \}. \end{aligned}$

N.B.:

$\begin{aligned} \nabla^2_x L(x,\nu) &= \nabla_x^2 ( x^Tx + \nu^T(Ax-b)) \\ &= 2Id_{n \times n}\\ & \succeq 0 \end{aligned}$

and so $x\mapsto L(x,\nu)$ is convex.
Consequently,

$\begin{aligned} L(x^\star,\nu) = \inf\{L(x,\nu):x \in \mathbb{R}^n \} = \text{min}\{L(x,\nu):x \in \mathbb{R}^n \} \end{aligned}$

iff

$\nabla_x L(x^\star,\nu) = 2x^\star + A^T\nu = 0$ ,

i.e., iff

$x^\star = -\frac{1}{2}A^T\nu.$

In conclusion,

$\begin{aligned} g(\nu) &= L(x^\star,\nu)\\ &= (x^\star)^Tx^\star + \nu^T(Ax^\star-b)\\ &= \left(-\frac{1}{2}A^T\nu\right)^T\left(-\frac{1}{2}A^T\nu\right) +\nu^T\left(A\left(-\frac{1}{2}A^T\nu\right) - b\right)\\ &=\frac{1}{4}\nu^TAA^T\nu - \frac{1}{2}\nu^TAA^T\nu - \nu^Tb\\ &=-\frac{1}{4}\nu^TAA^T\nu - \nu^Tb. \end{aligned}$

In particular,

$-\frac{1}{4}\nu^TAA^T\nu - b^T\nu \leq \inf\{x^Tx:Ax=b\}.$

for all $\nu \in \mathbb{R}^p$ .

Example 2

Consider the linear program

$\text{(LP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to}& Ax=b\\ & x\succeq 0 \end{cases}$

for given

$c \in \mathbb{R}^n, \quad A \in \mathbb{R}^{p \times n}, \quad b \in \mathbb{R}^p$ .

N.B.:

equality constraints given by
$h_i(x) = a_i^Tx-b_i =0, \quad i=1,\ldots,p$ .
$x\succeq 0$ iff
$x_i \geq 0, \quad i=1,\ldots,n\quad$
iff
$\quad f_i(x) = -x_i \leq 0,\quad i=1,\ldots,n$ .

Therefore, the Lagrangian for (LP) is

$\begin{aligned} L&: \mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, L &=\mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p \\ L(x,\lambda,\nu) &= c^Tx - \sum_{i=1}^n \lambda_i x_i + \sum_{i=1}^p \nu_i(a_i^Tx-b_i)\\ &= c^Tx - \lambda^Tx + \nu^T(Ax-b)\\ &=(c - \lambda + A^T\nu)^Tx - \nu^Tb \end{aligned}$

Want to compute

$g(\lambda,\nu) = \inf\{ L(x,\lambda,\nu) : x \in \mathbb{R}^n \},$

but

$x \mapsto (c - \lambda + A^T\nu)^Tx - \nu^Tb$

is an affine function with domain $\mathbb{R}^n$ .
Therefore,

$x \mapsto (c - \lambda + A^T\nu)^Tx - \nu^Tb$

is bounded below iff $(\lambda,\nu)$ satisfy

$c - \lambda + A^T\nu = 0$ .

Therefore, the Lagrange dual is

$\begin{aligned} g&:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, g &= \{(\lambda,\nu) \in \mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p : c - \lambda + A^T\nu= 0\}\\ g(\lambda,\nu) &= \begin{cases} -b^T\nu & c - \lambda + A\nu^T = 0\\ -\infty & \text{else} \end{cases}. \end{aligned}$

In particular, for dual feasible $(\lambda,\nu)$ , there holds

$-b^T\nu \leq c^Tx$

for all feasible $x$ .

Return of Conjugate Function

Recall: given $f:\mathbb{R}^n \to \mathbb{R}$ , its conjugate function $f^*$ is the convex function

$f^*(y) = \sup\{y^Tx - f(x) : x \in \text{dom}\,f\}.$

Interestingly: the conjugate function is related to the Lagrange dual.

Example.
Consider the the (OP)

$\text{(OP)} \begin{cases} \text{minimize}&f_0(x)\\ \text{subject to}&Ax \preceq b\\ &Cx=d \end{cases},$

for given

$\begin{aligned} f_0:\mathbb{R}^n \to \mathbb{R},\quad A \in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^m\\ C \in \mathbb{R}^{p \times n}, \quad d \in \mathbb{R}^p. \end{aligned}$

We may write

$\begin{aligned} Ax\preceq b &\iff f_i(x) = a_i^Tx - b_i \leq0 , \quad i =1,\ldots, m\\ Cx = d &\iff h_i(x) =c_i^Tx - d_i=0, \quad i=1,\ldots,p \end{aligned}$

for suitable

$a_i ,c_i \in \mathbb{R}^n.$

The Lagrangian is

$\begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, L &= D\\ L(x,\lambda,\nu) &= f_0(x) + \sum_{i=1}^m \lambda_i(a_i^Tx - b_i) + \sum_{i=1}^p \nu_i(c_i^Tx - d_i)\\ &= f_0(x) + \lambda^T(Ax-b) + \nu^T(Cx-d)\\ &= f_0(x) + \lambda^TAx + \nu^TCx - \lambda^Tb - \nu^Td \end{aligned}$

We may now compute the Lagrange dual in terms of the conjugate $f_0^*$ :

$\begin{aligned} g&:\mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ g(\lambda,\nu)&= \inf\{L(x,\lambda,\nu):x \in D\}\\ &=\inf\{f_0(x) + \lambda^TAx + \nu^TCx - \lambda^Tb - \nu^Td : x \in D\}\\ &=\inf\{ (A^T\lambda + C^T\nu)^Tx + f_0(x):x \in D\} - \lambda^Tb - \nu^Td \\ &=-\sup\{-(A^T\lambda+C^T\nu)^Tx - f_0(x) : x \in D\} - \lambda^Tb - \nu^Td\\ &=-f_0^*(-A^T\lambda-C^T\nu) - \lambda^Tb - \nu^Td. \end{aligned}$

Since

$\begin{aligned} g(\lambda,\nu)>-\infty \iff f_0^*(-A^T\lambda-C^T\nu)< +\infty \end{aligned}$ ,

we conclude

$\begin{aligned}\text{dom}\,g = \{(\lambda,\nu)\in \mathbb{R}^m\times\mathbb{R}^p: -A^T\lambda-C^T\nu \in \text{dom}\,f_0^* \} \end{aligned}.$

Example: A Volume Minimizing Ellipsoid

Problem. Given points $a_1,\ldots,a_m \in \mathbb{R}^n$ , among all (closed) origin centered ellipsoids $\mathcal{E}$ satisfying $a_i \in \mathcal{E}$ , find those with minimal volume.

Plan. We will formulate this problem as a convex optimization problem and determine the Lagrange dual function.

Positive semidefinite representation of ellipsoids.
Given $x' \in \mathbb{R}^n, X \in \boldsymbol{S}_{++}^n$ , the set

$\mathcal{E}_X := \{ x \in \mathbb{R}^n : (x-x')^T X (x-x') \leq 1 \}$

is an ellipsoid.
Moreover, the volume of $\mathcal{E}_X$ is proportional to $(\det X^{-1})^{1/2}$ .
(This follows change of variable formula.)

Justification

WLOG: $X = \text{diag}(v)$ for some $v \in \mathbb{R}^n$ whose entries are the positive eigenvalues of $X$ .
Then $(x-x_0)^TX(x-x_0) \leq 1$ implies

$\begin{aligned} (x-x')^TX(x-x') &= \begin{bmatrix} x_1-x'_1 & \cdots & x_n - x_n' \end{bmatrix} \begin{bmatrix}v_1 & 0 & \cdots & 0\\ 0 & v_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\0&0&\cdots&v_n \end{bmatrix}\begin{bmatrix} x_1-x'_1 \\ \vdots \\ x_n - x_n' \end{bmatrix}\\ &=v_1(x_1-x_1')^2 + \cdots + v_n(x_n-x'_n)^2\\ &\leq 1 \end{aligned}$

which is the usual description of a closed ellipsoid with center $x'$ .
E.g., $X = r^{-2}\text{diag}(1,1,\cdots,1)$ gives the ball of radius $r$ .

Problem reformulation. Find those $X \in \boldsymbol{S}_{++}^n$ satisfying $a_i^T X a_i \leq 1$ which minimize $(\det X^{-1})^{1/2}$ .
In fact, $f_0(X) = \log \det X^{-1}$ is convex, and so we may reformulate the problem as the following (COP):

$\begin{cases} \text{minimize} & f_0(X) = \log \det X^{-1}\\ &\text{dom} f_0 = \boldsymbol{S}_{++}^n\\ \text{subject to} & a_i^T X a_i \leq 1. \end{cases}$

Recall

$\text{trace}(ABC) = \text{trace}(CAB)$ for $A,B,C \in \mathbb{R}^{n \times n}$ ;
there is a natural way of identifying matrix $A \in \mathbb{R}^{n \times n}$ with vector $v_A \in \mathbb{R}^{n^2}$ ;
Under this identification, we have
$\text{trace}(A^T B) = v_A^T v_B$ .

Let $A_i = a_ia_i^T$ , noting $A_i^T = A_i$ .
Then 1. gives

$\begin{aligned} a_i^TXa_i &= \text{trace}(a_i^TXa_i)\\ &= \text{trace}(a_ia_i^TX)\\ &= \text{trace}(A_iX). \end{aligned}$

Therefore 2. and 3. allow us to realize the quadratic inequality

$a_i^TXa_i \leq 1$

as the linear inequality

$\text{trace}(A_i X) = v_{A_i}^T v_X \leq 1$ .

These observations allow us to appeal to the Lagrange dual formalism.

It can be shown

$\begin{aligned} f_0^*(Y) &= \log\det(-Y)^{-1} - n\\ \text{dom}f_0^* &= - \boldsymbol{S}_{++}^n. \end{aligned}$

From preceding section, we can conclude the Lagrange dual function of $f_0(X) = \log\det X^{-1}$ is

$g(\lambda) = \begin{cases} \log\det\left(\sum_{i=1}^m \lambda_i a_ia_i^T \right) - \boldsymbol{1}_m^T \lambda + n & \sum_{i=1}^m \lambda_i a_i a_i^T \succ 0\\ -\infty & \text{ else} \end{cases}.$

Since $g(\lambda)$ provides a lower bound of the optimal value, we conclude: if $V_0$ is the optimal volume, then, up to known constant $c_0>0$ , there holds

$c_0V_0 \geq \log\det\left(\sum_{i=1}^m \lambda_i a_i a_i^T \right) - \boldsymbol{1}^T\lambda + n$ ,

which is a very explicit lower bound depending only on the Lagrange multiplier and the problem data.

Let

$\mathcal{A} = \begin{bmatrix} v_{A_1}^T\\\vdots\\ v_{A_m}^T \end{bmatrix}, \quad \boldsymbol{1}_n = \begin{bmatrix} 1\\\vdots\\1\end{bmatrix} \in \mathbb{R}^m$ .

Therefore, the problem is equivalent to

$\begin{cases} \text{minimize} & f_0(X) = \log \det X^{-1}\\ &\text{dom} f_0 = \boldsymbol{S}_{++}^n\\ \text{subject to} & \mathcal{A} v_X \leq \boldsymbol{1}_m. \end{cases}$

Introducing the Lagrange multiplier $\lambda$ , observe

$\mathcal{A}^T\lambda = \lambda_1 v_{A_1} + \cdots + \lambda_m v_{A_m}$ .

Under our chosen identification $\mathbb{R}^{n \times n} \cong \mathbb{R}^{n^2}$ , we identify

$\lambda_i v_{A_i} \iff \lambda_i A_i = \lambda_i a_ia_i^T$

and so

$\mathcal{A}^T\lambda \iff \sum_{i=1}^m \lambda_i a_ia_i^T.$

We lastly record the conjugate function of $f_0(X) = \log \det X^{-1}$ :

$\begin{aligned} f_0^*(Y) &= \log\det(-Y)^{-1} - n\\ \text{dom}\,f_0^* &= -\boldsymbol{S}_{++}^n \end{aligned}$ .

By previous section, the Lagrange dual is given by

$\begin{aligned} g(\lambda) &= - f_0^*(-\mathcal{A}^T\lambda) - \lambda^T\boldsymbol{1}_m\\ &= \begin{cases} \log\det(\sum_{i=1}^m \lambda_i a_i a_i^T ) - \boldsymbol{1}_m + n & \sum_{i=1}^m \lambda_i a_i a_i^T \succ0 \\ -\infty &\text{else} \end{cases} \end{aligned}$ .

Since $g(\lambda)$ provides a lower bound of the optimal value, we conclude: if $V_0$ is the optimal volume, then, up to known constant $c_0>0$ , there holds

$c_0V_0 \geq \log\det\left(\sum_{i=1}^m \lambda_i a_i a_i^T \right) - \boldsymbol{1}^T\lambda + n$ ,

which is a very explicit lower bound depending only on the Lagrange multiplier and the problem data.

The Dual Problem

Let

$\begin{aligned} g(\lambda,\nu) &=\text{ Lagrange dual function of a given (OP)}\\ p^\star &= \text{ optimal value of the (OP)}. \end{aligned}$

Recall

$\lambda \succeq 0 \implies g(\lambda,\nu) \leq p^\star$

Main point:

$\sup\{ g(\lambda,\nu):\lambda \succeq0,\nu \in \mathbb{R}^p\} \leq p^\star$ ,

suggests considering maximization problem with objective $g(\lambda,\nu)$ .
Gives best underestimate available by Lagrange dual function.

Lagrange dual problem: the problem

$\begin{cases} \text{maximize} & g(\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0 \end{cases}.$

Remarks

The original problem is called the primal problem.
Viz.., the dual problem is dual to the primal problem.
$(\lambda,\nu)$ feasible to dual problem $\implies$ $g(\lambda,\nu)>-\infty$ .
Viz., $(\lambda,\nu)$ is dual feasible.
As stated, the only constraint is $\lambda \succeq 0$ ; however, domain of $g$ usually has implicit constraints.
Generally: $\text{dom}\,g$ has “dimension” less than or equal to $m+p$ .
Recall: $g$ is infimum of family of concave functions.
Thus, $g$ is concave and $-g$ is convex.
So,
maximizing concave $g$ $\iff$ minimizing convex $-g$ .

Therefore, since $\lambda \succeq0$ is convex constraint:
Dual problems are always convex, even if primal is not.
Solutions $(\lambda^\star,\nu^\star)$ to dual are called dual optimal.

Remark on Duality for Equivalent Problems

Question: if two primal problems are equivalent, how are their respective duals related?

Spoiler: The respective dual problems may be quite different; this is demonstrated by example.

Example. Consider the unconstrained problem

$\text{(OP1)} \begin{cases} \text{minimize} & f_0(Ax+b). \end{cases}$

This problem is equivalent to the constrained problem

$\text{(OP2)} \begin{cases} \text{minimize} & f_0(y)\\ \text{subject to} & y = Ax+b \end{cases}.$

Having no constraints, the Lagrangian for (OP1) is

$L(x) = f_0(Ax+b)$

and so the Lagrange dual function is simply

$g = \inf\{f_0(Ax+b)\}= p^\star .$

Therefore, the dual problem of (OP1) trivializes to minimizing a constant.

Having only the equality constraints $y = Ax+b$ , the Lagrangian for (OP2) is

$L(x,y,\nu) = f_0(y) + \nu^T(Ax+b-y)$ .

Observe that $L(x,y,\nu)$ is unbounded below if $A^T\nu \neq 0$ .
Moreover, if $A^T\nu=0$ , then

$\begin{aligned} g(\nu) &= \inf\{ f_0(y) + \nu^T b - \nu^T y \}\\ &=\nu^T b - \sup \{ \nu^Ty - f_0(y)\}\\ &= \nu^T b - f_0^*(\nu). \end{aligned}$

Thus,

$g(\nu) = \begin{cases} \nu^Tb - f_0^*(\nu) & A^T\nu=0\\ -\infty & \text{else}. \end{cases}$

Therefore, the dual problem to (OP2) is

$\begin{cases} \text{maximize} &b^T\nu - f_0^*(\nu)\\ \text{subject to }& A^T\nu = 0. \end{cases}$

Conclusion: the dual of (OP2) is conceivable useful, whereas the dual of (OP1) is useless, even though (OP1) and (OP2) are equivalent.

Example: Duality of standard form and inequality form LP

Recall: a standard form LP is of the form

$\text{(LP1)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax = b\\ & x \succeq 0 \end{cases}$ .

An inequality form LP is of the form

$\text{(LP2)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax \preceq b \end{cases}$ .

We will show the dual of (LP1) is (equivalent to a problem) of the form (LP2), and vice versa.

The dual (LP1)
We consider first:

$\text{(LP1)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax = b\\ & x \succeq 0 \end{cases}$ .

For (LP1), the Lagrangian is

$L(x,\lambda,\nu) = (c + A^T\nu - \lambda)^Tx - b^T\nu$

and Langrange dual function is

$\begin{aligned} g(\lambda,\nu) &= \begin{cases} -b^T\nu & c - \lambda + A^T\nu = 0\\ -\infty & \text{else} \end{cases}\\ \text{dom}\, g &= \{(\lambda,\nu) \in \mathbb{R}^n \times \mathbb{R}^p : c - \lambda + A^T\nu = 0\} \end{aligned}$ .

(Recall: domain is determined by where $L$ is bounded below.)

Therefore, the dual problem of (LP1) is

$\begin{cases} \text{maximize}&g(\lambda,\nu)= \begin{cases} -b^T\nu & c - \lambda + A^T\nu = 0\\ -\infty & \text{else} \end{cases}\\ \text{subject to}&\lambda \succeq 0 \end{cases},$

which is evidently equivalent to

$\begin{cases} \text{maximize}& -b^T\nu \\ \text{subject to}&\lambda \succeq 0\\ & c - \lambda + A^T\nu = 0 \end{cases},$

N.B.: the domain of $g$ had the implicit constraint $c - \lambda + A^T\nu = 0$ .

Observe the equivalency of constraints:

$\begin{aligned} \begin{cases} &\lambda \succeq 0\\ & c - \lambda + A^T\nu = 0 \end{cases} & \iff \begin{cases} &\lambda \succeq 0\\ & c+ A^T\nu = \lambda \end{cases}\\ & \iff \begin{cases} & c+ A^T\nu \succeq 0 \end{cases}\\ & \iff \begin{cases} & c \succeq -A^T\nu \end{cases} \end{aligned}$

Therefore

$\begin{cases} \text{maximize}& -b^T\nu \\ \text{subject to}&\lambda \succeq 0\\ & c - \lambda + A^T\nu = 0 \end{cases} \iff \begin{cases} \text{minimize}& b^T\nu \\ \text{subject to}& - A^T\nu \preceq c \end{cases}$

The last problem is of the form (LP2).
Viz., the dual of a standard form LP is (equivalent to) an inequality form LP.

The dual of (LP2)
We now consider

$\text{(LP2)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax \preceq b. \end{cases}$ .

For (LP2), the Lagrangian is

$\begin{aligned} L(x,\lambda) &= c^Tx + \lambda^T(Ax-b) \\ &= (A^T\lambda + c)^Tx - b^T\lambda \end{aligned}$ .

N.B.: an affine function $\alpha^Tx + \beta$ is bounded below iff $\alpha = 0$ .
Therefore, the Lagrange dual function is

$\begin{aligned} g(\lambda) &= \begin{cases} -b^T\lambda & A^T\lambda + c = 0\\ -\infty & \text{else} \end{cases}\\ \text{dom}\,g &= \{\lambda \in \mathbb{R}^m : A^T\lambda + c = 0 \} \end{aligned}$ .

Therefore, the dual problem of (LP2) is

$\begin{cases} \text{maximize}&g(\lambda,\nu)= \begin{cases} -b^T\nu & A^T\lambda + c = 0\\ -\infty & \text{else} \end{cases}\\ \text{subject to}&\lambda \succeq 0 \end{cases},$

which is evidently equivalent to

$\begin{cases} \text{maximize}& -b^T\nu \\ \text{subject to}& A^T\lambda + c = 0\\ & \lambda \succeq 0 \end{cases}.$

Again, the domain of $g$ had an implicit constraint, namely, $A^T\lambda + c = 0$ .

Observe the equivalency of constraints:

$\begin{cases} &A^T\lambda + c = 0\\ & \lambda \succeq 0 \end{cases} \iff \begin{cases} &A^T\lambda = -c\\ & \lambda \succeq 0 \end{cases}$

Therefore, the dual problem is equivalent to

$\begin{cases} \text{minimize}& b^T\nu \\ \text{subject to}& A^T\lambda =-c\\ &\lambda \succeq 0 \end{cases},$

which is a problem of the form (LP1).
Viz., the dual of an inequality form LP is a standard form LP.

Weak and Strong Duality

Let

$\begin{aligned} p^\star &= \text{ optimal value of primal problem}\\ d^\star &= \text{ optimal value of dual problem} \end{aligned}.$

Weak duality: the property $d^\star \leq p^\star$ .
N.B.: Optimization problems of the form (OP) always satisfy weak duality.
Strong duality: the property $d^\star = p^\star$ .
N.B.: Having strong duality does not mean the primal and the dual are actually solvable.
Constraint qualifications: conditions for a given type of problem which ensure strong duality.
E.g., “A (QCQP) with single quadratic constraint has strong duality if _______.”
Optimal duality gap: the difference $p^\star - d^\star$ .

Remarks

Observe
$\begin{aligned} p^\star = -\infty &\iff \text{ primal unbounded}\\ p^\star = +\infty & \iff \text{ primal infeasible}\\ d^\star = + \infty &\iff \text{ dual unbounded}\\ d^\star = -\infty &\iff \text{ dual infeasible}. \end{aligned}$
Therefore
$\begin{aligned} p^\star = - \infty &\implies d^\star = -\infty \implies \text{ dual is infeasible}\\ d^\star = +\infty &\implies p^\star = +\infty \implies \text{ primal is infeasible}. \end{aligned}$
$0<p^\star-d^\star<\infty$ is possible.
Primal and dual may be simultaneously infeasible, i.e.,
$\begin{aligned} p^\star &= + \infty\\ d^\star &=-\infty \end{aligned}$
may occur.
Convex optimization problems often have strong duality, but not always.

Slater’s Condition

Consider the (COP)

$\text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}&f_i(x) \leq 0, \quad i=1,\ldots,m\\ &Ax = b \end{cases}$

with domain $D$ .
Let $\text{relint}\,D$ denote the relative interior of $D$ .
Intuitively: $x\in \text{relint}\,D$ means $x \in D$ is not on the “boundary” of $D$
Recall: $D$ generally has smaller dimension than number of variables.

Slater’s condition: there exists $x \in \text{relint}\,D$ such that

$f_i(x)<0, \quad i=1,\ldots,m$ and $Ax = b$ .

Strict feasibility: when a feasible $x$ satisfies Slater’s condition.

Example Consider the inequality constraints

$\begin{cases} &(x_1-1)^2 + x_2^2 \leq 1\\ &(x_1-2)^2 + x_2^2 \leq 4\\ &(x_1-3)^2+x_2^2 \leq 9 \end{cases}$

and suppose $f_0$ has

$\text{dom}\,f_0 = \{ (x_1-3)^2+x_2^2 \leq 9 \}$ .

Then $(x_1,x_2)=(0,0)$ is feasible but not strictly feasible.
Moreover, any point in the interior of the smallest disk satisfies Slater’s condition.

Slater’s Theorem. If (COP) satisfies Slater’s condition, then it is strongly dual and the dual problem is solvable

Remarks.

Slater’s condition is a constraint qualification for convex optimization problems.
In principle, an (OP) may be strongly dual without the dual being solvable.
Thus, Slater’s theorem has the strength of implying there is a dual feasible $(\lambda^\star,\nu^\star)$ with $g(\lambda^\star,\nu^\star) = d^\star = p^\star$ .

Theorem. If Slater’s condition holds for all non-affine inequality constraints, then conclusion of Slater’s theorem hold.

Remarks.

Thus, if $f_1,\ldots,f_k$ are affine and $f_{k+1},\ldots,f_{m}$ are not, then it is enough if there holds the weakened Slater condition: there exists a $x \in \text{relint}\,D$ such that
$\begin{aligned} f_i(x) &\leq 0, \quad i =1,\ldots,k\\ f_i(x)&<0,\quad i=k+1,\ldots,m\\ Ax&=b \end{aligned}$ .
Therefore, Slater’s condition is really a qualification constraint for non-affine inequality constraints.

Remark on Slater’s Condition

Consider the convex optimization problem

$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0 , \quad i=1,\ldots,m\\ & Ax = b \end{cases}.$

Let $D \subset \mathbb{R}^n$ be the domain.
Since $D$ is convex, it lies in its affine hull $\text{aff}\,(D)$ .
Given $x\in \mathbb{R}^n, r>0$ ,

$B(x,r) = \{ y \in \mathbb{R}^n : |x-y|<r \}$

be the Euclidean ball of radius $r$ and center $x$ .
Relative interior: the set

$\begin{aligned} \text{relint}(D) = \{ x \in D: \exists r>0 \text{ such that } B(x,r) \cap \text{aff}\,(D) \subset D \}. \end{aligned}$

Example. In the image below:

$D$ is the ellipse lying in the $xy$ -plane.
The affine hull $\text{aff}\,(D)$ is the $xy$ -plane.
The ball depicts a ball $B$ centered at a point in $D$ and with small enough radius so that $B \cap \text{aff}\,(D)$ still lies in $D$ .
The relative interior $\text{relint}(D)$ is the shaded region of the ellipse excluding the curve bounding the domain.

Question How can Slater’s condition fail?
I.e., what if there exist no $x \in \text{relint}(D)$ such that

$\begin{aligned} f_i(x) &< 0, \quad i=1,\ldots,m\\ Ax&=b? \end{aligned}$

Consider the (COP)

$\text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_1(x) \leq 0 , \quad i=1,\ldots,m\\ & Ax = b \end{cases}.$

where $f_0,f_1: \mathbb{R}^3 \to \mathbb{R}$ are convex.
Suppose

$f_1 \leq 0$ describes a cube:
$\begin{aligned} \{f_1(x_1,x_2,x_3) \leq 0 \} = \{0 \leq x_1, x_2, x_3 \leq 1 \} = : C, \end{aligned}$
$\text{dom}\,f_0 = \text{dom}\,f_1 = \{ x_3 \leq 1 \}$ ,
the solution set
$\{Ax = b \} = \{ x_3 = 1 \}$
is a plane intersecting the top face of the cube.

The images below depict the cube, the plane and their intersection (a square).
N.B.: the domain is the area below and including the plane.

Then the domain is $D = \{x_3 \leq 1 \}$ and (COP) fails Slater’s condition:

if $x \in \text{relint}(D) = \{x_3<1\}$ , then $Ax = b$ must fail.

Fix: Can project problem onto a lower dimensional face of $C$ .
Indeed, since

$C \cap \{ Ax=b\}=\{0\leq x_1,x_2 \leq 1, x_3 = 1 \}$ ,

(COP) is equivalent to the problem

$\text{(COP1)} \begin{cases} \text{minimize} & F_0(x_1,x_2)\\ \text{subject to} & -x_1 \leq 0\\ &x_1 -1 \leq 0\\ &-x_2 \leq 0\\ &x_2 -1 \leq 0 \end{cases},$

where $F_0(x_1,x_2) = f_0(x_1,x_2,1)$ .
Taking $x_1=x_2=\frac{1}{2}$ , we see (COP1) satisfies Slater’s condition.

Comparing Duals and KKT Conditions.
Let $L_0$ and $L_1$ be the Lagrangian for (COP) and (COP1) respectively.
Then

$\begin{aligned} L_0(x,\lambda,\nu) &= f_0(x) + \lambda f_1(x) + \nu(x_3-1)\\ x &= \begin{bmatrix} x_1 &x_2&x_3 \end{bmatrix}^T\\ L_1(x',\lambda') &= F_0(x_1,x_2) -\lambda_1 x_1 + \lambda_2(x_1-1) - \lambda_3 x_2 + \lambda_4 (x_2-1)\\ x' &= \begin{bmatrix} x_1 &x_2 \end{bmatrix}^T\\ \lambda' &= \begin{bmatrix} \lambda_1 & \lambda_2 & \lambda_3 & \lambda_4 \end{bmatrix}^T. \end{aligned}$

The respective KKT conditions are

$\begin{cases} \begin{aligned} f_1(x) & \leq 0\\ x_3-1&=0\\ \lambda f_1(x) &=0 \\ \lambda &\geq 0\\ \nabla f_0 + \lambda \nabla f_1 + \begin{bmatrix}0\\0\\\nu\end{bmatrix} &=0 \end{aligned} \end{cases} \quad \begin{cases} \begin{aligned} 0 \leq x_1 &\leq 1\\ 0 \leq x_2 &\leq 1\\ \lambda_1 x_1 = \lambda_2 (x_1-1)&=0\\ \lambda_3 x_2 = \lambda_4 (x_2-1)&=0\\ \lambda_1,\lambda_2,\lambda_3,\lambda_4 &\geq 0\\ \nabla F_0 + \begin{bmatrix} \lambda_2-\lambda_1\\\lambda_4 - \lambda_3 \end{bmatrix} &=0 \end{aligned} \end{cases}$

Remarks.

While (COP) does not satisfy Slater’s conditions, its projection (COP1) does.
$f_1$ might not even be differentiable, in which the KKT conditions for (COP) would be ill-posed.
Identifying the correct face to project problem to $\implies$ relatively simpler KKT conditions.
(Not true in general.)

Geometric description of Slater’s condition failing:

Consider a general (COP) with convex domain $D$ .
Suppose the relative boundary of $D$ contains a convex set $K$ (e.g., a polygonal face or $D$ is conic)
If $\text{aff}(K) = \{Ax=b\}$ or $\{Ax=b\} \cap D = K$ , then the problem fails Slater’s condition.
Indeed, any feasible point must satisfy $Ax=b$ and lie on the relative boundary (and hence not in the relative interior).
N.B.: a problem even “almost” failing Slater’s condition can cause numerical issues.
E.g., a 3-dimensional problem with nearly 2-dimensional domain.

Examples

Example 1.
Consider the least squares problem

$\text{(LS)} \begin{cases} \text{minimize}&x^Tx\\ \text{subject to}& Ax = b \end{cases}.$

Recall: the Lagrangian is

$L(x,\nu) = x^Tx + \nu^T(Ax-b)$

and Lagrange dual function is

$g(\nu) = -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb$ .

Therefore, the dual problem is

$\begin{cases} \text{maximize}& -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb \end{cases}.$

N.B.: (LP) has no inequality constraints and $D = \mathbb{R}^n$ .
Thus, Slater’s condition is simply:

there exists $x \in \mathbb{R}^n$ such that $Ax=b$ , i.e., $b \in \text{range}\,A$ .

We will analyze this closer.

Case 1 ( $b \in \text{range}\,A$ ):
Here, Slater’s condition is satisfied since $b \in \text{range}\,A$ implies there exists $x$ with $Ax=b$ .
Therefore, primal is feasible and hence $p^\star < +\infty$ .
Slater’s theorem implies

$d^\star = p^\star < +\infty$ .

In particular, the dual objective

$-\frac{1}{4}\nu^TAA^T\nu - \nu^Tb$

is bounded above.

Case 2 ( $b \notin \text{range}\,A$ ):
Here, the primal is infeasible and so $p^\star = +\infty$ .
Note $b \notin\text{range}\,A$ $\implies$ exists $z$ such that $A^Tz = 0$ and $b^Tz \neq 0$ .
(Recall: $\text{ker}\,A^T \perp \text{range}\,A$ .)
But then

$g(tz)=-\frac{1}{4}tz^TAAtz - b^Ttz = -b^Ttz$ ,

which is unbounded above as a function of $t$ , and so $d^\star=+\infty$ .

In conclusion: for (LS), there holds $d^\star = p^\star$ , even when $p^\star = \infty$ .
Therefore, (LS) is strongly dual whether feasible or infeasible.

Example 2.
Consider the standard form (LP)

$\text{(LP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax = b\\ & x \succeq 0 \end{cases}.$

We have already shown that its dual problem is

$\begin{cases} \text{minimize}& b^T\nu \\ \text{subject to}& - A^T\nu \preceq c \end{cases}.$

Since the inequality constraints of (LP) are affine, namely,

$x_i \geq 0$

weakened version of Slater’s condition implies problem is strongly dual when feasible.
Interestingly: (LP) may fail to be strongly dual when infeasible, i.e., there may hold $p^\star = +\infty$ and $d^\star = -\infty$ .

Example 3.
Consider the QCQP

$\text{(QCQP)} \begin{cases} \text{minimize} & \frac{1}{2}x^TQ_0x + q_0^Tx + r_0\\ \text{subject to}& \frac{1}{2}x^TQ_ix + q_i^Tx + r_i \leq 0, \quad i=1,\ldots,m \end{cases}$

where

$Q_0 \in \boldsymbol{S}_{++}^n, \quad Q_i \in \boldsymbol{S}_+^n,\quad i=1, \ldots, m.$

We now determine the dual problem.

The Lagrangian of (QCQP) is

$\begin{aligned} L(x,\lambda) &= \frac{1}{2}x^TQ_0x + q_0^Tx + r_0\\ &+\sum_{i=1}^m \frac{1}{2}\lambda_i x^TQ_ix + \lambda_i q_i^Tx + \lambda_i r_i\\ &= \frac{1}{2}x^T\left(Q_0+\sum_{i=1}^m \lambda_i Q_i \right)x \\ &+ \left(q_0^T + \sum_{i=1}^m \lambda_i q_i^T \right)x \\ &+ r_0 + \sum_{i=1}^m \lambda_i r_i \end{aligned}$

Defining

$\begin{aligned} Q(\lambda) &= Q_0 + \sum_{i=1}^m \lambda_i Q_i\\ q(\lambda) &= q_0 + \sum_{i=1}^m \lambda_i q_i\\ r(\lambda) &= r_0 + \sum_{i=1}^m \lambda_i r_i, \end{aligned}$

we have

$L(x,\lambda) = \frac{1}{2}x^TQ(\lambda)x + q(\lambda)^Tx + r(\lambda)$ .

We now compute the Lagrange dual function $g(\lambda)$ for $\lambda \succeq 0$ .

To begin, observe: if $\lambda \succeq 0$ , then

$Q(\lambda) = Q_0 + \sum_{i=1}^m \lambda_i Q_i \succeq 0$

due to positive semidefiniteness of $Q_0$ .
So: $Q(\lambda)$ is invertible and

$L(x,\lambda) = \frac{1}{2}x^TQ(\lambda)x + q(\lambda)^Tx + r(\lambda)$ .

is convex in $x$ .
Therefore, $g(\lambda)$ is determined by critical points of $L(x,\lambda)$ .

Compute

$\begin{aligned} \nabla_x L(x,\lambda) &= \nabla_x \left(\frac{1}{2}x^TQ(\lambda)x\right) + \nabla_x\left( q(\lambda)^Tx \right) + \nabla_x r(\lambda)\\ &=Q(\lambda)x + q(\lambda). \end{aligned}$

Thus, $x^\star$ is a minimizer of $L(x,\lambda)$ iff

$\nabla_x L(x^\star,\lambda)=0 \iff x^\star = -Q(\lambda)^{-1}q(\lambda)$ .

Therefore,

$\begin{aligned} g(\lambda) &= \inf\{L(x,\lambda): x \in \mathbb{R}^n\}\\ &=\min\{L(x,\lambda):x \in \mathbb{R}^n \}\\ &=L(x^\star,\lambda)\\ &=\frac{1}{2}(-Q(\lambda)^{-1}q(\lambda))^TQ(\lambda)(-Q(\lambda)^{-1}q(\lambda)) \\ &+ q(\lambda)^T(-Q(\lambda)^{-1}q(\lambda)) + r(\lambda)\\ &=-\frac{1}{2}q(\lambda)^TQ(\lambda)^{-1}q(\lambda) + r(\lambda). \end{aligned}$

Therefore, the primal problem

$\text{(QCQP)} \begin{cases} \text{minimize} & \frac{1}{2}x^TQ_0x + q_0^Tx + r_0\\ \text{subject to}& \frac{1}{2}x^TQ_ix + q_i^Tx + r_i \leq 0, \quad i=1,\ldots,m \end{cases}$

has dual problem

$\begin{cases} \text{maximize}&-\frac{1}{2}q(\lambda)^TQ(\lambda)^{-1}q(\lambda) + r(\lambda)\\ \text{subjet to} & \lambda \succeq0 \end{cases}.$

Slater’s theorem implies that these two problems are strongly dual if there exists a strictly feasible $x$ satisfying

$\frac{1}{2}x^TQ_ix + q_i^Tx + r_i < 0$

for all $i =1,\ldots,m$ .

Qualitative Uses of Lagrange Duality

Recall: a problem has

$\begin{aligned} \text{weak duality when }& d^\star \leq p^\star;\\ \text{strong duality when }&d^\star = p^\star. \end{aligned}$

These are qualitative properties; e.g., strong duality alone does not provide means to find primal optimal $x^\star$ satisfying $f_0(x^\star) = p^\star$ ;

However, strong and weak duality provide three useful “qualitative” uses:

Certification: a dual feasible $(\lambda,\nu)$ provides a certification that $g(\lambda,\nu)$ is a suboptimal value: $g(\lambda,\nu) \leq p^\star$ .
Strong duality $\implies$ can (theoretically) certify up to any desirable precision.

Duality gap: for primal feasible $x$ and dual feasible $(\lambda,\nu)$ , the value
$f_0(x) - g(\lambda,\nu)$ .
N.B.: if $x,(\lambda,\nu)$ feasible, then
$g(\lambda,\nu) \leq d^\star \leq p^\star \leq f_0(x),$
i.e.,
$p^\star,d^\star \in [g(\lambda,\nu),f_0(x)]$
and the duality gap gives length of interval.

In particular: if duality gap $= 0$ at a feasible pair $x,(\lambda,\nu)$ , then
$\begin{aligned} g(\lambda,\nu) \leq d^\star &\leq p^\star \leq f_0(x)\\ g(\lambda,\nu) &= f_0(x) \end{aligned}$
give
$p^\star = f_0(x) = g(\lambda,\nu) = d^\star$ ,
i.e., such $(\lambda,\nu)$ certifies that $x$ is optimal, and vice versa.
Stopping Criterion. Observing
$\begin{aligned} g(\lambda,\nu) \leq p^\star & \iff -p^\star \leq - g(\lambda,\nu)\\ &\iff f_0(x) - p^\star \leq f_0(x) - g(\lambda,\nu), \end{aligned}$
and setting
$\begin{aligned} \epsilon = f_0(x) - g(\lambda,\nu) \end{aligned}$
we see
$\begin{aligned} (\lambda,\nu) \text{ is dual feasible} \ \implies \text{primal feasible }x \text{ is }\epsilon\text{-suboptimal} \end{aligned}.$
Viz.,
$f_0(x) - p^\star \leq \epsilon$ .
N.B.: this is showing is -suboptimal without even knowing the primal optimal .

As application: suppose we wish to find optimal but can settle for feasible with at worst -suboptimal:
$f_0(x') - p^\star \leq \epsilon$ .
Suppose we use algorithm producing feasible in search of optimal .
Letting
$\epsilon_k =f_0(x^{(k)}) - g(\lambda^{(k)},\nu^{(k)})$
denote the resulting duality gaps, we may use the following stopping criterion:
$\text{If } \epsilon_k \leq \epsilon \text{, then stop search}$ .
Therefore, with gives feasible within allowed error: .

N.B.:
1. Re-emphasize: this stopping criterion does not require knowing the primal optimal $p^\star$ in advance.
2. strong duality $\implies$ $\epsilon$ can be arbitrarily small.
Complementary slackness. Assume problem has strong duality, i.e., .
If is primal optimal and is dual optimal, then
$\lambda_i^\star f_i(x^\star) = 0, \quad i=1,\ldots,m$ .
For example:
$\begin{aligned} \begin{bmatrix} f^T(x^\star)\\ \hline \lambda^{\star T} \end{bmatrix} = \begin{bmatrix} f_1(x^\star) & 0 & 0 & f_4(x^\star) & \cdots & 0 & f_{m-1}(x^\star) & 0\\ \hline 0 & \lambda_2^\star & \lambda_3^\star & 0 & \cdots & \lambda_{m-2}^\star & 0 & \lambda_m^\star \end{bmatrix}, \end{aligned}$
where is the vector of inequality constraint functions.
This relationship between the two vectors is called complementary slackness.
N.B.:
1. Since $\lambda^\star \succeq 0$ and $f_i(x^\star)\leq0$ , we have
  $\begin{aligned} \lambda_i^\star > 0 & \implies f_i(x^\star) = 0\\ f_i(x^\star) < 0 & \implies \lambda_i^\star = 0. \end{aligned}$
2. While a qualitative property, complementary slackness can sometimes be used to solve the primal.
3. Having $\lambda_i^\star = f_i(x^\star)= 0$ is permissible.
Justification: for primal optimal and dual optimal, we find
$\begin{aligned} f_0(x^\star) &= g(\lambda^\star,\nu^\star)\\ &= \inf\{ f_0(x) + \sum_{i=1}^m \lambda_i^\star f_i(x) + \sum_{i=1}^p \nu_i^\star h_i(x) : x \text{ feasible} \}\\ &\leq f_0(x^\star) + \sum_{i=1}^m \lambda_i^\star f_i(x^\star) + \sum_{i=1}^p \nu_i^\star h_i(x^\star)\\ &\leq f_0(x^\star). \end{aligned}$
Since
$a\leq b \leq c \leq a \implies a=b=c$ ,
and since , we conclude
$\begin{aligned} f_0(x^\star) = f(x^\star) + \sum_{i=1}^m \lambda_i^\star f_i(x^\star) \end{aligned}$
and so
$\begin{aligned} \sum_{i=1}^m \lambda_i^\star f_i(x^\star) = 0. \end{aligned}$

Since and , the sum
$\begin{aligned} \sum_{i=1}^m \lambda_i^\star f_i(x^\star) \end{aligned}$
is a sum of nonpositive things and so
$\begin{aligned} \sum_{i=1}^m \lambda_i^\star f_i(x^\star) = 0 \end{aligned}$
implies
$\begin{aligned} \lambda_i^\star f_i(x^\star) = 0, \quad i=1,\ldots,m, \end{aligned}$
which is the desired complementary slackness.

Karush-Kuhn-Tucker (KKT) Conditions

Assume $f_i,h_i$ are differentiable and have open domains.

In previous section, we saw: if $x^\star,(\lambda^\star,\nu^\star)$ are optimal with zero optimality gap, then

$\begin{aligned} \inf\{ f_0(x) + &\sum_{i=1}^m \lambda_i^\star f_i(x) + \sum_{i=1}^p \nu_i^\star h_i(x) : x \text{ feasible} \} \\ &= f_0(x^\star) + \sum_{i=1}^m \lambda_i^\star f_i(x^\star) + \sum_{i=1}^p \nu_i^\star h_i(x^\star). \end{aligned}$

Question: What does this say about relationship between $L(x,\lambda,\nu)$ , $x^\star$ and $(\lambda^\star,\nu^\star)$ (under strong duality)?

Above is equivalent to:

$\inf\{L(x,\lambda^\star,\nu^\star):x \text{ feasible}\} = L(x^\star,\lambda^\star,\nu^\star),$ .

$x^\star \in \text{argmin}\{L(x,\lambda^\star,\nu^\star): x \text{ feasible} \}$ .

Viz., if

problem is strongly dual,
$x^\star$ is primal optimal, and
$(\lambda^\star,\nu^\star)$ is dual optimal,

then $x^\star$ minimizes the Lagrangian $L(\cdot,\lambda^\star,\nu^\star)$ with dual optimal Lagrange multipliers.

But, if

$x^\star$ minimizes $L(\cdot,\lambda^\star,\nu^\star)$ and
$f_i,h_i$ are differentiable,

then $x \mapsto L(x,\lambda^\star,\nu^\star)$ is differentiable and $x^\star$ is a critical point:

$\begin{aligned} \nabla_xL(x^\star,\lambda^\star,\nu^\star) &= \nabla f_0(x^\star) + \sum_{i=1}^m\lambda^\star_i \nabla f_i(x^\star) + \sum_{i=1}^p \nu^\star_i \nabla h_i(x^\star)\\ &=0 \end{aligned}$

Beware: we do not know a priori if $\nabla f_0(x^\star)$ or $\nabla h_i (x^\star)$ are zero.

Karush-Kuhn-Tucker (KKT) Optimality Conditions:

$\begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots,m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla_x L(x^\star,\lambda^\star,\nu^\star) &=0 \end{aligned}$

Recap:

these are necessary conditions for any strongly dual optimization problem admitting primal optimal $x^\star$ and dual optimal $(\lambda^\star,\nu^\star);$
the first and second conditions just indicate $x^\star$ is primal feasible;
the third condition is the complementary slackness derived in previous section;
the fourth condition is standard nonnegativity of Lagrange multiplier $\lambda;$
the last condition follows from $x^\star$ minimizing $L(\cdot,\lambda^\star,\nu^\star)$ .

KKT and Convexity

Theorem. If the primal problem is differentiable and convex, then the KKT conditions are sufficient for primal and dual optimality and strong duality.

Viz., for differentiable convex problems, if $x'$ and $(\lambda',\nu')$ satisfy the KKT conditions, then they automatically primal and dual optimal, respectively.

Proof.

Step 1. Suppose $x'$ and $(\lambda',\nu')$ satisfy the KKT conditions:

$\begin{aligned} f_i(x') & \leq 0 , \quad i=1,\ldots,m\\ h_i(x') & = 0 , \quad i=1,\ldots,p\\ \lambda_i' f_i(x') &= 0, \quad i =1,\ldots,m\\ \lambda' &\succeq 0\\ \nabla_x L(x',\lambda',\nu') &=0 \end{aligned}.$

First two conditions $\implies$ $x'$ is primal feasible.

Step 2. Observe

$\lambda' \succeq 0 \implies \lambda'_if_i(x)$ are convex.

Thus

$\begin{aligned} L(x,\lambda',\nu') = f_0(x) + \sum_{i=1}^m \lambda_i' f_i(x) + \sum_{i=1}^p \nu_i' h_i(x) \end{aligned}$ .

as a function of $x$ is a sum of convex functions and hence convex.

Therefore $\nabla_x L(x',\lambda',\nu') = 0 \implies$ $x'$ is a minimizer.
This also implies $(\lambda',\nu')$ is dual feasible since

$\begin{aligned} g(\lambda',\nu') &= \inf\{ L(x,\lambda',\nu'): x \text{ feasible} \} \\ &= L(x',\lambda',\nu')\\ &> -\infty \end{aligned}$ .

Step 3. By Step 2., feasibility of $x'$ and the complementary slackness

$\lambda_i' f_i(x') = 0, \quad i =1,\ldots,m,$

there holds

$\begin{aligned} g(\lambda',\nu') &= L(x',\lambda',\nu')\\ &= f_0(x') + \sum_{i=1}^m\lambda'_i f_i(x') + \sum_{i=1}^p \nu'_i h_i(x')\\ &=f_0(x'). \end{aligned}$

Therefore, the duality gap $f_0(x') - g(\lambda',\nu')$ vanishes and hence $x'$ is primal optimal and $(\lambda',\nu')$ is dual optimal.

Indeed, recall:

$g(\lambda',\nu') \leq d^\star \leq p^\star \leq f_0(x')$

and so $g(\lambda',\nu')=f_0(x')$ implies

$g(\lambda',\nu') = d^\star = p^\star = f_0(x').$

Corollary. If the primal problem is differentiable, convex and satisfies Slater’s condition, then the KKT conditions are necessary and sufficient for primal and dual optimality and strong duality.

Viz.: in this situation, finding all solutions to KKT conditions provides all solutions to the given problem.

Remark. In convex optimization, many algorithms are conceived as methods for solving KKT conditions.
Moreover, some KKT conditions for some problems may be solved analytically.

Example 1

(CO Example 5.1)
Consider the quadratic program

$\text{(QP)} \begin{cases} \text{minimize} & \frac{1}{2}x^TQx + q^Tx + r\\ \text{subject to} & Ax = b \end{cases}$

with $Q \in \boldsymbol{S}_+^n$ .

Goal: Derive the KKT conditions for (QP) and solve (QP).

Step 0. Observe that (QP) is a differentiable convex problem which trivially satisfies Slater’s conditions and so the KKT conditions may be used to solve it.

Step 1. Find the Lagrangian and its gradient:
(QP) only has the equality constraint $Ax = b$ and so the Lagrangian is

$L(x,\nu) = \frac{1}{2}x^TQx + q^Tx + r + \nu^T(Ax-b)$ .

Differentiating with respect to $x$ gives

$\begin{aligned} \nabla_x L(x,\nu) &= \nabla_x(\frac{1}{2}x^TQx) + \nabla_x(q^Tx) + \nabla_x r + \nabla_x(\nu^TAx - \nu^T b) \\ &= Qx + q +A^T\nu. \end{aligned}$

Step 2. Construct the KKT conditions:
Since (QP) has no inequality constraints, the KKT conditions take the form

$\begin{aligned} h_i(x^\star) &= 0, \quad i= 1,\ldots,p\\ \nabla_x L(x^\star,\nu^\star) &= 0. \end{aligned}$

Viz., the KKT conditions for (QP) are

$\begin{aligned} Ax^\star &= b \\ Qx^\star + q + A^T\nu^\star&=0. \end{aligned}$

Rewriting the KKT conditions as

$\begin{aligned} Qx^\star + A^T\nu^\star&=-q\\ Ax^\star &= b, \end{aligned}$

the KKT conditions are evidently equivalent to the matrix equation

$\begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} x^\star\\ \nu^\star \end{bmatrix} = \begin{bmatrix} -q\\ b \end{bmatrix}.$

Conclusion: Solving the quadratic program

$\begin{cases} \text{minimize} & \frac{1}{2}x^TQx + q^Tx + r\\ \text{subject to} & Ax = b \end{cases}$

is equivalent to solving the linear equation

$\begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} x^\star\\ \nu^\star \end{bmatrix} = \begin{bmatrix} -q\\ b \end{bmatrix}.$

Example 2

(CO Example 5.2)
Consider the convex optimization problem

$\text{(WF)} \begin{cases} \text{minimize} & -\sum_{i=1}^n \log(\alpha_i + x_i)\\ \text{subject to} & \boldsymbol{1}^Tx = 1\\ & x \succeq 0 \end{cases}$

where $\alpha_i > 0$ and $\boldsymbol{1} \in \mathbb{R}^n$ is the vector of 1’s.

Goal: Derive the KKT conditions for (WF) and solve (WF).

Step 0. Observe

the domain of (WF) contains the nonnegative orthant $x \succeq 0$ .
(WF) is a differentiable convex problem.
The condition $x \succeq 0$ is equivalent to $-x \preceq 0$ .
(WF) satisfies Slater’s condition since there exist $-x \prec0$ with $\boldsymbol{1}^Tx = 1$ and so (WF) satisfies strong duality.
Therefore, we may use KKT conditions to solve (WF).

Step 1. Find the Lagrangian and its gradient:
Given the constraints

$\begin{cases}&\boldsymbol{1}^Tx = 1\\ & -x \preceq 0 \end{cases}$

the Lagrangian is

$\begin{aligned} L(x,\lambda,\nu) = -\sum_{i=1}^n \log(\alpha_i + x_i) - \lambda^T x + \nu(\boldsymbol{1}^Tx - 1) \end{aligned}$

with Lagrange multipliers

$\lambda \in \mathbb{R}^n, \quad \nu \in \mathbb{R}$ .

Differentiating with respect to $x$ gives

$\begin{aligned} \nabla_x L(x,\lambda,\nu) &= -\sum_{i=1}^n \nabla_x \log (\alpha_i + x_i) - \nabla_x(\lambda^T x) +\nu \nabla_x(\boldsymbol{1}^Tx) -\nabla_x\nu\\ &= -\frac{1}{\alpha_i + x_i} \boldsymbol{1} - \lambda + \nu \boldsymbol{1}. \end{aligned}$

Thus, $x^\star$ is a critical point of $L(x,\lambda,\nu)$ if it satisfies the system of equations

$-\frac{1}{\alpha_i + x_i} - \lambda_i + \nu = 0, \quad i = 1,\ldots, n .$

Step 2. Construct the KKT conditions:
Since (WF) has inequality and equality constraints, the KKT conditions take the form

Thus, the KKT conditions for (WF) are

$\begin{aligned} x^\star &\succeq 0 \\ \boldsymbol{1}^Tx^\star & = 1\\ \lambda_i^\star x_i^\star &= 0, \quad i =1,\ldots,n\\ \lambda^\star &\succeq 0\\ -\frac{1}{\alpha_i + x_i^\star} - \lambda_i^\star + \nu^\star &= 0, \quad i = 1,\ldots, n . \end{aligned}$

Observe:

$\begin{aligned} \lambda_i^\star x_i^\star &= 0, \quad i =1,\ldots,n\\ \lambda^\star &\succeq 0\\ -\frac{1}{\alpha_i + x_i^\star} - \lambda_i^\star + \nu^\star &= 0, \quad i = 1,\ldots, n . \end{aligned}$

is equivalent to

$\begin{aligned} \left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star &= 0, \quad i =1,\ldots,n\\ \nu^\star &\geq \frac{1}{\alpha_i + x_i^\star}, \quad i = 1,\ldots, n . \end{aligned}$

(In particular, $\lambda^\star$ is acting as a slack variable.)

Therefore, we wish to solve:

$\begin{aligned} x^\star &\succeq 0 \\ \boldsymbol{1}^Tx^\star & = 1\\ \left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star &= 0, \quad i =1,\ldots,n\\ \nu^\star &\geq \frac{1}{\alpha_i + x_i^\star}, \quad i = 1,\ldots, n . \end{aligned}$

We will solve for $x_i^\star$ in terms of $\nu^\star$ by considering two cases.

Case 1: $\nu^\star < \frac{1}{\alpha_i}$ .
Observe

$\frac{1}{\alpha_i + x_i^\star} \leq \nu^\star < \frac{1}{\alpha_i}$

is only possible for $x\succeq 0$ if $x_i^\star >0$ .
Then the complementary slackness

$\left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star = 0$

enforces $\nu^\star = \frac{1}{\alpha_i + x_i}$ .

Case 2: $\nu^\star \geq \frac{1}{\alpha_i}$
If $x_i >0$ , then

$\nu^\star - \frac{1}{\alpha_i + x_i^\star}>0$

and so the complementary slackness

$\left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star = 0$

furnishes the contradiction $x_i^\star = 0$ .
Thus $\nu^\star \geq \frac{1}{\alpha_i} \implies x_i^\star = 0$ .

Putting the two cases together:

$\begin{aligned} x_i^\star &= \begin{cases} \frac{1}{\nu^\star} - \alpha_i & \nu^\star < \frac{1}{\alpha_i}\\ 0 & \nu^\star \geq \frac{1}{\alpha_i} \end{cases}\\ &= \max\{ 0 ,\frac{1}{\nu^\star} - \alpha_i \}. \end{aligned}$

Next, using $\boldsymbol{1}^Tx = 1$ , we get

$\begin{aligned} \sum_{i=1}^n \max\{0,\frac{1}{\nu^\star} - \alpha_i\} =1. \end{aligned}$

This is enough to solve for $\nu^\star$ and hence $x^\star$ .

Further details: Consider the function

$G(t) = \sum_{i=1}^n \max\{0,t-\alpha_i\}$

with $0<\alpha_1 < \alpha_2 < \cdots < \alpha_n$ .
Observe

$\begin{aligned} \text{on }[0,\alpha_1] \text{ there holds }& G(t)=0\\ \text{on }[\alpha_1,\alpha_2] \text{ there holds }& G(t) = t-\alpha_1\\ \text{on }[\alpha_2,\alpha_3] \text{ there holds }& G(t) = t-\alpha_1 + t-\alpha_2 = 2t - \alpha_1 - \alpha_2 \end{aligned}$

and so on.
Moreover, $G$ is continuous.
Thus $G(t)$ is an increasing continuous piecewise linear function.
Then $G(t) = 1$ may be solved by finding when the graph of $G(t)$ crosses the horizontal line $y = 1$ .

Perturbation

Given the OP:

$\text{(OP)} \begin{cases} \text{minimize}& f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i=1,\ldots,m\\ & h_i(x) = 0, \quad i=1,\ldots,p, \end{cases}$

a natural question is:

How do $p^\star$ and $x^\star$ behave under perturbations of constraints?

More precisely: given $u \in \mathbb{R}^m$ , $v \in \mathbb{R}^p$ , consider the perturbed problem

$\text{(OP)}_{uv} \begin{cases} \text{minimize}& f_0(x)\\ \text{subject to} & f_i(x) \leq u_i, \quad i=1,\ldots,m\\ & h_i(x) = v_i, \quad i=1,\ldots,p, \end{cases}$

Observe

$(u,v)=(0,0)$ results in $\text{(OP)}_{uv} = \text{(OP)}_{00} = \text{(OP)}$ ;
$u_i>0$ results in relaxing $f_i(x) \leq 0$ ;
$u_i<0$ results in tightening $f_i(x) \leq 0$ ;
$v_i \neq 0$ results in “translating” solution set of $h_i(x)=0$ .

Example 1.
The image below depicts various perturbations in inequality constraints (the shaded regions) and equality constraints (the dashed contours).
N.B.: perturbing the equality constraint results in using different contours.

Example 2.
The three images below depict the constraint

$x^2+y^2 - 1 \leq u$

with $u=-0.5,0,0.5$ , respectively.

The three images below depict the constraint

$x+y-1=v$

with $u=-0.5,0,0.5$ , respectively.

The optimality function
For $u \in \mathbb{R}^m$ and $v \in \mathbb{R}^p$ , let $p^\star(u,v)$ denote the primal optimal value for $\text{(OP)}_{uv}$ .
Can therefore introduce the function

$\begin{aligned} p^\star(\cdot,\cdot) : \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} \end{aligned}$

which assigns perturbation $(u,v)$ the primal optimal value $p^\star(u,v)$ . N.B.:

$p^\star(0,0)=p^\star=$ primal optimal value of unperturbed problem;
If $p^\star(u,v)=+\infty$ , then the perturbation $(u,v)$ makes the problem infeasible.
If $p^\star(u,v)=-\infty$ , then the perturbation $(u,v)$ makes the problem unbounded.
If (OP) is convex, then $p^\star(u,v)$ is convex in $(u,v)$ .

Example
Consider the problem

$\begin{cases} \text{minimize} & f_0(x) = -\sqrt{x}\\ \text{subject to} & x-1 \leq 0 \end{cases}.$

Given $\text{dom}\,f_0 = \mathbb{R}_+$ and the constraint, problem is:

minimize $-\sqrt{x}$ on $[0,1]$ .

Evidently, $x^\star = 1, p^\star = -1$ .

Consider now the perturbation

$\begin{cases} \text{minimize} & f_0(x) = -\sqrt{x}\\ \text{subject to} & x-1 \leq u \end{cases}.$

Viz.,

minimize $-\sqrt{x}$ on $[0,1+u]$ .

Given $\text{dom}\,f_0 = \mathbb{R}_+$ , perturbed problem is feasible only for $u \geq -1$ .
Can thus conclude

$p^\star(u) = \begin{cases} -\sqrt{1+u} & u \geq -1\\ +\infty & u<-1 \end{cases}.$

The graph of $p^\star(u)$ is plotted below; observe the behavior of $p^\star(u)$ as the constraint is relaxed.

Sensitivity

Theorem. If the primal has strong duality and the dual optimal $d^\star$ is achieved by dual feasible $(\lambda^\star,\nu^\star)$ , then
$p^\star(0,0)- \lambda^{\star T}u - \nu^{\star T}v \leq p^\star(u,v)$
for any $u \in \mathbb{R}^m$ and $v \in \mathbb{R}^p$ .

Proof.

Fix the perturbation vector $(u,v)$ and let $x$ be feasible for the resulting perturbed problem:

$\begin{aligned} f_i(x) &\leq u_i, \quad i=1,\ldots,m\\ h_i(x) &= v_i, \quad i=1,\ldots,p. \end{aligned}$

Observe:

$\begin{aligned} \begin{aligned} f_i(x) &\leq u_i\\ \lambda^\star & \succeq 0 \end{aligned} &\implies \lambda_i^\star f_i(x) \leq \lambda_i^\star u_i \\ &\implies \sum_{i=1}^m \lambda_i^\star f_i(x) \leq \lambda^{\star T}u\\ h_i(x) = v_i &\implies \sum_{i=1}^p \nu_i^\star h_i(x) = \nu^{\star T} v. \end{aligned}$

Using this and strong duality gives

$\begin{aligned} p^\star(0,0)&=g(\lambda^\star,\nu^\star)\\ &\leq f_0(x) + \sum_{i=1}^m \lambda_i^\star f_i(x) + \sum_{i=1}^p \nu_i^\star h_i(x)\\ &\leq f_0(x) + \lambda^{\star T}u + \nu^{\star T}v. \end{aligned}$

Rearranging gives

$\begin{aligned} p^\star(0,0)- \lambda^{\star T}u - \nu^{\star T}v \leq f_0(x) \end{aligned}$

for all $x$ feasible for the perturbed problem.
Since LHS independent of $x$ , there holds

$\begin{aligned} p^\star(0,0)- \lambda^{\star T}u - \nu^{\star T}v \leq p^\star(u,v). \end{aligned}$

Remark. Using this theorem, the sizes and signs of $\lambda_i,\nu_i$ may determine the sensitivity of the primal optimal value subjected to perturbation.

Example Suppose Theorem inequality takes the form

$p^\star(u,v) \geq p^\star(0,0) - \lambda^\star u - \nu^\star v, \quad \lambda,\nu,u,v\in\mathbb{R}$ .

We make four observations.

The larger $\lambda^\star$ is, the easier tightening $f_1(x)\leq0$ results in increasing $p^\star(u,v)$ .
Consider $\lambda^\star = 100$ :
$\begin{aligned} f_1(x) \leq -0.01 &\implies p^\star(-0.01,0) \geq p^\star(0,0) + 1 \\ f_1(x) \leq -0.1 &\implies p^\star(-0.1,0) \geq p^\star(0,0) + 10 \\ f_1(x) \leq -1 &\implies p^\star(-1,0) \geq p^\star(0,0) + 100 \\ \end{aligned}$
The smaller $\lambda^\star$ is, the more flexibility we have to relax $f_1(x) \leq 0$ without decreasing $p^\star(u,v)$ too much.
Consider $\lambda^\star = 0.01$ :
$\begin{aligned} f_1(x) \leq 1 &\implies p^\star(1,0) \geq p^\star(0,0) - 0.01 \\ f_1(x) \leq 10 &\implies p^\star(10,0) \geq p^\star(0,0) - 0.1 \\ f_1(x) \leq 100 &\implies p^\star(100,0) \geq p^\star(0,0) - 1 \\ \end{aligned}$

When $\nu^\star v <0$ : the larger $\nu^\star$ is, the easier changing $v$ increases $p^\star(u,v)$ .
Consider $\nu^\star = \pm 100$ :
$\begin{aligned} h_1(x) = \mp 0.01 &\implies p^\star(0,\mp 0.01) \geq p^\star(0,0) + 1 \\ h_1(x) = \mp 0.1 &\implies p^\star(0,\mp 0.1) \geq p^\star(0,0) + 10 \\ h_1(x) = \mp 1 &\implies p^\star(0,\mp 1) \geq p^\star(0,0) + 100 \\ \end{aligned}$

When $\nu^\star v > 0$ : the smaller $\nu^\star$ is, the more flexibility we have to change $v$ without decreasing $p^\star(u,v)$ too much.
Consider $\nu^\star = \pm 0.01$ :
$\begin{aligned} h_1(x) = \pm 1 &\implies p^\star(0,\pm 1) \geq p^\star(0,0) - 0.01 \\ h_1(x) = \pm 10 &\implies p^\star(0,\pm 10) \geq p^\star(0,0) - 0.1 \\ h_1(x) = \pm 100 &\implies p^\star(0,\pm 100) \geq p^\star(0,0) - 10 \\ \end{aligned}$

Example

This example reviews KKT conditions, Slater’s condition, perturbation and sensitivity.
Consider the LP

$\text{(LP)} \begin{cases} \text{minimize} & \frac{1}{2}x - \frac{1}{2}y + 1\\ \text{subject to}& \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0 \\ & x+y = R \end{cases},$

where $R>0$ is fixed.

The Lagrangian and its gradient
Since the constraints are

$\begin{cases} & \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0 \\ & x+y = R \end{cases},$

the Lagrangian takes the form

$\begin{aligned} L(x,y,\lambda,\nu) &= \frac{1}{2}x - \frac{1}{2}y + 1 \\ &+ \frac{\lambda}{R^2}x^2 + \frac{\lambda}{R^2}y^2 - \lambda \\ &+ \nu x + \nu y - \nu R. \end{aligned}$

Differentiating with respect to $(x,y)$ gives

$\nabla L = \begin{bmatrix} \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu\\ \end{bmatrix}.$

KKT Conditions
Since (LP) has both inequality and equality constraints, its KKT conditions take the form

$\begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots, m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla L(x^\star,\lambda^\star,\nu^\star) &=0. \end{aligned}$

Therefore, we wish to solve

$\begin{aligned} \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0\\ x+y-R=0\\ \lambda\left(\frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1\right) = 0\\ \lambda \geq 0\\ \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu &=0 \end{aligned}$

Observe:

$\begin{aligned} \lambda &=0\\ \nabla L &= 0 \end{aligned} \implies \begin{aligned} \frac{1}{2} + \nu &= 0\\ -\frac{1}{2} + \nu &= 0 \end{aligned},$

which is impossible and so $\lambda > 0$ .

Using complementary slackness and $\lambda > 0$ gives

$\begin{aligned} \lambda\left(\frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1\right) = 0 \implies \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 =0. \end{aligned}$

Therefore, $x$ and $y$ must solve

$\begin{aligned} x+y-R &=0\\ \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 &=0. \end{aligned}$

Equivalently,

$\begin{aligned} \frac{1}{R^2}(R-y)^2 + \frac{1}{R^2}y^2 - 1 =0, \end{aligned}$

which has solutions $y=0,R$ , and whence $x=R,0$ , respectively.

Using $\nabla L = 0$ and $(x,y)=(R,0)$ gives

$\begin{aligned} \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu &=0 \end{aligned} \implies \begin{aligned} \frac{1}{2} + \frac{2\lambda}{R} + \nu &= 0\\ -\frac{1}{2} + \nu &=0, \end{aligned}$

which has no solution for $\lambda>0$ .
Therefore, $(x,y)=(R,0)$ is not optimal.

Using $\nabla L = 0$ and $(x,y)=(0,R)$ gives

$\begin{aligned} \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu &=0 \end{aligned} \implies \begin{aligned} \frac{1}{2} + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R} + \nu &=0, \end{aligned}$

which has solution $(\lambda^\star,\nu^\star)=(\frac{R}{2},-\frac{1}{2})$ .

Conclusion.

$\begin{aligned} \text{primal optimal point} &= (x^\star,y^\star)=(0,R)\\ \text{dual optimal point} &= (\lambda^\star,\nu^\star)=(\frac{R}{2},-\frac{1}{2})\\ \text{primal optimal value} &= p^\star\\ &=\frac{1}{2}\cdot 0 - \frac{1}{2}R + 1\\ & = -\frac{R}{2}+1 \end{aligned}$

Sensitivity
Recall the sensitivity inequality:

$p^\star(0,0)- \lambda^{\star T}u - \nu^{\star T}v \leq p^\star(u,v)$

This inequality applies to (LP) since it has strong duality and $d^\star$ is achieved.
Thus

$-\frac{R}{2} + 1 - \frac{R}{2}u + \frac{1}{2}v \leq p^\star(u,v)$

where $p^\star(u,v)$ is the primal optimal for the perturbed problem

$\begin{cases} \text{minimize} & \frac{1}{2}x - \frac{1}{2}y + 1\\ \text{subject to}& \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq u \\ & x+y - R=v \end{cases}.$

Remarks.

By our previous analysis: if $R$ is large, then the problem ought to be sensitive to making $u$ more and more negative.
In hindsight, this is obvious: the inequality
$\frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq u$
is equivalent to
$x^2 + y^2 \leq R^2 + uR^2$
Therefore, perturbing by making $u$ more and more negative considerably restricts the problem to a smaller and smaller disk.
In fact, the problem is no longer feasible for any $v$ when $u < -1$ !

Explicitly: straightforward to compute
$p^\star(u,0) =-\frac{R^2}{2}\sqrt{2u+1} +1$ .
Compare
$p^\star(0,0) =-\frac{R}{2}+1 \ll p^\star(u\sim -\frac{1}{2},0) \sim 1$ .

Question: What if we had formulated the problem with the constraint

$x^2 + y^2 - R^2 \leq 0$

instead of

$\frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0 ?$

Why is this problem no longer sensitive to small negative perturbations $u$ ?

Geometry of Lagrangian Duality

Goal: Provide a geometric description of Lagrange duality.

Restriction: We consider only the one inequality constraint and no equality constraint setting; c.f. CO Section 5.3 for general dimensions.

One Constraint Setting
Consider the OP

$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_1(x) \leq 0 \end{cases}$

with $x \in \mathbb{R}^n$ and domain $D$ .
Construct the vector function

$F(x) = \begin{bmatrix} f_1(x)\\ f_0(x) \end{bmatrix}.$

Consider the sets

$\begin{aligned} \mathcal{G} &= \{(u,t) : u=f_1(x),t=f_0(x)\text{ for some }x \in D \}\\ &= F(D)\\ \mathcal{G}_{\text{feas}} &= \{(u,t) \in \mathcal{G}: u \preceq 0\}\\ &= \{ (u,t) : u=f_1(x),t=f_0(x) \text{ for some feasible }x\} \end{aligned}$

Thus $(u,t) \in \mathcal{G}_{\text{feas}}$ iff

there exists $x \in D$ with $f_0(x)=t$ and $u=f_1(x)\leq0$ .

Let

$\begin{aligned} t^\star &= \inf\{t : \exists u\leq0 \text{ with } (u,t) \in \mathcal{G} \} \\ &=\inf\{t:(u,t) \in \mathcal{G}_{\text{feas}}\}. \end{aligned}$

Intuitively: $t^\star$ is the smallest $t$ such that $(u,t) \in \mathcal{G}_{\text{feas}}$ for some $u \leq 0$ .
Thus $t^\star$ is the smallest value $f_0(x)$ can take among feasible $x$ .
Can therefore conclude: $p^\star = t^\star$ .
Explicitly:

$\begin{aligned} t^\star &= \inf\{t : \exists u\leq0 \text{ with } (u,t) \in \mathcal{G} \}\\ &= \inf\{t : \exists x \in D \text{ with } f_1(x) \leq 0 , f_0(x)=t \}\\ &= \inf\{f_0(x) : x \text { feasible}\}\\ &=p^\star. \end{aligned}$

Examples

Consider the problem
$\begin{cases} \text{minimize} & \frac{1}{2}s \cos(s)\\ \text{subject to}& 3\log(s-1) - 3 \leq 0 \end{cases}.$
Define
$F(s) = \begin{bmatrix} 3\log(s-1) - 3\\ \frac{1}{2}s \cos(s) \end{bmatrix}.$
Then $F(s)$ describes a parametric curve in $\mathbb{R}^2$ .

The image $\mathcal{G}$ of $F(s)$ is plotted below:

Question: Which points (A, B, C or?) corresponds to the optimal value $p^\star$ ?

Answer.
N.B.: $\mathcal{G}_{\text{feas}}$ corresponds to the portion of $\mathcal{G}$ with $u \leq 0$ .
This portion of $\mathcal{G}$ is the dashed line in the graph below.
Observed above: $p^\star$ corresponds to smallest value the $t$ -coordinate “can take” for $(u,t) \in \mathcal{G}_{\text{feas}}$ .
Consequently, B corresponds to the point $(u,p^\star)$ .
Consider a problem whose $\mathcal{G}$ is given by the curve and its enclosed region as depicted below:
Question: Which points (A, B, C or?) corresponds to the optimal value $p^\star$ ?

Answer.
$p^\star = 3$ .
This is the $t$ -coordinate for the points A and B.
N.B.: both A and B belong to $\mathcal{G}_{\text{feas}}$ .
Remark: C is not considered because C is not in $\mathcal{G}_{\text{feas}}$ .

The Lagrange dual function
For each $\lambda \in \mathbb{R}$ , define the function

$\begin{aligned} \Gamma_\lambda&:\mathcal{G} \to \mathbb{R}\\ \Gamma_\lambda(u,t) &= \begin{bmatrix} \lambda & 1 \end{bmatrix}\begin{bmatrix}u \\ t \end{bmatrix} = \lambda u + t. \end{aligned}$

But $(u,t) \in \mathcal{G}$ iff $u=f_1(x)$ and $t=f_0(x)$ for some $x \in D$ and so

$\begin{aligned} \Gamma_\lambda(u,t) = \lambda u + t = \lambda f_1(x) + f_0(x). \end{aligned}$

Question: Have we seen this before?

Answer.

$\lambda f_1(x) + f_0(x)$ is exactly the Lagrangian $L(x,\lambda)$ !

Since (this is an equality of sets of real numbers)

$\begin{aligned} \{ \Gamma_\lambda(u,t) : (u,t) \in \mathcal{G} \} &= \{ \lambda u + t : (u,t) \in \mathcal{G} \}\\ &= \{ L(x,\lambda) : x \in D \} \end{aligned}$

we conclude

$\begin{aligned} g(\lambda) &= \inf \{ L(x,\lambda) : x \in D \}\\ &= \inf \{ \Gamma_\lambda(u,t) : (u,t) \in \mathcal{G} \} \\ &= \inf \{ \lambda u + t: (u,t) \in \mathcal{G} \} . \end{aligned}$

In particular:

$g(\lambda) \leq \lambda u + t$

for all $(u,t) \in \mathcal{G}$ ; i.e.,

$\{ g(\lambda) = \lambda u + t : (u,t) \in \mathbb{R}^2 \}$

is a supporting hyperplane of $\mathcal{G}$ .
N.B.: $g(\lambda)$ is the $t$ -intercept of this line.
(This is all only meaningful if $g(\lambda)$ is finite.)

Weak duality revisited
Observe:

$\lambda \geq 0 , \quad u \leq 0 \implies \lambda u \leq 0$

and so

$\lambda u + t \leq t.$

Using $g(\lambda) \leq \lambda u + t$ for $(u,t) \in \mathcal{G}$ gives

$g(\lambda) \leq t.$

Since this holds for all $t$ with $(u,t) \in \mathcal{G}$ , we conclude

$g(\lambda) \leq p^\star.$

Since this holds for all $\lambda\geq0$ , we conclude weak duality:

$d^\star \leq p^\star.$

Example 1.
Consider a problem whose $\mathcal{G}$ is given by the curve and its enclosed region as depicted below:

The image below depicts

the optimal value $p^\star$ ;
the line $g(\frac{2}{3}) = \frac{2}{3}u + t$ ;
and the value $g(\frac{2}{3}) = 2$ given as the $t$ -intercept of this line.

The image below depicts the line $g(\frac{4}{3}) = \frac{4}{3}u+t$ :

Remark. Observe that no supporting hyperplane of $\mathcal{G}$ can intersect $(0,p^\star)$ .
Thus, there are no multipliers $\lambda^\star$ such that $g(\lambda^\star)= p^\star$ .
As a result, this problem is not stronlgy dual.
This is further indicated in the image below.

Example 2.
Consider a problem whose $\mathcal{G}$ is given by the curve and its enclosed region as depicted below:

Question: Does this problem satisfy strong duality.

Answer.

Yes!
As the image below depicts, observe that there is a supporting hyperplane passing through $(0,p^\star)$ .
This is enough to conclude $d^\star = p^\star$ and hence strong duality.

Sketch of Proof of Slater’s Theorem

Recall:

Slater’s Theorem. If a convex optimization problem satisfies Slater’s condition, then it is strongly dual and the dual problem is solvable

As in the previous section, we consider problems with one inequality constraint:

$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_1(x) \leq 0 \end{cases}$

where $x \in \mathbb{R}^n$ and both $f_0,f_1$ are convex.

Define the epigraph

$\begin{aligned} \mathcal{A} = \{ (u,t) : f_1(x) \leq u, f_0(x)\leq t\text{ for some }x \in D \}. \end{aligned}$

Thus, if

$\xi = (f_1(x),f_0(x)) \in \mathcal{G}$ ,

then every point “above and to the right” of $\xi$ is in $\mathcal{A}$ .

Example. In the images below, the sets $\mathcal{G}$ and $\mathcal{A}$ are given.

Important remarks:

The point $(0,p^\star)$ is generically on the boundary of $\mathcal{A}$ .
Strong duality is prevented exactly because $\mathcal{A}$ is not convex.

Sketch of Proof of Slater’s Theorem.
Observe: $f_0,f_1$ convex $\implies \mathcal{A}$ convex.
Indeed, if $(u,t),(u',t') \in \mathcal{A}$ and $s \in [0,1]$ , then

$f_1(x) \leq u, f_1(x') \leq u'$ gives
$\begin{aligned}f_1(sx+(1-s)x') &\leq s f_1(x) + (1-s)f_1(x')\\ & \leq su+ (1-s)u' \end{aligned}$
$f_0(x)\leq t, f_0(x')\leq t'$ gives
$\begin{aligned}f_0(sx+(1-s)x') &\leq s f_0(x) + (1-s)f_0(x')\\ & \leq st+ (1-s)t' \end{aligned}$

Therefore $t(u,t) + (1-t)(u',t') \in \mathcal{A}$ .

$\mathcal{A}$ convex $\implies$ for each boundary point $P \in bd\mathcal{A}$ of $\mathcal{A}$ there is a supporting hyperplane $\ell_P$ of $\mathcal{A}$ which intersects $P$ .
This is depicted below.

Recall: $P^\star:=(0,p^\star) \in bd\mathcal{A}$
Therefore, there is a supporting hyperplane $\ell_{P^\star}$ of $\mathcal{A}$ which intersects $P^\star$ .
N.B.: $\ell_{P^\star}$ lies below $\mathcal{A}$ .
This is depicted below.

Assume Slater’s condition: there exists $x' \in D$ with $f_1(x')<0$ .
Let $u'<0$ be such that $f_1(x') < u'<0$
Let $t'>f_0(x')$ .
N.B.: $(u',t')$ is to the right of $(f_1(x'),f_0(x'))$ .
Then $(u',t') \in \text{int}(\mathcal{A})$ and lies above $\mathcal{G}_{\text{feas}}$ .
Slater’s condition is what ensures such an interior point exists.
This is depicted below.

Since $\ell_{P^\star}$ is a supporting hyperplane below $\mathcal{A}$ , it follows that $(u',t')$ has to lie above $\ell_{P^\star}$ .
This ensures $\ell_{P^\star}$ is nonvertical and so it has a finite slope $\lambda'$ .
This is depicted below.

Evidently,

$g(\lambda') = \inf\{\lambda' u + t: (u,t) \in \mathcal{A}\} = p^\star.$

Therefore, strong duality holds.
N.B.: If $g(\lambda') = c \neq p^\star$ were the case, then the line $\{ c=\lambda' u + t \}$ would fail to be a supporting hyperplane since it will pass through $(0,c) \neq (0,p^\star)$ .
In particular, either $c=\lambda' u +t$ may be decreased or is unachievable.
The lines for $c=p^\star,a,b$ with $a<p^\star<b$ is depicted below.

Theorems of Alternatives

Recall the feasibility problem:

$\begin{cases} \text{find} & x\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.$

Goal: Use Lagrange duality to study the feasibility problem.

Observe that the feasibility problem is equivalent to the minimization problem:

$\text{(FP)} \begin{cases} \text{minimize} & 0\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.$

Indeed, a solution to (FP) exists iff the constraints are consistent.

The optimal value for (FP) is given by

$p^\star = \begin{cases} 0 & \text{(FP) is feasible}\\ +\infty & \text{(FP) is infeasible} \end{cases}.$

Duality of Feasibility Problem.
Let

$f(x) = \begin{bmatrix}f_1(x)\\\vdots\\f_m(x)\end{bmatrix},\quad h(x) = \begin{bmatrix} h_1(x) \\\vdots\\h_p(x) \end{bmatrix}.$

Since the objective function of (FP) is $f_0(x)=0$ , the Lagrangian is

$L(x,\lambda,\nu) = \lambda^T f(x) + \nu^T h(x)$

with Lagrange multipliers $(\lambda,\nu) \in \mathbb{R}^m \times \mathbb{R}^p$ .
The Lagrange dual function is thus

$g(\lambda,\nu) = \inf\{ \lambda^T f(x) + \nu^T h(x) : x \in D \}.$

The dual of (FP) is therefore

$\begin{cases} \text{maximize} & g(\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0. \end{cases}.$

The dual feasibility problem is thus

$\text{(DFP)} \begin{cases} \text{find} & (\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0\\ & g(\lambda,\nu)>0. \end{cases}.$

Observe for $t \geq 0$ :

$\begin{aligned} g(t\lambda,t\nu) &= \inf\{ \lambda^T f(x) + \nu^T h(x) : x \in D \}\\ &=t\inf\{ t\lambda^T f(x) + t\nu^T h(x) : x \in D \}. \end{aligned}$

Using this and that $g(0,0)=0$ , we conclude

$d^\star = \begin{cases} +\infty & \text{(DFP) is feasible}\\ 0 & \text{(DFP) is infeasible} \end{cases}.$

Justification.

Indeed: if $\exists (\lambda,\nu)$ with $\lambda \succeq 0$ and $g(\lambda,\nu)>0$ , then

$g(t\lambda,t\nu) = tg(\lambda,\nu)>0$

can be made as large and desirable.
On the other hand, if no such $(\lambda,\nu)$ exist, then the large $g(\lambda,\nu)$ may be is 0.

Weak Alternatives.
Recall weak duality asserts $d^\star \leq p^\star$ .
We also just derived

$p^\star = \begin{cases} 0 & \text{(FP) is feasible}\\ +\infty & \text{(FP) is infeasible} \end{cases}, \quad d^\star = \begin{cases} +\infty & \text{(DFP) is feasible}\\ 0 & \text{(DFP) is infeasible} \end{cases}.$

Therefore:

(FP) feasible $\implies$ $p^\star =0$ $\implies$ $d^\star = 0$ $\implies$ (DFP) infeasible.
(DFP) feasible $\implies$ $d^\star = \infty$ $\implies$ $p^\star = \infty$ $\implies$ (FP) infeasible.

In general:

If at most one of two problems can be feasible at a time, then they are called weak alternatives.

Therefore, (FP) and (DFP) are weak alternatives.

Strong Alternatives.
In general:

If exactly one of two problems is feasible at a time, then they are called strong alternatives.

Farkas’ Lemma.
Let $A \in \mathbb{R}^{m \times n}$ and $c \in \mathbb{R}^n$ .
Then the feasibility problems
$\begin{cases} \text{find} &x\\ \text{subject to} & Ax \preceq0\\ &c^Tx < 0 \end{cases} \quad \text{and}\quad \begin{cases} \text{find} & y\\ \text{subject to}&A^Ty + c =0\\ &y \succeq 0 \end{cases}$
are strong alternatives.

Proof.

Consider the LP

$\begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax \preceq0. \end{cases}$

Its Lagrangian and Lagrange dual are

$\text{(LP)} \begin{aligned} L(x,\lambda) &= c^Tx+ \lambda^TAx = (c+A^T\lambda)^Tx\\ g(\lambda) &= \begin{cases} 0 & c+A^T\lambda =0\\ -\infty & \text{else} \end{cases} \end{aligned}$

Therefore, the dual problem is

$\text{(DLP)} \begin{cases} \text{maximize} & 0\\ \text{subject to} & A^Ty+c=0\\ &y\succeq 0 \end{cases} .$

N.B.: (LP) and (DLP) are strongly dual and so their respective optimal values $p^\star,d^\star$ satisfy $p^\star=d^\star$ .

Observe:

$Ax \preceq 0 , c^Tx<0$ infeasible $\iff$ $p^\star = 0$ .
$Ax \preceq 0 , c^Tx<0$ feasible $\iff$ $p^\star = -\infty$ .
$A^Ty+c=0, y\succeq0$ infeasible $\iff$ $d^\star =-\infty$ .
$A^Ty+c=0, y\succeq0$ feasible $\iff$ $d^\star =0$ .

Using strong duality $p^\star=d^\star$ , we conclude that the feasibility problems

$\begin{cases} \text{find} &x\\ \text{subject to} & Ax \preceq0\\ &c^Tx < 0 \end{cases} \quad \text{and}\quad \begin{cases} \text{find} & y\\ \text{subject to}&A^Ty + c =0\\ &y \succeq 0 \end{cases}$

are strong alternatives.

Descent Algorithms for Unconstrained Minimization

Overview

Unconstrained Minimization.
We will focus on unconstrained problems of the form

$\begin{cases} \text{minimize} & f(x). \end{cases}$

The main assumptions on $f$ will be

$f$ is strongly convex (defined below).
$f$ is twice continuously differentiable: $\nabla^2 f$ is continuous.
$f$ has a closed sublevel set.

Constrained minimization problems will come later.

Goal.
Formulate and study algorithms which search for the minimizing $x^\star$ that solve the problem: $p^\star = f(x^\star)$ .

Idea: Using Descent Methods.
Find iterative rules

$G_k: \text{dom}\,f \to \text{dom}\,f$

so that the sequence

$x^{(k+1)} = G_k(x^{(k)}), \quad k =1, 2, \ldots,$

stabilizes and satisfies descent:

$\begin{aligned} x^{(k)} &\to x' \text{ for some }x' \text{ as }k \to \infty\\ f(x^{(k+1)}) &< f(x^{(k)}) \text{ whenever } x^{(k)} \text{ is not optimal}. \end{aligned}$

N.B.: such rules are natural for searching for minimizers.
Without convexity, such rules may get “stuck” at local minimizers.
Example.
Consider the iterative rule

$G_k(x^{(k)}) = x^{(k)} + h_k \nu^{(k)}$

where $h_k \in \mathbb{R}$ are step sizes and $\nu^{(k)} \in \mathbb{R}^n$ are search directions vector.
Generally, $h_k,\nu^{(k)}$ may depend on $x^{(k)}$ .
Thus $G_k(x)$ determines how far to step and in what direction from $x$ .

Remarks.

If $x^{(k)} \to x^\star$ , then continuity would give
$f(x^{(k)}) \to f(x^\star) = p^\star.$
In general, $p^\star$ need not to be known a priori.
In practice, one specifies tolerance $\epsilon>0$ and terminates search when an iterate $x^{(K)}$ satisfies this tolerance:
$f(x^{(K)})-p^\star \leq \epsilon.$
To start the search: a suitable starting point $x^{(0)}$ needs to be chosen.
To stop the search: a suitable stopping criterion that ensures tolerance is met needs to be determined.
Generally $G_k$ depends on the step; if $G_k = G$ is independent of the step, then $G$ is called stationary.

General Descent Algorithm.
Given iterative rule $G_k$ satisfying descent, a desired tolerance $\epsilon > 0$ , and a stopping criterion

$\sigma(x^{(k)}) = \begin{cases} \text{true} & \text{if } f(x^{(k)}) - p^\star \leq \epsilon\\ \text{false} &\text{ else} \end{cases}$

a general descent algorithm takes the form:


given initial .
repeat: compute .
until: .

An natural kind of stopping criterion may be

$\sigma(x^{(k)}) = \begin{cases} \text{true} & \text{if } \Vert \nabla f(x^{(k)})\Vert_2 \text{ is sufficiently small}\\ \text{false} &\text{ else} \end{cases}.$

Indeed, for differentiable convex functions, if $\nabla f(x) = 0$ , then $x=x^\star$ .

Mathematical Framework

Main Assumptions
For theoretical convenience, we always assume

$f$ satisfies strong convexity (defined below)
$f$ is twice continuously differentiable.
for the chosen initial point $x^{(0)}$ , the sublevel set
$S := \{ x \in \text{dom}\, f : f(x) \leq f(x^{(0)}) \}$
is closed.

N.B.:

Since $f(x^\star) = p^\star \leq f(x^{(0)})$ , we have $x^\star \in S$ .
$S$ being closed holds in case $\text{dom}f\, = \mathbb{R}^n$ and $f$ is continuous.
However, $S$ being closed may fail for nontrivial and non-pathological cases; e.g., consider the case $\text{dom}\,f$ is an open ball and $x_0 = \text{argmax}f$ .
Then $S$ is an open ball in $\mathbb{R}^n$ and therefore not closed.

Strong convexity.
We say $f$ satisfies strong convexity on $S$ if there exists a $m >0$ such that

$m Id \preceq \nabla^2 f(x) , \quad \forall x \in S.$

N.B.:

$Id \in \mathbb{R}^{n \times n}$ indicates the identity matrix.
Fix $x \in S$ and let $v \in \mathbb{R}^n$ be an eigenvector of $\nabla^2 f(x)$ with eigenvalue $\lambda$ .
Then
$\nabla^2 f(x) - m Id \succeq0$
implies
$0 \leq v^T (\nabla^2 f(x) - m Id) v = (\lambda-m)\Vert v\Vert_2^2.$
Thus
$0< m \leq \lambda$ and $\nabla^2f(x) \succ 0$ .
In particular, strong convexity is a stronger form of strict convexity!

Proposition. If $f$ is strongly convex and $S$ is closed, then there exists $M >0$ such that
$\nabla^2 f(x) \preceq M Id , \quad \forall x \in S.$

Proof.

The plan is to show that the sublevel set $S$ is closed and bounded (hence compact) and use continuity of $\nabla^2 f(x)$ to conclude each of its matrix entries are bounded.
This is enough to conclude the inequality.
$f$ twice continuously differentiable implies:
for each $x,y \in S$ there exists $z$ on the line $x\to y$ such that
$\begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x). \end{aligned}$
Strong convexity $\nabla^2 f(x) \succeq m Id$ implies
$\begin{aligned} \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x) \geq \frac{m}{2}\Vert y-x \Vert_2^2. \end{aligned}$
Taking $y \in S$ and $x = x^\star$ in previous step, there holds
$\begin{aligned} f(x^{(0)})& \geq f(y)\\ & \geq f(x^\star) + \nabla f(x^\star)^T(y-x^\star) + \frac{m}{2}\Vert y-x^\star \Vert_2^2\\ &= p^\star + \frac{m}{2}\Vert y-x^\star \Vert_2^2. \end{aligned}$
Previous step gives
$\begin{aligned} \frac{2}{m}\left( f(x^{(0)})-p^\star \right) \geq \Vert y-x^\star \Vert_2^2, \end{aligned}$
which implies all $y \in S$ belong to a ball of sufficiently large radius with center $x^\star$ , and therefore $S$ is bounded.
Since $S$ is bounded and closed, it is compact.
Therefore $\nabla^2 f$ is continuous on a compact set.
Therefore each entry of $\nabla^2 f$ is bounded and hence $\nabla^2 f(x) \preceq M Id$ for sufficiently large $M$ .

Remark.
Just as $mId \preceq \nabla^2 f(x)$ gave a lower bound on the eigenvalues of $\nabla^2 f(x)$ , the bound $\nabla^2 f(x) \preceq MId$ provides an upper bound on the eigenvalues.
The proof is mutatis mutandis the same.

Proposition. For $x \in S$ there holds
$\frac{1}{2m}\Vert \nabla f(x) \Vert_2^2 \geq f(x) - p^\star \geq \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2.$

Proof.

Observe: if a matrix $A \in \mathbb{R}^{n \times n}$ satisfies
$m Id \preceq A \preceq M Id,$
then for all $v \in \mathbb{R}^n$ there holds
$\begin{aligned} m\Vert v \Vert_2^2 = v^T(m Id)v \leq v^TAv \leq v^TM Idv = M\Vert v \Vert_2^2. \end{aligned}$

Therefore, using
$m Id \preceq \nabla^2 f(x) \preceq M Id,$
we have
$\begin{aligned} \frac{m}{2} \Vert y-x \Vert_2^2 \leq \frac{1}{2}(y-x) \nabla^2 f(z) (y-x) \leq \frac{M}{2}\Vert y-x \Vert_2^2. \end{aligned}$
As above: for each $x,y \in S$ there exists $z$ on the line $x\to y$ such that
$\begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x). \end{aligned}$
Observe that:
$q(y):= f(x) + \nabla f(x)^T (y-x) + \frac{c}{2}\Vert{y-x}\Vert_2^2$
is a convex quadratic for $c>0$ .
Moreover,
$\nabla q(y_0) = \nabla f(x) + c(y_0-x) =0$
iff
$y_0 = x - \frac{1}{c} \nabla f(x).$
Therefore
$\begin{aligned} \text{min}\, q(y) &= q(y_0)\\ &=q(x - \frac{1}{c} \nabla f(x))\\ &= f(x) - \frac{1}{2c}\Vert \nabla f(x) \Vert_2^2 \end{aligned}$
Using $\nabla^2 f(x) \preceq M Id$ , we have
$\begin{aligned} f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2}\Vert y-x \Vert_2^2 \end{aligned}$
and minimizing over $y$ gives
$\begin{aligned} p^\star \leq f(x) - \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2. \end{aligned}$
This proves
$\begin{aligned} \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2 \leq f(x) - p^\star . \end{aligned}$
Using $m Id \preceq \nabla^2 f(x)$ we have
$\begin{aligned} f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}\Vert y-x \Vert_2^2 \end{aligned}$
and minimizing over $y$ gives
$\begin{aligned} p^\star \geq f(x) - \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2. \end{aligned}$
This proves
$\begin{aligned} \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2 \geq f(x) - p^\star . \end{aligned}$

Remark.
The upper bound provides a stopping criterion: if $x^{(K)}$ satisfies

$\Vert \nabla f(x^{(K)}) \Vert_2 \leq \sqrt{2m\epsilon},$

then

$f(x^{(K)}) - p^\star \leq \epsilon.$

Viz., $x^{(K)}$ satisfies $\epsilon$ -tolerance.
Yet, $p^\star$ does not even need to be known for this.

Proposition. For $x \in S$ , there holds
$\Vert x^\star - x \Vert_2 \leq \frac{2}{m} \Vert \nabla f(x) \Vert_2.$

Proof.

Again, we use

$\begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x) \end{aligned}$

by taking $y = x^\star$ , which gives

$\begin{aligned} p^\star &= f(x^\star) \\ &= f(x) + \nabla f(x)^T(x^\star-x) + \frac{1}{2}(x^\star-x)^T \nabla^2 f(z)(x^\star-x)\\ &\geq f(x) - \Vert \nabla f(x)^T\Vert_2 \Vert x^\star - x\Vert_2 + \frac{m}{2} \Vert x^\star - x \Vert_2^2. \end{aligned}$

Here we used Cauchy-Schwarz to conclude

$\begin{aligned} \Vert \nabla f(x)^T\Vert_2 \Vert x^\star -x \Vert_2 \geq \nabla f(x)^T(x-x^\star) \end{aligned}$

and so

$\begin{aligned} -\Vert \nabla f(x)^T\Vert_2 \Vert x^\star -x \Vert_2 \leq \nabla f(x)^T(x^\star - x) \end{aligned}$

Since $p^\star - f(x) \leq 0$ , we conclude

$\begin{aligned} 0 \geq - \Vert \nabla f(x)^T\Vert_2 \Vert x^\star - x\Vert_2 + \frac{m}{2} \Vert x^\star - x \Vert_2^2. \end{aligned}$

and whence

$\begin{aligned} \frac{2}{m}\Vert \nabla f(x)^T \Vert_2 \geq \Vert x^\star - x \Vert_2. \end{aligned}$

Corollary. If $x^\star,x^{\star\star}$ are two minimizers, then $x^\star = x^{\star\star}$ since
$\Vert x^\star - x^{\star\star} \Vert_2 \leq \frac{2}{m} \Vert \nabla f(x^{\star\star}) \Vert_2 = 0.$

General Descent Methods

Plan: Further specialize and describe general descent methods.

Minimizing sequence: a sequence $\{ x^{(k)}\} \subset \text{dom}\,f$ such that

$x^{(k)} \to x^\star$ as $k \to \infty$ .

Goal: construct a minimizing sequence $\{ x^{(k)}\}$ satisfying

each iterate $x^{(k+1)}$ is defined via
$x^{(k+1)} = x^{(k)} + t^{(k)}\Delta x^{(k)}$
where
$\begin{aligned} t^{(k)}\geq0 & \text{ is called the step size}\\ \Delta x^{(k)} \in \mathbb{R}^n & \text{ is called the search direction}. \end{aligned}$
the $t^{(k)}$ and $\Delta x^{(k)}$ are chosen so that the sequence satisfies descent:
$f(x^{(k+1)}) < f(x^{(k)})$ whenever $x^{(k)} \neq x^\star$ .

Remarks.

$\Delta x^{(k)}$ is generally not assumed to be a unit vector;
A minimizing sequence satisfying descent satisfies
$x^{(k)} \in S := \{ x : f(x) \leq f(x^{(0)} ) \}$ .
Customary to write $x:= x+ t \Delta x$ or $x^+ = x + t\Delta x$ as short hand for $x^{(k+1)} = x^{(k)} + t^{(k)} \Delta x^{(k)}$ .

Necessary condition for descent: since $f$ is convex and differentiable, there holds

$f(x) + \nabla f(x)^T(y-x) \leq f(y),$

and so

$\nabla f(x)^T(y-x) \geq 0 \implies f(x) \leq f(y).$

Therefore, for $x:= x+t\Delta x$ to satisfy descent, it is necessary that

$\nabla f(x^{(k)})^T \Delta x^{(k)} <0.$

This follows by multiplying $\nabla f(x^{(k)})^T$ to both sides of

$\Delta x^{(k)} = \frac{1}{t^{(k)}} (x^{(k+1)} - x^{(k)} ).$

Descent direction: any search direction $\Delta x$ satisfying

$-\nabla f(x)^T \Delta x > 0.$

Viz., $\Delta x$ and $- \nabla f(x)$ form an acute angle.

General descent stopping criterion.
Recall: strong convexity $m Id \preceq \nabla^2 f(x)$ ensures

$\Vert \nabla f(x) \Vert_2 \leq \sqrt{2m\epsilon} \implies f(x)-p^\star \leq \epsilon .$

Since it is generally impossible to know the strong convexity constant $m$ , one settles with choosing $\epsilon'>0$ sufficiently small so that

$\Vert \nabla f(x) \Vert_2 \leq \epsilon' \implies f(x)-p^\star \leq \epsilon$

is likely to hold.
Stopping criteria for the descent methods studied here are often of this form.

General descent algorithm: Using the iterative rule $x := x + t \Delta x$ , a general descent algorithm takes the following form.


given initial 
repeat:
1. Determine descent direction .
2. Choose step size .
3. Take step: .
until: stopping criterion holds.

Line Searching

Observe

$\{ x + t \Delta x : t \geq 0 \}$

is a ray emanating from $x$ in the direction $\Delta x$ .
Thus Step 2. in the general descent algorithm is to determine where to step onto this line from $x$ .
Step 2. is therefore called a line search.

Exact line search.
Let $t_{\text{exact}}$ minimize $f$ along the line $\{ x + t \Delta x: t \geq 0 \}$ .
(Such $t_{\text{exact}}$ exists since $f$ convex.)
Certainly $f(x+t_{\text{exact}}\Delta x) \leq f(x)$ .
This search for $t_{\text{exact}}$ is called an exact line search.
A general descent algorithm with exact line search is recorded below.


given initial 
repeat:
1. Determine descent direction .
2. Compute .
3. Take step: .
until: stopping criterion holds.

Remarks.

Let $t^{(k)}_{\text{exact}}$ be sequence of exact step sizes and let
$x^{(k+1)} = x^{(k)} + t_{\text{exact}}^{(k)}\Delta x^{(k)}$
be resulting sequence of steps.
Since a search direction is used, each $t_{\text{exact}}^{(k)}>0$ and so the sequence $x^{(k)}$ satisfies descent:
$f(x^{(k+1)}) < f(x^{(k)})$ .
Using exact search is only reasonable when its computational cost is considerably less than the computational cost of finding search directions $\Delta x$ .
Otherwise, resources can be better spent finding better search directions.

Examples
Example 1.
Consider the objective

$f(x,y) = \frac{1}{2}(x-y)^2 + y$ .

The following image depicts a portion of the graph of $f$ .

The next figure depicts the restrictions of $f$ to the lines

$\begin{aligned} &\{ (0,0.5) + t (1,0): t \geq 0 \}\\ &\{ (0,1) + t (1,0): t \geq 0 \}\\ &\{ (0,1.5) + t (1,0): t \geq 0 \} \end{aligned}$

The last image depicts only these restrictions with the corresponding exact $x + t_{\text{exact}} \Delta x$ obtained from exact line search on each respective line.

Example 2.
Consider the same objective

$f(x,y) = \frac{1}{2}(x-y)^2 + y$ .

The first image below depicts the first step of a general descent method using exact line search where

$\begin{aligned} x^{(0)} &= (0,1.5)\\ \Delta x^{(0)} &= (1,0)\\ t_{\text{exact}}^{(0)} &= 1.5\\ x^{(1)} &= x^{(0)} + t_{\text{exact}}^{(0)} \Delta x^{(0)} = (1.5,1.5). \end{aligned}$

The next image below depicts the second step using exact line search where

$\begin{aligned} x^{(1)} &= (1.5,1.5)\\ \Delta x^{(1)} &= (0,-1)\\ t_{\text{exact}}^{(1)} &= 1\\ x^{(2)} &= x^{(1)} + t_{\text{exact}}^{(1)} \Delta x^{(1)} = (1.5,.5). \end{aligned}$

The last image below depicts the third step using exact line search where

$\begin{aligned} x^{(2)} &= (1.5,.5)\\ \Delta x^{(2)} &= (-1,0)\\ t_{\text{exact}}^{(2)} &= 1\\ x^{(3)} &= x^{(2)} + t_{\text{exact}}^{(2)} \Delta x^{(2)} = (.5,.5). \end{aligned}$

Backtracking line search.
Naturally: exact line search may be too computationally expensive.

Therefore, we may settle with line search which either

decreases the objective $f$ enough or
approximately minimizes $f$ in the direction $\Delta x$ .

Idea: given descent direction $\Delta x$ and parameter $\beta \in (0,1)$ ,

take step $x \mapsto x + t \Delta x$
“backtrack”: test smaller steps $x \mapsto x + \beta^k t\Delta x$ until the decrease
$f(x+\beta^k t\Delta x) - f(x)$
behaves suitably at each iteration for convergence to hold.

N.B.: Even if the initial step $x \mapsto x+t\Delta x$ results in an increase in objective, convexity ensures $x \mapsto x+\beta^k t\Delta x$ results in decrease for some $k$ .

Motivation of backtracking line search:
Throughout, fix

$f:\mathbb{R}\to\mathbb{R}$
descent direction $\Delta x$ , i.e., $\Delta x \in \mathbb{R}$ and $f'(x)\Delta x < 0$
Parameters $\alpha,\beta \in (0,1)$ .

Observations:

For $t$ small, Taylor’s approximation gives
$f(x+t\Delta x) - f(x) \approx tf'(x) \Delta x.$
In particular, small steps guarantee an approximate decrease by the amount $tf'(x) \Delta x$ .

At worst, a general step size $t$ makes
$x\mapsto x+t\Delta x$
either overshoot the minimizer
$x + t_{\text{exact}} \Delta x$
or is too small for
$x + t\Delta x$
to be “near”
$x+t_{\text{exact}}\Delta x$ .

Consider the linear extrapolation
$y(t) = tf'(x) \Delta x + f(x)$
as a function of .
Since , the difference is a linear approximation of the difference .
The line searching stopping criterion:
```
until: 
```
amounts to searching for until decreases by a fraction of what the linear approximation gives.

By 1. and $f'(x)\Delta x < 0$ , the inequality in 3. is guaranteed for small $t$ .
Indeed
$f(x+t\Delta x) - f(x) \approx tf'(x) \Delta x \leq \alpha tf'(x) \Delta x.$

By 4., we observe that, if
$f(x+t\Delta x) - f(x) \leq \alpha tf'(x) \Delta x.$
is not satisfied after first step $x\mapsto x + t \Delta$ , then
$f(x+\beta^k t\Delta x) - f(x) \leq \alpha \beta^k tf'(x) \Delta x$
will be satisfied for large enough $k$ .
Can therefore consider the largest step size $\beta^k t$ which guarantees a decrease comparable to the decrease predicted by linear extrapolation.

Improved Idea:

take step $x \mapsto x + t \Delta x$
“backtrack”: find the first $k\geq0$ such that
$f(x+\beta^k t\Delta x) - f(x) \leq \alpha \beta^k tf'(x) \Delta x$
is satisfied.
Then (as we will see) we have “suitable decrease” and we proceed with choosing next descent direction.

The aforementioned observations and ideas motivate a line search called backtracking line search.
In higher dimension, the differential inequality

$f(x+t\Delta x) - f(x) \leq \alpha tf'(x) \Delta x$

takes the form

$f(x+t\Delta x) - f(x) \leq \alpha t\nabla f(x)^T \Delta x.$

This inequality is called the Armijo-Goldstein inequality or Armijo condition.
The algorithm for backtracking line search may now be recorded:


given 

descent direction  at 
parameters 

while: 
update:

Remarks.

The loop
```
while: 
update: 
```
creates a sequence of step sizes
$1, \beta, \beta^2 , \beta^3 , \cdots, \beta^k , \cdots$
and the sequence terminates once the exit criterion (Armijo condition) is satisfied.
N.B.: it is possible for the search to terminate at .
The while condition
```
while: 
```
is understood as waiting until the objective is suitably decreased.
Moreover, if Armijo-Goldstein inequality holds for some , it also holds for all .
Indeed, by
$\begin{aligned} \lim_{t\to0^+} \frac{f(x+t\Delta x) - f(x)}{t} = \nabla f(x)^T \Delta x \leq \alpha \nabla f(x)^T \Delta x, \end{aligned}$
there is a small enough such that
$\begin{aligned} \frac{f(x+t\Delta x) - f(x)}{t} \leq \alpha \nabla f(x)^T \Delta x, \quad \text{ for } 0 < t \leq t_0. \end{aligned}$
This is the Armijo-Goldstein inequality rearranged.
The assumption $\alpha \in (0,0.5)$ and Armijo’s condition are sufficient for convergence of gradient descent coupled with backtracking line search, which is detailed below.
The smaller $\beta$ is, the faster $\beta^k$ decreases and hence the quicker the Armijo-Goldstein inequality holds.
However, this also results in smaller steps $x \mapsto x + \beta^k t\Delta x$ .
One needs to ensure $f(x+t\Delta x)$ is well-defined to start the algorithm, i.e., that $x + t \Delta x \in \text{dom}\, f$ .
This can be done by taking $t$ to be the first $\beta^k t$ with $x + \beta^k t\Delta x \in \text{dom}\, f$ .
This algorithm always terminates for differentiable and convex $f$ .
The argument follows the one-dimensional argument.
Indeed Taylor’s approximation ensures: for small $t$ there holds
$f(x + t \Delta x) - f(x) \approx t \nabla f(x)^T \Delta x < \alpha t \nabla f(x)^T \Delta x.$
Thus, once $k$ is large enough for $t=\beta^k$ be in the “small $t$ ” range, the Armijo-Goldstein inequality holds.

Gradient Descent

Recall: $\Delta x$ is called a descent direction provided

$\nabla f(x)^T \Delta x < 0$ .

Therefore

$\Delta x = -\nabla f(x)$ .

provides a natural descent direction associated to the problem.
This results in the following gradient descent method.


given initial 
repeat:
1. Set .
2. Perform line search to determine step size .
3. Take step: .
until: stopping criterion holds.

We focus on the cases the line search is exact or backtracking.
Recall: by strong convexity and initial sublevel set $S$ being closed, there exist absolute constants $0 < m \leq M <\infty$ such that

$m Id \preceq \nabla^2 f(x) \preceq M Id \quad \text{ for } x \in S$ .

Convergence for exact line search.
We show

gradient descent with exact line search converges.
number of iterations needed to achieve tolerance is bounded in terms of the problem data:
- optimal value $p^\star$ and initial value $f(x^{(0)})$ ,
- desired tolerance $\epsilon>0$ ,
- and the conditioning of $\nabla^2 f(x)$ .

Theorem. Suppose $f$ is strongly convex with convexity constants $m,M$ and its initial sublevel set $S$ is closed. Then the gradient descent method with exact line search converges. Moreover, if the desired tolerance is $\epsilon>0$ , then
$f(x^{(k)})- p^\star \leq \epsilon$
holds after at most
$-\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1 - m/M)}$
many iterations.

Proof.

Step 0.
The main idea is to establish constants $c \in (0,1)$ and $A>0$ such that

$f(x^{(k)}) - p^\star \leq c^k A$

for all $k \geq 0$ .
Indeed, if this holds,

$\lim_{k} c^k = 0 \implies \lim_k f(x^{(k)}) =p^\star$ .

For notational simplicity: we forgo indexing by iteration step.
For given iterate $x$ , write $t_{\text{exact}}$ for resulting exact line search step size.
Write $x^+ = x - t_{\text{exact}}\nabla f(x)$ for the next iterate after $x$ using gradient descent with exact line search.

Step 1.
Recall: under strong convexity assumptions on $f$ , there holds

$\begin{aligned} \nabla^2 f(x) &\preceq M Id\\ f(y) &\leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2}\Vert x-y\Vert_2^2. \end{aligned}$

Letting $y = x - t \nabla f(x)$ :

$\begin{aligned} f(x - t\nabla f(x)) &\leq f(x) - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2. \end{aligned}$

This holds for all $t\geq0$ with $x - t \nabla f(x) \in S$ .

Step 2.
Using exact line search:

$\begin{aligned} t_{\text{exact}} &:= \text{argmin}\{ f(x- t\nabla f(x)): t \geq 0 \}\\ f(x^+)&=f(x-t_{\text{exact}}\nabla f(x)) \leq f(x - t\nabla f(x)) \end{aligned}$

for all $t \geq 0$ .

Step 3.
The convex quadratic

$\begin{aligned} f(x) - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2 \end{aligned}$

is minimized at

$t = \frac{1}{M}$

and so

$\begin{aligned} f(x) - \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2 \leq f(x) - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2. \end{aligned}$

Step 4.
Minimizing both sides of the Step 1. inequality

$\begin{aligned} f(x - t\nabla f(x)) &\leq f(x) - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2 \end{aligned}$

and using Steps 2. and 3. gives

$f(x^+) = f(x - t_{\text{exact}}\nabla f(x)) \leq f(x) - \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2$ .

Therefore

$f(x^+) - p^\star \leq f(x) - p^\star - \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2$ .

Step 5.
Recall we derived

$f(x) - p^\star \leq \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2 .$

(This was when we derived a natural stopping criterion.)
Using this and Step 4. gives

$\begin{aligned} f(x^+) - p^\star & \leq f(x) - p^\star - \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2\\ &\leq f(x) - p^\star - \frac{1}{2M}2m(f(x)-p^\star)\\ &=c (f(x)-p^\star) \end{aligned}$

where

$\begin{aligned} c = 1- \frac{m}{M}<1. \end{aligned}$

Step 6.
Let

$x^{++} = x^+ - t'_{\text{exact}}\nabla f(x^+)$

denote the iterate following $x^+$ using gradient descent and exact line search to find $t'_{\text{exact}}$ .
Applying the analysis with $(x^{++},x^{+})$ in place of $(x^{+},x)$ , we conclude

$\begin{aligned} f(x^{++}) - p^\star \leq &c (f(x^{+})-p^\star) \\ &\leq c (c (f(x) - p^\star))\\ &= c^2 (f(x) - p^\star). \end{aligned}$

Letting $x^{(k)}$ denote the $k$ th iterate, iterating this argument gives

$\begin{aligned} f(x^{(k)}) - p^\star \leq c^k (f(x^{(0)}) -p^\star) \end{aligned}$ .

Step 7.
Since $c = 1 - \frac{m}{M} < 1$ , conclude $c^k \to 0$ as $k \to \infty$ .
Therefore

$\begin{aligned} \lim_{k \to \infty } f(x^{(k)}) - p^\star \leq \lim_{k \to \infty } c^k (f(x^{(0)}) -p^\star) = 0. \end{aligned}$

It follows that $f(x^{(k)}) \to p^\star$ and so there holds convergence.

Step 8.
To find $k$ such that

$f(x^{(k)}) - p^\star \leq \epsilon$ ,

we solve

$\begin{aligned} c^k (f(x^{(0)}) -p^\star) \leq \epsilon \end{aligned}$

in terms of $k$ :

$\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1/c)} \leq k.$

Therefore, as soon as $k$ surpasses this number, the tolerance is satisfied.
Recall $c = 1 - \frac{m}{M}$ .

Remarks.

Recall: above we showed that
$\begin{aligned} \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2 &\leq f(x) - p^\star \leq \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2\\ \Vert x^\star - x \Vert_2 &\leq \frac{2}{m} \Vert \nabla f(x) \Vert_2. \end{aligned}$
It follows that $f(x^{(k)}) - p^\star \to 0$ implies $x^{(k)} \to x^\star$ .
We can use these a priori estimates to observe
$-\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1 - m/M)} \leq -\frac{\log\left( \epsilon^{-1}m^{-1}2^{-1}\Vert \nabla f(x^{(0)}) \Vert_2^2 \right) }{\log(1 - m/M)}$ ,

$p^\star$

Convergence for backtracking line search.

We show

gradient descent with backtracking line search converges.
number of iterations needed to achieve tolerance is bounded in terms of the problem data:
- optimal value $p^\star$ and initial value $f(x^{(0)})$ ,
- desired tolerance $\epsilon>0$ ,
- and the conditioning of $\nabla^2 f(x)$ .

Theorem. Suppose $f$ is strongly convex with convexity constants $m,M$ and its initial sublevel set $S$ is closed. Then the gradient descent method with backtracking line search converges. Moreover, the Armijo-Goldstein inequality holds for $0 < t < \frac{1}{M}$ . Lastly, if the desired tolerance is $\epsilon>0$ , then
$f(x^{(k)})- p^\star \leq \epsilon$
holds after at most
$-\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1 - \text{min}\{ 2m\alpha,2\alpha\beta m/M\})}$
many iterations.

Proof.

Step 0.
The main idea is to establish constants $c \in (0,1)$ and $A>0$ such that

$f(x^{(k)}) - p^\star \leq c^k A$

for all $k \geq 0$ .
Indeed, if this holds,

$\lim_{k} c^k = 0 \implies \lim_k f(x^{(k)}) =p^\star$ .

Step 1.
Let $0<t<\frac{1}{M}$ be fixed but arbitrary and let $x^+ = x + t \Delta x$ .
In the next step, we will show the Armijo-Goldstein inequality holds whenever $0 < t < \frac{1}{M}$ .
Beforehand: since gradient descent chooses the descent direction

$\Delta x = - \nabla f(x)$ ,

we have

$\nabla f(x)^T \Delta x = - \nabla f(x)^T \nabla f(x) = -\Vert \nabla f(x) \Vert_2^2$

and so the Armijo-Goldstein inequality takes the form

$f(x^+) - f(x) \leq -\alpha t \Vert \nabla f(x) \Vert_2^2.$

Step 2.
Recall strong convexity and $S$ closed gave

$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2}\Vert y-x \Vert_2^2.$

Taking $y = x^+ = x - t\nabla f(x)$ gives

$f(x^+) \leq f(x) -t \Vert \nabla f(x)\Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2.$

Noting

$\begin{aligned} 0 < t < \frac{1}{M} &\implies Mt^2 \leq t\\ & \implies -t + \frac{Mt^2}{2} \leq - \frac{t}{2}, \end{aligned}$

and using $\alpha \in (0,0.5)$ , we conclude

$\begin{aligned} f(x^+) &\leq f(x) - \frac{t}{2}\Vert \nabla f(x) \Vert_2^2\\ &\leq f(x) - \alpha t \Vert \nabla f(x) \Vert_2^2. \end{aligned}$

In conclusion: if $0<t<\frac{1}{M}$ , then the Armijo-Goldstein inequality

$f(x^+) - f(x) \leq - \alpha t \Vert \nabla f(x) \Vert_2^2$

holds.

Step 3.
By preceding step: the backtracking line search terminates once $0 < t < \frac{1}{M}$ or at the initial step with $t = 1$ .
Supposing

$\frac{1}{M} < 1$
line search does not terminate at $t = 1$ ,

then line search terminates for some $t \geq \frac{\beta}{M}$ .
Indeed, let $k$ be the largest integer such that the line search does not terminate at $t= \beta^k$ .
Then $\beta^k \geq \frac{1}{M}$ and termination happens at $t'=\beta^{k+1}$ , which consequently satisfies $t' \geq \frac{\beta}{M}$ .
The claim follows.

Step 4.
If backtracking line search terminates at $t=1$ , then

$f(x^+) \leq f(x) - \alpha \Vert\nabla f(x) \Vert_2^2$ .

If backtracking line search terminates for some $t' \geq \frac{\beta}{M}$ , then

$\begin{aligned} f(x^+) &\leq f(x) - \alpha t' \Vert\nabla f(x) \Vert_2^2 \\ &\leq f(x) - \frac{\alpha \beta}{M} \Vert \nabla f(x) \Vert_2^2. \end{aligned}$

Therefore, if either $t=1$ or $t=t'$ , then

$f(x^+) \leq f(x) - \text{min}\left\{ \alpha, \frac{\alpha\beta}{M} \right\} \Vert\nabla f(x) \Vert_2^2$ .

Step 5.
Recall

$2m(f(x) - p^\star) \leq \Vert \nabla f(x) \Vert_2^2$ .

The preceding step thus gives

$\begin{aligned} f(x^+) - p^\star &\leq f(x) - p^\star - \text{min}\left\{ \alpha, \frac{\alpha\beta}{M} \right\} \Vert\nabla f(x) \Vert_2^2\\ & \leq f(x) - p^\star - \text{min}\left\{ \alpha, \frac{\alpha\beta}{M} \right\} \cdot 2m(f(x)-p^\star)\\ &\leq \left( 1 - \text{min}\left\{ 2m\alpha, \frac{2m\alpha\beta}{M} \right\} \right) (f(x) - p^\star)\\ &:= c(f(x)-p^\star), \end{aligned}$

where

$c= 1 - \text{min}\left\{ 2m\alpha, \frac{2m\alpha\beta}{M} \right\} < 1$ .

Iterating the argument gives

$f(x^{(k)}) - p^\star \leq c^k (f(x^{(0)}) - p^\star)$

and whence the desired convergence since $c^k \to 0$ .

Step 6.
To find $k$ such that

$f(x^{(k)}) - p^\star \leq \epsilon$ ,

we solve

$\begin{aligned} c^k (f(x^{(0)}) -p^\star) \leq \epsilon \end{aligned}$

in terms of $k$ :

$\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1/c)} \leq k.$

Therefore, as soon as $k$ surpasses this number, the tolerance is satisfied.
Recall

$c=1 - \text{min}\left\{ 2m\alpha, \frac{2m\alpha\beta}{M} \right\}$ .

Remark.
The remarks after the convergence theorem for exact line search apply to this convergence theorem for backtracking line search.

On the Condition Number

Objectives

Define condition numbers for matrices and convex subsets.
Establish connection between strong convexity of a function and the condition number on its (convex) sublevel sets.
Prepare for understanding how conditioning on the Hessian is important for convergence in gradient descent algorithms.

Condition Number of a Matrix
Given a matrix $A \in \boldsymbol{S}_{++}^n$ , let

$\begin{aligned} \lambda_{\text{min}}(A) &= \text{ minimum eigenvalue of } A\\ \lambda_{\text{max}}(A) &= \text{ maximum eigenvalue of } A. \end{aligned}$

Condition number: the ratio

$\kappa(A) := \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)}.$

Remarks.

There holds $\kappa(A^{-1}) = \kappa(A)$ .
Indeed:
$\begin{aligned} \lambda_{\text{min}}(A^{-1}) &= \frac{1}{\lambda_{\text{max}}(A)}\\ \lambda_{\text{max}}(A^{-1}) &= \frac{1}{\lambda_{\text{min}}(A)} \end{aligned}$
and so
$\begin{aligned} \kappa(A^{-1})=\frac{\lambda_{\text{max}}(A^{-1})}{\lambda_{\text{min}}(A^{-1})} = \frac{1/\lambda_{\text{min}}(A)}{1/\lambda_{\text{max}}(A)} = \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)} = \kappa(A). \end{aligned}$

For $c>0$ , there holds $\kappa(cA) = \kappa(A)$ .
Indeed:
$\begin{aligned} \lambda_{\text{min}}(cA) &= c \lambda_{\text{max}}(A)\\ \lambda_{\text{max}}(cA) &= c \lambda_{\text{min}}(A) \end{aligned}$
and so
$\kappa(cA) = \frac{\lambda_{\text{max}}(cA)}{\lambda_{\text{min}}(cA)} = \frac{c\lambda_{\text{max}}(A)}{c\lambda_{\text{min}}(A)} = \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)} = \kappa(A).$

Condition Number of a Convex Subset
Let $K \subset \mathbb{R}^n$ be a convex subet.
Directional Width: given a direction

$\begin{aligned} \nu & \in \mathbb{R}^n\\ \Vert \nu \Vert_2 &= 1, \end{aligned}$

the number

$\begin{aligned} W(K,\nu) := \sup\{\nu^T x : x \in K\} - \inf\{ \nu^Tx : x \in K \}. \end{aligned}$

Remarks.

$W(K,\cdot)$ provides a means of measuring relative eccentricity between any two directions; e.g.,
$W(K,\nu_1)>W(K,\nu_2)$
implies $K$ is elongated more in the direction $\nu_1$ than $\nu_2$ .
The numerical value $W(K,\nu)$ is independent of the placement of $K$ .

Justification.
Given $x_0 \in \mathbb{R}^n$ , let
$K + x_0 = \{x + x_0 : x \in K \}$ .
Then
$\begin{aligned} W(K+x_0,\nu)&=\sup\{\nu^Tx : x \in K + x_0 \} - \inf\{\nu^Tx : x \in K + x_0 \} \\ &= \sup\{\nu^T(x+x_0) : x \in K \} - \inf\{\nu^T(x+x_0) : x \in K \}\\ &= \sup\{\nu^Tx : x \in K \} + \nu^T x_0 - \inf\{\nu^Tx:x \in K\} - \nu^T x_0\\ &= \sup\{\nu^Tx : x \in K \} - \inf\{\nu^Tx:x \in K\}\\ &= W(K,\nu). \end{aligned}$

Examples:

Let
$B(0,r) = \{ x \in \mathbb{R}^n : \Vert x \Vert_2 < r \}$
be the open ball of radius $r>0$ and center $0 \in \mathbb{R}^n$ .
Then
$\begin{aligned} W(B(0,r),\nu) &= \sup\{\nu^Tx : \Vert x \Vert_2 < r \} - \inf\{\nu^Tx: \Vert x \Vert_2 < r\}\\ &= \nu^T (r\nu) - \nu^T(-r\nu)\\ &= 2r. \end{aligned}$
As expected: directional widths of Euclidean balls are independent of the direction $\nu$ .

Let $K_1,K_2 \subset \mathbb{R}^n$ be convex subsets with $K_1 \subset K_2$ .
Using
$\begin{aligned} \sup\{\nu^T x: x \in K_1 \} \leq \sup\{\nu^T x : x \in K_2\}\\ \inf\{\nu^T x: x \in K_1 \} \geq \inf\{\nu^T x : x \in K_2\}\\ \end{aligned}$
we have
$\begin{aligned} W(K_1,\nu) &= \sup\{\nu^T x: x \in K_1 \} - \inf\{\nu^T x: x \in K_1 \} \\ &\leq \sup\{\nu^T x : x \in K_2\} - \inf\{\nu^T x : x \in K_2\}\\ &=W(K_2,\nu). \end{aligned}$
As expected: larger sets have larger directional widths

Extremal Widths: the maximum and minimum widths are defined as

$\begin{aligned} \text{maximum width} &= W_{\text{max}}(K) = \sup\{W(K,\nu): \Vert \nu\Vert_2 = 1 \}\\ \text{minimum width} &= W_{\text{min}}(K) = \inf\{W(K,\nu): \Vert \nu\Vert_2 = 1 \}. \end{aligned}$

Condition number: the ratio

$\kappa(K) = \left(\frac{W_{\text{max}}(K)}{W_{\text{min}}(K)} \right)^2.$

Remark.
$\kappa(K) >1$ measures a lack of symmetry and indicates $K$ is thin (or elongated) in some preferred direction; however, $\kappa(K)=1$ does not indicate $K$ is a ball.

Examples.

Let
$B(0,r) = \{ x \in \mathbb{R}^n : \Vert x \Vert_2 < r \}$
be the open ball of radius $r>0$ and center $0 \in \mathbb{R}^n$ .
As shown above,
$W(B(0,r),\nu) = 2r$
is constant relative to $\nu$ .
Thus
$\begin{aligned} W_{\text{max}}(B(0,r)) = \sup\{W(B(0,r),\nu): \Vert \nu\Vert_2 = 1 \} = 2r\\ W_{\text{min}}(B(0,r)) = \inf\{W(B(0,r),\nu): \Vert \nu\Vert_2 = 1 \} = 2r\\ \end{aligned}$
and so
$\kappa(B(0,r)) = \left( \frac{W_{\text{max}}(B(0,r))}{W_{\text{min}}(B(0,r))} \right)^2 = \left( \frac{2r}{2r} \right)^2=1.$
N.B.: Euclidean balls are not the only convex sets satisfying $\kappa = 1$ ; e.g., Reuleaux triangles are of constant width.

Let $K_1,K_2,\Omega\subset \mathbb{R}^n$ be convex subsets with $K_1 \subset \Omega \subset K_2$ .
From a previous example above, we have
$W(K_1,\nu) \leq W(\Omega,\nu) \leq W(K_2,\nu).$
Thus
$\begin{aligned} W_{\text{min}}(K_1) &= \inf W(K_1,\nu) \leq \inf W(\Omega,\nu) = W_{\text{min}}(\Omega)\\ W_{\text{max}}(\Omega) &= \sup W(\Omega,\nu) \leq \sup W(K_2,\nu) = W_{\text{max}}(K_2) \end{aligned}$
and so
$\begin{aligned} \kappa(\Omega) = \left( \frac{W_{\text{max}}(\Omega)}{W_{\text{min}}(\Omega)} \right)^2 \leq \left( \frac{W_{\text{max}}(K_2)}{W_{\text{min}}(K_1)} \right)^2 \end{aligned}.$
For Example, if $K_1 = B(0,r_1)$ and $K_2 = B(0,r_2)$ , then
$\begin{aligned} \kappa(\Omega) \leq \left( \frac{W_{\text{max}}(K_2)}{W_{\text{min}}(K_1)} \right)^2 = \left( \frac{2r_2}{2r_1} \right)^2 = \frac{r_2^2}{r_1^2} \end{aligned}.$

Given $Q \in \boldsymbol{S}_{++}^n$ and $x_0 \in \mathbb{R}^n$ , define the set
$\mathcal{E} = \{ x \in \mathbb{R}^n : (x-x_0)^TQ^{-1}(x-x_0) \leq 1 \}.$
N.B.: $\mathcal{E}$ is an ellipsoid and all ellipsoids can be described like this.
We set out to determine the condition number of $\mathcal{E}$ .
In fact, we prove the following.

Proposition. Let $\mathcal{E}$ and $Q$ be as above. Then their condition numbers are the same:
$\kappa(\mathcal{E}) = \kappa(Q) = \kappa(Q^{-1})$ .

Proof.
Step 0.
As observed above, we take WLOG $x_0=0$ .
We need to compute the directional widths $W(\mathcal{E},\nu)$ and therefore need to first compute
$\begin{aligned} \sup\{ \nu^T x: x \in \mathcal{E}\} \quad \text{ and } \quad \inf\{ \nu^T x : x \in \mathcal{E} \}. \end{aligned}$
This can be achieved by solving the optimization problems
$\begin{cases} \text{minimize} & \pm \nu^Tx\\ \text{subject to}& x^TQ^{-1}x \leq 1 \end{cases}.$
We employ Lagrangian duality.
N.B.: these problems are convex and satisfy Slater’s condition.

Step 1.
The Lagrangian is
$L(x,\lambda) = \pm \nu^Tx + \lambda (x^T Q^{-1} x - 1).$
Its gradient is
$\nabla_x L(x,\lambda) = \pm \nu + 2\lambda Q^{-1} x.$
The KKT conditions demand
$\begin{aligned} \lambda( x^T Q^{-1} x - 1) &= 0\\\ \nabla_x L &= 0 \end{aligned}$
Observe that $\lambda = 0$ would imply $\nu=0$ , which is a contradiction.
Thus $\lambda>0$ and so complementary slackness implies
$x^T Q^{-1} x = 1$
and $\nabla L=0$ gives
$x = \mp \frac{1}{2\lambda} Q\nu.$

Step 2.
Using
$\begin{aligned} 1&=x^T Q^{-1} x \\ x &= \mp \frac{1}{2\lambda} Q\nu \end{aligned}$
we have
$\begin{aligned} 1 &= \left(\mp\frac{1}{2\lambda}Q\nu \right)^T Q^{-1} \left(\mp\frac{1}{2\lambda}Q\nu \right)\\ &=\frac{1}{4\lambda^2}\nu^T Q \nu. \end{aligned}$
Solving for $\lambda$ gives
$\lambda =\frac{1}{2} \sqrt{\nu^T Q \nu}= \frac{1}{2} \sqrt{\nu^T Q^{1/2} Q^{1/2}\nu} = \frac{1}{2} \Vert Q^{1/2}\nu \Vert_2.$

Step 3.
Using our evaluation of $\lambda$ gives
$\begin{aligned} x^\star &= \mp \frac{1}{2\lambda}Q\nu\\ &= \mp \frac{1}{\Vert Q^{1/2}\nu\Vert_2} Q\nu\\ p^\star &= \pm \nu^T x^\star\\ &= - \frac{1}{\Vert Q^{1/2}\nu\Vert_2} \nu^TQ\nu\\ &= - \Vert Q^{1/2}\nu\Vert_2. \end{aligned}$
This gives
$\begin{aligned} \sup\{ \nu^Tx : x \in \mathcal{E}\} &= - \inf\{ - \nu^T x : x \in \mathcal{E} \}\\ &=\Vert Q^{1/2}\nu\Vert_2\\ \inf\{ \nu^Tx : x \in \mathcal{E}\} &= -\Vert Q^{1/2}\nu\Vert_2. \end{aligned}$

Step 4.
We can now compute the directional and extremal widths.
From Step 3. we have
$\begin{aligned} W(\mathcal{E},\nu) & = \sup\{ \nu^Tx : x \in \mathcal{E}\} - \inf\{ \nu^Tx : x \in \mathcal{E}\}\\ &=\Vert Q^{1/2}\nu\Vert_2 - (-\Vert Q^{1/2}\nu\Vert_2)\\ &= 2 \Vert Q^{1/2}\nu\Vert_2, \end{aligned}$
and so
$\begin{aligned} W_{\text{min}}(\mathcal{E}) &= \inf\{2 \Vert Q^{1/2}\nu\Vert_2: \Vert \nu\Vert_2 = 1\}\\ &= 2 \sqrt{\lambda_{\text{min}}(Q)}\\ W_{\text{max}}(\mathcal{E}) &= \inf\{2 \Vert Q^{1/2}\nu\Vert_2: \Vert \nu\Vert_2 = 1\}\\ &= 2 \sqrt{\lambda_{\text{max}}(Q)}. \end{aligned}$
At last, we compute the condition number of $\mathcal{E}$ :
$\kappa(\mathcal{E}) = \left(\frac{W_{\text{max}}(\mathcal{E})}{W_{\text{min}}(\mathcal{E})} \right)^2 = \frac{\lambda_{\text{max}}(\mathcal{Q})}{\lambda_{\text{min}}(\mathcal{Q})} = \kappa(Q)$
which is what we wanted to show.

Strong Convexity and Conditioning
Let $f:\mathbb{R}^n \to \mathbb{R}$ be strongly convex and have closed sublevel set $S = \{ x : f(x) \leq f(x^{(0)})\}$ for some $x^{(0)} \in \text{dom}\,f$ .
Recall: it follows that

$mId \preceq \nabla^2f(x) \preceq M Id$

for some $0<m \leq M < \infty$ .

For $p^\star < \alpha \leq f(x^{(0)})$ , let

$S_\alpha = \{ x : f(x) \leq \alpha \}$

be the sublevel set corresponding to $\alpha$ .
We set out to prove that the condition number $\kappa(S_\alpha)$ is controlled by the conditioning of $\nabla^2 f(x)$ .

Remark.
The condition number of $\nabla^2 f(x)$ depends on $x$ .
However,

$mId \preceq \nabla^2f(x) \preceq M Id$

gives the upper estimate of

$\kappa(\nabla^2 f(x)) \leq \frac{M}{m}$

for all $x \in S$ .
Important points:

$M$ can be taken to be arbitrarily large and $m$ arbitrarily small, this estimate can be very bad.
$M$ depends on $S$ and hence the initial choice $x^{(0)}$ .
The further $x^{(0)}$ is chosen from $x^\star$ , the worse (larger) $M$ can be.

Little example.
Let $f(x_1,x_2) = e^{\frac{1}{2}(x_1^2+x_2^2)}$ .
One computes

$\begin{aligned} e^{\frac{1}{2}(x_1^2+x_2^2)} Id& \preceq \nabla^2 f(x) \\ &= e^{\frac{1}{2}(x_1^2+x_2^2)} \begin{bmatrix} 1+x_1^2 & x_1x_2\\ x_1x_2 & 1+x_2^2 \end{bmatrix}\\ & \preceq e^{\frac{1}{2}(x_1^2+x_2^2)}(1+x_1^2+x_2^2) Id \end{aligned}$

with the inequalities given by the extremal eigenvalues.
Evidently $\kappa(\nabla^2 f(x))$ depends on $x$ .
If $x^{(0)}=(1,1)$ , then $S = \{ x_1^2+x_2^2 \leq 2 \}$ and so we may take

$\begin{aligned} Id& \preceq \nabla^2 f(x) \preceq 3e Id \end{aligned}$

with $m=1,M=3e$ .

Proposition. Let $f, m , M, \alpha, S_\alpha$ be as above. Then
$\kappa(S_\alpha) \leq \frac{M}{m}$ .
Remark.
Thus, the better conditioned $\nabla^2 f(x)$ is in the sense

$M/m \approx 1$ ,

then the more conditioned its sublevel sets are.
Conversely, if the sublevel sets of $f$ are poorly conditioned in the sense

$\kappa(S_\alpha) \gg 1$ ,

then $\nabla^2 f(x)$ will be poorly conditioned, i.e.,

$M \gg m$ .

Proof.

Step 1.
Using

$mId \preceq \nabla^2f(x) \preceq M Id$

and

$\begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x) \end{aligned}$

for suitable $z$ , we have (taking $x=x^\star$ ):

$\frac{m}{2}\Vert y-x^\star \Vert_2^2 + p^\star \leq f(y) \leq \frac{M}{2}\Vert y-x^\star \Vert_2^2 + p^\star$

for $y \in S$ .

Let

$\begin{aligned} B_i &= \left\{ y \in S: \Vert y-x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha - p^\star)}{M}} \right\}\\ B_o &= \left\{ y \in S: \Vert y-x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha - p^\star)}{m}} \right\} \end{aligned}$

We show

$B_i \subset S_\alpha \subset B_o$ .

By observations above, the conditioning of $S_\alpha$ is estimated in terms of the extremal widths of $B_i$ and $B_o$ .

Step 2.
To begin, let $y \in B_i$ .
Using

$\Vert y - x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha-p^\star)}{M}}$

we have

$\begin{aligned} f(y) &\leq \frac{M}{2}\Vert y-x^\star \Vert_2^2 + p^\star\\ &\leq \frac{M}{2}\frac{2(\alpha-p^\star)}{M} + p^\star\\ &=\alpha \end{aligned}$

and so $y \in S_\alpha$ .

Step 3.
For the other containment, let $y \in S_\alpha$ .
Using

$f(y) \leq \alpha$

we have

$\begin{aligned} \frac{m}{2}\Vert y - x^\star \Vert_2^2 + p^\star \leq \alpha \end{aligned}$

and whence

$\begin{aligned} \Vert y - x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha-p^\star)}{m}}, \end{aligned}$

thereby establishing $y \in B_o$ .

Step 4.
By observations made above, we use $B_i \subset S_\alpha \subset B_o$ to conclude

$\begin{aligned} \kappa(S_\alpha) &= \left(\frac{W_{\text{max}}(S_\alpha)}{W_{\text{min}}(S_\alpha)} \right)^2\\ &\leq \left(\frac{W_{\text{max}}(B_o)}{W_{\text{min}}(B_i)} \right)^2\\ &\leq \left(\frac{\sqrt{\frac{2(\alpha-p^\star)}{m}}}{\sqrt{\frac{2(\alpha-p^\star)}{M}}} \right)^2\\ &=\frac{M}{m}. \end{aligned}$

Proposition. Let $f, m , M, \alpha, S_\alpha$ be as above. Then
$\begin{aligned} \lim_{\alpha \to p^\star } \kappa(S_\alpha) = \kappa\left(\nabla^2 f(x^\star) \right). \end{aligned}$
Remark.
The point is that, as the sublevel sets shrink to $x^\star$ , the problem’s conditioning is dictated by the conditioning of $\nabla^2 f(x^\star)$ .

Proof (sketch).

Step 1.
Using Taylor approximation at $x^\star$ , there holds

$\begin{aligned} f(y) &\approx f(x^\star) + \nabla f(x^\star)^T (y-x^\star) + \frac{1}{2}(y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star)\\ &= p^\star + \frac{1}{2}(y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star) \end{aligned}$

for $y$ near $x^\star$ .

Step 2.
Observing that

$\bigcap_{\alpha>p^\star} S_\alpha = S_{p^\star} = \{ x^\star \}$ ,

we can choose $\alpha$ near $p^\star$ so that $y \in S_\alpha$ is near $x^\star$ .
Then the above Taylor approximation concludes $y \in S_\alpha$ iff

$\begin{aligned} \alpha \gtrsim p^\star + \frac{1}{2}(y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star) \end{aligned}$

which holds iff

$\begin{aligned} 2(\alpha-p^\star) \gtrsim (y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star). \end{aligned}$

Step 3.
Previous step indicates: if $y \in S_\alpha$ then $y$ belongs to or nearly belongs to

$\begin{aligned} \{ y : (y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star) \leq 2(\alpha-p^\star) \}, \end{aligned}$

which is an ellipsoid defined by the matrix

$\left(\nabla^2 f(x^\star) \right)^{-1}$

and so

$\kappa(S_\alpha) \approx \kappa(\nabla^2 f(x^\star))$ .

(Recall $\kappa(A) = \kappa(A^{-1})$ .)
Taking $\alpha$ closer to $p^\star$ evidently improves these approximations and so we conclude

$\begin{aligned} \lim_{\alpha \to p^\star } \kappa(S_\alpha) = \kappa\left(\nabla^2 f(x^\star) \right). \end{aligned}$

Example

(See also CO Example 9.3.2)
Let

$\begin{aligned} f(x_1,x_2) = e^{\frac{1}{2}(x_1^2+x_2^2)}. \end{aligned}$

Goal
Argue how changing conditioning of sublevel sets of $f$ affect convergence.

Sublevel set
Observe

$\begin{aligned} S_\alpha &= \{e^{\frac{1}{2}(x_1^2+x_2^2)} \leq \alpha \}\\ &= \{ x_1^2 + x_2^2 \leq 2\log \alpha \} \end{aligned}$

are disks centered at the origin and with radius $\sqrt{2\log\alpha}$ .
As such, the sublevel sets are well-conditioned:

$\kappa(S_\alpha)=1.$

Conditioning of the Hessian
We compute

$\begin{aligned} \nabla f(x_1,x_2) &= e^{\frac{1}{2}(x_1^2+x_2^2)} \begin{bmatrix} x_1\\x_2 \end{bmatrix}\\ \nabla^2 f(x_1,x_2) &= e^{\frac{1}{2}(x_1^2+x_2^2)} \begin{bmatrix} 1+x_1^2 & x_1x_2\\ x_1x_2 & 1+x_2^2 \end{bmatrix} \end{aligned}.$

The extremal eigenvalues of $\nabla^2 f(x_1,x_2)$ are

$\begin{aligned} \lambda_{\text{min}}(\nabla^2 f(x_1,x_2)) &= e^{\frac{1}{2}(x_1^2+x_2^2)}\\ \lambda_{\text{max}}(\nabla^2 f(x_1,x_2)) &= e^{\frac{1}{2}(x_1^2+x_2^2)}(1+x_1^2+x_2^2). \end{aligned}$

Therefore

$\kappa(\nabla^2 f(x_1,x_2)) = \frac{\lambda_{\text{max}}(\nabla^2 f(x_1,x_2))}{\lambda_{\text{min}}(\nabla^2 f(x_1,x_2))} = 1 + x_1^2+x_2^2$

and so $\nabla^2 f$ is reasonably well-conditioned near $x^\star = (0,0)$ .

Let $x^{(0)} = (1,1)$ be the initial point and define the initial sublevel set

$S = \{ f(x) \leq f(x^{(0)}) = e\}.$

On $S$ , we have

$\begin{aligned} m=1 &\leq e^{\frac{1}{2}(x_1^2+x_2^2)} = \lambda_{\text{min}}(\nabla^2 f(x_1,x_2)) \\ \lambda_{\text{max}}(\nabla^2 f(x_1,x_2)) &= e^{\frac{1}{2}(x_1^2+x_2^2)}(1+x_1^2+x_2^2) \leq 3e =M. \end{aligned}$

Thus $f$ satisfies the strong convexity bounds

$\begin{aligned} Id \preceq \nabla^2 f(x_1,x_2) \preceq 3e Id, \quad (x_1,x_2) \in S. \end{aligned}$

Applying Gradient Descent
Using gradient descent with exact line search solves the problem in one step.
This is due to radiality of sublevel sets and $\nabla f$ and $f$ being minimized at the origin.
Using backtracking line search, the problem is solved without issue.

Unconditioning the Problem
Consider the anisotropic dilation: $(x_1,x_2) \mapsto (\gamma x_1,x_2)$ for an arbitrary $\gamma > 1$ .

Applying this dilation to $f$ results in the function

$f_\gamma (x_1,x_2) = f(\gamma x_1,x_2) = e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)}.$

Observe: the sublevel sets of $f_\gamma$ are of the form

$\begin{aligned} S_\alpha &= \{ e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)} \leq \alpha \}\\ &=\{ \gamma^2 x_1^2 + x_2^2 \leq 2\log \alpha\}, \end{aligned}$

which is an ellipse whose condition number depends on $\gamma$ .
Evidently, the larger $\gamma$ is, the worse $S_\alpha$ is conditioned.
N.B.: by the anistropy, $f_\gamma$ is more sensitive to change in $x_1$ than $x_2$ .

Now compute

$\nabla f_\gamma(x) = e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)} \begin{bmatrix} \gamma^2 x_1\\ x_2 \end{bmatrix}.$

The effects of poor conditioning can already be observed: for large $\gamma>1,$ considering the step

$x^+ = x - t e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)} \begin{bmatrix} \gamma^2 x_1\\ x_2 \end{bmatrix},$

we see $x^+$ is obtained from $x$ by stepping significantly further in the $x_1$ -coordinate than the $x_2$ -coordinate.
This is despite the fact that the minimizer is still at the origin.

Applying Gradient Descent
While exact line search will still find the minimizer for this problem quickly (2 steps), one finds backtracking line search becomes impractical for large $\gamma$ .
Recalling that Armijo-Goldstein’s inequality is satisfied for $t<1/M$ , this is unsurprising since the larger $M$ is, the smaller $t$ is likely needed to be in order for the Armijo-Goldstein's inequality to hold.

Steepest Descent

Gradient Descent as Steepest Descent:
Recall: if $t>0$ is small and $\Vert \nu \Vert_2=1$ , then Taylor’s approximation gives

$f(x+t\nu) - f(x) \approx t \nabla f(x)^T \nu.$

If $\nu$ is a descent direction, then

$t \nabla f(x)^T \nu<0$

and this quantity records the approximate decrease in $f$ in the direction $\nu$ upon taking the small step

$x=x + t \nu.$

If we want to decrease $f$ efficiently, we ask: for given step size $t>0$ , which direction $\nu$ effects greatest descent?

Observing

$\begin{aligned} -\frac{1}{\Vert u \Vert_2} u = \text{argmin}\{ u^T \nu : \Vert \nu \Vert_2 = 1 \} \end{aligned}$

for any nonzero $u \in \mathbb{R}^n$ , we conclude

$\begin{aligned} -\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x) = \text{argmin}\{ \nabla f(x)^T \nu : \Vert \nu \Vert_2 = 1 \}. \end{aligned}$

Therefore, the direction

$\begin{aligned} \Delta x_{\text{nsd}}:=-\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x) \end{aligned}$

gives the direction of greatest decrease: for small $t>0$ we have

$\begin{aligned} f(x + t \Delta x_{\text{nsd}}) - f(x) & \approx t \nabla f(x)^T \Delta x_{\text{nsd}}\\ &\leq t \nabla f(x)^T \nu\\ & \approx f(x + t \nu ) - f(x) . \end{aligned}$

One may call such a direction a steepest descent direction.

Quadratic Norm Steepest Descent
Suppose for $\alpha$ near $p^\star$ , the sublevel sets $S_\alpha$ are poorly conditioned.
Suppose there is a change of variable/coordinates $\bar x = P^{1/2} x$ so that

$\bar f(\bar x) := f(P^{-1/2}\bar x) = f(x)$

has well-condition sublevel sets.
Then gradient descent in $\bar x$ -coordinates is likely to behave well when minimizing $\bar f(\bar x)$ .

Compute

$\nabla_{\bar x} \bar f(\bar x) = P^{-1/2} (\nabla f)(P^{-1/2}\bar x) = P^{-1/2} \nabla f(x).$

Then the steepest descent direction in $\bar x$ -coordinates is

$\Delta \bar x = - \frac{1}{\Vert P^{-1/2} \nabla f(x)\Vert_2} P^{-1/2} \nabla f(x).$

Converting back to original coordinates $x = P^{-1/2}\bar x$ :

$\begin{aligned} \Delta x &= P^{-1/2}\Delta \bar x = - \frac{1}{\Vert P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x). \end{aligned}$

But then (in $x$ -coordinates)

$\begin{aligned} -\frac{1}{\Vert P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x) &= - \frac{1}{\Vert \nabla_{\bar x} \bar f(\bar x)\Vert_2} \nabla \bar f(\bar x)\\ &= \text{argmin}\left\{ \nabla \bar f(\bar x)^T \bar\nu : \Vert \bar \nu \Vert_2=1\right\}\\ &= \text{argmin}\left\{ \left( P^{-1/2} \nabla f(x) \right)^T P^{1/2}\nu : \Vert P^{1/2} \nu \Vert_2=1\right\}\\ &= \text{argmin}\left\{ \nabla f(x)^T \nu : \Vert P^{1/2} \nu \Vert_2=1\right\}. \end{aligned}$

Therefore, the gradient descent direction obtained by the change of variable is the “steepest descent direction”

$\begin{aligned} \text{argmin}\left\{ \nabla f(x)^T \nu : \Vert P^{1/2} \nu \Vert_2=1\right\} \end{aligned}$

relative to the norm

$\Vert x \Vert_P := \Vert P^{1/2} x \Vert_2.$

In summary:

After a change of variable, the problem becomes well-conditioned and gradient descent may be used to obtain a steepest descent direction relative to the norm $\Vert \cdot \Vert_2$ (in the $\bar x$ -coordinates).
Undoing the change of variable, this steepest descent direction is realized as the steepest descent relative to a different norm.

The observations above suggest that a change of variable may result in better computational performance (via improving conditioning).
Also indicated: gradient descent in the new variables was equivalent to finding the steepest descent direction with respect to the norm $\Vert P^{1/2} \cdot \Vert_2$ in the original variables.
Motivated by this: consider steepest descent with respect to general norms.

Review on Norms

A norm on a vector space $V$ is a function $\Vert \cdot \Vert: V \to \mathbb{R}_{+}$ satisfying

triangle inequality: $\Vert x+y \Vert \leq \Vert x \Vert + \Vert y \Vert$ for all $x,y \in V$ .
Homogeneity: $\Vert c x \Vert = |c| \Vert x \Vert$ for all $c \in \mathbb{R}$ and $x \in V$ .
Positive definiteness: if $x \in V$ satisfies $\Vert x \Vert = 0$ , then $x = 0$ .

Examples of important norms are:

Standard Euclidean norm:
$\Vert x \Vert_2 = \sqrt{x_1^2 + \cdots + x_n^2}$
Quadratic norms for $P \in \boldsymbol{S}_{++}^n$ :
$\Vert x \Vert_P = \Vert P^{1/2} x \Vert_2.$
$\ell_p$ norms for $1 \leq p < \infty$ :
$\Vert x \Vert_p = \left( |x_1|^p + \cdots + |x_n|^p \right)^{1/p}.$
Chebyshev/ $\ell_\infty$ norm:
$\Vert x \Vert_\infty = \max\{ |x_1|,\ldots,|x_n| \}.$

Norm spheres: for a given norm $\Vert \cdot \Vert$ , the unit norm sphere

$\{ x : \Vert x \Vert = 1 \}$

defines a collection of “directions” relative to the norm $\Vert \cdot \Vert$ .

Examples of unit norm spheres are indicated below.

Given a norm $\Vert \cdot \Vert$ , we define the dual norm $\Vert \cdot \Vert_*$ to be

$\Vert x \Vert_* := \sup\{ x^T \nu : \Vert \nu \Vert = 1 \}$ .

(This is the “operator/matrix norm” of $x^T$ .)

Important examples are recorded in the following table.

$\Vert \cdot \Vert$	$\Vert \cdot \Vert_*$
$\Vert \cdot \Vert_2$	$\Vert \cdot \Vert_2$
$\Vert \cdot \Vert_P := \Vert P^{1/2} \cdot \Vert_2$ , for $P \in \boldsymbol{S}_{++}^n$	$\Vert \cdot \Vert_{P^{-1}} := \Vert P^{-1/2} \cdot \Vert_2$
$\Vert \cdot \Vert_1$	$\Vert \cdot \Vert_\infty$
$\Vert \cdot \Vert_\infty$	$\Vert \cdot \Vert_1$
$\Vert \cdot \Vert_p$ , for $1 < p < \infty$	$\Vert \cdot \Vert_q$ , where $\frac{1}{p}+\frac{1}{q} = 1$

General Steepest Descent
Fix a norm $\Vert \cdot \Vert$ on $\mathbb{R}^n$ .

Normalized steepest descent direction for $\Vert \cdot \Vert$ : any element

$\Delta x_{\text{nsd}} \in \text{argmin}\{ \nabla f(x)^T \nu : \Vert \nu \Vert = 1 \}.$

Intuition: the step $x=x+t\Delta x_{\text{nsd}}$ (for small $t>0$ ) effects the greatest decrease in the objective among all directions $\nu$ satisfying $\Vert \nu \Vert =1$ .
Recall:

$f(x+t\nu) - f(x) \approx t \nabla f(x)^T \nu.$

In practice:

$\Vert \cdot \Vert$ is chosen depending on problem; however, gradient descent and Newton’s method (given below) are common.
Convergence results may depend on choice of $\Vert \cdot \Vert$ .
$\Delta x_{\text{nsd}}$ is generally not unique; e.g. this can occur for $\Vert \cdot \Vert_1$ , as indicated below.
Here, the red “x” indicates $-\nabla f(x)$ and the black dots indicate two distinct $\Delta x_{\text{nsd}}$ .
In fact: $\Delta x_{\text{nsd}}$ may be taken to be any point on the unit norm sphere in the first quadrant.

N.B.: by

$\Delta x_{\text{nsd}} \in \text{argmin}\{\nabla f(x)^T \nu : \Vert \nu \Vert =1\}$

we have

$\begin{aligned} \nabla f(x)^T \Delta x_{\text{nsd}} &= \inf\{ \nabla f(x)^T \nu : \Vert \nu \Vert = 1 \} \\ &= -\sup\{ -\nabla f(x)^T \nu : \Vert \nu \Vert = 1 \} \\ &= -\Vert \nabla f(x)\Vert_*. \end{aligned}$

Recall:

$\Vert x \Vert_* = \sup\{ x^T \nu : \Vert \nu \Vert = 1\}.$

Unnormalized steepest descent direction: for any $\Delta x_{\text{nsd}}$ , a descent direction of the form

$\Delta x_{\text{sd}} = \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}.$

N.B.:

This choice of “unnormalization” gives
$\nabla f(x)^T \Delta x_{\text{sd}} =-\Vert \nabla f(x) \Vert_*^2.$
Using $\Delta x_{\text{sd}}$ instead of $\Delta x_{\text{nsd}}$ uses both the direction of steepest descent (with respect to $\Vert \cdot \Vert$ ) and the rate of decrease of $f$ .
Exact line search does not see choice of normalization.
Indeed for $c>0$ , we have
$\begin{aligned} \text{argmin}\{f(x+tc\Delta x): t \geq 0 \} = \frac{1}{c}\,\text{argmin}\{f(x+t\Delta x): t \geq 0 \} \end{aligned}$
Theoretically: choice of normalization does not matter.
Pragmatically: choice of normalization may affect behavior of convergence.

Steepest Descent Method
The general steepest descent method may now be recorded.


given initial 
repeat:
1. Compute: .
2. Perform line search to determine step size .
3. Take step: .
until: stopping criterion holds.

N.B.: for exact line search, either $\Delta x_{\text{sd}}$ or $\Delta x_{\text{nsd}}$ may be used.

Examples

For various norms, we will compute

$\begin{aligned} &\Delta x_{\text{nsd}} \\ &\Delta x_{\text{sd}} = \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}. \end{aligned}$

Example 1
Let $\Vert \cdot \Vert = \Vert \cdot \Vert_2$ .
Observe

$\begin{aligned} \Delta x_{\text{nsd}} &=\text{argmin}\{ \nabla f(x)^T\nu : \Vert \nu \Vert_2 = 1 \}\\ &= -\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x). \end{aligned}$

Next, using

$\Vert \cdot \Vert_* = \Vert \cdot \Vert_2$

we have

$\Vert \nabla f(x) \Vert_* = \Vert \nabla f(x) \Vert_2$

and so

$\begin{aligned} \Delta x_{\text{sd}} &= \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}\\ &=\Vert \nabla f(x) \Vert_2 \left( -\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x) \right) \\ &=- \nabla f(x). \end{aligned}$

Conclusion: gradient descent method is the steepest descent with respect to the Euclidean norm $\Vert \cdot \Vert_2$ .

Example 2
Let $P \in \boldsymbol{S}_{++}^n$ and let

$\Vert x \Vert = \Vert x \Vert_P := \Vert P^{1/2} x\Vert_2 .$

Compute

$\begin{aligned} \Delta x_{\text{nsd}} &= \text{argmin}\left\{ \nabla f(x)^T \nu : \Vert P^{1/2} \nu \Vert_2=1\right\}\\ &= P^{-1/2} \text{argmin}\left\{ \nabla f(x)^T P^{-1/2}\mu : \Vert \mu \Vert_2=1\right\}\\ &= P^{-1/2} \text{argmin}\left\{ (P^{-1/2}\nabla f(x))^T\mu : \Vert \mu \Vert_2=1\right\}\\ &= P^{-1/2}\left( - \frac{1}{\Vert P^{-1/2} \nabla f(x)\Vert_2} P^{-1/2} \nabla f(x) \right)\\ &= - \frac{1}{\Vert P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x). \end{aligned}$

Using that the dual of $\Vert \cdot \Vert_P$ is

$\Vert x \Vert_* = \Vert x \Vert_{P^{-1}} := \Vert P^{-1/2} x\Vert_2$

and so

$\Vert \nabla f(x) \Vert_* = \Vert P^{-1/2} \nabla f(x) \Vert_2$ ,

we conclude

$\begin{aligned} \Delta x_{\text{sd}} &= \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}\\ &= \Vert P^{-1/2} \nabla f(x) \Vert_2 \left( - \frac{1}{\Vert P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x) \right)\\ & = P^{-1}\nabla f(x) \end{aligned}$ .

Consider the change of variable $y = P^{1/2} x$ and

$\bar f(y) = f(P^{-1/2}y) = f(x)$ .

The gradient descent step for $\bar f$ in the new variables $y$ is

$\begin{aligned} \Delta y_{\text{sd}} &= - \nabla \bar f(y) \\ &= -\nabla_y (f(P^{-1/2}y))\\ &= - P^{-1/2} (\nabla f)(P^{-1/2}y) \end{aligned}$ .

This direction in the original variables $x$ is thus

$\begin{aligned} P^{-1/2}\Delta y_{\text{sd}} &= - P^{-1} \nabla f(x) = \Delta x_{\text{sd}}. \end{aligned}$ .

Conclusion: obtaining $\Delta x_{\text{sd}}$ with respect to $\Vert \cdot \Vert_P$ is equivalent to standard gradient descent in a different coordinate system.
N.B.: suitably chosen $P$ may result in a well-conditioned problem in the new variables $y = P^{1/2} x$ .

Example 3
Let

$\Vert x \Vert = \Vert x \Vert_1 = |x_1| + \cdots + |x_n|,$

and recall

$\Vert x \Vert_* = \Vert x \Vert_\infty = \max\{ |x_1|,\ldots,|x_n|\}.$

Given $x$ , let $i$ be such that

$\begin{aligned} \Vert \nabla f(x) \Vert_\infty &= \max\left\{\left\vert \frac{\partial f(x)}{\partial x_j} \right\vert: j=1,\ldots,n\right\}\\ &= \left\vert \frac{\partial f(x)}{\partial x_i} \right\vert \end{aligned}$

Let $\left\{e_j\right\}$ denote the standard basis of $\mathbb{R}^n$ .
N.B.: $\Vert \pm e_i \Vert_1 = 1$ .
One can show

$\begin{aligned} \Delta x_{\text{nsd}} &= - \text{sgn}\left( \frac{\partial f(x)}{\partial x_i} \right) e_i. \end{aligned}$

Therefore, for $\Vert \cdot \Vert_1$ , the steepest descent direction $\Delta x_{\text{nsd}}$ is in the coordinate direction in which $f$ changes the most.

Lastly, one has

$\begin{aligned} \Delta x_{\text{sd}} &= \Vert \nabla f(x) \Vert_\infty \Delta x_{\text{nsd}}\\ &= \left\vert \frac{\partial f(x)}{\partial x_i} \right\vert \left(- \text{sgn}\left( \frac{\partial f(x)}{\partial x_i} \right) e_i \right)\\ &=-\frac{\partial f(x)}{\partial x_i} e_i \end{aligned}$

The resulting steepest descent algorithm is often called a coordinate-descent algorithm.
Indeed: the step

$x=x-t\frac{\partial f(x)}{\partial x_i} e_i, \quad t>0$

amounts to increasing or decreasing the $i$ th coordinate of $x$ according to the coordinate direction of greatest change.

Newton’s Method

Suppose always that $f$ is strongly convex and twice continuously differentiable.

Newton step: the direction

$\Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1} \nabla f(x).$

N.B.: if $\nabla f(x) \neq 0$ , then convexity implies

$\begin{aligned} \nabla f(x)^T \Delta x_{\text{nt}} &= - \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)<0, \end{aligned}$

and so $\Delta x_{\text{nt}}$ is a descent direction.
(Recall strong convexity $\implies$ $\nabla^2 f(x) \in \boldsymbol{S}_{++}^n$ .)

N.B.: Computing $\Delta x_{\text{nt}}$ involves solving the system

$\begin{aligned} Hv &= -g,\\ H = \nabla^2 f(x), &\quad g = \nabla f(x). \end{aligned}$

Exploiting special matrix structure of $H$ may lead to efficiently solving system.

Example.
Let

$f(x_1,x_2) = x_1^2 + e^{x_2}.$

Then

$\nabla f(x_1,x_2) = \begin{bmatrix} 2x_1\\ e^{x_2} \end{bmatrix}, \qquad \nabla^2 f(x_1,x_2) = \begin{bmatrix} 2 & 0\\ 0 & e^{x_2} \end{bmatrix},$

and so

$\begin{aligned} \Delta x_{\text{nt}} &= - \nabla^2 f(x_1,x_2)^{-1} \nabla f(x_1,x_2)\\ &=- \begin{bmatrix} \frac{1}{2} & 0\\ 0 & e^{-x_2} \end{bmatrix} \begin{bmatrix} 2x_1\\ e^{x_2} \end{bmatrix} \\ &= \begin{bmatrix} -x_1\\ -1 \end{bmatrix} \end{aligned}$

Observe:

$(x_1,x_2) + \Delta x_{\text{nt}} = (x_1,x_2) - (x_1,1) = (0,x_2-1)$ .
Thus, first step minimizes the quadratic part $x_1^2$ of $f$ :
$f(0,x_2-1) = e^{x_2-1}$ .
$x^{(k)} = (0,x_2-k)$ for $k>0$ .
Thus, sequence decreases the $e^{x_2}$ part of $f$ :
$f(x^{(k)}) = e^{x_2-k}$ .

A couple of steps are plotted below.

Interpretations of Newton Step
Steepest Descent:
Let

$P = \nabla^2 f(x).$

Consider the “Hessian norm”:

$\begin{aligned} \Vert y \Vert &:= \Vert y \Vert_P \\ &= \Vert P^{1/2} y \Vert_2\\ &= \left( (P^{1/2}y)^T P^{1/2}y \right)^{1/2}\\ &= \left( y^T P y \right)^{1/2}\\ &= \left( y^T \nabla^2 f(x) y \right)^{1/2}. \end{aligned}$

Recall: unnormalized steepest descent direction $\Delta x_{\text{sd}}$ for quadratic norm $\Vert \cdot\Vert_Q$ is

$\Delta x_{\text{sd}} = - Q^{-1}\nabla f(x)$ .

Therefore, $\Delta x_{\text{nt}}$ is the steepest descent direction $\Delta x_{\text{sd}}$ for $\Vert \cdot \Vert_{P}$ :

$\Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1} \nabla f(x) = - P^{-1} \nabla f(x)$ .

N.B.: finding $\Delta x_{\text{nt}}$ is equivalent to finding $\Delta x_{\text{sd}}$ after a change of variable that results in well conditioned level sets near optimizer.

Minimizing Quadratic Approximation:
For given $x$ , define

$F(v) = f(x) + \nabla f(x)^T v + \frac{1}{2} v^T \nabla^2 f(x) v.$

Thus, $F(v) \approx f(x+v)$ is the second order Taylor approximation of $f$ at $x$ .
N.B.: $F(v)$ is a convex quadratic in $v$ since $\nabla^2 f(x) \in \boldsymbol{S}_{++}^n$ .
Moreover, recalling

$\begin{aligned} P \in \boldsymbol{S}_{++}^n \implies \text{argmin}\{c + b^T v + \frac{1}{2} v^T P v\} = -P^{-1}b \end{aligned}$

we conclude $F(v)$ is minimized at

$v = - \nabla^2 f(x)^{-1} \nabla f(x) = \Delta x_{\text{nt}}.$

Viz., the step $\Delta x_{\text{nt}}$ exactly minimizes the quadratic approximation.

Conclusion:
Given

$\Delta x_{\text{nt}}$ minimizes the quadratic approximation $F$ of $f$ at $x$ .
$f(x+v) \approx F(v)$ is a good approximation for $x \approx x^\star$ .

we conclude

$p^\star = f(x^\star) \approx F(\Delta x_{\text{nt}})$ and $x+\Delta x_{\text{nt}} \approx x^\star$ .

The Newton Decrement

Newton decrement

$x$

$\lambda(x) = \left( \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \right)^{1/2}.$

$\lambda(x)$

$\begin{aligned} &f(x) - \inf_{v} F(v)\\ &f(x) - p^\star\\ &f(x+t\Delta x_{\text{nt}}) - f(x). \end{aligned}$

Proposition

There holds

$\begin{aligned} \lambda(x) &= \Vert \Delta x_{\text{nt}} \Vert_{\nabla^2 f(x)}. \end{aligned}$

Proof.

$\begin{aligned} \lambda(x) &= \left( \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \right)^{1/2}\\ &= \left( \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla^2 f(x) \nabla^2 f(x)^{-1} \nabla f(x) \right)^{1/2}\\ &=\left( (-\nabla^2 f(x)^{-1}\nabla f(x))^T \nabla^2 f(x) (-\nabla^2 f(x)^{-1} \nabla f(x) )\right)^{1/2}\\ &=\left(\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}} \right)^{1/2}\\ &= \Vert \Delta x_{\text{nt}} \Vert_{\nabla^2 f(x)}. \end{aligned}$

Proposition

There holds

$\begin{aligned} f(x) - \inf_v F(v) &= \frac{1}{2}\lambda(x)^2. \end{aligned}$

$\begin{aligned} x \approx x^\star &\implies f(x+v) \approx F(v)\\ & \implies F(\Delta x_{\text{nt}}) \approx p^\star \end{aligned}$

$\frac{1}{2}\lambda^2(x) \approx f(x)-p^\star$

Proof.

$\begin{aligned} f(x) - \inf F(v) &= f(x) - F(\Delta x_{\text{nt}})\\ &= f(x) - f(x) - \nabla f(x)^T \Delta x_{\text{nt}} - \frac{1}{2}\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}}\\ &= \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \\ &\qquad -\frac{1}{2} \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla^2 f(x) \nabla^2 f(x)^{-1} \nabla f(x)\\ &= \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \\ &\qquad - \frac{1}{2} \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &= \frac{1}{2}\nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &= \frac{1}{2}\lambda(x)^2 \end{aligned}$

Proposition

Armijo’s condition for the Newton step is

$f(x+t\Delta x_{\text{nt}}) - f(x) \leq -\alpha t \lambda(x)^2$

Proof.

Recall: Armijo’s condition for backtracking line search is

$f(x+t\Delta x) - f(x) \leq \alpha t\nabla f(x)^T \Delta x.$

Taking

$\Delta x = \Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1} \nabla f(x),$

we find

$\begin{aligned} \nabla f(x)^T \Delta x_{\text{nt}} &= \nabla f(x)^T \left( - \nabla^2 f(x)^{-1} \nabla f(x) \right)\\ &= -\nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &= - \lambda(x)^2. \end{aligned}$

Thus Armijo’s condition for the Newton step is

$f(x+t\Delta x_{\text{nt}}) - f(x) \leq -\alpha t \lambda(x)^2$

Newton’s Method

$t=1$


given 
         initial 
         tolerance 
repeat: 
1. Compute 
          
and 
         .
2. Stopping criterion: quit if .
3. Perform line search to determine step size .
4. Take step: .

$\begin{aligned} x \approx x^\star &\implies f(x+v) \approx F(v)\\ & \implies F(\Delta x_{\text{nt}}) \approx p^\star \end{aligned}$

$\frac{1}{2}\lambda^2(x) \approx f(x)-p^\star$

Affine Invariance

Proposition

If

$T \in \mathbb{R}^{n\times n}$ is nonsingular

$x = Ty$

$\bar f(y):= f(Ty)$

then the Newton step $\Delta y_{\text{nt}}$ for $\bar f$ is related to the Newton step $\Delta x_{\text{nt}}$ for $f$ by
$\Delta x_{\text{nt}} = T \Delta y_{\text{nt}}$ .

Proof.

Computing

$\begin{aligned} \nabla \bar f(y) &= T^T \nabla f(x)\\ \nabla^2 \bar f(y) &= T^T \nabla^2 f(x) T, \end{aligned}$

we find

$\begin{aligned} \Delta y_{\text{nt}} &= - \nabla^2 \bar f(y)^{-1} \nabla \bar f(y)\\ &=\left( T^T \nabla^2 f(x) T \right)^{-1} T^T \nabla f(x)\\ &=T^{-1} \nabla^2 f(x)^{-1}(T^T)^{-1}T^T \nabla f(x)\\ &=T^{-1}\nabla^2 f(x)^{-1}\nabla f(x)\\ &=T^{-1}\Delta x_{\text{nt}}. \end{aligned}$

$x+\Delta x_{\text{nt}} = Ty + T\Delta y_{\text{nt}} = T(y + \Delta y_{\text{nt}}).$

$f$

$x^{(k)}$

$\bar f$

$y^{(0)} = T x^{(0)}$

$y^{(k)} = T^{-1} x^{(k)}$

$x=Ty$

$\begin{aligned} \bar\lambda(y)^2&= \nabla \bar f(y)^T \nabla^2 \bar f(y)^{-1} \nabla \bar f(y)\\ &= (\nabla f(x)^T T )( T^{-1} \nabla^2 f(x)^{-1} T^{-T} ) T^T \nabla f(x)\\ &= \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &=\lambda(x)^2. \end{aligned}$

Descent Algorithms for Equality Constrained Minimization

Overview

Equality Constrained Minimization.
We will focus on equality constrained problems of the form

$\begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}$

The main assumptions are

$f$ is convex.
$f$ is twice continuously differentiable: $\nabla^2 f$ is continuous.
An optimal solution $x^\star \in \text{dom}\,f$ exists.
$A \in \mathbb{R}^{p \times n}$ with $\text{rank}\,A = p < n$ .
$b \in \mathbb{R}^p$ is problem data.

N.B.:

$\text{rank}\,A = p$ indicates equality constraints are independent (not superfluous).
$\text{rank}\, A< n$ indicates there are fewer equality constraints than variables.

Goal.
Demonstrate that Newton’s method naturally extends to equality constrained problems.
Will cover:

Feasible star equality constrained Newton’s method
Infeasible star equality constrained Newton’s method

Warm-up Question.
Suppose we wish to employ Newton’s method for

$\begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}$

Suppose $\Delta x_{\text{nt}}$ denotes the to-be-determined Newton step.
Suppose starting point $x^{(0)} \in \text{dom}\,f$ is feasible: $Ax^{(0)} = b$ .
What necessary assumption on $\Delta x_{\text{nt}}$ should we impose so that each step $x = x + t \Delta x_{\text{nt}}$ remains feasible?

Equivalent Unconstrained Formulations

Main Idea.
May apply unconstrained optimization to unconstrained optimization problems equivalent to the equality constrained problem

$\begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}$

Two ways of achieving this are

eliminating equality constraint by the change
$f(x) \mapsto f(Fz+x_0)$ .
formulating the Lagrangian dual problem.

N.B.: both reformulations of the equality constrained problem may break problem structure (e.g., sparsity) and hence affect numerics.

Equality Constraint Elimination.
N.B.: $\text{rank}\,A = p < n \implies$ $\dim\ker A = n-p$ .
Can find $F \in \mathbb{R}^{n \times (n-p)}$ such that

$\text{range}\,F = \text{ker}\,A.$

Let $x_0$ solve $Ax_0 = b$ .
Then

$\{ x \in \mathbb{R}^n : Ax = b \} = \{ Fz + x_0 : z \in \mathbb{R}^{n-p}\}.$

Therefore

$\{ f(x) : Ax = b \} = \{ f(Fz + x_0) : z \in \mathbb{R}^{n-p}\}\\$

and so

$\begin{aligned} \inf\{f(x): Ax=b\} = \inf\{ f(Fz + x_0) : z \in \mathbb{R}^{n-p}\}. \end{aligned}$

Thus, if $z^\star \in \mathbb{R}^{n-p}$ solves

$\begin{cases} \text{minimize} & f(Fz+x_0) \end{cases}$

then $x^\star = Fz^\star + x_0$ solves

$\begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax=b. \end{cases}$

May therefore use unconstrained optimization for $f(Fz+x_0)$ to solve original equality constrained problem.
N.B.: this confirms that minimizing the restriction of a function to an affine subspace is theoretically equivalent to an unconstrained problem.

Using Lagrangian Dual.
Since there are no inequality constraints, the Lagrangian of the problem is

$L(x,\nu) = f(x) + \nu^TAx - \nu^T b.$

Using Legendre transform $f^*$ , the dual function is

$\begin{aligned} g(\nu) &= \inf\{ L(x,\nu) : x\}\\ &= \inf\{f(x) + \nu^TAx - \nu^T b\}\\ &= - \nu^T b - \sup\{ -(A^T \nu)^T x - f(x)\}\\ &=-\nu^T b - f^*(-A^T\nu). \end{aligned}$

The dual problem is therefore the unconstrained optimization problem

$\begin{cases} \text{maximize} & -\nu^T b - f^*(-A^T\nu). \end{cases}$

N.B.: $g(\nu)=-\nu^T b - f^*(-A^T\nu)$ is not a priori twice continuously differentiable, even if $f$ is.
If $g(\nu)$ is twice continuously differentiable, then can use unconstrained optimization to find dual optimizer $\nu^\star$ and whence primal optimizer $x^\star$ . (Assumption that $x^\star$ exists implies problem satisfies Slater’s condition since there are no inequality constraints.)

Quadratic Model Problem

Recall:

Recall
$\begin{aligned} f(x) = \frac{1}{2}x^T Q x + q^T x + r, \\ \quad Q \in \boldsymbol{S}_{++}^n, \quad q \in \mathbb{R}^n, \quad r \in \mathbb{R}. \end{aligned}$
is minimized at
$x^\star = -Q^{-1}q = -\nabla^2 f(x)^{-1} \nabla f(x).$
N.B.: This is the Newton’s step for this problem.
For general $f$ , Newton’s method is to solve problem by solving sequence of quadratic approximation problems:
$\begin{aligned} \text{Perform Quad. approx. at }x &\implies \text{Solve Quad. approx. problem for }\Delta x_{\text{nt}}\\ &\implies \text{Take Newton step } x \mapsto x + t \Delta x_{\text{nt}}\\ & \implies \text{Perform Quad. approx. at } x + t\Delta x_{\text{nt}}\\ &\implies \cdots \end{aligned}$

Idea.
Develop Newton-type method for equality constrained problems based on minimizing the quadratic model problem

$\begin{cases} \text{minimize} & \frac{1}{2}x^T Qx + q^Tx + r\\ \text{subject to} & Ax=b, \end{cases}$

with $Q,q,r,A,b$ as above.
Then one may follow unconstrained idea:

$\begin{aligned} \text{Perform Quad. approx. at }x &\implies \text{Solve Quad. approx. problem for }\Delta x_{\text{nt}}\\ &\implies \text{Take Newton step } x \mapsto x + t \Delta x_{\text{nt}}\\ & \implies \text{Perform Quad. approx. at } x + t\Delta x_{\text{nt}}\\ &\implies \cdots \end{aligned}$

N.B.: the more general case $Q \in \boldsymbol{S}_+^n$ may also be treated.

The Model.
For

$\begin{cases} \text{minimize} & \frac{1}{2}x^T Qx + q^Tx + r\\ \text{subject to} & Ax=b, \end{cases}$

the KKT optimality conditions are

$\begin{cases} \begin{aligned} Ax^\star &= b\\ Qx^\star + q + A^T \nu^\star &=0. \end{aligned} \end{cases}$

The second equation follows from differentiating the Lagrangian

$L(x,\nu) =\frac{1}{2}x^T Qx + q^Tx + r + \nu^T Ax + \nu^T b.$

(N.B.: Slater’s condition is satisfied if $Ax=b$ is consistent.)

The KKT optimality conditions are equivalent to the KKT system:

$\begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} x^\star \\ \nu^\star \end{bmatrix} = \begin{bmatrix} -q\\ b \end{bmatrix}.$

KKT matrix:

$\begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix}.$

N.B.:

If KKT matrix is nonsingular, then problem has unique solution.
If KKT matrix is singular, then either $Ax=b$ is inconsistent, or solutions to problem are not unique.

Solving the KKT System.
Suppose KKT matrix is nonsingular.
Then $x^\star$ is obtained by computing

$\begin{bmatrix} x^\star \\ \nu^\star \end{bmatrix} = \begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix}^{-1} \begin{bmatrix} -q\\ b \end{bmatrix}.$

But this may be costly.
Another idea: use KKT conditions directly.

$\begin{aligned} \begin{cases} \begin{aligned} Ax^\star &= b\\ Qx^\star + q + A^T \nu^\star &=0. \end{aligned} \end{cases} &\implies x^\star = -Q^{-1}(q + A^T \nu^\star)\\ & \implies b= A x^\star = -AQ^{-1}(q + A^T \nu^\star)\\ &\implies \nu^\star = - (AQ^{-1}A^T)^{-1}(b + AQ^{-1}q) \end{aligned}$

Now, computing $\nu^\star$ can be used to compute $x^\star$ .

Main Idea.
Consider now the general problem

$\begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}$

Given initial $x \in \text{dom}\,f$ satisfying $Ax=b$ , let

$\begin{aligned} F(v) &= f(x) + \nabla f(x)^T v + \frac{1}{2}v^T \nabla^2 f(x) v &\approx f(x+v) \end{aligned}$

be the quadratic approximation $f$ at $x$ .
N.B.: if we want to approximate $f$ at $x+v$ with $x+v$ feasible, then need $Av=0$ :

$b= A(x+v) = Ax+Av = b$ .

Idea: if feasible $x$ near $x^\star$ , then $F$ approximates $f$ well and so, if $v^\star$ solves

$\begin{cases} \text{minimize} & F(v) = f(x) + \nabla f(x)^T v + \frac{1}{2}v^T \nabla^2 f(x) v\\ \text{subject to} & Av = 0, \end{cases}$

then $x+v^\star$ approximates $x^\star$ well.
N.B.: The point of $Av^\star = 0$ is so that $x+v^\star$ remains feasible, i.e.,

$A(x+v^\star) = Ax + Av^\star = Ax = b$ .

Viz., $v^\star$ gives the feasible direction which minimizes the quadratic approximation of $f$ .

Linear Equality Constrained Newton’s Method

Feasible directions relative to equality constraint $Ax=b$ are any directions $v \in \ker A$ .
N.B.: if $v$ is a feasible direction and $x$ is feasible, then $x+tv$ is feasible for all $t \in \mathbb{R}$ . (Viz.: line searching does not break feasibility.)

Consider now the general problem

$\begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}$

Newton step: a direction $\Delta x_{\text{nt}}$ such that the system

$\begin{bmatrix} \nabla^2 f(x) & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}}\\ w \end{bmatrix} = \begin{bmatrix} -\nabla f(x)\\ 0 \end{bmatrix}$

is consistent for some $w \in \mathbb{R}^n$ .
N.B.: we only define $\Delta x_{\text{nt}}$ when the KKT matrix

$\begin{bmatrix} \nabla^2 f(x) & A^T\\ A & 0 \end{bmatrix}$

is nonsingular, which always holds when $f$ is strongly convex.

Remarks.

The system
$\begin{bmatrix} \nabla^2 f(x) & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} z_1\\z_2 \end{bmatrix} = \begin{bmatrix} -\nabla f(x)\\ 0 \end{bmatrix}$
is the KKT system for the quadratic approximation problem
$\begin{cases} \text{minimize} & f(x) + \nabla f(x)^T v + \frac{1}{2}v^T \nabla^2 f(x) v\\ \text{subject to} & Av = 0. \end{cases}$
Thus, finding $\Delta x_{\text{nt}}$ amounts to minimizing the equality constrained quadratic approximation of the original problem.

$\Delta x_{\text{nt}}$ is a feasible descent direction.
Indeed, $A\Delta x_{\text{nt}} = 0$ , and
$\nabla^2 f(x) \Delta x_{\text{nt}} + A^T w = - \nabla f(x)$
imply
$\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}} + \Delta x_{\text{nt}}^T A^T w = - \nabla f(x)^T \Delta x_{\text{nt}},$
which implies
$0>-\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}} = \nabla f(x)^T \Delta x_{\text{nt}}.$

Newton Decrement.
The Newton decrement $\lambda(x)$ for the linear constrained Newton step is

$\lambda(x) = \left( \Delta x_{\text{nt}} \nabla^2 f(x) \Delta x_{\text{nt}} \right)^{1/2}.$

The interpretations for the unconstrained Newton decrement also hold; e.g.,

$f(x) - p^\star \approx \frac{1}{2}\lambda(x)^2$ for $x \approx x^\star$ .

In particular: $\frac{1}{2}\lambda(x)^2 < \epsilon$ gives a suitable stopping criterion as before.

Linear Equality Constrained Newton’s Method.
The Newton’s method for the linear equality constrained problem may now be stated.
One may call the following algorithm a feasible descent method since each iteration demands the update $x = x + t \Delta x_{\text{nt}}$ is feasible.


given 
         initial  with 
         tolerance 
repeat: 
1. Solve
         
and compute
         .
2. Stopping criterion: quit if .
3. Perform line search to determine step size .
4. Take step: .

Interpretation.
The equality constrained Newton’s method amounts to constructing a sequence of equality constrained quadratic minimization problems whose solutions approximate the solution of the original problem.

Interior Point Methods

Problem Setup

We will focus on general convex optimization problems of the form

$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i=1,\ldots,m\\ & Ax = b. \end{cases}$

The main assumptions are

$f_0$ is convex.
$f_0$ is twice continuously differentiable: $\nabla^2 f_0$ is continuous.
An optimal solution $x^\star \in \text{dom}\,f_0$ exists.
$A \in \mathbb{R}^{p \times n}$ with $\text{rank}\,A = p < n$ .
$b \in \mathbb{R}^p$ is problem data.

Review on Lagrange Duality

The main idea of Lagrange duality is detailed in the following steps.
We assume strong duality holds $d^\star = p^\star$ .

Build optimization problem
$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i=1,\ldots,m\\ & h_i(x)=0, \quad i=1,\ldots,p \end{cases}$

Build Lagrangian
$L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x).$
and Lagrange dual function
$g(\lambda,\nu) = \inf \{L(x,\lambda,\nu) : x \text{ in domain of problem} \}.$

Build Lagrange dual problem
$\begin{cases} \text{maximize} & g(\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0. \end{cases}$

$d^\star$

$p^\star = d^\star$

Theoretically solve Lagrange dual problem for dual optimal $(\lambda^\star,\nu^\star)$ , noting primal optimal $x^\star$ minimizes
$x \mapsto L(x,\lambda^\star,\nu^\star).$
If
$\begin{cases} \text{minimize} & L(x,\lambda^\star,\nu^\star) \end{cases}$
has unique solution $x^\sharp$ that is primal feasible, then primal optimal is $x^\star = x^\sharp$ .

Implementation of previous step is predicated on dual problem being simpler to solve and $L(x,\lambda^\star,\nu^\star)$ having unique solution.
In generality, Lagrange duality introduces the KKT optimality conditions
$\begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots,m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla_x L(x^\star,\lambda^\star,\nu^\star) &=0 \end{aligned}$
For convex problems, these conditions are necessary and sufficient for $(x^\star,\lambda^\star,\nu^\star)$ to be optimal.

In practice: either
- $x^\star$ is found by directly integrating KKT system or
- $(\lambda^\star,\nu^\star)$ found first, back substituted, then $x^\star$ is found.
N.B.: which route is taken or how is determined is dictated by problem structure (theoretically or numerically).

Problem Hierarchy and Outline

$\begin{aligned} \begin{cases} \text{ECQP}:&\\ \text{minimize}& \frac{1}{2}x^TQx + q^T x + r\\ \text{subject to}& Ax=b \end{cases} \boldsymbol{\subset} \begin{cases} \text{NCOP}:&\\ \text{minimize}& f_0(x)\\ \text{subject to}& Ax=b \end{cases} \boldsymbol{\subset} \begin{cases} \text{ICOP}:&\\ \text{minimize}& f_0(x)\\ \text{subject to}& f_i(x)\leq0\\ &Ax=b \end{cases} \end{aligned}$

ECQP:
Solved via solving KKT system

$\begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} x^\star \\ \nu^\star \end{bmatrix} = \begin{bmatrix} -q\\ b \end{bmatrix}.$

Either solved directly or one appeals to the dual:

$\begin{aligned} \begin{aligned} Qx^\star + A^T \nu^\star & = -q\\ A x^\star &= b \end{aligned} &\implies x^\star = -Q^{-1}(q+A^T \nu^\star)\\ &\implies -AQ^{-1}(q+A^T \nu^\star)=b\\ &\implies \nu^\star =-(AQ^{-1}A^T)^{-1}(b+AQ^{-1}q) \end{aligned}$

Solving for $\nu^\star$ allows one to compute $x^\star$ .

NCOP:
Solved via sequence of approximating ECQP’s: at each iteration, approximate $f_0(x)$ via

$F(v) = \frac{1}{2}v^T \nabla^2 f_0(x) v + \nabla f_0(x)^T v + f_0(x)$

and solve the ECQP

$\begin{cases} \text{minimize}& \frac{1}{2}v^T \nabla^2 f_0(x) v + \nabla f_0(x)^T v + f_0(x)\\ \text{subject to}& Av=0 \end{cases}$

for the Newton step $\Delta x_{\text{nt}} = v^\star$ .

This is equivalent to solving the KKT system

$\begin{bmatrix} \nabla^2 f_0(x) & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}} \\ w \end{bmatrix} = \begin{bmatrix} -\nabla f_0(x)\\ 0 \end{bmatrix}.$

From above: we may first solve

$w =-(A\nabla^2 f_0(x)^{-1} A^T)^{-1}(A\nabla^2 f_0(x)^{-1} \nabla f_0(x))$

and back substitute to obtain

$\Delta x_{\text{nt}} = -\nabla^2 f_0(x)^{-1}(\nabla f_0(x)+A^T w)$

N.B.: $A=0$ recovers unconstrained Newton step.

ICOP:
The main idea is to create a sequence of NCOP’s whose solutions approximate the solution to the ICOP.
A first approximation: let

$I_{-}(u) = \begin{cases} 0 & u \leq 0\\ +\infty & u>0. \end{cases}$

Then the ICOP is equivalent to the equality constrained convex optimization problem

$\begin{cases} \text{minimize}& f_0(x) + \sum_{i=1}^m I_{-}(f_i(x))\\ \text{subject to}& Ax=b \end{cases}$

N.B.: This new objective function need not be smooth and therefore Newton’s need not apply in general.

Plan:
Devise an approximation scheme of this nonsmooth equality constrained convex optimization problem.
This is achieved via solving a sequence of NCOP’s whose solutions converge to a solution to the ICOP.

Logarithmic Barriers

Logarithmic Approximation of Indicator Function:
The indicator function

$I_{-}(u) = \begin{cases} 0 & u \leq 0\\ +\infty & u>0. \end{cases}$

is smoothly approximated by the logarithm

$\hat{I}_{-}(u) = \begin{cases} -\frac{1}{t}\log(-u) & u \leq 0\\ +\infty & u>0, \end{cases}$

where $t>0$ is a parameter dictating the accuracy of approximation.

The samples $t=1,2,4,8,16$ (solid curves) and $I_-$ (dashed curve) are plotted below.

N.B.: recall

$s \to 0 \implies c^{s}\to 1$ for $c>0$ .

Thus, for large $t>0$ and fixed $u<0$ , we have

$-\frac{1}{t}\log(-u) = - \log \left((-u)^{1/t} \right)\approx 0$

and for negative $u \approx 0$ and fixed $t$ , we have

$-\frac{1}{t}\log(-u) \approx \infty.$

Therefore, $-\frac{1}{t}\log(-u)$ approximates $I_-(u)$ well for large $t$ .

The Logarithmic Barriers:
With the equivalence

$\begin{aligned} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0\\ & Ax = b \end{cases} \iff \begin{cases} \text{minimize}& f_0(x) + \sum I_{-}(f_i(x))\\ \text{subject to}& Ax=b \end{cases} \end{aligned}$

and approximation

$I_-(u) \approx \hat{I}_-(u) := -\frac{1}{t}\log(-u)$

we introduce the following approximating logarithmic barrier problems:

$\begin{aligned} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0\\ & Ax = b \end{cases} &\iff \begin{cases} \text{minimize}& f_0(x) + \sum I_{-}(f_i(x))\\ \text{subject to}& Ax=b \end{cases} \\ &\quad\approx \text{(LBP)} \begin{cases} \text{minimize}& f_0(x) -\frac{1}{t} \sum \log(-f_i(x))\\ \text{subject to}& Ax=b \end{cases} \end{aligned}$

Remarks:

- The approximating term
  $\phi(x):=-\sum_{i=1}^m \log(-f_i(x))$
  is called a logarithmic barrier for the problem.
- N.B.: $- \log(-u) >0$ for $0<-u<1$ and so $\phi(x)$ penalizes $x$ closer to boundary of feasible set.
- The $t$ in $f_0(x) + \frac{1}{t}\phi(x)$ controls this penalty since
  $t \approx \infty \implies -\frac{1}{t}\log(-u) \approx 0$
  for each fixed $u<0$ .

- If $f_0,f_i$ are twice continuously differentiable, then so is $\phi$ on its domain
  $\text{dom}\,\phi = \{ x : f_i(x) < 0,\, i =1,\ldots,m\}.$
- Moreover, $\psi(x)$ convex implies $-\log \psi(x)$ is convex, and so $\phi$ is also convex.
- Therefore, the approximating LBP’s are convex and Newton’s method is applicable for each $t>0$ .
  (“Applicable” means “can be ran” and not “will necessarily converge”.)

Thus each LBP is a NCOP for each $t>0$ .
Moreover,
$f_0 + \frac{1}{t}\phi \to f_0 + \sum_{i=1}^m I_-(f_i(x))\text{ as } t \to \infty.$
This suggests the approximation scheme: for each $t>0$ , use Newton’s method to solve the approximating LBP
$\begin{cases} \text{minimize}& f_0(x) + \frac{1}{t} \phi(x) \\ \text{subject to}& Ax=b \end{cases}$
for primal optimal $x^\star(t)$ , and prove $x^\star(t) \to x^\star$ as $t \to \infty$ .

Therefore, informally speaking, we have
$\begin{aligned} \begin{cases} \text{minimize}& f_0(x) + \frac{1}{t}\phi(x)\\ \text{subject to}& Ax=b \end{cases} \to \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0\\ & Ax = b \end{cases} \end{aligned}$
as $t\to \infty$ .
For sake of notational convenience: there holds the equivalence
$\begin{aligned} \begin{cases} \text{minimize}& f_0(x) + \frac{1}{t} \phi(x) \\ \text{subject to}& Ax=b \end{cases} \iff \begin{cases} \text{minimize}& tf_0(x) + \phi(x) \\ \text{subject to}& Ax=b \end{cases}. \end{aligned}$
In fact, both problems have the same primal optimizers.
To use the KKT conditions for the approximating LBP’s, we record the gradient and Hessian of a general logarithmic barrier $\phi(x)$ :

$\begin{aligned} \nabla \phi(x) &= -\nabla \left( \sum_{i=1}^m \log(-f_i (x)) \right) \\ &=- \sum_{i=1}^m \frac{1}{f_i(x)} \nabla f_i(x)\\ \nabla^2 \phi(x) &= \sum_{i=1}^m \frac{1}{f_i(x)^2} \nabla f_i (x) \nabla f_i(x)^T - \sum_{i=1}^m \frac{1}{f_i(x)} \nabla^2 f_i(x). \end{aligned}$
N.B.: $\nabla f_i (x) \nabla f_i(x)^T$ is an $n \times n$ matrix.

Central Path

We will always suppose that, for each $t>0$ , the LBP

$\begin{cases} \text{minimize}& tf_0(x) + \phi(x) \\ \text{subject to}& Ax=b \end{cases}$

has a unique solution $x^\star(t)$ .
We will also call this problem a centering problem.
For concreteness, we will assume this problem is uniquely solvable for each $t>0$ via Newton’s method.

The $x^\star(t)$ form the central path of

$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i = 1,\ldots,m\\ & Ax = b \end{cases}$

which is a path in the feasible set $F$ of this problem.
Each $x^\star(t)$ is called a central point.

An example of a central path is depicted below.

Each point along the curve indicates a solution to an approximating centering problem.
Intuitively, these solutions should converge to the solution of the original problem.

Therefore, since the problem is convex: for each $t>0$ , a point $x^\star(t)$ is central iff the following hold:

$\begin{aligned} \text{strict feasibility}& \begin{cases} Ax^\star(t) = b\\ f_i(x^\star(t)) < 0 \end{cases}\\ \text{KKT}& \begin{cases} 0&=t \nabla f_0(x^\star(t)) + \nabla \phi(x^\star(t)) + A^T \nu_t\\ &= t \nabla f_0(x^\star(t)) - \sum_{i=1}^m \frac{1}{f_i(x^\star(t))} \nabla f_i(x^\star(t)) + A^T \nu_t \end{cases}\\ \end{aligned}$

for some $\nu_t \in \mathbb{R}^p$ .
N.B.: strict feasibility follows from $f_i(x) = 0 \implies \phi(x) = \infty$ .

Convergence and Dual Central Path

Objectives:

show $x^\star(t)$ naturally determines a path of dual feasible points $(\lambda^\star(t),\nu^\star(t))$ .
show that $x^\star(t) \to x^\star$ and $f_0(x^\star(t)) \to p^\star$ as $t \to \infty$ .

Emphasis: the dual central path $(\lambda^\star(t),\nu^\star(t))$ can serve as certificates that $x^\star(t)$ is suboptimal with desired tolerance: $f_0(x^\star(t)) - p^\star < \epsilon$ .
In short: the objectives follow from Lagrange duality and the KKT conditions for the approximating LBP’s.

Theorem. Let $x^\star(t)$ be the central path. Then
$f_0(x^\star(t)) - p^\star \leq \frac{m}{t}$
and there exists $\nu_t \in \mathbb{R}^p$ such that
$\lambda^\star_i(t) := -\frac{1}{t f_i(x^\star(t))},\quad \nu^\star(t) := \frac{1}{t}\nu_t$
form a path $(\lambda^\star(t),\nu^\star(t))$ of dual feasible points.

Remarks.

Recall: $m$ denotes the number of inequality constraints of the original problem.
The estimate $f_0(x^\star(t)) - p^\star \leq \frac{m}{t}$ implies
$\begin{aligned} f_0(x^\star(t)) &\to p^\star\\ x^\star(t) & \to x^\star. \end{aligned}$
Moreover, it determines exactly which $t>0$ ensures $\epsilon$ -suboptimality:
$\frac{m}{\epsilon}\leq t \implies f_0(x^\star(t)) - p^\star \leq \epsilon.$
Recall: a pair $(\lambda,\nu)$ is called dual feasible if $\lambda \succeq 0$ and $g(\lambda,\nu)>-\infty$ , where $g$ is the Lagrange dual function of the original problem.

Proof.
Outline

Determine KKT conditions for approximating centering problems; this will determine $\nu^\star(t)$ .
Compare with Lagrangian for original problem; this will determine $\lambda^\star(t)$ .
Show $(\lambda^\star(t),\nu^\star(t))$ is dual feasible.
Use Lagrange dual function and duality gap to conclude suboptimality estimate.

For each $t>0$ , the Lagrangian for the approximating centering problem
$\begin{cases} \text{minimize}& tf_0(x) + \phi(x) \\ \text{subject to}& Ax=b \end{cases}$
is
$L_t(x,\lambda) = tf_0(x) + \phi(x) + \nu^T Ax - \nu^T b.$
Thus, the KKT conditions are
$\begin{cases} Ax^\star(t)=b\\ t \nabla f_0(x^\star(t)) + \nabla \phi(x^\star(t)) + A^T \nu_t = 0, \end{cases}$
for some $\nu_t \in \mathbb{R}^p$ .
Set
$\nu^\star(t) := \frac{1}{t}\nu_t.$
(Viz.: $\nu^\star(t)$ is a scaled dual optimal Lagrange multiplier for the approximating centering problem at time $t$ .)

The Lagrangian for the original problem
$\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i = 1,\ldots,m\\ & Ax = b \end{cases}$
is
$\begin{aligned} L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \nu^T Ax - \nu^T b, \end{aligned}$
whose gradient is
$\begin{aligned} \nabla_x L(x,\lambda,\nu) = \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu. \end{aligned}$
We show that $L(x,\lambda,\nu)$ is minimized at $x^\star(t)$ for suitably chosen $(\lambda,\nu)$ , which will be given by KKT conditions of the approximating centering problems.

Compare the gradient of $L$ at $x^\star(t)$ and the KKT conditions for the approximating centering problems:
$\begin{aligned} \nabla_x L(x^\star(t),\lambda,\nu) &= \nabla f_0(x^\star(t)) + \sum_{i=1}^m \lambda_i \nabla f_i(x^\star(t)) + A^T \nu \\ \frac{1}{t}\nabla_x L_t(x,\nu) & = \nabla f_0(x^\star(t)) - \sum_{i=1}^m \frac{1}{tf_i(x^\star(t))} \nabla f_i(x^\star(t)) + A^T \frac{1}{t}\nu_t\\ &=0 \end{aligned}$
Matching coefficients, we see choosing
$\begin{aligned} \lambda_i &= - \frac{1}{t f_i(x^\star(t))}, \qquad \nu = \frac{1}{t}\nu_t. \end{aligned}$
effects
$\nabla L(x^\star(t),\lambda,\nu) = 0.$

Indeed, if
$\begin{aligned} \lambda_i^\star(t) &= - \frac{1}{t f_i(x^\star(t))}, \qquad \nu^\star(t) = \frac{1}{t}\nu_t, \end{aligned}$
then
$\begin{aligned} \nabla_x L(x^\star(t),\lambda^\star(t),\nu^\star(t)) &= \nabla f_0(x^\star(t)) + \sum_{i=1}^m \lambda_i^\star(t) \nabla f_i(x^\star(t)) + A^T \nu^\star(t) \\ &= \nabla f_0(x^\star(t)) - \sum_{i=1}^m \frac{1}{tf_i(x^\star(t))} \nabla f_i(x^\star(t)) + A^T \frac{1}{t}\nu_t\\ &= \frac{1}{t}\nabla_xL_t(x^\star(t),\nu_t) \\ &=0 \end{aligned}$

We now show $(\lambda^\star(t),\nu^\star(t))$ is dual feasible, i.e., that
$\begin{aligned} \lambda^\star(t) & \succeq 0\\ g(\lambda^\star(t),\nu^\star(t))&> -\infty. \end{aligned}$
First observe:
$\begin{aligned} f_i (x^\star(t)) < 0 & \implies \lambda_i^\star (t) = - \frac{1}{t f_i(x^\star(t))} >0\\ &\implies \lambda^\star(t) \succeq 0. \end{aligned}$

$x\mapsto L(x,\lambda^\star(t),\nu^\star(t))$

$x^\star(t)$

$g(\lambda,\nu)$

$(\lambda^\star(t),\nu^\star(t))$

$\begin{aligned} g(\lambda^\star(t),\nu^\star(t)) &= \inf\{L(x,\lambda^\star(t),\nu^\star(t)): x\text{ feasible}\}\\ &= f_0(x^\star(t)) + \sum_{i=1}^m \lambda_i^\star(t)f_i(x^\star(t)) + \nu^{\star}(t)^T(Ax^\star(t)-b)\\ &= f_0(x^\star(t)) - \sum_{i=1}^m \frac{1}{tf_i(x^\star(t))}f_i(x^\star(t)) + \frac{1}{t}\nu_t \cdot 0\\ &= f_0(x^\star(t)) - \frac{m}{t}\\ &>-\infty \end{aligned}$

Lastly, we observe that the evaluation of $g(\lambda^\star(t),\nu^\star(t))$ paired with the optimality gap gives:
$\begin{aligned} g(\lambda^\star(t),\nu^\star(t)) = &f_0(x^\star(t)) - \frac{m}{t}\\ &\implies f_0(x^\star(t)) - g(\lambda^\star(t),\nu^\star(t)) = \frac{m}{t}\\ &\implies f_0(x^\star(t)) - \sup g(\lambda,\nu) \leq \frac{m}{t}\\ &\implies f_0(x^\star(t)) - p^\star \leq \frac{m}{t}\\ \end{aligned}$

The Barrier Method

We are now ready to write out the algorithm which uses logarithmic barriers and centering problems to approximate the solution to an inequality constrained convex optimization problem.

This algorithm is called the barrier method; aka sequential unconstrained minimization technique (SUMT) by Fiacco-McCormick or path-following method.


given 
         initial strictly feasible 
         initial time 
         multiplier 
         tolerance 
repeat: 
1. Centering step. Find center  by solving
         
   with initial point .
2. Update. .
3. Stopping criterion. quit if .
4. Increase . .

Remarks.

The iterations in Newton’s method to solve centering problem are called inner iterations.
The execution of Step 1. is called an outer iteration.
Size of dictates trade-off between number of inner and outer iterations:
- small: .
  Thus,
  - $x^\star(t)$ is good initial point to compute $x^\star(\mu t)$
    $\implies$ few Newton steps to move from $x^\star(t)$ to $x^\star(\mu t)$ .
    $\implies$ few inner iterations per outer iteration.
  - However: $x^\star(t) \approx x^\star(\mu t)$
    $\implies$ algorithm moves along central path slowly
    $\implies$ many outer iterations.
  - Newton steps follow along central path quite well.
- large: and quite separated.
  Thus,
  - $x^\star(t)$ poor initial point to compute $x^\star(\mu t)$
    $\implies$ many Newton steps to move from $x^\star(t)$ to $x^\star(\mu t)$
    $\implies$ many inner iterations.
  - However: large separation of $x^\star(t)$ and $x^\star(\mu t)$
    $\implies$ algorithm moves along central path quickly
    $\implies$ few outer iterations.
  - Newton steps may diverge far from central path.
Size of also affects number of inner and outer iterations:
- The larger is:
  - the faster the algorithm moves along central path
    $\implies$ the fewer outer iterations needed
  - the closer $x^\star(t)$ is to $x^\star$
  - However: if initial $x$ far from $x^\star$ , then may require many initial inner iterations.
  - (These observations apply to aggressive choice of $t = m/\epsilon$ , where algorithm requires one outer iteration.)
- The smaller is:
  - the slower the algorithm moves along central path
    $\implies$ the more outer iterations needed
  - the farther $x^\star(t)$ is to $x^\star$
  - Moreover: if initial $x$ far from $x^\star(t)$ and near $x^\star$ , then may require many superfluous initial inner iterations.

Modified KKT Conditions

Recall:

the KKT optimality conditions for centering problem at time $t$ are
$\begin{cases} Ax = b\\ t\nabla f_0(x) - \sum_{i=1}^m \frac{1}{f_i(x)} \nabla f_i(x) + A^T \nu = 0. \end{cases}$
Central path $x^\star(t)$ satisfies strict feasibility
$f_i(x^\star(t)) < 0, \quad t = 1,\ldots,m$
and defines dual central path $(\lambda^\star(t),\nu^\star(t))$ with
$\lambda^\star (t) = - \frac{1}{t f_i(x^\star(t))}.$

Combining the two: $x$ is a central point iff there is a pair $(\lambda,\nu)$ satisfying the modified KKT conditions:

$\begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \lambda > 0\\ -\lambda_i f_i(x) = \frac{1}{t}\\ \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu = 0 \end{cases}$

Main Point.
Certain interior point methods amount to iteratively solving this system.

Remarks.

modified KKT conditions are a “continuous” deformation of KKT conditions for original problem.
Evidently: as $t \to \infty$ , the modified KKT conditions converge to the original KKT conditions.
Complementary slackness
$\lambda_i f_i(x) = 0$
is now “almost” complementary slackness
$-\lambda_i f_i(x) = \frac{1}{t}\\$

Newton Step for Centering Problem

Recall: first step in the barrier method is solving the centering problem

$\begin{cases} \text{minimize} & tf_0+\phi\\ \text{subject to} & Ax=b \end{cases}$

where $\phi$ is a log barrier.
Solving the centering problem using Newton’s method amounts to iteratively solving the KKT systems

$\begin{bmatrix} t\nabla^2 f_0(x) + \nabla^2 \phi(x) & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}}\\ \nu_{\text{nt}} \end{bmatrix} = \begin{bmatrix} -t\nabla f_0(x) - \nabla \phi(x)\\ 0 \end{bmatrix}$

Remark.
Solving this system turns out to be directly related to solving the modified KKT equations, which is generally a nonlinear system.

To achieve this, first we form

$\begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \nabla f_0(x) - \sum_{i=1}^m \frac{1}{t f_i(x)}\nabla f_i(x) + A^T \nu = 0 \end{cases}$

from the modified KKT conditions

$\begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \lambda > 0\\ -\lambda_i f_i(x) = \frac{1}{t}\\ \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu = 0 \end{cases}$

by performing the elimination

$-\lambda_i f_i(x) = \frac{1}{t}.$

Then finding the Newton step to solve centering problem is equivalent to finding the Newton step to solve the nonlinear system

$\begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \nabla f_0(x) - \sum_{i=1}^m \frac{1}{t f_i(x)}\nabla f_i(x) + A^T \nu = 0. \end{cases}$

Recall (Newton’s Method for Nonlinear Equations)

One approach to solving a nonlinear system of equations

$\begin{aligned} F&:\mathbb{R}^k \to \mathbb{R}^k\\ F(X) &= (F_1(X),\ldots,F_k(X)) = 0 \end{aligned}$

is through Newton’s method.

Here, one considers the iteration scheme

$\begin{aligned} X^{(k+1)} &= X^{(k)} + \Delta X_{\text{nt}}\\ \Delta X_{\text{nt}} &= - J_F^{-1}(X^{(k)})F(X^{(k)})\\ J_F(X)&= \begin{bmatrix} \frac{\partial F_1}{\partial X_1} & \cdots & \frac{\partial F_1}{\partial X_k}\\ \vdots & \ddots & \vdots\\ \frac{\partial F_k}{\partial X_1} & \cdots & \frac{\partial F_k}{\partial X_k} \end{bmatrix} \end{aligned}$

We call $\Delta X_{\text{nt}}$ the Newton step for solving the nonlinear system $F(X)=0$ .
Under suitable circumstances, if $X^\star$ solves $F(X)=0$ , and $X^{(0)}$ near $X^\star$ , then $X^{(k)} \to X^\star$ .

Primal-Dual Interior Point Method

Recall: Newton’s method for the inner iterations to solve the centering problems is equivalent to using Newton’s method to solve the nonlinear modified KKT conditions

$\begin{cases} Ax = b\\ f_i(x)<0, \quad i =1,\ldots,m\\ \nabla f_0(x) - \sum_{i=1}^m \frac{1}{t f_i(x)}\nabla f_i(x) + A^T \nu = 0 \end{cases}$

after performing the elimination

$-\lambda_i f_i(x) = \frac{1}{t}.$

If we instead use Newton’s method to solve the full modified KKT conditions

$\begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \lambda > 0\\ -\lambda_i f_i(x) = \frac{1}{t}\\ \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu = 0 \end{cases}$

we develop a new algorithm which is an example of a “primal-dual interior point method”.

Main Features.

There are no inner iterations.
Both primal and dual variables are updated each iteration.
Primal and dual iterates need not be feasible.
often outperforms barrier method.

Primal-dual search direction.
Define

$\begin{aligned} f(x) = \begin{bmatrix} f_1(x)\\\vdots\\f_m(x) \end{bmatrix} ,\qquad Df(x)= \begin{bmatrix} \nabla f_1(x)^T\\\vdots\\\nabla f_m(x)^T \end{bmatrix}\\ \text{diag}(\lambda) = \lambda \begin{bmatrix} 1 & 0 &\cdots & 0 \\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 &\cdots & 1 \end{bmatrix} ,\qquad \boldsymbol{1} = \begin{bmatrix} 1\\\vdots\\1\end{bmatrix}\\ r_t(x,\lambda,\nu) = \begin{bmatrix}\nabla f_0(x) + Df(x)^T\lambda + A^T\nu\\ -\text{diag}(\lambda)f(x) - \frac{1}{t}\boldsymbol{1}\\ Ax-b \end{bmatrix} \end{aligned}$

N.B.:

$\begin{cases} r_t(x,\lambda,\nu) =0\\ f_i(x)<0 \end{cases}$

are exactly the modified KKT conditions.
Thus, if $(x,\lambda,\nu)$ solves this system, then

$(x,\lambda,\nu) = (x^\star(t),\lambda^\star(t),\nu^\star(t))$ .

Define the following residuals

$\begin{aligned} \text{dual residual} &= r_{\text{dual}} = \nabla f_0(x) + Df(x)^T \lambda + A^T \nu\\ \text{centrality residual} &= r_{\text{cent}} = -\text{diag}(\lambda)f(x) - \frac{1}{t}\boldsymbol{1}\\ \text{primal residual} &= r_{\text{pri}} = Ax-b. \end{aligned}$

Remarks.

These residuals are just the blocks of $r_t$ :
$r_t = \begin{bmatrix} r_{\text{dual}} \\ r_{\text{cent}} \\ r_{\text{pri}} \end{bmatrix}.$
$r_{\text{dual}}(x,\lambda,\nu)$ measures divergence from $(\lambda,\nu)$ from being dual feasible.
$r_{\text{pri}}(x,\lambda,\nu)$ measures divergence from $x$ being primal feasible.
$r_{\text{cent}}(x,\lambda,\nu)$ measures divergence from $x$ and $(\lambda,\nu)$ from being central.
All three $\approx 0$ means $(x,\lambda,\nu)$ nearly solves modified KKT condition.

Using Newton’s Method.
Applying Newton’s method to solve the nonlinear system

$r_t(x,\lambda,\nu) = 0$

at a point $y = (x,\lambda,\nu)$ satisfying $f(x) \prec 0 \prec \lambda$ results in a Newton step

$\Delta y = (\Delta x, \Delta \lambda, \Delta \nu)$

given by

$\Delta y = -J_{r_t}(y)^{-1}r_t(y),$

where

$J_{r_t}(y)= Dr_t(y) = \begin{bmatrix} \nabla^2 f_0(x) + \sum_{i=1}^m \lambda_i \nabla^2 f_i(x) & Df(x)^T & A^T\\ -\text{diag}(\lambda)Df(x) & -\text{diag}(f(x)) & 0\\ A & 0 & 0 \end{bmatrix}.$

Viz.: the Newton step solves

$\begin{aligned} \begin{bmatrix} \nabla^2 f_0(x) + \sum_{i=1}^m \lambda_i \nabla^2 f_i(x) & Df(x)^T & A^T\\ -\text{diag}(\lambda)Df(x) & -\text{diag}(f(x)) & 0\\ A & 0 & 0 \end{bmatrix} \begin{bmatrix} \Delta x\\ \Delta \lambda \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} r_{\text{dual}} \\ r_{\text{cent}} \\ r_{\text{pri}} \end{bmatrix} \end{aligned}$

The solution to this system is called the primal-dual search direction and will be denoted by

$\Delta y_{\text{pd}} = (\Delta x_{\text{pd}},\Delta \lambda_{\text{pd}},\Delta \nu_{\text{pd}})$

Surrogate Duality Gap.
The primal-dual search directions $\Delta y_{\text{pd}}$ need not form a sequence $(x^{(k)},\lambda^{(k)},\nu^{(k)})$ of feasible points.
Therefore, $f_0(x^{(k)}) - g(\lambda^{(k)},\nu^{(k)})$ need not measure the duality gap.

In place of duality gap, we use the surrogate duality gap: for $(x,\lambda)$ satisfying $f(x) \prec 0 \prec \lambda$ , the number

$\hat\eta(x,\lambda) = - f(x)^T \lambda.$

N.B.: if $x$ feasible and $(\lambda,\nu)$ dual feasible, then

$\begin{aligned} \hat\eta(x,\lambda) &= -f(x)^T \lambda\\ &= -f_1(x)\lambda_1 - \cdots - f_m(x) \lambda_m\\ &= \frac{1}{t} + \cdots + \frac{1}{t}\\ &= \frac{m}{t}\\ &= f_0(x) - g(\lambda,\nu) \end{aligned}$

Viz.: $r_{\text{pri}} =0, \, r_{\text{dual}} = 0 \implies$ $\hat\eta(x,\lambda,\nu)$ is duality gap.
Therefore, $r_{\text{pri}},r_{\text{dual}},\hat\eta$ can be used to define stopping criterion.
Indeed: $\Vert r_{\text{pri}} \Vert_2$ and $\Vert r_{\text{dual}} \Vert_2$ small $\implies$ $\hat\eta$ nearly duality gap.
Therefore, all three quantities small $\implies$ small duality gap.

Primal-dual interior point method.
We may now introduce the algorithm called primal-dual interior point method.


given 
         initial  satisfying 
         initial  and 
         multiplier 
         tolerances 
repeat: 
1. Determine . Set 
2. Compute primal-dual search direction 
3. Perform line search and update. 
         Determine step length  and set .
until: ,  and .

N.B.:

the line search is a modified back-tracking line search which ensures $f(x) \prec 0 \prec \lambda$ always holds.
for feasible parameters, the value $t= m/\hat\eta$ corresponds to a duality gap of $\hat\eta$ .

Appendix

Differentiating $b^Tx$

Given

$b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_n\end{bmatrix} \in \mathbb{R}^n$ ,

define the scalar function

$g(x) = b^Tx = b_1x_1 + \cdots + b_n x_n$ .

Using the Taylor expansion

$\begin{aligned} g(y) = g(x) + (y-x)^T \nabla g(x) + \cdots \end{aligned}$

at $y = 0$ , we find

$\begin{aligned} 0 = b^Tx - x^T \nabla g(x) + \cdots \end{aligned}$

and so

$x^T \nabla g(x) = b^T.$

To see this computed directly, computing

$\begin{aligned} \frac{\partial}{\partial x_k} g(x) &= \frac{\partial}{\partial x_k}(b_1x_1 + \cdots + b_nx_n)\\ &=b_k, \end{aligned}$

we conclude

$\nabla g = \begin{bmatrix}\frac{\partial}{\partial x_1} g(x) \\ \frac{\partial}{\partial x_2}g(x) \\ \vdots \\ \frac{\partial}{\partial x_n}g(x) \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = b.$

Differentiating $\frac{1}{2}x^TQx$

Given

$Q = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1n}\\ q_{21} & q_{22} & \cdots & q_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ q_{n1} & q_{n2} & \cdots & q_{nn} \end{bmatrix} \in \boldsymbol{S}^n$

define the scalar function

$f(x) = \frac{1}{2}x^T Q x.$

Using the Taylor expansion

$\begin{aligned} f(y) = f(x) + (y-x)^T \nabla f(x) + \frac{1}{2} (y-x)^T \nabla^2 f(x) (y-x) + \cdots \end{aligned}$

at $y = 0$ , we find

$\begin{aligned} 0 &= \frac{1}{2}x^TQx - x^T \nabla f(x) + \frac{1}{2} x^T \nabla^2 f(x)x + \cdots\\ &= x^TQx - x^T \nabla f(x) - \frac{1}{2}x^TQx + \frac{1}{2}x^T\nabla^2 f(x) x + \cdots, \end{aligned}$

which is evidently satisfied in case

$\begin{aligned} \nabla f(x) &= Qx\\ \nabla^2 f(x) &= Q. \end{aligned}$

To see this computed directly: expanding the matrix multiplication, we have

$\begin{aligned} f(x) = \frac{1}{2} \sum_{i,j=1}^{n} x_i x_j q_{ij} \end{aligned}$ .

Computing

$\begin{aligned} \frac{\partial}{\partial x_k} f(x) &= \frac{1}{2}\sum_{i,j=1}^n \frac{\partial}{\partial x_k} (x_i x_j q_{ij})\\ &=\frac{1}{2} \left(\sum_{i=1}^n x_i q_{ik} + \sum_{j=1}^n x_j q_{kj} \right)\\ &=\frac{1}{2}\sum_{i=1}^n x_i(q_{ik}+q_{ki})\\ &=\sum_{i=1}^n x_iq_{ik} \end{aligned}$

we have

$\begin{aligned} \nabla f(x) &= \begin{bmatrix} \frac{\partial}{\partial x_1} f(x)\\ \frac{\partial}{\partial x_2} f(x)\\ \vdots\\ \frac{\partial}{\partial x_n} f(x) \end{bmatrix} =Qx \end{aligned}$ .

Taking the second derivative gives

$\begin{aligned} \frac{\partial^2}{\partial x_k \partial x_l} f(x) &= \sum_{i=1}^n\frac{\partial}{\partial x_l}x_i q_{ik}\\ &=q_{lk}. \end{aligned}$

Consequently, there holds

$\begin{aligned} \nabla^2 f(x) &= \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1 }(x) & \frac{\partial^2 f}{\partial x_2^2 }(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \\ \end{bmatrix}\\ & = Q. \end{aligned}$

Example.

Let

$\begin{aligned} f(x_1,x_2) &= \frac{1}{2}\begin{bmatrix}x_1 & x_2 \end{bmatrix} \begin{bmatrix} a&b\\c&d \end{bmatrix} \begin{bmatrix}x_1\\x_2 \end{bmatrix}\\ &=\frac{1}{2} \begin{bmatrix}x_1&x_2\end{bmatrix}\begin{bmatrix}ax_1+bx_2\\cx_1+dx_2\end{bmatrix}\\ &=\frac{1}{2}(ax_1^2 + (b+c)x_1x_2 + dx_2^2). \end{aligned}$

Compute

$\begin{aligned} \frac{\partial}{\partial x_1} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_1} (ax_1^2 + (b+c)x_1x_2 + dx_2^2)\\ &=\frac{1}{2}(2ax_1 + (b+c)x_2)\\ \frac{\partial}{\partial x_2} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_2} (ax_1^2 + (b+c)x_1x_2 + dx_2^2)\\ &=\frac{1}{2}((b+c)x_1 + 2dx_2)\\ \end{aligned}.$

Consequently,

$\begin{aligned} \nabla f(x) &= \begin{bmatrix} \frac{\partial}{\partial x_1} f(x) \\ \frac{\partial}{\partial x_2} f(x) \end{bmatrix}\\ &=\begin{bmatrix} \frac{1}{2}(2ax_1 + (b+c)x_2)\\ \frac{1}{2}((b+c)x_1 + 2dx_2) \end{bmatrix}\\ &=\frac{1}{2} \left(\begin{bmatrix}ax_1 + bx_2 \\ cx_1 + dx_2 \end{bmatrix} + \begin{bmatrix} ax_1 + cx_2 \\ bx_1 + dx_2 \end{bmatrix} \right)\\ &=\frac{1}{2}\left( \begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix} + \begin{bmatrix}a&c\\b&d\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}\right)\\ &=\frac{1}{2}(Q+Q^T)x. \end{aligned}$

Next compute the second derivatives:

$\begin{aligned} \frac{\partial^2}{\partial x_1^2} f(x_1,x_2) &=\frac{1}{2}\frac{\partial}{\partial x_1}(2ax_1 + (b+c)x_2)\\ &=a\\ \frac{\partial^2}{\partial x_1 \partial x_2} f(x_1,x_2) &=\frac{1}{2}\frac{\partial}{\partial x_2}(2ax_1 + (b+c)x_2)\\ &=\frac{1}{2}(b+c)\\ \frac{\partial^2}{\partial x_2 \partial x_1} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_1}((b+c)x_1 + 2dx_2)\\ &=\frac{1}{2}(b+c)\\ &=\frac{1}{2}(b+c)\\ \frac{\partial^2}{\partial x_2^2} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_2}((b+c)x_1 + 2dx_2)\\ &=d. \end{aligned}$

Putting this together gives

$\begin{aligned} \nabla^2 f(x) &= \begin{bmatrix} \frac{\partial^2}{\partial x_1^2}f & \frac{\partial^2}{\partial x_1 \partial x_2}f\\ \frac{\partial^2}{\partial x_2 \partial x_1}f & \frac{\partial^2}{\partial x_2^2}f \end{bmatrix}\\ &= \frac{1}{2}\begin{bmatrix} a & b+c\\ c+b & d \end{bmatrix}\\ &= \frac{1}{2}\begin{bmatrix} a & b\\ c& d \end{bmatrix} + \frac{1}{2}\begin{bmatrix}a&c\\b&d\end{bmatrix}\\ &= \frac{1}{2}(Q+Q^T). \end{aligned}$

Differentiating $b^TAx$

Given $A \in \mathbb{R}^{m \times n}$ , $b \in \mathbb{R}^m$ , define the scalar function $h(x) = b^TAx.$
Using the Taylor expansion

$h(y) = h(x) + (y-x)^T\nabla h(x) + \cdots$

at $y = 0$ , we get

$\begin{aligned} 0 &= b^TAx - x^T \nabla h(x) + \cdots\\ &= x^T(Ab^T) - x^T \nabla h(x) + \cdots \end{aligned}$

which directly implies

$\nabla h(x) = Ab^T$ .

To see this computed directly, first compute

$\begin{aligned} b^TAx &= \begin{bmatrix}b_1& \cdots & b_m \end{bmatrix} \begin{bmatrix} a_1^T\\ \vdots \\ a_m^T \end{bmatrix} \begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}\\ &=\begin{bmatrix}b_1& \cdots & b_m \end{bmatrix} \begin{bmatrix} a_1^Tx\\\vdots\\ a_m^T x\end{bmatrix}\\ &= b_1a_1^Tx + \cdots + b_ma_m^Tx. \end{aligned}$

Computing

$\begin{aligned} \frac{\partial}{\partial x_k} b^TAx &= \frac{\partial}{\partial x_k}(b_1a_1^Tx + \cdots + b_ma_m^Tx)\\ &= b_1 a_{1k} + \cdots + b_m a_{mk}, \end{aligned}$

we conclude

$\begin{aligned} \nabla b^TAx &= \begin{bmatrix} b_1 a_{1k} + \cdots + b_m a_{mk}\\ \vdots\\ b_1 a_{1m} + \cdots + b_m a_{mm} \end{bmatrix}\\ &= A^Tb. \end{aligned}$